Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 6 min | 1060 words

Computing_In_MorningOne of the hardest things that we did in RavenDB 4.0 would probably go completely unnoticed by users. We completely re-wrote how RavenDB is processing map/reduce queries. One of my popular blog posts is still a Visual Explanation to Map/Reduce, and it still does a pretty good job of explaining what map/reduce is.

The map/reduce code in RavenDB 3.x is one of the more fragile things that we have, require you to maintain in your head several completely different states that a particular reduction can be in and how they transition between states. Currently, there are probably two guys* who still understand how it works and one guy that is still able to find bugs in the implementation. It is also not as fast as we wished it would be.

So with RavenDB 4.0 we set out to build it from scratch, based in no small part on the fact that we had also written our storage engine for 4.0 and was able to take full advantage of that. You can read about the early design in this blog post, but I’m going to do a quick recap and explain how it works now.

The first stage in map/reduce is… well, the map. We run over the documents and extract the key portions we’ll need for the next part. We then immediately apply the reduce on each of the results independently. This give us the final map/reduce results for a single document. More to the point, this also tells us what is the reduce key for the results is. The reduce key is the value that the index grouped on.

We store all of the items with the same reduce key together. And here is where its get interesting. Up until a certain point, we just store all of the values for a particular reduce key as an embedded value inside a B+Tree. That means that whenever any of the values changes, we can add that value to the appropriate location and reduce all the matching values in one go. This works quite well until the total size of all the values exceed about 4KB or so.

At this point, we can’t store the entire thing as an embedded value and we move all the values for that reduce key to its own dedicated B+Tree. This means that we start with a single 8KB page and fill it up, then split it, and so on. But there is a catch. The results of a map/reduce operation tend to be extremely similar to one another. At a minimum, they share the same properties and the same reduce key. That means that we would end up storing a lot of duplicate information. To resolve that, we also apply recursive compression. Whenever a page nears 8KB in size, we will compress all the results stored in that page as a single unit. This tend to have great compression rate and can allow us to store up to 64KB of uncompressed data in a single page.

When adding items to a map/reduce index, we apply an optimization so it looks like:

results = reduce(results, newResults);

Basically, we can utilize the recursive nature of reduce to optimize things for the append only path.

When you delete or update documents and results change or are removed, things are more complex. We handle that by running a re-reduce on the results. Now, as long as the number of results is small (this depend on the size of your data, but typically up to a thousand or so) we’ll just run the reduce over the entire result set. Because the data is always held in a single location, this means that it is extremely efficient in terms of memory access and the tradeoff between computation and storage leans heavily to the size of just recomputing things from scratch.

When we have too many results (the total uncompressed size exceeds 64KB) we start splitting the B+Tree and adding a level to the three. At this point, the cost of updating a value is now the cost of updating a leaf page and the reduce operation on the root page. When we have more data still,  we will get yet another level, and so on.

The (rough) numbers are:

  • Up to 64KB (roughly 1000 results) – 1 reduce for the entire dataset
  • Up to 16 MB – 2 reduces (1 for up to 1000 results, 1 for up to 254 results)
  • Up to 4 GB – 3 reduces (1 for up to 1000 results, 2 for up to 254 results each)
  • Up to 1 TB  - 4 reduces (1 for up to 1000 results, 3 for up to 254 results each)
  • I think you get how it works now, right? The next level up is 1 to 248 TB and will requite 5 reduces.

These numbers is if your reduce data is very small, in the order of a few dozen byes. If you have large data, this means that the tree will expand faster, and you’ll get less reduces at the first level.

Note that at the first level, if there is only an addition (new document, basically), we can process that as a single operation between two values and then proceed upward as the depth of the tree requires.There are also optimizations in place if we have multiple updates to the same reduce key, in that case, we can first apply all the updates, then do the reduce once for all of them in one shot.

And all of that is completely invisible to the users, unless you want to peek inside, which is possible using the Map/Reduce visualizer:

image

This can give you insight deep into the guts of how RavenDB is handling map/reduce operations.

The current status is that map/reduce indexing are actually faster than normal indexes, because they are almost all our code, while a large portion of the normal indexing cost is with Lucene.

* That is an exaggeration, there is one guy that know how it works. Okay, okay, I’ll admit that we can dive into the code and figure out what is going on, but it takes quite a bit of time if there is a significant issue there.

time to read 3 min | 513 words

clock-150754_640There is a lot of chatter in the industry about the notion of 10x programmers. People who can routinely be an order of magnitude faster than mere mortals.  Okay, that was a bit snarky, I’ll admit.

I have had the pleasure to interact with a lot of developers, from people whose conversation I could barely follow (way above my level) and whose code I mined for insight and ideas to the the classic outsourcing developer who setup a conference call for assistance in writing “Hello World” to the console. I think that I have enough experience at this point to comment on the nature of developer productivity. More to the point, I know of quite a lot of way to destroy it.

The whole 10x developer mentality assume that a single (or very few) developers are actually able to make a major difference, and that is usually not the case. Let me try to explain why, and note that I assume a perfect world in which there no need to burn this 10x dev with all nighters and hero mode.

The problem is what we are talking about when we are talking about major difference. Usually, as developers, we can talk about making major technical changes. Let us consider the Windows Kernel Dispatcher Lock removal. That was 8 years ago and it is still something that pop to my mind when I consider big changes in the guts of software. This is something that is clearly beneficial, was quite complex to get right and require a lot of work. No idea if the people working on it were “10x” but I assume that the kernel team in Microsoft weren’t pulled from the lowest bidder by the Shady Outsourcing R Us.

What real difference did it make for Windows? Well, it became faster, which is great. But I think it is fair to say that most people never heard about it, and of those who did, fewer cared.

The things that really matter for a product are a solid technical base, and then all the rest of the stuff. This can be the user interface, the documentation, the getting started guide and even the “yes dear” installer. It is the whole experience that matters, and you’ll not typically find a who can do all of that significantly better than others.

One of the guys in the office is currently spending much more time writing the documentation and walkthrough for a feature than the time it took to actually write it. The problem with developers is that we tend to live in our own world and consider everything else that isn’t technical secondary.

agenda-153555_640

But as good as the software is, the actual release to customers require a lot more work that isn’t even remotely technical, be it marketing materials, working with partners or just making sure that the purchase workflow actually work.

time to read 4 min | 681 words

I mentioned that RavenDB 4.0 uses x509 client certificate for authentication, right? As it turns out, this can create some issues for us when we need to do more than just blind routing to the right location.

Imagine that our proxy is setup in front of the RavenDB Docker Swarm not just for handling routing but to also apply some sort of business logic. It can be that you want to do billing per client basis on their usage, or maybe even inspect the incoming data into RavenDB to protect against XSS (don’t ask, please). But those are strange requirements. Let us go with something that you can probably emphasize with more easily.

We want to have a RavenDB cluster that uses a Let’s Encrypt certificate, but that certificate has a very short life time, typically around 3 months. So you probably don’t want to setup these certificates within RavenDB itself, because you’ll be replacing them all the time. So we want to write a proxy that would handle the entire process of fetching, renewing and managing Let’s Encrypt certificates for our database, but the certificates that the RavenDB cluster will use are internal ones, with much longer expiration times.

So far, so good. Except…

The problem that we have here is that here we have a problem. Previously, we used the SNI extension in our proxy to know where we are going to route the connection, but now we have different certificates for the proxy and for the RavenDB server. This means that if we’ll try to just pass the connection through to the RavenDB node, the client will detect that it isn’t using a trusted certificate and fail the request. On the other hand, if we terminated the SSL connection at the proxy, we have another issue, we use x509 client certificate for ensuring that the user actually have the access they desire. And we can’t just pass the client certificate forward, since we terminated the SSL connection.

Luckily, we don’t have to deal with a true man in the middle simulation here, because we can configure the RavenDB server to trust the proxy. All we are left now is to figure out how the proxy can tell the RavenDB server what is the client certificate that the proxy authenticated. A common way to do that is to send the client certificate details over in a header, and that would work, but…

Sending the certificate details in a header has two issues for us. First, it would mean that we need to actually parse and mutate the incoming data stream. Not that big a deal, but it is something that I would like to avoid if possible. Second, and more crucial for us, we don’t want to have to validate the certificate on each and every request. What we want to do is take advantage on the fact that connections are reused and do all the authentication checks once, when the client connects to the server. Authentication doesn’t cost too much, but when you are aiming at tens and thousands of requests a second, you want to reduce costs as much as possible.

So we have two problems, but we can solve them together. Given that the RavenDB server can be configured to trust the proxy, we are going to do the following. Terminate the SSL connection at the proxy, and validate the client certificate (just validate the certificate, not check permissions or such) and then the magic happens. The proxy will generate a new certificate, signed with the proxy own key and registering the original client certificate thumbprint in the new client certificate (caching that certificate, obviously). Then the proxy route the request to its destination, signed with its own client certificate. The RavenDB server will recognize that this is a proxied certificate, pull the original certificate thumbprint from the proxied client certificate and use that to verify the permissions to assign to the user.

The proxy can then manage things like refreshing the certificates from Let’s Encrypt and RavenDB can get proxied requests.

time to read 4 min | 661 words

RavenDB 4.0 uses x509 client certificates for authentication. That is good, because it means that we get both encryption and authentication on both ends, but it does make is more complex to handle some deployment scenarios. It turns out that there is quite a big demand for doing things to the data that goes to and from RavenDB.

We’ll start with the simpler case, of having dynamic deployment on Docker, with nodes that may be moved from location to location. Instead of exposing the nastiness of the internal network to the outside world with URLs such as (https://129-123-312-1.rvn-srv.local:59421) we want to have nice and clean urls such as https://orders.rvn.cluster. The problem is that in order to do that, we need to put a proxy in place.

That is pretty easy when you deal with HTTP or plain TCP, but much harder when you deal with HTTPS and TLS because you also need to handle the encrypted stream. We looked at various options, such as Ngnix and Traefik as well as a peek at Squid but we rule them out for various reasons, mostly related to the deployment pattern (Ngnix doesn’t handle dynamic routing), feature set (Traefik doesn’t handle client certificates properly) and usecase (Squid seems to be much more focused on being a cache). All of them didn’t support the proper networking model we want (1:1 connection matches from client to server, which we would really like to preserve because it simplify authentication costs significantly).

So I set out to explore what it would take to build an SSL Proxy to fit our needs. The first thing I looked at was how to handle routing. Given a user that type https://orders.rvn.cluster in the browser, how does this translate to actually hitting an internal Docker instance with a totally different port and host?

The answer, as it turned out, is that this is not a new problem. One of the ways to do that is to just intercept the traffic. We can do that because in this deployment model, we control both the proxy and the server, so we can put the certificate fro “orders.rvn.cluster” in the proxy, decrypt the traffic and then forward it to the right location. That works, but it means that we have a man in the middle. Is there another option?

As it turns out, this is such a common problem that there are multiple solutions for it. These are SNI (Server Name Indication) and ALPN (Application Layer Protocol Negotiation), both of which allow the client to specify what they want to get from the server as part of the initial (and unencrypted) negotiation. This is pretty sweet from the point of view of the proxy, because it can make routing decisions without needing to do the TLS negotiation but not so much for the user if they are currently trying to check “super-shady.site”, since while the contents of their request is masked, the destination is not. I’m not sure how big of a security problem this is (the end IP isn’t encrypted, after all, and even if you host a thousands sites on the same server, it isn’t that big a deal to narrow it down).

Anyway, the key here is that this is possible, so let’s make this happen. The solution is almost literally pulled from the StreamExtended readme page.

We get a TCP stream from a client, and we peek into it to read the TLS header, at which point we can pull the server name out. At this point, you’ll note, we haven’t touched SSL and we can forward the stream toward its destination without needing to inspect any other content, just carrying the raw bytes.

This is great, because it means that things like client authentication can just work and authenticate against the final server without any complexity. But it can be a problem if we actually need to do something with the traffic. I’ll discuss how to handle this properly in the next post.

time to read 2 min | 245 words

imageWith the RC release out of the way, we are starting on a much faster cadence of fixes and user visible changes as we get ready to the release.

In order to allow users to be able to report issues and have then resolved as soon as possible we now publish our nightly build process.

The nightly release is literally just whatever we have at the top of the branch at the time of the release. A nightly release goes through the following release cycle:

  • It compiles
  • Release it!

In other words, a nightly should be used only on development environment where you are fine with the database deciding that names must be “Green Jane” and it is fine to burp all over your data or investigate how hot we can make your CPU.

More seriously, nightlies are a way to keep up with what we are doing, and its stability is directly related to what we are currently doing. As we come closer to the release, the nightly builds stability is going to improve, but there are no safeguards there.

It means that the typical turnaround for most issues can be as low as 24 hours (and it give me back the ability, “thanks for the bug report, fixed and will be available tonight”). All other release remains with the same level of testing and preparedness.

time to read 3 min | 498 words

I have been talking a lot about major features and making things visible and all sort of really cool things. What I haven’t been talking about is a lot of the work that has gone into the backend and all the stuff that isn’t sexy and bright. You probably don’t really care how the piping system in your house work, at least until the toilet doesn’t flush. A lot of the things that we did with RavenDB 4.0 is to look at all the pain points that we have run into and try to resolve them. This series of posts is meant to expose some of these hidden features. If we did our job right, you will never even know that these features exists, they are that good.

In RavenDB 3.x we had a feature called Document Compression. This allowed a user to save significant amount of space by having the documents stored in a compressed form on disk. If you had large documents, you could typically see significant space savings from enabling this feature. With RavenDB 4.0, we removed it completely. The reason is that we need to store documents in a way that allow us to load them and work with them in their raw form without any additional work. This is key for many optimizations that apply to RavenDB 4.0.

However, that doesn’t mean that we gave up on compression entirely. Instead of compressing the whole document, which would require us to decompress any time that we wanted to do something to it, we selectively compress individual fields. Typically, large documents are large because they have either a few very large fields or a collection that contain many items. The blittable format used by RavenDB handles this in two ways. First, we don’t need to repeat field names every time, we store this once per document and we can compress large field values on the fly.

Take this blog for instance, a lot of the data inside it is actually stored in large text fields (blog posts, comments, etc). That means that when stored in RavenDB 4.0, we can take advantage of the field compression and reduce the amount of space we use. At the same time, because we are only compressing selected fields, it means that we can still work with the document natively. A trivial example would be to pull the recent blog post titles. we can fetch just these values (and since they are pretty small already, they wouldn’t be compressed) directly, and not have to touch the large text field that is the actual post contents.

Here is what this looks like in RavenDB 4.0 when I’m looking at the internal storage breakdown for all documents.

image

Even though I have been writing for over a decade, I don’t have enough posts yet to make a statistically meaningful difference, the total database sizes for both are 128MB.

time to read 2 min | 253 words

I run into this post, claiming that typed languages can reduce bugs by 15%. I don’t have a clue if this is true, but I wanted to talk about a major feature that I really like with typed languages. They give you compiler errors.

That sounds strange, until you realize that the benefit isn’t when you are writing the code, but when you need to change it. One of the things that I frequently do when I’m modifying code, especially if this is a big change is to make a change that will intentionally cause a compiler error, and work my way from there. A good example of that from a short time ago was a particular method that return an object. That object was responsible for a LOT of allocations, and we needed to reduce that.

So I introduced pooling, and the change looked like this:

This change broke all the call sites, allowing me to go and refactor each one in turn, sometimes pushing the change one layer up in the code, until I recursively fixed all the changes and then… everything worked.

The idea isn’t to try to use the type system to prove that the system will work. That is what I have all these tests for, but the idea is that I can utilize the structure of the code to be able to pivot it. Every time that I tried to do stuff like that in a language that wasn’t strongly typed, I run into a lot of trouble.

time to read 6 min | 1182 words

image

This is written a day before the Jewish New Year, so I suppose that I’m a bit introspective. We recently had a discussion in the office about priorities and what things are important for RavenDB.  This is a good time to do so, just after the RC release, we can look back and see what we did right and what we did wrong.

RavenDB is an amazing database, even if I say so myself, but one of the things that I find extremely frustrating is that so much work is going into things that have nothing to do with the actual database. For example, the studio. At any given point, we have a minimum of 6 people working on the RavenDB Studio.

Here is a confession, as far as I’m concerned, that studio is a black box. I go and talk to the studio lead, and things happen. About the only time that I actually go into the studio code is when I did something and that broke the studio build. But I’ll admit that I usually go to one of the studio guys and have them fix it for me Smile.

Yep, at this stage, I can chuck that “full stack” title right off the edge. I’m a backend guy now, almost completely. To be fair, when I started writing HTML (it wasn’t called web apps, and the fancy new stuff was called DHTML) the hot new thing was the notion that you shouldn’t use <blink/> all the time and we just got <table> for layout.  I’m rambling a bit, but I’ll get there, I think.

I find the way web applications are built today to be strange, horribly complex and utterly foreign (I find just the size of the node_modules folder is scary). But this isn’t a post about an old timer telling you how good were the days when you had 16 colors to play with and users knew their place. This post is about how a product is perceived.

I mentioned that RavenDB is amazing, right? It is certainly the most complex project that I have ever worked on and it is choke full of really good stuff. And none of that matters unless it is in the studio. In fact, that team of people working on the studio? That includes only the people actually working on the studio itself. It doesn’t include all the work done to support the studio on the server side. So it would be pretty fair to say that over half of the effort we put into RavenDB is actually spent on the studio at this point.

And it just sounds utterly crazy, right? We literally spent more time on building the animation for the cluster state changes so you can see them as they happen then we have spent writing the cluster behavior. Given my own inclinations, that can be quite annoying.

It is easy to point at all of this hassle and say: “This is nonsense, surely we can do better”. And indeed, a lot of our contemporaries get away with this for their user interface:

I’ll admit that this was tempting. It would free up so much time and effort that it is very tempting.

It would also be quite wrong. For several reasons. I’ll start from the good of the project.

Here is one example from the studio (rough cost, a week for the backend work, another week for various enhancement / fixes for that, about two weeks for the UI work, some issues still pending for this) showing the breakdown of the I/O operations made on a live RavenDB instance.

image

There is also some runtime cost to actually tracking this information, and technically speaking we would be able to get the same information with strace or similar.  So why would we want to spend all this time and effort on something that a user would already likely have?

Well, this particular graph is responsible for us being able to track down and resolve several extremely hard to pin down performance issues with how we write to disk. Here is how this look like at a slightly higher level, the green is writes to the journal, blue are flushes to the data file and orange are fsyncs.

image

At this point, I have a guy in the office that stared at theses graphs so much that he can probably tell me the spin rate of the underlying disk just by taking a glance. Let us be conservative and call it a mere 35% improvement in the speed of writing to disk.

Similarly, the cluster behavior graph that I complained about? Just doing QA on the graph part allowed us to find several issues that we didn’t notice because they were suddenly visible and there.

That wouldn’t justify the amount of investment that we put into them, though. We could have built diagnostics tools much more cheaply then that, after all. But they aren’t meant for us, all of these features are there for actual customer use in production, and if they are useful for us during development, they are going to be invaluable for users trying to diagnose things in production. So while I may consider the art of graph drawing black magic of the highest caliber I can most certainly see the value in such features.

And then there is the product aspect. Giving a user a complete package, were they can hit the ground running and feel that they have a good working environment is a really good way to keep said user. There is also the fact that as a database, things like the studio are not meant primarily for developers. In fact, the heaviest users of the studio are the admin stuff managing RavenDB, and they have a very different perspective on things. Making the studio useful for such users was an interesting challenge, luckily handled by people who actually know what they are doing in this regard much better than me.

And last, but not least, and tying back to the title of this post. Very few people actually take the time to do a full study of any particular topic, we use a lot of shortcuts to make decisions. And seeing the level of investment put into the user interface is often a good indication of overall production quality. And yes, I’m aware of the slap some lipstick on this pig and ship it mentality, there is a reason is works, even if it is not such a good idea as a long term strategy. Having a solid foundation and having a good user interface it a lot more challenging, but far better end result.

And yet, I’m a little bit sad (and mostly relieved) that I have areas in the project that I can neithe work nor understand.

time to read 1 min | 150 words

In the wake of RavenDB 4.0 Release Candidate, you are going to be seeing quite a lot of us Smile.

Here is the schedule for the rest of the year. In all of these conferences we are going to have a booth and demo RavenDB 4.0 live. We are going to demonstrate distributed database on conference network, so expect a lot of demo of the failover behavior Smile.

I’ll be speaking in Build Stuff about Modeling in Non Relation World and Extreme Performance Architecture as well as giving a full day workshop about RavenDB 4.0.

time to read 2 min | 201 words

I’m really happy to show off our RavenDB 4.0 Python client, now on beta release. This is the second (after the .NET one, obviously) of the new clients that are upcoming. In the pipeline we have JVM, Node.JS, Go and Ruby.

I have fallen in love with Ruby over a decade ago, almost incidentally, mainly because the Boo syntax is based on that. And I loved Boo enough to write a book about it. So I’m really happy that we can now write Python code to talk to RavenDB, with all the usual bells and whistles that accompany a full fledge client to RavenDB.

This is a small example, showing basic CRUD, but a more complex sample is when we are using Python scripts to drive functionality in our application, using batch process scripts. Here is how this looks like:

This gives us the ability to effectively run scripts that will be notified by RavenDB when something happens in the database and react to them. This is a powerful tool at the hand of the system administrator, since they can use that to add functionality to the system with ease and with the ease of use of the Python language.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}