Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 443 words

Imagine that you have a two nodes cluster, setup as master-master replication, and then you write a document to one of them. The node you wrote the document to now contacts the 2nd node to let it knows about the new document. The data is replicated, and everything is happy in this world.

But now let us imagine it with three nodes. We write to node 1, which will then replicate to nodes 2 and 3. But node 2 is also configured to replicate to node 3, and given that we have a new document in, it will do just that. Node 3 will detect that it already have this document and turn that into a no-op. But at the same time that node 3 is getting the document from node 2, it is also sending the document it got from node 1 to node 2.

This work, and it evens out eventually, because replicating a document that was already replicated is safe to do. And on high load systems, replication is batched, so you typically don't see a problem until you get to bigger cluster sizes.

Let us take the following six way cluster. In this graph, we are going to have 15 round trips across the network on each replication batch.*

* Nitpicker corner, yes, I know that the number of connections is ( N * (N-1) ) / 2, but N^2 looks better in the title.

mes-topology

The typical answer we have for this is to change the topology, instead of having a fully connected graph, with each node talking to all other nodes, we use something like this:

tree-topology

Real topologies typically have more than a single path, and it isn't a hierarchy, but this is to illustrate a point.

This work, but it requires the operations team to plan ahead when they deploy, and if you didn't allow for breakage, a single node going down can disconnect large portion of your cluster. That is not ideal.

So in RavenDB 3.5 we have taken steps to avoid it, nodes are now much smarter about the way they go about talking about their changes. Instead of getting all fired up and starting to send replication message all over the place, potentially putting some serious pressure on the network, the nodes will be smarter about it, and wait a bit to see if their siblings already got the documents from the same source. In which case, we now only need to ping them periodically to ensure that they are still connected, and we saved a whole bunch of bandwidth.

time to read 6 min | 1134 words

I have written extensively about the blittable format already, so I’ll not get into that again. But what I wanted to do in this post is to discuss the implication of the intersection of two very important features:

  • The blittable format requires no further action to be useful.
  • Voron is based on a memory mapped file concept.

Those two, brought together, are quite interesting.

To see why, let us consider the current state of affairs. In RavenDB 3.0, we store the data as json directly. Whenever we need to read a document, we need to load the document from disk, parse the json, load it into .NET objects, and only then do something with it. When we just got started with RavenDB, it didn’t actually matter to us. Our main concern was I/O, and that dominated all our costs. We spent multiple releases improving on that, and the solution was the prefetcher.

  • Prefetcher will load documents from the disk and make them ready to be indexed.
  • The prefetcher is running concurrently to indexing, so we can parallelize I/O and CPU work.

That allow us to reduce most of the I/O wait times, but it still left us with problems. If two indexes are working, and they each use their own prefetcher, then we have double the I/O cost, double the parsing cost, double the memory cost, double the GC cost. So in order to avoid that, we group indexes together that are roughly at the same space in their indexing. But that lead to a different set of problems, if we have one slow index, that would impact all the other indexes, so we need to have a way to “abandon” an index while it is indexing, to let the other indexes in the group the chance to run.

There is also another issue, when inserting documents into the database, we want to index them, but it seems stupid to take the index, write it to the disk, only to then load them from the disk, parse them, etc. So when we insert a new document, we add it to the prefetcher directly, saving us some work in the common case where indexes are caught up and only need to index new things. That, too, have a cost, it means that the lifetime of such objects tend to be much longer, which means that they are more likely to be pushed into Gen1 or Gen2, so they will not be collected for a while, and when they do, it will be a more expensive collection run.

Oh, and to top it off, all of the structure above need to consider available memory, load on the server, time for indexing batch, I/O rates, liveliness and probably a dozen other factors that don’t pop to mind right now. In short, this is complex.

With RavenDB 4.0, we set out to remove all of this complexity. A large part of the motivation for the blittable format and using Voron are driven by the reasoning below.

If we can get to a point where we can just access the values, and reading documents won’t incur a heavy penalty in CPU/memory, we could radically shift the cost structure. Let us see how. Now, the only cost for indexing is going to be pure I/O, paging the documents to memory when we access them. Actually indexing them is done by merely access the mapped memory directly, so we don’t actually need to allocate much memory during indexing.

Optimizing the actual I/O is pretty easily done by just asking the operating system, we can do that explicitly using PrefetchVirtualMemory or madvise(MADV_WILLNEED), or just let the OS handle that based on actual access pattern. So those are two separate issues that just went away completely. And without needing to spread the cost of loading the documents among all indexes, we no longer have a good reason to go with grouping indexes. So that is out the window, as well as all the complexity that is required to handle a slow index slowing down everyone.

And because newly written documents are likely to be memory resident (they have just been accessed, after all), we can just skip the whole “let us remember recently written documents for the indexes”, because by the time we index them, we are expecting them to still be in memory.

What is interesting here is that by using the right infrastructure we have been able to remove quite a lot of code. Now, the major part here is that being able to remove a lot of code is almost always great, the major change here is that all of the code we removed had to deal with a very large number of factors (if new documents are coming in, but indexing isn’t caught up to them, we need to stop putting the new documents into the perfetcher cache and clear it) that are hard to predict and sometimes interact in funny ways. By moving a lot of that complexity to “let us manage what parts of the file are memory resident”, we can simplify a lot of that complexity and even push much of it directly to the operation system.

This has other implications, because we now no longer need to run indexes in groups, and they can each run and do their own thing, we can now split them so each index has their own dedicated thread. Which mean, in turn, that if we have a very busy index, it is going to be very easy to point which one is the culprit. It also make it much easier for us to handle priorities. Because each index is a thread, it means that we can now rely on the OS prioritization. If you have an index that you really care about running as soon as possible, we can bump its priority higher. And by default, we can very easily mark the indexing thread as lower priority, so we can prioritize answer incoming requests over processing indexes.

Doing it in this manner means that we are able to ask the OS to handle the problem of starvation in the system, where an index doesn’t get to run because it has a lower priority. All of that is already handled in the OS scheduler, so we can lean on that.

Probably the hardest part in the design of RavenDB 4.0 is that we are thinking very hard about how to achieve our goals (and in many cases exceed them) not by writing code, but by not writing code. But by arranging things so the right thing would happen. Architecture and optimization by omission, so to speak.

As a reminder, we have the RavenDB Conference in Texas in a few months, which would be an excellent opportunity to learn about RavenDB 4.0 and the direction in which we are going.

image

time to read 2 min | 294 words

It looks like I’m on a rule for administrators and operations features in RavenDB 3.5, and in this case, I want to introduce the Administrator JS Console.

This is a way for a server administrator to execute arbitrary code on a running system. Here is a small such example:

image

This is 100% the case of giving the administrator the option for running with scissors.

image Yes, I’m feeling a bit nostalgic Smile.

The idea here is that we have the ability to query and modify the state of the server directly. We don’t have to rely on prepared-ahead-of-time end points, and only being able to do whatever it is we thought of beforehand.

If we run into a hard problem, we’ll actually be able to ask the server directly, what ails you, and when we find out, we can tell it, so let us fix that. For example, interrogating the server about the state of its memory, then telling it to flush a particular cache, or changing (at runtime, without any restarts or downtime) the value of a particular configuration option, or giving the server additional information it is missing.

We already saw three cases where having this ability would have been a major time saver, so we are really excited about this feature, and what it means to our ability to more easily support RavenDB in production.

time to read 4 min | 771 words

RavenDB has been using the low level Esent as our storage engine from day 1. We toyed with building our own storage engine in Munin, but it was only in 2013 that we started pay serious attention to that. A large part of that was the realization that Esent is great, but it isn’t without its own issues and bugs (it is relatively easy to get it into referencing invalid memory, for example, if you run multiple defrag operations), that we have to work around. But two major reasons were behind our decision to invest a lot of time and effort into building Voron.

When a customer has an issue, they don’t care that this is some separate component that is causing it, we need to be able to provide them with an answer, fast. And escalating to Microsoft works, but it is cumbersome and in many cases result in a game of telephone. This can be amusing when you see kids do it, but not so much fun when it is an irate customer with a serious problem on the phone.

The second reason is that Esent was not designed for our workload. It has done great by us, but it isn’t something that we can take and tweak and file away all the rough edges in our own usage. In order to provide the level of performance and flexibility we need, we have to own every critical piece in our stack, and Voron was the result of that. Voron was released as part of RavenDB 3.0, and we have some customers running the biggest databases on it, battle testing it in production for the past two+ years.

With RavenDB 4.0, we have made Voron the only storage engine we support, and have extended it further. In particular, we added low level data structures that changed some operations from O(M * logN)  to O(logN), push more functionality to the storage engine itself and simplified the concurrency model.

In Voron 3.0, the only data structure that we had was the B+Tree, we had multiple of those, and you could recurse them, but that was it. Data was stored in the following manner:

  • Documents tree (key: document id, value: document json)
  • Etag tree (key: etag, value: document key)

We had one B+Tree as the primary, which contained the actual data, and a number of additional indexes, which allow us to search on additional fields, then find the relevant data by lookin up in the data tree.

This had two issues. The first was that our code needed to manually make sure that we were updating all the index trees whenever we updated/deleted/created any values. The second was that a scan over an index would result in the code first doing an O(logN) search over the index tree, then for each matching result it do another lookup to the actual data tree.

In practice, this only showed up as a problem when you have over 200 million entries, in which case the performance cost was noticeable. But for that purpose, and for a bunch of other (which we’ll discuss in the next post) we made the following changes to Voron.

We now have 4 data structures supported.

  • B+Tree – same as before, variable size key & value.
  • Fixed Sized B+Tree – int64 key, and fixed size (can be configured at creation time) size of value. Same as the one above, but let us take advantage of various optimization when we know what the total size is.
  • Raw data section – allow to store data, and give an opaque id to access the data later.
  • Table – combination of raw data sections with any number of indexes.

The raw data section is interesting, because it allows us to just ask it to store a bunch of data, and get an id back (int64), and it has an O(1) cost for accessing those values using the id.

We then use this id as the value in the B+Tree, which means that our structure now looks like this:

  • Raw data sections – [document json, document json, etc]
  • Documents tree (key: document id, value: raw data section id)
  • Etags tree (key: etag, value: raw data section id)

So now getting the results from the etags tree is an seek into the tree O(logN), and then just O(1) cost for each document that we need to get from there.

Another very important aspect is the fact that Voron is based on memory mapped files, which allows some really interesting optimization opportunities. But that will be the subject of the next post.

time to read 2 min | 302 words

One of the most important features in RavenDB is replication, and the ability to use a replica node to get high availability and load balancing across all nodes.

And yes, there are users who choose to run on a single instance, or want to have a hot backup, but not pay for an additional license because they don’t need the high availability mode.

Hence, RavenDB 3.5 will introduce the Hot Spare mode:

image

A hot spare is a RavenDB node that is available only as a replication target. It cannot be used for answering queries or talking with clients, but it will get all the data from your server and keep it safe.

If there is a failure of the main node , an admin can light up the hot spare, which will turn the hot spare license into a normal license for a period of 96 hours. At that point, the RavenDB server will behave normally as a standard server, clients will be able to failover to it, etc. After the 96 hours are up, it will revert back to hot spare mode, but the administrator will not be able to activate it again without purchasing another hot spare license.

The general idea is that this license is going to be significantly cheaper than a full RavenDB license, but provides a layer of protection for the application. We would still recommend that you’ll get for a full cluster, but for users that just want the reassurance that they can hit the breaker and switch, without going into the full cluster investment, that is a good first step.

We’ll announce pricing for this when we release 3.5.

time to read 4 min | 617 words

So the first thing to tackle is the over the wire protocol. RavenDB is a REST based system, working purely within HTTP. A lot of that was because at the time of conception, REST was pretty much the thing, so it was natural to go ahead with that. For many reasons, that has been the right decision.

The ability to easily debug the over the wire protocol with a widely available tool like Fiddler makes it very easy to support in production, script it using curl or wget and in general make it easier to understand what is going on.

On the other hand, there are a bunch of things that we messed up on. In particular, RavenDB is using the HTTP headers to pass the the document metadata. At the time, that seemed like an elegant decision, and something that is easily in line with how REST is supposed to work. In practice, this limited the kind of metadata that we can use to stuff that can pass through HTTP headers, and forced some constraints on us (case insensitivity, for example), and there have been several cases where proxies of various kind inserted their own metadata that ended up in RavenDB, sometimes resulting in bad things happening.

When looking at RavenDB 4.0, we made the following decisions:

  • We are still going to use HTTP as the primary communication mechanism.
  • Unless we have a good reason to avoid it, we are going to be human readable over the wire.

So now instead of sending document metadata as HTTP headers, we send them inside the document, and use headers only to control HTTP itself. That said, we have taken the time to analyze our protocol, and we found multiple places where we can do much better. In particular, the existence of web sockets means that there are certain scenarios that just became tremendously simpler for us. RavenDB has several needs for bidirectional communications. Bulk inserts, Change API, subscriptions, etc.

The fact that web sockets are now pretty much available across the board means that we have a much easier time dealing with those scenarios. In fact, the presence of web sockets in general is going to have a major impact on how we are doing replication, but that will be the topic of another post.

Beyond the raw over the wire protocol, we also looked at a common source for bugs, issues and trouble: IIS. Hosting under IIS means that you are playing by IIS rules, which can be really hard to do if you aren’t a classic web application. In particular, IIS has certain ideas about size of requests, their duration, the time you have to shut down or start up, etc. In certain configurations, IIS will just decide to puke on you (if you have a long URL, it will return 404, with no indication why, for example), resulting in escalated support calls.  One particular support call happened because a database took too long to open (as a result of a rude shutdown by IIS), which resulted in IIS aborting the thread and hanging the entire server because of an abandoned mutex. Fun stuff like that.

In RavenDB 4.0 we are going to be using Kestrel exclusively, which simplify our life considerably. You can still front it with IIS, of course (or ngnix, etc), for operational, logging, etc reasons. But this means that RavenDB will be running its own process, worry about its own initialization, shutdown, etc. It makes our life much easier.

That is about it for the protocol bits, in the next post, I’ll talk about the most important aspect of a database, how the data actually gets stored on disk.

time to read 2 min | 222 words

One of the things that happens when you start using RavenDB is that you start using more of it, and more of its capabilities. The problem there is that often users end up with a common set of stuff that they use, and they need to apply it across multiple databases (and across multiple servers).

This is where the Global Configuration option comes into play. This allows you to define the behavior of the system once, and have it apply to all your databases, and using the RavenDB Cluster option, you can apply it across all you nodes in on go.

image

As you can see in the image, we are allowing to configure this globally, for the entire server (or multiple servers), instead of having to remember to configure it for each. The rest of the tabs are pretty much the same manner. You can configure global behavior (and override it on individual databases, of course). Aside from the fact that it is available cluster-wide, this isn’t a major feature, or a complex one, but it is one that is going to make it easier for the operations team to work with RavenDB.

time to read 3 min | 466 words

I mentioned in my previous post that one of the things that pushed the design of RavenDB 4.0 is retrospective analysis. In particular, , "if we had to do it all over again, rewind the clock to 2008, and design RavenDB from the ground up with all the knowledge that we currently have, what would we do differently?". The problem with this approach is that we have shipping software, we have stuff that customers have been using for over six years, so it isn’t like we can start from scratch, and as tempting a decision that would be, it is almost always the wrong decision. RavenDB 4.0 isn’t a re-write, it is a major architectural change, driven by own experience in what is painful.

That is a good guideline, but what does this mean? We had a few rounds of thoughts around this, and we ended up with the following decisions.

As a major version release, we aren’t bound by backward compatibility, and we are going to take full advantage of that. That means that a 4.0 server cannot be accessed by a 3.0 client, and a 3.0 server can’t replicate to a 4.0 server. From the point of view of the wire protocol, we have taken the chance to fix some long standing issues, but I’ll have another post about that. Internally, an upgrade from 3.0 to 4.0 will probably be done by a dedicated migration tool, unlike the previous “in place” upgrade procedure that we previously had. The reason for those decisions is that this gives us a lot of flexibility in fixing our implementation and change how we are doing things.

At the same time, from an external point of view, users of RavenDB should see as little change as we can get away with. Ideally, the process of upgrading a new piece of software to RavenDB 4.0 should be:

  • Update the Nuget package
  • Recompile

And that would be it. In practice, I’m sure that we’ll have edge cases and things that would require a bit more work from the user, but that is the goal, that as far as users are concerned, they don’t have to do a lot of extra work to upgrade.

We’ll probably have a lot of discussions around what exactly we can change, and what we must absolutely have. In some of the discussion we had with our customers we already learn that running on 32 bits is a hard requirement for some of them, which means that RavenDB 4.0 will support that, even if that makes our life quite a bit more complex.

As a reminder, we have the RavenDB Conference in Texas in a few months, which would be an excellent opportunity to learn about RavenDB 4.0 and the direction in which we are going.

image

time to read 4 min | 642 words

The first thing you’ll see when you run RavenDB 3.5 is the new empty screen:

image

You get the usual “Create a database” and the more interesting, Create Cluster. In RavenDB 3.0 clusters were ad hoc things, you created them by bridging together multiple nodes using replication. In RavenDB 3.5, we have made that into a first class concept. Clicking on the “create cluster” link will give you the following:

image

And then you can start adding additional nodes:

image

RavenDB cluster is using Raft (more specifically, Rachis, our Raft implementation) to bridge together multiple server into a single distributed consensus. This allows you to manage the entire cluster in one go.

Once you have created the cluster, you can always configure it (add / remove nodes) by clicking the cluster icon on the bottom.

image

Raft is a distributed consensus algorithm, which allows a group of server to agree on the order of a set of commands to execute (not strictly true, but for more details, see my previous posts in the topic).

We use Raft to connect server together, but what does this mean? RavenDB replication is still running in a multi master node, which means that each server can still accept requests. We use Raft for two distinct purposes.

The first of which is to distribute changes of operations across the cluster. Adding a new database, changing a configuration settings. This allows to be sure that such changes are accepted by the cluster before they are actually performed.

But the one that you’ll probably notice most of all is that shown in the cluster image, Raft is a strong leader system, and we use it to make sure that all the writes in the cluster are done on the leader. If the leader is taken down, we’ll transparently select a new leader, clients will be informed about it, and all new writes will go to the new leader. When the previous leader recover, it will join the cluster as a follower, and there will be no disruption of service.  This is done for all the databases in the cluster, as you can see from the follow topology view:

image

Note that while the leader selection is handled via Raft, the leader database is replicating to the other nodes using multi-master system, so if you need to wait until a document is present in a majority of the cluster, you need to wait for write assurance.

The stable leader in the presence of failure means that we won’t have the clients switching back and forth between nodes, they will be informed of the leader, and stick to it. This generates a much more stable network environment, and allowing you to configure the cluster details once and have it propagate everywhere is a great reduction in operational work. I’ll talk more about this in my next post in this series.

As a reminder, we have the RavenDB Conference in Texas in a few months, where we’ll present RavenDB 3.5 in all its glory.

image

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}