Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 1 min | 160 words

FizzBuzz is a well known test to show that you can program. To be rather more exact, it is a simple test that does not tell you if you can program well, but if you cannot do FizzBuzz, you cannot program. This is a fail only kind of metric. We need this thing because sadly, we see people that fail FizzBuzz coming to interviews.

I have another test, which I feel is simpler than FizzBuzz, which can significantly reduce the field of candidates. I show them this code and ask them to analyze what is going on here:

Acceptable answers include puking, taking a few moments to breathe into a paper bag and mild to moderate professional swearing.

This is something that I actually run into (about 15 years ago, in the WebForms days) and I have used it ever since. That is a great way to measure just how much a candidate knows about the environment in which they operate.

time to read 1 min | 84 words

I run into the following code during code review and had an immediate and visceral reaction.

This is a (bad) attempt to add thread safety, because you are getting a value through a read only interface, but there is still the mutable instance to work with at the source, and now you have someone that observes the instance while it is being mutated, outside the lock.

The proper way to handle this is to copy the list (under the lock) and return a distinct copy.

time to read 1 min | 199 words

The end of the year is closing fast, and I run into the following metric (below). What you can see here is one of our RavenDB production instances over the past year. We are continuously dogfooding our own software, and there is a clear indication of the results.

What you can see here is the total memory used by RavenDB (production load, fairly constant over time)  for the past year. As we update RavenDB, we benefit from various optimizations, and the trend line is very encouraging.

image

Around August, we had a change that saved us a single allocation in some cases, here is the chance, you can see the impact it had:

image

We also started using a new feature in production around December, and that seems to have an additional memory cost, so we optimized that as well:

image

You can see the new build deployed around the 17th of the month.

time to read 6 min | 1182 words

The file pager needs to know what values it has in memory and what it needs from the disk. Instead of tracking values on a per page level, we are going to do that on a chunk basis, where each chunk in 2MB (256 pages). A single file is going to be limited to 8 GB in size, so we have a maximum of 4,096 chunks in a file. We can allocate a simple array of metadata for the entire file in a single shot. That means that we don’t have to do reallocation when we grow the size of the file (up to the 8GB maximum). Let’s consider what metadata we need to know about the chunks we have:

  • What is the status of the chunk (in memory, on the disk, being loaded or errored).
  • How many outstanding references we have for a chunk?
  • Where do we find the actual chunk data in memory, when it is loaded?

The whole thing is made complex because we have to consider concurrency. Multiple threads may try to load a chunk at the same time, we may need to release the memory of a chunk to make room for loading another, etc. We also need to consider issues such as I/O failures, optimizing I/O patterns, etc. For now, I/O will be handled by another post. I want to focus just on how we will deal with the metadata.

A major PITA with concurrency is how to handle reference tracking. If a thread is reading from a chunk, we cannot release it. That leads us to reference counting, but that is tough to do atomically. You have to deal with the ABA problem, to start with. For that reason, we want to limit chunk metadata to 8 bytes in total. This will allow us to use atomic instructions to modify the metadata safely.

Using just 8 bytes is a very low amount. We know that the chunks we’ll use are 2MB in size. We can assume that we’ll also align them on 2MB boundary. That means that the lower 20 bits are unused, we can repurpose them. On x64 and ARM64, the top 16 bits are also unused (not always true, since from 2019 we have IceLake that has PML5, which uses 57 bits, but very likely to be the case). In most systems, the 47th bit will be used for kernel vs. user memory, so that will be cleared as well. That means that we actually only need 64 – 17 – 20 = 27 bits to store the pointer value. We can repurpose the other 37 bits.

There are actually several ways in which we can do this. The compressed pointer method is just one of them. I decided to not go that route. Instead, we are going to have the following structure:

This is a packed bit field struct which can fit into a 64 bits value. Note that we have fields for the type of the value, the version (for ABA) and the number of references. In addition to that, we also have the actual value, which is specified in offsetInPages. Let’s talk about sizes here.

  • The tag field has four options, as you can see.
  • The version field is 16 bits, which means that it can have 65,536 possible values. It will be incremented on every change to the value and used to avoid false successes when updating the value concurrently.
  • The references field is 20 bits in size, giving us 1 million values here. That is the number of concurrent references that it can support. That looks like big enough value that we shouldn’t care about it.
  • The offsetInPages field is 26 bits in size. Assuming 4 KB pages, we can reference up to 256 GB of memory. We’ll want to support machines with higher memory than that, which is why we’ll also add the concept of base. For a single file, all the allocations must come in the same 256 GB range. I don’t expect that to be a big problem, and different files can have different bases.

The fact that all of that fits in 64 bits means that we can use simple Compare & Swap atomic operations and avoid the need for 128 bits atomic instructions. To be fair, cmpxchg16b has been around forever. I believe that you can do that on ARM as well, but I’m not sure how.

At any rate, let’s look at the ChunkMetadata struct in all its glory, then we’ll discuss what is going on:

The ChunkMetadata can be in one of four states:

  • Empty – there is no value
  • Error – we tried to load the chunk, but failed for some reason. In that case, the actual error code is stored in offsetInPages.
  • Loading – we are currently loading the chunk, and callers can decide to wait for this or try again later.
  • Value – there is a value in the chunk and it is available immediately.

When we get() a value we check what the current state of the metadata is and in all but the Value case we’ll return immediately. If there is a value, we can’t just return it to the caller. We need to increment the reference count. That is most of the code in the get() method. We increment the references, do a wrapping increment for the version (so each change will be unique) and then use an atomic operation to update the value. The idea is that two concurrent threads getting the value at the same time will always increment or decrement the references properly. That will be quite important later on.

After you are done with the chunk, you can release() it, which will decrement the reference count. Note that reference count of 0 is wrong, we aren’t handling actual releasing of values yet. That will come in another post.

The trySet() function is responsible for the other side, it will set the value or the error, taking care of the concurrency aspects of the ChunkMetadata. Of particular interest here, however, is the Futex.wake() call. That deserves some detail.

Consider the sequence of events for accessing a chunk. We may have two threads that try to get a chunk, but they find that it is not resident in memory. It needs to be loaded, but we don’t want both threads to do so at once. Therefore, the threads will compete on moving the chunk from the Empty state to the Loading state. After which, the thread that won the race will need to schedule the actual I/O. What does the other thread do in the meantime? It needs to wait until the I/O is completed. This is done using the waitForValue() method, where we interpret the first half of the chunk metadata (the one holding the version field) as a Futex.wait  value. In other words, the thread will sleep until the trySet() call will wake it.

That is enough talking about the ChunkMetadata, I think. We went over that in detail, for my next post, I want to talk about how we deal with what is likely to be the most interesting bit of the file pager, managing the actual chunks.

time to read 7 min | 1275 words

In the previous post, I showed how we can get a pretty nice pager (important for building a storage system) in under 100 lines of code using mmap(). If that was all of it, it would be a pretty short series of posts. However, I want to explore what it would take to take ownership of that part of the storage system and build our own from scratch. Let’s see what it would take to build a pager when we are doing the I/O.

In the mmap() implementation, I didn’t really have a lot of states. Just the mapping and that was pretty much it. When building our own, we need to track a whole lot more states. Off the top of my head, we need to:

  • Track what pages we handed out to callers.
  • Track usage of pages so we’ll know when to release them.
  • Manage concurrency explicitly between threads.
  • Handle several scenarios that were just… working on the mmap() implementation.

For example, let’s talk about what kind of state I need. Zig comes with a hash table (and does that beautifully for an unmanaged language), so I can do this, right?

pages: std.AutoHashMap(u64, Page),

That would be a mapping between the pages in memory and the memory we allocated for them. Except… that it doesn’t quite work like that. One of the key issues that we have to deal with is the fact that while most of the time we will ask for a page, we can also ask for a continuous run of pages.

We can safely assume that the caller is responsible for ensuring that there is no duplication. In other words, the following sequence of calls is invalid:

That is a very important limit to how much complexity we have to deal with, I have to note. Another thing to deal with is concurrency. How do we deal with scenarios where two threads want to get pages (which may or may not be the same)?

Anther consideration is about reduce the overall cost of I/O, we don’t want to issue too many operations, both for reads and for writes. That pushes us toward batching operations as much as possible. Here is the overall design that I have for the file pager:

image

In other words, even though we are dealing with 8KB pages, the pager itself will issue work with chunks of 2MB in size each time. The idea is that we can  amortize the cost of going to the disk by ensuring that we’ll do bulk I/O. That, in turn, means that we have to consider some aspects of our system very early on.

In the case of the mmap pager, we didn’t really need to think about caching, that was the responsibility of the operating system. In the case of this pager, we must have a cache, and if we cache a chunk, we can probably benefit greatly from locality of reference, which is always nice.

The 2MB chunk size design decision complicate our lives. The pager needs to handle both single pages access and work with values that may span multiple pages. As long as they reside in a single chunk, that is pretty easy. But we need to consider how we’ll manage to work with values that are bigger than 2MB in size. It’s interesting, because even at this very early stage, a design decision on how big the size we fetch from the disk will have impact for the implementation of the entire system.

As early as we are, we can make the following assumption / requirements from our callers:

  • Most of the access is going to be for single pages.
  • Some of the accesses will be for multiple pages, but under the 2 MB chunk limit.
  • Few accesses will need to work with multiple pages over the 2 MB limit.

That is important because it impacts the way we think about the system. Earlier in this post, I mentioned using a hash map to store the references to the pages. With chunks, we can probably adjust slightly and be done with it, right?

Except that we really can’t. One of the primary issues that we have to deal with is the fact that this is meant to be concurrent. A hash map isn’t going to support that and will need to be protected by a lock. Interestingly, most concurrent data structures pretty much require garbage collection of some sort and building them with an unmanaged system is quite complex.

How do we deal with this issue? It turns out that it is far simpler to have an array to hold those references and access each element using atomic instructions. Here we run into another design decision. Are we going to have a single file or multiple files? That matters because if we have a single file, we need to deal with increasing the file size on the fly. That means that the array of references would need to grow, and that is also complex with concurrent access. If we have multiple files, we can just create a completely new file as needed. We can allocate a single array at the maximum file size and not worry about it. There are other reasons why we might want to use multiple files (such as making it easier to release space back to the file system), so we’ll go with multiple files.

That means that we can reasonably set the maximum file size at 8GB (remember the big values issue, I think it is reasonable to set the max size of a value at 2GB, so 8GB is plenty). With 8GB files, we are talking about 4,096 chunks of 2 MB each. Assuming that we’ll use an 8 bytes struct to hold the data about each chunk, that means that we can safely allocate the maximum size of 32Kb upfront. If we need to increase the size of the file, we already allocated the place for its metadata. That gives us a far simpler system (no need to try to manage concurrent accesses) at a small memory cost.

Now, we can require that page allocations that are below 2 MB in size will always be aligned inside a page boundary. But what happens when we have a value whose size exceeds 2MB? The answer to that is that we are going to require the calling code to follow specific patterns for that. We require that any value that is greater than 2MB will be aligned on a 2MB boundary from the end of the final chunk. Here is what this looks like, the yellow marked pages are allocated on two separate chunks, and you can see how we aligned this on the end:

image

The nice thing about this approach is that we know that the caller will not do partial calls. If we asked for pages 5 - 10, there can be no call to page 6 on an independent basis. As such, when we ask for a value that is bigger than a single chunk, it will always be expressed as a load from the starting chunk to the end. That means that we can load the full value in a single I/O call.  Here, again, we have very low level concerns affecting how we lay out the data on disk.

There are other aspects that we need to consider, such as eviction policies, how to handle concurrency, etc. But that is enough for one post, I intentionally want to limit the scope of what we do to avoid getting mired in the details. Expect more in the next post in the series.

time to read 4 min | 633 words

Now that we know what we want to implement, let’s dig a bit deeper and see how to do it. An interesting way to implement a file pager is to… not do that. Instead, we can rely on the OS’ memory mapping to do most of the heavy lifting. Let’s see how we can do that.

The first thing that we need to manage is the setup and teardown of the pager, which you can see here:

There isn’t much here, we simply call mmap() and that is about… it. Let’s see how we can implement the actual pager behavior. We’ll start with the easy pieces here getting and releasing the memory from the pager:

You’ll notice that we don’t actually have anything here? Even the act of checking that the page is within the bound of the mapped memory is done by slicing the ptr directly. What about the blocking part? How do we actually move the data to memory? The answer is that we aren’t. When you access the pointer we return from the get(), we’ll just get a page fault and the OS will read the data from the disk.  The release() function also doesn’t need to do much, all the behavior is inside the mmap() implementation, after all.

A bit more complex is the part where we try to get the pages from the disk, here is the tryGet() implementation:

That is quite a bit of code for not much in practice. We create a temporary array and then call mincore() on the range of memory that we’ll return. If the entire range is not already in memory, we’ll call madvice() to load it in the background and return null. If the range is already in memory, just return it.

This isn’t 100% safe to do, by the way, there may be race conditions that would cause us to think that the data is in memory just as it is swapped to disk, but that is good enough for our needs. Especially because the whole thing is quite simple overall.

The next stage is to handle writes and syncing to disk. This is simplicity itself, in this model.

Since we handed out a buffer from the memory map itself, we don’t need to do any copying, we already modified that range of memory. And when we sync to the file, we can do that by a single msync() call. There are a few things to note here, though:

  • Because we are writing directly to the memory mapped file, it is possible that our changes will show up in the file before write and sync are called.
  • The msync() will sync the entire range, if we have smaller changes that we made, we can try to reduce the amount of memory that is synced by remembering what parts we have written to, but it ends up being quite a chore. And since the OS is already doing that for us, we can shell that to it directly.

And that is pretty much it. The whole pager is under 100 lines of code.

There are some things that I don’t handle, such as what happens if we want to extend the size of the file. That requires us to re-wire the mapping, if we are going by the strict reading of the API. But in both Linux & Windows, you can define a memory mapping that is greater than the file and that will automatically adjust as you grow the file. That is quite a nice feature for us and can save us a lot of management overhead internally.

With that out of the way, we can start implementing higher level functions in a storage system. But notice how we moved pretty much everything to the OS? What would it look like if we wanted to build that ourselves?

time to read 2 min | 397 words

A file pager is a component in database systems that is responsible for reading and writing pages (typically 8KB blocks) from the file system. The pager is responsible for the I/O operations and is crucial for the overall performance of the system. Ideally, it should manage details such as caching pages in memory, reduce I/O costs and continuously optimize the overall behavior of the storage.

That can be a pretty big chunk of  a storage system, and it can have a significant impact on the way the storage system behaves. Here is the most basic version that I can think of:

The idea is that whenever you need a particular page, you’ll call it using tryGet() which will return the document if it is already in memory, but it will not block. You can call getBlocking() to force the current thread to wait for the page to be in memory. That allows the calling code to perform some really nice optimizations.

Once we got the page, the Pager is charged with keeping it in memory until we will release it. Note that I’m talking about a Page, but that might actually contain multiple sequential pages. The release() call tells the Pager that the memory is no longer in active use, the Pager may decide to do something about that.

Finally, we have the write() method, which will write the data from the in-memory page to storage, and the sync() method, which will ensure that all previous writes are durable to disk.

There aren’t that many moving pieces, right? Not in particular that we don’t have the notion of transactions here, this is lower level than that. This API has the following properties:

  • The same page will always be represented in memory by the same location. However, if we release and get the page again, it may move.
  • The methods tryGet(), getBlocking() and release() have no locking or threading limits. You may call them in any context and the Pager will deal with any concurrency internally.
  • The write() and sync() calls, on the other hand, require synchronization by the client. There can be no concurrency between the two.

With that in place, we can build quite a sophisticated storage system. But we’ll focus on how the pager works for now.

There are a bunch of ways to implement this, so I’ll have at least a couple of posts on the topic. How would you approach implementing this?

time to read 5 min | 912 words

I got into a good discussion about how RavenDB implements some optimizations with transaction handling. The details got big enough (and hopefully interesting enough) that they warrant their own post.

When we are talking about transactions, one of the key factors in the cost of a transaction is the amount of time that it takes to persist that. This is different for local and distributed transactions.

For a local transaction, we can consider the transaction committed if it is durably stored on the disk.

For a distributed transaction, we can consider the transaction committed if it is durably stored on a majority of the disks in the cluster.

That factor right there is the primary reason why a distributed transaction is more expensive. For a local transaction, you are waiting for the disk. For a distributed transaction, you are waiting for multiple disks and the network.

One of the core optimizations for speeding up transactions is the ability to batch things. The cost of writing to the disk is more or less the same, regardless of how much you write (within an order of magnitude or so). In other words, writing 8 KB and writing 32 KB has pretty much the same cost. Writing 1 MB and writing 100 MB does not, but writing 1 MB vs 4 MB isn’t meaningfully different (sequential durable write, which is what we care for in the case of transactions).

The point of this post is how this is actually handled. RavenDB utilizes a process called transaction merging to reduce the number of times that we have to go to the disk. Concurrent transactions will be bundled into the same write call, massively increasing our throughput. To give you some context, without transaction merging, you can peak at a few hundreds transactions per second. With transaction merging, you can jump to high thousands of transactions per second. Here is how this works:

image

RavenDB actually takes this further, in addition to transaction merging, we also apply something we call async commit. Take a look at the following timeline:

image

A transaction is actually composed of two separate steps. First we need to execute whatever commands we have in the transaction, then we have to write the transaction changes to disk.

RavenDB is able to start processing the next transaction as soon as the previous one started the write to the disk. The idea is to parallelize compute and I/O, and we are able to benefit greatly as a result. Note that this is safe to do, since the next transaction won’t be committed until the prior transaction has been durably stored.

How does this work in practice? Whenever we have a new transaction, we add it to a queue. A dedicated thread will merge those transactions and pull them from the queue, running the separate transactions as one big operation. When we run out of pending transactions or hit certain size / timing limits, we’ll commit the merged transaction and start working on the next one while the commit is completing in the background.

There are certain algorithms that try to maximize throughput, such as Nagle. They do that by waiting for additional transactions to arrive before actually going to the disk. RavenDB doesn’t use that approach. If a system is idle and we get a single transaction, we’ll immediately execute and commit it.

But the fact that we don’t explicitly do Nagle doesn’t mean that it isn’t useful. Because we have to wait for the disk, what ends up happening is that under load, we start getting more pending transactions in the queue. Which will then be executed as a merged unit. In other words, RavenDB implements a dynamic batching approach, affected by the actual I/O constraints and the load on the system. If we have independent transactions, we’ll execute them immediately. As the load increases, we’ll start making more merged transactions. This way we can keep a fairly consistent response time even when the load of the system grows by leaps and bounds.

The situation is similar when we are talking about distributed transactions. RavenDB uses the Raft protocol for managing its distributed behavior. I’m going to focus just on the batching aspect of the behavior. RavenDB will send an AppendEntries message to the other members in the cluster every 200 ms or so. However, if we have a new command to send to the cluster, it will go out immediately over the network. An  important factor here is that we are using TCP, and we require acknowledgment from the other side before we send the next message. As a result of those behaviors, we have pretty much the same situation. Depending on the network latency and the processing time, we’ll send more entries in a single roundtrip.

In short, the overall behavior for RavenDB is that we’ll start the operation immediately on the first action (both for disk and network), and then we’ll batch anything that happens while the first operation is in flight and send that as a result.

After over a decade of working in this manner, I can tell that this has proven to be a highly adaptable system that results in the minimum number of knobs to mess with. It favors latency over throughput when there isn’t a lot of activity and shifts toward favoring throughput over latency as the load grows.

time to read 2 min | 324 words

imageThe phrase “work well under pressure” is something that I consider to be a red flag in a professional environment. My company builds a database that is used as the backend of business critical systems. If something breaks, there is a need to fix it. It costs money (sometimes a lot of money) for every minute of downtime.

Under such a scenario, I absolutely want the people handling the issue to remain calm, collected and analytical. In such a case, being able to work well under pressure is a huge benefit.

That is not how this term is typically used, however. The typical manner you’ll hear this phrase is to refer to the usual working environment. For example, working under time pressure to deliver certain functionality. That sort of pressure is toxic over time.

Excess stress is a well known contributor to health issues (mental and physical ones), it will cause you to make mistakes and it adds frictions all around.

From my perspective, the ability to work well under pressure is an absolutely important quality, which should be hoarded. You may need to utilize this ability in order to deal with a blocking customer issue, but should be careful not to spend that on non-critical stuff.

And by definition, most things are not critical. If everything is critical, you have a different problem.

That means that part of the task of the manager is to identify the places where pressure is applied and remove that. In the context of software, that may be delaying a release date or removing features to reduce the amount of work.

When working with technology, the most valuable asset you have is the people and the knowledge they have. And one of the easiest ways to lose that is to burn the candle at both ends. You get more light, sure, but you also get no candle.

time to read 3 min | 420 words

imageA RavenDB user has accidentally deleted a collection. They intended to do something else, probably, but…  They have a backup, but as you can imagine, this is a bad place to be in.

They talked to us and mentioned that they want a feature where deletion in the studio can be locked by a password or even two factor authentication, to prevent such a scenario.

We are not going to implement such a feature. From a technical perspective, this is a pretty easy thing to do, of course. My issue is that it doesn’t make sense for such a feature to exist. Let me dig into the details and explain what the problem is.

Locking deletes behind a password or two factor authentication is a security feature. That has a major impact on all aspects of the design. However, this is about preventing mistakes on the part of the user, not another security capability (this user can do deletes, this one cannot).

As such, this isn’t a security feature, but a UX one. The delete is already asking for confirmation, but it is the sort of thing that you rarely read,  as we all know.

The distinction between a security feature and a UX feature is important. If this is a security feature, that means that I need to prevent doing mass deletes everywhere. As the result of queries, iterating over ids, in patch operations, etc. If this is a UX issue, that is a whole different level.

Looking at other such destructive operations, where the user is allowed to do the operation, but we want to prevent accidents leads me to consider something like this:

image

Where we require the user to perform some action if there is a major risk. That shifts the burden to the user, but it means that we now need to consider how to apply this.

Are we dealing with just mass deletes? What about update queries?

The purpose here isn’t to prevent the user from making the operation, but to have them stop and consider for a moment. The problem is that for common operations, that is something that you would add a significant amount of friction to your daily work.

When working on importing data, for example, it is common to delete the previous run each time that you run (think, development time, every 3 minutes). Adding hurdles along the way is a PITA.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}