Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 537 words

Memtables are conceptually a very simple thing. You have the list of values that you were provided, as well as a skip list for searches.

Complications:

  • Memtables are meant to be used concurrently.
  • We are going to have to have to hold all of our values in memory. And I am really not sure that I want to be doing that.
  • When we switch between mem tables (and under write conditions, we might be doing that a lot), I want to immediately clear the used memory, not wait for the GC to kick in.

The first thing to do was to port the actual SkipList from the leveldb codebase. That isn’t really hard, but I had to make sure that assumptions made for the C++ memory model are valid for the CLR memory model. In particular, .NET doesn’t have AtomicPointer, but Volatile.Read / Volatile.Write are a good replacement, it seems. I decided to port the one from leveldb because I don’t know what assumptions other list implementations have made. That was the first step in order to create a memtable. The second was to decide where to actually store the data.

Here is the most important method for that part:

   public void Add(ulong seq, ItemType type, Slice key, Stream val)

The problem is that we cannot just reference this. We have to copy those values into memory that we control. Why is that? Because the use is free to change the Stream contents or the Slice’s array as soon as we return from this method. By the same token, we can’t just batch this stuff in memory, again, because of the LOH. The way this is handled in leveldb never made much sense to me, so I am going to drastically change that behavior.

In my implementation, I decided to do the following:

  • Copy the keys to our own buffer, and keep them inside the skip list. This is what we will use for actually doing searches.
  • Change the SkipList to keep track of values, as well as the key.
  • Keep the actual values in unmanaged memory, instead of managed memory. That avoid the whole LOH issue, and give me immediate control on when the memory is disposed.

This took some careful coding, because I want to explicitly give up on the GC for this. That means that I need to make damn sure that I don’t have bugs that would generate memory leak.

Each memtable would allocate 4MB of unmanaged memory, and would write the values to it. Note that you can write over 4MB (for example, by writing a very large value, or by writing a value whose length exceed the 4MB limit. At that point, we would allocate more unmanaged memory, and hand over the memory table to compaction.

The whole thing is pretty neat, even if I say so myself Smile.

time to read 7 min | 1383 words

This is an example of an SST that stores:

  • tests/0000 –> values/0
  • tests/0001 –> values/1
  • tests/0002 –> values/2
  • tests/0003 –> values/3
  • tests/0004 –> values/4

As well as a bloom filter for rapid checks.

image

If you are wondering about the binary format, that is what this post is all about. We actually start from the end. We have the last 48 bytes of the file are dedicated to the footer.

image

The footer format is:

  • Last 8 bytes of the file is a magic number: 0xdb4775248b80fb57ul – this means that we can quickly identify whatever this is an SST file or not.
    Here is what this number looks like broken down to bytes:
    image
  • The other 40 bytes are dedicated for the metadata handle and the index handle.

Those are two pair of longs, encoded using 7 bit encoding, in our case, here is what they look like:

image

Let us see if we can parse them properly:

  • 100
  • 38
  • 143, 1 = 143
  • 14

Note that in order to encode 143 properly, we needed two bytes, because it is higher than 127 (and we use the last bit to indicate if there are more items to read). The first two values are actually the metadata handle (position: 100, count: 38), the second are the index handle (position: 143, count: 14).

We will start by parsing the metadata block first:

image

You can see the relevant portions in the image.

The actual interesting bits here are the first three bytes. Here we have:

  • 0 – the number of shared bytes with the previous key (there is no, which is why it is zero).
  • 25 – the number of non shared bytes (in this case ,the full value, which is 25).
  • 2 – the size of the value, in this case ,the value is the handle in the file of the data for the filter.

You can read the actual key name on the left ,”filter.BuiltinBloomFilter”, and then we have the actual filter data:

image

You can probably guess that this is the filter handle (position: 82, count: 18).

The rest of the data are two 4 bytes integers. Those are the restart array (position 130 –133) and the restart count (position 134 – 137). Restarts are a very important concept for reducing the size of the SST, but I’ll cover them in depth when talking about the actual data, not the metadata.

Next, we have the actual filter data itself, which looks like this:

image

This is actually a serialized bloom filter, which allows us to quickly decide if a specified key is here or not. There is a chance for errors, but errors can only be false positive, never false negative. This turn out to be quite useful down the road, when we have multiple SST files are need to work with them in tandem. Even so, we can break it apart into more detailed breakdown:

  • The first 8 bytes are the actual serialized bloom filter bits.
  • The 9th byte is the k value in the bloom filter algorithm. The default value is 6.
  • The last value (11) is the lg value (also used in the bloom filter algo).
  • The rest of the data is a bit more interesting. The 4 bytes preceding the 11 (those are 9,0,0,0) are the offset of valid data inside the filter data.
  • The four zeros preceding that are the position of the relevant data in the file.

Honestly, you can think about this as a black box. Filter data is probably enough.

Now that we are done with the filter, we have to look at the actual index data. This is located on 143, but we know that the filter data is actually ended on 100 + 38, so why the gap? The answer is that after each block, we have a block type (compressed or not, basically) and the block CRC, which is used to determine if the file has been corrupted.

Back to the index block, is tarts at 143 and goes for 14 bytes, looking like this:

image

The last 4 bytes (32 bit int equal to 1) is the number of restarts that we have for this block. And the 4 bytes preceding them (32 bit int equal to 0) is the offset of the restarts in the index block.

In this case, we have just one restart, and the offset of that is 0. Hold on about restarts, I’ll get there. Now let us move to the beginning of the index block. We have the following three bytes: 0,1,2.

Just like in the meta block case, those fall under (shared, non shared, value size) and are all 7 bit encoded ints. That means that there is no shared data with the previous key (because there isn’t previous key), the non shared data is 1 and the data size is 2. If you memorized your ASCII table, 117 is lower case ‘u’. The actual value is a block handle. This time, for the actual data associated with this index. In this case, a block with position: 0 and count: 70.

image

Let us start analyzing this data. 0,10, 8 tells us shared, non shared, value . Indeed, the next 10 bytes spell out ‘tests/0000’ and the 8 after that are ‘values/0’. And what about the rest?

image

Now we have 9,1,8. Shared is 9, non shared 1 and value size is 8. We take the first 9 bytes of the previous key, giving us ‘tests/000’ and append to it the non shared data, in this case, byte with value 49 (‘1’ in ASCII), giving us a full key of ‘tests/0001’. The next 8 bytes after that spell out ‘values/1’. And the rest is pretty much just like it.

Now, I promised that I would talk about restarts. You see how we can use the shared/non shared data to effectively compress data. However, that has a major hidden cost. In order to figure out what the key is, we need to read all the previous keys.

In order to avoid this, we use the notion of restarts. Every N keys (by default, 16), we we will have a restart point and put the full key into the file. That means that we can skip ahead based on the position specified in the restart offset, and that in turn is governed by the number of restarts that we have in the index block.

And… that is pretty much it. This is the full and unvarnished binary dump of an SST. Obviously real ones are going to be more complex, but they all follow the same structure.

time to read 4 min | 637 words

When reading a SST, we have to deal with values of potentially large sizes. I want to avoid loading anything into managed memory if I can possible avoid it. That, along with other considerations has led me to use memory mapped files as the underlying abstractions for reading from the table.

Along the way, I think that I made the following assumptions:

  • Work in little endian only.
  • Can work in 32 bits, but 64 bits are preferred.
  • Doesn’t put pressure on the .NET memory manager.

In particular, I am worried about users wanting to do store large values, or large number of keys.

As it turned out, .NET’s memory mapped files are a pretty good answer for what I wanted to do. Sure, it is a bit of a pain with regards to how to handle things like WinRT / Silverlight, etc. But I am mostly focused on server side for now. And I got some ideas on how to provide the mmap illusion on top regular streams for platforms that don’t support it.

The fact that SST is written by a single thread, and once it is written it is immutable has drastically simplified the codebase. Although I have to admit that looking at hex dump to figure out that I wrote to the wrong position is a bit of a bother, but more on that later. A lot of the code is basically the leveldb code, tweaked for .NET uses. One important difference that I made was with regards to the actual API.

  • Keys are assumed to be small. Most of the time, less than 2KB in size, and there are optimizations in place to take advantage of that. (It will still work with keys bigger than that, but will consume more memory).
  • In general, through the codebase I tried to put major emphasis on performance and memory use even at this early stage.
  • Values are not assumed to be small.

What does the last one mean?

Well, to start with, we aren’t going to map the entire file into our memory space, to start with, it might be big enough to start fragmenting our virtual address space, but mostly because there is no need. We always map just a single blokc at at time, and usually we never bother to read the values into managed memory, instead just accessing the data directly from the memory mapped file.

Here is an example of reading the key for an entry from the SST:

image

As you can see, we need to read just the key itself to memory, the value itself is not even touched. Also, there is some buffer management going on to make sure that we don’t need to re-allocate buffers as we are scanning through the table.

When you want to get a value, you call CreateValueStream, which gives you a Stream that you can work with. Here is how you use the API:

image

This is actually part of the internal codebase, we are actually storing data inside the SST that will later help us optimize things, but that is something I’ll touch on a later point in time.

Except for the slight worry that I am going to have to change the underlying approach from memory mapped files to streams if I need to run it outside the server/client, this is very cool.

Next on my list is to think on how to implement the memtable, again, without impacting too much on the managed memory.

time to read 7 min | 1252 words

Applying the lessons from the leveldb codebase isn’t going to be a straightforward task. I already  discussed why we can’t just use leveldb directly, so I won’t go into that.

I decided to see if I can implement a leveldb inspired storage for RavenDB. And the first thing that I wanted to try is to build an SST. That seems like an easy first step. SST are just sorted data in a file, and it require nothing much to write to them. As mentioned, we can’t just do a line by line port of the code. As easy as that would be. The way leveldb manages memory is… not really suitable for what we can / should do in .NET.

  • I want this to be an efficient .NET implementation.
  • It is inspired, but not intended to be compatible with leveldb.

The leveldb API deals with byte arrays all over the place. This makes sense for a C++ codebase, but it is horrible for a .NET application, especially when we expect a lot of our values to be large enough to hit the LOH. That means fragmentation, and pain down the road. Instead, our API uses ArraySegment<byte> for keys, and Stream for values. Since keys are expected to be relatively small, I don’t foresee this being a problem. And the values are streams, so they are easily handled without introducing any cost from the API.

Another thing that leveldb does quite a lot is batch things in memory for a while. It may be the current block, it may be the current data block, it may be the index block, but it does so quite often. That works nicely for C++ apps with expected small values, but not so much for our expected use case. So I want to avoid as much as possible holding items in managed memory. Here is the API for creating an SST:

   1: var options = new StorageOptions();
   2: using (var file = File.Create("test.sst"))
   3: using(var temp = new FileStream(Path.GetTempFileName(),FileMode.CreateNew,FileAccess.ReadWrite, 
   4:                     FileShare.None, 4096, FileOptions.DeleteOnClose | FileOptions.SequentialScan))
   5: {
   6:     var tblBuilder = new TableBuilder(options, file, temp);
   7:  
   8:     for (int i = 0; i < 100; i++)
   9:     {
  10:         var key = "tests/" + i.ToString("0000");
  11:         tblBuilder.Add(key, new MemoryStream(Encoding.UTF8.GetBytes(key)));
  12:     }
  13:  
  14:     tblBuilder.Finish();
  15: }

As you can see, we uses two streams here, one to actually write to the table, the second is a temporary stream that we use to write the index block while we are working, then we merged it back to the actual sst file. Note that after building the table, the temp file can be deleted (indeed, we marked is as delete on close, so that would automatically happen).

That part was easy, all it required was simple I/O for generating the file. The more interesting part is going to be reading the values out.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 4 days from now
  3. What happens when a sparse file allocation fails? - 6 days from now
  4. NTFS has an emergency stash of disk space - 8 days from now
  5. Challenge: Giving file system developer ulcer - 11 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}