Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,612
|
Comments: 51,243
Privacy Policy · Terms
filter by tags archive
time to read 2 min | 349 words

Well, I am very happy at the conclusion of this blog post series. Beside being one of the longest that I have done, this actually stretched my ability to count using roman numerals.

In summary, I am quite happy that I spent the time reading all of this code. The LevelDB codebase is really simple, when you grok what it actually does. There is nothing there that would totally stun a person. What there is there, however, is a lot of accumulated experience in building those sort of things.

You see this all over the place, in the format of the SST, in the way compaction is working, in the ability to add filters, write merging, etc. The leveldb codebase is a really good codebase to read, and I am very happy to have done so. Especially since doing this in C++ is way out of m comfort zone. It was also interesting to read what I believe is idiomatic C++ code.

Another observation about leveldb is that it is a hard core C++ product. You can’t really take that and use the same approach in a managed environment. In particular, efforts to port leveldb to java (https://github.com/dain/leveldb/) are going to run into hard issues with problems like managing the memory. Java, like .NET, has issues with allocating large byte arrays, and even from the brief look I took, working with leveldb on java using the codebase as it is would likely put a lot of pressure there.

Initially, I wanted to use leveldb as a storage engine for RavenDB. Since I couldn’t get it compiling & working on Windows (and yes, that is a hard requirement. And yes, it has to be compiling on my machine to apply), I thought about just porting it. That isn’t going to be possible. At least not in the trivial sense. Too much work is require to make it work properly.

Yes, I have an idea, no, you don’t get to hear it at this time Smile.

time to read 2 min | 243 words

I was all set to finish this series, when I realized that I missed something important, I didn’t covered, and indeed, I don’t currently understand, what filters are, and how are they used.

As usual, once I actually got to the part of the code where this is actually handled (instead of ignoring it), it made a lot more sense:

image

And that, following with the rest of the code that I read makes a lot more sense.

A SST is a sorted table, essentially. We have the keys and the blocks, etc. And we can find a value very quickly. But what if you could do this even more cheaply?

Using a bloom filter is a good way to never get false negative, and it will reduce the amount of work we have to do if we can’t find the key in the SST drastically (only need to read the filter value, don’t even need to do any additional checks). Quite elegant, even if I say so myself.

There is one thing to pay attention to and that is that you can define your own comparator, in which case you must also define you own filter policy. If you use a comparator that ignore casing, you also need to provider a filter that ignore casing.

time to read 2 min | 286 words

As mentioned before, every 4 MB or so, we will move the current memtable and switch to a new one, turning the current one into an immutable memtable, which will then be compacted to disk. This post is about that process. The work there is done by the surprisingly name: DBImpl::CompactMemTable, which then calls to WriteLevel0Table.

That method build the actual table, using the same approach I already covered. Honestly, the code is pretty obvious now that I spent enough time reading this. The only thing of interest is DeleteObsoleteFiles() method, which looks interesting if only because it is one of the few places leveldb actually does real work with files. Mostly, however, ParseFileName is interesting:

image

It give me a good way to look at the actual files being used. This will be important when we will investigate how leveldb handles recovery.

Leveldb will consider files obsolete if:

  • They are log files and aren't of the current/previous log number.
  • If it is a manifest file, keep the current one (or later if there are any).
  • If it is a table file, keep it if it is in use. A table file may have been compacted away.
  • If it is a lock file, info log or the current db file.

All of this is interesting, but on a minor level. I think that we are nearing the end of the codebase now. The only thing that I really want to go into is the recovery parts, and that would pretty much be it.

time to read 3 min | 500 words

This is how leveldb starts up a new db:

image

As you can imagine, this is quite useful to find out, since it means that everything we do on startup is recover. I do wonder why we have to take a lock immediately, though. I don't imagine that anything else can try to make use of the db yet, it hasn't been published to the outside world.

Recover is interesting (and yes, I know I wrote that a lot lately).

  • It starts by taking a lock on the db (File System lock, by holding to LOCK file).
  • If the db does not exists, we call NewDB(), which will create a default MANIFEST-00001 file and a CURRENT file pointing at that.
  • If the db exists, we call VersionSet::Recover(), this starts by reading the CURRENT file, which points to the appropriate MANIFEST file.
  • This then gives us the latest status of the db versions, in particular, it tells us what files belong to what levels and what they ranges are.
  • We check that the comparators we use is identical to the one used when creating the db.
  • The code make it looks like there might be multiple version records in the MANIFEST file, but I don't recall that. It might be just the way the API is structured, though. I just checked with the part that write it, and yes, it should have just one entry.
  • Once we have our versions, what next? We check that all of the expected files are actually there, because otherwise we might have a corrupted install.
  • The final thing we do is look for log files that are later than the latest we have in the MANIFEST. I am not sure if this indicates a recover from a crash or a way to be compatible with an older version. I guess that one way this can happen is if you crashed midway while committing a transaction.

When recovering, we are forcing checksum checks, to make sure that we don't get corrupt data (which might very well be the case, since the data can just stop at any point here. The relevant code here is leveldb::log:Reader, which takes care of reading potentially corrupt log file and reporting on its finding. I already went over how the log file is structured, the log reader just does the same thing in reverse, with a lot of paranoia. While reading from the log file, we build a memtable with the committed & safe changes. Here, too, we need to handle with memtable sizes, so we might generate multiple level 0 files during this process.

And... that is pretty much it.

I'll have another post summarizing this, and maybe another on what I intend to do with this information, but I'll keep that close to the chest for now.

time to read 3 min | 488 words

In leveldb we have the memtable, into which all of the current writes are going and the immutable memtable. So far, I haven't really checked what it actually means.

It looks like this happens on the write, by default, a mem table is limited to about 4 MB of space. We write to the memtable (backed by the tx log) until we get to the point where we go beyond the 4MB limit (note that again, if you have large values, you might go much higher than 4MB and then we switch to another memtable, change the existing memtable to be the immutable one, and move on.

Something that might trip you is that if you write 2 big values one after the other, in separate batches, the second one might need to wait for the compaction to complete.

Here is the test code:

   std::string big(1024 * 1024 * 5, 'a');
     
     for(int i=0; i < 10; i++ ){
        
         std::clock_t start = std::clock();
       
         std::ostringstream o;
         o << "test" << i;
         std::string key =  o.str();


         db->Put(writeOptions, key, big);
    
         std::clock_t end = std::clock();
   
         std::cout << i << ": " << 1000.0 *(end - start) / CLOCKS_PER_SEC << "ms " << std::endl;
   
     }

And the output is:

0: 20ms

1: 30ms

2: 50ms

3: 50ms

4: 50ms

5: 50ms

6: 40ms

7: 60ms

8: 50ms

9: 50ms

Not really that interesting, I'll admit. But when I tried it with 50 mb for each value, it felt like the entire machine grind to a halt. Considering the amount of times memory is copied around, and that it needs to be saved to at least two locations (log file & sst), that makes a lot of sense.

For references ,those were my timing, but I am not sure that I trust them.

0: 170ms

1: 310ms

2: 350ms

3: 460ms

4: 340ms

5: 340ms

6: 280ms

7: 530ms

8: 400ms

9: 1200ms

time to read 3 min | 462 words

Okay, now that I know how data actually gets to the disk and from it, it is time to read how leveldb handles snapshots. Snapshots seems to be very simple on the surface. On every write, we generate a sequence number. We store the number locally and use the oldest still living snapshot as the oldest version that we have to retain when doing compaction work.

I had hard time figuring out how it worked out with regards to using this in memory. Consider the following code:

     leveldb::WriteOptions writeOptions;
     writeOptions.sync = true;
     
     db->Put(writeOptions, "test", "one");
     
     const leveldb::Snapshot* snapshot = db->GetSnapshot();
     
     db->Put(writeOptions, "test", "two");
  
    leveldb::ReadOptions readOptions;
    readOptions.snapshot = snapshot;
    std::string val;
    status = db->Get(readOptions, "test", &val);

This will properly give val == “one” when we execute it.

As it turned out, I missed something when I read the code earlier for MemTables. The value that is actually stored as the key is [key][tag]. And the tag is the sequence number + write type. And because of the way it is sorted (little endian, always), it means that higher values are sorted first. And what that means in turn is that unless you specify a specific snapshot number (which is what this tag contains, most of the time), you are going to get the latest version. But if you specify a snapshot number, you’ll get the value that was there as of that snapshot.

And that, in turn, means that we can write code like this:

image

Where key.memtable_key() contains the required snapshot value. So we can just skip all the ones larger than what we want.

That is really cool, but what about when we go to disk? Pretty much in the same way. The actual key value include the sequence & tag value. But the comparator knows to filter it out when needed. This is quite nice, and an elegant way to handle this situation.

time to read 6 min | 1035 words

After figuring out how the TableCache works, I now want to dig a bit deeper into what the Tables are. That means that we are starting out in Table::Open. And this starts with:

image

So we start by reading the footer. Then we read the index block:

image

Construct a new table, and we are pretty much set. I went back to look at the TableBuilder, and I think that I am getting things better. There is the BlockHandle, which is basically just the offset / size in the file.

Then we have the actual index, which is located at the end of the file. This one has the format of:

key, offset, size

key, offset, size

key, offset, size

The only problem is that I don't see yet where this is actually used. And here it is, inside Table::InternalGet.

image

So we seek into the block data. And that matches pretty closely what I would expect here. And then we have this:

image

 

Some notes about this code, it looks like we are going to be loading the entire block into memory. But what happens if we have a very large value? For that matter, what happens if we have a very large value next to a very small value on the same block, but we wanted the small value?

I think that I am wrong here. Looking at the code ,I found this comment, which I previously missed:

image

And that explains it really nicely with regards to how blocks works in general. The cool part happens inside Block::Iter::Seek, we first do a binary search for the prefixes that we have inside the block, and then a linear search inside the restart section. By default, there are 16 items per restart, so that is going to generate a really fast turnaround in general.

One key point I think that bear repeating is with regards to my previous comment about sizes. We don't actually ever copy the entire block to memory, instead, we rely on the OS to do so for us, because the files are actually memory mapped. That means that we can easily access them with very little cost, even if we have mixed big & small values on the same block. That is because we can just skip the large value and not touch it,and rely on the OS to page the right pages for us. That said, when reading a block, not just iterating it, we are going to allocate it in memory:

image

So I am not sure about that. I think that I was mistaken previously. The only problem is that the buf is actually used for scratch purposes.  I think that this is right on both ends, looking at the code, we have PosixRandomAccessFile:

image

This is the standard approach, actually. Read it from the disk into the buffer, and go on with your life.  But we also have PosixMMapReadableFile implementation:

image

And this is a lot more reasonable and in line with what I was thinking. We don't use the scratch at all.

However, we still need to allocate this scratch buffer on every single write. Looking at the code, the decision on what to allocate seems to be done here:

image

As you can see, it is mmap_limit_ that will make that determination. That is basically limiting us to no memory maps on 32 bits (make sense, the 2 GB virtual address space is really small) and to a max of 1,000 mapped files for 64 bits. Given that I am assuming that you are going to run this on 64 bits a lot more often than on 32 bits (at least for server apps), it would make more sense...

Stopping right here. Leveldb is being used in the browser as well, and presumably on mobile devices, etc. That means that you can't really assume/require 64 bits. And given that most of the time, you are going to have blocks of up to 4 KB in size (except if you have very large keys), I think that this is reasonable. I would probably have done away with allocating the buffer in the happy case, but that is beside the point, probably, since most of the time I assume that the value are small enough.

I am looking at this through the eyes of someone who deals with larger values all the time, so it triggers a lot of introspection for me. And this is how we actually read a value from disk, I actually managed to also figure out how we write, and in what format. All together good day's (actually, it was mostly at night) fun.

time to read 6 min | 1034 words

In the previous post, I focused mostly on reading the code for writing a SST to disk. But I have to admit that I am not really following how you read them back in a way that would be easy to read.

In order to understand that, I think that the right place in the code would be the TableCache. The API for that is pretty slick, here are just the header (comments were stripped).

class TableCache {
 public:
  TableCache(const std::string& dbname, const Options* options, int entries);
  ~TableCache();

  Iterator* NewIterator(const ReadOptions& options,
                        uint64_t file_number,
                        uint64_t file_size,
                        Table** tableptr = NULL);

  Status Get(const ReadOptions& options,
             uint64_t file_number,
             uint64_t file_size,
             const Slice& k,
             void* arg,
             void (*handle_result)(void*, const Slice&, const Slice&));

  void Evict(uint64_t file_number);

 private:

  Status FindTable(uint64_t file_number, uint64_t file_size, Cache::Handle**);
};

There are a couple of interesting things here that show up right away. How do you know what is the number of entries. I guess that this is stored externally, but I am not sure where. I am going to figure that question out first.

And the answer is strange:

image

It isn't number of entries in the table, it is the number of files? That actually say something very important, since this means that the table cache is reading multiple SST files, rather than just one per cache. Looking at the rest of the API, it makes even more sense. We need to pass the file number that we are going to be reading. That one is going to be got from the Version we are using.

Side note: I just spend an hour or so setting up a tryout project for leveldb, so I can actually execute the code and see what it does. This had me learning cmake (I am using KDevelop to read the code) and I finally got it. Still haven't figure out how to step into leveldb's code. But that is a good step.

Also, the debug experience is about 2/10 compared to VS.

And damn, I am so not a C++ developer any longer. Then again, never was a real dev on linux, anyway.

The first thing the TableCache does is to setup a cache. This is interesting, and I followed the code, it create 16 internal caches and has between them. I think that this is done to get concurrency because all the internal cache methods looks like this:

image

Note the mutex lock. That is pretty much how it works for all of the cache work. Looking deeper into the code, I can see another reason why I should be grateful for staying away from C++. leveldb comes with its own hash table functionality. At any rate, the cache it using LRU to evict items, and it get the number of entries from the value that was passed in the ctor. That make me think that it hold the opened files.

Speaking of the cache, here is an example of the code using it:

image

The cache is also used to do block caching, this is why it takes a slice as an argument. I'm going to go over that later, because this looks quite interesting. The rest of this method looks like this:

image

So the only very interesting bit is the Table::Open. The rest is just opening the mmap file and storing it in the cache. Note that the actual file size is passed externally. I am not really sure what this means yet. I'll go over the table code later, right now I want to focus on the table cache.

Looking over the TableCache interface, we can see that we always get the file size from outside. That got me curious enough to figure out why. And the answer is that we always have it in the FileMetaData that we previously explored. I am not sure why that is so important, but I'll ignore it for now.

The rest of TableCache is pretty simple, although this made me grin:

image

More specifically, look at the RegisterCleanup, this is basically registering for the delete event, so they can unref the cache. Pretty nice, all around.

The rest of the code is just forwarding calls to the Table, so that is what I'll be reading next...

time to read 3 min | 502 words

It seems like I have been trying to get to the piece that actually write to disk for a very long while.

image

The API from the header file looks to be pretty nice. I would assume that you would call it something like this:

TableBuilder builder(options,writableFile);
builder.Add(“one”, “bar”);
builder.Add(“two”, “foo”);
builder.Finish();

There are some things that looks funky. Like Flush(), which has a comment that I don't follow: Can be used to ensure that two adjacent entries never live in the same data block.

Oh well, maybe I need to actually read the code first?

And immediately we have this:

image

I think that this explain why you have Flush(), this forces different blocks. Most of the time, you let leveldb handle that, but you can also control that yourself if you really know what you are doing.

There is something called FilterBlockPolicy, but I don't get what it is for, ignoring for now.

Calling Add() does some really nice things with regards to the implementation. Let us say that we are storing keys with similar postfixes. That is very common.

When considering leveldb for RavenDB, we decided that we would store the data in this manner:

/users/ayende/data = {}

/users/ayende/metadata = {}

Now, when we want to add them, we add the first one and then the second one, but we have a chance to optimize things.

We can arrange things so we would output:

0,19,2

/users/ayende/data

{}

15, 9, 3

metadata

{}

Note that for the second key, we can reuse the first 15 characters of the previous entry. And we just saved some bytes on the disk.

Now, blocks are held in memory, and by default they are flushed every 4 KB or so.

Side note: looking at the code, it is really scary just how many times data is copied from one memory location to another in the leveldb codebase. Client code to WriteBatch, WriteBatch to MemTable, MemTable to TableBuilder, etc.

Okay, so now I am pretty sure that I am following what is going on when writing to disk. Still not sure how you search in an SST. The data is sorted, but that looks like it would require multiple seeks .There is metadata & index blocks in there, but I am not quite sure how they would work. Need to read the reader side of things to really get things. Well, I'll do that next time...

time to read 4 min | 726 words

After going over the VersionSet, I understand how leveldb decide when to compact and what it decide to compact. More or less, at least.

This means that I now mostly can figure out what this does:

Status DBImpl::BackgroundCompaction()

A user may force manual compaction of a range, or we may have reasons of our own to decide to compact, based on leveldb heuristics. Either way, we get the Compaction object, which tells us what files we need to merge.

There is a check there whatever we can do a trivial compaction, that is, just move the file from the current level to level+1. The interesting thing is that we avoid doing that if this is going to cause issues in level+2 (require more expensive compaction later on).

But the interesting work is done in DoCompactionWork, where we actually do compaction of complex data.

The code refers to snapshots for the first time that I see. We only merge values that are in a valid snapshot. So data doesn't “go away” for users. While holding a snapshot active.

The actual work starts in here:

image

This give us the ability to iterate over all of the entries in all of the files that need compaction.

And then we go over it:

image

But note that on each iteration we do a check if we need to CompactMemTable(); I followed the code there, and we finally write stuff to disk! I am quite excited about this, but I'll handle that in a future blog post. I want to see how actual compaction works.

We then have this:

image

This is there to stop a compaction that would force a very expensive compaction next time, I think.

As a side note, this really drive me crazy:

image

Note that we have current_output() and FileSize() in here. I don't really care what naming convention you use, but I would really rather that you had one. If there is one for the leveldb code base, I can't say that I figured it out. It seems like mostly it is PascalCase, but every now & then we have this_style methods.

Back to the code, it took me a while to figure it out.

image

Will return values in sorted key order, that means that if you have the same key in multiple levels, we need to ignore the older values. After this is happening, we now have this:

image

This is where we are actually writing stuff out to the SST file! This is quite exciting :-). I have been trying to figure that out for a while now.

The rest of the code in the function is mostly stuff related to stats book keeping, but this looks important:

image

This generate the actual VersionEdit, which will remove all of the files that were compacted and add the new file that was created as a new Version to the VersionSet.

Good work so far, even if I say so myself. We actually go to where we are building the SST files. Now it is time to look at the code that build those table. Next post, Table Builder...

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Recording (18):
    29 Sep 2025 - How To Run AI Agents Natively In Your Database
  2. Webinar (8):
    16 Sep 2025 - Building AI Agents in RavenDB
  3. RavenDB 7.1 (7):
    11 Jul 2025 - The Gen AI release
  4. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
  5. RavenDB News (2):
    02 May 2025 - May 2025
View all series

Syndication

Main feed ... ...
Comments feed   ... ...
}