Reviewing Basho’s Leveldb

time to read 4 min | 683 words

After taking a look at HyperLevelDB, it is time to see what Basho has changed in leveldb. They were kind enough to write a blog post detailing those changes, unfortunately, unlike HyperLevelDB, they have been pretty general and focused on their own product (which makes total sense). They have called out the reduction of “stalls”, which may or may not be related to issues with the write delay that leveldb intentionally introduce under load.

Okay, no choice about it, I am going to go over the commit log and see if I can find interesting stuff. The first tidbit that caught my eye is improving the compaction process when you have on disk corruption. Instead of stopping, it would move the bad data to the “lost” directory and move on. Note that there is some data loss associated with this, of course, but that won’t necessarily be felt by the users.

As a note, I dislike this code formatting:

image

Like HyperLevelDB, Basho made a lot of changes to compaction, it appears that this is the case for performance reasons:

  • No compactions triggered by reads, that is too slow.
  • There are multiple threads now handling compactions, with various levels of priorities between them. For example, flushing the immutable mem table is high priority, as is level 0 compaction, but standard compactions can wait.
  • Interestingly, when flushing data from memory to level 0, no compression is used.
  • After those were done, they also added additional logic to enforce locks that would give flushing from memory to disk and from level 0 downward much higher priority than everything else.

As an aide, another interesting thing I noticed, Basho also moved closing files and unmmaping memory to a background thread. I am not quite sure why that is the case, I wouldn’t expect that to be very expensive.

Next on the list, improving caching. Mostly by taking into account actual file sizes and by introducing a reader/writer lock.

Like HyperLevelDB, they also went for larger files, although I think that in this case, they went for significantly larger files than even HyperLevelDB did. Throttling, unlike with HyperLevelDB, where they did away with write throttling altogether in favor of concurrent writes, Basho’s leveldb went into a much more complex system of write throttling base on the current load, pending work, etc. The idea is to gain better load distribution overall. (Or maybe they didn’t think about the concurrent write strategy).

I wonder (but didn’t check) if some of the changes were pulled back into the leveldb project. Because there is some code here that I am pretty sure duplicate work already done in leveldb. In this case, the retiring of data that has already been superseded.

There is a lot of stuff that appears to relate to maintenance. Scanning SST files for errors, perf counters, etc. It also look like the decided to go to assembly for actually implementing CRC32. In fact, I am pretty sure that the asm is for calling hardware CRC inside the CPU. But I am unable to decipher that.

What I find funny is that another change I just run into is the introduction of a way to avoid copying data when Get()ing data from leveldb. If you’ll recall, I pointed that out as an issue a while ago in my first review of leveldb.

And here is another pretty drastic change. In leveldb, only level 0 can have overalapping files, but Basho’s changed things so the first 3 levels would have overlapping files. The idea is that you can do cheaper compactions this way, I am guessing.

I am aware that this is a bit of a mess, with regards to the review, but I just went over the code and wrote down the notes as I saw them. Overall, I think that I like HyperLevelDB changes better, but they have the advantage of using a much later codebase.