The design of managed memory map database

Aug 12 2013

The design of managed memory map database

time to read 3 min | 552 words

Since MMMD appears to be an airport in Mexico, I think that I’ll call this project ~~Nevar~~ Voron. It seems… appropriate, somehow.

I don’t learn by just reading code, I really need to do some writing to actually understand the concepts. In order to do that, I decided to try to create a managed implementation based on the ideas from LMDB. Note that I explicitly don’t try to do a port, I am trying to do something different, but based on the concepts that I learned from reading the code.

Goals:

Single process only – I have no need to do something that can be used by multiple processes.
No predefined file size – The reason LMDB does this is that it maps the entire file as a single continuous range. But there isn’t really any reason why you can grow the file and map the extra space in a different location.
Multiple trees (“tables”) – So we don’t have just a single B-Tree, we can have more than one. I won’t have a main DB exposed outside.
ACID, Transactional, MVCC with single writer
No File I/O – In LMDB, the code allows either standard file I/O or direct memory writes. The later is dangerous if you are using C, because you can overwrite and corrupt data, but in .NET, that isn’t an issue, since we don’t do direct memory access.
No duplicates – It really complicates the LMDB codebase, and I don’t think it is really needed.

Overall architecture:

One of the reasons I explicitly not trying to do a port is that I still don’t like the LMDB codebase. I think that it require a major refactoring, and the way it works is not really suitable for a managed implementation. The first major component is going to be the B-Tree. This is responsible for reading & writing data from a memory mapped file region. But it is focused only on that. Things like transactions, page allocations, free pages, concurrency and anything else is the responsibility of others. This is important because B-Tree is going to be useful for a lot of stuff. In LMDB, the B+Tree is used to maintain the free list, the main db, sub databases, etc. For that matter, just having a persistent B+Tree is going to be very useful in general.

After going away and playing with the code for a while, I can tell you that you need the following just to get started:

Page – represent a page in memory, of course. A page is a sorted list of items, where item can be a value or a reference to another page.
Tree – the actual tree implementation, responsible for such things as maintaining the tree, reads & searches. The only outside dependency it has is on Transaction, and that is only when it needs to allocate a pages.
Transaction – manages transactions, either read only or write. Handles page allocations. This one isn’t even started, because I want to see how far I can take the tree code before I need to actually complete it.

The basic idea is, I really like the ideas that LMDB uses, I just don’t like the codebase. So I want to see if I can do the same thing with a codebase that would be easier for me to work with.

And, of course, this is a cool project and quite interesting one.

Tweet Share Share 10 comments

Comments

12 Aug 2013
09:26 AM

Pedro

"Nevar" means snowing in Portuguese and Spanish.

12 Aug 2013
09:30 AM

Ayende Rahien

Pedro, Yeah, we already changed the name, this was posted a while ago.

12 Aug 2013
10:52 AM

Jan

Voron - "raven" in russian. Also in urban fantasy series "Kate Daniels" by Ilona Andrews, it is the name of main heroine stepfather. :-)

12 Aug 2013
11:06 AM

Sergey Shumov

"Voron" means raven in Russian :-)

12 Aug 2013
11:14 AM

Sergey Shumov

I think you should name it "Crow", cause it would be cool to say: "Raven uses Crow as a backstore".

12 Aug 2013
11:23 AM

Howard Chu

re: multi-process access - it's extremely handy for doing administration on a long-running server. E.g. in OpenLDAP we can bulk-import additional data concurrently with a running slapd. No service downtime.

re: duplicate key support - this is crucial for a fast indexing implementation.

12 Aug 2013
11:25 AM

Jesús López

You are a brave man!

Incidentally, have you considered implementing a Lucency Directory on top on Raven.Voron?

Perhaps that improve the indexing performance. But also you would get more robust indexes.

12 Aug 2013
12:47 PM

tobi

SQL Server does duplicate keys with a hidden uniquifier column. That saves 99% of the codebase from dealing with duplicates.

12 Aug 2013
13:11 PM

Ayende Rahien

Jan, I was reading the latest Daniels' book when I decided to name it this way.

12 Aug 2013
13:32 PM

Howard Chu

A typical approach to indexing with duplicates is to append the primary key id to the index key. e.g., if you've indexed a field "FirstName" then you might have two records 1 and 7 with name "Joe". Your index would create keys "Joe1" and "Joe7". This gets wasteful because you're duplicating the key "Joe" in your storage every time you get a duplicate. Many B-tree implementations do prefix compression on keys to mitigate this issue. The problem with that is (1) it's quite complex and (2) it's easy to break. Prefix compression is done by compressing out the common prefix of all of the keys on a given page. Where this breaks is if a new value is inserted into a page whose key has no prefix in common with the existing keys. In that case all of the keys on the page must be restored to their full length. If the page was densely packed before, this may result in the nodes no longer fitting on the page, causing a need for new page-split ops to make room. This is particularly hairy if you were already in the middle of a rebalance op - in the normal case, page-splits can only be triggered by user insert operations, and rebalance/merges can only be triggered by user delete operations. When prefix compression is used, any tree maintenance op can trigger both in heavily nested succession.

With LMDB duplicate support, the key is only stored once, and there is no danger of cascading tree rebalances due to prefix compression state changes. It is both more space efficient and more time efficient than the alternative.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB