B+Trees and why I love them, part I
One of the things that I enjoy about learning new things is the way it changes the way I look at the stuff that I already knows. Reading the LMDB codebase, and implementing a persistent B+Tree has given me a new depth of understanding about how relational databases (and Esent, too), work.
I am going to go back to level zero and assume that you have never heard about this. You probably know what a binary search tree is. And you know how they work. Here is such a tree:
Most tree algorithms spend a lot of time ensuring that the trees are balanced, because their most important property, O(log N) search performance, require that. Binary search trees seems ideally suitable for building databases ,since they allow speedy lookups and backward & forward iteration through the data. Unfortunately, they suffer from a pretty crippling limitation. They have absolutely horrible performance when used in persistent medium. Let us assume that I have a file that contains the following information (Key, Left offset, Right offset). I could search through that file by just walking the links. That, however, would result quite terrible performance. Every time that we move between nodes, we have to do a seek, and that would kill any benefits we will gain from using binary search tree.
Instead, we have a data structure that is aimed at reducing those costs. B+Tree. On the face of it, B+Tree is a truly stupidly simple idea. Instead of having just 2 items in the tree, left & right, let us have more.
Now, since B+Trees are usually used in persistent format, we start out by defining pages. That is because if we want to do random I/O, we have to be able to write in pages to the file. Therefor, we usually talk about B+Trees in terms of pages. Here is the above tree, rendered as a B+Tree:
Since the data is small, it all fits into a single page. That means that you can read the entire page into memory, and then do a binary search on the data page cheaply. The page above uses a page size of 4,096, so it can contain about 200 entries (of the size we just put in, which are very small). What happens when we have more than that? We Split the page. The resulting data now looks like this (256 entries):
Now, if we want to do a search for key 152, we start by reading the root node, we can see that the page containing values greater than 102 is page 1, so we will go there. We read that, and do a binary search on the data to find our value. So that was two seeks to get the data. Instead of eight. And usually, the high level pages are already cached, so we would only need a single seek.
Let us see what happens when we put even more data in (512):
As you can see, we doubled the amount of information we store, but we remain at the same height. That is quite important, in order to get good performance from the disk. The usual fan out is about a 100, depending on your page size and the keys you selected.
I played with the settings a a bit so you can get a good idea about what this look like when you have some more depth, I wrote the following code:
1: for (int i = 0; i < 32; i++)
2: {
3: ms.Position = 0;
4: for (int j = 0; j < 32; j++)
5: {
6: env.Root.Add(tx, string.Format("{0,5}", i+j), ms);
7: }
8: }
Which resulted in the following tree:
So far, so good, but this is actually the best case scenario. This is what happens when we are inserting data in a sequential manner. Let us mix things up a bit, okay?
1: for (int i = 0; i < 16; i++)
2: {
3: ms.Position = 0;
4: for (int j = 0; j < 4; j++)
5: {
6: env.Root.Add(tx, string.Format("{1,5}-{0,5}", i, j), ms);
7: }
8: }
This code will generate values in effectively a random fashion, and you can see the impact it has on the tree:
What happens is that we have had to insert data in a lot of places, which resulted in a lot of page splits, which increase the depth of the tree. Deeper trees means more operations required to actually get to the data on the leaf nodes. And yes, if you have a flash back to “always use sequential ids” section in RDMBS class, this is the reason why.
Now, I think that this is enough for now, I’ll do some more talking about the subject in my next post.
Comments
According the Wikipedia page, most B+ trees have a fanout of 100 or more pointers to each child node. I guess in practice you're only really limited by the number of pointers you can fit into a node.
Btrees are self-balancing by definition, so how can insertion order affect tree depth?
@Rafal B+Trees are different from Binary Trees, see http://en.wikipedia.org/wiki/B%2B_tree compared to http://en.wikipedia.org/wiki/Self-balancing_binary_search_tree
Sorry 2nd link should end "Self-balancing_binary_search_tree"
I give up, try http://tinyurl.com/y9jm4d for the page on Self-balancing trees
If I've read the Wikipedia page correctly, the transition from two to more than two children is the idea behind the plain old B-tree. The "+" in the B+tree refers to the fact that data is only stored in the leaves, and that there are links between the leaves; presumably that will be a subject for a future post.
Rob - pretty much.
In LMDB there are no links between the leaves. They are unnecessary since it's a simple operation to look at the next node in the parent page. They're also not feasible due to the COW design.
There are other variations on the B+tree theme that use those links though, see the B-Link tree. I spent a fair amount of time playing with these before settling on LMDB's approach.
I'm using a similar data structure on a KVP storage of my own that was derived from B+trees, but found that outperforms them (at least in managed .net).
Basically, you have one (always in-memory - backed up on storage) sorted list (by key) of a structure that basically has: First-key-in-page, Page-number, Page-item-count, and then a set of pages that stores the information in a dictionary-like way (key-value, key-value). The dictionary can be sorted or not (you will need to sort it at least once when iterating data/page splits, but it is not necessary for insertion/deletion. Usually use a "dirty" flag for this).
So, you get a O(log N) + O(1) for reading/storing data (the first one to do the binary search on the sorted list, the second one to access the item in the page). I'm currently storing 4k items per page with excellent results.
Page splits are easy, as doesn't requires a tree traversal for node overflow. Also, locking usually doesn't needs to happen on the whole structure, but just on the page(s) you are modifying.
Thanks for explaining B+ trees. I love learning about data structures. They taught us a few in college, but my day-to-day programming job typically requires only simple lists or dictionaries.; it's nice to expand out of that.
Marcelo: definitely sounds like a nice idea, but not workable in an mmap'd data store.
Howard: Probably... never used mmap'd files. Maybe with a little change (instead of storing in the dictionary the values, just store a pointer to where the real value is). But as I mentioned, I'm using that from managed code, and one of the beauties of that is that I don't have to deal with pointers :) I can't tell you how well that scales, but at least on the projects I'm using it, I'm handling over 10 million items without any issues... The only downside of the method is the need to store the primary sorted list in memory... Increasing the number of items per page will lower the requirements for that list, but will increase the time to do splits.
Marcelo, How do you handle 10,000,000 keys?
Marcelo, Okay, that explains things. If you are keeping the primary list in memory, that would tend to work much better. Problems starts to happen when you can't / don't want to do that. See Bitcask scaling issues.
Ayende: Sure, as I mentioned, that's the downside. If you need that many items, at some point you will have to increase the page size, or put more memory on the machine... B+Trees don't suffer from that problem since you can always free memory by disposing in-memory pages. I do that with pages, but can't with the main list. Maybe if I change the main list by a skip-list.. but in the end that would defeat the whole point of that list.
@Matt - in B+trees all leaves are at the same distance from the root so the tree is self-balancing, even if the balancing algorithm is not the same as in binary trees. If you do a page split you create two sibling pages - tree can grow only if the insertion triggers page split at the root node. So, i'm quite sure insertion order doesn't affect tree height at all.
..., sorry, not quite. There can be fluctuations in btree height, depending on how elements are divided between pages (but I think no more than [optimal height + 1] so btree is pretty immune to insertion order). But inserting in increasing key order will not give you the optimal height, it's rather a worst-case scenario for trees.
What do you do with variable length keys and values?
By the way, I'm not sure if I explained myself correctly, but I don't need to have the 10,000,000 keys IN-MEMORY. I only have the first key that it is stored in each page, so for a 4k items/page, I only need to have in memory about 2442 keys. That's not that much for the total number of items stored...
Hey Ayende.
Just wondering, what do you think of fractal trees (Tokudb: https://tinyurl.com/lycxb6l)
I don't know if you heard about http://research.microsoft.com/en-us/news/features/hekaton-122012.aspx but I thought you might find it interesting. It looks like you are not the only one thinking about the issue of performance and B+ trees.
The Bw-tree design still uses its own cache manager. LMDB's zero-copy design will run circles around it. Their use of so-called "elastic pages" guarantees they will be fighting the OS's VM manager when the DB size is much larger than RAM. Their engine is non-transactional and relies on an external transaction manager to provide full ACID semantics. Such a txn manager will obliterate all of their cache-friendliness and scaling claims. Their use of logging requires a garbage collection system, which will damage the consistency and predictability of their latency, just as it does with every other GC'd system.
Beating BerkeleyDB's performance by a factor of 20 isn't hard. (When they beat it by a factor of 3300, like LMDB does, then they'll be approaching the range of interesting.) The fact that B+trees are more efficient than skip lists is also already well-documented. (And anyone who tries to tell you skip lists are superior is either consciously lying or just ignorant. Looking at you, MemSQL.)
Hekaton will certainly be an improvement over the dreck they were using before. But by the time they add on all the transactional semantics they currently lack, much of the lustre of their Bw-trees will be gone.
I have an acquaintance who is raving about how wonderful TokuDB is. Definitely need to try out their code and see how it behaves.
The M$ guys are patting themselves on the back for writing a B+tree engine designed for "today's" Flash memory-based systems. But Flash has been around for decades and is on the verge of being superseded by MRAM/PRAM/etc.
These guys have been busy scratching their heads about problems with yesterday's hardware. It's wasted effort, because the next wave is about to hit and all their head-scratching will be irrelevant.
At least the Linux guys see what's in front of their noses. http://sched.co/13kqnUQ
NVDIMMs already on the market
http://arstechnica.com/information-technology/2013/04/memory-that-never-forgets-non-volatile-dimms-hit-the-market/
http://thessdreview.com/daily-news/latest-buzz/diablo-technologies-unveils-memory-channel-storage-outperforms-enterprise-ssd-and-pcie-flash-solutions/
LMDB's Single-Level-Store design gets instant benefit from today's and tomorrow's technology without wasting CPU cycles in unnecessary complexity. Everything else is just noise.
Marcelo, what happens when you write a lot of items that need to go in a particular page? since you have a sorted list, and if the writes happen in the middle / start of the list, you might need to move all keys.
Ayende: I only need to insert a new key on the sorted list when a page is full and I need to do a page split. On a page split, I take half of the items in one page, move to a new one, and insert the first key in that new page into the sorted list. That's why I have two levels, one with the sorted keys and another of pages. Since a page is a dictionary, I don't need to sort keys on every insert. Only when iterating the collection/doing a range scan (I don't need that for reading a single value) or when doing a page split I need to sort the keys in the page. For the specific case of having a lot of inserts that are happening with a key value lower than the lowest value in the list, I have a "spare" page where all those items go. When the page is full, I don't even split the page. Simply sort it, store it, read first key, insert in the list (yes, at that moment that would imply moving all the keys one slot up, but as I mentioned, with 10M items, I only have to move 2k keys. Not really an issue), and create a new spare page.
@Howard, @Ayende. I would love to hear more of your opinion on the Tokudb aproach
Marco: Couldn't you use an SSTable style design to have an efficient lookup without having the entire sorted key list in memory?
@Jeremiah: As I said, I don't have the entire list of keys in memory. I just have the necessaries keys to do the page lookup.
Comment preview