More thoughts about compression & storage
One of the things that I liked in LevelDB is that it pretty much had compression built in from day one. The entire format is built to save space just about wherever it can. From using 7 bit numbers to prefix compression to making compression core part of the on disk data structure. Considering the cost of I/O, it is usually worth trading CPU time for I/O costs. LMDB, in contrast, went quite the opposite way. It actively discourage compression, since you have no support in the format. But it is more than that, LMDB is meant to be used by directly getting a memory pointer, and it actively encourage you to do such things as have a pointer reference from one record to another record.
The idea is that you can just persist your in memory state and require zero difference between the on disk structure and the in memory structure. This allows you to take full advantage of the memory mapped nature of the database.
Voron, however, doesn’t really do that. And the idea of compressing the data can significantly cut the amount of I/O we use, and can allow us to pack much more data into the same space. Now, in a few places, Howard Chu (creator of LMDB) points out that you don’t need any support from LMDB to add compression, and points to this as proof. I don’t find this satisfactory. I think that this puts a lot of effort on the application side, and require a lot of special hacks.
Now, what do we talk about when we talk about compression. We can, of course, individually compress each record. But that just means that we remove redundancies in that single value. In practice, it is almost always better to do compression on multiple values, because then you can get much better compression ratios.
For reference, I have taken the opposite view a while ago, see here: http://ayende.com/blog/162466/some-thoughts-about-compression-storage
But let us say that we compress the whole page? Taking a sample page from one of our files, I can see that compressing that (using DeflateStream) is taking essentially zero time, and show that a 4KB set of sample data got turned into about 800 bytes.
That means a lot, because it means that I can shove more data into a single page. And if I can do that, it means that I can also (drastically) reduce the height of the tree, and since that is one of the major factors in I/O costs in B-Tree based systems, that might actually have a lot more performance implications that it would seem.
It is also a lot harder. For example, you would have a very hard time telling ahead of time if an addition / deletion is going to be within the page boundary, or if it is going to require to require a split. For now, it is a really nice idea, with some interesting perf implications for later on, but I don’t think we’ll explore this very closely in the near future.
Comments
But is such compression any better than a compressed filesystem?
Rafal, The problem with compressed FS is that you need to ensure that this works with things like fsync and write through. But yes, that is probably something that we'll end up doing.
sooo, if compression is on by default with Voron does that just become a standard feature of RavenDB?
Khalid, At the moment, we aren't doing this at all.
Look at how SQL Server implements PAGE compression. The way they do it allows for incremental efficient updates at the cost of less compression ratio. They also do not have to decompress the entire page to read one row. I think you'll see your perf totally tank if single-row access requires decompressing an entire page.
Tobi, Any reference for this?
There are internals series on that that I have read. Don't have them handy. You'll find them going through a few results of https://www.google.com/webhp?complete=1&hl=en#complete=1&hl=en&q=sql+server+page+compression+internals
Basically, they store the compression dictionary with the page header. Most compressors build the dict on the fly and never store it. SQL Server does so that it can do incremental compression.
If your decompression algorithm can decompress 100MB/s then you can decompress 25k pages per second. I don't think that's acceptable for mostly random single-row accesses. Without compression I'd expect 1-2 orders of magnitude more throughput.
Comment preview