Raven Storage–Voron’s Performance
Voron is the code name for another storage engine that we are currently trying, based on LMDB. After taking that for a spin for a while, it is not pretty complete, and I decided to give it some perf test runs. So far, there has been zero performance work. All the usual caveat applies here, with regards to short test runs, etc.
Just like before, we found that this is horribly slow on the first run. The culprit was our debug code that verified the entire structure whenever we added or removed something. One we run it in release mode we started getting more interesting results.
Here is the test code:
using (var env = new StorageEnvironment(new MemoryMapPager("bench.data", flushMode))) { using (var tx = env.NewTransaction(TransactionFlags.ReadWrite)) { var value = new byte[100]; new Random().NextBytes(value); var ms = new MemoryStream(value); for (long i = 0; i < Count; i++) { ms.Position = 0; env.Root.Add(tx, i.ToString("0000000000000000"), ms); } tx.Commit(); } using (var tx = env.NewTransaction(TransactionFlags.ReadWrite)) { DebugStuff.RenderAndShow(tx, tx.GetTreeInformation(env.Root).Root); tx.Commit(); } }
We write 1 million entries with 100 bytes in size and 8 bytes of the key. We run it in three mode:
fill seq none : 9,578 ms 104,404 ops / sec
fill seq buff : 10,802 ms 92,575 ops / sec
fill seq sync : 9,387 ms 106,528 ops / sec
None means no flushing to disk, let the OS deals with that completely. Buffers means flush to the OS, but not to disk, and full means do a full fsync.
Now, this is pretty stupid way to go about it, I’ve to say. This is doing everything in a single transaction, and we are actually counting the time to close & open the db here as well, but that is okay for now. We aren’t interested in real numbers, just some rough ideas.
Now, let us see how we read it?
using (var env = new StorageEnvironment(new MemoryMapPager("bench.data"))) { using (var tx = env.NewTransaction(TransactionFlags.Read)) { var ms = new MemoryStream(100); for (int i = 0; i < Count; i++) { var key = i.ToString("0000000000000000"); using (var stream = env.Root.Read(tx, key)) { ms.Position = 0; stream.CopyTo(ms); } } tx.Commit(); } }
And this gives us:
read seq : 3,289 ms 304,032 ops / sec
And again, this is with opening & closing the entire db.
We could do better with pre-allocation of space on the disk, etc. But I wanted to keep things realistic and to allow us to grow.
Next, I wanted to see how much we would gain by parallelizing everything, so I wrote the following code:
using (var env = new StorageEnvironment(new MemoryMapPager("bench.data"))) { var countdownEvent = new CountdownEvent(parts); for (int i = 0; i < parts; i++) { var currentBase = i; ThreadPool.QueueUserWorkItem(state => { using (var tx = env.NewTransaction(TransactionFlags.Read)) { var ms = new MemoryStream(100); for (int j = 0; j < Count / parts; j++) { var current = j * currentBase; var key = current.ToString("0000000000000000"); using (var stream = env.Root.Read(tx, key)) { ms.Position = 0; stream.CopyTo(ms); } } tx.Commit(); } countdownEvent.Signal(); }); } countdownEvent.Wait(); }
I then run it with 1 – 16 parts, so see how it behaves. Here are the details for this machine:
And the results pretty much match what I expected:
read seq : 3,317 ms 301,424 ops / sec
read parallel 1 : 2,539 ms 393,834 ops / sec
read parallel 2 : 1,950 ms 512,711 ops / sec
read parallel 4 : 2,201 ms 454,172 ops / sec
read parallel 8 : 2,139 ms 467,387 ops / sec
read parallel 16 : 2,010 ms 497,408 ops / sec
We get a 2x perf improvement from running on two cores, running on 4 threads require some dancing around, which caused some perf drop, then we see more time spent in thread switching than anything else, pretty much. As you can see, we see a really pretty jump in performance the more cores we use, until we reach the actual machine limitations.
Note that I made sure to clear the OS buffer cache before each test. If we don't do that, we get:
read seq : 2,562 ms 390,291 ops / sec
read parallel 1 : 2,608 ms 383,393 ops / sec
read parallel 2 : 1,868 ms 535,220 ops / sec
read parallel 4 : 1,646 ms 607,283 ops / sec
read parallel 8 : 1,673 ms 597,557 ops / sec
read parallel 16 : 1,581 ms 632,309 ops / sec
So far, I am pretty happy with those numbers. What I am not happy is the current API, but I’ll talk about this in a separate post.
Comments
Interesting project! Looking forward to seeing how this evolves.
These numbers don't make sense though like the last set of benchmarks.
The delta between fsync and non-fsync writes should be much greater. I suspect you're only writing to the disk cache most of the time and that the writes aren't safe.
Can you describe what kind of disk is being used? I would like to look up the cache details.
Also to see proper CPU usage and disk characteristics in benchmarking I suggest something for windows that's similar to iostats or dstat. Task manager is not going to tell much of a story that's worthwhile.
Kelly, There is just one transaction, so just one set of fsync() calls. Also note that we close the db _in the benchmark_. That means that we are effectively flushing the file to the OS no matter what. Yes, we are using SSDs, those are pretty early results, just there to show that we are on the right track.
Great results. BTW, do you employ the write scheduling algorithm you posted as the programming exercise?
Rafal, Not quite, we are doing something different.
Dont if youve had a chnace to listen.
http://slideshot.epfl.ch/play/suri_stonebraker#2
Prof. Michael Stonebraker has some opinions on dbs, checkout slide 19 where he shows where the major bottlenects for dbs are, might be of use.
Preaching to the choir, Joram. That slide 19, LMDB does no locking, no latching, no recovery, and has no buffer pool. All of that crap is gone, totally unnecessary. All of that's a waste of time.
All old news. LMDB's basic principle, single-level-store, is aimed at eliminating the difference between on-disk format and in-memory format.
... sorry for the multiple one-line posts. Just came back from cocktail hour and the thoughts are only rolling in one at a time ...
It's important to remember, when talking about "eliminating the difference between on-disk and in-memory format" - this doesn't mean you can just use any arbitrary in-memory format. LMDB is still a page-oriented DB because memory pages are the fundamental unit of the virtual memory system. If you just throw architecture concepts out the window and use a free form in-memory data structure you will lose performance.
Slide 32, saying that disk-oriented data format is slower, is dead wrong. What is correct is to ensure that you have only one format, so there is no decode step needed to transfer from disk to memory, but as I pointed out, your format cannot be oblivious to the realities of the paged memory management system.
In LMDB we have the MDB_FIXEDMAP option for the true single-level-store use. The purpose of this option is to allow you to create records that use live pointers and store them in the memory map, and then use them directly during future reads. The presentation is at least correct in this regard - your aim is to eliminate the decode step.
Heh, those slides are from 2010. We first published these principles of LMDB in 2009.
http://www.openldap.org/lists/openldap-devel/200905/msg00036.html
Johnny-come-latelies...
Very true Howard.
I was however hoping the info about the cost of multi threading might be usefull to Ayende
Comment preview