Raven Storage–Voron’s Performance

Sep 20 2013

Raven Storage–Voron’s Performance

time to read 10 min | 1804 words

Voron is the code name for another storage engine that we are currently trying, based on LMDB. After taking that for a spin for a while, it is not pretty complete, and I decided to give it some perf test runs. So far, there has been zero performance work. All the usual caveat applies here, with regards to short test runs, etc.

Just like before, we found that this is horribly slow on the first run. The culprit was our debug code that verified the entire structure whenever we added or removed something. One we run it in release mode we started getting more interesting results.

Here is the test code:

using (var env = new StorageEnvironment(new MemoryMapPager("bench.data", flushMode)))
{
    using (var tx = env.NewTransaction(TransactionFlags.ReadWrite))
    {
        var value = new byte[100];
        new Random().NextBytes(value);
        var ms = new MemoryStream(value);
        for (long i = 0; i < Count; i++)
        {
            ms.Position = 0;
            env.Root.Add(tx, i.ToString("0000000000000000"), ms);
        }

        tx.Commit();
    }
     using (var tx = env.NewTransaction(TransactionFlags.ReadWrite))
     {
         DebugStuff.RenderAndShow(tx, tx.GetTreeInformation(env.Root).Root);

         tx.Commit();
     }
}

We write 1 million entries with 100 bytes in size and 8 bytes of the key. We run it in three mode:

fill seq none : 9,578 ms 104,404 ops / sec

fill seq buff : 10,802 ms 92,575 ops / sec

fill seq sync : 9,387 ms 106,528 ops / sec

None means no flushing to disk, let the OS deals with that completely. Buffers means flush to the OS, but not to disk, and full means do a full fsync.

Now, this is pretty stupid way to go about it, I’ve to say. This is doing everything in a single transaction, and we are actually counting the time to close & open the db here as well, but that is okay for now. We aren’t interested in real numbers, just some rough ideas.

Now, let us see how we read it?

using (var env = new StorageEnvironment(new MemoryMapPager("bench.data")))
{
    using (var tx = env.NewTransaction(TransactionFlags.Read))
    {
        var ms = new MemoryStream(100);
        for (int i = 0; i < Count; i++)
        {
            var key = i.ToString("0000000000000000");
            using (var stream = env.Root.Read(tx, key))
            {
                ms.Position = 0;
                stream.CopyTo(ms);
            }
        }

        tx.Commit();
    }
}

And this gives us:

read seq : 3,289 ms 304,032 ops / sec

And again, this is with opening & closing the entire db.

We could do better with pre-allocation of space on the disk, etc. But I wanted to keep things realistic and to allow us to grow.

Next, I wanted to see how much we would gain by parallelizing everything, so I wrote the following code:

using (var env = new StorageEnvironment(new MemoryMapPager("bench.data")))
{
    var countdownEvent = new CountdownEvent(parts);
    for (int i = 0; i < parts; i++)
    {
        var currentBase = i;
        ThreadPool.QueueUserWorkItem(state =>
        {
            using (var tx = env.NewTransaction(TransactionFlags.Read))
            {
                var ms = new MemoryStream(100);
                for (int j = 0; j < Count / parts; j++)
                {
                    var current = j * currentBase;
                    var key = current.ToString("0000000000000000");
                    using (var stream = env.Root.Read(tx, key))
                    {
                        ms.Position = 0;
                        stream.CopyTo(ms);
                    }
                }

                tx.Commit();
            }

            countdownEvent.Signal();
        });
    }
    countdownEvent.Wait();
}

I then run it with 1 – 16 parts, so see how it behaves. Here are the details for this machine:

And the results pretty much match what I expected:

read seq : 3,317 ms 301,424 ops / sec

read parallel 1 : 2,539 ms 393,834 ops / sec

read parallel 2 : 1,950 ms 512,711 ops / sec

read parallel 4 : 2,201 ms 454,172 ops / sec

read parallel 8 : 2,139 ms 467,387 ops / sec

read parallel 16 : 2,010 ms 497,408 ops / sec

We get a 2x perf improvement from running on two cores, running on 4 threads require some dancing around, which caused some perf drop, then we see more time spent in thread switching than anything else, pretty much. As you can see, we see a really pretty jump in performance the more cores we use, until we reach the actual machine limitations.

Note that I made sure to clear the OS buffer cache before each test. If we don't do that, we get:

read seq : 2,562 ms 390,291 ops / sec

read parallel 1 : 2,608 ms 383,393 ops / sec

read parallel 2 : 1,868 ms 535,220 ops / sec

read parallel 4 : 1,646 ms 607,283 ops / sec

read parallel 8 : 1,673 ms 597,557 ops / sec

read parallel 16 : 1,581 ms 632,309 ops / sec

So far, I am pretty happy with those numbers. What I am not happy is the current API, but I’ll talk about this in a separate post.

Tweet Share Share 10 comments

Tags:

Comments

31 Jul 2013
20:50 PM

Kelly Sommers

Interesting project! Looking forward to seeing how this evolves.

These numbers don't make sense though like the last set of benchmarks.

The delta between fsync and non-fsync writes should be much greater. I suspect you're only writing to the disk cache most of the time and that the writes aren't safe.

Can you describe what kind of disk is being used? I would like to look up the cache details.

Also to see proper CPU usage and disk characteristics in benchmarking I suggest something for windows that's similar to iostats or dstat. Task manager is not going to tell much of a story that's worthwhile.

01 Aug 2013
04:31 AM

Ayende Rahien

Kelly, There is just one transaction, so just one set of fsync() calls. Also note that we close the db _in the benchmark_. That means that we are effectively flushing the file to the OS no matter what. Yes, we are using SSDs, those are pretty early results, just there to show that we are on the right track.

20 Sep 2013
11:22 AM

Rafal

Great results. BTW, do you employ the write scheduling algorithm you posted as the programming exercise?

20 Sep 2013
11:24 AM

Ayende Rahien

Rafal, Not quite, we are doing something different.

20 Sep 2013
20:21 PM

Joram

Dont if youve had a chnace to listen.

http://slideshot.epfl.ch/play/suri_stonebraker#2

Prof. Michael Stonebraker has some opinions on dbs, checkout slide 19 where he shows where the major bottlenects for dbs are, might be of use.

21 Sep 2013
02:43 AM

Howard Chu

Preaching to the choir, Joram. That slide 19, LMDB does no locking, no latching, no recovery, and has no buffer pool. All of that crap is gone, totally unnecessary. All of that's a waste of time.

21 Sep 2013
02:47 AM

Howard Chu

All old news. LMDB's basic principle, single-level-store, is aimed at eliminating the difference between on-disk format and in-memory format.

21 Sep 2013
02:57 AM

Howard Chu

... sorry for the multiple one-line posts. Just came back from cocktail hour and the thoughts are only rolling in one at a time ...

It's important to remember, when talking about "eliminating the difference between on-disk and in-memory format" - this doesn't mean you can just use any arbitrary in-memory format. LMDB is still a page-oriented DB because memory pages are the fundamental unit of the virtual memory system. If you just throw architecture concepts out the window and use a free form in-memory data structure you will lose performance.

Slide 32, saying that disk-oriented data format is slower, is dead wrong. What is correct is to ensure that you have only one format, so there is no decode step needed to transfer from disk to memory, but as I pointed out, your format cannot be oblivious to the realities of the paged memory management system.

In LMDB we have the MDB_FIXEDMAP option for the true single-level-store use. The purpose of this option is to allow you to create records that use live pointers and store them in the memory map, and then use them directly during future reads. The presentation is at least correct in this regard - your aim is to eliminate the decode step.

21 Sep 2013
03:08 AM

Howard Chu

Heh, those slides are from 2010. We first published these principles of LMDB in 2009.

http://www.openldap.org/lists/openldap-devel/200905/msg00036.html

Johnny-come-latelies...

22 Sep 2013
15:47 PM

Joram

Very true Howard.

I was however hoping the info about the cost of multi threading might be usefull to Ayende

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB