Raven Storage–Voron’s Performance

time to read 10 min | 1804 words

Voron is the code name for another storage engine that we are currently trying, based on LMDB. After taking that for a spin for a while, it is not pretty complete, and I decided to give it some perf test runs. So far, there has been zero performance work. All the usual caveat applies here, with regards to short test runs, etc.

Just like before, we found that this is horribly slow on the first run. The culprit was our debug code that verified the entire structure whenever we added or removed something. One we run it in release mode we started getting more interesting results.

Here is the test code:

using (var env = new StorageEnvironment(new MemoryMapPager("bench.data", flushMode)))
{
    using (var tx = env.NewTransaction(TransactionFlags.ReadWrite))
    {
        var value = new byte[100];
        new Random().NextBytes(value);
        var ms = new MemoryStream(value);
        for (long i = 0; i < Count; i++)
        {
            ms.Position = 0;
            env.Root.Add(tx, i.ToString("0000000000000000"), ms);
        }

        tx.Commit();
    }
     using (var tx = env.NewTransaction(TransactionFlags.ReadWrite))
     {
         DebugStuff.RenderAndShow(tx, tx.GetTreeInformation(env.Root).Root);

         tx.Commit();
     }
}

We write 1 million entries with 100 bytes in size and 8 bytes of the key. We run it in three mode:

fill seq none                      :      9,578 ms    104,404 ops / sec
fill seq buff                      :     10,802 ms     92,575 ops / sec
fill seq sync                      :      9,387 ms    106,528 ops / sec

None means no flushing to disk, let the OS deals with that completely. Buffers means flush to the OS, but not to disk, and full means do a full fsync.

Now, this is pretty stupid way to go about it, I’ve to say. This is doing everything in a single transaction, and we are actually counting the time to close & open the db here as well, but that is okay for now. We aren’t interested in real numbers, just some rough ideas.

Now, let us see how we read it?

using (var env = new StorageEnvironment(new MemoryMapPager("bench.data")))
{
    using (var tx = env.NewTransaction(TransactionFlags.Read))
    {
        var ms = new MemoryStream(100);
        for (int i = 0; i < Count; i++)
        {
            var key = i.ToString("0000000000000000");
            using (var stream = env.Root.Read(tx, key))
            {
                ms.Position = 0;
                stream.CopyTo(ms);
            }
        }

        tx.Commit();
    }
}

And this gives us:

read seq                           :      3,289 ms    304,032 ops / sec

And again, this is with opening & closing the entire db.

We could do better with pre-allocation of space on the disk, etc. But I wanted to keep things realistic and to allow us to grow.

Next, I wanted to see how much we would gain by parallelizing everything, so I wrote the following code:

using (var env = new StorageEnvironment(new MemoryMapPager("bench.data")))
{
    var countdownEvent = new CountdownEvent(parts);
    for (int i = 0; i < parts; i++)
    {
        var currentBase = i;
        ThreadPool.QueueUserWorkItem(state =>
        {
            using (var tx = env.NewTransaction(TransactionFlags.Read))
            {
                var ms = new MemoryStream(100);
                for (int j = 0; j < Count / parts; j++)
                {
                    var current = j * currentBase;
                    var key = current.ToString("0000000000000000");
                    using (var stream = env.Root.Read(tx, key))
                    {
                        ms.Position = 0;
                        stream.CopyTo(ms);
                    }
                }

                tx.Commit();
            }

            countdownEvent.Signal();
        });
    }
    countdownEvent.Wait();
}

I then run it with 1 – 16 parts, so see how it behaves. Here are the details for this machine:

image

And the results pretty much match what I expected:

read seq                           :      3,317 ms    301,424 ops / sec
read parallel 1                    :      2,539 ms    393,834 ops / sec
read parallel 2                    :      1,950 ms    512,711 ops / sec
read parallel 4                    :      2,201 ms    454,172 ops / sec
read parallel 8                    :      2,139 ms    467,387 ops / sec
read parallel 16                   :      2,010 ms    497,408 ops / sec

We get a 2x perf improvement from running on two cores, running on 4 threads require some dancing around, which caused some perf drop, then we see more time spent in thread switching than anything else, pretty much. As you can see, we see a really pretty jump in performance the more cores we use, until we reach the actual machine limitations.

Note that I made sure to clear the OS buffer cache before each test. If we don't do that, we get:

read seq                           :      2,562 ms    390,291 ops / sec
read parallel 1                    :      2,608 ms    383,393 ops / sec
read parallel 2                    :      1,868 ms    535,220 ops / sec
read parallel 4                    :      1,646 ms    607,283 ops / sec
read parallel 8                    :      1,673 ms    597,557 ops / sec
read parallel 16                   :      1,581 ms    632,309 ops / sec

So far, I am pretty happy with those numbers. What I am not happy is the current API, but I’ll talk about this in a separate post.