What is your control group?

time to read 8 min | 1489 words

One of the areas that where we think Voron can be improved is the free space utilization policy. In particular, smarter free space utilization can lead to better performance, since we won’t have to seek so much.

I spent some time working on that, and I got something that on paper, at least, looks much better, performance wise. But… actual benchmarks showed little to no improvement, and in some cases, actual degradation! That was the point when I realize that I actually needed to have some sort of a control, to see what would be the absolute optimal scenario for us. So I wrote a null free space policy. With no free space, Voron will always go to the end of the file, giving us the best case scenario of sequential writes.

This gives us the following behavior:

Flush      1 with   2 pages -   8 kb writes and   1 seeks (  2 leaves,   0 branches,   0 overflows)
Flush      2 with   8 pages -  32 kb writes and   1 seeks (  7 leaves,   1 branches,   0 overflows)
Flush      3 with  10 pages -  40 kb writes and   1 seeks (  9 leaves,   1 branches,   0 overflows)
Flush     27 with  74 pages - 296 kb writes and   1 seeks ( 72 leaves,   2 branches,   0 overflows)
Flush     28 with  74 pages - 296 kb writes and   1 seeks ( 72 leaves,   2 branches,   0 overflows)
Flush     29 with  72 pages - 288 kb writes and   1 seeks ( 70 leaves,   2 branches,   0 overflows)
Flush  1,153 with 155 pages - 620 kb writes and   1 seeks (102 leaves,  53 branches,   0 overflows)
Flush  1,154 with 157 pages - 628 kb writes and   1 seeks (104 leaves,  53 branches,   0 overflows)
Flush  1,155 with 165 pages - 660 kb writes and   1 seeks (108 leaves,  57 branches,   0 overflows)
Flush  4,441 with 191 pages - 764 kb writes and   1 seeks (104 leaves,  87 branches,   0 overflows)
Flush  4,442 with 196 pages - 784 kb writes and   1 seeks (107 leaves,  89 branches,   0 overflows)
Flush  4,443 with 198 pages - 792 kb writes and   1 seeks (108 leaves,  90 branches,   0 overflows)
Flush  7,707 with 200 pages - 800 kb writes and   1 seeks (106 leaves,  94 branches,   0 overflows)
Flush  7,708 with 204 pages - 816 kb writes and   1 seeks (106 leaves,  98 branches,   0 overflows)
Flush  7,709 with 211 pages - 844 kb writes and   1 seeks (113 leaves,  98 branches,   0 overflows)
Flush  9,069 with 209 pages - 836 kb writes and   1 seeks (107 leaves, 102 branches,   0 overflows)
Flush  9,070 with 205 pages - 820 kb writes and   1 seeks (106 leaves,  99 branches,   0 overflows)
Flush  9,071 with 208 pages - 832 kb writes and   1 seeks (108 leaves, 100 branches,   0 overflows)

And with this, 10,000 transactions with 100 random values each

fill rnd buff separate tx          :    106,383 ms      9,400 ops / sec

And that tells me that for the best case scenario, there is something else that is causing this problem, and it ain’t the cost of doing seeks. I dropped the number of transactions to 500 and run it through a profiler, and I got the following:

image

image

In other words, pretty much the entire time was spent just calling FlushViewOfFile. However, I think that we optimized that enough already, didn’t we? Looking at the calls, it seems that we have just one FlushViewOfFile per transaction in this scenario.

In fact, looking at the actual system behavior, we can see:

image

So seeks wise, we are good. What I can’t understand, however, is why we see those ReadFile calls. Looking at the data, it appears that we run into this whenever we access now portion of the file, so this is the mmap subsystem paging the file contents into memory before we start doing that. It is actually pretty great that it is able to page 1 MB at a time.

Next, let us see what else we can do here. I run the 500 tx test on an HDD drive. And it have given me the following results.

fill rnd sync separate tx          :     25,540 ms      1,958 ops / sec

But note that each write has two writes. One at the end of the file, and one at the file beginning (which is the actual final act of the commit). What happened if we just removed that part?

This give me a very different number:

fill rnd sync separate tx          :     21,764 ms      2,297 ops / sec

So just seeking and writing a single page cost us 17% of our performance. Here are the details from running this test:

image

Now, this is a meaningless test, added just to check what the relative costs are. We have to do the header write, otherwise we can’t do real transactions.

For fun, I run the same thing using sequential write, giving me 3,619 ops / sec. Since in both cases we are actually doing sequential writes, the major differences was how much we actually wrote. This is the view of writing sequentially:

image

As you can see, we only have to write 8 – 10 pages per transaction, compare to 110 – 130 in the random case. And that obviously has a lot of implications.

All of this has thought me something very important. In the end, the actual free space policy matters, but not that much. So I need to select something that is good, but that is about it.