The cost of random writes

time to read 3 min | 502 words

After pretty much getting to where we wanted with sequential writes, it is time to try to tackle the cost of random writes. Here is Esent & Voron doing random writes for 1 million entries:

image

Yes, it is sad that this is the case, because we can do about 40 times better for sequential writes.

However… what is the reason for the slowness? As usual, the answer is to profile, profile & profile. In fact, we want to do profiling on multiple machines, to check that we don’t have a machine bias.

At any rate, as you can see, we are actually pretty close to Esent in random writes, but why are we so much slower than our performance during sequential writes?

Well, we actually covered that, actually. The major reason for the cost is actually the fact that we have to deal with very different B-Tree shapes. You can see the difference between sequential and randomly inserted trees here. The more random you are, the bigger your tree is, the more pages you have to touch whenever you update. Our scenario calls for making 100 random modification in every transaction. That means that on every transaction, we have to touch many different parts of the tree.

Here are the results (in pages) for how much it cost us (in this test, we run 5,000 transactions of 100 items each.

 

Sequential

Random

Average

7.06

111.37

Max

11.00

134.00

Total

35,320.00

556,973.00

Std Dev

0.35

7.09

In other words, this means that on average, a sequential transaction wrote 7 pages, or about 28Kb. While the average random transaction wrote about 450Kb!

While the total size written for sequential was 137MB, and allocate 319MB. However, random sizes wrote 2.1 GB for 2.3 GB that were allocated. And that explains quite a lot about the whole thing. (Note that we count for writes only the journal writes, not the data writes, because those are the limiting factor for write performance).

Esent, by the way, writes sequentially about 465 MB and 2.3GB for random writes. Now we know why we are pretty close, because the vast majority of the time is actually spent just writing to the disk, and we are stalled by the sheer amount of data we have.

Interestingly enough, I don’t think that there is a good way to handle that nicely. Certainly not with a B-Tree. We’ll see how we can work with that at a future post.