Reducing the cost of writing to disk
So, we found out that the major cost of random writes in our tests was actually writing to disk. Writing 500K sequential items resulted in about 300 MB being written. Writing 500K random items resulted in over 2.3 GB being written.
Note: I would like to point out Alex’s comment, which helped setup this post.
So the obvious thing to do would be to use compression. I decided to try and see what that would give us, and the easiest thing to do is to just enable file compression at the NTFS level. But we can probably do better. The major cost we have is writes to the journal file. And we only ever read from the journal file when we recover the database. We are also usually writing multiple pages at a time for most transactions, and for the scenario we care about, writing 100 random values in a single transaction, we are usually write about a 100 pages or so anyway. That means that we have got a pretty big buffer that we can try to compress all at once.
Sadly, we don’t really see a meaningful improvement under that scenario. Using NTFS compression slow us down considerably, while both LZ4 and Snappy resulted in greatly reduce file writes, but roughly the same performance.
Note that I have done just the minimal amount of work required to test this out. I changed how we are writing to the journal, and disabled reading from it. That was pretty disappointing, to tell you the truth, I fully expected that we’ll see some interesting improvements there. The LZ4 compression factor is about x10 for our data set. That means that 100 – 121 pages usually are compressed to 10 – 13 pages.
The good thing about this is that running this through the profiler shows that we are running on a high I/O machine, where the amount and speed of I/O doesn’t impact our performance much.
To test it out, I run the same test on an Amazon m3.2xlarge machine, and got the following results (writing to the C: drive):
So for now I think that I’ll do the following. We’ll focus on the other costs beyond I/O, and profile heavily on systems with higher I/O costs.
Compression should work, in fact, my first thought was that we pay for less I/O with more CPU, but that is a guess without looking at the numbers.
Here is the cost of committing using compression (lz4):
And here it is without compression:
So the good news is that we can visible see that we reduced the I/O cost from 8 seconds to 3.8 seconds. And even with compression cost of 2.6 seconds, we are still ahead.
However, the cost of the collections used is actually a lot worse from our perspective. So we are back to reducing computation costs first, then looking at compression again at a later point in time.
Comments
This post would have more useful with corresponding CPU %. There has to be trade-off
Also would you mind sharing LZ4 and Snappy libs you used for .NET
Vadi, The CPU % aren't meaningful for me, I'm happy to trade off some CPU for I/O. I have CPU cycles to spare, but I/O is my limiting factor.
Can the journal file be put on another spindle or SSD? I realize this would be an ops choice rather than a code choice.
Tyler, Sure, you can absolutely do that, and you'll see a nice speed up.
Interesting. I would have expected to see a bit more improvement, especially because competition with data sync writes should be substantially less. In some experiments with relatively large transaction sizes and a continuously high tx write load I have seen around 40-50% improvement in total throughput in a design comparable to Voron's.
For scenarios with many pages/tx a potential gain could be to do block compression in parallel to async writes (i.e. compress a block, start async write for the block and prepare the next compressed block in parallel). But with realistic average transaction sizes this is likely to cost at least as much as is gained.
From the cost on collections, it is somewhat curious to see from the two timing images that the cost of "AddRange" is only 136 ms for 5000 calls in the no-compression scenario vs. 3049 ms in the compression scenario. Can this be explained by many more elements being added per call?
Oren is there a reason why you don't hire someone from Greg's team and getthem to work on voron? Greg and his team have a lot of expertise in storage technologies and ran several houndred hours of perf testson various disk types and their laboratory. I don't think your approach can ever compete with that to be honest
Daniel, a) I am pretty sure that Greg would consider that poaching, and that isn't nice. b) What Greg is doing and what we are doing is different. In particular, Greg is focused on immutability, LSM and fully self contained solution. We are doing mutable data, B+Tree and only provide the software part of the solution. c) I am pretty sure that we are at a rate right now that is quite beyond what we need to do. We can do over million writes _a second_, or process over 10,000 transactions a second.
Alex, The I/O rates are pretty high there, so that explains a lot of the cost. Even with the reduced in size, we still have to go to the disk a lot. The collection cost was removed almost entirely, so that drops the cost even further.
Ok, given that the collection cost was largely removed, does that give you the projected improvement when using compression (i.e. 6.548 seconds vs. 8.606 seconds, or around 25% improvement)?
Did you end up going with compression for Voron or was it not worth the trouble?
Alex, It was very much worth the trouble, yes. In practice, it really improved our perf time. The other thing that we did was really reduced fsync costs, for a further improvements in perf. I'll talk about that in a new post sometime.
pyjsdbzfoef, kjgoudfmlt , [url=http://www.uryvjhkblx.com/]qncbrmants[/url], http://www.kdwnaunpub.com/ kjgoudfmlt
Comment preview