Fast transaction logLinux
 We were doing some perf testing recently, and we got some odd results when running a particular benchmark on Linux. So we decided to check this on a much deeper level.
We were doing some perf testing recently, and we got some odd results when running a particular benchmark on Linux. So we decided to check this on a much deeper level.
We got an AWS macine ( i2.2xlarge – 61 GB, 8 cores, 2x 800 GB SSD drive, running Ubuntu 14.04, using kernel version 3.13.0-74-generic, 1GB/sec EBS drives ) and run the following code and run it. This tests the following scenarios on a 1GB file (pre allocated) and “committing” 65,536 (64K) transactions with 16KB of data in each. The idea is that we are testing how fast we can create write those transactions to the journal file, so we can consider them committed.
We have tested the following scenarios
- Doing buffered writes (pretty useless for any journal file, which needs to be reliable, but good baseline metric).
- Doing buffered writes and calling fsync after each transaction (which is pretty common way to handle committing to disk in databases)
- Doing buffered writes and calling fdatasync (which is supposed to be slightly more efficient than calling fsync in this scenario).
- Using O_DSYNC flag and asking the kernel to make sure that after every write, everything will be synced to disk.
- Using O_DIRECT flag to bypass the kernel’s caching and go directly to disk.
- Using O_DIRECT | O_DSYNC flag to ensure that we don’t do any caching, and actually force the disk to do its work.
The code is written in C, and it is written to be pretty verbose and ugly. I apologize for how it looks, but the idea was to get some useful data out of this, not to generate beautiful code. It is only quite probable that I made some mistake in writing the code, which is partly why I’m talking about this.
Here is the code, and the results of execution are below:
It was compiled using: gcc journal.c –o3 –o run && ./run
The results are quite interesting:
| Method | Time (ms) | Write cost (ms) | 
| Buffered | 525 | 0.03 | 
| Buffered + fsync | 72,116 | 1.10 | 
| Buffered + fdatasync | 56,227 | 0.85 | 
| O_DSYNC | 48,668 | 0.74 | 
| O_DIRECT | 47,065 | 0.71 | 
| O_DIRECT | O_DSYNC | 47,877 | 0.73 | 
The results are quite interesting. The buffered call, which is useless for a journal, but important as something to compare to. The rest of the options will ensure that the data reside on disk* after the call to write, and are suitable to actually get safety from the journal.
* The caveat here is the use of O_DIRECT, the docs (and Linus) seems to be pretty much against it, and there are few details on how this works with regards to instructing the disk to actually bypass all buffers. O_DIRECT | O_DSYNC seems to be the recommended option, but that probably deserve more investigation.
Of course, I had this big long discussion on the numbers planned. And then I discovered that I was actually running this on the boot disk, and not one of the SSD drives. That was a face palm of epic proportions that I was only saved from by the problematic numbers that I was seeing.
Once I realized what was going on and fixed that, we got very different numbers.
| Method | Time (ms) | Write cost (ms) | 
| Buffered | 522 | 0.03 | 
| Buffered + fsync | 23,094 | 0.35 | 
| Buffered + fdatasync | 23,022 | 0.35 | 
| O_DSYNC | 19,555 | 0.29 | 
| O_DIRECT | 9,371 | 0.14 | 
| O_DIRECT | O_DSYNC | 20,595 | 0.31 | 
There is about 10% difference between fsync and fdatasync when using the HDD, but there is barely any difference as far as the SSD is concerned. This is because the SSD can do random updates (such as updating both the data and the metadata) much faster, since it doesn’t need to move the spindle.
Of more interest to us is that D_SYNC is significantly faster in both SSD and HHD. About 15% can be saved by using it.
But I’m just drolling over O_DIRECT write profile performance on SSD. That is so good, unfortunately, it isn’t safe to do. We need both it and O_DSYNC to make sure that we only get confirmation on the write when it hits the physical disk, instead of the drive’s cache (that is why it is actually faster, we are writing directly to the disk’s cache, basically).
The tests were run using ext4. I played with some options (noatime, noadirtime, etc), but there wasn’t any big difference between them that I could see.
In my next post, I’ll test the same on Windows.
 

Comments
There are many SSDs (usually enterprise ones) that guarantee the cache will be written to the flash storage in the event of a power loss, usually by having a really big capacitor to ensure enough power to write the cache. Would it be possible to make bypassing the drive's cache optional for such a situation?
Mircea, In that case, absolutely, yes. But in those cases, they are usually already programmed to lie and say that they wrote to the storage when they actually wrote to the buffer, so it is the same thing
...which is not a lie because the guarantee is for durability, not for putting the data into flash cells.
Do DIRECT and DSYNC map to the two Windows equivalents WRITE_THROUGH and NO_BUFFERING? It seems so. Would be interesting to see the same load on a comparable Windows machine.
Tobi, The are inexact mapping, yes. And see the title of the next blog post, scheduled for tomorrow. :-)
+1 for Mircea, in the same issue, there are some applications that don't care about durability (for instance, statistics and analytics), and prefer performance over durability. is it possible to get the buffered option as well?
Uri, We are going to provide several options there, yes. But the buffered option is problematic because you need to be careful about how / when to sync journal and data files.
We are probably going to have multiple steps along the way
Are you planning on doing this for OS X? I imagine that the numbers will be abysmal compared to Linux and Windows, but still would be good to know.
Jay
The comment about OSX probably being slow got my interest so I ported this (roughly) over.
The caveats are that fdatasync goes away and O_DIRECT is replaced by F_NOCACHE. Code is here: https://gist.github.com/tdunning/6dfaa070bb9e8e55ed7efa6a70425e18
Here are the results on my MacBookPro with SSD:
Seems not bad at all except for the baseline case.
The buffered version is really bad. And given the costs, I think that you aren't flushing really.
IIRC, Mac OS X also need fcntl(F_FULLFSYNC)
I think that Oren is right based on documentation.
When I add F_FULLSYNC, performance drops to a very steady 12MB/s with correspondingly astronomical times.
Once upon a time, I got the following instruction from a database vendor: If the database crashes, check whether any kind of buffering is used e.g. write-back cache in the RAID controller. Then you know what to blame :-)
The database vendor was very reliable and the only time, I lost a database, was just after a firmware upgrade on some new servers.
Comment preview