The Guts n’ Glory of Database InternalsWriting to a data file
 So, we now know how we can write to a journal file efficiently, but a large part of doing that is relying on the fact that we are never actually going to read from the journal file. In other words, this is like keeping your receipts in case of an audit. You do that because you have to, but you really don’t want to ever need it, and you just throw it in the most efficient way to a drawer and will sort if out when you need to.
So, we now know how we can write to a journal file efficiently, but a large part of doing that is relying on the fact that we are never actually going to read from the journal file. In other words, this is like keeping your receipts in case of an audit. You do that because you have to, but you really don’t want to ever need it, and you just throw it in the most efficient way to a drawer and will sort if out when you need to.
In most database engines that implement a journal, there is this distinction, the journal is strict for durability and recovery, and the data file(s) are used to actually store the data in order to operate. In our case, we’ll assume a single journal and a single data file.
On every commit, we’ll write to the journal file, as previously discussed, and we ensure that the data is safely on the disk. But what happens to writes on the data file?
Simple, absolutely nothing. Whenever we need to write to the data file, we make buffered writes into the data file, which goes into the big chain of buffers that merge / reorder and try to optimize our I/O. Which is great, because we don’t really need to care about that. We can let the operating system handle all of that.
Up to a point, of course.
Every now and then, we’ll issue an fsync on the data file, forcing the file system to actually flush all those buffered writes to disk. We can do this in an async manner, and continue operations on the file. At the time that the fsync is done (which can be a lot of time, if we had a lot of write and a busy server), we know what is the minimum amount of data that was already written to the data file and persisted on disk. Since we can match it up to the position of the data on the journal, we can safely say that the next time we recover, we can start reading the journal from that location.
If we had additional writes, from later in the journal file, that ended up physically in the data file, it doesn’t matter, because they will be overwritten by the journal entries that we have.
Doing it this way allows us to generate large batches of I/O, and in most cases, allow the operating system the freedom to flush things from the buffers on its own timeline, we just make sure that this doesn’t get into degenerate case (where we’ll need to read tens of GB of journal files) by forcing this every now and then, so recovery is fast in nearly all cases.
All of this I/O tend to happen in async thread, and typical deployments will have separate volumes for logs and data files, so we can parallelize everything instead of competing with one another.
By the way, I’m running this series based on my own experience in building databases, and I’m trying to simplify it as much as it is possible to simplify such a complex topic. If you have specific questions / topics you’ll like me to cover, I’ll be happy to take them.
More posts in "The Guts n’ Glory of Database Internals" series:
- (08 Aug 2016) Early lock release
- (05 Aug 2016) Merging transactions
- (03 Aug 2016) Log shipping and point in time recovery
- (02 Aug 2016) What goes inside the transaction journal
- (18 Jul 2016) What the disk can do for you
- (15 Jul 2016) The curse of old age…
- (14 Jul 2016) Backup, restore and the environment…
- (11 Jul 2016) The communication protocol
- (08 Jul 2016) The enemy of thy database is…
- (07 Jul 2016) Writing to a data file
- (06 Jul 2016) Getting durable, faster
- (01 Jul 2016) Durability in the real world
- (30 Jun 2016) Understanding durability with hard disks
- (29 Jun 2016) Managing concurrency
- (28 Jun 2016) Managing records
- (16 Jun 2016) Seeing the forest for the trees
- (14 Jun 2016) B+Tree
- (09 Jun 2016) The LSM option
- (08 Jun 2016) Searching information and file format
- (07 Jun 2016) Persisting information
 

Comments
I would like to know more about the file structure of the journal and redo/undo/backup/restore/replication: https://en.wikipedia.org/wiki/Transaction_log
I was about to ask about fsync of the journal, but you have already described that subject one week ago: https://ayende.com/blog/174564/the-guts-n-glory-of-database-internals-durability-in-the-real-world
Windows as an OS is terrible at dirty page write-back. I would not want to rely on that. I have seen it write contiguous pages in the wrong order causing random IO. The size that it buffers can easily be to small. When you fsync the IO patterns you get are at the mercy of a 3rd party blackbox that does not seem to get a lot of attention. Also, it can cause an IO storm. Or, the IO DOP could be far too small because Windows likely has no idea what the best DOP for your storage is.
Windows appears to be able to detect certain sequential IO patterns and treat them specially. This can manifest itself as throttling writes and flushing a file when the app closes it. I have seen those effects. So you could run into random heuristics and effects.
It seems better to manage dirty buffers in the database engine. That way flushing is subject to a policy of your choice.
Turns out SQL Server does that. I have pondered this a lot and I like that system.
Carsten, Good point on the content of the transaction log, I'll post about it.
Fsync details are discussed in posts that will show up in 15 / 18 this month.
Tobi, The optimization patterns for Windows are pretty clear, it does read ahead / read behind if it detect it. And it allows to detect what stays in the page cache.
What is I/O DOP, I'm not familiar with the term.
Buffered I/O will absolutely look like flushing on close, sure, but that is the point. The perf difference is that big. The key issue is what happens when you issue an explicit flush.
In that case, Windows has a pretty well defined behavior, and I haven't see it writing out of sequence. It is possible that this is the case if you are seeing separate writes that happened while you are flushing, so they go to the next flush.
your website is copyrighted till 2015
Anon, thanks, fixed
Comment preview