Voron’s Journal: Design Decisions
I’ve spoken before about the idea behind Voron’s Journal. It allows us to do some pretty sweet things:
- Sequential writes.
- Incremental backups.
- Multi Versioning Concurrency Control.
- Mutable trees.
We haven’t really run perf testing yet (we have some early benchmarks, but they aren’t ready to be talked about yet), but things are looking up.
In particular, because we write to the log sequentially, we get a much better write speed. I’ll discuss the mechanics of flushing the journal in a bit, and I already talking about incremental backup.
What I really want to talk about now is how we implement MVCC and a major optimization: Mutable trees.
In LMDB, the engine uses a copy on write method to ensure that you can modify the data while other transactions are reading from it. This leads to a lot of of writes (and free space). We can avoid that entirely by using the journal file. This end up reduce the amount of writes we have to do by a factor of 10 or so in our random writes scenario. You can read more about this here.
The downside of having a journal is that you can’t really just keep the data there. You need to move it to the data file at some point. (CouchDB’s engine uses what is effectively only a journal file, but need to compact to save space).
And as it turns out, there are a lot of issues that we need to consider here. To start with, we have a minor write amplification. We are (at worst) writing twice as much data as we get. The reason that I am saying that we are writing at most twice as much data is that there are actually optimizations in place to stop up from writing the same page multiple times. So if we wrote five values in five transactions, we would have to write the 5 transactions to the journal, but only update a single page in the data file.
More interesting are how we are going to actually handle flushing the journal. Or to be rather more exact, when we are going to do that. We need to balance system speed vs. number of journal files vs. recovery time vs. performance spikes.
My current thinking is that we’ll start with 1MB log file (see side note), and flush to disk every 1MB. When a log file is full, we create a new one, and if there is more than a single log file, we will double the size of the log file (up to 64 – 256MB, I guess). This allows us to very quickly upsize the log file automatically without needing to do explicit configuration. It also means that under load, we will get bigger log files, hence better write performance. Once we are beyond the 1MB range (a size that I don’t think require any optimizations at all), we need to start thinking on how to do work in parallel, both flushing to the data file and writing to the current log file.
Side note: optimizing small and growing
One of the things that we noticed with our users is that we have a lot of users that are trying out the database with very small data sets. With RavenDB default configuration, we are set up to take about 64MB of disk space, and we got some complaints about that. It might be better to start small and increase the size, than start with a more reasonable default that hurts small dbs.
What I think we’ll do is allow only one concurrent data flushing, which will be how we limit the amount of work that flushing can take. After every write, we will check how much space we have to flush. If the space is about 50% of the current log file size, we will initiate a flush. We will also initiate a flush if there have been no writes for 5 seconds. What we are trying to do with this logic is that ideally, under load, we will always have writes to both the log file and flushing from the log file. The reason we want to do that is to ensure that we get consistent performance. We want to avoid spikes and stalls in performance. It is more important to be consistent that allow high burst speeds.
Comments
So you moved away from COW to a log-based model? The COW-model always seems like an odd trade-off.
Couldn't you just switch over to the stock InnoDB engine or some other war-hardened engine?
Tobi, No, I couldn't do that. See the full talk for details: http://skillsmatter.com/podcast/scala/the-db-disassembly-kit
"reason we want to do that is to ensure that we get consistent performance"
From what you wrote it looks like you actually want consistent throughput. Consitent performance can also mean consistent latency (which does not seem to be a goal).
Catalin, I am not sure that I understand the difference between your consistent perf, latency and throughput. Ideally, I want to have consistent response times overall. The three terms seems interchangeable in this context.
When in this flow does indexing happen? Does the indexer work with both the journal and stored data?
Brian, This is several levels down from indexing, The indexer doesn't know about any of this.
Sorry, I should have been more clear. When is the data available for reading by the higher level code such as the indexer? Does it have to be flushed first or is data in the journal read as well?
Brian, The data is immediately available, it can be read from the journal.
LMDB's first basic concept was to be a memory-mapped database. Every decision after that was driven by this choice. The use of COW was mandatory, because if you are using a writable mmap, and you make changes to the database, you cannot guarantee the order in which the OS will persist those changes to disk. The only viable approach was to ensure that when you write to the database, you never write to a page that might currently be in use. A side benefit of this approach is that a system crash can never corrupt the database, so restarts after a crash are instantaneous; no recovery procedure is needed. This approach also naturally leads you to MVCC, you get it basically for free.
The strategy of using the journal to turn all writes into sequential writes is generally a good one. There's no free lunch though - eventually you have to merge the journaled ops back into the main DB file, and that means you're still going to have to do a lot of random I/Os at that point. Of course you can consolidate a lot of the I/Os by batching the journal updates, but there will still be a large number of random I/Os left. (Look at the performance delta between batched and non-batched random writes in the LMDB microbenchmarks - that's how much performance headroom you have to work with.)
Howard, The journal solves another problem, it drastically simplify free space (I can modify the "same" page, not a different page). Even though that page actually reside in the journal file. That means that for a lot of updates, I don't have to go up the tree to modify things. That saves me not insignificant amount of I/O.
And yes, we would need to do flushing of the journal file. But it won't be random I/O. We actually batch the writes from the journal file to the data file, and we do so in a sequential manner. So yes, we might actually be skipping sectors (and might jump ahead a lot, but it ain't random).
We also get the benefit that we only need to modify any page only once, so for hot pages, we save a lot of IO.
If I've understood what you're saying, that means that you write modified pages into your journal (as opposed to just logging operations/deltas). At merge time, you write the most up-to-date version of a page from the journal back to the main DB file. This is an update-in-place operation, so if the system crashes during this merge, the DB file is corrupted.
But I guess that's OK, because the merge is idempotent, so the next time you start up you can redo the merge. It sounds promising. The only hassle I see is in tracking which pages are live in the journal, to satisfy various read requests.
Howard, No, if the system crash midway, that is not going to do anything bad, you can just restart it.
We do have to effectively track pages in journal, we keep a page translation table for that. It tend to be pretty small, since we flush quite often.
Comment preview