Voron’s Log

time to read 9 min | 1650 words

I was thinking, which is a bit dangerous, about Voron’s backup story, and things sort of started to roll down the hill from there.

Nitpicker note: None of the things that I am discussing here is actually novel in any way, I am well aware of that. I am articulating my thought process about a problem and how it went down from an initial small feature to something much bigger. Yes, I am aware of previous research in the area. I know that Xyz is also doing that. Thank you in advance.

Voron’s backup story is  a thing of beauty, and I can say that because I didn’t write it. I took it all from LMDB. Basically, just copy the entire file in a safe and transactional manner.  However, it cannot do incremental backups. So I started thinking about what would be needed to happen to get that, and I found that it is actually a pretty easy proposition.

Whenever we commit a transaction, we have a list of modified pages that we write back to the data file. An incremental backup strategy could be just:

Do a full backup – copy the full data file and put it somewhere. Keep track of all the modified pages in a log file. Incremental backup would just be copying the change log elsewhere are restarting the change log.

Restoring would happen by taking the full data file and applying the changes to it. This looks hard, but not when you consider that we always work in pages, so we can just have a change log format like this:

image

The process for applying the change log then becomes something like this:

foreach(LogEntry entry in GetLogEntryFiles())
{
       foreach(Page page in entry.Pages)
       {
             WritePageToDataFile(page.PageNumber, page.Base);
       }
}

And voila, we have incremental backups.

Of course, this means that we will have to write the data twice, once for the incremental log file, and once for the actual data file. But that led me to another series of thoughts. Once of the areas where we can certainly do better is random writes. This is an area where we have a particular problem with. I already discussed this here. But if we already have a log file (which we need to support incremental backups), and if we already write twice… why not take advantage of that.

Instead of having a change log, we can have a write ahead log. That turn all of our writes into sequential writes. Leading to much faster performance overall.

The downside of write ahead logging is that you are going to need to do a few important things. To start with, there is a point where you need to spill the write ahead log into the actual file. That leads to doubling the costs of writes, in comparison to LMDB’s usual manner of only a single write. It also means that you now have to implement some form of recovery, since we need to read the log at startup.

The benefits however are pretty good, I think. There is a reason why most databases have some form of Write Ahead Log to make sure that their writes are sequential. And the nice thing about it is that we can now flush only very clearly defined section of the file, and only do fsyncs on the actual log file.

The idea now becomes:

In the data file header we add record to indicate what is the last flushed log file, and up to where in the log file we flushed. We pre-allocate a log file with 64MB, and start write to it using mem map.

We also write to the end of of the mapped range, and we can safely state what section of the file needs to flush & fsynced, and it is always at the (logical) end of the file. Once every X (where X is time or size or just idleness), we take all the records from the log file from the last position we flushed the log file to the current location, then update the data file header accordingly.  The fun part about this is that this is absolutely safe in the case of failure. If we failed midway through for any reason, we can just re-apply the process from the beginning and there can be no corruption. There is also a good chance for being able to do additional optimizations here, because we can merge a lot of writes into a single operation and then make sure that we access the disk in a predictable and optimal fashion. All of this means that this can be an async process that is allowed to fail, which is very good from our perspective, because it means we can just kick it off with low priority to avoid having impacting regular write performance.

Until we get the data flushed to the data file, we maintain an in memory map of pages. In the image above, we will have a map for pages 2,3,13,19 and when they are requested, we will give out the address of the memory mapped log file entry at the appropriate location.

One thing that I want to avoid is stalls, like what happen to leveldb under heavy sustained writes. So when the log file is full, we just create a new log file and write to that. It is possible that under very heavy load you’ll have the “flush to data file” process take long enough that the new log file will also be full before the process is done. Unlike leveldb, we are going to just go ahead and create a new log file. So when the “flush to disk file” process resumes, it may need to process multiple log files. That is fine with us, and likely to be even better, because free space reuse will mean that we can probably skip writing pages that were overwritten by newer transactions (for example, above, we only need to write page 3 once).

There are a bunch of “details” that still need to be done:

  • What happens if you have a single transaction that is bigger than the size of the log file?
    • We can reject that transaction, 64MB per transaction is a lot.
    • Probably it means that we need to have a log file dedicated to that transaction, but how do we detect / decide that? That doesn’t seems workable.
    • Probably better would be to keep the 64MB size but say that this is a transaction start only, and that we need to process transaction end before we consider this committed. This is very similar to the way the leveldb log works.
  • Need to think about way to reduce the impact of flushing the logs:
    • Maybe provide a way for the user to “suppress” this for a period of 1 minute. We can use that during things like bulk insert, to get very high speed. And as long as the bulk insert goes, we will keep asking for an extension. (I don’t want to have on/off switches, because that means we have to be very careful about turning it on again).
    • Maybe we can detect the rate of writes and adjust accordingly? But what happens when we are running multiple databases at the same time?
  • Transaction logs should be named 0000000001.txlog, 0000000002.txlog, etc. Because I keep getting send them as “database log” during debug scenarios.
  • Need to make sure that this works properly for large values. So values that takes more than a single page are supported.
    • To make things complicated, what happen if we are the last 3 items in the log file, but we need to write a value that is 5 pages long? We need to mark the transaction as “tx started”, move to the next log file, and continue the transaction.
    • For that matter, what if we have a single value that is more than the size of the log file? Is that something that we want to support? We can do the same as “tx started”, marker, but maybe also indicate that we a “tx large value start”, “tx large value mid”, “tx large value end”.
    • For both “tx started” and “tx large value started” we need to be sure that we consider the transaction aborted unless we have the “tx end” part.
  • On startup, we need to read the log file(s) from the last flushed log location and upward and build the map of the pages contained in the log file. This introduce a “recovery” step of a sort, but I think that this should be a pretty small one overall.
  • We need to make sure to delete old log files (unless users want them to incremental backup purposes). And we need to make sure that the memory map for read transactions is kept alive as long there are any transactions that may be reading from it.

Overall, I think that this is a pretty radical departure in our internal process compare to how LMDB works, but I have high hopes for this. Even though it starts with “do we really need to support incremental backups?”

My expectation is that we will see much higher writes, and about the same numbers for reads. Especially for our scenarios.