An issue was recently raised with a really scary title:
Intermittent Index corruption: VoronUnrecoverableErrorException.
Those are the kinds of issues that you know are going to be complex. Fixing such issues in the past was usually a Task Force effort and quite a challenge.
We asked for more information and started figuring out who would handle the issue (given the time of the year) when the user came back with:
After pressing the disk check issue with our hosting provider, we found out that one of the disks was reporting an error but according to our hosting, it was only because the manufacturer's guarantee expired, and not the actual disk failure. We swapped the disk anyway, and so far we are not seeing the issue.
I write a transactional database for a living, and the best example of why we want transactions is transferring money between accounts. It is ironic, therefore, that there is no such thing as transactions for money transfers in the real world.
If you care to know why, go back 200 years and consider how a bank would operate in an environment without instant communication. I would actually recommend doing that, it is a great case study in distributed system design. For example, did you know that the Templars used cryptography to send money almost a thousand years ago?
Recently I was reviewing my bank transactions and I found the following surprise. This screenshot is from yesterday (Dec 18), and it looks like a payment that I made is still “stuck in the tubes” two and a half weeks later.
I got in touch with the supplier in question to apologize for the delay. They didn’t understand what I was talking about. Here is what they see when they go to their bank, they got the money.
For fun, look at the number of different dates that you can see in their details.
Also, as of right now, my bank account still shows the money as pending approval (to be sent out from my bank).
I might want to recommend that they use a different database. Or maybe I should just convince the bank to approve the payment by the time of the next invoice and see if I can get a bit of that infinite money glitch.
RavenDB is a database, a transactional one. This means that we have to reach the disk and wait for it to complete persisting the data to stable storage before we can confirm a transaction commit. That represents a major challenge for ensuring high performance because disks are slow.
I’m talking about disks, which can be rate-limited cloud disks, HDD, SSDs, or even NVMe. From the perspective of the database, all of them are slow. RavenDB spends a lot of time and effort making the system run fast, even though the disk is slow.
An interesting problem we routinely encounter is that our test suite would literally cause disks to fail because we stress them beyond warranty limits. We actually keep a couple of those around, drives that have been stressed to the breaking point, because it lets us test unusual I/O patterns.
We recently ran into strange benchmark results, and during the investigation, we realized we are actually running on one of those burnt-out drives. Here is what the performance looks like when writing 100K documents as fast as we can (10 active threads):
As you can see, there is a huge variance in the results. To understand exactly why, we need to dig a bit deeper into how RavenDB handles I/O. You can observe this in the I/O Stats tab in the RavenDB Studio:
There are actually three separate (and concurrent) sets of I/O operations that RavenDB uses:
Blue - journal writes - unbuffered direct I/O - in the critical path for transaction performance because this is how RavenDB ensures that the D(urability) in ACID is maintained.
Green - flushes - where RavenDB writes the modified data to the data file (until the flush, the modifications are kept in scratch buffers).
Red - sync - forcing the data to reside in a persistent medium using fsync().
The writes to the journal (blue) are the most important ones for performance, since we must wait for them to complete successfully before we can acknowledge that the transaction was committed. The other two ensure that the data actually reached the file and that we have safely stored it.
It turns out that there is an interesting interaction between those three types. Both flushes (green) and syncs (red) can run concurrently with journal writes. But on bad disks, we may end up saturating the entire I/O bandwidth for the journal writes while we are flushing or syncing.
In other words, the background work will impact the system performance. That only happens when you reach the physical limits of the hardware, but it is actually quite common when running in the cloud.
To handle this scenario, RavenDB does a number of what I can only describe as shenanigans. Conceptually, here is how RavenDB works:
deftxn_merger(self):while self._running:with self.open_tx()as tx:while tx.total_size < MAX_TX_SIZE and tx.time < MAX_TX_TIME:
curOp = self._operations.take()if curOp isNone:break# no more operations
curOp.exec(tx)
tx.commit()# here we notify the operations that we are done
tx.notify_ops_completed()
The idea is that you submit the operation for the transaction merger, which can significantly improve the performance by merging multiple operations into a single disk write. The actual operations wait to be notified (which happens after the transaction successfully commits).
If you want to know more about this, I have a full blog post on the topic. There is a lot of code to handle all sorts of edge cases, but that is basically the story.
Notice that processing a transaction is actually composed of two steps. First, there is the execution of the transaction operations (which reside in the _operations queue), and then there is the actual commit(), where we write to the disk. It is the commit portion that takes a lot of time.
Here is what the timeline will look like in this model:
We execute the transaction, then wait for the disk. This means that we are unable to saturate either the disk or the CPU. That is a waste.
To address that, RavenDB supports async commits (sometimes called early lock release). The idea is that while we are committing the previous transaction, we execute the next one. The code for that is something like this:
deftxn_merger(self):
prev_txn = completed_txn()while self._running:
executedOps =[]with self.open_tx()as tx:while tx.total_size < MAX_TX_SIZE and tx.time < MAX_TX_TIME:
curOp = self._operations.take()if curOp isNone:break# no more operations
executedOps.append(curOp)
curOp.exec(tx)if prev_txn.completed:break# verify success of previous commit
prev_txn.end_commit()# only here we notify the operations that we are done
prev_txn.notify_ops_completed()# start the commit in async manner
prev_txn = tx.begin_commit()
The idea is that we start writing to the disk, and while that is happening, we are already processing the operations in the next transaction. In other words, this allows both writing to the disk and executing the transaction operations to happen concurrently. Here is what this looks like:
This change has a huge impact on overall performance. Especially because it can smooth out a slow disk by allowing us to process the operations in the transactions while waiting for the disk. I wrote about this as well in the past.
So far, so good, this is how RavenDB has behaved for about a decade or so. So what is the performance optimization?
This deserves an explanation. What this piece of code does is determine whether the transaction would complete in a synchronous or asynchronous manner. It used to do that based on whether there were more operations to process in the queue. If we completed a transaction and needed to decide if to complete it asynchronously, we would check if there are additional operations in the queue (currentOperationsCount).
The change modifies the logic so that we complete in an async manner if we executed any operation. The change is minor but has a really important effect on the system. The idea is that if we are going to write to the disk (since we have operations to commit), we’ll always complete in an async manner, even if there are no more operations in the queue.
The change is that the next operation will start processing immediately, instead of waiting for the commit to complete and only then starting to process. It is such a small change, but it had a huge impact on the system performance.
Here you can see the effect of this change when writing 100K docs with 10 threads. We tested it on both a good disk and a bad one, and the results are really interesting.
The bad disk chokes when we push a lot of data through it (gray line), and you can see it struggling to pick up. On the same disk, using the async version (yellow line), you can see it still struggles (because eventually, you need to hit the disk), but it is able to sustain much higher numbers and complete far more quickly (the yellow line ends before the gray one).
On the good disk, which is able to sustain the entire load, we are still seeing an improvement (Blue is the new version, Orange is the old one). We aren’t sure yet why the initial stage is slower (maybe just because this is the first test we ran), but even with the slower start, it was able to complete more quickly because its throughput is higher.