Optimizing RavenDB by adding Thread.Sleep(5)
This post is here because we recently had to add this code to RavenDB:
Yes, we added a sleep to RavenDB, and we did it to increase performance.
The story started out with a reported performance regression. On a previous version of RavenDB, the user was able to insert 32,000 documents per second. Same code, same machine, new version of RavenDB, but the performance is 13,000 documents per second.
That is, as we call it internally, and Issue. More specifically issue: RavenDB-14777. 
Deeper investigation revealed that the problem was that we are too fast, therefor we are too slow. Adding a sleep fixed the being too fast thing, so we were faster again.
You might need to read the previous paragraph a few times to make sense of it, I’m particularly proud of it. Here is what actually happened. Our bulk insert code is reading from the network and as soon as we have some data, we start parallelizing the write to disk and the read from the network. The idea is that we want to be reduce the user time, so we maximize the amount of work we do. This is a fairly standard optimization for us and has paid many dividends in performance. The way it works, we read from the network until there is nothing available in memory and we have to wait for I/O, at which point we start writing to the disk and wait for the network I/O to resume the operation.
However, the issue is that the versions that the user was trying also included a runtime change. The old version run on .NET Core 2.2 and the new version run on .NET Core 3.1. There has been many optimizations as a result of this change, and it seems that the read from network path has benefited from these.
As a result, we would be able to read the data from the buffer more quickly, which meant that we would end up faster with waiting for network I/O. And that meant that we would do a lot more disk writes because we were better in reading from the network. And that, in turn, slowed down the whole thing enough to be noticeable.
Our change means that we’ll only queue a new disk operation if there has been 5 milliseconds with no new network traffic (or a bunch of other conditions that you don’t really care about). This way, we retain the parallel work and not saturate the disk with small writes.
As I said earlier, we had to pump the brakes to get into real high speed.
 


Comments
this doesn't make sense, isn't there a better pattern to handle this? producer-consumer / channels / dataflow?
Uri,
This is using a producer consumer + batching mode. The only difference is what is the trigger for the batch.Instead of "no data available" we changed it to "no data available for 5 ms.
How many documents per second are you inserting now?
Andres,
We are seeing single bulk insert pushing > 25K docs / sec after this change
Hi, Thanks for the interesting post.
A question: I understood you wait for not data available anymore "from the left hand side" (network in this case) then only start writing this data "to the right hand side" (disk in this case). This feels more like a serial process than a parallel one... I missed something... Could you explain it to me again how is this "parallelizing the write to disk and the read from the network"? Maybe it's because of the "batching mode" that I get confused.
Is it because you don't want to start writing to disk while reading from network in order to save CPU for the network reading process?
Or do you actually resume reading from network if new data is coming even before then current write to disk operation is completed?
Thanks in advance.
Sylvain,
We start writing to the disk in an async manner, and read more from the network at the same time, ready for the next write to disk.
Thanks Oren.
But then I don't see why the new delay improves things. Is it because without it you would write too small batches of data to disk, thus decreasing the useful/overhead ratio?
Sylvain,
Yes, the issue was that we read from the network too quickly, so we sent a smaller batch to disk. That ended up causing high latency because we kept having to wait for the disk.
When we waiting more for the network, we would send bigger disk batches, so we had more parallelism of work.
Interesting problem! I glanced at the code and wondered if you considered breaking up the process a bit more, like having a central ConcurrentQueue or BlockingCollection that one thread just dumps items into from the network and a separate thread that just dequeues as fast as it likes for disk writes.
Adam,
We did that in the past, but it turns out that the additional complexity isn't worth it. We need to make sure that we aren't reading too much to memory, that we balance network and disk speeds, etc. This ended up being the best option.
OK, thanks.
So optimally it would be like: don't start writing data to disk before we get enough data to write from the network, or, if we don't get that amount of data within a given time range, give up and write what we got so far anyway (because we don't want to wait for ever). I guess the 5ms are this "given time range".
Thanks.
Well, it does not look as an improvement...
Sylvian,
Yes, that is the case. The idea is that we ensure that we always going to do work, both network and disk
Andres,
The two tests weren't run on the same machine for those numbers. The 32K was on the user's machine, the 25K was on one of our tests machines. With the issue, on the test machine that was saw 25K / sec, the speed was 8K / sec on our tests machine.
Couple of quick things on this:
The turnaround time from the problem report to the fix was really good. It would be interesting to see a future post on both the profiling and diagnostic info you had available to help track down and test this optimization and how the "5ms" magic number was the most optimal batch size (and if different durations work better or worse for different disk types). Wonder if self-tuning is a thing in the future, especially as you run on such diverse HW?
@Andrés - we were part of the discovery of the problem . The 4.2.8 perf on our test rig was about 32k docs/sec. 4.2.101 was about 13k docs per sec. The patch / updated version was slightly better than 4.2.8 (about 34k docs/sec). Not scientific, but the 4.2.10x versions are proving faster and less resource intensive in most regards.
TrevHunter,
What may not be apparent here is that the code here does adaptive behavior.
We are going to read from the network as long as:
All of these together means that after a short while, we are going to settle on reading from the network in batches that are identical to the time it takes to write these to the disk. There isn't a lot of code here, but the behavior is quite sophisticated.
I'm curious, try you try other values than 5?
JustPassingBy,
Yes, we tested a whole bunch of values. See the details in the post. 5 ms was the best value.
If a malicious user sent a single byte every 4ms, that would keep one of your threads busy for potentially very long time. If you had enough of such malicious users, you could run out of thread pool threads/sockets/memory/other resources. Does above sound like a real problem or am I missing something?
Adrian,
Not really, that would hit the rate limits that we have set and bounce.Also note that we are assuming non malicious user here, this is not generally exposed to the wide world, after all. You need an authenticated certificate to run this.
Comment preview