Ayende @ Rahien

Dec 01 2016

The performance regression in the optimizationPart II

time to read 4 min | 762 words

Tags:

In my previous post I showed a performance conundrum. A code that has been optimized to reduced heavy allocation usage that became over twice as slow.

In particular, we had a problem here, the new code it 3.4 times slower than the new one, but how?

Now, the real scenario we had involved concurrent access, so it was much harder to figure out, but I cheated a bit when producing this image, I used sampling profiling, instead of tracing one. The major difference between the two is that tracing profiler will also give you the number of calls. This is called out as something that you would typically do because you want to analyze algorithmic complexity, but I find it incredibly useful to figure out what my code is actually doing.

And indeed, looking at the same code using tracing profiler gives us the following two calls:

And when looking at the diffs between those two, we have:

So for some reason we are making 54 million more calls to the Equals method in the optimized version, but why? Both of those are using the exact same dictionary, using the exact same key type and the same keys, even.

In the real scenario we were facing, that wasn’t the case, so that made it hard to analyze the issue. We started looking into whatever we were doing some sort of cache poisoning by having the buffer holder as the dictionary value, instead of the array directly, but that didn’t pan out. We kept circling around the number of Equals calls. Note that the number of calls to TryGetValue is the same, as well as the number of calls to GetHashCode. So what is the diff?

The diff, quite simple, is not here at all.

The problem is in the RemoveBefore method. In the old version, if we removed all the entries, we’ll remove it completely from the dictionary. In the new version, we’ll reset the buffer so it can be used again next time. The problem with that approach is that it means that the dictionary is pretty big, much bigger than it would be in the case of the old version of the code. And that means that we’ll need to find the value (which is empty), then check its content. On the old version, we’ll just do a GetHashCode, then find that the table entry is over, and exit.

Indeed, all we had to do was change RemoveBefore to look like this:

And that gives us:

14.0 seconds & 1.1 GB of memory for old version
12.8 seconds & 0.4 GB of memory for new version

Which is pretty good result overall. It gets better when you break it apart to its component parts.

This is actually surprising, since we didn’t really set out to optimize this call very much, and it is pretty much unchanged in both versions. I think that this is likely because we keep the buffers around longer, so they are more likely to be in the cache.

This shows more than double the speed we previous had, which is pretty awesome, since this code is actually called per transactions, so anything that reduces that cost is golden.

This happens during a flush, and reducing its speed is important to reducing the time we hold the write lock, so this is pretty sweet.

The performance regression in the optimizationPart I

time to read 4 min | 629 words

Tweet Share Share 6 comments

Tags:

PageTable is a pretty critical piece of Voron. It is the component responsible for remapping modified pages in transactions and is the reason why we support MVCC and can avoid taking locks for the most part. It has been an incredibly stable part of our software, rarely changing and pretty much the same as it was when it was initially written in 2013. It has been the subject for multiple performance reviews in that time, but acceptable levels of performance from our code in 2013 is no longer acceptable today. PageTable came up recently in one of our performance reviews as a problematic component. It was responsible for too much CPU and far too many allocations.

Here is a drastically simplified implementation, which retain the salient points:

Here is the sample workout for this class, which just simulates ten thousand transactions. This little scenario takes 15.3 seconds and allocates a total of 1.1GB of memory! That is a lot of allocations, and must have tremendous amount of time spent in GC. The most problematic issue here is the SetItems methods, which will allocate two different delegates for each modified page in the transaction. Then we have the total abandon in which we’ll allocate additional memory in there. As you can imagine, we weren’t very happy about this, so we set out to fix this issue.

We can take advantage off the fact that SetItems and RemoveBefore are only called under lock, while TryGetValue is called concurrently with everything else.

So I wrote the following code:

This relies on allowing stale reads from concurrent readers, which we don’t care about since they wouldn’t be able to make use of the data anyway, and it was able to reduce the allocations to just 320 MB, but the runtime actually went up to 32 seconds.

That is quite annoying, as you can imagine, and much cursing enthused as a result. I then pulled my trusty profiler ask it kindly to figure out what piece of code needs to be hit with a rolling pin and have a stern talk to about what is expected from code after it has been laboriously and carefully optimized. It is expected to sit nicely and be fast, or by Git I’ll revert you.

What the hell?! Here are the original implementation costs, and you can clearly see how much time we are spending on garbage collection.

And here is the optimized version, which is actually slower, and actually used more memory?!

There are a bunch of interesting things going on here. We can see that we are indeed using spending a little less time in GC, and that both RemoveBefore and SetItems methods are much cheaper, but the cost of TryGetValue is so much higher, in fact, if we compare the two, we have:

So we are 3.4 times higher, and somehow, the cost of calling the concurrent dictionary TryGetValue has risen by 88%.

But the implementation is pretty much the same, and there isn’t anything else that looks like it can cause that much of a performance gap.

I’ll leave this riddle for now, because it drove me crazy for two whole days and give you the details on what is going on in the next post.

Oren Eini

Oren Eini

CEO of RavenDB

The performance regression in the optimizationPart II

The performance regression in the optimizationPart I

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed