Getting fatal out of memory errors because we are managing memory too well
We got a serious situation on one of our test cases. We put the system through a lot, pushing it to the breaking point and beyond. And it worked, in fact, it worked beautifully. Up until the point that we started to use too many resources and crashed. While normally that would be expected, it really bugged us, we had provisions in place to protect us against that. Bulkheads were supposed to be blocked, operations rolled back, etc. We were supposed to react properly, reduce costs of operations, prefer being up to being fast, the works.
That did not happen. From the outside, what happened is that we go to the point where we would trigger the “sky about the fall, let’s conserve everything we can”, but we didn’t see the reaction that we expected from the system. Oh, we were started to use a lot less resources, but the resources that we weren’t using? They weren’t going back to the OS, they were still held.
It’s easiest to talk about memory in this regard. We hold buffers in place to handle requests, and in order to avoid fragmentation, we typically make them large buffers, that are resident on the large object heap.
When RavenDB detects that there is a low memory situation, it starts to scale back. It releases any held buffers, completes ongoing works and starts working on much smaller batches, etc. We saw that behavior, and we certainly saw the slow down as RavenDB was willing to take less upon itself. But what we didn’t see is the actual release of resources as a result of this behavior.
And as it turned out, that was because we were too good about managing ourselves. A large part of the design of RavenDB 4.0 was around reducing the cost of garbage collections by reducing allocations as much as possible. This means that we are running very few GCs. In fact, GC Gen 2 collections are rare on our environment. However, we need these Gen 2 collections to be able to clean up stuff that is in the finalizer queue. In fact, we typically need two such runs before the GC can be certain that the memory is not in use and actually collect it.
In this particular situation, we were careful to code so we will get very few GC collections running, and that led us to crash because we would run out of resources before the GC could realize that we are actually not really using them at this point.
The solution, by the way, was to change the way we respond to low memory conditions. We’ll be less good about keeping all the memory around and if it isn’t being used, we’ll start discarding it a lot sooner, so the GC has better chance to actually realize that is isn’t being used and recover the memory. An instead of throwing the buffers away all at once when we have low memory and hope that the GC will be fast enough in collecting them, we’ll keep them around and reuse them, avoiding the additional allocations that processing more requests would have required.
Since the GC isn’t likely to be able to actually free them in time, we aren’t affecting the total memory consumed in this scenario but are able to reduce allocations by serving the buffers that are already allocated. This two actions, being less rigorous about policing our memory and not freeing things when we get low memory are confusingly enough to get both reduce the chance of getting into low memory and reduce the chance of actually using too much memory in such a case.
Comments
Instead of throwing the buffers away or fully keeping them around, have you considered the middle ground of switching from strong references to weak references when memory is low?
That way, if GC is fast enough, it can reclaim the buffers, but if GC is slow, you can reuse them.
Svick, That gives me far less control, and I would rather have better predictability here.
@Svick when you are dealing with memory and hardware effects at the same time, you will always trade predictability for short term gains in performance because in the end predictability will give you better performance on the long run.
From the description of this and similar cases in the past, it looks like you're playing some hide & seek with the OS and sometimes become a victim of that. First of all, how it's possible that when you run out of memory you can actually free some buffers? So in fact your program didnt' run out of memory, it just tricked itself in believing that it has no memory left and started to panic. And what is the goal of freeing the buffers? to return them back to the OS so then OS can give it to you again? Or just to lower the panic level back to normal? In any case, it looks like your program bravely solves a problem that it created in the first place.
Rafal, In this case, the problem was that we had released the memory, but the GC didn't run (see earlier post about requiring two full GC Gen 2 runs), which meant that we didn't have access to the buffers but didn't have space to allocate new ones.
The case we had was a high memory notification. At this point, we would release any pending buffers, clear caches, etc. We would also move to a much more conservative mode, in which we can reduce our memory consumption at the expense of overall perf. That, in turn, meant that we would not be running GC to clear the managed resources, and when we had a big enough amount of work that did require us to allocate we would die.
ok, but isn't the runtime supposed to run a GC pass to free memory if it can't allocate? Before throwing OOM? So how it's possible it forgot to do so in your case?
Rafal, See this post: https://ayende.com/blog/181665-A/the-cost-of-finalizers Basically, when you have such a scenario, you'll need two such GC (at Gen 2) to actually know that the memory is free. So what happens is that the GC runs once, because it can't allocate any more. But it couldn't free enough memory. If it would run again, given our specific situation, it will have free memory, but it doesn't know that, and since it just run a GC, it will fail with OOM. The key here is if we weren't so careful about memory management we'll probably have more Gen2 collections earlier and the memory would already be marked as free :-)
Missed that earlier post... nice trap indeed.
@Ayende,
Knowing that GC Gen 2 will always (actually often...) collect objects with Finalizers, not trigger a GC Gen 2 wait for pending finalizers then trigger a GC Gen 2 again ... this way the memory will surely be freed. Calling GC manually is far from Ideal however in Low memory situations I think it might be an acceptable tradeoff.
Pop Catalin, The problem is that we are not called on that. The GC does this on its own, and we don't want to just sprinkle
GC.Collect
calls everywhere.Would those
GC.Collect
calls be everywhere, or just at the point where the "The Sky is About to Fall" switch is flung?Paul, How would you detect the sky is falling before it fell?
I assumed that you already had this trigger-point in the system.
Comment preview