Self flagellation and the barbarians are at the gate
We got a report from a user about severe issues with RavenDB. It reports resource exhaustion with plenty of resources still available, and once that happens, it will refuse to even restart itself, forcing a process kill.
As you can imagine, that was a pretty big deal for us, so we set out to investigate. And we found some interesting results.
One of the things that we like to keep in mind with RavenDB is that it is a safe choice. Whenever we need to make a decision between various tradeoffs, we’ll always chose the safe choice. That means, among other things, that we are pretty careful about the way that we approach external input. And in this case, we are actively protecting ourselves from the outside world. One of the ways we do that is by limiting the number of requests that we will concurrently process.
The idea is that it is better to flat out reject requests than put such a load on the system that it will eventually crash. Indeed, that has been such a successful tactic that to this day, there has been exactly zero production issues with it. To my knowledge, it hasn’t ever been even noticed by any of our users.
The actual issue is that we have an internal limit that is set by default to 256 concurrent transactions. And by default, we will accept up to 192 concurrent requests. Then I looked at the actual logs, and I found:
And that explains much, but not nearly all. We had this in our code base for roughly 8 months. There are still other things that protect us from those issues, not the least of which is that it is actually hard to generate that number of requests against us (you really have to try very hard, usually from multiple machines). But there was one scenario that we didn’t consider for the purpose of protecting ourselves from the barbarians at the gate. Multi Get requests.
Multi Get requests allows you to package multiple requests to RavenDB into a single physical request. Those requests are going to cost you a single round trip to the server, and you can run as many of those as you want. In the dump we received, we could see 17 pending Multi Get request, and about 400 queries being executed, each of them requiring their own session. No wonder we got out of session errors.
Final note: for what it is worth, I changed our limits to 1,024 concurrent sessions and 512 concurrent requests, which is more reasonable considering the kind of hardware we usually run on. Multi Get has another 192 sessions that it can utilize, and the rest are dedicated for background processes.
Comments
Maybe you should measure Round Trip/Queue time or Pending requests or something like that. If it takes too long time you can reject low priority requests/sessions.
If you can influence the clients you might ask them to slow down.
Something like the: http://en.wikipedia.org/wiki/TCP_window_scale_option
curious - why the constants instead of configuration options?
allan, Because people will change them. And then you have users complaining about "ravendb doesn't work", and it takes a lot of time & effort to figure out that they did a strange config and got very strange results. If this is a configuration value, you need to document it, support it and watch for it being changed.
From what build is this in?
Oren, it is very interesting when you blog a post like this and don't list which builds are effected and which build has the fix. Especially seeing as a lot of your posts are written days before they are published.
Comment preview