Things we learned from production, part II–wake up or I kill you dead
Getting started is probably easier than shutting down, I mean, no one is going to begrudge us some time to get our feet from under us, right?
As it turned out, this assumption is wrong on quite a few levels.
To start with, hosts such as IIS / Windows Service Manager will give you a certain time to start before they decide that you are hang and ruthlessly execute you without even thinking twice about it. This doesn’t even include the issue of admins with people breathing down their necks who assume that a taste of mortality must convince RavenDB to try even harder then next time it is started after then 7th time it was killed for not starting fast enough.
Because killing us during startup is pretty much the same as a standard crash, it means that we need to run recovery after this happened, which means that the next time is going to take longer, and then…
I think you can get the picture, right?
But the issue here is actually much more complex.
It is actually easier to recover from a real crash (something like a process termination or kill –9). It is harder when it isn’t a real crash, but something like IIS just recycling the AppDomain. The reason it is harder is that anything that is scoped to the OS, like file handles, unmanaged resources, etc, are actually still alive. It means that during the crash, you have to be very careful about detecting that you are crashing and cleaning up after you properly.
Moving back to the actual startup issue, so we have to startup fairly quickly, even if we just crashed. That makes sense, I guess. Now, that is fine and dandy, but that is just for the system database, what happens when you want to access a non system database (for example, the Northwind database)?
In RavenDB, we load those databases lazily, so on the first request to that particular database, we will load it.
As it turned out, this simple and fairly obvious decision has caused a no end of problems.
Starting up a database may take a while, in bad cases, that while may be long enough that the request time out. Now, what does it means, request time out? You might get a 408 Request Timeout from the server, but that is the client perspective.
What happens on the server? Well, IIS handed over control of the request to RavenDB, and as far as IIS is concerned, RavenDB is sitting there doing nothing, well above its time limit. Now, IIS doesn’t have a way to tell RavenDB, stop processing this request. So what do you think it does?
Welcome to the nice land of Thread.Abort().
Now, if you have ever read about Thread.Abort(), you probably know that every single reference to that is filled with warnings about the need to be very careful about what you are doing, that it is a very bad idea in general and that you should take care to never use it. The reason it is such a bad idea is that you basically cut the thread at mid execution, leaving it no chance at all to actually handle things. It is an easy way to violate invariants.
In particular, it is a good way for your cleanup to never happen. Think about it, we are in the middle of our constructor, opening files, settings things up, and suddenly the floor is yanked right out from under us.
As it turned out, in those cases, we would leak some stuff out. The next time that you tried to access the database, you would get an error that said that the files were already opened by someone else. (To make things worse, those were unmanaged resources, they wouldn’t get cleaned up by the system when GC is run.
That led to errors that were extremely hard to figure out. Because they would only occur when running at a high load, with a db that crashed and was now recovering, and with a few other databases waiting as well. And going over the code, thinking multi threading thoughts, none of that works. At some point, I put so many locks there, just to figure out what is going on, that the code looked like this:
But the actual problem wasn’t another thread corrupting state, the problem was that the current thread was ruthless killed in mid operation.
Once we figured that one out, it was straightforward, but in no way easy, to device a solution. We made sure that our db init code was robust for thread aborts, and then we moved the actual db initialization to a separate thread, one that wasn’t controlled by IIS, so we could actually get things done without having a hard time limit.
In my next post, I’ll discuss the fallacy of the singleton and how much pain it caused us.
Comments
Nice idea with moving not-abortable stuff to a separate thread.
What about CER, CriticalFinalizer, SafeHandle and other things? And if we talk about AppDomain beign unloaded - I hope you are aware of the fact you can integrate nicely with the WAS runtime?..
This is my view on what you are/were doing: you receive a phone call at 7:00 am, but you are sleeping. You wake up, but don't pick up the phone, insead you go to have breakfast first, and when you go back to the phone, you have a lost call. You should not do that, instead, you should pick up the phone and suggest her/him call you later because you need to have a cup of coffee first.
IIS calls you, you wake up, but you are not ready to serve the request. IMHO you should not work hard to server the request as soon as possible, instead you should return an error and keep working.
Also I think that after a crash you should recover all databases, but return errors when a client tries to access a not yet recovered database.
SQL Server, for example, behaves this way, it returns you errors just started before databases are recovered, but picks up the phone.
For those who didn't know, the image displays a standard issue underware in the israelian army
This somehow explains why some posts (like this) are full of pain. I wouldn't trust any software with so rich inner life.
this chastity belt is more than expressive:)
Vs, We make use of some of them, yes. And AppDomain unload is something that we can work with. What we had a really hard time with was the thread aborts.
Jesus, We actually have a slightly more complex behavior now. We would pick up the phone, ask you to wait, and start things up. If things finish up quickly, we will answer normally. If things do not finish quickly enough, we will give you an error.
Rafal, You actually need to do all of that in order to provide a rich feature set.
Ayende, I was referring to psychological intensity of shutdown and startup procedure, not to the feature set ;)
Rafal, Given a rich feature set, you need to deal with a lot of variables, and they impact both startup and shutdown.
Comment preview