[Unstable code] How a blocking remote call can take down an application
I mentioned that this line has the potential to destabilize an application, because it is a remote blocking call.
var cart = customerSrv.GetShoppingCart(customerId);
Neil Mosafi left the following comment:
I've never experienced other threads being blocked whilst making a sync service call. Even an Async call is essentially a sync call but done in another thread or using an iocompletion port. Or are you saying we should be making duplex service calls to avoid possible problems?
Let us start by saying that I am talking about pathological scenarios, nothing that you'll meet in everyday scenario. However, "once in a million is next Tuesday" in our business. I have seen applications behave... strangely on production.
Let us focus on the trivial issues first, shall we?
- HTTP: Only 2 concurrent requests per host
This is fairly well known, and there are ways around it, but it is neither trivial nor something you can ignore.
Result: requests are serialized in the HTTP layer - HTTPS: All of HTTP limitations, plus ~4,000 request per IP (not host) in any 2 minutes duration.
This is not well known, and while there are ways around it, it is not something that most people think of until the application fail.
Result: request is denied.
Those are the common ones, but with TCP based protocols, the server can hang the client in so many ways, it isn't even funny. TCP redirection loops, waiting on the listen queues, slow transfer rates, malformed TCP protocols and high packet loss are just the things that occur to me right now.
In general, we can divide the issues into fail fast and block. Fail fast are what we want, block is what we have to deal with.
Now, how can a blocking call take down an application? Starting with a convoy and ending with a chain reaction.
Let us say that we are making the blocking call above, and for some reason, it takes longer to process this than our SLA allows. In most scenarios, we would like to abort the current call and send an error downstream. What we don't want is to have a situation on our hands where we block. If we block, we hold a valuable thread that is doing nothing but wait.
In .NET, there are several types of threads that we utilize. Thread pool threads (ASP.Net, WCF, QueueWork, etc), main thread (in client applications), free threads (my own term, threads that were created by the application manually), IO threads (we mostly don't deal with them, they are an infrastructure concern) and private thread pools.
A thread is an expensive resource, so we tend to hang to it, rather than creating them all the time. In particular, for most servers, we have a finite amount of threads that are available for doing work.
Now, assume that some threads are blocked, or even just processing things more slowly. The concept of blocking remote calls means that we have now propagated this issue to all our clients, which will propagate them to their clients, etc. In fact, a convoy (serialization of processing work in one place) can easily lead to a chain reaction which will lead to the entire application meltdown.
And that is the good part.
The bad part is if all you threads are blocked for some reason. (I had a case once where some idiot run a long query with serializable isolation on the log table. Guess what happened to the application in the meantime?) If all the threads are blocked, you can't do anything, you are dead in the waters.
I will talk about approaches to dealing with this in a future post.
Comments
All I can say is that I entirely agree. From my perspective it's very simple. If you don't specify a timeout value you ask for trouble. I described the same problem but at a lower level on my blog: http://pabich.eu/blog/archive/2008/04/16/never-ever-synchronize-threads-without-specifying-a-timeout-value.aspx
well, i know that you're reading Release It, and the book proposes the Circuit Breaker pattern to deal with these kind of issues
i've provided a (simple) implementation of the pattern here:
http://davybrion.com/blog/2008/05/the-circuit-breaker/
it needs more work to be production-ready but it might be a good start
Soo, here's what I don't get - the line you show should be responsible for being "unblockable"?
Sounds a bit sansationalist because this is just any old service call? I could of completely missed the point of course..
I am pointing out the flaw in sync remote calls.
Yes, this IS how most service calls are done. The model is broken.
Comment preview