Writing unreliable software
This had surprised me, to say the least. I run into a bug during stress testing for SvnBridge, after a while, it would simply get stuck.
I am not the best in figuring out exactly what got an application stuck, but I generally manage to put some effort before I call the big guns. This time, I had managed to get a working theory, and prove that that the symptoms that I am seeing are consistent with my theory. I decided to dig into it a bit, and came up with interesting results.
The jury is still out with regards to whatever that was the real reason for the issue that SvnBridge has, but I went to some trouble to try research my theory. Along the way, I came up with some interesting conclusions. Just a reminder, I recently read Release It! and I consider this a very influential book. One of the things that kept coming up in the book how a chain reaction can take down an application. A single server stop responding, and the other just crumble. The examples that were brought up were of two kinds: a flawed implementation of a connection pool and a mismatch in the various capacities of the systems.
In SvBridge, there is a location where I am doing something similar to this code:
revision = webService.GetLatestVersion() items = webService.GetItems(revision, downloadPath) items.ForEach { item | item.BeginDownload(webService) } comment = webService.GetLog(revision)items.WaitForAllToFinishDownloading() SendToClient( Revision: revision, Comment: comment) for item in items: item.WaitUntilLoaded() SendToClient( item.Data )
This isn't the exact code, but is is a good, and simple, representation of what is going on. This code can lead, under a set of special circumstances, to a hung server.
The important thing to understand here is that this code is using BeginDownload to perform an async invocation over HTTP. Another important data point is that we tend to hit the same physical server for a lot of our work.
Take a look at what is actually happening...
It is very important to observe that GetLog is actually never executed. The .NET framework is allowing (by default) only 2 connections to a server. GetLog will only be executed when there is a free slot, and the async requests are ahead of it in the queue. (Actually, I haven't verified the exact sequence in which this would be executed).
Until the async requests complete, we are stuck. This is important because it is not limited to the current request. This means that one request can block another, and since a single checkout request can cause a few thousands sub requests to the backend server, a single user can take down the system for a significant amount of time. (What I am actually seeing is a bit different, it looks like the async request never returns from the server under high load, but never mind that).
There are solutions for all of that, connection groups and overriding the number of allowed connections per server are just few of them.
I remember reading the Release It book and being thankful that so much of that seems to be focused on Java (thankfully, I have never had to write my own connection pool, which seems to be a pastime in the Java land).
What caused this post was actually my defensive approach failing with spectacular results. I started by defensively specifying Timeout on my requests. It would kill the request, but it would also keep the server alive. Unfortunately, Timeout has no affect on async requests. I found this very surprising, even though it is documented, I consider this a bug. Even worse, the actual example in BeginGetRequest is for fixing this issue, to support Timeout in async requests.
Leaving aside my own annoyance for hitting this tripwire, it brought to mind sharply that we should be very aware of the things that we do, and how they affect the longevity and scalability of our solutions. I spent most of the beginning of this week and the end of last week doing just that, throwing huge amounts of code and requests at SvnBridge. This was the most interesting problem that it had, and it was very interesting to see how we will solve that.
Comments
Does the software behave differently if you configure the framework to allow more concurrent requests?
I hitted the problem of timeout not working in async request a while ago. I too think that this is a bug, it is documented, but it remain a bug.
Alk.
TFS proxies use UnsafeAuthenticatedConnectionSharing and ConnectionGroupName. All the proxies to a server use the same connection group name.
My guess is that what you observed is due to this connection group settings. Is there a way for you to find the ServicePoint.ConnectinLimit value for your app? There is a good chance that these are related.
"The .NET framework is allowing (by default) only 2 connections to a server. GetLog will only be executed when there is a free slot, and the async requests are ahead of it in the queue. (Actually, I haven't verified the exact sequence in which this would be executed)."
Two things:
The queue doesn't wait for a response from the first request before sending another. As an example, if you invoke 14 very slow methods asynchronously and a 15th fast method asynchronously, you will get the 15th response back first. The fact that you can only have two sockets open to the server does not mean that you can only have two requests executing at a time. The queue means that you can only send 2 requests at a time (one over each socket). A queue on the other side means you will only receive 2 requests at a time.
2 connections != 2 concurrent executing requests
This generally doesn't factor into anything unless the requests or responses become large. In that scenario, if you have 3 responses that need to be received and 2 that are large files (several megabytes), the the two multimegabyte files will clog the connections (because of the time needed to receive the data). Opening the 3rd connection would allow the little response (#3) to come though.
Generally speaking, this only factors in if you are sending or receiving large requests/responses.
Although this can bite you if you are doing heavy webservice traffic between two machines (as SOAP is very bloated). It can cause an artificial throughput/performance bottleneck--especially if there is a high speed network connection between the machines.
Now for the second thing:
The queue isn't strictly FIFO. Request #N may be sent before Request #1. In simple test scenarios, I've seen the 15th async request received on the server (and processed) before the 1st (when sending the requests asynchronously very close together).
For the above, you can google "Http Pipelining" for more information as that's how .NET/IIS do it on the wire.
Next on the list of things to check, but I don't think so
The scenario that I have is file downloads, which is why I am hitting this.
Thanks for confirming that the queue is not FIFO
Oren,
This is a bit of a tangent, but what tools do you use for the nice graphics you use in your posts? Especially code mark-up with boxes and arrows?
Thanks!
MSPaint & PowerPoint
Comment preview