Writing unreliable software

time to read 4 min | 725 words

This had surprised me, to say the least.  I run into a bug during stress testing for SvnBridge, after a while, it would simply get stuck.

I am not the best in figuring out exactly what got an application stuck, but I generally manage to put some effort before I call the big guns. This time, I had managed to get a working theory, and prove that that the symptoms that I am seeing are consistent with my theory. I decided to dig into it a bit, and came up with interesting results.

The jury is still out with regards to whatever that was the real reason for the issue that SvnBridge has, but I went to some trouble to try research my theory. Along the way, I came up with some interesting conclusions. Just a reminder, I recently read Release It! and I consider this a very influential book. One of the things that kept coming up in the book how a chain reaction can take down an application. A single server stop responding, and the other just crumble. The examples that were brought up were of two kinds: a flawed implementation of a connection pool and a mismatch in the various capacities of the systems.

In SvBridge, there is a location where I am doing something similar to this code:

revision = webService.GetLatestVersion()

items = webService.GetItems(revision, downloadPath)
items.ForEach { item | item.BeginDownload(webService) }

comment = webService.GetLog(revision)items.WaitForAllToFinishDownloading()

SendToClient( Revision: revision, Comment: comment)

for item in items:
	item.WaitUntilLoaded()
	SendToClient( item.Data )

This isn't the exact code, but is is a good, and simple, representation of what is going on. This code can lead, under a set of special circumstances, to a hung server.

The important thing to understand here is that this code is using BeginDownload to perform an async invocation over HTTP. Another important data point is that we tend to hit the same physical server for a lot of our work.

Take a look at what is actually happening...

image

It is very important to observe that GetLog is actually never executed. The .NET framework is allowing (by default) only 2 connections to a server. GetLog will only be executed when there is a free slot, and the async requests are ahead of it in the queue. (Actually, I haven't verified the exact sequence in which this would be executed).

Until the async requests complete, we are stuck. This is important because it is not limited to the current request. This means that one request can block another, and since a single checkout request can cause a few thousands sub requests to the backend server, a single user can take down the system for a significant amount of time. (What I am actually seeing is a bit different, it looks like the async request never returns from the server under high load, but never mind that).

There are solutions for all of that, connection groups and overriding the number of allowed connections per server are just few of them.

I remember reading the Release It book and being thankful that so much of that seems to be focused on Java (thankfully, I have never had to write my own connection pool, which seems to be a pastime in the Java land).

What caused this post was actually my defensive approach failing with spectacular results. I started by defensively specifying Timeout on my requests. It would kill the request, but it would also keep the server alive. Unfortunately, Timeout has no affect on async requests. I found this very surprising, even though it is documented, I consider this a bug. Even worse, the actual example in BeginGetRequest is for fixing this issue, to support Timeout in async requests.

Leaving aside my own annoyance for hitting this tripwire, it brought to mind sharply that we should be very aware of the things that we do, and how they affect the longevity and scalability of our solutions. I spent most of the beginning of this week and the end of last week doing just that, throwing huge amounts of code and requests at SvnBridge. This was the most interesting problem that it had, and it was very interesting to see how we will solve that.