Writing unreliable software

May 09 2008

Writing unreliable software

time to read 4 min | 725 words

This had surprised me, to say the least. I run into a bug during stress testing for SvnBridge, after a while, it would simply get stuck.

I am not the best in figuring out exactly what got an application stuck, but I generally manage to put some effort before I call the big guns. This time, I had managed to get a working theory, and prove that that the symptoms that I am seeing are consistent with my theory. I decided to dig into it a bit, and came up with interesting results.

The jury is still out with regards to whatever that was the real reason for the issue that SvnBridge has, but I went to some trouble to try research my theory. Along the way, I came up with some interesting conclusions. Just a reminder, I recently read Release It! and I consider this a very influential book. One of the things that kept coming up in the book how a chain reaction can take down an application. A single server stop responding, and the other just crumble. The examples that were brought up were of two kinds: a flawed implementation of a connection pool and a mismatch in the various capacities of the systems.

In SvBridge, there is a location where I am doing something similar to this code:

revision = webService.GetLatestVersion()

items = webService.GetItems(revision, downloadPath)
items.ForEach { item | item.BeginDownload(webService) }

comment = webService.GetLog(revision)items.WaitForAllToFinishDownloading()

SendToClient( Revision: revision, Comment: comment)

for item in items:
	item.WaitUntilLoaded()
	SendToClient( item.Data )

This isn't the exact code, but is is a good, and simple, representation of what is going on. This code can lead, under a set of special circumstances, to a hung server.

The important thing to understand here is that this code is using BeginDownload to perform an async invocation over HTTP. Another important data point is that we tend to hit the same physical server for a lot of our work.

Take a look at what is actually happening...

It is very important to observe that GetLog is actually never executed. The .NET framework is allowing (by default) only 2 connections to a server. GetLog will only be executed when there is a free slot, and the async requests are ahead of it in the queue. (Actually, I haven't verified the exact sequence in which this would be executed).

Until the async requests complete, we are stuck. This is important because it is not limited to the current request. This means that one request can block another, and since a single checkout request can cause a few thousands sub requests to the backend server, a single user can take down the system for a significant amount of time. (What I am actually seeing is a bit different, it looks like the async request never returns from the server under high load, but never mind that).

There are solutions for all of that, connection groups and overriding the number of allowed connections per server are just few of them.

I remember reading the Release It book and being thankful that so much of that seems to be focused on Java (thankfully, I have never had to write my own connection pool, which seems to be a pastime in the Java land).

What caused this post was actually my defensive approach failing with spectacular results. I started by defensively specifying Timeout on my requests. It would kill the request, but it would also keep the server alive. Unfortunately, Timeout has no affect on async requests. I found this very surprising, even though it is documented, I consider this a bug. Even worse, the actual example in BeginGetRequest is for fixing this issue, to support Timeout in async requests.

Leaving aside my own annoyance for hitting this tripwire, it brought to mind sharply that we should be very aware of the things that we do, and how they affect the longevity and scalability of our solutions. I spent most of the beginning of this week and the end of last week doing just that, throwing huge amounts of code and requests at SvnBridge. This was the most interesting problem that it had, and it was very interesting to see how we will solve that.

Tweet Share Share 9 comments

Tags:

Performance

Comments

09 May 2008
14:20 PM

Scott Allen

Does the software behave differently if you configure the framework to allow more concurrent requests?

09 May 2008
15:10 PM

Gian Maria Ricci

I hitted the problem of timeout not working in async request a while ago. I too think that this is a bug, it is documented, but it remain a bug.

Alk.

09 May 2008
16:51 PM

Ravi Terala

TFS proxies use UnsafeAuthenticatedConnectionSharing and ConnectionGroupName. All the proxies to a server use the same connection group name.

My guess is that what you observed is due to this connection group settings. Is there a way for you to find the ServicePoint.ConnectinLimit value for your app? There is a good chance that these are related.

09 May 2008
17:45 PM

Evan

"The .NET framework is allowing (by default) only 2 connections to a server. GetLog will only be executed when there is a free slot, and the async requests are ahead of it in the queue. (Actually, I haven't verified the exact sequence in which this would be executed)."

Two things:

The queue doesn't wait for a response from the first request before sending another. As an example, if you invoke 14 very slow methods asynchronously and a 15th fast method asynchronously, you will get the 15th response back first. The fact that you can only have two sockets open to the server does not mean that you can only have two requests executing at a time. The queue means that you can only send 2 requests at a time (one over each socket). A queue on the other side means you will only receive 2 requests at a time.

2 connections != 2 concurrent executing requests

This generally doesn't factor into anything unless the requests or responses become large. In that scenario, if you have 3 responses that need to be received and 2 that are large files (several megabytes), the the two multimegabyte files will clog the connections (because of the time needed to receive the data). Opening the 3rd connection would allow the little response (#3) to come though.

Generally speaking, this only factors in if you are sending or receiving large requests/responses.

Although this can bite you if you are doing heavy webservice traffic between two machines (as SOAP is very bloated). It can cause an artificial throughput/performance bottleneck--especially if there is a high speed network connection between the machines.

Now for the second thing:

The queue isn't strictly FIFO. Request #N may be sent before Request #1. In simple test scenarios, I've seen the 15th async request received on the server (and processed) before the 1st (when sending the requests asynchronously very close together).

09 May 2008
17:54 PM

Evan

For the above, you can google "Http Pipelining" for more information as that's how .NET/IIS do it on the wire.

09 May 2008
19:35 PM

Ayende Rahien

Next on the list of things to check, but I don't think so

09 May 2008
19:46 PM

Ayende Rahien

The scenario that I have is file downloads, which is why I am hitting this.

Thanks for confirming that the queue is not FIFO

15 May 2008
13:04 PM

Eric

Oren,

This is a bit of a tangent, but what tools do you use for the nice graphics you use in your posts? Especially code mark-up with boxes and arrows?

Thanks!

15 May 2008
13:11 PM

Ayende Rahien

MSPaint & PowerPoint

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB