The thread in the haystack...

time to read 3 min | 546 words

I was urgently called today to solve a problem in one of my applications. The problem was that for some reason, the application would stop process incoming files, a restart would fix this issue for a short while, after which the problem would re-appear. The application is heavily multi threaded, and has a lot of logging built in to help diagnose issues that may crop up.

The problem is that the problem never even showed up in the logs. It was as if the application stopped watching the directory, but it should never do that. I began to consider framework bugs and was about to start investigating how to investigate this issue when I noticed a recurring pattern in the logs:

Cycle #123:
Checking status of Item #42: Not yet ready
Checking status of Item #43: Not yet ready
Checking status of Item #44: Not yet ready
Checking status of Item #45: Not yet ready
Checking status of Item #46: Not yet ready

Cycle #124:
Checking status of Item #42: Not yet ready
Checking status of Item #43: Not yet ready
Checking status of Item #44: Not yet ready
Checking status of Item #45: Not yet ready
Checking status of Item #46: Not yet ready

Hm... an item is sent to another machine for processing, and the communication between the two parts is done via a web service, on a fairly slow connection.

What happened was interesting, there was a large number of items that has not been processed yet, and at every cycle (~1 minute) all of them were checked indepdently. The problem was that when it got to the point where enough of them were being queries, the next cycle began before the first one could begin.

Since checking the status of all the items was done on a thread pool thread, and since processing a new item (and logging it) also accurrs on a thread pool thread. After the application has been running for a while, asking the second server for status starts to consume most of the thread pool threads, and processing of new items is delayed until their turn in the queue arrives (although that is not guranteed).

The "fix": change the cycle time for checking status to 10 minutes and asking the other side to start processing this on more than a dialy basis.