ChallengeWhat killed the application?
I have been doing a lot of heavy performance testing on Raven, and I run into a lot of very strange scenarios. I found a lot of interesting stuff (runaway cache causing OutOfMemoryException, unnecessary re-parsing, etc). But one thing that I wasn’t able to resolve was the concurrency issue.
In particular, Raven would slow down and crash under load. I scoured the code, trying to figure out what was going on, but I couldn’t figure it out. It seemed that after several minutes of executing, request times would grow longer and longer, until finally the server would start raising errors on most requests.
I am ashamed to say that it took me a while to figure out what was actually going on. Can you figure it out?
Here is the client code:
Parallel.ForEach(Directory.GetFiles("Docs","*.json"), file => { PostTo("http://localhost:9090/bulk_docs", file); });
The Docs directory contains about 90,000 files, and there is no concurrent connection limit. Average processing time for each request when running in a single threaded mode was 100 – 200 ms.
That should be enough information to figure out what is going on.
Why did the application crash?
More posts in "Challenge" series:
- (03 Feb 2025) Giving file system developer ulcer
- (20 Jan 2025) What does this code do?
- (01 Jul 2024) Efficient snapshotable state
- (13 Oct 2023) Fastest node selection metastable error state–answer
- (12 Oct 2023) Fastest node selection metastable error state
- (19 Sep 2023) Spot the bug
- (04 Jan 2023) what does this code print?
- (14 Dec 2022) What does this code print?
- (01 Jul 2022) Find the stack smash bug… – answer
- (30 Jun 2022) Find the stack smash bug…
- (03 Jun 2022) Spot the data corruption
- (06 May 2022) Spot the optimization–solution
- (05 May 2022) Spot the optimization
- (06 Apr 2022) Why is this code broken?
- (16 Dec 2021) Find the slow down–answer
- (15 Dec 2021) Find the slow down
- (03 Nov 2021) The code review bug that gives me nightmares–The fix
- (02 Nov 2021) The code review bug that gives me nightmares–the issue
- (01 Nov 2021) The code review bug that gives me nightmares
- (16 Jun 2021) Detecting livelihood in a distributed cluster
- (21 Apr 2020) Generate matching shard id–answer
- (20 Apr 2020) Generate matching shard id
- (02 Jan 2020) Spot the bug in the stream
- (28 Sep 2018) The loop that leaks–Answer
- (27 Sep 2018) The loop that leaks
- (03 Apr 2018) The invisible concurrency bug–Answer
- (02 Apr 2018) The invisible concurrency bug
- (31 Jan 2018) Find the bug in the fix–answer
- (30 Jan 2018) Find the bug in the fix
- (19 Jan 2017) What does this code do?
- (26 Jul 2016) The race condition in the TCP stack, answer
- (25 Jul 2016) The race condition in the TCP stack
- (28 Apr 2015) What is the meaning of this change?
- (26 Sep 2013) Spot the bug
- (27 May 2013) The problem of locking down tasks…
- (17 Oct 2011) Minimum number of round trips
- (23 Aug 2011) Recent Comments with Future Posts
- (02 Aug 2011) Modifying execution approaches
- (29 Apr 2011) Stop the leaks
- (23 Dec 2010) This code should never hit production
- (17 Dec 2010) Your own ThreadLocal
- (03 Dec 2010) Querying relative information with RavenDB
- (29 Jun 2010) Find the bug
- (23 Jun 2010) Dynamically dynamic
- (28 Apr 2010) What killed the application?
- (19 Mar 2010) What does this code do?
- (04 Mar 2010) Robust enumeration over external code
- (16 Feb 2010) Premature optimization, and all of that…
- (12 Feb 2010) Efficient querying
- (10 Feb 2010) Find the resource leak
- (21 Oct 2009) Can you spot the bug?
- (18 Oct 2009) Why is this wrong?
- (17 Oct 2009) Write the check in comment
- (15 Sep 2009) NH Prof Exporting Reports
- (02 Sep 2009) The lazy loaded inheritance many to one association OR/M conundrum
- (01 Sep 2009) Why isn’t select broken?
- (06 Aug 2009) Find the bug fixes
- (26 May 2009) Find the bug
- (14 May 2009) multi threaded test failure
- (11 May 2009) The regex that doesn’t match
- (24 Mar 2009) probability based selection
- (13 Mar 2009) C# Rewriting
- (18 Feb 2009) write a self extracting program
- (04 Sep 2008) Don't stop with the first DSL abstraction
- (02 Aug 2008) What is the problem?
- (28 Jul 2008) What does this code do?
- (26 Jul 2008) Find the bug fix
- (05 Jul 2008) Find the deadlock
- (03 Jul 2008) Find the bug
- (02 Jul 2008) What is wrong with this code
- (05 Jun 2008) why did the tests fail?
- (27 May 2008) Striving for better syntax
- (13 Apr 2008) calling generics without the generic type
- (12 Apr 2008) The directory tree
- (24 Mar 2008) Find the version
- (21 Jan 2008) Strongly typing weakly typed code
- (28 Jun 2007) Windsor Null Object Dependency Facility
Comments
The files in Directory.GetFiles("Docs","*.json") is the same directory as http://localhost:9090/bulk_docs, so you have an ever increasing filecount?
Wild guess is that it ran out of IP source port numbers?
Directory.GetFiles("Docs",".json") should be Directory.EnumerateFiles("Docs",".json") if you want to be Parallel.
Henning,
No, there is no association between the two.
Peter,
No, we haven't got that. But I run into this before.
It usually only pop up using HTTPS, or authenticated connections, though.
LS,
Actually, no, we parallelize the action, not the enumeration, but thanks for letting me know about the new API
Your testclient is sending more requests than the server can handle, maybe you're using some sort of queue on the server which overflows.
Wild guessing from my side.
Is it because the directory contains too many files?
Depending how your test is set up, could it be that Parallel ForEach and Raven DB are getting worker threads from the same thread pool?
hit OOM because the server was buffering all the post'ed files? It's gotta get the whole request (including file contents) into memory before passing it along AFAIK
What about the underlying database - maybe it had some concurrency problems - deadlocks, transaction timeouts, or run out of pooled connections?
A wild guess, doesthe Directory.GetFiles() method return a non-generic collection instead of a generic one? If so, you should cast it.
It effectively DoS'd the server by uploading too many files at the one time (there were more parallel threads going on the client than the server could accept, so they started to timeout).
90,000 files @ 100-200ms each, no limit on the degrees of parallelization - lemme guess you had around 8,000 threads active, with 1MB stack allocated each, and hit OOM?
Was it getting the same set of files ..
Richard:it is single-threaded
you hit max sockets,file descriptors per process/system.
msdn.microsoft.com/.../ms739169%28VS.85%29.aspx
the deafult is 64.
Unless you modify the registry to increase the limit WININET makes at most two distinct connections to the same remote host so you are only going to benefit from two threads. The other threads are going to block waiting for one of the two connections and if you have more than two processors in your system you are going to spin up more and more threads out of the .NET thread pool all of them blocking and taking up 1-2mb of virtual address space.
The thread-pool was spawning more and more threads (max by default is 250) because from its perspective the work was IO bound (waiting on the posts). It tries to saturate the CPU by spawning more threads.
Is PostTo doing an async post? I can't imagine how Parallel.ForEach would be bogging down the server since it limits the number of parallel tasks to the number of cores that you have. So if you are doing synchronous POST requests, it is only going to be posting 2-4 requests at a time, which is obviously not a lot.
Is it something to do with the fact that you're posting to the same uri over and over again?
I can imagine a scenario where at some point you decide to persist the documents, by recursively walking the documents to be written and because there are so many you end up blowing the stack somehow.
Does it have anything to do with TIME-WAIT? msdn.microsoft.com/en-us/library/ms819739.aspx
HttpWebRequest.KeepAlive was set to its default "true" value?
I am impressed because many creative solutions have been posted. By coincidence I faced the same issue 5min ago. It was the threadpool. Breaking in the debugger and executing ThreadPool.SetMaxThreads in the immediate window helped so I did not have to restart my long-running batch job.
The .NET thread pool does not create a thread unless there is a processor/core on your system that is doing nothing. If there are no processors available the thread pool puts your request in a queue. You shouldn't use with ThreadPool.SetMaxThreads. The problem is that a thread is created and it blocks immediately when WININET already has two connections to a given host. When it blocks the processor it was running on is freed and the thread pool takes a request out of its queue and schedules a thread. You end up with all these threads blocked each taking up 1-2mb of virtual address space and they are all waiting for the same WININET resource to become available.
This article discusses the WININET limit http://support.microsoft.com/kb/183110
Maybe it could be a problem with max http connections by server:
stackoverflow.com/.../improving-performance-of-...
Don't know if the Paralle.ForEach uses some sort of I/O port completion, but if so, I would think than blocking time waiting for socket reply ( the http request ) will be used, and do other file handles open and evantually run out of maximum file handles available. If I remember well, file handles are forced to some not so large count, in order to prevent buggy/malicious software to arm the system
Are you enumerating the entire contents of bulk_docs for every request to check your filename is unique?
Because you dumped 90,000 tasks into the Parallel Framework task scheduler?
Actually, it handled that really nicely.
Comment preview