The Art of Debugging
I often end up being pulled into various debugging sessions. I find that having a structured approach for debugging can help you figure out what is going on and fix even the nastiest of bugs. This is a story about a few different bugs that I run into.
We have some new features coming in RavenDB 5.4 which touch the manner in which we manage some data. When we threw enough data at RavenDB, it died. Quite rudely, I have to say. I described the actual bug here, if you care. There wasn’t really much to finding this one.
The process crashes, we don’t know why. The process will usually gives us good signs about what exactly is happening. In this case, attaching a debugger and running the code and seeing where it failed was sufficient. I’ll admit that I mostly pressed F10 and F11 until I found the offending function, and then I didn’t believe it for some time.
Still in the same area, however, we began to run into a problem with impossible results. Invariants were broken, quite harshly. The problem was that this was something we could only reproduce after ~30 minutes or so of running a big benchmark. This does not make it easy to figure out what is going on. What is worse, there is a distinct gap between the time that the bug showed up and when it actually happened.
That made it very hard to figure out what is going on. This is where I think the structured approach shines.
Architecture – I hate concurrency debugging. I’m roughly 20+ years of experience in writing parallel programs, it is not sufficient to debug concurrency, I’m afraid. As a result, the design of RavenDB goes to great lengths to avoid having to do concurrency coordination. There are very well defined locations where we are applying concurrency (handling requests) and there are places where we are using serial manner (modifying data). The key is that we are using multiple serial processes at the same time. Each index is bound to a thread, for example, but they are independent of one another. The error in question was in an index, and I was able to narrow it down to a particular area of the code. Somewhere along the way, we messed things up.
Reproducing – I had a way to reproduce the issue, it was just utterly unhelpful for debugging purposes. But since the process is a serial one, it meant that it is also fairly consistent. So I added a bunch of printf() that logged the interesting operations that we had:
And here is what this looked like:
Each line is a particular command that I’m executing. I wrote a trivial parser that would read from the file and perform the relevant operations.
For ease of explanation, imagine that I’m writing a hash table, and this set of operations prime it to expose the issue.
The file contained hundreds of millions of operations. Somewhere toward the end, we run into the corrupted state. That is too much to debug, but I don’t need to do that. I can write code to verify the state of the data structure.
There is one problem here, the cost. Verifying the data structure is an operation that is O(N*logN + N) at least (probably O(N^2), to be honest, didn’t bother to run the proper analysis). I can’t run it on each command. The good news is that I don’t need to.
Here is what I did:
That did a full verification every 100,000 runs. This meant that I was confident that this was roughly the error way. Then I just repeated the process, raising the minimum start for doing the check and reducing the gaps in the frequencies.
This is a pretty boring part, to be honest. You run the code, do some emails, check it again, update a couple of numbers, and move on. This bisection approach yields great results, however. It means that I could very quickly narrow down to the specific action that caused the error.
That was in step 48,292,932, by the way. The actual error was discovered far later.
Fixing – that part I’m going to skip, this is usually fairly obvious once you know what is going on. In the case above, there was some scenario where we were off by 4 bytes, and that caused… havoc.
The key here is that for most of the debugging process, I don’t really need to do any sort of thinking. I figure out what information I need to gather, then I write the code that would do that for me.
And then I basically keep running things until I narrowed the field down to the level where the problem is clear and I can fix that.
Comments
It very much looks like you need use something like QuickCheck, this John Hughes talk is eye opening. I've written CsCheck to include much of the same functionality. Another point is that unit testing frameworks should run each test in parallel by default (even against itself). Expecto in F# does this. You can run the test suite for 30 mins say in parallel which can highlight concurrency and memory issue for free. I've not managed to make a testing library equivalent in C# yet but have thought of making a runner on top of Xunit. Xunit/Nunit are overly complex and limited on running tests in parallel.
Ant, Running tests in parallel is something that
xunit
certainly supports. It is something that is easy enough to do (and yes, it found bugs).But a key issue here is that this is a really bad place to be in. If you have to run tests in parallel for 30 minutes to smoke out bugs, your architecture is very brittle. Doing that (very) rarely for common infrastructure (such as accidentally sharing buffers) is one thing, because that is the nature of the beast.But for the greatest extent possible, you want a _single threaded mode of operation_.Sure we should build the simplest thing we can get away with. But sometimes you need to write a high performance immutable data structure say, or make sure an algorithm is always correct, or just be happy your code doesn't have any rare bugs. Some simply stated problems are inherently difficult to get right. QuickCheck et al are good for this. An example: create a function to allocate an integer total over some weights that adds up and is 'fair': int[] Allocate(int total, double[] weights). You can't do this successfully without something like QuickCheck (or writing a poor version of it).
Comment preview