The 7 years old disk test machine

time to read 3 min | 465 words

rodentia-icons_fsguard-plugin-urgent-300pxWe are testing RavenDB on a wide variety of software and hardware, and a few weeks ago one of our guys came to me with grave concern. We had a major regression in performance on Linux. And major as in 75% slower than what it used to be a few weeks ago.

Testing at that point that showed that indeed, there is a big performance gap between the same benchmark on that Linux machine and a comparable machine running Windows. That was worrying, and took us a while to figure out what was going on. The problem was that we previously had that exact same scenario. The I/O pattern that are most suitable for Linux are pretty bad for Windows, and vice versa, so optimizing for each requires a delicate hand. The expectation was that we did something that would overload the system somehow and caused major regression.

A major discovery was that it wasn’t Linux per se that was slow. Testing the same thing on a significantly smaller machine showed much better performance. We still had to rule out a bunch of other things, such as specific setting / behavior that we would trigger on that particular machine, but it seemed promising. And that was the point when we looked at the hardware. That particular Linux machine is an old development machine that has gone through several developer upgrade cycles, and when it was rebuilt, we used the most easily available disk that we had on hand.

That turned out to be a Crucial SSD 128GB M22 disk. To those of you who don’t keep a catalog of all hard disks and their numbers, there is Google, which will tell you that this has been out for nearly a decade, and that particular disk has been shuffling bits in our offices for about 7 years or so. In its life, it has been subject to literally thousands of database benchmarks, reading and writing very large amount of data.

I’m frankly shocked that it is still working, and it is likely that there is a lot of internal error correction that is going on. But the end result is that it is predictably generate very unpredictable I/O patterns, and it is a great machine to test what happens when things start to fail in a very ungraceful manner (a write to the local disk that takes 5 seconds but also blocks all other I/O operations in the system, for example).

I’m aware of things like nbd & trickle, but it was a lot more fun to discover that we can just run stuff on that particular machine and find out what happens when a lot of our assumptions are broken.