The randomly failing test

time to read 2 min | 349 words

We made  low level change in how RavenDB is writing to the journal. This was verified by multiple code reviews and a whole battery of tests and production abuse. And yet, once in a blue moon, we’ll have a test failure. Utterly non reproducible, and only happening once every week or two (out of hundreds or thousands of test runs. That was worrying, because this test was checking the behavior of RavenDB when it crashed midway through a transaction, which is kind of an important metric for us.

It took a long while to finally figure out what is going on there. The first thing that we ruled out was non reprodicability because of threading. This test was single threaded, and nothing code inject anything to the code.

The format of the test was something like this:

  • Write 1000 random fixed size values to the database.
  • Close the database.
  • Corrupt the last page of the page journal.
  • Star the db again and note that all the values in the last transaction are not in the db.

So far, awesome. So why would it fail?

The underlying reason was a obvious, once we looked at it. The only thing that differs from test to test is the random call. But we are using fixed size buffers to write, so that shouldn’t change anything. The data itself is meaningless.

As it turned out, the data is not quite meaningless. As part of the commit process, we compress the data before we write it to the journal . As it turns out, different patterns of random buffers have different compression characteristics. In other words, a buffer of 100 random bytes may compress to 90 bytes or 102 bytes. And that mattered. If the test got enough random inputs to create a new journal file, we will still corrupt the last page on that journal, but since we already are on a new journal, that last page hasn’t been used yet, and the transaction wouldn’t become corrupt and we would have the data still in the database, effectively failing the test.