Finding a bug with code that isn’t there

time to read 3 min | 451 words

A user called us with a strange bug report. He said that the SQL ETL process inside of RavenDB was behaving badly. It would write the data from the RavenDB server to the MySQL database, but then it would immediately delete it.

From the MySQL logs, the user showed:

2021-10-12 13:04:18 UTC:20.52.47.2(65396):root@ravendb:[1304]:LOG: execute <unnamed>: INSERT INTO orders ("id") VALUES ('Tab/57708-A')
2021-10-12 13:04:19 UTC:20.52.47.2(65396):root@ravendb:[1304]:LOG: execute <unnamed>: DELETE FROM orders WHERE "id" IN ('Tab/57708-A')

As you can imagine, that isn’t an ideal scenario for ETL processes. It is also something that should absolutely not happen. RavenDB issues the delete before the insert, obviously. In fact, for your viewing pleasure, here is the relevant piece of code:

image

It doesn’t get any clearer than that, right? We issue any deletes we have, then we send the inserts. But the user insisted that they are seeing this behavior, and I tend to trust them. But we couldn’t reproduce the issue, and the code in question dates to 2017, so I was pretty certain it is correct.

Then I noticed the user’s configuration. To define a SQL ETL process in RavenDB, you need to define what tables we’ll be writing to as well as deciding what data to send to those tables. Here is what this looked like:

image

And here is the script:

image

Do you see the problem? It might be better to use an overload, which make it clearer:

image

In the table listing, we had an “orders” table, but the script sent the data to the “Orders” table.

RavenDB is a case insensitive database, but in this case, what happened behind the scenes is that the code used a case sensitive dictionary to keep track of the tables that we are working with. That meant that what we did was roughly:

And basically, the changes_by_table dictionary had two tables in it. One from the table definitions and one from the script. When we validate that the tables from the script are fine, we do that using case insensitive comparison, so that passed properly.

To make it worst, the order of items in a dictionary is not predictable. If this was the iteration order other way around, everything would appear to be working just fine.

We fixed the bug, but I found it really interesting that three separate people (very experienced with the codebase) had a look and couldn’t figure out how this bug can happen. It wasn’t in the code, it was in what wasn’t there.