Finding a bug with code that isn’t there
A user called us with a strange bug report. He said that the SQL ETL process inside of RavenDB was behaving badly. It would write the data from the RavenDB server to the MySQL database, but then it would immediately delete it.
From the MySQL logs, the user showed:
2021-10-12 13:04:18 UTC:20.52.47.2(65396):root@ravendb:[1304]:LOG: execute <unnamed>: INSERT INTO orders ("id") VALUES ('Tab/57708-A')
2021-10-12 13:04:19 UTC:20.52.47.2(65396):root@ravendb:[1304]:LOG: execute <unnamed>: DELETE FROM orders WHERE "id" IN ('Tab/57708-A')
As you can imagine, that isn’t an ideal scenario for ETL processes. It is also something that should absolutely not happen. RavenDB issues the delete before the insert, obviously. In fact, for your viewing pleasure, here is the relevant piece of code:
It doesn’t get any clearer than that, right? We issue any deletes we have, then we send the inserts. But the user insisted that they are seeing this behavior, and I tend to trust them. But we couldn’t reproduce the issue, and the code in question dates to 2017, so I was pretty certain it is correct.
Then I noticed the user’s configuration. To define a SQL ETL process in RavenDB, you need to define what tables we’ll be writing to as well as deciding what data to send to those tables. Here is what this looked like:
And here is the script:
Do you see the problem? It might be better to use an overload, which make it clearer:
In the table listing, we had an “orders” table, but the script sent the data to the “Orders” table.
RavenDB is a case insensitive database, but in this case, what happened behind the scenes is that the code used a case sensitive dictionary to keep track of the tables that we are working with. That meant that what we did was roughly:
And basically, the changes_by_table dictionary had two tables in it. One from the table definitions and one from the script. When we validate that the tables from the script are fine, we do that using case insensitive comparison, so that passed properly.
To make it worst, the order of items in a dictionary is not predictable. If this was the iteration order other way around, everything would appear to be working just fine.
We fixed the bug, but I found it really interesting that three separate people (very experienced with the codebase) had a look and couldn’t figure out how this bug can happen. It wasn’t in the code, it was in what wasn’t there.
Comments
Don't you just love those bugs?!
// Ryan
This is exactly my experience, too: bugs related to something missing are much harder to debug than bugs related to wrong, but encoded behavior.
I believe it is what Kahnemann coined „what you see is all there is“ related to his studies on human cognition.
Comment preview