Investigating a query sort issue in RavenDB
A user reported that a particular query returned the results in an unexpected order. The query in question looked something like the following:
Note that we first search by score(), and then by the amount of sales. The problem was that documents that should have had the same score were sorted in different locations.
Running the query, we would get:
But all the documents have the same data (so they should have the same score), and when sorting by descending sales, 2.4 million should be higher than 62 thousands. What is going on?
We looked at the score, here are the values for the part where we see the difference:
- 1.702953815460205
- 1.7029536962509155
Okay… that is interesting. You might notice that the query above has include explanations(), which will give you the details of why we have sorted the data as we did. The problem? Here is what we see:
I’m only showing a small piece, but the values are identical on both documents. We managed to reduce the issue to a smaller data set (few dozen documents, instead of tens of thousands), but the actual issue was a mystery.
We had to dig into Lucene to figure out how the score is computed. In the land of indirectness and virtual method calls, we ended up tracing the score computation for those two documents and figure out the following, here is how Lucene is computing the score:
They sum the various scoring to get the right value (highly simplified). But I checked, the data is the same for all of the documents. Why do we get different values? Let’s see things in more details, shall we?
Here is the deal, if we add all of those together in calculator, we’ll get: 1.702953756
This is close, but not quite what we get from the score. This is probably because calculator does arbitrary precision numbers, and we use floats. The problem is, all of the documents in the query has the exact same numbers, why do we get different values.
I then tried to figure out what was going on there. The way Lucene handle the values, each subsection of the scoring (part of the graph above) is computed on its own and them summed. Still doesn’t explain what is going on, then I realized that Lucene is using a heap of mutable values to store the scorers at it was scoring the values. So whenever we scored a document, we will mark the scorer as a match and then put it in the end of the heap. But the order of the items in the heap is not guaranteed.
Usually, this doesn’t matter, but please look at the above values and consider the following fact:
What do you think are the values of C and D in the code above?
- c = 1.4082127
- d = 1.4082128
Yes, the order of operations for addition matters a lot for floats. This is expected, because of the way floating points are represented in memory, but unexpected.
The fact that the order of operations on the Lucene scorer is not consistent means that you may get some very subtle bugs. In order to avoid reproducing this bug, you can do pretty much anything and it will go away. It requires very careful setup and is incredibly delicate. And yet it tripped me hard enough to make me lose most of a day trying to figure out exactly where we did wrong.
Really annoying.
Comments
So, how did you fix the bug? Is there some way to make sure Lucene adds the values in a consistent order?
Svick,
The workaround was to change the query a bit, because you need to hit things exactly the wrong way to get hit by this. The problem is that this is deep inside Lucene and changing it would almost certainly result in compatibility issues in other locations that we don't want to touch
That was unexpected. The problem reproduces only on .NET Core, .NET Framework gives exact same numbers
Gleb,
That was interesting, I reported that to Microsoft and they had a good coverage of this: https://github.com/dotnet/runtime/issues/34783
Comment preview