Investigating a query sort issue in RavenDB

Mar 26 2020

Investigating a query sort issue in RavenDB

time to read 4 min | 654 words

A user reported that a particular query returned the results in an unexpected order. The query in question looked something like the following:

Note that we first search by score(), and then by the amount of sales. The problem was that documents that should have had the same score were sorted in different locations.

Running the query, we would get:

But all the documents have the same data (so they should have the same score), and when sorting by descending sales, 2.4 million should be higher than 62 thousands. What is going on?

We looked at the score, here are the values for the part where we see the difference:

1.702953815460205
1.7029536962509155

Okay… that is interesting. You might notice that the query above has include explanations(), which will give you the details of why we have sorted the data as we did. The problem? Here is what we see:

I’m only showing a small piece, but the values are identical on both documents. We managed to reduce the issue to a smaller data set (few dozen documents, instead of tens of thousands), but the actual issue was a mystery.

We had to dig into Lucene to figure out how the score is computed. In the land of indirectness and virtual method calls, we ended up tracing the score computation for those two documents and figure out the following, here is how Lucene is computing the score:

They sum the various scoring to get the right value (highly simplified). But I checked, the data is the same for all of the documents. Why do we get different values? Let’s see things in more details, shall we?

Here is the deal, if we add all of those together in calculator, we’ll get: 1.702953756

This is close, but not quite what we get from the score. This is probably because calculator does arbitrary precision numbers, and we use floats. The problem is, all of the documents in the query has the exact same numbers, why do we get different values.

I then tried to figure out what was going on there. The way Lucene handle the values, each subsection of the scoring (part of the graph above) is computed on its own and them summed. Still doesn’t explain what is going on, then I realized that Lucene is using a heap of mutable values to store the scorers at it was scoring the values. So whenever we scored a document, we will mark the scorer as a match and then put it in the end of the heap. But the order of the items in the heap is not guaranteed.

Usually, this doesn’t matter, but please look at the above values and consider the following fact:

What do you think are the values of C and D in the code above?

c = 1.4082127
d = 1.4082128

Yes, the order of operations for addition matters a lot for floats. This is expected, because of the way floating points are represented in memory, but unexpected.

The fact that the order of operations on the Lucene scorer is not consistent means that you may get some very subtle bugs. In order to avoid reproducing this bug, you can do pretty much anything and it will go away. It requires very careful setup and is incredibly delicate. And yet it tripped me hard enough to make me lose most of a day trying to figure out exactly where we did wrong.

Really annoying.

Tweet Share Share 4 comments

Tags:

Comments

28 Mar 2020
08:41 AM

svick

So, how did you fix the bug? Is there some way to make sure Lucene adds the values in a consistent order?

29 Mar 2020
15:15 PM

Oren Eini

Svick,

The workaround was to change the query a bit, because you need to hit things exactly the wrong way to get hit by this. The problem is that this is deep inside Lucene and changing it would almost certainly result in compatibility issues in other locations that we don't want to touch

09 Apr 2020
09:15 AM

Gleb

That was unexpected. The problem reproduces only on .NET Core, .NET Framework gives exact same numbers

12 Apr 2020
06:09 AM

Oren Eini

Gleb,

That was interesting, I reported that to Microsoft and they had a good coverage of this: https://github.com/dotnet/runtime/issues/34783

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB