An interesting RavenDB bug

time to read 4 min | 777 words

I got a very strange bug report recently,

The following index:

from movie in docs.Movies
from actor in movie.Actors
select new { Actor = actor }

Will produce multiple results from a single document, which poses a pretty big problem when you try to page through that. Imagine that each movie has 10 actors, and you are trying to page through this index for the first two documents of movies by Charlie Chaplin. The first movie that matches Charlie Chaplin will have ten results returned from the index, and simple paging at the index level will give us the wrong results.

Here is my solution for that, which works, but make me just a tad uneasy:
public IEnumerable<IndexQueryResult> Query(IndexQuery indexQuery)
{
    IndexSearcher indexSearcher;
    using (searcher.Use(out indexSearcher))
    {
        var previousDocuments = new HashSet<string>();
        var luceneQuery = GetLuceneQuery(indexQuery);
        var start = indexQuery.Start;
        var pageSize = indexQuery.PageSize;
        var skippedDocs = 0;
        var returnedResults = 0;
        do
        {
            if(skippedDocs > 0)
            {
                start = start + pageSize;
                // trying to guesstimate how many results we will need to read from the index
                // to get enough unique documents to match the page size
                pageSize = skippedDocs * indexQuery.PageSize; 
                skippedDocs = 0;
            }
            var search = ExecuteQuery(indexSearcher, luceneQuery, start, pageSize, indexQuery.SortedFields);
            indexQuery.TotalSize.Value = search.totalHits;
            for (var i = start; i < search.totalHits && (i - start) < pageSize; i++)
            {
                var document = indexSearcher.Doc(search.scoreDocs[i].doc);
                if (IsDuplicateDocument(document, indexQuery.FieldsToFetch, previousDocuments))
                {
                    skippedDocs++;
                    continue;
                }
                returnedResults++;
                yield return RetrieveDocument(document, indexQuery.FieldsToFetch);
            }
        } while (skippedDocs > 0 && returnedResults < indexQuery.PageSize);
    }
}

Tweet Share Share 8 comments

Tags:

Raven

Comments

22 Jul 2010
09:37 AM

Thilak Nathen

Had the exact same cartesian product type problem recently with a lucene search. My solution was almost exactly what you got - launch a full blown brute force attack on the index.

Did you profile memory usage after introducing this? Mine almost doubled.

22 Jul 2010
09:46 AM

configurator

            // trying to guesstimate how many results we will need to read from the index

            // to get enough unique documents to match the page size

            pageSize = skippedDocs * indexQuery.PageSize;

I don't see how that guesstimate is supposed to be close to the actual value needed. Seems to me like you're querying for way too much data once you've skipped anything.

22 Jul 2010
14:48 PM

Ayende Rahien

Thilak,

That is pretty much what would have to happen, you are reading a lot more data

22 Jul 2010
14:50 PM

Ayende Rahien

configurator,

The guess is that most documents would have similar number of results per doc.

Since we already know that the current read contained duplicate, we want to read the next range. And we want to limit the number of index calls that we make.

If you can think of a more accurate formula, I would love to see it.

22 Jul 2010
15:02 PM

configurator

Ayende, The thing is if I read this code correctly, the skippedDocs is the number of extra copies we got.

Suppose the page size is 10 and we got: 1, 2, 3, 3, 4, 4, 5, 6, 7, 8, 9, 10. If there is usually a similar number of copies, another 10 results should be more than enough - bet skippedDocs is 2 so we'd get 20 results.

Now what if we got 1,1,2,2,3,3,4,4,5,5 ? Another 10 results should be exactly enough, but we're getting 50.

Seems to me like the heuristically-accurate function would be:

var wasteMultiplier = (returnedResults + skippedDocs) / (float)returnedResults;

pageSize = wasteMultiplier * (indexQuery.pageSize - returnedResults);

Of course, this is too heuristically-accurate - you want to favour getting a bit more results than needed over having extra queries. Seems to me like a better heuristic would be

var wasteMultiplier = (returnedResults + skippedDocs) / (float)returnedResults;

pageSize = (int)(wasteMultiplier * (indexQuery.pageSize - returnedResults / X));

where X is some (float) constant that should be tweaked to get an optimal result. Probably somewhere between 1 and 2.

If using this heuristic, skippedDocs needs to be the total so shouldn't be reset to 0 each time. Add a boolean to know if a skip happened in the last loop iteration.

22 Jul 2010
23:08 PM

Ajai Shankar

I do not understand why the index would return 10 documents? Why is this a cartesian, is it a lucene thing?

23 Jul 2010
06:21 AM

Ayende Rahien

Ajai,

Because I am indexing 10 results per each document, so you can't do the paging inside Lucene.

24 Jul 2010
03:44 AM

Andrew

Have you considered the formula often used when re-allocating memory (under c++?

You start with say 10 MB, and if you're doing something and you find its not enough you double it - 20 MB, if you hit this wall again you double it to 40 MB, ... etc.

The theory being that you do the least amount of reallocations (by doubling the amount used each time - if a task does require a large amount of memory it should be reached very quickly), and the worst case is that you just use a bit too much memory for an operation.

In your case, worst case is you'd fetch too many documents (oh well) but you can reduce the amount of fetches you do by just requesting double the amount of documents since the last call.

I hope that makes sense :)

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB