An interesting RavenDB bug
I got a very strange bug report recently,
The following index:
from movie in docs.Movies from actor in movie.Actors select new { Actor = actor }
Will produce multiple results from a single document, which poses a pretty big problem when you try to page through that. Imagine that each movie has 10 actors, and you are trying to page through this index for the first two documents of movies by Charlie Chaplin. The first movie that matches Charlie Chaplin will have ten results returned from the index, and simple paging at the index level will give us the wrong results.
Here is my solution for that, which works, but make me just a tad uneasy:
public IEnumerable<IndexQueryResult> Query(IndexQuery indexQuery) { IndexSearcher indexSearcher; using (searcher.Use(out indexSearcher)) { var previousDocuments = new HashSet<string>(); var luceneQuery = GetLuceneQuery(indexQuery); var start = indexQuery.Start; var pageSize = indexQuery.PageSize; var skippedDocs = 0; var returnedResults = 0; do { if(skippedDocs > 0) { start = start + pageSize; // trying to guesstimate how many results we will need to read from the index // to get enough unique documents to match the page size pageSize = skippedDocs * indexQuery.PageSize; skippedDocs = 0; } var search = ExecuteQuery(indexSearcher, luceneQuery, start, pageSize, indexQuery.SortedFields); indexQuery.TotalSize.Value = search.totalHits; for (var i = start; i < search.totalHits && (i - start) < pageSize; i++) { var document = indexSearcher.Doc(search.scoreDocs[i].doc); if (IsDuplicateDocument(document, indexQuery.FieldsToFetch, previousDocuments)) { skippedDocs++; continue; } returnedResults++; yield return RetrieveDocument(document, indexQuery.FieldsToFetch); } } while (skippedDocs > 0 && returnedResults < indexQuery.PageSize); } }
Comments
Had the exact same cartesian product type problem recently with a lucene search. My solution was almost exactly what you got - launch a full blown brute force attack on the index.
Did you profile memory usage after introducing this? Mine almost doubled.
I don't see how that guesstimate is supposed to be close to the actual value needed. Seems to me like you're querying for way too much data once you've skipped anything.
Thilak,
That is pretty much what would have to happen, you are reading a lot more data
configurator,
The guess is that most documents would have similar number of results per doc.
Since we already know that the current read contained duplicate, we want to read the next range. And we want to limit the number of index calls that we make.
If you can think of a more accurate formula, I would love to see it.
Ayende, The thing is if I read this code correctly, the skippedDocs is the number of extra copies we got.
Suppose the page size is 10 and we got: 1, 2, 3, 3, 4, 4, 5, 6, 7, 8, 9, 10. If there is usually a similar number of copies, another 10 results should be more than enough - bet skippedDocs is 2 so we'd get 20 results.
Now what if we got 1,1,2,2,3,3,4,4,5,5 ? Another 10 results should be exactly enough, but we're getting 50.
Seems to me like the heuristically-accurate function would be:
var wasteMultiplier = (returnedResults + skippedDocs) / (float)returnedResults;
pageSize = wasteMultiplier * (indexQuery.pageSize - returnedResults);
Of course, this is too heuristically-accurate - you want to favour getting a bit more results than needed over having extra queries. Seems to me like a better heuristic would be
var wasteMultiplier = (returnedResults + skippedDocs) / (float)returnedResults;
pageSize = (int)(wasteMultiplier * (indexQuery.pageSize - returnedResults / X));
where X is some (float) constant that should be tweaked to get an optimal result. Probably somewhere between 1 and 2.
If using this heuristic, skippedDocs needs to be the total so shouldn't be reset to 0 each time. Add a boolean to know if a skip happened in the last loop iteration.
I do not understand why the index would return 10 documents? Why is this a cartesian, is it a lucene thing?
Ajai,
Because I am indexing 10 results per each document, so you can't do the paging inside Lucene.
Have you considered the formula often used when re-allocating memory (under c++?
You start with say 10 MB, and if you're doing something and you find its not enough you double it - 20 MB, if you hit this wall again you double it to 40 MB, ... etc.
The theory being that you do the least amount of reallocations (by doubling the amount used each time - if a task does require a large amount of memory it should be reached very quickly), and the worst case is that you just use a bit too much memory for an operation.
In your case, worst case is you'd fetch too many documents (oh well) but you can reduce the amount of fetches you do by just requesting double the amount of documents since the last call.
I hope that makes sense :)
Comment preview