PostmortemAccidentally quadratic indexing output

time to read 3 min | 496 words

Tracking down a customer’s performance issue, we eventually tracked things down to a single document modification that would grind the entire server to a halt. The actual save was working fine, it was when indexing time came around that we saw the issues. The entire system would spike in terms of memory usage and disk I/O, but CPU utilization wasn’t too bad.

We finally tracked it down to a fairly big document. Let’s assume that we have the following document:

Note that this can be big. As in, multiple megabyte range in some cases, with thousands of reviews. The case we looked at, the document was over 5MB in size and had over 1,500 reviews.

That isn’t ideal, and RavenDB will issue an performance hint when dealing with such documents, but certainly workable.

The problem was with the index, which looked like this:

This index is also setup to store all the fields being indexed. Take a look at the index, and read it a few times. Can you see what the problem is?

This is a fanout index, which I’m not a fan of, but that is certainly something that we can live with. 1,500 results from a single index isn’t even in the top order of magnitude that we have seen. And yet this index will cause RavenDB to consume a lot of resources, even if we have just a single document to index.

What is going on here?

Here is the faulty issue:

image

Give it a moment to sink in, please.

We are indexing the entire document here, once for each of the reviews that you have in the index. When RavenDB encounters a complex value as part of the indexing process, it will index that as a JSON value. There are some scenarios that call for that, but in this case, what this meant is that we would, for each of the reviews in the document:

  • Serialize the entire document to JSON
  • Store that in the index

5MB times 1,500 reviews gives us a single document costing us nearly 8GB in storage space alone. And will allocate close to 100GB (!) of memory during its operation (won’t hold 100GB, just allocate it). Committing such an index to disk requires us to temporarily use about 22GB of RAM and force us to do a single durable write that exceed the 7GB mark. Then there is the background work to clean all of that.

The customer probably meant to index book_id, but got this by mistake, and then we ended up with extreme resource utilization every time that document was modified. Removing this line meant that indexing the document went from ~8GB to 2MB. That is three orders of magnitude difference.

We are going to be adding some additional performance hints to make it clear that something is wrong in such a scenario. We had a few notices, but it was hard to figure out exactly what was going on there.

More posts in "Postmortem" series:

  1. (27 Sep 2021) Partial RavenDB Cloud outage
  2. (23 Jul 2021) Accidentally quadratic indexing output
  3. (13 Jul 2021) The case of the pauses, the fault and the GC