Ayende @ Rahien

Jan 17 2025

Production post-mortemInspecting ourselves to death

time to read 9 min | 1642 words

Tags:

The problem was that this took time - many days or multiple weeks - for us to observe that. But we had the charts to prove that this was pretty consistent. If the RavenDB service was restarted (we did not have to restart the machine), the situation would instantly fix itself and then slowly degrade over time.

The scenario in question was performance degradation over time. The metric in question was the average request latency, and we could track a small but consistent rise in this number over the course of days and weeks. The load on the server remained pretty much constant, but the latency of the requests grew.

That the customer didn’t notice that is an interesting story on its own. RavenDB will automatically prioritize the fastest node in the cluster to be the “customer-facing” one, and it alleviated the issue to such an extent that the metrics the customer usually monitors were fine. The RavenDB Cloud team looks at the entire system, so we started the investigation long before the problem warranted users’ attention.

I hate these sorts of issues because they are really hard to figure out and subject to basically every caveat under the sun. In this case, we basically had exactly nothing to go on. The workload was pretty consistent, and I/O, memory, and CPU usage were acceptable. There was no starting point to look at.

Those are also big machines, with hundreds of GB of RAM and running heavy workloads. These machines have great disks and a lot of CPU power to spare. What is going on here?

After a long while, we got a good handle on what is actually going on. When RavenDB starts, it creates memory maps of the file it is working with. Over time, as needed, RavenDB will map, unmap, and remap as needed. A process that has been running for a long while, with many databases and indexes operating, will have a lot of work done in terms of memory mapping.

In Linux, you can inspect those details by running:

$

600a33834000-600a3383b000 rSize: KernelPageSize: MMUPageSize: Rss: Pss: Shared_Clean: Shared_Dirty: Private_Clean: Private_Dirty: Referenced: Anonymous: LazyFree: AnonHugePages: ShmemPmdMapped: FilePmdMapped: Shared_Hugetlb: Private_Hugetlb: Swap: SwapPss: Locked: THPeligible: VmFlags: 600a3383b000-600a33847000 rSize: KernelPageSize: MMUPageSize: Rss: Pss: Shared_Clean: Shared_Dirty: Private_Clean: Private_Dirty: Referenced: Anonymous: LazyFree: AnonHugePages: ShmemPmdMapped: FilePmdMapped: Shared_Hugetlb: Private_Hugetlb: Swap: SwapPss: Locked: THPeligible: VmFlags:

cat /proc/22003/smaps punctuation">--p 00000000 08:30 214585                     /data/ravendb/Raven.Server 28 kB 4 kB 4 kB 28 kB 26 kB 4 kB 0 kB 24 kB 0 kB 28 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 rd mr mw me dw punctuation">-xp 00006000 08:30 214585                     /data/ravendb/Raven.Server 48 kB 4 kB 4 kB 48 kB 46 kB 4 kB 0 kB 44 kB 0 kB 48 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 rd ex mr mw me dw

Here you can see the first page of entries from this file. Just starting up RavenDB (with no databases created) will generate close to 2,000 entries. The smaps virtual file can be really invaluable for figuring out certain types of problems. In the snippet above, you can see that we have some executable memory ranges mapped, for example.

The problem is that over time, memory becomes fragmented, and we may end up with an smaps file that contains tens of thousands (or even hundreds of thousands) of entries.

Here is the result of running perf top on the system, you can see that the top three items that hogs most of the resources are related to smaps accounting.

This file provides such useful information that we monitor it on a regular basis. It turns out that this can have… interesting effects. Consider that while we are running the scan through all the memory mapping, we may need to change the memory mapping for the process. That leads to contention on the kernel locks that protect the mapping, of course.

It’s expensive to generate the `smaps` file

Reading from /proc/[pid]/smaps is not a simple file read. It involves the kernel gathering detailed memory statistics (e.g., memory regions, page size, resident/anonymous/shared memory usage) for each virtual memory area (VMA) of the process. For large processes with many memory mappings, this can be computationally expensive as the kernel has to gather the required information every time /proc/[pid]/smaps is accessed.

When /proc/[pid]/smaps is read, the kernel needs to access memory-related structures. This may involve taking locks on certain parts of the process’s memory management system. If this is done too often or across many large processes, it could lead to contention or slow down the process itself, especially if other processes are accessing or modifying memory at the same time.

If the number of memory mappings is high, and the frequency with which we monitor is short… I hope you can see where this is going. We effectively spent so much time running over this file that we blocked other operations.

This wasn’t an issue when we just started the process, because the number of memory mappings was small, but as we worked on the system and the number of memory mappings grew… we eventually started hitting contention.

The solution was two-fold. We made sure that there is only ever a single thread that would read the information from the smaps (previously it might have been triggered from multiple locations). We added some throttling to ensure that we aren’t hammering the kernel with requests for this file too often (returning cached information if needed) and we switched from using smaps to using smaps_rollup instead. The rollup version provides much better performance, since it deals with summary data only.

With those changes in place, we deployed to production and waited. The result was flat latency numbers and another item that the Cloud team could strike off the board successfully.

Production post-mortemToo much RAM and the unstable cluster

time to read 5 min | 970 words

Tweet Share Share 0 comments

Tags:

A customer reported that on their system, they suffered from frequent cluster elections in some cases. That is usually an indication that the system resources are hit in some manner. From experience, that usually means that the I/O on the machine is capped (very common in the cloud) or that there is some network issue.

The customer was able to rule these issues out. The latency to the storage was typically withing less than a millisecond and the maximum latency never exceed 5 ms. The network monitoring showed that everything was fine, as well. The CPU was hovering around the 7% CPU and there was no reason for the issue.

Looking at the logs, we saw very suspicious gaps in the servers activity, but with no good reason for them. Furthermore, the customer was able to narrow the issue down to a single scenario. Resetting the indexes would cause the cluster state to become unstable. And it did so with very high frequency.

“That is flat out impossible”, I said. And I meant it. Indexing workload is something that we have a lot of experience managing and in RavenDB 4.0 we have made some major changes to our indexing structure to handle this scenario. In particular, this meant that indexing:

Will run in dedicated threads.
Are scoped to run outside certain cores, to leave the OS capacity to run other tasks.
Self monitor and know when to wind down to avoid impacting system performance.
Indexing threads are run with lower priority.
The cluster state, on the other hand, is run with high priority.

The entire thing didn’t make sense. However… the customer did a great job in setting up an environment where they could show us: Click on the reset button, and the cluster become unstable. So it is impossible, but it happens.

We explored a lot of stuff around this issue. The machine is big and running multiple NUMA node, maybe it was some interaction with that? It was highly unlikely, and eventually didn’t pan out, but that is one example of the things that we tried.

We setup a similar cluster on our end and gave it 10x more load than what the customer did, on a set of machines that had a fraction of the customer’s power. The cluster and the system state remain absolutely stable.

I officially declared that we were in a state of perplexation.

When we run the customer’s own scenario on our system, we saw something, but nothing like what we saw on their end. One of the things that we usually do when we investigate resource constraint issues is to give the machines under test a lot less capability. Less memory and slower disks, for example, means that it is much easier to surface many problems. But the worse we made the situation for the test cluster, the better the results became.

We changed things up. We gave the cluster machines with 128 GB of RAM and fast disks and tried it again. The situation immediately reproduced.

Cue facepalm sound here.

Why would giving more resources to the system cause instability in the cluster? Note that the other metrics also suffered, which made absolutely no sense.

We started digging deeper and we found the following index:

It is about as simple an index as you can imagine it would be and should cause absolutely no issue for RavenDB. So what was going on? We then looked at the documents…

I would expect the State field to be a simple enum property. But it is an array that describe the state changes in the system. This array also holds complex objects. The size of the array is on average about 450 items and I saw it hit a max of 13,000 items.

That help clarify things. During index, we have to process the State property, and because it is an array, we index each of the elements inside it. That means that for each document, we’ll index between 400 – 13,000 items for the State field. What is more, we have a complex object to index. RavenDB will index that as a JSON string, so effectively the indexing would generate a lot of strings. These strings are going to be held in memory until the end of the indexing batch. So far, so good, but the problem in this case was that there were enough resources to have a big batch of documents.

That means that we would generate over 64 million string objects in one of those batches.

Enter GC, stage left.

The GC will be invoked based on how many allocations you have (in this case, a lot) and how many live objects you have. In this case, also a lot, until the index batch is completed. However, because we run GC multiple times during the indexing batch, we had promoted significant numbers of objects to the next generation, and Gen1 or Gen2 collections are far more expensive.

Once we knew what the problem was, it was easy to find a fix. Don’t index the State field. Given that the values that were indexed were JSON strings, it is unlikely that the customer actually queried on them (later confirmed by talking to the customer).

On the RavenDB side, we added monitoring for the allocation frequency and will close the indexing batch early to prevent handing the GC too much work all at once.

The reason we failed to reproduce that on lower end machine was simple, RavenDB already used enough memory so we closed the batch early, before we could gather enough objects to cause the GC to really work hard. When running on a big machine, it had time to get the ball rolling and hand the whole big mess to the GC for cleanup.

Oren Eini

Oren Eini

CEO of RavenDB

Production post-mortemInspecting ourselves to death

It’s expensive to generate the `smaps` file

Production post-mortemToo much RAM and the unstable cluster

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Production post-mortemInspecting ourselves to death

It’s expensive to generate the smaps file

Production post-mortemToo much RAM and the unstable cluster

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

It’s expensive to generate the `smaps` file