Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 4 min | 763 words

imageWe have been testing RavenDB in the harshest possible ways we can envision. Anything from simulating hardware failures to corrupting the network data to putting as much load as possible on the system. This is done as part of a long running test suite that has been running for the last few months. We have been stepping up the kind of things that we are doing in attempt to identify the weak points in RavenDB.

One of the interesting failure modes that we need to handle is what happens when the amount of work that is asked from the system exceeds the amount of resources that are available to it. At this point, what we want to do is to start failing gracefully. What we don’t want to do is to have the entire system grind to a halt of the server crashing.

There are problems with the idea that we can detect when we re in low resource mode and react accordingly. To start with, it is really hard. If the user paid for a fast machine, and we are running at 99% CPU, should we start rejecting requests? The users are actually getting their money’s worth from the hardware, so it make no sense to do that. Second, what is low resource mode? No space in hard disk is quite easy, actually. We detect that and error and everything is fine. High CPU is not something that we want to react to, it might be that we are actively handling a spike of traffic, or just making full use of the system.

Memory is another constrained resource, and here we run into our toughest problems. RavenDB uses memory mapped files for a lot of its storage needs, which means that high memory usage is something that we want, because it means that we are actually using the memory of the machine. Now, the OS can choose to evict such data from memory at any time very cheaply, so if there is a true memory pressure, we don’t need to worry, since there is a good degradation path for us.

The problem is that this isn’t the only cause for high memory usage. In additional to the actual memory we are using (the working set) there is also the commit charge for the system. I’m probably going to have a separate post to talk about the details of memory management from the OS point of view. The commit charge is how much memory the OS promised all the applications in the system. It is very common for applications to ask for a lot more memory than they actually need, which mean that the OS will usually not actually allocate the memory immediately. Instead, it will just record the promise to give it at a future date.

On Windows, the maximum commit charge is the size of the RAM and the page file(s) and Windows will flat out refuse to commit memory beyond that limit. When you are working on a system that is heavily overburdened, it is very possible to hit that limit, which is when… interesting things will happen.

In particular, we need to consider the behavior of failure to commit memory when we need to increase the size of the thread stack. In this case, even though the size of the stack is reasonable, there is no way to get more memory for the stack, and we’ll get a a fatal Stack Overflow exception. It looks like this exact behavior is very explicitly called in the code. This means that under low memory conditions (which may be low committed memory, not real low memory) opening a new thread, which may need to allocate / expand its stack, is a very dangerous behavior.

We have some code in RavenDB that spawn a new thread per connection for certain types of very long running server to server connections. Combine that we the fact that under such high load you’ll typically see disconnection and recovery by establishing a new connection (requiring a new thread) and you can see the problem. Under such load we’ll hit both conditions. Low committed memory and spawning of new threads, and then it is just a game of whatever it will be regular (and handled) allocation that fails or if it would be the stack extension that would fail, resulting in fatal error.

We are handling this by reusing the threads now, which seems to offer much greater stability in our test case.

time to read 2 min | 354 words

imageSometimes you see the impossible. In one of our scenarios, we saw a cluster that had such a bad case of split brain that it came near to fracturing the very boundaries of space & time.

In a three node cluster, we have one node that looked to be fine. It connected to all the other nodes and was the cluster leader. The other two nodes, however, were not in the cluster and in fact, they were showing signs that they never were in the cluster.

What was really strange was that we took the other two machines down and the first node was still showing a successful cluster. We looked deeper and realized that it wasn’t actually a healthy situation, in fact, this node was very rapidly switching between leader and follower mode.

It took a bit of time to figure out what was going on, but the root cause was DNS. We had the three nodes on separate DNS (a.oren.development.run, b.oren.development.run, c.oren.development.run) and they were setup to point to the three machines. However, we have previously used the same domain names to run a cluster on the first machine only. Because of the way DNS updates, whenever the machine at a.oren.development.run would try to connect to b.oren.development.run it would actually connect to itself.

At this point, A would tell B that it is the leader. But A is B, so A would respond by becoming a follower (because it was told it should, by itself). Because it became a follower, it disconnected from itself. After a timeout, it would become leader again, and the cycle would continue.

Every time that the server would get up, it would whip itself down again. “I’m a leader”, “No, I’m a leader”, etc.

This is a fun thing to discover. We had to trace pretty deep to figure out that the problem was in the DNS cache (since the DNS itself was properly updated).

We fixed things so we now recognize if we are talking to ourselves and error properly.

time to read 3 min | 599 words

imageRowhammer is a type of attack on the way DRAM is built. A specific write pattern to a specific memory cell can cause “leakage” to nearby memory cells, causing bit flips. The issue we were facing in production ended up being very similar.

The initial concern was that a database on disk size was very large. Enough that we started a serious investigating into what exactly is taking all this space. Our internal reporting said, nothing. And that was very strange. Looking at the actual file usage, we had a lot of transaction journals taking a lot of space there. By a lot I mean that the size of the data file in question was 32 MB and the journals were taking a total of over 15GB. To be fair, some of them were kept around to be reused, but that was bad.

It took a while to figure out what was going on. This particular Voron instance was used for a map/reduce index on a database that had high write throughput. Because if that, the indexing was always active. So far, so good, we have several other such instances, and they don’t exhibit this behavior. What was different about this index is that due to the shape of the index and the nature of the data, what ended up happening is that we’ll always modify the same (small) set of the data.

This index sums up a number of events and aggregate them based on when they happened. This particular system handle about a hundred updates a second on average, and can peak to about five to seven times that. The index gives us things such as “how many events of each type happened today” and things like that. This means that there is a lot of locality in the updates. And that was the key.

Even though this index (and the Voron storage that backed it) was experienced a lot of writes, these writes almost always happened to the same set of data (basically updating the counters). That means that there wasn’t actually just a very small set of pages in the data that were modified. And that set off a chain of behaviors that results in a lot of disk space being used.

  • A lot of data is modified, meaning that we need to write a lot to the journal on each transaction.
  • Because the same data is constant modified, the total amount of modified bytes is smaller than a certain threshold.
  • Writes are constants.
  • Flushing the data to disk is expensive, so we try to avoid it.
  • We can only delete journals after we flushed the data.

Because we try to avoid flushing to disk if we can, we only do that when there is enough idle time or when enough data has been modified. In this case, there was no idle time, and the amount of data that was modified was too small to hit the limit.

The system would actually balance itself out eventually (which is why it stopped at around ~15GB or so of journals). At some point we would either hit an idle spot or the buffer will hit the limit and we’ll flush to disk, which allow us to free the journals, but that happened only after we had quite a few. The fix was to add a limit to how long we’ll delay flushing to disk in such a case and was quite trivial once we figure out what exactly all the different parts were doing.

time to read 3 min | 457 words

imageSystem configuration is important, and the more complex your software is, the more knobs you usually have deal with. That is complex enough as it is, because sometimes these configurations are inter dependent. But it become a lot more interesting when we are talking about a distributed environment.

In particular, one of the oddest scenarios that we had to deal with in the production test run was when we got the different members in the cluster to be configured differently from each other. Including operational details such as endpoints, security and timeouts.

This can happen for real when you make a modification on a single server, because you are trying to fix something, and it works, and you forget to deploy it to all the others. Because people drop the ball, or because you have different people working on different things at the same time.

We classified such errors into three broad categories:

  • Local state which is fine to be different on different machines. For example, if each node has a different base directory or run under a different user, we don’t really care for that.
  • Distributed state which breaks horribly if misconfigured. For example, if we use the wrong certificate trust chains on different machines. This is something we don’t really care about, because things will break in a very visible fashion when this happens, which is quite obvious and will allow quick resolution.
  • Distributed state which breaks horrible and silently down the line if misconfigured.

The last state was really hard to figure out and quite nasty. One such setting is the timeout for cluster consensus. In one of the nodes, this was set to 300 ms and on another, it was set to 1 minute. We derive a lot of behavior from this value. A server will heartbeat every 1/3 of this value, for example, and will consider a node down if it didn’t get a heartbeat from it within this timeout.

This kind of issue meant that when the nodes are idle, one of them would ping the others every 20 seconds, while they would expect a ping every 300 milliseconds. However, when they escalated things to check explicitly with the server, it replied that everything was fine, leading to the whole cluster being confused about what is going on.

To make things more interesting, if there is activity in the cluster, we don’t wait for the timeout, so this issue only shows up only on idle periods.

We tightened things so we enforce the requirement that such values to be the same across the cluster by explicitly validating this, which can save a lot of time down the road.

time to read 2 min | 316 words

imageNot all of our testing happened in a production settings. One of our test clusters was simply running a pretty simple loop of writes, reads and queries on all the nodes in the cluster while intentionally destabilizing the system.

After about a week of this we learned that this worked, there were no memory leaks or increased resource usage and also that the size of the data on disk was about three orders of magnitude too much.

Investigating this we discovered that the test process introduced conflicts because it wrote the same set of documents to each of the nodes, repeatedly. We are resolving this automatically but are also keeping the conflicted copies around so users can figure out what happened to their system. In this particular scenario, we had a lot of conflicted revisions, and it was hard initially to figure out what took that space.

In our production system, we also discovered that we log too much. One of the interesting feedback items we were looking for in this production test run is to see what kind of information we can get from the logs and make sure that the details there are actionable. A part of that was to see if we could troubleshoot something simply using the logs, and add missing details if there were stuff that we couldn’t figure out from them.

We also discovered that under load, we would log a lot. In particular, we had logs detailed every indexed document and replicated item. These are almost never useful, but they generate a lot of noise when we lowered the log settings. So that went away as well. We are very focused on logs usability, it should be possible to understand what is going on and why without drowning in minutia.

time to read 4 min | 676 words

imageBefore stamping RavenDB with the RTM marker, we decided that we wanted to push it to our production systems. That is something that we have been doing for quite a while, obviously, dogfooding our own infrastructure. But this time was different. While before we had a pretty simple deployment and stable pace, this time we decided to mix things up.

In other words, we decided to go ahead with the IT version of the stooges, for our production systems. In particular, that means this blog, the internal systems that run our business, all our websites, external services that are exposed to customers, etc. As I’m writing this, one of the nodes in our cluster has run out of disk space, it has been doing that since last week. Another node has been torn down and rebuilt at least twice during this run.

We also did a few times of “it is compiles, it fits production”. In other words, we basically read this guy’s twitter stream and did what he said. This resulted in an infinite loop in production on two nodes and that issue was handled by someone who didn’t know what the problem was, wasn’t part of the change that cause it and was able to figure it out, and then had to workaround it with no code changes.

We also had two different things upgrade their (interdependent) systems at the same time, which included both upgrading the software and adding new features. I also had two guys with the ability to manage machines, and a whole brigade of people who were uploading things to production. That meant that we had distinct lack of knowledge across the board, so the people managing the machines weren’t always aware that the system was experiencing and the people deploying software weren’t aware of the actual state of the system. At some points I’m pretty sure that we had two concurrent (and opposing) rolling upgrades to the database servers.

No, I didn’t spike my coffee with anything but extra sugar. This mess of a production deployment was quite carefully planned. I’ll admit that I wanted to do that a few months earlier, but it looks like my shipment of additional time was delayed in the mail, so we do what we can.

We need to support this software for a minimum of five years, likely longer, that means that we really need to see where all the potholes are and patch them as best we can. This means that we need to test it on bad situations. And there is only so much that a chaos monkey can do. I don’t want to see what happens when the network failed. That is quite easily enough to simulate and certainly something that we are thinking about. But being able to diagnose a live production system with a infinite loop because of bad error handling and recover that. That is the kind of stuff that I want to know that we can do in order to properly support things in production.

And while we had a few glitches, but for the most part, I don’t think that any one that was really observed externally. The reason for that is the reliability mechanisms in RavenDB 4.0, we need just a single server to remain functional, for the most part, which meant that we can just run without issue even if most of the cluster was flat out broken for an extended period of time.

We got a lot of really interesting results for this experience, I’ll be posting about some of them in the near future. I don’t think that I recommend doing that for any customers, but the problem is that we have seen systems that are managed about as poorly, and we want to be able to survive in such (hostile) environment and also be able to support customers that have partial or even misleading ideas about what their own systems look like and behave.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 4 days from now
  3. What happens when a sparse file allocation fails? - 6 days from now
  4. NTFS has an emergency stash of disk space - 8 days from now
  5. Challenge: Giving file system developer ulcer - 11 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}