Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 2 min | 347 words

I am writing this answer before people had a chance to answer the actual challenge, so I hope people caught it. This was neither easy nor obvious to catch, because it was hiding with a pile of other stuff and the bug is a monster to figure out.

In case you need a reminder, here is the before & after code:

Look at line 18 in the second part. If we tried to allocate native memory and failed, we would try again, this this with the requested amount.

The logic here is that we typically want to request memory in power of 2 increments. So if asked for 17MB, we’ll allocate 32MB. This code is actually part of our memory allocator, which request memory from the operating system, so it is fine if we allocate more, we’ll just use that in a bit. However, if we don’t have enough memory to allocate 32MB, maybe we do have enough to allocate 17MB. And in many cases, we do, which allow the system to carry on operating.

Everyone is happy, right? Look at line 21 in the second code snippet. We set the allocated size to the size we wanted to allocate, not the actual size we allocated.

We allocated 17MB, we think we allocated 32MB, and now everything can happen.

This is a nasty thing to figure out. If you are lucky, this will generate an access violation when trying to get to that memory you think you own. If you are not lucky, this memory was actually allocated to your process, which means that you are now corrupting some totally random part of memory in funny ways. And that means that in some other time you’ll be start seeing funny behaviors and impossible results and tear your hair out trying to figure it out.

To make things worse, this is something that only happens when you run out of memory, so you are already suspicious about pretty much everything that is going on there. Nasty, nasty, nasty.

I might need a new category of bugs: “Stuff that makes you want to go ARGH!”

time to read 1 min | 72 words

We are working on improving the reliability of RavenDB under a host of scenarios This week it is low memory conditions. We made some fixes, and introduced a horrible bug.

Here is the code, can you see what the error is? Here are the first and second versions of the code. The second version is meant to be more robust to running under low memory conditions, but it is actually much worse.

time to read 3 min | 467 words

imageOne of the changes we made to RavenDB 4.0 as part of the production run feedback was to introduce the notion of a pool of threads. This is quite distinct from the notion of a thread pool, and it deserves it own explanation.

A thread pool is something that is very commonly used in server applications. Instead of spawning a thread per task (a really expensive process), the system keeps a pool of threads around and provide some way to queue tasks for them to do. Such tasks are typically expected to be short and transient. They should also not have any expectation about the state of the thread nor should they modify any thread state.

In .NET, all async work will typically go through the system thread pool. When you are processing a request in ASP.Net, you are running on a thread pool thread and it is in heavy use in any sort of server environment.

RavenDB is also making heavy use of the thread pool, to service requests and handle most operations. But it also has the need to process several long terms tasks (which can run for days or more). Because of that,  and because we need both fine grained control and the ability to inspect the state of the system easily, we typically spawn a new, dedicated, thread for such tasks. As it turns out, under high memory load, this is a dangerous thing to do. The thread might try to commit some stack space, but there is no memory in the system to do so, resulting in a fatal stack overflow.

I should note that the kind of tasks we use a dedicated thread for are pretty rare and long lived, they also do things like mutate the thread state (changing the priority, for example), for example.

Because of that, we can’t just use the thread pool, nor do we want a similar abstraction. Instead, we created a pool of threads. A job can request a thread to run on, and it will get its own thread to run and do with as it pleases. When it is done running, which can be in a minute or in a week’s time, it will return the thread to the pool, where it will remain until another job needs it.

In this way, under high memory usage, we’ll not be creating new threads all the time, and the threads’ stack are likely to be already committed and available to the process.

Update: To clear things up. Even if we do need to create a new thread, we now have control over that, in a single place. If there isn't enough memory available to actually use the new thread stack, we'll refuse to create it.

time to read 4 min | 763 words

imageWe have been testing RavenDB in the harshest possible ways we can envision. Anything from simulating hardware failures to corrupting the network data to putting as much load as possible on the system. This is done as part of a long running test suite that has been running for the last few months. We have been stepping up the kind of things that we are doing in attempt to identify the weak points in RavenDB.

One of the interesting failure modes that we need to handle is what happens when the amount of work that is asked from the system exceeds the amount of resources that are available to it. At this point, what we want to do is to start failing gracefully. What we don’t want to do is to have the entire system grind to a halt of the server crashing.

There are problems with the idea that we can detect when we re in low resource mode and react accordingly. To start with, it is really hard. If the user paid for a fast machine, and we are running at 99% CPU, should we start rejecting requests? The users are actually getting their money’s worth from the hardware, so it make no sense to do that. Second, what is low resource mode? No space in hard disk is quite easy, actually. We detect that and error and everything is fine. High CPU is not something that we want to react to, it might be that we are actively handling a spike of traffic, or just making full use of the system.

Memory is another constrained resource, and here we run into our toughest problems. RavenDB uses memory mapped files for a lot of its storage needs, which means that high memory usage is something that we want, because it means that we are actually using the memory of the machine. Now, the OS can choose to evict such data from memory at any time very cheaply, so if there is a true memory pressure, we don’t need to worry, since there is a good degradation path for us.

The problem is that this isn’t the only cause for high memory usage. In additional to the actual memory we are using (the working set) there is also the commit charge for the system. I’m probably going to have a separate post to talk about the details of memory management from the OS point of view. The commit charge is how much memory the OS promised all the applications in the system. It is very common for applications to ask for a lot more memory than they actually need, which mean that the OS will usually not actually allocate the memory immediately. Instead, it will just record the promise to give it at a future date.

On Windows, the maximum commit charge is the size of the RAM and the page file(s) and Windows will flat out refuse to commit memory beyond that limit. When you are working on a system that is heavily overburdened, it is very possible to hit that limit, which is when… interesting things will happen.

In particular, we need to consider the behavior of failure to commit memory when we need to increase the size of the thread stack. In this case, even though the size of the stack is reasonable, there is no way to get more memory for the stack, and we’ll get a a fatal Stack Overflow exception. It looks like this exact behavior is very explicitly called in the code. This means that under low memory conditions (which may be low committed memory, not real low memory) opening a new thread, which may need to allocate / expand its stack, is a very dangerous behavior.

We have some code in RavenDB that spawn a new thread per connection for certain types of very long running server to server connections. Combine that we the fact that under such high load you’ll typically see disconnection and recovery by establishing a new connection (requiring a new thread) and you can see the problem. Under such load we’ll hit both conditions. Low committed memory and spawning of new threads, and then it is just a game of whatever it will be regular (and handled) allocation that fails or if it would be the stack extension that would fail, resulting in fatal error.

We are handling this by reusing the threads now, which seems to offer much greater stability in our test case.

time to read 2 min | 354 words

imageSometimes you see the impossible. In one of our scenarios, we saw a cluster that had such a bad case of split brain that it came near to fracturing the very boundaries of space & time.

In a three node cluster, we have one node that looked to be fine. It connected to all the other nodes and was the cluster leader. The other two nodes, however, were not in the cluster and in fact, they were showing signs that they never were in the cluster.

What was really strange was that we took the other two machines down and the first node was still showing a successful cluster. We looked deeper and realized that it wasn’t actually a healthy situation, in fact, this node was very rapidly switching between leader and follower mode.

It took a bit of time to figure out what was going on, but the root cause was DNS. We had the three nodes on separate DNS (a.oren.development.run, b.oren.development.run, c.oren.development.run) and they were setup to point to the three machines. However, we have previously used the same domain names to run a cluster on the first machine only. Because of the way DNS updates, whenever the machine at a.oren.development.run would try to connect to b.oren.development.run it would actually connect to itself.

At this point, A would tell B that it is the leader. But A is B, so A would respond by becoming a follower (because it was told it should, by itself). Because it became a follower, it disconnected from itself. After a timeout, it would become leader again, and the cycle would continue.

Every time that the server would get up, it would whip itself down again. “I’m a leader”, “No, I’m a leader”, etc.

This is a fun thing to discover. We had to trace pretty deep to figure out that the problem was in the DNS cache (since the DNS itself was properly updated).

We fixed things so we now recognize if we are talking to ourselves and error properly.

time to read 2 min | 332 words

This blog post was a very interesting read. It talks about the ipify service and how it grew. It is an interesting post, but what really caught my eye was the performance details.

In particular, this talks about exceeding 30 billions of requests per month. The initial implementation used Node, and couldn’t get more than 30 req/second or so and the current version seems to be in Go and can handle about 2,000 requests a second.

I’m interested in performance so I decided to see why my results would be. I very quickly wrote the simplest possible implementation using Kestrel and threw that on a t2.nano in AWS. I’m pretty sure that this is the equivalent for the dyno that he is using. I then spun another t2.small instance and used that to bench the performance of the system. All I did was just run the code with release mode on the t2.nano, and here are the results:"

image

So we get 25,000 requests a second and some pretty awesome latencies under very high load on a 1 CPU / 512 MB machine. For that matter, here are the top results from midway through on the t2.nano machine:

image

I should say that the code I wrote was quick and dirty in terms of performance. It allocates several objects per request and likely can be improved several times over, but even so, these are nice numbers. Primarily because there is actually so little that needs to be done here. To be fair, a t2.nano machine is meant for burst traffic and is not likely to be able to sustain such load over time, but even when throttled by an order of magnitude, it will still be faster than the Go implementation Smile.

time to read 3 min | 599 words

imageRowhammer is a type of attack on the way DRAM is built. A specific write pattern to a specific memory cell can cause “leakage” to nearby memory cells, causing bit flips. The issue we were facing in production ended up being very similar.

The initial concern was that a database on disk size was very large. Enough that we started a serious investigating into what exactly is taking all this space. Our internal reporting said, nothing. And that was very strange. Looking at the actual file usage, we had a lot of transaction journals taking a lot of space there. By a lot I mean that the size of the data file in question was 32 MB and the journals were taking a total of over 15GB. To be fair, some of them were kept around to be reused, but that was bad.

It took a while to figure out what was going on. This particular Voron instance was used for a map/reduce index on a database that had high write throughput. Because if that, the indexing was always active. So far, so good, we have several other such instances, and they don’t exhibit this behavior. What was different about this index is that due to the shape of the index and the nature of the data, what ended up happening is that we’ll always modify the same (small) set of the data.

This index sums up a number of events and aggregate them based on when they happened. This particular system handle about a hundred updates a second on average, and can peak to about five to seven times that. The index gives us things such as “how many events of each type happened today” and things like that. This means that there is a lot of locality in the updates. And that was the key.

Even though this index (and the Voron storage that backed it) was experienced a lot of writes, these writes almost always happened to the same set of data (basically updating the counters). That means that there wasn’t actually just a very small set of pages in the data that were modified. And that set off a chain of behaviors that results in a lot of disk space being used.

  • A lot of data is modified, meaning that we need to write a lot to the journal on each transaction.
  • Because the same data is constant modified, the total amount of modified bytes is smaller than a certain threshold.
  • Writes are constants.
  • Flushing the data to disk is expensive, so we try to avoid it.
  • We can only delete journals after we flushed the data.

Because we try to avoid flushing to disk if we can, we only do that when there is enough idle time or when enough data has been modified. In this case, there was no idle time, and the amount of data that was modified was too small to hit the limit.

The system would actually balance itself out eventually (which is why it stopped at around ~15GB or so of journals). At some point we would either hit an idle spot or the buffer will hit the limit and we’ll flush to disk, which allow us to free the journals, but that happened only after we had quite a few. The fix was to add a limit to how long we’ll delay flushing to disk in such a case and was quite trivial once we figure out what exactly all the different parts were doing.

time to read 1 min | 116 words

The Inside RavenDB 4.0 book just got an update, you can download the new release in the link. It currently has 10 chapters and cover a lot of materials. New in this release is in depth coverage of indexing.

Note that some of the earlier details might have changed a bit (usually configuration name changes), I’m going to go over and fix them as part of the release process.

There are also the upcoming RavenDB Workshops in San Francisco and New York next month, if you want to learn about RavenDB directly from me Smile.

We’ll also have a booth at the Developer Week conference in early February in San Francisco.

time to read 5 min | 969 words

imageA customer asked for an interesting feature. Given two servers that need to replicate data between themselves, and the following network topology:

  • The Red DB is on the internal network, it is able to connect to the Gray DB.
  • The Gray DB is on the DMZ, it is not able to connect to the Red DB.

They are setup (with RavenDB) to do replication to one another. With RavenDB, this is done by each server opening a TCP connection to the other and sending all the updates this way.

Now, this is a simple example of a one way network topology, but there are many other cases where you might get into a situation where two nodes need to talk to each other, but only one node is able to connect to the other. However, once a TCP connection is established, communication is bidirectional.

The customer asked if we could add a configuration to support reversing the communication mode for replication. Instead of the source server initiating the connection, the destination server will do that, and then the source server will use the already established TCP connection henceforth.

This works, at least in theory, but there are many subtle issues that you’ll need to deal with:

  • This means that the source server (now confusingly the one that accepts requests) is not in control of sending the data. Conversely, it means that the destination side must always keep the connection open, retrying immediately if there was a failure and never getting a chance to actually idle. This is a minor concern.
  • Security is another problem. Replication is usually setup by the admin on the source server, but now we have to set it up on both ends, and make sure that the destination server has the ability to connect to the source server. That might carry with it more permissions that we want to grant to the destination (such as the ability to modify data, not just get it).
  • Configuration is now more complex, because replication has a bunch of options that needs to be set, and now we need to set these on the source server, then somehow have the destination server let the source know which replication configuration it is interested in. What happens if the configuration differs between the two nodes?
  • Failover in distributed system made of distributed systems is hard. So far we actually talked about nodes, but this isn’t actually the case, the Red and Gray DBS may be clusters in their own right, composed of multiple independent nodes each. When using replication in the usual manner, the source cluster will select a node to be in charge of the replication task, and this will replicate the data to a node on the other side. This can have multiple failure modes, a source node can be down, a destination node can be done, etc. That is all handled, but it will need to be handled anew for the other side.
  • Concurrency is yet another issue. Replication is now controlled by the source, so it can assume certain sequence of operations, if the destination can initiate the connection, it can initiate multiple connections (or different destination nodes will open connections to the same / different source nodes at the same time) resulting in a sequential code path suddenly needing to deal with concurrency, locking, etc.

In short, even though it looks like a simple feature, the amount of complexity it brings is quite high.

Luckily for us, we don’t need to do all that. If what we want is just to have the connection be initiated by the other side, that is quite easy. Set things up the other way, at the TCP level. We’ll call our solution Routy, because it’s funny.

First, you’ll need a Routy service at the destination node, this will just open an TCP connection to the source node. Because this is initiated by the destination, this works fine. This TCP connection is not sent directly to the DB on the source, instead, it is sent to the Routy service on the source that will accept the connection and keep it open.

At the source side, you’ll configure it to connect to the source-side service. At this point, the Routy service on the source has two TCP connections. One that came from the source and one that came from the remote Routy service on the destination. At this point, it will basically copy the data between the two sockets and we’ll have a connection to the other side. On the destination side, the Routy service will start getting data from the source, at which point it will initiate its own connection to the database on the destination, ending up with two connections of its own that it can then tie together.

From the point of view of the databases, the source server initiated the connection and the destination server accepted it, as usual. From the point of view of the network, this is a TCP connection that came from the destination to the source, and then a bunch of connections locally on either end.

You can write such a Routy service in a day or so, although the error handling is probably the trickiest one. However, you probably don’t need to do that. This is called TCP reverse tunneling, and you can just use SSH to do that. There are also many other tools that can do the same.

Oh, and you might want to talk to your network admin first, it is possible that this would be easier if they will just change the firewall settings. And if they don’t do that, remember that this is effectively doing the same, so there might be a good reason here.

time to read 3 min | 457 words

imageSystem configuration is important, and the more complex your software is, the more knobs you usually have deal with. That is complex enough as it is, because sometimes these configurations are inter dependent. But it become a lot more interesting when we are talking about a distributed environment.

In particular, one of the oddest scenarios that we had to deal with in the production test run was when we got the different members in the cluster to be configured differently from each other. Including operational details such as endpoints, security and timeouts.

This can happen for real when you make a modification on a single server, because you are trying to fix something, and it works, and you forget to deploy it to all the others. Because people drop the ball, or because you have different people working on different things at the same time.

We classified such errors into three broad categories:

  • Local state which is fine to be different on different machines. For example, if each node has a different base directory or run under a different user, we don’t really care for that.
  • Distributed state which breaks horribly if misconfigured. For example, if we use the wrong certificate trust chains on different machines. This is something we don’t really care about, because things will break in a very visible fashion when this happens, which is quite obvious and will allow quick resolution.
  • Distributed state which breaks horrible and silently down the line if misconfigured.

The last state was really hard to figure out and quite nasty. One such setting is the timeout for cluster consensus. In one of the nodes, this was set to 300 ms and on another, it was set to 1 minute. We derive a lot of behavior from this value. A server will heartbeat every 1/3 of this value, for example, and will consider a node down if it didn’t get a heartbeat from it within this timeout.

This kind of issue meant that when the nodes are idle, one of them would ping the others every 20 seconds, while they would expect a ping every 300 milliseconds. However, when they escalated things to check explicitly with the server, it replied that everything was fine, leading to the whole cluster being confused about what is going on.

To make things more interesting, if there is activity in the cluster, we don’t wait for the timeout, so this issue only shows up only on idle periods.

We tightened things so we enforce the requirement that such values to be the same across the cluster by explicitly validating this, which can save a lot of time down the road.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}