Ayende @ Rahien

Sep 27 2021

PostmortemPartial RavenDB Cloud outage

time to read 11 min | 2125 words

Tags:

On Friday, Sep 24, 2021, our monitoring tool started emit alerts. Our customers’ nodes were losing connectivity to one another. Not all of them, but a significant percentage. The issue impacted nodes in all availability zones, in all regions and across cloud providers. We are running isolated clusters, and aside from a total breakdown of the Internet, there shouldn’t be a scenario where this happens. Looking into the details, it soon became clear that the issue wasn’t with the infrastructure or the network. Actual connectivity between the nodes was fine, they were able to reach one another, but they weren’t trusting one another. Following a checklist, we verified that this isn’t an issue with a specific release or version, the issue was affecting (some) nodes across all our supported versions, including many that has been running in production for years without issue.

TLDR; If you are using RavenDB and you run into SSL / TLS connection issues such as:
System.Security.Authentication.AuthenticationException: The remote certificate is invalid because of errors in the certificate chain
You can resolve the issue by running the following command on your instances:
rm /home/$RAVENDB_USER/.dotnet/corefx/cryptography/x509stores/ca/DAC9024F54D8F6DF94935FB1732638CA6AD77C13.pfx
rm /home/$RAVENDB_USER/.dotnet/corefx/cryptography/x509stores/ca/48504E974C0DAC5B5CD476C8202274B24C8C7172.pfx

RavenDB uses X509 certificates for authentication. One of the primary reasons that it is doing so is that reduce the dependency on external actions (like LDAP, for example). We were able to quickly figure out what the root cause was. Clients were not trusting the servers. Usually, for authentication, it is the other way around, but in this case, somehow, the client connection would detect that the remote server is using an invalid certificate.

We are using Let’s Encrypt certificate, like the vast majority of the web, and a quick look on our own showed us nothing. Our browsers certainly thought that the certificate was valid, and locally everything worked. On the production systems (and that impacted a few dozen clusters), on the other hand, they were flat out rejected. That was not supposed to happen. The certificate were valid, and we would get a different error if they were expired. One of the good things about using X509 certificates is that it is easy to debug independently. Here is an example of what we saw:

ayende@oren-pc:~$ curl https://a.free.ayende.ravendb.cloud
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.haxx.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

Digging a bit deeper, we can use OpenSSL to see the real details:

ayende@oren-pc:~$ openssl s_client -connect a.free.ayende.ravendb.cloud:443
CONNECTED(00000003)
depth=1 C = US, O = Let's Encrypt, CN = R3
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 CN = *.free.ayende.ravendb.cloud
verify return:1
---
Certificate chain
 0 s:CN = *.free.ayende.ravendb.cloud
   i:C = US, O = Let's Encrypt, CN = R3
 1 s:C = US, O = Let's Encrypt, CN = R3
   i:O = Digital Signature Trust Co., CN = DST Root CA X3
---

So the issue is that is is now no longer trusting Let’s Encrypt? What the hell is going on. The start of the answer is here, the root certificate for Let’s Encrypt will expire at the end of the month. That was known well in advance, and there is a new root certificate in place that should be in pretty much all devices that you use (except very old Androids, which aren’t relevant here).

The problem is that for some reason, which if I understand correctly has to do with old versions of OpenSSL, the ca-certificates package, which is a common way to maintain a list of trusted root certificates has removed the soon to be expiring (but not yet actually expired) from the list of trusted root certificates. That was pushed (as a security update), to many instances. It is at this point that problem started:

Established connections are fine and worked just great, since they already authenticated and were working without issue.
New connections, on the other hand, would fail, because the trusted root was now untrusted.

As you can imagine, this caused somewhat of an uproar for our operations team.

The killer thing for us, I have to say, is that we tested this scenario before hand. There has been plenty of time to get ready, and our ops team have made sure that we provisioned the certificates with the new root. This should have been a non event, given that we prepared specifically for that.

However… what happened was that on some portion of our machines, the certificate that was used was using the old chain, the one that is no longer valid. At this point we were looking at having to redeploy certificates to hundreds of clusters and we split the team to work on things in parallel. While one part of the team prepped the emergency patch to all instances, the other half focused on how to properly generate the right certificate.

That is a bit complex, we intentionally do not keep the private keys of the clusters on our management plane, they are located solely on customer instances, so we first had to fetch them from the store, generate a new Let’s Encrypt certificate and then distribute it back to the right location. Tricky bit of work, but we are used to working at this scale.

Except… that didn’t work. To be rather more exact, the certificates that we got from Let’s Encrypt were always using the old root certificate, never the new one. There is a way to get the specific chain from Let’s Encrypt, but our client for interacting with Let’s Encrypt did not support it. It shouldn’t matter, however, all recent (6 months or so, at least) certificates from Let’s Encrypt are signed using the new chain, so how can this be?

Here is how you’ll typically see a discussion of cross-signed certificates:

Basically, you have a single certificate that is signed using both chains. Except, that isn’t how it works. A certificate can be signed by a single issuer. The key here is that the issuer is not a certificate. The issuer is the public key + name of the certificate that provided the signature. The key here is that you can get the same public key + name to be signed by multiple parties. Which means, of course, that when you serve the certificate to a client, you typically need to tell it what certificate signed this one. I’ll have a more detailed post on that topic, but for now, assume that when the server reply to a client, it sends either:

Our Certificate + Issuer (green)
Our Certificate + Issuer (red)

Note that the client that receive the certificate is free to absolutely ignore any other certificate that the server it sending, but most of the time, it will use that to follow the chain of certificates and find a trusted root.

In our case, we could see that we were serving the wrong certificate chain. We were serving the (valid) server certificate, and an intermediate that pointed to a removed root. That meant that the client would follow the chain we provided and then would find a missing root, leading to a trust violation and an aborted connection.

That, in a word, sucks. Because we couldn’t figure out what was going on. Given the scope of the problem and the number of impacted customers, we needed to figure out a way to resolve that as soon as possible.

Going back to the Let’s Encrypt certificates that we were generating, what we actually get back from Let’s Encrypt is actually a certificate that contains:

Server Certificate (a.free.ayende.ravendb.cloud, for example)
Intermediate certificate R3 signed by DST Root CA X3 – that one is the problematic one
Intermediate certificate R3 signed by ISRG Root X1 – that is the one we want

We verified that not only are we getting the right certificate, we have been getting the right certificate for quite some time now, including on all the impacted servers.

A note on the state of the system at that frame. We have quite a few customers running on the cloud, and the problem only impacted a small percentage at the time. Given features such as connection pooling, actually creating new connection tends to be rare. But the longer we let it go, the more chance is that a connection would break and we can’t renew those.

What is more, that impacts both server to server communication in the cluster and clients connecting to the server. That means that this also happened to customers on their application (although it was pretty rare), the major impact that we could see is that the cluster connections broke, which meant that we effectively had a network split, even though the network itself was fine.

RavenDB is actually designed to operate in such an environment, and the nodes carried on their task. Reads and writes to the database worked as usual, but cluster wide tasks (such as deploying indexes) would fail.

So the sky wasn’t falling yet, but there were some dark clouds in the horizon. At this point, we have managed to figure out why the certificate isn’t trusted anymore, because of the automatic update for the ca-certificates package. The test environment showed that reinstalling the removed certificate would restore proper behavior. After verifying that this indeed fixes the issue, we rolled out across our fleet.

Problem solved? Not so fast. We plugged the hole in the ship, but in a but a few days, we are going to be looking into the exact same issue, because the certificate will actually expire. We had to figure out what happened and why RavenDB wasn’t serving the right certificate chain in this context.

Cue putting of spelunking equipment and digging into pretty much every code base in site. We scoured OpenSSL, .NET framework and dug through documentation about the certificate infrastructure in Windows. We couldn’t figure out what was going on. This should work. In fact, all our tests showed us that it is working, except it didn’t.

At this point, we believe that we figured it out, and this is pretty insane reason. Four years ago, we run into a serious problem inside of RavenDB when running in a secured environment. If the certificate chain wasn’t available on the host, it would download that so it could serve the appropriate chain when doing TLS handshake. To avoid this issue, RavenDB will build the certificate chain once on startup, and then store that in the CA X509Store for the RavenDB user. That allowed us to remove a few remote calls on connection startup and made us more resilient when we are facing network issues.

The actual problem we were trying to solve is that occasionally, RavenDB would fail when (a completely random, as far as we are concerned) server (which is serving the missing intermediate certificate) is missing. Registering the certificate ourselves in the RavenDB’s user store was a way to avoid this issue entirely. It also falls in line with the It Just Works model that we have. The other alternative would be to ask users to setup the full certificate chains locally (for each platform and environment). We rather handle that ourselves and not have to have another item in the checklist.

That said, this created an interesting issue. On Linux machines, there are now multiple certificate stores that are being used. There is the .NET X509 certificate store, which is stored at ~/.dotnet/corefx/cryptography/x509stores/ca/ and there is the system store, which is stored in /etc/ssl/certs (usually).

The ca-certificates update, however, caused an interesting problem. The relevant certificate was removed from the system store, but it was still on the .NET store. If it would have expired, as we expected it to, the situation would be just fine. The stores would agree and serve the new chain without issue.

However, given that it didn’t expire yet, we had a discrepancy. .NET will obviously favor its X509 store if it can find the certificate there, so it kept serving the old chain. The solution to that it to manually clear the old certificate from the cache as well, you can do that by issuing the following commands:

rm /home/$RAVENDB_USER/.dotnet/corefx/cryptography/x509stores/ca/DAC9024F54D8F6DF94935FB1732638CA6AD77C13.pfx
rm /home/$RAVENDB_USER/.dotnet/corefx/cryptography/x509stores/ca/48504E974C0DAC5B5CD476C8202274B24C8C7172.pfx

If you are running on windows, you can run the following commands to achieve the same goal:

.\PsExec.exe -u "nt authority\Local Service" powershell.exe
Get-ChildItem cert:/CurrentUser/CA/48504E974C0DAC5B5CD476C8202274B24C8C7172 | Remove-Item

That would remove the relevant chain from the store and make RavenDB use the right chain. There is no need to restart the service, it should automatically fix itself.

For the future, we are going to take a long look into our policies in this matter. Given the advent of Docker & Cloud services, we feel that the actual management of the entire environment should be left to the environment, not to the RavenDB process itself.

PostmortemAccidentally quadratic indexing output

time to read 3 min | 496 words

Tweet Share Share 7 comments

Tags:

Tracking down a customer’s performance issue, we eventually tracked things down to a single document modification that would grind the entire server to a halt. The actual save was working fine, it was when indexing time came around that we saw the issues. The entire system would spike in terms of memory usage and disk I/O, but CPU utilization wasn’t too bad.

We finally tracked it down to a fairly big document. Let’s assume that we have the following document:

Note that this can be big. As in, multiple megabyte range in some cases, with thousands of reviews. The case we looked at, the document was over 5MB in size and had over 1,500 reviews.

That isn’t ideal, and RavenDB will issue an performance hint when dealing with such documents, but certainly workable.

The problem was with the index, which looked like this:

This index is also setup to store all the fields being indexed. Take a look at the index, and read it a few times. Can you see what the problem is?

This is a fanout index, which I’m not a fan of, but that is certainly something that we can live with. 1,500 results from a single index isn’t even in the top order of magnitude that we have seen. And yet this index will cause RavenDB to consume a lot of resources, even if we have just a single document to index.

What is going on here?

Here is the faulty issue:

Give it a moment to sink in, please.

We are indexing the entire document here, once for each of the reviews that you have in the index. When RavenDB encounters a complex value as part of the indexing process, it will index that as a JSON value. There are some scenarios that call for that, but in this case, what this meant is that we would, for each of the reviews in the document:

Serialize the entire document to JSON
Store that in the index

5MB times 1,500 reviews gives us a single document costing us nearly 8GB in storage space alone. And will allocate close to 100GB (!) of memory during its operation (won’t hold 100GB, just allocate it). Committing such an index to disk requires us to temporarily use about 22GB of RAM and force us to do a single durable write that exceed the 7GB mark. Then there is the background work to clean all of that.

The customer probably meant to index book_id, but got this by mistake, and then we ended up with extreme resource utilization every time that document was modified. Removing this line meant that indexing the document went from ~8GB to 2MB. That is three orders of magnitude difference.

We are going to be adding some additional performance hints to make it clear that something is wrong in such a scenario. We had a few notices, but it was hard to figure out exactly what was going on there.

PostmortemThe case of the pauses, the fault and the GC

time to read 6 min | 1027 words

Tweet Share Share 4 comments

Tags:

A RavenDB customer called us with an interesting issue. Every now and then, RavenDB will stop process any and all requests. These pauses could last for as long as two to three minutes and occurred on a fairly random, if frequent, basis.

A team of anteaters was dispatched to look at the issue (best bug hunters by far), but we couldn’t figure out what was going on. During these pauses, there was absolutely no activity on the machine. There was hardly any CPU utilization, there was no network or high I/o load and RavenDB was not responding to requests, it was also not doing anything else. The logs just… stopped for that duration. That was something super strange.

We have seen similar pauses in the past, I’ll admit. Around 2014 / 2015 we had a spate of issues very similar, with RavenDB paused for a very long time. Those issues were all because of GC issues. At the time, RavenDB would do a lot of allocations and it wasn’t uncommon to end up with the majority of the time spent on GC cleanup. The behavior at those time, however, was very different. We could see high CPU utilization and all metrics very clearly pointed out that the fault was the GC. In this case, absolutely nothing was going on.

Here is what such a pause looked like when we gathered the ETW metrics:

Curiouser and curiouser, as Alice said.

This was a big instance, with quite a bit of work going on, so we spent some time analyzing the process behavior. And absolutely nothing appeared to be wrong. We finally figured out that the root cause is the GC, as you can see here:

The problem is that the GC is doing absolutely nothing here. For that matter, we spend an inordinate amount of time making sure that the GC won’t have much to do. I mentioned 2014/2015 earlier, as a direct result of those issues, we have fixed that by completely re-architecting RavenDB. The database uses a lot less managed memory in general and is far faster. So what the hell is going on here? And why weren’t we able to see those metrics before? It took a lot of time to find this issue, and GC is one of the first things we check.

In order to explain the issue, I would like to refer you to the Book of the Runtime and the discussion of threads suspension. The .NET GC will eventually need to run a blocking collection, when that happens, it needs to ensure that the heap is in a known state. You can read the details in the book, but the short of it is that there are what is known as GC Safe Points. If the GC needs to run a blocking collection, all managed threads must be as a safe point. What happens if they aren’t, however? Well, the GC will let them run until they reach such a point. There is a whole machinery in place to make sure that this happens. I would also recommend reading the discussion here. And Konard’s book is a great resource as well.

Coming back to the real issue, the GC cannot start until all the managed threads are at a safe point, so in order to suspend the threads, it will let them run to a safe point and suspend them there. What is a safe point, it is a bit complex, but the easiest way to think about it is that whenever there is a method call, the runtime ensures that the GC would have stable information. The distance between method calls is typically short, so that is great. The GC is not likely to wait for long for the thread to come to a safe point. And if there are loops that may take a while, the JIT will do the right thing to ensure that we won’t wait too long.

In this scenario, however, that was very much not the case. What is going on?

We got a page fault, which can happen anywhere, and until we return from the page fault, we cannot get to the GC Safe Point, so all the threads are suspended waiting for this page fault to complete.

And in this particular case, we had a single page fault, reading 16KB of data, that took close to two minutes to complete.

So the actual fault is somewhere in storage, which is out of scope for RavenDB, but a single slow write had a cascading effect to pause the whole server. The investigation continues and I’ll probably have another post on the topic when I get the details.

For what it is worth, this is a “managed language” issue, but a similar scenario can happen when we are running in native code. A page fault while holding the malloc lock would soon have the same scenario (although I think that this would be easier to figure out).

I wanted to see if I can reproduce the same issue on my side, but run into a problem. We don’t know what caused the slow I/O, and there is no easy way to do it in Windows. On the other hand, Linux has userfaultfd(), so I decided to use that.

The userfaultfd() doesn’t have a managed API, so I wrote something that should suffice for my (very limited) needs. With that, I can write the following code:

If you’ll run this with: dotnet run –c release on a Linux system, you’ll get the following output:

139826051907584 about to access
Got page fault at 139826051907584
Calling GC...

And that would be… it. The system is hang. This confirms the theory, and is quite an interesting use of the Linux features to debug a problem that happened on Windows.

Oren Eini

Oren Eini

CEO of RavenDB

PostmortemPartial RavenDB Cloud outage

PostmortemAccidentally quadratic indexing output

PostmortemThe case of the pauses, the fault and the GC

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed