The failure of a computer you didn't even know existed

time to read 2 min | 323 words

The title of this post is a reference to a quote by Leslie Lamport: “A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable”.

A few days ago, my blog was down. The website was up, but it was throwing errors about being unable to connect to the database. That is surprising, the database in question is running a on a triply redundant system and has survived quite a bit of abuse. It took some digging to figure out exactly what was going on, but the root cause was simple. Some server that I never even knew existed was down.

In particular the crl.identrust.com server was down. I’m pretty familiar with our internal architecture, and that server isn’t something that we rely on. Or at least so I thought. CRL stands for Certificate Revocation List. Let’s see where it came from, shall we. Here is the certificate for this blog:

image

This is signed by Let’s Encrypt, like over 50% of the entire internet. And the Let’s Encrypt certificate has this interesting tidbit in it:

image

Now, note that this CRL is only used for the case in which a revocation was issued for Let’s Encrypt itself. Which is probably a catastrophic event for the entire internet (remember > 50%).

When that server is down, the RavenDB client could not verify that the certificate chain was valid, so it failed the request. That was not expected and something that we are considering to disable by default. Certificate Revocation Lists aren’t really used that much today. It is more common to see OCSP (Online Certificate Status Protocol), and even that has issues.

I would appreciate any feedback you have on the matter.