With bugs, failures and errors: ever chugging forward
A customer called us about some pretty weird-looking numbers in their system:
You’ll note that the total number of entries in the index across all the nodes does not match. Notice that node C has 1 less entry than the rest of the system.
At the same time, all the indicators are green. As far as the administrator can tell, there is no issue, except for the number discrepancy. Why is it behaving in this manner?
Well, let’s zoom out a bit. What are we actually looking at here? We are looking at the state of a particular index in a single database within a cluster of machines. When examining the index, there is no apparent problem. Indexing is running properly, after all.
The actual problem was a replication issue, which prevented replication from proceeding to the third node. When looking at the index status, you can only see that the entry count is different.
When we zoom out and look at the state of the cluster, we can see this:
There are a few things that I want to point out in this scenario. The problem here is a pretty nasty one. All nodes are alive and well, they are communicating with each other, and any simple health check you run will give good results.
However, there is a problem that prevents replication from properly flowing to node C. The actual details aren’t relevant (a bug that we fixed, to tell the complete story). The most important aspect is how RavenDB behaves in such a scenario.
The cluster detected this as a problem, marked the node as problematic, and raised the appropriate alerts. As a result of this, clients would automatically be turned away from node C and use only the healthy nodes.
From the customer’s perspective, the issue was never user-visible since the cluster isolated the problematic node. I had a hand in the design of this, and I wrote some of the relevant code. And I’m still looking at these screenshots with a big sense of accomplishment.
This stuff isn’t easy or simple. But to an outside observer, the problem started from: why am I looking at funny numbers in the index state in the admin panel? And not at: why am I serving the wrong data to my users.
The design of RavenDB is inherently paranoid. We go to a lot of trouble to ensure that even if you run into problems, even if you encounter outright bugs (as in this case), the system as a whole would know how to deal with them and either recover or work around the issue.
As you can see, live in production, it actually works and does the Right Thing for you. Thus, I can end this post by saying that this behavior makes me truly happy.
Comments
So is the timeline 1) RavenDB "raised the appropriate alerts" 2) customer went in and saw "weird-looking numbers"? 3) shouldn't the bad node have been the first thing to see, not the numbers?
Hey, that's mine!
I am happy that the data didn't take the node down or anything like that, but it still was annoying that the indicator for the alert was transient in the face of a condition that makes the customer wonder "what is going on here and why isn't the system helping me dig deeper here."
Comment preview