Avoid rolling your own leader election algorithm

time to read 4 min | 774 words

Jeremy Miller has an interesting blog post about using advisory locks in Postgres to handle leader elections. This is a topic I spend a lot of time on, so I went over the post in detail. I don’t like this approach, because it has several subtle issues that are going to bite you down the road. All of them are relatively obscure, and all of them are going to happen in production in short order.

Go read the blog post, it explains the reasoning well. The core of the leader election is this:

The idea is that you have a process instance, that has a State() and a Start() methods. On multiple nodes, you are running this method, and it will coordinate using Postgres to ensure that there is only a single process that owns the lock at any given point in time. At least, that is the idea. In practice, there are issues.

Let’s assume that we are protecting a shared resource, such as a printer. We want to serialize access to the printer so two print jobs won’t get their pages mixed together. For simplicity, we’ll assume just two such nodes that compete on the lock.

On startup, one of the nodes will successfully get the lock, and the other will not, resulting in retries. So far, this is as expected.

I’m ignoring for now the lack of error handling, if we cannot start the connection, the whole thing is going to fail. This is sample code, so I’m pointing this out because the code must be resilient to such issues. We may bring up a node before the database is ready, and in this case, you’ll need to retry access the data.

A much more serious problem here is that we have a way for the process to signal that it is broken, but there is no way for the service to tell the process that it is no longer the leader. Let’s assume that a network issue has caused the connection to drop. The code, as written now, has no way of identifying this issue. It is actually worse than expected, because the connection isn’t actually being used. So even if the connection has dropped, the service is not aware of this. Even this, though, is something that can be fixed in a straightforward manner. You can add a cancellation token that the process will listen to.

You also need to keep verifying against the database server that you are still the owner of the lock and that the connection didn’t drop / fail and released it behind your back. And of course, there may be a delay between losing the lock and finding out about that.

That leads us to the most serious problem: Race conditions. In this case, even if the code handled all such scenarios nicely, we have to take into account the fact that we are dealing with separate resources here. In our example, we have Postgres for the leader election and the printer as the protected resource. The two nodes are competing on the lock, and then one of them starts printing. The lock is lost because of a network reset. At this point, Postgres frees the lock and the other node is able to lock it. It starts to run its own printing jobs.

Let’s say that the first node has a way to detect that it lost the lock. There is still the issue of how fast that can happen. It is very likely that at a certain point, you’ll have two nodes that believe that they are the leader. That is a Bad Thing.

A couple of years ago, GitHub was down for more than a day because of exactly this kind of a scenario. I analyzed the issue at the time in detail.

In this case, using the system above, you are pretty much guaranteed to have a messed up printing job, with pages from multiple jobs mixed together.

If you really care about consistency in the leader operations, you can’t just run things using a leader election. You have to run everything through the same mechanism. In GitHub’s example, they used Raft (a distributed consensus algorithm), but they used that to make decisions on a separate system, so there was a guarantee for inconsistency in that system.

In other words, you are either all in to distributed consensus or you should be out. Note that being out is fine, if you don’t care about short periods of multiple leaders. But if you need to ensure that this is the case, you cannot make it work without building it properly from the ground up.