Paxos enlightment

time to read 6 min | 1132 words

Paxos is an algorithm used to reach consensus among a group of machines, which is resilient to failures. For a long time, I had a really hard time understand Paxos. Or, to be rather more exact, I didn’t have an issue with Paxos per-se, I understood the protocol. What I had a trouble with is its application.

My problem always was that I couldn’t figure out what you do with it. That had to do with a basic problem on my part, I failed to understand how to go from a shared consensus on a numeric value to something that is actually useful. After a while, I had enough of feeling stupid, and I started reading all the material that I could on that, including going through the source codes of available Paxos implementations (such as libpaxos). The main problem was that I had a huge misconception in my head.

I kept thinking that Paxos is a consensus algorithm to arrive at a value in a distributed system. It isn’t, and because I kept thinking that it is, I had a really hard time understand what it does and how to apply it.

Leslie Lamport [pdf], the original author of the algorithm, describe the goal of Paxos as follows:

Assume a collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen. If no value is proposed, then no value should be chosen. If a value has been chosen, then processes should be able to learn the chosen value.

This is from the Paxos Made Simple paper, I don’t know about you, but I have a hard time going from proposed value to something useful. What finally made everything click was the Paxos Made Code, which describe the implementation of libpaxos. While reading the paper, I had a light bulb moment.

Paxos is an algorithm used to ensure consist ordering semantics over a set of values in a cluster of machines. Why is this important? Because if you have consistent ordering over a set of values, and the values are events (or commands, or states), you can be sure that all machines in the cluster have either the same state or a previous version of the state.

Let us see how this can be useful. We have the following scenario:

  • Jane has no money in her back account.
  • Joe sends 100$ to Jane
  • When Jane is notified that it got the 100$, she send a 50$ check to the IRS
  • The IRS cash the check and spend it on something stupid.

image

Without a consistent ordering, each machine in the cluster may view the events in any order, which means that the following three timelines are allowed:

image

As you can imagine, Jane isn’t very happy about that overdraft fee she was just charged with. It is important to note that Paxos sole use here is to ensure that all machines will have a consistent view of events across all machines. That view might not be the same as the order those events showed up. It is possible to give that guarantee as well, on top of Paxos, but that isn’t the topic of this post.

Now that we all (hopefully) understand what Paxos is and what it is used for, let us talk about the algorithm itself.

Paxos has three actor types (Usually, the different actors are all part of the same system, either all together or two of the roles together), we will assume that we have 3 of each in our cluster and that we are interested in events ordered by sequential integer event ids with no gaps:

image

When you want to make a change in the system, you go to the proposer and tell it that you want to add event SendCheck { To= “IRS”, Amount = 50 } to the system.

The proposer then check what is the latest event id that it knows about, increment it, and then ask all the acceptors in the cluster to reserve that event id for its use. (Please note that I intentionally skip the details of how this is done, I am trying to get a high level description here, you can read the actual algorithm description for all the details).

There are several options here:

  • Another proposer is currently trying to reserve that event id.
  • Another proposer successfully claimed this event id.
  • Another proposer tried to claim this id and then crashed midway.
  • Etc… :-)

What Paxos ensures is that in the end, even in the presence of failures of network and machines, the proposer will be able to write the SendCheck { To= “IRS”, Amount = 50 } to an event id in such a way that no other machine will see another value in that location and that eventually all machines in the cluster will see that value in that location.

It is important to understand that even with Paxos, the following timelines are possible:

image

That is, due to some error, we may not be fully up to date on some machines (as seen on machine #2) or that we have missing events (as seen on machine #3).

What Paxos provides is that there wouldn’t be a scenario in which we have missing events and are not explicitly aware of that. At that point, we can defer processing of cashing the check until we know what event we are missing, explicitly decide to ignore the discontinuity in the time line or something else that fit the business needs.

In order to understand how this all works, I had to write my own implementation of Paxos in C#. There doesn’t seem to be anything like that available to the public, and I (at least) find the code I wrote much easier to understand than libpaxos’ C or Erlang implementations. You can find the implementation here: http://github.com/ayende/Paxos.Demo