In the field of computer science, the Paxos algorithm is notorious for how difficult it is to understand. I had to learn the Paxos algorithm in my distributed systems class. I even have "implemented" it by translating Leslie Lamport's TLA+ to Python. But I didn't understand it until much much later.

Now I have a better understanding of Paxos than I used to, I want to explain it to other people. Not because I'd like to help people, rather, I find that explaining things is a very good way to find blind spots in my own understanding.

So, where do we start? Personally, I dislike explanations that start with a step-by-step breakdown of the algorithm, followed by a proof of why those steps do what they claim to do. Instead, I much prefer to start with the problem the algorithm tries to solve, then iteratively come up with a solution together with the reader. So that's what I am going to do. And now you understand the title.

Small disclaimer: The glossaries used in this article is different from what is commonly used for Paxos. I just picked the ones that made the most sense for my narrative.

The problem

The distributed consensus problem is widely useful, so the reader probably doesn't need to be motivated. Here I will just simply state the problem.

There is a group of agents (let's call them CLIENTs\sc{CLIENT}s), who want to choose a number among their selections. Any number is fine, as long as everyone agrees on the same number.

Here, there are a few assumptions we will make to make this problem meaningful:

  • All the agents - including but not limited to the CLIENTs\sc{CLIENT}s, as we will add more types of agents later - are well-behaved. Meaning they all execute the prescribed algorithms faithfully, and don't maliciously try to trick other agents. (If you like jargons: Byzantine failures don't occur.)
  • Agents can talk to each other by sending each other messages, but the messages they send to each other could take arbitrarily long before reaching their destination, and might get lost (but never altered).

The agents could also "fail". However, failing is equivalent to all messages sent to/from that agent being lost forever. So whether we have this assumption or not won't change the algorithm we come up with.

Also, to not complicate things, we are only solving the "single-round" consensus problem in this article, meaning as the output of this algorithm, all of the CLIENTs\sc{CLIENT}s will get a single number which they agree on.

Solution searching adventure

Iteration 0

When trying to solve a complex problem such as this one, it's usually a good idea to start by simplifying the problem. As a start, let's just ignore the need to be reliable entirely.

If we throw reliability out the window, it should be easy to come up with a very simple solution: we add an agent (let's call it COORDINATOR\sc{COORDINATOR}). The CLIENTS\sc{CLIENTS} send whatever number they pick to the COORDINATOR\sc{COORDINATOR} in a PROPOSAL(clienti,x)\sc{PROPOSAL}(client_i, x) message, where xx is the number proposed by the ii-th CLIENT\sc{CLIENT}. The COORDINATOR\sc{COORDINATOR} picks an arbitrary proposal (say, xx'), and informs the other CLIENTs\sc{CLIENT}s about this decision. Specifically, the COORDINATOR\sc{COORDINATOR} will just reply with a CHOSEN(x)\sc{CHOSEN}(x') message to all the PROPOSAL()\sc{PROPOSAL}(\ldots) messages it has received and will receive.

If we assume no messages ever get lost, it is quite easy to see that every CLIENT\sc{CLIENT} will get a number. And because only one number is ever chosen, they will all get the same number.

It is also easy to see why this solution is impractical: it has a single point of failure. Once the singular COORDINATOR\sc{COORDINATOR} fails, no further progress can be made.

Iteration 1

To improve this almost looks easy at first glance: just add more COORDINATORs\sc{COORDINATOR}s!

Sure, more COORDINATORs\sc{COORDINATOR}s would remove the single point of failure. However, if there are more than one COORDINATORs\sc{COORDINATOR}s, they might individually make different decisions, which results in the CLIENTs\sc{CLIENT}s having disagreement.

What if we let the COORDINATORs\sc{COORDINATOR}s reach an agreement among themselves before responding? But wait, doesn't that sound familiar? Having a group of agents reaching an agreement, that's exactly what we added the COORDINATORs\sc{COORDINATOR}s to solve. We just made the problem cyclic.

Let's take a step back. Is there a way for the clients to reach an agreement without having the COORDINATORs\sc{COORDINATOR}s communicate with each other?

In other words, among the decisions of the COORDINATORs\sc{COORDINATOR}s, is there an deterministic algorithm to pick out a specific one that is robust against message losses?

This might sound hard, but it's actually quite simple: pick the decision that is backed by more than half of the COORDINATORs\sc{COORDINATOR}s.

There can't be two decisions both with more than half of the COORDINATORs\sc{COORDINATOR}s backing them; and if a decision doesn't have that many COORDINATORs\sc{COORDINATOR}s backing it, it won't appear to have more backing COORDINATORs\sc{COORDINATOR}s through message losses.

Since this approach resembles a majority vote, let's call COORDINATOR\sc{COORDINATOR} decisions VOTE(coordi,x)\sc{VOTE}(coord_i, x) from now on, where xx is the number picked by the ii-th COORDINATOR\sc{COORDINATOR}. Each COORDINATOR\sc{COORDINATOR} has a single vote, because each of them only makes a single decision.

Obviously, our solution cannot be infinitely reliable. If more than half of the COORDINATORs\sc{COORDINATOR}s went down, there will never be a majority reached. But this is already vastly better than our first solution, and the reliability scales with the number of COORDINATORs\sc{COORDINATOR}s. So we will call it good enough.

Sadly, this solution doesn't actually work: there might not be a majority at all! For example, it's possible that three of the proposals each get a third of the votes. We would have a stalemate in that case.

Iteration 2

Again, a solution seems straightforward: just try again in case of a stalemate.

But then again, things aren't that simple.

First of all, the COORDINATORs\sc{COORDINATOR}s need to be made aware of a retry. Otherwise, because each COORDINATOR\sc{COORDINATOR} only has one vote, they won't be able to vote again even if the CLIENTs\sc{CLIENT}s retry.

To do that, we attach an attempt id to all the messages sent. i.e. PROPOSAL(clienti,x)\sc{PROPOSAL}(client_i, x) becomes PROPOSAL(#attempt,clienti,x)\sc{PROPOSAL}(\#attempt, client_i, x), and so forth. Each time a CLIENT\sc{CLIENT} retries, it bumps #attempt\#attempt to the maximum #attempt\#attempt it knows of plus 1. And the COORDINATORs\sc{COORDINATOR}s should only responds to messages with the most recent #attempt\#attempt.

Hopefully the intent of the #attempt\#attempt number is clear. (Let me know if not.)

Are we good now? Unfortunately, no. Consider this scenario:

There were 2 clients. They proposed their numbers, the COORDINATOR\sc{COORDINATOR} voted on them and all agreed on a single number, x1x_1, all is good. But, all of the VOTE()\sc{VOTE}(\ldots) messages got lost on the way to client2client_2, while client1client_1 received all of the messages just fine. At this point, client1client_1 thought x1x_1 is the number, but client2client_2 went on to retry. The COORDINATORs\sc{COORDINATOR}s voted again, and got x2x_2. This time, all the messages sent to client1client_1 got lost.

And behold, we got the two clients to disagree.

There is an important insight to be had here. Whenever a COORDINATOR\sc{COORDINATOR}, say coordicoord_i, sends out a VOTE(,coordi,x)\sc{VOTE}(\ldots, coord_i, x), there is a chance that some CLIENT\sc{CLIENT} would adopt xx. If coordicoord_i ever sends out two votes with different xx, there is a chance that some of the CLIENTs\sc{CLIENT}s would disagree.

In other words, once a COORDINATOR\sc{COORDINATOR} has revealed its vote, it has to stick to it.

This seems to run contrary to our attempt: if the COORDINATORs\sc{COORDINATOR}s cannot change their votes, what's the point of retrying? A stalemate will be a stalemate forever.

Looks like we reached a dead end with this type of voting. It appears the problem stems from the fact that the COORDINATORs\sc{COORDINATOR}s have to commit to their votes.

So, what if we introduce a form of non-commitment voting?

Iteration 3

Let's explore this idea. Say, the COORDINATORs\sc{COORDINATOR}s could now send a TENTATIVEVOTE(#attempt,coordi,x)\sc{TENTATIVE}\sc{VOTE}(\#attempt, coord_i, x) message, to tentatively vote for xx.

Obviously, the CLIENTs\sc{CLIENT}s couldn't adopt xx right away. So what's this vote good for?

Ah, right, it could get us to a majority.

It is correct that tentative votes don't lead directly to an agreement among CLIENTs\sc{CLIENT}s, but it can show us when a majority has formed among the COORDINATORs\sc{COORDINATOR}s.

Once a CLIENT\sc{CLIENT} sees a majority tentative vote, it can then message the COORDINATORs\sc{COORDINATOR}s to ask for an actual vote. (Let's call this message PLEASEVOTE(#attempt,clienti)\sc{PLEASE}\sc{VOTE}(\#attempt, client_i)). Intuitively, the COORDINATORs\sc{COORDINATOR}s have to make the same vote in the actual vote as their tentative votes.

If all goes well, we would get a majority and an agreement. If there is no majority, the COORDINATORs\sc{COORDINATOR}s won't even start a vote, so they are free to change their mind. So the CLIENTs\sc{CLIENT}s could start another attempt which might have a different outcome.

What if things don't go well? What if the PLEASEVOTE\sc{PLEASE}\sc{VOTE} messages weren't received by some of the COORDINATORs\sc{COORDINATOR}s? In that case, some of the COORDINATORs\sc{COORDINATOR}s would have voted, and their decisions cannot be changed. That is to say, in all subsequent attempts, these COORDINATORs\sc{COORDINATOR}s will always vote for what they have voted for, whether it's a tentative vote, or the actual vote. But that doesn't create a problem for us. There was a majority in the tentative votes, and now we solidified part of the tentative votes. There is at least one way we can still reach a majority in the next round: everyone votes the same as they did in this round. And we can prove this inductively for all future rounds.

From this, we can have a rough image of how the algorithm functions: as attempts are being made, more and more COORDINATORs\sc{COORDINATOR}s start to make up their mind which number they will commit to, while making sure a majority could still be reached. Eventually, after all the COORDINATORs\sc{COORDINATOR}s have made up their minds, by induction there must be a majority among them. From that point on, they will just repeatedly broadcast this decision to the CLIENTs\sc{CLIENT}s, until all the CLIENTs\sc{CLIENT}s have got that message.

And Viola, we have a working algorithm.

The actual algorithm

Let's clean up our thoughts, and condense the description of our algorithm so it's easy to understand.

First, there is the #attempt\#attempt number that is attached to every message. This number is bumped every time a new attempt is made. If a CLIENT\sc{CLIENT} sees a message with a #attempt\#attempt bigger then its most recent #attempt\#attempt, it knows a new attempt has been initiated, so it would abort its current attempt and participate in the newer one. If a COORDINATOR\sc{COORDINATOR} sees a message with a #attempt\#attempt smaller than the biggest #attempt\#attempt it has ever seen, it would know that message is stale, so it will drop the message.

With that out of the way, let's describe what happens in an attempt.

Each attempt can be split into two phases:

  • Phase 1: The CLIENTs\sc{CLIENT}s each send its PROPOSAL()\sc{PROPOSAL}(\ldots) to the COORDINATORs\sc{COORDINATOR}s. The COORDINATORs\sc{COORDINATOR}s reply with a TENTATIVEVOTE(,x)\sc{TENTATIVE}\sc{VOTE}(\ldots, x). Each CLIENT\sc{CLIENT} waits for the tentative votes until they reach a majority. If a majority is not reached, retry.
  • Phase 2: If a majority is reached, the CLIENTs\sc{CLIENT}s send PLEASEVOTE\sc{PLEASE}\sc{VOTE}, and the COORDINATORs\sc{COORDINATOR}s actually vote. Their actual votes would be the same as their respective tentative votes. Each CLIENT\sc{CLIENT} waits for the votes until they reach a majority, and then adopt the majority number.

Back to Paxos

Our algorithm does look a bit different from the official Paxos. For one, the name of the agents are different. What we call CLIENTs\sc{CLIENT}s, Lamport calls PROPOSERs\sc{PROPOSER}s; and COORDINATORs\sc{COORDINATOR}s, ACCEPTORs\sc{ACCEPTOR}s.

Besides that, there are protocol differences too.

Firstly, the COORDINATORs\sc{COORDINATOR}s don't have to send the TENTATIVEVOTE()\sc{TENTATIVE}\sc{VOTE}(\ldots) to everyone, they just need to send it to the CLIENT\sc{CLIENT} they agree with. This way we won't have every CLIENTs\sc{CLIENT}s sending PLEASEVOTE\sc{PLEASE}\sc{VOTE} at the same time, that would be inefficient.

After, we notice that the proposed number is unnecessarily sent multiple times in phase 1 and phase 2. The phase 1 is used to reach a majority, the proposed number is not actually important in that phase. So we remove the xx from PROPOSAL\sc{PROPOSAL}; and in TENTATIVEVOTE\sc{TENTATIVE}\sc{VOTE}, instead voting for a number, they vote for a CLIENT\sc{CLIENT}, by sending the tentative vote only to that CLIENTS\sc{CLIENTS}. Finally, after a client received a majority of tentative votes, it sends a PLEASEVOTE(x)\sc{PLEASE}\sc{VOTE}(x), so all the COORDINATORs\sc{COORDINATOR}s got that message will vote xx. Of course, if a COORDINATOR\sc{COORDINATOR} has already voted in a previous round, it has to tell the CLIENT\sc{CLIENT}, so it could pick the already voted xx, otherwise its PLEASEVOTE(x)\sc{PLEASE}\sc{VOTE}(x) will be wasted, as the COORDINATORs\sc{COORDINATOR}s couldn't change their minds.

(The modified algorithm has slightly better property. In our algorithm, we just make sure a majority is still possible after each attempt; in Paxos, each round the COORDINATORs\sc{COORDINATOR}s that vote will all vote for the same number.)

With this little changes, we can map our algorithm back to Paxos:

Agents:

  • CLIENT\sc{CLIENT} => PROPOSER\sc{PROPOSER} (which makes proposals) and LEARNER\sc{LEARNER} (which adopts the resulting number)
  • COORDINATOR\sc{COORDINATOR} => ACCEPTOR\sc{ACCEPTOR}

Messages:

  • PROPOSAL(#attempt,clienti)\sc{PROPOSAL}(\#attempt, client_i) => PREPARE(#attempt,clienti)\sc{PREPARE}(\#attempt, client_i)
  • TENTATIVEVOTE(#attempt,coordi)\sc{TENTATIVE}\sc{VOTE}(\#attempt, coord_i) => PROMISE(#attempt,coordi)\sc{PROMISE}(\#attempt, coord_i)
  • PLEASEVOTE(#attempt,clienti,x)\sc{PLEASE}\sc{VOTE}(\#attempt, client_i, x) => ACCEPT(#attempt,clienti,x)\sc{ACCEPT}(\#attempt, client_i, x)
  • VOTE(#attempt,coordi,x)\sc{VOTE}(\#attempt, coord_i, x) => ACCEPTED(#attempt,coordi,x)\sc{ACCEPTED}(\#attempt, coord_i, x)