sia.hackernoon.com

You’re Reading Part 2 of a 3-Part Series on Paxos Consensus Algorithms in Distributed Systems.

In part 1 of this series, we looked at why consensus is such a tricky problem in distributed systems and how Paxos provides a way out. Through Alice and Bob’s battle for a lock, we saw how Paxos uses majority agreement to make decisions that can’t be undone once chosen.

Even when nodes fail, recover, or rejoin, the system still converges safely on one value. That’s the magic of Paxos—it keeps things consistent in an inconsistent world. In Part 2, we’ll dive into the messier edge cases and see how Paxos still manages to hold things together.

How Paxos Handle Edge Cases

In Part 1, we saw Paxos work smoothly: Alice proposed a value, the nodes accepted it, and even when Bob joined later, the algorithm forced him to carry forward Alice’s decision. Real systems, however, aren’t always this tidy. Messages can get lost, nodes can crash, and multiple proposers might compete simultaneously.

Let’s walk through a few messy scenarios with our familiar friends, Alice and Bob.

Edge Case 1 – Lost Commit (Alice’s Proposal Stalls)

Alice once again proposes AliceLock with proposal number 1001.

She contacts nodes 1, 4, and 5 and gets majority acceptance.

But here’s the twist: node 4 never sends back its final commit message due to a network glitch or node 4 went down or Alice disappears.

Let's try to unpack each failure.

Node 4 never sends back its final commit message (network glitch):

This is totally possible. Alice thinks she doesn’t have a majority, even though she actually does.
The value is already chosen once a majority of acceptors have accepted.
The missing acknowledgment only prevents Alice from knowing it succeeded.
Safety holds, but Alice may stop making progress.

Node 4 goes down after accepting but before replying:

Same effect as the network glitch: Alice may not see the quorum even though one exists.
Any other proposer can query other nodes and will learn that AliceLock was accepted by a majority.

Alice (the proposer) disappears:

Also fine! Proposers in Paxos are stateless initiators. Once Alice sends out her proposal and enough acceptors persist it, Alice herself can die without breaking consensus.
Any other proposer can always step in with a higher-numbered proposal and will be forced to carry forward AliceLock.

In either of the scenarios, nodes 1, 4 (if still alive) and 5 still remembers AliceLock.

Now Bob arrives with BobLock (n=2001). By this time, let's say nodes 2 and 3 are also online and, through consensus, will learn that Alice has the lock eventually.

He queries the nodes and hears back from nodes 1, 2, and 5.

All three report that they previously accepted AliceLock even though Alice never actually had the lock.
Bob is forced to adopt AliceLock and feels “starved” — his new value never has a chance.

Lesson: Even if a commit acknowledgment is lost, Paxos ensures safety: Bob cannot override AliceLock. But liveness suffers — Bob’s new value makes no progress.

Edge Case 2: Both Alice and Bob arrive with their proposals simultaneously

Step 1 – First accepts

Node 4 accepts Alice’s proposal (n=1001, AliceLock).
Node 5 accepts Bob’s proposal (n=2001, BobLock).

Now both Alice and Bob have one vote each.

Step 2 – Other nodes respond differently

Node 2 sees Bob’s proposal and accepts BobLock (n=2001).
Node 1 sees Alice’s proposal and accepts AliceLock (n=1001).

So far:

AliceLock has Node 1 + Node 4.
BobLock has Node 2 + Node 5.
Node 3 hasn’t decided yet.

Step 3 – Node 3 crashes

Before hearing from either proposer, Node 3 goes down.

This leaves:

Alice with 2 votes (Nodes 1 & 4).
Bob with 2 votes (Nodes 2 & 5).
Majority quorum = 3 (since 5 total nodes).

Neither proposer can form a majority with Node 3 offline.

Step 4 – Stalemate (temporary)

Alice cannot reach quorum (only has 2/5).
Bob cannot reach quorum (only has 2/5).
With Node 3 offline, progress stalls.

Lesson: Safety is preserved: no conflicting value is committed yet, since quorum wasn’t reached.

Step 5 – Retry with a higher number

Suppose Bob retries with a new proposal number n=2002.

He sends a prepare(n=2002) to all nodes.
Node 1 replies: “I previously accepted AliceLock at n=1001.”
Node 2 replies: “I previously accepted BobLock at n=2001.”
Node 4 replies: “I previously accepted AliceLock at n=1001.”
Node 5 replies: “I previously accepted BobLock at n=2001.”

So Bob learns:

Highest prior accepted proposal = n=2001, BobLock.

By Paxos rules, he must carry forward BobLock.

Step 6 – Consensus reached

Bob now sends accept(n=2002, BobLock).
Nodes 1–5 (except Node 3 which is down) respond.
He gets Node 2 + Node 5 (already BobLock), and at least one of Node 1 or Node 4 may switch since the higher number is binding.
This gives Bob a quorum (3/5).

Final decision: BobLock is chosen.

Edge Case 3 – Minority Partition (No Quorum)

Suppose a network partition occurs and only 2 out of 5 nodes are reachable (say nodes 4 and 5).

Alice proposes AliceLock (n=4001).
She contacts nodes 4 and it accepts.
But quorum requires 3 out of 5 — Alice falls short.

At the same time, Bob proposes BobLock (n=4002) to node 5.

It responds, but again only 1 votes is possible and Bob falls short.
Neither Alice nor Bob can commit.

Result: No value is chosen.

Lesson: Paxos prioritizes safety over availability. With fewer than a majority of nodes alive, the system cannot make progress. This is why Paxos-based systems may stall under minority partitions — it’s a tradeoff for never committing conflicting values.

Edge Case 4 – Out-of-Order / Delayed Messages

Now consider message delays:

Alice sends a prepare with n=5001. Nodes 1–3 promise Alice.
Then Bob arrives with n=5002. Nodes 4–5 promise Bob.
Bob eventually gathers a majority (say nodes 2, 4, and 5) and commits BobLock.

But later, a delayed accept message from Alice (n=5001, AliceLock) arrives at Node 1.

Paxos rule: Node 1 checks its state and sees it already promised not to accept anything below 5002.
The old message is rejected, even though it arrived late.

Lesson: Paxos tolerates asynchronous, delayed, and reordered messages. Outdated proposals are ignored once a higher-numbered promise exists, preserving safety.

Wrapping Up

So far, we’ve seen Paxos handle a range of real-world messiness:

Lost commits & proposer crashes — safety holds, though liveness suffers.
Racing proposers with node failures — progress may stall temporarily, but higher numbers resolve the tie.
Minority partitions — no progress without quorum, but no conflicting decisions either.
Out-of-order messages — stale proposals are safely rejected, ensuring consistency.

Paxos guarantees one thing above all else: safety is never compromised. But this comes at the cost of liveness in certain situations — proposers can starve, partitions can halt progress, and competition can cause livelock.

In Part 3, we’ll explore how Raft (and Multi-Paxos) address these practical challenges, making leader-based consensus simpler and more efficient in real-world deployments.

Understanding the Paxos Consensus Algorithm Part II: Handling Crashes, Partitions, and Lost Messages

How Paxos Handle Edge Cases

Edge Case 1 – Lost Commit (Alice’s Proposal Stalls)

Edge Case 2: Both Alice and Bob arrive with their proposals simultaneously

Step 1 – First accepts

Step 2 – Other nodes respond differently

Step 3 – Node 3 crashes

Step 4 – Stalemate (temporary)

Step 5 – Retry with a higher number

Step 6 – Consensus reached

Edge Case 3 – Minority Partition (No Quorum)

Edge Case 4 – Out-of-Order / Delayed Messages

Wrapping Up