Understanding Consensus Algorithms: Raft, Paxos, and Why They Matter
I first encountered consensus algorithms when a Redis cluster split-brain incident took down our job queue at 4 AM. Three Redis nodes each thought they were the primary, accepting writes independently. When the network partition healed, we had three conflicting versions of the same data. The recovery took 6 hours of manual intervention.
That incident taught me that distributed systems don't "just work." Behind every database cluster, every Kubernetes deployment, every distributed lock — there's a consensus algorithm making sure nodes agree on the state of the world. Understanding these algorithms won't make you implement your own (please don't), but it will make you a better operator and architect of the systems that use them.
This guide explains Raft, Paxos, and related algorithms in plain language, with practical implications for real-world systems.
Part 1: The Consensus Problem
Imagine you have three database servers. A client writes x = 5. How do you ensure all three servers agree that x = 5, even if one server crashes, the network drops packets between servers, or a server is slow?
This is the consensus problem: getting multiple nodes to agree on a single value, even in the presence of failures. It sounds simple. It is, in fact, one of the hardest problems in computer science.
Why It's Hard: The FLP Impossibility Result
In 1985, Fischer, Lynch, and Paterson proved that consensus is impossible in an asynchronous distributed system with even one faulty process. This is known as the FLP impossibility result, and it's the most important theoretical result in distributed systems.
In practice, this means every consensus algorithm must make tradeoffs. Either it can get stuck (sacrifice liveness) or it can disagree (sacrifice safety). Practical algorithms like Raft choose to sacrifice liveness: they guarantee they'll never disagree, but they might temporarily be unable to make progress during a network partition.
The CAP Theorem Connection
Brewer's CAP theorem states that a distributed system can provide at most two of three guarantees: Consistency, Availability, and Partition tolerance. Since network partitions are inevitable, the real choice is between consistency (CP) and availability (AP).
| System Type | Choice | Consensus Algorithm | Example |
|---|---|---|---|
| CP | Consistency over availability | Raft, Paxos | etcd, ZooKeeper, CockroachDB |
| AP | Availability over consistency | Gossip, CRDT | Cassandra, DynamoDB, Riak |
Part 2: Paxos — The Original (and Incomprehensible)
Paxos was invented by Leslie Lamport in 1989, described in a paper about a fictional Greek parliament, and has confused distributed systems engineers ever since. Google's Chubby lock service, Apache ZooKeeper (originally), and many internal systems at major companies use Paxos variants.
The Basic Idea
Paxos has three roles: Proposers, Acceptors, and Learners.
Phase 1: Prepare
1. Proposer picks a proposal number N (unique, increasing)
2. Proposer sends Prepare(N) to a majority of Acceptors
3. Each Acceptor:
- If N is the highest number it's seen, responds with Promise(N)
and any value it has already accepted
- Otherwise, ignores the request
Phase 2: Accept
4. If Proposer receives Promise from a majority:
- If any Acceptor reported a previously accepted value,
Proposer must use that value
- Otherwise, Proposer can propose its own value V
5. Proposer sends Accept(N, V) to Acceptors
6. Each Acceptor:
- If it hasn't promised a higher number, it accepts (N, V)
- Otherwise, rejects
Phase 3: Learn
7. Once a majority of Acceptors accept, the value is "chosen"
8. Learners are notified of the chosen value
Why Paxos Is Hard
The basic Paxos algorithm (Single-Decree Paxos) agrees on a single value. Real systems need to agree on a sequence of values (a log). Multi-Paxos extends the algorithm for this, but Lamport's paper doesn't fully specify multi-Paxos, leaving implementation details to the reader. Different implementations fill these gaps differently, leading to subtle incompatibilities.
As the Raft paper authors noted, "there are significant gaps between the description of the Paxos algorithm and the needs of a real-world system ... the final system will be based on an unproven protocol."
Part 3: Raft — Consensus for Humans
Raft was designed by Diego Ongaro and John Ousterhout in 2013 specifically to be understandable. The Raft paper (In Search of an Understandable Consensus Algorithm) is one of the most cited papers in distributed systems, and for good reason — it takes the same problem Paxos solves and makes it comprehensible.
The Key Insight: Leader-Based
Raft simplifies consensus by electing a single leader. All writes go through the leader. The leader replicates the write to followers. Once a majority acknowledges, the write is committed.
Raft has three states for each node:
1. Leader — handles all client requests, replicates log
2. Follower — passive, responds to leader
3. Candidate — trying to become leader
Normal operation:
Client → Leader → replicate to Followers → majority ACK → committed
Leader election:
1. Followers have a randomized election timeout (150-300ms)
2. If a follower doesn't hear from the leader within the timeout,
it becomes a Candidate
3. Candidate votes for itself and requests votes from other nodes
4. If it gets votes from a majority, it becomes the new Leader
5. The term number increases with each election
Log Replication in Detail
// Simplified Raft log replication
Leader's log: [1:x=1] [2:y=2] [3:z=3] [4:w=4]
Follower A's log: [1:x=1] [2:y=2] [3:z=3] ← one behind
Follower B's log: [1:x=1] [2:y=2] [3:z=3] [4:w=4] ← up to date
Follower C's log: [1:x=1] [2:y=2] ← two behind
Follower D's log: [1:x=1] [2:y=2] [3:z=3] [4:w=4] ← up to date
Commit index: 4 (majority have entry 4: Leader, B, D = 3/5)
// When Leader receives new write x=5:
1. Leader appends [5:x=5] to its log
2. Leader sends AppendEntries RPC to all followers
3. Followers append [5:x=5] and acknowledge
4. Once 3/5 nodes have the entry, Leader advances commit index to 5
5. Leader responds to client: "write committed"
What Happens When the Leader Dies
Timeline:
T=0: Leader (Node A) is healthy, term 5
T=100: Leader crashes (network failure, hardware failure, etc.)
T=250: Follower C's election timer fires (randomized: 250ms)
T=250: C becomes Candidate, increments term to 6
T=250: C sends RequestVote to all nodes
T=260: B and D vote for C (term 6 is higher than their current term 5)
T=260: C receives majority votes (C + B + D = 3/5)
T=260: C becomes Leader for term 6
T=270: C starts sending heartbeats to all followers
T=300: Normal operation resumes
// Key insight: the entire leader election takes ~10-50ms
// Clients might experience a brief timeout, but no data is lost
Raft Interactive Visualization
The best way to understand Raft is through the interactive Raft visualization by The Secret Lives of Data. It walks through leader election, log replication, and network partitions with animations. I recommend spending 15 minutes with it.
Part 4: Raft vs Paxos
| Aspect | Raft | Paxos |
|---|---|---|
| Understandability | Designed for clarity | Notoriously difficult |
| Leader | Strong leader (all writes through leader) | No fixed leader (any node can propose) |
| Log | Log entries must be contiguous | Gaps allowed in log |
| Membership changes | Specified in the paper | Left as exercise to reader |
| Proven correct | Yes (TLA+ specification) | Yes |
| Production use | etcd, CockroachDB, Consul, TiKV | Chubby, Spanner (internally) |
| Performance | Comparable | Comparable |
| Implementations | Many open-source | Fewer complete implementations |
Part 5: Where You Encounter Consensus
You're already using consensus algorithms, even if you don't know it:
etcd and Kubernetes
etcd is a distributed key-value store that uses Raft. Kubernetes stores all its state in etcd — pod schedules, service configurations, secrets. When you kubectl apply a deployment, the change goes through etcd's Raft consensus before it takes effect.
This is why a Kubernetes control plane needs an odd number of etcd nodes (3 or 5, typically): Raft requires a majority to function. With 3 nodes, you can tolerate 1 failure. With 5, you can tolerate 2.
CockroachDB and TiDB
NewSQL databases like CockroachDB use Raft to replicate data across nodes. Each "range" of data has its own Raft group. When you execute a SQL transaction, Raft ensures it's consistently replicated before returning success. The CockroachDB architecture documentation explains this in detail.
Consul and Service Discovery
HashiCorp Consul uses Raft for its leader election and state replication. When a service registers, it's replicated via Raft to ensure all Consul servers agree on the service catalog.
ZooKeeper and Kafka
ZooKeeper uses a Paxos-like protocol called ZAB (ZooKeeper Atomic Broadcast). Apache Kafka used ZooKeeper for leader election and metadata management until KRaft (Kafka Raft), which replaced ZooKeeper with a built-in Raft implementation.
Part 6: Beyond Raft and Paxos
Byzantine Fault Tolerance (BFT)
Raft and Paxos assume nodes are honest — they might crash, but they won't lie. Byzantine fault tolerance handles the case where nodes can be malicious (sending incorrect data, lying about their state). This is what blockchain consensus algorithms solve.
PBFT (Practical Byzantine Fault Tolerance) by Castro and Liskov showed that BFT is practical for small cluster sizes. Blockchain systems use various BFT-derived algorithms.
Gossip Protocols
For systems that prioritize availability over consistency (AP systems), gossip protocols spread information epidemically. Each node periodically shares its state with a random peer, and eventually all nodes converge. Cassandra, DynamoDB, and Redis Cluster use gossip for membership and failure detection.
CRDTs (Conflict-Free Replicated Data Types)
CRDTs are data structures that can be merged without coordination. They achieve eventual consistency without consensus. Examples include counters, sets, and text (used by collaborative editors). The CRDT.tech website maintains a comprehensive list.
Part 7: Practical Implications for Developers
Cluster Sizing
Consensus requires a majority (quorum):
- 3 nodes: quorum = 2, tolerates 1 failure
- 5 nodes: quorum = 3, tolerates 2 failures
- 7 nodes: quorum = 4, tolerates 3 failures
Why always odd numbers?
- 4 nodes: quorum = 3, tolerates 1 failure (same as 3 nodes!)
- 6 nodes: quorum = 4, tolerates 2 failures (same as 5 nodes!)
- Even numbers waste resources without improving fault tolerance
Recommendation:
- Development: 1 node (no consensus needed)
- Production: 3 nodes (good balance)
- Critical systems: 5 nodes (tolerates 2 simultaneous failures)
- More than 5: rarely needed, adds latency to writes
Performance Implications
Every write must be replicated to a majority before it's confirmed.
3-node cluster:
Write latency = max(latency to node 1, latency to node 2)
(The slowest of the two fastest responses)
5-node cluster:
Write latency = 3rd fastest response out of 4
(Slightly higher than 3-node cluster)
// This is why consensus-based databases have higher write latency
// than single-node databases. The tradeoff is durability.
Part 8: My Opinionated Take
1. You should understand Raft, but never implement it. Implementing consensus correctly is extraordinarily difficult. Use etcd, Consul, or CockroachDB — systems with thousands of hours of testing and production hardening. The Jepsen testing project has found bugs in almost every distributed system it's tested, including those built by world-class teams.
2. Three nodes is enough for most systems. I see teams deploying 5-node etcd clusters "for safety" when they've never even had a single node failure. Three nodes tolerates one failure, which covers the vast majority of real-world scenarios. Five nodes adds latency and operational complexity.
3. Network partitions are rarer than you think, but worse than you expect. Most outages are caused by software bugs, configuration errors, and human mistakes — not network partitions. But when partitions do happen, systems without proper consensus can corrupt data in ways that take weeks to repair.
4. Read the Raft paper. It's 18 pages and genuinely readable. Understanding Raft will change how you think about every distributed system you work with.
Action Plan
This Week
- Go through the Raft interactive visualization (15 min)
- Read the Raft paper, at least sections 1-5
- Check: how many nodes does your etcd/Consul/ZooKeeper cluster have?
This Month
- Simulate a node failure in your staging cluster and observe the leader election
- Monitor leader election frequency — frequent elections indicate instability
- Review your write latency — is consensus the bottleneck?
Sources
- Ongaro & Ousterhout: In Search of an Understandable Consensus Algorithm (Raft Paper)
- Lamport: Paxos Made Simple
- Fischer, Lynch, Paterson: FLP Impossibility
- The Secret Lives of Data: Raft Visualization
- CockroachDB: Replication Layer Architecture
- Jepsen: Distributed Systems Testing
- CRDT.tech: Conflict-Free Replicated Data Types
I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.
