Understanding Consensus Algorithms: Raft, Paxos, and Why They Matter

Distributed network nodes reaching consensus visualization

I first encountered consensus algorithms when a Redis cluster split-brain incident took down our job queue at 4 AM. Three Redis nodes each thought they were the primary, accepting writes independently. When the network partition healed, we had three conflicting versions of the same data. The recovery took 6 hours of manual intervention.

That incident taught me that distributed systems don't "just work." Behind every database cluster, every Kubernetes deployment, every distributed lock — there's a consensus algorithm making sure nodes agree on the state of the world. Understanding these algorithms won't make you implement your own (please don't), but it will make you a better operator and architect of the systems that use them.

This guide explains Raft, Paxos, and related algorithms in plain language, with practical implications for real-world systems.

Part 1: The Consensus Problem

Imagine you have three database servers. A client writes x = 5. How do you ensure all three servers agree that x = 5, even if one server crashes, the network drops packets between servers, or a server is slow?

This is the consensus problem: getting multiple nodes to agree on a single value, even in the presence of failures. It sounds simple. It is, in fact, one of the hardest problems in computer science.

Why It's Hard: The FLP Impossibility Result

In 1985, Fischer, Lynch, and Paterson proved that consensus is impossible in an asynchronous distributed system with even one faulty process. This is known as the FLP impossibility result, and it's the most important theoretical result in distributed systems.

In practice, this means every consensus algorithm must make tradeoffs. Either it can get stuck (sacrifice liveness) or it can disagree (sacrifice safety). Practical algorithms like Raft choose to sacrifice liveness: they guarantee they'll never disagree, but they might temporarily be unable to make progress during a network partition.

The CAP Theorem Connection

Brewer's CAP theorem states that a distributed system can provide at most two of three guarantees: Consistency, Availability, and Partition tolerance. Since network partitions are inevitable, the real choice is between consistency (CP) and availability (AP).

System Type	Choice	Consensus Algorithm	Example
CP	Consistency over availability	Raft, Paxos	etcd, ZooKeeper, CockroachDB
AP	Availability over consistency	Gossip, CRDT	Cassandra, DynamoDB, Riak

Part 2: Paxos — The Original (and Incomprehensible)

Paxos was invented by Leslie Lamport in 1989, described in a paper about a fictional Greek parliament, and has confused distributed systems engineers ever since. Google's Chubby lock service, Apache ZooKeeper (originally), and many internal systems at major companies use Paxos variants.

The Basic Idea

Paxos has three roles: Proposers, Acceptors, and Learners.

Phase 1: Prepare
1. Proposer picks a proposal number N (unique, increasing)
2. Proposer sends Prepare(N) to a majority of Acceptors
3. Each Acceptor:
   - If N is the highest number it's seen, responds with Promise(N)
     and any value it has already accepted
   - Otherwise, ignores the request

Phase 2: Accept
4. If Proposer receives Promise from a majority:
   - If any Acceptor reported a previously accepted value,
     Proposer must use that value
   - Otherwise, Proposer can propose its own value V
5. Proposer sends Accept(N, V) to Acceptors
6. Each Acceptor:
   - If it hasn't promised a higher number, it accepts (N, V)
   - Otherwise, rejects

Phase 3: Learn
7. Once a majority of Acceptors accept, the value is "chosen"
8. Learners are notified of the chosen value

Why Paxos Is Hard

The basic Paxos algorithm (Single-Decree Paxos) agrees on a single value. Real systems need to agree on a sequence of values (a log). Multi-Paxos extends the algorithm for this, but Lamport's paper doesn't fully specify multi-Paxos, leaving implementation details to the reader. Different implementations fill these gaps differently, leading to subtle incompatibilities.

As the Raft paper authors noted, "there are significant gaps between the description of the Paxos algorithm and the needs of a real-world system ... the final system will be based on an unproven protocol."

Part 3: Raft — Consensus for Humans

Clear structured algorithm with defined leader and followers

Raft was designed by Diego Ongaro and John Ousterhout in 2013 specifically to be understandable. The Raft paper (In Search of an Understandable Consensus Algorithm) is one of the most cited papers in distributed systems, and for good reason — it takes the same problem Paxos solves and makes it comprehensible.

The Key Insight: Leader-Based

Raft simplifies consensus by electing a single leader. All writes go through the leader. The leader replicates the write to followers. Once a majority acknowledges, the write is committed.

Raft has three states for each node:
1. Leader    — handles all client requests, replicates log
2. Follower  — passive, responds to leader
3. Candidate — trying to become leader

Normal operation:
Client → Leader → replicate to Followers → majority ACK → committed

Leader election:
1. Followers have a randomized election timeout (150-300ms)
2. If a follower doesn't hear from the leader within the timeout,
   it becomes a Candidate
3. Candidate votes for itself and requests votes from other nodes
4. If it gets votes from a majority, it becomes the new Leader
5. The term number increases with each election

Log Replication in Detail

// Simplified Raft log replication

Leader's log:     [1:x=1] [2:y=2] [3:z=3] [4:w=4]
Follower A's log: [1:x=1] [2:y=2] [3:z=3]          ← one behind
Follower B's log: [1:x=1] [2:y=2] [3:z=3] [4:w=4]  ← up to date
Follower C's log: [1:x=1] [2:y=2]                    ← two behind
Follower D's log: [1:x=1] [2:y=2] [3:z=3] [4:w=4]  ← up to date

Commit index: 4 (majority have entry 4: Leader, B, D = 3/5)

// When Leader receives new write x=5:
1. Leader appends [5:x=5] to its log
2. Leader sends AppendEntries RPC to all followers
3. Followers append [5:x=5] and acknowledge
4. Once 3/5 nodes have the entry, Leader advances commit index to 5
5. Leader responds to client: "write committed"

What Happens When the Leader Dies

Timeline:
T=0:    Leader (Node A) is healthy, term 5
T=100:  Leader crashes (network failure, hardware failure, etc.)
T=250:  Follower C's election timer fires (randomized: 250ms)
T=250:  C becomes Candidate, increments term to 6
T=250:  C sends RequestVote to all nodes
T=260:  B and D vote for C (term 6 is higher than their current term 5)
T=260:  C receives majority votes (C + B + D = 3/5)
T=260:  C becomes Leader for term 6
T=270:  C starts sending heartbeats to all followers
T=300:  Normal operation resumes

// Key insight: the entire leader election takes ~10-50ms
// Clients might experience a brief timeout, but no data is lost

Raft Interactive Visualization

The best way to understand Raft is through the interactive Raft visualization by The Secret Lives of Data. It walks through leader election, log replication, and network partitions with animations. I recommend spending 15 minutes with it.

Part 4: Raft vs Paxos

Aspect	Raft	Paxos
Understandability	Designed for clarity	Notoriously difficult
Leader	Strong leader (all writes through leader)	No fixed leader (any node can propose)
Log	Log entries must be contiguous	Gaps allowed in log
Membership changes	Specified in the paper	Left as exercise to reader
Proven correct	Yes (TLA+ specification)	Yes
Production use	etcd, CockroachDB, Consul, TiKV	Chubby, Spanner (internally)
Performance	Comparable	Comparable
Implementations	Many open-source	Fewer complete implementations

Part 5: Where You Encounter Consensus

Modern distributed systems infrastructure

You're already using consensus algorithms, even if you don't know it:

etcd and Kubernetes

etcd is a distributed key-value store that uses Raft. Kubernetes stores all its state in etcd — pod schedules, service configurations, secrets. When you kubectl apply a deployment, the change goes through etcd's Raft consensus before it takes effect.

This is why a Kubernetes control plane needs an odd number of etcd nodes (3 or 5, typically): Raft requires a majority to function. With 3 nodes, you can tolerate 1 failure. With 5, you can tolerate 2.

CockroachDB and TiDB

NewSQL databases like CockroachDB use Raft to replicate data across nodes. Each "range" of data has its own Raft group. When you execute a SQL transaction, Raft ensures it's consistently replicated before returning success. The CockroachDB architecture documentation explains this in detail.

Consul and Service Discovery

HashiCorp Consul uses Raft for its leader election and state replication. When a service registers, it's replicated via Raft to ensure all Consul servers agree on the service catalog.

ZooKeeper and Kafka

ZooKeeper uses a Paxos-like protocol called ZAB (ZooKeeper Atomic Broadcast). Apache Kafka used ZooKeeper for leader election and metadata management until KRaft (Kafka Raft), which replaced ZooKeeper with a built-in Raft implementation.

Part 6: Beyond Raft and Paxos

Byzantine Fault Tolerance (BFT)

Raft and Paxos assume nodes are honest — they might crash, but they won't lie. Byzantine fault tolerance handles the case where nodes can be malicious (sending incorrect data, lying about their state). This is what blockchain consensus algorithms solve.

PBFT (Practical Byzantine Fault Tolerance) by Castro and Liskov showed that BFT is practical for small cluster sizes. Blockchain systems use various BFT-derived algorithms.

Gossip Protocols

For systems that prioritize availability over consistency (AP systems), gossip protocols spread information epidemically. Each node periodically shares its state with a random peer, and eventually all nodes converge. Cassandra, DynamoDB, and Redis Cluster use gossip for membership and failure detection.

CRDTs (Conflict-Free Replicated Data Types)

CRDTs are data structures that can be merged without coordination. They achieve eventual consistency without consensus. Examples include counters, sets, and text (used by collaborative editors). The CRDT.tech website maintains a comprehensive list.

Part 7: Practical Implications for Developers

Cluster Sizing

Consensus requires a majority (quorum):
- 3 nodes: quorum = 2, tolerates 1 failure
- 5 nodes: quorum = 3, tolerates 2 failures
- 7 nodes: quorum = 4, tolerates 3 failures

Why always odd numbers?
- 4 nodes: quorum = 3, tolerates 1 failure (same as 3 nodes!)
- 6 nodes: quorum = 4, tolerates 2 failures (same as 5 nodes!)
- Even numbers waste resources without improving fault tolerance

Recommendation:
- Development: 1 node (no consensus needed)
- Production: 3 nodes (good balance)
- Critical systems: 5 nodes (tolerates 2 simultaneous failures)
- More than 5: rarely needed, adds latency to writes

Performance Implications

Every write must be replicated to a majority before it's confirmed.

3-node cluster:
  Write latency = max(latency to node 1, latency to node 2)
  (The slowest of the two fastest responses)

5-node cluster:
  Write latency = 3rd fastest response out of 4
  (Slightly higher than 3-node cluster)

// This is why consensus-based databases have higher write latency
// than single-node databases. The tradeoff is durability.

Part 8: My Opinionated Take

Engineering decisions about distributed systems architecture

1. You should understand Raft, but never implement it. Implementing consensus correctly is extraordinarily difficult. Use etcd, Consul, or CockroachDB — systems with thousands of hours of testing and production hardening. The Jepsen testing project has found bugs in almost every distributed system it's tested, including those built by world-class teams.

2. Three nodes is enough for most systems. I see teams deploying 5-node etcd clusters "for safety" when they've never even had a single node failure. Three nodes tolerates one failure, which covers the vast majority of real-world scenarios. Five nodes adds latency and operational complexity.

3. Network partitions are rarer than you think, but worse than you expect. Most outages are caused by software bugs, configuration errors, and human mistakes — not network partitions. But when partitions do happen, systems without proper consensus can corrupt data in ways that take weeks to repair.

4. Read the Raft paper. It's 18 pages and genuinely readable. Understanding Raft will change how you think about every distributed system you work with.

Action Plan

This Week

Go through the Raft interactive visualization (15 min)
Read the Raft paper, at least sections 1-5
Check: how many nodes does your etcd/Consul/ZooKeeper cluster have?

This Month

Simulate a node failure in your staging cluster and observe the leader election
Monitor leader election frequency — frequent elections indicate instability
Review your write latency — is consensus the bottleneck?

Sources

I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.

Loading BirJob...

Understanding Consensus Algorithms: Raft, Paxos, and Why They Matter

Understanding Consensus Algorithms: Raft, Paxos, and Why They Matter

Part 1: The Consensus Problem

Why It's Hard: The FLP Impossibility Result

The CAP Theorem Connection

Part 2: Paxos — The Original (and Incomprehensible)

The Basic Idea

Why Paxos Is Hard

Part 3: Raft — Consensus for Humans

The Key Insight: Leader-Based

Log Replication in Detail

What Happens When the Leader Dies

Raft Interactive Visualization

Part 4: Raft vs Paxos

Part 5: Where You Encounter Consensus

etcd and Kubernetes

CockroachDB and TiDB

Consul and Service Discovery

ZooKeeper and Kafka

Part 6: Beyond Raft and Paxos

Byzantine Fault Tolerance (BFT)

Gossip Protocols

CRDTs (Conflict-Free Replicated Data Types)

Part 7: Practical Implications for Developers

Cluster Sizing

Performance Implications

Part 8: My Opinionated Take

Action Plan

This Week

This Month

Sources

İş axtarışınıza başlayın

Oxşar məqalələr