Distributed Systems for Web Developers: What You Need to Know
A couple of years ago, I was debugging a checkout flow that worked perfectly in development but crumbled under real traffic. Orders were duplicated. Inventory went negative. The payment gateway timed out, but the order still went through. I sat in front of my screen at 2 AM, staring at logs from three different services, and realized: I was building a distributed system, and I had no idea what I was doing.
If you're a web developer in 2026, you're probably already building distributed systems whether you know it or not. The moment you add a background job queue, a separate auth service, a CDN, or even a managed database on a different server, you've left the comfortable world of single-process applications. And the rules are different here.
This guide is the crash course I wish I'd had before that 2 AM incident. It covers the core concepts, the traps, the patterns that actually work, and the ones that sound great in blog posts but fail in production.
What Makes a System "Distributed"?
A distributed system is any system where components located on networked computers communicate and coordinate their actions by passing messages. That's it. No fancy infrastructure required. If your web app talks to a database on a different machine, congratulations, you have a distributed system.
But the real definition, the one that matters in practice, comes from Peter Deutsch's Eight Fallacies of Distributed Computing. These were written in 1994 and remain devastatingly accurate:
- The network is reliable
- Latency is zero
- Bandwidth is infinite
- The network is secure
- Topology doesn't change
- There is one administrator
- Transport cost is zero
- The network is homogeneous
Every production bug I've encountered in distributed systems traces back to violating one of these assumptions. The network will drop packets. Latency will spike. Servers will disappear without warning.
According to a 2025 Google SRE report, network-related failures account for roughly 35% of all incidents in cloud-native applications. Not code bugs. Not bad deployments. Just the network being the network.
The CAP Theorem: Your First Trade-off
Eric Brewer's CAP theorem states that a distributed data store can only provide two of these three guarantees simultaneously:
- Consistency: Every read receives the most recent write or an error.
- Availability: Every request receives a non-error response (without guarantee that it contains the most recent write).
- Partition Tolerance: The system continues to operate despite network partitions between nodes.
Since network partitions are inevitable in any real system, the practical choice is between CP (consistent but sometimes unavailable) and AP (always available but sometimes stale).
| System Type | CAP Choice | Example | Best For |
|---|---|---|---|
| Traditional RDBMS (single-node) | CA (no partition tolerance) | PostgreSQL single instance | Small-scale apps |
| Distributed RDBMS | CP | CockroachDB, Google Spanner | Financial transactions |
| NoSQL Document Store | AP | Cassandra, DynamoDB | High-availability reads |
| Distributed Cache | AP | Redis Cluster | Session storage, caching |
| Consensus-based Store | CP | etcd, ZooKeeper | Configuration, leader election |
Here's my take: most web developers overthink CAP. If you're running a standard SaaS application with a managed PostgreSQL database and a Redis cache, you don't need to make a CAP decision. Your database is CP by default. Your cache is AP by nature. The real question is: what happens when the cache disagrees with the database?
Consensus: How Nodes Agree on Anything
When you have multiple nodes that need to agree on something, such as "who is the current leader?" or "what is the current value of X?", you need a consensus algorithm. The two you'll encounter most are:
Raft
Raft was designed to be understandable. It elects a leader, and the leader handles all writes. If the leader fails, a new election happens. It's used by etcd (which powers Kubernetes), CockroachDB, and HashiCorp Consul. If you only learn one consensus algorithm, learn Raft.
Paxos
Paxos is the OG consensus algorithm, described by Leslie Lamport in 1989. It's notoriously difficult to understand and implement correctly. Google's Chubby lock service uses it. Unless you're building infrastructure at Google-scale, you probably don't need to implement Paxos directly.
For web developers, the practical takeaway is: don't implement consensus yourself. Use etcd, ZooKeeper, or Consul. These are battle-tested implementations that handle the edge cases you haven't thought of.
Patterns That Actually Work in Production
1. Idempotency
This is the single most important pattern in distributed systems. An operation is idempotent if performing it multiple times has the same effect as performing it once. Why does this matter? Because in a distributed system, you will retry requests. The network will drop acknowledgments. Your message queue will deliver messages twice.
Stripe's payment API is the gold standard here. Every API call accepts an Idempotency-Key header. If you send the same request twice with the same key, you get the same response without the payment being processed twice. According to Stripe's documentation, this pattern prevents roughly 0.1% of payments from being duplicated, which at their scale is millions of dollars.
Implementation is straightforward: store a hash of the request and its response. Before processing a new request, check if you've seen it before. If yes, return the cached response.
2. Circuit Breakers
When a downstream service is failing, you don't want every request to wait for it to time out. A circuit breaker detects failures and "trips" after a threshold, immediately returning an error without attempting the call. After a cooldown period, it allows a few test requests through to see if the service has recovered.
Netflix popularized this pattern with Hystrix (now in maintenance mode). Modern alternatives include Resilience4j for Java and Polly for .NET. In Node.js, opossum is a solid choice.
3. Saga Pattern
When a business transaction spans multiple services, you can't use a traditional database transaction. Instead, you use a saga: a sequence of local transactions where each step has a compensating action that undoes it if a later step fails.
For example, an e-commerce order might involve:
- Reserve inventory (compensate: release inventory)
- Charge payment (compensate: refund payment)
- Create shipment (compensate: cancel shipment)
If step 3 fails, you run the compensating actions for steps 2 and 1 in reverse order. This is called a choreography-based saga when each service listens for events, or an orchestration-based saga when a central coordinator manages the flow.
4. Event Sourcing + CQRS
Instead of storing the current state of an entity, store a sequence of events that led to that state. This gives you a complete audit trail, enables temporal queries ("what was the state at time T?"), and makes it natural to build read-optimized projections (CQRS).
A 2025 Confluent survey found that 42% of organizations using event-driven architecture reported fewer data consistency issues compared to traditional CRUD architectures.
Anti-Patterns: What Looks Good but Fails
Two-Phase Commit Across Services
Two-phase commit (2PC) works great within a single database. Across services? It's a disaster. It requires all participants to be available for the commit to succeed, which means a single unavailable service blocks everyone. It's also slow, holding locks across the entire transaction duration. Use sagas instead.
Distributed Transactions with XA
XA transactions extend 2PC across different resource managers (databases, message queues). They have all the problems of 2PC plus additional complexity. Every major tech company that tried XA at scale has abandoned it. Don't start.
Synchronous Microservices Chains
Service A calls Service B, which calls Service C, which calls Service D. The latency is additive. The availability is multiplicative. If each service has 99.9% availability, a chain of four has 99.6% availability. That's the difference between 8.7 hours and 35 hours of downtime per year.
| Chain Length | Individual Availability | Chain Availability | Annual Downtime |
|---|---|---|---|
| 1 service | 99.9% | 99.9% | 8.7 hours |
| 3 services | 99.9% | 99.7% | 26.3 hours |
| 5 services | 99.9% | 99.5% | 43.8 hours |
| 10 services | 99.9% | 99.0% | 87.6 hours |
Observability: You Can't Debug What You Can't See
In a monolith, a stack trace tells you everything. In a distributed system, a single user request might touch ten services. You need three pillars of observability:
Distributed Tracing
Tools like Jaeger and Zipkin propagate trace IDs across service boundaries, letting you see the full journey of a request. OpenTelemetry has emerged as the industry standard for instrumentation, supported by every major cloud provider and APM vendor.
Structured Logging
Every log line should include the trace ID, service name, and relevant business context. JSON-structured logs that flow into a centralized system (ELK stack, Datadog, Grafana Loki) let you correlate events across services.
Metrics
The RED method (Rate, Errors, Duration) gives you the essential metrics for every service. Prometheus with Grafana is the open-source standard. For managed solutions, Datadog and New Relic dominate the market, with Datadog holding approximately 28% market share in the APM space according to Gartner's 2025 APM report.
My Opinionated Take: Start Monolithic, Distribute Deliberately
Here's what I believe after years of building and breaking distributed systems: most web applications don't need to be distributed beyond a web server, a database, and a cache.
The industry has a fetish for microservices. We've been told that monoliths are legacy, that real engineers build distributed systems, that you need Kubernetes to deploy a blog. This is nonsense.
Amazon started as a monolith. So did Netflix, Shopify, and GitHub. They distributed their systems when they had specific scaling problems that couldn't be solved any other way. They didn't do it because a conference speaker told them to.
My rule of thumb: if your team has fewer than 20 engineers, you probably don't need microservices. If your database handles your current load with room to spare, you don't need sharding. If a single server can process your background jobs, you don't need a distributed task queue.
Distribute when you must, not when you can. Every network boundary you add is a source of latency, failure, and complexity. Make sure the trade-off is worth it.
Action Plan: Building Your Distributed Systems Knowledge
If you're a web developer who wants to understand distributed systems deeply, here's the learning path I recommend:
Week 1-2: Foundations
- Read Designing Data-Intensive Applications by Martin Kleppmann (chapters 1-5). This is the single best resource on the topic.
- Implement idempotency in a side project. Add an idempotency key to one of your API endpoints.
- Set up distributed tracing with OpenTelemetry in a simple two-service application.
Week 3-4: Patterns
- Implement a circuit breaker. Use a library first, then build one from scratch to understand the state machine.
- Build a simple saga with compensating transactions.
- Read about the Microservices patterns catalog by Chris Richardson.
Week 5-6: Failure Modes
- Run Chaos Monkey or a similar tool against a test environment.
- Simulate network partitions using
tc(traffic control) on Linux. - Practice debugging with distributed traces. Intentionally break things and find the root cause.
Week 7-8: Production Readiness
- Set up health checks, readiness probes, and graceful shutdown in your services.
- Implement retry with exponential backoff and jitter.
- Write a runbook for your most critical distributed workflow.
Key Takeaways
- You're probably already building distributed systems. Acknowledge it and learn the rules.
- The CAP theorem is a useful mental model, but don't let it paralyze you. Most web apps work fine with a CP database and an AP cache.
- Idempotency is the single most important pattern. Implement it everywhere.
- Avoid synchronous chains of microservices. Use asynchronous messaging where possible.
- Observability is not optional. You cannot debug a distributed system with
console.log. - Start monolithic. Distribute deliberately, with specific reasons for each boundary.
Sources
- Fallacies of Distributed Computing - Wikipedia
- CAP Theorem - Wikipedia
- Raft Consensus Algorithm
- Stripe Idempotent Requests Documentation
- Designing Data-Intensive Applications - Martin Kleppmann
- OpenTelemetry
- Microservices Patterns - Chris Richardson
- Google Cloud SRE Blog
I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.
