Chaos Engineering: Breaking Things on Purpose to Build Resilience
At 2:47 AM on a Tuesday, our job aggregator's database connection pool exhausted. Not because of traffic — because a single scraper hung for 8 minutes holding a connection, triggering a cascade that starved every other service of database access. The site returned 500 errors for 23 minutes. We had retry logic, we had connection timeouts, we had circuit breakers — but none of them had ever been tested under this specific failure mode. After the post-mortem, I decided to start breaking things deliberately.
Chaos engineering is the discipline of experimenting on a production system to build confidence in its ability to withstand turbulent conditions. This guide covers the principles, practical implementation, and a realistic starting framework for teams of all sizes — not just Netflix-scale operations.
What Chaos Engineering Actually Is
The term was coined by the Netflix Chaos Engineering team in their Principles of Chaos Engineering manifesto. The definition is precise: "Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production."
Note what it's NOT:
| Chaos Engineering IS | Chaos Engineering IS NOT |
|---|---|
| Controlled experiments with hypotheses | Randomly breaking things to see what happens |
| Running in production (or production-like environments) | Only running in staging |
| Automated, repeatable experiments | Manual ad-hoc testing |
| Minimizing blast radius | Taking down the whole system |
| Verifying that resilience mechanisms work | Finding bugs (that's a side effect) |
According to Gremlin's 2024 State of Chaos Engineering report, 63% of organizations now practice some form of chaos engineering, up from 28% in 2019. Of those practicing it, 91% report discovering critical failures before they caused customer-facing incidents.
The Five Principles of Chaos Engineering
From the Principles of Chaos Engineering manifesto:
- Build a Hypothesis Around Steady State Behavior. Define what "normal" looks like with metrics: response time p99 < 200ms, error rate < 0.1%, throughput > 1000 req/s. Your experiment tests whether this steady state holds under turbulence.
- Vary Real-World Events. Inject failures that actually happen: server crashes, network partitions, disk full, dependency timeouts, DNS failures, clock skew, memory leaks, certificate expiration.
- Run Experiments in Production. Production has traffic patterns, data volumes, and configuration that staging doesn't. However, start in staging if you're new to chaos engineering — production experiments require mature monitoring and blast radius controls.
- Automate Experiments to Run Continuously. Manual experiments provide one-time confidence. Automated, scheduled experiments provide ongoing assurance. Run them weekly or after every deployment.
- Minimize Blast Radius. Start small. Affect 1% of traffic, not 100%. Kill one pod, not the whole deployment. If the experiment causes unexpected customer impact, stop immediately.
The Chaos Experiment Framework
Every chaos experiment follows this structure:
// Chaos Experiment Template
interface ChaosExperiment {
// 1. Define steady state
steadyState: {
metric: string; // "API response time p99"
normalValue: string; // "< 200ms"
measurement: string; // "Datadog APM"
};
// 2. Form hypothesis
hypothesis: string;
// "When we terminate 1 of 3 API pods, the load balancer will
// redistribute traffic and p99 will stay under 300ms"
// 3. Design experiment
experiment: {
action: string; // "Kill 1 API pod via kubectl delete"
blastRadius: string; // "33% of API capacity"
duration: string; // "5 minutes"
rollback: string; // "Pod auto-restarts via Deployment"
};
// 4. Run and observe
observation: {
metrics: string[]; // Watch: error rate, latency, throughput
alerting: boolean; // Will on-call be paged? Warn them.
runbook: string; // Steps if experiment causes real outage
};
// 5. Analyze results
result: 'hypothesis_confirmed' | 'hypothesis_disproved';
findings: string;
actionItems: string[];
}
Example: Database Connection Pool Exhaustion
const dbPoolExperiment: ChaosExperiment = {
steadyState: {
metric: "Job search API response time p99",
normalValue: "< 500ms",
measurement: "Sentry Performance Monitoring",
},
hypothesis:
"When database connection pool is reduced from 20 to 5 connections, " +
"the connection queue will handle the overflow and response times " +
"will increase to < 1000ms but not cause errors",
experiment: {
action: "Set DATABASE_POOL_SIZE=5 (from 20) via environment variable update",
blastRadius: "All API requests that hit the database",
duration: "10 minutes during low-traffic hours (2 AM - 3 AM)",
rollback: "Revert DATABASE_POOL_SIZE to 20, restart pods",
},
observation: {
metrics: [
"pg_stat_activity.count (active DB connections)",
"API response time p50, p95, p99",
"API error rate (5xx)",
"Connection pool wait time",
],
alerting: true, // On-call engineer aware and standing by
runbook: "If error rate > 5% or p99 > 3000ms, immediately rollback",
},
// After running:
result: 'hypothesis_disproved',
findings:
"With pool_size=5, connection queue filled within 30 seconds. " +
"After queue timeout (30s), requests received 500 errors. " +
"Error rate hit 12% within 2 minutes. Root cause: our ORM " +
"holds connections during the entire request lifecycle, not " +
"just during queries.",
actionItems: [
"Implement connection release after each query (not after request)",
"Add connection pool monitoring to dashboard",
"Set connection_timeout to 5s (currently 30s — too long)",
"Add circuit breaker on database connections",
],
};
Chaos Experiments Catalog: Start Here
Ranked by difficulty and risk, from "safe for beginners" to "requires mature practice":
| Level | Experiment | What It Tests | Risk |
|---|---|---|---|
| 1 (Beginner) | Kill a single pod/container | Auto-restart, load balancing | Low |
| 1 | Add 200ms latency to an internal service | Timeout handling, retry logic | Low |
| 1 | Return 500 from a non-critical dependency | Graceful degradation, fallbacks | Low |
| 2 (Intermediate) | Fill disk to 95% | Disk monitoring, log rotation | Medium |
| 2 | DNS resolution failure for one dependency | DNS caching, circuit breakers | Medium |
| 2 | Clock skew (NTP drift) | Token expiry, caching, logs | Medium |
| 3 (Advanced) | Network partition between services | Split-brain handling, consensus | High |
| 3 | Database leader failover | Replica promotion, write availability | High |
| 3 | Availability zone failure simulation | Multi-AZ resilience, data replication | High |
| 4 (Expert) | Gradual memory leak injection | OOM handling, alerting timeliness | Very High |
Implementing Level 1 Experiments
# Experiment: Kill a random pod
# Hypothesis: Kubernetes will restart it within 30s, no user impact
# Using kubectl directly
kubectl delete pod $(kubectl get pods -l app=api -o jsonpath='{.items[0].metadata.name}') \
--grace-period=0
# Using Litmus Chaos (more structured)
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-kill-experiment
spec:
appinfo:
appns: production
applabel: "app=api"
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '30'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'false'
# Experiment: Add latency to a dependency
# Using toxiproxy (great for local/staging)
# Setup toxic on the database proxy
toxiproxy-cli toxic add \
--type latency \
--attribute latency=500 \
--attribute jitter=200 \
--upstream \
postgres_proxy
# Observe: Do API timeouts trigger? Do retries work?
# Does the circuit breaker open after N failures?
# Remove the toxic
toxiproxy-cli toxic remove \
--toxicName latency_upstream \
postgres_proxy
Tools Comparison
| Tool | Type | Best For | Cost | Learning Curve |
|---|---|---|---|---|
| Gremlin | SaaS platform | Enterprise, managed chaos | $$$ | Low |
| Litmus Chaos | Open source (CNCF) | Kubernetes-native chaos | Free | Medium |
| Chaos Mesh | Open source (CNCF) | Kubernetes, rich UI | Free | Medium |
| Toxiproxy | Open source proxy | Network failure simulation | Free | Low |
| AWS Fault Injection Simulator | AWS service | AWS infrastructure | $$ | Low |
| Chaos Monkey | Open source | Random instance termination | Free | Medium |
My recommendation for small teams: Start with Toxiproxy for network fault simulation (it's lightweight and works anywhere) and manual kubectl commands for Kubernetes experiments. Graduate to Litmus Chaos or Chaos Mesh when you need automated, scheduled experiments.
Opinionated: The Uncomfortable Truths About Chaos Engineering
1. You're not ready for chaos engineering if you don't have basic monitoring. If you can't measure your steady state, you can't verify your hypothesis. Invest in observability first (error tracking, APM, logging), then start chaos experiments.
2. Start in staging. Really. The "run in production" principle is aspirational. When you're starting out, run experiments in staging until you trust your blast radius controls. The first experiment that goes wrong will teach you more than any article.
3. Game days > automated chaos. Before automating, run "game days" — scheduled sessions where the team runs experiments together, observes results, and discusses findings. The learning from game days is 10x what you get from automated experiments that run silently.
4. Most teams discover their monitoring is broken, not their systems. The most common finding from first chaos experiments isn't "our retry logic doesn't work" — it's "our monitoring didn't alert us when things went wrong." This is still extremely valuable.
5. Don't do chaos engineering to impress people. "We run chaos engineering" is a great conference talk. But if you're a 3-person startup, your time is better spent writing integration tests and setting up basic health checks. Chaos engineering's ROI is highest for systems with many components and high availability requirements.
Action Plan: Your First 30 Days of Chaos Engineering
Week 1: Preparation
- Verify your monitoring: Can you see error rates, latency, throughput in real time?
- Identify your system's critical path (the user-facing flow that matters most)
- Document your steady state metrics
- Get leadership buy-in: explain the "controlled experiment" framing
Week 2: First Experiment (Staging)
- Choose Level 1: kill a single pod or add latency to one dependency
- Write the experiment template (hypothesis, blast radius, rollback plan)
- Run the experiment during a game day with the team watching
- Document findings and action items
Week 3: Fix and Re-run
- Address the findings from Week 2 (likely: fix retries, add timeouts, improve monitoring)
- Re-run the same experiment to verify the fix
- Run a second experiment (different failure mode)
Week 4: Production (If Ready)
- Run your validated experiment in production during low-traffic hours
- Have the on-call engineer standing by with the rollback plan
- Start planning a monthly game day schedule
- Write up results for the broader team
Sources
- Principles of Chaos Engineering
- Gremlin — State of Chaos Engineering 2024
- Netflix Tech Blog — Chaos Engineering
- LitmusChaos — CNCF Project
- Chaos Mesh — CNCF Project
- Shopify Toxiproxy
- O'Reilly — Chaos Engineering (Book)
I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.
