Chaos Engineering: Breaking Things on Purpose to Build Resilience

Complex system architecture under stress testing

At 2:47 AM on a Tuesday, our job aggregator's database connection pool exhausted. Not because of traffic — because a single scraper hung for 8 minutes holding a connection, triggering a cascade that starved every other service of database access. The site returned 500 errors for 23 minutes. We had retry logic, we had connection timeouts, we had circuit breakers — but none of them had ever been tested under this specific failure mode. After the post-mortem, I decided to start breaking things deliberately.

Chaos engineering is the discipline of experimenting on a production system to build confidence in its ability to withstand turbulent conditions. This guide covers the principles, practical implementation, and a realistic starting framework for teams of all sizes — not just Netflix-scale operations.

What Chaos Engineering Actually Is

The term was coined by the Netflix Chaos Engineering team in their Principles of Chaos Engineering manifesto. The definition is precise: "Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production."

Note what it's NOT:

Chaos Engineering IS	Chaos Engineering IS NOT
Controlled experiments with hypotheses	Randomly breaking things to see what happens
Running in production (or production-like environments)	Only running in staging
Automated, repeatable experiments	Manual ad-hoc testing
Minimizing blast radius	Taking down the whole system
Verifying that resilience mechanisms work	Finding bugs (that's a side effect)

According to Gremlin's 2024 State of Chaos Engineering report, 63% of organizations now practice some form of chaos engineering, up from 28% in 2019. Of those practicing it, 91% report discovering critical failures before they caused customer-facing incidents.

The Five Principles of Chaos Engineering

System resilience architecture principles

From the Principles of Chaos Engineering manifesto:

Build a Hypothesis Around Steady State Behavior. Define what "normal" looks like with metrics: response time p99 < 200ms, error rate < 0.1%, throughput > 1000 req/s. Your experiment tests whether this steady state holds under turbulence.
Vary Real-World Events. Inject failures that actually happen: server crashes, network partitions, disk full, dependency timeouts, DNS failures, clock skew, memory leaks, certificate expiration.
Run Experiments in Production. Production has traffic patterns, data volumes, and configuration that staging doesn't. However, start in staging if you're new to chaos engineering — production experiments require mature monitoring and blast radius controls.
Automate Experiments to Run Continuously. Manual experiments provide one-time confidence. Automated, scheduled experiments provide ongoing assurance. Run them weekly or after every deployment.
Minimize Blast Radius. Start small. Affect 1% of traffic, not 100%. Kill one pod, not the whole deployment. If the experiment causes unexpected customer impact, stop immediately.

The Chaos Experiment Framework

Every chaos experiment follows this structure:

// Chaos Experiment Template
interface ChaosExperiment {
  // 1. Define steady state
  steadyState: {
    metric: string;        // "API response time p99"
    normalValue: string;   // "< 200ms"
    measurement: string;   // "Datadog APM"
  };

  // 2. Form hypothesis
  hypothesis: string;
  // "When we terminate 1 of 3 API pods, the load balancer will
  //  redistribute traffic and p99 will stay under 300ms"

  // 3. Design experiment
  experiment: {
    action: string;        // "Kill 1 API pod via kubectl delete"
    blastRadius: string;   // "33% of API capacity"
    duration: string;      // "5 minutes"
    rollback: string;      // "Pod auto-restarts via Deployment"
  };

  // 4. Run and observe
  observation: {
    metrics: string[];     // Watch: error rate, latency, throughput
    alerting: boolean;     // Will on-call be paged? Warn them.
    runbook: string;       // Steps if experiment causes real outage
  };

  // 5. Analyze results
  result: 'hypothesis_confirmed' | 'hypothesis_disproved';
  findings: string;
  actionItems: string[];
}

Example: Database Connection Pool Exhaustion

const dbPoolExperiment: ChaosExperiment = {
  steadyState: {
    metric: "Job search API response time p99",
    normalValue: "< 500ms",
    measurement: "Sentry Performance Monitoring",
  },

  hypothesis:
    "When database connection pool is reduced from 20 to 5 connections, " +
    "the connection queue will handle the overflow and response times " +
    "will increase to < 1000ms but not cause errors",

  experiment: {
    action: "Set DATABASE_POOL_SIZE=5 (from 20) via environment variable update",
    blastRadius: "All API requests that hit the database",
    duration: "10 minutes during low-traffic hours (2 AM - 3 AM)",
    rollback: "Revert DATABASE_POOL_SIZE to 20, restart pods",
  },

  observation: {
    metrics: [
      "pg_stat_activity.count (active DB connections)",
      "API response time p50, p95, p99",
      "API error rate (5xx)",
      "Connection pool wait time",
    ],
    alerting: true, // On-call engineer aware and standing by
    runbook: "If error rate > 5% or p99 > 3000ms, immediately rollback",
  },

  // After running:
  result: 'hypothesis_disproved',
  findings:
    "With pool_size=5, connection queue filled within 30 seconds. " +
    "After queue timeout (30s), requests received 500 errors. " +
    "Error rate hit 12% within 2 minutes. Root cause: our ORM " +
    "holds connections during the entire request lifecycle, not " +
    "just during queries.",

  actionItems: [
    "Implement connection release after each query (not after request)",
    "Add connection pool monitoring to dashboard",
    "Set connection_timeout to 5s (currently 30s — too long)",
    "Add circuit breaker on database connections",
  ],
};

Chaos Experiments Catalog: Start Here

Catalog of chaos engineering experiments

Ranked by difficulty and risk, from "safe for beginners" to "requires mature practice":

Level	Experiment	What It Tests	Risk
1 (Beginner)	Kill a single pod/container	Auto-restart, load balancing	Low
1	Add 200ms latency to an internal service	Timeout handling, retry logic	Low
1	Return 500 from a non-critical dependency	Graceful degradation, fallbacks	Low
2 (Intermediate)	Fill disk to 95%	Disk monitoring, log rotation	Medium
2	DNS resolution failure for one dependency	DNS caching, circuit breakers	Medium
2	Clock skew (NTP drift)	Token expiry, caching, logs	Medium
3 (Advanced)	Network partition between services	Split-brain handling, consensus	High
3	Database leader failover	Replica promotion, write availability	High
3	Availability zone failure simulation	Multi-AZ resilience, data replication	High
4 (Expert)	Gradual memory leak injection	OOM handling, alerting timeliness	Very High

Implementing Level 1 Experiments

# Experiment: Kill a random pod
# Hypothesis: Kubernetes will restart it within 30s, no user impact

# Using kubectl directly
kubectl delete pod $(kubectl get pods -l app=api -o jsonpath='{.items[0].metadata.name}') \
  --grace-period=0

# Using Litmus Chaos (more structured)
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-kill-experiment
spec:
  appinfo:
    appns: production
    applabel: "app=api"
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '30'
            - name: CHAOS_INTERVAL
              value: '10'
            - name: FORCE
              value: 'false'

# Experiment: Add latency to a dependency
# Using toxiproxy (great for local/staging)

# Setup toxic on the database proxy
toxiproxy-cli toxic add \
  --type latency \
  --attribute latency=500 \
  --attribute jitter=200 \
  --upstream \
  postgres_proxy

# Observe: Do API timeouts trigger? Do retries work?
# Does the circuit breaker open after N failures?

# Remove the toxic
toxiproxy-cli toxic remove \
  --toxicName latency_upstream \
  postgres_proxy

Tools Comparison

Chaos engineering tools and platforms comparison

Tool	Type	Best For	Cost	Learning Curve
Gremlin	SaaS platform	Enterprise, managed chaos	$$$	Low
Litmus Chaos	Open source (CNCF)	Kubernetes-native chaos	Free	Medium
Chaos Mesh	Open source (CNCF)	Kubernetes, rich UI	Free	Medium
Toxiproxy	Open source proxy	Network failure simulation	Free	Low
AWS Fault Injection Simulator	AWS service	AWS infrastructure	$$	Low
Chaos Monkey	Open source	Random instance termination	Free	Medium

My recommendation for small teams: Start with Toxiproxy for network fault simulation (it's lightweight and works anywhere) and manual kubectl commands for Kubernetes experiments. Graduate to Litmus Chaos or Chaos Mesh when you need automated, scheduled experiments.

Opinionated: The Uncomfortable Truths About Chaos Engineering

1. You're not ready for chaos engineering if you don't have basic monitoring. If you can't measure your steady state, you can't verify your hypothesis. Invest in observability first (error tracking, APM, logging), then start chaos experiments.

2. Start in staging. Really. The "run in production" principle is aspirational. When you're starting out, run experiments in staging until you trust your blast radius controls. The first experiment that goes wrong will teach you more than any article.

3. Game days > automated chaos. Before automating, run "game days" — scheduled sessions where the team runs experiments together, observes results, and discusses findings. The learning from game days is 10x what you get from automated experiments that run silently.

4. Most teams discover their monitoring is broken, not their systems. The most common finding from first chaos experiments isn't "our retry logic doesn't work" — it's "our monitoring didn't alert us when things went wrong." This is still extremely valuable.

5. Don't do chaos engineering to impress people. "We run chaos engineering" is a great conference talk. But if you're a 3-person startup, your time is better spent writing integration tests and setting up basic health checks. Chaos engineering's ROI is highest for systems with many components and high availability requirements.

Action Plan: Your First 30 Days of Chaos Engineering

Week 1: Preparation

Verify your monitoring: Can you see error rates, latency, throughput in real time?
Identify your system's critical path (the user-facing flow that matters most)
Document your steady state metrics
Get leadership buy-in: explain the "controlled experiment" framing

Week 2: First Experiment (Staging)

Choose Level 1: kill a single pod or add latency to one dependency
Write the experiment template (hypothesis, blast radius, rollback plan)
Run the experiment during a game day with the team watching
Document findings and action items

Week 3: Fix and Re-run

Address the findings from Week 2 (likely: fix retries, add timeouts, improve monitoring)
Re-run the same experiment to verify the fix
Run a second experiment (different failure mode)

Week 4: Production (If Ready)

Run your validated experiment in production during low-traffic hours
Have the on-call engineer standing by with the rollback plan
Start planning a monthly game day schedule
Write up results for the broader team

Matrix of system resilience data and chaos experiment results

Sources

I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.

Loading BirJob...

Chaos Engineering: Breaking Things on Purpose to Build Resilience

Chaos Engineering: Breaking Things on Purpose to Build Resilience

What Chaos Engineering Actually Is

The Five Principles of Chaos Engineering

The Chaos Experiment Framework

Example: Database Connection Pool Exhaustion

Chaos Experiments Catalog: Start Here

Implementing Level 1 Experiments

Tools Comparison

Opinionated: The Uncomfortable Truths About Chaos Engineering

Action Plan: Your First 30 Days of Chaos Engineering

Sources

İş axtarışınıza başlayın

Oxşar məqalələr