Blue-Green vs Canary vs Rolling Deployments: A Practical Comparison

Deployment pipeline visualization with multiple stages

The first time I deployed BirJob to production, I did it the simplest way possible: SSH into the server, git pull, npm run build, pm2 restart. The site was down for 47 seconds while the build ran. No one noticed — we had maybe 10 users at the time.

Today, with thousands of daily users and scrapers running around the clock, 47 seconds of downtime is unacceptable. So I learned deployment strategies: blue-green, canary, rolling, and their variants. Each has real tradeoffs in complexity, cost, safety, and speed.

This guide explains each strategy with concrete examples, architecture diagrams in code, comparison tables, and practical recommendations. No hand-waving about "best practices" — just the real tradeoffs you'll face.

Part 1: The Problem We're Solving

Deploying a new version of software involves replacing running code with new code. The challenge is doing this without:

Downtime: Users shouldn't see errors or blank pages during deployment
Data loss: In-flight requests shouldn't be dropped
Breaking changes reaching all users at once: If the new version has a bug, you want to catch it before 100% of traffic is affected
Inability to roll back: If something goes wrong, you need a fast path back to the previous version

Different deployment strategies solve these problems with different tradeoffs. According to Google Cloud's deployment strategy documentation, the choice depends on your risk tolerance, infrastructure capabilities, and team maturity.

Part 2: Rolling Deployment

Sequential process visualization representing rolling updates

The simplest zero-downtime strategy. Replace instances one at a time (or in small batches).

How It Works

Initial state: 4 instances running v1
  [v1] [v1] [v1] [v1]   ← All serving traffic

Step 1: Take instance 1 out of load balancer, deploy v2
  [v2*] [v1] [v1] [v1]  ← v2 starting up

Step 2: Health check passes, instance 1 rejoins with v2
  [v2] [v1] [v1] [v1]   ← Mix of v1 and v2

Step 3: Repeat for instance 2
  [v2] [v2] [v1] [v1]   ← 50/50 split

Step 4: Repeat for instances 3 and 4
  [v2] [v2] [v2] [v2]   ← All v2, deployment complete

Kubernetes Rolling Update

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: birjob-api
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1    # At most 1 pod down at a time
      maxSurge: 1           # At most 1 extra pod during rollout
  template:
    spec:
      containers:
        - name: api
          image: birjob/api:v2.0.0
          readinessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 10
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 30
            periodSeconds: 10

# Deploy:
kubectl apply -f deployment.yaml

# Watch rollout:
kubectl rollout status deployment/birjob-api

# Rollback:
kubectl rollout undo deployment/birjob-api

Rolling Deployment Analysis

Aspect	Rating	Notes
Zero downtime	Yes	If health checks are configured correctly
Rollback speed	Slow	Must re-roll through all instances
Infrastructure cost	Low	Only 1 extra instance during deploy
Traffic splitting	No control	Mix of v1 and v2 during rollout
Complexity	Low	Built into Kubernetes, ECS, etc.
Version coexistence	Yes	v1 and v2 serve traffic simultaneously

When to use: Stateless services with backward-compatible changes. This is the default strategy for most Kubernetes deployments.

When to avoid: Database schema changes that break backward compatibility. Changes that require all instances to be on the same version.

Part 3: Blue-Green Deployment

Maintain two identical production environments. At any time, one ("blue") is live and the other ("green") is idle. Deploy to the idle environment, test it, then switch traffic.

How It Works

Initial state:
  BLUE (LIVE):  [v1] [v1] [v1] [v1]   ← Serving all traffic
  GREEN (IDLE): [---] [---] [---] [---]  ← Empty/previous version

Step 1: Deploy v2 to GREEN
  BLUE (LIVE):  [v1] [v1] [v1] [v1]   ← Still serving traffic
  GREEN:        [v2] [v2] [v2] [v2]   ← Deployed, not serving

Step 2: Run smoke tests against GREEN
  Verify health checks, run integration tests, check key flows

Step 3: Switch load balancer to GREEN
  BLUE:         [v1] [v1] [v1] [v1]   ← No longer serving (but ready)
  GREEN (LIVE): [v2] [v2] [v2] [v2]   ← Now serving all traffic

Rollback: Switch load balancer back to BLUE (instant!)
  BLUE (LIVE):  [v1] [v1] [v1] [v1]   ← Serving again
  GREEN:        [v2] [v2] [v2] [v2]   ← Idle

AWS Implementation

# Using AWS ALB with target groups

# Step 1: Deploy v2 to green target group
aws ecs update-service \
  --cluster birjob \
  --service birjob-api-green \
  --task-definition birjob-api:v2

# Step 2: Wait for green to be healthy
aws ecs wait services-stable \
  --cluster birjob \
  --services birjob-api-green

# Step 3: Switch ALB listener to green target group
aws elbv2 modify-listener \
  --listener-arn $LISTENER_ARN \
  --default-actions Type=forward,TargetGroupArn=$GREEN_TG_ARN

# Rollback: Switch back to blue
aws elbv2 modify-listener \
  --listener-arn $LISTENER_ARN \
  --default-actions Type=forward,TargetGroupArn=$BLUE_TG_ARN

Blue-Green Analysis

Aspect	Rating	Notes
Zero downtime	Yes	Instant switch
Rollback speed	Instant	Just switch the load balancer back
Infrastructure cost	High (2x)	Must maintain two full environments
Traffic splitting	All-or-nothing	100% blue or 100% green
Complexity	Medium	Need infrastructure for two environments
Pre-deployment testing	Excellent	Full testing before any user sees v2

When to use: Mission-critical applications where instant rollback is essential. Regulatory environments requiring pre-deployment validation. According to Martin Fowler's description, blue-green is ideal when the cost of downtime exceeds the cost of double infrastructure.

When to avoid: Cost-sensitive environments. Databases that can't be shared between two app versions.

Part 4: Canary Deployment

Gradual traffic shifting visualization with monitoring

Named after the canary in a coal mine, this strategy gradually shifts traffic from the old version to the new version while monitoring for errors.

How It Works

Initial state: 100% traffic → v1
  [v1] [v1] [v1] [v1]

Step 1: Deploy v2 canary (1 instance), route 5% traffic
  [v1] [v1] [v1] [v2]   ← 5% to v2
  Monitor: error rates, latency, business metrics

Step 2: If metrics are good, increase to 25%
  [v1] [v1] [v2] [v2]   ← 25% to v2
  Monitor for 15-30 minutes

Step 3: Increase to 50%
  [v1] [v2] [v2] [v2]   ← 50% to v2
  Monitor for 15-30 minutes

Step 4: Full rollout (100%)
  [v2] [v2] [v2] [v2]   ← 100% to v2

Rollback at any step: Route all traffic back to v1
  Only a fraction of users were affected by any bug

Kubernetes with Argo Rollouts

# canary-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: birjob-api
spec:
  replicas: 4
  strategy:
    canary:
      canaryService: birjob-api-canary
      stableService: birjob-api-stable
      trafficRouting:
        nginx:
          stableIngress: birjob-api-ingress
      steps:
        - setWeight: 5
        - pause: { duration: 5m }
        - setWeight: 25
        - pause: { duration: 10m }
        - setWeight: 50
        - pause: { duration: 15m }
        - setWeight: 100
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 1
        args:
          - name: service-name
            value: birjob-api-canary

---
# Automated analysis: abort if error rate > 1%
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 60s
      count: 5
      successCondition: result[0] >= 0.99
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{
              service="{{args.service-name}}",
              status=~"2.."
            }[5m])) /
            sum(rate(http_requests_total{
              service="{{args.service-name}}"
            }[5m]))

Canary Analysis

Aspect	Rating	Notes
Zero downtime	Yes	Gradual shift
Rollback speed	Fast	Route traffic back to stable
Infrastructure cost	Medium	Small number of extra instances
Traffic control	Excellent	Precise percentage control
Blast radius	Minimal	Only canary % affected by bugs
Complexity	High	Requires traffic management + monitoring

When to use: High-traffic applications where you need to validate changes with real traffic before full rollout. Netflix's engineering blog details their automated canary analysis system (Kayenta), which they use for every production deployment.

When to avoid: Low-traffic applications (not enough traffic to detect issues at 5%). Simple CRUD apps where blue-green is sufficient.

Part 5: Other Strategies

A/B Testing Deployments

Similar to canary but routes specific user segments to v2 (not random traffic). Useful when testing a new feature with a specific cohort.

// Route by user attribute
if (user.country === 'AZ' && user.id % 10 === 0) {
  // Route to v2 (10% of Azerbaijani users)
  routeToV2(request);
} else {
  routeToV1(request);
}

Shadow/Dark Deployment

Route a copy of production traffic to v2, but don't return v2's responses to users. Compare v1 and v2 outputs to detect differences.

// Shadow deployment with traffic mirroring
// Istio configuration
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: birjob-api
spec:
  hosts:
    - birjob-api
  http:
    - route:
        - destination:
            host: birjob-api-v1
          weight: 100
      mirror:
        host: birjob-api-v2
      mirrorPercentage:
        value: 100.0

Feature Flags (Complementary)

Feature flags aren't a deployment strategy per se, but they complement all strategies. Deploy code with features disabled, then enable them independently of deployment.

// Feature flag approach
async function getJobListings(query) {
  if (featureFlags.isEnabled('new-search-algorithm', user)) {
    return newSearchAlgorithm(query);  // New code, deployed but flagged
  }
  return oldSearchAlgorithm(query);    // Existing code
}

Part 6: Comparison Table

Strategic comparison of multiple deployment approaches

Factor	Rolling	Blue-Green	Canary	Shadow
Zero downtime	Yes	Yes	Yes	Yes
Rollback speed	Minutes	Seconds	Seconds	N/A
Blast radius	Increasing	All or none	Controlled %	Zero (no user impact)
Infra cost	+25%	+100%	+25-50%	+100%
Complexity	Low	Medium	High	High
Version mixing	During rollout	None	During rollout	Isolated
Pre-prod testing	No	Yes	Partial	Full
Required monitoring	Basic	Basic	Advanced	Advanced
Best for	Standard deploys	Critical systems	High-traffic apps	Risky changes

Part 7: The Database Problem

The elephant in the room: all deployment strategies assume that both v1 and v2 can work with the same database. This is fine for application code changes, but breaks when you need database schema changes.

The Expand-Contract Pattern

// Phase 1: Expand (backward compatible)
// Deploy: Add new column, keep old column
ALTER TABLE jobs ADD COLUMN location_json JSONB;
// v1 and v2 both work (v1 uses old column, v2 writes to both)

// Phase 2: Migrate data
UPDATE jobs SET location_json = json_build_object('city', city, 'country', country)
WHERE location_json IS NULL;

// Phase 3: Contract (after all instances are v2)
// Deploy v3: Remove old column references from code
// Then: ALTER TABLE jobs DROP COLUMN city, DROP COLUMN country;

This three-phase approach ensures that at every point, both the old and new code versions work with the database.

Part 8: My Opinionated Take

Developer making deployment strategy decisions

1. Start with rolling deployments. They're built into every container orchestrator, they work for 90% of use cases, and they require zero additional tooling. Don't over-engineer your deployment strategy until you've felt the pain that more complex strategies solve.

2. Add canary when you have enough traffic to detect issues. If you're serving 10 requests per minute, a 5% canary sees 0.5 requests per minute. You can't detect an error rate increase from that. Canary deployments need traffic volume to be effective — typically 100+ requests per minute.

3. Blue-green is underrated. The instant rollback alone is worth the infrastructure cost for critical systems. And with cloud auto-scaling, the "double cost" is only during the deployment window, not 24/7.

4. Feature flags are more important than deployment strategies. The safest deployment is one where new features are deployed behind flags and enabled gradually. You deploy code constantly (low risk) and enable features deliberately (controlled risk). LaunchDarkly and open-source alternatives like Unleash make this straightforward.

5. The database is always the hard part. Application deployment strategies are well-understood. Database migration strategies during deployment are where most teams struggle. Master the expand-contract pattern before worrying about canary analysis.

Action Plan

Week 1: Assess

Document your current deployment process (how long, how risky, how manual)
Identify your biggest deployment risks (downtime, data loss, user impact)
Check: do you have health checks configured for all services?
Measure: how long does a rollback take today?

Week 2: Implement

If no zero-downtime: implement rolling deployments (Kubernetes default)
Add readiness and liveness probes to all services
Practice a rollback in staging — make sure it works
Set up basic deployment monitoring (error rates, latency)

Month 2+: Evolve

If high-traffic: evaluate canary deployments with Argo Rollouts or Flagger
If critical system: implement blue-green for instant rollback
Add feature flags for risky changes
Set up automated rollback triggers based on error rate thresholds

Sources

I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.

Loading BirJob...

Blue-Green vs Canary vs Rolling Deployments: A Practical Comparison

Blue-Green vs Canary vs Rolling Deployments: A Practical Comparison

Part 1: The Problem We're Solving

Part 2: Rolling Deployment

How It Works

Kubernetes Rolling Update

Rolling Deployment Analysis

Part 3: Blue-Green Deployment

How It Works

AWS Implementation

Blue-Green Analysis

Part 4: Canary Deployment

How It Works

Kubernetes with Argo Rollouts

Canary Analysis

Part 5: Other Strategies

A/B Testing Deployments

Shadow/Dark Deployment

Feature Flags (Complementary)

Part 6: Comparison Table

Part 7: The Database Problem

The Expand-Contract Pattern

Part 8: My Opinionated Take

Action Plan

Week 1: Assess

Week 2: Implement

Month 2+: Evolve

Sources

İş axtarışınıza başlayın

Oxşar məqalələr