Blue-Green vs Canary vs Rolling Deployments: A Practical Comparison
The first time I deployed BirJob to production, I did it the simplest way possible: SSH into the server, git pull, npm run build, pm2 restart. The site was down for 47 seconds while the build ran. No one noticed — we had maybe 10 users at the time.
Today, with thousands of daily users and scrapers running around the clock, 47 seconds of downtime is unacceptable. So I learned deployment strategies: blue-green, canary, rolling, and their variants. Each has real tradeoffs in complexity, cost, safety, and speed.
This guide explains each strategy with concrete examples, architecture diagrams in code, comparison tables, and practical recommendations. No hand-waving about "best practices" — just the real tradeoffs you'll face.
Part 1: The Problem We're Solving
Deploying a new version of software involves replacing running code with new code. The challenge is doing this without:
- Downtime: Users shouldn't see errors or blank pages during deployment
- Data loss: In-flight requests shouldn't be dropped
- Breaking changes reaching all users at once: If the new version has a bug, you want to catch it before 100% of traffic is affected
- Inability to roll back: If something goes wrong, you need a fast path back to the previous version
Different deployment strategies solve these problems with different tradeoffs. According to Google Cloud's deployment strategy documentation, the choice depends on your risk tolerance, infrastructure capabilities, and team maturity.
Part 2: Rolling Deployment
The simplest zero-downtime strategy. Replace instances one at a time (or in small batches).
How It Works
Initial state: 4 instances running v1
[v1] [v1] [v1] [v1] ← All serving traffic
Step 1: Take instance 1 out of load balancer, deploy v2
[v2*] [v1] [v1] [v1] ← v2 starting up
Step 2: Health check passes, instance 1 rejoins with v2
[v2] [v1] [v1] [v1] ← Mix of v1 and v2
Step 3: Repeat for instance 2
[v2] [v2] [v1] [v1] ← 50/50 split
Step 4: Repeat for instances 3 and 4
[v2] [v2] [v2] [v2] ← All v2, deployment complete
Kubernetes Rolling Update
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: birjob-api
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # At most 1 pod down at a time
maxSurge: 1 # At most 1 extra pod during rollout
template:
spec:
containers:
- name: api
image: birjob/api:v2.0.0
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
# Deploy:
kubectl apply -f deployment.yaml
# Watch rollout:
kubectl rollout status deployment/birjob-api
# Rollback:
kubectl rollout undo deployment/birjob-api
Rolling Deployment Analysis
| Aspect | Rating | Notes |
|---|---|---|
| Zero downtime | Yes | If health checks are configured correctly |
| Rollback speed | Slow | Must re-roll through all instances |
| Infrastructure cost | Low | Only 1 extra instance during deploy |
| Traffic splitting | No control | Mix of v1 and v2 during rollout |
| Complexity | Low | Built into Kubernetes, ECS, etc. |
| Version coexistence | Yes | v1 and v2 serve traffic simultaneously |
When to use: Stateless services with backward-compatible changes. This is the default strategy for most Kubernetes deployments.
When to avoid: Database schema changes that break backward compatibility. Changes that require all instances to be on the same version.
Part 3: Blue-Green Deployment
Maintain two identical production environments. At any time, one ("blue") is live and the other ("green") is idle. Deploy to the idle environment, test it, then switch traffic.
How It Works
Initial state:
BLUE (LIVE): [v1] [v1] [v1] [v1] ← Serving all traffic
GREEN (IDLE): [---] [---] [---] [---] ← Empty/previous version
Step 1: Deploy v2 to GREEN
BLUE (LIVE): [v1] [v1] [v1] [v1] ← Still serving traffic
GREEN: [v2] [v2] [v2] [v2] ← Deployed, not serving
Step 2: Run smoke tests against GREEN
Verify health checks, run integration tests, check key flows
Step 3: Switch load balancer to GREEN
BLUE: [v1] [v1] [v1] [v1] ← No longer serving (but ready)
GREEN (LIVE): [v2] [v2] [v2] [v2] ← Now serving all traffic
Rollback: Switch load balancer back to BLUE (instant!)
BLUE (LIVE): [v1] [v1] [v1] [v1] ← Serving again
GREEN: [v2] [v2] [v2] [v2] ← Idle
AWS Implementation
# Using AWS ALB with target groups
# Step 1: Deploy v2 to green target group
aws ecs update-service \
--cluster birjob \
--service birjob-api-green \
--task-definition birjob-api:v2
# Step 2: Wait for green to be healthy
aws ecs wait services-stable \
--cluster birjob \
--services birjob-api-green
# Step 3: Switch ALB listener to green target group
aws elbv2 modify-listener \
--listener-arn $LISTENER_ARN \
--default-actions Type=forward,TargetGroupArn=$GREEN_TG_ARN
# Rollback: Switch back to blue
aws elbv2 modify-listener \
--listener-arn $LISTENER_ARN \
--default-actions Type=forward,TargetGroupArn=$BLUE_TG_ARN
Blue-Green Analysis
| Aspect | Rating | Notes |
|---|---|---|
| Zero downtime | Yes | Instant switch |
| Rollback speed | Instant | Just switch the load balancer back |
| Infrastructure cost | High (2x) | Must maintain two full environments |
| Traffic splitting | All-or-nothing | 100% blue or 100% green |
| Complexity | Medium | Need infrastructure for two environments |
| Pre-deployment testing | Excellent | Full testing before any user sees v2 |
When to use: Mission-critical applications where instant rollback is essential. Regulatory environments requiring pre-deployment validation. According to Martin Fowler's description, blue-green is ideal when the cost of downtime exceeds the cost of double infrastructure.
When to avoid: Cost-sensitive environments. Databases that can't be shared between two app versions.
Part 4: Canary Deployment
Named after the canary in a coal mine, this strategy gradually shifts traffic from the old version to the new version while monitoring for errors.
How It Works
Initial state: 100% traffic → v1
[v1] [v1] [v1] [v1]
Step 1: Deploy v2 canary (1 instance), route 5% traffic
[v1] [v1] [v1] [v2] ← 5% to v2
Monitor: error rates, latency, business metrics
Step 2: If metrics are good, increase to 25%
[v1] [v1] [v2] [v2] ← 25% to v2
Monitor for 15-30 minutes
Step 3: Increase to 50%
[v1] [v2] [v2] [v2] ← 50% to v2
Monitor for 15-30 minutes
Step 4: Full rollout (100%)
[v2] [v2] [v2] [v2] ← 100% to v2
Rollback at any step: Route all traffic back to v1
Only a fraction of users were affected by any bug
Kubernetes with Argo Rollouts
# canary-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: birjob-api
spec:
replicas: 4
strategy:
canary:
canaryService: birjob-api-canary
stableService: birjob-api-stable
trafficRouting:
nginx:
stableIngress: birjob-api-ingress
steps:
- setWeight: 5
- pause: { duration: 5m }
- setWeight: 25
- pause: { duration: 10m }
- setWeight: 50
- pause: { duration: 15m }
- setWeight: 100
analysis:
templates:
- templateName: success-rate
startingStep: 1
args:
- name: service-name
value: birjob-api-canary
---
# Automated analysis: abort if error rate > 1%
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: success-rate
interval: 60s
count: 5
successCondition: result[0] >= 0.99
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{
service="{{args.service-name}}",
status=~"2.."
}[5m])) /
sum(rate(http_requests_total{
service="{{args.service-name}}"
}[5m]))
Canary Analysis
| Aspect | Rating | Notes |
|---|---|---|
| Zero downtime | Yes | Gradual shift |
| Rollback speed | Fast | Route traffic back to stable |
| Infrastructure cost | Medium | Small number of extra instances |
| Traffic control | Excellent | Precise percentage control |
| Blast radius | Minimal | Only canary % affected by bugs |
| Complexity | High | Requires traffic management + monitoring |
When to use: High-traffic applications where you need to validate changes with real traffic before full rollout. Netflix's engineering blog details their automated canary analysis system (Kayenta), which they use for every production deployment.
When to avoid: Low-traffic applications (not enough traffic to detect issues at 5%). Simple CRUD apps where blue-green is sufficient.
Part 5: Other Strategies
A/B Testing Deployments
Similar to canary but routes specific user segments to v2 (not random traffic). Useful when testing a new feature with a specific cohort.
// Route by user attribute
if (user.country === 'AZ' && user.id % 10 === 0) {
// Route to v2 (10% of Azerbaijani users)
routeToV2(request);
} else {
routeToV1(request);
}
Shadow/Dark Deployment
Route a copy of production traffic to v2, but don't return v2's responses to users. Compare v1 and v2 outputs to detect differences.
// Shadow deployment with traffic mirroring
// Istio configuration
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: birjob-api
spec:
hosts:
- birjob-api
http:
- route:
- destination:
host: birjob-api-v1
weight: 100
mirror:
host: birjob-api-v2
mirrorPercentage:
value: 100.0
Feature Flags (Complementary)
Feature flags aren't a deployment strategy per se, but they complement all strategies. Deploy code with features disabled, then enable them independently of deployment.
// Feature flag approach
async function getJobListings(query) {
if (featureFlags.isEnabled('new-search-algorithm', user)) {
return newSearchAlgorithm(query); // New code, deployed but flagged
}
return oldSearchAlgorithm(query); // Existing code
}
Part 6: Comparison Table
| Factor | Rolling | Blue-Green | Canary | Shadow |
|---|---|---|---|---|
| Zero downtime | Yes | Yes | Yes | Yes |
| Rollback speed | Minutes | Seconds | Seconds | N/A |
| Blast radius | Increasing | All or none | Controlled % | Zero (no user impact) |
| Infra cost | +25% | +100% | +25-50% | +100% |
| Complexity | Low | Medium | High | High |
| Version mixing | During rollout | None | During rollout | Isolated |
| Pre-prod testing | No | Yes | Partial | Full |
| Required monitoring | Basic | Basic | Advanced | Advanced |
| Best for | Standard deploys | Critical systems | High-traffic apps | Risky changes |
Part 7: The Database Problem
The elephant in the room: all deployment strategies assume that both v1 and v2 can work with the same database. This is fine for application code changes, but breaks when you need database schema changes.
The Expand-Contract Pattern
// Phase 1: Expand (backward compatible)
// Deploy: Add new column, keep old column
ALTER TABLE jobs ADD COLUMN location_json JSONB;
// v1 and v2 both work (v1 uses old column, v2 writes to both)
// Phase 2: Migrate data
UPDATE jobs SET location_json = json_build_object('city', city, 'country', country)
WHERE location_json IS NULL;
// Phase 3: Contract (after all instances are v2)
// Deploy v3: Remove old column references from code
// Then: ALTER TABLE jobs DROP COLUMN city, DROP COLUMN country;
This three-phase approach ensures that at every point, both the old and new code versions work with the database.
Part 8: My Opinionated Take
1. Start with rolling deployments. They're built into every container orchestrator, they work for 90% of use cases, and they require zero additional tooling. Don't over-engineer your deployment strategy until you've felt the pain that more complex strategies solve.
2. Add canary when you have enough traffic to detect issues. If you're serving 10 requests per minute, a 5% canary sees 0.5 requests per minute. You can't detect an error rate increase from that. Canary deployments need traffic volume to be effective — typically 100+ requests per minute.
3. Blue-green is underrated. The instant rollback alone is worth the infrastructure cost for critical systems. And with cloud auto-scaling, the "double cost" is only during the deployment window, not 24/7.
4. Feature flags are more important than deployment strategies. The safest deployment is one where new features are deployed behind flags and enabled gradually. You deploy code constantly (low risk) and enable features deliberately (controlled risk). LaunchDarkly and open-source alternatives like Unleash make this straightforward.
5. The database is always the hard part. Application deployment strategies are well-understood. Database migration strategies during deployment are where most teams struggle. Master the expand-contract pattern before worrying about canary analysis.
Action Plan
Week 1: Assess
- Document your current deployment process (how long, how risky, how manual)
- Identify your biggest deployment risks (downtime, data loss, user impact)
- Check: do you have health checks configured for all services?
- Measure: how long does a rollback take today?
Week 2: Implement
- If no zero-downtime: implement rolling deployments (Kubernetes default)
- Add readiness and liveness probes to all services
- Practice a rollback in staging — make sure it works
- Set up basic deployment monitoring (error rates, latency)
Month 2+: Evolve
- If high-traffic: evaluate canary deployments with Argo Rollouts or Flagger
- If critical system: implement blue-green for instant rollback
- Add feature flags for risky changes
- Set up automated rollback triggers based on error rate thresholds
Sources
- Google Cloud: Application Deployment and Testing Strategies
- Martin Fowler: Blue-Green Deployment
- Netflix Engineering: Automated Canary Analysis with Kayenta
- Argo Rollouts Documentation
- Unleash: Open-Source Feature Flags
I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.
