Zero-Downtime Deployments: A Practical Guide
At 2:37 AM on a Tuesday, I pushed a database migration to BirJob's production server. The migration took 47 seconds. During those 47 seconds, approximately 200 users got a blank page. Not a 500 error, not a maintenance page — just nothing. I found out the next morning through analytics, not an alert. That was the moment I decided zero-downtime deployments were no longer a nice-to-have.
This guide is everything I have learned since then — from blue-green deployments to rolling updates, from database migration strategies to health check patterns. Whether you are running a single VPS or a Kubernetes cluster, you will find actionable techniques here to eliminate deployment downtime.
1. Why Zero Downtime Matters: The Numbers
Downtime costs money. Gartner estimates the average cost of IT downtime at $5,600 per minute for enterprise companies. Even for small companies, the cost is real: lost revenue, damaged trust, and frustrated users who may never come back.
But the real argument for zero-downtime deployments is not about preventing catastrophic outages. It is about enabling continuous delivery. When deployments are scary, teams deploy less often. When teams deploy less often, each deployment contains more changes. More changes mean more risk. More risk means deployments are even scarier. It is a vicious cycle.
Google's DORA research shows that elite engineering teams deploy multiple times per day with a change failure rate below 5%. They can do this because their deployment process is safe and automated. Zero-downtime deployments are not a luxury — they are a prerequisite for engineering excellence.
Let us break down the most common strategies and when to use each one.
2. Deployment Strategies Compared
| Strategy | Downtime | Rollback Speed | Resource Cost | Complexity |
|---|---|---|---|---|
| Recreate | Yes (seconds to minutes) | Slow (redeploy old version) | 1x | Low |
| Rolling Update | None | Medium (roll back gradually) | 1x + fraction | Medium |
| Blue-Green | None | Instant (switch traffic back) | 2x | Medium |
| Canary | None | Fast (route traffic away) | 1x + small % | High |
| A/B Testing | None | Fast | 1x + small % | High |
| Shadow/Dark | None | N/A (not serving users) | 2x | Very High |
3. Blue-Green Deployments: The Workhorse
Blue-green deployment is the simplest strategy that achieves true zero downtime. The concept is straightforward: you maintain two identical production environments, "blue" and "green." At any time, one is live (serving traffic) and the other is idle (ready for the next deployment).
How It Works
- Blue is live, serving all production traffic
- Deploy new version to Green
- Run smoke tests against Green
- Switch the load balancer/DNS from Blue to Green
- Green is now live; Blue becomes the idle environment
- If something goes wrong, switch back to Blue instantly
Implementation with Nginx
Here is a practical implementation using Nginx as a reverse proxy:
# /etc/nginx/conf.d/app.conf
upstream blue {
server 127.0.0.1:3000;
}
upstream green {
server 127.0.0.1:3001;
}
# This file contains a single line: "blue" or "green"
# Updated by your deployment script
map $active_deployment $backend {
"blue" blue;
"green" green;
}
server {
listen 80;
server_name app.example.com;
set_by_lua_block $active_deployment {
local f = io.open("/etc/nginx/active_deployment", "r")
local content = f:read("*all")
f:close()
return content:gsub("%s+", "")
}
location / {
proxy_pass http://$backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
Your deployment script then becomes:
#!/bin/bash
CURRENT=$(cat /etc/nginx/active_deployment)
if [ "$CURRENT" = "blue" ]; then
TARGET="green"
PORT=3001
else
TARGET="blue"
PORT=3000
fi
# Deploy new version
cd /opt/app-$TARGET
git pull origin main
npm ci --production
npm run build
# Health check
for i in {1..30}; do
if curl -sf http://localhost:$PORT/health; then
break
fi
sleep 1
done
# Switch traffic
echo $TARGET > /etc/nginx/active_deployment
nginx -s reload
echo "Deployed to $TARGET. Rollback: echo $CURRENT > /etc/nginx/active_deployment && nginx -s reload"
The beauty of this approach is the rollback: it is literally one line. Write the old environment name to the file and reload Nginx. Total rollback time: under 1 second.
4. Rolling Updates: For Container Orchestration
If you are running Kubernetes, Docker Swarm, or ECS, rolling updates are your default zero-downtime strategy. The orchestrator replaces instances one at a time, ensuring that healthy instances are always serving traffic.
Kubernetes Rolling Update
apiVersion: apps/v1
kind: Deployment
metadata:
name: birjob-web
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Create 1 new pod before killing old ones
maxUnavailable: 0 # Never have fewer than 4 pods running
template:
spec:
containers:
- name: web
image: birjob/web:v2.3.1
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 15
periodSeconds: 10
The critical settings here are maxUnavailable: 0 (never reduce capacity below the desired count) and the readinessProbe (only send traffic to pods that are ready). Without the readiness probe, Kubernetes might route traffic to a pod that is still starting up, causing errors.
Graceful Shutdown
Rolling updates require your application to handle graceful shutdowns. When Kubernetes sends SIGTERM to your pod, your application should:
- Stop accepting new connections
- Finish processing in-flight requests (with a timeout)
- Close database connections and flush buffers
- Exit cleanly
// Node.js graceful shutdown example
const server = app.listen(3000);
process.on('SIGTERM', () => {
console.log('SIGTERM received. Starting graceful shutdown...');
// Stop accepting new connections
server.close(() => {
console.log('All connections closed. Exiting.');
process.exit(0);
});
// Force exit after 30 seconds
setTimeout(() => {
console.error('Forced shutdown after timeout.');
process.exit(1);
}, 30000);
});
According to Kubernetes documentation, the default grace period is 30 seconds. If your application takes longer than that to drain connections, increase the terminationGracePeriodSeconds in your pod spec.
5. Canary Deployments: For High-Stakes Changes
Canary deployments route a small percentage of traffic to the new version while the majority continues hitting the old version. If the canary shows problems (elevated error rates, increased latency), you roll back before most users are affected.
| Phase | Canary Traffic | Duration | Decision Criteria |
|---|---|---|---|
| 1. Initial | 1% | 15 minutes | No 5xx errors, p99 latency < 500ms |
| 2. Expand | 10% | 30 minutes | Error rate < 0.1%, no alerts triggered |
| 3. Grow | 50% | 1 hour | All metrics normal, no user complaints |
| 4. Complete | 100% | - | Full rollout, old version decommissioned |
Tools like Flagger (for Kubernetes) and Ambassador can automate canary analysis, automatically promoting or rolling back based on metrics.
6. The Hardest Part: Database Migrations
Zero-downtime application deployments are relatively straightforward. Zero-downtime database migrations are where things get genuinely difficult. The core problem: during a rolling update, both the old and new versions of your application are running simultaneously, and both are talking to the same database. If the new version expects a column that does not exist yet, or the old version expects a column you just dropped, you have a problem.
The Expand-Contract Pattern
The solution is the expand-contract (also called parallel change) pattern. Every breaking schema change is split into three phases:
Phase 1: Expand — Add new columns/tables without removing old ones. Both versions work.
-- Migration: Add new "full_name" column (old "first_name" + "last_name" still exist)
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
-- Backfill existing data
UPDATE users SET full_name = first_name || ' ' || last_name WHERE full_name IS NULL;
Phase 2: Migrate — Deploy new application code that writes to both old and new columns but reads from the new one.
Phase 3: Contract — Once all application instances are on the new version and the old columns are no longer read, remove them.
-- Only run this AFTER confirming no application code reads first_name/last_name
ALTER TABLE users DROP COLUMN first_name;
ALTER TABLE users DROP COLUMN last_name;
Non-Locking Migrations in PostgreSQL
Standard ALTER TABLE operations in PostgreSQL can acquire exclusive locks that block all reads and writes. For large tables, this can mean minutes of downtime. Use these alternatives:
| Operation | Locking Version | Non-Locking Version |
|---|---|---|
| Add column with default | ALTER TABLE ADD COLUMN x DEFAULT 0 (PG < 11) |
PG 11+ handles this without rewrite. For older: add column, then SET DEFAULT separately |
| Add index | CREATE INDEX |
CREATE INDEX CONCURRENTLY |
| Add NOT NULL constraint | ALTER TABLE ALTER COLUMN SET NOT NULL |
Add CHECK constraint first with NOT VALID, then VALIDATE separately |
| Change column type | ALTER TABLE ALTER COLUMN TYPE |
Add new column, backfill, swap reads, drop old |
Tools like strong_migrations (Ruby) and migra (Python) can automatically detect migration operations that will cause locking issues.
7. Health Checks: The Foundation of Everything
Every zero-downtime strategy depends on health checks. Without them, your load balancer, orchestrator, or deployment script has no way to know if the new version is actually working. I have seen teams implement perfect blue-green deployments that still caused downtime because their health check was just return 200 — it did not verify the application could actually serve requests.
Three Types of Health Checks
Liveness: "Is the process alive?" — Checks that the application has not crashed or deadlocked. If this fails, restart the instance.
Readiness: "Can this instance serve traffic?" — Checks that the application is fully initialized, database connections are established, and caches are warm. If this fails, stop routing traffic to this instance but do not restart it.
Startup: "Has initial startup completed?" — For applications with long startup times, prevents liveness checks from killing the pod before it is ready.
// Express.js health check implementation
app.get('/health/live', (req, res) => {
// If we can respond, we are alive
res.status(200).json({ status: 'alive' });
});
app.get('/health/ready', async (req, res) => {
try {
// Check database
await db.query('SELECT 1');
// Check Redis
await redis.ping();
// Check external dependencies
const cacheWarmed = await cache.isWarmed();
if (!cacheWarmed) {
return res.status(503).json({ status: 'warming_cache' });
}
res.status(200).json({ status: 'ready' });
} catch (error) {
res.status(503).json({ status: 'not_ready', error: error.message });
}
});
8. Feature Flags: Decouple Deployment from Release
The most powerful concept in zero-downtime engineering is separating deployment (putting code on servers) from release (exposing features to users). Feature flags let you deploy code that is dormant, enable it for specific users or a percentage of traffic, and disable it instantly if problems arise.
// Feature flag usage
if (featureFlags.isEnabled('new-search-algorithm', { userId: user.id })) {
results = await newSearchAlgorithm(query);
} else {
results = await legacySearch(query);
}
Tools like LaunchDarkly, Unleash (open source), and even simple database-backed flags make this straightforward. According to LaunchDarkly's research, teams using feature flags deploy 200% more frequently with 60% fewer incidents.
9. CI/CD Pipeline for Zero Downtime
Here is a complete CI/CD pipeline that incorporates all the strategies we have discussed:
# .github/workflows/deploy.yml
name: Deploy
on:
push:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci
- run: npm test
- run: npm run lint
build:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: docker build -t app:${{ github.sha }} .
- run: docker push registry.example.com/app:${{ github.sha }}
deploy-canary:
needs: build
runs-on: ubuntu-latest
steps:
- name: Deploy to 5% of traffic
run: |
kubectl set image deployment/app-canary \
app=registry.example.com/app:${{ github.sha }}
kubectl rollout status deployment/app-canary
- name: Wait and check metrics
run: |
sleep 300 # 5 minutes
ERROR_RATE=$(curl -s prometheus:9090/api/v1/query \
--data-urlencode 'query=rate(http_errors_total{version="canary"}[5m])' \
| jq '.data.result[0].value[1]')
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "Error rate too high: $ERROR_RATE"
kubectl rollout undo deployment/app-canary
exit 1
fi
deploy-full:
needs: deploy-canary
runs-on: ubuntu-latest
steps:
- name: Rolling update to all pods
run: |
kubectl set image deployment/app \
app=registry.example.com/app:${{ github.sha }}
kubectl rollout status deployment/app --timeout=600s
10. My Opinionated Take
After running deployments across various setups — from single VPS to Kubernetes clusters — here is what I believe:
Blue-green is underrated. The industry has a bias toward Kubernetes and canary deployments because they are more sophisticated. But for most applications, especially monoliths running on 1-3 servers, blue-green with Nginx is simpler, faster to set up, and provides instant rollback. Do not over-engineer your deployment strategy.
Database migrations are the real bottleneck. You can have the most sophisticated deployment pipeline in the world, but if your migrations take an exclusive lock on a 100-million-row table, nothing else matters. Invest disproportionately in migration safety.
Feature flags are the most important tool here. They change the psychology of deployments. When you can deploy code that is off by default and turn it on gradually, deployments stop being events and become routine. That psychological shift is worth more than any technical improvement.
Health checks should be treated as first-class features. I have seen teams spend weeks on deployment pipelines but write health checks in 5 minutes. Your health check is the only thing standing between a broken deployment and your users. It deserves careful thought and testing.
11. Action Plan: Going Zero-Downtime in 2 Weeks
Week 1: Foundation
- Implement proper health checks (liveness + readiness) in your application
- Add graceful shutdown handling for SIGTERM
- Review all pending database migrations — do any require table locks? Rewrite them using expand-contract
- Set up a staging environment that mirrors production
Week 2: Implementation
- Choose your strategy: blue-green for simple setups, rolling updates for Kubernetes
- Implement the deployment pipeline with automated health check verification
- Add rollback automation — one command, under 30 seconds
- Run a test deployment during business hours (yes, during business hours — that is the whole point)
- Set up monitoring alerts for error rate spikes post-deployment
Sources
- Gartner — Cost of IT Downtime
- Google DORA — State of DevOps Report
- Kubernetes — Pod Lifecycle Documentation
- Martin Fowler — Blue-Green Deployment
- LaunchDarkly — Feature Management Research
- Flagger — Progressive Delivery for Kubernetes
I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.
