Zero-Downtime Deployments: A Practical Guide

Server racks in a modern data center with blue lighting

At 2:37 AM on a Tuesday, I pushed a database migration to BirJob's production server. The migration took 47 seconds. During those 47 seconds, approximately 200 users got a blank page. Not a 500 error, not a maintenance page — just nothing. I found out the next morning through analytics, not an alert. That was the moment I decided zero-downtime deployments were no longer a nice-to-have.

This guide is everything I have learned since then — from blue-green deployments to rolling updates, from database migration strategies to health check patterns. Whether you are running a single VPS or a Kubernetes cluster, you will find actionable techniques here to eliminate deployment downtime.

1. Why Zero Downtime Matters: The Numbers

Downtime costs money. Gartner estimates the average cost of IT downtime at $5,600 per minute for enterprise companies. Even for small companies, the cost is real: lost revenue, damaged trust, and frustrated users who may never come back.

But the real argument for zero-downtime deployments is not about preventing catastrophic outages. It is about enabling continuous delivery. When deployments are scary, teams deploy less often. When teams deploy less often, each deployment contains more changes. More changes mean more risk. More risk means deployments are even scarier. It is a vicious cycle.

Google's DORA research shows that elite engineering teams deploy multiple times per day with a change failure rate below 5%. They can do this because their deployment process is safe and automated. Zero-downtime deployments are not a luxury — they are a prerequisite for engineering excellence.

Let us break down the most common strategies and when to use each one.

2. Deployment Strategies Compared

Strategy	Downtime	Rollback Speed	Resource Cost	Complexity
Recreate	Yes (seconds to minutes)	Slow (redeploy old version)	1x	Low
Rolling Update	None	Medium (roll back gradually)	1x + fraction	Medium
Blue-Green	None	Instant (switch traffic back)	2x	Medium
Canary	None	Fast (route traffic away)	1x + small %	High
A/B Testing	None	Fast	1x + small %	High
Shadow/Dark	None	N/A (not serving users)	2x	Very High

3. Blue-Green Deployments: The Workhorse

Abstract visualization of network switching and routing

Blue-green deployment is the simplest strategy that achieves true zero downtime. The concept is straightforward: you maintain two identical production environments, "blue" and "green." At any time, one is live (serving traffic) and the other is idle (ready for the next deployment).

How It Works

Blue is live, serving all production traffic
Deploy new version to Green
Run smoke tests against Green
Switch the load balancer/DNS from Blue to Green
Green is now live; Blue becomes the idle environment
If something goes wrong, switch back to Blue instantly

Implementation with Nginx

Here is a practical implementation using Nginx as a reverse proxy:

# /etc/nginx/conf.d/app.conf

upstream blue {
    server 127.0.0.1:3000;
}

upstream green {
    server 127.0.0.1:3001;
}

# This file contains a single line: "blue" or "green"
# Updated by your deployment script
map $active_deployment $backend {
    "blue"  blue;
    "green" green;
}

server {
    listen 80;
    server_name app.example.com;

    set_by_lua_block $active_deployment {
        local f = io.open("/etc/nginx/active_deployment", "r")
        local content = f:read("*all")
        f:close()
        return content:gsub("%s+", "")
    }

    location / {
        proxy_pass http://$backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Your deployment script then becomes:

#!/bin/bash
CURRENT=$(cat /etc/nginx/active_deployment)
if [ "$CURRENT" = "blue" ]; then
    TARGET="green"
    PORT=3001
else
    TARGET="blue"
    PORT=3000
fi

# Deploy new version
cd /opt/app-$TARGET
git pull origin main
npm ci --production
npm run build

# Health check
for i in {1..30}; do
    if curl -sf http://localhost:$PORT/health; then
        break
    fi
    sleep 1
done

# Switch traffic
echo $TARGET > /etc/nginx/active_deployment
nginx -s reload

echo "Deployed to $TARGET. Rollback: echo $CURRENT > /etc/nginx/active_deployment && nginx -s reload"

The beauty of this approach is the rollback: it is literally one line. Write the old environment name to the file and reload Nginx. Total rollback time: under 1 second.

4. Rolling Updates: For Container Orchestration

If you are running Kubernetes, Docker Swarm, or ECS, rolling updates are your default zero-downtime strategy. The orchestrator replaces instances one at a time, ensuring that healthy instances are always serving traffic.

Kubernetes Rolling Update

apiVersion: apps/v1
kind: Deployment
metadata:
  name: birjob-web
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Create 1 new pod before killing old ones
      maxUnavailable: 0   # Never have fewer than 4 pods running
  template:
    spec:
      containers:
      - name: web
        image: birjob/web:v2.3.1
        readinessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 15
          periodSeconds: 10

The critical settings here are maxUnavailable: 0 (never reduce capacity below the desired count) and the readinessProbe (only send traffic to pods that are ready). Without the readiness probe, Kubernetes might route traffic to a pod that is still starting up, causing errors.

Graceful Shutdown

Rolling updates require your application to handle graceful shutdowns. When Kubernetes sends SIGTERM to your pod, your application should:

Stop accepting new connections
Finish processing in-flight requests (with a timeout)
Close database connections and flush buffers
Exit cleanly

// Node.js graceful shutdown example
const server = app.listen(3000);

process.on('SIGTERM', () => {
  console.log('SIGTERM received. Starting graceful shutdown...');

  // Stop accepting new connections
  server.close(() => {
    console.log('All connections closed. Exiting.');
    process.exit(0);
  });

  // Force exit after 30 seconds
  setTimeout(() => {
    console.error('Forced shutdown after timeout.');
    process.exit(1);
  }, 30000);
});

According to Kubernetes documentation, the default grace period is 30 seconds. If your application takes longer than that to drain connections, increase the terminationGracePeriodSeconds in your pod spec.

5. Canary Deployments: For High-Stakes Changes

Canary deployments route a small percentage of traffic to the new version while the majority continues hitting the old version. If the canary shows problems (elevated error rates, increased latency), you roll back before most users are affected.

Phase	Canary Traffic	Duration	Decision Criteria
1. Initial	1%	15 minutes	No 5xx errors, p99 latency < 500ms
2. Expand	10%	30 minutes	Error rate < 0.1%, no alerts triggered
3. Grow	50%	1 hour	All metrics normal, no user complaints
4. Complete	100%	-	Full rollout, old version decommissioned

Tools like Flagger (for Kubernetes) and Ambassador can automate canary analysis, automatically promoting or rolling back based on metrics.

6. The Hardest Part: Database Migrations

Zero-downtime application deployments are relatively straightforward. Zero-downtime database migrations are where things get genuinely difficult. The core problem: during a rolling update, both the old and new versions of your application are running simultaneously, and both are talking to the same database. If the new version expects a column that does not exist yet, or the old version expects a column you just dropped, you have a problem.

The Expand-Contract Pattern

The solution is the expand-contract (also called parallel change) pattern. Every breaking schema change is split into three phases:

Phase 1: Expand — Add new columns/tables without removing old ones. Both versions work.

-- Migration: Add new "full_name" column (old "first_name" + "last_name" still exist)
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);

-- Backfill existing data
UPDATE users SET full_name = first_name || ' ' || last_name WHERE full_name IS NULL;

Phase 2: Migrate — Deploy new application code that writes to both old and new columns but reads from the new one.

Phase 3: Contract — Once all application instances are on the new version and the old columns are no longer read, remove them.

-- Only run this AFTER confirming no application code reads first_name/last_name
ALTER TABLE users DROP COLUMN first_name;
ALTER TABLE users DROP COLUMN last_name;

Non-Locking Migrations in PostgreSQL

Standard ALTER TABLE operations in PostgreSQL can acquire exclusive locks that block all reads and writes. For large tables, this can mean minutes of downtime. Use these alternatives:

Operation	Locking Version	Non-Locking Version
Add column with default	`ALTER TABLE ADD COLUMN x DEFAULT 0` (PG < 11)	PG 11+ handles this without rewrite. For older: add column, then SET DEFAULT separately
Add index	`CREATE INDEX`	`CREATE INDEX CONCURRENTLY`
Add NOT NULL constraint	`ALTER TABLE ALTER COLUMN SET NOT NULL`	Add CHECK constraint first with NOT VALID, then VALIDATE separately
Change column type	`ALTER TABLE ALTER COLUMN TYPE`	Add new column, backfill, swap reads, drop old

Tools like strong_migrations (Ruby) and migra (Python) can automatically detect migration operations that will cause locking issues.

7. Health Checks: The Foundation of Everything

Code on screen showing server monitoring and health check logic

Every zero-downtime strategy depends on health checks. Without them, your load balancer, orchestrator, or deployment script has no way to know if the new version is actually working. I have seen teams implement perfect blue-green deployments that still caused downtime because their health check was just return 200 — it did not verify the application could actually serve requests.

Three Types of Health Checks

Liveness: "Is the process alive?" — Checks that the application has not crashed or deadlocked. If this fails, restart the instance.

Readiness: "Can this instance serve traffic?" — Checks that the application is fully initialized, database connections are established, and caches are warm. If this fails, stop routing traffic to this instance but do not restart it.

Startup: "Has initial startup completed?" — For applications with long startup times, prevents liveness checks from killing the pod before it is ready.

// Express.js health check implementation
app.get('/health/live', (req, res) => {
  // If we can respond, we are alive
  res.status(200).json({ status: 'alive' });
});

app.get('/health/ready', async (req, res) => {
  try {
    // Check database
    await db.query('SELECT 1');

    // Check Redis
    await redis.ping();

    // Check external dependencies
    const cacheWarmed = await cache.isWarmed();

    if (!cacheWarmed) {
      return res.status(503).json({ status: 'warming_cache' });
    }

    res.status(200).json({ status: 'ready' });
  } catch (error) {
    res.status(503).json({ status: 'not_ready', error: error.message });
  }
});

8. Feature Flags: Decouple Deployment from Release

The most powerful concept in zero-downtime engineering is separating deployment (putting code on servers) from release (exposing features to users). Feature flags let you deploy code that is dormant, enable it for specific users or a percentage of traffic, and disable it instantly if problems arise.

// Feature flag usage
if (featureFlags.isEnabled('new-search-algorithm', { userId: user.id })) {
  results = await newSearchAlgorithm(query);
} else {
  results = await legacySearch(query);
}

Tools like LaunchDarkly, Unleash (open source), and even simple database-backed flags make this straightforward. According to LaunchDarkly's research, teams using feature flags deploy 200% more frequently with 60% fewer incidents.

9. CI/CD Pipeline for Zero Downtime

Here is a complete CI/CD pipeline that incorporates all the strategies we have discussed:

# .github/workflows/deploy.yml
name: Deploy
on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm test
      - run: npm run lint

  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: docker build -t app:${{ github.sha }} .
      - run: docker push registry.example.com/app:${{ github.sha }}

  deploy-canary:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to 5% of traffic
        run: |
          kubectl set image deployment/app-canary \
            app=registry.example.com/app:${{ github.sha }}
          kubectl rollout status deployment/app-canary

      - name: Wait and check metrics
        run: |
          sleep 300  # 5 minutes
          ERROR_RATE=$(curl -s prometheus:9090/api/v1/query \
            --data-urlencode 'query=rate(http_errors_total{version="canary"}[5m])' \
            | jq '.data.result[0].value[1]')
          if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
            echo "Error rate too high: $ERROR_RATE"
            kubectl rollout undo deployment/app-canary
            exit 1
          fi

  deploy-full:
    needs: deploy-canary
    runs-on: ubuntu-latest
    steps:
      - name: Rolling update to all pods
        run: |
          kubectl set image deployment/app \
            app=registry.example.com/app:${{ github.sha }}
          kubectl rollout status deployment/app --timeout=600s

10. My Opinionated Take

After running deployments across various setups — from single VPS to Kubernetes clusters — here is what I believe:

Blue-green is underrated. The industry has a bias toward Kubernetes and canary deployments because they are more sophisticated. But for most applications, especially monoliths running on 1-3 servers, blue-green with Nginx is simpler, faster to set up, and provides instant rollback. Do not over-engineer your deployment strategy.

Database migrations are the real bottleneck. You can have the most sophisticated deployment pipeline in the world, but if your migrations take an exclusive lock on a 100-million-row table, nothing else matters. Invest disproportionately in migration safety.

Feature flags are the most important tool here. They change the psychology of deployments. When you can deploy code that is off by default and turn it on gradually, deployments stop being events and become routine. That psychological shift is worth more than any technical improvement.

Health checks should be treated as first-class features. I have seen teams spend weeks on deployment pipelines but write health checks in 5 minutes. Your health check is the only thing standing between a broken deployment and your users. It deserves careful thought and testing.

11. Action Plan: Going Zero-Downtime in 2 Weeks

Week 1: Foundation

Implement proper health checks (liveness + readiness) in your application
Add graceful shutdown handling for SIGTERM
Review all pending database migrations — do any require table locks? Rewrite them using expand-contract
Set up a staging environment that mirrors production

Week 2: Implementation

Choose your strategy: blue-green for simple setups, rolling updates for Kubernetes
Implement the deployment pipeline with automated health check verification
Add rollback automation — one command, under 30 seconds
Run a test deployment during business hours (yes, during business hours — that is the whole point)
Set up monitoring alerts for error rate spikes post-deployment

Team of engineers working together in a modern office environment

Sources

I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.

Loading BirJob...

Zero-Downtime Deployments: A Practical Guide

Zero-Downtime Deployments: A Practical Guide

1. Why Zero Downtime Matters: The Numbers

2. Deployment Strategies Compared

3. Blue-Green Deployments: The Workhorse

How It Works

Implementation with Nginx

4. Rolling Updates: For Container Orchestration

Kubernetes Rolling Update

Graceful Shutdown

5. Canary Deployments: For High-Stakes Changes

6. The Hardest Part: Database Migrations

The Expand-Contract Pattern

Non-Locking Migrations in PostgreSQL

7. Health Checks: The Foundation of Everything

Three Types of Health Checks

8. Feature Flags: Decouple Deployment from Release

9. CI/CD Pipeline for Zero Downtime

10. My Opinionated Take

11. Action Plan: Going Zero-Downtime in 2 Weeks

Sources

İş axtarışınıza başlayın