The Complete Guide to Database Migrations Without Downtime

Database infrastructure with servers and monitoring screens

The worst outage I ever caused was a database migration. A simple ALTER TABLE to add a column to a 40-million-row table. I expected it to take a few seconds. It took 47 minutes. During those 47 minutes, the table was locked, every query that touched it timed out, and the API returned 500 errors to every user. The post-mortem was brutal.

Database migrations are the most dangerous routine operation in software engineering. They touch your most critical asset (your data), they're hard to reverse, and they're often performed under time pressure ("just deploy and run the migration, it'll be fine"). But with the right patterns and tools, you can run migrations on tables with billions of rows without your users noticing.

This guide covers everything: the theory behind zero-downtime migrations, specific strategies for each database engine, the tools that make it practical, and the hard-won lessons from running hundreds of migrations in production.

Why Migrations Cause Downtime

To understand how to avoid downtime, you need to understand why it happens in the first place. The root cause is almost always locks.

How Database Locks Work During Schema Changes

When you run ALTER TABLE in most databases, the engine needs to acquire a lock on the table. The type of lock depends on the operation:

Operation	PostgreSQL	MySQL (InnoDB)	Lock Type
Add column (nullable, no default)	Instant (metadata only)	Instant (8.0+)	Metadata lock (brief)
Add column with default value	Instant (11+)	Instant (8.0.12+)	Metadata lock (brief)
Add NOT NULL column with default	Instant (11+)	Table rebuild (copy)	Full table lock (MySQL)
Drop column	Instant (metadata)	Table rebuild	Metadata / Full
Rename column	Instant	Instant (8.0+)	Metadata lock (brief)
Change column type	Table rewrite	Table rebuild	Full table lock
Add index	CONCURRENTLY option	Online DDL	Shared / None
Drop index	Instant	Instant	Metadata lock (brief)

The critical insight: any operation that requires a table rewrite will lock the table for the duration of the rewrite. For small tables, this is milliseconds. For a 100GB table, it can be hours.

Even "instant" operations in PostgreSQL can cause problems due to metadata lock contention. If a long-running query holds a lock on the table, your ALTER TABLE will wait for that lock, and every subsequent query will queue behind your ALTER TABLE. This is the "lock queue stampede" problem, and it's caught more experienced engineers than I'd like to admit.

According to PostgreSQL documentation, even operations that are logically instant still need to acquire an ACCESS EXCLUSIVE lock briefly, which can block behind long-running transactions.

The Expand-Contract Pattern

The expand-contract pattern (also called "parallel change") is the foundation of zero-downtime migrations. The idea is simple: never make a breaking change in a single step. Instead, expand the schema to support both old and new formats, migrate the data, then contract the schema by removing the old format.

Example: Renaming a Column

You want to rename user_name to username. Here's the zero-downtime approach:

Step 1 - Expand: Add the new column username. Keep user_name.

Step 2 - Dual Write: Update your application to write to both columns. Add a database trigger to keep them in sync.

Step 3 - Backfill: Copy existing data from user_name to username in batches.

Step 4 - Switch Reads: Update your application to read from username instead of user_name.

Step 5 - Stop Dual Write: Update your application to only write to username.

Step 6 - Contract: Drop the user_name column.

Yes, this is six deployments for a column rename. That's the cost of zero downtime. Each step is independently safe and reversible.

Developer writing migration code on laptop screen

PostgreSQL-Specific Strategies

Adding Indexes Concurrently

PostgreSQL's CREATE INDEX CONCURRENTLY builds the index without blocking writes. It takes longer than a regular CREATE INDEX (roughly 2-3x), but it doesn't lock the table.

Caveat: If a concurrent index build fails (which can happen due to deadlocks or constraint violations), it leaves an invalid index behind. Always check for invalid indexes after the operation: SELECT * FROM pg_indexes WHERE indexdef LIKE '%INVALID%'.

Lock Timeout

Set a lock timeout before running migrations. If the migration can't acquire the necessary lock within the timeout, it fails fast instead of queuing and causing a stampede:

SET lock_timeout = '5s';

This is the single most important PostgreSQL setting for safe migrations. According to Braintree's engineering blog, they set a 2-second lock timeout for all production migrations.

Advisory Locks for Migration Coordination

When running migrations across multiple application instances, use PostgreSQL advisory locks to ensure only one instance runs the migration at a time. Most migration frameworks (like Prisma Migrate and Flyway) handle this automatically.

Backfilling Large Tables

When you need to populate a new column on a large table, don't do UPDATE users SET new_col = compute(old_col). This updates every row in a single transaction, bloating the WAL and potentially running out of disk space.

Instead, backfill in batches:

Select a batch of rows by primary key range
Update the batch in a small transaction
Sleep briefly between batches to reduce load
Repeat until all rows are updated

A batch size of 1,000-10,000 rows with a 100ms sleep between batches is a reasonable starting point. Monitor replication lag (if applicable) and adjust accordingly.

MySQL-Specific Strategies

Online DDL in MySQL 8.0+

MySQL 8.0 introduced Online DDL, which allows many schema changes without blocking reads or writes. The syntax is: ALTER TABLE t ADD COLUMN c INT, ALGORITHM=INPLACE, LOCK=NONE.

However, not all operations support online DDL. Changing a column type, for example, still requires a table copy. Always check the MySQL Online DDL operations reference before running a migration in production.

pt-online-schema-change (Percona)

pt-online-schema-change from Percona Toolkit is the industry standard for large MySQL migrations. It works by:

Creating a new table with the desired schema
Creating triggers on the original table to copy changes to the new table
Copying rows in chunks from the original to the new table
Swapping the tables with an atomic rename

GitHub developed a similar tool called gh-ost that uses binary log replication instead of triggers, avoiding trigger-related limitations. According to GitHub Engineering, gh-ost has been used to migrate tables with over a billion rows without downtime.

Tool	Mechanism	Pros	Cons
pt-online-schema-change	Triggers	Battle-tested, widely used	Trigger overhead, FK limitations
gh-ost	Binlog replication	No triggers, pausable, testable	Requires binlog in ROW format
MySQL Online DDL	In-place (InnoDB)	No external tool needed	Not all operations supported
fb-mysql (Facebook)	Instant DDL extensions	Very fast	Facebook's fork, not mainstream

Migration Frameworks Compared

Framework	Language	Zero-Downtime Support	Notable Features
Flyway	Java (any DB)	Manual (SQL-based)	Version-based, SQL or Java migrations
Liquibase	Java (any DB)	Manual (changeset-based)	Rollback support, XML/YAML/JSON
Prisma Migrate	TypeScript/JS	Limited	Schema-first, auto-generated SQL
Alembic	Python	Manual	SQLAlchemy integration, auto-detect
golang-migrate	Go	Manual	Lightweight, CLI tool
dbmate	Go (any DB)	Manual	Framework-agnostic, simple
strong_migrations	Ruby	Enforced checks	Catches dangerous migrations

My recommendation: if you're using Ruby on Rails, add strong_migrations immediately. It catches dangerous migration patterns and suggests safe alternatives. For other languages, I wish equivalent tools existed. The closest is Squawk, a linter for PostgreSQL migrations that works with any language.

The Migration Safety Checklist

Before running any migration in production, go through this checklist:

Test on a copy of production data. Not a tiny test database. A full copy. The migration that takes 50ms on your dev database might take 45 minutes on production.
Set a lock timeout. Always. 2-5 seconds is reasonable.
Check for long-running queries. Kill or wait for them before starting.
Have a rollback plan. For every migration, know exactly how to undo it.
Monitor replication lag. If you have read replicas, ensure the migration doesn't cause them to fall behind.
Run during low-traffic periods. Even with zero-downtime patterns, less traffic means less risk.
Communicate with the team. Everyone should know a migration is happening.
Have a DBA review it. If you don't have a DBA, the most database-experienced engineer on the team.

Checklist and planning documents on a desk

My Opinionated Take

Here are the hills I'll die on when it comes to database migrations:

1. Every migration should be backward-compatible. Your application should work correctly both before and after the migration runs. This means you can deploy the application first, then run the migration, or vice versa. If the migration breaks the currently running application, you've created a deployment coupling that will bite you.

2. Never delete data in a migration. Drop columns, sure. But never DELETE FROM in a migration. If you need to clean up data, do it in a separate, well-tested script with a backup plan.

3. Migration frameworks are necessary but not sufficient. Flyway, Alembic, Prisma Migrate — they handle ordering, versioning, and tracking. But they don't prevent you from writing a dangerous migration. You need linting (Squawk, strong_migrations) and code review on top.

4. The best migration tool is a boring migration. If your migration requires a complex multi-step process, consider whether the schema change is worth the complexity. Sometimes a slightly less elegant schema that can be reached with a simple migration is the better choice.

5. ORMs hide migration complexity. When Prisma or Django auto-generates a migration, look at the SQL it produces. Understand what locks it will acquire. Auto-generated migrations are a starting point, not a finished product.

Action Plan: Safe Migrations for Your Team

Immediate (This Week)

Set lock_timeout to 5 seconds in your migration runner configuration
Add a migration linter to your CI pipeline (Squawk for PostgreSQL, or equivalent)
Document your team's migration review process

Short-Term (This Month)

Create a staging environment with production-sized data for testing migrations
Write a migration safety checklist specific to your stack
Practice the expand-contract pattern on a non-critical table
Set up monitoring for lock waits and replication lag

Medium-Term (This Quarter)

Implement automated migration testing in your CI/CD pipeline
Evaluate tools like gh-ost (MySQL) or pgroll (PostgreSQL) for large table migrations
Build a runbook for emergency migration rollbacks
Train the team on database locking behavior

Key Takeaways

Database migrations cause downtime primarily through locking. Understand your database's locking behavior before writing migrations.
The expand-contract pattern is the foundation of zero-downtime migrations. Never make a breaking schema change in a single step.
Set a lock timeout. Always. This is the cheapest and most effective safety measure.
Test migrations on production-sized data. Performance characteristics change dramatically with data volume.
For large MySQL tables, use gh-ost or pt-online-schema-change. For PostgreSQL, use CREATE INDEX CONCURRENTLY and batch backfills.
Add a migration linter to your CI pipeline to catch dangerous patterns automatically.

Sources

I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.

Loading BirJob...

The Complete Guide to Database Migrations Without Downtime

The Complete Guide to Database Migrations Without Downtime

Why Migrations Cause Downtime

How Database Locks Work During Schema Changes

The Expand-Contract Pattern

Example: Renaming a Column

PostgreSQL-Specific Strategies

Adding Indexes Concurrently

Lock Timeout

Advisory Locks for Migration Coordination

Backfilling Large Tables

MySQL-Specific Strategies

Online DDL in MySQL 8.0+

pt-online-schema-change (Percona)

Migration Frameworks Compared

The Migration Safety Checklist

My Opinionated Take

Action Plan: Safe Migrations for Your Team

Immediate (This Week)

Short-Term (This Month)

Medium-Term (This Quarter)

Key Takeaways

Sources

İş axtarışınıza başlayın

Oxşar məqalələr