The Complete Guide to Database Migrations Without Downtime
The worst outage I ever caused was a database migration. A simple ALTER TABLE to add a column to a 40-million-row table. I expected it to take a few seconds. It took 47 minutes. During those 47 minutes, the table was locked, every query that touched it timed out, and the API returned 500 errors to every user. The post-mortem was brutal.
Database migrations are the most dangerous routine operation in software engineering. They touch your most critical asset (your data), they're hard to reverse, and they're often performed under time pressure ("just deploy and run the migration, it'll be fine"). But with the right patterns and tools, you can run migrations on tables with billions of rows without your users noticing.
This guide covers everything: the theory behind zero-downtime migrations, specific strategies for each database engine, the tools that make it practical, and the hard-won lessons from running hundreds of migrations in production.
Why Migrations Cause Downtime
To understand how to avoid downtime, you need to understand why it happens in the first place. The root cause is almost always locks.
How Database Locks Work During Schema Changes
When you run ALTER TABLE in most databases, the engine needs to acquire a lock on the table. The type of lock depends on the operation:
| Operation | PostgreSQL | MySQL (InnoDB) | Lock Type |
|---|---|---|---|
| Add column (nullable, no default) | Instant (metadata only) | Instant (8.0+) | Metadata lock (brief) |
| Add column with default value | Instant (11+) | Instant (8.0.12+) | Metadata lock (brief) |
| Add NOT NULL column with default | Instant (11+) | Table rebuild (copy) | Full table lock (MySQL) |
| Drop column | Instant (metadata) | Table rebuild | Metadata / Full |
| Rename column | Instant | Instant (8.0+) | Metadata lock (brief) |
| Change column type | Table rewrite | Table rebuild | Full table lock |
| Add index | CONCURRENTLY option | Online DDL | Shared / None |
| Drop index | Instant | Instant | Metadata lock (brief) |
The critical insight: any operation that requires a table rewrite will lock the table for the duration of the rewrite. For small tables, this is milliseconds. For a 100GB table, it can be hours.
Even "instant" operations in PostgreSQL can cause problems due to metadata lock contention. If a long-running query holds a lock on the table, your ALTER TABLE will wait for that lock, and every subsequent query will queue behind your ALTER TABLE. This is the "lock queue stampede" problem, and it's caught more experienced engineers than I'd like to admit.
According to PostgreSQL documentation, even operations that are logically instant still need to acquire an ACCESS EXCLUSIVE lock briefly, which can block behind long-running transactions.
The Expand-Contract Pattern
The expand-contract pattern (also called "parallel change") is the foundation of zero-downtime migrations. The idea is simple: never make a breaking change in a single step. Instead, expand the schema to support both old and new formats, migrate the data, then contract the schema by removing the old format.
Example: Renaming a Column
You want to rename user_name to username. Here's the zero-downtime approach:
Step 1 - Expand: Add the new column username. Keep user_name.
Step 2 - Dual Write: Update your application to write to both columns. Add a database trigger to keep them in sync.
Step 3 - Backfill: Copy existing data from user_name to username in batches.
Step 4 - Switch Reads: Update your application to read from username instead of user_name.
Step 5 - Stop Dual Write: Update your application to only write to username.
Step 6 - Contract: Drop the user_name column.
Yes, this is six deployments for a column rename. That's the cost of zero downtime. Each step is independently safe and reversible.
PostgreSQL-Specific Strategies
Adding Indexes Concurrently
PostgreSQL's CREATE INDEX CONCURRENTLY builds the index without blocking writes. It takes longer than a regular CREATE INDEX (roughly 2-3x), but it doesn't lock the table.
Caveat: If a concurrent index build fails (which can happen due to deadlocks or constraint violations), it leaves an invalid index behind. Always check for invalid indexes after the operation: SELECT * FROM pg_indexes WHERE indexdef LIKE '%INVALID%'.
Lock Timeout
Set a lock timeout before running migrations. If the migration can't acquire the necessary lock within the timeout, it fails fast instead of queuing and causing a stampede:
SET lock_timeout = '5s';
This is the single most important PostgreSQL setting for safe migrations. According to Braintree's engineering blog, they set a 2-second lock timeout for all production migrations.
Advisory Locks for Migration Coordination
When running migrations across multiple application instances, use PostgreSQL advisory locks to ensure only one instance runs the migration at a time. Most migration frameworks (like Prisma Migrate and Flyway) handle this automatically.
Backfilling Large Tables
When you need to populate a new column on a large table, don't do UPDATE users SET new_col = compute(old_col). This updates every row in a single transaction, bloating the WAL and potentially running out of disk space.
Instead, backfill in batches:
- Select a batch of rows by primary key range
- Update the batch in a small transaction
- Sleep briefly between batches to reduce load
- Repeat until all rows are updated
A batch size of 1,000-10,000 rows with a 100ms sleep between batches is a reasonable starting point. Monitor replication lag (if applicable) and adjust accordingly.
MySQL-Specific Strategies
Online DDL in MySQL 8.0+
MySQL 8.0 introduced Online DDL, which allows many schema changes without blocking reads or writes. The syntax is: ALTER TABLE t ADD COLUMN c INT, ALGORITHM=INPLACE, LOCK=NONE.
However, not all operations support online DDL. Changing a column type, for example, still requires a table copy. Always check the MySQL Online DDL operations reference before running a migration in production.
pt-online-schema-change (Percona)
pt-online-schema-change from Percona Toolkit is the industry standard for large MySQL migrations. It works by:
- Creating a new table with the desired schema
- Creating triggers on the original table to copy changes to the new table
- Copying rows in chunks from the original to the new table
- Swapping the tables with an atomic rename
GitHub developed a similar tool called gh-ost that uses binary log replication instead of triggers, avoiding trigger-related limitations. According to GitHub Engineering, gh-ost has been used to migrate tables with over a billion rows without downtime.
| Tool | Mechanism | Pros | Cons |
|---|---|---|---|
| pt-online-schema-change | Triggers | Battle-tested, widely used | Trigger overhead, FK limitations |
| gh-ost | Binlog replication | No triggers, pausable, testable | Requires binlog in ROW format |
| MySQL Online DDL | In-place (InnoDB) | No external tool needed | Not all operations supported |
| fb-mysql (Facebook) | Instant DDL extensions | Very fast | Facebook's fork, not mainstream |
Migration Frameworks Compared
| Framework | Language | Zero-Downtime Support | Notable Features |
|---|---|---|---|
| Flyway | Java (any DB) | Manual (SQL-based) | Version-based, SQL or Java migrations |
| Liquibase | Java (any DB) | Manual (changeset-based) | Rollback support, XML/YAML/JSON |
| Prisma Migrate | TypeScript/JS | Limited | Schema-first, auto-generated SQL |
| Alembic | Python | Manual | SQLAlchemy integration, auto-detect |
| golang-migrate | Go | Manual | Lightweight, CLI tool |
| dbmate | Go (any DB) | Manual | Framework-agnostic, simple |
| strong_migrations | Ruby | Enforced checks | Catches dangerous migrations |
My recommendation: if you're using Ruby on Rails, add strong_migrations immediately. It catches dangerous migration patterns and suggests safe alternatives. For other languages, I wish equivalent tools existed. The closest is Squawk, a linter for PostgreSQL migrations that works with any language.
The Migration Safety Checklist
Before running any migration in production, go through this checklist:
- Test on a copy of production data. Not a tiny test database. A full copy. The migration that takes 50ms on your dev database might take 45 minutes on production.
- Set a lock timeout. Always. 2-5 seconds is reasonable.
- Check for long-running queries. Kill or wait for them before starting.
- Have a rollback plan. For every migration, know exactly how to undo it.
- Monitor replication lag. If you have read replicas, ensure the migration doesn't cause them to fall behind.
- Run during low-traffic periods. Even with zero-downtime patterns, less traffic means less risk.
- Communicate with the team. Everyone should know a migration is happening.
- Have a DBA review it. If you don't have a DBA, the most database-experienced engineer on the team.
My Opinionated Take
Here are the hills I'll die on when it comes to database migrations:
1. Every migration should be backward-compatible. Your application should work correctly both before and after the migration runs. This means you can deploy the application first, then run the migration, or vice versa. If the migration breaks the currently running application, you've created a deployment coupling that will bite you.
2. Never delete data in a migration. Drop columns, sure. But never DELETE FROM in a migration. If you need to clean up data, do it in a separate, well-tested script with a backup plan.
3. Migration frameworks are necessary but not sufficient. Flyway, Alembic, Prisma Migrate — they handle ordering, versioning, and tracking. But they don't prevent you from writing a dangerous migration. You need linting (Squawk, strong_migrations) and code review on top.
4. The best migration tool is a boring migration. If your migration requires a complex multi-step process, consider whether the schema change is worth the complexity. Sometimes a slightly less elegant schema that can be reached with a simple migration is the better choice.
5. ORMs hide migration complexity. When Prisma or Django auto-generates a migration, look at the SQL it produces. Understand what locks it will acquire. Auto-generated migrations are a starting point, not a finished product.
Action Plan: Safe Migrations for Your Team
Immediate (This Week)
- Set
lock_timeoutto 5 seconds in your migration runner configuration - Add a migration linter to your CI pipeline (Squawk for PostgreSQL, or equivalent)
- Document your team's migration review process
Short-Term (This Month)
- Create a staging environment with production-sized data for testing migrations
- Write a migration safety checklist specific to your stack
- Practice the expand-contract pattern on a non-critical table
- Set up monitoring for lock waits and replication lag
Medium-Term (This Quarter)
- Implement automated migration testing in your CI/CD pipeline
- Evaluate tools like gh-ost (MySQL) or pgroll (PostgreSQL) for large table migrations
- Build a runbook for emergency migration rollbacks
- Train the team on database locking behavior
Key Takeaways
- Database migrations cause downtime primarily through locking. Understand your database's locking behavior before writing migrations.
- The expand-contract pattern is the foundation of zero-downtime migrations. Never make a breaking schema change in a single step.
- Set a lock timeout. Always. This is the cheapest and most effective safety measure.
- Test migrations on production-sized data. Performance characteristics change dramatically with data volume.
- For large MySQL tables, use gh-ost or pt-online-schema-change. For PostgreSQL, use
CREATE INDEX CONCURRENTLYand batch backfills. - Add a migration linter to your CI pipeline to catch dangerous patterns automatically.
Sources
- PostgreSQL ALTER TABLE Documentation
- MySQL InnoDB Online DDL
- gh-ost - GitHub's Online Schema Migration Tool
- pt-online-schema-change - Percona Toolkit
- strong_migrations - Ruby Gem
- Squawk - PostgreSQL Migration Linter
- GitHub Engineering Blog
I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.
