Zero-Downtime Migrations at Scale
On a small table, every migration is safe. The table fits in memory, locks are held for milliseconds, and nobody notices. That experience is exactly what makes large-table migrations dangerous: the same one-line add_column or add_index that was instant in development can, on a multi-million-row production table under live traffic, take a heavy lock and stall every query behind it. The app does not crash — it just stops responding while the migration runs, which from a user’s perspective is the same thing. Zero-downtime migration is the discipline of changing the schema without ever taking a lock that blocks production for a meaningful length of time.
The core fact to internalize is that in Postgres, DDL takes locks, and the type of lock decides whether your users notice. An ACCESS EXCLUSIVE lock blocks reads and writes; if a migration grabs one on a hot table and then does real work while holding it, everything queues. The whole craft is avoiding that situation.
Install strong_migrations and stop guessing
You should not have to memorize which operations are safe — and the list shifts between Postgres versions, which makes memorization a losing game. The strong_migrations gem encodes the current knowledge and fails the migration in development or CI before a dangerous operation reaches production, with an explanation and a safe rewrite.
# Gemfile
gem "strong_migrations"
# A migration that looks innocent but is not, on a big table:
class AddNotNullToOrders < ActiveRecord::Migration[7.1]
def change
# strong_migrations will halt this: adding a NOT NULL constraint
# forces a full-table scan under ACCESS EXCLUSIVE.
change_column_null :orders, :region, false
end
end
The gem turns “senior engineer reviews every migration for lock safety” into a default that the whole team gets for free. Adopt it on day one; it is the single highest-leverage thing on this list.
Build indexes CONCURRENTLY, always
A plain CREATE INDEX locks the table against writes for the entire build — which on a large table can be minutes. Postgres offers CREATE INDEX CONCURRENTLY, which builds without blocking writes, at the cost of taking longer and not being runnable inside a transaction. In Rails:
class AddIndexToOrdersOnRegion < ActiveRecord::Migration[7.1]
disable_ddl_transaction!
def change
add_index :orders, :region, algorithm: :concurrently
end
end
disable_ddl_transaction! is mandatory because CONCURRENTLY cannot run inside the transaction Rails wraps migrations in. The trade-off: if a concurrent build fails partway, it leaves an INVALID index behind that you must drop and rebuild — so concurrent index migrations should be isolated, one per file, never bundled with other changes.
The expand / contract pattern for everything that’s really two changes
Most “scary” migrations are actually a single logical change that must be split across multiple deploys because the old code and the new code both run simultaneously during a rolling deploy. The pattern is expand, then contract: first make the schema accept both old and new shapes, deploy the code that uses the new shape, then remove the old shape once nothing references it.
Renaming a column is the canonical example, and the naive rename_column is unsafe precisely because of the deploy overlap: the instant you rename, old running processes still issuing SELECT old_name start erroring. The safe sequence is several deploys:
- Expand. Add the new column. Add application code that writes to both the old and new columns on every change (the dual-write).
- Backfill. Copy existing rows from old to new in batches (below).
- Migrate reads. Deploy code that reads from the new column. Both columns are still written.
- Contract. Once nothing reads or writes the old column, drop it.
The same shape applies to changing a column’s type, splitting a table, or moving a foreign key. It is more deploys than a single migration, but every step is individually safe and reversible, and the app stays up the entire time.
Backfills: never in the migration, always in batches
The most common way to turn a schema change into an outage is to backfill data inside the migration itself: execute "UPDATE orders SET region = ..." on ten million rows. That single statement takes one enormous transaction, holds locks, bloats the WAL, and blocks for the entire run. Backfills must happen outside the schema migration — in a separate, batched job that updates a few thousand rows at a time and commits between batches so locks are short-lived:
class BackfillOrderRegionJob
include Sidekiq::Job
def perform
Order.where(region: nil).in_batches(of: 5_000) do |batch|
batch.update_all("region = derive_region(shipping_zip)")
sleep 0.1 # breathe — let replicas catch up, ease IO pressure
end
end
end
The in_batches keyset iteration walks the table by primary key without loading it all into memory, and the small pause between batches keeps you from saturating IO or letting read replicas fall behind. Because the column was already added (expand) and the app dual-writes new rows, the backfill only has to catch the historical rows, and it can run for hours in the background without anyone noticing.
Adding constraints without the full-table-scan lock
Adding a NOT NULL or foreign-key constraint normally validates every existing row while holding a strong lock. Postgres lets you split validation into two steps: add the constraint as NOT VALID (cheap — only checks new rows, takes a brief lock), then VALIDATE CONSTRAINT in a separate step (scans existing rows but only takes a SHARE UPDATE EXCLUSIVE lock that does not block reads or writes).
# Deploy 1: add the constraint NOT VALID (brief lock).
add_check_constraint :orders, "region IS NOT NULL",
name: "orders_region_null", validate: false
# Deploy 2 (after backfill): validate without blocking traffic.
validate_check_constraint :orders, name: "orders_region_null"
This is the safe path to a NOT NULL column on a large table: backfill first, add the check constraint NOT VALID, validate it, and (on recent Postgres) the validated check constraint lets you set the real NOT NULL attribute cheaply. strong_migrations will walk you through exactly this sequence when it stops your naive version.
One more guardrail: lock_timeout
Even a “safe” migration can wait behind a long-running query to acquire its lock, and while it waits in the lock queue it blocks everything behind it. Set a short lock_timeout so a migration that cannot get its lock quickly fails fast and retries, rather than forming a pileup:
# config/initializers or in the migration
ActiveRecord::Base.connection.execute("SET lock_timeout = '5s'")
Failing fast is the correct behavior here: a migration that cannot acquire its lock in five seconds should get out of the way and try again, not hold the line and convert one slow query into a site-wide stall.
The takeaway
Zero-downtime migration is not a trick; it is a set of habits. Install strong_migrations so the dangerous operations stop themselves. Build indexes CONCURRENTLY. Split anything that changes existing data into expand / backfill / contract across multiple deploys, with the application dual-writing during the overlap. Keep backfills out of the schema migration and in batched background jobs. Add constraints NOT VALID first, validate second. And set a lock_timeout so a stuck migration fails fast instead of taking the site with it. None of these are exotic — they are the difference between a schema change nobody notices and a postmortem.
We do schema-change work on large Rails/Postgres systems — planning multi-deploy migrations, writing safe backfills, and untangling tables that have grown past the point where naive migrations are safe. If you have a column you have been afraid to change, that is a good conversation to have.