Planning and Executing Database Migrations#

Database migrations are the highest-risk routine operations most teams perform. A bad migration can cause downtime, data loss, or application errors that cascade across every service that touches the affected tables. This operational sequence walks through the assessment, planning, execution, and rollback of database migrations from simple column additions to full platform changes.

Phase 1 – Assessment#

Step 1: Classify the Migration#

Every migration falls into one of three categories, each with a different risk profile:

Schema-only changes modify the structure of existing tables without transforming data. Examples: adding a nullable column, creating an index, adding a constraint. These are the most common and usually the safest.

Data migrations transform or backfill existing data. Examples: populating a new column with computed values from other columns, reformatting phone numbers from national to international format, merging data from two tables into one. These involve reading and writing large volumes of rows and carry lock and performance risks.

Platform migrations change the database engine, major version, or hosting location. Examples: PostgreSQL 14 to 16, MySQL to PostgreSQL, self-hosted to RDS. These are the most complex and carry the highest risk.

Step 2: Impact Analysis#

Before writing any migration code, answer these questions:

  • Which applications read from or write to the affected tables? Check connection strings across all services, not just the one you own.
  • What is the data volume? A table with 1,000 rows migrates instantly. A table with 100 million rows takes hours and locks need careful management.
  • What is the acceptable downtime? Zero (online migration required), minutes (maintenance window), or hours (weekend migration)?
  • What is the current replication topology? Migrations run on the primary. Replicas receive changes through the WAL (PostgreSQL) or binlog (MySQL). Large migrations can cause replication lag.

Step 3: Identify Breaking Changes#

Some operations cannot be done in a single step without breaking running applications:

  • Column rename: Old application code references the old name. New code references the new name. Both cannot work simultaneously unless you use a view or alias.
  • Column type change: Changing varchar(50) to integer requires data conversion and application code changes simultaneously.
  • Drop column: Any application code still referencing the column will throw errors.
  • Add NOT NULL without default: Existing rows have no value. INSERT statements without the column will fail.

These all require the expand-contract pattern described in Phase 2.

Step 4: Output#

Write a migration plan document that includes: migration type, affected tables, data volume, dependent applications, estimated execution time, acceptable downtime, rollback strategy, and the point of no return.

Phase 2 – Safe Schema Change Patterns#

Step 5: The Expand-Contract Pattern#

This is the core pattern for zero-downtime schema changes. It breaks a breaking change into four non-breaking steps.

Example: Renaming a column from user_name to username.

Step 1 – Expand: Add the new column alongside the old one.

-- PostgreSQL
ALTER TABLE users ADD COLUMN username VARCHAR(255);

-- MySQL
ALTER TABLE users ADD COLUMN username VARCHAR(255);

Update the application to write to both columns:

# Application code during expand phase
def create_user(name):
    db.execute(
        "INSERT INTO users (user_name, username) VALUES (%s, %s)",
        (name, name)
    )

Step 2 – Migrate: Backfill the new column with data from the old column.

-- PostgreSQL: batched backfill
UPDATE users SET username = user_name
WHERE username IS NULL
AND id BETWEEN 1 AND 10000;

-- Repeat for next batch
UPDATE users SET username = user_name
WHERE username IS NULL
AND id BETWEEN 10001 AND 20000;

Step 3 – Contract: Switch reads to the new column. Stop writing to the old column. The application now uses only username.

# Application code during contract phase
def get_user(user_id):
    return db.execute("SELECT username FROM users WHERE id = %s", (user_id,))

Step 4 – Cleanup: After a verification period (at least one full release cycle), drop the old column.

ALTER TABLE users DROP COLUMN user_name;

Step 6: Safe Operations#

These operations are safe to run without the expand-contract pattern:

-- Add nullable column (instant in PostgreSQL 11+, fast in MySQL 8.0+)
ALTER TABLE users ADD COLUMN email VARCHAR(255);

-- Add column with default (instant in PostgreSQL 11+)
ALTER TABLE users ADD COLUMN active BOOLEAN DEFAULT true;

-- Create index concurrently (does not block reads or writes)
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);

-- In MySQL, use pt-online-schema-change for large tables
-- pt-online-schema-change --alter "ADD INDEX idx_email (email)" D=mydb,t=users

Step 7: Unsafe Operations Requiring Expand-Contract#

Operation Risk Approach
Drop column Application errors Stop using it first, then drop after verification
Rename column Application errors Add new column, migrate, drop old
Change column type Data loss, application errors Add new column of new type, migrate, drop old
Add NOT NULL INSERT failures Add nullable, backfill, then add constraint
Drop table Data loss Stop all references, keep table for rollback period

Adding a NOT NULL constraint safely in PostgreSQL:

-- Step 1: Add column as nullable
ALTER TABLE orders ADD COLUMN status VARCHAR(20);

-- Step 2: Backfill (batched)
UPDATE orders SET status = 'pending' WHERE status IS NULL AND id < 10000;
-- ... continue batching ...

-- Step 3: Add NOT NULL constraint (validates existing rows)
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;

-- In PostgreSQL 12+, add the constraint as NOT VALID first (instant),
-- then validate separately (does not block writes):
ALTER TABLE orders ADD CONSTRAINT orders_status_not_null
  CHECK (status IS NOT NULL) NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT orders_status_not_null;

Step 8: Verification#

After each expand-contract step, verify that the application works correctly. Run integration tests against the modified schema. Check error logs for SQL errors referencing the old or new column names. Confirm that both old and new application versions can operate simultaneously during the transition period.

Phase 3 – Execution#

Step 9: Run in Dev#

Execute the migration against the development database first. Verify application compatibility. This catches syntax errors and logical mistakes before they reach production.

# Using Flyway
flyway -url=jdbc:postgresql://dev-db:5432/myapp migrate

# Using raw SQL
psql -h dev-db -U myapp -d myapp -f migrations/V003__add_username_column.sql

Step 10: Run in Staging with Production-Like Data#

Staging should have a data volume representative of production. If production has 50 million rows in the users table, staging should have at least 5 million. This catches performance issues that only appear at scale.

# Time the migration
time psql -h staging-db -U myapp -d myapp -f migrations/V003__add_username_column.sql

Step 11: Estimate Production Execution Time#

Based on staging timing, extrapolate for production. If staging has 10% of production data and the migration took 30 seconds, expect approximately 5 minutes in production (it will not scale linearly due to index operations and I/O patterns).

For CREATE INDEX CONCURRENTLY on large tables, staging timing is especially important. A 100-million-row table index creation can take 30 minutes to several hours depending on hardware.

Step 12: Production Execution#

Schedule a maintenance window if needed. Communicate the timeline to stakeholders. Execute with monitoring active.

# Start a screen/tmux session in case of SSH disconnect
tmux new -s migration

# Enable timing in psql
\timing on

# Run the migration
\i migrations/V003__add_username_column.sql

Step 13: Batched Backfills for Large Tables#

Never run UPDATE users SET username = user_name on a 100-million-row table in one transaction. It locks the table and fills the WAL.

-- PostgreSQL batched update with progress tracking
DO $$
DECLARE
  batch_size INTEGER := 5000;
  total_updated INTEGER := 0;
  rows_affected INTEGER;
BEGIN
  LOOP
    UPDATE users
    SET username = user_name
    WHERE id IN (
      SELECT id FROM users
      WHERE username IS NULL
      ORDER BY id
      LIMIT batch_size
    );

    GET DIAGNOSTICS rows_affected = ROW_COUNT;
    total_updated := total_updated + rows_affected;
    RAISE NOTICE 'Updated % rows (total: %)', rows_affected, total_updated;

    EXIT WHEN rows_affected = 0;

    -- Brief pause to let replicas catch up
    PERFORM pg_sleep(0.1);
    COMMIT;
  END LOOP;
END $$;

For MySQL:

-- MySQL batched update
SET @batch_size = 5000;
SET @last_id = 0;

REPEAT
  UPDATE users
  SET username = user_name
  WHERE username IS NULL AND id > @last_id
  ORDER BY id
  LIMIT @batch_size;

  SET @rows = ROW_COUNT();
  SET @last_id = @last_id + @batch_size;
  SELECT SLEEP(0.1);
UNTIL @rows = 0 END REPEAT;

Step 14: Monitor During Migration#

Watch these metrics during execution:

# PostgreSQL: check for lock waits
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE state = 'active' AND wait_event IS NOT NULL;

# PostgreSQL: check replication lag
SELECT client_addr, sent_lsn, write_lsn, flush_lsn, replay_lsn,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication;

# MySQL: check replication lag
SHOW SLAVE STATUS\G
-- Look at Seconds_Behind_Master

If replication lag exceeds your threshold, pause the backfill and wait for replicas to catch up.

Step 15: Verification#

After the migration completes, verify:

# Check for application errors
kubectl logs -l app=my-service --since=10m | grep -i error

# PostgreSQL: verify column exists and has data
psql -c "SELECT COUNT(*) FROM users WHERE username IS NOT NULL;"
psql -c "\d users"  -- show table structure

# Check query performance
psql -c "EXPLAIN ANALYZE SELECT * FROM users WHERE username = 'testuser';"

Phase 4 – Rollback Plan#

Step 16: Every Migration Needs a Rollback Script#

Write the rollback script before executing the migration. Store it alongside the migration file.

migrations/
  V003__add_username_column.sql
  V003__add_username_column_ROLLBACK.sql

Step 17: Rollback Strategies by Migration Type#

Expand-contract: Rollback is built into the pattern. To roll back the expand step, revert the application to only use the old column, then drop the new column. No data is lost because the old column was never modified.

Column addition: Drop the new column. Safe as long as no application code references it yet.

-- Rollback: V003__add_username_column_ROLLBACK.sql
ALTER TABLE users DROP COLUMN IF EXISTS username;

Data backfill: If you are transforming data in place (modifying the original column), you need to keep a backup. If you are writing to a new column (expand-contract), the old column is your backup.

Step 18: Platform Migration Rollback#

For platform migrations, keep the old database running and accessible until the cutover is validated. Set the old database to read-only (or continue replication in reverse if your tools support it). If the new platform has problems, switch the application connection string back.

Step 19: Identify the Point of No Return#

Every migration has a moment after which rollback is no longer possible:

  • Dropping the old column after expand-contract: the old data is gone
  • Deleting the old database after platform migration: the fallback is gone
  • Modifying data in place without a backup: the original values are gone

Document this clearly in the migration plan. Never cross the point of no return until you have verified the new state is correct and stable, ideally after at least 24 hours of production traffic.

Phase 5 – Platform Migrations#

Step 20: PostgreSQL Major Version Upgrade#

Option A – pg_upgrade (requires downtime):

# Stop the old cluster
pg_ctlcluster 14 main stop

# Run pg_upgrade
pg_upgrade \
  --old-datadir=/var/lib/postgresql/14/main \
  --new-datadir=/var/lib/postgresql/16/main \
  --old-bindir=/usr/lib/postgresql/14/bin \
  --new-bindir=/usr/lib/postgresql/16/bin \
  --link  # Uses hard links, much faster

# Start the new cluster
pg_ctlcluster 16 main start

# Run post-upgrade analysis
/usr/lib/postgresql/16/bin/vacuumdb --all --analyze-in-stages

The --link flag avoids copying data files, reducing downtime to minutes. Without it, large databases can take hours.

Option B – Logical replication (zero-downtime):

-- On the old (14) server: create publication
CREATE PUBLICATION upgrade_pub FOR ALL TABLES;

-- On the new (16) server: create subscription
CREATE SUBSCRIPTION upgrade_sub
  CONNECTION 'host=old-server dbname=myapp'
  PUBLICATION upgrade_pub;

Wait for the subscription to catch up (check pg_stat_subscription), then switch application traffic to the new server. This approach is complex but avoids downtime entirely.

Step 21: Cross-Engine Migration#

Migrating from MySQL to PostgreSQL or vice versa requires data transformation. Use pgloader for MySQL-to-PostgreSQL:

pgloader mysql://user:pass@mysql-host/mydb postgresql://user:pass@pg-host/mydb

pgloader handles type mapping (MySQL TINYINT(1) to PostgreSQL BOOLEAN, DATETIME to TIMESTAMP), index recreation, and sequence setup.

Step 22: Verification for Platform Migrations#

# Compare row counts across all tables
psql -h old-server -c "SELECT schemaname, relname, n_live_tup FROM pg_stat_user_tables ORDER BY relname;"
psql -h new-server -c "SELECT schemaname, relname, n_live_tup FROM pg_stat_user_tables ORDER BY relname;"

# Spot-check data integrity
psql -h old-server -c "SELECT md5(string_agg(id::text || email, '')) FROM users ORDER BY id;"
psql -h new-server -c "SELECT md5(string_agg(id::text || email, '')) FROM users ORDER BY id;"

# Run application smoke tests against the new database
APP_DATABASE_URL=postgresql://new-server/myapp ./run-smoke-tests.sh

Row counts should match exactly. Checksums on critical tables should match. Smoke tests should pass. Only after all three verifications should you consider switching production traffic.

Summary#

Safe database migrations follow a predictable pattern: assess the migration type and risk, choose the appropriate change strategy (direct for safe operations, expand-contract for breaking changes), execute in dev and staging first, monitor during production execution, and always have a tested rollback plan. The expand-contract pattern is the single most important tool for zero-downtime migrations. When in doubt, take the slower path that preserves rollback capability.