Safe Rollbacks: Securing SQL Migrations Before They Torch Production



Hook: The Migration That Dropped a Column Faster Than You Could Say Rollback
A Rails migration removes users.phone_number after product deprecates SMS login. The migration runs fine in staging, but in production a legacy report still references the column. Queries start failing, dashboards go blank, and the on-call engineer attempts to rollback the migration. Unfortunately the down migration is empty. The only way back is restoring from backup and replaying hours of writes. The outage becomes a security incident when MFA fallback numbers vanish, locking users out.
Developers often treat migrations as simple schema changes, but in regulated systems they control access, retention, and auditing. This article covers how to build safe migrations: reversible scripts, shadow tables, feature flags, and automated tests that run before deploy. We will use examples in Rails, Django, and plain SQL.
The Problem Deep Dive
Common pain points:
- Irreversible migrations. Removing columns or data without backups.
- Mixed DDL and DML. Scripts mutate data while changing schema; rollbacks are complex.
- Missing gating. Migrations deploy automatically without validation.
- No drift detection. Manual hotfixes diverge schema between environments.
Example anti-pattern:
class DropPhoneNumber < ActiveRecord::Migration[7.0]
def change
remove_column :users, :phone_number, :string
end
end
change cannot derive a rollback; down is blank.
Technical Solutions
Quick Patch: Always Provide up/down
Write explicit up/down methods:
class DropPhoneNumber < ActiveRecord::Migration[7.0]
def up
add_column :users, :phone_number_backup, :string
execute "INSERT INTO users_backup(id, phone_number) SELECT id, phone_number FROM users"
remove_column :users, :phone_number
end
def down
add_column :users, :phone_number, :string
execute "UPDATE users SET phone_number = b.phone_number FROM users_backup b WHERE users.id = b.id"
drop_table :users_backup
end
end
Even if you never roll back, documenting the path is essential.
Durable Fix: Two-Phase Migrations with Feature Flags
- Add new column or table.
- Backfill data in batches (idempotent script).
- Switch application code via feature flag.
- Remove old column only after validation.
SQL example (Postgres):
ALTER TABLE users ADD COLUMN phone_numbers jsonb DEFAULT '[]'::jsonb;
-- backfill script in application or offline job
-- After feature flag enabled
ALTER TABLE users DROP COLUMN phone_number;
Use ALTER TABLE ... SET NOT NULL only after verifying no nulls remain.
Migration Testing
- Run migrations in CI against prod snapshots (masking sensitive data).
- Use tools like
strong_migrations(Rails) ordjango-migration-linterto warn about dangerous operations. - Write smoke tests verifying views, stored procedures, and downstream ETL still function.
Versioned Schema Contracts
- Maintain schema dumps (
structure.sql) per environment. - Compare dumps in CI to detect unauthorized changes.
- Use tools like
skeemaoratlasto manage schema diffs.
Observability
- Instrument migrations with duration, lock wait time, and rows affected metrics.
- Emit events when migrations run, including author and ticket ID.
Alprina Policies
Scan migration files for remove_column without backup. Flag change methods removing columns. Ensure down migrations exist and are tested.
Testing & Verification
- Run
rails db:migrateandrails db:rollback STEP=1in CI to ensure reversibility. - For Django,
python manage.py migrate --planand--databaseper environment. - Use
pg_dump --schema-onlybefore and after migration to compare. - Add integration tests hitting endpoints that depend on the modified schema.
Common Questions & Edge Cases
Do we have to keep backup tables forever? No. Keep them until confident nothing depends on old schema, then drop in later migration.
How to handle large tables? Use phased migrations (add column, backfill in batches, swap). Avoid locking statements like ALTER TABLE ... TYPE on large tables without downtime planning.
What about zero-downtime deployments? Coordinate app rollout with migrations. Use additive changes first, then deploy app, then clean up.
Can we use tools like Liquibase or Flyway? Yes. They provide checksums and history tables, making drift obvious.
How to handle data privacy? Maintain data retention policies in migrations. Do not delete regulated data without legal sign-off.
Conclusion
Safe migrations require discipline: explicit rollbacks, phased rollouts, and comprehensive testing. Treat schema changes as code with guardrails, and the next time you drop a column, you will know exactly how to recover if something goes sideways.