No downtime database migrations

A RailsConf talk by Kir Shatov (blog post - http://kirshatrov.com/2018/04/01/async-migrations/) got me thinking about database migrations. Shopify has written a custom method for running migrations asynchronously and separate from deploys. They try to make all migrations backward compatible so both old and new code can work (no renaming or deleting columns). GitLab has also done some work on this with a concept of standard migrations and post-deploy asynchronous migrations. I’m sure other companies have done some similar custom solutions internally to minimize downtime.

I think this illustrates that there ought to be a generic and builtin solution for no downtime migrations in Rails. In my mind there are at least two strategies that should be considered when attempting a solutions here - continuous deployment/Kubernetes and periodic/standard deployment. Shopify’s migration solution seems to cater well to continuous deployment/Kubernetes because a background job runs periodically to check for new migrations. GitLab’s caters more to standard deployments where we want to deploy some code with database changes and then immediately come back post-deploy and clean up the stuff we no longer need. I bet we can come up with a solution that addresses both situations. However, Kubernetes is the perfect driver for change here because the deployment strategy almost guarantees you will have different versions of code running together while a deploy is in progress.

So, two questions:

  1. Is this something that we should solve in Rails?

  2. What are some thoughts on how to solve this?

Migrations without downtime are mostly a logical problem. You need to write them in a way that allows schema or data changes and two versions of the code coexisting. Sometimes that means planning a schema change in a coordinated way that perhaps requires a couple or three deploys.

At the framework level you can provide utilities. For example, the ability to ignore columns is handy to plan column deletion by first shipping a version of the app that doesn’t use it. Anything that can help in that sense would be cool I think, but ultimately the need for planning seems unavoidable.

Perhaps Rails could throw a warning of some sort for non-safe migrations?

https://github.com/ankane/strong_migrations

Yes, definitely planning will always be a part of it. There are some things that we can do to help, though. For example, can add_column automatically avoid locking the table by adding a column, then adding a default value? Maybe that shouldn’t be the default, so instead maybe add a new method add_column_with_default that’s graceful.

Also, adding a concept of pre and post deploys or a method by which you can make a graceful change and then clean that up later.

These would be good places to start investigating. I also talked with someone here at the conference and his thought was that part of it may simply be having a conversation about this and then documenting some best practices.

Yeah, I agree that this is best attacked by documentation rather than coding it into Rails or using strong_migrations. Unfortunately even for that simple example, the best way to add columns is actually completely database-specific and version-specific.

For example, the below advice is the right thing to do on current versions of postgresql, but it is exactly the wrong thing to do on mysql and mariadb - current versions of those will add columns with defaults without blocking other transactions (they’ll rewrite the table, but they’ll buffer concurrent DML), whereas a separate update statement would block other transactions. (Older versions would block concurrent transactions, but doing an ADD COLUMN + an UPDATE is even worse as it would mean rewriting the table rows twice, doubling the blocking.) So it is not good advice for any version of mysql/mariadb (IMHO strong_migrations is only correct for postgresql).

It’s also a moving target. For example, the next version of postgresql is able to add columns with defaults without blocking as well (see https://www.depesz.com/2018/04/04/waiting-for-postgresql-11-fast-alter-table-add-column-with-a-non-null-default/).

Personally I don’t think there is any point baking this knowledge into the migration system; it’ll fall out of date too fast. All the database vendors are actively working on making the basic DDL operations safer and O(1); the remaining cases that you need to look out for are IMHO too dependent on your app and your deployment strategy.

This is because aside from the impact of the DDL statement itself, ignoring the ADD COLUMN case which is the simplest one, most of the problems are to do with explicit references plus the cached column lists in instances of the app already running.

If you’re using a flash reload strategy, the overlap in instances with the old and new column lists could be <10s; if you’re using a rolling deploy, it could be minutes; if you’re using canaries it could be hours. So depending on your strategy you might accept the risk, or you might need to do multi-stage releases.

So Xavier’s right, what is “safe” depends on your app and needs to be thought through at a higher level than the impact of running individual migration statements.

The ignored column mechanism Xavier mentioned can solve some of the problems (see also mariadb’s hidden column feature, and even the virtual column feature which can help with renames). The other half depends on the code that you have written - if your queries specifically reference a column that you want to rename or remove, you are going to have to do a multi-stage release if you don’t want to break it during your deploy window.

+1 to popularising the idea of post-deployment migrations though. We do this at our company (using a small patch to the migration code - I’ll post it if anyone wants it) and it in practice allows us to cover a lot of the major use cases such as adding a new column and starting to populate it in a new release + using an after-release migration to back-populate the data. (You can do the same thing with more individual merge-and-releases, but it’s more work.)

It would be a bit awkward to add to Rails core though, because it requires explicit support from your deployment tools. Some deployment techniques don’t even know for sure when the instances running the old versions of the app have gone away, so they can’t support this pattern.

Cheers,

Will

Excellent email Will, agree with all your points.

+1 to popularising the idea of post-deployment migrations though. We do this at our company (using a small patch to the migration code - I’ll post it if anyone wants it) and it in practice allows us to cover a lot of the major use cases such as adding a new column and starting to populate it in a new release + using an after-release migration to back-populate the data. (You can do the same thing with more individual merge-and-releases, but it’s more work.)

I’m interested in the patch you mentioned :slight_smile:

Hey Guillermo, please do share your patch I’m interested :)…We are also patching migrations :slight_smile:

I was exploring adding sprinkles of “no-dowtime” migration to Rails in the last few months, but I’m hitting the dead-end (for similar reasons that Will and Xavier mentioned)

Hi, sorry for the slow reply, I had to edit it down a bit to get rid of some boilerplate from our project.

Basically you just dump the following in config/initializers/after_release_migrations.rb, and run MODE=before bundle exec rake db:migrate before rolling out your new app version, and MODE=after bundle exec rake db:migrate after you’ve finished restarting instances. The latter will run any migrations whose name starts with AfterRelease, whereas the former will skip them.

(After-release migrations are only suitable for data back-population. Make sure you only do schema migrations in the regular migrations, because you’ll get surprising results later on if you change the schema again after restarting instances, as different processes may or may not have seen the extra schema change and so may behave differently to one another!)

Hope this is useful!

require ‘active_record/migration’

module ActiveRecord

class MigrationProxy

def after_release_only?

  name.starts_with?("AfterRelease")

end

end

module MigratorWithAfterReleaseCheck

def migrations(*args)

  case ENV["MODE"].try!(:downcase)

  when "after"

    super.select(&:after_release_only?)

  when "before"

    super.reject(&:after_release_only?)

  when "", nil

    super

  else

    raise "Unknown value '#{ENV["MODE"]}' for environment value MODE: must be unset, 'after' or 'before'"

  end

end

end

Migrator.singleton_class.prepend(MigratorWithAfterReleaseCheck)

end