Patterns for "data-only" Rails migrations

skatkov · May 20, 2020, 11:05am

Since topic name covers such a broad functionality, I’ll add my little annoyance with migrations here as well. A lot of people say that data migrations are an anti-pattern with existing database migrations functionality in rails.

While there are couple of gems that address that. It would be nice if Rails would offer a official recommendations for data migrations or maybe even a dedicated migration functionality for data only.

I’ve been working with different teams that have completely different approaches to solve this problem. And it felt, that we spend a lot of time deciding on proper way to go about it.

samsaffron · May 22, 2020, 7:09am

Can you expand on this a bit, what is a data migration?

Here are some examples I can dig up at Discourse (there are tons more):

github.com

discourse/discourse/blob/a2d939608d5b3adc7f037dc0ca38e7d1f6b895f6/db/migrate/20140306223522_move_topic_revisions_to_post_revisions.rb

# frozen_string_literal: true

class MoveTopicRevisionsToPostRevisions < ActiveRecord::Migration[4.2]
  def up
    execute <<SQL

    INSERT INTO post_revisions(user_id, post_id, modifications, number, created_at, updated_at)
    SELECT tr.user_id, p.id, tr.modifications, tr.number, tr.created_at, tr.updated_at
    FROM topic_revisions tr
    JOIN topics t ON t.id = tr.topic_id
    JOIN posts p ON p.topic_id = t.id AND p.post_number = 1

SQL

   execute <<SQL

   UPDATE post_revisions r SET number = 2 + (
    SELECT COUNT(*) FROM post_revisions r2
    WHERE r2.post_id = r.post_id AND r2.created_at < r.created_at
   )

This file has been truncated. show original

github.com

discourse/discourse/blob/a2d939608d5b3adc7f037dc0ca38e7d1f6b895f6/db/migrate/20140318150412_add_excerpt_to_topics.rb

# frozen_string_literal: true

class AddExcerptToTopics < ActiveRecord::Migration[4.2]
  def up
    add_column :topics, :excerpt, :string, limit: 1000

    topic_ids = execute("SELECT id FROM topics WHERE pinned_at IS NOT NULL").map { |r| r['id'].to_i }
    topic_ids.each do |topic_id|
      cooked = execute("SELECT cooked FROM posts WHERE topic_id = #{topic_id} ORDER BY post_number ASC LIMIT 1")[0]['cooked']
      if cooked
        excerpt = ExcerptParser.get_excerpt(cooked, 220, strip_links: true)
        execute "UPDATE topics SET excerpt = #{ActiveRecord::Base.sanitize(excerpt)} WHERE id = #{topic_id}"
      end
    end
  end

  def down
    remove_column :topics, :excerpt
  end
end

github.com

discourse/discourse/blob/a2d939608d5b3adc7f037dc0ca38e7d1f6b895f6/db/migrate/20140407202158_site_setting_comma_to_pipe.rb

# frozen_string_literal: true

class SiteSettingCommaToPipe < ActiveRecord::Migration[4.2]
  def up
    execute <<SQL
      UPDATE site_settings
      SET value = replace(value, ',', '|')
      WHERE name = 'white_listed_spam_host_domains'
      ;
SQL
    execute <<SQL
      UPDATE site_settings
      SET value = replace(value, ',', '|')
      WHERE name = 'exclude_rel_nofollow_domains'
      ;
SQL
  end

  def down
    execute <<SQL

This file has been truncated. show original

Do said people say that any migration containing DML is an anti-pattern and migrations shall only include DDL?

I think Rails should certainly document how to use the sharp knife that is disable_ddl_transaction!, you can search discourse for a few places we use it such as this: https://github.com/discourse/discourse/blob/a2d939608d5b3adc7f037dc0ca38e7d1f6b895f6/db/migrate/20180716140323_add_uniq_ip_or_user_id_topic_views.rb

I also think Rails should establish a pattern for post migrations, something that allows us to drop columns with minimal downtime.

I am not sure though that there needs to be a whole new paradigm for data migrations (unless I am misunderstanding what people mean by data migrations)

ferngus · May 22, 2020, 10:43am

Think part of the pushback against changing data in migrations comes from relying on existing ActiveRecord code to change data instead of writing raw SQL. Fun part is, migration breaking changes to that code don’t show up until somebody needs to run migrations from scratch.

samsaffron · May 22, 2020, 10:49am

Yeah, we have very strong rules about that, never lean on models app code during migrations, use raw sql, keep migrations stand-alone and stable over time

Betsy_Haibel · May 22, 2020, 1:50pm

The “raw SQL” idea is intriguing and I’ve never heard it before. I like it a lot, as someone who has previously been in camp “no data migrations, they will have sneaky app-breaking consequences down the line.”

I’m moving this topic to its own thread where it can get proper attention.

ferngus · May 22, 2020, 3:50pm

I much prefer writing data munging scripts in Ruby though. Writing non-trivial UPDATEs in SQL is intimidating. Defining a new ActiveRecord class in the migration file helps expose the power of AR without something breaking down the line, but testing the logic in that class is tricky.

walterdavis · May 22, 2020, 3:53pm

I personally make a practice of deleting my test database from time to time, and then running migrations up from the start. It tends to catch these sorts of things very quickly. Our CI also drops the test database before each run, and so “works on my Mac” isn’t a good excuse for anyone.

Walter

Betsy_Haibel · May 22, 2020, 4:20pm

Can you go more into this approach? I think I understand what you mean, and it sounds pretty cool, but I’m not 100% sure if the thing I’m imagining is the same as the thing you do.

ferngus · May 22, 2020, 4:59pm

Something like:

class Migration < ActiveRecord::Base
  self.table_name = 'schema_migrations'
end

Migration.last #should return the last entry in schema_migrations as a Migration object

Basically an AR class which only exists in the migration file. self.table_name= allows it to manage any existing table in the DB. Can also define multiple classes and set up associations.

DHH · May 22, 2020, 5:33pm

We mix data and sql statements in migrations all the time at Basecamp. This is what migrations were designed to do! They’re intended to be moments in time. Not to be replayable forever. The stable unit is schema.rb (or structure.sql).

TBadyl · May 22, 2020, 6:28pm

Then I am sorry to say this but my experience is exact opposite of those intentions. schema.rb was never stable unit for us and during lifetime of our applications we had several instances when we needed to regenerate it from scratch. And we still run into minor conflicts in commits with schema.rb

So keeping migrations free from AR operations is a safe strategy in a long term overall. No one knows when Rails will switch default id datatype again or if we will decide to switch to another DB. Having clean migrations path will help in those cases a lot if they happen.

franciscoj · May 22, 2020, 8:45pm

How do you manage data migrations that do not fill on those constrains?

One case I had recently:

I had a field with 2 possible values per user that for different reasons we had to transform on a table that had around 3-4 rows per group of users.

That transformation was not trivial and not only needed to go through some complicated business logic, it also needed to emit some notification events in the process for things like trazability, etc…

In my experience runnning those within schema migrations becomes problematic pretty soon as your data grows and in some cases it might even become a danger depending on your deploy strategy.

The solutions I’ve seen so far for these cases (and on different companies) where just ad-hoc scripts that somebody ran like bundle exec ruby some_script.rb on the production server or in the best of the cases a rake task.

Don’t get me wrong, I’m not saying this is wrong either. But I wonder which other tools we could have around processes like this that helped us make it easier… I’m also not sure it fits into Rails.

Some of the things I’m talking to and that I think a better data migration tooling would provide:

Specific Instrumentation: Having things like DataDog or NewRelic just hooking into the instrumentation support for Rails is amazing, saves time and headaches and makes issues easier to solve. I miss having this without having to solve it on my own.
User input capabilities. I’ve found in more than one case that I wanted to do things like: If we reach this case I would like to be able to stop, and ask some domain expert before continuing.
Notifications. Summaries with the actions that have been done, the tame it has taken or the human interventions.
Versions. Kind of what you get with rails db:migrate:status but for data migrations (there are some gems that do this already I think.
Good output by default. Knowing how long the process will take or how much of the things I need to migrate have already been migrated.
Testability: I’d love to have a super simple way to write specific fixtures that I could use to write tests for these cases (or to test them manually)
Disposability: Many of those are just there to be run once and after some time they can be disposed and that should also easily dispose of their tests and fixtures.

Again, I’m sure that for many, many cases nothing will beat a well written SQL query. But for the more complicated cases, the ones in which you need to scale up your solution. How do others manage beyond schema migrations? Does this not exist on rails world because it is not sucha common itch? Does it exist but I’m not aware of it? Would somebody else like to join if I decided to build it?

I’m also not sure that something like this should be supported by Rails but the topic was here and I couldn’t resist asking just in case.

PS: Thanks everybody for the May Of WTF!

ferngus · May 23, 2020, 1:07am

I think operational process needs to be considered for complicated data changes. I don’t use Rails migrations to handle data changes that I’m not confident enough to run unattended.

Some of the requirements that I think important for complex data migrations are:

Require code review and testing by another dev before go ahead
Ideally tested on a copy of production data
Be able to generate data reports before & after changing data
Be able to run idempotently
Be executed manually - they usually affect some customer, so shouldn’t be run while that customer is active

franciscoj · May 23, 2020, 6:12am

Yup, I agree with the list of constrains you commented they make a lot of sense.

But my point is that there doesn’t seem to exist something like this and that you are (as I am) building something custom when it is needed.

Do you see or do you feel that a good tooling for data processing like this. Which hooks into all the stuff rails provides, could be nice to have?

I’m not sure this is still on topic with the rest of the post and I wouldn’t want to go off topic or deviate from the original topic . Should we continue talking about it on a different post?

samsaffron · May 23, 2020, 6:26am

I saw this in the GitLab migrations:

github.com

gitlabhq/gitlabhq/blob/55be83f73cc3df921d89dc8dffc97edd1efdc102/db/migrate/20180702134423_generate_missing_routes.rb

# See http://doc.gitlab.com/ce/development/migration_style_guide.html
# for more information on how to write migrations for GitLab.

# This migration generates missing routes for any projects and namespaces that
# don't already have a route.
#
# On GitLab.com this would insert 611 project routes, and 0 namespace routes.
# The exact number could vary per instance, so we take care of both just in
# case.
class GenerateMissingRoutes < ActiveRecord::Migration[4.2]
  include Gitlab::Database::MigrationHelpers

  DOWNTIME = false

  disable_ddl_transaction!

  class User < ActiveRecord::Base
    self.table_name = 'users'
  end

This file has been truncated. show original

There are lots of examples there and even a guide online about all the special rules they have for migrations: Migration Style Guide | GitLab. We certainly do not practice all of the GitLab practices at Discourse. That said I think it is a very interesting read.

Personally, I just prefer to write SQL, it has many very powerful constructs you can use to allow you extreme levels of power

I think this can certainly work for most Rails applications out there, sadly for large open source apps like Discourse / GitLab this is not really an option. Often people will hold off upgrading for a year for … reasons. We love that you can just run migrations on a year old database and update to latest.

We stay safe by rerunning all our migrations every time we run our tests in CE. We can always go from 0 to a fully populated DB.

This is a tricky tricky problem that we hit regularly enough. We have also evolved a lot in our thinking over the years.

We have a system called OnceOff jobs that are guaranteed to fire once. We tend to place complicated stuff like that in this system.

github.com

discourse/discourse/blob/6b92c78033a1a26eea56f0417b6811581fab7a38/app/jobs/onceoff/migrate_url_site_settings.rb

# frozen_string_literal: true

module Jobs
  class MigrateUrlSiteSettings < ::Jobs::Onceoff
    SETTINGS = [
      ['logo_url', 'logo'],
      ['logo_small_url', 'logo_small'],
      ['digest_logo_url', 'digest_logo'],
      ['mobile_logo_url', 'mobile_logo'],
      ['large_icon_url', 'large_icon'],
      ['favicon_url', 'favicon'],
      ['apple_touch_icon_url', 'apple_touch_icon'],
      ['default_opengraph_image_url', 'opengraph_image'],
      ['twitter_summary_large_image_url', 'twitter_summary_large_image'],
      ['push_notifications_icon_url', 'push_notifications_icon'],
    ]

    def execute_onceoff(args)
      SETTINGS.each do |old_setting, new_setting|
        upload = SiteSetting.get(new_setting)

This file has been truncated. show original

These jobs fire post deploy, and only fire once. In general we tend to only put “best effort” kind of data migrations there. For example, the OnceOff I linked above could fail, and the end users can still recover if this happens by updating settings in the admin UI.

The OnceOff system was built before we introduced post_migrate directory. Many things we put there in the past, we should probably move to post_migrate.

post_migrate is a common system both GitLab and Discourse use. Our deployment system works like so:

Run migrations
Deploy new code to all machines
Run post migrations (if this fails deploy failed)

This allows us to do things like drop columns from a table safely, cause we can black list them in Active Record prior to removal.

I mentioned this system to @dhh and he is very open to including this pattern in Rails, so we will look at making a PR to Rails in the next few months. Unless someone here beats us to it

Anyway … so we have OnceOff system for “best effort jobs”, we also have rake tasks that people run by hand for certain types of migrations (move uploads from local to s3 for example)

The OnceOff system though is not a system I would recommend for Rails quite yet, there are just too many cracks. When a job fails … how do you remediate. For our own hosting it is simple enough, but self hosters … not so easy.

gingerlime · May 23, 2020, 6:33am

We also use migrations as throwaway, point-in-time code to transition our database structure or data from one format to another. The schema is the only source of truth (although it goes out of sync on occasion).

There are a few gotchas for using models within a migration. There are other gotachas related to dropping a column when the model still references it, database locks can be a real killer as well. Only recently, we tried to drop a column, the operation itself took 0.002s or something, but it required a lock. We also had a backup running pg_dump at the same time, so the lock was blocking pretty much everything…

The strong migrations gem does a great job in my opinion, and helped us introduce lock timeouts for example, as well as provide guidelines on how to avoid removing columns when the model still expects them to be there.

I’m a bit unsure whether this gem is suitable for beginners, because it can make the whole process intimidating. At the same time, developers should perhaps be intimidated when facing a potential foot gun. I would say the gem provides great training wheels though, and perhaps this is something that should be included in Rails? or installed by default?

samsaffron · May 24, 2020, 5:19am

There are lots of interesting ideas in the strong migration gem.

We took some of the ideas and implemented in Discourse. https://github.com/discourse/discourse/blob/6b92c78033a1a26eea56f0417b6811581fab7a38/lib/migration/safe_migrate.rb#L123-L155

I strongly recommend though going the post_migrate route for column removal vs 2 deploys to get a column drop out there that strong migrations gem is advocating.

I think there are some concepts that Rails can borrow, especially when/if Rails has post migrations. For example Rails could possibly hint you at using post migrations for column removal.

gingerlime · May 24, 2020, 6:19am

I’m not involved in any way with the gem, but I wouldn’t say it’s advocating it exactly. It’s more just guiding you what to do, given the rails migration constraints. Given that there’s no post deploy migrations in rails, this seems like a sensible thing to do, rather than reinventing post deploy migrations as well?

Yes, I completely agree. Doing 2 deploys (and typically 2 commits, 2 PRs, 2 CI runs etc) is quite tedious for something relatively straightforward as removing a column. Having both pre-deploy and post-deploy migrations would be really neat.

ferngus · May 24, 2020, 7:31am

The issues that I mentioned are more around operational process than tooling. I don’t run into them frequently enough to have developed a strong preference around what tooling should do.

I use the following pattern for data munging scripts which are intended to be run by an experienced developer from the console:

class SomeDataChanger
  def self.load
    # returns the set of records that need to be changed
  end 

  def self.changes
    # returns the set of records that need to be changed with changes applied to them
  end

  def self.patch
    # iterates through the set of changed records and save them, returning a set of save results
  end 

  def self.run
    # calls .load and stores a summary before changes
    # calls .patch and stores save results
    # calls .load again to store a summary after changes
    # returns before / after summaries and save results
  end
end

sunfox · May 24, 2020, 7:46am

I like how safe-pg-migrations aleviates a lot of issues with locks, timeouts and defaults without zero change in your migration code.

Topic		Replies	Views
Approaches to data migration on Rails applications rubyonrails-talk migrations	14	734	September 27, 2024
Data Migrations using Model Classes vs. SQL rubyonrails-talk	4	102	September 13, 2007
Changing the schema (and prototyping) without writing migrations A May Of WTFs	6	1429	May 23, 2020
migrations in rails? rubyonrails-talk	4	185	March 9, 2011
Generating Migration Files With DATA Changes rubyonrails-talk	0	118	September 13, 2006

Patterns for "data-only" Rails migrations

Related topics

More Resources