Approaches to data migration on Rails applications

ohbarye · September 15, 2024, 3:03am

Hi,

As Rails developers, we often encounter situations where we need to modify data in production databases. This could be to correct erroneously created data or to populate data after schema changes etc.

Defining these operations as “data migrations,” I’d like to discuss the following points:

What approaches exist for data migration on Rails?
Is there any officially recommended approach?

What approaches exist for data migration on Rails?

Based on my experience and discussions with colleagues, I’ve identified several approaches:

(1) Performing data migrations alongside db:migrate

This involves writing data migration code in the up, down, or change methods of ActiveRecord::Migration subclasses in the db/migrate directory. I’ve heard that GitLab adopts this approach. This approach was also long documented in the Rails Guides.

(2) Preparing and executing separate scripts from db migration files

Often implemented as rake tasks, executed via bin/rails my_task. This clearly separates data migrations from schema migrations. I’m most familiar with this approach.

(3) Executing SQL directly without using Rails

Connecting to the database with a user having write permissions and executing SQL. While it’s the most primitive approach, it’s less common in Rails projects, possibly because it bypasses model validations and callbacks. I’ve seen it used in non-Rails projects like Go or Java.

(4) Using specialized gems

GitHub - ilyakatz/data-migrate: Migrate and update data alongside your database structure.
- Implement scripts for data migration. Run via data:migrate rake task. (not db:migrate)
GitHub - ka8725/migration_data: Migrate data along with schema migrations in Rails and keep them up to date.
- Run scripts that hooks db:migrate
GitHub - pboling/seed_migration: Seed Migration
- Almost same as migration_data. Run via seed:migrate rake task.
GitHub - Shopify/maintenance_tasks: A Rails engine for queueing and managing data migrations.

The maintenance_tasks gem stands out by providing a rich GUI and features like job pausing and resuming.

Any other common/significant approaches that I miss?

Is there any officially recommended approach?

For a long time, the Active Record Migration Guide demonstrated performing data migrations simultaneously with schema migrations. This was present until the guide refresh in June 2024.

# quoted from https://guides.rubyonrails.org/v7.2/active_record_migrations.html#migrations-and-seed-data
class AddInitialProducts < ActiveRecord::Migration[7.2]
  def up
    5.times do |i|
      Product.create(name: "Product ##{i}", description: "A product.")
    end
  end

  def down
    Product.delete_all
  end
end

However, the current Active Record Migration Guide (as of September 2024) advises separating schema migrations from data migrations.

In Rails, it is generally not advised to perform data migrations using migration files.

Additionally, in this discussion, it was mentioned that Basecamp places data migration scripts in script/migrate/*. This led to the addition of a script generator in Add script folder and generator by jeromedalbert · Pull Request #52335 · rails/rails · GitHub.

Given these points, it seems the current official recommendation might be:

Separate schema migrations from data migrations.
Implement Ruby scripts in script/ and execute them via ruby script/migrate/foo.rb or rails r script/migrate/foo.rb.
Or, recommend GitHub - Shopify/maintenance_tasks: A Rails engine for queueing and managing data migrations. except in certain cases.

The “in certain cases” refers to scenarios outlined in the maintenance_tasks README:

If your task shouldn’t run as an Active Job, it probably isn’t a good match for this gem. If your task doesn’t need to run in the background, consider a runner script instead. If your task doesn’t need to be interruptible, consider a normal Active Job.

Thoughts

I understand that the best approach may vary depending on system and team size, and operational constraints etc. My goal is to be able to make better decisions considering the recent situation and these factors.

Past discussions

I’ve reviewed past discussions on data migration in Rails Discussions, but they seem to be from the 2010s or earlier. I hope that it’s worth to bring up this discussion in 2024.

chris.covington · September 15, 2024, 11:21pm

It’s one of those questions that really depends on where the project is in terms of current user load and the size of the table being modified.

If you have large windows in the day, especially at night, where a few minutes of downtime is easily acceptable, then why bother with something more complex?

I think there are two points I would stop using rails migrations for data migration purposes.

The migration involves a table that is quite large in size and doing the modification in a single transaction would be prohibitive.
There is significant user activity all hours of the day.

If you don’t have both of those constraints, then migrations are generally fine. For example I have the following two migrations in my pipeline.

def change
  create_table :recordings do |t|
    t.references :recordable, polymorphic: true
    t.belongs_to :parent, foreign_key: { to_table: :recordings }, index: true
    t.belongs_to :creator, foreign_key: { to_table: :people }, index: true
    t.belongs_to :account
    t.integer :position
    t.string :state, index: true

    t.timestamps
  end
end

As the schema migration and the follow up:

def change
  Person.includes(:account, { client: :workouts }).each do |person|
    next if person.client.nil?
    next if person.account.nil?

    workouts = person.client.workouts.pluck(:id, :created_at, :updated_at).map do |(id, created_at, updated_at)|
      {
        recordable_type: "Workout",
        recordable_id: id,
        created_at: created_at,
        updated_at: updated_at,
        creator_id: person.id,
        account_id: person.account.id
      }
    end

    Recording.insert_all!(workouts, returning: false, record_timestamps: false)
  end
end

For my case the total payload here is about 25k Recordings being created in ~300 total inserts

And in my case I’m testing this out via the pgsync gem, bringing the recrods in question down to my dev to ensure the data migration actually follows through on real production data… Great part about pgsync is you don’t have to bring down everything, I can tell it to sync the last week as a simple test, or sync the whole db to ensure no surprises.

Of course If I had 10k+ users with 10s of millions of recordings across all timezones, I would not do it this way, but then I would have a devops team to take care of the details of deploying it with more care.

jeromedalbert · September 16, 2024, 12:08am

I think I agree with you both. I will say though that when I was a junior developer it felt unclear where to put such scripts, and even now as I am more experienced, it is still a judgment call as there are multiple options to choose from.

Maybe we could update and improve the Active Record Migrations — Ruby on Rails Guides mentioned by @ohbarye to alleviate that feeling of uncertainty, and make this extra clear.

Like saying that small / low-impact / low-risk migrations may be good enough directly in a migration as mentioned by @chris.covington although you are doing this at your own risk. Then the next level could be recommending separate Ruby scripts in a folder like scripts/migrate for low to medium complexity data migrations (and maybe high complexity if your infrastructure is able to run these without being interrupted by deploys). Then the next level could be making your own interruptible idempotent background jobs in a folder like app/jobs/maintenance, or recommending a gem like maintenance_tasks for a fully-fledged background processing solution.

I do think that the the docs currently recommending maintenance_tasks without mentioning other approaches is a bit too much as it would introduce more complexity straight away and people may not realize that using a separate script might be good enough for their needs. But maybe it’s just me, so I’m also curious what other people think!

Confused_Vorlon · September 16, 2024, 12:05pm

I prefer migrating within the migrations. That keeps my migration logic in one place.

Also - I generally prefer sql rather than using Rails models.

The reason for SQL is that my migration/refactoring frequently changes the models.

So, that by the time I actually run a given migration the models it relies on have changed out from underneath it. Writing in sql means that the migration can ‘stand alone’ and I don’t have to worry about making sure it runs before I make other changes.

Bonus Tip: Chat GPT is great at writing that sql…

chris.covington · September 16, 2024, 5:49pm

maintenance_tasks makes a lot of sense for data migrations that need to be handled at scale. Though personally I would not use it without there being instacart/shopify levels of data to migrate.

The reason I use db/migrate is I had a hard time understanding how it was any different than putting it in some other directory and running it as a rake task. db/migrate has the advantage of already being part of the deployment process.

I would point out that I don’t keep data migrations around, they go in db/migrate, get run, then get deleted on the follow up commit where I would then add a migration to drop columns or whatever schema cleanup was needed.

chris.covington · September 16, 2024, 5:50pm

Definitely don’t use Rails model functions that trigger callbacks and whatnot. But using low-level rails models is generally fine.

I mean you should be deleting data migrations after they are used, they are transient concepts.

hakunin · September 20, 2024, 4:03am

I think following these 2 points is critical, or you will eventually end up with a big mess.

Avoid models in migrations. Models will change, and your older migrations will break. Call execute with raw SQL instead.
If you really really want to use models, you can define narrow-purpose models inside your migrations. Here’s a guide on this practice.

chris.covington · September 21, 2024, 9:40pm

I don’t understand, why would you keep migrations around, or worse, why would you run them in the distant future?

Migrations should be deleted over time, they are migrations, once they run there’s no point in keeping them.

Not including Engine based migrations of course…

hakunin · September 22, 2024, 3:32am

Are you saying that because you squash migrations sometimes, it’s okay to make them brittle?

I’m not a fan of playing Russian roulette with my workday. Much nicer to squash migrations because I planned on doing it today, not because things broke today.

The 2 practices above are pretty simple, and make it so you’re never forced to deal with unplanned squashing (which takes some coordination).

chris.covington · September 23, 2024, 1:30pm

Squash them? I’m not even sure what that means.

I delete them after they are deployed to production. They serve no purpose after that.

hakunin · September 23, 2024, 3:11pm

You must be working alone?

If you delete migrations, how will your teammates get db changes without dropping their local databases? Or do you make them all stop what they’re doing, switch to main branch, and migrate every time anyone decides to deploy?

Squashing means turning many migrations into a single cumulative one.

chris.covington · September 23, 2024, 5:28pm

Exactly this, db:reset so your loading the schema + seeds. I have never used migrations to update a dev environment and neither did anyone I worked with. If I’m pulling in changes that involve schema updates then I always use db:reset.

I can see this useful for engines because they are handled kind of weird by rails generators, not app migrations though.

All that being said, if your team is unwilling to do db:resets, then of course you are stuck with all of these extra rules. I’m too much of a rule breaker to want more rules than necessary

hakunin · September 23, 2024, 6:52pm

It’s not that they’re unwilling, it’s that people have convenient data built up over time for local dev. Resetting would throw it away. You could have a shared data pack or seeds (which aren’t exactly meant for this), but that’s also a thing you have to maintain.

I appreciate your rule-breaking spirit, but you’re replacing one rule that can be broken, with another that cannot (so nobody else besides you gets to be a rule breaker).

jnicho02 · September 27, 2024, 8:51am

I’ve been using a process with Dev databases where seeds.rb creates static reference data via ActiveRecord statements, loads in Production-editable reference data from dump files included in git, then populates realistic examples using the same FactoryBot models that are used in testing.

A good test for this is onboarding a new Dev.

chris.covington · September 27, 2024, 5:43pm

I used to do something like that until I found pgsync from Instacart. If you’re on Postgres it’s a game changer. Basically you can create sync scripts that will pull some slice of production, which you can shape very well, and obfuscate/randomize for privacy purposes.

I could see a team syncing a slice of production on a daily update to a more accessible “dev” server that less privileged devs have access to. That way resets would follow up with a consistent clone and all devs work with real world data.

Topic		Replies	Views
Patterns for "data-only" Rails migrations A May Of WTFs migrations	21	6585	February 4, 2021
In the need for separating data migration from schema migration rubyonrails-core	6	692	April 15, 2015
Migrations Shortcomings rubyonrails-talk	11	109	October 15, 2006
Using Migrations to load user data? rubyonrails-talk	3	164	November 14, 2014
Where should data migrations go? rubyonrails-talk	4	120	November 27, 2010

Approaches to data migration on Rails applications

What approaches exist for data migration on Rails?

Is there any officially recommended approach?

Thoughts

Past discussions

Related topics

More Resources