Patterns for "data-only" Rails migrations

Since topic name covers such a broad functionality, I’ll add my little annoyance with migrations here as well. A lot of people say that data migrations are an anti-pattern with existing database migrations functionality in rails.

While there are couple of gems that address that. It would be nice if Rails would offer a official recommendations for data migrations or maybe even a dedicated migration functionality for data only.

I’ve been working with different teams that have completely different approaches to solve this problem. And it felt, that we spend a lot of time deciding on proper way to go about it.

6 Likes

Can you expand on this a bit, what is a data migration?

Here are some examples I can dig up at Discourse (there are tons more):

Do said people say that any migration containing DML is an anti-pattern and migrations shall only include DDL?

I think Rails should certainly document how to use the sharp knife that is disable_ddl_transaction!, you can search discourse for a few places we use it such as this: discourse/20180716140323_add_uniq_ip_or_user_id_topic_views.rb at a2d939608d5b3adc7f037dc0ca38e7d1f6b895f6 · discourse/discourse · GitHub

I also think Rails should establish a pattern for post migrations, something that allows us to drop columns with minimal downtime.

I am not sure though that there needs to be a whole new paradigm for data migrations (unless I am misunderstanding what people mean by data migrations)

2 Likes

Think part of the pushback against changing data in migrations comes from relying on existing ActiveRecord code to change data instead of writing raw SQL. Fun part is, migration breaking changes to that code don’t show up until somebody needs to run migrations from scratch.

2 Likes

Yeah, we have very strong rules about that, never lean on models app code during migrations, use raw sql, keep migrations stand-alone and stable over time

1 Like

The “raw SQL” idea is intriguing and I’ve never heard it before. I like it a lot, as someone who has previously been in camp “no data migrations, they will have sneaky app-breaking consequences down the line.”

I’m moving this topic to its own thread where it can get proper attention.

I much prefer writing data munging scripts in Ruby though. Writing non-trivial UPDATEs in SQL is intimidating. Defining a new ActiveRecord class in the migration file helps expose the power of AR without something breaking down the line, but testing the logic in that class is tricky.

3 Likes

I personally make a practice of deleting my test database from time to time, and then running migrations up from the start. It tends to catch these sorts of things very quickly. Our CI also drops the test database before each run, and so “works on my Mac” isn’t a good excuse for anyone.

Walter

1 Like

Can you go more into this approach? I think I understand what you mean, and it sounds pretty cool, but I’m not 100% sure if the thing I’m imagining is the same as the thing you do.

Something like:

class Migration < ActiveRecord::Base
  self.table_name = 'schema_migrations'
end

Migration.last #should return the last entry in schema_migrations as a Migration object

Basically an AR class which only exists in the migration file. self.table_name= allows it to manage any existing table in the DB. Can also define multiple classes and set up associations.

1 Like

We mix data and sql statements in migrations all the time at Basecamp. This is what migrations were designed to do! They’re intended to be moments in time. Not to be replayable forever. The stable unit is schema.rb (or structure.sql).

3 Likes

Then I am sorry to say this but my experience is exact opposite of those intentions. schema.rb was never stable unit for us and during lifetime of our applications we had several instances when we needed to regenerate it from scratch. And we still run into minor conflicts in commits with schema.rb

So keeping migrations free from AR operations is a safe strategy in a long term overall. No one knows when Rails will switch default id datatype again or if we will decide to switch to another DB. Having clean migrations path will help in those cases a lot if they happen.

4 Likes

How do you manage data migrations that do not fill on those constrains?

One case I had recently:

I had a field with 2 possible values per user that for different reasons we had to transform on a table that had around 3-4 rows per group of users.

That transformation was not trivial and not only needed to go through some complicated business logic, it also needed to emit some notification events in the process for things like trazability, etc…

In my experience runnning those within schema migrations becomes problematic pretty soon as your data grows and in some cases it might even become a danger depending on your deploy strategy.

The solutions I’ve seen so far for these cases (and on different companies) where just ad-hoc scripts that somebody ran like bundle exec ruby some_script.rb on the production server or in the best of the cases a rake task.

Don’t get me wrong, I’m not saying this is wrong either. But I wonder which other tools we could have around processes like this that helped us make it easier… I’m also not sure it fits into Rails.

Some of the things I’m talking to and that I think a better data migration tooling would provide:

  1. Specific Instrumentation: Having things like DataDog or NewRelic just hooking into the instrumentation support for Rails is amazing, saves time and headaches and makes issues easier to solve. I miss having this without having to solve it on my own.
  2. User input capabilities. I’ve found in more than one case that I wanted to do things like: If we reach this case I would like to be able to stop, and ask some domain expert before continuing.
  3. Notifications. Summaries with the actions that have been done, the tame it has taken or the human interventions.
  4. Versions. Kind of what you get with rails db:migrate:status but for data migrations (there are some gems that do this already I think.
  5. Good output by default. Knowing how long the process will take or how much of the things I need to migrate have already been migrated.
  6. Testability: I’d love to have a super simple way to write specific fixtures that I could use to write tests for these cases (or to test them manually)
  7. Disposability: Many of those are just there to be run once and after some time they can be disposed and that should also easily dispose of their tests and fixtures.

Again, I’m sure that for many, many cases nothing will beat a well written SQL query. But for the more complicated cases, the ones in which you need to scale up your solution. How do others manage beyond schema migrations? Does this not exist on rails world because it is not sucha common itch? Does it exist but I’m not aware of it? Would somebody else like to join if I decided to build it?

I’m also not sure that something like this should be supported by Rails but the topic was here and I couldn’t resist asking just in case.

PS: Thanks everybody for the May Of WTF!

I think operational process needs to be considered for complicated data changes. I don’t use Rails migrations to handle data changes that I’m not confident enough to run unattended.

Some of the requirements that I think important for complex data migrations are:

  • Require code review and testing by another dev before go ahead
  • Ideally tested on a copy of production data
  • Be able to generate data reports before & after changing data
  • Be able to run idempotently
  • Be executed manually - they usually affect some customer, so shouldn’t be run while that customer is active

Yup, I agree with the list of constrains you commented they make a lot of sense.

But my point is that there doesn’t seem to exist something like this and that you are (as I am) building something custom when it is needed.

Do you see or do you feel that a good tooling for data processing like this. Which hooks into all the stuff rails provides, could be nice to have?

I’m not sure this is still on topic with the rest of the post and I wouldn’t want to go off topic or deviate from the original topic . Should we continue talking about it on a different post?

I saw this in the GitLab migrations:

There are lots of examples there and even a guide online about all the special rules they have for migrations: Migration Style Guide | GitLab. We certainly do not practice all of the GitLab practices at Discourse. That said I think it is a very interesting read.

Personally, I just prefer to write SQL, it has many very powerful constructs you can use to allow you extreme levels of power

I think this can certainly work for most Rails applications out there, sadly for large open source apps like Discourse / GitLab this is not really an option. Often people will hold off upgrading for a year for … reasons. We love that you can just run migrations on a year old database and update to latest.

We stay safe by rerunning all our migrations every time we run our tests in CE. We can always go from 0 to a fully populated DB.

This is a tricky tricky problem that we hit regularly enough. We have also evolved a lot in our thinking over the years.

We have a system called OnceOff jobs that are guaranteed to fire once. We tend to place complicated stuff like that in this system.

These jobs fire post deploy, and only fire once. In general we tend to only put “best effort” kind of data migrations there. For example, the OnceOff I linked above could fail, and the end users can still recover if this happens by updating settings in the admin UI.

The OnceOff system was built before we introduced post_migrate directory. Many things we put there in the past, we should probably move to post_migrate.

post_migrate is a common system both GitLab and Discourse use. Our deployment system works like so:

  1. Run migrations
  2. Deploy new code to all machines
  3. Run post migrations (if this fails deploy failed)

This allows us to do things like drop columns from a table safely, cause we can black list them in Active Record prior to removal.

I mentioned this system to @dhh and he is very open to including this pattern in Rails, so we will look at making a PR to Rails in the next few months. Unless someone here beats us to it :slight_smile:

Anyway … so we have OnceOff system for “best effort jobs”, we also have rake tasks that people run by hand for certain types of migrations (move uploads from local to s3 for example)

The OnceOff system though is not a system I would recommend for Rails quite yet, there are just too many cracks. When a job fails … how do you remediate. For our own hosting it is simple enough, but self hosters … not so easy.

5 Likes

We also use migrations as throwaway, point-in-time code to transition our database structure or data from one format to another. The schema is the only source of truth (although it goes out of sync on occasion).

There are a few gotchas for using models within a migration. There are other gotachas related to dropping a column when the model still references it, database locks can be a real killer as well. Only recently, we tried to drop a column, the operation itself took 0.002s or something, but it required a lock. We also had a backup running pg_dump at the same time, so the lock was blocking pretty much everything…

The strong migrations gem does a great job in my opinion, and helped us introduce lock timeouts for example, as well as provide guidelines on how to avoid removing columns when the model still expects them to be there.

I’m a bit unsure whether this gem is suitable for beginners, because it can make the whole process intimidating. At the same time, developers should perhaps be intimidated when facing a potential foot gun. I would say the gem provides great training wheels though, and perhaps this is something that should be included in Rails? or installed by default?

1 Like

There are lots of interesting ideas in the strong migration gem.

We took some of the ideas and implemented in Discourse. discourse/safe_migrate.rb at 6b92c78033a1a26eea56f0417b6811581fab7a38 · discourse/discourse · GitHub

I strongly recommend though going the post_migrate route for column removal vs 2 deploys to get a column drop out there that strong migrations gem is advocating.

I think there are some concepts that Rails can borrow, especially when/if Rails has post migrations. For example Rails could possibly hint you at using post migrations for column removal.

2 Likes

I’m not involved in any way with the gem, but I wouldn’t say it’s advocating it exactly. It’s more just guiding you what to do, given the rails migration constraints. Given that there’s no post deploy migrations in rails, this seems like a sensible thing to do, rather than reinventing post deploy migrations as well?

Yes, I completely agree. Doing 2 deploys (and typically 2 commits, 2 PRs, 2 CI runs etc) is quite tedious for something relatively straightforward as removing a column. Having both pre-deploy and post-deploy migrations would be really neat.

1 Like

The issues that I mentioned are more around operational process than tooling. I don’t run into them frequently enough to have developed a strong preference around what tooling should do.

I use the following pattern for data munging scripts which are intended to be run by an experienced developer from the console:

class SomeDataChanger
  def self.load
    # returns the set of records that need to be changed
  end 

  def self.changes
    # returns the set of records that need to be changed with changes applied to them
  end

  def self.patch
    # iterates through the set of changed records and save them, returning a set of save results
  end 

  def self.run
    # calls .load and stores a summary before changes
    # calls .patch and stores save results
    # calls .load again to store a summary after changes
    # returns before / after summaries and save results
  end
end

I like how safe-pg-migrations aleviates a lot of issues with locks, timeouts and defaults without zero change in your migration code.