Since topic name covers such a broad functionality, I’ll add my little annoyance with migrations here as well. A lot of people say that data migrations are an anti-pattern with existing database migrations functionality in rails.
While there are couple of gems that address that. It would be nice if Rails would offer a official recommendations for data migrations or maybe even a dedicated migration functionality for data only.
I’ve been working with different teams that have completely different approaches to solve this problem. And it felt, that we spend a lot of time deciding on proper way to go about it.
Think part of the pushback against changing data in migrations comes from relying on existing ActiveRecord code to change data instead of writing raw SQL. Fun part is, migration breaking changes to that code don’t show up until somebody needs to run migrations from scratch.
The “raw SQL” idea is intriguing and I’ve never heard it before. I like it a lot, as someone who has previously been in camp “no data migrations, they will have sneaky app-breaking consequences down the line.”
I’m moving this topic to its own thread where it can get proper attention.
I much prefer writing data munging scripts in Ruby though. Writing non-trivial UPDATEs in SQL is intimidating. Defining a new ActiveRecord class in the migration file helps expose the power of AR without something breaking down the line, but testing the logic in that class is tricky.
I personally make a practice of deleting my test database from time to time, and then running migrations up from the start. It tends to catch these sorts of things very quickly. Our CI also drops the test database before each run, and so “works on my Mac” isn’t a good excuse for anyone.
We mix data and sql statements in migrations all the time at Basecamp. This is what migrations were designed to do! They’re intended to be moments in time. Not to be replayable forever. The stable unit is schema.rb (or structure.sql).
Then I am sorry to say this but my experience is exact opposite of those intentions. schema.rb was never stable unit for us and during lifetime of our applications we had several instances when we needed to regenerate it from scratch. And we still run into minor conflicts in commits with schema.rb
So keeping migrations free from AR operations is a safe strategy in a long term overall.
No one knows when Rails will switch default id datatype again or if we will decide to switch to another DB. Having clean migrations path will help in those cases a lot if they happen.
How do you manage data migrations that do not fill on those constrains?
One case I had recently:
I had a field with 2 possible values per user that for different reasons we had to transform on a table that had around 3-4 rows per group of users.
That transformation was not trivial and not only needed to go through some complicated business logic, it also needed to emit some notification events in the process for things like trazability, etc…
In my experience runnning those within schema migrations becomes problematic pretty soon as your data grows and in some cases it might even become a danger depending on your deploy strategy.
The solutions I’ve seen so far for these cases (and on different companies) where just ad-hoc scripts that somebody ran like bundle exec ruby some_script.rb on the production server or in the best of the cases a rake task.
Don’t get me wrong, I’m not saying this is wrong either. But I wonder which other tools we could have around processes like this that helped us make it easier… I’m also not sure it fits into Rails.
Some of the things I’m talking to and that I think a better data migration tooling would provide:
Specific Instrumentation: Having things like DataDog or NewRelic just hooking into the instrumentation support for Rails is amazing, saves time and headaches and makes issues easier to solve. I miss having this without having to solve it on my own.
User input capabilities. I’ve found in more than one case that I wanted to do things like: If we reach this case I would like to be able to stop, and ask some domain expert before continuing.
Notifications. Summaries with the actions that have been done, the tame it has taken or the human interventions.
Versions. Kind of what you get with rails db:migrate:status but for data migrations (there are some gems that do this already I think.
Good output by default. Knowing how long the process will take or how much of the things I need to migrate have already been migrated.
Testability: I’d love to have a super simple way to write specific fixtures that I could use to write tests for these cases (or to test them manually)
Disposability: Many of those are just there to be run once and after some time they can be disposed and that should also easily dispose of their tests and fixtures.
Again, I’m sure that for many, many cases nothing will beat a well written SQL query. But for the more complicated cases, the ones in which you need to scale up your solution. How do others manage beyond schema migrations? Does this not exist on rails world because it is not sucha common itch? Does it exist but I’m not aware of it? Would somebody else like to join if I decided to build it?
I’m also not sure that something like this should be supported by Rails but the topic was here and I couldn’t resist asking just in case.
There are lots of examples there and even a guide online about all the special rules they have for migrations: Migration Style Guide | GitLab. We certainly do not practice all of the GitLab practices at Discourse. That said I think it is a very interesting read.
Personally, I just prefer to write SQL, it has many very powerful constructs you can use to allow you extreme levels of power
I think this can certainly work for most Rails applications out there, sadly for large open source apps like Discourse / GitLab this is not really an option. Often people will hold off upgrading for a year for … reasons. We love that you can just run migrations on a year old database and update to latest.
We stay safe by rerunning all our migrations every time we run our tests in CE. We can always go from 0 to a fully populated DB.
This is a tricky tricky problem that we hit regularly enough. We have also evolved a lot in our thinking over the years.
We have a system called OnceOff jobs that are guaranteed to fire once. We tend to place complicated stuff like that in this system.
These jobs fire post deploy, and only fire once. In general we tend to only put “best effort” kind of data migrations there. For example, the OnceOff I linked above could fail, and the end users can still recover if this happens by updating settings in the admin UI.
The OnceOff system was built before we introduced post_migrate directory. Many things we put there in the past, we should probably move to post_migrate.
post_migrate is a common system both GitLab and Discourse use. Our deployment system works like so:
Deploy new code to all machines
Run post migrations (if this fails deploy failed)
This allows us to do things like drop columns from a table safely, cause we can black list them in Active Record prior to removal.
I mentioned this system to @dhh and he is very open to including this pattern in Rails, so we will look at making a PR to Rails in the next few months. Unless someone here beats us to it
Anyway … so we have OnceOff system for “best effort jobs”, we also have rake tasks that people run by hand for certain types of migrations (move uploads from local to s3 for example)
The OnceOff system though is not a system I would recommend for Rails quite yet, there are just too many cracks. When a job fails … how do you remediate. For our own hosting it is simple enough, but self hosters … not so easy.
We also use migrations as throwaway, point-in-time code to transition our database structure or data from one format to another. The schema is the only source of truth (although it goes out of sync on occasion).
There are a few gotchas for using models within a migration. There are other gotachas related to dropping a column when the model still references it, database locks can be a real killer as well. Only recently, we tried to drop a column, the operation itself took 0.002s or something, but it required a lock. We also had a backup running pg_dump at the same time, so the lock was blocking pretty much everything…
The strong migrations gem does a great job in my opinion, and helped us introduce lock timeouts for example, as well as provide guidelines on how to avoid removing columns when the model still expects them to be there.
I’m a bit unsure whether this gem is suitable for beginners, because it can make the whole process intimidating. At the same time, developers should perhaps be intimidated when facing a potential foot gun. I would say the gem provides great training wheels though, and perhaps this is something that should be included in Rails? or installed by default?
I’m not involved in any way with the gem, but I wouldn’t say it’s advocating it exactly. It’s more just guiding you what to do, given the rails migration constraints. Given that there’s no post deploy migrations in rails, this seems like a sensible thing to do, rather than reinventing post deploy migrations as well?
Yes, I completely agree. Doing 2 deploys (and typically 2 commits, 2 PRs, 2 CI runs etc) is quite tedious for something relatively straightforward as removing a column. Having both pre-deploy and post-deploy migrations would be really neat.
The issues that I mentioned are more around operational process than tooling. I don’t run into them frequently enough to have developed a strong preference around what tooling should do.
I use the following pattern for data munging scripts which are intended to be run by an experienced developer from the console:
# returns the set of records that need to be changed
# returns the set of records that need to be changed with changes applied to them
# iterates through the set of changed records and save them, returning a set of save results
# calls .load and stores a summary before changes
# calls .patch and stores save results
# calls .load again to store a summary after changes
# returns before / after summaries and save results