We recently encountered a case where in_batches.destroy_all unnecessary lead to a slow query.
Use case: we have a table with millions of rows in it and an indexed created_at column. We wanted to remove some records that are older than some threshold. This is running periodically, so most of the old rows are removed, and each time it removes at most a few hundreds of rows. But because in_batches uses ordering by id, PostgreSQL incorrectly decided to iterate over the primary key index (basically doing sorting first) and then rechecking the condition (created_at < some value). This is basically a full index scan.
While we expected it to use the index on the created_at and then sort the results.
So, as you see, the problem is with ordering. I assume, that most projects don’t care in which order records are deleted, but can suffer from similar problems. But avoiding the ordering, we can use simpler queries and got faster execution plans.
What do people think if we make in_batches.destroy_all ignoring (by default or by some global/local configuration) the order?
We currently implemented a custom simple helper and a rubocop rule to enforce its usage in the project.
class ApplicationRecord < ActiveRecord::Base
def self.destroy_all_in_batches(of: 1000)
relation = unscope(:order).limit(of)
count = 0
loop do
records = relation.destroy_all
break if records.empty?
count += records.size
end
count
end
end
@byroot as always, interested in your input about this and whether Shopify encountered similar problems.
What do people think if we make in_batches.destroy_all ignoring (by default or by some global/local configuration) the order?
I’m not sure this is the best solution. I most case you don’t care about the order, but one might, and it seems a bit too much of an ad hoc solution.
I’m at a conference, so not a whole lot of time to think about this issue, but intuitively, I don’t see why in_batches wouldn’t allow for different order, even if it means always appending the PK as last ordering element.
In your case I think you query could be ORDER BY created_at ASC, id ASC? Would that make Postgres do the right thing?
In your case I think you query could be ORDER BY created_at ASC, id ASC ? Would that make Postgres do the right thing?
It will do the trick, but in_batches currently iterates only by a primary key and allows to have an order only on that column(s).
But even by allowing to specify the order, we need to remember to do the right ordering all the time, instead of just configure somewhere that we don’t care.
But yeah, I understand, that this use case might not be well suited (or worth including) into the framework and better implemented as a custom solution. Just wanted to start a conversation to see if people had similar issues.
If you have a created_at date which presumably doesn’t change, then you could just find the ID of the record that aligns with 6 months ago and deleted anything smaller?
in_batches sorts on primary key to have a stable sort order. Granted, it assumed a numeric auto-incrementing id. But for any other column, sorting can cause rows to be skipped if new values are inserted while iteration is occurring.
For example, if we’re sorting on name, and get to “Nate” and then “Jean” is inserted, Jean will be skipped.
If we were ordering by ID, and Nate is id 5 and Jean is id 10000000, we’ll get to Jean eventually.
Cursors are probably the only way to have a stable sort of a non-primary key in Postgres. I’m not sure about MySQL.
FWIW, UUIDv4 ids basically make in_batches unpredictable, for the same reason. Records are inserted all over the place.
You are absolutely right. But that’s not always a property you need, and in the given example, the query was created_at < ? so I assumed it wasn’t a concern here.
Yeah. FWIW, I’ve written code just like Dima before.
When you’ve got a specific set of records that you want to delete without doing too many in 1 transaction, you just do a loop with a limit until the delete query deletes nothing.
So perhaps delete_all_in_batches / destroy_all_in_batches is exactly what we need, but it just shouldn’t rely on in_batches?
One of my clients has a Postgres db with UUIDv4 PKs (switching to v7 soon but doesn’t help w/ existing records) and we have similar code to delete and batch-process accounts by created_at. We use the ID column too in some cases to ensure we don’t process records twice, because even though timestamps have millisecond precision some days have many thousands of new accounts.
e.g. Account.where('created_at > :last_created_at OR (created_at = :last_created_at AND id > :last_account_id)', ...)
In a lot of our cases deleting/processing just based on created_at would be sufficient though. If that was built in we’d most likely use it.
class ApplicationRecord < ActiveRecord::Base
# Destroys records in batches.
#
# This is an alternative to `in_batches.destroy_all`. The problem with that method, is
# that rails batches use ordering and large IN queries. When deleting, the ordering is
# almost always does not matter, but enforcing it can lead to incorrect execution plans
# in PostgreSQL and slow queries.
#
# @param of [Integer] 1000 Batch size
#
# @example
# User.where(has_widget: false).destroy_all_in_batches
#
# @return [Integer] The number of destroyed records
#
def self.destroy_all_in_batches(of: 1000)
relation = unscope(:order).limit(of)
count = 0
loop do
records = relation.destroy_all.select(&:destroyed?)
break if records.empty?
count += records.size
end
count
end
# Deletes records in batches.
#
# This is an alternative to `in_batches.delete_all`.
# @see .destroy_all_in_batches
#
def self.delete_all_in_batches(of: 1000)
relation = unscope(:order).limit(of)
count = 0
loop do
deleted_count = relation.delete_all
break if deleted_count == 0
count += deleted_count
end
count
end
end
# frozen_string_literal: true
module RuboCop
module Cop
module Custom
# Detects usages of `in_batches.destroy_all` or `in_batches.delete_all`.
#
# Example
# # bad
# User.where(has_widget: false).in_batches.destroy_all
# User.where(has_widget: false).in_batches.delete_all
# User.where(has_widget: false).in_batches(of: 1000).destroy_all
#
# # good
# User.where(has_widget: false).destroy_all_in_batches
# User.where(has_widget: false).delete_all_in_batches
# User.where(has_widget: false).destroy_all_in_batches(of: 1000)
#
# # good (custom configuration)
# User.where(has_widget: false).in_batches(start: id).destroy_all
#
class DestroyAllInBatches < Base
extend AutoCorrector
include RangeHelp
MSG = 'Use `%{good_method}` instead of `in_batches.%{bad_method}`.'
RESTRICT_ON_SEND = %i[destroy_all delete_all].freeze
def_node_matcher :deleting_in_batches?, <<~PATTERN
(send
(send _ :in_batches (hash (pair (sym :of) $_)) ?)
${:destroy_all :delete_all})
PATTERN
def on_send(node)
deleting_in_batches?(node) do |of_node, bad_method|
of_node = of_node.first
offense_range = range_between(node.receiver.loc.selector.begin_pos, node.loc.expression.end_pos)
good_method =
if of_node
"#{bad_method}_in_batches(of: #{of_node.source})"
else
"#{bad_method}_in_batches"
end
message = format(MSG, good_method: good_method, bad_method: bad_method)
add_offense(offense_range, message: message) do |corrector|
corrector.replace(offense_range, good_method)
end
end
end
end
end
end
end
# frozen_string_literal: true
require 'rubocop'
require 'rubocop/rspec/support'
require './lib/custom_cops/destroy_all_in_batches'
RSpec.describe RuboCop::Cop::Custom::DestroyAllInBatches, :config do
include RuboCop::RSpec::ExpectOffense
it 'registers an offense and corrects when using `in_batches.destroy_all`' do
expect_offense(<<~RUBY)
User.in_batches.destroy_all
^^^^^^^^^^^^^^^^^^^^^^ Use `destroy_all_in_batches` instead of `in_batches.destroy_all`.
RUBY
expect_correction(<<~RUBY)
User.destroy_all_in_batches
RUBY
end
it 'registers an offense and corrects when using `in_batches.delete_all`' do
expect_offense(<<~RUBY)
User.in_batches.delete_all
^^^^^^^^^^^^^^^^^^^^^ Use `delete_all_in_batches` instead of `in_batches.delete_all`.
RUBY
expect_correction(<<~RUBY)
User.delete_all_in_batches
RUBY
end
it 'registers an offense and corrects when using `in_batches.destroy_all` with `:of`' do
expect_offense(<<~RUBY)
User.in_batches(of: 1000).destroy_all
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Use `destroy_all_in_batches(of: 1000)` instead of `in_batches.destroy_all`.
RUBY
expect_correction(<<~RUBY)
User.destroy_all_in_batches(of: 1000)
RUBY
end
it 'does not register an offense when using `in_batches.destroy_all` with extra configurations' do
expect_no_offenses(<<~RUBY)
User.in_batches(start: id, of: 1000).destroy_all
RUBY
end
it 'does not register an offense when using `destroy_all_in_batches`' do
expect_no_offenses(<<~RUBY)
User.destroy_all_in_batches
RUBY
end
end
Just wondering, I see your cop still allows the usage of regular delete_all/destroy_all, i.e., without any batching being performed. Are there any cases in which batching would not be desirable? I’ve been wondering about that ever since Rails introduced in_batches.
If I know there will be only a few records, I use just delete_all/destroy_all without complicating the code with batches. But, yeah, batching can be used everywhere instead - I can’t think of a case where it can be undesirable.
Yeah true, if you know there’s only a few records batching is not necessary and might add a (slight) overhead. However, often you don’t know - or the amount of records can grow silently over time, and suddenly you find your deletes/destroys becoming unacceptably slow. Wondering if it’s then not a better default to always use batching for deletes/destroys, with the option to explicitly disable it if so desired.