[Feature Proposal] Run `in_batches.destroy_all`/`delete_all` without ordering

fatkodima · May 13, 2024, 12:38pm

We recently encountered a case where in_batches.destroy_all unnecessary lead to a slow query.

Use case: we have a table with millions of rows in it and an indexed created_at column. We wanted to remove some records that are older than some threshold. This is running periodically, so most of the old rows are removed, and each time it removes at most a few hundreds of rows. But because in_batches uses ordering by id, PostgreSQL incorrectly decided to iterate over the primary key index (basically doing sorting first) and then rechecking the condition (created_at < some value). This is basically a full index scan.

While we expected it to use the index on the created_at and then sort the results.

So, as you see, the problem is with ordering. I assume, that most projects don’t care in which order records are deleted, but can suffer from similar problems. But avoiding the ordering, we can use simpler queries and got faster execution plans.

What do people think if we make in_batches.destroy_all ignoring (by default or by some global/local configuration) the order?

We currently implemented a custom simple helper and a rubocop rule to enforce its usage in the project.

class ApplicationRecord < ActiveRecord::Base
  def self.destroy_all_in_batches(of: 1000)
    relation = unscope(:order).limit(of)
    count = 0
    loop do
      records = relation.destroy_all
      break if records.empty?

      count += records.size
    end
    count
  end
end

@byroot as always, interested in your input about this and whether Shopify encountered similar problems.

byroot · May 14, 2024, 12:25am

Not an Issue we encountered I think.

What do people think if we make in_batches.destroy_all ignoring (by default or by some global/local configuration) the order?

I’m not sure this is the best solution. I most case you don’t care about the order, but one might, and it seems a bit too much of an ad hoc solution.

I’m at a conference, so not a whole lot of time to think about this issue, but intuitively, I don’t see why in_batches wouldn’t allow for different order, even if it means always appending the PK as last ordering element.

In your case I think you query could be ORDER BY created_at ASC, id ASC? Would that make Postgres do the right thing?

fatkodima · May 14, 2024, 10:47am

In your case I think you query could be ORDER BY created_at ASC, id ASC ? Would that make Postgres do the right thing?

It will do the trick, but in_batches currently iterates only by a primary key and allows to have an order only on that column(s).

But even by allowing to specify the order, we need to remember to do the right ordering all the time, instead of just configure somewhere that we don’t care.

But yeah, I understand, that this use case might not be well suited (or worth including) into the framework and better implemented as a custom solution. Just wanted to start a conversation to see if people had similar issues.

lloydwatkin · May 16, 2024, 5:18pm

If you have a created_at date which presumably doesn’t change, then you could just find the ID of the record that aligns with 6 months ago and deleted anything smaller?

fatkodima · May 16, 2024, 9:48pm

The column might be something like last_used_at, which is not static. So this will not work.

natematykiewicz · May 18, 2024, 7:02am

in_batches sorts on primary key to have a stable sort order. Granted, it assumed a numeric auto-incrementing id. But for any other column, sorting can cause rows to be skipped if new values are inserted while iteration is occurring.

For example, if we’re sorting on name, and get to “Nate” and then “Jean” is inserted, Jean will be skipped.

If we were ordering by ID, and Nate is id 5 and Jean is id 10000000, we’ll get to Jean eventually.

Cursors are probably the only way to have a stable sort of a non-primary key in Postgres. I’m not sure about MySQL.

FWIW, UUIDv4 ids basically make in_batches unpredictable, for the same reason. Records are inserted all over the place.

byroot · May 18, 2024, 7:30am

You are absolutely right. But that’s not always a property you need, and in the given example, the query was created_at < ? so I assumed it wasn’t a concern here.

natematykiewicz · May 18, 2024, 6:34pm

Yeah. FWIW, I’ve written code just like Dima before.

When you’ve got a specific set of records that you want to delete without doing too many in 1 transaction, you just do a loop with a limit until the delete query deletes nothing.

So perhaps delete_all_in_batches / destroy_all_in_batches is exactly what we need, but it just shouldn’t rely on in_batches?

sjs · July 18, 2024, 5:30pm

One of my clients has a Postgres db with UUIDv4 PKs (switching to v7 soon but doesn’t help w/ existing records) and we have similar code to delete and batch-process accounts by created_at. We use the ID column too in some cases to ensure we don’t process records twice, because even though timestamps have millisecond precision some days have many thousands of new accounts.

e.g. Account.where('created_at > :last_created_at OR (created_at = :last_created_at AND id > :last_account_id)', ...)

In a lot of our cases deleting/processing just based on created_at would be sufficient though. If that was built in we’d most likely use it.

gstokkink · October 16, 2024, 8:37am

We are also running into a similar problem.

@fatkodima would you be willing to share your Rubocop cop? I like your approach.

fatkodima · October 16, 2024, 11:44pm

class ApplicationRecord < ActiveRecord::Base
  # Destroys records in batches.
  #
  # This is an alternative to `in_batches.destroy_all`. The problem with that method, is
  # that rails batches use ordering and large IN queries. When deleting, the ordering is
  # almost always does not matter, but enforcing it can lead to incorrect execution plans
  # in PostgreSQL and slow queries.
  #
  # @param of [Integer] 1000 Batch size
  #
  # @example
  #   User.where(has_widget: false).destroy_all_in_batches
  #
  # @return [Integer] The number of destroyed records
  #
  def self.destroy_all_in_batches(of: 1000)
    relation = unscope(:order).limit(of)
    count = 0
    loop do
      records = relation.destroy_all.select(&:destroyed?)
      break if records.empty?

      count += records.size
    end
    count
  end

  # Deletes records in batches.
  #
  # This is an alternative to `in_batches.delete_all`.
  # @see .destroy_all_in_batches
  #
  def self.delete_all_in_batches(of: 1000)
    relation = unscope(:order).limit(of)
    count = 0
    loop do
      deleted_count = relation.delete_all
      break if deleted_count == 0

      count += deleted_count
    end
    count
  end
end

# frozen_string_literal: true

module RuboCop
  module Cop
    module Custom
      # Detects usages of `in_batches.destroy_all` or `in_batches.delete_all`.
      #
      # Example
      #   # bad
      #   User.where(has_widget: false).in_batches.destroy_all
      #   User.where(has_widget: false).in_batches.delete_all
      #   User.where(has_widget: false).in_batches(of: 1000).destroy_all
      #
      #   # good
      #   User.where(has_widget: false).destroy_all_in_batches
      #   User.where(has_widget: false).delete_all_in_batches
      #   User.where(has_widget: false).destroy_all_in_batches(of: 1000)
      #
      #   # good (custom configuration)
      #   User.where(has_widget: false).in_batches(start: id).destroy_all
      #
      class DestroyAllInBatches < Base
        extend AutoCorrector
        include RangeHelp

        MSG = 'Use `%{good_method}` instead of `in_batches.%{bad_method}`.'
        RESTRICT_ON_SEND = %i[destroy_all delete_all].freeze

        def_node_matcher :deleting_in_batches?, <<~PATTERN
          (send
            (send _ :in_batches (hash (pair (sym :of) $_)) ?)
            ${:destroy_all :delete_all})
        PATTERN

        def on_send(node)
          deleting_in_batches?(node) do |of_node, bad_method|
            of_node = of_node.first
            offense_range = range_between(node.receiver.loc.selector.begin_pos, node.loc.expression.end_pos)
            good_method =
              if of_node
                "#{bad_method}_in_batches(of: #{of_node.source})"
              else
                "#{bad_method}_in_batches"
              end

            message = format(MSG, good_method: good_method, bad_method: bad_method)
            add_offense(offense_range, message: message) do |corrector|
              corrector.replace(offense_range, good_method)
            end
          end
        end
      end
    end
  end
end

# frozen_string_literal: true

require 'rubocop'
require 'rubocop/rspec/support'
require './lib/custom_cops/destroy_all_in_batches'

RSpec.describe RuboCop::Cop::Custom::DestroyAllInBatches, :config do
  include RuboCop::RSpec::ExpectOffense

  it 'registers an offense and corrects when using `in_batches.destroy_all`' do
    expect_offense(<<~RUBY)
      User.in_batches.destroy_all
           ^^^^^^^^^^^^^^^^^^^^^^ Use `destroy_all_in_batches` instead of `in_batches.destroy_all`.
    RUBY

    expect_correction(<<~RUBY)
      User.destroy_all_in_batches
    RUBY
  end

  it 'registers an offense and corrects when using `in_batches.delete_all`' do
    expect_offense(<<~RUBY)
      User.in_batches.delete_all
           ^^^^^^^^^^^^^^^^^^^^^ Use `delete_all_in_batches` instead of `in_batches.delete_all`.
    RUBY

    expect_correction(<<~RUBY)
      User.delete_all_in_batches
    RUBY
  end

  it 'registers an offense and corrects when using `in_batches.destroy_all` with `:of`' do
    expect_offense(<<~RUBY)
      User.in_batches(of: 1000).destroy_all
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Use `destroy_all_in_batches(of: 1000)` instead of `in_batches.destroy_all`.
    RUBY

    expect_correction(<<~RUBY)
      User.destroy_all_in_batches(of: 1000)
    RUBY
  end

  it 'does not register an offense when using `in_batches.destroy_all` with extra configurations' do
    expect_no_offenses(<<~RUBY)
      User.in_batches(start: id, of: 1000).destroy_all
    RUBY
  end

  it 'does not register an offense when using `destroy_all_in_batches`' do
    expect_no_offenses(<<~RUBY)
      User.destroy_all_in_batches
    RUBY
  end
end

gstokkink · October 17, 2024, 6:58am

@fatkodima thanks! You’re awesome

Just wondering, I see your cop still allows the usage of regular delete_all/destroy_all, i.e., without any batching being performed. Are there any cases in which batching would not be desirable? I’ve been wondering about that ever since Rails introduced in_batches.

fatkodima · October 17, 2024, 8:34am

If I know there will be only a few records, I use just delete_all/destroy_all without complicating the code with batches. But, yeah, batching can be used everywhere instead - I can’t think of a case where it can be undesirable.

gstokkink · October 17, 2024, 8:43am

Yeah true, if you know there’s only a few records batching is not necessary and might add a (slight) overhead. However, often you don’t know - or the amount of records can grow silently over time, and suddenly you find your deletes/destroys becoming unacceptably slow. Wondering if it’s then not a better default to always use batching for deletes/destroys, with the option to explicitly disable it if so desired.

Topic		Replies	Views
Delete_all, destroy_all support batch deletion rubyonrails-core feature	6	2460	January 16, 2022
ActiverRecord find_in_batches to take option[:order] rubyonrails-core	7	205	March 6, 2009
[Feature Request] [Active record] - Add order option to "find_in_batches" and rubyonrails-core	3	234	May 9, 2018
improved find_in_batches rubyonrails-core	0	117	November 25, 2013
I can not use find_each to write huge csv because find_in_batch ignores order scope. rubyonrails-core	1	236	March 4, 2014

[Feature Proposal] Run `in_batches.destroy_all`/`delete_all` without ordering

Related topics

More Resources