[Feature Proposal] Run `in_batches.destroy_all`/`delete_all` without ordering

We recently encountered a case where in_batches.destroy_all unnecessary lead to a slow query.

Use case: we have a table with millions of rows in it and an indexed created_at column. We wanted to remove some records that are older than some threshold. This is running periodically, so most of the old rows are removed, and each time it removes at most a few hundreds of rows. But because in_batches uses ordering by id, PostgreSQL incorrectly decided to iterate over the primary key index (basically doing sorting first) and then rechecking the condition (created_at < some value). This is basically a full index scan.

While we expected it to use the index on the created_at and then sort the results.

So, as you see, the problem is with ordering. I assume, that most projects don’t care in which order records are deleted, but can suffer from similar problems. But avoiding the ordering, we can use simpler queries and got faster execution plans.

What do people think if we make in_batches.destroy_all ignoring (by default or by some global/local configuration) the order?

We currently implemented a custom simple helper and a rubocop rule to enforce its usage in the project.

class ApplicationRecord < ActiveRecord::Base
  def self.destroy_all_in_batches(of: 1000)
    relation = unscope(:order).limit(of)
    count = 0
    loop do
      records = relation.destroy_all
      break if records.empty?

      count += records.size
    end
    count
  end
end

@byroot as always, interested in your input about this and whether Shopify encountered similar problems.

2 Likes

Not an Issue we encountered I think.

What do people think if we make in_batches.destroy_all ignoring (by default or by some global/local configuration) the order?

I’m not sure this is the best solution. I most case you don’t care about the order, but one might, and it seems a bit too much of an ad hoc solution.

I’m at a conference, so not a whole lot of time to think about this issue, but intuitively, I don’t see why in_batches wouldn’t allow for different order, even if it means always appending the PK as last ordering element.

In your case I think you query could be ORDER BY created_at ASC, id ASC? Would that make Postgres do the right thing?

In your case I think you query could be ORDER BY created_at ASC, id ASC ? Would that make Postgres do the right thing?

It will do the trick, but in_batches currently iterates only by a primary key and allows to have an order only on that column(s).

But even by allowing to specify the order, we need to remember to do the right ordering all the time, instead of just configure somewhere that we don’t care.

But yeah, I understand, that this use case might not be well suited (or worth including) into the framework and better implemented as a custom solution. Just wanted to start a conversation to see if people had similar issues.

If you have a created_at date which presumably doesn’t change, then you could just find the ID of the record that aligns with 6 months ago and deleted anything smaller?

The column might be something like last_used_at, which is not static. So this will not work.

in_batches sorts on primary key to have a stable sort order. Granted, it assumed a numeric auto-incrementing id. But for any other column, sorting can cause rows to be skipped if new values are inserted while iteration is occurring.

For example, if we’re sorting on name, and get to “Nate” and then “Jean” is inserted, Jean will be skipped.

If we were ordering by ID, and Nate is id 5 and Jean is id 10000000, we’ll get to Jean eventually.

Cursors are probably the only way to have a stable sort of a non-primary key in Postgres. I’m not sure about MySQL.

FWIW, UUIDv4 ids basically make in_batches unpredictable, for the same reason. Records are inserted all over the place.

You are absolutely right. But that’s not always a property you need, and in the given example, the query was created_at < ? so I assumed it wasn’t a concern here.

Yeah. FWIW, I’ve written code just like Dima before.

When you’ve got a specific set of records that you want to delete without doing too many in 1 transaction, you just do a loop with a limit until the delete query deletes nothing.

So perhaps delete_all_in_batches / destroy_all_in_batches is exactly what we need, but it just shouldn’t rely on in_batches?

1 Like

One of my clients has a Postgres db with UUIDv4 PKs (switching to v7 soon but doesn’t help w/ existing records) and we have similar code to delete and batch-process accounts by created_at. We use the ID column too in some cases to ensure we don’t process records twice, because even though timestamps have millisecond precision some days have many thousands of new accounts.

e.g. Account.where('created_at > :last_created_at OR (created_at = :last_created_at AND id > :last_account_id)', ...)

In a lot of our cases deleting/processing just based on created_at would be sufficient though. If that was built in we’d most likely use it.