ActiverRecord find_in_batches to take option[:order]

May be I am missing something obvious, but I do not understand why the
method find_in_batches cannot take the option[:order].

I have created a ticket http://is.gd/lUwX with a patch to remove the
error raised accordingly.

Thanks.

If, for example, you order by ‘Name’, and then while batch processing
you insert a new record with the name “Angilly”, it is possible
(probable actually) that your batch processing will miss records. ID’s
are not very likely to change, and the auto-incrementing means it’s
much more difficult to miss records. This is also the reason that it
requires integer ID’s (and not UUIDs).

Cheers,
Ryan

I created a patch to allow :order clauses on this topic
http://groups.google.com/group/rubyonrails-core/browse_thread/thread/66df9e4241fdc0a7

Ryan Angilly wrote:

If, for example, you order by ‘Name’, and then while batch
processing
you insert a new record with the name “Angilly”, it is possible
(probable actually) that your batch processing will miss records. ID’s
are not very likely to change, and the auto-incrementing means it’s
much more difficult to miss records. This is also the reason that it
requires integer ID’s (and not UUIDs).
I haven’t looked closely at the code, but wouldn’t it be possible to
just select all the ID’s in one query with whatever order criteria were
desired, then use those ID’s to select the full records in batches. It
would be something like how eager loading does multiple selects in some
cases.

Jack

Yeah that’s possible. But then you could run into the same memory bloat problem. Just a million records with UUID pk’s gives you over 35 megabytes of data to store in memory (not including any of the overhead due to Ruby’s Array and String classes). Maybe that’s enough of an edge case where it’s worth ignoring for now, but then I still come back to the original question: can you think of a case where you really need to order it?

If you need to process a million records, you may want to process them
in a specific order like most recently update first since those may be
the most important and it may take a while to finish the batch.
Processing them in primary key order may well mean that the most
important records are processed last.

The patch I submitted does just what Jack suggests. While you will end
up with 35Mb for a million records, it will be a constant memory size
and will not bloat as the batch is processed if you load associations.
While 35Mb may be too much for an environment with limited resources,
you could break it up using :limit and :offset into smaller batches
like any other ActiveRecord find.

Ordering is forced to the “id” column because:

  1. mysql optimizes iterating over huge datasets when you use the “WHERE id > ?” clause much better than with LIMIT and OFFSET;
  2. you don’t need ordering – every single record gets yielded.

I don't think the implementation core functionality should be based on
on how MySql handles something.

:limit, :offset, and :order are only options parameters that I'm
exposing in case the developer needs them. If you want to iterate over
the entire data set you can certainly do so and these options will not
be sent to the database. In fact, in some circumstances, the query
will end up more efficient. If you construct a horrendously complex
query that the optimizer doesn't come up with a good plan for, the
code in my patch will execute the horrible query once and afterward do
a simple "WHERE id IN (?, ?, ...)" query for every batch. Using "id