ActiverRecord find_in_batches to take option[:order]

May be I am missing something obvious, but I do not understand why the method find_in_batches cannot take the option[:order].

I have created a ticket #2135 ActiveRecord::find_in_batches should take option[:order] - Ruby on Rails - rails with a patch to remove the error raised accordingly.

Thanks.

If, for example, you order by ‘Name’, and then while batch processing you insert a new record with the name “Angilly”, it is possible (probable actually) that your batch processing will miss records. ID’s are not very likely to change, and the auto-incrementing means it’s much more difficult to miss records. This is also the reason that it requires integer ID’s (and not UUIDs).

Cheers, Ryan

I created a patch to allow :order clauses on this topic http://groups.google.com/group/rubyonrails-core/browse_thread/thread/66df9e4241fdc0a7

Ryan Angilly wrote:

If, for example, you order by ‘Name’, and then while batch processing you insert a new record with the name “Angilly”, it is possible (probable actually) that your batch processing will miss records. ID’s are not very likely to change, and the auto-incrementing means it’s much more difficult to miss records. This is also the reason that it requires integer ID’s (and not UUIDs). I haven’t looked closely at the code, but wouldn’t it be possible to just select all the ID’s in one query with whatever order criteria were desired, then use those ID’s to select the full records in batches. It would be something like how eager loading does multiple selects in some cases.

Jack

Yeah that’s possible. But then you could run into the same memory bloat problem. Just a million records with UUID pk’s gives you over 35 megabytes of data to store in memory (not including any of the overhead due to Ruby’s Array and String classes). Maybe that’s enough of an edge case where it’s worth ignoring for now, but then I still come back to the original question: can you think of a case where you really need to order it?

If you need to process a million records, you may want to process them in a specific order like most recently update first since those may be the most important and it may take a while to finish the batch. Processing them in primary key order may well mean that the most important records are processed last.

The patch I submitted does just what Jack suggests. While you will end up with 35Mb for a million records, it will be a constant memory size and will not bloat as the batch is processed if you load associations. While 35Mb may be too much for an environment with limited resources, you could break it up using :limit and :offset into smaller batches like any other ActiveRecord find.

Ordering is forced to the “id” column because:

  1. mysql optimizes iterating over huge datasets when you use the “WHERE id > ?” clause much better than with LIMIT and OFFSET;
  2. you don’t need ordering – every single record gets yielded.

I don't think the implementation core functionality should be based on on how MySql handles something.

:limit, :offset, and :order are only options parameters that I'm exposing in case the developer needs them. If you want to iterate over the entire data set you can certainly do so and these options will not be sent to the database. In fact, in some circumstances, the query will end up more efficient. If you construct a horrendously complex query that the optimizer doesn't come up with a good plan for, the code in my patch will execute the horrible query once and afterward do a simple "WHERE id IN (?, ?, ...)" query for every batch. Using "id