True ActiveRecord result set iteration

Hello,

For an internal project we were looking for a solution which extends ActiveRecord by an iterator function. Using find(:all) is no fun if you have to process some 10.000 records.

Until now there have been two ways of dealing with that scenario: - write your logic a second times (e.g. use stored procedure) - bet on AR plugins which work around that limitation by fetching IDs and processing them in groups, or by using :offset and :limit

Rewriting logic is something we wanted to avoid, and the plugins don't fully respect transactional contexts. So we started to implement our own true iterator support for AR. Our project is on Oracle, but in the meantime we have added support for MySQL. Probably other adapters can be extended easily, too. We also tried JRuby 1.1.x, which is sometimes faster than Ruby 1.8.6, but a patch is needed to bring the Java part of the connection adapter into shape for a result set iteration.

Okay, you're about to ask: how does it work. Here we go:

MyTable.find_each_by_sql("some SQL here") { |my_table| ... }

MyTable.find_each(:all, :conditions => ...,    :include => ...) { |my_table| ... }

Attached you find the magic code which can be used as a plugin for Rails. When testing, please keep in mind that only Oracle and MySQL is fully supported. JDBC will take lots of RAM for large result sets until you have patched the JdbcConnectionAdapter.

Some figures with JRuby: I've tested the code for an export of ~80.000 customer data records. Originally I couldn't run the export with heap space less than 2 GB (JRuby 1.1.4 without extensive garbage collection). After having patched the connection adapter, it works with less than 128 MB heap space (JRuby 1.1).

I'd be happy if our idea would be picked up and AR would get these iterator methods integrated. I've seen lots of people asking for exactly this behavior. It's possible to implement, it's easy to implement, and IMHO it doesn't break the AR metaphor.

If you like the idea and want to send feedback, please CC me. I'm not subscribed to the list.

Regards, Andreas

active_record_iterator.tgz (6.98 KB)

Any chance of having these work by scope/proxy, instead? For instance:

MyTable.all.each do |record| … end

Or, something like this, with named scopes:

User.registered.customers.each do |user|

… end

I think this already works, actually – it just implicitly converts the AssociationProxy to an array. In other words, it slurps the entire result set. It would be nice if this behavior could be made to transparently (or optionally) do what you’ve implemented, rather than inventing a whole new interface – for example, find_each doesn’t help when you want to use other things from Enumerable, like select/reject, inject, etc.

Hey Andreas,

I just quickly went through your changes and it looks like a good start. A few things :

- Could you please open a ticket at http://rails.lighthouseapp.com/projects/8994-ruby-on-rails/overview and upload a .diff file with the patch written for Rails ? Check http://rails.lighthouseapp.com/projects/8994-ruby-on-rails/overview for more help - Patch is missing tests - Those JDBC patches don't really belong to Rails :slight_smile:

Also, it'll really help if you could join the mailing list while we talk about your patch. Most of the people are likely to forget CCing you. But if you don't want to join our lil group, that's not a big deal :slight_smile:

Thanks for working on this.

David,

I don't think it would be possible to support the enumerable interface in this extension because it would negate the memory management benefits that are the reason for using such a technique. Looking at the code, it looks like Andreas instantiates one ruby object at a time from the result set, then yields that object to the block for work to be done on it and frees the object after the yield returns (it's a little fancier than this with the Oracle extension, but in the MySQL version that's about it). If you were to allow for things like select/ reject, inject, etc then you're back in a situation where some potentially large subset (or all) of those ruby objects are going to be held in memory.

As for named scopes, like you said, I think that just works. The named scopes just insert there additional magic into the final query that gets made (adds additional conditions, joins, etc).

Cheers, Andrew

The one problem I see with this approach is that it does not allow for the eager-loading of polymorphic associations. In fact, as the code stands now I couldn't get vanilla :include options to work, only :joins (but I think that's just a little bug in the parsing of the options hash).

The problem is, when Rails uses the eager include approach rather than the join approach it looks through the entire result set to see which associations need to be queried and then grabs those associations in separate queries. This is relatively efficient because the object graph is being built out for all objects at once, my guess is that doing the same thing one object at a time would be very difficult to do efficiently.

As an alternative approach, the project I work on uses a home-grown version of what Andreas mentions as one of the two options -- we fetch IDs and process them in groups. The one thing I really like about that approach is all the same finder options still work as expected with no additional programming required. IMO losing eager includes of polymorphic associations is a non-starter.

just my 2 cents :slight_smile:

-ac

David,

I don't think it would be possible to support the enumerable interface in this extension because it would negate the memory management benefits that are the reason for using such a technique. Looking at the code, it looks like Andreas instantiates one ruby object at a time from the result set, then yields that object to the block for work to be done on it and frees the object after the yield returns (it's a little fancier than this with the Oracle extension, but in the MySQL version that's about it). If you were to allow for things like select/ reject, inject, etc then you're back in a situation where some potentially large subset (or all) of those ruby objects are going to be held in memory.

I don't thing that's necessarily true.

Consider for example

(1..10000000).select {|x| x == 100000}

This range contains ten million elements, but executing that statement
doesn't require having all of those in memory at any given time (keep
an eye on ruby's memory consumption while this runs)

On the other hand this would (1..10000000).to_a.select {|x| x == 100000}

Of course if the result set you are returning is large eg
(1..10000000).to_a.select {|x| x % 2 == 0 } then that will take a big
chunk of memory but that seems fair enough.

Fred

I understand – it would be a case where you do not want to deliberately cache the resultset, and so, it would not be quite like an association proxy.

But of course, each is all that’s needed for things like select/reject/inject. And inject, in particular, should be safe – select/reject, any?, all?, and similar things will still benefit from not holding all items in memory, even if all items will have to be iterated over anyway.

In fact, it’s my understanding that this is the main reason for having an enumerable interface – when using such an interface, you’re not required to know or care whether it’s a native array, the result of slurping the entire table, one select call per iteration, or anything in between.

Of course, there are also the things like sort, uniq, and map that would ruin your whole day – though map, at least, could be implemented in a lazier way, for this purpose.

Hey Andreas,

I just quickly went through your changes and it looks like a good start. A few things :

- Could you please open a ticket at Overview - Ruby on Rails - rails and upload a .diff file with the patch written for Rails ? Check Overview - Ruby on Rails - rails for more help - Patch is missing tests - Those JDBC patches don't really belong to Rails :slight_smile:

Well, it's just that people would wonder about the JDBC behavior. You risk an OOM killed process if you work without the patches for large sets of data. If the adapter maintainers picked up the changes, the adapter would work out of the box.

Reading the other comments about the incompleteness of the #find_each part, I do wonder if I better focus on the first part for now. #find_each_by_sql is pretty straight forward. Probably the other part can be improved, let me see.

In general I see the main benefit in the impressively decreased memory need when working with large result sets. It's still a single transaction which is what I missed in the other plugins which work around the memory limitation by multiple selects.

Also, it'll really help if you could join the mailing list while we talk about your patch. Most of the people are likely to forget CCing you. But if you don't want to join our lil group, that's not a big deal :slight_smile:

You're right. :wink:

I hope to find some more time over the Xmas holidays to put some more effort into the current implementation. I certainly get back then.