Using Active Record to process large tables.

Hi

I am trying to use ActiveRecord to process very large tables, currently
with about 600,000 rows but I expect this to grow to well over a
million. This processing is either batch processing that needs to make
use of the business logic in the Active Record model for that table or
it occasionally occurs in a migration. It seems there are two options
here, do a Model.find(:all) to get the entire result set and then
process it or paginate through the result set, processing chunks at a
time.

Initially I tried doing a Model.find(:all) and then process the results
and I quickly came up against memory and performance issues and MySQL
dropping connections. A look at the ActiveRecord code showed that find
was reading the entire result set into memory then returning an array
containing ActiveRecord objects for each row in the result. How deep
does this assumption run? Is it possible to modify find to allow single
row instantiation so only one AR object / row exists in memory at a
time, eg (by pseudocode):

<code>
def find(sql)
   # assuming the results from select are able to be streamed
   connection.select(sql).each do |result|
       yeild instantiate(result)
   end
end
</code>

Of course an API would need to be worked out, that is not what I am
suggesting here, I would just like to know whether this is possible, or
the assumption that the entire result set could fit into memory exists
throughout the entire stack and supporting this idiom would be very
difficult.

The other option of paginating suffered from severe degradation of
performance due to MySQL performance being extremely poor with large
values of :offset. For example, processing the table in batches of
100, at the beginning of the table a batch would take about 0.01s to
complete, towards the end of the table batches were taking around 3
seconds to complete. The performance seems to degrade linearly so that
processing the table takes about 2.5 hours; if initial performance was
maintained it would be closer to 1-2 minutes. So paginating through a
large table is not a good solution either, however it is the only one
of the two that actually works.

Does anyone have any experience doing something similar? I would be
willing to investigate any changes required to make this work better,
but some pointers in the right direction or deeper explanation of the
issues would be useful.

Thanks
Sean Geoghegan

I accomplished this by adding an 'each' class method to AR::Base and
hacking it to instantiate only one object and replacing the attributes
of that object each iteration.

Customers.each do |customer|
  # do stuff
end

I never got around to submitting a patch or creating a plugin, now I've
switched back to perl and unbuffered queries for processing large
datasets, the whole Ruby, Rails, MySQL biindings suite arent well
suited for it if performance is a concern.

Row-by-row can be slow, too. Best solution is to grab rows by hundreds or so (depending on the available memory). Use the limit and offset feature to first grab 100 rows, process them, grab the next 100 rows, process and so on.

You can even use paginators for this, strange as it may sound. :slight_smile:

-M

Check out Starfish.
http://www.rufy.com/starfish/doc/

Whenever you are doing extremely large set operations, ANY ORM is going
to be slow. And row by row is always orders of magnitude slower than
whole set operations, in SQL. Might this not be a time to optimize
your model with a tiny bit of direct SQL?

I've literally seen improvements in multiple orders of magnitude before
by switching processes from stored procedures, to intelligent queries
that can do everything in one step, and I'd assume the same thing might
play out here.

Thanks for the reply.

I did try paginating using limit and offset, however as I mentioned in
my original post, the query performance for large values of offset
degrade quite significantly to the point of increasing your overall
execution time by an order of magnitude.

So no matter how many rows you fetch per query, using offset doesn't
scale to large tables.

Hi Tim,

Thanks for your reply.

Using direct SQL is possible if the operation you wish to perform is
supported by the SQL dialect you are using, in which case performance
is excellent. However some operations are not possible in SQL alone
and require more complicated logic. I agree the performance would be
slow using row-by-row solution, but I don't think it has to be as bad
as it is in ActiveRecord.

If it was possible to remove the buffering and in a sense, stream
through the results, performance should be much better than it is now.

Sean

Hi,

That sounds similar to the technique I was considering. Did you find
much of a performance improvment here? What I am concerned with is
whether the buffering exists throughout the whole ActiveRecord >
ConnectionAdapter > DB Driver stack, in which case just solving the
object creation in ActiveRecord::Base won't help much.

For example - look at
http://dev.rubyonrails.org/browser/trunk/activerecord/lib/active_record/connection_adapters/mysql_adapter.rb#L370
to see that the MySQL connection adapter seems to convert the entire
result set to an array of hashes before returning from the select
method. This is never going to work for large result sets.

I've looked at the MySQL/Ruby bindings and it doesn't seem to do this
buffering, I'm not even sure why ActiveRecord chooses to do it at the
connection adapter level. Could one of the Core Developer who knows
about this chime in?

Thanks

I did something similar recently, although my solution might not work
for you, but here it is:

class User < ActiveRecord::Base # There are lots of these
end

class TaskTracker # this consists of nothing more than an id, user_id,
and a completed flag (which is indexed)
  belongs_to :user

  def perform_task
    ... do expensive operation on self.user ...
    self.completed = true
    self.save
end

I create a TaskTracker for every User. (There are various ways to do
this, I did it with a 'select into' directly from sql)

Now, I can do this:

TaskTracker.find(:all, :conditions => 'completed = false', :limit =>
100).each do

t>

t.perform_task
end

This seems to be pretty quick, perhaps because I don't need to use
offset, and has the advantage that I can stick the processing in a
crontab if I want.

-c

seangeo wrote:

seangeo wrote:

The other option of paginating suffered from severe degradation of
performance due to MySQL performance being extremely poor with large
values of :offset. For example, processing the table in batches of
100, at the beginning of the table a batch would take about 0.01s to
complete, towards the end of the table batches were taking around 3
seconds to complete. The performance seems to degrade linearly so that
processing the table takes about 2.5 hours; if initial performance was
maintained it would be closer to 1-2 minutes. So paginating through a
large table is not a good solution either, however it is the only one
of the two that actually works.

Did you try indexing the columns you are filtering by? I am by no means
a MySQL guru, but I see no reason why LIMIT x,y should incur any
performance penalty as the offset increases, unless the engine actually
has to skip over all previous :offset results.

- --
Istvan Hoka

Did you omit transaction handling on purpose? If one of the calls to
t.perform_task raises an exception for any reason, this is likely to
leave your data in an inconsistent state.

You might think that there's nothing in perform_task that might raise an
exception. Think again. As you indicate, perform_task is an expensive
operation. I take this to mean that it takes a noticeable amount of
time. Now, when you fetch 100 objects from the database, the last one
will have reached a considerable age when you come to it. That object,
or any other, may have been changed by a user or some automated process
when you finally come to process it. Depending on whether you're using
optimistic locking (lock_version column), you either get a
StaleObjectError or you corrupt your database by overwriting a record
with obsolete data.

In effect, there are two cases you need to guard against.
The first is that your process may cause an error all by itself. That's
where transactions come in to ensure consistency

TaskTracker.transaction do
  TaskTracker.find(:all, :conditions => 'completed = false', :limit =>
100).each do |t|
    t.perform_task
  end
end

The second case is that your process is vulnerable to concurrent changes
of data. When you're sure that this only happens very rarely, it may be
sufficient to detect these changes using optimistic locking. Then, when
something does go wrong, rely on the transaction to rollback your
earlier changes and restart the process in the hope that everything
goes well the next time. When conflicts are rather likely, consider
using pessimistic locking. It's not yet available in a released version
of Rails/ActiveRecord, but will apparently be in 1.2.

Michael