Using Active Record to process large tables.

Hi

I am trying to use ActiveRecord to process very large tables, currently with about 600,000 rows but I expect this to grow to well over a million. This processing is either batch processing that needs to make use of the business logic in the Active Record model for that table or it occasionally occurs in a migration. It seems there are two options here, do a Model.find(:all) to get the entire result set and then process it or paginate through the result set, processing chunks at a time.

Initially I tried doing a Model.find(:all) and then process the results and I quickly came up against memory and performance issues and MySQL dropping connections. A look at the ActiveRecord code showed that find was reading the entire result set into memory then returning an array containing ActiveRecord objects for each row in the result. How deep does this assumption run? Is it possible to modify find to allow single row instantiation so only one AR object / row exists in memory at a time, eg (by pseudocode):

<code> def find(sql)    # assuming the results from select are able to be streamed    connection.select(sql).each do |result|        yeild instantiate(result)    end end </code>

Of course an API would need to be worked out, that is not what I am suggesting here, I would just like to know whether this is possible, or the assumption that the entire result set could fit into memory exists throughout the entire stack and supporting this idiom would be very difficult.

The other option of paginating suffered from severe degradation of performance due to MySQL performance being extremely poor with large values of :offset. For example, processing the table in batches of 100, at the beginning of the table a batch would take about 0.01s to complete, towards the end of the table batches were taking around 3 seconds to complete. The performance seems to degrade linearly so that processing the table takes about 2.5 hours; if initial performance was maintained it would be closer to 1-2 minutes. So paginating through a large table is not a good solution either, however it is the only one of the two that actually works.

Does anyone have any experience doing something similar? I would be willing to investigate any changes required to make this work better, but some pointers in the right direction or deeper explanation of the issues would be useful.

Thanks Sean Geoghegan

I accomplished this by adding an 'each' class method to AR::Base and hacking it to instantiate only one object and replacing the attributes of that object each iteration.

Customers.each do |customer|   # do stuff end

I never got around to submitting a patch or creating a plugin, now I've switched back to perl and unbuffered queries for processing large datasets, the whole Ruby, Rails, MySQL biindings suite arent well suited for it if performance is a concern.

Row-by-row can be slow, too. Best solution is to grab rows by hundreds or so (depending on the available memory). Use the limit and offset feature to first grab 100 rows, process them, grab the next 100 rows, process and so on.

You can even use paginators for this, strange as it may sound. :slight_smile:

-M

Check out Starfish. http://www.rufy.com/starfish/doc/

Whenever you are doing extremely large set operations, ANY ORM is going to be slow. And row by row is always orders of magnitude slower than whole set operations, in SQL. Might this not be a time to optimize your model with a tiny bit of direct SQL?

I've literally seen improvements in multiple orders of magnitude before by switching processes from stored procedures, to intelligent queries that can do everything in one step, and I'd assume the same thing might play out here.

Thanks for the reply.

I did try paginating using limit and offset, however as I mentioned in my original post, the query performance for large values of offset degrade quite significantly to the point of increasing your overall execution time by an order of magnitude.

So no matter how many rows you fetch per query, using offset doesn't scale to large tables.

Hi Tim,

Thanks for your reply.

Using direct SQL is possible if the operation you wish to perform is supported by the SQL dialect you are using, in which case performance is excellent. However some operations are not possible in SQL alone and require more complicated logic. I agree the performance would be slow using row-by-row solution, but I don't think it has to be as bad as it is in ActiveRecord.

If it was possible to remove the buffering and in a sense, stream through the results, performance should be much better than it is now.

Sean

Hi,

That sounds similar to the technique I was considering. Did you find much of a performance improvment here? What I am concerned with is whether the buffering exists throughout the whole ActiveRecord > ConnectionAdapter > DB Driver stack, in which case just solving the object creation in ActiveRecord::Base won't help much.

For example - look at http://dev.rubyonrails.org/browser/trunk/activerecord/lib/active_record/connection_adapters/mysql_adapter.rb#L370 to see that the MySQL connection adapter seems to convert the entire result set to an array of hashes before returning from the select method. This is never going to work for large result sets.

I've looked at the MySQL/Ruby bindings and it doesn't seem to do this buffering, I'm not even sure why ActiveRecord chooses to do it at the connection adapter level. Could one of the Core Developer who knows about this chime in?

Thanks

I did something similar recently, although my solution might not work for you, but here it is:

class User < ActiveRecord::Base # There are lots of these end

class TaskTracker # this consists of nothing more than an id, user_id, and a completed flag (which is indexed)   belongs_to :user

  def perform_task     ... do expensive operation on self.user ...     self.completed = true     self.save end

I create a TaskTracker for every User. (There are various ways to do this, I did it with a 'select into' directly from sql)

Now, I can do this:

TaskTracker.find(:all, :conditions => 'completed = false', :limit => 100).each do

t>

t.perform_task end

This seems to be pretty quick, perhaps because I don't need to use offset, and has the advantage that I can stick the processing in a crontab if I want.

-c

seangeo wrote:

seangeo wrote:

The other option of paginating suffered from severe degradation of performance due to MySQL performance being extremely poor with large values of :offset. For example, processing the table in batches of 100, at the beginning of the table a batch would take about 0.01s to complete, towards the end of the table batches were taking around 3 seconds to complete. The performance seems to degrade linearly so that processing the table takes about 2.5 hours; if initial performance was maintained it would be closer to 1-2 minutes. So paginating through a large table is not a good solution either, however it is the only one of the two that actually works.

Did you try indexing the columns you are filtering by? I am by no means a MySQL guru, but I see no reason why LIMIT x,y should incur any performance penalty as the offset increases, unless the engine actually has to skip over all previous :offset results.

- -- Istvan Hoka

Did you omit transaction handling on purpose? If one of the calls to t.perform_task raises an exception for any reason, this is likely to leave your data in an inconsistent state.

You might think that there's nothing in perform_task that might raise an exception. Think again. As you indicate, perform_task is an expensive operation. I take this to mean that it takes a noticeable amount of time. Now, when you fetch 100 objects from the database, the last one will have reached a considerable age when you come to it. That object, or any other, may have been changed by a user or some automated process when you finally come to process it. Depending on whether you're using optimistic locking (lock_version column), you either get a StaleObjectError or you corrupt your database by overwriting a record with obsolete data.

In effect, there are two cases you need to guard against. The first is that your process may cause an error all by itself. That's where transactions come in to ensure consistency

TaskTracker.transaction do   TaskTracker.find(:all, :conditions => 'completed = false', :limit => 100).each do |t|     t.perform_task   end end

The second case is that your process is vulnerable to concurrent changes of data. When you're sure that this only happens very rarely, it may be sufficient to detect these changes using optimistic locking. Then, when something does go wrong, rely on the transaction to rollback your earlier changes and restart the process in the hope that everything goes well the next time. When conflicts are rather likely, consider using pessimistic locking. It's not yet available in a released version of Rails/ActiveRecord, but will apparently be in 1.2.

Michael