slow eager loading ( & fix)

Executive Summary:

The main issue you are running into is that Rails' SQL queries for multiple included has_many associations return the cartesian product of the has_many_associations. Ideally, the best way to handle this is to send two or three separate SQL statements. You'd have one statement for each association, and then combine them together. The most efficient way is probably n+1 queries where n is the number of has_many associations, with one query to get the information on the main object, and one query for each has_many association, that only includes the association information and the main object's id (in order to associate it). That would shorten the number of rows returned for the queries you mention from 12,000 to 231 and from 27000 to 346. It's more complex than the current implementation, but it will preform much better. I'm not volunteering to implement it, though. :slight_smile:

As a workaround, how about:

question = Question.find_by_id(big_question.id, :include => :incoming_messages) question.instance_variable_set('@outgoing_messages', Question.find_by_id(big_question.id, :include => :outgoing_messages).outgoing_messages)

Also, note that for a single object, you are probably better off using lazy loading has_many associations (eager loading belongs_to associations is fine). Eager loading has_many associations should only be done if you are getting multiple objects at once (i.e. find :all).

Jeremy

Hey,

2 comments from me:

1 - using :include one-level-deep when you are fetching *one*
toplevel object *and* you are not issuing :conditions on
the :included tables is *always* (well, I've never found an
exception) slower.

x = Foo.find(4678. :include => [:incoming_messages, :outgoing_messages])

is slower than:

x = Foo.find(4678) x.incoming_messages x.outgoing_messages

If your reaction is to say "but it's always faster with eager
loading" then I urge you to *measure* it and get back to me if you
find that my measurements are wrong.

2 - I've got a plugin that improves your situation where you are
fetching multiple toplevel objects *and* you don't have
any :conditions that relate to the associations you're pulling in.

foos = Foo.find(:all, :hydrate =>
[:incoming_messages, :outgoing_messages])

It will split that find() into 3 queries, wiring up the relation
targets. I've chosen the explicit :hydrate option to give the user
more control over the strategy for pulling in associations and it
works just fine with Rick O's scope plugin too.

I've been sitting on the plugin since railsconf *last* year and never
released it because I had so many doubts about whether the strategy
was solid enough. I've used it enough times that I'm confident it
works. I'll be releasing it for public consumption in the next
couple of weeks.

Trev

Hey,

2 comments from me:

1 - using :include one-level-deep when you are fetching *one* toplevel object *and* you are not issuing :conditions on the :included tables is *always* (well, I've never found an exception) slower.

x = Foo.find(4678. :include => [:incoming_messages, :outgoing_messages])

is slower than:

x = Foo.find(4678) x.incoming_messages x.outgoing_messages

If your reaction is to say "but it's always faster with eager loading" then I urge you to *measure* it and get back to me if you find that my measurements are wrong.

I agree. Eager including is only really beneficial if you're doing lots of queries looping through the messages:

# or use paginate in the lovely will_paginate plugin @incoming_messages = @foo.incoming_messages.find(:all, :include => :author)

Doing your eager include here should be faster than doing a query on each row for each message author.

Personally I've stopped using all eager includes in favor of the new ActiveRecord connection caching and my own active_record_context plugin (currently in use in Lighthouse):

http://activereload.net/2007/5/23/spend-less-time-in-the-database-and-more-time-outdoors

2 - I've got a plugin that improves your situation where you are fetching multiple toplevel objects *and* you don't have any :conditions that relate to the associations you're pulling in.

foos = Foo.find(:all, :hydrate => [:incoming_messages, :outgoing_messages])

It will split that find() into 3 queries, wiring up the relation targets. I've chosen the explicit :hydrate option to give the user more control over the strategy for pulling in associations and it works just fine with Rick O's scope plugin too.

I've been sitting on the plugin since railsconf *last* year and never released it because I had so many doubts about whether the strategy was solid enough. I've used it enough times that I'm confident it works. I'll be releasing it for public consumption in the next couple of weeks.

Oh, I think coda hale had some similar plugin too. You guys should totally team up!

Hey,

2 comments from me:

1 - using :include one-level-deep when you are fetching *one* toplevel object *and* you are not issuing :conditions on the :included tables is *always* (well, I've never found an exception) slower.

x = Foo.find(4678. :include => [:incoming_messages, :outgoing_messages])

is slower than:

x = Foo.find(4678) x.incoming_messages x.outgoing_messages

If your reaction is to say "but it's always faster with eager loading" then I urge you to *measure* it and get back to me if you find that my measurements are wrong.

Ah yes, you are right. It's not a huge difference but its definitely there.

2 - I've got a plugin that improves your situation where you are fetching multiple toplevel objects *and* you don't have any :conditions that relate to the associations you're pulling in.

foos = Foo.find(:all, :hydrate => [:incoming_messages, :outgoing_messages])

It will split that find() into 3 queries, wiring up the relation targets. I've chosen the explicit :hydrate option to give the user more control over the strategy for pulling in associations and it works just fine with Rick O's scope plugin too.

I've been sitting on the plugin since railsconf *last* year and never released it because I had so many doubts about whether the strategy was solid enough. I've used it enough times that I'm confident it works. I'll be releasing it for public consumption in the next couple of weeks.

Very interesting, I look forward to seeing it!

Fred

The main issue you are running into is that Rails' SQL queries for multiple included has_many associations return the cartesian product of the has_many_associations. Ideally, the best way to handle this is to send two or three separate SQL statements. You'd have one statement for each association, and then combine them together. The most efficient way is probably n+1 queries where n is the number of has_many associations, with one query to get the information on the main object, and one query for each has_many association, that only includes the association information and the main object's id (in order to associate it). That would shorten the number of rows returned for the queries you mention from 12,000 to 231 and from 27000 to 346. It's more complex than the current implementation, but it will preform much better. I'm not volunteering to implement it, though. :slight_smile:

That would be the ideal

As a workaround, how about:

question = Question.find_by_id(big_question.id, :include => :incoming_messages) question.instance_variable_set('@outgoing_messages', Question.find_by_id(big_question.id, :include => :outgoing_messages).outgoing_messages)

A bit fiddly for me. For now I'm going to junk one of the eager loads ( I do have a frequent use case where I display a list of questions), foregoing some eager loading seems to be the way forward for now.

Also, note that for a single object, you are probably better off using lazy loading has_many associations (eager loading belongs_to associations is fine). Eager loading has_many associations should only be done if you are getting multiple objects at once (i.e. find :all).

Yes, I'm coming round to that view

Very awesome! I'll definitely be having a look at that one!

Fred

While we’re on the topic, another place where eager loading falls down is on tables with large fields that aren’t needed. Sometimes you need eager loading, but one of the tables has some large fields you don’t need. Normally you could :select only what you needed, but of course that doesn’t work with eager loading where the :select attribute is ignored.

I’ve run into a few other cases where I needed eager loading but for one reason or another it resulted in poor performance or had other limitations. The more I think about it, the more I think it would be really cool to be able to “manually” eager load with :select and :joins options (or possibly even find_by_sql).

The ability to instantiate multiple model types from a single query in a more flexible way than vanilla eager-loading would be quite useful. The tricky part would be coming up with a clean interface. Anyway, just food for thought.

The ability to instantiate multiple model types from a single query in a more flexible way than vanilla eager-loading would be quite useful. The tricky part would be coming up with a clean interface. Anyway, just food for thought.

Yeah, we've discussed this in the past and all we're waiting on is someone to come up with a nice interface and a wicked patch. being able to construct a graph of objects from the results of a sql query would be a nice feature for some people, we just haven't found someone who needs it badly enough to warrant spending time on investigating a solution :slight_smile:

I'm interested in tackling this one. I've been doing some pretty heavy stuff with dynamically built finds, so I have a lot of ideas.

I did a writeup at http://darwinweb.net/article/Free_Form_Manual_Eager_Loading to solicit some input. I need to fix my commenting system because it's kind of confusing. For anyone commenting: make sure that you submit twice, as the first time is only previewing your comment.

Trevor Squires wrote:

1 - using :include one-level-deep when you are fetching *one* toplevel object *and* you are not issuing :conditions on the :included tables is *always* (well, I've never found an exception) slower.

x = Foo.find(4678. :include => [:incoming_messages, :outgoing_messages])

is slower than:

x = Foo.find(4678) x.incoming_messages x.outgoing_messages

If your reaction is to say "but it's always faster with eager loading" then I urge you to *measure* it and get back to me if you find that my measurements are wrong.

I'm surprised the effect of the extra data and data processing that comes with use of :include so clearly trumps the chained latency of the extra database calls.

Do you think you would get the same result when:    1. The database is on another server,    2. :select is used to restrict the size of base model data returned, and    3. Hashes are used to speed matching in the construction of the object       hierarchy, like in Fred's patch?

Jeremy's n+1 solution obviously becomes more attractive as the include chain gets longer and the records larger. But there would still be a switchover point that's a function of database comms latency. Such a solution would really be aided by sending the database commands in parallel.

Unfortunately eager loading ignores the select option.

Gabe da Silveira wrote: