DATABASE Query performance

Hi Everybody

Suppose We have two models named Book and Author. Make a suggestion on how we will improve the database query performance on the following code.

books = Book.find(:all) books.each do | book | puts book.author end

Thanks in advance

Use eager-loading:

books = Book.find(:all, :include => :author)

You can use :select and :joins for the best performance, this is what I use currently.

:include is much less efficient but it is the Rails way of doing things. It used to create a JOIN but in a very ugly way.

Since Rails 2.x the :include code has been changed as it now makes 1 query on each table. It is still less efficient than custom :joins, but it is easier to use its derived objects.

Not for every scenario. For has_many associations, the Rails 2.x :include code is possibly better than a custom join when dealing with large datasets, since a custom join would give you a cartesian product while :include will only fetch the necessary records, at the expense of course, of additional SQL statements. This specially holds true if your webhost and db resides on different servers. If the data is large enough, it may merit to add indices to your reference fields (or foreign keys):

http://erolfornoles.blogspot.com/2008/08/optimize-eager-loading-in-rails-21.html

For 1-to-1 associations like belongs_to, Simpler Piggy-Backing makes for a good alternative to custom joins:

http://railsexpress.de/blog/articles/2006/05/29/simpler-piggy-backing

Not for every scenario. For has_many associations, the Rails 2.x :include code is possibly better than a custom join

Have you tested? because at the time of Rails 1.2.6 where the :include created a gigantic JOINs, my custom :joins was at least x3 times faster! mostly because I only pick the fields I really need and because AR doesn't create full objects with their references and all the stuff.

I tried finding some info on MySQL forums and websites about 1 query with multiple JOIN versus multiple queries, but it is not clear which is more efficient.

Things get really complicated when I read: "How adding another table to JOIN can improve performance ?"

I wrote the :include code in 2.1 because of performance problems I had with :include in 1.2.x. :including multiple has_manys from the same object very quickly leads to the database generating a very large number of rows (far greater than the number of actual objects you want) which doesn't help the database or ActiveRecord (who has to wade through them and discard portions of rows it has already seen.

Fred

Have you tested? because at the time of Rails 1.2.6 where the :include created a gigantic JOINs, my custom :joins was at least x3 times faster! mostly because I only pick the fields I really need and because AR doesn't create full objects with their references and all the stuff.

Yeah, in my case the Rails 2.1 :include was significantly faster than

class Depot   has_many :trucks   has_many :containers   has_many :cranes end

Of course, we may be talking about apples and bananas here, since my example involves has_many associations while the scenario I could think of given your description is a belongs_to or has_one.

I agree that belongs_to or has_one associations are better off piggy- backed via a :join and :select, but using them eventually leads into a lot of repetitions in our code. That's why I prefer using the Simpler Piggy-Backing plugin:

http://railsexpress.de/blog/articles/2006/05/29/simpler-piggy-backing

The author even claims: "In fact, using the extension is slightly faster than coding the select and joins manually (1%-3%)."

I tried finding some info on MySQL forums and websites about 1 query with multiple JOIN versus multiple queries, but it is not clear which is more efficient.

The Depot example I mentioned earlier is better composed using atleast 2 SQL statements, as against a single one. Not only would the single SQL approach (using joins) produce a cartesian product, it is logically incorrect since trucks, cranes and containers are independent of each other.

Then again, it all boils down to AR being ridiculously slow in composing AR objects from a resultset.

Things get really complicated when I read: "How adding another table to JOIN can improve performance ?"

http://www.mysqlperformanceblog.com/2008/08/01/how-adding-another-tab

There are a lot of quirks to doing optimizations in MySQL. The "one index per one query" is one, which is why I usually shy away from MySQL when doing data warehousing. I tried using it once to do CDR (call detail record) reporting, but the performance was such a nightmare that I had to migrate it to Postgre.

Yeah, in my case the Rails 2.1 :include was significantly faster than a :join for multiple has_many relationships, like:

Interesting.

What is the size of the DB on which your performed the tests? Mine were done on tables that have 100.000 and 1.000.000.000 rows. The :joins gave nice results that were around 0.1s on my good old PowerPC dev machine. The old :include method (rails 1.2.x) took almost 1s to perform the same query.

I wrote the :include code in 2.1 because of performance problems I had with :include in 1.2.x.

Hi Fred, yeah the :include in 1.2.x was pretty lame. Do you have some test results of the new :include versus custom :joins?

Those are almost by definition hard to come by: custom stuff is likely to be specific to one app (and just some custom joins does not accomplish the primary goal of :include (ie setup your associations).

Fred

What is the size of the DB on which your performed the tests? Mine were done on tables that have 100.000 and 1.000.000.000 rows. The :joins gave nice results that were around 0.1s on my good old PowerPC dev machine. The old :include method (rails 1.2.x) took almost 1s to perform the same query.

Are the seconds you mentioned DB fetch times? Or do they include AR times? And what kind of associations are involved?

The Depot model I mentioned is not that large, yet. A couple of dozen depots, which has a hundred to a few thousand trucks and containers each, and a couple of dozen cranes each. If you were going to use a join on them, the yield will be a cartesian product of:

20 depots x 1000 trucks x 1000 containers x 10 cranes = 200,000,000 rows on the average, which I'm pretty sure is going to take AR longer to compose.

I agree. has_one and belongs_to associations can be piggy-backed via :join and :select, but has_many associations cannot.

These were "completed" times.

And in what type of association are they in?

> If you were going to use a > join on them, the yield will be a cartesian product of:

> 20 depots x 1000 trucks x 1000 containers x 10 cranes = 200,000,000 > rows on the average, which I'm pretty sure is going to take AR longer > to compose.

Your figures are misleading. You never display all the records at once, you always pass a :conditions option to the query, don't you? Even your :include does, so what are you talking about?

Nope, those figures aren't misleading. Those are actual records fetched for a report, if I use traditional joins. Of course it won't be displayed on-screen, but AR has to compose the corresponding AR objects given that many records. The conditions aren't present since I have a rake task scheduled via cron that handles archiving of data, which means trucks, containers, and cranes for the past month aren't included in the resultset. It's more like an incoming/outgoing report that is generated weekly and monthly, used for billing and accounting purposes.

What the new :include does is to break those into manageable SQL.

20 depots x 1000 trucks = 20,000, assuming again, an average of 1000 trucks per depot 20 depots x 1000 containers = 20,000 20 depots x 10 cranes = 200

All in all, AR will only need to compose from 40,200 results on the average.

This is a sample of the report which needs to be generated:

- Depot -   - List of Trucks -   - List of Containers -   - List of Cranes - - Depot -   - List of Trucks -   - List of Containers -   - List of Cranes - ...

If you can give me a :select :join construct (without resorting to multiple SQL's) which could efficiently generate that kind of report, I'd be more than happy to use that.

Woops, I made a typo, instead of [,] one should use { => }.

I will definitely update my code from using :joins to the new :include.

Just like what Fred said, it's a case-to-case basis. But, uhmmm, it's your call I guess.

However I am having some problems. Using the following find: --- @order = Order.find(:all, :include => [:items, :products], :conditions => ['orders.id = ?', session[:order_id]]) ---

I get the following SQL: --- SELECT * FROM `orders` WHERE (orders.id = 15) SELECT `items`.* FROM `items` WHERE (`items`.order_id IN (15)) CACHE SELECT `items`.* FROM `items` WHERE (`items`.order_id IN (15)) SHOW FIELDS FROM `products` SELECT * FROM `products` WHERE (`products`.`id` IN (1,2)) SELECT * FROM `products` WHERE (`products`.`id` = 2) SELECT * FROM `products` WHERE (`products`.`id` = 1) ---

Why is Rails making unnecessary duplicate queries? --

Try Order.find(session[:order_id], :include => [:items, :products])

There's no way I know of to eliminate SHOW FIELDS FROM ..., unless AR gets updated.

Plus, there's a caveat here. Issuing:

Order.find(:all, include => [:items, :products], :conditions => "items.disabled IS NULL")

Or referencing the any of the included tables in :conditions will force AR to revert to the old eager-loading scheme.

Or referencing the any of the included tables in :conditions will force AR to revert to the old eager-loading scheme.

No, it depends how you build the :include. I have yet to found how it actually works. Sometimes I get the old way, sometimes I get the multi-query, it really depends on how you pass the hash.

There's no way I know of to eliminate SHOW FIELDS FROM ...

You cannot, and you shouldn't. This is how AR knows which attributes a model has. I already tried in the past to by-pass AR, and I got more problems than anything else, and the speed improvement was not that great. Anyway these queries are fast and they are cached.

Using the following query:

Or referencing the any of the included tables in :conditions will force AR to revert to the old eager-loading scheme.

No, it depends how you build the :include. I have yet to found how it actually works. Sometimes I get the old way, sometimes I get the multi-query, it really depends on how you pass the hash.

Whenever AR thinks you are referencing a column from a table other
that the base table it reverts to the old style :include (that's not
to say it gets it right all the time)

There's no way I know of to eliminate SHOW FIELDS FROM ...

You cannot, and you shouldn't. This is how AR knows which attributes a model has. I already tried in the past to by-pass AR, and I got more problems than anything else, and the speed improvement was not that great. Anyway these queries are fast and they are cached.

Yup, it;s only in dev mode that they are repeated anyway.

Actually, yes, it depends on whether you reference non-main - or therefore included - tables in :conditions or :order.

Taken directly from http://api.rubyonrails.com/classes/ActiveRecord/Associations/ClassMethods.html:

Since only one table is loaded at a time, conditions or orders cannot reference tables other than the main one. If this is the case Active Record falls back to the previously used LEFT OUTER JOIN based strategy. For example

Post.find(:all, :include => [ :author, :comments ], :conditions => ['comments.approved = ?', true])

Using the following query: --- @order = Order.find(:all, :include => [ :products, :items],       :conditions => ['orders.id = ?', session[:order_id]]) ---

I get the correct multi SQL queries. But I cannot understand why
when I try to do: --- <%- for item in @order.items -%> <%= debug item %> <%- end -%> ---

I get the following error message: --- undefined method `items' for #<Array:0x2e9c224> ---

because @order is an array of orders. The items method exists only on
the elements of that array.

Fred