DATABASE Query performance

Hi Everybody

Suppose We have two models named Book and Author. Make a suggestion on
how we will improve the database query performance on the following
code.

books = Book.find(:all)
books.each do | book |
puts book.author
end

Thanks in advance

Use eager-loading:

books = Book.find(:all, :include => :author)

You can use :select and :joins for the best performance, this is what I
use currently.

:include is much less efficient but it is the Rails way of doing things.
It used to create a JOIN but in a very ugly way.

Since Rails 2.x the :include code has been changed as it now makes 1
query on each table. It is still less efficient than custom :joins, but
it is easier to use its derived objects.

Not for every scenario. For has_many associations, the Rails
2.x :include code is possibly better than a custom join when dealing
with large datasets, since a custom join would give you a cartesian
product while :include will only fetch the necessary records, at the
expense of course, of additional SQL statements. This specially holds
true if your webhost and db resides on different servers. If the data
is large enough, it may merit to add indices to your reference fields
(or foreign keys):

http://erolfornoles.blogspot.com/2008/08/optimize-eager-loading-in-rails-21.html

For 1-to-1 associations like belongs_to, Simpler Piggy-Backing makes
for a good alternative to custom joins:

http://railsexpress.de/blog/articles/2006/05/29/simpler-piggy-backing

Not for every scenario. For has_many associations, the Rails
2.x :include code is possibly better than a custom join

Have you tested? because at the time of Rails 1.2.6 where the :include
created a gigantic JOINs, my custom :joins was at least x3 times faster!
mostly because I only pick the fields I really need and because AR
doesn't create full objects with their references and all the stuff.

I tried finding some info on MySQL forums and websites about 1 query
with multiple JOIN versus multiple queries, but it is not clear which is
more efficient.

Things get really complicated when I read: "How adding another table to
JOIN can improve performance ?"

http://www.mysqlperformanceblog.com/2008/08/01/how-adding-another-table-to-join-can-improve-performance/

I wrote the :include code in 2.1 because of performance problems I had with :include in 1.2.x. :including multiple has_manys from the same object very quickly leads to the database generating a very large number of rows (far greater than the number of actual objects you want) which doesn't help the database or ActiveRecord (who has to wade through them and discard portions of rows it has already seen.

Fred

Have you tested? because at the time of Rails 1.2.6 where the :include
created a gigantic JOINs, my custom :joins was at least x3 times faster!
mostly because I only pick the fields I really need and because AR
doesn't create full objects with their references and all the stuff.

Yeah, in my case the Rails 2.1 :include was significantly faster than

class Depot
  has_many :trucks
  has_many :containers
  has_many :cranes
end

Of course, we may be talking about apples and bananas here, since my
example involves has_many associations while the scenario I could
think of given your description is a belongs_to or has_one.

I agree that belongs_to or has_one associations are better off piggy-
backed via a :join and :select, but using them eventually leads into a
lot of repetitions in our code. That's why I prefer using the Simpler
Piggy-Backing plugin:

http://railsexpress.de/blog/articles/2006/05/29/simpler-piggy-backing

The author even claims: "In fact, using the extension is slightly
faster than coding the select and joins manually (1%-3%)."

I tried finding some info on MySQL forums and websites about 1 query
with multiple JOIN versus multiple queries, but it is not clear which is
more efficient.

The Depot example I mentioned earlier is better composed using atleast
2 SQL statements, as against a single one. Not only would the single
SQL approach (using joins) produce a cartesian product, it is
logically incorrect since trucks, cranes and containers are
independent of each other.

Then again, it all boils down to AR being ridiculously slow in
composing AR objects from a resultset.

Things get really complicated when I read: "How adding another table to
JOIN can improve performance ?"

http://www.mysqlperformanceblog.com/2008/08/01/how-adding-another-tab

There are a lot of quirks to doing optimizations in MySQL. The "one
index per one query" is one, which is why I usually shy away from
MySQL when doing data warehousing. I tried using it once to do CDR
(call detail record) reporting, but the performance was such a
nightmare that I had to migrate it to Postgre.

Yeah, in my case the Rails 2.1 :include was significantly faster than
a :join for multiple has_many relationships, like:

Interesting.

What is the size of the DB on which your performed the tests? Mine were
done on tables that have 100.000 and 1.000.000.000 rows. The :joins gave
nice results that were around 0.1s on my good old PowerPC dev machine.
The old :include method (rails 1.2.x) took almost 1s to perform the same
query.

I wrote the :include code in 2.1 because of performance problems I had
with :include in 1.2.x.

Hi Fred, yeah the :include in 1.2.x was pretty lame. Do you have some
test results of the new :include versus custom :joins?

Those are almost by definition hard to come by: custom stuff is likely to be specific to one app (and just some custom joins does not accomplish the primary goal of :include (ie setup your associations).

Fred

What is the size of the DB on which your performed the tests? Mine were
done on tables that have 100.000 and 1.000.000.000 rows. The :joins gave
nice results that were around 0.1s on my good old PowerPC dev machine.
The old :include method (rails 1.2.x) took almost 1s to perform the same
query.

Are the seconds you mentioned DB fetch times? Or do they include AR
times? And what kind of associations are involved?

The Depot model I mentioned is not that large, yet. A couple of dozen
depots, which has a hundred to a few thousand trucks and containers
each, and a couple of dozen cranes each. If you were going to use a
join on them, the yield will be a cartesian product of:

20 depots x 1000 trucks x 1000 containers x 10 cranes = 200,000,000
rows on the average, which I'm pretty sure is going to take AR longer
to compose.

I agree. has_one and belongs_to associations can be piggy-backed
via :join and :select, but has_many associations cannot.

These were "completed" times.

And in what type of association are they in?

> If you were going to use a
> join on them, the yield will be a cartesian product of:

> 20 depots x 1000 trucks x 1000 containers x 10 cranes = 200,000,000
> rows on the average, which I'm pretty sure is going to take AR longer
> to compose.

Your figures are misleading. You never display all the records at once,
you always pass a :conditions option to the query, don't you? Even your
:include does, so what are you talking about?

Nope, those figures aren't misleading. Those are actual records
fetched for a report, if I use traditional joins. Of course it won't
be displayed on-screen, but AR has to compose the corresponding AR
objects given that many records. The conditions aren't present since I
have a rake task scheduled via cron that handles archiving of data,
which means trucks, containers, and cranes for the past month aren't
included in the resultset. It's more like an incoming/outgoing report
that is generated weekly and monthly, used for billing and accounting
purposes.

What the new :include does is to break those into manageable SQL.

20 depots x 1000 trucks = 20,000, assuming again, an average of 1000
trucks per depot
20 depots x 1000 containers = 20,000
20 depots x 10 cranes = 200

All in all, AR will only need to compose from 40,200 results on the
average.

This is a sample of the report which needs to be generated:

- Depot -
  - List of Trucks -
  - List of Containers -
  - List of Cranes -
- Depot -
  - List of Trucks -
  - List of Containers -
  - List of Cranes -
...

If you can give me a :select :join construct (without resorting to
multiple SQL's) which could efficiently generate that kind of report,
I'd be more than happy to use that.

Woops, I made a typo, instead of [,] one should use { => }.

I will definitely update my code from using :joins to the new :include.

Just like what Fred said, it's a case-to-case basis. But, uhmmm, it's
your call I guess.

However I am having some problems. Using the following find:
---
@order = Order.find(:all, :include => [:items, :products],
:conditions => ['orders.id = ?', session[:order_id]])
---

I get the following SQL:
---
SELECT * FROM `orders` WHERE (orders.id = 15)
SELECT `items`.* FROM `items` WHERE (`items`.order_id IN (15))
CACHE SELECT `items`.* FROM `items` WHERE (`items`.order_id IN (15))
SHOW FIELDS FROM `products`
SELECT * FROM `products` WHERE (`products`.`id` IN (1,2))
SELECT * FROM `products` WHERE (`products`.`id` = 2)
SELECT * FROM `products` WHERE (`products`.`id` = 1)
---

Why is Rails making unnecessary duplicate queries?
--

Try Order.find(session[:order_id], :include => [:items, :products])

There's no way I know of to eliminate SHOW FIELDS FROM ..., unless AR
gets updated.

Plus, there's a caveat here. Issuing:

Order.find(:all, include => [:items, :products], :conditions =>
"items.disabled IS NULL")

Or referencing the any of the included tables in :conditions will
force AR to revert to the old eager-loading scheme.

Or referencing the any of the included tables in :conditions will
force AR to revert to the old eager-loading scheme.

No, it depends how you build the :include. I have yet to found how it
actually works. Sometimes I get the old way, sometimes I get the
multi-query, it really depends on how you pass the hash.

There's no way I know of to eliminate SHOW FIELDS FROM ...

You cannot, and you shouldn't. This is how AR knows which attributes a
model has. I already tried in the past to by-pass AR, and I got more
problems than anything else, and the speed improvement was not that
great. Anyway these queries are fast and they are cached.

Using the following query:

Or referencing the any of the included tables in :conditions will
force AR to revert to the old eager-loading scheme.

No, it depends how you build the :include. I have yet to found how it
actually works. Sometimes I get the old way, sometimes I get the
multi-query, it really depends on how you pass the hash.

Whenever AR thinks you are referencing a column from a table other
that the base table it reverts to the old style :include (that's not
to say it gets it right all the time)

There's no way I know of to eliminate SHOW FIELDS FROM ...

You cannot, and you shouldn't. This is how AR knows which attributes a
model has. I already tried in the past to by-pass AR, and I got more
problems than anything else, and the speed improvement was not that
great. Anyway these queries are fast and they are cached.

Yup, it;s only in dev mode that they are repeated anyway.

Actually, yes, it depends on whether you reference non-main - or
therefore included - tables in :conditions or :order.

Taken directly from http://api.rubyonrails.com/classes/ActiveRecord/Associations/ClassMethods.html:

Since only one table is loaded at a time, conditions or orders cannot
reference tables other than the main one. If this is the case Active
Record falls back to the previously used LEFT OUTER JOIN based
strategy. For example

Post.find(:all, :include => [ :author, :comments ], :conditions =>
['comments.approved = ?', true])

Using the following query:
---
@order = Order.find(:all, :include => [ :products, :items],
      :conditions => ['orders.id = ?', session[:order_id]])
---

I get the correct multi SQL queries. But I cannot understand why
when I
try to do:
---
<%- for item in @order.items -%>
<%= debug item %>
<%- end -%>
---

I get the following error message:
---
undefined method `items' for #<Array:0x2e9c224>
---

because @order is an array of orders. The items method exists only on
the elements of that array.

Fred