my old message to this list got removed. please tell me if I do something wrong? - ActiveRecord include behaviour

Hi,

I noticed ActiveRecord uses some join queries to perform :include
stuff.
While this certainly fixes the N+1 problem, I'm wondering about the
amount of sparse data this generates.

Example:
Task has many TodoItems
Both Task and TodoItem only have a name to keep things simple.

Now, when I do :
Task.find(:all, :include => :todo_items)

it executes:
SELECT tasks."id" AS t0_r0, tasks."name" AS t0_r1, todo_items."id" AS
t1_r0, todo_items."name" AS t1_r1, todo_items."task_id" AS t1_r2 FROM
tasks LEFT OUTER JOIN todo_items ON todo_items.task_id = tasks.id

which will transfer back:
1|task 1|1|item 1|1
1|task 1|2|item 2|1
1|task 1|3|item 4|1
1|task 1|4|item 3|1
2|task 2|5|item 1|2
2|task 2|6|item 2|2
2|task 2|7|item 3|2
2|task 2|8|item 4|2
3|task 3|9|item 1|3
3|task 3|10|item 2|3
3|task 3|11|item 3|3
3|task 3|12|item 4|3

As you can see, all fields for tasks are repeated for every todo item.
While in this small example this wouldn't matter at all, my guess is
that it _will_ matter more when there are more fields to a task, more
items, or deeper nested joins. If I want to join data from say, 5
tables, all task data will get repeated a lot of times. While using
mysql through a socket interface, I think it's neglectable (since
queries that large will be quite slow anyway), but when connection to
an external sql server, all this sparse data has to get transferred
over the network.

Of course it's still a lot faster than not using the include, since
looping over tasks would then generate:
SELECT * FROM tasks
SELECT * FROM todo_items WHERE (todo_items.task_id = 1)
SELECT * FROM todo_items WHERE (todo_items.task_id = 2)
SELECT * FROM todo_items WHERE (todo_items.task_id = 3)

(n+1) queries which is worse.

But how about this:
tasks=Task.find(:all)
todos = TodoItem.find(:all, :conditions => { :task_id => tasks.map {|
x> x.id} })

what will generate:
SELECT * FROM tasks
SELECT * FROM todo_items WHERE (todo_items."task_id" IN (1,2,3))

which is 1 query per table.
of course after this you'd have to find a way to 'link' the retrieved
todos to the corresponding tasks, but I guess activerecord would be
able to handle this (since it does it for the current, left-outer-join
behaviour as well).

Now, I'm no database expert so I might be overlooking something.
Also I haven't really done any benchmarks.
But like I said, it looks to me as if the current ActiveRecord
behaviour is quite sparse for transfering/parsing the data as soon as
you 3 or 4 tables, with long rows, and many "children".
Am I overlooking something?

Any thoughts on this matter would be nice.

Thanks,
Mathijs

:include can be a performance issue in some cases and not in others. In the case you noted, it’s probably fine.
However, in a case where you do this:

p = Post.find(1, :include=>[:comments]

That could be bad. A 6000 word post with 100 comments would bring back a LOT of data. In this case, two separate queries would be better.

p = Post.find(1)
comments = p.comments

However, have you timed it under load? You really don’t know how bad something will be until you test it. Don’t optimize prematurely, as they say. You have ActiveRecord there to help you get something done fast. Sometimes it’s fast and efficient, and sometimes not. At that point you can look at using some plugins that fix this, or use find_by_sql, or look at some of the advanced :join and :select options for finders in the API.