Is this ActiveRecord behaviour considered correct? (:include clash with :conditions)

When you compare the tags for the article on these pages:

http://blog.hasmanythrough.com/tags/tests to
http://blog.hasmanythrough.com/2007/5/2/getting-arbitrary-with-assert_difference

you see that the tag "edge" is missing on the first of these pages. Like probably every other Rails blog Mephisto does a variation of this:

def article
   has_many :tags, :through => :taggings
end

For a usual list of articles tagged "foobar" Mephisto tries to: find all articles that are tagged 'foobar' and eagerly preload all tags for each article. But what it acutally does is: find all articles that are tagged 'foobar' and eagerly preload the tag 'foobar'.

Mephisto does this by something like this:

find(:all, :include => :tags, :conditions => ['tags.name IN (?)', tag_names])

Is this ActiveRecord behaviour considered correct?

I.e. should this test pass or fail? http://pastie.caboo.se/59528 (Mephisto assumes it passes.)

If it is considered the right behaviour, is there any reasonable way to get around this and perform the required query ("find all articles that are tagged 'foobar' and eagerly preload all tags for each article") without constructing monstrous JOINs by hand? (e.g. see http://pastie.caboo.se/59436)

Thanks!

It's correct because there's no other reasonable way for it to
behave. "Fixing" this would add a TON of complexity to ActiveRecord
and would result in a greater number of things breaking in the
opposite fashion.

The answer is yes, you have to write your own JOINs, but they aren't
exactly monstruous. If you have the id of the tag it's a single join
like this:

"INNER JOIN taggings ON taggings.article_id = articles.id AND
taggings.tag_id = #{id}"

Then you drop the condition. Of course if you don't have the id you
can do a double join instead of the AND clause.

Also, very important, this won't currently work if you add a :limit
clause because the "eager loading of associations with limit" query
that is run to fetch the IDs will drop the JOINs and then the final
query will use a list of IDs that doesn't meet the criteria, therefore
resulting in fewer than your requested number of results.

This can be fixed with a simple patch to AR. See my recent writeup on
my new search model plugin for some notes on this issue:

http://darwinweb.net/article/Implementing_Advanced_Search_In_Rails_Using_Search_Models

I can provide you with the actual monkey patch file to drop into lib/
if you need it.

Thanks for answering.

It's correct because there's no other reasonable way for it to
behave. "Fixing" this would add a TON of complexity to ActiveRecord
and would result in a greater number of things breaking in the
opposite fashion.

Would it? The more I'm looking into this the more I'm wondering what ActiveRecord does here.

In the logs I'm seeing something like this (simplyfied):

Article Load IDs For Limited Eager Loading (0.004604)
SELECT DISTINCT contents.id FROM contents
LEFT OUTER JOIN taggings ON (taggings.taggable_id = contents.id AND taggings.taggable_type = 'Content')
LEFT OUTER JOIN tags ON tags.id = taggings.tag_id
WHERE tags.name IN ('tag')) AND ( (contents.`type` = 'Article' ) )
ORDER BY contents.published_at DESC LIMIT 15

Article Load Including Associations (0.004464)
SELECT * FROM contents
LEFT OUTER JOIN taggings ON (taggings.taggable_id = contents.id AND taggings.taggable_type = 'Content')
LEFT OUTER JOIN tags ON tags.id = taggings.tag_id
WHERE tags.name IN ('tag')) AND contents.id IN ('972', '971', '448', '181')
ORDER BY contents.published_at DESC

That is, these queries are practically identical besides that the first query fetches the ids that are used in the second query.

The :condition => "tags.name in (...)" part is used in both queries. Why? This doesn't seem to make sense to me. I guess this is some kind of edge case and the same behaviour would make sense in other cases that are treated the same by ActiveRecord?

The answer is yes, you have to write your own JOINs, but they aren't
exactly monstruous.

Sorry for the wording :slight_smile:

I just ment the amount of characters, like here: http://pastie.caboo.se/59436

If you have the id of the tag it's a single join
like this:

"INNER JOIN taggings ON taggings.article_id = articles.id AND
taggings.tag_id = #{id}"

Then you drop the condition.

Yes, similar to what ActiveRecord does internally, too. But ActiveRecord doesn't "drop the condition". Shouldn't it? (If it would my problem with Mephisto's tagging pages was solved.)

Also, very important, this won't currently work if you add a :limit
clause because the "eager loading of associations with limit" query
that is run to fetch the IDs will drop the JOINs and then the final
query will use a list of IDs that doesn't meet the criteria, therefore
resulting in fewer than your requested number of results.

As far as I can see from the log (see above) no JOINs are dropped (in this case)? Also, the :limit is only applied to the first query ("Article Load IDs For Limited Eager Loading"), so this theoretically should work (in this case)?

Would it? The more I'm looking into this the more I'm wondering what
ActiveRecord does here.

In the logs I'm seeing something like this (simplyfied):

Article Load IDs For Limited Eager Loading (0.004604)
SELECT DISTINCT contents.id FROM contents
LEFT OUTER JOIN taggings ON (taggings.taggable_id = contents.id AND
taggings.taggable_type = 'Content')
LEFT OUTER JOIN tags ON tags.id = taggings.tag_id
WHERE tags.name IN ('tag')) AND ( (contents.`type` = 'Article' ) )
ORDER BY contents.published_at DESC LIMIT 15

Article Load Including Associations (0.004464)
SELECT * FROM contents
LEFT OUTER JOIN taggings ON (taggings.taggable_id = contents.id AND
taggings.taggable_type = 'Content')
LEFT OUTER JOIN tags ON tags.id = taggings.tag_id
WHERE tags.name IN ('tag')) AND contents.id IN ('972', '971', '448',
'181')
ORDER BY contents.published_at DESC

That is, these queries are practically identical besides that the
first query fetches the ids that are used in the second query.

The :condition => "tags.name in (...)" part is used in both queries.
Why? This doesn't seem to make sense to me. I guess this is some kind
of edge case and the same behaviour would make sense in other cases
that are treated the same by ActiveRecord?

First you have to understand why there are two queries. It only
occurs when you use eager loading along with a :limit option. A LIMIT
clause only restricts the total rows, which is incompatible with eager
loading of associations that have many children since many rows will
be returned for each of the base items. Therefore, AR first fetches
just the DISTINCT base ids so it gets the right number of items.

So, yes, you are talking about an edge case. If you remove the :limit
or the :include, there is only one query. I can see your point that
the conditions are redundant on the second query, but on the same
token, changing this behaviour of ActiveRecord doesn't solve the
problem because it still exists if you don't have a :limit.

The core problem is that "tags.name IN ('tag')" affects the eager
loaded table. That's why you have to do an explicit join and alias
the table AS something_else.

> "INNER JOIN taggings ON taggings.article_id = articles.id AND
> taggings.tag_id = #{id}"

> Then you drop the condition.

Yes, similar to what ActiveRecord does internally, too. But
ActiveRecord doesn't "drop the condition". Shouldn't it? (If it would
my problem with Mephisto's tagging pages was solved.)

What I mean is the INNER JOIN will automatically drop any rows that
don't match the ON conditions. You don't need the condition you
specified, because this INNER JOIN already implies it. Think of this
JOIN as the equivalent of the :conditions you want. It serves the
purpose you are looking for exactly because it only allows articles
that are tagged with a specific tag, but it does not impost a
condition on the eager loaded tables.

> Also, very important, this won't currently work if you add a :limit
> clause because the "eager loading of associations with limit" query
> that is run to fetch the IDs will drop the JOINs and then the final
> query will use a list of IDs that doesn't meet the criteria, therefore
> resulting in fewer than your requested number of results.

As far as I can see from the log (see above) no JOINs are dropped (in
this case)? Also, the :limit is only applied to the first query
("Article Load IDs For Limited Eager Loading"), so this theoretically
should work (in this case)?

I'm talking about :joins you specify, not the ones created
automatically by eager loading.

Thanks for taking the time to enlighten me about this stuff.

I still do not understand _why_ ActiveRecord is designed this way in
the first place. I see it as a limitation of the design that one's
not able to specify the :condition in a way so that it doesn't effect
the :include'd objects.

Though after I've experimented with ActiveRecord's guts and unit
tests (to better understand _what's_ going on here) I do understand
that it'd probably be quite a major task to implement any changes in
a coherent way.

So I'll carry this again over to the Mephisto mailing list and
propose to change Mephisto.

Thanks again!

At the root of the issue is that ORM is inherently flawed because SQL
is a much more powerful language for manipulating data sets.
ActiveRecord has a good philosophy which is to make the easy stuff
easy, but expose SQL for the hard stuff.

What's going on here is you are specifying the condition directly
against the eager loaded table. It's nonsensical for it NOT to apply
to the very table which you specified. Your expected behaviour would
make it impossible to specify conditions on the loaded associations.
If you are trying to restrict base rows then you need to join the tags
table specifically for that purpose. It just so happens that the
INNER JOIN takes care of it for you in one swoop without the need for
an explicit WHERE clause. But you could just as easily do a LEFT JOIN
with a table alias and use that in your conditions.

So why doesn't AR implement an easy declarative syntax for this?
Well, there are millions of possible applications of joins and
conditions. Eager loading is one such application--a critical one
that requires a complex implementation so that you can get fully AR
models back. However what you are trying to do is just a simple
combination of a join and a condition (or even just a single join!).
Your particular problem is too specific to merit the unique syntax
that it would require--it would be just one more thing to remember
when people already understand joins and conditions. It might be
possible to come up with a broader syntax that could solve this
problem, but it would have to be pretty ingenious to be both broad
enough to merit its own syntax and clear enough to be more intuitive
than just specifying the JOINs explicitly.