Is this ActiveRecord behaviour considered correct? (:include clash with :conditions)

When you compare the tags for the article on these pages:

http://blog.hasmanythrough.com/tags/tests to has_many :through - Getting arbitrary with assert_difference

you see that the tag "edge" is missing on the first of these pages. Like probably every other Rails blog Mephisto does a variation of this:

def article    has_many :tags, :through => :taggings end

For a usual list of articles tagged "foobar" Mephisto tries to: find all articles that are tagged 'foobar' and eagerly preload all tags for each article. But what it acutally does is: find all articles that are tagged 'foobar' and eagerly preload the tag 'foobar'.

Mephisto does this by something like this:

find(:all, :include => :tags, :conditions => ['tags.name IN (?)', tag_names])

Is this ActiveRecord behaviour considered correct?

I.e. should this test pass or fail? Parked at Loopia (Mephisto assumes it passes.)

If it is considered the right behaviour, is there any reasonable way to get around this and perform the required query ("find all articles that are tagged 'foobar' and eagerly preload all tags for each article") without constructing monstrous JOINs by hand? (e.g. see Parked at Loopia)

Thanks!

It's correct because there's no other reasonable way for it to behave. "Fixing" this would add a TON of complexity to ActiveRecord and would result in a greater number of things breaking in the opposite fashion.

The answer is yes, you have to write your own JOINs, but they aren't exactly monstruous. If you have the id of the tag it's a single join like this:

"INNER JOIN taggings ON taggings.article_id = articles.id AND taggings.tag_id = #{id}"

Then you drop the condition. Of course if you don't have the id you can do a double join instead of the AND clause.

Also, very important, this won't currently work if you add a :limit clause because the "eager loading of associations with limit" query that is run to fetch the IDs will drop the JOINs and then the final query will use a list of IDs that doesn't meet the criteria, therefore resulting in fewer than your requested number of results.

This can be fixed with a simple patch to AR. See my recent writeup on my new search model plugin for some notes on this issue:

http://darwinweb.net/article/Implementing_Advanced_Search_In_Rails_Using_Search_Models

I can provide you with the actual monkey patch file to drop into lib/ if you need it.

Thanks for answering.

It's correct because there's no other reasonable way for it to behave. "Fixing" this would add a TON of complexity to ActiveRecord and would result in a greater number of things breaking in the opposite fashion.

Would it? The more I'm looking into this the more I'm wondering what ActiveRecord does here.

In the logs I'm seeing something like this (simplyfied):

Article Load IDs For Limited Eager Loading (0.004604) SELECT DISTINCT contents.id FROM contents LEFT OUTER JOIN taggings ON (taggings.taggable_id = contents.id AND taggings.taggable_type = 'Content') LEFT OUTER JOIN tags ON tags.id = taggings.tag_id WHERE tags.name IN ('tag')) AND ( (contents.`type` = 'Article' ) ) ORDER BY contents.published_at DESC LIMIT 15

Article Load Including Associations (0.004464) SELECT * FROM contents LEFT OUTER JOIN taggings ON (taggings.taggable_id = contents.id AND taggings.taggable_type = 'Content') LEFT OUTER JOIN tags ON tags.id = taggings.tag_id WHERE tags.name IN ('tag')) AND contents.id IN ('972', '971', '448', '181') ORDER BY contents.published_at DESC

That is, these queries are practically identical besides that the first query fetches the ids that are used in the second query.

The :condition => "tags.name in (...)" part is used in both queries. Why? This doesn't seem to make sense to me. I guess this is some kind of edge case and the same behaviour would make sense in other cases that are treated the same by ActiveRecord?

The answer is yes, you have to write your own JOINs, but they aren't exactly monstruous.

Sorry for the wording :slight_smile:

I just ment the amount of characters, like here: Parked at Loopia

If you have the id of the tag it's a single join like this:

"INNER JOIN taggings ON taggings.article_id = articles.id AND taggings.tag_id = #{id}"

Then you drop the condition.

Yes, similar to what ActiveRecord does internally, too. But ActiveRecord doesn't "drop the condition". Shouldn't it? (If it would my problem with Mephisto's tagging pages was solved.)

Also, very important, this won't currently work if you add a :limit clause because the "eager loading of associations with limit" query that is run to fetch the IDs will drop the JOINs and then the final query will use a list of IDs that doesn't meet the criteria, therefore resulting in fewer than your requested number of results.

As far as I can see from the log (see above) no JOINs are dropped (in this case)? Also, the :limit is only applied to the first query ("Article Load IDs For Limited Eager Loading"), so this theoretically should work (in this case)?

Would it? The more I'm looking into this the more I'm wondering what
ActiveRecord does here.

In the logs I'm seeing something like this (simplyfied):

Article Load IDs For Limited Eager Loading (0.004604) SELECT DISTINCT contents.id FROM contents LEFT OUTER JOIN taggings ON (taggings.taggable_id = contents.id AND
taggings.taggable_type = 'Content') LEFT OUTER JOIN tags ON tags.id = taggings.tag_id WHERE tags.name IN ('tag')) AND ( (contents.`type` = 'Article' ) ) ORDER BY contents.published_at DESC LIMIT 15

Article Load Including Associations (0.004464) SELECT * FROM contents LEFT OUTER JOIN taggings ON (taggings.taggable_id = contents.id AND
taggings.taggable_type = 'Content') LEFT OUTER JOIN tags ON tags.id = taggings.tag_id WHERE tags.name IN ('tag')) AND contents.id IN ('972', '971', '448',
'181') ORDER BY contents.published_at DESC

That is, these queries are practically identical besides that the
first query fetches the ids that are used in the second query.

The :condition => "tags.name in (...)" part is used in both queries.
Why? This doesn't seem to make sense to me. I guess this is some kind
of edge case and the same behaviour would make sense in other cases
that are treated the same by ActiveRecord?

First you have to understand why there are two queries. It only occurs when you use eager loading along with a :limit option. A LIMIT clause only restricts the total rows, which is incompatible with eager loading of associations that have many children since many rows will be returned for each of the base items. Therefore, AR first fetches just the DISTINCT base ids so it gets the right number of items.

So, yes, you are talking about an edge case. If you remove the :limit or the :include, there is only one query. I can see your point that the conditions are redundant on the second query, but on the same token, changing this behaviour of ActiveRecord doesn't solve the problem because it still exists if you don't have a :limit.

The core problem is that "tags.name IN ('tag')" affects the eager loaded table. That's why you have to do an explicit join and alias the table AS something_else.

> "INNER JOIN taggings ON taggings.article_id = articles.id AND > taggings.tag_id = #{id}"

> Then you drop the condition.

Yes, similar to what ActiveRecord does internally, too. But
ActiveRecord doesn't "drop the condition". Shouldn't it? (If it would
my problem with Mephisto's tagging pages was solved.)

What I mean is the INNER JOIN will automatically drop any rows that don't match the ON conditions. You don't need the condition you specified, because this INNER JOIN already implies it. Think of this JOIN as the equivalent of the :conditions you want. It serves the purpose you are looking for exactly because it only allows articles that are tagged with a specific tag, but it does not impost a condition on the eager loaded tables.

> Also, very important, this won't currently work if you add a :limit > clause because the "eager loading of associations with limit" query > that is run to fetch the IDs will drop the JOINs and then the final > query will use a list of IDs that doesn't meet the criteria, therefore > resulting in fewer than your requested number of results.

As far as I can see from the log (see above) no JOINs are dropped (in
this case)? Also, the :limit is only applied to the first query
("Article Load IDs For Limited Eager Loading"), so this theoretically
should work (in this case)?

I'm talking about :joins you specify, not the ones created automatically by eager loading.

Thanks for taking the time to enlighten me about this stuff.

I still do not understand _why_ ActiveRecord is designed this way in
the first place. I see it as a limitation of the design that one's
not able to specify the :condition in a way so that it doesn't effect
the :include'd objects.

Though after I've experimented with ActiveRecord's guts and unit
tests (to better understand _what's_ going on here) I do understand
that it'd probably be quite a major task to implement any changes in
a coherent way.

So I'll carry this again over to the Mephisto mailing list and
propose to change Mephisto.

Thanks again!

At the root of the issue is that ORM is inherently flawed because SQL is a much more powerful language for manipulating data sets. ActiveRecord has a good philosophy which is to make the easy stuff easy, but expose SQL for the hard stuff.

What's going on here is you are specifying the condition directly against the eager loaded table. It's nonsensical for it NOT to apply to the very table which you specified. Your expected behaviour would make it impossible to specify conditions on the loaded associations. If you are trying to restrict base rows then you need to join the tags table specifically for that purpose. It just so happens that the INNER JOIN takes care of it for you in one swoop without the need for an explicit WHERE clause. But you could just as easily do a LEFT JOIN with a table alias and use that in your conditions.

So why doesn't AR implement an easy declarative syntax for this? Well, there are millions of possible applications of joins and conditions. Eager loading is one such application--a critical one that requires a complex implementation so that you can get fully AR models back. However what you are trying to do is just a simple combination of a join and a condition (or even just a single join!). Your particular problem is too specific to merit the unique syntax that it would require--it would be just one more thing to remember when people already understand joins and conditions. It might be possible to come up with a broader syntax that could solve this problem, but it would have to be pretty ingenious to be both broad enough to merit its own syntax and clear enough to be more intuitive than just specifying the JOINs explicitly.