Limited Eager Loading Optimization

I have a patch which I believe will be pretty controversial, but I think is a improvement overall. I’d like some feedback from everyone using eager loading. For me this patch is absolutely critical for development on my G4 where MySQL performance is pretty poor.

http://dev.rubyonrails.org/ticket/9497

This patch is providing my development code with a 300x speed increase…

I know it does remove some functionality, but I think the missing functionality was limited and occasionally error-prone anyway. My opinion is that explicit :joins provide much more transparency and utility anyway, so I think this optimization is worth it in the long run (especially before Rails 2.0 hits).

Please post your counter examples or what this breaks in your test suite (rather than the contrived tests in the official test suite).

The tests aren't particularly contrived in this case, I've written
similar stuff in the past. However the large performance boost you're
seeing is probably worth investigating whether we can apply those
joins a little more selectively, or let users opt out of them.

My understanding is that the conditions / joins *have* to be applied
in order for limit / offset to work with an :order or :conditions when
using eager loading. If you apply the limit / offset to the final
query, it's not limiting correctly, and if you leave the order or
conditions off that query your paginated result set will behave
erratically, with varying numbers per page, or repeated entities.

For an example:

@tickets_from_priority_customers = Ticket.find(:all,
:conditions=>["tickets.open = ? and customers.priority = ?", true,
true], :include=>[:customers])

Now if you start paginating that list, your stuff will cause bugs?

The tests aren’t particularly contrived in this case, I’ve written
similar stuff in the past. However the large performance boost you’re

seeing is probably worth investigating whether we can apply those
joins a little more selectively, or let users opt out of them.

Perhaps a poor choice of words to say contrived… there is definitely use for :conditions on the eager-loaded table. My assertion is more along the lines that the utility of those :conditions is fairly limited in scope and easily replaced by explicit :joins which can be tailored to a wider variety of effects.

The solution I’m proposing is the simplest in that it doesn’t add any additional complexity. If we want something more robust we can add code to explicitly check the table names in :conditions/:order (this would be ideal because it could add only the necessary eager joins). The only downside is there would be some complexity added to the construct_finder_sql_for_association_limiting method.

Offering an opt-out option is another possibility (in fact you can opt out implicitly by NOT referring to another table in the :conditions/:order, but it could be more fine-grained), but I don’t see how to make the interface make sense. It’s a pretty specific case, and a specific option to control limited eager loading would seem out of place with the general options available to :find now. That’s the reason I proposed what I did, because I think it would make the most sense to those who are unaware of the internals.

My understanding is that the conditions / joins have to be applied

in order for limit / offset to work with an :order or :conditions when
using eager loading. If you apply the limit / offset to the final
query, it’s not limiting correctly, and if you leave the order or

conditions off that query your paginated result set will behave
erratically, with varying numbers per page, or repeated entities.

Currently the functions that check this (include_eager_conditions? and include_eager_order?) aren’t smart enough to actually figure out when the joins HAVE to be applied. They simply check that another table is referred to… they don’t check whether that table is an eager-loaded table or one from the explicit :joins or whether it’s even included in the query in which case it causes a database error.

In fact I posted another patch just prior (8496) that fixes an issue with INNER JOINs that results in breakage. I didn’t comment on that one here since it should be a much more obvious patch to apply.

For an example:

@tickets_from_priority_customers = Ticket.find(:all,
:conditions=>["
tickets.open = ? and customers.priority = ?", true,
true], :include=>[:customers])

Now if you start paginating that list, your stuff will cause bugs?

Correct. The basic gist of my argument is that this type of query is limited because it only lets you operate against the generic LEFT OUTER JOIN. It’s slightly opaque because you have to know that :include is doing a LEFT OUTER JOIN customers ON customers.ticket_id = tickets.id. So the condition you give selects tickets that have at least 1 customer with priority=1. In this case that makes sense, but I don’t think it’s so intuitive the other way around. For instance:

@tickets_from_priority_customers = Ticket.find(:all,
:conditions=>[“tickets.open = ? and customers.priority != ?”, true,
true], :include=>[:customers])

In that case the developer might think they are querying for all tickets that aren’t associated with a priority customer, when in fact they are querying for tickets that have at least one non-priority customer. Granted, they can get what they want by using the complement of the original query (in this case false), but my point here is that the user requires a knowledge of what :include is doing behind the scenes, and the resulting LEFT JOIN is not particularly flexible. In addition, the :conditions end up restricting which associated rows are fetched, with further limits the general utility. If I want a list of all tickets with at least one priority customer associated, it doesn’t mean I only want to see the priority customers in the association. Typically I would want the whole association, and therefore I would be better off doing:

@tickets_from_priority_customers = Ticket.find(:all,
:conditions=>[“tickets.open = ?”,
true], :include=>[:customers], :joins => “INNER JOIN customers AS c2 ON tickets.id = c2.ticket_id AND c2.priority=1”)

or even

@tickets_from_priority_custome
rs = Ticket.find(:all,
:conditions=>[“tickets.open = ? AND c2.id IS NOT NULL”,
true], :include=>[:customers], :joins => “LEFT JOIN customers AS c2 ON
tickets.id
= c2.ticket_id AND c2.priority=1”)

I’m not totally wedded to this idea, but all else being equal I would prefer not to have the eager-loading JOINs added in the pre-query. Building a smarter method that checks for specific table names and does a finer-grained analysis would certainly make the whole thing more robust… that wasn’t my first choice because it’s more work and it adds complexity that I wasn’t sure you would want. If the core team thinks that’s a better way to go, I’m happy to work on a patch though…

Okay, I put together a quick patch that collects the referenced table names and only joins ones that are needed on the pre-query. Both approaches have their pros and cons. Check 'em both out and let me know what you think.

http://dev.rubyonrails.org/ticket/9497

Sven Fuchs pointed me to this thread:


http://www.ruby-forum.com
/topic/108163

Which illustrates a bug in Mephisto due to confusion over the semantics.

In particular, Sven says:

(Of course I’m not after particulary blaming somebody personally,
but …) As far as I can see, this is an example where even a member
of the Rails core team mistakenly assumed that in:

Article.find(:all, :include => :tags, :conditions => [’
tags.name IN
(?)’, tag_names])

… this would return articles that have tags with certain names
associated (i.e. condition applied), but all of them pre-populated
with all tags (i.e. condition not applied).

In fact this bug is still present in Mephisto (or at least the version Josh Susser is running).

I think the thread illustrates a bit more impartially how this functionality can be confusing and limiting. Sven doesn’t get why AR works this way, while I argue that it’s the only reasonable for it to work. The result that it provides is not intuitive, so the user has to fully understand the behind-the-scenes JOINs anyway. Why not force them to write out the JOINs for their conditions manually? It still requires the same level of understanding, but at least it’s more transparent.

> ... this would return articles that have tags with
certain names
> associated (i.e. condition applied), but all of them
pre-populated
> with *all* tags (i.e. condition not applied).

Which makes sense to me. Because the find was called on Article model.

I think the thread illustrates a bit more impartially how this functionality
can be confusing and limiting. Sven doesn't get why AR works this way, while
I argue that it's the only reasonable for it to work. The result that it
provides is not intuitive, so the user has to fully understand the
behind-the-scenes JOINs anyway. Why not force them to write out the JOINs
for their conditions manually? It still requires the same level of
understanding, but at least it's more transparent.

I think I see where you're coming from for the :conditions on
:include. In fact I've worked around this exact issue with something
criminal like:

has_many :permissions
has_many :hax_permissions, :class_name=>"Permission"

Whatever.find(:all, :include=>:hax_permissions,
:conditions=>"permissions.group_id IN (?" ...

Otherwise :permissions isn't a full collection.

If you have time / ability to jump into #rails-contrib, I think this
will go a bunch quicker :slight_smile:

Right, and the implementation of this would require creating a secondary join on which to enforce the condition anyway… so why not just cut out the middle man and require the developer to manually add joins when they need any variation of this behavior (many–dare I say most–of which aren’t possible without manual joins anyway).

But I’m curious if you’ve looked at the two possible patches I’ve proposed and which (if either) you would support?

But I'm curious if you've looked at the two possible patches I've proposed
and which (if either) you would support?

Just went through your patches and relevant code in AR. And I'd
personally go with your 1st patch. I think the current behavior could
be very misleading to people who are not aware of AR internals and the
results might not be what they expected.

Also, I guess allowing conditions on eager loaded association can
easily make people write bad code. It's very likely that those
conditions deserves a separate association, at least in many cases.
For other cases, explicit join should be justified.

E.g. in Koz' example :

@tickets_from_priority_customers = Ticket.find(:all,
:conditions=>["tickets.open = ? and customers.priority = ?", true,
true], :include=>[:customers])

I guess it'd make more sense if Ticket model looks has :

has_many :customers
has_many :priority_customers, :class_name => "Customer", :conditions
=> "customers.priority = 1"

And then the query should like like :

@tickets_from_priority_customers = Ticket.find(:all,
:conditions=>["tickets.open = ?", true],
:include=>[:priority_customers])

That'd be cleaner and more efficient.

Having said that, in any case, the patch 2 is a must if most of the
people don't agree with patch 1.

So my votes are +2/+1 for your 2 patches, in order you uploaded them.

I have written up a lengthy explanation of my patch. I would
appreciate any and all reviewers to read it and give their opinion.
This will take a big mobilization to push through, but I think it is
well justified. Please read my opinion at:

http://darwinweb.net/article/Optimizing_And_Simplifying_Limited_Eager_Loading_In_Activerecord

You know, your article leaves me really confused.

You say (at the very end): “Now, just to be perfectly clear, both of these patches affect only the pre-query.”

Just before that you say: “I’ve also written a non-destructive patch which solves my immediate performance concerns by collecting the names of tables that are referenced in :conditions and then only eagerly joins the necessary tables (in the pre-query).”

Forgive me if I’m getting the wrong end of the stick here but doesn’t AR already do that?

Album.find(:all, :limit => 10, :include => :songs)

no joins in prequery, no conditions in prequery

Album.find(:all, :limit => 10, :include => :songs, :conditions => ‘albums.active = 1’)

no joins in prequery, conditions in prequery

Album.find(:all, :limit => 10, :include => :songs, :conditions => ‘songs.active = 1’)

join in prequery, conditions in prequery

What am I missing? Or to put it another way, why is your “non-destructive-patch” required in light of current behavior?

Regards,

Trevor

Album.find(:all, :limit => 10, :include => :songs, :conditions =>
'songs.active = 1')
# join in prequery, conditions in prequery

Check the prequery for :

Album.find(:all, :limit => 10, :include => [:songs, :artists],
:conditions => 'songs.active = 1')

Thanks for the reply, I believe I understand now.

So the issue is that once it decides that *any* of the :include
tables are referenced in :conditions it joins *all* of them - even
though it's possible that some of the includes are not referenced.

Personally I've not been bitten by performance issues with the
prequery but if your non-destructive-patch improves that situation
I'd say it was a good idea.

Is your patch intelligent enough to cope with this?:

Album.find(:all, :limit => 10, :include => [{:songs
=> :uploads}, :artists],
   :conditions => 'uploads.errored = 0')

where the prequery joins songs and uploads but not artists?

I bring it up because that strikes me as a reason why it just does
all of the joins - bit of a pain to tease apart exactly which joins
are absolutely required (not impossible - just a pain).

Regards,
Trevor

Thanks for the reply, I believe I understand now.

So the issue is that once it decides that any of the :include
tables are referenced in :conditions it joins all of them - even
though it’s possible that some of the includes are not referenced.

Personally I’ve not been bitten by performance issues with the
prequery but if your non-destructive-patch improves that situation
I’d say it was a good idea.

But what do you think of the destructive patch? In your example statement below I don’t like the fact that the conditions string is used both to A) find albums with at least one errored song upload and B) fetch only the errored uploads.

If you just wanted A then you have to do an explicit join. If you wanted B you would be better off defining a custom association with that condition directly attached. With the existing code A&B are conflated, which I think greatly diminishes the utility of adding eager joins to the pre-query.

Is your patch intelligent enough to cope with this?:

Album.find(:all, :limit => 10, :include => [{:songs

=> :uploads}, :artists],
:conditions => ‘uploads.errored = 0’)

Yes, it is just comparing the join string when the query is compiles, doesn’t matter what kind of nested includes you use.

where the prequery joins songs and uploads but not artists?

I bring it up because that strikes me as a reason why it just does

all of the joins - bit of a pain to tease apart exactly which joins
are absolutely required (not impossible - just a pain).

Check the patch, it’s actually quite simple. The include_eager_order? and include_eager_conditions? methods already parse their respective strings for table names, so I just refactored those methods to capture the table names in arrays.

Then I add one reject method in the method chain that adds the joins. Essentially there is only half a line of new code required to do this. But again, this patch is my second choice. I’m trying to drum up support for the first patch.

Then I add one reject method in the method chain that adds the joins.
Essentially there is only half a line of new code required to do this. But
again, this patch is my second choice. I'm trying to drum up support for
the first patch.

Gabe, why don't you submit a separate ticket with your second patch (
as the current ticket has a lot of negative noise and could be
misleading ) and try get it in first. And then continue your main
battle for the first patch (the original one).

Just a suggestion.

Gabe,

This is a lengthy response, sorry to those not interested :slight_smile:

I guess the issue I have with your destructive patch is that it hobbles existing behavior which (for me) returns a predictable dataset.

I’ve always known that :conditions which relate to :included associations can cause only part of the association’s data to be loaded (I think that’s been referred to as the mephisto bug). When I’ve used it in the past it’s been because that’s exactly what I want.

Setting aside the performance issues (which you address in your non-destructive patch), the problem as I see it is that there is no way to tell find() that you want certain conditions applied in the prequery and a different set of conditions applied in the data-loading query (which I’ll call ‘realquery’ from now on).

For the times when I have wanted all of the association’s data yet wanted to limit the top-level selection based on association conditions I’ve done my own prequery (with a limit) to grab the id values I want to present. I wrote a plugin that makes these prequeries a little easier to generate - http://svn.protocool.com/public/plugins/values_of

give me a list of the first 10 live album ids that have errored songs

ids = Album.values_of(“distinct albums.id”,

:joins => ‘inner join songs on songs.album_id = albums.id’,

:conditions => ‘albums.deleted_at is null and songs.errored = 1’, :limit => 10) {|string| string.to_i}

present those albums and all of their songs/artists

Album.find(ids, :include => [:songs, :artists])

That allows me to have 2 sets of :conditions - one for the prequery and one for the realquery - and guarantees that my constraints are always true for my returned data.

So… what I’m thinking is this: what about adding a new option to find() that allows you to specify ‘prequery-only’ conditions? For example, something like a :limit_conditions option that utilized your non-destructive patch?

  • The :limit_conditions option would only be valid in conjunction with :limit.

  • Both :limit_conditions and :conditions options would be applicable to the prequery.

  • Your non-destructive patch would determine the minimum joins required in the prequery based on :limit_conditions and :conditions

  • The :conditions would only be applied to the realquery

Those rules for which conditions apply and when are to make sure that the constraints you specify in your find() is guaranteed ‘true’ for all of the data you get back:

live albums with errored songs, and only those errored songs

as in, current behaviour with your non-destructive patch

Album.find(:all, :limit => 10, :include => [:songs, :artists],

:conditions => ‘albums.deleted_at is null and songs.errored = 1’)

prequery joins songs and applies condition ‘albums.deleted_at is null and songs.errored = 1’

realquery joins songs, artistst and applies condition ‘albums.deleted_at is null and songs.errored = 1’

live albums with errored songs, but I want all songs

Album.find(:all, :limit => 10, :include => [:songs, :artists],

:limit_conditions => ‘albums.deleted_at is null and songs.errored = 1’)

prequery joins songs and applies condition ‘albums.deleted_at is null and songs.errored = 1’

realquery joins songs, artists but applies no conditions

Now, this last example is probably where users might get confused about the interaction between :limit_conditions and :conditions - but the interaction is required in order to guarantee that you get back exactly what you specified in your single call to find().

live albums with errored songs and artists named dave, all songs but the artists relation only has artists called ‘Dave’

Album.find(:all, :limit => 10, :include => [:songs, :artists],

:limit_conditions => ‘albums.deleted_at is null and songs.errored = 1’,

:conditions => [‘artists.name = ?’, “Dave”)

prequery joins songs,artists applies condition ‘albums.deleted_at is null and songs.errored = 1 and artists.name = ‘Dave’’

realquery joins songs,artists and applies condition ‘artists.name = ‘Dave’’

It’s not ideal, and probably a little difficult for newbies to understand, but at least it’s consistent.

Ultimately I have to admit that when performance is a concern or the relations are hairy I will probably continue to reach for values_of() and another (soon to be released) plugin that hydrates associations with subsequent queries rather than with big-honking-joins.

Sorry for rambling, I hope I’ve not strayed too far from the original reasons you started talking about your problems…

Trev

I have done so, to hopefully generate some more discussion on the original patch. Although I have your support, and Koz and Josh expressed support with some reservation on irc, I haven’t gotten any feedback from anyone else about the original patch (Assaf linked to my blog post at least).

I’m starting to fear that the subtlety of the issue will prevent the necessary quorum of core members giving the issue careful consideration, which would be a shame, but if nothing happens in the next week or two I’ll probably just try to get the second patch through to solve my immediate need.

I guess the issue I have with your destructive patch is that it hobbles existing behavior which (for me) returns a predictable dataset.

Indeed it should be predictable to anyone who has a working knowledge of SQL databases. The reason I don’t see it as hobbling is because it seems that the cases where you want that behavior are far outnumbered by the cases where you don’t.

I’ve always known that :conditions which relate to :included associations can cause only part of the association’s data to be loaded (I think that’s been referred to as the mephisto bug). When I’ve used it in the past it’s been because that’s exactly what I want.

And if you do understand what’s going on, how hard is it to write the JOIN you want explicitly?

Setting aside the performance issues (which you address in your non-destructive patch), the problem as I see it is that there is no way to tell find() that you want certain conditions applied in the prequery and a different set of conditions applied in the data-loading query (which I’ll call ‘realquery’ from now on).

For the times when I have wanted all of the association’s data yet wanted to limit the top-level selection based on association conditions I’ve done my own prequery (with a limit) to grab the id values I want to present. I wrote a plugin that makes these prequeries a little easier to generate - http://svn.protocool.com/public/plugins/values_of

That’s a great idea… I’ll probably make use of that plugin myself.

That allows me to have 2 sets of :conditions - one for the prequery and one for the realquery - and guarantees that my constraints are always true for my returned data.

So… what I’m thinking is this: what about adding a new option to find() that allows you to specify ‘prequery-only’ conditions? For example, something like a :limit_conditions option that utilized your non-destructive patch?

  • The :limit_conditions option would only be valid in conjunction with :limit.
  • Both :limit_conditions and :conditions options would be applicable to the prequery.
  • Your non-destructive patch would determine the minimum joins required in the prequery based on :limit_conditions and :conditions
  • The :conditions would only be applied to the realquery

I like this idea. Would you be willing to write a patch to do this? If so I’d give it a +1, although my gut instinct is that Koz would -1 since he’s expressed reservations about limited eager loading in general.

In any case, my agenda right now is to try to push through the first patch. I don’t have any need for this specific behavior, though if the first patch doesn’t go through, I might find some time down the line to implement something like this.

Ultimately I have to admit that when performance is a concern or the relations are hairy I will probably continue to reach for values_of() and another (soon to be released) plugin that hydrates associations with subsequent queries rather than with big-honking-joins.

What is this other plugin? I am mulling over the idea of how to make eager loading more flexible. I have a couple different approaches in mind. One would be to extend :include to somehow indicate how the associations should be split into different queries. The other idea I had is some way of hydrating from more arbitrary queries like explicit :select/:joins or even find_by_sql. Are you working on this? If so I’d be interested in seeing your approach and doing some collaboration… hit me up off-list.