Included associations, limiting and subselects

When to-many associations are eagerly loaded, a "base" object and its associated objects can be represented by an unpredictable number of rows in the result set of the query. As a consequence, retrieving a certain range of objects including associations cannot be achieved by using OFFSET and LIMIT as they refer to rows in the result set, not objects.

In order to enable :offset and :limit in cases like this, the current code splits the task of retrieving objects. In a first query it retrieves a list of ids of objects, properly offset and limited. Then, in a second query, these ids are used in a condition to restrict the returned rows from which objects are instantiated. The code looks like this (slightly reformatted).

  def add_limited_ids_condition!(sql, options, join_dependency)     unless (id_list = select_limited_ids_list(options,         join_dependency)).empty?       sql << "#{condition_word(sql)} #{table_name}.#{primary_key} IN (#{id_list}) "     else       throw :invalid_query       end   end

  def select_limited_ids_list(options, join_dependency)     connection.select_all(       construct_finder_sql_for_association_limiting(options,         join_dependency),       "#{name} Load IDs For Limited Eager Loading"     ).collect { |row| connection.quote(row[primary_key]) }.join(", ")   end

The advantage of this approach is that it works even on less capable DBMS, specifically those that don't support subselects. The disadvantage is that on DBMS that do support subselects it incurs an unnecessary roundtrip to the database and robs the optimizer of a chance to do its job.

A solution is rather simple, or so it seems to me. Apart from naming issues, it would look like this, for databases that support subselects

  def select_limited_ids_list(options, join_dependency)     construct_finder_sql_for_association_limiting(options,       join_dependency)   end

Instead of eagerly executing a query for the ids, the requisite SQL is simply inserted in the main query to be executed as a subselect. I've applied this change and except for a test that explicitly checks select_limited_ids_list all test still pass on PostgreSQL.

Obviously, this change should not be applied unconditionally. Rather, I think it should be conditional on the capabilities of the used DBMS. Information, which in turn could be supplied by the respective adapters.

Michael

Imong others I heard that MySQL now does support subselects but it does not optimize anything inside a subselect (up to the point of it not using indexes and such).

Maybe AbstractAdapter#supports_subselects?

Maybe AbstractAdapter#supports_subselects?

In the 'long run' it'd be nice if all the conditional stuff for query generation was removed from AR::Base and moved into the adapters themselves. With custom databases using polymorphism to handle the quirks and supported / unsupported behaviour.

However in the meantime, I think a patch to make use a real subselect would be good, but probably not a 1.2 thing.

> Maybe AbstractAdapter#supports_subselects?

In the 'long run' it'd be nice if all the conditional stuff for query generation was removed from AR::Base and moved into the adapters themselves. With custom databases using polymorphism to handle the quirks and supported / unsupported behaviour.

Agreed.

However in the meantime, I think a patch to make use a real subselect would be good, but probably not a 1.2 thing.

I've timed (script/performance/benchmarker) find's with and without subselect. To my very big surprise, there was only a small difference and if anything, using two queries (i.e. the current state of things) was slightly faster than using a subselect. I may well have made a measuring error, though.

I've tested on PostgreSQL with 100_000 "base" objects with 10_000 objects "belonged to" and 400_000 has many objects. Generally, I've used a low LIMIT up to 100 and OFFSETs from 0 to 9_000.

The showstopper, however, was when I noticed that at least on PostgreSQL, columns referred to in the ORDER BY clause have to be contained in the SELECT clause. As a result, the subselect returns not just id and the IN check in the surrounding statement becomes invalid. Wrapping the subselect in another select that maps to the id only would be possible, but I didn't bother to try.

I hope someone else will independently check this behavior, to confirm that the current implementation without subselect does not hurt performance.

Michael

Ultimately, trying to become more clever to deal with query conditions can become self defeating. I ended up solving my issues with having to deal with limited included query results by allowing a finder_sql option to be used along with a column mapping to provide for an optimized query that allows me to limit the result sets however I like. Check out my plugin that extends the eager loading code here: http://kellogg-assoc.com/articles/2006/11/05/eager-finder-sql.

Another useful addition would be able to use the results of a MySQL-like GROUP_CONCAT. I've done it on a case-by-case basis in the past, but haven't yet tried to do something that is more integrated.

Gregg