Included associations, limiting and subselects

When to-many associations are eagerly loaded, a "base" object and its
associated objects can be represented by an unpredictable number of
rows in the result set of the query. As a consequence, retrieving a
certain range of objects including associations cannot be achieved by
using OFFSET and LIMIT as they refer to rows in the result set, not
objects.

In order to enable :offset and :limit in cases like this, the current
code splits the task of retrieving objects. In a first query it
retrieves a list of ids of objects, properly offset and limited. Then,
in a second query, these ids are used in a condition to restrict the
returned rows from which objects are instantiated. The code looks like
this (slightly reformatted).

  def add_limited_ids_condition!(sql, options, join_dependency)
    unless (id_list = select_limited_ids_list(options,
        join_dependency)).empty?
      sql << "#{condition_word(sql)} #{table_name}.#{primary_key} IN
(#{id_list}) "
    else
      throw :invalid_query
      end
  end

  def select_limited_ids_list(options, join_dependency)
    connection.select_all(
      construct_finder_sql_for_association_limiting(options,
        join_dependency),
      "#{name} Load IDs For Limited Eager Loading"
    ).collect { |row| connection.quote(row[primary_key]) }.join(", ")
  end

The advantage of this approach is that it works even on less capable
DBMS, specifically those that don't support subselects. The
disadvantage is that on DBMS that do support subselects it incurs an
unnecessary roundtrip to the database and robs the optimizer of a
chance to do its job.

A solution is rather simple, or so it seems to me. Apart from naming
issues, it would look like this, for databases that support subselects

  def select_limited_ids_list(options, join_dependency)
    construct_finder_sql_for_association_limiting(options,
      join_dependency)
  end

Instead of eagerly executing a query for the ids, the requisite SQL is
simply inserted in the main query to be executed as a subselect. I've
applied this change and except for a test that explicitly checks
select_limited_ids_list all test still pass on PostgreSQL.

Obviously, this change should not be applied unconditionally. Rather, I
think it should be conditional on the capabilities of the used DBMS.
Information, which in turn could be supplied by the respective
adapters.

Michael

Imong others I heard that MySQL now does support subselects but it does not optimize anything inside a subselect (up to the point of it not using
indexes and such).

Maybe AbstractAdapter#supports_subselects?

Maybe AbstractAdapter#supports_subselects?

In the 'long run' it'd be nice if all the conditional stuff for query
generation was removed from AR::Base and moved into the adapters
themselves. With custom databases using polymorphism to handle the
quirks and supported / unsupported behaviour.

However in the meantime, I think a patch to make use a real subselect
would be good, but probably not a 1.2 thing.

> Maybe AbstractAdapter#supports_subselects?

In the 'long run' it'd be nice if all the conditional stuff for query
generation was removed from AR::Base and moved into the adapters
themselves. With custom databases using polymorphism to handle the
quirks and supported / unsupported behaviour.

Agreed.

However in the meantime, I think a patch to make use a real subselect
would be good, but probably not a 1.2 thing.

I've timed (script/performance/benchmarker) find's with and without
subselect. To my very big surprise, there was only a small difference
and if anything, using two queries (i.e. the current state of things)
was slightly faster than using a subselect. I may well have made a
measuring error, though.

I've tested on PostgreSQL with 100_000 "base" objects with 10_000
objects "belonged to" and 400_000 has many objects. Generally, I've
used a low LIMIT up to 100 and OFFSETs from 0 to 9_000.

The showstopper, however, was when I noticed that at least on
PostgreSQL, columns referred to in the ORDER BY clause have to be
contained in the SELECT clause. As a result, the subselect returns not
just id and the IN check in the surrounding statement becomes invalid.
Wrapping the subselect in another select that maps to the id only would
be possible, but I didn't bother to try.

I hope someone else will independently check this behavior, to confirm
that the current implementation without subselect does not hurt
performance.

Michael

Ultimately, trying to become more clever to deal with query conditions
can become self defeating. I ended up solving my issues with having to
deal with limited included query results by allowing a finder_sql
option to be used along with a column mapping to provide for an
optimized query that allows me to limit the result sets however I like.
Check out my plugin that extends the eager loading code here:
http://kellogg-assoc.com/articles/2006/11/05/eager-finder-sql.

Another useful addition would be able to use the results of a
MySQL-like GROUP_CONCAT. I've done it on a case-by-case basis in the
past, but haven't yet tried to do something that is more integrated.

Gregg