[Proposal] Use any() for Postgres array inclusion instead of IN

Currently, checking for array inclusion in Rails generates an IN list:

User.where(id: [1,2,3]).to_a
SELECT "users".* FROM "users" WHERE "users"."id" IN (1, 2, 3)

This is a problem because pg_stat_statements does not normalize the variable number of values that can be passed through IN lists: if the array you’re matching reaches a size of 100, as many as 100 duplicate queries can be tracked by pg_stat_statements. The result is twofold:

  • developers aren’t able to monitor pg_stat_statements to get an accurate view of their query performance because queries are split up into hundreds of duplicates
  • production applications generate so many duplicate queries that the historical data is pruned from pg_stat_statements, which makes historical analysis more difficult and is unnecessary work for the database

There have long been discussions on the Postgres mailing list to change this behavior: the current thread proposes grouping queries into buckets by the number of IN list values: 0-9, 10-99, 100-999, etc. But due to the complex implementation and lack of clarity on what should be the default behavior, it’s unlikely to be committed anytime soon.

Rails can resolve this issue in the meantime by constructing queries using any() instead:

sql = <<~SQL
  SELECT * FROM users WHERE id = any($1::bigint[])
SQL
ids = [1, 2, 3]
binds = [
  ActiveModel::Attribute.from_database('ids', ids, ActiveRecord::Type.lookup(:text, array: true)),
]
User.connection.select_rows(sql, '', binds)
SELECT * FROM users WHERE id = any($1::bigint[])
[["ids", "{1,2,3}"]]

I would like to open a PR to make this the default behavior for Postgres. Thoughts on possible roadblocks to getting such a PR merged?

1 Like

I’d find this useful and would be happy to help review a PR. Feel free to tag me in (including into a WIP / draft PR).

The monitoring benefits are nice, but you would need to benchmark the performance impact of the change, as well as thinking through any other potential downsides. It might be the case that you need to add an option on the PostgreSQLAdapter to let users opt in to this behaviour.

1 Like

Hi Alex, thanks for the offer to help review.

Here’s a first pass implementation: any() first pass · pganalyze/rails@34e9eb2 · GitHub

I’d appreciate thoughts on the general approach. Todo:

  • Figure out how to determine which database adapter is in use at runtime in ArrayHandler to switch back to HomogeneousIn for non-Postgres databases. Or should this code be moved to connection_adapters/postgresql? In that case, would ArrayHandler call if defined?(HomogeneousAny) to indirectly check if the Postgres adapter is being used?
  • Figure out why the array is still a literal in the query instead of a bind param. Note, HomogeneousIn also has code to make use of bind params but that code doesn’t seem to actually generate bind params either.
1 Like

I would expect this to live in Arel::Visitors::PostgreSQL#visit_Arel_Nodes_HomogeneousIn – it’s more of a dialect-specific way of spelling the same concept (visitor), rather than a new distinct concept (node)

1 Like

Okay, I’ve updated the code as you suggested: Use any() for Postgres array inclusion instead of IN · pganalyze/rails@dc4a29e · GitHub

I’ll write a benchmark script based on activerecord/examples/performance.rb. Let me know if there’s a better starting point.

Note that the unit tests generate SQL with a bind param, but actually performing a query doesn’t use a bind param:

User.where(id: [1,2]).count
SELECT COUNT(*) FROM "users" WHERE "users"."id" = any('{1,2}')

Any idea why that is?

Ah, that’s obvious in retrospect: if the database is configured with prepared_statements: false, Rails doesn’t use bind params. That explains why the code has to support both scenarios.

I’ve opened a PR for this: Reduce pg_stat_statements churn using `= any()` instead of `IN ()` by seanlinsley · Pull Request #49388 · rails/rails · GitHub

1 Like