[Feature Proposal] ActiveRecord Connection Lifetime (Age)

Summary: I would like to contribute connection lifetime for ActiveRecord, making sure that connections only live for less than a maximum amount of time. This is useful for a number of reasons, in our case when scaling database connections with a load balancer, or in low resilience network environments. I don’t believe it would add excessive complexity, and complements idle timeouts nicely.

Background

We run a big Rails monolith at Buildkite with a large primary Postgres database. We need many Puma and Sidekiq instances, and each of those needs connections to the database. Postgres connections are not cheap, however, each connection consumes a significant amount of memory. If we used ActiveRecord pools with direct connections it would exhaust our database’s memory very quickly. So we use PgBouncer as a proxy between rails and our database with connection pooling at a transaction level so that fewer active connections are required for regular operations. I believe this pattern is fairly common.

We previously ran PgBouncer on the hosts which are also running our Rails processes. This affinity meant that reconfiguring or upgrading PgBouncer could be done in a host-based rolling manner without causing interruption to the Rails processes. But we’ve reached the limit of this approach and need to pool across multiple hosts to avoid connection exhaustion, and so need to move PgBouncer into its own tier. To retain the ability to reconfigure or redeploy PgBouncer we need the ability to cleanly drain Rails connections from PgBouncer. We could do this by restarting all Rails processes, but that becomes impractical.

By introducing a connection lifetime to ActiveRecord connections we can be sure that if we deregister a PgBouncer instance and then wait one lifetime plus one maximum query time, plus a bit, the instance should be drained of all connections and can safely be stopped. This means we can performa a rolling deployment at the PgBouncer tier with zero downtime, and without restarting all Rails processes.

Note that the existing idle timeouts don’t quite serve this purpose because busy Rails processes may keep reusing connections so they never reach such a timeout.

“Connection Lifetime” seems to be a known concept in a few other connection pooling implementations, especially in .NET:

Workaround

We’re currently using a concern to implement connection lifetime, coupled with an equivalently-configured idle timeout to make sure our connections live no longer than ~120 seconds (checked out just before 60 second lifetime + 60 second request timeout, or check in just before 60 second lifetime + 60 second idle timeout):

# Connection lifetime for ActiveRecord
# 
# Make sure that connections to the database can only live for a certain number
# of seconds. Once lifetime is reached, the underlying connection will be
# reconnected. This is enforced when checking out a connection for use from the
# pool. Use in combination with idle_timeout to enforce connection lifetime on
# idle connections as well.
#
# This allows graceful failover of database connections when the connection
# targets underlying a database address might change over time. For example,
# when running a pgbouncer in front of postgres this allows pgbouncer itself to
# be scaled or updated without downtime by using a load balancer with a target
# deregistration connection draining timeout slightly longer than the
# connection lifetime. The difference should account for the maximum time a
# connection might be checked out, i.e. the maximum time of a web request.
#
# A potential future improvement would be to give opportunities for
# reconnection between transactions for transaction-pooled connections.
#
module ActiveRecord::ConnectionAdapters::Lifetime
  def initialize(...)
    super

    # Keep track of when this connection was opened.
    @connected_at = Concurrent.monotonic_time

    # And see if there is a configured time to live.
    @lifetime = self.class.type_cast_config_to_integer(
      @config.fetch(:lifetime, nil)
    )
  end

  attr_reader :connected_at, :lifetime

  def reconnect!(...)
    super

    # Make sure reconnect! resets the connection time.
    @connected_at = Concurrent.monotonic_time
  end

  def age
    Concurrent.monotonic_time - @connected_at
  end

  def old?
    lifetime && age > lifetime
  end

  def verify!(...)
    super

    # During checkout, reconnect if the connection is too old.
    reconnect! if old?
  end
end

Proposal

I would love to contribute this feature upstream to rails. Supporting connection lifetime directly across pools, connections and the reaper in roughly the same places as idle timeout could be much more elegant than this workaround. Discarding the connection would probably be preferable to reconnection, too, but was too complex for this workaround.

Would there be appetite for such a contribution?

6 Likes

@sj26 we are running into this issue as well. Do you have an example of how you apply the this ConnectionAdapter?

1 Like

Am also in need of this feature, and for the exact same reasons as stated by OP.

I think this is very import for large rails application, for now, we have to use pgbouncer to low down the connection pool issue.

Interesting. Well, I guess this hasn’t gotten a lot of traction in the last 2 years. However, it’s an interesting topic to me. In case your team or another team is pursuing this, here are some additional questions:

  • Are there other infrastructure alternatives like haproxy, that have a native connection lifetime/max duration concept? The RFC may be strengthened by listing alternatives, and then comparing them to your proposed solution.
  • Could this be implemented as an extension to active record, rather than in core active record, since it might not be useful generally? Most of the time I think the connections are either actively in use, or the active record connection pool client connections are open, but in an idle state, and the idle_timeout should reap them when that time is reached (300 seconds by default).
  • Is this unlikely to gain core acceptance, because it’s limited to PostgreSQL/pgbouncer only?
  • Would the lifetime/duration work the same way when directly connected to the database, as opposed to when connected through pgbouncer?