Announcing sidekiq-iteration - a gem that makes your long-running sidekiq jobs interruptible and resumable by design

Hello everyone :wave:

I am publishing a new gem - GitHub - fatkodima/sidekiq-iteration: Makes your long-running sidekiq jobs interruptible and resumable by design. For those familiar with job-iteration (GitHub - Shopify/job-iteration: Makes your background jobs interruptible and resumable by design.) from Shopify, this is an adoption of that gem to be used with raw Sidekiq (no ActiveJob).

Motivation

Imagine the following job:

class NotifyUsersJob
  include Sidekiq::Job

  def perform
    User.find_each do |user|
      user.notify_about_something
    end
  end
end

The job would run fairly quickly when you only have a hundred User records. But as the number of records grows, it will take longer for a job to iterate over all Users. Eventually, there will be millions of records to iterate and the job will end up taking hours or even days.

With frequent deploys and worker restarts, it would mean that a job will be either lost or restarted from the beginning. Some records (especially those in the beginning of the relation) will be processed more than once.

Solution

sidekiq-iteration helps to make this job interruptible and resumable. It will look like this:

class NotifyUsersJob
  include Sidekiq::Job
  include SidekiqIteration::Iteration

  def build_enumerator(cursor:)
    active_record_records_enumerator(User.all, cursor: cursor)
  end

  def each_iteration(user)
    user.notify_about_something
  end
end

each_iteration will be called for each User record in User.all relation. The relation will be ordered by primary key, exactly like find_each does. Iteration hooks into Sidekiq out of the box to support graceful interruption. No extra configuration is required.

See the gem documentation for more details and examples of usage.

5 Likes

Nice!

IMO, the “makes your jobs interruptible and resumable” part should be integrated into ActiveJob.

Does it uses PostgreSQL cursors in order to avoid repeating the query on the database? Or it repeats the (maybe complex) query on every iteration using “ID grater than”?

It uses the same approach as in_batches (basically the “ID greater than” that you mentioned). It has a pretty decent performance, which was also recently quite improved for whole table batching - Optimize Active Record batching for whole table iterations by fatkodima · Pull Request #45414 · rails/rails · GitHub.