Multi Threaded Update

mcieaaka · May 31, 2023, 3:08pm

Here is the situation, consider you have a Model A with 10,000 records in postgres and you want to update all of them using a script.

Such that

def script
  Model.find_each(batch_size: 100) do |m|
    m.attribute = # Assign Something
    m.save!
  end
end

Now I want to make this script multi threaded for faster execution.

I want to create batches then run each batch in a separate Thread.
I am using CRuby and I want to avoid GIL

Can someone tell me how I can do this? and also how can I benchmark it?

davetron5000 · May 31, 2023, 5:00pm

Do you want to use thread specifically or do you want to do updates concurrently? In a Rails app, if you have background jobs set up, you can do updates concurrently by queuing a job for each change you want. You can then control the concurrency by configuring the # of job workers (jobs to run at once).

In Sidekiq you would set concurrency in sidekiq.yml to some value and then use perform_bulk:

class UpdateJob
  include Sidekiq::Job

  def perform(model_id, thing_to_change)
    Model.find(model_id).update!(attribute: thing_to_change) # or whatever
  end
end

# then in console or a rake task
ids = Model.all.pluck(:id)
array_of_args = ids.zip # puts each id in its own array to pass to perform_bulk
UpdateJob.perform_bulk(array_of_args, batch_size: 100) # Sidekiq says don't set this above 1,000

Things to be aware of

Postgres only has so many connections so you likely can’t do 10,000 concurrent updates against a Postgres database
When using threads, Active Record’s connection pool must be set properly. See Advanced Options · sidekiq/sidekiq Wiki · GitHub which will help with Sidekiq specifically but is generally useful for any other threaded solution.

mcieaaka · May 31, 2023, 5:46pm

Thanks for the approach but i want to use job via Threads Queuing will result in too many jobs in my Sidekiq Queue as there are a lot of records

saiqulhaq · June 1, 2023, 2:55am

I would prefer to use bulk create/update using this gem GitHub - zdennis/activerecord-import: A library for bulk insertion of data into your database using ActiveRecord.

e_a · June 1, 2023, 12:38pm

Thanks for the approach but i want to use job via Threads Queuing will result in too many jobs in my Sidekiq Queue as there are a lot of records

You don’t need to have one job per record. The find_in_batches is useful for creating a block of records to then hand off to a job. Since you really just need to hand off the id to the job I usually combine with select('id') so I’m just grabbing that column. The actual job will load the full record for processing. Something like:

MyModel.select('id').find_in_batches batch_size: 1_000 do |batch|
  MyJob.perform_later batch.pluck :id
end

The job then looks like:

class MyJob < ApplicationJob
  def perform ids
    for obj in MyModel.where id: ids
      # Do something with obj
      obj.save!
    end
  end
end

You may wish to surround the loop in your job with a transaction to avoid the overhead of a transaction per record. So:

class MyJob < ApplicationJob
  def perform ids
    MyModel.transaction do
      for obj in MyModel.where id: ids
        # Do something with obj
        obj.save!
      end
    end
  end
end

This will cause locks to be held longer on the records being processed for that batch so if that’s an issue you can adjust the job size to be smaller to run quicker.

If the job route is not what you want and you are looking for more of a one-off script you certainly can use threads. The two main things are:

Make sure your database configuration has enough connections in the connection pool to match your thread count. This both means setting the pool value in your database.yml correctly as well as configuring your database max connection count appropriately.
Use a thread pool. You still don’t want one thread per record as that will cause too much contention. Use the concurrent-ruby gem as it provides both a thread pool impl as well as threadsafe primitives to allow you to enqueue records to be processed by the threadpool.

In the end, if you go this route at some point you are just making your own job system vs using the one provided out-of-the-box by Rails. In fact, if your concern is about infrastructure (setting up Sidekiq gem, setting up Redis, etc) you could just use the Async adapter for ActiveJob since it’s included in Rails. This essentially just wraps concurrent-ruby giving you environment of ActiveJob without any infrastructure making it ideal for one-off tasks. If not a one-off task but something you will do often I still recommend a real job server with persistent storage of the jobs so they can be managed, retried on failure, etc.

I am using CRuby and I want to avoid GIL

Regardless of if you use a job server or just regular threads in a thread pool there will be a sweet spot with concurrency for your setup. As you add threads you will gain concurrency and therefore process quicker. But if you have too many threads locking does come more into play. Not just the GIL but also locks in the database so additional threads will not help at some point and may start hurting. You just need to find that sweet spot for your environment.

mcieaaka · June 1, 2023, 1:21pm

Thanks for this answer. So I actually did use thread pools but the execution time was similar to that of serial execution. Can you tell me where I might have gone wrong

def script
   start_t = Time.zone.now

    thread_pool = Concurrent::ThreadPoolExecutor.new(
      min_threads: 4,
      max_threads: 8,
    )

    Model.find_in_batches(batch_size: 100).each do |batch|
      thread_pool.post do
        batch.each do |obj|
          obj.attr = # assign something
          obj.save!
        rescue => e
          # Log Error
        end
      end
    end

    thread_pool.shutdown
    thread_pool.wait_for_termination

    Rails.logger.info("Completed: Time Taken: #{Time.zone.now - start_t}")
end

e_a · June 1, 2023, 6:41pm

I wonder if needs to be wrapped in Rails.application.executor.wrap? I’ll be honest I only have a fuzzy understanding of those aspects but more info can be found here.

Also, when processing each batch consider creating a transaction so each save isn’t doing the overhead of it’s own transaction. I.E.

Model.transaction do
  batch.each do |obj|
    ..
  end
end

Not really related to threading but a way to speed things up.

nickhammond · June 1, 2023, 8:43pm

Can you elaborate a bit on what’s happening with the “assign something” line? Is it a fixed value? Is it calculated? Do you have callbacks on this model for updates?

obj.attr = # assign something

mcieaaka · June 2, 2023, 6:33am

So assign something might be dynamic/ static:

Case 1: Let’s say you have a model User with some configurations.

obj.attr = xyz if User.find(obj.associated_user_id).config.xyz.enabled?

Case 2: Assign a constant value

obj.attr = GLOBAL_CONSTANT

And yes I do have callbacks on my model that will be called on before_save

Topic		Replies	Views
Postgres, concurrency, and rails rubyonrails-talk	4	300	August 20, 2010
Is threading the right option? Or changing my Apache setup? rubyonrails-talk	7	169	February 14, 2008
Parallelizing Tasks in the Background rubyonrails-talk	1	153	October 6, 2013
Announcing sidekiq-iteration - a gem that makes your long-running sidekiq jobs interruptible and resumable by design rubyonrails-talk announcement	3	948	November 8, 2022
Proposal: Adding Parallel Query Support to ActiveRecord rubyonrails-core feature , activerecord	0	288	June 16, 2024

Multi Threaded Update

Related topics

More Resources