Do you want to use thread specifically or do you want to do updates concurrently? In a Rails app, if you have background jobs set up, you can do updates concurrently by queuing a job for each change you want. You can then control the concurrency by configuring the # of job workers (jobs to run at once).
In Sidekiq you would set concurrency in sidekiq.yml to some value and then use perform_bulk:
class UpdateJob
include Sidekiq::Job
def perform(model_id, thing_to_change)
Model.find(model_id).update!(attribute: thing_to_change) # or whatever
end
end
# then in console or a rake task
ids = Model.all.pluck(:id)
array_of_args = ids.zip # puts each id in its own array to pass to perform_bulk
UpdateJob.perform_bulk(array_of_args, batch_size: 100) # Sidekiq says don't set this above 1,000
Things to be aware of
Postgres only has so many connections so you likely can’t do 10,000 concurrent updates against a Postgres database
When using threads, Active Record’s connection pool must be set properly. See Advanced Options · sidekiq/sidekiq Wiki · GitHub which will help with Sidekiq specifically but is generally useful for any other threaded solution.
Thanks for the approach but i want to use job via Threads Queuing will result in too many jobs in my Sidekiq Queue as there are a lot of records
You don’t need to have one job per record. The find_in_batches is useful for creating a block of records to then hand off to a job. Since you really just need to hand off the id to the job I usually combine with select('id') so I’m just grabbing that column. The actual job will load the full record for processing. Something like:
MyModel.select('id').find_in_batches batch_size: 1_000 do |batch|
MyJob.perform_later batch.pluck :id
end
The job then looks like:
class MyJob < ApplicationJob
def perform ids
for obj in MyModel.where id: ids
# Do something with obj
obj.save!
end
end
end
You may wish to surround the loop in your job with a transaction to avoid the overhead of a transaction per record. So:
class MyJob < ApplicationJob
def perform ids
MyModel.transaction do
for obj in MyModel.where id: ids
# Do something with obj
obj.save!
end
end
end
end
This will cause locks to be held longer on the records being processed for that batch so if that’s an issue you can adjust the job size to be smaller to run quicker.
If the job route is not what you want and you are looking for more of a one-off script you certainly can use threads. The two main things are:
Make sure your database configuration has enough connections in the connection pool to match your thread count. This both means setting the pool value in your database.yml correctly as well as configuring your database max connection count appropriately.
Use a thread pool. You still don’t want one thread per record as that will cause too much contention. Use the concurrent-ruby gem as it provides both a thread pool impl as well as threadsafe primitives to allow you to enqueue records to be processed by the threadpool.
In the end, if you go this route at some point you are just making your own job system vs using the one provided out-of-the-box by Rails. In fact, if your concern is about infrastructure (setting up Sidekiq gem, setting up Redis, etc) you could just use the Async adapter for ActiveJob since it’s included in Rails. This essentially just wraps concurrent-ruby giving you environment of ActiveJob without any infrastructure making it ideal for one-off tasks. If not a one-off task but something you will do often I still recommend a real job server with persistent storage of the jobs so they can be managed, retried on failure, etc.
I am using CRuby and I want to avoid GIL
Regardless of if you use a job server or just regular threads in a thread pool there will be a sweet spot with concurrency for your setup. As you add threads you will gain concurrency and therefore process quicker. But if you have too many threads locking does come more into play. Not just the GIL but also locks in the database so additional threads will not help at some point and may start hurting. You just need to find that sweet spot for your environment.
Thanks for this answer. So I actually did use thread pools but the execution time was similar to that of serial execution. Can you tell me where I might have gone wrong
def script
start_t = Time.zone.now
thread_pool = Concurrent::ThreadPoolExecutor.new(
min_threads: 4,
max_threads: 8,
)
Model.find_in_batches(batch_size: 100).each do |batch|
thread_pool.post do
batch.each do |obj|
obj.attr = # assign something
obj.save!
rescue => e
# Log Error
end
end
end
thread_pool.shutdown
thread_pool.wait_for_termination
Rails.logger.info("Completed: Time Taken: #{Time.zone.now - start_t}")
end
I wonder if needs to be wrapped in Rails.application.executor.wrap? I’ll be honest I only have a fuzzy understanding of those aspects but more info can be found here.
Also, when processing each batch consider creating a transaction so each save isn’t doing the overhead of it’s own transaction. I.E.
Model.transaction do
batch.each do |obj|
..
end
end
Not really related to threading but a way to speed things up.
Can you elaborate a bit on what’s happening with the “assign something” line? Is it a fixed value? Is it calculated? Do you have callbacks on this model for updates?