Retry logic for killed Resque Workers


I’ve opened an issue (Killed Resque jobs cannot be retried using ActiveJob · Issue #49734 · rails/rails · GitHub) and a PR (Catch and handle Resque::DirtyExit exceptions in ActiveJob by geoffyoungs · Pull Request #49735 · rails/rails · GitHub) to fix.

We run ActiveJob / resque in production on k8s and find (especially after deploys) that the code that reaps old workers kills them too quickly for them to be handled nicely.

The failed jobs show in the resque failure queue (can be seen from resque-web interface) but because the exception is generated outside of the bounds of the job (in a different worker that is still active) it’s not handled by the existing ActiveJob exception handler.

I’ve added a MiniTest that demonstrates the issue (against rails/rails) and how it is fixed in the PR - but I’m not sure what the best approach to testing in rails is, as this requires multiple processes and running up redis-server to demonstrate the issue.

Is this issue so niche as to be outside the scope of the ActiveJob adapter or not?

If not, any thoughts on my proposed fix/testing?