destroy callbacks race condition

I’ve been looking into consistency problem with association counter caches where the counter cache value in the database is not consistent with the actual number of records in the association. What I’ve found is that it is from a concurrency issue where two process try to destroy the same record at the same time. Here is the pseudo SQL that is sent to the database when two process are deleting at the same time:

process_1 -> SELECT * FROM table WHERE id = 1

process_2 -> SELECT * FROM table WHERE id = 1

process_1 -> BEGIN

process_2 -> BEGIN

process_1 -> UPDATE parent_table SET counter_cache = COALESCE(counter_cache, 0) - 1 WHERE id = 1

process_1 -> DELETE FROM table WHERE id = 1

process_1 -> COMMIT

process_2 -> UPDATE parent_table SET counter_cache = COALESCE(counter_cache, 0) - 1 WHERE id = 1

process_2 -> DELETE FROM table WHERE id = 1

process_2 -> COMMIT

What happens is process_1 updates the counter cache and deletes the record. Process_2 simply updates the counter cache because the record is already deleted by the time it tries to delete it.

This is a pretty complicated issue and it touches more than just this one test case. The problem being that all the before and after destroy callbacks will be called regardless of if the record is actually destroyed. In the particular case of the counter caches, I think it could be fixed by moving the callback from a before_destroy to an after_destroy and adding a check in ActiveRecord to only call after destroy callbacks if a row was actually removed from the table.

In general I think it would be correct to make this general behavior so that after_destroy callbacks are not called if no record was deleted. However, that could affect quite a few things inside application code which could potentially leave objects in an inconsistent state because an expected callback was not called. I think the pending upgrade to Rails 4.0 might be a good time to introduce such behavior since it’s a major upgrade and as such people should not be expecting applications to work 100% without some alterations. This does not touch on the issue of before_destroy callbacks which would not be able to check the status of the delete operation. This could be handled with documentation stating that this is a known issue.

Another solution that would have less effect on current applications (but also leave them more vulnerable to being in an inconsistent state) would be to provide some sort of flag within the record that after_destroy callbacks could check if they are persisting data or interacting with external systems. Something like “row_deleted?” so that callbacks could be defined as:

after_destroy :my_callback, :if => :row_deleted?

Thoughts?

Can you write a public rails app that reproduces this issue? This behavior would be undesired and therefore a bug. If we can reproduce and attach that to an issue it could help the discussion.

I’ve been looking into consistency problem with association counter caches where the counter cache value in the database is not consistent with the actual number of records in the association. What I’ve found is that it is from a concurrency issue where two process try to destroy the same record at the same time. Here is the pseudo SQL that is sent to the database when two process are deleting at the same time:

process_1 -> SELECT * FROM table WHERE id = 1

process_2 -> SELECT * FROM table WHERE id = 1

process_1 -> BEGIN

process_2 -> BEGIN

process_1 -> UPDATE parent_table SET counter_cache = COALESCE(counter_cache, 0) - 1 WHERE id = 1

process_1 -> DELETE FROM table WHERE id = 1

process_1 -> COMMIT

process_2 -> UPDATE parent_table SET counter_cache = COALESCE(counter_cache, 0) - 1 WHERE id = 1

process_2 -> DELETE FROM table WHERE id = 1

process_2 -> COMMIT

What happens is process_1 updates the counter cache and deletes the record. Process_2 simply updates the counter cache because the record is already deleted by the time it tries to delete it.

This is a pretty complicated issue and it touches more than just this one test case. The problem being that all the before and after destroy callbacks will be called regardless of if the record is actually destroyed. In the particular case of the counter caches, I think it could be fixed by moving the callback from a before_destroy to an after_destroy and adding a check in ActiveRecord to only call after destroy callbacks if a row was actually removed from the table.

In general I think it would be correct to make this general behavior so that after_destroy callbacks are not called if no record was deleted. However, that could affect quite a few things inside application code which could potentially leave objects in an inconsistent state because an expected callback was not called. I think the pending upgrade to Rails 4.0 might be a good time to introduce such behavior since it’s a major upgrade and as such people should not be expecting applications to work 100% without some alterations. This does not touch on the issue of before_destroy callbacks which would not be able to check the status of the delete operation. This could be handled with documentation stating that this is a known issue.

Generally speaking that sounds like a sane change, but definitely something that we should experiment with a lot before making the changes.

As for your particular issue, the simplest way to avoid them is to fetch the records with :lock=>true so the database does a SELECT … FOR UPDATE, relational databases are pretty good at preventing this kind of thing, best use them if you can.

I’ve put a stand alone script here that reproduces the issue:

https://gist.github.com/3885509

I think adding a pessimistic lock to the destroy method will work. I opened this pull request that locks the record in the database before destroying it. If the record no longer exists, the callbacks are not called.

https://github.com/rails/rails/pull/7965

This won’t work on databases that don’t support row locking, but you’re using such a database you’d likely have other issues in a high concurrency situation like is needed to produce this issue.

I don’t think it’s reasonable to force pessimistic locks on every single destroy call, my point is more that in your case it’s a work around.

I don’t think it’s really that big of a change. The destroy action happens inside of a database transactions and the delete SQL will lock the row being deleted anyway. This would just move the lock up to before the before_destroy callbacks. This could potentially increase the length of time a row is locked if it has significant before_destroy callbacks and you could in theory cause a deadlock. Of course you have the same risks now if you have any after_destroy callbacks.

I have confirmed this behavior with MySQL using InnoDB tables. If I add an after_destroy callback that sleeps for five seconds the row is locked until the transaction is committed.

It does force another SELECT statement on destroy of all objects. This could be optimized if there’s a good way to check for the existence of callbacks since it is only needed if there are callbacks. The bigger change would be that callbacks would not necessarily be executed.

I do think that we need a general fix at least for data maintained by ActiveRecord. In the case of the counter cache, this is an ActiveRecord feature that should just work. Developers should not have to manually lock records in order to maintain their data integrity. This is something the ORM layer should handle for them for a feature provided by the ORM. If the mechanism is not automatic, it would be nice if it were exposed to be available to application developers to hook into.

To stop messing around with locks, and callbacks if we change the queries to this:

UPDATE parent_table SET counter_cache = COALESCE(counter_cache, 0) - (SELECT COUNT(*) FROM table WHERE id = 1) WHERE id = 1

DELETE FROM table WHERE id = 1

This one would not decrement, if record was already deleted…

I know this is a bit overhead on the database, but is far more trivial than change callbacks ordering…

That wouldn’t fix the problem because in the case of two processes destroying a record at the same time they can both be in the before callback at the same time. Even after the DELETE SQL is called by one process, it won’t be visible to the other process until the transaction is committed.