How to invoke cache sweeper from background jobs / models?

Hello list!

I need to expire fragment caches from a background job. The usual way to expire caches is to create a cache sweeper and put the observer hooks into the controller. That is fine as long as database is only modified through controller actions.

But in this case I have a background job importing data, and that needs to invalidate fragment caches for records it touches. The most elegant way would be to able to install the sweeping observer while the background job is running so it will expire all touched object's caches.

Thinking further, sometimes from the Rails console, I'm starting manual imports by invoking MyModel.import_all! on the model that's going to import data. Now, how would caches be expired here? Clearly, the MVC way cannot work here.

Time to break the rules. So, what would be the best approach to handle this? I'm seeing three ways with different downsides but only one that would solve the problem:

1. Expire fragments through observers the MVC way    Downside: Background importers won't purge caches because only    controllers install sweepers/observers.    Conclusion: This is no option.

2. Expire fragments through observers installed in controllers and    background jobs    Downside: No sweeping when running imports from console.    Conclusion: Probably the cleanest solution but not a real option.

3. Expire fragments directly from the models    Downside: Not the proposed MVC way, no support in Rails framework.    Conclusion: The only solution that works in a DRY way for me.

Looking at these points, I question the whole design behind sweepers and their MVC voodoo. Trying to force cache sweeping into controllers only is somehow a failed design. I don't see why that is. In the end, it's the data that makes up the views. Why should a controller sweep its caches? Is it for performance reasons? Clearly it cannot be the choice about clearing its own views because other controllers could show results from the same models.

I often end up purging the caches from all controllers within the observer. While that is DRY it shows the misconception about making sweepers available to controllers only.

MVC doesn't mean that all your logic has to be in a model, view, or controller. It sounds like you just need a class to do your import work, such can be called from a controller, background job, script, migration, etc.

A couple of responses. First, in Rails 4 this was changed. It now incorporates generational caching. The cache key has a digest which is a hash of the underlying template content so that any changes in content will bust the cache automatically. Observers have actually been removed from Rails 4 although you can still get the functionality back by using a gem.

Second, the architecture did make sense initially. In your typical Rails app, you don’t want your database updated without going through the controller. Even interfaces from other applications are, ideally, processed through an API as json or xml. Caching is the least of the issues, you want to insure that all the constraints and security established in the controller and model are applied. Fragment caching is actually a view process and, in MVS, you don’t want to manage view processes from the model, the controller is designed to be the one to generate messages from one to the other. Therefore, you used to generally have the Controller generate a message to the view when there’s a change in the underlying model that affect it. By the way, there are things other than data that can change a view fragment. In particular, a change in image file references that get incorporated into the view come to mind.

There are always situations that won’t fit into this, although they are usually the exception and not the rule. You could have a resource that you maintain only for information and gets updated on some periodic basis by a batch import. In that case, you are correct, you either have to define a method within the model (which is most common) that may cross the boundaries a bit.

Sorry, that should be MVC, not MVS.

mike2r <mruch@kalanitech.com> schrieb:

A couple of responses. First, in Rails 4 this was changed. It now incorporates generational caching. The cache key has a digest which is a hash of the underlying template content so that any changes in content will bust the cache automatically. Observers have actually been removed from Rails 4 although you can still get the functionality back by using a gem.

Actually, that is what I've switched to now. At first, my concern was with piling up a lot of stale and unused content in the cache. But I did fix that by switching over to memcached as the cache because it can be limited by size and has a LRU replacement policy.

To make use of this with fragment caching, I've started to change the cache keys in a way to ensure that they always include the object change time but also additional meta data that has to be used to detect cache invalidations - like adding change times of associated objects or ids of parent objects. When used together with russian doll caching, this works very well now - and giving up on using file based caching, performance improved a lot without borthering about infinite cache growth.

Second, the architecture did make sense initially. In your typical Rails app, you don't want your database updated without going through the controller. Even interfaces from other applications are, ideally, processed through an API as json or xml. Caching is the least of the issues, you want to insure that all the constraints and security established in the controller and model are applied. Fragment caching is actually a view process and, in MVS, you don't want to manage view processes from the model, the controller is designed to be the one to generate messages from one to the other. Therefore, you used to generally have the Controller generate a message to the view when there's a change in the underlying model that affect it. By the way, there are things other than data that can change a view fragment. In particular, a change in image file references that get incorporated into the view come to mind.

That is, of course, true. Actually, I really never want to bother with view components from the model. That's the point of MVC - and even if it looks like complicating things at a first glance (especially to beginners), it always results in a much cleaner design (at least if you don't try to exploit it), and thus in less bugs and easier to read code.

But:

There are always situations that won't fit into this, although they are usually the exception and not the rule. You could have a resource that you maintain only for information and gets updated on some periodic basis by a batch import. In that case, you are correct, you either have to define a method within the model (which is most common) that may cross the boundaries a bit.

There are situations which do not go through the controller - and cannot. The controller is being accessed through the action dispatcher. If I do have a background worker, it won't go that route. It will become a controller more or less itself which does the job. So, it appears to me that Rails is missing some glue part between the actual dispatching and the persistence layer which should not be the ActionController as a sole component.

I've seen some people creating extensions to Rails that fill this gap (mostly based on events or notifications/subscriptions, an example is wisper [1]), which actually look clean and like good ideas but still cumbersome to integrate into Rails, probably just because Rails lacks native and well integrated support for this. The main idea behind this concept is to insert a service layer between controllers and models. I like that idea because I could use the same service layer from background jobs. But I don't think it would have solved the specific problem I had with invalidating caches. And after some testing I found programmatically invalidating caches through some logic is incredibly slow in Rails (defeating to having the purpose of using caches) when you do huge batched updates to your data. Thus, I decided to go that memcached route which can do all the heavy lifting for me.

[1]: GitHub - krisleech/wisper: A micro library providing Ruby objects with Publish-Subscribe capabilities

Josh Jordan <josh.jordan@gmail.com> schrieb:

MVC doesn't mean that all your logic has to be in a model, view, or controller. It sounds like you just need a class to do your import work, such can be called from a controller, background job, script, migration, etc.

It is mostly done that way, though that class shares inclusion of a module with the model (that one that triggers copying one import item from the preprocessed data table to my model's table). It's cleanly split up into modules concerning import payloads, import sources, data mapping, etc...

The batch operations are parallelized with celluloid-pmap because we may have tens of thousands of records to compare with existing records and import from time to time, involving downloading additional data like images or attachments. Import workers can work in parallel by using appropriate locks on the database, implicitly by using pmap, and explicitly by just starting multiple workers in parallel. If handling the database thread pool correctly from within pmap, there's no problem with keeping the database busy without running out of connections - occassionally there are lock timeouts but those are now handled gracefully by just retrying the working set until it got processed completely without creating infinite loops.

It works pretty well now, mostly bullet-proof and I'm confident with performance. Exceptions are caught and handled by feeding them back to the import data table so those can be reviewed manually - and either retried after applying a fix or submitted to a web form for manual handling. Everything is wrapped into transactions so we don't leave some half-done operations behind.

Still, such a class cannot expire caches in the way Rails did its design around caches. As stated in the other post, cache expiration is now handled outside of Rails by using proper cache keys and view digests, and by using an LRU replacement policy and limiting the cache size (which memcached perfectly does, thus I switched to that).

The webviews show data almost instantly now on almost every request when we had up to 20 seconds before. The import process got a speedup of factor 50 or so - I didn't measure it. It's just much faster now. Generated pages are then additionally cached by a varnish frontend cache. Invalidating that properly, especially when using HTML5 manifests, can get tricky once in a while but I'm working on that. Maybe we could play with its edge-side includes a little bit to parallelize page generation and break it into individual fragments but currently the page design does not offer to make use of that effectively.