How to invoke cache sweeper from background jobs / models?

Hello list!

I need to expire fragment caches from a background job. The usual way to
expire caches is to create a cache sweeper and put the observer hooks into
the controller. That is fine as long as database is only modified through
controller actions.

But in this case I have a background job importing data, and that needs to
invalidate fragment caches for records it touches. The most elegant way
would be to able to install the sweeping observer while the background job
is running so it will expire all touched object's caches.

Thinking further, sometimes from the Rails console, I'm starting manual
imports by invoking MyModel.import_all! on the model that's going to import
data. Now, how would caches be expired here? Clearly, the MVC way cannot
work here.

Time to break the rules. So, what would be the best approach to handle this?
I'm seeing three ways with different downsides but only one that would solve
the problem:

1. Expire fragments through observers the MVC way
   Downside: Background importers won't purge caches because only
   controllers install sweepers/observers.
   Conclusion: This is no option.

2. Expire fragments through observers installed in controllers and
   background jobs
   Downside: No sweeping when running imports from console.
   Conclusion: Probably the cleanest solution but not a real option.

3. Expire fragments directly from the models
   Downside: Not the proposed MVC way, no support in Rails framework.
   Conclusion: The only solution that works in a DRY way for me.

Looking at these points, I question the whole design behind sweepers and
their MVC voodoo. Trying to force cache sweeping into controllers only is
somehow a failed design. I don't see why that is. In the end, it's the data
that makes up the views. Why should a controller sweep its caches? Is it for
performance reasons? Clearly it cannot be the choice about clearing its own
views because other controllers could show results from the same models.

I often end up purging the caches from all controllers within the observer.
While that is DRY it shows the misconception about making sweepers available
to controllers only.

MVC doesn't mean that all your logic has to be in a model, view, or controller. It sounds like you just need a class to do your import work, such can be called from a controller, background job, script, migration, etc.

A couple of responses. First, in Rails 4 this was changed. It now incorporates generational caching. The cache key has a digest which is a hash of the underlying template content so that any changes in content will bust the cache automatically. Observers have actually been removed from Rails 4 although you can still get the functionality back by using a gem.

Second, the architecture did make sense initially. In your typical Rails app, you don’t want your database updated without going through the controller. Even interfaces from other applications are, ideally, processed through an API as json or xml. Caching is the least of the issues, you want to insure that all the constraints and security established in the controller and model are applied. Fragment caching is actually a view process and, in MVS, you don’t want to manage view processes from the model, the controller is designed to be the one to generate messages from one to the other. Therefore, you used to generally have the Controller generate a message to the view when there’s a change in the underlying model that affect it. By the way, there are things other than data that can change a view fragment. In particular, a change in image file references that get incorporated into the view come to mind.

There are always situations that won’t fit into this, although they are usually the exception and not the rule. You could have a resource that you maintain only for information and gets updated on some periodic basis by a batch import. In that case, you are correct, you either have to define a method within the model (which is most common) that may cross the boundaries a bit.

Sorry, that should be MVC, not MVS.

mike2r <mruch@kalanitech.com> schrieb:

A couple of responses. First, in Rails 4 this was changed. It now
incorporates generational caching. The cache key has a digest which is a
hash of the underlying template content so that any changes in content
will
bust the cache automatically. Observers have actually been removed from
Rails 4 although you can still get the functionality back by using a gem.

Actually, that is what I've switched to now. At first, my concern was with
piling up a lot of stale and unused content in the cache. But I did fix that
by switching over to memcached as the cache because it can be limited by
size and has a LRU replacement policy.

To make use of this with fragment caching, I've started to change the cache
keys in a way to ensure that they always include the object change time but
also additional meta data that has to be used to detect cache invalidations
- like adding change times of associated objects or ids of parent objects.
When used together with russian doll caching, this works very well now - and
giving up on using file based caching, performance improved a lot without
borthering about infinite cache growth.

Second, the architecture did make sense initially. In your typical Rails
app, you don't want your database updated without going through the
controller. Even interfaces from other applications are, ideally,
processed through an API as json or xml. Caching is the least of the
issues, you want to insure that all the constraints and security
established in the controller and model are applied. Fragment caching is
actually a view process and, in MVS, you don't want to manage view
processes from the model, the controller is designed to be the one to
generate messages from one to the other. Therefore, you used to generally
have the Controller generate a message to the view when there's a change
in
the underlying model that affect it. By the way, there are things other
than data that can change a view fragment. In particular, a change in
image file references that get incorporated into the view come to mind.

That is, of course, true. Actually, I really never want to bother with view
components from the model. That's the point of MVC - and even if it looks
like complicating things at a first glance (especially to beginners), it
always results in a much cleaner design (at least if you don't try to
exploit it), and thus in less bugs and easier to read code.

But:

There are always situations that won't fit into this, although they are
usually the exception and not the rule. You could have a resource that
you maintain only for information and gets updated on some periodic basis
by a
batch import. In that case, you are correct, you either have to define a
method within the model (which is most common) that may cross the
boundaries a bit.

There are situations which do not go through the controller - and cannot.
The controller is being accessed through the action dispatcher. If I do have
a background worker, it won't go that route. It will become a controller
more or less itself which does the job. So, it appears to me that Rails is
missing some glue part between the actual dispatching and the persistence
layer which should not be the ActionController as a sole component.

I've seen some people creating extensions to Rails that fill this gap
(mostly based on events or notifications/subscriptions, an example is wisper
[1]), which actually look clean and like good ideas but still cumbersome to
integrate into Rails, probably just because Rails lacks native and well
integrated support for this. The main idea behind this concept is to insert
a service layer between controllers and models. I like that idea because I
could use the same service layer from background jobs. But I don't think it
would have solved the specific problem I had with invalidating caches. And
after some testing I found programmatically invalidating caches through some
logic is incredibly slow in Rails (defeating to having the purpose of using
caches) when you do huge batched updates to your data. Thus, I decided to go
that memcached route which can do all the heavy lifting for me.

[1]: https://github.com/krisleech/wisper

Josh Jordan <josh.jordan@gmail.com> schrieb:

MVC doesn't mean that all your logic has to be in a model, view, or
controller. It sounds like you just need a class to do your import work,
such can be called from a controller, background job, script, migration,
etc.

It is mostly done that way, though that class shares inclusion of a module
with the model (that one that triggers copying one import item from the
preprocessed data table to my model's table). It's cleanly split up into
modules concerning import payloads, import sources, data mapping, etc...

The batch operations are parallelized with celluloid-pmap because we may
have tens of thousands of records to compare with existing records and
import from time to time, involving downloading additional data like images
or attachments. Import workers can work in parallel by using appropriate
locks on the database, implicitly by using pmap, and explicitly by just
starting multiple workers in parallel. If handling the database thread pool
correctly from within pmap, there's no problem with keeping the database
busy without running out of connections - occassionally there are lock
timeouts but those are now handled gracefully by just retrying the working
set until it got processed completely without creating infinite loops.

It works pretty well now, mostly bullet-proof and I'm confident with
performance. Exceptions are caught and handled by feeding them back to the
import data table so those can be reviewed manually - and either retried
after applying a fix or submitted to a web form for manual handling.
Everything is wrapped into transactions so we don't leave some half-done
operations behind.

Still, such a class cannot expire caches in the way Rails did its design
around caches. As stated in the other post, cache expiration is now handled
outside of Rails by using proper cache keys and view digests, and by using
an LRU replacement policy and limiting the cache size (which memcached
perfectly does, thus I switched to that).

The webviews show data almost instantly now on almost every request when we
had up to 20 seconds before. The import process got a speedup of factor 50
or so - I didn't measure it. It's just much faster now. Generated pages are
then additionally cached by a varnish frontend cache. Invalidating that
properly, especially when using HTML5 manifests, can get tricky once in a
while but I'm working on that. Maybe we could play with its edge-side
includes a little bit to parallelize page generation and break it into
individual fragments but currently the page design does not offer to make
use of that effectively.