Explaning Solid Cache, Rails' new cache store

On the first day of Rails World, Donal McBree from 37Signals introduced a new Rails gem called Solid Cache, which allows Rails to use a database for its cache store. While that might sound counterintuitive (isn’t something like Redis/Memcached better/faster?), there’s reasoning behind their choice.

Video: https://www.youtube.com/watch?v=wYeVne3aRow

37Signals blog: 37signals Dev — Solid Cache

What problem is this gem trying to solve?

During the talk, Donal McBree made it clear that 37Signals makes heavy use of Russian Doll Caching to accelerate HTML generation and avoid unnecessary queries in the database. He also mentioned that requests that hit the cache were fast, but the ones that missed were up to 50 times slower.

So, they were looking for ways to improve those misses.

What was their hypothesis?

Increasing the cache on their current machines was not economically desirable because RAM is expensive. So, they decided to test the following:

“NVMe drives are very, very fast and are much cheaper than RAM. What if we used a cache store that keeps hot keys in memory and the rest on the NVMe? Well, databases can already do that, so why don’t we use a database as a cache store?”

What assumptions does this gem make about your infrastructure?

Over the past year, DHH has posted on Twitter and his personal blog about why they were leaving the cloud, the economies it generates, and the hardware they have purchased. This is something important to keep in mind because this gem was created and tested on an environment that has:

  1. Database servers with NVMe drives attached directly to them;
  2. People who can hand tune the database service (MySQL) for maximum performance by disabling ACID, logging, etc.

What was the result?

As explained in both the talk and the README of the gem, they saw that reads and writes were up to 25% to 50% slower than on Redis BUT, that just meant they went up from 0.8 to 1.2 ms, which is not a significant percentage of the overall request time.

On the other hand, they were able to massively increase their cache sizes, going from keeping keys for a few hours to keeping them for a couple of months.

What are other caveats?

  1. In order to keep the performance impact at only 25%-50%, they were forced to implement only FIFO for key expiration, not LFU or LRU, which is what Redis uses. This means that to get equivalent cache hit rates, they needed a much larger cache.
  2. They are not using their main database for caching; they are running multiple sharded databases dedicated only to caching.
  3. The gem uses a routine to expire keys after a specific amount of time that you have to set manually or after a certain a number of keys. It cannot however expire based on size used on disk.

Should you be using this gem?

If your database is under light load (eg: a small app), sure give it a test. There’s a good chance that your cache is small enough that it will stay in the database cache, so the disk speed won’t have much of an impact, and it’s one less service to worry about failing, or going into read-only scheduled maintenance.

If your database is similar to 37signals (directly attached nVME and hand tuned by someone who knows what they are doing), you can test this with some certainty that it will work. Just remember this was probably tested mostly with HTML fragments, not complex objects.

If your database is being managed by a third party (AWS, GCP, etc) you will have to test this carefully. If your primary database has enough capacity to handle the extra load, then that might be something to consider. If not, well, the managed database service tends to be more expensive than the managed redis service, so the economics might favor a larger redis instance than a second database instance.

In our case, our database is managed by Google, but our Redis is hosted on a normal Compute Engine instance, so it’s cheaper to increase Redis instance size. On the other hand, we don’t do much fragment caching (never had much luck) so our cache is small, and might fit the primary DB. Which means: we will have to try and see what happens.