Reducing downtime

We have our RoR apps on cloud servers that can be blown away and rebuilt with the aid of chef. The only volatile data on the server is the database which we backup periodically and pull off to our backup server.

If we backup the database once a day then should things go wrong and we restore the database from backup we will be, on average, 12 hours out of date. At worst (if the system crashes just before the backup is made) we could be 24 hours out of date.

To reduce the time we lose when we restore a backup we could of course just take more frequent backups. But this would start to become a performance issue, not to mention pulling the backups off the server will start to eat into our bandwidth.

The final additional problem is that we are running >25 different applications, each with their own database (some small, some huge). We tried replication but we seemed to need a real monster of a server to replicate to than we could afford and we can’t afford to run >25 duplicate servers either.

Has anyone tackled this problem?

I was imaging an extension to active record that created a journal / log file of changes such as:

UPDATE Event#1234 start_time=“2013-07-03”

UPDATE Event#3453 status=“completed”, finished_at=“2013-07-03 12:15:00”

DELETE Ticket#2222

CREATE Ticket#3453 status=“opened”, owner=“fred”

Which could be rerun once the database had been updated to catch up with the changes that happened since the backup was made.

I am assuming that any idea that I have someone else has already had and probably implemented (unless it’s completely insane - in which case some indication as to why it wont work would be nice).

Any thoughts?

Peter Hickman wrote in post #1114323:

To reduce the time we lose when we restore a backup we could of course
just
take more frequent backups. But this would start to become a performance
issue, not to mention pulling the backups off the server will start to
eat
into our bandwidth.

Any thoughts?

A robust solution is to use database replication and then run your
backups off the replica, This way you have at least two servers that are
kept synchronized constantly, Backups of the replica will also not
impact the performance of the application accessing the source database.

http://dev.mysql.com/doc/refman/5.0/en/replication.html

We tried that, we use PostgreSQL and londiste, but when it decided to break
it stayed broken. I spent weeks trying to get things to work again once it
failed. That was only 2 databases, what would happen for the >25 databases
that we have?

Unless someone dumps a large amount of cash in my lap I have to improve the
situation we are in with what we have to hand rather than throwing money
around.

You really only have two options here if you ask me, learn to do live
snap-shots so you can roll-back or learn to deploy early and often
with small changes that will not drastically break anything and can be
easily rolled back if something bad happens. The latter is what some
companies do, the former is a complicated mixture of how do you want
to go about it, and are we doing it right.

Sorry to hear you spent some fruitless time there. PostgreSQL supports
various kinds of replication - both internal streaming and Londiste work
very well for many people, so those technology choices are good ones.

Probably trite to say time = money; I do recognise that the functional
relationship between time and money differs for everybody. In general the
open source ecosystem does have a place for companies that specialise in
building, maintaining and supporting system software to assist in cases
like this.

Deploying is not the problem. It's things like the file system becoming
corrupted, which can happen at any time. In fact this is what prompted this
post in the first place, the file system on a vm became corrupt and we had
to wipe and rebuild.

"Live with it" is not acceptable either. There is something better than the
current situation. Even if it is not perfect it can be better.

Logging the changes from ActiveRecord by using the before_* and after_*
hooks might provide an ad hoc transaction journal that could be replayed
against a backup to bring things up to date. At least for databases that
experience a low level of activity.

Sorry to hear you spent some fruitless time there. PostgreSQL supports
various kinds of replication - both internal streaming and Londiste work
very well for many people, so those technology choices are good ones.

When it was working, and it saved us one one occasion, it worked like a
charm. But then one database started to backlog and nothing I could do
would fix it. Including removing and reinstalling londiste. Given that this
was when it was with only 2 databases I wasn't prepared to commit all our
other databases to it. A false sense of security is not security at all.

Probably trite to say time = money; I do recognise that the functional
relationship between time and money differs for everybody. In general the
open source ecosystem does have a place for companies that specialise in
building, maintaining and supporting system software to assist in cases
like this.

Well I have no money, well certainly not the sort of money that would be
need to pay anyone else to solve this problem for me. Time I have so I
guess that I will have to see what sort of solution I can come up with.

I've never worked with PostgreSQL in production, but 'Warm Standby'
(http://wiki.postgresql.org/wiki/Warm_Standby) sounds appropriate.
Did you try that and have issues?

And it appears to me that Warm Standby is basically copying over
binary transaction files, which presumably you could store on some
other system (or even e.g. S3) without actually running an alternate
DB server.

Just thinking out loud :slight_smile:

Scaling on a budget - so much fun, eh? Good luck!

Those communities would welcome your contributions via bug reports
and/or documentation improvements.

Thanks very much.