Reducing downtime

Peter_Hickman1 · July 3, 2013, 2:09pm

We have our RoR apps on cloud servers that can be blown away and rebuilt with the aid of chef. The only volatile data on the server is the database which we backup periodically and pull off to our backup server.

If we backup the database once a day then should things go wrong and we restore the database from backup we will be, on average, 12 hours out of date. At worst (if the system crashes just before the backup is made) we could be 24 hours out of date.

To reduce the time we lose when we restore a backup we could of course just take more frequent backups. But this would start to become a performance issue, not to mention pulling the backups off the server will start to eat into our bandwidth.

The final additional problem is that we are running >25 different applications, each with their own database (some small, some huge). We tried replication but we seemed to need a real monster of a server to replicate to than we could afford and we can’t afford to run >25 duplicate servers either.

Has anyone tackled this problem?

I was imaging an extension to active record that created a journal / log file of changes such as:

UPDATE Event#1234 start_time=“2013-07-03”

UPDATE Event#3453 status=“completed”, finished_at=“2013-07-03 12:15:00”

DELETE Ticket#2222

CREATE Ticket#3453 status=“opened”, owner=“fred”

Which could be rerun once the database had been updated to catch up with the changes that happened since the backup was made.

I am assuming that any idea that I have someone else has already had and probably implemented (unless it’s completely insane - in which case some indication as to why it wont work would be nice).

Any thoughts?

11155 · July 3, 2013, 7:25pm

Peter Hickman wrote in post #1114323:

To reduce the time we lose when we restore a backup we could of course just take more frequent backups. But this would start to become a performance issue, not to mention pulling the backups off the server will start to eat into our bandwidth.

Any thoughts?

A robust solution is to use database replication and then run your backups off the replica, This way you have at least two servers that are kept synchronized constantly, Backups of the replica will also not impact the performance of the application accessing the source database.

http://dev.mysql.com/doc/refman/5.0/en/replication.html

Peter_Hickman1 · July 4, 2013, 10:43am

We tried that, we use PostgreSQL and londiste, but when it decided to break it stayed broken. I spent weeks trying to get things to work again once it failed. That was only 2 databases, what would happen for the >25 databases that we have?

Unless someone dumps a large amount of cash in my lap I have to improve the situation we are in with what we have to hand rather than throwing money around.

Jordon_Bedwell · July 4, 2013, 11:08am

You really only have two options here if you ask me, learn to do live snap-shots so you can roll-back or learn to deploy early and often with small changes that will not drastically break anything and can be easily rolled back if something bad happens. The latter is what some companies do, the former is a complicated mixture of how do you want to go about it, and are we doing it right.

Simon_Riggs · July 4, 2013, 11:19am

Sorry to hear you spent some fruitless time there. PostgreSQL supports various kinds of replication - both internal streaming and Londiste work very well for many people, so those technology choices are good ones.

Probably trite to say time = money; I do recognise that the functional relationship between time and money differs for everybody. In general the open source ecosystem does have a place for companies that specialise in building, maintaining and supporting system software to assist in cases like this.

Peter_Hickman1 · July 4, 2013, 11:41am

Deploying is not the problem. It's things like the file system becoming corrupted, which can happen at any time. In fact this is what prompted this post in the first place, the file system on a vm became corrupt and we had to wipe and rebuild.

"Live with it" is not acceptable either. There is something better than the current situation. Even if it is not perfect it can be better.

Logging the changes from ActiveRecord by using the before_* and after_* hooks might provide an ad hoc transaction journal that could be replayed against a backup to bring things up to date. At least for databases that experience a low level of activity.

Peter_Hickman1 · July 4, 2013, 11:49am

Sorry to hear you spent some fruitless time there. PostgreSQL supports various kinds of replication - both internal streaming and Londiste work very well for many people, so those technology choices are good ones.

When it was working, and it saved us one one occasion, it worked like a charm. But then one database started to backlog and nothing I could do would fix it. Including removing and reinstalling londiste. Given that this was when it was with only 2 databases I wasn't prepared to commit all our other databases to it. A false sense of security is not security at all.

Probably trite to say time = money; I do recognise that the functional relationship between time and money differs for everybody. In general the open source ecosystem does have a place for companies that specialise in building, maintaining and supporting system software to assist in cases like this.

Well I have no money, well certainly not the sort of money that would be need to pay anyone else to solve this problem for me. Time I have so I guess that I will have to see what sort of solution I can come up with.

hassan · July 4, 2013, 2:38pm

I've never worked with PostgreSQL in production, but 'Warm Standby' (Warm Standby - PostgreSQL wiki) sounds appropriate. Did you try that and have issues?

And it appears to me that Warm Standby is basically copying over binary transaction files, which presumably you could store on some other system (or even e.g. S3) without actually running an alternate DB server.

Just thinking out loud

Scaling on a budget - so much fun, eh? Good luck!

Simon_Riggs · July 5, 2013, 6:31am

Those communities would welcome your contributions via bug reports and/or documentation improvements.

Thanks very much.

Topic		Replies	Views
Backup strategy for frequently changing database rubyonrails-talk	3	144	March 14, 2008
Scalability: numerous servers vs. powerful servers rubyonrails-talk	2	128	March 15, 2007
resilience to database restart rubyonrails-talk	0	93	January 17, 2007
Updating multiple databases at the same time rubyonrails-talk	4	146	January 14, 2009
What backup approach for uploaded files stored in file system??? rubyonrails-talk	5	150	July 7, 2010

Reducing downtime

Related topics

More Resources