Multiple Database Connections

Hi all,

We currently have a potentially large scale rails project. There is one table (call it 'updates') that will potentially have somewhere in the region of 2500 records created every minute (ie 200 million records every 6 months). The entries in this table belong to different accounts (about 50 accounts).

So a possible solution is to split each customer account up and give them their own 'updates' table. This would reduce the number of records being created in different tables. This would mean hijacking the rails ActiveRecord::Base connection mechanism, overriding the database connector, and redirecting the connection to the appropriate database, depending upon the account the request is coming from.

My worry is in caching/pooling connections. If Account A calls action "get_updates", Rails connects to database "A", then Account B calls action "get_updates", Rails connects to database "B"... THEN Account A comes back and calls action "get_updates": the first connection to database "A" will presumabely have been lost. Is this correct? If so, is there a way around this? The problem would be that many many connections would be opened once, and never reused which is a potential nightmare.

Has anyone else had any experience in this multiple databases approach? The alternative is to create a new "updates" table in the database for each customer, call them something like "updates_a", "updates_b" etc.. However doing this approach it looks likely that the database will grow to be very large, and scalability may become a problem?

Thanks for any thoughts on this.

Just realised that Rails does not perform connection pooling: so the pooling question is of course rubbish.

But would still like to hear anybody else's thoughts on the multiple databases approach?

Cheers

Maybe this plugin would help.

Daniel

Thanks,

It is a good concept, but really I don't need to change the databse for all my models in each model, I can just put a before_filter on the ApplicationController, because all models will point at a certain DB.

Has no one had any experience of using Multiple/Customer Databases in a single rails app? Is it to be discouraged?

Thanks in advance for anybody's replies :slight_smile:

Have you actually /tried/ this and found it to be a problem? You could spend a month working on all this multiple connection stuff, only to find out the bottleneck is somewhere else. Create a test script that submits similarly sized records, 2500 per minute and see if it works. If you use script/runner you can even run it in the rails framework with an activerecord model.

If you split each customer onto a different database, are the databases going to be on different machines? If not, it seems like you would run into the same problems and perhaps even faster, due to memory, disk and network congestion. If the customer databases would be on different machines, why not run a separate rails installation for each customer? Then you just set the db config for each rails installation, rather than all this multiple connection hacking.

~Rusty

Rusty,

Unfortunately we can't deploy a new rails application for each customer; if we could that would be great but we can't.

Inserts are not the problem: we have run spike tests and a pack of mongrels and MySQL can easily handle 2500 inserts a minute. The problem is in retrieval of data.

Potentially, after a month, the Updates table could have 3 million records. Each record belongs to a tenant, and doing a search for all the Updates for a particular tenant on a 3 million big recordset takes 13 seconds. This is using 50 tenants. Using the multiple databases approach, the same select for a tenant's Updates takes about 0.3 seconds. After 6 months, you are looking at holding 20 million records, with average searches just to get a list of Updates taking over a minute and a half (that's with some careful indexing too). Now obviously partitioning data is one way to reduce this lookup time: in rails partitioning data would mean splitting up the Updates table either into tenants or by date...but from a scalability point of view is this desirable? Is it not better to tackle the multiple databases approach as a means of improving search times and creating an architecture that scales 'automatically'?

Thanks in advance for anyone's thoughts

Potentially, after a month, the Updates table could have 3 million records. Each record belongs to a tenant, and doing a search for all the Updates for a particular tenant on a 3 million big recordset takes 13 seconds. This is using 50 tenants. Using the multiple databases approach, the same select for a tenant's Updates takes about 0.3 seconds.

I would be curious to see how PostgreSQL 8.2 handles that dataset.

After 6 months, you are looking at holding 20 million records, with average searches just to get a list of Updates taking over a minute and a half (that's with some careful indexing too). Now obviously partitioning data is one way to reduce this lookup time: in rails partitioning data would mean splitting up the Updates table either into tenants or by date...but from a scalability point of view is this desirable? Is it not better to tackle the multiple databases approach as a means of improving search times and creating an architecture that scales 'automatically'?

If retrieval is the issue, I would side with splitting the table rather than the entire database. What is gained from multiple databases over splitting the table? It seems like additional overhead and complexity with performance similar to splitting the table.

What do you mean by automatically? How is either method more 'automatic' than the other?

~Rusty

What do you mean by automatically? How is either method more 'automatic' than the other?

By automatic, I refer to the differences in architecture. With a single database approach, as new tenants are added the system does not scale out to accommodate this new data: recordsets in the schema grow. With a multiple database approach, the system scales out horizontally to cope with the new tenant and new data. This is a 'side-effect' as it were of using a multi-database architecture: that horizontal scale out is 'automatically' or rather *inherently* achieved. There is no *inherent* or 'automatic' scale out with the single database approach.

Out of interest, are there any performance issues with having a large number of tables in a MySQL database? Large being in the hundreds to thousands range.

Matt

matt wrote the following on 12.08.2007 21:11 :

What do you mean by automatically? How is either method more 'automatic' than the other?      By automatic, I refer to the differences in architecture. With a single database approach, as new tenants are added the system does not scale out to accommodate this new data: recordsets in the schema grow. With a multiple database approach, the system scales out horizontally to cope with the new tenant and new data. This is a 'side-effect' as it were of using a multi-database architecture: that horizontal scale out is 'automatically' or rather *inherently* achieved. There is no *inherent* or 'automatic' scale out with the single database approach.

Out of interest, are there any performance issues with having a large number of tables in a MySQL database? Large being in the hundreds to thousands range.    The only one I remember from experience is slowness of the command line client startup unless you tell it not to autocomplete table and column names (by default it lists all tables and column on startup to provide autocompletion). The database itself doesn't seem affected (this was on a mix of MyIsam and InnoDB tables, most tables where MyIsam).

Lionel

Actually your better approach may be to use oracle 10g It can do this "break" down behind the schenes, running a single table across a "rack" of machines. Then ror and your code does not have to worry about it.

You are making assumptions which are most likely not correct. Without knowing the data structure, usage patterns, and lifespan it's difficult to give any specific advice, but multiple databases is most likely not the best solution. Table partitioning comes to mind as a possible solution. A lot depends on the overall usage characteristics. It also sounds like the data might have a relatively short lifespan? I would also suggest taking a look at postgresql instead of mysql for something like this. Mvcc architectures often do better under heavy insert/update activity.

Chris

Hi Chris,

Thanks for your response. Do you have any specific reasons why multiple databases may not be the best approach?

Cheers!

The main reason is that it's usually just not necessary. The other reason is that it's a management nightmare to partition data like that as you scale up. The amount of data is usually not a major issue, it's the usage patterns. If you do need to partition for whatever reason, first look at the partitioning features of the database.

Also, the one thing you are leaving out that makes it difficult to give any specifics is the overall usage patterns. What kind of selects will you be doing on the data?

Chris

Chris

It would normally be doing a select to get a list of Updates by tenant, for the past day (these would potentially be done every minute for each tenant user logged in). Then, when reporting was required, there would be doing a select all per tenant.

Usage patterns would mean that for potentially 100 - 500 concurrent users (100 typical case, 500 worst case) each user would be doing a ( Select where tenant_id="" and date="" ) every minute or so.

Does that clarify things a bit?

Thanks for your response

Matt

snacktime wrote:

Hi Chris,

Thanks for your response. Do you have any specific reasons why multiple databases may not be the best approach?

The main reason is that it's usually just not necessary. The other reason is that it's a management nightmare to partition data like that as you scale up. The amount of data is usually not a major issue, it's the usage patterns. If you do need to partition for whatever reason, first look at the partitioning features of the database.

Also, the one thing you are leaving out that makes it difficult to give any specifics is the overall usage patterns. What kind of selects will you be doing on the data?

It may or may not be necessary for the original poster's situation but managing connections to multiple databases is something that very large Web applications need to do in order to scale the DB backend.

Current best practices for scaling the DB portion of a Web app include creating multiple read-only slaves through replication so the "master" database(s) only has to deal with the writes. This requires the app to be aware of "read-only" database connections and "write-only" database connections.

Another best practice that requires being able to manage multiple database connection is "data federation" where a very large "logical" dataset is split among multiple databases and servers.

Sites like eBay, Flickr and LiveJournal use both of these techniques to scale their database backends.

In that case I'd probably take a look at table partitioning by date range. Implementation depends on the database, I'm most familiar with postgresql not sure if it works like this with mysql.

The basic idea is that you create tables named by a date range, say by month. Then you use something like a rule to rewrite the queries to the appropriate table. Probably a good rule of thumb for partitioning is whatever size is small enough so you can keep the indexes in memory. It's going to take some time to create the rules and auto create the tables, etc.., but it's much less work in the long run then multiple databases and it's transparent if done correctly.

That's just a general idea though, there are several variations on that, and they will all be database specific. Where you have complicated selects that need to hit the database, then partitioning can work great. But you might also look at some creative ways of caching since your selects are fairly simple and not even worry about partitioning for now. For example, on each insert/update do a memcache set, and have another key that you increment as a counter. Then on select get the counter key which contains the number of entries for the date range, and then a multi get to grab all the data. Your keys would use a naming scheme of some type so you know which ones to get.

In any case I think you will find a more elegant solution then using multiple databases. I've gone that route before and regretted it.

Chris