Multiple Database Connections

Hi all,

We currently have a potentially large scale rails project. There is
one table (call it 'updates') that will potentially have somewhere in
the region of 2500 records created every minute (ie 200 million
records every 6 months). The entries in this table belong to
different accounts (about 50 accounts).

So a possible solution is to split each customer account up and give
them their own 'updates' table. This would reduce the number of
records being created in different tables. This would mean hijacking
the rails ActiveRecord::Base connection mechanism, overriding the
database connector, and redirecting the connection to the appropriate
database, depending upon the account the request is coming from.

My worry is in caching/pooling connections. If Account A calls action
"get_updates", Rails connects to database "A", then Account B calls
action "get_updates", Rails connects to database "B"... THEN Account A
comes back and calls action "get_updates": the first connection to
database "A" will presumabely have been lost. Is this correct? If
so, is there a way around this? The problem would be that many many
connections would be opened once, and never reused which is a
potential nightmare.

Has anyone else had any experience in this multiple databases
approach? The alternative is to create a new "updates" table in the
database for each customer, call them something like "updates_a",
"updates_b" etc.. However doing this approach it looks likely that
the database will grow to be very large, and scalability may become a
problem?

Thanks for any thoughts on this.

Just realised that Rails does not perform connection pooling: so the
pooling question is of course rubbish.

But would still like to hear anybody else's thoughts on the multiple
databases approach?

Cheers

Maybe this plugin would help.

http://www.elctech.com/2007/3/8/using-and-testing-rails-with-multiple-databases

Daniel

Thanks,

It is a good concept, but really I don't need to change the databse
for all my models in each model, I can just put a before_filter on the
ApplicationController, because all models will point at a certain DB.

Has no one had any experience of using Multiple/Customer Databases in
a single rails app? Is it to be discouraged?

Thanks in advance for anybody's replies :slight_smile:

Have you actually /tried/ this and found it to be a problem? You
could spend a month working on all this multiple connection stuff,
only to find out the bottleneck is somewhere else. Create a test
script that submits similarly sized records, 2500 per minute and see
if it works. If you use script/runner you can even run it in the
rails framework with an activerecord model.

If you split each customer onto a different database, are the
databases going to be on different machines? If not, it seems like
you would run into the same problems and perhaps even faster, due to
memory, disk and network congestion. If the customer databases would
be on different machines, why not run a separate rails installation
for each customer? Then you just set the db config for each rails
installation, rather than all this multiple connection hacking.

~Rusty

Rusty,

Unfortunately we can't deploy a new rails application for each
customer; if we could that would be great but we can't.

Inserts are not the problem: we have run spike tests and a pack of
mongrels and MySQL can easily handle 2500 inserts a minute. The
problem is in retrieval of data.

Potentially, after a month, the Updates table could have 3 million
records. Each record belongs to a tenant, and doing a search for all
the Updates for a particular tenant on a 3 million big recordset takes
13 seconds. This is using 50 tenants. Using the multiple databases
approach, the same select for a tenant's Updates takes about 0.3
seconds. After 6 months, you are looking at holding 20 million
records, with average searches just to get a list of Updates taking
over a minute and a half (that's with some careful indexing too). Now
obviously partitioning data is one way to reduce this lookup time: in
rails partitioning data would mean splitting up the Updates table
either into tenants or by date...but from a scalability point of view
is this desirable? Is it not better to tackle the multiple databases
approach as a means of improving search times and creating an
architecture that scales 'automatically'?

Thanks in advance for anyone's thoughts

Potentially, after a month, the Updates table could have 3 million
records. Each record belongs to a tenant, and doing a search for all
the Updates for a particular tenant on a 3 million big recordset takes
13 seconds. This is using 50 tenants. Using the multiple databases
approach, the same select for a tenant's Updates takes about 0.3
seconds.

I would be curious to see how PostgreSQL 8.2 handles that dataset.

After 6 months, you are looking at holding 20 million
records, with average searches just to get a list of Updates taking
over a minute and a half (that's with some careful indexing too). Now
obviously partitioning data is one way to reduce this lookup time: in
rails partitioning data would mean splitting up the Updates table
either into tenants or by date...but from a scalability point of view
is this desirable? Is it not better to tackle the multiple databases
approach as a means of improving search times and creating an
architecture that scales 'automatically'?

If retrieval is the issue, I would side with splitting the table
rather than the entire database. What is gained from multiple
databases over splitting the table? It seems like additional overhead
and complexity with performance similar to splitting the table.

What do you mean by automatically? How is either method more
'automatic' than the other?

~Rusty

What do you mean by automatically? How is either method more
'automatic' than the other?

By automatic, I refer to the differences in architecture. With a
single database approach, as new tenants are added the system does not
scale out to accommodate this new data: recordsets in the schema
grow. With a multiple database approach, the system scales out
horizontally to cope with the new tenant and new data. This is a
'side-effect' as it were of using a multi-database architecture: that
horizontal scale out is 'automatically' or rather *inherently*
achieved. There is no *inherent* or 'automatic' scale out with the
single database approach.

Out of interest, are there any performance issues with having a large
number of tables in a MySQL database? Large being in the hundreds to
thousands range.

Matt

matt wrote the following on 12.08.2007 21:11 :

What do you mean by automatically? How is either method more
'automatic' than the other?
    
By automatic, I refer to the differences in architecture. With a
single database approach, as new tenants are added the system does not
scale out to accommodate this new data: recordsets in the schema
grow. With a multiple database approach, the system scales out
horizontally to cope with the new tenant and new data. This is a
'side-effect' as it were of using a multi-database architecture: that
horizontal scale out is 'automatically' or rather *inherently*
achieved. There is no *inherent* or 'automatic' scale out with the
single database approach.

Out of interest, are there any performance issues with having a large
number of tables in a MySQL database? Large being in the hundreds to
thousands range.
  
The only one I remember from experience is slowness of the command line
client startup unless you tell it not to autocomplete table and column
names (by default it lists all tables and column on startup to provide
autocompletion).
The database itself doesn't seem affected (this was on a mix of MyIsam
and InnoDB tables, most tables where MyIsam).

Lionel

Actually your better approach may be to use oracle 10g
It can do this "break" down behind the schenes, running
a single table across a "rack" of machines.
Then ror and your code does not have to worry about it.

You are making assumptions which are most likely not correct. Without
knowing the data structure, usage patterns, and lifespan it's
difficult to give any specific advice, but multiple databases is most
likely not the best solution. Table partitioning comes to mind as a
possible solution. A lot depends on the overall usage
characteristics. It also sounds like the data might have a relatively
short lifespan? I would also suggest taking a look at postgresql
instead of mysql for something like this. Mvcc architectures often do
better under heavy insert/update activity.

Chris

Hi Chris,

Thanks for your response. Do you have any specific reasons why
multiple databases may not be the best approach?

Cheers!

The main reason is that it's usually just not necessary. The other
reason is that it's a management nightmare to partition data like that
as you scale up. The amount of data is usually not a major issue,
it's the usage patterns. If you do need to partition for whatever
reason, first look at the partitioning features of the database.

Also, the one thing you are leaving out that makes it difficult to
give any specifics is the overall usage patterns. What kind of
selects will you be doing on the data?

Chris

Chris

It would normally be doing a select to get a list of Updates by
tenant, for the past day (these would potentially be done every minute
for each tenant user logged in). Then, when reporting was required,
there would be doing a select all per tenant.

Usage patterns would mean that for potentially 100 - 500 concurrent
users (100 typical case, 500 worst case) each user would be doing a
( Select where tenant_id="" and date="" ) every minute or so.

Does that clarify things a bit?

Thanks for your response

Matt

snacktime wrote:

Hi Chris,

Thanks for your response. Do you have any specific reasons why
multiple databases may not be the best approach?

The main reason is that it's usually just not necessary. The other
reason is that it's a management nightmare to partition data like that
as you scale up. The amount of data is usually not a major issue,
it's the usage patterns. If you do need to partition for whatever
reason, first look at the partitioning features of the database.

Also, the one thing you are leaving out that makes it difficult to
give any specifics is the overall usage patterns. What kind of
selects will you be doing on the data?

It may or may not be necessary for the original poster's situation but managing connections to multiple databases is something that very large Web applications need to do in order to scale the DB backend.

Current best practices for scaling the DB portion of a Web app include creating multiple read-only slaves through replication so the "master" database(s) only has to deal with the writes. This requires the app to be aware of "read-only" database connections and "write-only" database connections.

Another best practice that requires being able to manage multiple database connection is "data federation" where a very large "logical" dataset is split among multiple databases and servers.

Sites like eBay, Flickr and LiveJournal use both of these techniques to scale their database backends.

In that case I'd probably take a look at table partitioning by date
range. Implementation depends on the database, I'm most familiar with
postgresql not sure if it works like this with mysql.

The basic idea is that you create tables named by a date range, say by
month. Then you use something like a rule to rewrite the queries to
the appropriate table. Probably a good rule of thumb for partitioning
is whatever size is small enough so you can keep the indexes in
memory. It's going to take some time to create the rules and auto
create the tables, etc.., but it's much less work in the long run then
multiple databases and it's transparent if done correctly.

That's just a general idea though, there are several variations on
that, and they will all be database specific. Where you have
complicated selects that need to hit the database, then partitioning
can work great. But you might also look at some creative ways of
caching since your selects are fairly simple and not even worry about
partitioning for now. For example, on each insert/update do a
memcache set, and have another key that you increment as a counter.
Then on select get the counter key which contains the number of
entries for the date range, and then a multi get to grab all the data.
Your keys would use a naming scheme of some type so you know which
ones to get.

In any case I think you will find a more elegant solution then using
multiple databases. I've gone that route before and regretted it.

Chris