Here's an outline for a solution for supporting multiple connections
to duplicate databases. I'm not an expert in production environments,
so whilst the tutorial works, I'm not sure how it will play out in
"the real world".
I've pinged Alex @ twitter so hopefully he'll have some feedback too.
I like the idea behind this a lot. Is there any reason this couldn't
be implemented without requiring changes to the way models are
called? Right now the magic_multi_connections has you do
conn::Person.find(:all). Couldn't you just use your random clone
selection during ActiveRecord's 'execute' method?
I haven't dug too deep in your code but I'm curious if there's a
reason this would be tough.
I’m lazy - he’s a cut+paste of an answer from the comments of the article:
"that would be great if you didn’t care which requests went to which
connection. I think some apps might want write requests to go to the
master DB, and read requests (select) to be distributed to the slaves.
“Though perhaps we could use your idea and pass a :connection option in the crud methods; or perhaps :connection_group.”
I might investigate implementing these, just for laughs and giggles.
I'm not really feeling it. Connecting to multiple database instances
to provide greater database scaling should be transparent. It seems to
me that *that* was Alex's biggest gripe. I'd rather see the
transparency either a.) at the adapter level or b.) through the use of
load balancing (load balancing is great for non-HTTP traffic).
If you are trying, on the other hand, to connect different models to
different databases, or even the same model to different databases at
different times, then there are already patterns to do that (which you
know.)
Perhaps the following database.yml syntax cleanly describes the
desired usage of a hidden connection pool with different CRUD
operations for an assumed one master-many slaves setup (one master
only being a subset of this).
production:
adapter: mysql
...
read_only:
clone1:
...
clone2:
...
We'd then use cloneN for select calls if read_only was specified.
If people think this is nice syntax for specifying multiple
connections, I'll go an implement it.
production:
adapter: mysql
...
read_only:
clone1:
...
clone2:
...
This is nice, I'm also wondering... how hard would it be to do the connection pool trick for multi master?
If that has any sense from a load balancing pov (I think it does for highly write-intensive apps) and it is manageable in the code, then I'd like to be able specify the role (write/read/both) for each node.
E.g.
production:
adapter: mysql
cluster:
node1:
host: X.Y.Z.1
role: write
node2:
host: X.Y.Z.2
role: read
node3:
host: X.Y.Z.3
role: read
Just to retort my own idea - I don't know all the permutations that
uber-production cluster setups use, nor do I know people that use
them. So the chance of me making an arse of this multiple connections
business is high.
Nonetheless, it occurred to me that some slaves might be read-write,
so some syntax for that:
There is also the case where you have federated data, so you'd have:
db1: users 1 - 100k
db2: users 101k - 200k
db3: users 201k - 300k
etc
I don't think you can possibly handle all the different cluster setups
out there. Just as long as the developer can manually pick the
connection in the code if need be, it will be clean for those with
simple setups (rw to master, read-only slaves), and a uglier but
manageable for the massive, extremely complicated setups.
Yeah I think any syntax should be ok for implementation - it'll get
mapped into two lists: read and write; so the read/write connections
will be in both.
Thought: given that concurrency in Rails is obtained with additional
processes, each Rails instance/Mongrel probably doesn't need a
connection to all DBs. But might be over-assumptive.
Inserts + Updates in rails are followed by fetches to retrieve ids.
So, having write_only databases might get ugly.
Thought: given that concurrency in Rails is obtained with additional
processes, each Rails instance/Mongrel probably doesn't need a
connection to all DBs. But might be over-assumptive.
might it be solved deploying different database.yml for different processes/machines?
Inserts + Updates in rails are followed by fetches to retrieve ids.
So, having write_only databases might get ugly.
There is also the case where you have federated data, so you'd have:
db1: users 1 - 100k
db2: users 101k - 200k
db3: users 201k - 300k
etc
hmm federated data will need special treatment if cross-node relations are allowed, i may be naive but it seems that things can get tricky soon.
I do absolutely agree though with what you say below, that is should be at least possible for the developers to address a specific connection.
I don't think you can possibly handle all the different cluster setups
out there. Just as long as the developer can manually pick the
connection in the code if need be, it will be clean for those with
simple setups (rw to master, read-only slaves), and a uglier but
manageable for the massive, extremely complicated setups.
yes it should be trivial for simple setups and doable for data-partitioned clusters
I really enjoy seeing you all getting your geek on, but isn't it a lot easier to just use a MySQL cluster?
“Standard MySQL clients: These are no different for MySQL Cluster than they are for standard (non-Cluster) MySQL. In other words, MySQL Cluster can be accessed from existing MySQL applications written in PHP, Perl, C, C++, Java, Python, Ruby, and so on.”
Ignoring Manfred’s superb suggestion for a moment to continue to exercise geekiness…
Rob Sanheim and I chatted and resolved that an initial solution would be restricted to read-write databases only, no read-only slaves to guarantee you had access to immediately recently insert/updated data. Perhaps this isn’t necessary but it seemed a decent initial restriction.
Back to Manfred seemingly accurate dismal of the whole idea of multiple connections for the purpose of master-slave databases.
So if twitter’s DB connection issues can be solved with MySQL Clusters, what is their DB problem then?
Quite true. If you are using strictly MySQL I agree.
That said, however, I'm liking Nic's ideas as they are trying to solve
the issue of connecting to any combination of Oracle / Postgres /
MySQL / ... databases very nicely.
I think the "best" approach might be to simply have a hook into the
connection selection method using a block that is yielded certain
"interesting" information about the pending operation and is expected
to return the name of the connection to use for the operation. The
ability to specify a global default and override it on specific models
would be nice, too. In general, I think that for anyone who /needs/
this type of behavior, such a setup would not be considered too
complicated; and it means you don't have to decide on a least common
denominator for the various situations that one might run in to. What
it provides is a central place to configure the behavior, as well as a
convention.
One point to make about MySQL Clusters is that they may not be the
silver bullet people think they are as they uses the NDB storage
engine which is in-memory (as opposed to MyISAM and InnoDB), thus each
cluster node needs enough memory to hold all of the database in main
memory. That gets more expensive the bigger your database grows.