This topic is kind of fortuitous for me- I've actually been looking
for an answer to something like this, through the archives of this
group and with Google. I understand what you're saying about
optimizing prematurely, but I have been that lucky before, and it was,
contrary to the saw, not a nice problem to have, at least for those
who had to deal with it. I worked on a site where people hadn't really
worried about scalability, and it bit us a bit when the site's traffic
increased a thousandfold over a very short period of time. So I think
there's a balance to be struck there. One thing I learned from that
experience is that some things are often pretty easy to deal with-
using a poor algorithm, if it can just be rewritten, or failing to
cache as much as you should, for instance. But once you have actual
data it can be hard to deal with problems with db architecture cleanly
and quickly.
I'm wondering about a situation in which you're offering an
application to the public, letting anyone sign up to use it, with what
is conceptually their own instance of it. Take something like Blogger,
just as an example. It's pretty easy to map out a simple data design
for an individual blog... if you need to host tens of thousands of
them it is trickier. Let's say that blogs have posts, and posts have
comments. One way to handle this is to stick all of the comments into
one large table with a foreign key that references posts (and do the
same for posts, relative to blogs). If you tune the db well, and have
enough RAM in your machine(s) to hold the entire dataset, this is
likely to scale pretty well for quite a while, but I think at some
point you may hit a wall with this approach- after all, if the average
post has 100 comments, and the average blog has 1000 posts, and there
are 10,000 blogs you're talking about 1,000,000,000 records (which, of
course, would be a nice problem to have ). Beyond that, it's less
clear than the original "one blog" problem. Instead of just having
posts, all of which belong to the one blog in question, you now have
to worry about which blog a post belongs to, which is not as
parsimonious as one (or at least I) would ideally like.
If you're writing this from scratch, without using a framework, there
are a lot of ways to deal with this (including the one big table
approach), and there are trade-offs in each case (at least in each
case worth considering). If you're willing to give up portability
across databases you could, for instance, use a namespace mechanism,
like Postgres's schemas, creating a new schema for each instance of a
blog. This has the virtue of letting you embed the instances in an
overarching schema that handles information that is common across all
blogs- for instance commenters might be registered with one name
across all of the blogs. You could create a new set of tables for each
registered blog, and duplicate tables by munging some sort of prefix
onto each table name- that would be really ugly though. You could go
really far and virtualise each instance of the app, though that would
create a headache dealing with data that crossed instances.
So I guess what I'm asking is whether or not Rails has any canonical
way of dealing with this beyond the one big table approach. Blogger is
just an example- I'm more interested in the general answer than in
specific details about how a blogging site should/would work in Rails,
as I have no intention of trying to write a new Blogger. I'm pretty
new to Rails, and I figured that if I kept digging I'd find the answer
myself, but since it came up on this list I figured I'd get the
question into this thread.
Thanks