Marnen Laibow-Koser wrote:
Mike P. wrote:
Marnen Laibow-Koser wrote:
Mike P. wrote:
Hello,
I've decided to use multiple tables for an entity (e.g. "todo_items1,"
"todo_items2," "todo_items3," etc.), instead of just one main table
which could end up having a lot of rows (e.g. just "todo_items"). I'm
doing this to try and to avoid a potential future performance drop that
could come with having too many rows in one table.
Robert already said it, and I agree: DO NOT DO THIS. Any decent DB
system is designed to handle tables in the millions of rows without
breathing hard. Just index your tables properly and make sure your
queries are efficient.
If you have to use your DB's sharding features at some point, go for it.
But don't even think of doing this in the app layer.
Best,
--
Marnen Laibow-Koser
http://www.marnen.org
marnen@marnen.org
Hi Marnen,
Thanks again for your response. I was thinking about it last night, and
did some more research and a few calculations, and I'm inclined to agree
with you and stick with the single table model. It will make things
easier and the database should be able to handle it for a while.
However, I'm going to reply below as if I'm sticking with the
multi-table model, mostly to see how I really feel about it by typing
some of my thoughts (it sounds kind of weird, but hopefully it makes
sense).
Honestly, I still think that splitting the main table into smaller ones
will have future benefits. Even though there's a bit of extra work up
front, I'd rather get this small thing over with rather than letting it
turn into a bigger thing, sooner, in the future.
I think if people could just get over the "don't optimize too early"
mantra, and realize that this can't possibly be best move for everyone,
a lot of future stress could be avoided, for both the business owner and
the customer.
Personally, I'd probably change that saying to read "don't optimize *too
much* too early." As they say, an "ounce of prevention is worth a pound
of cure..." it makes sense in the database optimization world too. 
This multi-table thing isn't meant to completely avoid the need to do
further optimizations, by the way, and I'm aware that I shouldn't go
overboard with too much upfront optimization. I know that future, most
likely database-layer, performance tweaks will probably be needed later.
But it's *because* I know this that I was thinking about an easy-to-do
change that I could do now, that will keep the performance better, for
longer.
Anyway...
Hi Marnen,
Thank you also for your post and advice. It actually made me think more
about this. So you're saying that even if I had millions of records in a
table, and was noticing a decrease in performance, that I shouldn't
split up the tables this way?
Correct, unless there is no better way to improve things (which will
almost certainly not be the case).
Would a shard be the only option?
Depends on the DB.
It's PostgreSQL running on a single database server, in case you were
wondering.
One of the things that I keep coming back to when I think about the
single-table option is that one of the fields will be rather large, like
a blog-post in size. And I plan on indexing that to make it searchable.
Also, there's a lot of insertions and deletions. Does it still make
sense to include all of that in a single table? Or does that change any
of your advice?
That does not change any of my advice -- right up until the point where
you have actual, measurable problems. And by then, your data model may
be different anyway. Don't prematurely optimize, because you may be
optimzing the wrong things.
I see your point here. I don't think the model for this table will
change much though (I know, I know), but I do see your point. However,
from the other side, even if the model does change, I could always just
update all of those tables with the migration(s), which I'd still have
to do with the one-big-table model.
I agree that this is an application layer split, sort of; but I don't
think that this particular split is a bad thing.
On what basis do you think that? (I think you're wrong, but I want to
know your reasoning.)
Well, pretty much because it's not much different than accessing the
rest of the tables in the database.
When I want to look up a user, I go to the User table; I may have
another table for user preferences which I'd also have to access.
Looking at todo items for a user involves me looking at a single table
in both cases, the main difference between the multiple-tables vs
single-tables is that there's an extra bit of indirection there.
It could be as simple as:
1) Looking at the "which_todo_table" column (which would be a number,
I'm thinking)
2) Appending that number to the end of a string (so, "todo_items" << "3"
becomes "todo_items3")
So, there's an extra column in the User table that tells which table to
look at to get that user's todo items, and the app layer chooses that
database when it needs to.
I agree that the database can handle this sort of split more efficiently
than the application layer, especially when it comes to things like
sharding or partitioning; but I'm looking at the multi-table model as a
sort of "soft-partitioning." The data gets "split" up in the database
(via tables), and the indexes stay at about a fifth of the size of the
single-table methodology. Similar to partitioning, but without modifying
the database in a way that Rails doesn't know about.
So, the difference is that I'd be using Ruby/Rails to "decide" which
table to look at, instead of having a pre-set constant that's stored by
Rails and used whenever a model is accessed.
With this, as a benefit, everything's at the application layer. I can
see exactly what's going on, and if I move servers or something, I won't
have to deal with the potentially complicated, database-specific
sharding stuff that can only be seen by investigating the database
layer.
It's just another table. That's how I'm looking at it.
If I did this in pure Ruby, for example, I'm guessing that the effect
would be minimal (just one class/method that looks up the correct table,
and returns the object for that table).
But it is still inappropriate, because this can be handled more
efficiently on the DB side than on the app side.
I agree, but it's also more complicated to have to set up the database
to manage all of that stuff. Handling things on the DB side adds
another "moving part," so to speak. It will have to be managed, tested,
and revisited when changes are being made.
Don't get me wrong, I'm not opposed to DB-layer tweaks and
modifications, but if I can avoid/delay complicated DB-layer changes by
doing some work upfront - especially in this case where the modification
isn't database-level --, then I'll at least look into doing so.
The other main benefit is that DB performance will be much more
resilient to surges of data. I'll have more time to implement those
DB-level changes, if/when needed, and the decrease in performance will
be much more gradual.
Considering that making a change such as this has a such a
(theoretically) small impact on the app-layer codebase, has no effect on
the database itself, and allows the database/app to perform better for
longer - thereby delaying the slowdowns and the need for potentially
complicated, not-easily-portable changes - isn't it worth looking into?
You must have seen people who are noticing their database performance
dropping, and who are scrambling to fix their issues. They pop up in
forums quite often, no? And how do they fix it? With more servers and/or
complicated procedures; with data migrations, extra layers, and then
downtime. All the while, they're feeling pressure because the
performance lag is noticeable, and getting more noticeable as time
passes.
It happens so often, everyone knows that it's going to happen, and yet
trying to optimize a bit upfront is discouraged.
And, of course, in every DB performance upgrade push, there's another,
even more important part of the story here: the users.
For however long it takes to get these changes implemented, and
sometimes from even before the changes start, the users are noticing
these slowdowns, and possibly even time-outs. They're seeing notices on
the site about how the company is working to make the site faster, blah,
blah, blah. They don't look at it as "oh, yeah, they followed the
hard-and-fast rule about not optimizing too early, there's nothing they
could've done to prevent this, so it's okay, this is just what happens."
Instead, it looks like they didn't plan this properly, or that they
weren't prepared.
So then, my question is, why -- if people can expect these performance
issues to happen as a site grows -- why is it so frowned upon to try and
do a little bit of upfront work to delay that from happening... and also
when it starts to happen, to have better performance during that period?
What's (potentially) making the
app layer bloated is all of the unused 'has_many' attributes since I'm
trying to stick with Rails. I really do want to be able to take
advantage of the nice Rails methods and functionality... I'm just trying
to avoid having a bunch of unused associations for each user object.
Then don't try to use a silly schema like the one you proposed. Perhaps
you will ultimately need something like MySQL's MERGE table, but you
don't need to worry about that yet.
Not yet, true. But it'll come. So why not do something about it? In that
case, do you have any other recommendations for something small that can
be done upfront to keep performance up in the future? I've already got
the indexes covered... anything else? Or should I seriously just do
nothing but the indexing?
Is there maybe a way to unload a 'has_many' association for an object?
Or perhaps call 'has_many' inside of a user-specific method or
something? Like
if @current_user.todo_table_number == 1
has_many :todo_items1
elsif @current_user.todo_table_number == 2
has_many :todo_items2
elsif ...
Don't bother.
Back to my question: why not? It's just a small preventative measure.
Mind you, if I was attempting to do some multi-server sharding right
now, then you'd be totally right.
Anyway, the reason why I'm trying to stay away from the database-layer
stuff is because they seem to be a bit 'much' right now,
Right! They are a bit much right now -- because they are premature
optimization. When you need them, they will be the right tools.
But not all premature optimizations are the same. Would you tell someone
not to worry about indexing until they get to a point where they have
the metrics (i.e. a few hundred or thousand users/rows)? The database
will work well until they get a few hundred/thousand rows without
indexing, yet people are told to take care of that upfront.
This isn't a super-crazy thing I'm trying to do here, just some
optimization that would help at the million+ mark, instead of at the
thousand+ mark.
and also can
apparently cause problems. For example, even adding foreign keys at the
database-layer can evidently cause issues with testing, and sometimes
they get lost in db:push/db:pull calls.
You apparently are passing on FUD you've heard somewhere. I assure you
-- from experience -- that this is completely untrue.
Well, it's not hearsay if that's what you're saying. I could have been
reading something that doesn't apply to Rails 2.3.5 or Rails 3, but one
of them was from someone using a specific host and, after doing a
db:push or db:pull, the foreign keys that he setup were gone. The issue
was due to a plugin that didn't support foreign keys.
As for the testing issue, it's something to do with the fixtures not
being loaded in the proper order, and as a result, the Test database
wasn't getting setup properly. That looked like it needed a patch or
something to fix it, so maybe it wasn't a super big deal.
Besides that, as far as I know (and I could be wrong) Rails doesn't
support foreign keys out of the box. That's why plugin's like Foreigner
exist. I don't have the links for these offhand, but if you'd like them,
I can go back and find them. I read about both of these within the past
week.
But this is besides the point...
Okay, how about this. Let's say that instead of my original post, I
posted saying that I have a large table with 7 million+ rows that keeps
growing. I had already split up the tables in the manner mentioned in
the original post. Also, let's say that I really wanted to stick with
the-Rails-way and was opposed to sharding. What would your advice be
then?
I would advise you to get over your anti-sharding bias and do things as
I outlined above.
Well, it's not so much an anti-sharding bias, it's an attempt to keep
the app/database cleaner for longer, to keep the app/database performing
better for longer, and to avoid upsetting customers with decreased
performance by giving myself some additional time to make any required
drastic changes like sharding.
Sharding and other complex/time-consuming optimizations should indeed be
left until the metrics show that it's going to be needed soon. I agree
with that part. But sharding is also a few orders of magnitude more
complex then what I'm trying to do.
I certainly don't agree that every optimization should be supported by
data and held off until then, especially when we all know what happens
when database performance goes down.
Waiting until a known potential problem (i.e. large table performance
drops) is actually a problem has consequences. I'm just trying to delay
it. Splitting up the tables will work, and it's not that complicated and
keeps the supporting code at the application level.
So, any best practices or tips for how to support/implement this
multiple table model would be great, and super appreciated.
Thanks again for your help,
Mike
Best,
--
Marnen Laibow-Koser
http://www.marnen.org
marnen@marnen.org
Thanks again,
Mike