Best way to handle multiple tables to replace one big table?

11155 · April 23, 2010, 5:37pm

Hello,

I've decided to use multiple tables for an entity (e.g. "todo_items1," "todo_items2," "todo_items3," etc.), instead of just one main table which could end up having a lot of rows (e.g. just "todo_items"). I'm doing this to try and to avoid a potential future performance drop that could come with having too many rows in one table.

With that, I'm looking for a good way to handle this in Rails, mainly by trying to avoid loading a bunch of unused associations for each User object. I'm guessing that other have done something similar, so there's probably a few good tips/recommendations out there.

(I know that I could use a partition for this, but, for now, I've decided to go the 'multiple tables' route.)

Each user has their todo_items placed into a specific table. The user's "todo items" table is set when the user is created, and all of their todo_items go into the same table. The data in their todo items collection is private, so when it comes time to process a users todo_items, I'll only have to look at one table.

One thing I don't particularly want to have is a bunch of unused associations in the User class. Right now, it looks like I'd have to do the following:

class User < ActiveRecord::Base has_many :todo_items1, :todo_items2, :todo_items3, :todo_items4, :todo_items5 end

class todo_items1 < ActiveRecord::Base belongs_to :user end

class todo_items2 < ActiveRecord::Base belongs_to :user end

class todo_items3 < ActiveRecord::Base belongs_to :user end

The thing is, for each individual user, only one of the "todo items" tables would be usable/applicable/accessible since all of a user's todo_items are stored in the same table. This means only one of the associations would be in use at any time and all of the other "has_many :todo_itemsX" associations that were loaded would be a waste.

For example, with a user.id of 2, I'd only need "todo_items3.find_by_text('search_word')", but the way I'm thinking of setting this up (see above), I'd still have access to todo_items1, todo_items2, todo_items4 and todo_items5.

I'm thinking that these "extra associations" adds extra overhead and makes each User object's size in memory much bigger than it has to be.

Also, there's a bunch of stuff that Ruby/Rails is doing in the background which may make this implementation. For example, I'm guessing that there could be some additional method call/lookup overhead for each User object, since it has to load all of those associations, which in turn creates all of those nice, dynamic model accessor methods like "User.find_by_something."

I don't really know Ruby/Rails does internally with all of those has_many associations though, so maybe it's not so bad. But right now I'm thinking that it's really wasteful, and that there may just be a better, more efficient way of doing this.

So, a few questions:

1) Is there's some sort of special Ruby/Rails methodology that could be applied to this 'multiple tables to represent one entity' scheme? Are there any 'best practices' for this?

2) Is it really bad to have so many unused has_many associations for each object? Is there a better way to do this?

3) Does anyone have any advice on how to abstract the fact that there's multiple "todo items" tables behind a single todo_items model/class? For example, so I can call "todo_items.find_by_text('search_phrase')" instead of "todo_items3.find_by_text('search_phrase')," or even "@user.todo_items?"

Thank you!

11155 · April 23, 2010, 6:07pm

Mike P. wrote:

Hello,

I've decided to use multiple tables for an entity (e.g. "todo_items1," "todo_items2," "todo_items3," etc.), instead of just one main table which could end up having a lot of rows (e.g. just "todo_items"). I'm doing this to try and to avoid a potential future performance drop that could come with having too many rows in one table.

This sounds to me like the very definition of "Premature Optimization."

What you're describing is called "Database Sharding."

However, I would very highly recommend the simpler design, unless you are having scaling problems right now. It sounds like you're at the beginning stages of design so I doubt that is the case.

Putting optimization before design leads to complex, difficult to maintain systems. At present you have no verifiable metrics to determine whether the more complex design will ever be needed. Why pay the cost for it now? Wouldn't it be smarter to implement optimizations after you have solid metrics to prove the need for said optimizations?

11155 · April 23, 2010, 7:27pm

Robert Walker wrote:

Mike P. wrote:

Hello,

I've decided to use multiple tables for an entity (e.g. "todo_items1," "todo_items2," "todo_items3," etc.), instead of just one main table which could end up having a lot of rows (e.g. just "todo_items"). I'm doing this to try and to avoid a potential future performance drop that could come with having too many rows in one table.

This sounds to me like the very definition of "Premature Optimization."

Program optimization - Wikipedia

What you're describing is called "Database Sharding."

Shard (database architecture) - Wikipedia

However, I would very highly recommend the simpler design, unless you are having scaling problems right now. It sounds like you're at the beginning stages of design so I doubt that is the case.

Putting optimization before design leads to complex, difficult to maintain systems. At present you have no verifiable metrics to determine whether the more complex design will ever be needed. Why pay the cost for it now? Wouldn't it be smarter to implement optimizations after you have solid metrics to prove the need for said optimizations?

Hi Robert,

Thank you for your response. I totally agree that not everything needs to be optimized upfront, but there's also a problem with waiting for the metrics: When I have those metrics that prove the need for these optimizations, the performance will already be getting affected. And depending on how closely the performance is being watched, it could already be something that's annoying to the user, and that's not good.

That, and I think it's easier, in this case, to have this built-in from the start, rather than having to add this in later on which would (probably) cause me to have to shut down the database when it comes time to do the actual upgrade, which would be another inconvenience for the user. I'd also like to avoid having to move the pre-existing data around for optimization, which I know I'd end up doing if I postpone this.

Creating the multiple tables now, when there's no real data, takes care of those issues and helps me get the infrastructure in place and well-tested.

Also, splitting up the tables upfront will help to delay the need for some future optimization. since index performance starts to drop once the tables get too large (e.g. around one million records or so). If I make five tables then, in theory, that index performance decrease shouldn't happen until those tables reach 1 million+ records, which will take longer since the data is being spread across the tables.

As for sharding, thank you for the advice, but it seems like an even more complicated solution than just using a few additional tables. When looking up sharding, I found a number people that recommend against it unless it's absolutely necessary. So, yes, sharding would be a good thing to wait for metrics before implementing it.

The links I found for the cons of sharding are from Updates 2, 3 and 4 at the following site:

So, I'm still thinking that I want to split up the tables upfront. I do agree that "premature optimization" isn't always a good thing, but it's not always a bad thing either.

(Also, on another note, I'm using PostgreSQL and I only have the one database server, so moving a table/database off to another server isn't an option right now. [And I know that the shards can be on the same server, but it doesn't look like something I want to implement right now.])

So, any advice on a good Ruby/Rails way to handle this?

Thanks, Mike

11155 · April 23, 2010, 8:32pm

Mike P. wrote:

Hello,

I've decided to use multiple tables for an entity (e.g. "todo_items1," "todo_items2," "todo_items3," etc.), instead of just one main table which could end up having a lot of rows (e.g. just "todo_items"). I'm doing this to try and to avoid a potential future performance drop that could come with having too many rows in one table.

Robert already said it, and I agree: DO NOT DO THIS. Any decent DB system is designed to handle tables in the millions of rows without breathing hard. Just index your tables properly and make sure your queries are efficient.

If you have to use your DB's sharding features at some point, go for it. But don't even think of doing this in the app layer.

Best,

11155 · April 23, 2010, 11:29pm

Marnen Laibow-Koser wrote:

Mike P. wrote:

Hello,

I've decided to use multiple tables for an entity (e.g. "todo_items1," "todo_items2," "todo_items3," etc.), instead of just one main table which could end up having a lot of rows (e.g. just "todo_items"). I'm doing this to try and to avoid a potential future performance drop that could come with having too many rows in one table.

Robert already said it, and I agree: DO NOT DO THIS. Any decent DB system is designed to handle tables in the millions of rows without breathing hard. Just index your tables properly and make sure your queries are efficient.

If you have to use your DB's sharding features at some point, go for it. But don't even think of doing this in the app layer.

Best, -- Marnen Laibow-Koser http://www.marnen.org marnen@marnen.org

Hi Marnen,

Thank you also for your post and advice. It actually made me think more about this. So you're saying that even if I had millions of records in a table, and was noticing a decrease in performance, that I shouldn't split up the tables this way? Would a shard be the only option?

One of the things that I keep coming back to when I think about the single-table option is that one of the fields will be rather large, like a blog-post in size. And I plan on indexing that to make it searchable. Also, there's a lot of insertions and deletions. Does it still make sense to include all of that in a single table? Or does that change any of your advice?

I agree that this is an application layer split, sort of; but I don't think that this particular split is a bad thing.

If I did this in pure Ruby, for example, I'm guessing that the effect would be minimal (just one class/method that looks up the correct table, and returns the object for that table). What's (potentially) making the app layer bloated is all of the unused 'has_many' attributes since I'm trying to stick with Rails. I really do want to be able to take advantage of the nice Rails methods and functionality... I'm just trying to avoid having a bunch of unused associations for each user object.

Is there maybe a way to unload a 'has_many' association for an object? Or perhaps call 'has_many' inside of a user-specific method or something? Like

if @current_user.todo_table_number == 1 has_many :todo_items1 elsif @current_user.todo_table_number == 2 has_many :todo_items2 elsif ...

Anyway, the reason why I'm trying to stay away from the database-layer stuff is because they seem to be a bit 'much' right now, and also can apparently cause problems. For example, even adding foreign keys at the database-layer can evidently cause issues with testing, and sometimes they get lost in db:push/db:pull calls.

Okay, how about this. Let's say that instead of my original post, I posted saying that I have a large table with 7 million+ rows that keeps growing. I had already split up the tables in the manner mentioned in the original post. Also, let's say that I really wanted to stick with the-Rails-way and was opposed to sharding. What would your advice be then?

Thanks again for your help, Mike

11155 · April 24, 2010, 12:09am

Mike P. wrote:

Marnen Laibow-Koser wrote:

Mike P. wrote:

Hello,

I've decided to use multiple tables for an entity (e.g. "todo_items1," "todo_items2," "todo_items3," etc.), instead of just one main table which could end up having a lot of rows (e.g. just "todo_items"). I'm doing this to try and to avoid a potential future performance drop that could come with having too many rows in one table.

Robert already said it, and I agree: DO NOT DO THIS. Any decent DB system is designed to handle tables in the millions of rows without breathing hard. Just index your tables properly and make sure your queries are efficient.

If you have to use your DB's sharding features at some point, go for it. But don't even think of doing this in the app layer.

Best, -- Marnen Laibow-Koser http://www.marnen.org marnen@marnen.org

Hi Marnen,

Thank you also for your post and advice. It actually made me think more about this. So you're saying that even if I had millions of records in a table, and was noticing a decrease in performance, that I shouldn't split up the tables this way?

Correct, unless there is no better way to improve things (which will almost certainly not be the case).

Would a shard be the only option?

Depends on the DB.

One of the things that I keep coming back to when I think about the single-table option is that one of the fields will be rather large, like a blog-post in size. And I plan on indexing that to make it searchable. Also, there's a lot of insertions and deletions. Does it still make sense to include all of that in a single table? Or does that change any of your advice?

That does not change any of my advice -- right up until the point where you have actual, measurable problems. And by then, your data model may be different anyway. Don't prematurely optimize, because you may be optimzing the wrong things.

I agree that this is an application layer split, sort of; but I don't think that this particular split is a bad thing.

On what basis do you think that? (I think you're wrong, but I want to know your reasoning.)

If I did this in pure Ruby, for example, I'm guessing that the effect would be minimal (just one class/method that looks up the correct table, and returns the object for that table).

But it is still inappropriate, because this can be handled more efficiently on the DB side than on the app side.

What's (potentially) making the app layer bloated is all of the unused 'has_many' attributes since I'm trying to stick with Rails. I really do want to be able to take advantage of the nice Rails methods and functionality... I'm just trying to avoid having a bunch of unused associations for each user object.

Then don't try to use a silly schema like the one you proposed. Perhaps you will ultimately need something like MySQL's MERGE table, but you don't need to worry about that yet.

Is there maybe a way to unload a 'has_many' association for an object? Or perhaps call 'has_many' inside of a user-specific method or something? Like

if @current_user.todo_table_number == 1 has_many :todo_items1 elsif @current_user.todo_table_number == 2 has_many :todo_items2 elsif ...

Don't bother.

Anyway, the reason why I'm trying to stay away from the database-layer stuff is because they seem to be a bit 'much' right now,

Right! They are a bit much right now -- because they are premature optimization. When you need them, they will be the right tools.

and also can apparently cause problems. For example, even adding foreign keys at the database-layer can evidently cause issues with testing, and sometimes they get lost in db:push/db:pull calls.

You apparently are passing on FUD you've heard somewhere. I assure you -- from experience -- that this is completely untrue.

Okay, how about this. Let's say that instead of my original post, I posted saying that I have a large table with 7 million+ rows that keeps growing. I had already split up the tables in the manner mentioned in the original post. Also, let's say that I really wanted to stick with the-Rails-way and was opposed to sharding. What would your advice be then?

I would advise you to get over your anti-sharding bias and do things as I outlined above.

Thanks again for your help, Mike

Best,

11155 · April 24, 2010, 11:45pm

Marnen Laibow-Koser wrote:

Mike P. wrote:

Marnen Laibow-Koser wrote:

Mike P. wrote:

Hello,

I've decided to use multiple tables for an entity (e.g. "todo_items1," "todo_items2," "todo_items3," etc.), instead of just one main table which could end up having a lot of rows (e.g. just "todo_items"). I'm doing this to try and to avoid a potential future performance drop that could come with having too many rows in one table.

Robert already said it, and I agree: DO NOT DO THIS. Any decent DB system is designed to handle tables in the millions of rows without breathing hard. Just index your tables properly and make sure your queries are efficient.

If you have to use your DB's sharding features at some point, go for it. But don't even think of doing this in the app layer.

Best, -- Marnen Laibow-Koser http://www.marnen.org marnen@marnen.org

Hi Marnen,

Thanks again for your response. I was thinking about it last night, and did some more research and a few calculations, and I'm inclined to agree with you and stick with the single table model. It will make things easier and the database should be able to handle it for a while.

However, I'm going to reply below as if I'm sticking with the multi-table model, mostly to see how I really feel about it by typing some of my thoughts (it sounds kind of weird, but hopefully it makes sense).

Honestly, I still think that splitting the main table into smaller ones will have future benefits. Even though there's a bit of extra work up front, I'd rather get this small thing over with rather than letting it turn into a bigger thing, sooner, in the future.

I think if people could just get over the "don't optimize too early" mantra, and realize that this can't possibly be best move for everyone, a lot of future stress could be avoided, for both the business owner and the customer.

Personally, I'd probably change that saying to read "don't optimize *too much* too early." As they say, an "ounce of prevention is worth a pound of cure..." it makes sense in the database optimization world too.

This multi-table thing isn't meant to completely avoid the need to do further optimizations, by the way, and I'm aware that I shouldn't go overboard with too much upfront optimization. I know that future, most likely database-layer, performance tweaks will probably be needed later. But it's *because* I know this that I was thinking about an easy-to-do change that I could do now, that will keep the performance better, for longer.

Anyway...

Hi Marnen,

Thank you also for your post and advice. It actually made me think more about this. So you're saying that even if I had millions of records in a table, and was noticing a decrease in performance, that I shouldn't split up the tables this way?

Correct, unless there is no better way to improve things (which will almost certainly not be the case).

Would a shard be the only option?

Depends on the DB.

It's PostgreSQL running on a single database server, in case you were wondering.

One of the things that I keep coming back to when I think about the single-table option is that one of the fields will be rather large, like a blog-post in size. And I plan on indexing that to make it searchable. Also, there's a lot of insertions and deletions. Does it still make sense to include all of that in a single table? Or does that change any of your advice?

That does not change any of my advice -- right up until the point where you have actual, measurable problems. And by then, your data model may be different anyway. Don't prematurely optimize, because you may be optimzing the wrong things.

I see your point here. I don't think the model for this table will change much though (I know, I know), but I do see your point. However, from the other side, even if the model does change, I could always just update all of those tables with the migration(s), which I'd still have to do with the one-big-table model.

I agree that this is an application layer split, sort of; but I don't think that this particular split is a bad thing.

On what basis do you think that? (I think you're wrong, but I want to know your reasoning.)

Well, pretty much because it's not much different than accessing the rest of the tables in the database.

When I want to look up a user, I go to the User table; I may have another table for user preferences which I'd also have to access. Looking at todo items for a user involves me looking at a single table in both cases, the main difference between the multiple-tables vs single-tables is that there's an extra bit of indirection there.

It could be as simple as:

1) Looking at the "which_todo_table" column (which would be a number, I'm thinking) 2) Appending that number to the end of a string (so, "todo_items" << "3" becomes "todo_items3")

So, there's an extra column in the User table that tells which table to look at to get that user's todo items, and the app layer chooses that database when it needs to.

I agree that the database can handle this sort of split more efficiently than the application layer, especially when it comes to things like sharding or partitioning; but I'm looking at the multi-table model as a sort of "soft-partitioning." The data gets "split" up in the database (via tables), and the indexes stay at about a fifth of the size of the single-table methodology. Similar to partitioning, but without modifying the database in a way that Rails doesn't know about.

So, the difference is that I'd be using Ruby/Rails to "decide" which table to look at, instead of having a pre-set constant that's stored by Rails and used whenever a model is accessed.

With this, as a benefit, everything's at the application layer. I can see exactly what's going on, and if I move servers or something, I won't have to deal with the potentially complicated, database-specific sharding stuff that can only be seen by investigating the database layer.

It's just another table. That's how I'm looking at it.

If I did this in pure Ruby, for example, I'm guessing that the effect would be minimal (just one class/method that looks up the correct table, and returns the object for that table).

But it is still inappropriate, because this can be handled more efficiently on the DB side than on the app side.

I agree, but it's also more complicated to have to set up the database to manage all of that stuff. Handling things on the DB side adds another "moving part," so to speak. It will have to be managed, tested, and revisited when changes are being made.

Don't get me wrong, I'm not opposed to DB-layer tweaks and modifications, but if I can avoid/delay complicated DB-layer changes by doing some work upfront - especially in this case where the modification isn't database-level --, then I'll at least look into doing so.

The other main benefit is that DB performance will be much more resilient to surges of data. I'll have more time to implement those DB-level changes, if/when needed, and the decrease in performance will be much more gradual.

Considering that making a change such as this has a such a (theoretically) small impact on the app-layer codebase, has no effect on the database itself, and allows the database/app to perform better for longer - thereby delaying the slowdowns and the need for potentially complicated, not-easily-portable changes - isn't it worth looking into?

You must have seen people who are noticing their database performance dropping, and who are scrambling to fix their issues. They pop up in forums quite often, no? And how do they fix it? With more servers and/or complicated procedures; with data migrations, extra layers, and then downtime. All the while, they're feeling pressure because the performance lag is noticeable, and getting more noticeable as time passes.

It happens so often, everyone knows that it's going to happen, and yet trying to optimize a bit upfront is discouraged.

And, of course, in every DB performance upgrade push, there's another, even more important part of the story here: the users.

For however long it takes to get these changes implemented, and sometimes from even before the changes start, the users are noticing these slowdowns, and possibly even time-outs. They're seeing notices on the site about how the company is working to make the site faster, blah, blah, blah. They don't look at it as "oh, yeah, they followed the hard-and-fast rule about not optimizing too early, there's nothing they could've done to prevent this, so it's okay, this is just what happens." Instead, it looks like they didn't plan this properly, or that they weren't prepared.

So then, my question is, why -- if people can expect these performance issues to happen as a site grows -- why is it so frowned upon to try and do a little bit of upfront work to delay that from happening... and also when it starts to happen, to have better performance during that period?

What's (potentially) making the app layer bloated is all of the unused 'has_many' attributes since I'm trying to stick with Rails. I really do want to be able to take advantage of the nice Rails methods and functionality... I'm just trying to avoid having a bunch of unused associations for each user object.

Then don't try to use a silly schema like the one you proposed. Perhaps you will ultimately need something like MySQL's MERGE table, but you don't need to worry about that yet.

Not yet, true. But it'll come. So why not do something about it? In that case, do you have any other recommendations for something small that can be done upfront to keep performance up in the future? I've already got the indexes covered... anything else? Or should I seriously just do nothing but the indexing?

Is there maybe a way to unload a 'has_many' association for an object? Or perhaps call 'has_many' inside of a user-specific method or something? Like

if @current_user.todo_table_number == 1 has_many :todo_items1 elsif @current_user.todo_table_number == 2 has_many :todo_items2 elsif ...

Don't bother.

Back to my question: why not? It's just a small preventative measure. Mind you, if I was attempting to do some multi-server sharding right now, then you'd be totally right.

Anyway, the reason why I'm trying to stay away from the database-layer stuff is because they seem to be a bit 'much' right now,

Right! They are a bit much right now -- because they are premature optimization. When you need them, they will be the right tools.

But not all premature optimizations are the same. Would you tell someone not to worry about indexing until they get to a point where they have the metrics (i.e. a few hundred or thousand users/rows)? The database will work well until they get a few hundred/thousand rows without indexing, yet people are told to take care of that upfront.

This isn't a super-crazy thing I'm trying to do here, just some optimization that would help at the million+ mark, instead of at the thousand+ mark.

and also can apparently cause problems. For example, even adding foreign keys at the database-layer can evidently cause issues with testing, and sometimes they get lost in db:push/db:pull calls.

You apparently are passing on FUD you've heard somewhere. I assure you -- from experience -- that this is completely untrue.

Well, it's not hearsay if that's what you're saying. I could have been reading something that doesn't apply to Rails 2.3.5 or Rails 3, but one of them was from someone using a specific host and, after doing a db:push or db:pull, the foreign keys that he setup were gone. The issue was due to a plugin that didn't support foreign keys.

As for the testing issue, it's something to do with the fixtures not being loaded in the proper order, and as a result, the Test database wasn't getting setup properly. That looked like it needed a patch or something to fix it, so maybe it wasn't a super big deal.

Besides that, as far as I know (and I could be wrong) Rails doesn't support foreign keys out of the box. That's why plugin's like Foreigner exist. I don't have the links for these offhand, but if you'd like them, I can go back and find them. I read about both of these within the past week.

But this is besides the point...

Okay, how about this. Let's say that instead of my original post, I posted saying that I have a large table with 7 million+ rows that keeps growing. I had already split up the tables in the manner mentioned in the original post. Also, let's say that I really wanted to stick with the-Rails-way and was opposed to sharding. What would your advice be then?

I would advise you to get over your anti-sharding bias and do things as I outlined above.

Well, it's not so much an anti-sharding bias, it's an attempt to keep the app/database cleaner for longer, to keep the app/database performing better for longer, and to avoid upsetting customers with decreased performance by giving myself some additional time to make any required drastic changes like sharding.

Sharding and other complex/time-consuming optimizations should indeed be left until the metrics show that it's going to be needed soon. I agree with that part. But sharding is also a few orders of magnitude more complex then what I'm trying to do.

I certainly don't agree that every optimization should be supported by data and held off until then, especially when we all know what happens when database performance goes down.

Waiting until a known potential problem (i.e. large table performance drops) is actually a problem has consequences. I'm just trying to delay it. Splitting up the tables will work, and it's not that complicated and keeps the supporting code at the application level.

So, any best practices or tips for how to support/implement this multiple table model would be great, and super appreciated.

Thanks again for your help, Mike

Best, -- Marnen Laibow-Koser http://www.marnen.org marnen@marnen.org

Thanks again, Mike

Colin_Law1 · April 25, 2010, 7:49am

Optimising "too early" is a bad thing by definition. If it was a good thing then it would not be "too early".

Colin

11155 · April 25, 2010, 1:07pm

Colin Law wrote:

... I think if people could just get over the "don't optimize too early" mantra, and realize that this can't possibly be best move for everyone, a lot of future stress could be avoided, for both the business owner and the customer.

Optimising "too early" is a bad thing by definition. If it was a good thing then it would not be "too early".

Exactly. You're trying to justify a bad idea, but it's still a bad idea.

No one is saying that you shouldn't do research on possible future optimizations. But don't implement them until you know where your performance problems *are*, not where you assume they'll be.

"Premature optimization is the root of all evil." --Donald Knuth

Colin

Best,

11155 · April 26, 2010, 1:08am

Thank you Colin and Marnen for your repsonses.

Marnen Laibow-Koser wrote:

Colin Law wrote:

... I think if people could just get over the "don't optimize too early" mantra, and realize that this can't possibly be best move for everyone, a lot of future stress could be avoided, for both the business owner and the customer.

Optimising "too early" is a bad thing by definition. If it was a good thing then it would not be "too early".

But by whose definition? Someone stuck that "too early" in there, and it's an extremely relative and subjective phrase.

If "too early" is anytime before one has noticed a decrease in performance (which, by the way, probably has to be fairly significant to be noticeable by trying to use a site), and has measured it, then how much worse is the performance going to get between that initial observation and actually implementing the upgrades?

Is it really worth not doing anything about it until we see it for ourselves, even when we can use the experience of others to lessen the effect?

Exactly. You're trying to justify a bad idea, but it's still a bad idea.

No one is saying that you shouldn't do research on possible future optimizations. But don't implement them until you know where your performance problems *are*, not where you assume they'll be.

"Premature optimization is the root of all evil." --Donald Knuth

Okay, but why is it considered an assumption when there are dozens/hundreds of posts out there about people dealing with performance issues of growing tables with a large number of rows?

And what about the cases when your research shows that based on the expectations for the table (lots of insertions and deletions, table gets increasingly large as new users come in, etc.), that the evidence suggests that this particular table will have the same issues as the tables mentioned in the other posts?

It's just kind of strange to me that there's no room for upfront optimizations, not even little ones meant to keep up good performance for a longer period.

It's almost like driving a car that you know is going to run out of gas on a road trip, into a rural area that you've never been to before (which is basically the road of "not knowing how fast something will grow"). Before you left, you purposefully chose not to pack an extra gas canister to "get you a little farther" or anything, because the car would work fine with the amount of gas you had when you left.

But you knew what you have to do when the car does start to run out of gas: add some fuel.

So you drive and drive, and then you happen to glance down at the dashboard and notice that the gas light is on. Then, and *only* then, you start looking for a gas station. At this point, you get a bit stressed because you don't know where you are, and you don't know how long the gas will last. Your primary focus is to get more gas. You know *what* you have to do, but it will take some time to do it.

So, you just try and keep the car running and hope you get to a gas station before the car starts chugging or stalls.

The driver knew that the car would run out of gas. He knows of other situations where people drove their car to that point and ended up having their car stall (i.e. timeouts and major downtime). The preventative measure would have been to either get more gas before he/she left, pack an "emergency" gas canister to help the car get to a gas station, if needed (i.e. if they didn't get the database-layer optimizations in place in time); or plan out the route based on the known fuel efficiency of the car so that he/she knows exactly when and where to get gas (a bit complicated, and probably not completely accurate).

So, I'm trying to get the fuel before starting, not when I absolutely need it.

I know this is a bit of a loose analogy, but hopefully you see where I'm going. We can learn from the experience of others. You've seen or heard companies mention "growing pains" on their websites, or in podcasts and interviews. Why wouldn't I try and lessen the effect of a known performance issue?

Does what I'm saying make sense though? I'm just trying to take the experience of others, learn something from it, and prolong the good performance of the most heavily used table in the database. This will help keep the customers happy, and make my experience less stressful in the future.

If you guys tell me that what I'm trying to do will either

1) Take a long time, or 2) Won't prolong the good performance of the database/index (or even have a negative effect on it)

then it would be a different story.

Again, I'm not trying to do something drastic here, I'm just asking for help on how to handle the multiple-table model at the application layer in a Rails app, to avoid too many 'has_many' associations being loaded for each User object.

Any advice/tips on that?

Thanks again, Mike

Colin_Law1 · April 26, 2010, 7:09am

If you absolutely _know_ that there is going to be an issue with the performance then it is not premature optimisation, it is part of the specification of the app. If however, as you say, "the evidence suggests that this particular table will have the same issues ..." then it is only a suggestion and not a certainty, so what is the point of doing up-front work delaying initial deployment? Particularly when the suggested optimisation may not be the best way of solving the problem (which may never occur).

A long career in s/w development has shown me that the key performance issues in an app are very rarely in the areas that one expected at the start.

Colin

Michael_Pavling · April 26, 2010, 7:47am

I think it seems to some that what you might be doing is filling the back seats of your car with jerry cans full of fuel when all you're doing is popping down the road to pick the kids up from school.

But you have put your finger on the issue yourself - the "too soon" is subjective; if it's not too soon *for you*, then it's perfectly okay.... but there's been a few posts here (and I agree with them) that what you describe as the application's purpose does not wave a flag for something that is in desperate need of DB sharding in the development phase... but only you can *know* based on your requirements and measurements, and hopefully you take the suggestions as constructive input rather than as criticism.

11155 · April 26, 2010, 12:16pm

Michael Pavling wrote:

So, I'm trying to get the fuel before starting, not when I absolutely need it.

I think it seems to some that what you might be doing is filling the back seats of your car with jerry cans full of fuel when all you're doing is popping down the road to pick the kids up from school.

Good analogy. And the weight of those jerricans will probably reduce your fuel efficiency.

But you have put your finger on the issue yourself - the "too soon" is subjective; if it's not too soon *for you*, then it's perfectly okay....

No. It's not subjective. If the optimization is speculative, don't do it!

As to the issue of waiting till users are affected, you don't have to. It is possible to keep an eye on performance stats, see a problem coming before your users notice, and fix the problem.

but there's been a few posts here (and I agree with them) that what you describe as the application's purpose does not wave a flag for something that is in desperate need of DB sharding in the development phase...

Right. And also the fact that DB sharding belongs in the DB and should not be reinvented in the app layer. (More on this in another post.)

but only you can *know* based on your requirements and measurements, and hopefully you take the suggestions as constructive input rather than as criticism.

And you *can't* know. The app doesn't exist yet, so you don't know what your usage patterns will actually be.

Best,

Michael_Pavling · April 26, 2010, 12:34pm

I agree totally.. but I was being nice to the OP; and he may prefer to make his own mistakes to learn from (or not learn

11155 · April 29, 2010, 2:48am

Marnen Laibow-Koser wrote:

Michael Pavling wrote:

So, I'm trying to get the fuel before starting, not when I absolutely need it.

I think it seems to some that what you might be doing is filling the back seats of your car with jerry cans full of fuel when all you're doing is popping down the road to pick the kids up from school.

Good analogy. And the weight of those jerricans will probably reduce your fuel efficiency.

But you have put your finger on the issue yourself - the "too soon" is subjective; if it's not too soon *for you*, then it's perfectly okay....

No. It's not subjective. If the optimization is speculative, don't do it!

As to the issue of waiting till users are affected, you don't have to. It is possible to keep an eye on performance stats, see a problem coming before your users notice, and fix the problem.

but there's been a few posts here (and I agree with them) that what you describe as the application's purpose does not wave a flag for something that is in desperate need of DB sharding in the development phase...

Right. And also the fact that DB sharding belongs in the DB and should not be reinvented in the app layer. (More on this in another post.)

but only you can *know* based on your requirements and measurements, and hopefully you take the suggestions as constructive input rather than as criticism.

And you *can't* know. The app doesn't exist yet, so you don't know what your usage patterns will actually be.

Best, -- Marnen Laibow-Koser http://www.marnen.org marnen@marnen.org

Thank you all, so very much, for your great, thoughtful posts and solid advice. You all have excellent points. I've decided to take your advice as wise council, and stick with the single-table model.

Thanks again everybody!

Mike

Allan · April 30, 2010, 1:50am

Thank you all, so very much, for your great, thoughtful posts and solid advice. You all have excellent points. I've decided to take your advice as wise council, and stick with the single-table model.

Hi,

if you find you *DO* have to shard in your application, then I used the approach:

todo_items_<user_id>

and created a base level todo_items (which held no values) and created the tables todo_items_<user_id> with the MySQL statement:

create table todo_items_<user_id> like todo_items

Migration can be a bit of a pain though!

Once you know the user_id in your controller the you can just do

TodoItem.set_table_name "todo_items_#{@user.id}"

Just my 2 pennies worth

Allan

Topic		Replies	Views
sparse tables seems to be an impossible term to search on google rubyonrails-talk	14	313	October 22, 2014
GSoC Project Proposal: Scalable Database Interaction rubyonrails-core	4	166	April 4, 2009
DATABASE Query performance rubyonrails-talk	28	222	September 23, 2008
One huge table or separate tables? design issue rubyonrails-talk	15	520	November 4, 2014
Multiple Database Connections rubyonrails-talk	15	491	August 13, 2007

Best way to handle multiple tables to replace one big table?

Related topics

More Resources