One huge table or separate tables? design issue

Yongbiao_Long · October 12, 2014, 1:32am

Hi, everyone,

I’ve learned Rails(Web development) on my own for two months, and came to a situation like this:

There are two tables A and B in my database. A 'has_many B. For every user input, There is an entry in table A. At the same time, I need to add a huge number(say 100,000) of entries in B. As table A grows, table B seems to grow uncontrollably. There must be performance issues in later data manipulation operations, or even 'overflow problems. I use Mysql.

My initial thought was to split table B, for example, for every entry in table A, create a new table (on the fly) for it. The questions are:

Is my thought right?

If yes, then how to design the models and migrations to represent the relationship correctly?

If not, what are good practices to solve this kind of problem?

I’d be very grateful if you could only give some hints, doc or good articles about it.

Vivek_Sampara · October 12, 2014, 9:05am

Hi,

Creating multiple tables is not the key to solve this problem. If your rows are higher than 10 million records, then you have to make sure you archive the older ( Unused ) rows or solve it through MASTER - SLAVE db approach.

Ideally when you’re worry about performance issues with databases , it depends on the READ/WRITE speed expected from the db. And how often is the data being stored READ ?

Cheers

Vivek

Colin_Law · October 12, 2014, 9:17am

Are you sure your basic design is appropriate? Having 100,000 associated records per user seems very high. Are you able to explain what is in those records in case an alternative can be suggested?

Colin

Frederick_Cheung · October 12, 2014, 9:47am

Hi, everyone,

I've learned Rails(Web development) on my own for two months, and came to a situation like this:

There are two tables A and B in my database. A 'has_many` B. For every user input, There is an entry in table A. At the same time, I need to add a huge number(say 100,000) of entries in B. As table A grows, table B seems to grow uncontrollably. There must be performance issues in later data manipulation operations, or even 'overflow` problems. I use Mysql.

My initial thought was to split table B, for example, for every entry in table A, create a new table (on the fly) for it. The questions are:

Is my thought right? If yes, then how to design the models and migrations to represent the relationship correctly? If not, what are good practices to solve this kind of problem?

I've had hundreds of millions of rows in one table without problems before. The important think is not just the number of rows but your access patterns. For example I currently do something vaguely similar with some mongo data: we create one collection per day (and drop the collection from a few days ago) because dropping a collection is very fast compared to deleting several millions documents from a larger collection.

Another real world example of this is new relic: at least at one point they created one table per customer per hour ( New Relic Architecture - Collecting 20+ Billion Metrics a Day - High Scalability -). There's is a somewhat specialised case: high insert performance and specialised read/aggregate and delete patterns pushed them that way.

This is an uncommon pattern. While it allows a form of sharding and avoids the need for blocking migrations on very large tables it will make everyday life more complicated. As far as I'm aware you'll be pretty much on your own if you go down the path.

Odds are you don't need this. If you do, think really carefully about your access patterns first

Fred

Yongbiao_Long · October 12, 2014, 10:34am

Thank you Colin! I'm implementing a railway ticket-booking system just for learn purpose, not for production. Now I try to explain it: Suppose a train has 20 stations along the way. It has 1000 seats to sell. Table A store one train's information in a row. I want table B to store every seat's information. I design table B as follow

train_id: start_city_id: end_city_id: seat_id: availability: boolean type date:

For one train, the number of start_city_id and end_city_id's combinations is 19+18+...+1=190, the number of possible date is 20(people can book tickets in 20 days), and possible seat_id's number is 1000. The total number of rows for one train is 190*1000*20=3,800,000. So huge..

Actually I have more tables in database which are omitted as they are unrelated.

Maybe I should compute the state of any seat every time the system issues a seat? (Some words on the state of a seat: A train goes by city L, M, N, O and there is only one seat left. If a person has booked M->N's seat, then L->O is not available, but L->M, N->O is.)

Colin_Law · October 12, 2014, 10:54am

Are you sure your basic design is appropriate? Having 100,000 associated records per user seems very high. Are you able to explain what is in those records in case an alternative can be suggested?

Thank you Colin! I'm implementing a railway ticket-booking system just for learn purpose, not for production. Now I try to explain it: Suppose a train has 20 stations along the way. It has 1000 seats to sell. Table A store one train's information in a row. I want table B to store every seat's information. I design table B as follow

train_id: start_city_id: end_city_id: seat_id: availability: boolean type date:

For one train, the number of start_city_id and end_city_id's combinations is 19+18+...+1=190, the number of possible date is 20(people can book tickets in 20 days), and possible seat_id's number is 1000. The total number of rows for one train is 190*1000*20=3,800,000. So huge..

The first point is that you should only add rows as seats are booked, use that fact that there is no row to indicate that there is no booking. Secondly, what is the date? Is that the date of the journey or the date of the booking? I don't understand why you have multiplied the number of trains by 20 days. If it is the date of the journey then consider having a Journeys table which is a particular train on a particular day. Thirdly I don't understand why you have 20 factorial for the city combinations. Is it not true that if a seat is booked from station 1 to station 3 then that implies a booking 1 to 2 and 2 to 3? In which case you do not separate rows for bookings 1 to 2, 2 to 3, and 1 to 3.

Colin_Law · October 12, 2014, 11:12am

In fact thinking further, I would probably turn the problem around and have a bookings table, specifying the journey, the stations, and the seat (or seats).

Colin

Jim_Ruther_Nill · October 12, 2014, 11:18am

you can also run a task that archives past dates to a duplicate table since you’ll rarely need them. maybe you can keep 1 month data in the table that you’re

writing to. this way, you won’t have to worry about the exponential growth of

the table you’re always accessing.

Yongbiao_Long · October 12, 2014, 11:28am

The first point is that you should only add rows as seats are booked, use that fact that there is no row to indicate that there is no booking.

Secondly, what is the date? Is that the date of the journey or the date of the booking? I don't understand why you have multiplied the number of trains by 20 days. If it is the date of the journey then consider having a Journeys table which is a particular train on a particular day.

It's the date of the journey. I once thought about this method, the problem is I don't know how to implement it in rails. In Rails tutorials and guidelines, there is a model file in app/model directory for every table. If I need to create new journey tables, what's the model files and migration files look like?(Or I don't need them any more?)

Thirdly I don't understand why you have 20 factorial for the city combinations. Is it not true that if a seat is booked from station 1 to station 3 then that implies a booking 1 to 2 and 2 to 3? In which case you do not separate rows for bookings 1 to 2, 2 to 3, and 1 to 3.

According to your first and third point, It's because my design is different. In my design one booking might indicates more than one row. A row just means the state of a seat. So, if a seat is booked from station 1 to station 3, then rows representing 1->3, 1->2, 2->3, 1->4, 1->5.... all need to change. I do it in this way because I have another table representing the number of left seats, and I think it's easier to update this table in my way(in this case, the number of leaf seats from 1->3, 1->2, 2->3.... decreases by one).

Thank you for your suggestion! I'll dig it deeper and reveal my concerns.

Yongbiao_Long · October 12, 2014, 11:35am

Yeah, I've also plan to do this kind of thing. Thank you anyway;-)

Yongbiao_Long · October 12, 2014, 11:40am

This computation process seems the same for both our table designs... I'll use your approach, having a bookings table, specifying the journey, the stations, and the seat (or seats).

Colin_Law · October 12, 2014, 12:32pm

In that case you have redundant data in your database (as 1->3 implies 1->2 and 1->3), which is almost always not the best way to do it.

Colin

Yongbiao_Long · October 15, 2014, 2:44pm

I feel sorry to bother you all again here, but I still have a problem: How to deal with the daily created tables in rails? I googled around and found nothing.

The daily created tables are named on the dates, their names are all different. I know how to create tables in a periodically executed rake task. However, I surely cannot define that many(infinite) model classes to manipulate these tables. This question might seem too naive for you, as I asked it in previous posts but got no answers. Hope somebody could give some hints. Thanks again!

Scott_Ribe · October 15, 2014, 2:58pm

Don't do that. It's an unnormalized, broken, bordering-on-insane design. The date is an attribute of some entity that belongs in its own table, to which your other data is related. Figure out that entity, design it, set up the relationship appropriately between the two.

Colin_Law · October 15, 2014, 3:00pm

I thought you decided you were not going to have those tables, that you would have a bookings table where each row corresponded to a booking and recorded the journey, start and finish station and seat information.

Colin

Jason_FB · November 4, 2014, 6:55pm

In the Rails world, what Scott wrote is basically right (bordering on insane). ActiveRecord really won't place nice with that kind of a design.

Outside of the Rails world, "Big Data" and Reporting people do things like that (date stamping table names) for aggregate reporting. There are times when those kinds of designs aren't such a bad idea (like you have a massive amount of data and you want to build a reporting engine). But if that's your case, Rails isn't the right tool for you.

You could still write a Rails app, but then use some kind of reporting tool to extract data from your Rails database and copy it into another reporting database where you will do your reporting operations (and probably use Crystal Reports or something to deal with that data).

Topic		Replies	Views
Best way to handle multiple tables to replace one big table? rubyonrails-talk	15	1270	April 30, 2010
Database/program design question rubyonrails-talk	6	131	January 9, 2008
ActiveRecord model with multiple table? rubyonrails-talk	3	123	May 18, 2008
Same table, multiple DBs? rubyonrails-talk	3	117	June 1, 2009
Multiple Database Connections rubyonrails-talk	15	491	August 13, 2007

One huge table or separate tables? design issue

Related topics

More Resources