One huge table or separate tables? design issue

Hi, everyone,

I’ve learned Rails(Web development) on my own for two months, and came to a situation like this:

There are two tables A and B in my database. A 'has_many B. For every user input, There is an entry in table A. At the same time, I need to add a huge number(say 100,000) of entries in B. As table A grows, table B seems to grow uncontrollably. There must be performance issues in later data manipulation operations, or even 'overflow problems. I use Mysql.

My initial thought was to split table B, for example, for every entry in table A, create a new table (on the fly) for it. The questions are:

Is my thought right?

If yes, then how to design the models and migrations to represent the relationship correctly?

If not, what are good practices to solve this kind of problem?

I’d be very grateful if you could only give some hints, doc or good articles about it.

Hi,

Creating multiple tables is not the key to solve this problem. If your rows are higher than 10 million records, then you have to make sure you archive the older ( Unused ) rows or solve it through MASTER - SLAVE db approach.

Ideally when you’re worry about performance issues with databases , it depends on the READ/WRITE speed expected from the db. And how often is the data being stored READ ?

Cheers

Vivek

Are you sure your basic design is appropriate? Having 100,000 associated records per user seems very high. Are you able to explain what is in those records in case an alternative can be suggested?

Colin

Hi, everyone,

I've learned Rails(Web development) on my own for two months, and came to a situation like this:

There are two tables A and B in my database. A 'has_many` B. For every user input, There is an entry in table A. At the same time, I need to add a huge number(say 100,000) of entries in B. As table A grows, table B seems to grow uncontrollably. There must be performance issues in later data manipulation operations, or even 'overflow` problems. I use Mysql.

My initial thought was to split table B, for example, for every entry in table A, create a new table (on the fly) for it. The questions are:

Is my thought right? If yes, then how to design the models and migrations to represent the relationship correctly? If not, what are good practices to solve this kind of problem?

I've had hundreds of millions of rows in one table without problems before. The important think is not just the number of rows but your access patterns. For example I currently do something vaguely similar with some mongo data: we create one collection per day (and drop the collection from a few days ago) because dropping a collection is very fast compared to deleting several millions documents from a larger collection.

Another real world example of this is new relic: at least at one point they created one table per customer per hour ( New Relic Architecture - Collecting 20+ Billion Metrics a Day - High Scalability -). There's is a somewhat specialised case: high insert performance and specialised read/aggregate and delete patterns pushed them that way.

This is an uncommon pattern. While it allows a form of sharding and avoids the need for blocking migrations on very large tables it will make everyday life more complicated. As far as I'm aware you'll be pretty much on your own if you go down the path.

Odds are you don't need this. If you do, think really carefully about your access patterns first

Fred

Thank you Colin! I'm implementing a railway ticket-booking system just for learn purpose, not for production. Now I try to explain it: Suppose a train has 20 stations along the way. It has 1000 seats to sell. Table A store one train's information in a row. I want table B to store every seat's information. I design table B as follow

train_id: start_city_id: end_city_id: seat_id: availability: boolean type date:

For one train, the number of start_city_id and end_city_id's combinations is 19+18+...+1=190, the number of possible date is 20(people can book tickets in 20 days), and possible seat_id's number is 1000. The total number of rows for one train is 190*1000*20=3,800,000. So huge..

Actually I have more tables in database which are omitted as they are unrelated.

Maybe I should compute the state of any seat every time the system issues a seat? (Some words on the state of a seat: A train goes by city L, M, N, O and there is only one seat left. If a person has booked M->N's seat, then L->O is not available, but L->M, N->O is.)

Are you sure your basic design is appropriate? Having 100,000 associated records per user seems very high. Are you able to explain what is in those records in case an alternative can be suggested?

Thank you Colin! I'm implementing a railway ticket-booking system just for learn purpose, not for production. Now I try to explain it: Suppose a train has 20 stations along the way. It has 1000 seats to sell. Table A store one train's information in a row. I want table B to store every seat's information. I design table B as follow

train_id: start_city_id: end_city_id: seat_id: availability: boolean type date:

For one train, the number of start_city_id and end_city_id's combinations is 19+18+...+1=190, the number of possible date is 20(people can book tickets in 20 days), and possible seat_id's number is 1000. The total number of rows for one train is 190*1000*20=3,800,000. So huge..

The first point is that you should only add rows as seats are booked, use that fact that there is no row to indicate that there is no booking. Secondly, what is the date? Is that the date of the journey or the date of the booking? I don't understand why you have multiplied the number of trains by 20 days. If it is the date of the journey then consider having a Journeys table which is a particular train on a particular day. Thirdly I don't understand why you have 20 factorial for the city combinations. Is it not true that if a seat is booked from station 1 to station 3 then that implies a booking 1 to 2 and 2 to 3? In which case you do not separate rows for bookings 1 to 2, 2 to 3, and 1 to 3.

In fact thinking further, I would probably turn the problem around and have a bookings table, specifying the journey, the stations, and the seat (or seats).

Colin

you can also run a task that archives past dates to a duplicate table since you’ll rarely need them. maybe you can keep 1 month data in the table that you’re

writing to. this way, you won’t have to worry about the exponential growth of

the table you’re always accessing.

The first point is that you should only add rows as seats are booked, use that fact that there is no row to indicate that there is no booking.

Secondly, what is the date? Is that the date of the journey or the date of the booking? I don't understand why you have multiplied the number of trains by 20 days. If it is the date of the journey then consider having a Journeys table which is a particular train on a particular day.

It's the date of the journey. I once thought about this method, the problem is I don't know how to implement it in rails. In Rails tutorials and guidelines, there is a model file in app/model directory for every table. If I need to create new journey tables, what's the model files and migration files look like?(Or I don't need them any more?)

Thirdly I don't understand why you have 20 factorial for the city combinations. Is it not true that if a seat is booked from station 1 to station 3 then that implies a booking 1 to 2 and 2 to 3? In which case you do not separate rows for bookings 1 to 2, 2 to 3, and 1 to 3.

According to your first and third point, It's because my design is different. In my design one booking might indicates more than one row. A row just means the state of a seat. So, if a seat is booked from station 1 to station 3, then rows representing 1->3, 1->2, 2->3, 1->4, 1->5.... all need to change. I do it in this way because I have another table representing the number of left seats, and I think it's easier to update this table in my way(in this case, the number of leaf seats from 1->3, 1->2, 2->3.... decreases by one).

Thank you for your suggestion! I'll dig it deeper and reveal my concerns.

Yeah, I've also plan to do this kind of thing. Thank you anyway;-)

This computation process seems the same for both our table designs... I'll use your approach, having a bookings table, specifying the journey, the stations, and the seat (or seats).

In that case you have redundant data in your database (as 1->3 implies 1->2 and 1->3), which is almost always not the best way to do it.

Colin

I feel sorry to bother you all again here, but I still have a problem: How to deal with the daily created tables in rails? I googled around and found nothing.

The daily created tables are named on the dates, their names are all different. I know how to create tables in a periodically executed rake task. However, I surely cannot define that many(infinite) model classes to manipulate these tables. This question might seem too naive for you, as I asked it in previous posts but got no answers. Hope somebody could give some hints. Thanks again!

Don't do that. It's an unnormalized, broken, bordering-on-insane design. The date is an attribute of some entity that belongs in its own table, to which your other data is related. Figure out that entity, design it, set up the relationship appropriately between the two.

I thought you decided you were not going to have those tables, that you would have a bookings table where each row corresponded to a booking and recorded the journey, start and finish station and seat information.

Colin

In the Rails world, what Scott wrote is basically right (bordering on insane). ActiveRecord really won't place nice with that kind of a design.

Outside of the Rails world, "Big Data" and Reporting people do things like that (date stamping table names) for aggregate reporting. There are times when those kinds of designs aren't such a bad idea (like you have a massive amount of data and you want to build a reporting engine). But if that's your case, Rails isn't the right tool for you.

You could still write a Rails app, but then use some kind of reporting tool to extract data from your Rails database and copy it into another reporting database where you will do your reporting operations (and probably use Crystal Reports or something to deal with that data).