I’ve learned Rails(Web development) on my own for two months, and came to a situation like this:
There are two tables A and B in my database. A 'has_many B. For every user input, There is an entry in table A. At the same time, I need to add a huge number(say 100,000) of entries in B. As table A grows, table B seems to grow uncontrollably. There must be performance issues in later data manipulation operations, or even 'overflow problems. I use Mysql.
My initial thought was to split table B, for example, for every entry in table A, create a new table (on the fly) for it. The questions are:
Is my thought right?
If yes, then how to design the models and migrations to represent the relationship correctly?
If not, what are good practices to solve this kind of problem?
I’d be very grateful if you could only give some hints, doc or good articles about it.
Creating multiple tables is not the key to solve this problem. If your rows are higher than 10 million records, then you have to make sure you archive the older ( Unused ) rows or solve it through MASTER - SLAVE db approach.
Ideally when you’re worry about performance issues with databases , it depends on the READ/WRITE speed expected from the db. And how often is the data being stored READ ?
Are you sure your basic design is appropriate? Having 100,000
associated records per user seems very high. Are you able to explain
what is in those records in case an alternative can be suggested?
I've learned Rails(Web development) on my own for two months, and came to a situation like this:
There are two tables A and B in my database. A 'has_many` B. For every user input, There is an entry in table A. At the same time, I need to add a huge number(say 100,000) of entries in B. As table A grows, table B seems to grow uncontrollably. There must be performance issues in later data manipulation operations, or even 'overflow` problems. I use Mysql.
My initial thought was to split table B, for example, for every entry in table A, create a new table (on the fly) for it. The questions are:
Is my thought right?
If yes, then how to design the models and migrations to represent the relationship correctly?
If not, what are good practices to solve this kind of problem?
I've had hundreds of millions of rows in one table without problems before. The important think is not just the number of rows but your access patterns. For example I currently do something vaguely similar with some mongo data: we create one collection per day (and drop the collection from a few days ago) because dropping a collection is very fast compared to deleting several millions documents from a larger collection.
Another real world example of this is new relic: at least at one point they created one table per customer per hour ( New Relic Architecture - Collecting 20+ Billion Metrics a Day - High Scalability -). There's is a somewhat specialised case: high insert performance and specialised read/aggregate and delete patterns pushed them that way.
This is an uncommon pattern. While it allows a form of sharding and avoids the need for blocking migrations on very large tables it will make everyday life more complicated. As far as I'm aware you'll be pretty much on your own if you go down the path.
Odds are you don't need this. If you do, think really carefully about your access patterns first
Thank you Colin! I'm implementing a railway ticket-booking system just
for learn purpose, not for production.
Now I try to explain it:
Suppose a train has 20 stations along the way. It has 1000 seats to
sell. Table A store one train's information in a row. I want table B
to store every seat's information. I design table B as follow
train_id:
start_city_id:
end_city_id:
seat_id:
availability: boolean type
date:
For one train, the number of start_city_id and end_city_id's
combinations is 19+18+...+1=190, the number of possible date is
20(people can book tickets in 20 days), and possible seat_id's number
is 1000. The total number of rows for one train is
190*1000*20=3,800,000. So huge..
Actually I have more tables in database which are omitted as they are unrelated.
Maybe I should compute the state of any seat every time the system
issues a seat? (Some words on the state of a seat: A train goes by
city L, M, N, O and there is only one seat left. If a person has
booked M->N's seat, then L->O is not available, but L->M, N->O is.)
Are you sure your basic design is appropriate? Having 100,000
associated records per user seems very high. Are you able to explain
what is in those records in case an alternative can be suggested?
Thank you Colin! I'm implementing a railway ticket-booking system just
for learn purpose, not for production.
Now I try to explain it:
Suppose a train has 20 stations along the way. It has 1000 seats to
sell. Table A store one train's information in a row. I want table B
to store every seat's information. I design table B as follow
train_id:
start_city_id:
end_city_id:
seat_id:
availability: boolean type
date:
For one train, the number of start_city_id and end_city_id's
combinations is 19+18+...+1=190, the number of possible date is
20(people can book tickets in 20 days), and possible seat_id's number
is 1000. The total number of rows for one train is
190*1000*20=3,800,000. So huge..
The first point is that you should only add rows as seats are booked,
use that fact that there is no row to indicate that there is no
booking.
Secondly, what is the date? Is that the date of the journey or the
date of the booking? I don't understand why you have multiplied the
number of trains by 20 days. If it is the date of the journey then
consider having a Journeys table which is a particular train on a
particular day.
Thirdly I don't understand why you have 20 factorial for the city
combinations. Is it not true that if a seat is booked from station 1
to station 3 then that implies a booking 1 to 2 and 2 to 3? In which
case you do not separate rows for bookings 1 to 2, 2 to 3, and 1 to 3.
In fact thinking further, I would probably turn the problem around and
have a bookings table, specifying the journey, the stations, and the
seat (or seats).
you can also run a task that archives past dates to a duplicate table since you’ll
rarely need them. maybe you can keep 1 month data in the table that you’re
writing to. this way, you won’t have to worry about the exponential growth of
The first point is that you should only add rows as seats are booked,
use that fact that there is no row to indicate that there is no
booking.
Secondly, what is the date? Is that the date of the journey or the
date of the booking? I don't understand why you have multiplied the
number of trains by 20 days. If it is the date of the journey then
consider having a Journeys table which is a particular train on a
particular day.
It's the date of the journey. I once thought about this method, the
problem is I don't know how to implement it in rails. In Rails
tutorials and guidelines, there is a model file in app/model directory
for every table. If I need to create new journey tables, what's the
model files and migration files look like?(Or I don't need them any
more?)
Thirdly I don't understand why you have 20 factorial for the city
combinations. Is it not true that if a seat is booked from station 1
to station 3 then that implies a booking 1 to 2 and 2 to 3? In which
case you do not separate rows for bookings 1 to 2, 2 to 3, and 1 to 3.
According to your first and third point, It's because my design is
different. In my design one booking might indicates more than one row.
A row just means the state of a seat. So, if a seat is booked from
station 1 to station 3, then rows representing 1->3, 1->2, 2->3, 1->4,
1->5.... all need to change. I do it in this way because I have
another table representing the number of left seats, and I think it's
easier to update this table in my way(in this case, the number of leaf
seats from 1->3, 1->2, 2->3.... decreases by one).
Thank you for your suggestion! I'll dig it deeper and reveal my concerns.
This computation process seems the same for both our table designs...
I'll use your approach, having a bookings table, specifying the
journey, the stations, and the seat (or seats).
I feel sorry to bother you all again here, but I still have a problem:
How to deal with the daily created tables in rails? I googled around
and found nothing.
The daily created tables are named on the dates, their names are all
different. I know how to create tables in a periodically executed rake
task. However, I surely cannot define that many(infinite) model
classes to manipulate these tables. This question might seem too naive
for you, as I asked it in previous posts but got no answers. Hope
somebody could give some hints. Thanks again!
Don't do that. It's an unnormalized, broken, bordering-on-insane design. The date is an attribute of some entity that belongs in its own table, to which your other data is related. Figure out that entity, design it, set up the relationship appropriately between the two.
I thought you decided you were not going to have those tables, that
you would have a bookings table where each row corresponded to a
booking and recorded the journey, start and finish station and seat
information.
In the Rails world, what Scott wrote is basically right (bordering on insane). ActiveRecord really won't place nice with that kind of a design.
Outside of the Rails world, "Big Data" and Reporting people do things like that (date stamping table names) for aggregate reporting. There are times when those kinds of designs aren't such a bad idea (like you have a massive amount of data and you want to build a reporting engine). But if that's your case, Rails isn't the right tool for you.
You could still write a Rails app, but then use some kind of reporting tool to extract data from your Rails database and copy it into another reporting database where you will do your reporting operations (and probably use Crystal Reports or something to deal with that data).