I have a legacy logging app that needs to be modernized. At the moment, Data is going to be stored in 5 tables, one table is going to contain more than 200 million and 2 others about 60 million rows. Mysql will be used, unless someone sees a major advantage in using something else (at the moment data is stored in a raw positioned file format, which I’m not keeping).
For performance reasons I’m thinking about storing the data for each entity in separate tables. So instead of a ‘logdata’ table with >200 million rows, I’d create ‘logdata_001’, ‘logdata_002’,… tables where the rowcount of even the biggest entity logdata stays below 3 million.
Rails seems to have no problem with me setting the tablename (set_table_name) on each incoming request before accessing the database.
I have this gut feeling though I’m overlooking something nasty.
Anyone have any comments/suggestions ? Am I about to create my own personal maintenance nightmare ?
I’m doing a similar thing [setting information about the database on-the-fly] but I’m having to change a model’s table_name_prefix. The one concern I have [though it hasn’t come up so far] is the possibility of two requests being made at the same or close to the same time and one mistakenly using the others’ setup. I’m thinking of using transactions, perhaps. I dunno but I too am interested in if anyone else has done on-the-fly type database configuring.
Why not just have a table with >200m rows? As far as I’m aware, there are no real performance advantages with splitting it up. The point of a database is that it can just eat up data without without causing any performance issues.
Although I can’t imagine what on earth you’d want to 200m rows for! Modeling the cells of a small rodent, maybe?
May I tap into your combined common sense ?
I have a legacy logging app that needs to be modernized. At the moment, Data
is going to be stored in 5 tables, one table is going to contain more than
200 million and 2 others about 60 million rows. Mysql will be used, unless
someone sees a major advantage in using something else (at the moment data
is stored in a raw positioned file format, which I'm not keeping).
I'd go with Postgres unless you have some really solid reasons for
staying with mysql.
I suggest you set aside a couple of days to configure both db's. (pg
ships with very modest defaults -- performance will suffer if you
don't give it a bit more shared+working memory - and do turn on auto
vacuum), load the data, vacuum analyze, add some indexes, and then try
to run a few queries.
Try some fancy subqueries, for instance. IIRC, mysql performance
suffers badly from not being able to use an index more than once per
query, or something along those lines.
For performance reasons I'm thinking about storing the data for each entity
in separate tables. So instead of a 'logdata' table with >200 million rows,
I'd create 'logdata_001', 'logdata_002',... tables where the rowcount of
even the biggest entity logdata stays below 3 million.
Rails seems to have no problem with me setting the tablename
(set_table_name) on each incoming request before accessing the database.
I have this gut feeling though I'm overlooking something nasty.
Anyone have any comments/suggestions ? Am I about to create my own personal
maintenance nightmare ?
Keep the data in rows, that's what RDBMS are made for.
Just make sure you add indexes on the FKs. Periodically clustering on
that index could be beneficial too, if you need to do a lot of seq
scans on single entities.
There's no benefit to splitting the data across multiple tables and it
makes your app more complex. And don't forget about what happens when
you're asked to do something that spans all those entities, for
whatever reason.
FWIW, there can be significant performance gains using partitioned
tables, assuming you have something to key on, such as a year or
something which can be logically segmented.
As for having 200m rows, it is common in data warehouses when you are
you dealing with historical data over a 5 year or greater time frame.
Hopefully I'll get the opportunity to speak about it at RailsConf
2007, and if so come up and say hi.
How are you all setting table_name or table_name_prefix dynamically? I
initially thought I could just set a session variable (since I want
this value to persist for a given user across multiple requests until
it is reset by the user) and then use in my model:
class TestTable < ActiveRecord::Base
self.table_name_prefix = session[:my_little_variable] + '_'
end
But this doesn't work, since apparently session data is not directly
available in the models. Not every model I'm working with needs a table
prefix, so I don't want to set this globally. How do I take a variable
selected by the user and cleanly use this as a table name prefix for
the models that need it?
What’s the criteria for which models need a new table name/prefix? If it centers around another model then you can put a table_prefix attribute on that model and call it thusly:
def self.set_tables
OtherModel.set_table_name
“#{table_prefix}_othermodels”
end
Don’t use table_name_prefix as it sets ActionRecord::Base itself and not the specific model as you might think. Lots of fun confusion on that for me. Also, you can’t [or I couldn’t figure out how to] set the table_name on the model itself. For me, the models that needed new table names [databases actually] were all related to what site was being used so my Site model has the set_tables method. Call that method from a filter on the appropriate controller methods and voila! On-the-fly table name changes. If you have a relationship between the models, you might also need to tweak your associations code like this:
class Whatever
has_many :othermodels, :finder_sql => ‘SELECT #{table_name}_othermodels.* FROM #{table_name}_othermodels WHERE #{table_name}_othermodels.whatever_id = #{id}’
I hope that’s not too obtuse. It’s just that my code uses this functionality to change databases and I don’t want to paste it and have you thinking that it’s merely for changing table names. However, if anyone is wondering… If you do need to change the database, you don’t have to establish a new connection but can use Model.set_table_name “#{db_name}.table_name”. You can use the same trick in the association to select across two databases.
God, I hope that wasn’t too obscure. Apologies in advance.
Hmm, I posted a reply using the Google Groups interface and it’s not
appearing (even though my first reply to this post came through just
fine). Lots of posts elsewhere about this annoying Google Groups issue.
So at the risk of repeating myself, I’ll try sending an email reply to
the group to see if it goes through or disappears into the ether.
Anyway,
first off, thanks for your reply. My particular situation is that I
have separate tables for individual US states with the two letter state
abbreviation as table prefixes (
e.g. ‘ny_things’, ‘ca_things’). For a variety of reasons these tables
can’t be merged into a single table. Rather than define (essentially
the same) models for every state, I’d rather just have the user select
a state, which will the set the state abbreviation in the users session
data. From then on, all tables needing a prefix will be dynamically set
according to the session data. Some tables do not need a prefix at all,
so yes, I ran into the same problem trying to set ‘table_name_prefix’
on an individual model and found it was being prefixed to all table
names.
So for my first stab at a solution, I’m using a before_filter in my
controllers which call a private method in application.rb (so it can be
shared between several controllers) which calls the ‘table_name’
methods for those models needing this:
class ApplicationController < ActionController::Base
Before this method is called, I call another before_filter method that
makes sure the the session variable is set and valid. If not, it
redirects the user to select an appropriate value. This is working
perfectly for now. I’m using a regular expression to ensure that the
state abbreviation is in fact two and only two lower case letters
before it is set in the session, so I think I’m fairly safe with this
approach.
As far as setting the table name within a model, I’ve always been able to just do:
class ExampleModel < ActiveRecord::Base
self.table_name = “some_crazy_table_name_not_conforming_to_conventions”
I’ve come to the same result as you with setting the table_name via a filter but I still go through the Site model to change the child table_names because, in my case, the child table names all use an attribute of the site instance.[Site#db_name, which in turn comes from Site#host].
As far as the self.table_name methodology. I had a lot of trouble with that and finally gave up because it wants to set that when the model loads and since I’m actually calling an attribute from another model to get a string, the whole thing would collapse on me. I had the same problem with custom finders until I came across a post somewhere that talked about using single rather than double quotes. I really should have mentioned that earlier too.