One model, multiple tables vs. confusion about SQL indexes

Hi all,

This might be more of an SQL question, but I figured I give it a shot
here. One of my models has ~2 million records in it but only 200 are
frequently updated. I can also reliably predict which 200 these are
going to be. The issue is that even though 200 records are being
updated, the database indexes are being adjusted for all 2 million
records. Not very efficient. I'm considering two options to speed
things up:

(1) Split my table into two -- one for frequently updated records, one
for ones updated not so often. Problem with this is that I'd either
need my model to keep switching back and forth between tables or need
to use two models instead of one, leading to a lot of repetition.

(2) Add a frequently_updated column to my model and index based on
this. Problem is I'm not sure how indexes work. Let's say the index
has two branches, one for frequently_updated == 1 and one for
frequently_updated == 0. When I update or add a record, is the index
only updated for one of these branches or is the entire thing updated?

Anyone have experience dealing with this sort of thing before?

Oh, I'm using MySQL 5.0 and the table is InnoDB (although I can switch
to MyISAM if that works better).


-- Andrew

Andrew -- good question. It's one of those question where the answer
is "it depends". Databases can (should) do a lot of caching, and so
on, so depending on how frequently you request a row or rows from this
table, and how much memory you have, you may get dramatically
different results. Even though 2M rows sounds like a lot, it's not
particularly large by today's standards. I get concerned when I have
in the 100M to 1B range. You're right that lots of updates on a very
small percentage of rows will tend to decrease the efficiency of the
index (typically implemented as hash trees, but which get unbalanced
after a while). I am not a MySQL expert, but in PostgreSQL and all
other DBs I have worked with, there's a process to run that can update
the stats the database uses to build it's plan, and also even to
completely rebuild data structures on disk. There should also be some
commands to analyze the query plan (e.g. to ensure that the database
decides correctly to use your index instead of doing a full table

I don't think there should be any need to increase the complexity of
your application by creating a separate table to hold the less
frequently updated rows. It's not a bad idea to create an additional
"frequently_updated" column, at which point you can either a) create
an index on this column and another on the main key column you're
looking up, or b) create a compound index on the two columns at once
-- a good database would probably do better with choice a. Some
databases also let you create a conditional index, where only rows
meeting a certain condition are indexed -- this would be ideal for
your case. In any situation, updating the index shouldn't be a
terribly expensive operation, as indexes are usually implemented as
linked lists -- the index is only updated if the value of the index
key changes, and when it does, only a few index entries are updated:
the records that point to the index entry, and which the index entry
points to (both before and after the update) and the index record
itself (so just five rows). So if you can reliably predict which rows
change frequently, and those rows are pretty much the same from day to
day, the additional column is probably not worth it; if the rows that
change are volatile (go in and out of the "change_frequently" state a
lot), then the additional column might help.

Sorry for running on. Bottom line: measure, measure and measure again
before doing anything. Simplicity is a wonderful thing, and adding
complexity is almost always the wrong way of solving problems. All
good databases are almost always capable of handling queries like this
very well; in my career with databases, I have almost always found
that the cause of poor performance was the code that called the query,
not the query itself. In the few cases where there were performance
issues traced to the database, 90% were just a missing index, and 9%
were cases where the database was not using the index when it should
have (perhaps a configuration issue). 1% required some special
handling such as partitioning tables, or adding columns.

Hope this helps!