One model, multiple tables vs. confusion about SQL indexes

Hi all,

This might be more of an SQL question, but I figured I give it a shot here. One of my models has ~2 million records in it but only 200 are frequently updated. I can also reliably predict which 200 these are going to be. The issue is that even though 200 records are being updated, the database indexes are being adjusted for all 2 million records. Not very efficient. I'm considering two options to speed things up:

(1) Split my table into two -- one for frequently updated records, one for ones updated not so often. Problem with this is that I'd either need my model to keep switching back and forth between tables or need to use two models instead of one, leading to a lot of repetition.

(2) Add a frequently_updated column to my model and index based on this. Problem is I'm not sure how indexes work. Let's say the index has two branches, one for frequently_updated == 1 and one for frequently_updated == 0. When I update or add a record, is the index only updated for one of these branches or is the entire thing updated?

Anyone have experience dealing with this sort of thing before?

Oh, I'm using MySQL 5.0 and the table is InnoDB (although I can switch to MyISAM if that works better).

Thanks!

-- Andrew

Andrew -- good question. It's one of those question where the answer is "it depends". Databases can (should) do a lot of caching, and so on, so depending on how frequently you request a row or rows from this table, and how much memory you have, you may get dramatically different results. Even though 2M rows sounds like a lot, it's not particularly large by today's standards. I get concerned when I have in the 100M to 1B range. You're right that lots of updates on a very small percentage of rows will tend to decrease the efficiency of the index (typically implemented as hash trees, but which get unbalanced after a while). I am not a MySQL expert, but in PostgreSQL and all other DBs I have worked with, there's a process to run that can update the stats the database uses to build it's plan, and also even to completely rebuild data structures on disk. There should also be some commands to analyze the query plan (e.g. to ensure that the database decides correctly to use your index instead of doing a full table scan).

I don't think there should be any need to increase the complexity of your application by creating a separate table to hold the less frequently updated rows. It's not a bad idea to create an additional "frequently_updated" column, at which point you can either a) create an index on this column and another on the main key column you're looking up, or b) create a compound index on the two columns at once -- a good database would probably do better with choice a. Some databases also let you create a conditional index, where only rows meeting a certain condition are indexed -- this would be ideal for your case. In any situation, updating the index shouldn't be a terribly expensive operation, as indexes are usually implemented as linked lists -- the index is only updated if the value of the index key changes, and when it does, only a few index entries are updated: the records that point to the index entry, and which the index entry points to (both before and after the update) and the index record itself (so just five rows). So if you can reliably predict which rows change frequently, and those rows are pretty much the same from day to day, the additional column is probably not worth it; if the rows that change are volatile (go in and out of the "change_frequently" state a lot), then the additional column might help.

Sorry for running on. Bottom line: measure, measure and measure again before doing anything. Simplicity is a wonderful thing, and adding complexity is almost always the wrong way of solving problems. All good databases are almost always capable of handling queries like this very well; in my career with databases, I have almost always found that the cause of poor performance was the code that called the query, not the query itself. In the few cases where there were performance issues traced to the database, 90% were just a missing index, and 9% were cases where the database was not using the index when it should have (perhaps a configuration issue). 1% required some special handling such as partitioning tables, or adding columns.

Hope this helps!

Tom