Speeding up ActiveRecord Querying on very large joins?

Maybe not directly relevant, but have you considered that you may have normalized your database into nonfunctionality? From the queries you've shown, most of the tables have one data field(?!), a date, and a foreign key field for the team. This would be an appropriate structure for describing a system where each type of statistic might be generated at different times, but it seems completely inappropriate for your application where all the stats are (obviously) generated simultaneously.

Why did you decide to go this way, vs a single 'stats' table for everything?

--Matt Jones

Matt Jones wrote:

Maybe not directly relevant, but have you considered that you may have normalized your database into nonfunctionality? From the queries you've shown, most of the tables have one data field(?!), a date, and a foreign key field for the team. This would be an appropriate structure for describing a system where each type of statistic might be generated at different times, but it seems completely inappropriate for your application where all the stats are (obviously) generated simultaneously.

Why did you decide to go this way, vs a single 'stats' table for everything?

--Matt Jones

Hi Matt,

There are approx. 37 different categories of statistics for NCAA football. Each category houses up to 14 columns. I would never attempt to build one table that housed 518 fields.

All of the stats are not generated simultaneously. They are built into separate weeks. Each statistics table houses 120 rows of data paired to the given week it was compiled. So, if there's say 17 weeks (counting bowls) for college football, then there will be 17 weeks worth of data (17 x 120 rows).

Yes, I could simply place all of that into one giant table (1 stats table with 518 columns that contains 2040 rows per year. That's not going to happen. It's inefficient and difficult to manage. If the table gets corrupted, very likely given that type of scenario, it kills my entire site at once. You also have to account for 3 x /tmp space for the largest table you have in your database. Given the nature of this, and the fact that these stats will not be deleted for years, that can hit a server pretty hard.

My tables are normalized properly.

Thanks.

Matt Jones wrote: > Maybe not directly relevant, but have you considered that you may have > normalized your database into nonfunctionality? From the queries > you've shown, most of the tables have one data field(?!), a date, and > a foreign key field for the team. This would be an appropriate > structure for describing a system where each type of statistic might > be generated at different times, but it seems completely inappropriate > for your application where all the stats are (obviously) generated > simultaneously.

> Why did you decide to go this way, vs a single 'stats' table for > everything?

> --Matt Jones

Hi Matt,

There are approx. 37 different categories of statistics for NCAA football. Each category houses up to 14 columns. I would never attempt to build one table that housed 518 fields.

Why not? Even MySQL's fairly restrictive limits are more than 5 times that (see http://dev.mysql.com/doc/refman/5.0/en/column-count-limit.html). The general argument against that is that tables with that many rows aren't correctly separating concerns. (see this post: http://forums.digitalpoint.com/showpost.php?s=c6f4b98064317f2ad75a532f0e1268ee&p=8255322&postcount=11 for a prime example of doing it wrong)

However, in this case, a single event (a game) generates all 518 values.

All of the stats are not generated simultaneously. They are built into separate weeks. Each statistics table houses 120 rows of data paired to the given week it was compiled. So, if there's say 17 weeks (counting bowls) for college football, then there will be 17 weeks worth of data (17 x 120 rows).

Nothing controversial there.

Yes, I could simply place all of that into one giant table (1 stats table with 518 columns that contains 2040 rows per year. That's not going to happen. It's inefficient and difficult to manage. If the table gets corrupted, very likely given that type of scenario, it kills my entire site at once. You also have to account for 3 x /tmp space for the largest table you have in your database. Given the nature of this, and the fact that these stats will not be deleted for years, that can hit a server pretty hard.

Some quick math:

518 columns x 8 bytes per column (could be less, but assume there's some overhead) -> 4144 bytes per row. MySQL's limit is just shy of 65k, so no issue there.

2040 rows x 4144 bytes -> roughly 8.1 MB.

Given that there are live applications that manage hundreds if not thousands of GBs of data in MySQL, I think this is a exceptionally premature optimization, unless you plan on running the server unaltered for the next 500 or so years. 3x /tmp space is similarly beneath consideration, unless you're serving from an embedded microcontroller or something.

Worrying about "table corruption" makes even less sense - if ANY data gets corrupted, it's time to restore from backup.

Finally, I'll point out that while a combined row is way under the limit, the mondo-JOIN query you're headed for is getting close to another limit - the maximum 61 tables allowed in a join clause.

--Matt Jones

Hi Matt,

Again, you bring up some solid points, and some I agree with and others I don't. However, at this stage in my app (at least for this season) I can't alter or test any of what you speak about.

I will definitely test a single table in development as the season progresses and see how much more optimized it is. I'm just skeptical but I'm sure I'm not the only one that is this way when it comes to a table this large.

I know it's not the biggest table ever, but a table that size still has me concerned. However, you could be right! Maybe it's the real way to go. I don't know at this point.

I just wanted to let you know that I do hear what you are saying and that if you are correct in everything you say, it would help me for each new season.

So, my thanks to you for the input and feedback, and most of all - for following up on your comments with some solid rebuttals!

Take care mate.

Alpha Blue wrote:

Hi Matt,

Again, you bring up some solid points, and some I agree with and others I don't. However, at this stage in my app (at least for this season) I can't alter or test any of what you speak about.

Poppycock. You might not be willing to alter your production DB this season, but you *can* (and probably should) set up a testing environment to see what you can do.

I will definitely test a single table in development as the season progresses and see how much more optimized it is. I'm just skeptical but I'm sure I'm not the only one that is this way when it comes to a table this large.

I think a 518-field table sounds odd, but perhaps right for what you're doing. However, a large number of single-column tables sounds far odder.

I know it's not the biggest table ever, but a table that size still has me concerned.

Why?

However, you could be right! Maybe it's the real way to go. I don't know at this point.

I just wanted to let you know that I do hear what you are saying and that if you are correct in everything you say, it would help me for each new season.

So, my thanks to you for the input and feedback, and most of all - for following up on your comments with some solid rebuttals!

Take care mate.

Best,

Alpha Blue wrote: [...]

I mentioned in paragraph two that I will be setting up a test environment for it.

OK; I missed that.

But, again, I'm not going to alter my production DB at this time. The season is just a couple weeks away and it wouldn't allow for enough time to implement, test, and then complete.

Sure it would. This is not a big change from the point of view of your app.

I won't risk any issues.

Why?

I talked about table corruption previously. Matt simply said well you can restore from backup. If I had 50,000 subscribers all expecting a service and a table this large that would basically represent all of the data provided in the service failed, there will be a critical outage. Table corruption has a larger chance of occurring on larger tables.

Um, what? Table corruption is simply not an issue in any DB environment I have ever worked with. And I'm not sure why you say it has a greater chance of occurring on larger tables. If it's simply that there's more data, well, that's true of many small tables too.

Basically, if you can't trust your DB server not to corrupt your tables, then don't use that DB server. It's that simple.

I like to ensure whatever I provide works 100%. My current situation is a lot more improved (query times are really good right now) and with caching, will only get better.

What I'm doing right now is not wrong, it can just be more efficient later on (as Matt pointed out).

Perhaps. I haven't examined your queries in detail

At this point in time, we can agree to disagree on this topic.

But we shouldn't. That's a sign that more discussion is necessary, and should not simply be ignored. It's lazy, and not in the Larry Wall sense. With what you're doing, I doubt that you can afford that kind of laziness.

Thanks.

Best,

Alpha Blue wrote: [...]

So, when I say we can agree to disagree - I'm referring to the "timing" of the optimizations. I believe they should occur "after this season", with proper testing, and implementation.

I don't think there's necessarily a disagreement there -- I agree that you should be careful about what you do to your production DB -- although I wouldn't be surprised if you have a DB meltdown in mid-season since your current schema is apparently so weird.

(And BTW, you might want to review the usage of quotation marks. :slight_smile: )

I'm very systematic about how I work on things.

Good.

Best,

Hi Marnen,

Right now, the max time I'm pulling using the query routines that now have proper indexes and are broken down into manageable pieces runs anywhere from:

18ms - 24ms

I don't think that would create a DB meltdown considering these numbers are without caching enabled. I would imagine that once I have caching turned on, the amount of clicks the database receives would be greatly minimized.

What I've decided to do (after talking to Slicehost staff) is to prepare for a larger slice upgrade (which would take approx. 30 minutes or less) in the wee hours of the morning if I get hit by too much load. I'm also seriously considering running a cache routine on all matchups each week so that they are cached (this was a pretty good suggestion given).

I think given this scenario I should be good to go for now. However, I'm already looking at how to implement the larger table and test it against a double year data load (just for simulation purposes).

My problem before was that my tables were not properly indexed and I had two queries that used 13 table joins each. It was just not efficient. The largest join I have now is 7 tables on indexed results. Those queries are only taking 3ms apiece.

Alpha Blue wrote: [...]

The largest join I have now is 7 tables on indexed results. Those queries are only taking 3ms apiece.

OK, that sounds *way* better.

Best,