Speeding up ActiveRecord Querying on very large joins?

Maybe not directly relevant, but have you considered that you may have
normalized your database into nonfunctionality? From the queries
you've shown, most of the tables have one data field(?!), a date, and
a foreign key field for the team. This would be an appropriate
structure for describing a system where each type of statistic might
be generated at different times, but it seems completely inappropriate
for your application where all the stats are (obviously) generated
simultaneously.

Why did you decide to go this way, vs a single 'stats' table for
everything?

--Matt Jones

Matt Jones wrote:

Maybe not directly relevant, but have you considered that you may have
normalized your database into nonfunctionality? From the queries
you've shown, most of the tables have one data field(?!), a date, and
a foreign key field for the team. This would be an appropriate
structure for describing a system where each type of statistic might
be generated at different times, but it seems completely inappropriate
for your application where all the stats are (obviously) generated
simultaneously.

Why did you decide to go this way, vs a single 'stats' table for
everything?

--Matt Jones

Hi Matt,

There are approx. 37 different categories of statistics for NCAA
football. Each category houses up to 14 columns. I would never attempt
to build one table that housed 518 fields.

All of the stats are not generated simultaneously. They are built into
separate weeks. Each statistics table houses 120 rows of data paired to
the given week it was compiled. So, if there's say 17 weeks (counting
bowls) for college football, then there will be 17 weeks worth of data
(17 x 120 rows).

Yes, I could simply place all of that into one giant table (1 stats
table with 518 columns that contains 2040 rows per year. That's not
going to happen. It's inefficient and difficult to manage. If the
table gets corrupted, very likely given that type of scenario, it kills
my entire site at once. You also have to account for 3 x /tmp space for
the largest table you have in your database. Given the nature of this,
and the fact that these stats will not be deleted for years, that can
hit a server pretty hard.

My tables are normalized properly.

Thanks.

Matt Jones wrote:
> Maybe not directly relevant, but have you considered that you may have
> normalized your database into nonfunctionality? From the queries
> you've shown, most of the tables have one data field(?!), a date, and
> a foreign key field for the team. This would be an appropriate
> structure for describing a system where each type of statistic might
> be generated at different times, but it seems completely inappropriate
> for your application where all the stats are (obviously) generated
> simultaneously.

> Why did you decide to go this way, vs a single 'stats' table for
> everything?

> --Matt Jones

Hi Matt,

There are approx. 37 different categories of statistics for NCAA
football. Each category houses up to 14 columns. I would never attempt
to build one table that housed 518 fields.

Why not? Even MySQL's fairly restrictive limits are more than 5 times
that (see http://dev.mysql.com/doc/refman/5.0/en/column-count-limit.html).
The general argument against that is that tables with that many rows
aren't correctly separating concerns. (see this post:
http://forums.digitalpoint.com/showpost.php?s=c6f4b98064317f2ad75a532f0e1268ee&p=8255322&postcount=11
for a prime example of doing it wrong)

However, in this case, a single event (a game) generates all 518
values.

All of the stats are not generated simultaneously. They are built into
separate weeks. Each statistics table houses 120 rows of data paired to
the given week it was compiled. So, if there's say 17 weeks (counting
bowls) for college football, then there will be 17 weeks worth of data
(17 x 120 rows).

Nothing controversial there.

Yes, I could simply place all of that into one giant table (1 stats
table with 518 columns that contains 2040 rows per year. That's not
going to happen. It's inefficient and difficult to manage. If the
table gets corrupted, very likely given that type of scenario, it kills
my entire site at once. You also have to account for 3 x /tmp space for
the largest table you have in your database. Given the nature of this,
and the fact that these stats will not be deleted for years, that can
hit a server pretty hard.

Some quick math:

518 columns x 8 bytes per column (could be less, but assume there's
some overhead) -> 4144 bytes per row.
MySQL's limit is just shy of 65k, so no issue there.

2040 rows x 4144 bytes -> roughly 8.1 MB.

Given that there are live applications that manage hundreds if not
thousands of GBs of data in MySQL, I think this is a exceptionally
premature optimization, unless you plan on running the server
unaltered for the next 500 or so years. 3x /tmp space is similarly
beneath consideration, unless you're serving from an embedded
microcontroller or something.

Worrying about "table corruption" makes even less sense - if ANY data
gets corrupted, it's time to restore from backup.

Finally, I'll point out that while a combined row is way under the
limit, the mondo-JOIN query you're headed for is getting close to
another limit - the maximum 61 tables allowed in a join clause.

--Matt Jones

Hi Matt,

Again, you bring up some solid points, and some I agree with and others
I don't. However, at this stage in my app (at least for this season) I
can't alter or test any of what you speak about.

I will definitely test a single table in development as the season
progresses and see how much more optimized it is. I'm just skeptical
but I'm sure I'm not the only one that is this way when it comes to a
table this large.

I know it's not the biggest table ever, but a table that size still has
me concerned. However, you could be right! Maybe it's the real way to
go. I don't know at this point.

I just wanted to let you know that I do hear what you are saying and
that if you are correct in everything you say, it would help me for each
new season.

So, my thanks to you for the input and feedback, and most of all - for
following up on your comments with some solid rebuttals!

Take care mate.

Alpha Blue wrote:

Hi Matt,

Again, you bring up some solid points, and some I agree with and others
I don't. However, at this stage in my app (at least for this season) I
can't alter or test any of what you speak about.

Poppycock. You might not be willing to alter your production DB this
season, but you *can* (and probably should) set up a testing environment
to see what you can do.

I will definitely test a single table in development as the season
progresses and see how much more optimized it is. I'm just skeptical
but I'm sure I'm not the only one that is this way when it comes to a
table this large.

I think a 518-field table sounds odd, but perhaps right for what you're
doing. However, a large number of single-column tables sounds far
odder.

I know it's not the biggest table ever, but a table that size still has
me concerned.

Why?

However, you could be right! Maybe it's the real way to
go. I don't know at this point.

I just wanted to let you know that I do hear what you are saying and
that if you are correct in everything you say, it would help me for each
new season.

So, my thanks to you for the input and feedback, and most of all - for
following up on your comments with some solid rebuttals!

Take care mate.

Best,

Alpha Blue wrote:
[...]

I mentioned in paragraph two that I will be setting up a test
environment for it.

OK; I missed that.

But, again, I'm not going to alter my production DB
at this time. The season is just a couple weeks away and it wouldn't
allow for enough time to implement, test, and then complete.

Sure it would. This is not a big change from the point of view of your
app.

I won't
risk any issues.

Why?

I talked about table corruption previously. Matt simply said well you
can restore from backup. If I had 50,000 subscribers all expecting a
service and a table this large that would basically represent all of the
data provided in the service failed, there will be a critical outage.
Table corruption has a larger chance of occurring on larger tables.

Um, what? Table corruption is simply not an issue in any DB environment
I have ever worked with. And I'm not sure why you say it has a greater
chance of occurring on larger tables. If it's simply that there's more
data, well, that's true of many small tables too.

Basically, if you can't trust your DB server not to corrupt your tables,
then don't use that DB server. It's that simple.

I like to ensure whatever I provide works 100%. My current situation is
a lot more improved (query times are really good right now) and with
caching, will only get better.

What I'm doing right now is not wrong, it can just be more efficient
later on (as Matt pointed out).

Perhaps. I haven't examined your queries in detail

At this point in time, we can agree to disagree on this topic.

But we shouldn't. That's a sign that more discussion is necessary, and
should not simply be ignored. It's lazy, and not in the Larry Wall
sense. With what you're doing, I doubt that you can afford that kind of
laziness.

Thanks.

Best,

Alpha Blue wrote:
[...]

So, when I say we can agree to disagree - I'm referring to the "timing"
of the optimizations. I believe they should occur "after this season",
with proper testing, and implementation.

I don't think there's necessarily a disagreement there -- I agree that
you should be careful about what you do to your production DB --
although I wouldn't be surprised if you have a DB meltdown in mid-season
since your current schema is apparently so weird.

(And BTW, you might want to review the usage of quotation marks. :slight_smile: )

I'm very systematic about how
I work on things.

Good.

Best,

Hi Marnen,

Right now, the max time I'm pulling using the query routines that now
have proper indexes and are broken down into manageable pieces runs
anywhere from:

18ms - 24ms

I don't think that would create a DB meltdown considering these numbers
are without caching enabled. I would imagine that once I have caching
turned on, the amount of clicks the database receives would be greatly
minimized.

What I've decided to do (after talking to Slicehost staff) is to prepare
for a larger slice upgrade (which would take approx. 30 minutes or less)
in the wee hours of the morning if I get hit by too much load. I'm also
seriously considering running a cache routine on all matchups each week
so that they are cached (this was a pretty good suggestion given).

I think given this scenario I should be good to go for now. However,
I'm already looking at how to implement the larger table and test it
against a double year data load (just for simulation purposes).

My problem before was that my tables were not properly indexed and I had
two queries that used 13 table joins each. It was just not efficient.
The largest join I have now is 7 tables on indexed results. Those
queries are only taking 3ms apiece.

Alpha Blue wrote:
[...]

The largest join I have now is 7 tables on indexed results. Those
queries are only taking 3ms apiece.

OK, that sounds *way* better.

Best,