Thoughts on tracking down cause of huge spikes

For the past week I have been experiencing extremely high loads at
seemingly random times.

Load goes through the roof causing Mongrel's to be restarted by Monit
after load reaches >10 for more than 5 minutes.

The logs for production.log and mongrel.log's, Apache access log
doesn't help me much in tracking down what call is being made to the
application that could cause such a freak out.

The most recent showed a huge spike in the I/O wait on the box.

The database sits on a separate box and the load isn't running high at
any point (nothing above 1).

The box serves nothing except the site, it's content only, no logins
or anything fancy like that.

The strange thing is the first time this occurred no code to the app
had been changed. The only change on the server was a kernel upgrade
for a security patch (RHEL4) 3 days beforehand.

I am extremely stumped and just looking for possibly some other ideas
on how I could keep track of what's going on and see if there is one
URL that is causing these problems.

Thanks,
Paul

This sounds like the box may be swapping.

This could cause high load *and* high iowait.

Tom,

Thanks for the reply, I should've included more details. You're right
the box was swapping like crazy. My 2 gigs of RAM were filled and 2
gigs of swap filled during this last hiccup.

Load went up over 15, and all the mongrels ended up being shutdown and
restarted by monit. Once everything got shutdown memory returned to
more normal usage. Monit will restart Mongrels on an individual basis
if memory gets out of hand. RMagick is no longer enabled on this app.

What I believe caused the swapping was Apache starting up more workers
to handle incoming requests as all available workers got used up
sitting around waiting for responses from the proxy_balanced Mongrels.

The box has 2gigs of memory, on a good day usage by 6 mongrels for the
public app and 2 for the admin backend use about 1500 of that. All run
in production mode.

I've had my LogLevel for Apache set to debug and the only odd thing
that has started showing up are entries like the following. They don't
always correlate exactly to times that are matched to these spikes.

[Wed Aug 22 00:06:51 2007] [debug] proxy_util.c(1625): proxy: grabbed
scoreboard slot 7 in child 28696 for worker http://127.0.0.1:9000
[Wed Aug 22 00:06:51 2007] [debug] proxy_util.c(1708): proxy:
initialized worker 7 in child 28696 for (127.0.0.1) min=0 max=25
smax=25
[Wed Aug 22 00:06:51 2007] [debug] proxy_util.c(1625): proxy: grabbed
scoreboard slot 9 in child 28696 for worker http://127.0.0.1:9001
[Wed Aug 22 00:06:51 2007] [debug] proxy_util.c(1708): proxy:
initialized worker 9 in child 28696 for (127.0.0.1) min=0 max=25
smax=25
[Wed Aug 22 00:06:51 2007] [debug] proxy_util.c(1625): proxy: grabbed
scoreboard slot 11 in child 28696 for worker http://127.0.0.1:9002

What is baffling is why all of a sudden the app is running so slow
that it essentially dies. Looking at Munin graphs it's easy to
pinpoint on the CPU & I/O usage the night the problem first appeared.
Again the only change that has come to light was 3 nights earlier a
kernel update was made.

Could Ruby/Mongrel require a recompile? Could there have been
something in that kernel upgrade(although I checked the change log and
nothing with I/O) came up. This setup has been running for months
without this kind of problem so I'm really trying to grasp at straws.
Traffic hasn't increased all that much, been rather steady.

I have identified some slower generating pages that could use a good
rewrite (paginate is the likely culprit) But they don't get generated
over and over again as they are using cache_pages.

In my perfect world there is one page request that is just destroying
the app but I can't find anything in the logs about it.

-Paul

Full setup details
Web Server
2x 2.4Ghz Xeon
2gigs RAM
RAID 10 10K SCSI disks 200gigs free

Database (Load has yet to really peak over 1 since optimized)
1x 2.8 P4
1Gig of RAM
RAID 1

I've never liked dynamic server configs because they're dynamic. :slight_smile:

Set the min and max works in Apache to be the same number, a number that keep the machine out of swap.

Having requests queue is far better than swapping! :slight_smile:

Does your box have cron events? Might these episodes be triggered by those?

Yikes! Sorry for the illiterate response. No more posting before
caffeine!

I've never liked dynamic server configs because they're dynamic. :slight_smile:

Set the min and max workers in Apache to be the same number, a number
that keeps the machine out of swap.

Queueing requests is far better than swapping! :slight_smile:

Does your box have cron events? Might these episodes be triggered by
those?

Haha, luckily I used the last of my Starbucks gift card this morning.

I'm definitely going to set the min/max to the same. I also had
overlooked a large spike on the DB box that is definitely in line with
the problem on the web server.

Luckily Mysql Slow queries logging was on so I have something to look at.

And only one cron set for one minute after midnight. No load problems then.

Thanks for the advice. Hopefully this gets squashed soon.