Thoughts on tracking down cause of huge spikes

For the past week I have been experiencing extremely high loads at seemingly random times.

Load goes through the roof causing Mongrel's to be restarted by Monit after load reaches >10 for more than 5 minutes.

The logs for production.log and mongrel.log's, Apache access log doesn't help me much in tracking down what call is being made to the application that could cause such a freak out.

The most recent showed a huge spike in the I/O wait on the box.

The database sits on a separate box and the load isn't running high at any point (nothing above 1).

The box serves nothing except the site, it's content only, no logins or anything fancy like that.

The strange thing is the first time this occurred no code to the app had been changed. The only change on the server was a kernel upgrade for a security patch (RHEL4) 3 days beforehand.

I am extremely stumped and just looking for possibly some other ideas on how I could keep track of what's going on and see if there is one URL that is causing these problems.

Thanks, Paul

This sounds like the box may be swapping.

This could cause high load *and* high iowait.

Tom,

Thanks for the reply, I should've included more details. You're right the box was swapping like crazy. My 2 gigs of RAM were filled and 2 gigs of swap filled during this last hiccup.

Load went up over 15, and all the mongrels ended up being shutdown and restarted by monit. Once everything got shutdown memory returned to more normal usage. Monit will restart Mongrels on an individual basis if memory gets out of hand. RMagick is no longer enabled on this app.

What I believe caused the swapping was Apache starting up more workers to handle incoming requests as all available workers got used up sitting around waiting for responses from the proxy_balanced Mongrels.

The box has 2gigs of memory, on a good day usage by 6 mongrels for the public app and 2 for the admin backend use about 1500 of that. All run in production mode.

I've had my LogLevel for Apache set to debug and the only odd thing that has started showing up are entries like the following. They don't always correlate exactly to times that are matched to these spikes.

[Wed Aug 22 00:06:51 2007] [debug] proxy_util.c(1625): proxy: grabbed scoreboard slot 7 in child 28696 for worker http://127.0.0.1:9000 [Wed Aug 22 00:06:51 2007] [debug] proxy_util.c(1708): proxy: initialized worker 7 in child 28696 for (127.0.0.1) min=0 max=25 smax=25 [Wed Aug 22 00:06:51 2007] [debug] proxy_util.c(1625): proxy: grabbed scoreboard slot 9 in child 28696 for worker http://127.0.0.1:9001 [Wed Aug 22 00:06:51 2007] [debug] proxy_util.c(1708): proxy: initialized worker 9 in child 28696 for (127.0.0.1) min=0 max=25 smax=25 [Wed Aug 22 00:06:51 2007] [debug] proxy_util.c(1625): proxy: grabbed scoreboard slot 11 in child 28696 for worker http://127.0.0.1:9002

What is baffling is why all of a sudden the app is running so slow that it essentially dies. Looking at Munin graphs it's easy to pinpoint on the CPU & I/O usage the night the problem first appeared. Again the only change that has come to light was 3 nights earlier a kernel update was made.

Could Ruby/Mongrel require a recompile? Could there have been something in that kernel upgrade(although I checked the change log and nothing with I/O) came up. This setup has been running for months without this kind of problem so I'm really trying to grasp at straws. Traffic hasn't increased all that much, been rather steady.

I have identified some slower generating pages that could use a good rewrite (paginate is the likely culprit) But they don't get generated over and over again as they are using cache_pages.

In my perfect world there is one page request that is just destroying the app but I can't find anything in the logs about it.

-Paul

Full setup details Web Server 2x 2.4Ghz Xeon 2gigs RAM RAID 10 10K SCSI disks 200gigs free

Database (Load has yet to really peak over 1 since optimized) 1x 2.8 P4 1Gig of RAM RAID 1

I've never liked dynamic server configs because they're dynamic. :slight_smile:

Set the min and max works in Apache to be the same number, a number that keep the machine out of swap.

Having requests queue is far better than swapping! :slight_smile:

Does your box have cron events? Might these episodes be triggered by those?

Yikes! Sorry for the illiterate response. No more posting before
caffeine!

I've never liked dynamic server configs because they're dynamic. :slight_smile:

Set the min and max workers in Apache to be the same number, a number that keeps the machine out of swap.

Queueing requests is far better than swapping! :slight_smile:

Does your box have cron events? Might these episodes be triggered by those?

Haha, luckily I used the last of my Starbucks gift card this morning.

I'm definitely going to set the min/max to the same. I also had overlooked a large spike on the DB box that is definitely in line with the problem on the web server.

Luckily Mysql Slow queries logging was on so I have something to look at.

And only one cron set for one minute after midnight. No load problems then.

Thanks for the advice. Hopefully this gets squashed soon.