Tom,
Thanks for the reply, I should've included more details. You're right
the box was swapping like crazy. My 2 gigs of RAM were filled and 2
gigs of swap filled during this last hiccup.
Load went up over 15, and all the mongrels ended up being shutdown and
restarted by monit. Once everything got shutdown memory returned to
more normal usage. Monit will restart Mongrels on an individual basis
if memory gets out of hand. RMagick is no longer enabled on this app.
What I believe caused the swapping was Apache starting up more workers
to handle incoming requests as all available workers got used up
sitting around waiting for responses from the proxy_balanced Mongrels.
The box has 2gigs of memory, on a good day usage by 6 mongrels for the
public app and 2 for the admin backend use about 1500 of that. All run
in production mode.
I've had my LogLevel for Apache set to debug and the only odd thing
that has started showing up are entries like the following. They don't
always correlate exactly to times that are matched to these spikes.
[Wed Aug 22 00:06:51 2007] [debug] proxy_util.c(1625): proxy: grabbed
scoreboard slot 7 in child 28696 for worker http://127.0.0.1:9000
[Wed Aug 22 00:06:51 2007] [debug] proxy_util.c(1708): proxy:
initialized worker 7 in child 28696 for (127.0.0.1) min=0 max=25
smax=25
[Wed Aug 22 00:06:51 2007] [debug] proxy_util.c(1625): proxy: grabbed
scoreboard slot 9 in child 28696 for worker http://127.0.0.1:9001
[Wed Aug 22 00:06:51 2007] [debug] proxy_util.c(1708): proxy:
initialized worker 9 in child 28696 for (127.0.0.1) min=0 max=25
smax=25
[Wed Aug 22 00:06:51 2007] [debug] proxy_util.c(1625): proxy: grabbed
scoreboard slot 11 in child 28696 for worker http://127.0.0.1:9002
What is baffling is why all of a sudden the app is running so slow
that it essentially dies. Looking at Munin graphs it's easy to
pinpoint on the CPU & I/O usage the night the problem first appeared.
Again the only change that has come to light was 3 nights earlier a
kernel update was made.
Could Ruby/Mongrel require a recompile? Could there have been
something in that kernel upgrade(although I checked the change log and
nothing with I/O) came up. This setup has been running for months
without this kind of problem so I'm really trying to grasp at straws.
Traffic hasn't increased all that much, been rather steady.
I have identified some slower generating pages that could use a good
rewrite (paginate is the likely culprit) But they don't get generated
over and over again as they are using cache_pages.
In my perfect world there is one page request that is just destroying
the app but I can't find anything in the logs about it.
-Paul
Full setup details
Web Server
2x 2.4Ghz Xeon
2gigs RAM
RAID 10 10K SCSI disks 200gigs free
Database (Load has yet to really peak over 1 since optimized)
1x 2.8 P4
1Gig of RAM
RAID 1