one mongrel with *lots* of close_wait tcp connections

* cross posted to the mongrel mailing list*

Hi, I'm running into a strange issue where one mongrel will sometimes
develop hundreds of CLOSE_WAIT TCP connections, mostly to apache (I
think --
see sample lsof output below). I haven't had a chance to get the
mongrel
with this behavior into USR1 debug mode yet. I wrote a little loop
below that will catch it next time.
This issue occurs a couple times a day on average at seemingly random
times. The
problem goes away within a minute or two, probably after a restart of
the
mongrel.

I'm probably doing something crazy to cause this behavior, but I'm
having
trouble figuring out exactly what the problem is. It probably has to
do
with the fact that my mongrels get files off of amazon s3 for some
requests. We do HTTPClient.get(url) for some s3 urls. I'm setting
up
dnsmasq now, by the way, but it's not up yet.

My next steps are to get the mongrel into USR1 debugging mode and to
see
what actions are causing the problem, and to install dnsmasq and
cacti. I
think I've got a good guess which action is responsible -- it's
probably the
one that gets the files from s3, but I'll make sure.

If you have any thoughts or other ideas, please let me know. Thanks a
ton
for your help!

Some sample output from lsof:

lsof -i -P | grep CLOSE_ | grep mongrel

CLOSE_WAIT --mysite
mongrel_r 831 root 6u IPv4 95162945 TCP
localhost.localdomain
:8011->localhost.localdomain:59311 (CLOSE_WAIT)
mongrel_r 831 root 9u IPv4 95161753 TCP
mysite.com:49269->xxx-xxx-xxx-xxx.amazon.com:80<http://xxx-xxx-xxx-
xxx.amazon.com/>(CLOSE_WAIT)
mongrel_r 831 root 11u IPv4 95162093 TCP mysite.com:
49339->
xxx-xxx-xxx-xxx.amazon.com:80 <http://xxx-xxx-xxx-xxx.amazon.com/

(CLOSE_WAIT)

mongrel_r 831 root 14u IPv4 95162202 TCP mysite.com:
49373->
xxx-xxx-xxx-xxx.amazon.com:80 <http://xxx-xxx-xxx-xxx.amazon.com/

(CLOSE_WAIT)

mongrel_r 831 root 15u IPv4 95162229 TCP mysite.com:
49380->
xxx-xxx-xxx-xxx.amazon.com:80 <http://xxx-xxx-xxx-xxx.amazon.com/

(CLOSE_WAIT)

mongrel_r 831 root 16u IPv4 95162319 TCP
mysite.com:49399->xxx-xxx-xxx-xxx.amazon.com:80<http://xxx-xxx-xxx-
xxx.amazon.com/>(CLOSE_WAIT)
mongrel_r 831 root 17u IPv4 95162477 TCP mysite.com:
49436->
xxx-xxx-xxx-xxx.amazon.com:80 <http://xxx-xxx-xxx-xxx.amazon.com/

(CLOSE_WAIT)

mongrel_r 831 root 19u IPv4 95163082 TCP
localhost.localdomain
:8011->localhost.localdomain:59348 (CLOSE_WAIT)
mongrel_r 831 root 20u IPv4 95163221 TCP
localhost.localdomain
:8011->localhost.localdomain :59387 (CLOSE_WAIT)
mongrel_r 831 root 21u IPv4 95163360 TCP
localhost.localdomain
:8011->localhost.localdomain:59426 (CLOSE_WAIT)
mongrel_r 831 root 22u IPv4 95161592 TCP mysite.com:
49227 ->
xxx-xxx-xxx-xxx.amazon.com:80 <http://xxx-xxx-xxx-xxx.amazon.com/

(CLOSE_WAIT)

mongrel_r 831 root 23u IPv4 95163507 TCP
localhost.localdomain
:8011->localhost.localdomain :59463 (CLOSE_WAIT)
mongrel_r 831 root 24u IPv4 95163675 TCP
localhost.localdomain
:8011->localhost.localdomain:59495 (CLOSE_WAIT)
mongrel_r 831 root 25u IPv4 95164041 TCP
localhost.localdomain:8011->
localhost.localdomain:59586 (CLOSE_WAIT)
mongrel_r 831 root 26u IPv4 95164181 TCP
localhost.localdomain
:8011->localhost.localdomain:59618 (CLOSE_WAIT)
mongrel_r 831 root 27u IPv4 95164293 TCP
localhost.localdomain
:8011->localhost.localdomain:59641 (CLOSE_WAIT)
mongrel_r 831 root 28u IPv4 95164441 TCP
localhost.localdomain
:8011->localhost.localdomain:59670 (CLOSE_WAIT)
mongrel_r 831 root 29u IPv4 95164607 TCP
localhost.localdomain
:8011->localhost.localdomain:59705 (CLOSE_WAIT)
mongrel_r 831 root 30u IPv4 95164748 TCP
localhost.localdomain
:8011->localhost.localdomain:59746 (CLOSE_WAIT)
mongrel_r 831 root 31u IPv4 95164895 TCP
localhost.localdomain
:8011->localhost.localdomain:59786 (CLOSE_WAIT)
mongrel_r 831 root 32u IPv4 95165064 TCP
localhost.localdomain
:8011->localhost.localdomain:59830 (CLOSE_WAIT)

etc. this goes on for 700 lines, where the mongrel on port 8011 has
roughly
700 CLOSE_WAIT TCP connections to the 30-60k port range (to apache, I
believe). All of these close_waits are for the mongrel on port 8011,
in
this case. Also, any ideas what's going on with the close_wait
connections to amazon s3?

lsof -i -P | grep CLOSE_ | grep mongrel | wc -l
703

netstat | grep 56586 # an example port
tcp 1 0 localhost.localdomain:8011 localhost.localdomain:
56586
CLOSE_WAIT
tcp 0 0 localhost.localdomain :56586 localhost.localdomain:
8011
FIN_WAIT2
getnameinfo failed
getnameinfo failed

#background loop to set the bad mongrel to debug mode during the
close_wait period
  def debug_mongrel_loop
    sleep (60) until (`lsof -i -P | grep CLOSE_WAIT | grep mongrel |
wc -l`).to_i > 100
    `killall -USR1 mongrel_rails`
    AdminMailer.deliver_mongrel_debug_mode_turned_on # optional
email alert

    # sleep 2 minutes, and then undo the debug mode.
    sleep(120)
    `killall -USR1 mongrel_rails`
  end