load balancer performance problems

Are you using sendfile ? do you have sendfile gem installed?

both nginx slows down and pound slows down ?

are you serving up static content either through some nice rewrite rules (see Ezra’s nginx setup or mongrel docs for pound) ?

i thought sendfile needed needed anymore with Mongrel? thought Zed posted that in one of the more recent releases…

i’m having the same issue as Will and I don’t have sendfile installed (because I removed it). the app will serve thousands of request at 0.2 - 0.8 seconds and then the render times will just shoot up to 4-8 seconds for no reason…database isn’t the issue (at leastnot according to the “DB” vs “Render” times in Rails log)

ed

Are the two of you that are seeing this problem running in production mode? And as you say that this happens with pound, it might be a mongrel or rails issue as pound proxies everything making mongrel serve static files too which it shouldn’t be doing in a production environment. With nginx are you using a good config file[1] that does the correct rewrites to make sure nginx serves all static and rails page cached files?

Also what kind of server environment are you running on? Does the site sit idle for a while before this happens? Maybe its being swapped out to disk and then needs to be swapped back in? If you can provide more details I'm sure that we can help you figure out what it is.

-Ezra

[1] http://brainspl.at/nginx.conf.txt

i should have given more information:

in our production env:
mongrel (0.3.13.4)
mongrel_cluster (0.2.0)
F5 load balancer (balancing 10 mongrel processes)

the ‘site’ is more of an XML service (though it doesn’t use AWS). it serves http requests back XML exclusively, no sessions, no rhtml, etc. we’re currently getting only around 120 requests per min but the site is hardly ever idle.

most of the requests get served in <= 0.5s but occasionally the render time jumps way up (>4s) for no visibile reason. the Rails logs report quick DB times, it’s just the render times that are high. this is interesting because the DB query is quite large (6 joins w/ some tables having almost 1mil rows) but the result set is very small (maybe 5 rows).

im using the to_xml method, not rxml templates.

ed

Will, you might like to read this:

http://www.mail-archive.com/mongrel-users@rubyforge.org/msg01593.html

Altho I don’t quite grok it, it seems relevant,

Vish

Have you looked at networking statistics to see if you're getting high retransmits and such?

Perhaps there's a bad piece of networking gear somewhere?

You need to break down the performance of each component using a series of small experiments. Start from one end of the chain and run a series of tests on it going back through, then test each piece of the chain as it's connected to the next piece. Tedious but it'll force you to examine each part and will help find the problem.

You should also check out mtr (matt's traceroute) and do a traceroute to/from various locations. You might have something messed up and it's dropping packets along the way.

Last thing, go grab Wireshark/Ethereal, take a laptop and you can hook it up between different components and grab chunks of TCP between them. It's got a nice graphical front end, is really easy to use, and it will even decode the TCP streams so you can see what's going on. Main thing you're looking for is really insane timings (look at the timestamps on the left and find big differences) and you're looking for bad packets (they show up as a red/brown).

Otherwise, if you don't have the tools or expertise to figure it out, I'd recommend hiring someone who's qualified/certified in your router equipment to come and look at it. Someone good could probably know right away what is wrong and it'll be loads cheaper than you doing it for the next month.