Unexpected Server Load Issue

Today I ran into an interesting server problem. We recently had a huge traffic spike and the servers didn't respond the way we thought they would. We thought part of the issue could be a cold Amazon ELB, but that didn't explain the load spike on the servers behind the ELB. We also looked around a bit and while the resources on the machine were elevated, none of them were at critical levels. Memory usage remained a fairly flat 40% and CPU was under 25%.

Eventually, our poking around the server revealed that the current fcgid process count, 4, was far below our expected process count of 20. We use apache 2.2 along with mod_fcgid to serve our perl-based site. We had both DefaultMinClassProcessCount and DefaultMaxClassProcessCount set to 20. I had two questions. First, was the process count low during the spike and second, what could be resetting the processes?

The first question was fairly simple to answer. We use New Relic to monitor the servers. Using the "Processes" tab of the server and selecting the "perl (www-data)" process, I was able to see the process count during the past 24 hours. It turns out that we were running at 4 processes before the spike and, over the next hour, it finally climbed to 20. I also noticed that it reduced in count this morning.

Some further research on the meaning of DefaultMinClassProcessCount also cleared up some confusion. The setting represents the minimum number of processes the server should reduce to when processes are idle. If we never serve enough requests to require more than a few processes, the process count will be low.

We were able to test this by using apache benchmark to make a bunch of concurrent requests to the server.

/usr/sbin/ab -n 1000 -c 20 > /dev/null 2>&1

As expected, the number of processes grew to about 19 or 20 and stayed there after apache benchmark finished running. Yay!

I also noticed that the time it shrunk was around the time logrotate works. This helped me form my hypothesis: Logrotate was resetting the process count each day and our normal traffic wasn't large enough to use the full 20 processes. We forced a logrotate to see.

logrotate -f /etc/logrotate.d/apache2

The number of processes went down on the server. Proof of concept!

Then we decided to add the ab call as part of the postrotate directive in logrotate. So it now looked like this:

if [ -f "`. /etc/apache2/envvars ; echo ${APACHE_PID_FILE:-/var/run/apache2.pid}`" ]; then
  /etc/init.d/apache2 reload > /dev/null 2>&1
  /usr/sbin/ab -H "Host:URL" -n 1000 -c 20 > /dev/null 2>&1

After the logrotate, the processes were all there as expected.

What I think may have happened is that the servers were swamped with the sudden influx of traffic. At least two of the servers were removed from the ELB during this time period, which would have increased load on the remaining servers, causing a cascade. The work around would be to run enough traffic to the servers after an apache start/restart/reload to ensure that the number of processes we expect are there. We're not convinced that adding this directive to logrotate is the best way to handle it, and I think it might make more sense to add the spawn to apache2 init.d instead of the logrotate so it will do this regardless of why apache restarts, but it's progress.