New IPB Behavior Causes High Load

Well, it's more complicated than that, but I can't come up with a good title.

So. This past Sunday, our forum at http://asoiaf.westeros.org/ received unprecedented levels of traffic thanks to the latest episode of the TV show based on the novels our site is about. Big, big load. And for the most part, our set up -- lighttpd+php-fpm, with Varnish set up for guests -- handled it okay. Note the server also hosts a popular Mediawiki which draws more visitors than the forum does.

But after Sunday... things got weird. I'm going to quote our valiant server host and admin:

The primary issue at present is that PHP-FPM is bumping up against memory limits; it's demonstrably the issue, as it's logging the fact. If I turn up PHP-FPM's memory allocation, MySQL starts swapping things out and causing queries to run really slowly, which causes PHP-FPM to take forever to resolve searches and thus causes PHP-FPM to run out of available helper processes. Worse, in some cases PHP-FPM seems not to be recovering. If I dump PHP to caching to disk, we don't have the helper process starvation issue but the site runs reaaaaaaaally sloooooooow.

The obvious question at this point is why PHP-FPM has suddenly become memory-bound, when it had not been for weeks beforehand; that's the root issue from which all others have been springing. (Save the stale Varnish cache, which was just a ruleset we'd never triggered before.) Something has clearly changed in the way Invision is doing things — and it's definitely the board, not the wiki — but after several days of trying to track down what has changed in how Invision's doing things — including literally rolling back every memory file to Sunday's copy to see if that improved things — I'm ready to just take the easy way out and buy some breathing room by throwing memory at PHP-FPM.

So... any ideas as to what might have happened to start leading to this memory-chewing problem? I've done no installs of mods or upgrades in the intervening time, so it's not that.

The only thing I can think of is that the database has simply gotten large enough that it takes too long for IPB to deal with it, which leads to a cascade effect as PHP processes pile up waiting for MySQL queries to finish. I can certainly prune the forum significantly -- it's long over due -- but if there are other ideas, I'd be interested in hearing them.

The obvious fix right now is to get more memory installed (from 2GB to 4GB) but it bothers me that we don't really know why this is suddenly necessary.
I'm afraid I'm not sure what more information might be useful! I guess the thing is that it has been enough for weeks, and suddenly... it isn't.

That doesn't sound like normal behavior. Do we blame gremlins? We hadn't changed anything about the site configuration from one day to the next, so far as we know. Even yesterday, our slowest day where traffic simply wasn't all that high, we had issues with having to kick PHP-FPM every once in awhile because of whatever it is the forum is doing.

That said, we're planning to upgrade the RAM. Just wanted to see if there's any idea for what could be causing the memory death-spiral.

As a random stab in the dark regarding configuration information, here's some of our output from the APC regarding configuration stuff. Afraid I'm not privy to the php-fpm conf file so I'm not sure what's set on that:

Oh, I forgot to update.... but I'll say you were looking in the right direction, Luis.

We decided to move to a brand new server (a much newer, more powerful one with 8GB of RAM)... and in the course of the move, our admin discovered that the disk controller was shot. Not so much that it failed completely, but under heavy load it would start repeating paging actions, leading to massive slow down of MySQL, which would then cascade through to PHP-FPM and then the memory would get eaten up by processes, start the process of off-loading to an already crippled disk...

So, that was what happened. That Sunday evening was the highest level of traffic we ever had, and it seems that it was too much for the server.

So, all solved.

