High webserver load (hitting 100)

Gabriel Torres · February 16, 2015

I am ignoring the flame war above. I am seasoned enough to read everything and make my own mind. Don't worry about it.

I am also moving from Xcache to Memcached, I opened a new thread with my current questions about it: http://community.invisionpower.com/topic/407306-questions-on-memcached/

@RevengeFNF we already use Percona (5.6.22 at this moment) with XtraDB. Should I upgrade to MariaDB?

Thanks for pointing out that I am running an outdated version of mysqltuner.

Rhett · February 16, 2015

Please stop guys, both of you are trying to help here, however you are both assuming and guessing on a lot of items here, the only way to really know what's going on here is for a systems admin to look at the server directly.

We all appreciate that you guys are trying to help here, however jabbing at each other isn't helping anyone.

Rhett · February 16, 2015

Gabrial, the differences between Mariadb and Percona are not going to be enough to warrant the trouble changing imo, I use both, and they are both performing great.

Makoto · February 16, 2015

Please stop guys, both of you are trying to help here, however you are both assuming and guessing on a lot of items here, the only way to really know what's going on here is for a systems admin to look at the server directly.

I am not wildly guessing at anything, nor am I telling him to wildly increase values without knowing what the consequences of those changes will be. This is why I also offered to look into the issue myself.

You are plenty experienced yourself, I find it hard to believe you don't share some amount of sentiment with my frustration, even if you have far more patience than I do and professionality not to openly admit or display it. (I'm not going to continue this here, I'll drop it completely after this and I really do apologize for the drama at least.)

RevengeFNF we already use Percona (5.6.22 at this moment) with XtraDB. Should I upgrade to MariaDB?

The two are pretty comparable to one another. It's mostly going to boil down to personal preference.

Gabriel Torres · February 16, 2015

Thanks @Rhett and @Kirito for the feedback.

Gabriel Torres · February 20, 2015

Since my last post I upgraded to PHP 5.6.5 and replaced Xcache with memcached for var cache. Everything was working fine for two days... Until now... See attached screenshot. What is strange is that it is right now 10:20 pm in Brazil, definitely not caused by normal traffic... Running some tests here and will keep you posted...

RevengeFNF · February 20, 2015

If you are trying everything and you cannot resolve the problem(im guessing you already tried the new configurations for the php-fpm), i would try changing to Apache.

One thing you know, that situation cannot continue.

Gabriel Torres · February 20, 2015

RevengeFNF

I haven't had tried changing php-fpm.conf as everything was working fine after I upgraded PHP.

I am playing with php-fpm configuration as I write this. Load has dropped to 23 at this time. listen.backlog was not configured and therefore at its default of 128. Changed to -1 and seems to have helped.

Makoto · February 20, 2015

It's starting to really sound like this could be some kind of network (DDoS) attack.

Can you access the PHP-FPM status page and see how many active processes are spawned?

Gabriel Torres · February 20, 2015

@Kirito that is my best bet at the moment. Will enable the php-fpm status page and will let you know.

RevengeFNF · February 20, 2015

@Gabriel Torres run this command:

netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -n

Gabriel Torres · February 20, 2015

@RevengeFNF thanks. I don't see any particular IP doing a lot of requests. However, the tip Kirito gave me regarding enabling php-fpm status was the best so far, as we can see "inside" php-fpm to see exactly what is going on... It should help me out. Thanks.

Makoto · February 20, 2015

Append "?full" to the end of the status URL (i.e. /status?full) as well, this will show you what the request duration, URI, memory footprint and etc. is for every individual request.

This will give you a lot more to go on if there's a runaway script or anything of the sort causing your server to choke.

Is PHP-FPM the only process on your server behaving erratically?

Gabriel Torres · February 20, 2015

Kirito, I sent you a PM with the link to the php-fpm status on our server, so you can see it by yourself.

And yes, as you can see from the top screenshot, php-fpm is the only process reaching 100% CPU load.

Gabriel Torres · February 20, 2015

There is something wrong... Look now... What is strange is there is no php-fpm process reaching 100% load anymore...

Edit: this particular scenario was caused by the server reaching its pm.max_children limit. I increased it. Let's see what happens.

Makoto · February 20, 2015

This is why I asked if PHP-FPM was the only process acting erratically.

Have you tested the HDD on your server to make sure it's operating normally? Use the smartmontools to check the status of your servers drives if you have't already, if anything just to rule it out.

smartctl -t long /dev/<device>

Do the server loads remain high even after stopping PHP-FPM?

Nothing else has been acting up on the server? No crashes or anything of the sort?

Gabriel Torres · February 20, 2015

No, if I stop php-fpm, load drops steadly.

My best guess at the moment is that we were receiving a lot of bot/crawler requests at the same time, maxing out the number of process available, and acting as a DDoS attack. We had this problem before, and what I did was to blacklist offending crawlers, such as Baidu. I will add something to our robots.txt to slow down Google to see if it helps.

I had pm.max_children at 120 before, increased to 256 and it was maxed out at the last screenshot (load at 200). Increased again to 512 and load has decreased to 38 at the moment. Still very high.

Makoto · February 20, 2015

Possibly.

FWIW I don't think setting listen.backlog to -1 actually works. It's documented that it removes the listen queue length, but in my experience it actually doesn't. On the page you linked me in a PM, it still says your listen queue length is 128.

I have this increased to 512 on my forum but ideally you shouldn't be hitting the backlog anyways. If your backlog queue maxes out PHP-FPM should just start refusing connections.

I don't believe Google should be causing you problems. Baidu is extremely spammy, so I can understand it causing issues, but not Google.

You shouldn't be needing to double and re-double pm.max_children like that though. It really doesn't seem like this could be caused by legitimate traffic.

Gabriel Torres · February 20, 2015

Kirito,

I believe that I found the culprit, a bot called Boardreader. After blocking their IP ranges, load dropped to 7, and it is 9 right now. As mentioned, I saw this happening before, but since I had already banned Baidu from here, I was pretty sure that it might be a configuration issue at our end.

Anyway, it was a good experience, as you helped me a lot, and your tutorial was very useful.

You seem to be right about listen.backlog, as I saw the same thing there.

I will keep monitoring for the next few days and will post a follow-up here.

Thanks,

Gabriel.

Gabriel Torres · February 25, 2015

@Kirito

Just a follow-up: the listen.backlog directive is limited by the operating system's net.core.somaxconn directive, which defaults to 128. If you want to change that, you have to add the following to /etc/sysctl.conf:

net.core.somaxconn = 1024

For this to take effect, you must run:

sysctl -p

More information: http://tweaked.io/guide/kernel/

Grumpy · March 5, 2015

I'm totally late to the party, but I wanted to add an explanation piece for the future.

Increasing number of children for php-fpm will increase performance as it means there are more workers of the task. However, increasing beyond your system capabilities put you in an unstable situations and can cause a "freeze" in the system making ultimate performance a lot worse than it should be. The freezing effect occurs because every process attempts to retrieve a resource but is interrupted by another getting the same resource. So rather than increasing performance by having more workers, you introduce traffic by adding more workers. The solution is to find and resolve the bottleneck that's causing the backlog, not add more workers. Adding workers will only be significantly beneficial if you run into a bottleneck of lacking workers. For this reason, you should never allow load to go up as high as 200 (active process count) by disallowing so many to be spawned to begin with.

I had pm.max_children at 120 before, increased to 256 and it was maxed out at the last screenshot (load at 200). Increased again to 512 and load has decreased to 38 at the moment. Still very high.

I believe that event only helped you because by the act of restarting php-fpm to change the max number of children, the status of the bottlenecked resource was changed because killing all the children makes them let go of that resource.

Determining what the maximum children is good is non-trivial and requires testing. Just outright raising is a very risky move and I would strongly suggest never doing that again because I value stability of the server over squeezing a few more drops out of it. If you're at the point where squeezing few more percentage out of the machine is important, you probably should be upgrading or adding more hardware rather than introducing risk of overload to your system.

Personally, in a dedicated machine for php-fpm only (nothing else), I would suggest thread (I want to ignore thread vs true core argument for now... and refer to thread) count * 2 as the max children. If there are other services, they should be reduced according to how much of a chunk each other process is going to take. What it should be... I don't think I can summarize in one post nor would I attempt to. But the basic logic is that a single thread can only work on one thing at a time. So, you get n = thread count. But that worker might be waiting on something, like disk for example, then it should work something else that's not waiting. So, I get n = thread count * 2. If they're both waiting, why not add a third? Because if they are both waiting, they're probably waiting on the same thing since you probably hit some sort of bottleneck that causes a queue of waiting. The 3rd will probably want the same thing as well, it's the same process after all, and adding a 3rd worker waiting on the same thing adds no value.

Gabriel Torres · March 5, 2015

Thanks @Grumpy

In my case, it seems it was indeed a batch of bad bots crawling ou forum at the same time. Blocked their IPs and haven't had issues since. Thanks!

High webserver load (hitting 100)

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Recently Browsing 0 members

Upcoming Events

Trending Content