High webserver load (hitting 100)

February 16, 201510 yr

Author

I am ignoring the flame war above. I am seasoned enough to read everything and make my own mind. Don't worry about it.

I am also moving from Xcache to Memcached, I opened a new thread with my current questions about it: http://community.invisionpower.com/topic/407306-questions-on-memcached/

@RevengeFNF we already use Percona (5.6.22 at this moment) with XtraDB. Should I upgrade to MariaDB?

Thanks for pointing out that I am running an outdated version of mysqltuner.

February 16, 201510 yr

Please stop guys, both of you are trying to help here, however you are both assuming and guessing on a lot of items here, the only way to really know what's going on here is for a systems admin to look at the server directly.

We all appreciate that you guys are trying to help here, however jabbing at each other isn't helping anyone.

February 16, 201510 yr

Gabrial, the differences between Mariadb and Percona are not going to be enough to warrant the trouble changing imo, I use both, and they are both performing great.

February 16, 201510 yr

Please stop guys, both of you are trying to help here, however you are both assuming and guessing on a lot of items here, the only way to really know what's going on here is for a systems admin to look at the server directly.

I am not wildly guessing at anything, nor am I telling him to wildly increase values without knowing what the consequences of those changes will be. This is why I also offered to look into the issue myself.

You are plenty experienced yourself, I find it hard to believe you don't share some amount of sentiment with my frustration, even if you have far more patience than I do and professionality not to openly admit or display it. (I'm not going to continue this here, I'll drop it completely after this and I really do apologize for the drama at least.)

RevengeFNF we already use Percona (5.6.22 at this moment) with XtraDB. Should I upgrade to MariaDB?

The two are pretty comparable to one another. It's mostly going to boil down to personal preference.

February 16, 201510 yr

Author

Thanks @Rhett and @Kirito for the feedback.

February 20, 201510 yr

Author

Since my last post I upgraded to PHP 5.6.5 and replaced Xcache with memcached for var cache. Everything was working fine for two days... Until now... See attached screenshot. What is strange is that it is right now 10:20 pm in Brazil, definitely not caused by normal traffic... Running some tests here and will keep you posted...

February 20, 201510 yr

If you are trying everything and you cannot resolve the problem(im guessing you already tried the new configurations for the php-fpm), i would try changing to Apache.

One thing you know, that situation cannot continue.

February 20, 201510 yr

Author

RevengeFNF

I haven't had tried changing php-fpm.conf as everything was working fine after I upgraded PHP.

I am playing with php-fpm configuration as I write this. Load has dropped to 23 at this time. listen.backlog was not configured and therefore at its default of 128. Changed to -1 and seems to have helped.

February 20, 201510 yr

It's starting to really sound like this could be some kind of network (DDoS) attack.

Can you access the PHP-FPM status page and see how many active processes are spawned?

February 20, 201510 yr

Author

@Kirito that is my best bet at the moment. Will enable the php-fpm status page and will let you know.

February 20, 201510 yr

@Gabriel Torres run this command:

netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -n

February 20, 201510 yr

Author

@RevengeFNF thanks. I don't see any particular IP doing a lot of requests. However, the tip Kirito gave me regarding enabling php-fpm status was the best so far, as we can see "inside" php-fpm to see exactly what is going on... It should help me out. Thanks.

February 20, 201510 yr

Append "?full" to the end of the status URL (i.e. /status?full) as well, this will show you what the request duration, URI, memory footprint and etc. is for every individual request.

This will give you a lot more to go on if there's a runaway script or anything of the sort causing your server to choke.

Is PHP-FPM the only process on your server behaving erratically?

February 20, 201510 yr

Author

Kirito, I sent you a PM with the link to the php-fpm status on our server, so you can see it by yourself.

And yes, as you can see from the top screenshot, php-fpm is the only process reaching 100% CPU load.

February 20, 201510 yr

Author

There is something wrong... Look now... What is strange is there is no php-fpm process reaching 100% load anymore...

Edit: this particular scenario was caused by the server reaching its pm.max_children limit. I increased it. Let's see what happens.

February 20, 201510 yr

This is why I asked if PHP-FPM was the only process acting erratically.

Have you tested the HDD on your server to make sure it's operating normally? Use the smartmontools to check the status of your servers drives if you have't already, if anything just to rule it out.

smartctl -t long /dev/<device>

Do the server loads remain high even after stopping PHP-FPM?

Nothing else has been acting up on the server? No crashes or anything of the sort?

February 20, 201510 yr

Author

No, if I stop php-fpm, load drops steadly.

My best guess at the moment is that we were receiving a lot of bot/crawler requests at the same time, maxing out the number of process available, and acting as a DDoS attack. We had this problem before, and what I did was to blacklist offending crawlers, such as Baidu. I will add something to our robots.txt to slow down Google to see if it helps.

I had pm.max_children at 120 before, increased to 256 and it was maxed out at the last screenshot (load at 200). Increased again to 512 and load has decreased to 38 at the moment. Still very high.

February 20, 201510 yr

Possibly.

FWIW I don't think setting listen.backlog to -1 actually works. It's documented that it removes the listen queue length, but in my experience it actually doesn't. On the page you linked me in a PM, it still says your listen queue length is 128.

I have this increased to 512 on my forum but ideally you shouldn't be hitting the backlog anyways. If your backlog queue maxes out PHP-FPM should just start refusing connections.

I don't believe Google should be causing you problems. Baidu is extremely spammy, so I can understand it causing issues, but not Google.

You shouldn't be needing to double and re-double pm.max_children like that though. It really doesn't seem like this could be caused by legitimate traffic.

February 20, 201510 yr

Author

Kirito,

I believe that I found the culprit, a bot called Boardreader. After blocking their IP ranges, load dropped to 7, and it is 9 right now. As mentioned, I saw this happening before, but since I had already banned Baidu from here, I was pretty sure that it might be a configuration issue at our end.

Anyway, it was a good experience, as you helped me a lot, and your tutorial was very useful.

You seem to be right about listen.backlog, as I saw the same thing there.

I will keep monitoring for the next few days and will post a follow-up here.

Thanks,

Gabriel.

February 25, 201510 yr

Author

@Kirito

Just a follow-up: the listen.backlog directive is limited by the operating system's net.core.somaxconn directive, which defaults to 128. If you want to change that, you have to add the following to /etc/sysctl.conf:

net.core.somaxconn = 1024

For this to take effect, you must run:

sysctl -p

More information: http://tweaked.io/guide/kernel/

March 5, 201510 yr

I'm totally late to the party, but I wanted to add an explanation piece for the future.

Increasing number of children for php-fpm will increase performance as it means there are more workers of the task. However, increasing beyond your system capabilities put you in an unstable situations and can cause a "freeze" in the system making ultimate performance a lot worse than it should be. The freezing effect occurs because every process attempts to retrieve a resource but is interrupted by another getting the same resource. So rather than increasing performance by having more workers, you introduce traffic by adding more workers. The solution is to find and resolve the bottleneck that's causing the backlog, not add more workers. Adding workers will only be significantly beneficial if you run into a bottleneck of lacking workers. For this reason, you should never allow load to go up as high as 200 (active process count) by disallowing so many to be spawned to begin with.

I had pm.max_children at 120 before, increased to 256 and it was maxed out at the last screenshot (load at 200). Increased again to 512 and load has decreased to 38 at the moment. Still very high.

I believe that event only helped you because by the act of restarting php-fpm to change the max number of children, the status of the bottlenecked resource was changed because killing all the children makes them let go of that resource.

Determining what the maximum children is good is non-trivial and requires testing. Just outright raising is a very risky move and I would strongly suggest never doing that again because I value stability of the server over squeezing a few more drops out of it. If you're at the point where squeezing few more percentage out of the machine is important, you probably should be upgrading or adding more hardware rather than introducing risk of overload to your system.

Personally, in a dedicated machine for php-fpm only (nothing else), I would suggest thread (I want to ignore thread vs true core argument for now... and refer to thread) count * 2 as the max children. If there are other services, they should be reduced according to how much of a chunk each other process is going to take. What it should be... I don't think I can summarize in one post nor would I attempt to. But the basic logic is that a single thread can only work on one thing at a time. So, you get n = thread count. But that worker might be waiting on something, like disk for example, then it should work something else that's not waiting. So, I get n = thread count * 2. If they're both waiting, why not add a third? Because if they are both waiting, they're probably waiting on the same thing since you probably hit some sort of bottleneck that causes a queue of waiting. The 3rd will probably want the same thing as well, it's the same process after all, and adding a 3rd worker waiting on the same thing adds no value.

March 5, 201510 yr

Author

Thanks @Grumpy

In my case, it seems it was indeed a batch of bad bots crawling ou forum at the same time. Blocked their IPs and haven't had issues since. Thanks!

Five Invision Community 5 features your team will love

Five Invision Community 5 features your members will love

Invision Community 4: SEO, prepare for v5 and dormant account notifications

Invision Community 5: Beta testing and latest updates

High webserver load (hitting 100)

Featured Replies

Archived

Recently Browsing 0

Account

Navigation

Search