Jump to content

High loads


Recommended Posts

Posted

Hi,

We've seen an unexplained increase of our server load out of the blue.

Our forum is targeted to a specific country; right now it is 3:43 in the morning and I am seeing loads above 10. Usually, at this time, load would stay at 3, tops.

With top, I can see a few processes that are using 100% CPU. However, I don't know how to investigate exactly what they are doing, i.e. how to find what exactly is shooting load through the roof. Please see attached screenshot.

Our webserver is a dedicated server (MySQL is running on another dedicated server) with 32 GiB RAM, SSDs in RAID0, and for CPU is has 2x Xeon E5-2430, for a total of 24 processing cores. The connection to the MySQL server is done through a private dedicated Gigabit connection.

Running nginx with php-fpm.

The problem is that yesterday we saw during business hours load peaking over 50...

Thanks,

Gabriel

post-35164-0-36832400-1405666128_thumb.g

Posted

Have you tried turning it off & on again?

Seriously, php-fpm to be specific. Does it at least help with the problem temporarily? If yes, have you tried lowering php-fpm's max serve count? Like serve 1000 times and restart child.

Not related to the issue, but...

> SSDs in RAID0

ಠ_ಠ

Posted

Yes, I restarted php-fpm at least three times today, at no avail.

I found this at the php-fpm.log:
[18-Jul-2014 02:29:30] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 22 idle, and 36 total children
Current php-fpm config:
pm = dynamic
pm.max_children = 2048
pm.start_servers = 128
pm.max_spare_servers = 256

NOTE: start_server was at 5 and max_spare_servers was at 50, I increased after noticing the above message. The screenshot was taken after changing these parameters (approximately after 1:30 H).
I enabled pm.status_path = /status, however when I try opening myforum.com/status, it gives a 404 error.
Thanks for your prompt reply.
Posted

PS: Just checked the log file again and even with the highly increased numbers I still get:

[18-Jul-2014 03:16:17] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 127 idle, and 135 total children

[18-Jul-2014 03:16:18] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 127 idle, and 136 total children

[18-Jul-2014 03:17:14] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 125 idle, and 152 total children

I have lots of idle children yet it complais that the [pool www] is busy...

Posted

Your max child count is too high. In fact, it's FAR too high. 2048 children is completely unreasonable. Roughly speaking, more than 5x of your core count is likely no longer helpful (lol graph). By calc, >500 is dangerous for you. If you find yourself having to raise further, your problem is else where. Lower the child count, the more stable your server gets (from server p.o.v.). Higher the child count, higher performance you'll get. But once you go past the point your hardware can handle, you can't get any more performance, you're only increasing probability of running into server doing crazy things. I would suggest you revert back to 5/50. Even a static of 50 would be fine too. There's really no need for dynamic for you since you have so much ram to spare.

It looks to me like your PHP processes have issues ending. It's running forever, or simply too long. This is why you need more and more children. Processes should end quickly and then get on with a new one. You need to diagnose your php script... something that's preventing php from ending. Unfortunately, it's not something I can realistically help over a forum. Only easy method of diagnosis I can suggest is New Relic Trial option... Rest gets too technical.

More immediately, you should change your timeouts. Go to your php.ini and lower your max execution time. A lot. Down to like 3 seconds. At least while you're having this problem. You have no page that should take more than 3 seconds to process, right? So that whatever is the source of this problem is cut down to 3 seconds. If that helps, I bought you time. Time to look for your php problem that's doing this.

Some additional look out points:

  • Check your network connections. See if you have higher than normal.
  • Check your access logs. See if you have increase in weird queries. This is lot of scouring... not very efficient in terms of man-hour to effect. But regardless a meaningful one. Maybe http ddos?
  • Check your mysql connections. See if higher than normal.
  • Check your memcache connections, if applicable.
  • Check to see if your PHP is making any external connections. Like someone else's API, namely things like facebook, twitter, anywhere you pull/push data from/to, etc. See if the connection / response time to those stuff has changed.
  • Check to see if you have any new hooks/plugins/etc. Try disabling. Even if not new, try disabling all non-essentials anyway.

P.S. You didn't answer my last questions.

Posted

Thanks a lot for all the information. I will have to read it slowly to check all points you raised.

As for your questions, you asked this:

Seriously, php-fpm to be specific. Does it at least help with the problem temporarily? If yes, have you tried lowering php-fpm's max serve count? Like serve 1000 times and restart child.

And I answered this:

Yes, I restarted php-fpm at least three times today, at no avail.

So I don't know which questions I didn't answer.

Posted

How many connections do you have when the load it high?

What kind of connections? MySQL connections?

From mysqltuner.pl:

Highest usage of available connections: 54% (1623/3000)

Right now it is 11:00 PM on a friday night in Brazil, currently around 13 connections. Of course, this is not our peak time.

Thanks.

Posted

I was referring to web connections, it could be as simple as an attack on your server or site, could be bots, could be a small ddos.

run this from ssh

ps aux | grep httpd | wc -l
Posted

Based on your advice, changed to:

pm = static
pm.max_children = 120 (5x 24 cores, as suggested)
max_execution_time for some reason was incorrectly set to 300. Changed that to 3 as suggested. Load didn't drop.
Experimented disabling all third-party hooks and applications. Load dropped from 7~8 to 4~5:
 load average: 5.34, 5.79, 7.25
As for your other questions:
  • Check your network connections. See if you have higher than normal.

A: I'd say it is normal for the week. See our MRTG graph below.

post-35164-0-72556100-1405738162_thumb.p

  • Check your access logs. See if you have increase in weird queries. This is lot of scouring... not very efficient in terms of man-hour to effect. But regardless a meaningful one. Maybe http ddos?

A: I will have to investigate this further.

  • Check your mysql connections. See if higher than normal.

A: Right now, it is not higher than normal.

  • Check your memcache connections, if applicable.

A: No memcache here.

  • Check to see if your PHP is making any external connections. Like someone else's API, namely things like facebook, twitter, anywhere you pull/push data from/to, etc. See if the connection / response time to those stuff has changed.

A: I will have to investigate this further.

  • Check to see if you have any new hooks/plugins/etc. Try disabling. Even if not new, try disabling all non-essentials anyway.
A: Answered above.
I will have to check the load during our peak time and let you know. Meanwhile, I will check the other points you mentioned. As a reference, I am attaching some MRTG graphs.
This is our yearly graph. We moved from vBulletin 3.8.7 to IPB in January 2014. It is very clear that IPB uses much more CPU resources:
post-35164-0-98030700-1405738401_thumb.p
This is the graph for the past 30 days:
post-35164-0-97670700-1405738472_thumb.p
And this is the graph for the past 7 days. Here you can see how our load increased from wednesday on:
post-35164-0-57358200-1405738502_thumb.p
Thank you all for all the tips and help.
Posted

Rhett, we use nginx, not Apache. I don't know if running ps ax | grep nginx | wc -l represents the same (it returns 26, BTW).

I am highly inclined to believe this is being caused by bots. I will investigate this further analyzing the logs. I will keep you posted.

Posted

Yep. It seems this is being caused by Mr. Google Bot, Mr. Bing and Mr. Baidu crawling our forums at the same time. I can't confirm this 100% as on Google Webmaster Tools the latest date available for crawl analysis is of 07/13, so I will have to wait about a week to confirm this. What I am seeing on my logs is anywhere from 5-10 hits per second by Google, 3-5 hits per second by Bing, and 5-6 hits per second by Baidu. F--- Baidu!

I believe that Baidu is the cause of the issues we are facing.

http://community.invisionpower.com/topic/352416-robotstxt-and-bad-robots/

I am adding them to our firewall and let you know the results... I should have suspected Baidu... The worst search engine ever...

Posted

If you suspect certain IP's are abusing your connection, try running

netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -n | tail

This will list number of connections made per IP, in order. It will only include the highest 10. If you want more, add -n 50 or something at the end.

netstat -an | grep :80 | wc -l

is what I meant by network connections. Not bandwidth. Typically, abuses coming through http will be low on bandwidth, so it doesn't reflect very well on bandwidth graphs.

If you want to check who's hitting your pages the most via log instead:

tail /var/log/nginx/access.log -n 100000 | awk '{print $1}' | cut -d: -f1 | sort | uniq -c | sort -n | tail -n 50
tail /var/log/nginx/error.log -n 100000 | awk '{print $14}' | cut -d: -f1 | sort | uniq -c | sort -n | tail -n 50

First for access log and 2nd for error log. I should have the #14th column right... unless you have different version/non-default error output. Though, because error log may be more dynamic, it might not always be the 14th. This code will look at your 50 highest visitors based on last 100k sample. Check their IP. See who they are. You can increase the first to higher sample size, but something too big takes too long to process.

Though, it doesn't answer the cause of a problem,

> Experimented disabling all third-party hooks and applications. Load dropped from 7~8 to 4~5:

shows a signficant drop in load by disabling hooks. I would consider removing some hooks that are too heavy or reprogram them. A hook that eats as much as IPB itself sounds something poorly programmed. Like load of ~1 is likely eaten by the nginx itself from connections (can't help it, people don't download instantly). So, interpretively, we could say load went from 6~7 down to 3~4 which is almost halving the load. Hooks taking 50% of the processing power certainly isn't welcoming in my opinion.

> So I don't know which questions I didn't answer.

Well. There were 2 questions in that quote. 1 if it helps temporarily, not over long time. If you answered in terms of temporarily, my mistake.

To the second question of what your pm.max_requests is unanswered. If unset, I suggest you set it now to like a 1000. Unset is very bad, because default is 0, which is unlimited. It's problematic because it doesn't handle possible leaks at all.

============

Do note that the very act of setting php-fpm to a higher static number will increase your load #. Because there really is more things. Like your past normal low might be 3 before, but it may now become 4 with 100 php-fpm count. Though, that doesn't really mean anything, just be sure to to misinterpret relative change in the load value.

Posted

Well, they do... i tried this morning to block Yandex and Baidu through robots and the number of guests decreased from 100 to 40 within 15 minutes.

I was amazed.

Posted

Hello,

I added the Baidu IP range to our Firewall and the problem is gone! Load dropped to normal levels now (around 4 right now). On the MRTG graphs the drop in server load is impressive.

On the graph below you can see the server load while Baidu was crawling our forums between sunday and friday. Friday night I added the rule to our firewall, see now the server load for this saturday:

post-35164-0-46776700-1405809003_thumb.p

Thank you all for your help!

Grumpy: I will read the other recommendations you gave and try them out, and I will post here the results.

I will keep an eye today on the access log to see if I catch any other bad bot.

I also updated the "Spider List" from IPB so now my install recognizes correctly Baidu, Yandex, Trendiction and some other bots that crawls our forums.

Gabriel.

Posted

Grumpy, thanks for all your advice. I believe the problem is now solved. It seems everything was being caused by Baidu. Load is back to normal. I will keep an eye on this during this week and will post here the final conclusion.

As for pm.max_requests it was blank, leaving it at its default of 0. I've changed it to 1000 as suggested, thanks.

Also thanks for the command-line commands, they are very handy and I will bookmark this page for future reference.

Posted

Back to the "drawing board"...

Load skyrocket again today...

Load is at 20~25... Seeing around 1980 connections

Last night load was around 80...

My best bet at the moment is Bing, as it is massively crawling our forum.

What is really strange is that this increase in load is not accompanied by any increase in traffic (bandwidth), quite the contrary. See the chart below: the top one is the CPU load and the bottom one is the network traffic.

post-35164-0-53981900-1406032269_thumb.g

I also need advice regarding RAM, we have 32 GiB, I am thinking of adding 64 GiB total to see if it helps... How to know the ideal amount of RAM? I don't see any HDD swap happening right now.

Also, I see this on my php-fpm.log (I don't see the errors I reported before anymore):

[22-Jul-2014 06:53:00] NOTICE: [pool www] child 22699 started
[22-Jul-2014 06:53:07] NOTICE: [pool www] child 10698 exited with code 0 after 14417.150574 seconds from start
[22-Jul-2014 06:53:07] NOTICE: [pool www] child 22704 started
[22-Jul-2014 06:54:19] NOTICE: [pool www] child 10781 exited with code 0 after 14368.727134 seconds from start
[22-Jul-2014 06:54:19] NOTICE: [pool www] child 22761 started
[22-Jul-2014 06:54:23] NOTICE: [pool www] child 10699 exited with code 0 after 14492.626564 seconds from start
[22-Jul-2014 06:54:23] NOTICE: [pool www] child 22765 started
[22-Jul-2014 06:54:30] NOTICE: [pool www] child 10501 exited with code 0 after 14713.425983 seconds from start
[22-Jul-2014 06:54:30] NOTICE: [pool www] child 22772 started
[22-Jul-2014 06:54:37] NOTICE: [pool www] child 10435 exited with code 0 after 14802.885335 seconds from start
[22-Jul-2014 06:54:37] NOTICE: [pool www] child 22775 started
[22-Jul-2014 06:55:50] NOTICE: [pool www] child 10675 exited with code 0 after 14607.918562 seconds from start
[22-Jul-2014 06:55:50] NOTICE: [pool www] child 22854 started
[22-Jul-2014 06:56:36] NOTICE: [pool www] child 10569 exited with code 0 after 14783.707863 seconds from start
[22-Jul-2014 06:56:36] NOTICE: [pool www] child 22897 started
Posted

If you don't see any swap, you don't need more ram. If you have free ram (even with cache + buffer included), adding ram does nothing.

--------------------------------

We don't know if your new issue is completely separate or same. So, I suggest you start from the beginning.

1. Check what is causing the load. See `top`, order by processes since that seems to be the one being consumed. The graph you post shows a nice contrast from before and now, but it lacks far too much information. top command is what you want as a overview at any time. If you find the live top annoying to copy/paste, type `top -n 2` then it'll show the 2nd iteration and stop. First iteration usually isn't right "now". Or you can just press q at any time you want it to stop.

2. If php-fpm, run the commands I gave you in the last post to see exactly who is hammering your site rather than guessing. Then you proceed to solutions.

The notice you are getting in php-fpm.log is normal. It's a result of setting pm.max_requests. Assuming you set it to a 1000, it shows that your child exits after ~14000 seconds. Which roughly means one child serves one request in 14 seconds on average.

Posted

Thanks. Traffic seems to be normal again right now. Running the commands you gave me yesterday when load was ~20, the top IPs were all from Bing, so that is why I was suspecting on Bing. I added a crawl-delay for Bing on my robots.txt. I will keep an eye on this and if the server load jumps above 5 again I will perform a more in-depth diagnostics as suggested. BTW. When the load is high, the top command shows php-fpm as the one causing all the load.

Archived

This topic is now archived and is closed to further replies.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...