why is it so difficult to troubleshoot root causes of high server load?

Posted February 15, 20223 yr

As the topic title asks, why is it so difficult to troubleshoot root causes of high server load?

I have had server load going from around 15 and spiking all the way up to 50, 60 or even 80

Then it goes back down after a while.

However, it's really difficult to troubleshoot.

There must be a better way to isolate the issue? Any tips?

Thank you

February 15, 20223 yr

How are you troubleshooting today? What's the way you're looking to improve?

February 15, 20223 yr

Author

How are you troubleshooting today? What's the way you're looking to improve?

using TOP

February 15, 20223 yr

Community Expert

Top is a good start. What is the biggest offender? (I’m guessing it’s either Apache or Mysql)

Is it CPU or memory constrained?

Have you looked at the server connections? (netstat -plant)

Have you looked at traffic logs and did any sort of correlation of traffic to load?

SJ77
1

February 16, 20223 yr

There is also htop - its easier to read and a better interface and does all top does, and iotop for input/output processes on a per process basis and metope for mysql.

As stated above how are you using top ? As there are specific command line options to filter things

(I’m guessing it’s either Apache or Mysql)

Not NGINX ? 🤣

There is also a good read here over at cPanel forums with a bash script

Edited February 16, 20223 yr by Muddy Boots

SJ77
1

February 16, 20223 yr

Author

HI

Thank you both of you. I also have htop

I have nixstat subscription and I was going through every single metric to see if anything jumped out at me and I found one really weird thing.

Every 5 days there is a BIG (HUGE) spike in I/O READS that lasts 3 days, even though writes and TPS doesn't change. Not sure if this is correlated to the high load as the patterns don't match exactly but... this seems suspicious

Edited February 16, 20223 yr by SJ77

February 16, 20223 yr

The fact it lasts 3 days gives you plenty of time to track it down

Anything relating to disk usage like backups etc ?

SJ77
1

February 16, 20223 yr

Community Expert

This almost looks like some sort of automated activity. It seems to be spaced out like clock work.

- Is this a dedicated or shared server? (If it’s a VPS, are we sure there is not something stealing CPU and causing a lack of resources?)

- Is there anything else that runs on the server outside of IPB?

- When these spikes happen… we can absolutely see the disk IO increase but this does not describe what’s causing it. What processes are consuming the most disk/cpu/ram?

SJ77
1

February 16, 20223 yr

Author

I agree, 3 days will give me time to track it down. I don't think it's back up related. I have daily back ups.

This almost looks like some sort of automated activity. It seems to be spaced out like clock work.

- Is this a dedicated or shared server? (If it’s a VPS, are we sure there is not something stealing CPU and causing a lack of resources?)

- Is there anything else that runs on the server outside of IPB?

- When these spikes happen… we can absolutely see the disk IO increase but this does not describe what’s causing it. What processes are consuming the most disk/cpu/ram?

dedicated
Nothing but IPB runs on this machine
unfortunately I noticed the spikes after they had stopped. So we will have to wait till the next one begins and I will be digging into all the running processes. I will hang tight for about 5 days

Edited February 16, 20223 yr by SJ77

February 17, 20223 yr

You should still be able to look at the logs as you know when the spikes occurred - try the sar command

Whats the memory like - usage when this happens ?

February 27, 20223 yr

Author

Ok I am back in the thick of my high server load cycle. Which seems to happen every 5 days as indicated by high disk read patters.

I used IOTOP to isolate these Thread ID's as being the issue. How can I turn this into actual information?

I want to know what these processes are doing exactly so address accordingly.

Having an arbitrary TID doesn't actually tell me much. Is there someway to investigate more from here?

Thank you in advance.

February 27, 20223 yr

Community Expert

You keep focusing on the activites that are showing disk read/writes. You’re not showing anything that is indicating how many connections are established. What memory is being consumed, what cpu is being consumed, etc.

If the system is out of memory and is swapping it would make 100 percent sense that the disk is thrashing.

February 27, 20223 yr

I used IOTOP to isolate these Thread ID's as being the issue. How can I turn this into actual information?

ps aux

Will list the process ids with what the command associated with it

SJ77
1

February 27, 20223 yr

Author

ps aux
Will list the process ids with what the command associated with it

But I have the command showing in iostat. Can I find anything more specific beyond "nginx worker process"?

You keep focusing on the activites that are showing disk read/writes. You’re not showing anything that is indicating how many connections are established. What memory is being consumed, what cpu is being consumed, etc.

If the system is out of memory and is swapping it would make 100 percent sense that the disk is thrashing.

Because I have high load that correlates to this strange pattern of high disk READ I/O

5 days it's good then 3 days it's bad... then repeats.

I am trying to find out what is driving this bizarre patter because then I think I can stop the high load.

February 27, 20223 yr

But I have the command showing in iostat. Can I find anything more specific beyond "nginx worker process"?

Try ps aux

You could also (if not already installed) use in ssh

glances

If its not installed then

Quote

yum install glances

SJ77
1

February 28, 20223 yr

Also use this

Put the process id number in this command and it will return a name of the process

ps -p PIDNAME -o comm=

example:
ps -p 1572 -o comm=

SJ77
1

February 28, 20223 yr

Author

Also use this

Put the process id number in this command and it will return a name of the process
ps -p PIDNAME -o comm=

example:
ps -p 1572 -o comm=

Shows nginx worker process. That could be many things unfortunately.

February 28, 20223 yr

Shows nginx worker process. That could be many things unfortunately.

Thats strange - Are you logged in as root in ssh ?

Try

ps ax|egrep "^ PIDNAME"

Example:

ps ax|egrep "^ 1572"

SJ77
1

February 28, 20223 yr

Author

Thats strange - Are you logged in as root in ssh ?

Try
ps ax|egrep "^ PIDNAME"

Example:

ps ax|egrep "^ 1572"

It returns this

Which of course doesn't mean anything to me.

I am trying to translate this into exactly what is going on, so that I can take action. Surely there is a way to see what exactly this process is actually doing

February 28, 20223 yr

@SJ77 Whats the full output of

ps -aux |grep nginx

February 28, 20223 yr

Author

@SJ77 Whats the full output of
ps -aux |grep nginx

February 28, 20223 yr

This will give you details of the worker process memory useage

pmap -x PIDNAME

Your last one would be 8789 for the PID name if thats still high at 17%

What openssl version are you on ?

openssl version

February 28, 20223 yr

Have you got proxy_buffering set to on ? Are you using mod_security ?

Five Invision Community 5 features your team will love

Five Invision Community 5 features your members will love

Invision Community 4: SEO, prepare for v5 and dormant account notifications

Invision Community 5: Beta testing and latest updates

why is it so difficult to troubleshoot root causes of high server load?

Featured Replies

Recently Browsing 0

Account

Navigation

Search