High Load: Diagnostics & Troubleshooting

What to do when the CPU is running hot

Have you ever noticed your Linux server slowing down, becoming unresponsive, or experiencing sluggish performance? One common culprit behind such issues is high load.

What is High Load?

High load refers to a condition where the system's resources, such as CPU, memory, and disk I/O, are heavily utilized or overburdened. When the load average on a server spikes, it indicates that the demand for system resources exceeds its capacity to handle the workload efficiently. As a result, processes take longer to complete, response times increase, and the overall performance suffers.

Why Does High Load Happen?

High load can occur due to various reasons, including:

CPU Intensive Tasks: If your server is running resource-intensive applications or services that require significant computational power, the CPU may struggle to keep up with the workload, leading to high load.
Insufficient Memory: When the available memory is insufficient for the running processes, the system resorts to excessive swapping, where it moves data between RAM and disk frequently, degrading performance.
Disk I/O Bottlenecks: If the disk I/O operations become a bottleneck, for example, due to heavy database activity or high file read/write operations, the system can experience high load as it waits for disk operations to complete.
Network Bottlenecks: If the server is overwhelmed with incoming network requests or experiencing network congestion, it can contribute to high load as the system struggles to process and respond to the requests in a timely manner.

Diagnosing High Load

To diagnose high load, you can employ several Linux commands and tools to gather information about the system's current state. Here are some useful ones:

top: Displays real-time information about processes, CPU usage, memory usage, and load average. By monitoring the load average values, you can determine if the system is under high load.
htop: An interactive process viewer that provides a more user-friendly and detailed overview of system resources and processes.
sar: Collects, reports, and analyzes system activity, including CPU, memory, and disk statistics, allowing you to identify trends and patterns related to high load.
iostat: Provides information about disk I/O utilization, including average wait time, disk throughput, and I/O request rates. It helps identify if disk I/O is a bottleneck causing high load.
netstat: Displays active network connections and network interface statistics, enabling you to identify potential network-related issues contributing to high load.

Here's an example how that top command can help you to diagnose high load. If you type top, the output may look like this:

top - 09:35:50 up 3 days,  1:22,  3 users,  load average: 4.21, 3.92, 3.89
Tasks: 251 total,   1 running, 250 sleeping,   0 stopped,   0 zombie
%Cpu(s): 32.1 us,  2.3 sy,  0.0 ni, 62.9 id,  1.6 wa,  0.0 hi,  1.1 si,  0.0 st
MiB Mem :  15932.6 total,   3569.2 free,   5107.2 used,   7256.3 buff/cache
MiB Swap:   2048.0 total,   2048.0 free,      0.0 used.  10098.6 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                     
  1576 root      20   0 1956336 260088  35772 R  35.1   1.6   0:12.45 java                                                                        
  2323 mysql     20   0 19.868g 6.036g  11196 S  95.2  37.7 101:56.24 mysqld                                                                      
  2746 www-data  20   0  568600  87708  20740 S   3.6   0.5   0:05.27 apache2                                                                     
   924 root      20   0  143132  65992  44684 S   1.7   0.4   0:09.11 systemd-journal

In this modified example, the mysqld process (PID 2323) is utilizing a significant amount of CPU resources (%CPU 95.2). To identify if it is the reason for high load, follow these steps:

Check Load Average: Look at the load average values (load average: 4.21, 3.92, 3.89) at the top of the output. If the load average values are consistently high (e.g., above the number of CPU cores), it indicates a high load.
Identify High CPU Usage: Examine the %CPU column to find processes consuming a large percentage of CPU resources. In our example, the mysqld process stands out with a %CPU value of 95.2.
Consider Process Characteristics: Investigate the process associated with high CPU usage (mysqld in this case). Determine if it is a critical service or application that could potentially contribute to high load. In this scenario, the high CPU usage by mysqld might be a cause for concern if it's impacting the server's performance.
Analyze Memory Usage: Although not directly related to identifying the culprit, monitoring memory usage (MiB Mem) can help understand if memory constraints are exacerbating the high load situation. If the system has insufficient memory, it can lead to increased swapping and further impact performance.

Based on the example, the mysqld process with its high CPU usage could be a primary factor contributing to the high load on the server. Further investigation, such as

analyzing MySQL performance metrics and tuning the database configuration, might be necessary to alleviate the high load caused by the mysqld process.

Troubleshooting High Load

Once you've diagnosed high load, it's time to take action and troubleshoot the problem. Here are some steps you can follow:

Identify the Resource Bottleneck: Analyze the output of commands like top, htop, sar, and iostat to determine which system resource is causing the high load, such as CPU, memory, disk I/O, or network.
Investigate Processes: Identify any specific processes or applications consuming excessive resources using the above commands. Pay attention to processes with high CPU utilization, high memory consumption, or excessive disk I/O.
Optimize Resource Usage: Consider optimizing your software configuration, application code, or system settings to reduce resource consumption. For instance, fine-tune application parameters, enable caching mechanisms, or limit concurrent connections to alleviate the load.
Add Hardware Resources: If your server consistently experiences high load

due to insufficient resources, you might need to consider upgrading your hardware. Adding more RAM, upgrading the CPU, or employing faster storage devices can significantly improve system performance.

Scale Out or Load Balance: For scenarios where high load is caused by excessive incoming network requests, you can distribute the workload across multiple servers using load balancing techniques or scale out your infrastructure by adding more servers to handle the increased demand.
Monitor System Performance: Implement a comprehensive monitoring system that continuously tracks system metrics, alerts you when load thresholds are breached, and provides insights into performance patterns. This allows you to proactively identify and address high load situations.

High I/O Wait and its Relationship to High Load

When dealing with high load on a Linux server, another crucial factor to consider is high I/O wait. I/O wait refers to the time that processes spend waiting for input/output (I/O) operations to complete, such as reading from or writing to disks, network interfaces, or other I/O devices. High I/O wait can significantly impact system performance and contribute to overall high load conditions.

Here's how high I/O wait relates to high load:

Disk I/O Bottlenecks: If the server experiences high I/O wait, it suggests that the disk I/O subsystem is struggling to keep up with the workload. Disk I/O bottlenecks can occur due to various reasons, such as slow or overloaded storage devices, inefficient disk access patterns, or heavy concurrent disk operations. As a result, processes waiting for disk I/O completion can pile up, leading to high load.
Reduced Process Efficiency: When processes are blocked by high I/O wait, their execution is delayed, and they cannot proceed until the required I/O operations are completed. This delay in process execution can reduce overall system efficiency, as CPU resources remain idle while waiting for I/O, contributing to higher load averages.
Impaired Responsiveness: High I/O wait can manifest as unresponsiveness or slow response times for user interactions or services. For example, a web server might take longer to process requests, leading to increased latency and degraded user experience. Such sluggishness can be attributed to the high wait times for I/O operations, resulting in high load and a perception of unresponsiveness.

Here's an example using the top command that highlights high I/O wait as the cause of high load:

top - 09:35:50 up 3 days,  1:22,  3 users,  load average: 8.67, 7.91, 6.84
Tasks: 251 total,   1 running, 250 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.2 us,  5.3 sy,  0.0 ni, 32.4 id, 50.1 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  15932.6 total,   3569.2 free,   5107.2 used,   7256.3 buff/cache
MiB Swap:   2048.0 total,   2048.0 free,      0.0 used.  10098.6 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                     
  1576 root      20   0 1956336 260088  35772 R  25.6   1.6   0:12.45 java                                                                        
  2323 mysql     20   0 19.868g 6.036g  11196 S   7.4  37.7 101:56.24 mysqld                                                                      
  2746 www-data  20   0  568600  87708  20740 S   3.6   0.5   0:05.27 apache2                                                                     
   924 root      20   0  143132  65992  44684 S  12.8   0.4   0:09.11 systemd-journal

In this example, we have a high I/O wait percentage (wa) in the %Cpu(s) section, indicating potential disk I/O bottlenecks. Here's what you should look for and what it means:

Check Load Average: The load average values (load average: 8.67, 7.91, 6.84) at the top of the output indicate a high load on the server.
Examine CPU Usage: In the %Cpu(s) section, focus on the wa (I/O wait) value, which is 50.1 in this example. This value represents the percentage of time the CPU is waiting for I/O operations to complete. A high wa value suggests that the system is experiencing significant I/O wait.
Interpret I/O Wait: A high I/O wait percentage implies that processes are spending a considerable amount of time waiting for I/O operations to finish. It indicates potential disk I/O bottlenecks, where the disk subsystem is struggling to keep up with the workload, causing processes to queue up and contributing to high load.

In this scenario, the high I/O wait percentage suggests that the server's performance is being impacted by disk I/O bottlenecks. Let's see if we can find out who is the culprit.

Here's the output of the iotop command:

Total DISK READ: 25.60 MiB/s | Total DISK WRITE: 120.40 MiB/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 1576 be/4  root        10.00 MiB/s   80.00 MiB/s  0.00 % 25.6 % java
  345 be/3  www-data     0.00 B/s   10.50 MiB/s  0.00 %  5.7 % apache2
 2323 be/4  mysql        0.00 B/s    0.00 B/s  0.00 %  0.2 % mysqld

In this example, the java process (PID 1576) shows a high disk write rate (DISK WRITE: 80.00 MiB/s), indicating increased I/O activity. The java process is utilizing a significant percentage of I/O resources (25.6%).

To further diagnose and troubleshoot the high I/O wait caused by the java process, consider the following steps:

Analyze Java Application: Investigate the specific Java application running under the java process (PID 1576). Review the application's code, database interactions, or any file-intensive operations that could contribute to high disk write activity.
Optimize I/O Operations: Identify areas within the Java application that could be optimized to minimize unnecessary or redundant disk write operations. This may involve implementing caching mechanisms, improving data access patterns, or optimizing file I/O strategies.
Monitor System Resources: Continuously monitor system resource utilization, including CPU, memory, and disk I/O metrics, to identify potential bottlenecks and resource constraints. CleverUptime can help track resource usage over time.
Consider Hardware and Configuration: Evaluate the server's hardware configuration and ensure that it meets the demands of the Java application. If necessary, consider upgrading disk drives or adopting RAID configurations to improve I/O performance. Additionally, review the operating system and Java runtime configurations to ensure they are optimized for performance.

By addressing high I/O wait and optimizing disk I/O performance, you can alleviate the impact on system load, improve overall responsiveness, and enhance the server's ability to handle demanding workloads. Regular monitoring and tuning of the I/O subsystem are essential to maintain optimal performance and mitigate the effects of high I/O wait on system load.

Swapping and its Relationship to High Load

When discussing high load on a Linux server, it's essential to understand the concept of swapping and its relationship to system performance. Swapping is the process of moving inactive or less frequently accessed memory pages from the RAM to the disk, thereby freeing up physical memory for more immediate use. While swapping can be a useful mechanism to manage memory usage, it can also contribute to high load in certain scenarios.

When a server's physical memory (RAM) becomes insufficient to hold all the active processes and data, the Linux kernel resorts to swapping as a means of maintaining system stability. However, swapping involves reading and writing data between the disk and RAM, which is significantly slower than accessing data solely from the RAM. As a result, excessive swapping can lead to performance degradation and contribute to high load conditions.

Here's how swapping relates to high load:

Increased Disk I/O: Swapping involves frequent read and write operations to the disk, which can introduce significant disk I/O activity. If the disk I/O subsystem becomes a bottleneck, it can slow down overall system performance and contribute to high load.
Sluggish Response Times: When a server experiences high load due to CPU-intensive tasks or insufficient physical memory, processes may spend more time waiting for memory pages to be swapped in and out of disk. This can lead to increased response times and a overview sense of sluggishness in the system.
Thrashing: In cases where the system is overwhelmed by excessive swapping, a phenomenon known as thrashing can occur. Thrashing happens when the system spends a significant amount of time swapping pages in and out of disk without making meaningful progress in executing tasks. This vicious cycle of swapping and reduced productivity can result in a severe performance degradation and an extremely high load on the system.

Here's an example what swapping looks like in top:

top - 09:45:20 up 2 days,  3:18,  3 users,  load average: 8.92, 9.45, 10.08
Tasks: 254 total,   3 running, 251 sleeping,   0 stopped,   0 zombie
%Cpu(s): 65.3 us,  10.5 sy,  0.0 ni,  3.5 id, 20.1 wa,  0.0 hi,  0.6 si,  0.0 st
MiB Mem :   8192.0 total,    256.2 free,   7935.8 used,     59.9 buff/cache
MiB Swap:  16384.0 total,    158.7 free,  16225.3 used.     70.2 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  1586 root      20   0 1604740 7.0g    30364 R  95.6  86.8   0:12.45 java
  2323 mysql     20   0 17.685g 3.6g   11996 S   3.2  43.8 104:56.24 mysqld
   924 root      20   0  143132  65992  44684 S   1.2   0.8   0:09.11 systemd-journal

In this updated example, mysqld and java are using a significant amount of RAM (7.0g and 3.6g in the RES column) and contributing to swapping. Here's what to observe:

Check Load Average: The load average values (load average: 8.92, 9.45, 10.08) at the top of the output indicate a significantly higher load on the server.
Examine Memory Usage: In the MiB Mem section, the free value (256.2 MiB) represents the available physical memory, which is relatively low.
Inspect Swap Usage: In the MiB Swap section, note the used value (16225.3 MiB), which indicates the amount of swap space being utilized. In this example, the swap space usage is high.
Identify Memory-Intensive Process: In the process list, both java and mysqld are consuming a significant amount of memory (7.0g and 3.6g in the RES column). This high memory usage can lead to swapping.

It's important to strike a balance between physical memory capacity, system workload, and swapping behavior to ensure optimal performance and minimize the impact of swapping on high load situations. Regular monitoring and proactive management of memory resources can help mitigate the detrimental effects of excessive swapping.

Applications that can Cause High Load

While high load can be a result of various factors, certain applications are known to consume substantial resources, potentially leading to high load. Here are a few examples:

Apache: A popular web server that can generate high load when serving numerous concurrent requests or running inefficient web applications.
MariaDB: A relational database management system that, under heavy load or poorly optimized queries, can cause significant CPU and disk I/O utilization.
Elasticsearch: A distributed search and analytics engine that can generate high load when indexing large volumes of data or executing complex search queries.
Samba: A service that provides file sharing and printing capabilities to Windows clients. It can induce high load when handling numerous concurrent file operations or experiencing network bottlenecks.
Nginx: A lightweight and high-performance web server that can generate high load when serving a large number of concurrent connections or handling resource-intensive tasks like SSL encryption.
MySQL: A popular open-source relational database management system that, when under heavy load or poorly optimized queries, can lead to high CPU usage and disk I/O bottlenecks.
PostgreSQL: An advanced object-relational database system that, when dealing with large databases or executing complex queries, can consume significant system resources and contribute to high load.
Redis: An in-memory data structure store that can generate high load when handling a large number of read and write operations or executing computationally intensive commands.
MongoDB: A NoSQL document-oriented database that, when handling heavy write operations or executing complex queries, can impact CPU usage and disk I/O performance.
Memcached: A distributed memory caching system that can cause high load when serving a large number of cache requests or experiencing excessive cache misses.
Jenkins: An open-source automation server used for continuous integration and continuous delivery. When executing a large number of build jobs simultaneously, Jenkins can impose high load on the system.
Hadoop: A distributed processing framework used for big data processing and analytics. When processing large data sets or running resource-intensive MapReduce jobs, Hadoop can cause high CPU and disk I/O load.
Docker: A containerization platform that, when running numerous containers or resource-intensive applications within containers, can lead to high load by increasing CPU, memory, and disk usage.

These applications represent just a fraction of the vast ecosystem of software that can contribute to high load on Linux servers. Understanding their resource requirements and performance characteristics can help you diagnose and troubleshoot high load situations more effectively.

Understanding the resource requirements and potential load implications of these applications can assist in troubleshooting high load scenarios.

Conclusion

Dealing with high load on a Linux server can be challenging, but armed with the right knowledge and diagnostic tools, you can effectively diagnose the issue and take appropriate actions to optimize your system's performance. Remember, monitoring and proactive maintenance are key to ensure your server runs smoothly even under heavy workloads.