Skip to main content

Linux Performance Tools at a Glance

Performance analysis is done by comparing data of a system in a good versus bad state. This helps to find out what has changed, and the cause of the change.

Glowing illustration of a hammer and a wrench crossed in an "X."

There's a number of reasons to collect and analyze performance data for the workloads that clients are running on Linux (e.g., planning, information, optimization, resolving issues). Usually when optimizing a system, it's natural to collect performance data and compare it before and after making some changes. With performance issues, problems are coming up unplanned and you can't gather data right before a performance issue occurs. So what do you compare your data to?

Most people begin collecting data when they have a performance issue and try to figure out what the problem is by looking at the data. That makes it hard to find the reason of performance issues. To accurately solve a performance issue you need to compare information.

Performance analysis is done by comparing data of a system in a good versus bad state. This helps to find out what has changed, and the cause of the change.

Monitor Regularly

To prepare for future performance issues, set up regular monitoring and keep the data at least since the last major change, always having an example what a good case looks like.

For analyzing changes over time, degrading problems or finding out when a problem has occurred the first time, you need historical data (e.g., daily for the past week, weekly for the last month or monthly for the last year).

To do this, the SYSSTAT utilities package comes with the most important tool for monitoring, sar/sadc, that can be used for both data gathering and for regular monitoring. For starters you can collect performance data in 10-minute intervals.

An example for the usage of sadc is:

/usr/lib64/sa/sadc 600 144 outfile

The sampling rate—the rate at which data points are collected—needs to be high enough to spot a problem. For further analysis of a performance issue, it’s a good idea to collect data with a higher resolution once you know where to look.

Performance monitoring needs system resources so it can impact the system if your sampling rate is high. You don't want to gather too many samples in short intervals because it takes processor performance and you need to store the data. But gathering data is taking an average over time, which flattens peaks. Your sampling rate is defining your resolution in time. Events that occur on a time scale shorter than your sampling rate can't be seen. You need to set the sampling rate high enough to spot the peaks.

Create a description of your system and workload to determine if there’s problem. If you know what your system is supposed to look like and accomplish, you can identify deviations. To find out what’s going wrong, know what has changed by comparing actual performance data to historical performance data. Do so by having a data set of a healthy system. With this, you can quantify the problem.

It's best to be trained in gathering, processing and evaluating the data. So it's a good idea using the performance tool set you choose to process the data once in a while even if there's no problem. If there's a problem, time is precious and things are urgent. Work with the tools on the system before problems occur to learn how they work and what a normal state on your system looks like.

Find the Problem

Before you can start analyzing a performance issue, make a clear statement what the problem is. First, describe your performance issue properly. What indicators show the issue? What range is considered good and bad? Make a clear statement what the deviation is to expected or historical behavior.

Sometimes problems occur at specific times of day or they're triggered by certain events. That means you need to gather data during a specific timeframe. Some problems appear only a very short time. You need a sampling rate that’s high enough to see it. It's important to capture the bad state and a time frame as short as possible.

It's not enough to send in the monitoring data that somewhere contain the performance issue. Solving performance issues is like finding a needle in haystack and you don't know what that needle looks like. The less data you have the easier it is to analyze. More data is more likely to contain a shot (i.e., data of an occurrence) of the problem.

After the first analysis you might need to gather data a second time. In time you will gain experience that helps to gather better data. Have data for good cases at hand for comparison.

Analyze and Solve

The analysis starts on the data provided. It's depends on the problem what tools are best to use for analysis. The first analysis leads to a better understanding of the problem and we know better what to look for. That usually doesn't solve the problem but leads to a second round of gathering data.

Once the problem is solved it’s important to understand it to improve our preparation. This can be preventing the issue from occurring again by improving and monitoring the system, to reacting early to prevent the issue from occurring again or having better data available if it comes up again.

In performance analysis you work forward through the problem. You start be asking questions and create theses. These theses can be verified or falsified. Falsifying is usually a lot stronger because it can rule out a theses.

Answer the questions one at a time and try to narrow down the problem. What data do you need to answer that question? Believe in your data and your conclusions. Once a question is answered move on to the next. There's no reason to gather new data to answer the same question. There are only two reasons to gather new data: you have a new question or you discover you made a mistake.

A multi-staged approach—to narrow down the part of the system where the root cause is.—saves a lot of work and a lot of time. First use a general tool to isolate the area of the issue. Find the deviations from the good case in your data. Create theories as to how the observed data is produced. Verify or falsify your theories. Start out with the easy ones. Falsifying is usually stronger. Remember: Ravens are black. One more black raven proves nothing. But one white raven shatters the rule.

Once you learned more about the issue and know where to look closer, gather new data in that range or use another tool to gather more data. Apply what you've just learned and start over.

Identify the most likely problem area and focus on that. If you can rule out one area, take it from the list and focus on the most likely from the remaining areas.

Once you have an overview of the system performance and have identified the area of the problem you need to look deeper. There's a whole lot of tools to get more data. Maybe you want to take a look at one or the other tool. But it's good to look at the tools when you need it. Every tool has specific things it can analyze. They are:

  • I/O: iostat, DASD statistics/SCSI statistics, multipath
  • Network: lsqeth, ethtool, netstat
  • Memory: top, slabtop, smem, meminfo
  • Crypto: icastats, lscrypt
  • CPU: top, hyptop, pidstat, ps
  • Profiling: strace, ltrace, oprofile, perf
  • User space: top, ps, perf
  • Java: Java Garbage Collection and Memory visualizer (verbosegc)

There’s an IBM page with hints and tips for Linux on z Systems. This includes results from performance measurements and recommended settings for performance tuning.

System Optimization

A system that has all resources fully used doesn't have a performance problem but it's probably working as designed. When we’re talking about performance issues some resources are overused while others are underutilized. When we identify bottlenecks and get rid of them we can make better use of resources and push the system closer to 100 percent utilization.

The first step is to determine if your system is CPU bound or memory bound. CPU bound means your system is limited by the available CPU resources. Adding more CPU resources in this case gives you more performance. If not, you have a scaling issue with your software. Memory bound means adding physical memory improves performance. Adding memory to a CPU bound system doesn't change anything as adding CPU resources to a memory bound system doesn’t improve performance.

In most cases you can spot CPU bound systems by 100 percent CPU usage with plenty of memory free. Memory bound systems are sometimes harder to spot because they don’t necessarily mean you have lots of CPU resources available. In most cases considerable swap activity is going on, which also uses a considerable part of CPU resources. This can go as far as using most CPU resources for the memory subsystem. Adding CPU resources can improve performance to some extent, but that doesn’t solve the problem of lack of memory.

What follows are the names of command-line tools with an example of how to use it.

top: This is a very common tool in Linux systems. It can be used to get a quick overview of what processes are running on the system and how much CPU and memory is consumed. It's a very comprehensive and easy tool to use and shows system performance in real time.

ps: With ps, you can quickly find out the process ID of a process and kill the process. But using the right options can give you very detailed information about a process (e.g., for hanging processes the system call that the process is hanging in). The complete command with options is:

ps -eo pid,user,pcpu,comm,wchan

It's good to know some of the options and keep in mind that it has a lot more options you can look up in the UNIX man pages in case you need it.

perf: The complete command with options is:

perf stat ls -R /etc > /dev/null

Using perf you can take a deeper look into the performance of applications (e.g., it can do profiling):

perf record ls -R /etc > /dev/null
perf report

Profiling can help to find hot spots in your application. Passages in the code where a significant part of the processor time is spent, like a loop that’s executed, are the parts to optimize.

A code part that accounts for 1 percent of the total execution time doesn’t have the potential of improving the performance by more than 1 percent because, even if you eliminate that code, it saves only 1 percent of the total execution time.

Code parts that account for a significant amount of execution time have the best potential for improvement. If you can improve a code part that accounts for 30 percent of the total execution time by 10 percent, the total improvement will be 3 percent.

Be Prepared

Three things to do are: prepare, prepare and prepare. Understanding your system and your workload will help to quickly identify the area of a problem or understand why deviations might not be the indicator of a problem because the are in the specific nature of your workload.

Find your own methodology and tools. Though a lot of shops do use the same set of tools, the methodology is specific to the system and the workload. In case a problem comes up you need to be able to act quickly.

To determine performance problems you need historical data. They'll help you to make a clear statement what the problem is and to quantify the problem.

IBM Systems Webinar Icon

View upcoming and on-demand (IBM Z, IBM i, AIX, Power Systems) webinars.
Register now →