How To Recognize SAN Performance Problems
When do you have I/O performance problem in a SAN environment and how do you figure it out?
By Jose Ortega04/01/2019
Note: Although the term SAN sometimes refers only to the switching fabric, in the context of this article, I use the term SAN to include the storage server.
In any discussion about performance and troubleshooting, it’s important to talk about expectations. Imagine, for example, that I have the server with fastest processor, the fastest storage server and the fastest FC adapter; in other words, the fastest equipment on the market. But in spite of this advantage, I don’t have the performance that I wish I had. Normally, this could be happening for two reasons. Maybe, my expectations are too high. I don’t have any problem with my equipment. I’m hitting the limits of my system. Therefore, I have to able to recognize when my problem is an expectation problem. One solution for this situation could be to wait until technology improves and then buy better equipment. Nonetheless, normally this doesn’t happen too often. It’s more common that we have a performance bottleneck in our environment and we have to identify and fix that bottleneck. I’d say that’s the main duty of the systems administrator.
Bottom line: it’s important to determine whether we have a real performance issue or an expectation problem. But, how do we identify real I/O bottlenecks? In order to answer that question, I want to discuss very briefly the problem determination.
The Problem Determination
The problem determination starts before the problem happens. If I’m asking questions or gathering information when the problem happens, I’m already losing time and money. For that reason, I need to have some information about my SAN environment before the problems happen. For example, it’s in my best interest to know what my SAN network look likes. I don’t need a full schematic of my SAN network because it could be overly complex, but at least I’d like to know what my storage servers and FC switches ports are.
In addition, I need to understand what my systems are doing before problems occur. As a result, I need reference points or a performance baseline; this information will tell what my normal performance is. Then, I can compare my current performance numbers against the baseline and this will tell me if I’m having a problem in my system. Because, we have many tools to gather performance data, such as vmstat, iostat, filemon and nmon. But most of these tools are not going to make sense if we don’t have anything to compare with.
In addition, from my experience it’s beneficial to gather a performance baseline by stressing my LUN devices with workload generator tools and then keeping these results for later comparison. I’ll explain in more detail different ways to stress the LUNs in a future article.
When do you have a SAN performance problem?
The strongest indication to determine if we have SAN performance problems is by checking the average service time (latency) numbers and eventually the queue statistics for the hdisk devices. In this part, I want to discuss starting numbers or rule of thumbs. It’s considered that we are having a SAN performance problem and it doesn’t matter the type of storage technology we’re using if we get greater numbers than these:
|Latency for read operation (read average service time) is larger than 15 ms:
||Latency for write operation ( write average service time) is larger than 3 ms:||High numbers on queue wait:|
|This might indicate that your bottleneck is in a lower layer, which can be the HBA, SAN, or even in the storage.
Also, check whether the same problem occurs with other disks of the same VG.
|Writes that average significantly and consistently higher service times indicate that write cache is full, and there is a bottleneck in the disk.||Whether average wait queue size (avgwqsz) is larger than average queue size (avgsqsz). Compare with other disks in the storage.|
These numbers are suggested by IBM based on field experience. Nevertheless, it’s also important to make a comparison between your baseline and current numbers. Because you may have better numbers than these but, you may have a performance problem. To illustrate, imagine you’re using a V9000 with flash disk technology. The average response time for read or write operations for this type of disks is between 0.1ms to 0.5ms; actually, getting 1ms response time could be considered high, but it’s still OK. Nonetheless, if you were getting 5ms latency with flash disk that is an indicator that you might have a bottleneck. Consider the following response times for flash disks:
o For small I/O size workloads (8 KB - 32 KB) is to stay under 1 millisecond
o For large I/O size workloads (64 KB - 128 KB) should be 3 milliseconds
Source: IBM FlashSystem V9000 AC3 and AE3 Performance Redbook
In general, you need to interpret the average service times as response times because they include potential queuing at various storage subsystem components. Keep also in mind, if your workload is sequential, you need to check bandwidth, and in this case, the latency might not be an issue since the latency increases with longer I/O sizes. Finally, make sure your queue_depth configuration is correct. Now, let’s see how to monitor the latency from the LUN devices.
How do you figure out that you have a performance problem?
The average service time metric provides the most direct measurement of the health of SAN and storage subsystem. To identify performance problems, compare the current avgserv numbers for read operations (in blue) and write operation (in orange) with your performance baseline numbers. In this example, I have the iostat –D output from my baseline:
If you see a high increase on the avgserv values but having similar I/O load numbers on the tps (IOPS) or bps (bytes per second) as shown in your baseline, this indicate that might you have a bottleneck in a lower layer of S.O, which can be the HBA, SAN Fabric, or the storage. I’d suggest getting the overall response time and load numbers on different LPARs, hardware and VIOS which are using the same storage server to confirm if the storage server is causing problems. Monitoring tools such as IBM ITM or lpar2rrd help you for this analysis; furthermore, use your knowledge about your SAN environment for this task.
Consider these root causes for common SAN performance problems:
- Disk array bottleneck:The most common type of performance problem is a disk array bottleneck. Similar to other types of I/O performance problems, this usually manifests itself in high disk response time on the host.
- Hardware connectivity:Infrequent connectivity issues occur because of broken or damaged components in the I/O path. Causing transactions taking longer than normal or time out in certain cases.
- Storage port bottleneck:This does not occur often, but they are a component typically oversubscribed.
- HBA bottleneck:Increase in response time without an appropriate increase in the number of IOPS (disk Reads/Sec).
There are other situations we might face. However, the most common problems are related to bottlenecks in the storage server, then follows SAN equipment and HBA bottlenecks. Finally, avoid blaming the SAN or storage without examination, due to the fact that AIX has a lot of layers such as LVM, queues, VMM; all of them might have a performance impact.
In short, one of best ways to identify SAN performance problems from AIX is by checking the average response time (avgserv) from iostat. These values must be compared against your performance baseline which might be historical information or stress test results. If this information is not available, the rule of thumbs related to response time can help you. The average response time for a read operation is under 15ms and write operation is under 3ms. For flash disks with small I/O the latency is under 1ms and big I/O is under 3ms latency. Finally, it’s important to avoid blaming the SAN environment without first checking, because AIX has a lot of layers which can have a performance impact.
AIX 7.2 Performance Management Guide
IBM System Storage DS8000 Performance Monitoring and Tuning, SG24-8318-00
IBM FlashSystem V9000 AC3 and AE3 Performance – RedPaper
Best Practices Guide for Databases on IBM FlashSystem
IBM Power Systems Performance Guide: Implementing and Optimizing Redbook
Jose Ortega is an IBM Power Systems and database consultant. He's been working on the platform since 2005.