Skip to main content

Stress Tests Validate Systems and Strain Analysts

In conjunction with a client insurance company, I have the dubious distinction and privilege of being an early pioneer in the discipline of stress testing computer systems. Thanks to the wise experience of an IBM Consulting Systems Engineer named Don Lund who took me under his wing and passed his wisdom to me—and mostly kept me out of trouble with that discerning guidance—I and a couple of my client’s systems programmers used an IBM Field Developed Program (FDP) called CICS Volume Test to validate a new CICS release (CICS/VS V1.3). Volume Test used the MVS Spool to record transactions for playback and run real online activity in test systems.

Foundational Principles for Stress Testing

Before we started stress testing, some foundational principles were needed.
 
Scheduling Issues 
 
Because of cost, space and hardware limitations, establishing a stress testing data center wasn’t viable, which meant it had to employ existing facilities. Because of stress testing’s scope and duration, it couldn’t be run concurrently with normal systems, creating significant scheduling issues.
 
Each test started with a full production system shutdown, followed by a full stress test system startup. The entire process had to be reversed when a test was over. If a test ran long, it had to be terminated in time to get the production system up as scheduled.
 
I also came to dread three day holiday weekends, because these lengthier-than-normal, non-business time periods were the only option for extensive tests. Sometimes we had to break a stress test into parts, meaning multiple weekends.
 
Smaller volume tests could sometimes run overnight, when online systems were down. It meant getting home when most people get up. Even worse if there were daytime meetings or other activities to attend.
 
Stress Test Justification 
 
Stress testing was a priority with this client because of the great difficulty they had with CICS upgrades to new releases, due to both availability and performance issues. This was the mid-1970s, when CICS/VS V1.3 was leading-edge technology. CICS upgrades invoked both product and user-written Assembler program defects. This created many system and user outages.
 
Adding desperately-needed function to a relatively immature product had a deleterious impact to both CICS and associated application programs, resulting in serious performance degradation which in turn increased response time and significantly decreased employee productivity. Further compounding matters, this degradation could deplete CICS of virtual storage and create system stresses which almost always led to system failure and an outage.
 
Despite all the issues, CICS was far and away this client’s best option for real-time, interactive processing, which in turn was a key component in their business strategy to increase market share.

Initial Stress Testing

Given these constraints and considerations, our stress test objectives were:

  • Identify and resolve errors prior to moving new system software (OS, filedatabase, transaction manager, etc.) to production
  • Collect system and transaction performance data, response time, processor and disk utilization, and other performance metrics
  • Establish performance objectives consistent with end-user agreements
  • Tune the system during stress tests before moving to production

My fellow systems programmers were almost as inexperienced as I, and just as eager. FDPs were not an official Program Product, so support and documentation were limited, but we completed installation and tested using some sample transactions and test files on a very small scale. Some bugs were discovered, a few of which were related to Volume Test, but a lot more that were transaction-based. This pattern would continue with production testing, and would really complicate things, but overall we felt ready for the real thing.
 
The initial stress tests in a test system we ran were primitive, but provided substantial insights once we finished, including how to build a production stress test, as follows:

  1. Make backups of all system- and CICS-related disk drives on the day we capture transactions, just prior to online systems availability
  2. Capture several hours of production transactions right after CICS startup using Volume Test utility
  3. Run Volume Test utilities to reformat captured transactions into the format needed for a stress test
  4. Start Volume Test to verify new captured transactions initialize correctly, then shut it down

Then, on production stress test day, we continued with these steps:

  1. Make backups of all system- and CICS-related disk drives prior to online systems availability. These copies are used to re-establish the current production system when the test is over.
  2. Rather than a normal startup, re-initialize with the new software (CICS, Db2 and so forth)
  3. Restore Step 1 files to their assigned disk drives, because everything must be exactly the same as when transactions were captured. If not, synchronization errors will occur because of data inconsistencies.
  4. Start performance monitors and other tracking components
  5. Once the test is started, it may be paused or restarted, and various test-related functions can be performed. The ones I participated in usually involved three to five people to monitor not only the test, but capacity and performance metrics, program or system errors, transaction failures, disk and processor utilizations, queue lengths, etc. Constant communication between participants was vital, and the transaction rate was lowered, raised, even shut down to resolve a parameter. IBM assisted in CICS fixes, and client staff resolved client defects.
  6. When complete, terminate Volume Test and CICS
  7. Quiesce all work on processor, then restore data from Step 5
  8. Recycle entire system, bring up normal production
  9. Analyze results, fix problems, tune system, finalize production preparations 

Things Change 

Volume Test was quite limited, but it served as the entrée into improved stress testing, and it enabled us to do a very successful, almost error-free CICS upgrade. Within a couple years, IBM produced a superior product called Teleprocessing Network Simulator (TPNS). TPNS addressed a gaping Volume Test problem—the lack of network measurements (response time, link utilization, etc.) by adding network interaction to a stress test. Additionally, it’s a Program Product, with full support and ongoing enhancements.
 
TPNS embraced the network aspect of transaction processing by adding a component loaded into a network communication controller such as the 3705, or later, the 3725 controller. Significant measurements and test control parameters were added as well as enhanced scripts and utilities to produce those features from captured transactions. My client quickly switched to this product and used it extensively to validate not only CICS upgrades, but all software related to interactive, real-time processing.

The Stress Is Real

Stress tests don’t just stress online systems, workloads, websites or disaster recovery sites. They stress people who do them. Stress tests often go awry because their purpose is to discover parts that break, errors that abort, or perform poorly, usually with fixed deadlines. Four times in my career, I’ve worked 48+ hour shifts, and three of them were stress tests. The toughest ones involved a problem with Volume Test or TPNS, because troubleshooting those products was tough. Another involved a file type called Partitioned Data Sets where TPNS stored its scripts. The scripts kept failing during TPNS initialization, both baffling and undocumented, and we tried over and over unsuccessfully. We called the IBM System Center and IBM Support Center, while watching clock hands turn as precious minutes elapsed, and we lost the entire weekend. It turned out the PDS file system was losing key file information due to when their characteristics were updated. Go figure! I learned Murphy’s extended law during that stress test: “If anything can go wrong, it will, and if nothing can go wrong, something will.”
 
Yet the next weekend, armed with a solution, we had a series of good runs, gave the whole system a good shakedown, verified results and essentially replaced all system software related to online processing. The performance profile we created showed we should be able to run 30 transactions per second, and an abnormal peak—probably catch-up work—validated the numbers. The stress testing infrastructure that had been put in place years earlier more than paid for itself, ushering in a highly-available, lightning-fast online system running nine parallel copies of CICS TS.