AIX > Tips & Techniques > Miscellaneous

Hardware Testing: Testing the Tested


I’m not an expert on computer hardware; my specialty is systems and application software, IT management, capacity and planning. A different division— Field Engineering—dealt with hardware when I was with IBM. I had to have a working knowledge of hardware. I remember big-boxed machines with a fraction of the PC’s processor speed giving off volumes of heat requiring water-cooling. The days of running bulky copper cables connecting peripheral devices (e.g., disk, tape, communications and storage controllers)—the reason computer rooms had to be raised floor—was a fine art. Hardware skills were IBM’s forte.

There’s no doubt that the today’s technology surpasses the reliability, quality and installability of the water-cooled days. But one thing remains: The experts in hardware installation and testing are still vendors, and the client is the key player because it’s their business processes that use and depend on the hardware. It doesn’t matter whether the platform is z/OS mainframe, Windows PC or Mac, UNIX/AIX Power systems or other technology. When hardware’s installed, the vendor’s already extensively tested it. But things break, and it’s the client who drives resolution, makes the decisions, augments onsite testing and insures success.

Client Involvement in Hardware Testing

There are numerous tasks a client performs in hardware testing. Activities include:

  • Performing integration testing, which varies with component and number of devices. When it’s a new processor, every piece of hardware is involved. A full IPL is performed, all subsystems are brought up, and the system is exercised to verify that all systems work. For a storage controller, console Display commands are issued to verify attached disk and/or tape drives are visible. A test job may be run against data on some drives. For a disk or tape drive, a test file is allocated and populated, then the data is accessed and processed. In general, a test job or process is run against new or repaired devices to verify functionality.
  • Scheduling an outage
  • Coordinating with vendors to determine problem specifics
  • Running jobs to collect diagnostic information and determine if it’s fixed
  • Driving the problem when there’s finger-pointing or delays
  • Documenting and managing the situation

Hardware Error Manifestation

Initial hardware problem isolation determines where an error occurs. Hardware errors can reveal themselves ambiguously via an abnormal termination, which can also result from system software or application program bugs, garbage data, etc. Only deeper investigation will identify the cause. Manifestations that could indicate hardware errors include:

  • System or application amends
  • System, subsystem or application hangs. A hang is a program or job that stops executing. Record enqueues or deadlocks are one cause, but an I/O request which never completes could be due to a hardware error. Other conditions also exist.
  • System, subsystem or application loops, where code is executed over without completing. Loops are usually due to bad program logic, but a hardware error can also cause it.
  • I/O errors usually but not always due to hardware errors. The error isn’t necessarily on a disk drive. It can also be a storage controller, cable connection, bus or other components in the data path.
  • System wait states are similar to a hang, but are usually due to a process not being completed. Reasons may be that devices are disabled, inaccessible, inoperable, or have a microcode or logic defect.
  • Storage overlays when a program loses addressability, but a hardware error can also cause it; hardware addresses stored in control blocks can be corrupted. These are extraordinarily difficult errors.

Hardware Diagnostics to Identify the Problem

Something that wasn’t true in my early career but is now a ubiquitous asset on mainframes, PCs or midrange is a plethora of diagnostic and directional tools that help identify problems and provide guidance on resolution. Functions exist today on all platforms to detect a hardware problem at it’s developing (e.g., increasing frequency of temporary errors), and to provide notification so the defect can be fixed before it occurs. Helpful diagnostics include:

  • System, application, slip trap, standalone or a variety of other dumps provide a lot of information like:
    • The instruction being executed when the error occurred
    • The contents of different main storage areas being referenced by the instruction
    • Hardware addresses of peripheral devices (channel and device address)
    • Contents of other storage relevant to the error A dump displays this information in hexadecimal or other formats, and “programmed” dumps such as slip traps can zero in on specific, relevant information. Ever see all that gobbledeegook on the blue screen when your laptop fails? To a trained person it’s pure gold.
  • Operator console Display or Modify commands showing status or other relevant information regarding attached devices for hangs or loops where the system isn’t down. Messages issued from programs or processes can also be extremely useful.
  • Traces, a facility where system software logic is used to record the sequence of events as a program or process executes. This entails a lot of overhead but when debugging an error, traces show the sequence of events that precede the hardware error, which is immensely valuable.
  • Custom-made devices designed to test connectivity, electrical flow, transistor impedance, magnetic strength, bearing resistance, read and copy disk contents, ignore scratches or tears to recover data from tape, CD, disk, etc., and many more hardware characteristics.

Hardware Components That May Fail

The more common hardware components involved in a typical IT enterprise include:

  • Processor chips and microcode
  • Memory
  • Power supply and surge suppressors
  • Coupling Facility (Sysplex)
  • Disk, Thumb Drives
  • Tape, CD and DVD Drives
  • Channels
  • Storage Controllers
  • Communication and Network Controllers, Routers
  • Fan/Coolant System
  • Consoles
  • Network Firewalls, Switches
  • Cabling

Working Together

A processing complex is such a blend of hardware, system software and application software that it’s essentially impossible to determine where the role of one ends and another starts. It really doesn’t matter, because the IT department has to be involved in every aspect of testing in its computer complex. Vendors may or may not have limited involvement in applications, but they have primary involvement in systems software, and even more of the responsibility in hardware validation. In all cases, it’s IT who must coordinate, manage and operate the system. They must be conversant in all aspects of the operation, they must make the operative decisions and the ultimate responsibility lies with them.

Jim Schesvold can be reached at jschesvold@mainframehelp.com.



Like what you just read? To receive technical tips and articles directly in your inbox twice per month, sign up for the EXTRA e-newsletter here.


comments powered by Disqus

Advertisement

Advertisement

2019 Solutions Edition

A Comprehensive Online Buyer's Guide to Solutions, Services and Education.

AIX > TIPS & TECHNIQUES > MISCELLANEOUS

10 Things to Love About AIX

AIX > TIPS & TECHNIQUES > MISCELLANEOUS

Application Testing: Giving Users What They Need

AIX > TIPS & TECHNIQUES > MISCELLANEOUS

Change Management: Approval Must Be Earned

IBM Systems Magazine Subscribe Box Read Now Link Subscribe Now Link iPad App Google Play Store
IBMi News Sign Up Today! Past News Letters