When the Phone Rings in the Middle of the Night
In IT support, mistakes or slow response can be very costly. But resolving these situations can be rewarding and produce opportunities, self-confidence and job satisfaction.
By Jim Schesvold08/16/2019
I and a consulting systems engineer (Don) combined skills to create this product while maintaining our daily client responsibilities. One of our customers wanted this offering badly and provided its impetus, giving Don justification to procure computer time at the Minneapolis IBM Regional Office. We were converting an IMS product to CICS; Don was project manager and technical advisor, while I wrote PL/1 code, created screens via CICS Basic Mapping Support (BMS), and coding CICS/VS macros, all after normal hours. Another client graciously provided computer facilities with the necessary products and punched card equipment.
We could only use the Minneapolis computer from Friday evening through Sunday, and we had a three month window to complete the work, so time was precious. The disks were removable platters that had to be swapped, all punched card programming objects had to be read in, and systems software like CICS/VS and compilers had to be brought up. So the easiest way to configure the system was to perform an initial program load (IPL)—also known as a boot—respond to prompts and maybe override an error message or two, making startup a somewhat lengthy process.
We were on our own; there were no onsite operators, no systems programmers, no customer engineers, which left the two of us to deal with any problem. Don was something of a genius, which significantly mitigated the lack of support, and I contributed in such devious ways as cracking the lock to the circuit-breaker room when a breaker needed to be reset. Ingenuity saved us, and after a couple weekends of compiling programs, installing programs like CICS/VS and IMS, assembling BMS maps, and testing hardware connections to color TVs via an IBM Series/1, we were ready.
We finally had a running deliverable, a great accomplishment, but program bugs had to be fixed, screens corrected, usability improved, and thorough testing. The product release date was looming, and I put in my first 48-plus hour shift. The product was successfully completed by Sunday night, filling a niche some key clients desperately wanted, and paving the way for the new discipline of business graphics that revolutionized planning, projection, and forecasting through the use of visual, colorful illustration tools that could quickly generate charts, graphs, and displays.
Long, Unpredictable and Pressure-Packed HoursThat two-plus day effort was one of three I put in over the years, and each one was a high-pressure situation, demanding determination, concentration, creativity and a fierce desire to succeed. The reason there weren’t more is because body and mind have limits and the circumstances were extraordinary. But IT professionals invariably have stretches when the hours are long, or the situation is unpredictable, over holidays, at weird times, split shifts or any mixture that can be dreamed up. My second 48-plus hour was one of those, the most mind-numbing of the trio.
One of my clients performed stress tests prior to major software or hardware changes, a proven technique for problem and defect resolution the upgrade might generate. Stress test preparation was daunting, starting with collection and storage of a day’s worth of end user terminal input. These copies were processed by the IBM Teleprocessing Network Simulator (TPNS) preprocessor, which produced TPNS scripts that emulated actual network traffic. Additionally, all disk drives were backed up, TPNS startup and shutdown was tested along with dry runs, and other tasks like fallback.
Stress test preparation was more work than the stress test, and because of the preparatory effort, we had to shut down all other work. Data capture went smoothly, an early shutdown was performed, so we were good to go. Generating TPNS scripts was highest priority, so once the computer was recycled, the job was submitted, and the disaster began. A flood of syntax errors were generated, and we set to work. Thankfully some errors were repetitive and once one was fixed, it could be propagated to resolve many errors. But some weren’t so simple, and took a lot of effort, taking us through Saturday well into Sunday.
Working through those irascible errors was mentally exhausting, but it got the worse, because when we tried to store the scripts, there was insufficient space. But changing space allocations didn’t help, and after several fruitless hours, we eliminated that possibility. It was a Partitioned Data Set, which is an esoteric, obscure format, so we enlisted help from a storage specialist who was equally baffled. The specialist finally figured it out after numerous “shots in the dark,” and we barely made Monday morning startup. The thing I remember most was how my mind got numb around 40 hours; it was so hard to think.
Emergencies Are the WorstBy far, most on-call situations occur when something breaks or fails, and oftentimes I’ve been sound asleep when the phone rings. My third 48-plus hour shift occurred that way, working with the same client, again over a weekend. In this case, however, the failure’s timing was fortuitous, because it involved CICS online business systems, and the Friday night failure occurred during normal system shutdown, giving us the whole weekend to work on the problem without impacting online system availability. I didn’t get involved until early Saturday morning.
A CICS failure is not usually difficult to recover from, because CICS logs all record changes that occur during online processing; those changes were written to magnetic tape. CICS can back out “in flight” tasks (a transaction that has started but not completed), restoring records to their original state (called Emergency Restart for a CICS crash). But this time, the log tape was improperly closed and unusable. Backout failed, and that left many customer accounts incorrect. It was disastrous, because any manual method of recovery was prohibitively difficult or impossible.
After a cursory evaluation and error information collection, my client contacted the IBM Support Center, which triggered a call for me to get onsite ASAP. I was only minutes away, so I was onsite quickly and we began poring over error documentation, trying to determine the failure. It was very time consuming and mostly fruitless, and we bounced ideas past each other looking for inspiration. Just like the stress testing scenario, it was intuitive trial-and-error.
We tried shutdown after shutdown, tracking them via CICS shutdown messages, and finally early Monday morning—facing the terrifying idea of extended unavailability—a Support Center guru noticed a tape drive error at shutdown, leading to the discovery the tape was never properly closed, distorting end-of-file information, a one in a million possibility. It was possible to correct this information manually via a utility program that updated the end of the tape. CICS Emergency Restart then restored all online files to a consistent state.
On-Call Situations Come With the TerritoryOn-call situations are standard operating procedure for IT professionals, with unpredictable work shifts. They’re mostly infrequent, but sometimes commonplace. The most pervasive overnight calls my staff took was with a large paper mill conglomerate we supported. Late night calls were normal, although the odds lessened on weekends. The operators were volatile, and our job was as much babysitting as troubleshooting, but we took every call. An outage might shut a mill down, costing millions of dollars, so even frivolous calls were handled.
My three 48-plus hour shifts are extreme examples of on-call situations, but in IT support, expect some tension-loaded situations. IT systems are often 24-7, which is why my team subscribed to Murphy’s Law: “If something can go wrong, it will, and if nothing can go wrong, it will.” Things go wrong any time of any day, usually at the worst time. Problems and outages are usually expensive, and an ominous aspect of IT support is that mistakes or slow response can be very costly; the technician’s consequences can be severe. Conversely, resolving these situations can be rewarding and produce opportunities, self-confidence and job satisfaction.
Jim Schesvold can be reached at firstname.lastname@example.org. More →
Sponsored ContentAchieve Compliance Without Impacting Productivity
Post a Comment
Note: Comments are moderated and will not appear until approvedcomments powered by Disqus