How Core Contained Checkstops Increase Reliability of the Power Systems Platform
The feature makes servers more reliable and resilient in the face of increasingly complex CPU designs.
By Dave Stanton01/02/2020
Q: What’s a core contained checkstop?
IBM Power Systems* servers rarely have outages, but when they do, a common question IBM support receives from our clients is “Why did this happen?”
The core checkstop failure containment designed into the Power Systems platform contributes to making our clients’ overall IT infrastructure more reliable and resilient in the face of today’s increasingly complex CPU designs. Before delving into details around the core checkstop support on the IBM Power System E950 and E980 servers, I’d like to explain how a server can fail and how the platform is designed to contain those failures.
The functionality a modern server provides is a combination of hardware and firmware. In an ideal world, we would all like it to be 100% reliable. However, over time, failures are inevitable. This is where the concept of failure containment becomes increasingly important in reducing the scope of impact that a failure in any single component of a server has. Core contained checkstops are just one of the failure containment mechanisms built into Power Systems that can help isolate the failure of a CPU core from the rest of the system and reduce the scope of the outage. Read on for more information on how a core checkstop works.
What happens when a core checkstop occurs?
A CPU core checkstop occurs when internal logic checking built into the CPU core detects that something internal to that logic unit is no longer behaving properly. When that occurs, the CPU core stops running CPU instructions and notifies the rest of the system that something went wrong. The POWER* hypervisor receives that notification and then examines the system to locate the CPU core that encountered a problem.
The POWER hypervisor then determines what type of code was executing on the threads of the core when it failed. If all of the threads were in a power saving state or executing OS or application code, the failure can be contained to just the partition that was currently running on the CPU core. At this point, the partition will be halted. If spare or unlicensed cores are present in the system—or if the minimum CPU capacity of the partition configuration allows for the current CPU capacity to be reduced—the partition will be restarted to get the client workload running again. The checkstopped core is then isolated from the rest of the system and an error log generated calling out for replacement of the processor chip.
Can all core checkstops be isolated to keep the server running?
It’s important to note that failure containment of a checkstopped core to just the partition that was running on the core isn’t possible in all cases. If a core was actively running hypervisor code on any threads, then recovery isn’t possible. In this scenario, the server will automatically reboot with the failed CPU core removed from the configuration and restart any partitions that have been marked as auto start.
How does core contained checkstops reduce server outages?
Core contained checkstops are just one of many features that contribute to the industry-leading reliability of the Power Systems platform. Many problems that can develop within a CPU core are automatically handled by recovery logic with no disruption to the server, and in the event they continue to occur over time, predictively deconfigure the CPU core long before it encounters a problem that cannot be recovered from. Core contained checkstops, while not 100% guaranteed to be recoverable, add a final line of defense against CPU core failures causing a server outage and further increase reliability.
Dave Stanton is a senior technical staff member in PowerVM Architecture and Development.
Post a Comment
Note: Comments are moderated and will not appear until approvedcomments powered by Disqus