IBM XL C/C++ Compiler Maximizes zEC12’s Transactional Execution Capabilities
Editor’s note: This is part one of two articles on the hardware transactional memory features of the IBM zEnterprise EC12.
As trends in modern hardware technology continue to show declining single-thread performance, hardware designs have shifted toward increasing thread counts (e.g. symmetric multi-threading, multi-core). This shift in the hardware paradigm is causing significant renewed focus on software parallelism and concurrency.
More specifically, as hardware thread-counts grow, scalability and fine-grained parallelism are playing an increasingly critical role in ensuring the new scale-out compute bandwidth can be leveraged by software. Several key challenges exist in modern software designs that hamper this.
For instance, modern software practices tend to use monolithic locks or mutexes on shared data structures. This is referred to as coarse-grained locking. In situations where the critical section for accessing a shared-data structure can become contended, such a design will typically fail to scale well. That’s because locking across lengthy critical sections can impose serialized access on the data structure. Such locking is typically overly pessimistic as threads will serialize despite usually operating on mutually exclusive elements of the data structure.
Fine-grained locking can be used. Some noted challenges are that this usually requires non-trivial effort and can require technical depth beyond the scope of most application developers. In addition, the performance improvement of fine-grained locking is bound by the cost of the lock operation. As such, even when finer-grained locks are implemented correctly, often the overhead of inspecting and holding a lock can quickly dominate the concurrent operation on the data structure, limiting effective use of the software thread for doing real work.
Hardware Transactional Memory
IBM’s zEnterprise EC12 (zEC12) is the first general-purpose IBM server to incorporate transactional memory technology first used commercially to help make the IBM Blue Gene/Q-based Sequoia system at Lawrence Livermore National Lab the fastest supercomputer in the world. In zEC12, IBM adapted this technology to enable software to better support concurrent operations that use a shared set of data such as financial institutions processing transactions against the same set of accounts.
The zEC12’s Transactional Execution (TX) facility is an architectural framework that allows for lockless interlocked execution of a block of code called a transaction. In this context, a transaction is a segment of code that appears to execute “atomically” to other CPUs, and as such, other processors in the system will either see all-or-none of the storage updates made by the transaction.
Transactions are bound by TBEGIN and TEND instructions. A storage conflict is detected by the hardware if another CPU updates storage used by the transaction as shown in Figure 1. The conflict triggers a transaction to abort, rolling back hardware state (general purpose registers and storage) to that which was observed at the TBEGIN instruction. Program execution is also rolled back to the instruction immediately following the TBEGIN, with transactional execution now disabled. A transaction-failure condition-code is set such that program flow can be diverted to a transaction failure handler. The transaction failure handler can choose to re-try the transaction or perform traditional coarse-locking to guarantee forward progress of the program.