Your Data Center Can Make or Break AI Efforts
IBMer Ed Batewell explains why infrastructure is critical to AI success.
By Jim Utsler07/11/2019
To the relief of organizations in many different industries, artificial intelligence (AI) now has more than a simple toehold in typical business operations. To some, it’s quickly becoming an indispensable and inextricable part of everyday processes, and they’re now actively building models that will create insight where none existed before.
As they’ve learned, however, pursuing the AI goal requires more than data, data scientists, machine learning and useable inferences. Indeed, it also demands changes—subtle or not—to their data centers, including servers, storage and networking. Without these necessary alterations, they may end up consuming more floor space, energy and system-maintenance time than what should be required.
“Any newly introduced technology is inevitably going to have to co-exist with whatever you already have on the floor,” says Ed Batewell, IBM executive client technical architect. “As a result, it’s important for IT decision makers to validate that the technology that’s being deployed supports what they already have and can be integrated to add value to their existing investment in IT infrastructure.”
Saving Money and Floor Space
AI in particular is likely going to necessitate some data center rethinking. For example, it’s crucial that older in-house data centers are evaluated or updated to meet the various power demands of new AI technology.
One way to help address this is finding a scalable AI-specific system, such as the IBM Power Systems* AC922. It pairs POWER9* CPUs and NVIDIA Tesla V100 with NVLink GPUs, and includes a number of next-generation I/O architectures, including PCIe Gen4, CAPI 2.0, OpenCAPI and NVLink. These interconnects provide 2x to 5.6x the bandwidth for data-intensive workloads than the PCIe3 Bus Gen3 found in x86 servers.
“We have fatter plumbing between our CPUs and GPUs and fatter pipes to memory, which helps reduce bottlenecks. This technology—called NVLink—allows the system to better work in concert, as a coherent whole,” Batewell notes. “Having NVLink between CPU and GPU is unique to the Power* architecture, which is designed specifically for the AI era of computing. It enables much larger data sets in working memory and can significantly reduce training times for large deep learning models, enabling training times that take only hours rather than weeks. See Figure 1, to learn more about the AI training process. A faster training rate allows faster iterations, to deploy more accurate models that are trained and retrained on more recent data, and ultimately better results.”
The POWER9 CPUs found in the AC922 support up to 5.6x more I/O and 2x more threads than their x86 contemporaries. And within the AC922 server, the POWER9 CPU has available configurations with anywhere between 16 and up to 44 cores.
This type of core configuration offers many benefits in addition to those directly related to AI. For example, they allow users to scale more easily within a defined footprint, thereby reducing the amount of required data center space. This is opposed to x86 AI systems, which demand additional space because they often require more servers to do the same amount of work, which negatively impacts floor space and energy consumption—not to mention the acquisition costs of having to buy more servers.
“You have a much higher density of energy per server when you pull in accelerators like GPUs, which makes data center efficiency a high priority,” Batewell says. “Ideally, you’re able to support more wattage per floor tile and, subsequently, more cooling per data center square foot. So having higher-performance nodes in the same footprint consumes less floor tile space and energy, and reduces the amount of money you’re spending on cooling. These factors might be easily overlooked, but they add up, especially at scale. An intelligent, thoughtful approach to reducing infrastructure costs can make all the difference.”
The ʻmachine’ is at the heart of 'machine learning,' so your infrastructure matters a great deal because it could be a limitation or it could be your salvation.
Speed Is of the Essence
Data storage is another crucial aspect to be considered when it comes to adopting AI, as is networking. Part of this—as with server density—involves scalability. AI, after all, requires massive amounts of data if it’s going to be effective, and traditional network-attached storage architectures may not be up to the task of supporting it.
The goal, then, is to dramatically reduce the complexity and time required to plan growth and broader adoption, while also building storage systems that integrate into existing organizational infrastructures. For example, real-time AI data interaction may require flash storage while typical workloads can remain on disk, resulting in the need for hybrid storage.
“You want to have the flexibility baked into the data center environment that accounts for your storage strategy and the actual data architecture from an availability perspective,” Batewell remarks. “AI is dependent on data, so it starts with a sound, flexible data architecture.”
Speed is of the essence in nearly every AI application, and a system such as IBM Elastic Storage Server (ESS), with IBM Spectrum* Scale, can more than meet required data objectives. A high-performance software-defined storage system running on Power Systems can scale out to handle petabytes or exabytes of data with I/O-intensive storage servers.
It supports 10GbE, 40GbE or 100GbE ethernet and Enhanced Data Rate (EDR) or (14 data rate) FDR InfiniBand. Sustained streaming performance of data can reach 40 GB/s in each building block and grow as more blocks are added to a configuration, creating a highly scalable storage environment. And by using ESS to consolidate storage requirements across an organization, users can reduce inefficiencies, lower acquisition costs and support demanding AI workloads.
The requirements for network-subsystem architectures, vendors and components depend upon organizational preferences and skills. It’s key to note, however, that InfiniBand and high-speed ethernet are necessities, as is a network topology that allows both server-to-storage and server-to-server traffic. Adopting a topology that extends to an InfiniBand island structure allows the AI training environment to scale for large clusters.
High bandwidth and low latency between storage and compute nodes is vital, and sufficient bandwidth between the nodes has to be considered for the data intake and transformation phase of the workflow. Performance is critical when it comes to training models to make sure sufficient data is delivered to the systems to keep the GPUs running at capacity. As a result, a high-speed network subsystem is needed for the training cluster.
The golden egg of AI is accurate and quick data interpretation and response. So when looking at AI, many organizations focus primarily on use cases without necessarily considering how AI is going to impact their data centers, which obviously shouldn’t be overlooked. Appropriate data center decisions are likely to result in less power consumption, easier systems management and AI-necessitated speed improvements.
As Batewell explains. “You have to look at the entire compute environment you’re choosing. The servers you’re choosing. The storage you’re choosing. The networking technologies you’re choosing. The orchestration tools you’re using, assigning jobs to the architecture that’s best fit for the purpose,” he says. “The ‘machine’ is at the heart of ‘machine learning,’ so your infrastructure matters a great deal because it could be a limitation or it could be your salvation”.
Jim Utsler, IBM Systems magazine senior writer, has been writing for IBM since the mid-1990s.