POWER > Systems Management > Data Management

Extract Business Value From Data Lakes

Data Lakes
Illustration by Daniel Hertzberg

Enterprises have access to an unprecedented volume and variety of data. Big data dominates the business headlines, if not the thinking of executives around the globe. Data alone isn’t enough, however. The strategic advantage lies not in the data itself but in the ability to draw actionable insights from it in a timely fashion.

That task was difficult enough back when the bulk of an organization’s data was maintained in the rows and columns of a relational database. Today, the data pool has broadened and deepened. It now includes unstructured data such as documents, texts, emails and social-media posts. The most effective tool for managing and learning from this data is a data lake, an updated approach to storing information that has the potential to enhance context and help create business value.

Whether a business realizes that potential depends upon how effectively it can execute a cognitive strategy. That, in turn, depends upon the effectiveness of its data lake. The IBM Power Systems* LC921 and LC922 servers are designed for data lake applications. They combine storage-rich hardware with fast I/O and a processor that’s purpose built for machine learning/deep learning. Using this targeted platform, organizations can present customers with the right offer at the right time, more rapidly develop new products and services, and substantially drive business value.

What Is a Data Lake?

Relational databases store data in a very specific structure designed to be sorted in ways that highlight the relationships among factors. A retailer might use a relational database to monitor revenues on a store-by-store basis, or analyze sales figures from the previous year to determine what to order for the back-to-school season.

With the rise of social media, that same company might want to analyze Instagram posts and Twitter hashtags to better understand their customers’ buying habits. The problem is that these types of unstructured data don’t fit into a standard relational database. Entering specific aspects of one of these data types, such as Twitter hashtags, may provide some information, but doing so also substantially reduces the richness of the format. Worse, it removes flexibility for future analysis, taking time and costing money while limiting potential future benefit. Concerns about these issues have led to the emergence of data lakes.

A data lake is a data repository that stores raw data of multiple types, side by side, each in its native format. Data lakes enable organizations to leverage synergies among different types of data. “The flexibility of data lakes and modern databases gives us the ability to augment traditional relational analytics with things like text analytics that can provide context and more meaning for the data,” says Linton Ward, Distinguished Engineer for OpenPOWER Solutions. “I can store data that I may not know the immediate use of, so I can do exploration in the future.”

Data lakes are different from data warehouses, which are databases designed for structured data. The data that goes into a data warehouse must be cleansed, formatted and managed. Data lakes ingest data in its raw form.

It’s important to note that one doesn’t replace the other. Rather, data lakes are increasingly used to enhance data warehouses. “The data lake is kind of that central place—I call it the data plain—that augments the traditional data warehouse,” says Ward. “It may also be augmented by specialized data stores, but it’s the central place conceptionally of the data plain, serving the analytics that support the digital transformation of our clients.”

Infrastructure Matters

Because of their relative simplicity compared to data warehouses, data lakes have always been considered lower-cost solutions that can be implemented with commodity hardware infrastructure. In the era of big data, however, those assumptions no longer hold. Hadoop, the open-source distributed computing framework used to manage big data workloads, is growing increasingly sophisticated. Driven by the open-source community, including IBM and its partner Hortonworks, Hadoop has the ability to run processes in different data and execution environments, depending on the demands of the workload. Old-school one-size-fits-all data lakes are no longer appropriate for the task at hand. Servers and storage need to be flexible enough to be optimized for the job.

“All nodes in a data lake are not necessarily performing the same role, so being able to mix and match, and scale up portions of your data lake to meet the needs of your workload is a big benefit,” says Steve Roberts, offering manager for Big Data on IBM Power Systems. “You might use high-speed nodes for data ingest, for example. With real-time analytics becoming a larger and larger part of every use case, being able to choose where you run your machine learning, deep learning workloads is important.”

A collection of x86 boxes cannot perform to the level needed, which increasingly presents problems for enterprises. “CXOs trying to figure out their cognitive data flows and strategies face a big challenge, which is commodity hardware. We are talking about a general piece of hardware,” says Dylan Boday, offering manager for cognitive infrastructure. “They are told to use it in every aspect of their cognitive journey, and it just doesn’t work.”

To support data lakes in the modern cognitive environment, IBM released the LC921 and LC922. Both feature the IBM POWER9* processor, an open hardware platform designed specifically for the cognitive computing era (see “By the Numbers,”). In a side-by-side comparison with commodity servers, the Power Systems LC922 server delivered a 2x price performance. Clients can configure the boxes with up to 40 TB or 120 TB of storage, respectively, and up to 2 TB of RAM.

“The flexibility of data lakes and modern databases gives us the ability to augment traditional relational analytics with things like text analytics that can provide context and more meaning for the data.”
—Linton Ward, Distinguished Engineer for OpenPOWER Solutions

In these applications, fast I/O rates are essential. Ward points to the example of a telecom client that imports more than 1 billion records daily into its data lake for correlation with analytics. To tackle high-volume I/O, the POWER9 servers are available with PCIe Gen 4 interconnects. Applied over 48 lanes, it adds up to 192 GB/s duplex bandwidth. Clients who choose 25G links over 48 lanes can achieve 300 GB/s duplex bandwidth.

The LC line also offers another key feature for cognitive computing: the Linux* OS, which is a requirement for running Hadoop. The release is part of IBM’s strategy of focusing on open-source efforts that particularly benefit from the attributes of the POWER9 processor and its advanced I/O structure. “As we see this emerging trend toward very large data lakes and data-rich workloads, we’ve very excited about positioning the Power Systems platform into this Linux market to really deliver the value that the end user has been wanting,” says Boday. “The LC line is a great resource for them to turn to when they’re no longer getting the results in areas that they have been challenged in over the last couple of years.”

IBM’s commitment to the open-source movement includes the establishment of the OpenPOWER Foundation, a 6-year-old program that has fostered an open ecosystem around Power Systems technology. “This is a true shift for IBM in that we are leveraging what I call an innovation ecosystem to bring new capabilities more quickly to market,” says Ward. “Our ecosystem enables people to drive their digital transformation and disrupt markets, if they so choose.”

One Plus One Equals Three

Success in the cognitive era begins with the right platform. IBM offers the AC922, which is purpose-built for machine learning and deep learning. It delivers 4x1 performance in these applications compared to commodity hardware. In comparison, LC922 is optimized for the data lakes. “When you marry these two things together and leverage PCI Gen 4 to scale efficiently, you get that synergy where you now have two servers that are absolutely built and optimized towards that cognitive journey. In the Power Systems world, one plus one no longer equals two, it equals three,” Boday says.


Benchmark Reference
1. Result of 3.7x are based IBM Internal Measurements running 1000 iterations of Enlarged GoogleNet model (mini-batch size=5) on Enlarged Imagenet Dataset (2560x2560). Hardware: Power AC922; 40 cores (2 x 20c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU; Red Hat Enterprise Linux 7.4 for Power Little Endian (POWER9) with CUDA 9.1/ CUDNN 7;. Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) /  40 threads; Intel Xeon E5-2640 v4;  2.4 GHz; 1024 GB memory, 4xTesla V100 GPU, Ubuntu 16.04. with CUDA .9.0/ CUDNN 7 Software: Chainverv3 /LMS/Out of Core with patches found at bit.ly/2J4Izg9 .

 

Result of 3.8x are based IBM Internal Measurements running 1000 iterations of Enlarged GoogleNet model (mini-batch size=5) on Enlarged Imagenet Dataset (2240x2240). Power AC922; 40 cores (2 x 20c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU ; Red Hat Enterprise Linux 7.4 for Power Little Endian (POWER9) with CUDA 9.1/ CUDNN 7;. Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) /  40 threads; Intel Xeon E5-2640 v4;  2.4 GHz; 1024 GB memory, 4xTesla V100 GPU, Ubuntu 16.04. with CUDA .9.0/ CUDNN 7.  Software: IBM Caffe with LMS Source code (bit.ly/2KQ6ux9).

Kristin Lewotsky is a freelance technology writer based in Amherst, N.H.


comments powered by Disqus

Advertisement

Advertisement

2018 Solutions Edition

A Comprehensive Online Buyer's Guide to Solutions, Services and Education.

POWER > SYSTEMS MANAGEMENT > DATA MANAGEMENT

Are You Ready for GDPR?

POWER > SYSTEMS MANAGEMENT > DATA MANAGEMENT

IBM Researchers Maximize Apache Spark’s Capabilities

IBM Systems Magazine Subscribe Box Read Now Link Subscribe Now Link iPad App Google Play Store
not mf or hp