vStorm Enterprise Allows Unstructured Data to be Analyzed on the Mainframe

IBM along with business partner Veristorm announced Hadoop for System z at IBM’s Mainframe50 event April 8. vStorm Enterprise couples zDoop, a fully supported implementation of the open-source Hadoop project from the Apache Software Foundation, with data connector technology, making z/OS data available for processing using the Hadoop big data paradigm without that data having to leave the mainframe.

It is currently the only commercially supported solution on the market that allows mainframe data to be analyzed with Hadoop while preserving the mainframe model of security, ease, agility and efficiency. Because the entire solution runs in Linux on System z, it can be deployed to low-cost, dedicated mainframe Linux processors, IFLs. The vStorm Enterprise offering can be used to build out a highly scalable private cloud for big data analysis.

Hadoop is a Fit for the Mainframe

Hadoop is a highly parallelized computing environment coupled with a distributed, clustered file system that is generally viewed to be a very efficient environment for tackling the analysis of unstructured data. It was designed specifically for rapid processing across very large data sets and is ideally suited for discovery analysis.

Anywhere from 60 to 80 percent of all the world’s transactional data is said to reside on the mainframe. Although much of that data is stored in relational databases such as DB2 for z/OS, businesses can also use log data and file analysis to try and identify patterns to produce useful insights. Combining these mainframe-based data sources into a Hadoop cluster for analysis is a very attractive option for improving results and resolving issues a company may be having.

Because Hadoop is designed to handle very large data sets, one critical consideration for any Hadoop project is how the data gets ingested into the Hadoop Distributed File Systems (HDFS). Current techniques that move mainframe data off-platform for analysis typically involve the use of cumbersome, heavyweight federation engines that require a fair amount of skill to set up, not to mention put that data at risk by exposing it outside the mainframe security zone.

To simplify and streamline data ingestion for mainframe users, vStorm Enterprise includes vStorm Connect—a set of native data collectors that uses a graphical interface to facilitate the movement of z/OS data to the zDoop HDFS. Out of the box, vStorm Connect provides the ability to ingest a wide variety of z/OS data sources into the HDFS without forcing that data to leave the mainframe.

Critical Needs Are Met

By running Hadoop on System z with vStorm Enterprises, some of the most critical needs for mainframe clients are addressed, including:

  • A secure pipe for data. vStorm Enterprise is fully integrated with the z/OS security manager, Resource Access Control Facility (RACF), meaning that users of the product must log in with valid RACF credentials, and will only be able to access z/OS data to which they are already authorized via z/OS security. Having the Hadoop ecosystem remain on System z maintains mainframe security over the data and simplifies compliance with enterprise data governance controls.
  • Easy-to-use ingestion engine. The inclusion of vStorm Connect alongside the Hadoop distribution enables the efficient, quick and easy ingestion of z/OS data into Hadoop.
  • Templates for agile deployment. vStorm Enterprise executes within a fully virtualized environment managed by one of the industry’s most efficient Linux hypervisors. Better still, it offers—out of the box—templates so that new virtual servers can be added to the Hadoop cluster with just a few basic commands (or via automation software).
  • Mainframe efficiencies. When processors are more efficient, fewer of them are required. This can lead to lower licensing costs and more efficient energy and space utilization as well as lower management and maintenance costs.
  • Built-in visualization tools. This allow the data in HDFS and Apache Hive to be graphed and displayed in a wide variety of formats.

Testing Proves Effective

IBM’s Poughkeepsie, N.Y., lab assisted Veristorm with vStorm Enterprise testing to demonstrate proof of the mainframe’s efficient, linear scalability for Hadoop workloads. All testing was done with source data on z/OS and the target HDFS on Linux for System z on the same physical machine, a zEnterprise EC12. Hadoop nodes were configured as System z Linux virtual machines on top of the z/VM hypervisor, and the entire Linux environment was built on dedicated IFL processors. This effort showed that Hadoop does scale nicely on the mainframe; all tests met or exceeded expectations.

But in order to generate a more meaningful measure, Veristorm wanted to exercise vStorm Enterprise on a workload that mimicked the type of real-world problems that our clients have been sharing with us. Their performance test team configured a 2 IFL system, pulled 2 billion stock trading records from the New York Stock Exchange stored in a DB2 z/OS database, and ran an Apache Pig job to analyze this trading data for relevant metrics. This test was designed to measure how long it took to extract the data from DB2 z/OS, stream it to Linux on System z for ingestion into the zDoop HDFS and perform the analysis.

The 2 billion record database was extracted, streamed to the HDFS, and analyzed in an end-to-end process that lasted 2 hours, using the power of 2 IFLs. Thus, the “2-2-2 benchmark” was born. Clients say 2-2-2 represents a very effective metric for the kinds of mainframe data that they want to analyze with Hadoop.

Get Started with zDoop

By integrating the best of traditional mainframe processing with emerging technologies, such as IBM DB2 Analytics Accelerator Netezza-based appliances and Hadoop, a secure zone for sensitive mainframe data can be established that analyzes that data while simultaneously embracing insights from data that originates outside that zone. This hybrid approach—extending the System z ecosystem with non-System z technology— enables all relevant data to become part of every operation and analysis within the context of a cohesive governance framework.

To learn more about Hadoop on System z, read the IBM and Veristorm whitepaper “The Elephant on the Mainframe” here or visit IBM’s zDoop page here.

Paul DiMarzio is a mainframe strategist with nearly 30 years experience with IBM focused on bringing new and emerging technologies to the mainframe.

Like what you just read? To receive technical tips and articles directly in your inbox twice per month, sign up for the EXTRA e-newsletter here.

comments powered by Disqus



2019 Solutions Edition

A Comprehensive Online Buyer's Guide to Solutions, Services and Education.


Your Input Needed: IBM Systems Media Reader Survey

IBM Systems Magazine Subscribe Box Read Now Link Subscribe Now Link iPad App Google Play Store
Mainframe News Sign Up Today! Past News Letters