Best Practices in Storage for AI and Big Data
IBM VP Eric Herzog explains how asking the right questions can help you develop a storage solution that meets your AI needs.
By Eric Herzog04/09/2020
It’s common knowledge that the accuracy and efficacy of applying AI is dependent upon the data used to develop, train and test the models—and while more data can sometimes translate into better AI outcomes, that isn’t always the case. Data sets for AI require proper labelling to guide the training process, as well as sufficient variety to eliminate naive bias. Having the right data is one of the most crucial components of AI success. What’s surprising, however, is that the underlying storage that holds and manages all of that precious data doesn’t get the same scrutiny. Utilizing the right storage is critical to getting the most out of your AI workloads. Here are four strategies for selecting storage for AI.
The AI Data Pipeline
Developing AI isn’t a single workload or application. It requires collecting data from multiple sources, organizing the data to make it useful, analyzing the data with a variety of frameworks and then delivering the model to be used across the organization. The majority of the time spent on developing AI is spent collecting and managing the data, instead of fast training with GPUs. Notice that the pipeline has multiple steps with different data tools being used at different times.
Storage for AI should support the entire data pipeline. A single data repository eliminates overhead of data copied from one system to another, encourages good data practices of labeling and organization, simplifies team collaboration and reduces the total cost of data.
Keeping track of data usage across multiple projects and teams who use different applications or frameworks is a challenge. Modern storage systems can simplify tracking and reporting on data use by leveraging metadata. Metadata is the data about the data being used, which can track attributes such as when a piece of data was last modified and by whom.
The metadata can be used to track data origin, add labels and even tag data used for different AI models. There is emerging technology in this area using data governance tools or metadata management solutions to automate data tagging and indexing that use APIs and that span different types of storage.
Storage for AI seems to always be growing. Once collected and organized, it’s easier to keep the data than to recreate it. New projects can be built upon existing data sets. Training and validating new models on old data is typical. However, keeping a large and growing data set on fast primary storage busts budgets. Automate data tiering, rather than archiving data, because it keeps that data available for the data scientists.
AI projects are scripted and incorporate different libraries and frameworks. Good AI development practices use containers as the development and deployment standard. Containers not only provide some version control, they can also be deployed as sets of services that work together. Containers also provide relatively easy packaging to move AI applications, ingest or training to the public cloud or edge networks.
Storage for AI should have support for the evolving Container Storage Interface (CSI) standard and supported by Red Hat OpenShift and others. The standard enables self-deployment, snapshot management and backup that integrates with Kubernetes.
Ask the Right Questions
Understanding your applications and workloads—while factoring in growth—is critical for your storage AI needs. Working with, and asking questions of, the data science team will guide you to choose storage strategies that accelerate AI adoption and improve the flexibility for future AI deployments.
Eric Herzog is the IBM Storage Divistion CMO and vice president of global storage channels.More →