Use Machine Learning to Make Storage Smarter
With the goal of helping us learn more about the universe, IBM and the Netherlands Institute for Radio Astronomy (ASTRON) are collaborating on the IT roadmap behind the Square Kilometre Array (SKA), which will be the largest radio telescope in the world (bit.ly/2iywTpL).
As part of this effort, called the DOME project, IBM developed a pizza-box-sized data center that uses a fraction of the energy required by a typical data center to lower SKA computing costs (bit.ly/2jWgHut). Now, IBM is working on the best way to access and analyze the petabytes of data the SKA will gather daily.
“If you put everything on flash drives, you’re going to quickly eat up your IT budget. You must be smart about what you store long term versus short term, and the storage medium.”
—Giovanni Cherubini, data storage scientist, IBM
IBM Systems Magazine sat down with Vinodh Venkatesan and Giovanni Cherubini, IBM data storage scientists, who explain why the best way likely involves cognitive storage.
Inspired by the human brain, cognitive storage uses a combination of both data popularity and value. This means users can teach storage systems what to remember and what to forget, which can significantly lower storage costs, whether cognitive storage is used to scan the universe or track customer behavior.
IBM Systems Magazine (ISM): How is the DOME project related to cognitive storage?
Vinodh Venkatesan (VV): Our DOME project partner, ASTRON, is part of a consortium that’s designing and building the SKA. When it’s finished by 2024, one of the main challenges will be the amount of data ASTRON is planning to collect—close to a petabyte a day. At these data ingestion rates, the storage system cost goes through the roof. Within this data, astronomers may find answers to some fundamental questions about the universe, but this won’t be possible if the system is too expensive to operate.
Some parts of the data collected are more valuable than others. This got our IBM Research team thinking about how our brains work. We collect a lot of information every day with our eyes, ears and so on, but we don’t remember everything. We remember only what seems important or relevant, and forget things that aren’t. Our brain automatically does this classification.
We thought, “Why not apply these principles in a large-scale data storage system? Why not, as data comes into the system, analyze it, decide whether it’s likely important and then choose what type of medium to store it on—whether it’s flash, hard drives or tape—and the type of redundancies involved, such as how many backups you need?” Then, the system performance and reliability is optimized to take into account data values.
ISM: How does this differ from current storage systems?
Giovanni Cherubini (GC): Current systems already do some of this optimization, with respect to data popularity; they keep track of how frequently each file or piece of data is accessed. They might move frequently accessed data—hot data—to faster devices such as flash, or they might move data to other storage media as it becomes colder. We’re looking at adding this new notion of data value because, although some correlation might exist between the value and popularity of the same piece of data, not all data that’s popular is valuable and vice versa.
Therefore, we can add what we call the computing/analytics units and selector, which includes a learning system that can be used to teach the system what’s important and what’s not.