Bookmark and Share
RSS

Recent Posts

Unstructured Data: Why Make It Harder Than It Needs to Be?

April 2, 2014

This blog entry was written by John Easton, IBM Distinguished Engineer, Advanced Analytics Infrastructures.

You'll all have doubtless seen the statistics: that the volume of data we're producing as a species is growing at some (insert superlative here) rate and that the (insert large percentage here) of this data is going to be unstructured. This is usually followed by some statements to the effect that if we can harness this data in some way that the world is going to be a better place. I'll leave it to you to determine just what bette' might actually mean in practice. Such is the big data mantra, yet many organizations seem to struggle to make their first steps into this grim and scary unstructured world. But why?

Let's start by shooting some holes in an accepted belief. People will frequently tell you that an audio or video file is unstructured; however, this is not strictly true. Implicitly something like an MP3 file has to have structure in the form of headers, ID3 tags, etc., that allow an MP3 player to do something useful with it. Where the “unstructuredness” (if indeed such a word exists) comes in is that the audio content of the MP3 file is not defined by this structure. The only way to find out what the MP3 file contains is to play the content. Analyzing what that content is telling us can then take place. If you are not using a toolkit that provides the capability to do this, then you'll need to write a program to process this binary data. And let's be honest, if you've not done this before then this is hard.

So how else might you be able to start taking advantage of this mountain of unstructured data that’s easier to get going with? How about logfiles? These can be a source of great insight and--because they are typically stored as plain text--they are much easier to work with than binary data. But just because it's easier doesn't mean that there isn't real business value to be gained.

Consider a telecommunications company IBM has worked with. Using analytics on logfile data the company has identified particular combinations of communications hardware and firmware, which gives rise to poor performance and has used these insights to proactively fix their customers' systems. Doing this before many end users have even realized they had a problem has resulted in higher customer satisfaction and hence less churn in the company’s customer base. The telecommunications provider also been able to use network logfile data to identify individuals who are performing illegal activities on their network and work with the relevant law enforcement authorities to get these individuals dealt with appropriately.

In both cases, what the organization does is to build a model of what “normal” behavior is: what are the correct ranges for performance? What does a legal user look like? Once these normal behaviors are understood then graphically displaying all data allows the outliers--those that are “not normal”--to be found relatively easily and along with it realize real business value.

Successful analytics projects need to deliver tangible business benefits. In my experience, many organizations try to take on the big challenges too early. Starting small and starting easy isn’t an admission of failure, but rather opens the business' eyes to what might be possible. So, for your first foray into the world of unstructured data, why not think about using all those logfiles you've been storing away for a rainy day?

More on IBM Power Systems

For more information on how to get started with unstructured data, download the solution brief.  For updates on Power Systems for analytics, please follow our venues on Facebook, LinkedIn and Twitter.  And, for the latest on how Power Systems servers are constantly evolving to help you break through the physical and virtual boundaries of data, be sure to register for our upcoming webcast Open Innovation to Put Data to Work on April 28. 

John Easton is an IBM Distinguished Engineer, who is internationally known for his work helping commercial clients exploit large-scale distributed computing infrastructures, particularly those utilizing new and emerging technologies. He is currently leading work on next-generation systems infrastructures to support big data and complex analytical workloads. He has worked with clients in a range of industries with a particular focus on banks and financial markets firms. Over his time at IBM, John has led initiatives around hybrid systems, computational acceleration, cloud and grid computing, energy efficiency and mission-critical systems. He is a member of the IBM Academy of Technology and a  Fellow of both the Institute for Engineering and Technology and the British Computer Society.

 

 

 

Posted April 2, 2014| Permalink

Post a Comment

Note: Comments are moderated and will not appear until approved

comments powered by Disqus
-->