Skip to main content

Disposable Data Can Pose Legal Liabilities and Security Risks

If data outlives its usefulness, deleting it can reduce risks.

Pixelated laptop dissolving into thin air

In this digital age, more data is being created than ever before. Recent estimates from IDC place the number at 175 zettabytes (ZB) in 2025. This reflects a 61% compound annual growth rate and is already 9% higher than IDC’s estimates from 2018. Today, it’s pegged at 33 ZB, where a zettabyte is one sextillion (1021) bytes (bit.ly/2TN2hjj).

This vast amount of data is making it harder for organizations to gain the insights they need to succeed. Data is not only massive, but complex, crossing many formats and platforms. It can be structured, unstructured, distributed, on-premises, mobile, stored on IBM Z*, stored in the cloud and more.

Data Comes With Risks

Many data analysts believe that all data contributes to valuable insights. While this is true, data also comes with costs and risks. The following will be a brief discussion of problems we encountered when analyzing data across a sample of internal IBM z/OS* mainframe LPARs over a period of four years.

Like any large corporation, IBM has a comprehensive data-retention policy. It divides data into three categories: essential, reference and disposable. Essential data must satisfy not only IBM’s business operations, but also any corporate legal requirements. This could be anything from contract documents to data held to meet U.S. Internal Revenue Service regulations, ongoing litigation or the laws of countries where IBM does business. Document retention orders (DROs) are issued by IBM Legal to suspend the destruction of information and documents. Essential data must be retrievable at any time within the mandated retention period.

Reference information is defined as necessary by employees or departments to satisfy business obligations. In practice, it means anything an employee finds useful. Ideally, any data that hasn’t been used in over two years is reviewed regularly for continuous business needs. Any data that isn’t categorized as essential or reference is considered disposable. Application owners must be aware of DROs and consult essential information categories before moving forward with disposal.

Identifying Disposable Data

We applied these three categories to a sample size of 40 z/OS LPARs running 75% IBM internal systems and 25% commercial IBM clients in the insurance and finance industries. A toolbox of Restructured Extended Executor language and Statistical Analysis System programs were written to cull through data from IDCAMS DCOLLECT and various tape management systems. The primary focus was on the last date that a data set had been opened for read or write activity.

There were predictable as well as surprising findings. First, it’s imperative that data on disk subsystems be catalogued, and if data is being shared across LPARs, the catalogs must also be shared. Applications use the catalog to manage disk data, and something not in the catalog is usually considered to be suspect. DCOLLECT ignores the catalog and polls the volumes directly. On average, systems that allowed this condition had 15% uncatalogued data. Without exception, these were categorized as disposable and sometimes hadn’t been opened since the early 1990s.

Secondly, an average of 1.6% of data was found to be uncatalogued and lacking a security definition. When there’s no definition, security products will not allow access. However, a clever actor can use that knowledge to gain ownership. This was a surprising find that was immediately reported to data security.

The most valuable discovery was that 40% of the 4 PB of tape data fell into the category of disposable. Disposable data on disk ranged from 12% to 25%. Determining how to dispose of this data was prioritized.

The Data Growth Paradox

From these results, we see that data can pose a legal liability. The larger the data lake, the greater the loss during a data breach. Data can also be used against a company in a legal action. If it outlives its usefulness, and no legal reason exists to keep it, deleting it can reduce risks. We also see that risk and cost are a function of the size of the data lake.

This brings us to the paradox companies can face when creating a data-retention policy. In general, the average data owner doesn’t know how to efficiently manage the retention of their data. For example, they may be able to identify tens of thousands of tape data sets to eliminate but are only able to accomplish this by recalling and deleting each data set one at a time. The system caretakers have the skills to automate the deletion process but cannot take on the risk of deleting client data without permission. Getting this permission takes immense time and effort. Data continues to grow at alarming rates and any hope of taming this growth requires more aggressive procedures.

Delivering the latest technical information to your inbox.