Skip to main content

Eliminate Repeated Data With Deduplication

Deduplication can dramatically decrease storage requirements.

Illustration of file icons on white background.

Confused about compression versus deduplication? Let’s take a closer look at what both these terms mean, and the differences between them.

What Is Compression?

Simply put, compression is taking an item and manipulating it such that the item takes up less space than it originally did. Think in terms of a propane tank: The gas in the tank has been squeezed into a much smaller space than if it were a free gas. For us in the data processing industry, the concept is much the same. We deal with masses of data. That data takes up space on limited storage resources such as SSD, rotating disk and tape (physical and virtual). It would be more cost effective and efficient to manipulate that data to use less storage space. By compressing all of the bits of data, a fewer number of bits stores the same data.

Data compression is implemented in many different ways. Generally, data compression products are based on the type of data being compressed. For example, to compress audio files (music) you might use MP3, AAC, FLAC or many others. These compression formats can be generally defined as lossy or lossless. Lossy compression tends to remove information such that when the file is uncompressed, it’s not a duplicate of the original. Lossless compression returns a file to its original state when uncompressed. As you might expect, lossy compression generally compresses a file into a smaller state than lossless compression routines. 

When dealing with business data, lossy compression isn’t an option. Products such as PKZIP and Gzip are available for x86-based systems, which provide lossless compression for data files. Systems that don’t use x86 architecture may have a proprietary compression scheme or can send the data to an x86 subsystem for compression.

So how much compression can you expect? The input data determines how much compression is seen. Normal lossless compression has an average of a 2:1 compression across many different data types. Some tape drives state a 2.5:1 compression ratio. Generally, if you have a file with lots of repeating data (like lots of space characters), you’ll get a higher compression as compared to a file with random data. What this means is that textual files will compress much better than files that have audio or video information.

What Is Deduplication?

Deduplication is the process of finding duplicate and/or repeated blocks of information and only storing one copy of that block. Sound a bit like compression? It is in that it’s a mechanism to reduce the amount of storage required for your data. However, the key difference is scope: Most compression systems work with a file, a set of files or possibly a tape in a given instance. Most deduplication systems work over an entire storage environment for an extended period of time.  

Deduplication systems are implemented in many different ways, but the basic concept remains: It reviews the blocks of data in its environment. For each block of data, a deduplication system calculates a hash value (i.e., a really large number). Due to the way the hash is calculated, that number (hash value) should be unique for each unique block of data. The deduplication system then looks in its hash table to see if it has seen this value before. If the value is found, the data block is removed and it’s replaced by a pointer to the previously seen data block. If the hash value isn’t found, the data block is stored and the hash value is added to the hash table. The file itself is now just a string of hash values that can be rebuilt by looking up the blocks that the hash values refer to. 

There are as many different implementations of deduplication systems as there are vendors of deduplication systems. What sizes of blocks are used for the hash value calculation? Are blocks merged to create larger blocks? What hash algorithm is used? Are multiple hash algorithms used to prevent hash value collisions? (A hash collision is where two different data blocks could result in the same hash value. Depending on the hash algorithm, this is unlikely, but it’s still a mathematical possibility.) How is the hash table indexed? These implementation aspects will affect the performance of the system and the amount of deduplication (compression) seen.

Reaping the Benefits

So how much deduplication can you expect? The pattern of the data and the implementation of the deduplication system will determine the deduplication ratio. Because deduplication works by only storing a block of data once even if it’s seen many times, backup and archive applications tend to see the best deduplication ratios. 

The first time you back up your environment to a deduplication system, the deduplication ratio will be small. As you continue to backup or archive your environment, the ratio should improve as the same files are repeatedly seen. My experience has shown anywhere from 4:1 to 20:1 deduplication ratios in real life. I’ve even seen some vendors claim up to 200:1, but that’s with very specific data sets. Even at 4:1, that’s twice the storage savings over simple compression.

So if you have deduplication, you don’t need compression anymore? Not exactly. Again, depending on a vendor’s implementation, the stored blocks of data can be compressed for even more space saving. Employing compression and deduplication is the best of both worlds.

Besides storage space reduction, what other benefits does deduplication have? If you’re copying your data over a network to another location, most deduplication systems that have a remote replication feature will only send new blocks of data rather than a whole file or system. This can dramatically reduce bandwidth requirements.

Finally, for secure environments, the data blocks can be encrypted. Encryption generally can be processor intensive. On deduplication systems, you’re only encrypting blocks previously stored, thus saving on processing overhead.

Deciding on Deduplication

Unlike compression, which can be used as needed on a single file or a selection of files, deduplication requires a commitment to using it across an entire storage environment. It also requires special hardware and/or software to perform the deduplication and store the results. 

Deduplication can dramatically reduce storage requirements for some storage needs, but it requires a commitment and specialized tools. It’s not something that a site could implement themselves.

IBM Systems Webinar Icon

View upcoming and on-demand (IBM Z, IBM i, AIX, Power Systems) webinars.
Register now →