There are several different ways that data deduplication works. The type that is used will depend on the data being fixed as well as the storage space needed. While some data deduplication techniques might be used alone, others may be used in conjunction with another. Some might be take away too much data, while others leave too much. This is why some companies will choose to use a combination of all three methods. The three main types of data deduplication include hash based, chunking and primary and secondary storage deduplication.
In hash based data deduplication, the data is processed in chunks that end with hash numbers. These hash numbers are a symbol to the system that the data has already been processed. When data is being processed, the system will read these hash numbers and look for duplicates of these numbers. Any time a duplicate hash number is found, it is not saved in the system. However, sometimes this can result in false positives and data might be missed because the system believes it is a duplicate when it is not. That is why hash based data deduplication is usually used in conjunction with another type of deduplication. In this even, all positives might be added to a secondary storage source to be rechecked using a different process.
Chunking data is just what it sounds like. Data will be given chunking boundaries and then compeered to other chunks of data. When duplicates in these chunks are found, they will be eliminated. There are two different types of chunking. In the first case, the streams of data are reviews in static chunks that are only visited once. In more advanced chunking, the chunks are reviews in a sliding window and changed to be compared multiple times. In this type of data deduplication, duplicates might be missed. This is why it is sometimes used in conjunction with hash based data deduplication. That way, the data will be as compressed as possible and at the same time, key pieces of data wont be lost.
Primary and Secondary Storage Data Deduplication
This process is actually more of a back to the original backup in this instance, there is a primary storage source that holds all the originally data, while the secondary storage systems will only hold secondary or duplicate data. The goal in this case is to optimize the performance of the primary storage system and maintain cost effectiveness. The problems from these systems occur when the system needs to be recovered, as the data must be sifted through to find the best data to reload into the system.
A very good data deduplication system will often use a combination of all three of these types of data deduplication processes. It might use the chunking method first, to find the moist obvious duplicates. The hash based might be used to more thoroughly check for duplicates. To avoid data loss, a secondary storage system may be used after both of these processes are followed. This ensures that space is optimized without the loss of important data.
This article is copyright free.