MD Connection Q1

Does Data Deduplication 12 BY PIERRE DORION D

ATA DEDUPLICATION has been a hot topic and a fairly common practice in disk-

based backups and archives. Users’ initial wariness seems to have given way to adoption, and a deeper focus on the technology has opened up more ways to leverage the benefits of deduplication. The next frontier for deduplication is in the realm of primary storage.

Understanding

Primary Storage Primary storage consists of

disk drives (or flash drives) on a centralized storage-area network (SAN) or network-attached storage (NAS) array, where the data used to conduct business on a daily basis is stored. This includes structured data such as databases, as well as unstructured data such as email data, file server data, and most file-type application data. It’s important to understand this difference because not all data is suitable for primary storage deduplication.

Data Deduplication

Defined There two main types of data deduplication: inline and post- process. Inline deduplication identifies duplicate blocks as they’re written to disk. Post-process deduplication deduplicates data after it has been written to disk. Inline deduplication is considered more efficient in terms of overall storage requirements because non-unique or duplicate blocks are eliminated before they’re written to disk. Because duplicate blocks are eliminated, you don’t need to allocate enough storage

to write the entire data set for later deduplication. However, inline deduplication requires more processing power because it happens “on the fly”; this can potentially affect storage performance, which is a very important consideration when implementing deduplication on primary storage. On the other hand, post-process deduplication doesn’t have an immediate impact on storage performance because deduplication can be scheduled to take place after the data is written. However, unlike inline dedupe, post-process deduplication requires the allocation of sufficient data storage to hold an entire data set before it’s reduced via deduplication.

Determining

Best Fit Data How do you determine which

primary data is a good fit for deduplication? This is where the difference between structured and unstructured data comes into play. A database can be a significantly large file, subject to frequent and random reads or writes. For that reason, the majority of this data can be considered active. That means any processing overhead associated with deduplication could significantly impact I/O performance. In comparison, if we examine data on a file server, we quickly see that only a small portion of files are written to more than once and usually only for a short period of time after they were created. That means a very large portion of unstructured data is rarely accessed, making it a prime candidate for deduplication.

This allows rules to be set to deduplicate data based on a “last access” time stamp. Shared storage for virtual servers or desktop environments also presents good opportunities for deduplication because many operating system files aren’t unique. Other data selection criteria

include format and data retention. Encrypted data, and some imaging or streaming video files, tend to yield poor deduplication results because of their random nature. In addition, data must reside in storage for some time to generate enough duplicate blocks to make deduplication worth the effort. Transient data that’s only staged to primary for a short period–such as message queuing systems or temporary log files—should be excluded. And while archived data yields the best deduplication ratios, that type of data isn’t suitable for our primary storage discussion.

Inline vs.

Post-Processing Let’s say you’ve excluded

encrypted data, streaming video and transient data, and you’ve established rules to determine “last access” and retention. You’ve identified primary data storage that’s a good fit for deduplication. This is when you’ll have to choose between inline or post-process deduplication. The ability to deduplicate files once they’ve been inactive, or not accessed for some time, would favor post- process deduplication over inline because only selected data can be processed at a later time based on specific criteria and after it has been written to disk. Remember,

eduplication

WWW.MOREDIRECT.COM

VOLUME 3 • ISSUE 1

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36