Guest Column | July 7, 2009

The Cost of Free Deduplication


Written by: By Janae Lee, SVP of marketing at Quantum

Deduplication is suddenly everywhere. From its start in specialized appliances, it has spread to ISV agents, media servers, and hardware, even into hypervisors like VMware.

An interesting new option allows customers to simply add deduplication as a feature to their backup software. Some backup vendors have suggested that this approach is the best alternative for all customers, all of the time. No need to consult with value-added partners; just install the ‘free’ feature. The logic seems irrefutable. This option offers customers 100% integrated backup management in software they already own and administer. It’s a nonbillable feature, so it’s “free.” And since it is the backup vendors admitting their own historical flaws that are now offering to fix it, why shouldn’t this be the universal option for all new disk-based functionality, including deduplication?

This sounds like a great idea — simple, nondisruptive, and low cost. And for small or lower data growth customers, this may be true. But for many customers, the answer is more complex.

The backup products customers use today were architected for the lower volumes of data in an earlier age of data protection — before backup to disk. Backup software was initially designed with the assumption that backups were being directed to tape devices that had to be kept streaming at all costs. This approach drove the need for complex and costly processes that users still live with today, including the need to scan each file system nightly for changed data, or to operate a full plus incremental backup model in order to fit larger data volumes into a disappearing backup window. Backup vendors have made continuous improvements. Despite some historical hiccups — particularly those infamous releases when each vendor made iterative improvements to index scalability — most backup software can now scale to address millions of files. And enterprise-class backup software products now offer disk-centric features like embedded management of multiple copies of data, either disk-to-disk, or disk and tape.

Why shouldn’t deduplication be added — universally — in this same model? If it gives customers the benefits of deduplication efficiency, in combination with full data management integration, why shouldn’t every customer adopt it for every class of data?

The answer: because traditional backup software wasn’t designed for this class of problem.

Deduplication reduces data volume requirements. It is a form of data search and index: software algorithms that can rapidly find each unique string of data and link it to many files. The value of deduplication is directly related to the software’s efficiency at finding the maximum number of same strings, and how broadly it can then index them. The “secret sauce” in each deduplication product is how granularly it performs at efficiently finding the strings, and how scalably it can index them. As these strings are generally much smaller than the size of a typical file, the scalability requirements can be orders of magnitude higher than traditional backup. Deduplication engines were invented to manage a vastly larger number of objects and scalability is a core value of the technology.

Deduplication is not a simple technology — major target deduplication vendors like Data Domain and Quantum have been working for years to optimize their products, each of which are now in at least their third generation. Even enterprise backup vendors like Symantec and EMC, who have added acquired technology to their portfolio, target the use of these features to markets where management integration is a core value, and where data scalability and performance per server is generally bounded. These products are best matched to data protection of many independent remote branches, or multiple virtual machines with primarily unstructured files.

There is a tradeoff between the benefit of integrated backup management vs. the value of scalability, which is why universal adoption of deduplication as a feature of traditional backup is not wise, or likely. For a small business or departmental customer for whom data growth is not an issue, adding deduplication as a feature to his current backup model may make perfect sense. For a growing midrange or enterprise customer, it makes no sense at all.

A second tradeoff is disruption. Customers were initially slow to adopt source-based deduplication, which used specialized deduplicating backup agents before sending data to a backup server. This approach requires customers to swap their backup software, introducing new skills, processes, and infrastructure requirements — a major disruption. On the other hand, with a target-based approach, deduplication occurs as a transparent process on the target system by processing the incoming stream during ingest. As a result, target-based deduplication has enjoyed much wider adoption. With the addition of deduplication-as-a-feature to traditional backup software, hasn’t this inhibitor gone away? Maybe. A VAR should expect his vendor to add mask deduplication features behind existing administration and management interfaces to maximize current skills and ease adoption. However, issues still exist in deployment. Although the same brand may be printed on the CD, the software still needs to be changed, and load balancing on clients and media servers — as well as the hardware to operate it — will be impacted. This needs to be pushed through the traditional planning, change and operational processes — a reality that again points to the likelihood that backup feature-based deduplication is a better fit for smaller companies with limited numbers of systems to update, and fewer complex environments to plan.

The topic of infrastructure change leads to the topic of cost. One of the perceived advantages of deduplication as a software feature is its low price. This is particularly true if it’s ‘free.” Unfortunately, to quote numerous smart people (from Heinlein to Friedman) “there‘s no such thing as a free lunch.” As a search and index technology, deduplication creates new work for the customer’s backup infrastructure. Whether this added work is done by an agent on the production server or the media server, or by software on the core index server, it puts new weight on the infrastructure. This inevitably results in the need to add resources — bigger clients or bigger servers, and possibly horizontal scaling of clients, media servers or backup servers.

Customers will continue to need help from VARs to recommend the appropriate deduplication solution. Deployment of ‘free’ deduplication features on traditional backup software will not be universal; these features will fit primarily in low-data environments (e.g., small businesses) where some deduplication is “good enough,” particularly if it is embedded in a well-known, simple-to-use management stack. But where there’s a massive deployment of endpoints needing deduplication, the likely solution is one optimized for this use case, such as a client-deduplication solution, which will drive a large ROI. Where there is larger data volumes and growth, the likely solution is a target appliance, with its nondisruptive approach and scalability to many copies of data, again driving a large ROI.

If deduplication can drive a huge ROI, there’s no need for it to be free — it will pay for itself in the resulting business case it delivers.

Janae Lee is senior vice president of marketing at Quantum. Janae has more than 30 years experience in the storage market, including nine years of focus on deduplication.