Microsoft Team Shrinks Big Data By Deleting It

Software & Services

Microsoft Research takes new approach to data compression, but so far it works only on Azure.

September 10, 2012

4 Min Read

One of the big disadvantages of big data is that it requires big storage--potentially hundreds of petabytes, exabytes, or even zettabytes of storage.

Microsoft Research published a paper this week describing a more efficient way of cramming the 4 trillion objects stored in its cloud-based Windows Azure Storage into slightly less Windows Azure Storage.

The reason for doing this is simple enough, as stated in Microsoft's release: "Storing massive amounts of information in the cloud comes with costs, however--primarily the cost of storing all that digital data."

[ Major League Soccer taps big data to bolster team performance. Read more at Soccer Wonks Learn Tough Big Data Lesson. ]

Those costs include the disk arrays; expansion disks; replacement arrays; extra bills for support and repair; additional climate-controlled data center space to house all the extra disks; real estate for the extra data center space; salaries and benefits for skilled technicians to hook up, manage, and expand all that storage; bandwidth to make it available to customers; programmers to write the software to make cloud-based disks useful--and of course, PR and marketing staffs to spread the news about all that storage space.

The costs add up quickly and multiply in line with the number of required disks. That means even a trivial reduction in the space required to store a specific object--whether that reduction comes from better compression, more consistent lifecycle management, or accidental but frequent deletions--can dramatically reduce costs.

Save Space by Deleting Stuff

With the goal of taking a big bite out of storage costs, Microsoft's team--culled from the Windows Azure division and Microsoft Research--built compression software that takes a different approach to storage management. So far, no other vendors have joined Microsoft in promoting deletion as an approach to mass storage. But that could change as the technology emerges from research-and-development phase and develops a more practical track record.

In a commercial cloud environment, Microsoft's Douglas Gantenbein points out, the space required to store a single file isn't equal to the amount of disk space in which that file will fit. Each file must be stored at least three times on different disk arrays or different servers in order to safeguard it against crashes or other disasters.

In traditional lossless data compression algorithms go through a file and take out sequences of bits that are statistically redundant, keeping logs of what they eliminated and where it was so everything can be replaced later, when the file is uncompressed.

"Lossy" data compression methods, such as those used in MP3 files, eliminate levels of detail in order to reduce the amount of data to be stored.

The new storage approach for Windows Azure--called "lazy erasure coding"--is similar to lossless compression in that data is removed, but a shortened, coded version of the data is created that allows it to be replaced later on. When a chunk of data is compressed, it is split into two groups: segments to be stored, and parity segments the software will use for comparison to ensure the data isn't corrupted or missing anything after it is uncompressed. Then all the data segments and parity segments are distributed to different physical locations so the loss of one won't mean the loss of all, and the original three copies are deleted. The result is a series of data chunks that can be reconstituted bit for bit, but that occupy half the space they did while uncompressed.

The method is similar to Reed-Solomon coding, a technique that was invented in 1960 that was used in the U.S. space program and in error-correction for compact discs.

Looking for even more compression, the Microsoft group made a bet on the reliability of its hardware--reducing the number of parity fragments needed to reconstruct the data in order to make reconstruction faster and reduce the overall space required. This effort, which replaced Reed-Solomon codes with Local Reconstruction Codes, lowered the compression even further, from a total overhead of 1.5 (compressed to half the space of a live file) to 1.29 (less than a third the space).

Microsoft techs described the technique in greater detail in a presentation at the 2012 Usenix Annual Technical Conference last June; it's also detailed in a white paper that can be viewed or downloaded here.

The new technique--in short, a modification of Reed-Solomon compression using the Microsoft-invented Local Reconstruction Codes to increase the compression even further--was designed for Azure but could find its way into other Microsoft products as well. It would be particularly appropriate for "flash appliances" that use several flash-memory drives as part of a single storage unit, or possibly for solid-state drives used in laptops and other portable devices for which weight and battery power are greater concerns than disk space.

Even small IT shops can now afford thin provisioning, performance acceleration, replication, and other features to boost utilization and improve disaster recovery. Also in the new, all-digital Store More special issue of InformationWeek SMB: Don't be fooled by the Oracle's recent Xsigo buy. (Free registration required.)

About the Author(s)

Kevin Fogarty

Technology Writer

Kevin Fogarty is a freelance writer covering networking, security, virtualization, cloud computing, big data and IT innovation. His byline has appeared in The New York Times, The Boston Globe, CNN.com, CIO, Computerworld, Network World and other leading IT publications.

See more from Kevin Fogarty

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

About the Author(s)

Editor's Choice