Big Data: No Hoarding Allowed - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Data Management // Big Data Analytics
09:35 AM
Connect Directly

Big Data: No Hoarding Allowed

The best insights come from data you've just collected, not the musty bits you've saved for years, argues SumAll's CEO.

Hadoop Jobs: 9 Ways To Get Hired
Hadoop Jobs: 9 Ways To Get Hired
(Click image for larger view and slideshow.)

The save-everything mantra chanted by many big data proponents is a waste of money and resources, as organizations will gain little, if any, actionable insights from massive stockpiles of archived data. Rather, the real big data payback comes from near-real-time analysis of information as it's collected.

So says Dane Atkinson, CEO of SumAll, a three-year-old data analytics startup based in New York City. SumAll's platform takes in data from a variety of sources, including social media, email, and e-commerce, and allows companies to analyze the information right away.

Given the real-time nature of SumAll's business, perhaps it's no surprise that its CEO would preach the benefits of fast-acting data analysis. Then again, Atkinson isn't the only big data player to point out the shortcomings of information hoarding.

In a phone interview with InformationWeek, Atkinson noted that companies often warehouse big data at great expense, even when they're not sure what insights they'll gain from it. And if they don't know which questions to ask of it today, they're hopeful the astute queries will come months, or even years, down the road.

[Leave the geek-speak at the office. Learn How To Explain Big Data To A 5th Grader.]

"That's the theory. That's exactly it: 'We don't know smart questions to ask now, so we're going to keep it all so that we can ask them later,'" said Atkinson, distilling the common rationale behind data hoarding, which he considers an expensive process with a dubious ROI.

"It costs a lot of money," he said. "It costs us millions of dollars a year to store our customers' data."

But despite the expense, the popular trend is to save it all.

"It's not even a question. Every company, every Internet company, tries to store all the data they possibly can," he claimed. "They believe in this theory of big data, that it'll someday be valuable."

(Source: W.Rebel)
(Source: W.Rebel)

Atkinson wasn't suggesting that companies stop storing data altogether, but rather that they do so more efficiently and with a clearly defined strategy.

"We would highly discourage storing it in a fashion that's sort of the definition of big data -- where you have it in some SSD environment on Amazon, or on a rack of servers that are costing you a fortune -- because you're not getting value out of it," he said. "You're not asking questions because it's just too big."

Still, companies often become data hoarders.

"They're living in the hoarder's environment," said Atkinson. "They're taking in all the data and putting it into a repository."

One alternative: Rather than saving every bit, companies should determine the questions they want to ask of their data, and then store the indexes they really need, a move that "will take your data down by many factors," he claimed.

Take a retail business, for instance.

"You may not need to have every second's worth of transactional history over the last four years, but it's probably pretty handy to know how [each] day went," said Atkinson. "So rolling up those 60 minutes into an hour metric [will] give your team really good guidance on the trends and patterns they want to see."

Rather than storing, say, the 2 billion transactions your business did in the past two years, save an index that tallies the hourly transaction totals during that period, he added.

This approach can greatly reduce the size of your data hoard -- "gigabytes versus terabytes," claimed Atkinson.

Again, however, he finds few businesses are slimming their data stockpiles.

"It's only the really smart companies that have started to pare that down," said Atkinson. "They may have the hoarder's closet somewhere, but they've also made a new [data] store that's much more efficient, that tries to answer smart questions and not just grab hold of everything."

InformationWeek's June Must Reads is a compendium of our best recent coverage of big data. Find out one CIO's take on what's driving big data, key points on platform considerations, why a recent White House report on the topic has earned praise and skepticism, and much more.

Jeff Bertolucci is a technology journalist in Los Angeles who writes mostly for Kiplinger's Personal Finance, The Saturday Evening Post, and InformationWeek. View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
User Rank: Apprentice
7/8/2014 | 5:42:52 AM
Not Applicable to everything
Although a very interesting read, this is not always applicable to everything, such as senstive data sets (ie; Healthcare and NHS Where I work in the UK).

I tend to keep everything to allow us to do a previous year by year comparison on growth of a subject and more. Ie; 2 years ago there was no "SOP" in place, and now there is, this has been the change in data.

User Rank: Ninja
7/7/2014 | 11:19:21 PM
Re: Theory vs. reality
Playing it safe (storing all) is okay to a certain point -this keeps one risk free. However, it will be gone since the amount of data generation is increasing and it will require more to store. Humans are ,by nature, hoarders. At the same time it is tough to take the courage and hit the 'Delete' and confirm 'Yes'. 
Doug Henschen
Doug Henschen,
User Rank: Moderator
7/7/2014 | 3:35:23 PM
Re: Theory vs. reality
Another opinion on big data from a self-interested vendor. Atkinson's "cost millions to data warehouse" perspective is a little dated. And the example he offers, tied to structured transactional data, is also not a very "big data" frame of reference.

The point of aggregating to the hour instead of the second is simple enough -- conventional wisdom, really. But this seems like a very conventional frame of reference focused on developing analytics based on recency, frequency, and monetary value. What about variable data types like clickstreams, log files, or social data? That's when data gets really big. It's not just a matter of collecting more of the same old data. 
Thomas Claburn
Thomas Claburn,
User Rank: Author
7/7/2014 | 1:55:27 PM
Re: Theory vs. reality
If only someone could convince the NSA of the merits of not hoarding data.
Lorna Garey
Lorna Garey,
User Rank: Author
7/7/2014 | 1:47:24 PM
Theory vs. reality
It's all great in theory. However, to save selectively requires effort and will -- data classification programs, someone to decide to delete X set and take the fall if it's needed someday, etc. Meanwhile, storage is cheap and getting cheaper.
10 Cyberattacks on the Rise During the Pandemic
Cynthia Harvey, Freelance Journalist, InformationWeek,  6/24/2020
IT Trade Shows Go Virtual: Your 2020 List of Events
Jessica Davis, Senior Editor, Enterprise Apps,  5/29/2020
Study: Cloud Migration Gaining Momentum
John Edwards, Technology Journalist & Author,  6/22/2020
White Papers
Register for InformationWeek Newsletters
Current Issue
Key to Cloud Success: The Right Management
This IT Trend highlights some of the steps IT teams can take to keep their cloud environments running in a safe, efficient manner.
Flash Poll