Is Your Company Running A Data Dump? - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Software // Information Management

Is Your Company Running A Data Dump?

Hoarding useless data makes analytics harder. Companies like Paxata say their brand of analytics lets non-data experts turn data landfills into useful info.

Companies of all sorts are now in the garbage business. Without even thinking about it, companies collect so much data that they have data garbage dumps, filled up with bad data.

The big difference between data dumps and real landfills is the smell; bad data doesn't have the same odor. That's probably why companies keep collecting data they don't need. It's also cheap to keep data, and it's gotten cheaper in the last few years. That just makes comparing data harder to do.

"There's so much data from different places and in different formats. It's very difficult to treat that data," says Jon Oltsik, an analyst at Enterprise Strategy Group in Milford, Mass.

[What does "real time" mean, anyway? Read Real-Time Analytics: Ready For Its Close-Up?

The rise of post-relational database tools such as Hadoop, Mongo DB, and Cassandra have lowered data storage costs, says Nenshad D. Bardoliwalla, cofounder and vice president of product at Paxata, a startup that uses machine learning and analytics to automate and accelerate the data preparation part of big data. No longer do companies need to think about what they're storing.

"Companies have flipped their mentality to just store it all, rather than just the data they really want," he says.

Bardoliwalla was at Hyperion in an earlier era of data warehousing, and others involved in founding Paxata were at SAP, Tibco, and Guidewire.

Paxata's founders think they've used analytics to help turn big data landfills into compost. They argue the problem companies face is in preparing data, which is time consuming and costly. Bardoliwalla says that data preparation either takes place through arduous hand coding, with specialists using tools like Informatica and Trillium, or trying to scrub data in Excel.

Image courtesy of St. Louis County.
Image courtesy of St. Louis County.

Paxata applies analytic techniques to data sources to see whether Michael Fitzgerald, Mike Fitzgerald, and M Fitzgerald in different databases might all be the same person, for instance. Its software figures the answer out on its own, meaning a user does not have to look at it. For very large data sets, that promises huge time savings.

"The value there is exactly as they say," Oltsik said. He has no ties to Paxata and has not looked at its product.

Paxata's target user is someone like the company's vice president of marketing, an experienced user of Excel, but not a "super jock." She needs information from disparate sources, and needs to know things such as whether a sales lead is a duplicate, and if information about it is correct. Providing that context to data sets is one of the things that costs analysts precious time.

The rule of thumb is that data preparation takes up 80% to 90% of the time people spend on data, leaving a small fraction of time for actual analysis. "People pour things into the data landfill. They don't even know it's there," he says. "There's a huge discoverability problem that needs intelligent algorithmic techniques and visualization techniques to allow computers to do the heavy lifting."

Bardoliwalla wants to flip the ratio of time that analysts spend on data, so they can spend 80% of their time analyzing data sets. There is value in data, but getting to the value might be more expensive than the data is worth, like ore buried too deeply in a mine.

Paxata says it has about a dozen customers including data storage firm Box, Dannon, the American unit of French yogurt maker Group Danone, and the big Swiss financial firm UBS. It also is not alone in the market: just today I received an email for a pre-briefing on a similar product from another data company.

Perhaps some day soon companies will spend their time making hay from their data.

You can use distributed databases without putting your company's crown jewels at risk. Here's how. Also in the Data Scatter issue of InformationWeek: A wild-card team member with a different skill set can help provide an outside perspective that might turn big data into business innovation. (Free registration required.)

Michael Fitzgerald writes about the power of ideas and the people who bring them to bear on business, technology and culture. View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
User Rank: Apprentice
2/18/2014 | 12:36:33 PM
Re: Data dump by another name


<a  href="">information</a>
User Rank: Ninja
2/12/2014 | 8:10:46 AM
Re: E-discovery
There's big money in e-discovery. At least for the lawyers and vendor making e-discovery tools. The enterprise being targeted is the loser because of time, effort and cost to dig through all of that stuff.

I suspect that many companies are keeping too much data because they don't have a strategy well enough defined to outline a use or purpose for that data. Then, since storage is relatively cheap, they keep everything.

The flip side of that coin are the companies run by executives who've been bitten before by lawsuits and keep everything for CYA purposes.

Either way, it's to the company's detriment to keep absolutely everything. Decide what you need and keep that. Much better and more effective than keeping everything and eventually (or not) deciding what you need.
Lorna Garey
Lorna Garey,
User Rank: Author
2/11/2014 | 1:45:54 PM
We hear all the time about companies spending millions on e-discovery requests and lawyers coming up with a 'smoking gun' from some obscure data source that no one thought to delete. To wit: Chris Christie as Jersey digs for Bridgegate evidence.

Do you think the 'keep everything forever' mindset is going to play into this, making money for e-discovery software firms and consultancies and teams of lawyers?
Michael Fitzgerald
Michael Fitzgerald,
User Rank: Moderator
2/11/2014 | 1:17:59 PM
Re: Data dump by another name
I had the same thought about creating a smell for bad data.  We all know data decays over time. It would be fun, and telling, to have data records take on a different hue as they aged, perhaps.  You could then apply a little data air freshener. Or put it in a data coffin...
User Rank: Author
2/11/2014 | 10:45:59 AM
Re: Data dump by another name
"Data lake" doesn't do the practice justice. Lakefront property fetches a premium. No one's looking to drain lakes (for the most part) or reduce their size. For those subjected to driving through Staten Island, think Arthur Kill. Local residents couldn't close up that dump fast enough. Perhaps if rotting data smelled (a perverse market opportunity here?), companies wouldn't hoard so much of it.  
User Rank: Author
2/11/2014 | 10:27:54 AM
Data dump by another name
EMC likes to use the term "data lake" to describe the vast amount of data customers are grappling with. That sounds more pleasant -- but at some companies, data dump must certainly be more accurate.
2021 Outlook: Tackling Cloud Transformation Choices
Joao-Pierre S. Ruth, Senior Writer,  1/4/2021
Enterprise IT Leaders Face Two Paths to AI
Jessica Davis, Senior Editor, Enterprise Apps,  12/23/2020
10 IT Trends to Watch for in 2021
Cynthia Harvey, Freelance Journalist, InformationWeek,  12/22/2020
White Papers
Register for InformationWeek Newsletters
Current Issue
2021 Top Enterprise IT Trends
We've identified the key trends that are poised to impact the IT landscape in 2021. Find out why they're important and how they will affect you.
Flash Poll