It's the Data, Stupid - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Data Management // Big Data Analytics
Commentary
2/9/2015
02:00 PM
Meta S. Brown
Meta S. Brown
Commentary
50%
50%

It's the Data, Stupid

When it comes to acquiring the data that will feed your analytics initiative, "free" isn't always the best approach.

After a recent talk, I was bombarded by questions. What programming language did I use for this, what tool did I like best for that? In each response, I reminded my audience to focus on business problems first, and tools last.

Technology dazzles. It’s easy to equate great analysis with the best of algorithms, software, and coding. But the getting the best information in analytics is really about the quality and relevancy of your data.

Somebody mentioned scraping a social network site for data. “You should not be doing that,” I said. Someone else chimed in, telling her to use the social network’s application programming interface (API) instead. “No,” I said, “that API is not intended to support analytics.” The way to get appropriate data for the intended use, I explained, was to buy it from one of the vendors licensed to provide that data for analysis.

Everyone stared at me in horrified silence. Buying data was unthinkable to them. Yet, to obtain that particular data in any other way would likely lead to biased results.

The fundamental assumption of all data analysis is that the data you use is representative of the things you want to know about. The data you use is more important to your results than any other part of the process. You must use the source that is most relevant, not what’s free, convenient, or cool.

What can go wrong when the data you use isn’t truly representative data for your application? Everything.

  • Founders of one technology startup were not acquiring many paying customers. Their market research had consisted of a survey of personal contacts, a very biased sample. If only these founders had surveyed a representative sample of their target market, they could have learned that few people were prepared to pay for their product before investing time and money on development.
  • Google Flu Trends, an ongoing collaboration of Google and the Centers for Disease Control, aims to detect and assess the magnitude of influenza outbreaks as they develop. Successes of the program have received significant news coverage. But, as Nature News has reported, Google Flu Trends dramatically overestimated cases in one year, and underestimated in another. Google’s data resources are vast, but still only loosely relevant for this purpose.
  • Today’s political surveys typically poll around 2,000 people each, which may not seem like a lot when over 100 million votes will be cast in an election. Wouldn’t more be better? Not necessarily, since larger sample sizes come with greater challenges for ensuring that data is properly collected and analyzed. One 1936 survey by Literary Digest gathered data from over 2 million respondents, yet incorrectly predicted the winner of that year’s presidential election. Gallup’s much smaller, but carefully conducted, poll got it right.

What can you do to get the most relevant, high-quality data for any project?

  • Begin with a clear understanding of what you need to measure. Does that data exist? If not, can you change your data collection practices or conduct a test (experiment) to create sample data?
  • Look for documentation. What’s the source of the data? What does each field mean? How was the data collected? How is it managed and protected from tampering?
  • Perform your own data quality checks. Is the data you see consistent with what the documentation suggests? Are there many missing cases?

Some of your best data sources may be at risk. For example, if you use neighborhood demographics, the original source of your data is a government statistical agency, the United States Census Bureau, even though you may be getting that data through a vendor or nonprofit organization. The Consumer Price Index (CPI), employment figures and a host of other data used by business comes from government statistical agencies. Yet these agencies are threatened by budget cuts and political challenges.

Don’t be fooled by open data initiatives; these only require that agencies share the data they have. This is not the same as ensuring that useful data will be actively collected, so protect your data sources. Contact your representatives to let them know how important government statistical data is to your business.

Relevant, high quality data is the most valuable resource for data analysis. Focus on that, and everything else will be easier.

What are you doing to get the best data you can? Please share!

 

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Commentary
Augmented Analytics Drives Next Wave of AI, Machine Learning, BI
Jessica Davis, Senior Editor, Enterprise Apps,  3/19/2020
Slideshows
How Startup Innovation Can Help Enterprises Face COVID-19
Joao-Pierre S. Ruth, Senior Writer,  3/24/2020
Commentary
Enterprise Guide to Robotic Process Automation
Cathleen Gagne, Managing Editor, InformationWeek,  3/23/2020
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
IT Careers: Tech Drives Constant Change
Advances in information technology and management concepts mean that IT professionals must update their skill sets, even their career goals on an almost yearly basis. In this IT Trend Report, experts share advice on how IT pros can keep up with this every-changing job market. Read it today!
Slideshows
Flash Poll