Big Data: How To Pick Your Platform - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Data Management // Software Platforms
News
9/10/2014
03:42 PM
Connect Directly
LinkedIn
Google+
Twitter
RSS
E-Mail
50%
50%

Big Data: How To Pick Your Platform

Hadoop? A high-scale relational database? NoSQL? Event-processing technology? One size doesn't fit all. Here's how to decide.

Download this entire InformationWeek Tech Digest issue, distributed in an all-digital format (registration required).

Feel stuck in neutral? Don't worry. Big data success stories tend to start slowly, for two reasons.

First, there's the drag exerted by relational database administrators who badly want to stick to what they know. Second, big data problems have just as much to do with changing how you do data querying and processing as they do with handling the oft-cited "three V's" -- the big data parameters of volume, variety, and velocity. The good news is that once you pick up some steam, big data opens the door to myriad amazing business possibilities you hadn't even considered, and it starts to generate its own momentum.

Here's how to get unstuck: Consider the problems you're trying to solve with relational databases and whether other technologies might be more appropriate from a feature perspective. Tackle the limits around the "three V's." And start exploring comprehensive data platforms that can take you beyond simply knowing what a customer is doing to understanding why.

Predict the click
A typical first foray into big data involves attempting to analyze massive amounts of log or event data to identify causal patterns, also called clickstream analysis. What are the top three things mobile users do immediately before they uninstall your app? Can IT identify suspicious behaviors in server logs before someone steals data? How do you detect changes in sensor output that are significant enough to trigger dispatching a technician?

We've used relational databases to tackle these questions for many years, both directly and through enterprise data warehouses. GE Power & Water, for example, has monitored industrial turbines and used that data to predict maintenance needs for a decade.

However, a conventional data warehouse gets you only so far when you rack up 100 million hours of operating and maintenance data for 1,700 turbines, and it won't let you mash all that up with external data, such as weather information, to predict failures. GE Capital CIO Jim Fowler, in a discussion at the 2014 InformationWeek Conference (when he was still in the role of Power & Water CIO), said investments in new platforms, such as Hadoop and NoSQL databases, to crunch external sources with the terabyte of data per day spinning off of each of its sensor-equipped turbines, should net $66 billion in savings over the next 15 years.  

"We've seen the cartel of database vendors broken up, and some great new entrants give us new capabilities that we've never had before at a cost that we've never seen," Fowler said, specifically calling out MongoDB, Talend, and Pivotal, in which GE has invested.

Those savings and that cartel breakup are key, as we'll discuss.

Volume isn't the only challenge, though. Using relational databases to find out "why" is also challenging because of the sheer amount of work it takes to formulate, ask, and answer questions.

For example, to create useful queries about website clickstreams and user application activity logs, you need to "sessionize" the data -- that is, take data in which every row is an event and group together all events from a single "session," so you can ask what happened prior to a particular type of event, such as your mobile app getting uninstalled or a turbine going offline. HP Vertica and Hadoop have offered sessionization features for several years; ParAccel (which underpins Redshift from Amazon Web Services) introduced it last year; and as of earlier this year, Oracle 12c is on board.

To read the rest of this story,
download the entire InformationWeek Tech Digest issue, distributed in an all-digital format (registration required).
Joe Emison is a serial technical cofounder, most recently with BuildFax, the nation's premier aggregator and supplier of property condition information to insurers, appraisers, and real estate agents. After BuildFax was acquired by DMGT, Joe worked with DMGT's portfolio ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
pfretty
50%
50%
pfretty,
User Rank: Ninja
9/12/2014 | 11:58:49 AM
Great post
I think the platform and software selected really needs to match up with what the organization is hoping to accomplish as well as the internal capabilities. Whether or not Hadoop or any of the other technologies is a good fit depends on the organization's data maturity. 

 

Peter Fretty
Commentary
Why It's Nice to Know What Can Go Wrong with AI
James M. Connolly, Editorial Director, InformationWeek and Network Computing,  11/11/2019
Slideshows
Top-Paying U.S. Cities for Data Scientists and Data Analysts
Cynthia Harvey, Freelance Journalist, InformationWeek,  11/5/2019
Slideshows
10 Strategic Technology Trends for 2020
Jessica Davis, Senior Editor, Enterprise Apps,  11/1/2019
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
Getting Started With Emerging Technologies
Looking to help your enterprise IT team ease the stress of putting new/emerging technologies such as AI, machine learning and IoT to work for their organizations? There are a few ways to get off on the right foot. In this report we share some expert advice on how to approach some of these seemingly daunting tech challenges.
Slideshows
Flash Poll