Hadoop 2.0: New Big Data Possibilities - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Data Management // Software Platforms
11:28 AM
Doug Henschen
Doug Henschen
Connect Directly

Hadoop 2.0: New Big Data Possibilities

Hadoop 2.0 will move beyond batch processing to support interactive, online and streaming applications. But don't let warnings about YARN tie you up in knots.

Hadoop 2.0 will be announced within a matter of days, and a new YARN framework component at its core promises to "take Hadoop beyond MapReduce," according to Arun Murthy, chairman of the Apache committee overseeing the release. Moving beyond slow, iterative MapReduce processing is obviously a good thing, but just what are the new possibilities?

Better SQL querying, graph analysis and stream processing are all on the short list, according to Murthy, a Yahoo veteran who co-founded Hortonworks. He describes YARN (a slightly-off acronym for Yet Another Resource Manager) as a kind of large-scale, distributed operating system for big data applications. As is typical with operating systems, there's some question about what will and what won't work with YARN, but more on that later.

Most Hadoop adopters are treating the platform as a data lake or ocean for all company information, says Murthy, but they want to be able to use the information in multiple ways roughly falling into four categories: batch, interactive, online and streaming.

"As you look through the entire life cycle of that data, and as data is coming in, you want to process it quickly and efficiently and tackle whatever application you have in mind," Murthy says.

[ Want more on Cloudera's answer to analysis on Hadoop? Read Cloudera Impala Brings SQL Querying To Hadoop. ]

SQL is an example where human-interactive queries come in, and that could be through Hive. HBase, the Hadoop NoSQL database, is an online processing option. Storm (developed by Twitter) is a stream-processing option. Apache Giraph is an option for graph analysis. Spark is an option for high-speed, in-memory analytics on top of Hadoop. MPI is a modeling framework used for assessing risk, optimizing pricing and other advanced analytic applications.

And then there are what Murthy calls the "great big honking batch jobs across six, nine or 12 months of data where you're processing hundreds of terabytes or even petabytes of data." That's where MapReduce comes in.

"All of these things have been refactored to work on top of YARN," says Murthy.

Of course Hive, HBase and other options have been available alongside MapReduce for some time, but before Hadoop 2.0, the system was designed to be a single-application system, setting up competition for resources. Run a complex Hive query or one of those great big honking MapReduce jobs and you're likely to lock up resources and prevent any other application from running with anything like predictable performance.

YARN's job is to allocate resources across all the applications running on top of Hadoop to enable them to run simultaneously and with consistent levels of service to end users. This extends to supporting internal or external service-level agreements, quality-of-service standards and administrative control, according to Murthy.

"Instead of having [simplistic] queues for each of your classes of applications, you can decide how much resource you want to give to which class of application," he explains.

The only caveat with YARN is that it's part of the Apache Hadoop framework and is, therefore, designed to allocate resources to Apache Hadoop components. Where does that leave Cloudera Impala, Pivotal HAWQ and the many other SQL-on-Hadoop developments that may or may not become part of Hadoop? In the case of Impala, for example, the core query engine is shipped under Apache license, but Cloudera's Enterprise Real-Time Query (RTQ) management console for Impala is commercial, subscription-based tool.

"It's absolutely conceivable that something like Impala or Pivotal HAWQ could come into the YARN resource-management framework," Murthy says promisingly, but then he adds the caveat.

Speaking as an executive of Hortonworks -- a company that adheres strictly to open-source code and that competes with Cloudera, Pivotal and others adding commercial components to Hadoop -- Murthy warns that "with a bolt-on system like an Impala or a HAWQ, you reinvent everything built into YARN."

With YARN inside Hadoop and a separate management system outside of the platform, the question becomes, which system will control the resources and will these services be duplicative?

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
1 of 2
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
D. Henschen
D. Henschen,
User Rank: Author
5/22/2013 | 2:17:36 PM
re: Hadoop 2.0: New Big Data Possibilities
I had questions about the timing of Hadoop 2.0 and YARN, but the response from Arun Murthy came in too late for publication. It's the beta version that will be announced within a matter of days. When will it reach GA? The short answer is the second half of 2013 and into 2014, but here's a Murthy's statement with more detail:

"Apache Hadoop 2.0 and YARN have been under development for 2.5 to 3 years, will be reaching final Beta shortly, with a push to final stable release within the Apache community a matter of weeks after that. At that point, MapReduce (batch data processing) and Apache Tez (interactive data processing) will be two application types that are fully tested to run on YARN. Community projects such as S4, Storm, Giraph, OpenMPI and other open source projects have been doing work to be first-class YARN applications as well, so they will now have a stable platform release to test against and finish their efforts. Commercial vendors and startups have also been doing work around YARN. For example, Continuuity is a startup that created an open source framework called Weave that makes it easy to create YARN applications.

Bottom-line: the next wave of innovation on top of YARN has been underway for a while. How long will it take for the market to adopt Hadoop 2.0?GǪ Initial uptake of Hadoop 2.0 based solutions with YARN will start in the second half of 2013 with broader market adoption happening throughout 2014."
The State of Chatbots: Pandemic Edition
Jessica Davis, Senior Editor, Enterprise Apps,  9/10/2020
Deloitte on Cloud, the Edge, and Enterprise Expectations
Joao-Pierre S. Ruth, Senior Writer,  9/14/2020
Data Science: How the Pandemic Has Affected 10 Popular Jobs
Cynthia Harvey, Freelance Journalist, InformationWeek,  9/9/2020
White Papers
Register for InformationWeek Newsletters
Current Issue
IT Automation Transforms Network Management
In this special report we will examine the layers of automation and orchestration in IT operations, and how they can provide high availability and greater scale for modern applications and business demands.
Flash Poll