Pentaho Preps Data On Hadoop, Analyzes On MongoDB - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Data Management // Big Data Analytics
11:10 AM
Connect Directly

Pentaho Preps Data On Hadoop, Analyzes On MongoDB

Pentaho 5.1 adds YARN support to support predictive analysis, transforms JSON for analysis on MongoDB.

10 Big Data Pros To Follow On Twitter
10 Big Data Pros To Follow On Twitter
(Click image for larger view and slideshow.)

Open-source analytics and data-integration software provider Pentaho burnished its big-data credentials with newly released support for Hadoop's YARN management layer and the popular MongoDB NoSQL database.

The new capabilities, released in late June in Pentaho's 5.1 release, enhance the company's already strong presence in the world of big data by bolstering data preparation for predictive analysis.

"The big 'aha' in getting a return on analyzing a petabyte of information is being able to predict what the customer is going to do next, whether that's buy something, commit fraud, or churn," said Quentin Gallivan, Pentaho's CEO, in a phone interview with InformationWeek. "Our vision is to befriend the data scientist by building a studio where they can orchestrate and profile their data and then use their tool of choice for prediction."

[Want more on big data analysis? Read Will Spark, Google Dataflow Steal Hadoop's Thunder?]

Support for YARN, the management layer introduced in Hadoop 2.0, is crucial to that vision because it enables Pentaho's analytics studio to operate directly on top of all the data stored in Hadoop while also taking advantage of its distributed processing power. The studio supports data orchestration, data cleansing, and data profiling, and with a Data Science Pack included in Pentaho 5.1, that functionality is integrated with Pentaho's Weka data-mining tool and with the popular open-source R library with support for parallelized processing.

"In predictive analytics, 80% of the effort is getting to clean, structured data that's ready to analyze, so we've done the work to do the data transformation, enrichment, and profiling needed to turn a petabyte of unstructured data on Hadoop into data that's ready for analysis," Gallivan explained.

The Data Science Pack included with 5.1 allows R scripts as well as Weka scoring and forecasting models to be run on Pentaho Data Integration. Future releases will add data-prep support for tools including SAS, Metlab, and Mahout, said Gallivan.

Pentaho 5.1 also adds support for MongoDB, which has become "a killer, next-generation application database," said Gallivan. Pentaho is running its business intelligence, data-visualization, and OLAP tools on MongoDB's JSON data format.

"We transform the JSON to run effectively in an MDX [OLAP] environment," Gallivan explained. "MongoDB users want the richness of a data-discovery environment with data visualization against native JSON."

Pentaho's open-source software is used by more than 20,000 organizations. Among these, more than 1,500 customers pay for enterprise software and support, and at least 250 have successful big-data deployments, according to Gallivan. As Gallivan detailed in a recent interview, most of those customers fall into one of five deployment scenarios: 360-degree customer view, Internet of Things, data warehouse optimization, big-data refinery, or data security.

InformationWeek's new Must Reads is a compendium of our best recent coverage of the Internet of Things. Find out the way in which an aging workforce will drive progress on the Internet of Things, why the IoT isn't as scary as some folks seem to think, how connected machines will change the supply chain, and more. (Free registration required.)

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
Lorna Garey
Lorna Garey,
User Rank: Author
7/8/2014 | 3:48:24 PM
Re: Data integration vendors are hot to get in on big data
How will Pentaho monetize this? The number of customers paying for enterprise support doesn't seem all that high.
D. Henschen
D. Henschen,
User Rank: Author
7/8/2014 | 3:29:39 PM
Re: Pentaho system, ungainly or powerful?
Sorry, but I guess the headline is potentially misleading. Data-prep on Hadoop is in service of predictive analysis (done with tools such as Pentaho Weka, R, or, soon according to Pentaho, SAS or Metlab). The support for MongoDB is a separate thing, only for BI/data-visualization style analysis (not predictive work) on the data managed by MongoDB. The two are not connected other than the fact that they are both capabilities introduced in Pentaho 5.1.
Charlie Babcock
Charlie Babcock,
User Rank: Author
7/8/2014 | 3:24:06 PM
Pentaho system, ungainly or powerful?
To "befriend the data scientist" is no easy task. It's all too easy to be a friend to few, stranger to many. The combinatin of Hadoop with YARN on top for data prep, with the rsults plugged into MongoDB sounds like a powerful system -- as long as the movement between the two of them is smooth.  
D. Henschen
D. Henschen,
User Rank: Author
7/8/2014 | 1:13:09 PM
Data integration vendors are hot to get in on big data
When Hadoop first emerged, we all heard it would displace ETL. That's at least partially true, for some transformation processing, but now data-integration vendors -- like Informatica, Paxata, and, now Pentaho -- are saying their stuff is needed for all sorts of data prep and processing ahead of big-data analysis. It's another case of offering an alternative to clunky MapReduce processing, but I haven't talked to enough customers who have validated how useful these tools can be in big-data-analysis scenarios.

The "80% of the work" line above seems like a relic of relational data warehousing approches, but I need to hear from more practitioners -- yes, this is a naked plea for comments from practitioners -- before passing this off as an overstatement or marketing ploy.
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

11 Things IT Professionals Wish They Knew Earlier in Their Careers
Lisa Morgan, Freelance Writer,  4/6/2021
Time to Shift Your Job Search Out of Neutral
Jessica Davis, Senior Editor, Enterprise Apps,  3/31/2021
Does Identity Hinder Hybrid-Cloud and Multi-Cloud Adoption?
Joao-Pierre S. Ruth, Senior Writer,  4/1/2021
White Papers
Register for InformationWeek Newsletters
Current Issue
Successful Strategies for Digital Transformation
Download this report to learn about the latest technologies and best practices or ensuring a successful transition from outdated business transformation tactics.
Flash Poll