Plenty of Hadoop vendors and hangers-on are promising SQL-on-Hadoop capabilities, but in the process they're buying into the old, inflexible model-before-querying approach to data analysis.
Hadoop software distributor MapR on Tuesday announced it will start shipping Apache Drill software that it says delivers a more flexible, big-data-savvy data-exploration approach.
Unlike Apache Hive and Cloudera's Impala option for SQL analysis on Hadoop, MapR says Drill, which is based on Google Dremel, does not require IT people to anticipate queries and set up data models in advance. Instead, Drill is designed for data-exploration first, and the list of compatible big data includes Hadoop sources including HDFS, Hive, and HBase tables; NoSQL data from sources such as MongoDB and REST APIs; and self-describing data such as Avro, Parquet, and JSON files with nested structures.
[Want more on Cloudera's SQL option? Read Cloudera Impala Brings SQL Querying To Hadoop.]
"The model-first approach is the antithesis of the approach of exploring what big data is trying to tell you," said Jack Norris, MapR's chief marketing officer, in a phone interview with InformationWeek. "Drill allows schema discovery on the fly, support for modern data structures, and support for ANSI SQL."
Drill's approach is more flexible than that of Hive or Impala, said Norris, because data analysts can explore the data before they set up fixed schemas, ETL processes, or hardened production queries. Instead of fixing on a schema before the query engine can touch the data, Drill lets users explore first, and the engine automatically discovers source schemas and adjusts query plans accordingly as SQL queries are applied.
In addition to providing an SQL query interface, Drill exposes as an ODBC connector through which data sources can be explored with simple desktop tools, like Microsoft Excel or Tableau Software, or through more sophisticated business intelligence suites. Though it's currently in a 0.5 (pre-production-ready) beta release, Drill supports 15 of the 22 SQL queries used in the TCP-H performance benchmark whereas Cloudera Impala supports only two of those queries, according to MapR executives.
Though Drill is described by MapR as an open community, MapR is its chief advocate, and it is the only Hadoop vendor distributing the software. Cloudera, the leading Hadoop distributor by customer numbers, is pushing Impala, while Hortonworks is advancing the capabilities of Apache Hive, the most popular SQL-on-Hadoop tool available.
Currently in early beta, Drill is far from recommended production use, and MapR's announcement offered few beta customer references. Instead, partners and analysts offered their opinions on MapR's news.
"Apache Drill's ability to provide access to data in Hadoop without the need for centralized schemas and also NoSQL datasets with complex data structures including nested and repeated fields differentiates it from traditional approaches to SQL-on-Hadoop," stated Matt Aslett, research director, data platforms and analytics, 451 Research, in MapR's press release.
Cloud Connect (Sept. 29 to Oct. 2, 2014) brings its "cloud-as-business-enabler" programming to Interop New York for the first time in 2014. The two-day Cloud Connect Summit will give Interop attendees an intensive immersion in how to leverage the cloud to drive innovation and growth for their business. In addition to the Summit, Interop will feature five cloud workshops programmed by Cloud Connect. The Interop Expo will also feature a Cloud Connect Zone showcasing cloud companies' technology solutions. Register with Discount Code MPIWK or $200 off Total Access or Cloud Connect Summit Passes.Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio