Hadoop Crunches Web-Sized Data - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Software // Information Management
10:05 PM
Connect Directly

Hadoop Crunches Web-Sized Data

To parse large volumes of data from Web servers, Yahoo and others turn to Hadoop's open source cloud-based analysis system.

As the World Wide Web has exploded into millions of sites and billions of documents, the search engines that purport to know about everything on the Web have faced a gargantuan task. Sure, more spiders can be activated to crawl the Web and collect information. But what system can analyze all the data before the information is out of date?

The answer is a cluster-based analysis system, sometimes referred to loosely as a cloud database system. At the Cloud Computing Conference and Expo Nov. 3 in Santa Clara, Calif., representatives of Yahoo explained how they use Hadoop open source software, from the Apache Software Foundation, to analyze the Web.

Hadoop is a system that can be applied to "big data," or masses of data collected from the Web, such as the crawls that lead to the search indexes. Eric Baldeschwieler, VP of Hadoop software development, leads the largest known active Hadoop development team and said Yahoo is the world's largest Hadoop user. It uses Hadoop on clusters of 4,000 computers to analyze up to 92 petabytes of data stored on disks.

Hadoop builds Yahoo's indexes of the Web that power the Yahoo search engine. Its Web mapping system "runs in 73 hours, taking as input, data from all the Web pages in the world," he said. Yahoo's digest of Web pages consists of 300 terabytes of data. Hadoop analysis tells Yahoo's ad system what ads to serve to visitors, based on their profile from searches they've conducted on the site.

It's use of Hadoop keeps it running on a total of 25,000 servers at the company, he said. Yahoo distributes its tested, production version of Hadoop for free, Baldeschweiler said.

Another speaker at the conference was Christophe Brisciglia, a former Google engineer and now part of the founding team at Cloudera, a firm that is producing a supported enterprise distribution of Hadoop. "Cloudera is to Hadoop as Red Hat is to Linux," he said.

Brisciglia described Hadoop as "a batch data processing system" for use on clusters of commodity hardware. Unlike relational database, "in Hadoop there is no structure (to the data). You can dump incredibly large amounts of data into a Hadoop cluster and figure out what to do with it later."

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
1 of 2
Comment  | 
Print  | 
More Insights
The Best Way to Get Started with Data Analytics
John Edwards, Technology Journalist & Author,  7/8/2020
10 Cyberattacks on the Rise During the Pandemic
Cynthia Harvey, Freelance Journalist, InformationWeek,  6/24/2020
IT Trade Shows Go Virtual: Your 2020 List of Events
Jessica Davis, Senior Editor, Enterprise Apps,  5/29/2020
White Papers
Register for InformationWeek Newsletters
Current Issue
Key to Cloud Success: The Right Management
This IT Trend highlights some of the steps IT teams can take to keep their cloud environments running in a safe, efficient manner.
Flash Poll