Google Comes Knocking In Search Of Hidden Data - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Mobile // Mobile Applications
News
4/14/2008
04:21 PM
Connect Directly
LinkedIn
Twitter
RSS
E-Mail
50%
50%

Google Comes Knocking In Search Of Hidden Data

By crawling using HTML forms (and abiding by robots.txt), Google claims it leads search engine users to documents that otherwise would not be easily found -- but privacy concerns remain.

Google on Friday said that it has been testing ways to index data that is normally hidden to search engine crawlers, a change that should improve the breadth of information available through Google.

The so-called "hidden Web" that Google has begun indexing refers to data beyond static Web pages, such as Web pages generated dynamically from a database, based on input such as might be provided through a Web submission form.

"This experiment is part of Google's broader effort to increase its coverage of the Web," Google engineers Jayant Madhavan and Alon Halevy said in a blog post. "In fact, HTML forms have long been thought to be the gateway to large volumes of data beyond the normal scope of search engines. The terms Deep Web, Hidden Web, or Invisible Web have been used collectively to refer to such content that has so far been invisible to search engine users. By crawling using HTML forms (and abiding by robots.txt), we are able to lead search engine users to documents that would otherwise not be easily found in search engines, and provide Webmasters and users alike with a better and more comprehensive search experience."

Robots.txt is a file Web publishers place on their servers that specifies what data can or can't be accessed by crawling programs, should those programs chose to abide by its rules.

In their post, Madhavan and Halevy twice mention that Google follows robots.txt rules, perhaps to allay fears that Google's more curious crawler will expose sensitive data. Google's wariness of being seen as an invader of privacy is underscored by the fact that its two engineers characterize the Google crawler as "the ever-friendly Googlebot."

"Needless to say, this experiment follows good Internet citizenry practices," Madhavan and Halevy said in their post. "Only a small number of particularly useful sites receive this treatment, and our crawl agent, the ever-friendly Googlebot, always adheres to robots.txt, nofollow, and noindex directives. That means that if a search form is forbidden in robots.txt, we won't crawl any of the URLs that a form would generate. Similarly, we only retrieve GET forms and avoid forms that require any kind of user information."

Given that Google has and continues to be accused of disregarding privacy concerns -- a charge it has and continues to rebut -- such prudence is quite understandable.

In a 2001 paper, Michael K. Bergman, CTO of BrightPlanet, estimated that the hidden Web was 400 to 550 times larger than the exposed Web. Though it's not immediately clear whether this ratio still holds after seven years, Google's decision to explore the hidden Web more thoroughly should make its massive index even more useful, and perhaps even more controversial.

Indeed, not everyone has been won over. In a blog post, Robin Schuil, a software developer at eBay, criticized what Google was doing for creating an extra burden on sites.

He said it's "really awfully close to what some of the search engine spammers do: targeted scraping of Web sites."

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Commentary
2021 Outlook: Tackling Cloud Transformation Choices
Joao-Pierre S. Ruth, Senior Writer,  1/4/2021
News
Enterprise IT Leaders Face Two Paths to AI
Jessica Davis, Senior Editor, Enterprise Apps,  12/23/2020
Slideshows
10 IT Trends to Watch for in 2021
Cynthia Harvey, Freelance Journalist, InformationWeek,  12/22/2020
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
2021 Top Enterprise IT Trends
We've identified the key trends that are poised to impact the IT landscape in 2021. Find out why they're important and how they will affect you.
Slideshows
Flash Poll