Google Makes Scanned Documents Searchable - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Government // Mobile & Wireless
News
10/31/2008
02:20 PM
Connect Directly
Google+
LinkedIn
Twitter
RSS
E-Mail
50%
50%

Google Makes Scanned Documents Searchable

Using optical character-recognition technology, Google will make the converted text of scanned PDFs available on its search results pages via the "View as HTML" link.

Google on Thursday said that it has begun turning electronic copies of printed documents -- PDF files generated from scanned paper -- back into digital text using optical character-recognition (OCR) technology.

"In the past, scanned documents were rarely included in search results as we couldn't be sure of their content," said Google product manager Evin Levey in a blog post. "We had occasional clues from references to the document -- so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored in Adobe's PDF format."

Google is making the converted text of scanned PDFs available on its search results pages via the "View as HTML" link. As an example, this scan of a Consumer Product Safety Commission (CPSC) document about aluminum wiring repair from 2004 is viewable as HTML.

The same search, "repairing aluminum wiring," on Yahoo Search also returned the CPSC PDF as the top result, but the Yahoo's "View as HTML" link showed only blank pages. Microsoft's Live Search and Ask.com also returned the CPSC PDF as the top result. Neither offered a "View as HTML" link.

By turning images of text into text, Google expands its already massive index. As Levey points out, Google's OCR system converts pictures into thousands of words.

"This is a small but important step forward in our mission of making all the world's information accessible and useful," said Levey.

Google's approach doesn't obviate the need to consult the scanned file, however, if it contains images or diagrams. While Google appears to do a good job of converting text, its scans omit graphics. Perhaps in time its engineers will be able to isolate graphic elements in scanned PDFs and insert them into its HTML conversions.

One unfortunate consequence of this is that personal information like Social Security numbers that might have gone unnoticed in scans of court documents may now be discoverable through a Google search. Public.Resource.org, a project that aims to make public government publicly accessible, recently found about 1,700 documents with Social Security numbers or alien identification numbers out of a corpus of 2.5 million court documents that go back decades.

But that's the sort of problem that crops up when you make all the world's information accessible.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Slideshows
Top-Paying U.S. Cities for Data Scientists and Data Analysts
Cynthia Harvey, Freelance Journalist, InformationWeek,  11/5/2019
Slideshows
10 Strategic Technology Trends for 2020
Jessica Davis, Senior Editor, Enterprise Apps,  11/1/2019
Commentary
Is the Computer Science Degree Dead?
Guest Commentary, Guest Commentary,  11/6/2019
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
Getting Started With Emerging Technologies
Looking to help your enterprise IT team ease the stress of putting new/emerging technologies such as AI, machine learning and IoT to work for their organizations? There are a few ways to get off on the right foot. In this report we share some expert advice on how to approach some of these seemingly daunting tech challenges.
Slideshows
Flash Poll