Google Comes Knocking In Search Of Hidden DataApril 14, 2008 – 2:21 PM
Google on Friday said that it has been testing ways to index data that is normally hidden to search engine crawlers, a change that should improve the breadth of information available through Google.
The so-called “hidden Web” that Google has begun indexing refers to data beyond static Web pages, such as Web pages generated dynamically from a database, based on input such as might be provided through a Web submission form.
“This experiment is part of Google’s broader effort to increase its coverage of the Web,” said Google engineers Jayant Madhavan and Alon Halevy in a blog post. “In fact, HTML forms have long been thought to be the gateway to large volumes of data beyond the normal scope of search engines. The terms Deep Web, Hidden Web, or Invisible Web have been used collectively to refer to such content that has so far been invisible to search engine users. By crawling using HTML forms (and abiding by robots.txt), we are able to lead search engine users to documents that would otherwise not be easily found in search engines, and provide Webmasters and users alike with a better and more comprehensive search experience.”
Robots.txt is a file Web publishers place on their servers that specifies what data can or can’t be accessed by crawling programs, should those programs chose to abide by its rules.
In their post, Madhavan and Halevy twice mention that Google follows robots.txt rules, perhaps to allay fears that Google’s more curious crawler will expose sensitive data. Google’s wariness of being seen as an invader of privacy is underscored by the fact that its two engineers characterize the Google crawler as “the ever-friendly Googlebot.”
“Needless to say, this experiment follows good Internet citizenry practices,” said Madhavan and Halevy in their post. “Only a small number of particularly useful sites receive this treatment, and our crawl agent, the ever-friendly Googlebot, always adheres to robots.txt, nofollow, and noindex directives. That means that if a search form is forbidden in robots.txt, we won’t crawl any of the URLs that a form would generate. Similarly, we only retrieve GET forms and avoid forms that require any kind of user information.”
Given that Google has and continues to be accused of disregarding privacy concerns — a charge it has and continues to rebut — such prudence is quite understandable.
In a 2001 paper, Michael K. Bergman, CTO of BrightPlanet, estimated that the hidden Web was 400 to 550 times larger than the exposed Web. Though it’s not immediately clear whether this ratio still holds after seven years, Google’s decision to explore the hidden Web more thoroughly should make its massive index even more useful, and perhaps even more controversial.
Indeed, not everyone has been won over. In a blog post, Robin Schuil, a software developer at eBay, criticized what Google was doing for creating an extra burden on sites.
“[I]t is really awfully close to what some of the search engine spammers do: targeted scraping of Web sites,” he said.
Source: Information Week