Google Calls on Government to Make Itself Available to Search
BY Nancy Scola | Wednesday, June 24 2009
One of the participants in the White House's ongoing Open Government Initiative process is a little company by the name of Google, and it has some ideas to share with the executive branch on how government information can make itself more searchable and thus more accessible to the public. In comments submitted by Google Managing Policy Counsel Pablo Chavez (via the fusty old-fashioned Federal Register channel, rather than the blog/wiki-enabled online OGI process), Google makes the case that consumers and citizens are very often going to use Google as an interface onto government information, rather than any one .gov website or data tool. Government data that is hidden to Google and other search engines is effectively hidden from many of the people whom it might benefit and inform. From Google's submitted comments:
If a citizen is looking for specific information on the safety of organic food, it is more likely that a user will type in their query into a search box (e.g. 'organic food safety') than navigate directly to a relevant federal government website like that of the U.S. Department of Agriculture.
Citizens assume that search queries that are returned by search engines are complete. Therefore, it is critical that agency websites to be easily indexed and crawlable by search engines. If government websites do not allow search engines to crawl, or certain documents on websites are hidden by robots.txt files or behind databases, the results available to citizens are incomplete and not as helpful as they otherwise could be.
Google has two recommendations in particular on how government can unlock data to search. The first is to adhere to the XML Sitemap protocol. Whereas HTML sitemaps help users navigate sites, XML sitemaps can provide computers with meta data that help make sense of the meaning of the site's content and structure. For example, an XML sitemap can help Google's spidering agents understand which documents on a website are worth dedicating more attention to, and when the various sections of a site have been last updated. Google notes that several states, such as Virginia, California, Utah, Michigan, and Arizona have adopted XML-based sitemaps to unlock the impact of their data and other resources. In Google's submitted comments, Chavez reports that the introduction of an XML sitemap to one resource section on the website of the Department of Energy's Office of Scientific and Technical Information increased downloads of full-text documents by some 400%. "We should go back to the basics," writes Chavez, "and make the content that already exists accessible to citizens online."
The second recommendation that Google has for government is that agencies re-evaluate the text files residing on their servers that tell Google and other search engines which information they'd like to have included in search results and what they'd rather have ignored. Many agencies, writes Chavez, write their robots.txt file to err on the side of excluding information, when the presumption when it comes to government data should be on inclusion and disclosure. The company reports that it has been working with state-level agencies, such as the Florida Department of Education, to evaluate when the "no follow" and "no index" directions in robot.txt files are unnecessary and should be discarded. "This has resulted in tens of thousands of new pages containing deep content that are now discoverable to citizens through search," writes Chavez.
In addition to the utility and import of opening up a wider range of government data, having a deeper well of government data to draw from may be an advantage to Google as it seeks to stay one step ahead of search engines like Microsoft's Bing and Wolfram Alpha that take a less linear approach to retrieving and delivering information. Ola Rosling, the lead on the Google Public Data project we wrote about in this space on Monday, described in an email the company's interest in working with Data.gov and other first-party sources of government information: "We are open to collaborating with a broad range of public data providers, including the White House, to promote our broader goal to make public data accessible. We are also," he adds, "actively seeking contacts with new organizations that produce public data." That that latter end, Google has begun reaching out to public-data providers to ask them to fill out a detailed survey regarding what data it is they're currently publishing or are interested in publishing, on everything from trade figures to agricultural numbers to crime statistics.