Personal Democracy Plus Our premium content network. LEARN MORE You are not logged in. LOG IN NOW >

Code Warriors Debate Whitehouse.gov Robot Commands

BY Sarah Granger | Thursday, January 22 2009

As the tech community pored over the new whitehouse.gov site, one of the first subterranean changes noted was that of a file most people would never notice called robots.txt. This file serves as a notice to search robots informing them of what files they should or shouldn't survey. Upon seeing the new version of the file, some noticed that it only had two lines of code excluding robot searches vs. the former whitehouse.gov robots.txt that had nearly 2400 lines of exclude lines by the end of the Bush administration, sparking excitement and controversy over what the change means in terms of government transparency.

The text from the new robots.txt file:

User-agent: *
Disallow: /includes/

A sampling from near the end of the previous file:

Disallow: /president/text
Disallow: /president/waronterror/iraq200404/text
Disallow: /president/waronterror/photoessay/text
Disallow: /president/winterwonderland/iraq
Disallow: /president/winterwonderland/text
Disallow: /president/world-leaders/iraq
Disallow: /president/world-leaders/text
Disallow: /president/worldunites/iraq
Disallow: /president/worldunites/text

Cory Doctorow, Editor of Boing Boing and Former Outreach Director for the Electronic Frontier Foundation was one of the first to report this finding, with just the facts followed by a bunch of commenters asking for explanations.

Proponents of the belief that the move to the vastly smaller file was a statement about transparency claimed were ecstatic. According to Patrick Thibodeau of ComputerWorld, New York blogger James Kottke "thinks that by eliminating the Bush disallow list on its first day in office, the Obama administration was sending out a symbolic message." Kottke, in his post on Tuesday, alluded to the "huge change in the executive branch of the US government." In e-mail to Thibodeau, Kottke wrote: "One of Obama's big talking points during the campaign and transition was a desire for a more transparent government, and the spare robots.txt file is a symbol of that desire."

Presenting an alternate view, Declan McCullagh of CNET News pointed out that the Bush whitehouse.gov robots.txt file followed the letter of coder law for the most part in terms of what to disallow with the exception of a few incidents that were corrected. McCullagh brought to attention the idea that perhaps the new robots.txt file is actually too short. "It doesn't currently block search pages, meaning they'll show up on search engines--something that most site operators don't want and which runs afoul of Google's Webmaster guidelines."

While most of the technical experts weighing in suggest and expect that the robots.txt file should grow, most of them explain it as just a normal process a website undergoes over time. Andy John, a search developer for DeepDyve, puts it like this: "robots.txt is just a request. Robots can do whatever they like anyway." He then went further to describe what that means. "For example, there is a program "wget" (web get). You give it a URL, it downloads it and saves the file... You can tell it to download an entire site. It honors robots.txt by default. But by just adding these parameters you can tell it to ignore robots.txt and get everything: wget -erobots=off http://www.whitehouse.gov

As to why those who developed the new whitehouse.gov site would want to code it this way, Jaelithe Judy, a Search Engine Optimization specialist and political blogger says "Google does generally encourage webmasters to use disallows to keep from having their search pages spidered; this is to help keep a Google search from returning a whole page of search results from other sites' internal search engines, instead of relevant original content. However, in some cases a search result from a site is a meaningful result. For instance, when you are searching for 'DVD recorders' and the Amazon search page for 'DVD recorders on Amazon' pops up, that might actually be useful to most users."

She added that "Google is still trying to work out how to sort annoying search-generated page results from the useful ones. The Whitehouse.gov ones may lean toward being useful. For instance, if you are a middle school student doing a report on the First Ladies, and you get a Whitehouse.gov search page for First Ladies, that has all sorts of different links to different sorts of information, that might actually be useful."

The bottom line about robots.txt? John says, "It's really more of a serving suggestion."

News Briefs

RSS Feed tuesday >

Ruck.us Reboots As a Candidate Digital Toolkit That's a Bit Too Like Democracy.com

Ruck.us launched with big ambitions and star appeal, hoping to crack the code on how to get millions of people to pool their political passions through their platform. When that ambition stalled, its founder Nathan Daschle--son of the former Senator--decided to pivot to offering political candidates an easy-to-use free web platform for organizing and fundraising. Now the new Ruck.us is out from stealth mode, entering a field already being served by competitors like NationBuilder, Salsa Labs and Democracy.com. And strangely enough, Ruck.us seems to want its early users to ask Democracy.com for help. GO

Armenian Legislators: You Can Be As Anonymous on the 'Net As You Like—Until You Can't

A proposed bill in Armenia would make it illegal for media outlets to include defamatory remarks by anonymous or fake sources, and require sites to remove libelous comments within 12 hours unless they identify the author.

GO

monday >

The Good Wife Looks for the Next Snowden and Outwits the NSA

Even as the real Edward Snowden faces questions over his motives in Russia, another side of his legacy played out for the over nine million viewers of last night's The Good Wife, which concluded its season long storyline exploring NSA surveillance. In the episode titled All Tapped Out, one young NSA worker's legal concerns lead him to becoming a whistle-blower, setting off a chain of events that allows the main character, lawyer Alicia Florrick (Julianna Margulies), and her husband, Illinois Governor Peter Florrick (Chris Noth), to turn the tables on the NSA using its own methods. GO

The Expanding Reach of China's Crowdsourced Environmental Monitoring Site, Danger Maps

Last week billionaire businessman Jack Ma, founder of the e-commerce company Alibaba, appealed to his “500 million-strong army” of consumers to help monitor water quality in China. Inexpensive testing kits sold through his company can be used to measure pH, phosphates, ammonia, and heavy metal levels, and then the data can be uploaded via smartphone to the environmental monitoring site Danger Maps. Although the initiative will push the Chinese authorities' tolerance for civic engagement and activism, Ethan Zuckerman has high hopes for “monitorial citizenship” in China.

GO

The 13 Worst Bits of Russia's Current and Maybe Future Internet Legislation

It appears that Russia is on the brink of passing still more repressive Internet regulations. A new telecommunications bill that would require popular blogs—those with 3,000 or more visits a day—to join a government registry and conform to government-mandated standards is expected to pass this week. What follows is a list of the worst bits of both proposed and existing Russian Internet law. Let us know in the comments or on Twitter if we missed anything.

GO

Transparency and Public Shaming: Pakistan Tackles Tax Evasion

In Pakistan, where only one in 200 citizens files their income tax return, authorities published a directory of taxpayers' details for the first time. Officials explained the decision as an attempt to shame defaulters into paying up.

GO

wednesday >

Facebook Seeks Approval as Financial Service in Ireland. Is the Developing World Next?

On April 13 the Financial Times reported that Facebook is only weeks away from being approved as a financial service in Ireland. Is this foray into e-money motivated by Facebook's desire to conquer the developing world before other corporate Internet giants do? Maybe.

GO

The Rise and Fall of Iran's “Blogestan”

The robust community of Iranian bloggers—sometimes nicknamed “Blogestan”—has shrunk since its heyday between 2002 – 2010. “Whither Blogestan,” a recent report from the University of Pennsylvania's Iran Media Program sought to find out how and why. The researchers performed a web crawling analysis of Blogestan, survey 165 Persian blog users, and conducted 20 interviews with influential bloggers in the Persian community. They found multiple causes of the decline in blogging, including increased social media use and interference from authorities.

GO

More