Personal Democracy Plus Our premium content network. LEARN MORE You are not logged in. LOG IN NOW >

Wild Horses: Data.gov Proves Good Stats are Hard to Wrangle

BY Nick Judd | Thursday, January 28 2010

Wrangling good data is like wrangling horses: It's hard, and technology can only make it so much easier.Rollin' rollin' rollin', keep them data rollin': A herd of federal agency data was taken in from the pasture on Jan. 22. // Photo: Bureau of Land Management

Not to knock the plight of the wild North American horse, but it isn't clear to me how population counts of wild burros and mustangs are the most important data the Department of the Interior has to offer for its eager public.

Along with every other federal agency, Interior had until Jan. 22 to respond to a Dec. 8 directive from Office of Management and Budget Director Peter Orszag by posting, on the Obama administration's Data.gov open government data repository, three "high-value data sets." Their response was a list of volunteer opportunities from serve.gov; a list of government recreation facilities; three data sets concerning wildland fires; and an elaboration on the United States' dwindling stock of wild mustangs.

So I asked Interior: What makes the wild American donkey so important?

"One of the mandates under the open government directive was that the data being published was central to the agency mission and that was the case with these projects," Kendra Barkoff, a Department of Interior spokeswoman, wrote in reply.

Well, fair enough. But the Department of the Interior also handles Native American affairs — a population dealing with high unemployment, poor infrastructure in many places, and little to no focus or attention in the minds of the general public. It struck me as odd that mustangs, although Secretary of the Interior Ken Salazar has taken heat on the issue in the past, would be the subject of the data Salazar's department would make public via Data.gov.

Poring through the statistics and research made available on Data.gov that is relevant to their areas of interest, a small handful of researchers I spoke to say that much of it is stuff they've either seen before or don't find especially useful. The Sunlight Foundation's Bill Allison has already opined on this subject: the OMB directive mandated that federal agencies post only "data sets not previously available online or in a downloadable format."

UPDATE: On Thursday, the Washington Post came out with similar findings.

"I think that in some cases it may be true that a version or some of the information that was submitted as of Friday deadline had existed in one form or another on a government website," said an OMB spokesman, Tom Gavin. "But what we have found is in that many of those cases the data was not available in a machine readable format."

In other cases, it was available but not free, he said. Part of the point is that the data is now all in one place, and the process of aggregating that data is just beginning.

The problem may not be bureaucratic reticence, but simply that the agencies have only so much good data in the first place. Standards are getting better, researchers tell me, but right now, good government data is hard to find because there often isn't a whole lot of it, not because the government is keeping it to itself.

Open Voting

Data.gov allows users to vote on which datasets are the best. Here are some of the top contenders as of Wednesday afternoon, by number of votes:

"High value" itself may seem to be a subjective test, but, in Orszag's Dec. 8 directive, the OMB director offered a definition: Raw data that can be used to increase agency accountability and responsiveness; improve public knowledge of the agency and its operations; further the core mission of the agency; create economic opportunity; or respond to need and demand as identified through public consultation.

More than the "high-value" test, the issue of redundancy seems to be a pressing one. Data.gov seems to be better at fulfilling the need for a central clearing-house of information than at serving as a catalyst for the release of new government data, as Orszag's three-new-datasets proviso implied it was meant to be.

"Certainly I love datasets more than the average person," said Ashley Nellis, a research analyst for the Sentencing Project who has scoped out the criminal justice data available on Data.gov, in a Tuesday phone interview. "But those datasets, there wasn't anything new there that I could see."

For Nellis, who researches racial disparity in the criminal justice system — she's compiling data the Sentencing Project is getting from the state level on its own, for instance, on the greater propensity for black people to have life sentences as opposed to white people — the Bureau of Justice Statistics has such a great website that she has no need of its data in a second place.

A spokeswoman for the Justice Department was not immediately available for comment early Wednesday evening.

Similarly, the raw results of a Federal Voting Assistance Program survey of overseas voters after the 2004 Presidential elections was among the data released by the Department of Defense. But FVAP released that data online last summer, Claire Smith, research director of the Overseas Vote Foundation, told me Wednesday in a phone interview. (She's eagerly awaiting the results of the 2008 survey, she says, which she expects soon.)

When new appointee Bob Carey took over the Federal Voting Assistance Program in summer 2009, Smith says, she immediately noticed the program become more open. So perhaps it's fair to say that the data did come out as a result of the Obama administration's professed open-government ethos — just not this particular initiative.

Then what is left to disclose? I asked Smith if there was much else the government might have on this topic that researchers would want.

"I don't think there is any," she said, later adding, "we don't even know how many Americans live abroad."

The Census Bureau hasn't even tried to make an estimate since 2000, Smith said. (Although I imagine there are closely-held figures on the topic somewhere in the country's intelligence community.)

No one at the Department of Defense press shop was immediately available to comment, but it's important to note that this wasn't the only dataset DoD posted. Their log of received Freedom of Information Act requests, which they also indicated was a "high-value" dataset, has garnered attention, judging by the number of people who cast votes on Data.gov for the data as a useful set of information. DoD posted ten datasets in total as of this writing.

The Department of the Interior's data caught my eye because it is the federal department tasked with keeping track of Native American affairs. If ever there was an underappreciated area of research, this would be it — but when I went looking, I found a census of feral mustangs, and no information on the unemployment rate on reservations.

Similarly, there were available data on prisons in Indian country — but the most current set available on Data.gov is from 2001 as of this writing.

"[There's a] distinct difference between having the data sets available," said Peter Morris, director of strategy and partnership for the National Congress of American Indians, "and having the kind of data you need."

Specifically, he was talking about the Treasury's Recovery Act data, another popular dataset on Data.gov. The data is there on investments made through the Recovery Act, including through the Community Development Financial Institutions Program. That program, Morris says, has facilitated investment in Native American communities, some of which are in grave need of infrastructure like better roads and the jobs that would be filled by people building them. But the data isn't structured such that Morris can sort out investments made in Native American communities from those that aren't, he said.

Morris heaped praise on Interior Secretary Ken Salazar for his willingness to pay attention to Native American affairs, and said he feels that Salazar, and by extension the entire department, understands the need for better data.

It's a refrain Nellis repeated: It's very slowly getting easier to get good data, and this is a focus of the Obama administration. But it's been a process for at least the last eight years.

Interior and the Bureau of Labor Statistics keep separate unemployment data, for example, and Interior uses different standards than Labor. This means comparing unemployment rates of people living on Native American land and in the surrounding states would be like comparing, well, horses and burros: The two are similar, but just not the same. As a result, says Morris, Alaska, Arizona and Minnesota — all of which have sizeable populations on Native American land with higher unemployment rates than the states themselves — were classified as having an unemployment rate under the threshhold required to gain an extension of unemployment benefits that was granted last year.

The Obama administration appears to create an emphasis on Data.gov even as it pursues more arduous but arguably more relevant aspects of institutional change behind closed doors — and it is by all accounts moving in that direction. The same memo that established the Data.gov dump deadline also required each agency to designate a senior official responsible for the "quality and objectivity" of records that track federal spending, and — to the delight of data nerds worldwide — established that federal data should be as granular as possible. I heard from several researchers that, while slow, change in this arena was coming.

Gavin, the OMB spokesman, says a list of the agency officials responsible for data quality and a list of people on a related inter-agency working group are both supposed to be released soon.

While Data.gov is flashy, easy to explain and even starting an open-government competition of sorts with the United Kingdom, submitting data to the website is really the least difficult of the commitments the administration is now expected to keep.

The more abstract task of establishing standards for data, and actually collecting and entering it, is like wrangling a mustang: It's quite hard, and technology can only make it so much easier.

I asked Morris, of the National Congress of American Indians: Is Data.gov, at least for now, susceptible to garbage in, garbage out?

"That theme," he said, "comes through in the data you see here."

News Briefs

RSS Feed friday >

Slovenian ambassador apologizes for signing ACTA, Poland halts ratification

Apparently, some EU countries are reconsidering their support to ACTA, only a week after signing the agreement.
Helena Drnovsek Zorko, Slovenia's ambassador to Japan, has in fact issued a public apology to her country for signing it. Meanwhile, Poland Prime Minister Donald Tusk says he's halting the ratification process of the international treaty.
Last week people took the streets in Poland, and a protest is planned in Ljubljana tomorrow. GO

yesterday >

Did Newt Gingrich Lose Florida for Want of a Better API?

Slate's Sasha Issenberg has a great story outlining one narrative about Newt Gingrich's loss in Florida: He inspired a group of tech-savvy volunteers, but gave them no way to plug in to the campaign. GO

House GOP Hosts Legislative Data and Transparency Conference

Today, House Republicans are hosting a conference on legislative data and transparency. The goal, as it's been explained to me, is to set the table for a conversation between House leadership and open government/open data advocates about what the House could or should do next.

More information on the conference is here. It's being live streamed.

GO

When House Republicans Aren't Winning With Transparency

House Republicans have been pushing the results of their transparency initiatives, such as a pilot project to archive video of some committee hearings.

But other committee hearings are apparently off-limits. Politico reports today that documentary filmmaker Josh Fox was arrested while attempting to videotape a House Science Committee hearing on hydrofracking. Only credentialed members of the Congressional press corps can film hearings of that committee.

The archived webcast of that hearing, which was streamed live, is here, if you can get the software to work. Each committee chair has discretion over what to do with video of their hearings, although there's also an office of in-House broadcasters who keep archival footage of everything, staffers have told me previously. As a result, there's no universal standard for how hearings are streamed or archived. The Science Committee uses a content delivery platform powered by Akamai.

GO

Komen's Planned Parenthood Decision Raising Eyebrows Online

Online campaigns have begun to organize in response to news that the breast cancer group Susan G. Komen for the Cure would be cutting its financing to Planned Parenthood for breast cancer screening and education programs. According to the news reports, Komen says the decision is not in response to pressure from anti-abortion groups, as Planned Parenthood alleges. Rather, a spokesperson told the A.P., the main factor is a new rule adopted by Komen that prohibits grants to organizations being investigated by local, state or federal authorities. Currently, Rep. Cliff Stearns (R-Fla.) is looking in to how Planned Parenthood spends and reports its money. "Susan D. Komen" has been trending on Google since yesterday. GO

Team Obama Spends Big On Digital

There's more to come from recently filed campaign finance reports from the presidential campaigns. Meantime, Politico notes that Barack Obama's re-election effort has so far spent $2.2 million in online advertising, millions more on payroll and $809,000 on computer equipment and software. GO

tuesday >

Romney Campaign to Test Out Square Tonight

As Nick Bilton noted last night, the Mitt Romney campaign plans to test out Square for fund-raising at a Florida event tonight. A spokeswoman for Barack Obama's re-election campaign told us yesterday that Obama campaign staffers and select volunteers around the country would be getting the devices, which attach to mobile phones and work as credit card readers, as well as custom software that collects the information necessary for donations to be compliant with Federal Election Commission requirements.

Update: Now with screenshots!

GO

How Much Should a Campaign Know About an Online Volunteer?

Rick Santorum's campaign is asking folks to go online and make calls today on the former senator from Pennsylvania's behalf. Earlier this morning I noted that Mitt Romney's team is doing the same.

One ongoing discussion around this type of tool is how much the campaign should know about the volunteer before the volunteer is allowed to, well, volunteer. Mitt Romney's campaign just asks for a name and email address. Santorum's campaign requires volunteers to put in a full address before it starts revealing to users of their click-to-call tool the names and phone numbers of prospective voters. It's an additional step to protect voters' privacy — and to get more data for the campaign — although it isn't difficult for tricksters to use a fake or inaccurate address in a form like this.

GO

More