Wild Horses: Data.gov Proves Good Stats are Hard to Wrangle
BY Nick Judd | Thursday, January 28 2010
Not to knock the plight of the wild North American horse, but it isn't clear to me how population counts of wild burros and mustangs are the most important data the Department of the Interior has to offer for its eager public.
Along with every other federal agency, Interior had until Jan. 22 to respond to a Dec. 8 directive from Office of Management and Budget Director Peter Orszag by posting, on the Obama administration's Data.gov open government data repository, three "high-value data sets." Their response was a list of volunteer opportunities from serve.gov; a list of government recreation facilities; three data sets concerning wildland fires; and an elaboration on the United States' dwindling stock of wild mustangs.
So I asked Interior: What makes the wild American donkey so important?
"One of the mandates under the open government directive was that the data being published was central to the agency mission and that was the case with these projects," Kendra Barkoff, a Department of Interior spokeswoman, wrote in reply.
Well, fair enough. But the Department of the Interior also handles Native American affairs — a population dealing with high unemployment, poor infrastructure in many places, and little to no focus or attention in the minds of the general public. It struck me as odd that mustangs, although Secretary of the Interior Ken Salazar has taken heat on the issue in the past, would be the subject of the data Salazar's department would make public via Data.gov.
Poring through the statistics and research made available on Data.gov that is relevant to their areas of interest, a small handful of researchers I spoke to say that much of it is stuff they've either seen before or don't find especially useful. The Sunlight Foundation's Bill Allison has already opined on this subject: the OMB directive mandated that federal agencies post only "data sets not previously available online or in a downloadable format."
UPDATE: On Thursday, the Washington Post came out with similar findings.
"I think that in some cases it may be true that a version or some of the information that was submitted as of Friday deadline had existed in one form or another on a government website," said an OMB spokesman, Tom Gavin. "But what we have found is in that many of those cases the data was not available in a machine readable format."
In other cases, it was available but not free, he said. Part of the point is that the data is now all in one place, and the process of aggregating that data is just beginning.
The problem may not be bureaucratic reticence, but simply that the agencies have only so much good data in the first place. Standards are getting better, researchers tell me, but right now, good government data is hard to find because there often isn't a whole lot of it, not because the government is keeping it to itself.
Data.gov allows users to vote on which datasets are the best. Here are some of the top contenders as of Wednesday afternoon, by number of votes:
- FEMA Disaster Declarations Summary (Five stars, 10 votes)
- Aviation Accident Statistics: Air Carrier Occurrences Involving Illegal Acts (Sabotage, Suicide, or Terrorism), 1989 - 2008 (Five stars, 10 votes)
- U.S. Overseas Loans and Grants (Greenbook) (Five stars, 8 votes)
- History of Economic Forecasts (Five stars, 7 votes)
- FEMA Public Assistance Funded Projects Summary (Five stars, 6 votes)
- New Car Assessment Program (NCAP) - 5 Star Safety Ratings (Five stars, 5 votes)
- Bibliographical Metadata of the Foreign Relations of the United States Series (Five stars, 5 votes)
- Trade Capacity Building (Five stars, 5 votes)
"High value" itself may seem to be a subjective test, but, in Orszag's Dec. 8 directive, the OMB director offered a definition: Raw data that can be used to increase agency accountability and responsiveness; improve public knowledge of the agency and its operations; further the core mission of the agency; create economic opportunity; or respond to need and demand as identified through public consultation.
More than the "high-value" test, the issue of redundancy seems to be a pressing one. Data.gov seems to be better at fulfilling the need for a central clearing-house of information than at serving as a catalyst for the release of new government data, as Orszag's three-new-datasets proviso implied it was meant to be.
"Certainly I love datasets more than the average person," said Ashley Nellis, a research analyst for the Sentencing Project who has scoped out the criminal justice data available on Data.gov, in a Tuesday phone interview. "But those datasets, there wasn't anything new there that I could see."
For Nellis, who researches racial disparity in the criminal justice system — she's compiling data the Sentencing Project is getting from the state level on its own, for instance, on the greater propensity for black people to have life sentences as opposed to white people — the Bureau of Justice Statistics has such a great website that she has no need of its data in a second place.
A spokeswoman for the Justice Department was not immediately available for comment early Wednesday evening.
Similarly, the raw results of a Federal Voting Assistance Program survey of overseas voters after the 2004 Presidential elections was among the data released by the Department of Defense. But FVAP released that data online last summer, Claire Smith, research director of the Overseas Vote Foundation, told me Wednesday in a phone interview. (She's eagerly awaiting the results of the 2008 survey, she says, which she expects soon.)
When new appointee Bob Carey took over the Federal Voting Assistance Program in summer 2009, Smith says, she immediately noticed the program become more open. So perhaps it's fair to say that the data did come out as a result of the Obama administration's professed open-government ethos — just not this particular initiative.
Then what is left to disclose? I asked Smith if there was much else the government might have on this topic that researchers would want.
"I don't think there is any," she said, later adding, "we don't even know how many Americans live abroad."
The Census Bureau hasn't even tried to make an estimate since 2000, Smith said. (Although I imagine there are closely-held figures on the topic somewhere in the country's intelligence community.)
No one at the Department of Defense press shop was immediately available to comment, but it's important to note that this wasn't the only dataset DoD posted. Their log of received Freedom of Information Act requests, which they also indicated was a "high-value" dataset, has garnered attention, judging by the number of people who cast votes on Data.gov for the data as a useful set of information. DoD posted ten datasets in total as of this writing.
The Department of the Interior's data caught my eye because it is the federal department tasked with keeping track of Native American affairs. If ever there was an underappreciated area of research, this would be it — but when I went looking, I found a census of feral mustangs, and no information on the unemployment rate on reservations.
Similarly, there were available data on prisons in Indian country — but the most current set available on Data.gov is from 2001 as of this writing.
"[There's a] distinct difference between having the data sets available," said Peter Morris, director of strategy and partnership for the National Congress of American Indians, "and having the kind of data you need."
Specifically, he was talking about the Treasury's Recovery Act data, another popular dataset on Data.gov. The data is there on investments made through the Recovery Act, including through the Community Development Financial Institutions Program. That program, Morris says, has facilitated investment in Native American communities, some of which are in grave need of infrastructure like better roads and the jobs that would be filled by people building them. But the data isn't structured such that Morris can sort out investments made in Native American communities from those that aren't, he said.
Morris heaped praise on Interior Secretary Ken Salazar for his willingness to pay attention to Native American affairs, and said he feels that Salazar, and by extension the entire department, understands the need for better data.
It's a refrain Nellis repeated: It's very slowly getting easier to get good data, and this is a focus of the Obama administration. But it's been a process for at least the last eight years.
Interior and the Bureau of Labor Statistics keep separate unemployment data, for example, and Interior uses different standards than Labor. This means comparing unemployment rates of people living on Native American land and in the surrounding states would be like comparing, well, horses and burros: The two are similar, but just not the same. As a result, says Morris, Alaska, Arizona and Minnesota — all of which have sizeable populations on Native American land with higher unemployment rates than the states themselves — were classified as having an unemployment rate under the threshhold required to gain an extension of unemployment benefits that was granted last year.
The Obama administration appears to create an emphasis on Data.gov even as it pursues more arduous but arguably more relevant aspects of institutional change behind closed doors — and it is by all accounts moving in that direction. The same memo that established the Data.gov dump deadline also required each agency to designate a senior official responsible for the "quality and objectivity" of records that track federal spending, and — to the delight of data nerds worldwide — established that federal data should be as granular as possible. I heard from several researchers that, while slow, change in this arena was coming.
Gavin, the OMB spokesman, says a list of the agency officials responsible for data quality and a list of people on a related inter-agency working group are both supposed to be released soon.
While Data.gov is flashy, easy to explain and even starting an open-government competition of sorts with the United Kingdom, submitting data to the website is really the least difficult of the commitments the administration is now expected to keep.
The more abstract task of establishing standards for data, and actually collecting and entering it, is like wrangling a mustang: It's quite hard, and technology can only make it so much easier.
I asked Morris, of the National Congress of American Indians: Is Data.gov, at least for now, susceptible to garbage in, garbage out?
"That theme," he said, "comes through in the data you see here."