One Researcher-Entrepreneur on "Banning" Wikileaks
BY Nancy Scola | Friday, December 10 2010
Stuart Shulman is both an assistant professor in UMass Amherst's political science department and the CEO of a company called Texifter, which builds text-analysis software tools and counts amongst its clients the FCC and USDA. In an email conversation earlier this week, Shulman said that, with his company being asked to host Wikileaks' leadked data, he and his company are considering a ban on hosting Wikileaks "mega archives" for use with their text tools. I asked Shulman to explain why, and his response is after the jump:
This summer, we downloaded the original leaks from Afghanistan to determine if the Wiki-leakers in fact could have redacted the most sensitive information, say about informants and sources, in a timely as well as effective manner. Our experiments demonstrated that reasonable due diligence combined with off-the-shelf open source software could have produced a dataset with many fewer problematic named entities. For example, using advanced search features in DiscoverText [a Texifter software tool], we can easily find the latitude and longitude of every report in the so-called Afghan War Diaries where sources or informants are mentioned.
We do not like the idea of shying away from controversial data, but there are troubling implications for allowing this data to be processed using our text analytic software. We have chosen not to make the data available as a sample dataset to our users, but that does not limit someone from uploading it and sharing it with hundreds or thousands of others. As a result, we are considering a ban on Wikileaks’ mega-archives, much to the chagrin of some transparency advocates and to the general relief of others. On the other hand, this information is already available and could benefit from the type of advanced analytic tools DiscoverText provides.
Our software company is developing technology that will be helpful in the orderly redaction and declassification of the 420 million paper documents that the U.S. government is already required to release under the law. The practice of leaking relatively small sets of illegally acquired documents onto hundreds of mirror sites is, in our view, a form of information anarchy. A better approach is to support efforts to move massive numbers of documents due to be de-classified into the public domain with necessary, systematic precautions taken to ensure that transparency does not come at the expense of individuals on the ground who risk their safety to bring information to light. We support building and using the best language technologies to make sure that democratic, rights-based information practices win out over risky and destabilizing leaks.
It's worth pointing out that, as Shulman says in his email, Textifter is in the business of supporting the analysis and potential redaction of large caches of public data.
But his response points out that this whole Wikileaks episode has pushed many people into having to make decisions that, in their nuances, are complicated and even new ones. That State Department cable on the reaction inside China to Secretary Clinton's "Internet freedom speech" we discussed earlier this week, for example, contained unredacted names, including that of what seemed to be a U.S. source with intimate knowledge of the Chinese bureaucracy. It's public information, as it's on the Internet. But handling that information still requires some deep thinking about what's right and proper. It seems to make sense for people who are interested in forming strong, public opinions about Wikileaks to spend at least a little time diving into the substance of what these cables are, and what they reveal about people, places, and events. An easy way to do that is over on Cable Search.
More on this point: On Twitter, Tom Watson suggests that one type a phrase like "NGO" into a search on the Wikileaked documents to get a sense of the "'collateral damage" from the leaks.