#FlashHacks: Crowdscraping Corporate Data to Understand "The Man"
BY Jessica McKenzie | Tuesday, July 8 2014
You probably work for “The Man.” If not you, then someone close to you does, and even if you have no friends or family, your government is almost certainly doing business with him. Wouldn't it be nice to know a bit more about the so-called “Man”? Thanks to the massive open data project OpenCorporates, you now can, and they are intensifying their data opening efforts with #FlashHacks, a crowdscraping campaign launched today. The campaign goal is to release 10 million data points on the companies you work for, work with, buy from, sell to, and deal with in tangible and intangible ways every day, and all in just 10 days.
OpenCorporates launched three years ago and has since become the world's largest open database of companies, with information on more than 75 million companies, all of it sourced directly from governments and completely verifiable. Every week they add roughly a million more pieces of information; during #FlashHacks they want to exponentially increase that rate.
Hera Hussain, OpenCorporates community and communications manager, tells techPresident that although governments are making open data pledges and signing partnerships left and right, that making data open is still quite slow. This is, Hussain says, because governments “don't want to, can't be bothered, or don't have the resources.”
Open data is about more than just making information public; it's about making information freely and openly searchable, and machine-readable. That means PDFs need not apply.
And yet governments continue to release information in the form of scanned PDF documents. To get at the information, developers write a kind of bot called a scraper, which crawls through websites or documents and extracts data.
That is what OpenCorporates has called upon the crowd to help them do: scrape websites for information.
Each dataset has particular structural quirks that the bot must be taught to recognize so that it can accurately sort information. The complexity of the bot depends on the complexity of the documents it is meant to scrape, so writing a bot can take anywhere between 20 minutes to four days.
Although OpenCorporates already has support from the tight-knit open data community, Hussain says they hope people share this campaign with their other networks. This is the perfect project for a developer who may not work with open data every day, but who has 20 minutes or an hour to spare to help change the world.
Information from OpenCorporates has already been used by the Global Witness in their research on anonymous companies and by Open Oil in their work to uncover the sprawling, multilayer structure of the oil multinational BP.
#FlashHacks is timed to end with the Open Knowledge Festival in Berlin next week, where Open Oil will present some of their findings.
Perhaps the greatest thing about #FlashHacks is that all of the datasets (take a look at some of the suggested missions here, rated by difficulty level) have been requested by NGOs.
Hussain said it was all a bit hush hush before the campaign launched, and that they only reached out to NGOs they have a history with, but tomorrow OpenCorporates will put up a Google doc where any NGO can suggest datasets that would be of use to their organization.
Personal Democracy Media is grateful to the Omidyar Network and the UN Foundation for their generous support of techPresident's WeGov section.