Breathing New Life into Data with the "Scrapeathon"
BY Rebecca Chao | Monday, January 6 2014
At the heart of most civic-oriented hackathons, those short 24-hour or so gatherings to code and create innovative apps for public good, is data. But many hackathons suffer from a lack of quality data or knowledge on where to find it, a problem that Benjamin Gans says he and his team at a for-profit data crunching company, Data Publica, noticed after attending and hosting a number of their own hackathons. They have coined the term "scrapathon" or scrapeathon to describe the new data scraping events they have begun hosting to give data a new and more purposeful life.
Similar to a hackathon, a scrapeathon occurs over a short period of time and offers developers and coders an opportunity to work together in searching and scraping data off of websites using pre-existing web crawlers and scrapers or by developing their own software. Data is often “stuck” on a site because of its format – PDF, jpeg, charts and graphs – for example, or through authentication requirements or firewalls that block software from automatically plucking information off the site.
Data Publica's first scrapeathon, held over half a day on June 12 of last year in Paris, attracted 60 attendees. “It was really a kind of experiment, this event,” Gans tells techPresident. He manages marketing and communications at Data Publica. “We just wanted to see whether there would be an audience and we were curious about what kind of data they could scrape and cull. We didn't know exactly who was coming and if they could cull any data.”
To make sure that everyone could begin the scrapeathon with a basic level of data scraping, the event kicked off with a two-hour tutorial. Those with little to no coding skills receiving training on how to use Outwit, a cheap and easy-to-use scraping program on the web. Others received tips on how to create coding tools using python or java. They then separated into seven teams that scraped websites in varying levels of difficulty, from simply aggregating existing data on a site into a more user-friendly or analyzable format, to more complex tasks such as tackling websites that block automatic scraping. Another issue that was addressed during the scrapeathon was developing ways to keep the data “fresh” or up-to-date. Gans explains that maintaining the effectiveness of a web app created during a hackathon requires making sure the data for that doesn’t expire.
One project involved scraping the Wikipedia pages of French economists to uncover their links to private organizations and banks. Another, which is still ongoing because it is very complex, involves scraping legislative material from the Paris city council websites.
While Data Publica is not currently exporting the scrapeathon on a large scale, they did help organize a scrapeathon in Chile only a few weeks after the Paris event. On June 29, the nonprofit tech and innovation research group, INRIA, held its first scrapeathon in Santiago. More journalists were in attendance in Santiago than in the Paris event, says Gans, which is a good thing. He says Data Publica hopes that in the future, the scrapeathon will attract those from a diverse array of professions, which is key making these events more effective. “You need statisticians and developers to know where data is available but also in a scrapeathon, you need journalists to give purpose to the data that is extracted,” says Gans. Next month, Gans says they will host another scrapeathon that will involve players from a number of different professions such as graphic designers and those working with data visualization.
Before there was the Scrapeathon
Even if Data Publica’s scrapeathon is the first of its kind (though it appears in Bangalore, a group of developers got together in July 2012 for an informal 'scrapathon' to collect rainfall data for Rajashtan), the concept behind it is not new. There have been a number of data-scraping initiatives, ranging from very targeted projects to ones more massive in scale. As techPresident previously wrote, Argentinian programmer Manuel Aristarán created Tabula, a data scraper for PDFs that has been primarily used by newsrooms.
James Turk, at the Sunlight Foundation, leads the Open State Project, a large-scale data-scraping project that culls state legislative data on a daily basis. The Open States Project gathers information from all 50 states, Washington, D.C. and Puerto Rico. It publishes the data on its website, as well as on a mobile phone app and via API for bulk downloads of all its data sets. The site allows users to look up a bill in which they are interested, find out its history as well as their representative’s stance on the bill, receive alerts on bill updates and contact their lawmaker. They can also search for bills by subject matter and find out who their legislators are by plugging in their addresses. The API has been used by a number of news publications, such as the National Public Radio for their State Impact project, the Chicago Tribune for analyzing Illinois pension codes, and the New York Times for visualizing gun control bills.
In effort to quantify their data, Open States used their data to grade each state on the quality of their open data. The team created an A to F report card in March 2013, evaluating each state on ease of access to data, whether the data was already available in a machine readable format and completeness of data sets, among other criteria. After receiving less than satisfactory scores, several states reached out to Open States to make clarifications or note improvements of their websites. Most recently in November, Pennsylvania gave their website a facelift, improving the timeliness of its data, as well as ease of access and machine readability.
While some developers, like Aristarán, have had issues with governments making it more difficult to scrape data, putting in captcha devices or other authentication systems, the Open States project did not run into any issues. “We have surprisingly had no real resistance,” Turk writes in an e-mail to techPresident. “The data is essentially in the public domain and states had no issue with us aggregating it.” In fact, legislators actually use their website and phone app to make their own work more efficient and help them keep track of legislation. “It's often the best or only mobile interface to the state legislative data, as most states don't have good mobile sites,” says Turk.
Even so, the number of litigation over web scraping has increased over the last several years. One prominent example of a scraping project gone sour is Pete Warden’s Facebook project that he says nearly got him “sued into obvlion” by Facebook. In 2010, Warden built a web crawler with only $100 that allowed him to collect a massive amount of information from publicly available Facebook pages. Within a few hours, his application crawled through 500 million pages and collected information from about 220 million Facebook users. After anonymizing the data, he created an interactive graphic, showing Facebook users relationships with each other in different cities, states and countries. After a series of negotiations, which required that he destroy his database, Warden was spared from going to court and potentially facing bankruptcy from the costs of doing so.
The likeliness of running into trouble with publicly available information on government websites is much slimmer though nonprofits and private companies could take action. Data Publica is aware of the legal issues surrounding web scraping and invited a lawyer to discuss the legal ramifications of scraping data during their first scrapeathon. Data Publica also published a white paper providing a safe list of all the publicly available data that is offered by governments, transnational organizations, trade associations, and as well as some private companies that have embraced open data.
Disclosure: TechPresident's Andrew Rasiej and Micah Sifry are senior advisers to the Sunlight Foundation.
Personal Democracy Media is grateful to the Omidyar Network and the UN Foundation for their generous support of techPresident's WeGov section.