Web scraping serves as a helpful last resort reporting tool for data journalists, but it comes with its fair share of ethical and technical concerns. At this year’s NICAR conference, Martin Burch of The Wall Street Journal, Ricardo Brom of La Nación, Amanda Hickman of BuzzFeed's Open Lab for Journalism, Technology, and the Arts, and David Eads of NPR Visuals discussed what to consider when building a web scraper.
Burch explained that before scraping, it’s important to contact the organization you want to draw from to determine whether or not you could get the data through a records request, an application program interface (API), or another method. If scraping is necessary, consider how you plan to maintain and verify the data once your scraper is built.
The panelists agreed that scraped data should be taken with a grain of salt. If scraping is necessary, it usually means that the agency or institution keeping the records is having trouble organizing their data in the first place.
"Data is incredibly useful, but it’s not magically more definitive because it’s data,” Hickman said. “It’s still made by people.”
The way people — both the ones creating and consuming the data — handle what’s being scraped was the foundation for most of the ethical considerations raised.
Here are three ethical frameworks for scraping proposed by the panel:
The panelists said using these techniques helps ensure journalists are creating useful data sets and encourages transparency among sources.
They also proposed several tools and tricks for making sure that your web scraper is as efficient and effective as possible. To avoid overloading the source, try to:
While these tactics can help troubleshoot the scraping process, the panelists reiterated that scraping is a reporting tool rather than a means of collecting foolproof data. Eads said he uses scraped data for gathering story ideas, getting a rough sense of bulk data and for investigating the underlying data system of an organization. By following ethical guidelines and technical best practices, journalists can use web scrapers for much more than simply bulk data collection.
Riley Beggin is a journalism graduate student at the University of Missouri and a volunteer at IRE. You can find her on Twitter @rbeggin or email her at firstname.lastname@example.org.