Web scraping serves as a helpful last resort reporting tool for data journalists, but it comes with its fair share of ethical and technical concerns. At this year’s NICAR conference, Martin Burch of The Wall Street Journal, Ricardo Brom of La Nación, Amanda Hickman of BuzzFeed’s Open Lab for Journalism, Technology, and the Arts, and David Eads of NPR Visuals discussed what to consider when building a web scraper.
Burch explained that before scraping, it’s important to contact the organization you want to draw from to determine whether or not you could get the data through a records request, an application program interface (API), or another method. If scraping is necessary, consider how you plan to maintain and verify the data once your scraper is built.
The panelists agreed that scraped data should be taken with a grain of salt. If scraping is necessary, it usually means that the agency or institution keeping the records is having trouble organizing their data in the first place.
“Data is incredibly useful, but it’s not magically more definitive because it’s data,” Hickman said. “It’s still made by people.”
The way people — both the ones creating and consuming the data — handle what’s being scraped was the foundation for most of the ethical considerations raised.
Here are three ethical frameworks for scraping proposed by the panel:
- Do no harm: Don’t overload the site’s server, and respect the fair use test of copyright.
- Snapshot disappearing information: Use data from sites that present the most recent information, and be a watchdog for summary reports that don’t match the data it relies on.
- Use your scraping code as an extension of yourself as a reporter: If you can read something, your bot can too.
The panelists said using these techniques helps ensure journalists are creating useful data sets and encourages transparency among sources.
They also proposed several tools and tricks for making sure that your web scraper is as efficient and effective as possible. To avoid overloading the source, try to:
- Cache the data you’ve already downloaded. If you have to start over again for whatever reason, you won’t have to re-download it.
- Control your rate limits. Use the sleep() command to suspend execution of the scraper for a given number of seconds.
- Keep tabs on any changes by using HEAD requests.
While these tactics can help troubleshoot the scraping process, the panelists reiterated that scraping is a reporting tool rather than a means of collecting foolproof data. Eads said he uses scraped data for gathering story ideas, getting a rough sense of bulk data and for investigating the underlying data system of an organization. By following ethical guidelines and technical best practices, journalists can use web scrapers for much more than simply bulk data collection.
Riley Beggin is a journalism graduate student at the University of Missouri and a volunteer at IRE. You can find her on Twitter @rbeggin or email her at [email protected]