Parsing prickly PDFs

  • Event: 2016 CAR Conference
  • Speakers: Jacob Fenton of Public Accountability Initiative; Jeremy Singer-Vine of BuzzFeed News
  • Date/Time: Saturday, Mar. 12 at 2:15pm
  • Location: Denver III & IV
  • Audio file: Only members can listen to conference audio

Sometimes, life gives you ugly PDFs. In this session, we'll introduce you to a range of tools for pulling structured data out of the journalists' most-hated file format. We'll cover point-and-click software, command-line utilities, and libraries for writing custom PDF parsers. (For most tools, no programming experience is required.)

Speaker Bios

  • Jacob Fenton is the lead developer of The Investigative Reporting Workshop’s Public Accountability Project. He’s worked previously as Editorial Engineer at The Sunlight Foundation, as Director of Computer-Assisted Reporting at IRW, and as a reporter and editor for newspapers in Pennsylvania and California. In 2015/16 he was a JSK Fellow at Stanford. He's based in Portland, Oregon.

  • Jeremy Singer-Vine is the data editor at BuzzFeed News. He also publishes Data Is Plural, a weekly newsletter of useful/curious datasets. Website: jsvine.com

Related Tipsheets

  • Parsing prickly PDFs repo
    Sometimes, life gives you ugly PDFs. In this repo, we'll introduce you to a range of tools for pulling structured data out of the journalists' most-hated file format. We'll cover point-and-click software, command-line utilities, and libraries for writing custom PDF parsers. https://github.com/jsfenfen/parsing-prickly-pdfs