IRE boot camp attendee shares Pulitzer Prize for National Reporting

InsideClimate News became the third, and smallest, web-based organization to win a Pulitzer Prize, placing first on Monday in National Reporting for "The Dilbit Disaster: Inside the Biggest Oil Spill You've Never Heard Of." Months ago, reporter Lisa Song brought a database of pipeline spills to and IRE/NICAR boot camp and began learning to work with the data. She later wrote a piece for Uplink, IRE's Journal of computer-assisted reporting, about her reporting. It's republished below:

First Venture: Probing pipeline leak detection

By Lisa Song, InsideClimate News

I became interested in pipeline data after reporting on the Keystone XL oil pipeline. There was (and still is) a lot of debate about the pipeline's projected spill rate and safety. TransCanada, the Canadian company behind the project, already has one U.S. pipeline, which leaked 14 times within its first year of operation. I didn't know if that was unusual, so I wanted to compare TransCanada's record to the leak rates from other companies.

That story eventually proved too much to tackle, but it led me to another story about leak detection. As it turns out, the leak detection technology installed on the nation's pipelines detected just 5 percent of all oil spills from the past 10 years.

There are two sources for the data I needed: the pipeline industry and the federal government. The industry database was private and proprietary. But the Pipeline and Hazardous Materials Safety Administration (PHMSA)—the agency that regulates interstate pipelines—keeps its data on a public website. The database is posted as Microsoft Excel spreadsheets and regularly updated as new leaks are recorded. The database is quite detailed, with information on the name of the pipeline operator, the leak location, spill size, cost of cleanup and environmental damage, plus a list of technical specifications on the cause of the failure, the age of the section that failed and its maintenance history. The database is also fairly clean and includes instructions on how to choose the right delimiters for data import.

That's the good news. The downside is that the records are split into four spreadsheets: pipeline spills 2010-2012; 2002-2009; 1986-2002 and pre-1986. All the spills are self-reported by the responsible parties, and every time PHMSA updates the incident reporting form to ask for new or different information, the agency has to start a new file.

I brought the database to the March 2012 IRE and NICAR boot camp in Columbia. Mo., and spent the open lab hours trying to append the spreadsheets. I didn't get far before stopping—some of the spreadsheets had hundreds of fields, and there were many that either didn't match up or described technical details I didn't understand. I wasn't sure which fields were important, and I didn't want to waste time with data I would never actually use. I decided that I would run separate queries on the individual spreadsheets and join them later if needed.

There was a bigger problem. I soon realized it would take much longer than I'd thought to compare leak rates across different companies. There were dozens -- if not hundreds -- of pipeline operators and subsidiaries, and pipelines often switched operators over time, or merged and assumed joint ownership. Plus, some of the crucial information (like the name of the pipeline or line segment where a leak had occurred) was missing. I called PHMSA and asked if the missing info was available in some other form. It wasn't. If I wanted to fill in the gaps, I'd have to call the companies individually.

By this time—months after the boot camp—I was getting frustrated, and afraid of losing all those boot camp skills. I decided to run some queries using MySQL database manager with the Navicat interface just for fun. InsideClimate News was reporting extensively on the aftermath of the July 2010 Kalamazoo River oil spill, caused by a ruptured pipeline that spewed more than a million gallons of tar sands oil in 17 hours. That's how long it took the company to realize it had a spill, so it made me curious about leak detection.

Pipeline companies use a variety of ways to look for leaks. They conduct regular inspections, and members of the public can call an emergency number to report a spill. But many pipelines are hundreds of miles long, so the only method that works 24/7 along the entire length of the line is remote sensing technology. These sensors measure pressure and flow rates and alert the pipeline control center when they sense something that could be a leak.

I found a field in the 2010-2012 database that described how each leak was detected. Operators could choose from a number of categories, including company employee on the scene, member of the public, aerial patrols or their remote leak detection systems. I used GROUP BY and COUNT(*) query on the leak identification field. I also needed a WHERE line, because the database contains info on all hazardous liquid spills (ie crude oil, gasoline, liquid carbon dioxide), and I wanted to filter for just the crude oil data.

About a third of the entries came up as nulls, but that was OK, because they were for the small spills (less than 5 barrels, or 210 gallons) that required only partial reporting.

Of the remaining 202 leaks, less than 10 percent were discovered by remote leak detection technology. It was much lower than I'd expected. I ran the same query on the 2002-2009 spills and got similar results. That's when I knew I had a story. I decided not to analyze any spills from before 2002, because leak detection systems are constantly evolving and I didn't want the results skewed by outdated technology.

I sent my results, plus the SQL, to PHMSA for verification. In the meantime, I started interviewing pipeline experts to learn about leak detection technology. I found that it's hard for remote sensors to detect small leaks, and even when the technology works well, there's a lot of room for human error.

PHMSA responded with a meticulous fact check. I'd made a couple of minor mistakes. For example, I downloaded the 2010-2012 Excel file in March, which included pipeline leaks through Feb. 2012. By the time I ran the analysis, it was August, so PHMSA suggested I download the updated file to add the spills from March through July. That increased the total number of spills but it had little effect on the breakdown of how leaks were detected.

When I combined the results and calculated the percentages for oil spills 2002-July 2012, I found that remote sensors only detected 5 percent of all spills. The general public detected four times as many leaks as the remote sensors, and most of the spills (62 percent) were found by company employees at the scenes of the accidents.

With these numbers in hand, I called the Association of Oil Pipelines, an industry group that represents pipeline companies. They told me the technology works better for larger spills—the ones with the greatest effect on people, property and the environment.

So I took the SQL and added an extra parameter on the WHERE line for all spills larger than 1,000 barrels (42,000 gallons). I chose that number because PHMSA considers all leaks larger than 50 barrels to be "significant," and I wanted to go far above and beyond that standard.

There were 71 spills of that magnitude between 2002-July 2012. Twenty percent were detected by technology—a big improvement over the results for all spills—but the general public still found 17 percent of the large spills.

For added context, I ran the queries again, this time limiting my search to small spills. As it turns out, 76 percent of the spills from that time period were less than 30 barrels (1,260 gallons), and that helped explain why so few were detected by the remote sensors.

This was definitely a data-driven story, and great for a first venture post-boot camp. The analysis was straightforward because the database was publicly available and the fields I used were all clean. It gave me a chance to get comfortable with MySQL and Navicat (I couldn't use Microsoft Access because my work computer is a Macbook). And it inspired me to write additional stories on pipeline safety that are currently in progress. Someday I'd love to get my hands on that secret industry database. I hear it's more detailed, and probably has fewer blanks. Until that miracle occurs, I'm quite happy with the PHMSA data, and I'm sure it will lead to more ideas.

Lisa Song is a reporter for InsideClimate News: lisa.song@insideclimatenews.org