Cart 0 $0.00
IRE favicon

Scrutinizing what you scrape: How The New York Times investigated arbitration

A graphic from The New York Times' arbitration series

Court records have long been a vital tool for journalists looking to hold powerful corporations accountable. But what happens when disputes between companies and consumers move out of open court and into private meeting rooms? What happens when class action lawsuits – and the wealth of human sources and records that go with them – start to disappear?

Journalists at The New York Times found themselves wrestling with some of these questions as they reported out a three-part series on the rise of arbitration clauses.

Arbitration clauses are tucked into everything from terms of service to employment agreements. They’re used by giants like AT&T, Starbucks and Netflix, as well as smaller companies like Ashley Madison, the adultery dating website. The clauses funnel consumers’ grievances into private hearings where, instead of a judge and jury, people make their case to a corporate lawyer or professional arbitrator. The proceedings play out behind closed doors; there are no appeals, few rules and little oversight.

By interviewing scores of lawyers, judges and plaintiffs, Times reporters Jessica Silver-Greenberg and Michael Corkery were able to piece together a picture of what people can expect if they’ve signed an arbitration clause: Companies can compel arbitration according to religious texts — whether it’s the Bible or the tenets of Scientology. Evidence can be suppressed. Witnesses can be influenced. Class action lawsuit can be explicitly forbidden.

But how could the reporters move beyond anecdotes? Their sources portrayed arbitration as fundamentally reshaping the American justice system — but how do you quantify that?

Enter Robert Gebeloff, a database projects editor for the Times. He came to the paper in 2008 from the Star-Ledger, and by his own estimation he’s been specializing in data journalism for about 20 years. At the Times he works in partnership with other reporters, bringing his skills to bear on stories that need data analysis.

“Traditionally, what data journalism has meant is getting a data set from a government agency and analyzing it,” Gebeloff said.

But that’s changing.

Nowadays, he said, governments don’t have the data reporters want, or they won’t release it, or it’s formatted inconveniently. So Gebeloff’s job is increasingly focused on designing ways to scrape the data he wants from official sources. That’s what he set out to do for the arbitration story.

Spotting trends

Even though arbitration hearings are opaque by design, the Times found a few ways to glean enough data to discern some trends. Silver-Greenberg knew California law requires any arbitration company operating in the state to open their entire docket to the public, Gebeloff said, creating a window into every case arbitrated across the country. But it was messy.

Some companies didn’t follow California’s law, and the state didn’t enforce any sort of uniform reporting standards. Even within a single company, different arbitrators disclosed different information in different formats, Gebeloff said. It wasn’t unusual for arbitrators to leave key sections blank.

“The only thing we’d know was a case had been held involving a certain company,” he said. “And that would pretty much be it.”

Nonetheless, Gebeloff began scraping, parsing and pulling the dockets. Standardization was the name of the game; he came up with 25,000 arbitration files, and he created some basic rules for appending the data fields to one another. That made it easier to stack the data into clear categories — the companies, the judges, the outcomes — and it also demonstrated which questions they could answer with their limited dataset. One of the most important trends revealed in the data was how often the same arbitrators handled cases for a single company, creating the appearance of clientelism.

The Times team wanted to be careful about drawing conclusions from such a database pieced together from such sloppy sources, though. Even if a case detailed how much a customer was awarded, many didn’t list how much the customer had sought. For example, Gebeloff said, “did they win $5 on a $10 claim, or did they win $5 on a $50,000 claim?”

So Gebeloff devised a system for classifying wins and losses: He’d consider a win any time the consumer was awarded at least $1. “On one hand you say, OK, so it’s possible some of these people we’re counting as winners maybe don’t feel like winners,” he said. “But on the other hand, it’s good because it’s a conservative measure.”

Even with that generous accounting, the percentage of people who win cases against companies through arbitration is low, he said. But the Times team was careful in phrasing their findings:

“The Times found that between 2010 and 2014, only 505 consumers went to arbitration over a dispute of $2,500 or less.

Verizon, which has more than 125 million subscribers, faced 65 consumer arbitrations in those five years, the data shows. Time Warner Cable, which has 15 million customers, faced seven.”

A screenshot showing an internal website built for the The New York Times' arbitration reporting. (Courtesty of Robert Gebeloff)

The Times’ data couldn’t reveal arbitration’s inequities, per se; Silver-Greenberg and Corkery’s shoe-leather reporting had that covered. Rather, the data was better suited to demonstrating absences, Gebeloff said. “The bigger points we make with this data are simply counting.”

It wouldn’t have made sense to try to squeeze from the data any conclusions about arbitration outcomes, he said. “If these people had, in theory, been able to go to court, would they have had better outcomes? I mean, that’s just unknowable.”

“You could never build a dataset that says, OK, this is what happened in arbitration and this is what happened in comparable court cases. There’s just no way of knowing that. And so people are left to theorize and argue.”

Working with the data

As Gebeloff was beginning to pull together the data on arbitration, Silver-Greenberg and Corkery’s reporting was beginning to suggest that arbitration’s deepest effect was to make companies impervious to class-action lawsuits.

To test that theory, Gebeloff turned to Westlaw, Thomson Reuters’ proprietary database of court cases. Gebeloff pulled the dockets for every case between 2010 and 2014 where a company faced a class-action lawsuit. He downloaded each docket as a paper-formatted report and, using a program he wrote, extracted all the information he could: Case numbers, judges, plaintiffs, defendants, court names, each individual motion in the trial. Gebeloff turned them all into rows and columns.

At first he only came up with a few hundred cases. Then he tried searching the database a few different ways. Then he doubled his timeframe.

After discarding about 500 bad hits, he came up with about 1,700 cases where a company tried to dissolve a class-action lawsuit by invoking arbitration. Gebeloff poured through them all.

This data set wasn’t perfect either, he said, but it was more reliable than the first. He was looking for how often a company successfully used a “motion to compel arbitration,” but it wasn’t always a clear yes-or-no answer, he said. Sometimes an arbitration clause only covered a few people in the class-action. Sometimes a case was settled before a judge ruled on it.

“Reality’s often messy, and so we had to account for the messiness of reality,” he said. “But ultimately, the benefit of going through this was now we had a database that nobody else has ever had.”

Building that database drew on skills he’s honed for decades, but it also required reaching out to others who’ve attempted to wrangle similar data.

“The most important thing in doing this type of work is to know how to learn... and to know what’s possible,” he said. “Six months ago, I wouldn’t have known everything I know about getting things out of Westlaw or turning it into paper records or turning the paper record into a database, but I knew it was possible to do, and then every step of the way, I had enough background to learn what I needed to know how to do it.”

Adam Aton is a graduate student at the Missouri School of Journalism and a student employee at IRE. You can follow him on Twitter here or email him at

109 Lee Hills Hall, Missouri School of Journalism   |   221 S. Eighth St., Columbia, MO 65201   |   573-882-2042   |   |   Privacy Policy
crossmenu linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram
My cart
Your cart is empty.

Looks like you haven't made a choice yet.