We knew early in our investigation of Long Island police misconduct that police officers had committed dozens of disturbing offenses, ranging from cops who shot unarmed people to those who lied to frame the innocent. We also knew that New York state has some of the weakest oversight in the country.

What we didn’t know was if anyone had ever tried to change that. We suspected that the legislature, which reaps millions in contributions from law enforcement unions, hadn’t passed an attempt to rein in cops in years. But we needed to know for sure, and missing even one bill could change the story drastically.

At Newsday we like to take a data-driven approach to such questions. But it can be particularly difficult to quantify inaction, and such problems are notoriously time-consuming. For another part of our story that involved evaluating responses from county officials, I wound up reading more than 7,000 pages of committee minutes — just to prove that the topic had never come up.

In this case, I decided I needed a database showing every bill proposed in the state legislature that would’ve added oversight of cops. But no such database existed, and building one without leaving room for error meant reviewing each police-related bill by hand, a potentially monstrous task.

Luckily, I’d been playing with Overview, a Knight Foundation-funded Associated Press project that highlights patterns within piles of documents. Overview simplified my task greatly — letting me do days’ worth of work in a few hours.

But first I needed data on bills. I got it from the New York State Senate’s new, very cool OpenLegislation API. Using the API, anyone can download bulk data from the Senate’s database of bills by visiting a simple URL with either a web browser or (the Nerd’s Way) a command line tool like cURL or wget. I conferred with the OpenSenate staff to confirm that the API has reliable data on bills filed in both branches of the legislature — the Assembly as well as the Senate — and also spot-checked the data against what was available on other official websites.

Then I spent time trying different options in the API’s search instructions until I wound up with data on roughly 1,700 proposed bills that mentioned “police” since 2009. I wound up with a URL that looked like this: http://open.nysenate.gov/legislation/2.0/search.json?term=summary:police&pageSize=200&pageIdx=1

By reading the API instructions and adjusting the words that come after “term=”, it’s possible to search for bills based on almost anything. I also had to cycle through nine pages of results (“pageIdx”) to download the data in 200-bill chunks (“pageSize”).

The OpenSenate data describes bills in incredible detail. It includes each bill’s full text, title, summary, ID number, year proposed, sponsors, previous versions of the bill, the section of law that bill would change and procedural information on what had happened to the bill so far — which committees it had been in, whether it had come to a floor vote, etc.

The API provides data in a JSON format. Web programmers use JSON often because it provides a simple way to pass around some types of complicated data that wouldn’t fit well in a spreadsheet. But JSON files don’t play nicely with Microsoft Excel or Access, and I ultimately wanted to upload the data into Overview, which meant getting it into a spreadsheet-style CSV file.

I uploaded the JSON data into MongoDB, an open-source document database manager that reads JSON directly, and explored the data there. Once I had a good grasp of it, I exported the handful of fields that I needed — the bill’s ID, its URL on the OpenSenate site, its text, title, summary, year and sponsor’s name — as a CSV. Then I uploaded the data into Overview.

(There are probably easier ways to convert a JSON file to a CSV, and using Overview is usually not this difficult. If you’re looking to analyze regular paper documents, Overview can import PDF files directly, or, perhaps the easiest option, it can suck in entire projects straight from IRE’s DocumentCloud service.)

Almost instantly, Overview scanned the full text of all 1,700 bills and created a visualization that split the bills into dozens of groups based on the most unique words that appeared in each bill. This gave me an easy way to skim through the bills in each group by title.

Whenever I saw an interesting bill, Overview let me click on its title to see the full text and a link to the bill’s page on OpenSenate. When appropriate, I could then quickly tag that bill as “adds oversight,” “removes oversight” or “needs review.” Far more often, the entire group wasn’t what I was looking for. For example, many bills related to police pensions — which might be interesting, but not for this story. Overview especially shined in those cases, letting me review dozens of similar bills at once and quickly label them all “not of interest” with a single button.

As I tagged bills, Overview displayed my progress visually, using different colors to show me what percentage of each group I’d tagged.

Within a few hours, I’d reviewed all 1,700 bills — giving me the kind of thorough hand-check that I assumed would take days.

When I was done, I exported the tags from Overview as a csv, and opened the tags in a Microsoft Excel spreadsheet. I then used a formula called VLOOKUP to import some extra details about the bills from the original state dataset that hadn’t been included in Overview’s export.

I found roughly 80 bills that had attempted to increase oversight, which represented more than 50 unique attempts to stiffen the law. All failed, including some that seemed designed to address the precise issues we’d uncovered.

For example, one bill we mentioned in the story would’ve called in state troopers to investigate all police shootings. But it went nowhere, leaving local Suffolk County police to investigate the 2011 shooting of an unarmed cab driver by a Nassau County officer. That night the Suffolk police arrested the cab driver as the aggressor. But subsequent investigations in both counties, which stayed secret until we revealed them, found that the shooting had been unwarranted and that the officer had been drinking all night — the type of thing that state investigators might’ve noticed immediately.

I had our examples of inaction. But I wasn’t done. As we settled on the final wording we wanted to use in our story, I checked and rechecked our analysis. I did keyword searches for bills relating to misconduct, both on our OpenSenate data and also on the New York State Legislative Retrieval System (“LRS” to state capitol reporters). I also downloaded from LRS a list of every bill that had passed both the Senate and Assembly in that period and hand-checked those again, confirming that none were related to police oversight.

These steps weren’t because I distrusted Overview or the process outlined above. I always double-check every analysis from scratch — with a very different technique whenever possible — after we’ve settled on the language we want to use in print.

Finally, a week before we published our story, three reporters on our investigative team surveyed the entire Long Island legislative delegation and the leaders of both the Senate and the Assembly. We presented our findings, including the failure of oversight by lawmakers. No one disputed it.

The story, written by me and Sandra Peddie, had the following in the sixth paragraph: “Despite a series of high-profile misconduct cases in recent years, Long Island’s police face some of the weakest oversight in the country. New York is one of only six states that does not license police, and state lawmakers have ignored opportunities to pass tougher oversight laws.”

With many data stories, language like “at least” is enough to prove your point. But to show a lack of oversight, we needed to be definitive. This method let us say “never.” It was worth the work.


Adam Playford is a reporter and programmer on Newsday’s Investigations Team. He can be reached at adam.playford@newsday.com or @adamplayford.