Data science, meet campaign finance - Investigative Reporters & Editors

If you ever get the urge to feel a chill run down your spine, particularly if you're interested in political journalism, give Sasha Issenberg's new book The Victory Lab a good, close read.

Here's the headline: When it comes to using data to understand politics, journalists are playing checkers while political consultants are playing chess. Just listen to the debate that has surfaced in recent weeks around The New York Times' polling specialist, Nate Silver. The venerable Fourth Estate, whose job it is to hold the political system accountable, often lacks the skills to understand, let alone apply, many of the data-driven techniques that nowadays drive political campaigns.

Hence the motivation for the Prospect challenge we launched on Kaggle last month. In collaboration with Investigative Reporters and Editors, Inc., our data journalism team at the Center for Investigative Reporting launched the contest with a simple premise: How would the world-class data scientists approach a common political dataset – campaign finance records – differently than journalists who have been working with it for years? And what could journalists learn as a result?

The submissions were fascinating and enlightening. Journalists are used to looking at campaign finance data with a particular perspective: Seeing which candidate is raising the most, from whom, and how that money is later being spent. But Kagglers came back with more than a dozen novel applications of the data that could help reporters spot anomalies, find hidden influence and add rich metadata that could open up new reporting possibilities.

Here's a rundown of the highlights:

Measuring unusual donations from political committees

The winner of the contest, chosen by our panel of judges, presented a simple methodology for detecting when political committees make unusual donations to candidates or causes. Journalists who cover campaigns often flag these kinds of contributions based on their experience, but this methodology would allow a broader, automated look at strange donations that might otherwise fall through the cracks. It's also elegant in its simplicity. Many data journalists could implement it today.

Donation concentration as a measure of influence

It stands to reason that if a political committee accepts the majority of its money from a single donor, they will be (or at least seem) more beholden to that donor's influence. That's the simple assumption that underlies almost the entire modern regime of campaign finance regulations. However, many of the rules designed to limit influence on candidates do not apply to the political action committees and so-called Super PACs that drive an increasing majority of political spending today.

This entry uses techniques from social network analysis to (among other things) reveal committees that are funded by only a small number of donors. As the proposal's author notes, modified applications of this approach could be used to systematically reveal astroturf organizations that attempt to obscure interest group influence on particular issues or campaigns.

Uncovering legal (and illegal) coordination between PACs and campaigns

One of the few restrictions placed on Super PACs and other independent groups is that they cannot legally coordinate strategy with the candidates they support. This proposal suggests a model through which reporters could use correlation and regression techniques to find situations where that rule is being broken, as well as identifying more general situations when committees coordinate their spending.

Even beyond its potential ability to identify illegal coordination, which is often a grey area in campaign politics, the measure could be used to show candidates or issue committees that otherwise act in concert: When multiple committees are coordinate their strategy around a single ballot measure, for instance. Or when interest groups coordinate with state political parties in order to advance an agenda. An approach like this could help discover new political coalitions before they are widely publicized.

Annotating and analyzing campaign data using Wikipedia

Wikipedia can be a valuable research tool for journalists, but this proposal suggests a way to use its structured data to enhance and analyze powerful campaign contributors. Behind the scenes, Wikipedia maintains a rich network graph of people, places, organizations and topics. If we were able to link, say, prolific Texas donor Bob Perry with his Wikipedia page, we would know he was associated with Baylor University, has a wife named Doylene, and is a member of the Council for National Policy – therefore meaning he is connected to a number of other conservative donors and power brokers.

Accurately linking powerful donors to their Wikipedia pages presents a significant data-cleaning challenge, and there can be problems with the accuracy of some Wikipedia data, but the potential payoff in terms of new data and connections to analyze could be huge.

A natural language processing approach to donor analysis

Several proposals raised the novel idea of applying natural language processing techniques to donor occupation/employer fields and committee names in order to find interesting trends. This is a step further than reporters typically go, especially when looking at occupations and employer data, which is often disregarded because it can be incomplete and difficult to standardize.

One proposal in particular suggests a method for coupling NLP with decision trees to figure out the types of occupations and employers that are associated with supporting a candidate – then applying similar analysis to the words that characterize the bills that candidate supports once elected. Both sides of that approach could uncover trends showing how industry interests align with votes from the national to the local level.

New visualization techniques

Data visualization has been a bright spot of newsroom innovation over the last few years, particularly as it has related to politics and campaigns. Still, a couple competitors offered up ideas that journalists haven't yet tried. One is a set of word clouds based on committee names and their donation recipients. The other is an exploratory tool that uses streamgraphs to help journalists and the public dig into donation trends.

How much does campaign cash matter?

Unless you're Nate Silver, most journalists aren't in the business of predicting elections. But one question many of us have been curious about for years is a pretty simple one: How much does money actually matter in winning campaigns?

The assumption, of course, is: a lot. But a couple competitors offered up the idea of using fund-raising data as features in a model that predicts election winners. Even if money doesn't predict winners on its own, looking at it in context with other features could still show useful information about its effects.

Thanks to all who participated!

Finally, a word of thanks to our competitors. Our goal in running this contest was to showcase new and sophisticated approaches to analyzing campaign finance data that journalists wouldn't think to apply. It was about opening up our imaginations and showing what was possible. And thanks to your work, this project was a fantastic success.

Soon we'll be reaching out to some competitors individually to ask their advice about implementing some of these approaches, as well as inviting them to participate in the journalism industry's major data journalism conference, known as NICAR, next spring.

If anyone is interested in doing further work in this area, feel free to drop me a note at cdavis@cironline.org. CIR and IRE are leaders in this area, but we also have many friends across the country and around the world who would deeply appreciate your expertise.

And last but not least, another thanks to the team at Kaggle, who worked hard to support this contest and make it a success.