Student-built database shows lawmakers skirting travel rules
By Kathy Best, Howard Center for Investigative Journalism
This is a story of AI-assisted redemption.
In 1999, two enterprising Congressional Quarterly reporters wanted to show the link between congressional trips and the private interests that financed them. They used a portable scanner to make copies of travel reports and spent weeks inputting information by hand into a database. The end result was a story, featured in the February 2000 IRE Journal, that looked at only a sliver of congressional travel.
“After doing a lot of data entry work for that project, I basically swore that if I had another crack at it, I’d do the whole House and do it better, with better tools,’’ said Derek Willis, now a data lecturer at the Philip Merrill College of Journalism at the University of Maryland. “So I definitely was looking for a way to use AI to help parse records, and I definitely wanted to avenge younger me, who spent a lot of time typing in information from those PDFs.’’
He got that chance when he stepped in to fill a leave for data editor Sean Mussenden at the Howard Center for Investigative Journalism. With a team of data journalism students at UMD and Boston University to work with, Willis and his faculty colleagues led more than 50 students in an analysis of more than 17,000 privately sponsored trips by House of Representative members and their staff from 2012 through 2023.
To do the analysis, the Howard Center obtained travel disclosure filings and metadata from the House dating back to 2017. Because the House Clerk, the official source of travel filings, by law only maintains six years’ worth of records, the Howard Center also used records collected by ProPublica and sites like archive.org.
The House disclosure filings provided basic information about each trip: the traveler, sponsor, destinations and dates. However, that information wasn’t standardized and excluded crucial details, such as how much a trip cost, if a traveler brought along a family member and full itineraries. That information has to be gleaned from PDF filings.
Using Amazon Web Services’ Textract service, the Howard Center team performed Optical Character Recognition on nearly 50,000 pages of documents to extract accurate text representations. That, along with the filings, allowed students to construct and clean a standardized database of privately-sponsored trips. It also enabled them to search the full text of the filings.The Howard Center also purchased travel records from LegiStorm, a private company that provides legislative data, to fill in gaps and as a reference point.
Twenty-one student journalists at UMD spent weeks standardizing records and accounting for amended filings. Their analysis of the data found that nonprofit organizations with deep ties to lobbyists had emerged as leading sponsors of travel, including AIPAC, the Consumer Technology Association and the sugar industry.
One nonprofit in particular, The Congressional Institute, was responsible for a quarter of all trips. The reporters found that at least 75% of the institute’s board members were registered lobbyists. Moreover, the institute was bankrolling the trips with nearly $3 million in annual membership dues from private interest groups such as the Business Roundtable and American Hospital Association. But because the nonprofit institute was not registered to lobby itself, it could pay for multi-day trips to luxury hotels and resorts along the mid-Atlantic coast, where its guests could rub elbows with private-sector institute members who pay as much as $27,500 annually for access to the invite-only retreats.
Nearly 30 Boston University data and investigative journalism students and their professors focused on the 24 most “frequent flyers” in the trip disclosures — those House members or their staff who traveled on the private dime most often. To identify trips where travelers brought along guests, the reporters manually entered data from the disclosure filings.
They found that, altogether, private sponsors paid $4.3 million for the House members who traveled most frequently, including $1.3 million on family members.
Almost half — nearly 44% — of the trips included a spouse, sister, daughter-in-law, child or grandchild — legal under House ethics rules. The cost for those relatives sometimes exceeded $10,000 for one trip — and should have been reported as income by the lawmakers, tax experts told the student journalists.
Beyond the published stories, 12 UMD data journalism students — led by undergraduate student Apurva Mahajan — made the House travel data directly accessible to the public through a custom web application to allow users to look up organizations and lawmakers of interest, discover their own connections and review primary source documents. Students initially populated the web application database with records from the House, and will update it with additional records from the Senate this year. The application is still in active development, with students working to add new AI-backed features to enable summarization and entity extraction.
In addition to Willis, the project was led by UMD Professor Deb Nelson, Mussenden and Howard Center Director Kathy Best. At BU, student journalists worked with professors Maggie Mulvihill and Shannon Dooling.