By Natalia Alamdari, IRE & NICAR
What’s the project?
Not all data stories have to be serious. Here’s an example: Long Islanders don’t rely on public transportation as much as their neighbors in the city, so Tim Healy and his team at Newsday decided to look at which cars Long Islanders drive, and why. In the end, Newsday was able to give readers a snapshot of which cars are most popular on the island, along with an interactive database where readers can look up the most popular cars in their ZIP code.
How’d they do it?
First, Healy needed vehicle registration data for the two New York counties Newsday covers: Nassau and Suffolk. The state Department of Motor Vehicles makes this available online in .csv format. But once he got the data, Healy realized the state included car makes, but not models.
Healy put out a request for help on the NICAR-L listserv and a member pointed him to the National Highway Traffic Safety Administration’s vehicle listing API. The API could take the state-provided vehicle identification numbers and return the missing model data Healy needed.
Manually running nearly 2 million VINs to find models would be impossible.
Healy and Will Welch, a data developer at Newsday, figured out the most efficient strategy would be to send requests in batches. To run the VINs through the API in bulk, Welch wrote a script in Node.js, so the results could save to his desktop. In the end, for each county, the script took about four hours to return results.
Cleaning the data
Healy manually cleaned the data for each county, checking ZIP codes and standardizing variations in car model names. He also removed non-passenger vehicles like school buses, firetrucks and hearses from the list. For a smaller data set, he suggests using OpenRefine, a free and open-source tool for working with messy data.
Healy needed an organized way to present 1.8 million records to online audiences. He ended up creating two sortable tables. One sorted cars across the island by popularity — totalling 1,466 different makes and models.
The other was even more detailed, listing out makes and models by ZIP code — a whopping 97,096 rows. The story also included interactive charts that broke down the top ten car colors and the age of vehicles on the island.
Tips from Tim Healy
- Know exactly what pieces of data you’re looking for and pay attention to the data you receive. It took some time for Healy to realize car model wasn’t included in the state’s data.
- When running scrapers, try adding code that will record your results in a text file. This way, you won’t have to run your code multiple times.
- Have a clear idea of what you want for an end product. Healy knew he wanted to keep the final presentation simple, but a more robust version could be possible, he said.