By Tim Sandoval
Reporters who've used Microsoft Excel for a story know that analysis doesn't begin until the data has been cleaned. Spreadsheets provided by governments or other entities are often not formatted the way reporters would like. Some agencies format the data in confusing ways, or do problematic things like spell words incorrectly in some cells. These problems make analysis difficult and can even lead to important data being left out.
In the session “CAR wash: Clean that dirty data,” Alex Richards, a reporter on the investigative team of The Chicago Tribune, taught attendees how to use simple tools to identify dirty data and clean it up.
Richards gave a quick overview of OpenRefine, an open-source tool that can help to clean data and transform it into various formats. OpenRefine (formally Google Refine) allows reporters spot misspelled or misformmated data.
Session attendees used OpenRefine to find examples of where some department names in a dataset had different spellings, letter case, or additional spaces and characters. Using OpenRefine, after a few simple clicks, attendees were able to ensure that the department names were spelled the same way or did not have problematic characters.
Richards also went over some other simple data-cleaning methods in Excel. For instance, Richards showed attendees how to capitalize all the letters in the cells of each column to make them uniform.
Richards said there are many ways one can go about cleaning data. Essentially, the method a reporter uses comes down to his or her comfort level with the software.