There is a saying about software engineering that could easily be applied to formatting data. The truism goes something like, “it’s like looking for black cats in a dark room that has no cats in it.” And then, someone yells, ‘I got one!’”
Well, Joe Kokenge of ProPublica is practicing animal control.
His presentation on integrity checks and simple data cleaning was peppered with useful bits of knowledge from his experience. “I wanted to put together a list of things that everyone can do, but there is no 10 bullet proof things to do to make sure your data is clean,” he said, “It’s really about applying common sense- what do we have, how much do we have of it and what’s the context of what were’ looking at”
Kokenge applies what he calls a smell test to his data when sorting through it and looking for oddities, interlopers and truant information that might not fit his expectations.
Which is always a challenge.
“I mean, I’m freakishly paranoid and ridicliously organized,” he said, adding that the real trick is discovering the limits of your knowledge. “The things you don’t know that you don’t know … That’s the kind of stuff that will keep you up at night,” said Kokenge.
From Kokenge, it seems cleaning data is more of an art form than a process, however rote and utilitarian it may be. Extreme sensitivity for irregularities combined with an almost Cartesian level of doubt will bode the dirty data cleaner well.
And really, the time spent being paranoid of your data in the outset will save you in the long run.
“It has to be done,” Kokenge said, “ If you don’t do this, and a month, two months, a year later and you find you didin’t do this and something is off… “ and his thought just trailed off but we all knew what he was saying.