Tags : data cleaning

Integrity checks and simple data cleaning – the art of doubt

There is a saying about software engineering that could easily be applied to formatting data. The truism goes something like, "it’s like looking for black cats in a dark room that has no cats in it." And then, someone yells, 'I got one!'”

Well, Joe Kokenge of ProPublica is practicing animal control.

His presentation on integrity checks and simple data cleaning was peppered with useful bits of knowledge from his experience. “I wanted to put together a list of things that everyone can do, but there is no 10 bullet proof things to do to make sure your data ...

Read more ...

Centers for Medicare and Medicaid Services data reveals fraudulent offices

Our newspaper’s analysis of Centers for Medicare and Medicaid Services (CMS) data revealed that 131 providers in the Atlanta metropolitan area claimed a UPS Store mailbox as their medical office.

In turns out, Atlanta medical providers were not conducting medical procedures in mailboxes. Most of these providers filled out the federal paperwork incorrectly.  But dozens of others committed fraud by  using the UPS Store mailboxes as purported real offices. With a sham provider number and a UPS Store address, they could also provide what looked like a real physician’s approval for unnecessary or non-existent medical services and equipment ...

Read more ...

SBA disaster loan data updated in NICAR Database Library

In the wake of a disaster, individuals and business owners are often left with severely damaged property. Many turn for help to the Small Business Administration, which approves low-interest loans to help rebuild. For declared disasters in 2011 alone, the Small Business Administration approved over $1 billion in loans.

NICAR has updated the SBA database of these loans, which is now current through Sept. 2012. 

WHAT'S IN IT?
Disaster loans through the SBA are one of the primary forms of federal assistance for individuals and non-farm, private-sector businesses who have suffered losses. The data have information on the borrower ...

Read more ...

HMDA data updated in the Database Library

The Home Mortgage Disclosure Act (HMDA) data have just been updated in the NICAR Database Library -- and we'll help you turn it into a story.


WHAT'S IN IT?

This Act requires all banks, savings and loans, savings banks and credit unions with assets of more than $33 million and offices in metropolitan areas to report mortgage applications. Each loan record contains demographic information about loan applicants, including race, gender and income; the purpose of the loan (i.e. home purchase or improvement); whether the buyer intends to live in the home; the type of loan (i.e. conventional ...
Read more ...

Behind The Story: Analyzing and mapping salary data for small-town mayors

In August, reporter Kate Martin of the Skagit Valley Herald analyzed salary data for mayors across Washington state and ended up with a story about mayors from small towns in her coverage area -- Mount Vernon and Anacortes -- who had salaries on par with mayors from cities several times larger. In reporting the story, Martin first had to gather the data and then reconcile it with the realities of small-town civic duties.

The idea for the story arose through her typical reporting practices: each year, she requests salary data for all of the agencies that the Skagit Valley Herald covers.

“I ...

Read more ...

OSHA Workplace Safety data updated at NICAR Data Library

The Workplace Safety database from the Occupational Health and Safety Administration (OSHA) has just been updated in the NICAR Database Library.

WHAT’S IN IT?

This ten-table database holds information on workplace inspections performed by both federal and state OSHA offices in all states and U.S. territories, from 1972 to Oct 2011 – just under 4 million records.

OSHA classifies businesses by their location, name and North American Industry Classification System (NAICS), making it possible to analyze inspections, violations and accidents involving a certain occupation or those in a given region or city. The data also include details on the ...

Read more ...

From where? Validating data in the real world

By Anna Boiko-Weyrauch
@AnnaBoikoW

To understand your data, let’s go back to grade-school science class. Remember when you learned about the forest, and all the animals that call it home? The forest is a dynamic ecosystem. Your data is like a chimpanzee; it plays a role in the forest ecosystem.  Over time, the changes in the environment will affect your data/chimp.

In the session, “OK, but where did that data come from? Data validation in the digital age,” Managing Director at the Institute for Analytic Journalism J.T. Johnson said journalists need to remember that their data had ...

Read more ...

Fighting for open records in Spain

By Hilary Niles
@nilesmedia

Spain is an “information black hole,” journalist Mar Cabra said during the Against All -Spanish- Odds. She and software developer David Cabo are taking suggestions on how to fix that. 

Among the European countries with a population more than 1 million, Cabra said, Spain is the only one not to have freedom of information laws. On the technical side, David Cabo described what this looks like for people working with data (if they can get it):

  1. Administrations love PDF files and generally refuse to hand over raw data, text or Excel files
  2. There is little consistency ...
Read more ...

My Favorite Access SQL trick: Using the MID function to rearrange dates

I am constantly getting data from all kinds of public agencies that provide the most important field – the date of birth – in different formats. An example is city employee data. I request the name, DOB, salary and job title from San Antonio and other nearby cities to use in determining whether any of those workers have a criminal record.

Most of the time, the agencies provide the birth date like this in text format: mm/dd/yyyy. I prefer to work with the date like this in text format: yyyymmdd. When I want to find out how many employees were ...

Read more ...

Regex: Search and replace on steroids

Of all the tools and techniques that have been brought to bear on the pursuit of computer-assisted reporting, few have been more useful, I've found, than regular expressions. Regexes are among the tools I find myself using on an almost daily basis. Although there is a learning curve to be sure, I believe it is worth it. Once you learn the basics of regular expressions, you'll view the hours spent constructing elaborate string functions or embedded if/then statements as quite literally time wasted. So, what are regular expressions? Put simply, it's a specialized language for searching ... Read more ...