Operation Clean Data

Editing Out Inaccuracies

Even better than cleaning dirty data is making sure it can't be soiled in the first place. Organizations heavily reliant on accurate information, such as the US Census Bureau, are leading the charge when it comes to building real-time validation into data as it is generated. The bureau undertakes hundreds of surveys a year into demographics, the economy, trade data and much else. And needless to say, clean data is imperative.

To facilitate its work, the bureau has developed an approach of building feedback and validation loops into each survey and questionnaire in order to make sure that human-generated information is as accurate and reasonable as possible, says Richard Swartz, associate director for IT and CIO at the Census Bureau. Whenever the completed questionnaires are returned from businesses and individuals, and scanned into the bureau's computers, checks called "edits" take place that test the responses to make sure they are complete and reasonable, Swartz explains. Are the required fields complete? If not, how should nonresponses be dealt with? - should records be ignored, or should responses be "created" by estimating or putting in an average value so as to avoid throwing out a whole record just because of one odd or missing data item? Are responses reasonable? Can a 96-year-old describe herself as unemployed, and is that 80-year-old man really the father of a new baby? Is data consistent? Could a company with three peop le on the payroll really have a salary bill of more than $1 million?

At the US Centres for Disease Control and Prevention, such real-time data validation underpins data gathering, according to CIO Jim Seligman. When laptop-wielding field workers quiz 40,000 US households a year for the "National Health Interview Survey", automatic edits make sure that responses are as complete as possible while the survey is taking place. Some edits are "skip patterns", designed to prevent erroneous questions from being asked in the first place. If the respondent is male, for example, he won't get the question about mammographies. Other edits are consistency checks: Respondents are asked their age, but also their date of birth - and the two are compared.

It may sound trivial, but from such small foundations, clean data is built. "Any time a human being has something to do with entering data, there's the potential for error - whether it's misreading something, misinterpreting something or miskeying something," Seligman says. And very often, it takes humans working with machines to clean up the mess.

Four Ways to Make Your Data Sparkle

  • Prioritize the task. Cleaning data can be costly and time-consuming, so your first step should be figuring out which data is mission-critical and which isn't. For some companies, it's not worth cleaning data errors like sloppy punctuation when they don't get in the way of business objectives.

  • Involve the data owners. Ask the business units that own the data for help defining precise rules for what constitutes dirty data. That includes figuring out in advance whether 98 percent clean is good enough, or whether 100 percent is required or affordable.

  • Keep future data clean. Put processes and technologies in place that check every zip code and every area code.

  • Align your staff with business. Make sure you have IT people working on the ground with business units to make necessary changes in the data and relabel wrongly tagged inventory.
Show Comments