Operation Clean Data

Know Your Enemy

When Britain's defence department began its data-cleaning project in early 2000, it faced a huge task, says Lieutenant Colonel Andrew Law, head of The Cleansing Project. (It just so happens that the acronym TCP is also a well-known British brand of antiseptic.) The department's IT team was using three main systems to sort through 1.7 million records, which each had literally hundreds of attributes. Each record referred to an item that troops might require, and many of these items were to be dispatched from the ministry's widely dispersed warehouses in Bicester, England, and other locations. (The Bicester warehouses are far apart because they were built in 1942 with the idea to make it hard for German bombers to deliver a knockout punch.)

Law's mission was to review all the data, but he had to concentrate his team's energies on cleaning six critical data fields: the NATO item identifier, the NATO supply classification, the unit of issue, the supplier code, the packaging code and the hazard code. These six fields were chosen based on which ones would have the biggest impact on the supply chain if they were wrong.

"The first step was to identify homonyms and synonyms," says Paul Nettle, manager of data cleaning for TCP. Homonyms, he explains, are two or more different items with the same identifier, such as rations and radio valves. Synonyms are the same items with more than one identifier - the same radio valve kept in two places in a warehouse under two different numbers, for example.

"Synonyms are merely inefficient," Nettle observes. Overstocking and overbuying result from such data mistakes, rather than troops being shipped the wrong gear.

Next, the IT team employed data-profiling software to crawl though the data, checking it for valid NATO numbers. The troubling finding: 119,000 numbers (about one in 10) weren't valid. The radio valve, it turned out, was a valid NATO part number, but the rations came from a satellite system where non-standard rules had been used. Every one of them had to be sent to a NATO office in Glasgow for codification, and then corrected in each system in which it occurred. Nettle and his team also discovered they had quite a bit of relabelling to do at the depot, since much of the inventory sitting on the shelves was now incorrectly labelled.

The next step was "fuzzy matching", using software to look for duplicates and errors introduced by keyboard entry. "The ability to ignore [minor mistakes in] punctuation and figure out when a 3 had been erroneously substituted for an 8 was important when dealing," Nettle says. Such numerical errors, after all, could change the entire meaning of the text, while punctuation mistakes merely provided Nettle's team with much needed amusement.

By August 2001, they had completed the relatively easy (if time-consuming) task of examining item identifiers to see, for instance, if an item held the valid NATO number. Now they had to find a way to correct the other data fields. Here, the challenge was more difficult. For things such as unit-of-issue labels, packaging codes and supplier details, hard and fast rules to tell clean data from dirty data didn't exist. For example, supplies of aircraft oil: A military unit in the Gulf might order 250 litres of oil, expecting 250 one-litre cans - only to receive 250 separate 250-litre drums of the stuff. The reason? On the Royal Air Force system responsible for ordering the oil, 250-litre drums, not one-litre cans, were the unit of issue. Neither label was technically an error, but clearly, such inconsistencies could quickly cripple a supply chain. To make sure such a disaster would never occur, the TCP team turned to a data-profiling tool, which highlighted errors and inconsistencies in the various codes. The so ftware provides easy-to-understand, computer-generated diagrams to spot unusual data formats that could be erroneous.

As Law points out, however, technology only goes so far. The 12-man TCP project succeeded in large part because team members at headquarters worked closely with members in the British Army, Royal Navy and Royal Air Force, who made sure flawed data actually did get cleaned and organized activities such as relabelling inventory on the shelf. So far, Nettle says, the cleansing project has cost £6 million over four years, and has saved the Ministry of Defence £50 million.

Show Comments