A Question of Quality - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Software // Information Management
News
8/25/2004
12:58 PM
50%
50%

A Question of Quality

Data quality is today's obsession. How good does the data really have to be?

Everyone ought to go to the Wilshire Meta-data Conferences — not just for the excellent sessions, but also for the quick ideas and little stories that get tossed out at you so fast that you need a week or two to digest them.

The big thing right now is DQ (data quality, not Dairy Queen, in case you've been on Mars). Being a cynic, I think that its new popularity is because, thanks to Sarbanes-Oxley and more audits, management is worried about jail time.

The little toss-off I'm currently playing with came from Graeme Simsion at a panel discussion at this year's Meta-Data Conference (May 2-6, 2004). I assume you know the Zachman Framework, which can also be a general purpose mental tool for setting up a problem for a solution, not just an IT architecture thing. In 25 words or less, you build an array with English language interrogatives (who, what, where, when, why, and how) on one axis and the parts of your project on the other axis.

What Graeme pointed out is that other languages have different interrogatives that can go into this array. In particular, the French also ask about quantity and cost (how many/much?) and quality (how good?). Now, I have a quality dimension at the level of my metadata design that needs to be part of the data dictionary. I would also add "how trusted?" to the list. (Maybe it's an interrogative in Klingon?)

Return to Sender

Let me make this more concrete with the classic source of bad data — mailing addresses. When I'm setting up the data dictionary for the addresses, if I use Graeme's model as part of the high-level design process, then I have a cell for quality in the array to fill out. It has two parts: How good is the data I get? How good do I need it to be?

Because it's now at the top level, I have to have mechanisms all the way down to the implementation level to enforce it. Quality is getting built into the system and not stuck on later.

If I'm doing bulk mailings, then perhaps I can live with a 2 to 3 percent bad address rate and slower delivery. The cost of cleaning the data is greater than the return I would get from doing it. But if I'm sending out lab results to doctors, I need it to be 100 percent accurate and fast so I don't kill people. A large quantity leads to a lower quality in this example.

But some data "spoils" over time. People move, change jobs, change income, change health, and so forth over periods of time. But what is the speed of decay? Are we talking about milk or pickles in the data refrigerator?

The idea of spoilage lets me get a measure of the quality as it relates to time. If my mailing list is very old, I expect to have a lot more returns than a mailing list I compiled last week. If I track the spoilage so that I have a more precise idea that about x percent of the addresses go bad in k weeks, I have a way to schedule a data cleanup.

The trust dimension is harder to define. The source and age of the data are obvious factors. For example, people lie about their age and their income. If I have an outside verification from a trusted source, like a birth certificate or a tax return, then I'm more confident about my data. If I'm doing a magazine survey, I can stand a little fibbing in my data and adjust for it in the aggregate. If I'm doing a security check, then there's no such thing as a "little white lie"; it's perjury and has penalties.

Work in Progress

I'm still playing with these ideas. My first impulse is try to come up with some kind of statistical measurement for each data element that will look impressive because it has a lot of decimal places. Of course, this quickly leads to a DQ audit system that's more complicated than the underlying database.

What if we started with a "traffic light" code or five-point scale in the data dictionary to stand for quality? Nothing fancy, just an estimate of the data's quality. Then add another scale for spoilage rate expressed on another simple scale — verify once a decade, once a year, monthly, weekly, or whatever. What do you think?

Joe Celko is an independent consultant in Austin, Texas and the author of Joe Celko's Trees and Hierarchies in SQL for Smarties (Morgan Kaufmann, 2004) and Joe Celko's SQL for Smarties: Advanced SQL Programming

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Commentary
The Best Way to Get Started with Data Analytics
John Edwards, Technology Journalist & Author,  7/8/2020
Slideshows
10 Cyberattacks on the Rise During the Pandemic
Cynthia Harvey, Freelance Journalist, InformationWeek,  6/24/2020
News
IT Trade Shows Go Virtual: Your 2020 List of Events
Jessica Davis, Senior Editor, Enterprise Apps,  5/29/2020
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
Key to Cloud Success: The Right Management
This IT Trend highlights some of the steps IT teams can take to keep their cloud environments running in a safe, efficient manner.
Slideshows
Flash Poll