CSI: Data Warehouse - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Software // Information Management
News
11/24/2004
12:43 PM
50%
50%

CSI: Data Warehouse

Armed with a knowledge of patterns, you can root out bad data.

My wife and I love all the CSI police procedural dramas that are so popular now. The crime lab crew gets to the crime site knowing nothing about the situation and finds the bad guy in 60 minutes. How about a show called CSI: Data Warehouse on Tech TV?

Imagine that you walk into a client who has a large amount of data, and he wants to know if his data is real or fake. You don't know anything about his data, or even his industry. It turns out that data qua data actually has some patterns that are fairly easy to find in a modern database. Let me give a quick overview, without much mathematics, of some of the easy ones.

Seeing the Pattern

In the old days, before we had pocket calculators and cheap computers, engineers solved equations with slide rulers and books of mathematical tables. A slide ruler is accurate to three decimal places and most of the books were accurate to five or six places at least. (If you don't know what a slide ruler is, then Google it.)

In 1938, Dr. Frank Benford was a physicist at General Electric who noticed that pages in the logarithm tables that corresponded to numbers starting with the numeral one were getting worn out faster than the other pages. Most of us would have stopped at that point and not thought about it very much.

Dr. Benford could see no immediate reason that physicists and engineers would prefer logarithms starting with one. He started data mining unrelated sets of numbers. I mean really unrelated — geography, census data, baseball statistics, numbers in magazine articles, and just about anything he could find. After looking at 20,229 sets of numbers, he found that they all followed the same pattern he had seen in the logarithm tables.

The usual guess would be that all digits, one through nine, would be equally likely to pop up at the start of a string of digits. Nope, not true; Benford's Law says that it can be approximated by the formula P(d is first digit) = LOG10(1.0 + 1.0/d). The pattern is 30.1 percent for one, 17.6 percent for two, and down to 4.6 percent for nine. You can get some confirmation of this in "The First-Digit Phenomenon" by T. P. Hill (American Scientist, July-August 1998). Benford's Law gets better as the sample gets larger and more varied.

What makes Benford's Law useful to a data miner is that you don't have to understand the data. If the data drifts from the pattern, you know to look for a systematic bias or faked data. Like any statistic, it isn't a certainty, but it's a good place to start. In fact, there are fraud detection packages based on Benford's Law that look at patterns in expense reports and other financial data.

A Run of Luck

Another data mining trick is looking at runs. If you toss a fair coin 200 times, you're almost certain to encounter a run of six heads or six tails. But when people fake data, they don't like to repeat long runs of one value. This idea can be extended to multiple values and patterns of runs, but the coin toss is the simplest case and very easy to test.

We can generalize the idea of patterns over time. Given a jar of (n) numbered marbles, draw one out of the jar at random and put it back. How many draws do you have to make to have a better than 50 percent chance of getting a previous marble that was drawn (p) draws ago? You can make this scenario into customers who come back to the store, trucks that pull into the motor pool, or whatever. Most people guess (n/2), but it's much lower. The formula is a bit messy, but here's a quick table.

n

p

10

4

100

12

1000

37

10,000

118

100,000

372

In the real world, breaks in the data patterns are often the result of something systematic and not criminal. For example, the digit nine might show up a lot in an expense report sample because there's a $100.00 limit on something and people are padding their reports.

But the digit nine might also show up too often because the data warehouse handles missing data by filling codes that start with nines. Now we have an indication that we have a lot of missing data. We might want to look at this data and see if we need more detailed codes for kinds of missing data, if we need to improve data capture, or if we have to adjust the certainty of predictive reports to allow for this fact.

Joe Celko is an independent consultant in Austin, Texas and the author of Joe Celko's Trees and Hierarchies in SQL for Smarties (Morgan Kaufmann, 2004) and Joe Celko's SQL for Smarties: Advanced SQL Programming (Morgan Kaufmann, 1999).

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Commentary
CIOs Face Decisions on Remote Work for Post-Pandemic Future
Joao-Pierre S. Ruth, Senior Writer,  2/19/2021
Slideshows
11 Ways DevOps Is Evolving
Lisa Morgan, Freelance Writer,  2/18/2021
News
CRM Trends 2021: How the Pandemic Altered Customer Behavior Forever
Jessica Davis, Senior Editor, Enterprise Apps,  2/18/2021
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
2021 Top Enterprise IT Trends
We've identified the key trends that are poised to impact the IT landscape in 2021. Find out why they're important and how they will affect you.
Slideshows
Flash Poll