The Data Quality Audit - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Software // Information Management

The Data Quality Audit

An audit can help you discover the source of data quality problems. Hint: It might be the warehouse.

Data quality problems are often widespread and originate in your source systems, their applications, and operational processes. And, sadly, many are the direct result of inadequate warehouse management.

In order to combat these problems, data warehouse architects must understand their source data. This understanding can come from data profiling. But even though profiling techniques are important, they're still exploratory. They leave it up to the analyst to understand how the data profile fits the business requirements. That's where a business rules-based audit can be useful, if not critical.

Understanding the Source

Within your source systems are two principle areas of concern:

1. Insufficient process controls to ensure data quality and consistency. Epitomizing a lack of process control is the ability to change key values, such as account numbers, that have a direct impact on the historical integrity of the warehouse. When you can change the primary keys, you compromise referential integrity. Ensuring the integrity of the warehouse remains intact requires extremely complex ETL processing when these changes occur at the source level.

2. Limited or no basic edits. Source applications have been known to provide little or no edit checks to ensure that data being entered meets even the most basic standards. Without measures to monitor and control the most fundamental standards at the source systems, the warehouse team will struggle against a constant flow of poor quality data. And although user-defined fields add flexibility to an application, they have painful consequences for statistical analysis and reporting required from warehouses. Essentially, user-defined fields are free-form fields. They can be anything, contain anything, and change anytime the user sees fit. All of which introduces the problem of discerning any value from the fields.

The warehouse, unfortunately, adds to the data quality problems, which can be typically attributed to two factors:

1. Inadequate source data analysis. Guidelines for sourcing data into the warehouse must be established in order to avoid simply sourcing data because it is part of an existing record. This practice leads to a high percentage of data of questionable value being warehoused.

2. Limited warehouse management. A fundamental function of the warehouse team is to actively monitor and tune warehoused data in pursuit of quality assurance and data relevance, specifically:

  • Warehoused data should be associated with a business requirement. If not, the data is often neglected, neither purged because of irrelevance nor scrubbed for quality.
  • Reference table maintenance is another common source of warehouse-centric data quality problems. Reference tables often are out of sync with applications or even moot if ETL processes allow invalid codes.

With the source data applications providing an unfettered environment for end users, coupled with incomplete transformation, cleansing, and warehouse maintenance processes, the warehouse data is guaranteed to be less than satisfactory.

So, how bad is your warehoused data? A data quality audit is designed and implemented in order to determine the answer. There are three high-level objectives of a data quality effort:

  • Produce statistically valid data quality measurements of the data warehouse
  • Investigate, identify, and document leading data quality causes
  • Create an assessment and recommendation report.
  • This audit approach is straightforward. It is designed to be implemented by any warehouse team, on demand.

The Data Quality Audit Approach

The data quality audit is a business rules-based approach that incorporates standard deviation to identify variability in sample test results. You examine the likelihood (confidence levels of 95 to 99 percent) of invalid data values occurring in columns or fields after business rules are applied against sample test data. A business rule may be a simple check, such as "Account_Aggregate_Balance must be greater than zero," but it may also include rules that assure its accuracy. For example:

IF "Account_Aggregate_Balance" is > 0
AND "Account_Type" = "LOAN"
AND "Account_Open" is > 30 days
THEN "Account_Interest_Fee" MUST BE > 0

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
1 of 2
Comment  | 
Print  | 
More Insights
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

How SolarWinds Changed Cybersecurity Leadership's Priorities
Jessica Davis, Senior Editor, Enterprise Apps,  5/26/2021
How CIOs Can Advance Company Sustainability Goals
Lisa Morgan, Freelance Writer,  5/26/2021
IT Skills: Top 10 Programming Languages for 2021
Cynthia Harvey, Freelance Journalist, InformationWeek,  5/21/2021
White Papers
Register for InformationWeek Newsletters
Current Issue
Planning Your Digital Transformation Roadmap
Download this report to learn about the latest technologies and best practices or ensuring a successful transition from outdated business transformation tactics.
Flash Poll