Data Quality -Dynamic Business Validation

Overview

As far as business intelligence, and data warehousing is concerned, Data Quality has come to mean many different things, and is one of those buzz words that everyone likes to hear.  It implies that you understand the data in your intelligence systems, and implies that the answers you are getting out are trustworthy and reliable, all of that is true, but you need to identify, monitor, remediate, and communicate the inevitable issues with your data if you are going to move beyond implied quality to a truly trusted source.  Since the enterprise data warehouse comprises information from numerous sources systems throughout an organization dirty and inconsistent data is inevitable.  This article is intended to provide a framework for that monitoring process that is flexible, extensible, and most importantly automatic.

A key component of a successful business validation implementation is a partnership with business so that all the rules are understood.  As long as the customer can identify legitimate values for fields, the query to do the verification can be easily inserted into the process.  As with most  data warehouse load processes, things are done in a logical order:

  1. Dimension tables are loaded to establish any new and changing key values
  2. Business validation is executed against the dimension tables to identify any inconstancies, missing data, or other issues.
  3. Fact tables are loaded
  4. Business validation is executed against the fact tables.  This can be done in parallel for each table that was loaded.
  5. Automated post-load reporting.  These can be business reports based on the data just loaded, as well as listings of any business validations that were triggered due to the introduction of new data.

The Problem

Your company has a key performance indicator (KPI) that is driven by a date that defines the number of days it took to deliver a product.  Imagine what that measurement would look like if someone keys in a valid date, but is next year?  Or transposes the month and keys in 10 instead of 01?  The ETL process has a hard time transforming that data, it is valid, and difficult at load time to determine the intended meaning.  Most often this type of data is loaded, and thus reported on.  Something that simple, if not identified, can significantly undermine the confidence in a key metric, and thus undermine the confidence in an entire source (implied and real) data quality suffers.

The Infrastructure

The bus_val_message table contains the queries that will be looped through by the stored procedure.

Column Description
Query_id Unique identifier for the validation query
Validation_msg The message you want to see on the report when this validation is tripped
Message_cat Logical grouping of the message
Severity Group into Critical/Major/Minor/Informational
Table_name The source table being validated by this query
Zero_rows_ok True/False some validations should find data, others should not.
Query The actual query to be run to validate the data that has been loaded
Column_name The column in the source table being validated
Table_type Fact or Dimension
Group_no Allow the stored procedure to be called for everything in group_no 1 etc.

 

The bus_val_result table contains the queries that will be looped through by the stored procedure.

Column Description
Job_dt Job date that this result row ties to
Validation_msg The message you want to see on the report when this validation is tripped
Message_cat Logical grouping of the message
Severity Group into Critical/Major/Minor/Informational
Table_name The source table being validated by this query
Row Cnt Count result from the query
Query_id Which query does this result row relate to
Customer_key Identifier for the customer that this relates to. Allows reporting by customer
Audit_key Identifier for the process that inserted this row
Rec_create_dt Date and time this record was insert

 

The stored procedure call would look similar to this:

Exec bus_val_execute ‘product_delivery’, <audit_key>, <job_date>, ‘F’, ‘1’

Parameters:

           table_name – the table name
           audit_key  – the audit_key of the package that is executing
           job_date   – the date and time that the package started
           table_type – the type of table
                        F = Fact  (used to append the correct audit information)
                        D = Dimension (Look at the whole dimension)

           group_no – the query group number to be executed during this run
                        0 = do not execute this query
                        1 = queries in group 1 will be processed together
                        2 = queries in group 2 will be processed together
                        3 = queries in group 3 will be processed together

The first thing the stored procedure does is collect all of the available queries to be executed for a given table/table_type/group_no into a temporary table.

The next thing the stored procedure does is (from the audit data stored during load-time) determine which audit_key effected the data for the job-date being processed.  (That audit_key is then added to the where clause of the stored query prior to execution)

The solution

Dynamic business validation is built upon a flexible framework in which the validation queries are stored in the database, selected at execution time by a stored procedure, validation issues are stored as messages in the database that are then reported on.  Assume in our example, that the delivery date cannot be in the future, and must have happened within the last 60 days the query that is stored would look similar to this:

SELECT 230001 as query_id,
 count(*)as validation_count
  FROM product_delivery a
WHERE delivery_dt <= getdate()
  AND delivery_dt >= (getdate() – 60)

Automatically appended by the stored procedure to the query is the audit, and job date information to identify the rows loaded today (load process auditing should be a standard practice in loading datawarehouse data)

This query, if it finds any data, stores the results in the business validation results table.  In our example, we have received a date into the warehouse that is a year in the future, although a valid date, clearly violates the business validation rule, and would be trapped, reported on, and corrected in the source systems responsible for the initial capture of this information.

This achieves a few things

  1. Before a report with an invalid KPI is produced, the issue is identified, reported on, and automatically communicated to the proper people in the organization.
  2. Before business decisions are made, the business rules are automatically applied to the data.
  3. Data Quality is improved, and the trust in all of the underlying data is improved.  If there is a trust that the data is being validated, the trust in the data will increase.
  4. Opportunities to build better edits in source systems is created.  Clearly if a date in the future is not valid, is should never have made it into the originating systems.
  5. Provides a repository of business validation issues.

Tags: ,

No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: