Overview
As far as business intelligence, and data warehousing is concerned, Data Quality has come to mean many different things, and is one of those buzz words that everyone likes to hear. It implies that you understand the data in your intelligence systems, and implies that the answers you are getting out are trustworthy and reliable, all of that is true, but you need to identify, monitor, remediate, and communicate the inevitable issues with your data if you are going to move beyond implied quality to a truly trusted source. Since the enterprise data warehouse comprises information from numerous sources systems throughout an organization dirty and inconsistent data is inevitable. This article is intended to provide a framework for that monitoring process that is flexible, extensible, and most importantly automatic.
A key component of a successful business validation implementation is a partnership with business so that all the rules are understood. As long as the customer can identify legitimate values for fields, the query to do the verification can be easily inserted into the process. As with most data warehouse load processes, things are done in a logical order:
- Dimension tables are loaded to establish any new and changing key values
- Business validation is executed against the dimension tables to identify any inconstancies, missing data, or other issues.
- Fact tables are loaded
- Business validation is executed against the fact tables. This can be done in parallel for each table that was loaded.
- Automated post-load reporting. These can be business reports based on the data just loaded, as well as listings of any business validations that were triggered due to the introduction of new data.
The Problem
Your company has a key performance indicator (KPI) that is driven by a date that defines the number of days it took to deliver a product. Imagine what that measurement would look like if someone keys in a valid date, but is next year? Or transposes the month and keys in 10 instead of 01? The ETL process has a hard time transforming that data, it is valid, and difficult at load time to determine the intended meaning. Most often this type of data is loaded, and thus reported on. Something that simple, if not identified, can significantly undermine the confidence in a key metric, and thus undermine the confidence in an entire source (implied and real) data quality suffers.
The Infrastructure
The bus_val_message table contains the queries that will be looped through by the stored procedure.
Column | Description |
Query_id | Unique identifier for the validation query |
Validation_msg | The message you want to see on the report when this validation is tripped |
Message_cat | Logical grouping of the message |
Severity | Group into Critical/Major/Minor/Informational |
Table_name | The source table being validated by this query |
Zero_rows_ok | True/False some validations should find data, others should not. |
Query | The actual query to be run to validate the data that has been loaded |
Column_name | The column in the source table being validated |
Table_type | Fact or Dimension |
Group_no | Allow the stored procedure to be called for everything in group_no 1 etc. |
The bus_val_result table contains the queries that will be looped through by the stored procedure.
Column | Description |
Job_dt | Job date that this result row ties to |
Validation_msg | The message you want to see on the report when this validation is tripped |
Message_cat | Logical grouping of the message |
Severity | Group into Critical/Major/Minor/Informational |
Table_name | The source table being validated by this query |
Row Cnt | Count result from the query |
Query_id | Which query does this result row relate to |
Customer_key | Identifier for the customer that this relates to. Allows reporting by customer |
Audit_key | Identifier for the process that inserted this row |
Rec_create_dt | Date and time this record was insert |
The stored procedure call would look similar to this:
Exec bus_val_execute ‘product_delivery’, <audit_key>, <job_date>, ‘F’, ‘1’
Parameters:
table_name – the table name
audit_key – the audit_key of the package that is executing
job_date – the date and time that the package started
table_type – the type of table
F = Fact (used to append the correct audit information)
D = Dimension (Look at the whole dimension)
group_no – the query group number to be executed during this run
0 = do not execute this query
1 = queries in group 1 will be processed together
2 = queries in group 2 will be processed together
3 = queries in group 3 will be processed together
The first thing the stored procedure does is collect all of the available queries to be executed for a given table/table_type/group_no into a temporary table.
The next thing the stored procedure does is (from the audit data stored during load-time) determine which audit_key effected the data for the job-date being processed. (That audit_key is then added to the where clause of the stored query prior to execution)
The solution
Dynamic business validation is built upon a flexible framework in which the validation queries are stored in the database, selected at execution time by a stored procedure, validation issues are stored as messages in the database that are then reported on. Assume in our example, that the delivery date cannot be in the future, and must have happened within the last 60 days the query that is stored would look similar to this:
SELECT 230001 as query_id,
count(*)as validation_count
FROM product_delivery a
WHERE delivery_dt <= getdate()
AND delivery_dt >= (getdate() – 60)
Automatically appended by the stored procedure to the query is the audit, and job date information to identify the rows loaded today (load process auditing should be a standard practice in loading datawarehouse data)
This query, if it finds any data, stores the results in the business validation results table. In our example, we have received a date into the warehouse that is a year in the future, although a valid date, clearly violates the business validation rule, and would be trapped, reported on, and corrected in the source systems responsible for the initial capture of this information.
This achieves a few things
- Before a report with an invalid KPI is produced, the issue is identified, reported on, and automatically communicated to the proper people in the organization.
- Before business decisions are made, the business rules are automatically applied to the data.
- Data Quality is improved, and the trust in all of the underlying data is improved. If there is a trust that the data is being validated, the trust in the data will increase.
- Opportunities to build better edits in source systems is created. Clearly if a date in the future is not valid, is should never have made it into the originating systems.
- Provides a repository of business validation issues.
Leave a Reply