T-SQL Tuesday #005: Cleanup Your Notification Emails With HTML Tables

Introduction

I must admit this post is more along the lines of the “snack” series that (Andy Leonoard|Blog|Twitter) posts, but short-and to the point has some value!  I am writing this one to weigh in on this month’s #TSQL2sday topic about reporting.  So to blatantly steal the “snack” term from Andy, here it is:

TSQL Snack:

As a result of your load process, you send an email to your users to notify them that new data is available….right?  Well you should.  You can send a simple text email saying things are done, which certainly works, or, you can add some value and interest by providing some counts of what was loaded.  (You are auditing your loads…. that is another topic)  Your audit table already has everything you need, and a simple HTML table will get you professional looking output.

DECLARE @tableHTML  NVARCHAR(MAX) ;

SET @tableHTML =

    N'<H1>Warehouse Load Completion</H1>' +

    N'<table border="1">' +

    N'<tr><th>End Date</th><th>Table</th>' +

    N'<th>Rows Loaded</th>' +

    CAST ( ( SELECT td = CONVERT(VARCHAR,exec_start_dt, 110),       '',

                    td = tbl_name, '',

                    td = ISNULL(final_row_cnt - initial_row_cnt, 0)

              FROM dw_dbo.ld_audit

              WHERE audit_key >= (SELECT MAX(audit_key)

                                  FROM dw_dbo.ld_audit

                                  WHERE tbl_name = 'Master')

              AND tbl_name NOT IN ('Step','Datamart')

              ORDER BY (final_row_cnt - initial_row_cnt) DESC, audit_key

              FOR XML PATH('tr'), TYPE

    ) AS NVARCHAR(MAX) ) +

    N'</table>' ;

EXEC msdb.dbo.sp_send_dbmail @profile_name = 'General Notifications',

                             @recipients = 'Group-Distribution-List@your-domain.com; individual@your-domain.com; oncall-email@your-domain.com;',

                             @subject = 'Data Warehouse Load Complete',

                             @body = @tableHTML,

              @body_format = 'HTML';
				

  This example reads all of the audit rows, calculates the number of rows loaded for everything that was loaded since the start of the last “master” (which is my driver) audit row.  The output is put into a HTML formatted table (does not matter now many rows come back), gets a header-title, and column titles, and is sent to a set of distribution lists and email names via dbmail. What you will get is something like this (obscured my production names):

Conclusion

Communication is a good thing!  You should be letting your users know when loads are complete, what kind of data has been loaded, and what to expect.  Since SQL Server 2005/2008 make the HTML tables so simple, there is no reason that those notifications should look “simple”.

Because it’s TSQL2sDay

I know this is not what my blog is about… but a poll seems like the thing to do for today’s topic.

 

Data Quality -Dynamic Business Validation

Overview

As far as business intelligence, and data warehousing is concerned, Data Quality has come to mean many different things, and is one of those buzz words that everyone likes to hear.  It implies that you understand the data in your intelligence systems, and implies that the answers you are getting out are trustworthy and reliable, all of that is true, but you need to identify, monitor, remediate, and communicate the inevitable issues with your data if you are going to move beyond implied quality to a truly trusted source.  Since the enterprise data warehouse comprises information from numerous sources systems throughout an organization dirty and inconsistent data is inevitable.  This article is intended to provide a framework for that monitoring process that is flexible, extensible, and most importantly automatic.

A key component of a successful business validation implementation is a partnership with business so that all the rules are understood.  As long as the customer can identify legitimate values for fields, the query to do the verification can be easily inserted into the process.  As with most  data warehouse load processes, things are done in a logical order:

  1. Dimension tables are loaded to establish any new and changing key values
  2. Business validation is executed against the dimension tables to identify any inconstancies, missing data, or other issues.
  3. Fact tables are loaded
  4. Business validation is executed against the fact tables.  This can be done in parallel for each table that was loaded.
  5. Automated post-load reporting.  These can be business reports based on the data just loaded, as well as listings of any business validations that were triggered due to the introduction of new data.

The Problem

Your company has a key performance indicator (KPI) that is driven by a date that defines the number of days it took to deliver a product.  Imagine what that measurement would look like if someone keys in a valid date, but is next year?  Or transposes the month and keys in 10 instead of 01?  The ETL process has a hard time transforming that data, it is valid, and difficult at load time to determine the intended meaning.  Most often this type of data is loaded, and thus reported on.  Something that simple, if not identified, can significantly undermine the confidence in a key metric, and thus undermine the confidence in an entire source (implied and real) data quality suffers.

The Infrastructure

The bus_val_message table contains the queries that will be looped through by the stored procedure.

Column Description
Query_id Unique identifier for the validation query
Validation_msg The message you want to see on the report when this validation is tripped
Message_cat Logical grouping of the message
Severity Group into Critical/Major/Minor/Informational
Table_name The source table being validated by this query
Zero_rows_ok True/False some validations should find data, others should not.
Query The actual query to be run to validate the data that has been loaded
Column_name The column in the source table being validated
Table_type Fact or Dimension
Group_no Allow the stored procedure to be called for everything in group_no 1 etc.

 

The bus_val_result table contains the queries that will be looped through by the stored procedure.

Column Description
Job_dt Job date that this result row ties to
Validation_msg The message you want to see on the report when this validation is tripped
Message_cat Logical grouping of the message
Severity Group into Critical/Major/Minor/Informational
Table_name The source table being validated by this query
Row Cnt Count result from the query
Query_id Which query does this result row relate to
Customer_key Identifier for the customer that this relates to. Allows reporting by customer
Audit_key Identifier for the process that inserted this row
Rec_create_dt Date and time this record was insert

 

The stored procedure call would look similar to this:

Exec bus_val_execute ‘product_delivery’, <audit_key>, <job_date>, ‘F’, ‘1’

Parameters:

           table_name – the table name
           audit_key  – the audit_key of the package that is executing
           job_date   – the date and time that the package started
           table_type – the type of table
                        F = Fact  (used to append the correct audit information)
                        D = Dimension (Look at the whole dimension)

           group_no – the query group number to be executed during this run
                        0 = do not execute this query
                        1 = queries in group 1 will be processed together
                        2 = queries in group 2 will be processed together
                        3 = queries in group 3 will be processed together

The first thing the stored procedure does is collect all of the available queries to be executed for a given table/table_type/group_no into a temporary table.

The next thing the stored procedure does is (from the audit data stored during load-time) determine which audit_key effected the data for the job-date being processed.  (That audit_key is then added to the where clause of the stored query prior to execution)

The solution

Dynamic business validation is built upon a flexible framework in which the validation queries are stored in the database, selected at execution time by a stored procedure, validation issues are stored as messages in the database that are then reported on.  Assume in our example, that the delivery date cannot be in the future, and must have happened within the last 60 days the query that is stored would look similar to this:

SELECT 230001 as query_id,
 count(*)as validation_count
  FROM product_delivery a
WHERE delivery_dt <= getdate()
  AND delivery_dt >= (getdate() – 60)

Automatically appended by the stored procedure to the query is the audit, and job date information to identify the rows loaded today (load process auditing should be a standard practice in loading datawarehouse data)

This query, if it finds any data, stores the results in the business validation results table.  In our example, we have received a date into the warehouse that is a year in the future, although a valid date, clearly violates the business validation rule, and would be trapped, reported on, and corrected in the source systems responsible for the initial capture of this information.

This achieves a few things

  1. Before a report with an invalid KPI is produced, the issue is identified, reported on, and automatically communicated to the proper people in the organization.
  2. Before business decisions are made, the business rules are automatically applied to the data.
  3. Data Quality is improved, and the trust in all of the underlying data is improved.  If there is a trust that the data is being validated, the trust in the data will increase.
  4. Opportunities to build better edits in source systems is created.  Clearly if a date in the future is not valid, is should never have made it into the originating systems.
  5. Provides a repository of business validation issues.

Database Driven Aggregations

Introduction
Have you ever had a series of aggregate (tsql) queries to run?  Do you want to be sure they all run in the right order?  Do you want to make the changes “dynamic” (no code changes required)?  Certainly you can build an SSIS package that calls sql with the control-flow and control dependencies, however, each new query requires changes to that package.  This solution has a single proc-call (at the right point in the control flow), then hands control to the database.  This allow you to add new queries, change the order, disable, enable new ones, etc.

Setup
A few things to get setup before we start:
1) Table Creation for aggregation process — Optional table creation for logging
2) Proc Creation – This proc will require some modification for your specific environment.  There is logging of the process that takes place (in the optional table script), and some database links that you will need to adjust according to your specific needs.

Usage
Now for the easy part!  Insert into the aggregate driver a row for each query (primary) query you want to run, there may be sub-parts to the query, that is handled in the next table.  This table is the “driver” of the process, has the descriptive info, and allow for “active/inactive” setting.
Aggregate_Query_ID - Something that makes sense to you (I like logical operations together in groups of 1000)
Aggregate_group_ID – What you call the proc with (everything that is active for this group will run – in the order they are in the table)
Aggregate_freq_id – Future enhancement for Weekly/Monthly executions
Aggregate_Desc – What is this query, just a description
dim_audit_pkg_name – audit, what process is this query tied to (optional)
Source_Table_Name – what is the primary table for this query (descriptive, optional)
Source_Column_Name – what is the primary column for this query (descriptive, optional)
active_ind – Bit 0=off 1=on
aggregate_server – target server where the output of process will go
aggregate_database – target database where the output of process will go
aggregate_owner – target schema where the output of process will go
aggregate_table_name – target table where the output of process will go

Next, insert into the aggregate query table.  This is where the actual work goes.
aggregate_query_id – ties this group of queries to the aggregate driver table
aggregate_sequence – for queries that require more than 1 step, this is the order those queries will execute
query_desc – description (optional)
query – tsql valid statements to be executed

exec dw_dbo.aggreage_master_package 3  –this would execute the proc running all queries defined in aggregate_group_id 3 (this is the call in the SSIS package)  From here on out, any addition, change, inactivation to queries running in group 3 will be done in the database.

Example
You have 2 different processes, each process has multiple queries, and each query has multiple parts.

agg_qry_id agg_grp_id agg_freq_id aggregate_desc dim_audit_pkg_nm source_tbl_name   active_ind agg_server agg_database agg_owner agg_tbl_nm
1005 1 1 Example aggregate 1 example_pkg orders   0 maggie dm_reporting dw_dbo dw_agg_order
1006 1 1 Example aggregate 2 example_pkg orders   1 maggie dm_reporting dw_dbo dw_agg_order
1050 1 1 Example aggregate 3 example_pkg orders   1 maggie dm_reporting dw_dbo dw_agg_order
2005 2 1 Example aggregate 4 example_pkg customers   1 maggie dm_reporting dw_dbo dw_agg_customer
2006 2 1 Example aggregate 5 example_pkg customers   1 maggie dm_reporting dw_dbo dw_agg_customer
2007 2 1 Example aggregate 6 example_pkg customers   1 maggie dm_reporting dw_dbo dw_agg_customer

 

You see that when we call the proc with group_id “1″, there are 3 queries, the first of which will be skipped because active_ind is set to false.  When we call the proc for group 2 all group 2 queries will run.

Taking a closer look at the first query that will execute under group 1 (which is actually the second query 1006):

aggregate_query_id aggregate_sequence query_desc query
1006 1 Delete data from staging table on server Bart (stg_dw_agg_order) DELETE FROM bart.dw_stg_db.dw_dbo.stg_dw_agg_order
1006 2 Insert aggregated year 2010 data into staging table INSERT into osprey.dw_work.dw_dbo.stg_dw_agg_order (order_year, customer_id, state, annual_sales) select year, customer_id, state, sum(annual_sales) where year = 2010 from dw_dbo.orders group by year, customer_id, state
1006 3 Insert aggregated search year 2010 data into dw_agg_order BEGIN TRAN; DELETE FROM dm_reporting.dw_dbo.dw_agg_order WHERE year = 2010; insert into dm_reporting.dw_dbo.dw_agg_order with (tablock) (year, cusomter_id, state, annual_sales from bart.dw_staging.dw_dbo.stg_dw_agg_order; COMMIT;

 

There are three parts for this query
1) clears the data from the staging table on the staging server
2) run the query to collect the 2010 data into the staging server table
3) remove the existing aggregate rows for the 2010 customers, and then replace them with the new data (this is done in a transaction, with tablock on purpose, if someone is querying the table this query will wait to obtain the lock, if this query is running, there are no other queries allowed, they will block and wait).  You don’t want someone getting a report of 2010 with no data…
 Once query 1006 is done…(meaning all 3 parts) the next query 1050 (which can have multiple parts as well) is then started.  Separating things into logical groups will allow you to have multiple calls to the proc run from SSIS at the same time, achieving some parallelism.  Once a year is done, you can simply turn it off, and add a new row for the next year, this allow for a re-run of historical aggregations (should there be a data issue that requires updates/changes to the aggregation, or simply an additional field in the aggregations) by setting the active indicator back on.

Follow

Get every new post delivered to your Inbox.

Join 373 other followers