Want To Upgrade?

Introduction

I have seen several posts lately asking about ROI/Justifications etc about upgrades. There are LOTS of posts on this topic, and some really great documentation from Microsoft that covers this topic at great length. This post does not really add anything to those, other than distilling some of the key points from the Microsoft documentation that I used when discussing the upgrade, benefits, and expectations with my business leaders.

Summary

With the 2008 release, SQL Server makes a major advance in scalability for data warehousing. It meets the data warehouse needs of the largest enterprises more easily than ever. SQL Server 2008 provides a range of integrated products that enable you to build your data warehouse, and query and analyze its data. These include the SQL Server relational database system, Analysis Services, Integration Services, and Reporting Services. This paper introduces the new performance and manageability features for data warehousing across all these components. All these features contribute to improved scalability.

Upgrade Introduction

Microsoft® SQL Server™ 2008 provides a comprehensive data warehouse platform. It enables you to build and manage your data warehouse, and deliver insight to your users, with a single, integrated product suite. It scales to meet the needs of the largest enterprises, in a way that empowers both your end users and your IT staff.

The number one focus of development in the SQL Server 2008 release was to improve scalability across the entire product suite to comfortably meet the needs of large enterprises. Here, we’ll introduce the features and enhancements we’ve added to improve your data warehouse experience. Build. Manage. Deliver. SQL Server 2008 lets you do it all, with ease.

Map of New Data Warehousing Features

The following table shows the new scalability features in SQL Server 2008, and where they help with the activities that surround your data warehouse (DW).

 
 

   Build  Manage  Deliver Insight
SQL Server Relational DBMS  MERGE statement

Change data capture (CDC)

Minimally logged INSERT 

Backup compression  Star join performance

Faster parallel query on partitioned tables

GROUPING SETS 

Resource governor 
Data compression

Partition-aligned indexed views

Integration Services   Lookup performance

Pipeline performance 

     
Analysis Services     Backup  MDX Query Performance: Block Computation

Query and Writeback Performance 

Scalable Shared Database 
Reporting Services      Reporting scalability

Server scalability

Data Compression

The new data compression feature in SQL Server 2008 reduces the size of tables, indexes or a subset of their partitions by storing fixed-length data types in variable length storage format and by reducing the redundant data. The space savings achieved depends on the schema and the data distribution. Based on our testing with various data warehouse databases, we have seen a reduction in the size of real user databases up to 87% (a 7 to 1 compression ratio) but more commonly you should expect a reduction in the range of 50-70% (a compression ratio between roughly 2 to 1 and 3 to 1).

Backup Compression

Backup compression helps you to save in multiple ways.

By reducing the size of your SQL backups, you save significantly on disk media for your SQL backups. While all compression results depend on the nature of the data being compressed, results of 50% are not uncommon, and greater compression is possible. This enables you to use less storage for keeping your backups online, or to keep more cycles of backups online using the same storage.

Backup compression also saves you time. Traditional SQL backups are almost entirely limited by I/O performance. By reducing the I/O load of the backup process, we actually speed up both backups and restores.

Resource Governor

The new Resource Governor in SQL Server 2008 enables you to control the amount of CPU and memory resources allocated to different parts of your relational database workload. It can be used to prevent runaway queries (that deny resources to others) and to reserve resources for important parts of your workload. SQL Server 2005 resource allocation policies treat all workloads equally, and allocate shared resources (for example, CPU bandwidth, and memory) as they are requested. This sometimes causes a disproportionate distribution of resources, which in turn results in uneven performance or unexpected slowdowns.

Star Join

With dimensionally modeled data warehouses, a big part of your workload typically consists of what are known as star join queries. These queries follow a common pattern that joins the fact table with one or several dimension tables. In addition, star join queries usually express filter conditions against the non-key columns of the dimension tables and perform an aggregation (typically SUM) on a column of the fact table (called a measure column). With SQL Server 2008, you will experience significant performance improvements for many star join queries that process a significant fraction of fact table rows.

Partitioned Table Parallelism

Wouldn’t you like to get the most power you can out of the hardware you own? The partitioned table parallelism (PTP) feature in SQL Server 2008 helps you do that. Data warehouse applications typically collect large amounts of historical data in fact tables, which are often partitioned by date. In SQL Server 2005, queries that touch more than one partition use one thread (and thus one processor core) per partition. This sometimes limits the performance of queries that involve partitioned tables, especially when running on parallel shared memory multiprocessor (SMP) computers with many processor cores. Partitioned table parallelism improves the performance of parallel query plans against partitioned tables by better utilizing the processing power of the existing hardware, regardless of how many partitions a query touches. The feature works by default without the need for manual tuning or configuration.

Partition-Aligned Indexed Views

Partition-aligned indexed views enable you to create and manage summary aggregates in your relational data warehouse more efficiently, and use them in scenarios where you couldn’t effectively use them before, improving query performance. In a typical scenario, you have a fact table that is partitioned by date. Indexed views (summary aggregates) are defined on this table to help speed up queries. When you switch in a new table partition, the matching partitions of the partition-aligned indexed views defined on the partitioned table switch too, and do so automatically.

This is a significant improvement over SQL Server 2005

GROUPING SETS

GROUPING SETS allow you to write one query that produces multiple groupings and returns a single result set.

MERGE

The MERGE statement allows you to perform multiple Database Manipulation Language (DML) operations (INSERT, UPDATE, and DELETE) on a table or view in a single Transact-SQL statement.

SSIS – Load Audit

Introduction

There are many reasons to put some thought into an audit process for your load jobs.

  1. Tie the insert/update of a row to a specific SSIS package.
  2. Provide consistent date-stamp times for a given load (I will explain this in a moment).
  3. Ensure all the data staged actually made it to the target.
  4. Monitor load times, and easily find packages (or tables) that need tuning.
  5. Provide input to load jobs for selecting data from the source (you know the last time you ran successfully, select data that changed since then in the source… that is a topic for another post)
  6. Provide SLA reporting information on load-times.
  7. Probably a few other things that you will discover as you implement this.

If well thought out up-front, the additional major benefit of a good audit process is a cookie cutter approach to building load jobs.  They ALL do exactly the same thing, in exactly the same way using the same variables, and setup.  (The details of this is for another post, but I will cover the generalities here)

The Setup 

CREATE TABLE [dw_dbo].[ld_control_audit]
   (
       [audit_key]                 [INT] IDENTITY(1, 1) NOT NULL,
       [parent_audit_key]          [INT] NOT NULL,
       [tbl_name]                  [VARCHAR](50) NULL,
       [pkg_name]                  [VARCHAR](50) NULL,
       [exec_start_dt]             [DATETIME] NULL,
       [exec_stop_dt]              [DATETIME] NULL,
       [extract_row_cnt]           [BIGINT] NULL,
       [staged_row_cnt]            [BIGINT] NULL,
       [insert_row_cnt]            [BIGINT] NULL,
       [update_row_cnt]            [BIGINT] NULL,
       [error_row_cnt]             [BIGINT] NULL,
       [tbl_initial_row_cnt]       [BIGINT] NULL,
       [tbl_final_row_cnt]         [BIGINT] NULL,
       [tbl_max_surrogate_key]     [BIGINT] NULL,
       [successful_processing_ind] [CHAR](1) NULL
   )

Once you have the table, the view (in the default dbo schema) also calculates the execution time and executes with a NOLOCK, you want to be sure there is no locking by your monitoring processes on this table.

CREATE VIEW [dbo].[ld_control_audit] AS
   SELECT audit_key,
          parent_audit_key,
          tbl_name,
          pkg_name,
          exec_start_dt,
          exec_stop_dt,
          DATEDIFF(MINUTE, exec_start_dt, exec_stop_dt) AS runtime_minutes,
          staged_row_cnt,
          extract_row_cnt,
          insert_row_cnt,
          update_row_cnt,
          error_row_cnt,
          tbl_initial_row_cnt,
          tbl_final_row_cnt,
          tbl_max_surrogate_key,
          successful_processing_ind
   FROM   dw_dbo.dimaudit AS dimaudit_1 WITH (nolock)

Usage

Now that you have the basic table and view, the first thing your load will do is insert a row for the “Master Package” , which will create a new with a -9999 parent_audit_key to tie all of your “sub” executions to, and creates the exec_start_dt of this particular execution.

You now have an option…. you can select back the identity column you created, and pass it into each of your packages that you execute, or let each package lookup to get the audit-key that they will use as the parent_audit_key.  I DO NOT pass it in, but rather lookup at the top of each sub-package.  Reason 1, I execute all the sub-packages as out-of-process, to obtain a separate memory space for each of the executions, as I am running MANY processes in parallel.  Reason 2, restartability, if each package functions as an independent unit of work (including looking up the parent_audit_key) if a package fails for any reason, it can be re-run by itself without having to invoke the master-package, or manually adjusting variables etc.  If you are using  checkpoint files for restart this still plays well with that as the existing variable states (including audit_key assigned to the package, as well as parent_audit_key, record counts etc) are preserved for restart.

 

 If you have been reading any of my other posts, by now you have figured out I am “stuck” on naming conventions and consistency, but if you open ANY load package, they all look the same, so lets walk through the parts.

  1. Selecting the parent audit key from the audit table (max key for “master” for successful_processing_ind is null)
  2. Insert a new row into the audit table (VARIABLES for everything in the insert)…This is cut-paste for any package.
    There is a second sql statement that then selects the max audit-key for the @tablename that we just inserted a row for (this is what we will stamp on ALL rows affected by this load)
  3. This will select the package start-time from the audit table row.  This will be used for the rec-mod/rec-create dates in ALL rows affected by this load.  We are effectively going to tie all of the data in the warehouse back to a specific audit row, and a specific execution time.  If data needs to be backed out due to source issues or otherwise, all rows affected by the run, all have the same time.
  4. Using a scripting task, (same result as using an expression build in bi-express product if you have that).  This task sets SQL statement variables that will be executed before (and after) the load.
  5. Executes the count sql statement just built by the scripting task by using the table-name variables in the package.
  6. Executes the select max surrogate key sql stmt built by the scripting task using the table-name (or if this is a fact table, we select the max audit-key)… this is to capture with absolute certainty the starting point of the load
  7. This is the ONLY unique thing in this package.  EVERYTHING else is driven by the variables and dynamic SQL.
  8. Executes the count sql statement just built by the scripting task by using the table-name variables in the package.
  9. Update stats (this is for dimension loads only) the fact table stats/index’s are maintained by a nightly script, that is too large to run at load-time
  10. Update the audit row, set the end-time, and success-ind

Conclusion

There are many reasons for consistency… audit and maintainability is a major benefit.  The audit will produce a row in an audit table that you can tie ANY row in the warehouse (dimension or fact) back to the specific execution of a package.  The audit table will give you execution times for packages (which is very useful to monitor ongoing load-times and bottlenecks).   It may seem like a lot of work up-front (it really is not), and since all of the infrastructure is driven by dynamic sql, and variables, there is nothing (other than changing the values in variables) that needs to be created to maintain a consistent audit process.

Database Driven Aggregations

Introduction
Have you ever had a series of aggregate (tsql) queries to run?  Do you want to be sure they all run in the right order?  Do you want to make the changes “dynamic” (no code changes required)?  Certainly you can build an SSIS package that calls sql with the control-flow and control dependencies, however, each new query requires changes to that package.  This solution has a single proc-call (at the right point in the control flow), then hands control to the database.  This allow you to add new queries, change the order, disable, enable new ones, etc.

Setup
A few things to get setup before we start:
1) Table Creation for aggregation process — Optional table creation for logging
2) Proc Creation – This proc will require some modification for your specific environment.  There is logging of the process that takes place (in the optional table script), and some database links that you will need to adjust according to your specific needs.

Usage
Now for the easy part!  Insert into the aggregate driver a row for each query (primary) query you want to run, there may be sub-parts to the query, that is handled in the next table.  This table is the “driver” of the process, has the descriptive info, and allow for “active/inactive” setting.
Aggregate_Query_ID - Something that makes sense to you (I like logical operations together in groups of 1000)
Aggregate_group_ID – What you call the proc with (everything that is active for this group will run – in the order they are in the table)
Aggregate_freq_id – Future enhancement for Weekly/Monthly executions
Aggregate_Desc – What is this query, just a description
dim_audit_pkg_name – audit, what process is this query tied to (optional)
Source_Table_Name – what is the primary table for this query (descriptive, optional)
Source_Column_Name – what is the primary column for this query (descriptive, optional)
active_ind – Bit 0=off 1=on
aggregate_server – target server where the output of process will go
aggregate_database – target database where the output of process will go
aggregate_owner – target schema where the output of process will go
aggregate_table_name – target table where the output of process will go

Next, insert into the aggregate query table.  This is where the actual work goes.
aggregate_query_id – ties this group of queries to the aggregate driver table
aggregate_sequence – for queries that require more than 1 step, this is the order those queries will execute
query_desc – description (optional)
query – tsql valid statements to be executed

exec dw_dbo.aggreage_master_package 3  –this would execute the proc running all queries defined in aggregate_group_id 3 (this is the call in the SSIS package)  From here on out, any addition, change, inactivation to queries running in group 3 will be done in the database.

Example
You have 2 different processes, each process has multiple queries, and each query has multiple parts.

agg_qry_id agg_grp_id agg_freq_id aggregate_desc dim_audit_pkg_nm source_tbl_name   active_ind agg_server agg_database agg_owner agg_tbl_nm
1005 1 1 Example aggregate 1 example_pkg orders   0 maggie dm_reporting dw_dbo dw_agg_order
1006 1 1 Example aggregate 2 example_pkg orders   1 maggie dm_reporting dw_dbo dw_agg_order
1050 1 1 Example aggregate 3 example_pkg orders   1 maggie dm_reporting dw_dbo dw_agg_order
2005 2 1 Example aggregate 4 example_pkg customers   1 maggie dm_reporting dw_dbo dw_agg_customer
2006 2 1 Example aggregate 5 example_pkg customers   1 maggie dm_reporting dw_dbo dw_agg_customer
2007 2 1 Example aggregate 6 example_pkg customers   1 maggie dm_reporting dw_dbo dw_agg_customer

 

You see that when we call the proc with group_id “1″, there are 3 queries, the first of which will be skipped because active_ind is set to false.  When we call the proc for group 2 all group 2 queries will run.

Taking a closer look at the first query that will execute under group 1 (which is actually the second query 1006):

aggregate_query_id aggregate_sequence query_desc query
1006 1 Delete data from staging table on server Bart (stg_dw_agg_order) DELETE FROM bart.dw_stg_db.dw_dbo.stg_dw_agg_order
1006 2 Insert aggregated year 2010 data into staging table INSERT into osprey.dw_work.dw_dbo.stg_dw_agg_order (order_year, customer_id, state, annual_sales) select year, customer_id, state, sum(annual_sales) where year = 2010 from dw_dbo.orders group by year, customer_id, state
1006 3 Insert aggregated search year 2010 data into dw_agg_order BEGIN TRAN; DELETE FROM dm_reporting.dw_dbo.dw_agg_order WHERE year = 2010; insert into dm_reporting.dw_dbo.dw_agg_order with (tablock) (year, cusomter_id, state, annual_sales from bart.dw_staging.dw_dbo.stg_dw_agg_order; COMMIT;

 

There are three parts for this query
1) clears the data from the staging table on the staging server
2) run the query to collect the 2010 data into the staging server table
3) remove the existing aggregate rows for the 2010 customers, and then replace them with the new data (this is done in a transaction, with tablock on purpose, if someone is querying the table this query will wait to obtain the lock, if this query is running, there are no other queries allowed, they will block and wait).  You don’t want someone getting a report of 2010 with no data…
 Once query 1006 is done…(meaning all 3 parts) the next query 1050 (which can have multiple parts as well) is then started.  Separating things into logical groups will allow you to have multiple calls to the proc run from SSIS at the same time, achieving some parallelism.  Once a year is done, you can simply turn it off, and add a new row for the next year, this allow for a re-run of historical aggregations (should there be a data issue that requires updates/changes to the aggregation, or simply an additional field in the aggregations) by setting the active indicator back on.

Follow

Get every new post delivered to your Inbox.