What does that checkbox do? (Redshift Encryption)

**EDIT**     I don’t normally go back and edit things in a prior post… but… in this case the additional link to Werner Vogels blog  that came out after I wrote this is worthwhile to the topic at hand.           **EDIT**

As I alluded to in the prior post, I have been busy…   I have actually been busy writing, and now that all of the first-draft copies have been delivered to the publisher and I am in the mode of writing things.   I figured I would grab a small topic for a short post.  We are not far enough along in the publishing process for me to make formal book announcements yet, but it will become clear relatively soon.

OK, so you are creating a Redshift cluster and you want to enable encryption… not a bad Idea, but what does that mean in Redshift you ask? Not a bad question, as the documentation is relatively sparse on what is going on in this part of the engine.  Normally, when you enable encryption on something, the first thing you need to do is provide some sort of strong key… so here you have a checkbox?… well yes, that is in-fact what you have.  It is really not a bad thing, as the key would have to be stored someplace at Amazon anyway (more on that in a second).  Basically, your Root-key (private-key) is generated for you, so it is likely something better than you are going to type in anyway.  That generated key is then stored on your Amazon Control Plane network (as are your other AMI credentials, keys, SSH Keys etc.).  That control plane is on the other side of a firewall from your physical instance of whatever you are running (in this case Redshift).  That authentication through the firewall is done at startup of your instance, which is authenticated before that key is provided as a valid key.  That key is then kept in memory (never written to disk) on the cluster that is running Redshift, which will allow for the decryption of your data.  The data is encrypted (using hardware accelerated AES 256 encryption)  before it is put back on disk, so any data at rest will be encrypted (much like Microsoft transparent data encryption TDE).  Thus, if one of those drives, backups or other data were to somehow become compromised, there is nothing anyone could do with it, as they do not have the root key, which can only be obtained by your valid Redshift cluster through the firewall to the control plane network.

So, the reason (either way) you would have to store the key at Amazon is quite simply, otherwise you would have to be physically involved in the starting of the cluster.  After maintenance, after a reboot, after a crash… no matter the reason… the cluster cannot start without first obtaining the key.  If that key is not stored in a way that allows for automatic authentication, and access to that key, quite simply, the cluster could not start.  If that were the case, it would require you to access a console, command line, or other interface to provide the key externally for each start-up of the cluster.

T-SQL Tuesday #006: Configure Filestream Storage for BLOBS

Introduction

This is my post for the monthly TSQL Tuesday Blogging event.  With the release of SQL Server 2008, among the many new features was one that at first I took little notice of, FILESTREAM storage options. It did not take long to realize that this new feature has some powerful storage capabilities. You can store, (very simply at that) all of that unstructured data (Images, video, word docs, PDF’s… you name it). This post will give you an overview of how to enable the features necessary to be able to take advantage of this storage option under SQL Server 2008. (Note, to enable this option you need admin rights server or sys admin rights) Also, since this is using NTFS the filegroup that you configure to use filestream storage cannot be a compressed (at the windows OS level) volume. (Right Click the C:\data folder -> Properties -> Advanced Attributes -> Uncheck “Compress contents to save disk space”). PLEASE don’t put this directory on your OS drive… this should go onto a data drive. I configured mine to use a directory on the same drive that I have system databases, and the SSAS data files. Space needs will obviously be determined by what you store, but it should be treated like other data files, and NOT be located on your OS drive.

For the sake of “completeness” you can also enable by using management studio (or at install for that matter), I am describing the 2 options I use most often… tsql, and configuration manager.

TSQL

For those of you who don’t like to script things, or don’t like the ease of clicking check-box’s. (actually any of these really are easy)…. File_stream_access_level is an advanced option, so first you need to show advanced, then you can turn it on.

USE master
Go

EXEC sp_configure ‘show advanced options’
GO

EXEC sp_configure filestream_access_level, 1
GO

RECONFIGURE WITH OVERRIDE
GO

 

Filestream access levels:

  • Disabled = 0 – FILESTREAM support for the instance is Disabled
  • Enabled = 1 – FILESTREAM for Transact-SQL Access is Enabled
  • Enabled = 2 — FILESTREAM for Transact-SQL and Windows (32) streaming access is Enabled

Configuration Manager

I like this option the best… seems to me “configuration manager” is the logical place to manage the configuration of the database service, so (for me at least) this is the first place I go look. And the point-click thing is also not a negative in my mind… save me a few trips to MSDN to find the options.

Pick the instance (in this case default), then right-click select and select properties

Enable the check-boxes:

  • The first check-box turns it on…
  • The second is the name of the windows share name if you enable file-io streaming access (I left it at the default…. Please name this something more intelligent, you don’t want “mssqlserver” on each of your servers, you will get things confused if you have more than 1 server).
  • The third is if you are going to allow remote clients to access the file-stream data.
  • Service needs a restart

You have it turned on, take it for a drive ….

Almost there!

Now that the instance is configured to allow for the storage, you need to create a file-group to store the file-stream data in the database you wish to use.

ALTER DATABASE demo
ADD FILEGROUP filestream_data CONTAINS FILESTREAM
GO

 

Now that you have a filegroup, add a file. The directory cannot exist, it will be built/managed by the instance. In My case I had built J:\SQLData\demo\ before I executed this command.

ALTER DATABASE demo
ADD FILE
(
NAME= ‘demo_filestream_data_1′,
FILENAME = ‘J:\SQLData\demo\filestream’
)
TO FILEGROUP filestream_data
GO

 

Now you need a table to hold the pointers to the filestream data being stored on the server (note the filestream syntax)
I am using a guid for the ID, you can use a identity if you like.

CREATE TABLE dbo.BLOB_EXAMPLE (
I
D UNIQUEIDENTIFIER ROWGUIDCOL NOT NULL UNIQUE,
BLOB VARBINARY(MAX) FILESTREAM NULL
)

 

Now you insert into the table, I am using some text… you can put whatever you want into the varbinary…. Files images etc

DECLARE @ID UNIQUEIDENTIFIER

SET @ID = NEWID()

 INSERT INTO dbo.BLOB_EXAMPLE
(ID, BLOB)
VALUES
(@ID, CAST(‘tsql Tuesday BLOB example’ AS VARBINARY(MAX)))

 SELECT ID, BLOB
FROM dbo.BLOB_EXAMPLE
WHERE ID = @ID

The ID: BF9A3D53-CC91-410E-9606-E2A43E36312A

 SELECT BLOB.PathName()
F
ROM dbo.BLOB_EXAMPLE
WHERE ID = @ID

 

\\SERVER_NAME\MSSQLSERVER\v1\Demo\dbo\BLOB_EXAMPLE\BLOB\BF9A3D53-CC91-410E-9606-E2A43E36312A

There are several parts to the path you get back

  • “SERVER_NAME” (where is this path…. UNC PATH)
  • “MSSQLSERVER” (the share-name you created near the top of this post)
  • “\v1\Demo\BLOB_EXAMPLE\BLOB” (database, table, field info automatically created for you)
  • “BF9A3D53-CC91-410E-9606-E2A43E36312A” the ID that you can use to reference the BLOB to retrieve the data.

Conclusion

I hope this post is a starting point for you to setup some great SQL Server 2008 functionality. It really is easy, and will allow storage of what is a growing, and ever present world of unstructured data in your databases.

Data Warehouse – AutoShrink how-to

Introduction

Space management is always a challenge in any system, however, in systems that are designed to grow-grow and grow, it seems to be a particularly big challenge as to how best manage space.  This post will give you an overview how to set the auto-shrink correctly to allow for even distribution of data within the existing data files.

How To

One of the best part of this set-it and forget-it feature, is how simple it is to set.  All you need to do is set the option to “true”… and you are good to go.  Assuming you have setup all your data files to have some reasonable minimum sizes, you will automatically manage space.

This feature was at one point on the list of features to be depreciated by Microsoft, however, has recently been added back to the list of supported items and will be enhanced in significant ways for use with compression, great things to come!

Conclusion

If you have gotten this far, and have not realized the date of this post, you really are gullible.  Enjoy April fools day… !

Help — My disk performance IV

Introduction
This is the fourth post in a series on disk performance
Help — My disk performance       (RAID how/why what to put where)
Help — My disk performance II  (Physical connections to the SAN)
Help– My disk performance III  (Partitioning and compression)

Now that we have partitioning covered, filegroups should actually be relatively simple because it just becomes a physical discussion about the logical splits we just arrived at.  Just to restate the obvious, PLEASE come up with a naming convention for your filegroups, file-names, and directory structures. 10 minutes (literally) of thought about this up front will make monitoring, scripting, moving, etc MUCH easier later.

Data Files
The Windows OS operates on file-handles, and depending on the number of processors and read/write to more than 1 file, you will get many opinions on this one, and there are some “it depends” sort of answers…. what I have found through testing with 8 processors and the disk configuration discussed to this point in this blog, 1 file per processor for data files +1 file. (Log files in a moment)  More than 8 files in the filegroup seemed to add little measurable value, the 9th file is my insurance policy for unexpected growth.  Understanding how Sql Server writes to the files is also important, the engine will do it’s best to keep the amount of data equal in all the files in a filegroup.  So having a file-group with 7 files at 75% full, if you add a 8th file all the new data will go to the 8th file till it is 75% full, effectively removing the benefit of multiple files and creating a hotspot on that file.  (I will have more specifics, DMV’s, etc, for monitoring in a later post).Some basic rules:

  1. Pre-allocate space to all 8 datafiles
  2. Do not allow autogrowth on the 8 main files in the filegroup.
  3. Set the initial size small on the 9th file, and allow autogrowth to some number your’re comfortable with (I allow the ‘extra’ file to have growth to the same size as the primary files) I don’t like the autogrowth, but I like things failing with space issues less.
  4. Monitor the extra file for growth…if it is growing, you messed up your space projections, and you need to add space to the other 8 files.
  5. It is best to get the calculations correct, and preallocate the 8 files, this will prevent physical disk fragmentation as the files are added to.
  6. Monitor disk fragmentation on the server.

 Primary – personally, nothing against primary, but I put nothing into primary, and monitor what is in that space.  IF something went to primary, someone (probably me) built something without thinking all the way through.  I would allow for 8 data files preallocated to a reasonable size relative to your other dataspaces.  IF something goes into primary, you don’t want things failing or performing poorly, you can move things around in the background if something that goes there needs to be kept.

Expanding upon a diagram we looked at earlier, you can see the number of filegroups, and their relationship to the physical disks

Beyond the annual partitions, you see a filegroup for “Dim Data” (dimension data), “Dim Idx” (dimension indexs) “Data 1/2/3″ (generic data spaces)  You are going to have some data that does not belong to dimensions, and is not partitioned.  Hopefully all of the large stuff is partitioned, but, you need to have a non-partitioned home for some large tables as well.  If you KNOW two of these large tables are always used together, put them in data spaces that are on different drives. “Index” (non-partitioned indexs).  Depending on the quantity of non-partitioned tables, you may want to consider an 2″ space to allow for separation of the index IO from the table IO. 

Log files
Basically the same principles apply to log files (remember, they are going to their own drive).  Through my testing, due to the high level of activity,  logging can benefit from additional files (3 per cpu).  To avoid the over head associated with growing log files set them to their maximum size and do not allow for autogrowth.  Because there is no autogrowth on these files, you need to ensure you have sufficient log space available to manage data loading and queries, as well as monitoring in place to detect logs filling.  In addition to these file management recomendations, logging for a Data Warehouse should be set to simple mode. (Backup and recovery design for a Data Warehouse is another topic).

Naming Conventions
Each data drive should have a sqldata directory followed by a directory for each database (please don’t be one of those people that drop a few hundred files from all the databases on a server in one directory).  Beyond that, each filegroup should get a name.

J:\                                                                      <—- Data Drive
      SQLDATA                                                  <—- SQL Data directory (this will matter when you have ssas on the same drive as data) or sqllog
            Example_DB                                      <—- Each database has a directory
                  Example_db_data_2005_1  <—- Name each file – dbname + file type (data/log) + year + number
                  Example_db_data_2005_2
                  Example_db_data_2005_3
                                ……
                  Example_db_data_2005_7
                  Example_db_data_2005_8
                  Example_db_data_2006_1
                             …….

 As you can well imagine, there are going to be quite a number of files, so maintaining order within the naming will give you the ability to script, and otherwise programatically monitor things.  This will also provide a very clear delineation of which data resides in which file.  When it comes time to move the oldest data to a different drive, it is as simple as issuing the command in the database to provide the new location of the file, shut-down the database, physically copy the file to the new location, restart the database.  This will become important once your primary “fast” drives start to fill, and the old (least used) data can be migrated to larger (slower), less expensive LUNS.

Follow

Get every new post delivered to your Inbox.

Join 373 other followers