Archive for the ‘SQL Server’ Category

Inconsistent Date Literal Parameterization Behaviour

November 12, 2009 Leave a comment

I have mentioned query parameterization before and the process by which SQL extracts literal values from a query and re-writes the query in effect to use parameters to get a query plan cache hit, which negates the need to recompile the plan costing both time and CPU. There are a lot of good articles and book chapters that cover the topic.

What has confused me for a while is witnessing date literals within a query being parameterized on one query and not on another, even though both databases have parameterization set to ‘Simple’ mode. When a date literal is not parameterized the chances of getting a query plan cache hit is obviously very low which has performance impacts. The problem to date has been that I had been unable to ascertain the commonality and requirements that allowed the query to get parameterized and when it just kept the literal. I was paying too much attention to the query and as it turns out not enough to the table.

Well, after a number of hours getting to know the brick wall very well I finally tracked it down, and oddly it had nothing to do with the query I was submitting, but I could reproduce reliably by using an unrelated non-clustered index, which is confusing to say the least and I can not yet think of any reason why, is it a bug or just ‘Weird and Odd’.

The following steps reproduce the issue in both SQL Server 2005 and 2008.

Create a database, default settings, nothing specific, then the second step is to create our test table, a simple structure is suffice.

CREATE TABLE [dbo].[paramtest](  [id] [int] IDENTITY(1,1) NOT NULL, [somedate] [datetime] NOT NULL, [somefield] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
 [id] ASC

We need some data to work against just to make sure we are selecting results, and we can insert these trivially as follows: 

insert into [paramtest] values (getdate(),'a')
go 10000

So we now have 10k rows within the table, and a clustered primary index on the identity column.

The test starts with freeing up the procedure cache, and then running the select statement, the datetime I used was roughly in the middle of the range of values I had inserted but is not a deciding factor in the query plan results.

dbcc freeproccache select * from paramtest where somedate > '2009-11-12 21:14:50.000'

Using a standard query plan cache extraction query the specific line of the xml plan we are interested in is the SQL Statement.

<StmtSimple StatementText="(@1 varchar(8000))SELECT * FROM [paramtest] WHERE [somedate]>@1" StatementId="1" StatementCompId="1" StatementType="SELECT" StatementSubTreeCost="0.0379857" StatementEstRows="3831.48" StatementOptmLevel="TRIVIAL">

From it you can see the literal varchar value was extracted as @1 with a type of varchar(8000) and the query altered to use this parameter – this is exactly the behaviour we would expect from parameter sniffing.

Next step is to create a non-clustered index on the varchar ‘somefield’ – completely unrelated to the date literal being used, and should have no impact on the query at all.

CREATE NONCLUSTERED INDEX [ix_test] ON [dbo].[paramtest] ([somefield] ASC

Free the procedure cache up again and rerun the query

dbcc freeproccache select * from paramtest where somedate > '2009-11-12 21:14:50.000'

Extract the query plan again from the cache, but this time it is noticable different, the parameterisation has not occurred. The literal has

<StmtSimple StatementText="select * from paramtest  where somedate > '2009-11-12 21:14:50.000'" StatementId="1" StatementCompId="1" StatementType="SELECT" />

 To revert to the old plan, drop the index and clear the cache again, then run the query once more.

DROP INDEX [ix_test] ON [dbo].[paramtest] WITH (ONLINE = OFF)

Then clear the cache again and run the query

select * from paramtest where somedate > '2009-11-12 21:14:50.000'

And we are back to being parameterized.

So the application of a single non-clustered index on a separate field to the one being queried is preventing the simple parameterization mode from parameter sniffing the date literal – this makes absolutely no sense, and you can play around with it a lot more knowing what it causing the effect on the query plan. Even placing the additional non-clustered index on the identity field, which already has a clustered index results in the parameterization failing. If this behaviour is be design, then it makes for an interesting design or limitation on the parameterization.

As soon as the database is in ‘Forced’ parameterization mode, the literal was converted each time, so this looks specific to simple mode, but is not explainable, just demonstratable.

Can a Covering NC Index be Tipped?

November 7, 2009 Leave a comment

Non-clustered indexes normally have a ‘tipping point’, which is the point at which the query engine decides to change strategies from seeking the index with a nested loop operator back to a seek on the underlying table or choosing to just scan the underlying table and ignore the index. Kimberley Tripp wrote a great article about ‘The Tipping Point‘ , and the guidance is at about the 25-33% the query engine will change strategies.

If the non-clustered index is a covering index (it contains all the fields within the query) the query engine does not take the same decision – it makes sense that if any change in strategy occurs, it would have to be at a far higher figure, and as we are about to see, it will not take that decision and tip.

To test what strategy the engine would use I created a test situation of 2 separate tables, with different page counts, due to the padding column forcing the second table to use far more pages (5953 pages vs 9233)

CREATE TABLE [dbo].[tblIxTest1]( [PersonnelID] [int] IDENTITY(1,1) NOT NULL, [FirstName] [char](30) NULL, [LastName] [char](30) NULL,
   [Department] [char](30) NULL, [SomePadding] [char](10) NULL


CREATE TABLE [dbo].[tblIxTest2]( [PersonnelID] [int] IDENTITY(1,1) NOT NULL, [FirstName] [char](30) NULL, [LastName] [char](30) NULL,
   [Department] [char](30) NULL, [SomePadding] [char](1000) NULL

Next step was to insert some data, I needed random data to be able to ensure the index was not unbalanced in some way, so I broke out my useful little random string generation function. I should mention how to create this, a SQL function will not directly support the inclusion of a Rand() call within them, any attempt to do this results in the error:

Msg 443, Level 16, State 1, Procedure test, Line 13
Invalid use of a side-effecting operator 'rand' within a function.

However, there is nothing stopping a view from using this, and the function from using the view to get around the limitation: 

Create View [dbo].[RandomHelper] as Select Rand() as r

And then the function can be generated to use this, it is not necessarily the most efficient random string generation function, but it works nicely.

CREATE FUNCTION [dbo].[RandomString] (@Length int) RETURNS varchar(100)
  DECLARE @Result Varchar(100)
  SET @Result = ''
  DECLARE @Counter int
  SET @Counter = 0
  WHILE @Counter <= @Length
     SET @Result = @Result + Char(Ceiling((select R from randomhelper) * 26) + 64)       
     SET @Counter = @Counter + 1   END

This now allows me to generate random data and insert it into the tables to get a nice data distribution, and this was run for both of the tables.

insert into tblIxTest1 values (dbo.RandomString(20),dbo.RandomString(20),dbo.RandomString(20),'')
go 1000000

Two NC indexes are now needed, one for each table and both are identical and cover just the FirstName and PersonnelID fields within the table.

CREATE NONCLUSTERED INDEX [IX_Test1] ON [dbo].[tblIxTest1] ( [FirstName] ASC, [PersonnelID] ASC
CREATE NONCLUSTERED INDEX [IX_Test2] ON [dbo].[tblIxTest2] ( [FirstName] ASC, [PersonnelID] ASC

The setup is complete and it is pretty easy to now show the NC covering index is not going to tip, the most extreme where clause is where I am allowing every record to be returned:

select personnelid , firstname from tblixtest1 where firstname >= 'a' and firstname <= 'zzzzzzzzzzzzzzzzzzzzz'

This still produces a query plan with a seek strategy, regardless of which of my two tables it was executed on:

select personnelid , firstname from tblixtest1  where firstname >= 'a' and firstname <= 'zzzzzzzzzzzzzzzzzzzzz'  
    |--Index Seek(OBJECT:([FilteredIndexTest].[dbo].[tblIxTest1].[IX_Test1]), SEEK:([FilteredIndexTest].[dbo].[tblIxTest1].[FirstName] >= [@1] AND [FilteredIndexTest].[dbo].[tblIxTest1].[FirstName] <= [@2]) ORDERED FORWARD)

If we just select the entire table, unsurprisingly at that point it chooses to perform an index scan.

select personnelid , firstname from tblixtest1

Results in the following plan:

select personnelid , firstname from tblixtest1   |--Index Scan(OBJECT:([FilteredIndexTest].[dbo].[tblIxTest1].[IX_Test1])) 

The row counts on both queries were identical at 1 million. Slightly more interesting is that if I use a Like clause instead of a direct string evaluation, the behaviour alters slightly when selecting all the values:

select personnelid , firstname from tblixtest1 where firstname like '[a-z]%'

Gives the query plan:

select personnelid , firstname from tblixtest1  where firstname like '[a-z]%'  
   |--Index Scan(OBJECT:([FilteredIndexTest].[dbo].[tblIxTest1].[IX_Test1]),  WHERE:([FilteredIndexTest].[dbo].[tblIxTest1].[FirstName] like '[a-z]%'))

So the query engine is potentially making an optimisation that it knows the like clause covers 100% and adopts an automatic scan, but it is not really very clear why it has this optimisation path. If the like clause changes to [a-y] then it reverts back to a seek, so it looks specific to covering all the values within the like statement. If a between statement is used, it remains a seek regardless.

So the result is that a Non-clustered covering index is very unlikely to tip, you either have to not give it a where clause, or use a like statement across all the values available, it will steadfastly refuse to seek and choose the scan.


Well the I/O cost of the operation remains the same, it has to read every page in the table and it considered the cost of traversing the B-Tree negligible, so the difference between seek and scan is not very great. Running the seek based query and scan based query in the same batch the relative percentages are 48% vs 52% – that is the scan scoring 52% even though they read the same number of rows.

Outputting the IO statistics when they are run side by side shows the same number of pages being read, but the seek is clearly being favoured and is slightly faster as far as SQL is concerned – it is quite weird to consider a seek of an entire index is more efficient than a scan of the index.

(1000000 row(s) affected)
Table 'tblIxTest1'. Scan count 1, logical reads 5995, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1000000 row(s) affected)
Table 'tblIxTest1'. Scan count 1, logical reads 5995, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

So if you come across a covering index in a query plan that is scanning, it would be worth investigating as to whether it is intended. The chances are more likely the index field order is not supporting the predicates being used, than engine has chosen to tip the index like it would for the non-covering non-clustered indexes.

When is Bulk Logged Mode Not What it Says?

November 6, 2009 Leave a comment

Somewhat of a ‘doh’ moment when dealing with a SQL server in bulk logged mode, the transaction log was behaving very weirdly and growing very large for smaller index operations. When performing an additional operation, the actual percentage of log space used went down, and it was a repeatable test. At no point would you expect an operation to actually result in the log space showing as less space used. So something was up with the scenario, and I was asked to check it.

In bulk logged mode index creations are classed as a minimally logged operation, whether online or offline, and a drop index is a bit of a mixture, the index page deallocation is fully logged, but the heap rebuild is listed as a minimally logged operation. There was no obvious reason why the database was behaving as it was.

The steps to recreate this situation on your own computrer / server are rather simple, create a database for testing purposes and preallocated the log space as 200Meg. The database was set to bulk logged mode. I then created a basic table:

create table [tblOnlineIndexTest]
  id int,
  padding char(500)

The table does not have a primary key so a uniquifier will be created, which we are not worried about. To create a number of rows, I simply ran the following SQL to generate some data, just over 60 meg of it.

insert into tblOnlineIndexTest (id, padding) values (1, REPLICATE('a',100))
go 100000

To check the log space, I used the DBCC command:

dbcc sqlperf(logspace)

26.28% is used. The next step was the creation of a clustered index on the table:

create clustered index [PK_tblOnlineIndexTest] on [tblOnlineIndexTest] (id)

The log was checked again and the space used was 21.34%. I twigged the underlying issue at this point, but continued to test a bit more to be sure.

Next I dropped the index and the log space jumped up to 53.27% of the total log space, and the log had grown in size, even though it was not at 100% of the space. The log has to reserve undo space in advance of the transaction to make sure it can always record an undo, so that a rollback does not get stopped by the transaction log being full / having no space left.

Last step was to recreate the same index and this time the log space dropped to 2.23% but the actual log file has grown by 66 Meg at this point. So the log got larger, but the actual amount of data in it was smaller.

The problem was a simple one, the database was not actually in bulk logged mode, just like a database set as ‘full’ logging, until the first full database backup is taken, the database is still in the simple logging mode. A transaction log has to be replayed against a starting position of a full backup, so the lack of a full backup on the database automatically prevents the bulk logging mode from actually being active. 

When I was presented the initial problem to look at, I did not asked whether the initial database backup had been taken – I assumed it had been which was clearly a bad assumption. Whilst the SQL Server identified itself in bulk logged mode, the actual reality was that it was in simple mode so the transaction log will be automatically clearing down committed transactions from the log and reusing the space.  Just because the database says it is in bulk logged or full modes, does not mean it actually is yet.

Simple vs Forced – Query Parameterization

October 31, 2009 Leave a comment

On the drive back from a relaxing week off I had decided to write about Query Parameterization – this is the process where the query optimizer works on ad-hoc queries and chooses to take literal values used within a query predicate, and turn it into a parameter on the fly.

This optimised query is then the one used to check the query cache, and not the original query string, so it can result in a far higher percentage of query plan cache hits than would otherwise occur on ad-hoc queries.

 There are two modes to this feature, and neither of them is ‘off’, you automatically have ‘simple’ parameterization turned on, and can increase this to ‘forced’ mode for your database if you desire. The difference in the two modes is that the simple mode will only deal with relatively simple literal values within the predicates, and will not consider every literal a candidate for parameterization.

So, overall it sounds a good thing but there are some known problems to parameterization, or parameter sniffing as it is oft called – the query plan generated for a given value of the parameter is not necessarily the optimal plan for another. Consider a search on a personnel database where the EmployeeAge is between 95 and 99 – Assuming an index is in place, you can already expect the data quantities to be low to nill, so the index will not tip and an index seek is the expected behaviour.

If the next time the query is executed the value is from 20 to 60, the number of people matching that value would be very high, yet the query plan in cache has already chosen the method by which the query will be executed, and it is no longer the most optimal.

From a user’s perspective this will lead what appears to be random / non-deterministic behaviour – sometimes the query will be very fast and other times it will be slow. There will not necessarily be a pattern to the behaviour because the query might be removed from cache when it is under memory pressure, or for a wide variety of other reasons the plan might be invalidated, the simplest being a data statistics invalidation.

So with the knowledge that parameter sniffing can cause issues, when should you consider investing the time into testing this setting? The time and effort to test could be considerable so I have written how I would come to that decision.

 The first thing I check is the CPU utilisation on the SQL Server, in theory there should be some kind of issue forcing this to be considered and one of the issues that can occur from insufficient parameterization is an increase in the CPU utilisation due to a high query compilation rate. There are a lot of other reasons that can cause a high CPU such as incorrect indexing leading to table scans etc, but for the purpose of this explanation, we can assume that has already been checked and was not enough.

The second performance counters I would check are:

SQLServer : SQL Statistics: Auto-Param Attempts/Sec
SQLServer : SQL Statistics: Batch Request / Sec

I could also check the plan cache hit ratio, or the SQL Compilations / sec figures, but having the Auto-Params and Batch figure per sec allows you to do a rough calculation on the ratio of incoming batches compared to the number of queries that auto-paramerterization is being attempted on (successfully or not). The higher the ratio, then the more ad-hoc queries are already being affected by the ‘simple’ mode parameterization.

If I see a high ratio, the next check would be on the plan cache statistics, primarily:

SQL Server : Plan Cache : Cache Object Counts for SQL Plans
SQL Server : Plan Cache : Cache Pages for SQL Plans

This is to try get a judge of how many plans are sitting in the cache / whether the cache is under memory pressure. If you have a high number of SQL plan objects / cached pages then you can calculate whether you are under memory pressure. You can also look at the buffer manager performance counter for the page life expectancy value, but this could be a bit misleading since data pages will also be affecting this.

At this point I would start pulling the query plan cache, and checking the query text for a sample of plans, to check whether there was a high number of literals remaining within the submitted query, since these are the ones that the simple parameterisation failed to convert to parameters. When pulling the query plan cache, sorting them in SQL text order, allows you to easily spot when a near identical queries except for unconverted literals have been submitted but resulted in a different query plan.

There is no prescriptive guidance on how many of these near identical queries it takes before you consider it, but clearly a single duplicate is not an issue, whilst thousands of them could be quite an issue and be resulting in a high number of unnecessary SQL plan compilations, which is increasing the CPU utilisation of the server.

As you can tell, this setting is certainly not one that should be changed without some considerable forethought and then thorough testing to ensure the situation you have got is actually worth the switch, and preferably some very good instrumented testing in a realistic test environment to ensure the benefits you are getting from the increased level of parameterisation, are not being lost by occurences of non-optimal plan cache hits.

The Sequence of an Index Uniquifier

October 20, 2009 Leave a comment

During a training session today, I was asked about the structure of the Uniquifier, and whether it was a straight identity column. Off the top of my head I couldn’t remember the exact structure, I considered it a 4 byte int, but was not sure whether it acted as a pure identity value or acted in a more complex manner when incrementing, so decided to investigate it tonight.

To start from the beginning, an index uniquifier is the term given to the field that is automatically generated by SQL Server when you create a clustered index, but the index key is not specified as unique. Since each record in the table has to be uniquely identifiable, SQL will automatically assigned a 4 byte field to the row to make it unique, commonly called the ‘Uniquifier’. At this point I am sure English scholars will be frowning, pondering on the nature of the word and whether it qualifies as English; however that the term used so we will run with it.

It is actually quite easy to see this field in action, let’s create a simple table:

CREATE TABLE dbo.unique_test  (  
firstname char(20) NOT NULL,  
surname char(20) NOT NULL  
CREATE CLUSTERED INDEX [ix_test] ON [dbo].[unique_test]
[firstname] ASC

The clustered index is not unique, by design, so let’s start adding duplicate rows to see the effect:

insert into unique_test values ('John', 'Smith')
go 10

The table now contains 10 rows, each with the same details. This does not cause any undue concern, because each row is actually still unique – the way to show this is using the DBCC INC and DBCC Page commands, I’ve cut the output down since it is so wide.

dbcc ind ('testdb','unique_test',1)
PageFID PagePID     IAMFID IAMPID      PageType
------- ----------- ------ ----------- --------
1       41          NULL   NULL        10      
1       174         1      41          1       

The output shows a data page numbered 174 for my example and the IAM page with an ID of 41. We can crack open the page and view the contents very easily using DBCC Page.

dbcc dbcc traceon(3604)
dbcc page (idtest,1,174,3)

The output is quite large, but in essence, the first record is stored with the following details: 

UNIQUIFIER = [NULL]                 
Slot 0 Column 1 Offset 0x4 Length 20
firstname = John                    
Slot 0 Column 2 Offset 0x18 Length 20

 The second record:

Slot 1 Column 0 Offset 0x33 Length 4
UNIQUIFIER = 1                      
Slot 1 Column 1 Offset 0x4 Length 20
firstname = John                    
Slot 1 Column 2 Offset 0x18 Length 20
surname = Smith  

 The third record:

Slot 2 Column 0 Offset 0x33 Length 4
UNIQUIFIER = 2                      
Slot 2 Column 1 Offset 0x4 Length 20
firstname = John                    
Slot 2 Column 2 Offset 0x18 Length 20
surname = Smith                     

And so forth. The first record’s uniquifier is visible and clearly named within the data page, but set to null. The second copy of the same value receives the uniquifier of one, the third copy receives a 2 etc.  This count is maintained separately for each duplication, so the insert of a new name multiple times will also receive its own counter, beginning at null and working upwards, 1,2,3 etc. So just because the uniquifier is 4 bytes, this does not limit the total number of rows in the table to ~2.1 billion, but does logically limit the total number of duplicates to 2.1 billion. I must confess to not having tested that limit, generating 2.1 billion rows of duplicate data is not trivial and a scrapbook calculation predicts 435 hours of processing on a virtual pc. I suspect the error message it raises when it hits the limit would be interesting.

If we remove all the rows from the table and then add 10 more does the uniquifier reset? Easy to test but the short answer was no, the uniquifier continued to rise, 10 thru 19.

I was a bit suspicious of this since any requirement for the uniquifier to rise / remember what existed before requires it to be stored somewhere – it has to survive a crash after all, but there is no apparent place the current count is stored. If there was, you wouldn’t be storing just 1 value, you would be forced to store a value for each record key that had duplicates. This could run into thousands of separate counters being maintained per clustered key so it just doesn’t make sense that it is stored, it would be a very noticable overhead.

When checking the DBCC Ind for the empty table it insisted it still had a data page, but the only contents of the data page was a single ghost record – a row that has been marked as deleted. The ghost record was the for the ‘John Smith’ with the highest uniquifier before, was this coincidence? The other ghost records had not hung around, so why did this one persist?

I dropped and recreated the table again, inserted 10 rows and then deleted them. Checking DBCC Ind the table still showed a HoBT IAM allocation page for the table and a data page, the data page contained a single ghost record, the one with a Uniquifier of 9 – the highest given out when 10 duplicates were added. Even waiting some considerable time the ghost record was not cleaned up, so it appears that it will not delete it.

If I added another duplicate row, it picked up the next number in the sequence (10) and shortly after the ghost record was removed from the page. Very convenient and not a coincidence at all – the memory of the last uniquifier given out  persists as a ghost record, even if all the duplicates for the table have been removed.  What seems strange is this ghost record hanging about, persisting an entire record, to keep the duplicate count for that key, when no instances of it remain on the table.

It can not possibly do this for every key since the space overhead would become very noticable again, so how does it choose what to persist, the last entry? unfortunately it doesn’t appear that simple at all, after a number of tests it appeared to only be interested in keeping the ghost entry for the record that had the highest key value, so alphabetically, the one closed to ‘Z’ for my example.

Conclusion? On the evidence, whilst the other ghost records still persist for a short time, even deleting and then adding more duplicates can see the number continue from where it left of, but given a short time for the ghost records to be removed the uniquifier will restart the sequence back at Null,1,2 etc. Except in the case of the highest entry from the index perspective, that ghost record decides to stick around until there is another entry using the same key, continuing the sequence, at which point the ghost record then finally disappears.

I can not think of any sensible reason why it would do this, can you?

Overall, the uniquifier is a cost overhead of not having a unique index, and at a cost of 4 bytes, an int identity column makes a lot of sense – for all purposes it acts the same and serves the same purpose but in a far more visible manner – so it really does not make much sense to rely on the uniquifier provided for you, take control and create your own.

Guidance on How to Layout a Partitioned Table across Filegroups

October 15, 2009 4 comments

SQL Server’s Table Partitioning was one of my favourite features within the 2005 release. It really brought SQL into the mainstream when it came to holding very large data quantities and allowed us to talk with confidence about large tables containing ‘Billions’ of rows and not be limited to the ‘Millions’. From extensive use and a couple of SQL Labs specifically surrounding the partitioning feature there are some rules and best practises I use to try maximise the benefit / flexibility of the partitioned table, without causing uneccesary drops in performance or increases in the disk space requirements.

Many examples of table partitions focus heavily on using date range based partition functions and schema, with a common layout mechanism of stripping the weeks / months across a number of file groups. The file groups are re-used and in the example picture, you can see 2 years worth of data stripped across 12 filegroups. InitialLayout

This is pretty common and has an alluring charm of simplicity, but it is going to hurt when you start rolling the window. The underlying problem is that there is no gap.

For a good number of systems it would be unacceptable to remove a month of data unless you had already inserted the new month sucessfully. So the system starts to roll a new month of data in, and is required to use the same filegroups, the layout transforms into the following:


The file group has to expand by 50% to accomodate the new data, before the old data can be removed – and once the old data is removed the filegroups now look like:


So the 50% of space is now wasted unless you use a shrink, which is probably the worst thing you can do to your filegroup and data files at that point in time. Shrink can fragment the data to the extreme and is to be avoided at all costs. Which means you will have to maintain a 50% space penalty for the layout on every filegroup. That might not sound a lot, but on a large database in an enterprise with mirrored SAN’s,that additional 50% is going to cost a substancial amount.

There are also other issues, SQL allows you to backup at a filegroup level and since the bulk of the data is historic and will not alter, you are forced to re-back up historic data (Jan 08) when you backup the recently inserted Jan 09 data. So there is an impact on backup space, backup times and restore times.

The simplicity of the initial layout makes it seem like a good idea, but the side-effects are not pleasant. You can alter the layout and choose to have 6 Filegroups, each storing 4 months of data, and then the expansion is only from 4 to 5, so a 25% overhead. It is better, but still is a cost overhead, The ultimate extreme is to then place it all in one filegroup, but there are a number of difficulties and contention points with that.

A better approach is to use a different file group per month, but then also create an additional spare filegroup, so that no filegroup is forced to expand, as shown:BetterLayout

The difference here is that we have one free filegroup that is not a part of the current partition scheme / function definition, but will be allocated as the ‘Next Used’ filegroup for the partition scheme, so that when we split the partition function that filegroup is brought into the scheme and used. The old data that is going to be removed, will provide an empty filegroup that will be used for the next month’s import. In essence the head of the partition function is chasing the tail and will never catch up.

The expansion of size for this is 1/n where n is the number of partitions to be stored, so for our 24 month example 1/24th expansion – a considerable saving. Even better is that you can choose to utilise file group backups for the older static months of data.

This though is not perfect for a couple of reasons, primarily data loading and indexing. To make the ‘scene’ better there should be a further filegroup dedicated to the data staging / loading that is going to occur to bring the data into the partitioned table.WithStaging

The purpose of this is twofold:

  • The disks used for loading are seperate from the disks used for the actual data so that we can maintain Quality of Service on the main table.
  • The data wants to be loaded as a heap for speed, but is likely to be sorted prior to insertion, an in place sort would expand one of the main file groups which is to be avoided.

By loading into a staging filegroup, you give yourself the opportunity to then use a clustered index to sort the data and move it to the appropriate file group, prior to being constrained and switched in to the main partitioned table. If you had loaded the data into FG25 and then tried to apply a clustered index, it would of doubled the size of the filegroup again as it needs to write all the new rows out and commit them, before the heap could be deleted. That would put you back at square one wasting 50% of the disk space.

The staging filegroup does cost us another file group worth of space, so the 24 initial filegroups has grown to 26, which is still a smaller expansion than the potential 50%.

So some simple guidlines are:

  • For a partition function that is going to have N partitions, create N+1 Filegroups. 
  • ETL / Stage your data into a dedicated staging file group
  • Move your data using a clustered index creation to the ‘spare’ filegroup you always have.
  • Switch new data in, then switch old out creating the new ‘spare’

It is not entirely flawless however – there is a problem in using such a design if the quantity of data per partition is going to vary by a large amount; you then have to provision each filegroup to have enough space to cope with the fluctations, which in itself can result in wasted space and starts chipping away at the gains made.

It works best as a technique on datasets that remain relatively consistent in terms of the number of rows per partition and as the number of partitions goes up, the savings increase.

What is the SQL Server 2008 DateTimeOffset Internal Structure?

October 13, 2009 Leave a comment

After decoding the DateTime2 internal structure I thought I would take a quick look at the DateTimeOffset structure, since it should not present too many difficulties and post it up quickly, but was surprised at the initial result of a test. It follows the same basic premise that the time portion is followed by the date portion and the time / date is based on a day count from the epoch time of 0001/01/01 and the time the number of intervals since midnight, where the interval is defined by the accuracy.

I was expecting the datetime offset value itself to occupy the additional 2 bytes quoted, and not affect the other values. Bad assumption, as soon as I cracked open a few examples I could immediately see that setting the offset also alters the underlying time / date component values as well.

Using the same comparison methods as before the underlying time value is clearly being adjusted:

'0001/01/01 00:00:00 -0:00' => 0x0700000000000000000000
'0001/01/01 00:00:00 -12:00' => 0x0700E034956400000030FD

So, even though an offset is being stored, the underlying time is also being altered to match, and the internal storage is using UTC as the reference point. This makes sense as the most valid reference that you could use.

'0001-01-01 12:00:00.0000000 +00:00' => 0x0700E03495640000000000
'0001-01-01 00:00:00.0000000 -12:00' => 0x0700E034956400000030FD

The same time / date is generated for the two values, but the last two bytes hold the offset and have stored the offset used when the date was initially stored. The underlying storage of the time though is clearly identical in both,  so UTC is the common ground they each get stored again.

The final 2 bytes for the offset are pretty easy to decode, since the pattern is the same as before with a slight twist. The offset time records the number of minutes for the offset in hex, with the first byte of the two being the least significant as before with the time, so you end up reading the two bytes left-to-right and then decode that byte right-to-left.

The twist is that for positive offsets, the value increments 1,2,3, in hex as appropriate, but for negative values, it starts decrementing by considering -1 equal to ‘FFFF’, I’ve split the hex output into the individual components by adding some spaces to make it easier to read. (Accuracy Code, Time Value, Date Value, Offset Used)

'2001-01-01 12:00:00.0000000 +00:01' => 0x07   009A717164   75250B  0100
'2001-01-01 12:00:00.0000000 +00:00' => 0x07   00E0349564   75250B  0000
'2001-01-01 12:00:00.0000000 -00:01' => 0x07   0026F8B864   75250B  FFFF

Since the offsets supported at only +14 hours to -14 hours, there is no risk of the two ranges overlapping. When I think about this a bit more, it is acting as a signed number, -1 being 11111111111 etc. So the 2 bytes at the end is a signed int of the number of minutes offset.

There are a number of time zones in the world that do not occur at exact hourly intervals from UTC, some are on the half hour mark such as Caracas (-4:30) or Delhi (+5:30) to name a few, whilst Kathmandu (+5:45) is even more specific.  In theory the format allows offsets specified to even greater levels of distinction, although I am not sure as to why you would wish to use it. Do you really want an offset of +3:17 minutes? That is potentially scary to consider that as a valid input to the value.

That has made me wonder as to why the accuracy was set so high, when in reality 15 minute intervals would of been sufficient, with a -14 to +14 range, that is 113 different values inclusively, which could be accomodated within a single byte.

So why spend 2 bytes on the format, when 1 was enough? Was it to just be compatible to the ISO format in some way that required it? Not sure.

What is the SQL Server 2008 DateTime2 Internal Structure?

October 11, 2009 2 comments

SQL Server has a number of new date time formats, but the one I am most interested in is DateTime2. The internal format of the SQL DateTime is commonly mistaken as 2×4 byte integers, with the latter integer being milliseconds since midnight. It is in fact the number of 1/300ths of a second since midnight which is why the accuracy of the DateTime within SQL Server has historically been 3.33ms. (If you really want to see it, crack it open by converting it to a binary, adding 1 and re-converting, you add 3.33ms, not 1 ms.)

So DateTime2 must use a different format, and as a weekend exercise that had no purpose than understanding the internals I thought I’d take a look. I have not seen the information in the BoL or posted as yet, so might be of use.  I am starting with the DateTime2(7) and looking at the maximum accuracy structure. The code used to crack it open each time is basically as follows:

declare @dt datetime2(7)
set @dt = '2000/01/01 00:00:00'
declare @bin varbinary(max)
set @bin = CONVERT(varbinary(max), @dt)

To make my life easier, SQL conveniently outputs all the values as hexi-decimal numbers. The results are not what you would expect.


The date which traditionally occupied the first 4 bytes, clearly is occupying the last few bytes. So the format is not going to be obvious or simple. Interestingly the returned result is 9 bytes, but the length is quoted as 8. It is returning 8 when checked using the length, that first byte is somewhat odd to make an appearance.  It’s also suspiciously the accuracy value, and with a few tests using a change of accuracy, it show that value changes. So the first pseudo-byte is the accuracy indicator.

To start figuring out some more, let’s take the time back to the beginning point, which in this case is not 1900/01/01 but 0001/01/01 which when converted gives us:

'0001/01/01 00:00:00' => 0x070000000000000000

Start incrementing the day portion and there is an obvious pattern, the 6th byte changes.

'0001/01/02 00:00:00' => 0x070000000000010000
'0001/01/03 00:00:00' => 0x070000000000020000
'0001/01/31 00:00:00' => 0x0700000000001E0000

As you try the 2nd month, to check where the month is, the same byte alters, so it represents days, not specific date parts. Is it the number of days since the beginning of the year? No.

'0001/02/01 00:00:00' => 0x0700000000001F0000

If it was, there would be an issue since 1 byte does not represent enough values, as we can see, FF occurs on the 13th of September, and then it rolls over and puts a 1 in the 7th Byte position.

'0001/09/13 00:00:00' => 0x070000000000FF0000
'0001/09/14 00:00:00' => 0x070000000000000100
'0001/09/15 00:00:00' => 0x070000000000010100

It rolls over, then carries on as before. This immediately suggests the next test, to roll over the year, and the pattern continues.

'0001/12/31 00:00:00' => 0x0700000000006C0100  
'0001/12/31 00:00:00' => 0x0700000000006D0100

So the format is just counting, we see it in the example as hex, but it is a straight number count going on but the hex values are left-to-right. Only 2 bytes are used so far, which do not represent enough day combinations, add the third byte in by going past 180 years:

'0180/06/06 00:00:00' => 0x070000000000FFFF00  
'0180/06/07 00:00:00' => 0x070000000000000001

So the final byte is then increased, so the number of combinations becomes 16777215 – that seems a lot better and certainly going to cover the range required.

'2001/01/01 00:00:00' => 0x07000000000075250B

So that is the final 3 bytes decoded, a simple pattern – and provides the template of how the time is also stored.

'0001/01/01 00:00:00.0000000' => 0x070000000000000000
'0001/01/01 00:00:00.0000001' => 0x070100000000000000
'0001/01/01 00:00:00.0000255' => 0x07FF00000000000000
'0001/01/01 00:00:00.0065535' => 0x07FFFF000000000000
'0001/01/01 00:00:00.0065536' => 0x070000010000000000

So to check whether the format is the same,

'0001/01/01 00:00:00.9999999' => 0x077F96980000000000

Decode that again and it all matches:

select (152 * 256 * 256) + (150 * 256) + 127

When we click over into 1 second exactly, we increment the first byte by 1, so the time portion is still represented in 100ns intervals, with the normal system of each byte counting up 1 every time the previous byte rolls over. As we get to the limit of the 3 bytes, it rolls into the 4th and then the 5th.

'0001/01/01 00:00:01.0000000' => 0x078096980000000000

So the internal format of the DateTime2(7) is decoded, not difficult but it is an interesting choice – it is now a straight binary number, with the Least Significant Byte being on the Left, the Most Significant being on the right (for each section.) Within the byte however, to convert it you must still read it right-to-left.

The first 5 bytes are recording how many time units intervals have passed since midnight, and the last 3 bytes recording how many days have passed since 0001/01/01.

The time unit intervals are dictated by the accuracy of the number, 100ns for DateTime2(7), and 1 Micro second intervals for a DateTime2(6) etc.  The way in which you interpret it does not change, but the units you are multiplying the time portion by, alters based on the accuracy.

You could construct 2 dates that are identical at a binary level, but due to the field meta-data on accuracy, they do not represent the same date time.

declare @dt1 dt1 datetime2(6)
set @dt1 = '0001/01/01 00:00:00.000001'
declare @dt2 datetime2(7)
set @dt2 = '0001/01/01 00:00:00.0000001'


 And that is perhaps why on output they automatically have prefixed the binary value with the datetime accuracy, so that they are not entirely identical? I’m not sure but would be interested to find out.

Finding Next Identity Value, and a Wild Goose Chase.

October 9, 2009 3 comments

A question asked on stack overflow was to find the next identity value that would occur on a table, without being required to add a record to work it out. The problem lies in that if the highest row is deleted, the number is not reused so any answers using the existing rows can be incorrect.

Logically the value must be stored, so first place I checked was the DMV’s. That stores the currently used value, but does not store the next value.

The wild goose chase started there…

  • I used the dedicated admin console to pull all the system tables, expecting it to be in sys.syshobtcolumns, no joy. Dumped the whole system table contents before and after an insert looking for the difference and didn’t spot it.
  • Took a dump of every page in the file before and after inserting a new row and having the textural dumps compared in a text comparison application, still no joy.
  • Started dumping out log records using the following script and pulled the identity calls:
select *
from ::fn_dblog(null, null)
where operation = 'LOP_IDENT_NEWVAL'
  • After a couple of hours running around the pages trying to find it, I realised I should of stuck with the DMV and I really went the wrong way around it.

 The DMV has the right answer it seems, but as two fields you have to combine to get the answer.

create table foo (MyID int identity not null, MyField char(10))
insert into foo values ('test')
go 10

-- Inserted 10 rows
select Convert(varchar(8),increment_value) as IncrementValue,
   Convert(varchar(8),last_value) as LastValue
from sys.identity_columns where name ='myid'

-- insert another row
insert into foo values ('test')

-- check the values again
select Convert(varchar(8),increment_value) as IncrementValue,
   Convert(varchar(8),last_value) as LastValue
from sys.identity_columns where name ='myid'

-- delete the rows
delete from foo

-- check the DMV again
select Convert(varchar(8),increment_value) as IncrementValue,
   Convert(varchar(8),last_value) as LastValue
from sys.identity_columns where name ='myid'

-- value is currently 11 and increment is 1, so the next insert gets 12
insert into foo values ('test')
select * from foo

MyID        MyField
----------- ----------
12          test      

(1 row(s) affected)

So adding the increment to the last value will predict the next value correctly, assuming someone else does not grab it in the mean time which is why it is not a good idea to use in code, but if you need to investigate a table and want to know what it thinks is next, without actually inserting a row and affecting the table, then it is useful.

All that and the easy way is then:

select ident_current(‘foo’) + ident_incr(‘foo’)

Ah well, was fun investigating it – but what a wild goose chase to find it was an easy answer.

Do Filtered Index Have Different Tipping Points?

October 4, 2009 Leave a comment

I’ve been taking a look at filtered indexes and they do initially look particularly useful, but I was wandering what effect the filtered index has on the tipping point. The tipping point was described by Kimberley L. Tripp in an excellent post – and describes the point at which SQL opts to change the query plan strategy from a NC index seek to a scan of some kind (Table / Clustered Index)

So I was wondering whether Filtered indexes chose a different strategy, or would they ‘tip’ from a targeted index to a table scan in the same way.

First off was to create a dummy table:

create tabletblPersonnel ( 
PersonnelID Int Identity, 
FirstName Char(30), 
LastName Char(30),
Department Char(30), 
SomePadding Char(250)

This was deliberately created as a heap, since data had to be generated and loaded:

insert into tblPersonnel(firstname,lastname,Department) values( dbo.randomstring(20), dbo.randomstring(20), dbo.randomstring(20)); go 1000000

The dbo.randomstring is a user created function I added to generate a random string [A-Z] the length of the function parameter, so the insert statement is inserting random strings of length 20 into the first name, last name and department fields.

Once the data was loaded, a clustered index was applied.

alter table dbo.tblPersonnel add constraint
PK_tblPersonnel primary key clustered(PersonnelID) with( statistics_norecompute = off, ignore_dup_key = off, allow_row_locks = on, allow_page_locks = on) on [PRIMARY]

A quick check of the table:

select COUNT(*) from tblpersonnel
(1 row(s) affected)

To get a rough idea of the tipping point we need to know the number of pages at the leaf level of the table. 

select * from sys.objects where name ='tblpersonnel'
-- grab the object id and query the physical stats.
select * from sys.dm_db_index_physical_stats(DB_ID('FilteredIndexTest'), 1117247035 , 1,0,'DETAILED')
-- 45455 pages shown in the table at leaf level 0 i.e. the clustered index

So create an NC index on the first name, last name.

create nonclustered index[IX_Normal] ON [dbo].[tblPersonnel] (  [FirstName] asc,  [LastName] asc
) with( pad_index = off ,statistics_norecompute = off, ignore_dup_key = off, allow_row_locks = on, allow_page_locks = on) on [PRIMARY]

For reference, we will also check how many pages the NC index has:

select * from sys.dm_db_index_physical_stats(DB_ID('FilteredIndexTest'), 1117247035 , 2,0,'DETAILED')
-- Level 0 of the index shows 8696 pages

The tipping point is expected between 25% to 33% of the number of pages when expressed as rows, so something between 11363 to 15000 rows is where to expect it. Since the data was random when inserted, a bit of human binary chop is needed to find the tipping point. After some experimentation – the dividing point was between the firstname like ‘a[a-h]%’ and ‘a[a-i]%’

select * from tblpersonnel
where firstname like 'a[a-h]%'
-- 11898 rows - seeks
select * from tblpersonnel
where firstname like 'a[a-i]%'
-- 13299 rows - scans

So the tipping point showed up within the expected range, so next I created the filtered index based on FirstName ‘>=a’ and ‘<= ak’ since I can not use a like clause in the Filtered Index where statement.

create nonclustered index [IX_Filter] ON[dbo].[tblPersonnel] (  [FirstName] asc,  [LastName] asc
where[FirstName] >= 'a'and firstname <= 'ak'
) with( pad_index = off ,statistics_norecompute = off, ignore_dup_key = off, allow_row_locks = on, allow_page_locks = on) on [PRIMARY]

The filtered index is in play, so I disabled the non-filtered index and re-ran the same queries.

select * from tblpersonnel
where firstname like 'a[a-h]%'
-- 11898 rows - seeks
select * from tblpersonnel
where firstname like 'a[a-i]%'
-- 13299 rows - scans

Conclusion: The tipping point did not alter based on the filtered index – it still tips based on the ratio of rows to total pages in the leaf level of the clustered index.