Posts Tagged ‘SQL Server 2008’

Prodata SQL Academy Events

December 2, 2009 Leave a comment

If you haven’t seen them advertised, Bob Duffy from Prodata is running a series of SQL Academy half day training session in Dublin, hosted at the Microsoft Auditorium in their offices in Leopardstown – the events are level 300 which suits the half day slot allocated for the sessions – yesterday’s was about performance tuning an optimisation so myself and a colleague took a short flight over and enjoyed the excellent Irish hospitality. The talk was recorded so there will no doubt be a webcast published at some point published by Technet in Ireland. The talk primarily went through using perfmon counters and wait states – and the available tools that can make this a lot easier by wrapping up and correlating results from different logging mechanisms.

I would recommend keeping an eye out for the cast when it appears, since troubleshooting a production environment is all about using non-intrusive means to understand what is crippling the systems – memory, cpu, IO etc. If you are not practised at this form of troubleshooting it is very difficult to know which performance counters and wait states to observe amongst the thousands that exist – as well as which DMV’s can give you the critical information to diagnose the problems. (It was quite interesting that the demonstration performance issue he was looking at was fundamentally a combination of a missing index but more critically was a lack of query parameterisation since it was in simple mode. The counters used to diagnose this problem, and the symptoms that you might encounter I have previously written about.)

The wait-state side of the talk was very interesting, I often use a combination of DMV’s and perfmon in the field to diagnose, but have only used a certain amount of the wait-state information and do not delve into it as deeply – I will definitely be adding a few more wait states to the list for the future.

The next event is on February 16th and covers SQL Analysis Services – registration is already open.

Inconsistent Date Literal Parameterization Behaviour

November 12, 2009 Leave a comment

I have mentioned query parameterization before and the process by which SQL extracts literal values from a query and re-writes the query in effect to use parameters to get a query plan cache hit, which negates the need to recompile the plan costing both time and CPU. There are a lot of good articles and book chapters that cover the topic.

What has confused me for a while is witnessing date literals within a query being parameterized on one query and not on another, even though both databases have parameterization set to ‘Simple’ mode. When a date literal is not parameterized the chances of getting a query plan cache hit is obviously very low which has performance impacts. The problem to date has been that I had been unable to ascertain the commonality and requirements that allowed the query to get parameterized and when it just kept the literal. I was paying too much attention to the query and as it turns out not enough to the table.

Well, after a number of hours getting to know the brick wall very well I finally tracked it down, and oddly it had nothing to do with the query I was submitting, but I could reproduce reliably by using an unrelated non-clustered index, which is confusing to say the least and I can not yet think of any reason why, is it a bug or just ‘Weird and Odd’.

The following steps reproduce the issue in both SQL Server 2005 and 2008.

Create a database, default settings, nothing specific, then the second step is to create our test table, a simple structure is suffice.

CREATE TABLE [dbo].[paramtest](  [id] [int] IDENTITY(1,1) NOT NULL, [somedate] [datetime] NOT NULL, [somefield] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
 [id] ASC

We need some data to work against just to make sure we are selecting results, and we can insert these trivially as follows: 

insert into [paramtest] values (getdate(),'a')
go 10000

So we now have 10k rows within the table, and a clustered primary index on the identity column.

The test starts with freeing up the procedure cache, and then running the select statement, the datetime I used was roughly in the middle of the range of values I had inserted but is not a deciding factor in the query plan results.

dbcc freeproccache select * from paramtest where somedate > '2009-11-12 21:14:50.000'

Using a standard query plan cache extraction query the specific line of the xml plan we are interested in is the SQL Statement.

<StmtSimple StatementText="(@1 varchar(8000))SELECT * FROM [paramtest] WHERE [somedate]>@1" StatementId="1" StatementCompId="1" StatementType="SELECT" StatementSubTreeCost="0.0379857" StatementEstRows="3831.48" StatementOptmLevel="TRIVIAL">

From it you can see the literal varchar value was extracted as @1 with a type of varchar(8000) and the query altered to use this parameter – this is exactly the behaviour we would expect from parameter sniffing.

Next step is to create a non-clustered index on the varchar ‘somefield’ – completely unrelated to the date literal being used, and should have no impact on the query at all.

CREATE NONCLUSTERED INDEX [ix_test] ON [dbo].[paramtest] ([somefield] ASC

Free the procedure cache up again and rerun the query

dbcc freeproccache select * from paramtest where somedate > '2009-11-12 21:14:50.000'

Extract the query plan again from the cache, but this time it is noticable different, the parameterisation has not occurred. The literal has

<StmtSimple StatementText="select * from paramtest  where somedate > '2009-11-12 21:14:50.000'" StatementId="1" StatementCompId="1" StatementType="SELECT" />

 To revert to the old plan, drop the index and clear the cache again, then run the query once more.

DROP INDEX [ix_test] ON [dbo].[paramtest] WITH (ONLINE = OFF)

Then clear the cache again and run the query

select * from paramtest where somedate > '2009-11-12 21:14:50.000'

And we are back to being parameterized.

So the application of a single non-clustered index on a separate field to the one being queried is preventing the simple parameterization mode from parameter sniffing the date literal – this makes absolutely no sense, and you can play around with it a lot more knowing what it causing the effect on the query plan. Even placing the additional non-clustered index on the identity field, which already has a clustered index results in the parameterization failing. If this behaviour is be design, then it makes for an interesting design or limitation on the parameterization.

As soon as the database is in ‘Forced’ parameterization mode, the literal was converted each time, so this looks specific to simple mode, but is not explainable, just demonstratable.

Can a Covering NC Index be Tipped?

November 7, 2009 Leave a comment

Non-clustered indexes normally have a ‘tipping point’, which is the point at which the query engine decides to change strategies from seeking the index with a nested loop operator back to a seek on the underlying table or choosing to just scan the underlying table and ignore the index. Kimberley Tripp wrote a great article about ‘The Tipping Point‘ , and the guidance is at about the 25-33% the query engine will change strategies.

If the non-clustered index is a covering index (it contains all the fields within the query) the query engine does not take the same decision – it makes sense that if any change in strategy occurs, it would have to be at a far higher figure, and as we are about to see, it will not take that decision and tip.

To test what strategy the engine would use I created a test situation of 2 separate tables, with different page counts, due to the padding column forcing the second table to use far more pages (5953 pages vs 9233)

CREATE TABLE [dbo].[tblIxTest1]( [PersonnelID] [int] IDENTITY(1,1) NOT NULL, [FirstName] [char](30) NULL, [LastName] [char](30) NULL,
   [Department] [char](30) NULL, [SomePadding] [char](10) NULL


CREATE TABLE [dbo].[tblIxTest2]( [PersonnelID] [int] IDENTITY(1,1) NOT NULL, [FirstName] [char](30) NULL, [LastName] [char](30) NULL,
   [Department] [char](30) NULL, [SomePadding] [char](1000) NULL

Next step was to insert some data, I needed random data to be able to ensure the index was not unbalanced in some way, so I broke out my useful little random string generation function. I should mention how to create this, a SQL function will not directly support the inclusion of a Rand() call within them, any attempt to do this results in the error:

Msg 443, Level 16, State 1, Procedure test, Line 13
Invalid use of a side-effecting operator 'rand' within a function.

However, there is nothing stopping a view from using this, and the function from using the view to get around the limitation: 

Create View [dbo].[RandomHelper] as Select Rand() as r

And then the function can be generated to use this, it is not necessarily the most efficient random string generation function, but it works nicely.

CREATE FUNCTION [dbo].[RandomString] (@Length int) RETURNS varchar(100)
  DECLARE @Result Varchar(100)
  SET @Result = ''
  DECLARE @Counter int
  SET @Counter = 0
  WHILE @Counter <= @Length
     SET @Result = @Result + Char(Ceiling((select R from randomhelper) * 26) + 64)       
     SET @Counter = @Counter + 1   END

This now allows me to generate random data and insert it into the tables to get a nice data distribution, and this was run for both of the tables.

insert into tblIxTest1 values (dbo.RandomString(20),dbo.RandomString(20),dbo.RandomString(20),'')
go 1000000

Two NC indexes are now needed, one for each table and both are identical and cover just the FirstName and PersonnelID fields within the table.

CREATE NONCLUSTERED INDEX [IX_Test1] ON [dbo].[tblIxTest1] ( [FirstName] ASC, [PersonnelID] ASC
CREATE NONCLUSTERED INDEX [IX_Test2] ON [dbo].[tblIxTest2] ( [FirstName] ASC, [PersonnelID] ASC

The setup is complete and it is pretty easy to now show the NC covering index is not going to tip, the most extreme where clause is where I am allowing every record to be returned:

select personnelid , firstname from tblixtest1 where firstname >= 'a' and firstname <= 'zzzzzzzzzzzzzzzzzzzzz'

This still produces a query plan with a seek strategy, regardless of which of my two tables it was executed on:

select personnelid , firstname from tblixtest1  where firstname >= 'a' and firstname <= 'zzzzzzzzzzzzzzzzzzzzz'  
    |--Index Seek(OBJECT:([FilteredIndexTest].[dbo].[tblIxTest1].[IX_Test1]), SEEK:([FilteredIndexTest].[dbo].[tblIxTest1].[FirstName] >= [@1] AND [FilteredIndexTest].[dbo].[tblIxTest1].[FirstName] <= [@2]) ORDERED FORWARD)

If we just select the entire table, unsurprisingly at that point it chooses to perform an index scan.

select personnelid , firstname from tblixtest1

Results in the following plan:

select personnelid , firstname from tblixtest1   |--Index Scan(OBJECT:([FilteredIndexTest].[dbo].[tblIxTest1].[IX_Test1])) 

The row counts on both queries were identical at 1 million. Slightly more interesting is that if I use a Like clause instead of a direct string evaluation, the behaviour alters slightly when selecting all the values:

select personnelid , firstname from tblixtest1 where firstname like '[a-z]%'

Gives the query plan:

select personnelid , firstname from tblixtest1  where firstname like '[a-z]%'  
   |--Index Scan(OBJECT:([FilteredIndexTest].[dbo].[tblIxTest1].[IX_Test1]),  WHERE:([FilteredIndexTest].[dbo].[tblIxTest1].[FirstName] like '[a-z]%'))

So the query engine is potentially making an optimisation that it knows the like clause covers 100% and adopts an automatic scan, but it is not really very clear why it has this optimisation path. If the like clause changes to [a-y] then it reverts back to a seek, so it looks specific to covering all the values within the like statement. If a between statement is used, it remains a seek regardless.

So the result is that a Non-clustered covering index is very unlikely to tip, you either have to not give it a where clause, or use a like statement across all the values available, it will steadfastly refuse to seek and choose the scan.


Well the I/O cost of the operation remains the same, it has to read every page in the table and it considered the cost of traversing the B-Tree negligible, so the difference between seek and scan is not very great. Running the seek based query and scan based query in the same batch the relative percentages are 48% vs 52% – that is the scan scoring 52% even though they read the same number of rows.

Outputting the IO statistics when they are run side by side shows the same number of pages being read, but the seek is clearly being favoured and is slightly faster as far as SQL is concerned – it is quite weird to consider a seek of an entire index is more efficient than a scan of the index.

(1000000 row(s) affected)
Table 'tblIxTest1'. Scan count 1, logical reads 5995, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1000000 row(s) affected)
Table 'tblIxTest1'. Scan count 1, logical reads 5995, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

So if you come across a covering index in a query plan that is scanning, it would be worth investigating as to whether it is intended. The chances are more likely the index field order is not supporting the predicates being used, than engine has chosen to tip the index like it would for the non-covering non-clustered indexes.

When is Bulk Logged Mode Not What it Says?

November 6, 2009 Leave a comment

Somewhat of a ‘doh’ moment when dealing with a SQL server in bulk logged mode, the transaction log was behaving very weirdly and growing very large for smaller index operations. When performing an additional operation, the actual percentage of log space used went down, and it was a repeatable test. At no point would you expect an operation to actually result in the log space showing as less space used. So something was up with the scenario, and I was asked to check it.

In bulk logged mode index creations are classed as a minimally logged operation, whether online or offline, and a drop index is a bit of a mixture, the index page deallocation is fully logged, but the heap rebuild is listed as a minimally logged operation. There was no obvious reason why the database was behaving as it was.

The steps to recreate this situation on your own computrer / server are rather simple, create a database for testing purposes and preallocated the log space as 200Meg. The database was set to bulk logged mode. I then created a basic table:

create table [tblOnlineIndexTest]
  id int,
  padding char(500)

The table does not have a primary key so a uniquifier will be created, which we are not worried about. To create a number of rows, I simply ran the following SQL to generate some data, just over 60 meg of it.

insert into tblOnlineIndexTest (id, padding) values (1, REPLICATE('a',100))
go 100000

To check the log space, I used the DBCC command:

dbcc sqlperf(logspace)

26.28% is used. The next step was the creation of a clustered index on the table:

create clustered index [PK_tblOnlineIndexTest] on [tblOnlineIndexTest] (id)

The log was checked again and the space used was 21.34%. I twigged the underlying issue at this point, but continued to test a bit more to be sure.

Next I dropped the index and the log space jumped up to 53.27% of the total log space, and the log had grown in size, even though it was not at 100% of the space. The log has to reserve undo space in advance of the transaction to make sure it can always record an undo, so that a rollback does not get stopped by the transaction log being full / having no space left.

Last step was to recreate the same index and this time the log space dropped to 2.23% but the actual log file has grown by 66 Meg at this point. So the log got larger, but the actual amount of data in it was smaller.

The problem was a simple one, the database was not actually in bulk logged mode, just like a database set as ‘full’ logging, until the first full database backup is taken, the database is still in the simple logging mode. A transaction log has to be replayed against a starting position of a full backup, so the lack of a full backup on the database automatically prevents the bulk logging mode from actually being active. 

When I was presented the initial problem to look at, I did not asked whether the initial database backup had been taken – I assumed it had been which was clearly a bad assumption. Whilst the SQL Server identified itself in bulk logged mode, the actual reality was that it was in simple mode so the transaction log will be automatically clearing down committed transactions from the log and reusing the space.  Just because the database says it is in bulk logged or full modes, does not mean it actually is yet.

Guidance on How to Layout a Partitioned Table across Filegroups

October 15, 2009 4 comments

SQL Server’s Table Partitioning was one of my favourite features within the 2005 release. It really brought SQL into the mainstream when it came to holding very large data quantities and allowed us to talk with confidence about large tables containing ‘Billions’ of rows and not be limited to the ‘Millions’. From extensive use and a couple of SQL Labs specifically surrounding the partitioning feature there are some rules and best practises I use to try maximise the benefit / flexibility of the partitioned table, without causing uneccesary drops in performance or increases in the disk space requirements.

Many examples of table partitions focus heavily on using date range based partition functions and schema, with a common layout mechanism of stripping the weeks / months across a number of file groups. The file groups are re-used and in the example picture, you can see 2 years worth of data stripped across 12 filegroups. InitialLayout

This is pretty common and has an alluring charm of simplicity, but it is going to hurt when you start rolling the window. The underlying problem is that there is no gap.

For a good number of systems it would be unacceptable to remove a month of data unless you had already inserted the new month sucessfully. So the system starts to roll a new month of data in, and is required to use the same filegroups, the layout transforms into the following:


The file group has to expand by 50% to accomodate the new data, before the old data can be removed – and once the old data is removed the filegroups now look like:


So the 50% of space is now wasted unless you use a shrink, which is probably the worst thing you can do to your filegroup and data files at that point in time. Shrink can fragment the data to the extreme and is to be avoided at all costs. Which means you will have to maintain a 50% space penalty for the layout on every filegroup. That might not sound a lot, but on a large database in an enterprise with mirrored SAN’s,that additional 50% is going to cost a substancial amount.

There are also other issues, SQL allows you to backup at a filegroup level and since the bulk of the data is historic and will not alter, you are forced to re-back up historic data (Jan 08) when you backup the recently inserted Jan 09 data. So there is an impact on backup space, backup times and restore times.

The simplicity of the initial layout makes it seem like a good idea, but the side-effects are not pleasant. You can alter the layout and choose to have 6 Filegroups, each storing 4 months of data, and then the expansion is only from 4 to 5, so a 25% overhead. It is better, but still is a cost overhead, The ultimate extreme is to then place it all in one filegroup, but there are a number of difficulties and contention points with that.

A better approach is to use a different file group per month, but then also create an additional spare filegroup, so that no filegroup is forced to expand, as shown:BetterLayout

The difference here is that we have one free filegroup that is not a part of the current partition scheme / function definition, but will be allocated as the ‘Next Used’ filegroup for the partition scheme, so that when we split the partition function that filegroup is brought into the scheme and used. The old data that is going to be removed, will provide an empty filegroup that will be used for the next month’s import. In essence the head of the partition function is chasing the tail and will never catch up.

The expansion of size for this is 1/n where n is the number of partitions to be stored, so for our 24 month example 1/24th expansion – a considerable saving. Even better is that you can choose to utilise file group backups for the older static months of data.

This though is not perfect for a couple of reasons, primarily data loading and indexing. To make the ‘scene’ better there should be a further filegroup dedicated to the data staging / loading that is going to occur to bring the data into the partitioned table.WithStaging

The purpose of this is twofold:

  • The disks used for loading are seperate from the disks used for the actual data so that we can maintain Quality of Service on the main table.
  • The data wants to be loaded as a heap for speed, but is likely to be sorted prior to insertion, an in place sort would expand one of the main file groups which is to be avoided.

By loading into a staging filegroup, you give yourself the opportunity to then use a clustered index to sort the data and move it to the appropriate file group, prior to being constrained and switched in to the main partitioned table. If you had loaded the data into FG25 and then tried to apply a clustered index, it would of doubled the size of the filegroup again as it needs to write all the new rows out and commit them, before the heap could be deleted. That would put you back at square one wasting 50% of the disk space.

The staging filegroup does cost us another file group worth of space, so the 24 initial filegroups has grown to 26, which is still a smaller expansion than the potential 50%.

So some simple guidlines are:

  • For a partition function that is going to have N partitions, create N+1 Filegroups. 
  • ETL / Stage your data into a dedicated staging file group
  • Move your data using a clustered index creation to the ‘spare’ filegroup you always have.
  • Switch new data in, then switch old out creating the new ‘spare’

It is not entirely flawless however – there is a problem in using such a design if the quantity of data per partition is going to vary by a large amount; you then have to provision each filegroup to have enough space to cope with the fluctations, which in itself can result in wasted space and starts chipping away at the gains made.

It works best as a technique on datasets that remain relatively consistent in terms of the number of rows per partition and as the number of partitions goes up, the savings increase.

What is the SQL Server 2008 DateTimeOffset Internal Structure?

October 13, 2009 Leave a comment

After decoding the DateTime2 internal structure I thought I would take a quick look at the DateTimeOffset structure, since it should not present too many difficulties and post it up quickly, but was surprised at the initial result of a test. It follows the same basic premise that the time portion is followed by the date portion and the time / date is based on a day count from the epoch time of 0001/01/01 and the time the number of intervals since midnight, where the interval is defined by the accuracy.

I was expecting the datetime offset value itself to occupy the additional 2 bytes quoted, and not affect the other values. Bad assumption, as soon as I cracked open a few examples I could immediately see that setting the offset also alters the underlying time / date component values as well.

Using the same comparison methods as before the underlying time value is clearly being adjusted:

'0001/01/01 00:00:00 -0:00' => 0x0700000000000000000000
'0001/01/01 00:00:00 -12:00' => 0x0700E034956400000030FD

So, even though an offset is being stored, the underlying time is also being altered to match, and the internal storage is using UTC as the reference point. This makes sense as the most valid reference that you could use.

'0001-01-01 12:00:00.0000000 +00:00' => 0x0700E03495640000000000
'0001-01-01 00:00:00.0000000 -12:00' => 0x0700E034956400000030FD

The same time / date is generated for the two values, but the last two bytes hold the offset and have stored the offset used when the date was initially stored. The underlying storage of the time though is clearly identical in both,  so UTC is the common ground they each get stored again.

The final 2 bytes for the offset are pretty easy to decode, since the pattern is the same as before with a slight twist. The offset time records the number of minutes for the offset in hex, with the first byte of the two being the least significant as before with the time, so you end up reading the two bytes left-to-right and then decode that byte right-to-left.

The twist is that for positive offsets, the value increments 1,2,3, in hex as appropriate, but for negative values, it starts decrementing by considering -1 equal to ‘FFFF’, I’ve split the hex output into the individual components by adding some spaces to make it easier to read. (Accuracy Code, Time Value, Date Value, Offset Used)

'2001-01-01 12:00:00.0000000 +00:01' => 0x07   009A717164   75250B  0100
'2001-01-01 12:00:00.0000000 +00:00' => 0x07   00E0349564   75250B  0000
'2001-01-01 12:00:00.0000000 -00:01' => 0x07   0026F8B864   75250B  FFFF

Since the offsets supported at only +14 hours to -14 hours, there is no risk of the two ranges overlapping. When I think about this a bit more, it is acting as a signed number, -1 being 11111111111 etc. So the 2 bytes at the end is a signed int of the number of minutes offset.

There are a number of time zones in the world that do not occur at exact hourly intervals from UTC, some are on the half hour mark such as Caracas (-4:30) or Delhi (+5:30) to name a few, whilst Kathmandu (+5:45) is even more specific.  In theory the format allows offsets specified to even greater levels of distinction, although I am not sure as to why you would wish to use it. Do you really want an offset of +3:17 minutes? That is potentially scary to consider that as a valid input to the value.

That has made me wonder as to why the accuracy was set so high, when in reality 15 minute intervals would of been sufficient, with a -14 to +14 range, that is 113 different values inclusively, which could be accomodated within a single byte.

So why spend 2 bytes on the format, when 1 was enough? Was it to just be compatible to the ISO format in some way that required it? Not sure.

What is the SQL Server 2008 DateTime2 Internal Structure?

October 11, 2009 2 comments

SQL Server has a number of new date time formats, but the one I am most interested in is DateTime2. The internal format of the SQL DateTime is commonly mistaken as 2×4 byte integers, with the latter integer being milliseconds since midnight. It is in fact the number of 1/300ths of a second since midnight which is why the accuracy of the DateTime within SQL Server has historically been 3.33ms. (If you really want to see it, crack it open by converting it to a binary, adding 1 and re-converting, you add 3.33ms, not 1 ms.)

So DateTime2 must use a different format, and as a weekend exercise that had no purpose than understanding the internals I thought I’d take a look. I have not seen the information in the BoL or posted as yet, so might be of use.  I am starting with the DateTime2(7) and looking at the maximum accuracy structure. The code used to crack it open each time is basically as follows:

declare @dt datetime2(7)
set @dt = '2000/01/01 00:00:00'
declare @bin varbinary(max)
set @bin = CONVERT(varbinary(max), @dt)

To make my life easier, SQL conveniently outputs all the values as hexi-decimal numbers. The results are not what you would expect.


The date which traditionally occupied the first 4 bytes, clearly is occupying the last few bytes. So the format is not going to be obvious or simple. Interestingly the returned result is 9 bytes, but the length is quoted as 8. It is returning 8 when checked using the length, that first byte is somewhat odd to make an appearance.  It’s also suspiciously the accuracy value, and with a few tests using a change of accuracy, it show that value changes. So the first pseudo-byte is the accuracy indicator.

To start figuring out some more, let’s take the time back to the beginning point, which in this case is not 1900/01/01 but 0001/01/01 which when converted gives us:

'0001/01/01 00:00:00' => 0x070000000000000000

Start incrementing the day portion and there is an obvious pattern, the 6th byte changes.

'0001/01/02 00:00:00' => 0x070000000000010000
'0001/01/03 00:00:00' => 0x070000000000020000
'0001/01/31 00:00:00' => 0x0700000000001E0000

As you try the 2nd month, to check where the month is, the same byte alters, so it represents days, not specific date parts. Is it the number of days since the beginning of the year? No.

'0001/02/01 00:00:00' => 0x0700000000001F0000

If it was, there would be an issue since 1 byte does not represent enough values, as we can see, FF occurs on the 13th of September, and then it rolls over and puts a 1 in the 7th Byte position.

'0001/09/13 00:00:00' => 0x070000000000FF0000
'0001/09/14 00:00:00' => 0x070000000000000100
'0001/09/15 00:00:00' => 0x070000000000010100

It rolls over, then carries on as before. This immediately suggests the next test, to roll over the year, and the pattern continues.

'0001/12/31 00:00:00' => 0x0700000000006C0100  
'0001/12/31 00:00:00' => 0x0700000000006D0100

So the format is just counting, we see it in the example as hex, but it is a straight number count going on but the hex values are left-to-right. Only 2 bytes are used so far, which do not represent enough day combinations, add the third byte in by going past 180 years:

'0180/06/06 00:00:00' => 0x070000000000FFFF00  
'0180/06/07 00:00:00' => 0x070000000000000001

So the final byte is then increased, so the number of combinations becomes 16777215 – that seems a lot better and certainly going to cover the range required.

'2001/01/01 00:00:00' => 0x07000000000075250B

So that is the final 3 bytes decoded, a simple pattern – and provides the template of how the time is also stored.

'0001/01/01 00:00:00.0000000' => 0x070000000000000000
'0001/01/01 00:00:00.0000001' => 0x070100000000000000
'0001/01/01 00:00:00.0000255' => 0x07FF00000000000000
'0001/01/01 00:00:00.0065535' => 0x07FFFF000000000000
'0001/01/01 00:00:00.0065536' => 0x070000010000000000

So to check whether the format is the same,

'0001/01/01 00:00:00.9999999' => 0x077F96980000000000

Decode that again and it all matches:

select (152 * 256 * 256) + (150 * 256) + 127

When we click over into 1 second exactly, we increment the first byte by 1, so the time portion is still represented in 100ns intervals, with the normal system of each byte counting up 1 every time the previous byte rolls over. As we get to the limit of the 3 bytes, it rolls into the 4th and then the 5th.

'0001/01/01 00:00:01.0000000' => 0x078096980000000000

So the internal format of the DateTime2(7) is decoded, not difficult but it is an interesting choice – it is now a straight binary number, with the Least Significant Byte being on the Left, the Most Significant being on the right (for each section.) Within the byte however, to convert it you must still read it right-to-left.

The first 5 bytes are recording how many time units intervals have passed since midnight, and the last 3 bytes recording how many days have passed since 0001/01/01.

The time unit intervals are dictated by the accuracy of the number, 100ns for DateTime2(7), and 1 Micro second intervals for a DateTime2(6) etc.  The way in which you interpret it does not change, but the units you are multiplying the time portion by, alters based on the accuracy.

You could construct 2 dates that are identical at a binary level, but due to the field meta-data on accuracy, they do not represent the same date time.

declare @dt1 dt1 datetime2(6)
set @dt1 = '0001/01/01 00:00:00.000001'
declare @dt2 datetime2(7)
set @dt2 = '0001/01/01 00:00:00.0000001'


 And that is perhaps why on output they automatically have prefixed the binary value with the datetime accuracy, so that they are not entirely identical? I’m not sure but would be interested to find out.

Finding Next Identity Value, and a Wild Goose Chase.

October 9, 2009 3 comments

A question asked on stack overflow was to find the next identity value that would occur on a table, without being required to add a record to work it out. The problem lies in that if the highest row is deleted, the number is not reused so any answers using the existing rows can be incorrect.

Logically the value must be stored, so first place I checked was the DMV’s. That stores the currently used value, but does not store the next value.

The wild goose chase started there…

  • I used the dedicated admin console to pull all the system tables, expecting it to be in sys.syshobtcolumns, no joy. Dumped the whole system table contents before and after an insert looking for the difference and didn’t spot it.
  • Took a dump of every page in the file before and after inserting a new row and having the textural dumps compared in a text comparison application, still no joy.
  • Started dumping out log records using the following script and pulled the identity calls:
select *
from ::fn_dblog(null, null)
where operation = 'LOP_IDENT_NEWVAL'
  • After a couple of hours running around the pages trying to find it, I realised I should of stuck with the DMV and I really went the wrong way around it.

 The DMV has the right answer it seems, but as two fields you have to combine to get the answer.

create table foo (MyID int identity not null, MyField char(10))
insert into foo values ('test')
go 10

-- Inserted 10 rows
select Convert(varchar(8),increment_value) as IncrementValue,
   Convert(varchar(8),last_value) as LastValue
from sys.identity_columns where name ='myid'

-- insert another row
insert into foo values ('test')

-- check the values again
select Convert(varchar(8),increment_value) as IncrementValue,
   Convert(varchar(8),last_value) as LastValue
from sys.identity_columns where name ='myid'

-- delete the rows
delete from foo

-- check the DMV again
select Convert(varchar(8),increment_value) as IncrementValue,
   Convert(varchar(8),last_value) as LastValue
from sys.identity_columns where name ='myid'

-- value is currently 11 and increment is 1, so the next insert gets 12
insert into foo values ('test')
select * from foo

MyID        MyField
----------- ----------
12          test      

(1 row(s) affected)

So adding the increment to the last value will predict the next value correctly, assuming someone else does not grab it in the mean time which is why it is not a good idea to use in code, but if you need to investigate a table and want to know what it thinks is next, without actually inserting a row and affecting the table, then it is useful.

All that and the easy way is then:

select ident_current(‘foo’) + ident_incr(‘foo’)

Ah well, was fun investigating it – but what a wild goose chase to find it was an easy answer.

Do Filtered Index Have Different Tipping Points?

October 4, 2009 Leave a comment

I’ve been taking a look at filtered indexes and they do initially look particularly useful, but I was wandering what effect the filtered index has on the tipping point. The tipping point was described by Kimberley L. Tripp in an excellent post – and describes the point at which SQL opts to change the query plan strategy from a NC index seek to a scan of some kind (Table / Clustered Index)

So I was wondering whether Filtered indexes chose a different strategy, or would they ‘tip’ from a targeted index to a table scan in the same way.

First off was to create a dummy table:

create tabletblPersonnel ( 
PersonnelID Int Identity, 
FirstName Char(30), 
LastName Char(30),
Department Char(30), 
SomePadding Char(250)

This was deliberately created as a heap, since data had to be generated and loaded:

insert into tblPersonnel(firstname,lastname,Department) values( dbo.randomstring(20), dbo.randomstring(20), dbo.randomstring(20)); go 1000000

The dbo.randomstring is a user created function I added to generate a random string [A-Z] the length of the function parameter, so the insert statement is inserting random strings of length 20 into the first name, last name and department fields.

Once the data was loaded, a clustered index was applied.

alter table dbo.tblPersonnel add constraint
PK_tblPersonnel primary key clustered(PersonnelID) with( statistics_norecompute = off, ignore_dup_key = off, allow_row_locks = on, allow_page_locks = on) on [PRIMARY]

A quick check of the table:

select COUNT(*) from tblpersonnel
(1 row(s) affected)

To get a rough idea of the tipping point we need to know the number of pages at the leaf level of the table. 

select * from sys.objects where name ='tblpersonnel'
-- grab the object id and query the physical stats.
select * from sys.dm_db_index_physical_stats(DB_ID('FilteredIndexTest'), 1117247035 , 1,0,'DETAILED')
-- 45455 pages shown in the table at leaf level 0 i.e. the clustered index

So create an NC index on the first name, last name.

create nonclustered index[IX_Normal] ON [dbo].[tblPersonnel] (  [FirstName] asc,  [LastName] asc
) with( pad_index = off ,statistics_norecompute = off, ignore_dup_key = off, allow_row_locks = on, allow_page_locks = on) on [PRIMARY]

For reference, we will also check how many pages the NC index has:

select * from sys.dm_db_index_physical_stats(DB_ID('FilteredIndexTest'), 1117247035 , 2,0,'DETAILED')
-- Level 0 of the index shows 8696 pages

The tipping point is expected between 25% to 33% of the number of pages when expressed as rows, so something between 11363 to 15000 rows is where to expect it. Since the data was random when inserted, a bit of human binary chop is needed to find the tipping point. After some experimentation – the dividing point was between the firstname like ‘a[a-h]%’ and ‘a[a-i]%’

select * from tblpersonnel
where firstname like 'a[a-h]%'
-- 11898 rows - seeks
select * from tblpersonnel
where firstname like 'a[a-i]%'
-- 13299 rows - scans

So the tipping point showed up within the expected range, so next I created the filtered index based on FirstName ‘>=a’ and ‘<= ak’ since I can not use a like clause in the Filtered Index where statement.

create nonclustered index [IX_Filter] ON[dbo].[tblPersonnel] (  [FirstName] asc,  [LastName] asc
where[FirstName] >= 'a'and firstname <= 'ak'
) with( pad_index = off ,statistics_norecompute = off, ignore_dup_key = off, allow_row_locks = on, allow_page_locks = on) on [PRIMARY]

The filtered index is in play, so I disabled the non-filtered index and re-ran the same queries.

select * from tblpersonnel
where firstname like 'a[a-h]%'
-- 11898 rows - seeks
select * from tblpersonnel
where firstname like 'a[a-i]%'
-- 13299 rows - scans

Conclusion: The tipping point did not alter based on the filtered index – it still tips based on the ratio of rows to total pages in the leaf level of the clustered index.

How to Remember the Next Used Filegroup in a Partition Scheme

September 30, 2009 6 comments

Within SQL, the partitioned table feature provides an excellent way to store and roll forward data windows across large datasets without incurring huge loading and archiving penalties to the main table.

The process of loading, manipulating and decommissioning data from a partitioned table will need posts of their own – due to the depth of the information available and required as well as the detail needed to understand some of the best practises I have learnt from the field.

This entry is specifically relating to the ‘Next Used’ aspect of dealing with a partitioned table – for some time I wanted to know how could I tell what partition had been set as next used? There seems to be no record of the value – the BoL lists the Next Used facility as:

Specify the filegroup to be marked by the partition scheme as NEXT USED.

This is true in a simplistic sense, but the marking can not be at the filegroup level since the relation from partition schemes to filegroups can be many-to-many, so no single marking on a file group could suffice. I had never been able to figure out where to find that marking, and find a way to read what had been set.

I should add, that you really should not set the next used in one location of the code / stored proc and then perform the split of the partition function in another, it would be far safer to do them together, so the need to actually find it out is really borne out of investigative necessity in trying to understand for a given partition scheme that implements a rolling window that started to go bad, where does it think it should be splitting the partition to?

So the problem remained; it was given to Paul S. Randal  to figure out how we could see / infer this information outside of the dedicated admin console. He figured out the starting point in terms of which value in which DMV to start the solution with, and I ran with it from there to create a relatively easy way to get to it. So credit to Paul for finding out where to start on the issue.

So down to the scripts and some test cases:

There are 2 types of partition scheme to test:

  • Partition Function as Left.
  • Partition Function as Right.

When defining the scheme initially, you can also define it in two ways:

  • Partition scheme defined with the next used pre-set value.
  • Partition scheme defined with no pre-set next used value.

So there are 4 combinations to test initially, since we do not need data for this or are worried about the query plan, the main work is pure schema creation and checking.

First script is to just create a test database, nothing significant:

NAME = N'NextUsedTest', 
FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL
\DATA\NextUsedTest.mdf' , 
SIZE = 3072KB , 
NAME = N'PartFile1', 
FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL
\DATA\PartFile1.ndf' , 
SIZE = 3072KB , 
NAME = N'PartFile2', 
FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL
\DATA\PartFile2.ndf' , 
SIZE = 3072KB , 
NAME = N'PartFile3', 
FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL
\DATA\PartFile3.ndf' , 
SIZE = 3072KB , 
FILEGROUP[PartFG4] ( NAME = N'PartFile4', 
FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL
\DATA\PartFile4.ndf' , 
SIZE = 3072KB , 
NAME = N'PartFile5', 
FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL
\DATA\PartFile5.ndf' , 
SIZE = 3072KB , 
( NAME = N'NextUsedTest_log', 
FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL
SIZE = 1024KB , 
) GO

Now the database is in place, lets create the partition functions /schemes:

AS RANGE LEFT FOR VALUES (200801, 200802, 200803, 200804)

AS PARTITION[pfLeftNoSpare] 
TO([PartFG1], [PartFG2], [PartFG3], [PartFG4], [Primary])

AS RANGE LEFT FOR VALUES (200801, 200802, 200803, 200804)

AS PARTITION [pfLeftWithNextUsedSet] 
TO([PartFG1], [PartFG2], [PartFG3], [PartFG4], [Primary], [PartFG5])

AS RANGE RIGHT FOR VALUES (200801, 200802, 200803, 200804)

AS PARTITION[pfRightNoSpare] 
TO([Primary], [PartFG1], [PartFG2], [PartFG3], [PartFG4])

CREATE PARTITION FUNCTION [pfRightWithSpareNextUsedSet](int) 
AS RANGE RIGHT FOR VALUES (200801, 200802, 200803, 200804)

CREATE PARTITION SCHEME [psRightWithSpareNextUsedSet] 
AS PARTITION[pfRightWithSpareNextUsedSet] 
TO([Primary],[PartFG1], [PartFG2], [PartFG3], [PartFG4],  [PartFG5])

Four partition functions and schemes, for the left schemes the next used file group appears on the far right when the next used it pre-set. When you use this mechanisms to preset, it issues a message for the operation:

Partition scheme 'psLeftWithNextUsedSet' has been created successfully. 
'PartFG5' is marked as the next used filegroup 
in partition scheme 'psLeftWithNextUsedSet'. 
Partition scheme 'psRightWithSpareNextUsedSet' has been created successfully. 
'PartFG5' is marked as the next used filegroup 
in partition scheme 'psRightWithSpareNextUsedSet'.

The trick to finding the next used is to look for the mismatching record – if you join the partition function values to the partition scheme there are scheme entries that do not have a corresponding partition function value, since it has not been set by performing the split.

Lets start with the Left based partition and compare the two.

 select FG.Nameas FileGroupName
    , dds.destination_id
    , dds.data_space_id
    , prv.value, ps.Name
 from sys.partition_schemes PS
 inner join sys.destination_data_spaces as DDS 
    on DDS.partition_scheme_id = PS.data_space_id
 inner join sys.filegroups as FG 
    on FG.data_space_id = DDS.data_space_ID 
 left join sys.partition_range_values as PRV 
    on PRV.Boundary_ID = DDS.destination_id and prv.function_id=ps.function_id 
 where = 'psLeftNoSpare'

The output is:

FileGroupName destination_id data_space_id value Name
PartFG1 1 2 200801 psLeftNoSpare
PartFG2 2 3 200802 psLeftNoSpare
PartFG3 3 4 200803 psLeftNoSpare
PartFG4 4 5 200804 psLeftNoSpare
PRIMARY 5 1 NULL psLeftNoSpare

And check the other Left defined partition:

 select FG.Name asFileGroupName
    , dds.destination_id
    , dds.data_space_id
    , prv.value, ps.Name  
from sys.partition_schemes PS
 inner join sys.destination_data_spaces as DDS 
    on DDS.partition_scheme_id = PS.data_space_id
 inner join sys.filegroups as FG 
    on FG.data_space_id = DDS.data_space_ID 
 left join sys.partition_range_values as PRV 
    on PRV.Boundary_ID = DDS.destination_id and prv.function_id=ps.function_id 
 where = 'psLeftWithNextUsedSet'


FileGroupName destination_id data_space_id value Name
PartFG1 1 2 200801 psLeftNoSpare
PartFG2 2 3 200802 psLeftNoSpare
PartFG3 3 4 200803 psLeftNoSpare
PartFG4 4 5 200804 psLeftNoSpare
PRIMARY 5 1 NULL psLeftNoSpare
PartFG5 6 6 NULL psLeftNoSpare

And the difference appears, the next used appears as the highest destination ID, but only when there are two data spaces assigned to the partition that do not have a value. The ‘primary’ entry that shows up as null is due to the partition scheme running from -ve infinity to +ve infinity, and whilst 4 lines are drawn on the number line, that divides the number line into 5 sections.

Running the same statements for the partitions declared using the right shows similar results, but the Primary entry is given destination 5 whilst the spare shows on 6 again.

The query is slightly awkward in that it must pick up the second entry of the list if it exists, using ‘orderings and tops’ will not pull the result we need.

select FileGroupName, Destination_ID, Data_Space_ID, Name  from
  select  FG.Name as FileGroupName
   , dds.destination_id
   , dds.data_space_id
   , prv.value
   , ps.Name
   , RANK() OVER (PARTITION BY order by dds.destination_Id) as dest_rank
  from sys.partition_schemes PS
  inner join sys.destination_data_spaces as DDS 
    on DDS.partition_scheme_id = PS.data_space_id
  inner join sys.filegroups as FG 
    on FG.data_space_id = DDS.data_space_ID 
  left join sys.partition_range_values as PRV 
    on PRV.Boundary_ID = DDS.destination_id and prv.function_id=ps.function_id 
  where prv.Value is null
 ) as a
 where dest_rank = 2


FileGroupName destination_id data_space_id Name
PartFG5 6 6 psLeftWithNextUsedSet
PartFG5 6 6 psRightWithSpareNextUsedSet

To test whether it picks up setting the next used, let’s set it on the partitions that did not previously have it.


And re-run the query

FileGroupName destination_id data_space_id Name
PartFG5 6 6 psLeftNoSpare
PartFG5 6 6 psLeftWithNextUsedSet
PartFG5 6 6 psRightNoSpare
PartFG5 6 6 psRightWithSpareNextUsedSet

To make it re-usable, I switched the query into a view

create view NextUseFileGroups
select FileGroupName, Destination_ID, Data_Space_ID, Name
   select  FG.Name as FileGroupName
    , dds.destination_id
    , dds.data_space_id, prv.value, ps.Name,
   RANK() OVER (PARTITION BY order by dds.destination_Id) as dest_rank
   from sys.partition_schemes PS
   inner join sys.destination_data_spaces as DDS
     on DDS.partition_scheme_id = PS.data_space_id
   inner join sys.filegroups as FG
     on FG.data_space_id = DDS.data_space_ID 
   left join sys.partition_range_values as PRV
     on PRV.Boundary_ID = DDS.destination_id and prv.function_id=ps.function_id 
   where prv.Value is null
 ) as a
 where dest_rank = 2

And a final check with removing the setting – you can blank a set next used value by specifying no value in the statement.


Select from the view and the two file partition schemes / file groups no longer show up in the list as intended.

So finding out the ‘next used’ setting is possible, although there really is no need in normal operation of the partition window to have to find out, but as an investigative tool it could be useful.

The scripts were tested on both 2005 and 2008, so are good for both in terms of testing, or using the view.