Archive
The Sequence of an Index Uniquifier
During a training session today, I was asked about the structure of the Uniquifier, and whether it was a straight identity column. Off the top of my head I couldn’t remember the exact structure, I considered it a 4 byte int, but was not sure whether it acted as a pure identity value or acted in a more complex manner when incrementing, so decided to investigate it tonight.
To start from the beginning, an index uniquifier is the term given to the field that is automatically generated by SQL Server when you create a clustered index, but the index key is not specified as unique. Since each record in the table has to be uniquely identifiable, SQL will automatically assigned a 4 byte field to the row to make it unique, commonly called the ‘Uniquifier’. At this point I am sure English scholars will be frowning, pondering on the nature of the word and whether it qualifies as English; however that the term used so we will run with it.
It is actually quite easy to see this field in action, let’s create a simple table:
CREATE TABLE dbo.unique_test ( firstname char(20) NOT NULL, surname char(20) NOT NULL ) ON [PRIMARY] GO CREATE CLUSTERED INDEX [ix_test] ON [dbo].[unique_test] ( [firstname] ASC ) ON [PRIMARY]
The clustered index is not unique, by design, so let’s start adding duplicate rows to see the effect:
insert into unique_test values ('John', 'Smith') go 10
The table now contains 10 rows, each with the same details. This does not cause any undue concern, because each row is actually still unique – the way to show this is using the DBCC INC and DBCC Page commands, I’ve cut the output down since it is so wide.
dbcc ind ('testdb','unique_test',1)
PageFID PagePID IAMFID IAMPID PageType
------- ----------- ------ ----------- --------
1 41 NULL NULL 10
1 174 1 41 1
The output shows a data page numbered 174 for my example and the IAM page with an ID of 41. We can crack open the page and view the contents very easily using DBCC Page.
dbcc dbcc traceon(3604) dbcc page (idtest,1,174,3)
The output is quite large, but in essence, the first record is stored with the following details:
UNIQUIFIER = [NULL] Slot 0 Column 1 Offset 0x4 Length 20 firstname = John Slot 0 Column 2 Offset 0x18 Length 20
The second record:
Slot 1 Column 0 Offset 0x33 Length 4 UNIQUIFIER = 1 Slot 1 Column 1 Offset 0x4 Length 20 firstname = John Slot 1 Column 2 Offset 0x18 Length 20 surname = Smith
The third record:
Slot 2 Column 0 Offset 0x33 Length 4 UNIQUIFIER = 2 Slot 2 Column 1 Offset 0x4 Length 20 firstname = John Slot 2 Column 2 Offset 0x18 Length 20 surname = Smith
And so forth. The first record’s uniquifier is visible and clearly named within the data page, but set to null. The second copy of the same value receives the uniquifier of one, the third copy receives a 2 etc. This count is maintained separately for each duplication, so the insert of a new name multiple times will also receive its own counter, beginning at null and working upwards, 1,2,3 etc. So just because the uniquifier is 4 bytes, this does not limit the total number of rows in the table to ~2.1 billion, but does logically limit the total number of duplicates to 2.1 billion. I must confess to not having tested that limit, generating 2.1 billion rows of duplicate data is not trivial and a scrapbook calculation predicts 435 hours of processing on a virtual pc. I suspect the error message it raises when it hits the limit would be interesting.
If we remove all the rows from the table and then add 10 more does the uniquifier reset? Easy to test but the short answer was no, the uniquifier continued to rise, 10 thru 19.
I was a bit suspicious of this since any requirement for the uniquifier to rise / remember what existed before requires it to be stored somewhere – it has to survive a crash after all, but there is no apparent place the current count is stored. If there was, you wouldn’t be storing just 1 value, you would be forced to store a value for each record key that had duplicates. This could run into thousands of separate counters being maintained per clustered key so it just doesn’t make sense that it is stored, it would be a very noticable overhead.
When checking the DBCC Ind for the empty table it insisted it still had a data page, but the only contents of the data page was a single ghost record – a row that has been marked as deleted. The ghost record was the for the ‘John Smith’ with the highest uniquifier before, was this coincidence? The other ghost records had not hung around, so why did this one persist?
I dropped and recreated the table again, inserted 10 rows and then deleted them. Checking DBCC Ind the table still showed a HoBT IAM allocation page for the table and a data page, the data page contained a single ghost record, the one with a Uniquifier of 9 – the highest given out when 10 duplicates were added. Even waiting some considerable time the ghost record was not cleaned up, so it appears that it will not delete it.
If I added another duplicate row, it picked up the next number in the sequence (10) and shortly after the ghost record was removed from the page. Very convenient and not a coincidence at all – the memory of the last uniquifier given out persists as a ghost record, even if all the duplicates for the table have been removed. What seems strange is this ghost record hanging about, persisting an entire record, to keep the duplicate count for that key, when no instances of it remain on the table.
It can not possibly do this for every key since the space overhead would become very noticable again, so how does it choose what to persist, the last entry? unfortunately it doesn’t appear that simple at all, after a number of tests it appeared to only be interested in keeping the ghost entry for the record that had the highest key value, so alphabetically, the one closed to ‘Z’ for my example.
Conclusion? On the evidence, whilst the other ghost records still persist for a short time, even deleting and then adding more duplicates can see the number continue from where it left of, but given a short time for the ghost records to be removed the uniquifier will restart the sequence back at Null,1,2 etc. Except in the case of the highest entry from the index perspective, that ghost record decides to stick around until there is another entry using the same key, continuing the sequence, at which point the ghost record then finally disappears.
I can not think of any sensible reason why it would do this, can you?
Overall, the uniquifier is a cost overhead of not having a unique index, and at a cost of 4 bytes, an int identity column makes a lot of sense – for all purposes it acts the same and serves the same purpose but in a far more visible manner – so it really does not make much sense to rely on the uniquifier provided for you, take control and create your own.
Guidance on How to Layout a Partitioned Table across Filegroups
SQL Server’s Table Partitioning was one of my favourite features within the 2005 release. It really brought SQL into the mainstream when it came to holding very large data quantities and allowed us to talk with confidence about large tables containing ‘Billions’ of rows and not be limited to the ‘Millions’. From extensive use and a couple of SQL Labs specifically surrounding the partitioning feature there are some rules and best practises I use to try maximise the benefit / flexibility of the partitioned table, without causing uneccesary drops in performance or increases in the disk space requirements.
Many examples of table partitions focus heavily on using date range based partition functions and schema, with a common layout mechanism of stripping the weeks / months across a number of file groups. The file groups are re-used and in the example picture, you can see 2 years worth of data stripped across 12 filegroups. 
This is pretty common and has an alluring charm of simplicity, but it is going to hurt when you start rolling the window. The underlying problem is that there is no gap.
For a good number of systems it would be unacceptable to remove a month of data unless you had already inserted the new month sucessfully. So the system starts to roll a new month of data in, and is required to use the same filegroups, the layout transforms into the following:

The file group has to expand by 50% to accomodate the new data, before the old data can be removed – and once the old data is removed the filegroups now look like:

So the 50% of space is now wasted unless you use a shrink, which is probably the worst thing you can do to your filegroup and data files at that point in time. Shrink can fragment the data to the extreme and is to be avoided at all costs. Which means you will have to maintain a 50% space penalty for the layout on every filegroup. That might not sound a lot, but on a large database in an enterprise with mirrored SAN’s,that additional 50% is going to cost a substancial amount.
There are also other issues, SQL allows you to backup at a filegroup level and since the bulk of the data is historic and will not alter, you are forced to re-back up historic data (Jan 08) when you backup the recently inserted Jan 09 data. So there is an impact on backup space, backup times and restore times.
The simplicity of the initial layout makes it seem like a good idea, but the side-effects are not pleasant. You can alter the layout and choose to have 6 Filegroups, each storing 4 months of data, and then the expansion is only from 4 to 5, so a 25% overhead. It is better, but still is a cost overhead, The ultimate extreme is to then place it all in one filegroup, but there are a number of difficulties and contention points with that.
A better approach is to use a different file group per month, but then also create an additional spare filegroup, so that no filegroup is forced to expand, as shown:
The difference here is that we have one free filegroup that is not a part of the current partition scheme / function definition, but will be allocated as the ‘Next Used’ filegroup for the partition scheme, so that when we split the partition function that filegroup is brought into the scheme and used. The old data that is going to be removed, will provide an empty filegroup that will be used for the next month’s import. In essence the head of the partition function is chasing the tail and will never catch up.
The expansion of size for this is 1/n where n is the number of partitions to be stored, so for our 24 month example 1/24th expansion – a considerable saving. Even better is that you can choose to utilise file group backups for the older static months of data.
This though is not perfect for a couple of reasons, primarily data loading and indexing. To make the ‘scene’ better there should be a further filegroup dedicated to the data staging / loading that is going to occur to bring the data into the partitioned table.
The purpose of this is twofold:
- The disks used for loading are seperate from the disks used for the actual data so that we can maintain Quality of Service on the main table.
- The data wants to be loaded as a heap for speed, but is likely to be sorted prior to insertion, an in place sort would expand one of the main file groups which is to be avoided.
By loading into a staging filegroup, you give yourself the opportunity to then use a clustered index to sort the data and move it to the appropriate file group, prior to being constrained and switched in to the main partitioned table. If you had loaded the data into FG25 and then tried to apply a clustered index, it would of doubled the size of the filegroup again as it needs to write all the new rows out and commit them, before the heap could be deleted. That would put you back at square one wasting 50% of the disk space.
The staging filegroup does cost us another file group worth of space, so the 24 initial filegroups has grown to 26, which is still a smaller expansion than the potential 50%.
So some simple guidlines are:
- For a partition function that is going to have N partitions, create N+1 Filegroups.
- ETL / Stage your data into a dedicated staging file group
- Move your data using a clustered index creation to the ‘spare’ filegroup you always have.
- Switch new data in, then switch old out creating the new ‘spare’
It is not entirely flawless however – there is a problem in using such a design if the quantity of data per partition is going to vary by a large amount; you then have to provision each filegroup to have enough space to cope with the fluctations, which in itself can result in wasted space and starts chipping away at the gains made.
It works best as a technique on datasets that remain relatively consistent in terms of the number of rows per partition and as the number of partitions goes up, the savings increase.
What is the SQL Server 2008 DateTimeOffset Internal Structure?
After decoding the DateTime2 internal structure I thought I would take a quick look at the DateTimeOffset structure, since it should not present too many difficulties and post it up quickly, but was surprised at the initial result of a test. It follows the same basic premise that the time portion is followed by the date portion and the time / date is based on a day count from the epoch time of 0001/01/01 and the time the number of intervals since midnight, where the interval is defined by the accuracy.
I was expecting the datetime offset value itself to occupy the additional 2 bytes quoted, and not affect the other values. Bad assumption, as soon as I cracked open a few examples I could immediately see that setting the offset also alters the underlying time / date component values as well.
Using the same comparison methods as before the underlying time value is clearly being adjusted:
'0001/01/01 00:00:00 -0:00' => 0x0700000000000000000000 '0001/01/01 00:00:00 -12:00' => 0x0700E034956400000030FD
So, even though an offset is being stored, the underlying time is also being altered to match, and the internal storage is using UTC as the reference point. This makes sense as the most valid reference that you could use.
'0001-01-01 12:00:00.0000000 +00:00' => 0x0700E03495640000000000 '0001-01-01 00:00:00.0000000 -12:00' => 0x0700E034956400000030FD
The same time / date is generated for the two values, but the last two bytes hold the offset and have stored the offset used when the date was initially stored. The underlying storage of the time though is clearly identical in both, so UTC is the common ground they each get stored again.
The final 2 bytes for the offset are pretty easy to decode, since the pattern is the same as before with a slight twist. The offset time records the number of minutes for the offset in hex, with the first byte of the two being the least significant as before with the time, so you end up reading the two bytes left-to-right and then decode that byte right-to-left.
The twist is that for positive offsets, the value increments 1,2,3, in hex as appropriate, but for negative values, it starts decrementing by considering -1 equal to ‘FFFF’, I’ve split the hex output into the individual components by adding some spaces to make it easier to read. (Accuracy Code, Time Value, Date Value, Offset Used)
'2001-01-01 12:00:00.0000000 +00:01' => 0x07 009A717164 75250B 0100 '2001-01-01 12:00:00.0000000 +00:00' => 0x07 00E0349564 75250B 0000 '2001-01-01 12:00:00.0000000 -00:01' => 0x07 0026F8B864 75250B FFFF
Since the offsets supported at only +14 hours to -14 hours, there is no risk of the two ranges overlapping. When I think about this a bit more, it is acting as a signed number, -1 being 11111111111 etc. So the 2 bytes at the end is a signed int of the number of minutes offset.
There are a number of time zones in the world that do not occur at exact hourly intervals from UTC, some are on the half hour mark such as Caracas (-4:30) or Delhi (+5:30) to name a few, whilst Kathmandu (+5:45) is even more specific. In theory the format allows offsets specified to even greater levels of distinction, although I am not sure as to why you would wish to use it. Do you really want an offset of +3:17 minutes? That is potentially scary to consider that as a valid input to the value.
That has made me wonder as to why the accuracy was set so high, when in reality 15 minute intervals would of been sufficient, with a -14 to +14 range, that is 113 different values inclusively, which could be accomodated within a single byte.
So why spend 2 bytes on the format, when 1 was enough? Was it to just be compatible to the ISO format in some way that required it? Not sure.
What is the SQL Server 2008 DateTime2 Internal Structure?
SQL Server has a number of new date time formats, but the one I am most interested in is DateTime2. The internal format of the SQL DateTime is commonly mistaken as 2×4 byte integers, with the latter integer being milliseconds since midnight. It is in fact the number of 1/300ths of a second since midnight which is why the accuracy of the DateTime within SQL Server has historically been 3.33ms. (If you really want to see it, crack it open by converting it to a binary, adding 1 and re-converting, you add 3.33ms, not 1 ms.)
So DateTime2 must use a different format, and as a weekend exercise that had no purpose than understanding the internals I thought I’d take a look. I have not seen the information in the BoL or posted as yet, so might be of use. I am starting with the DateTime2(7) and looking at the maximum accuracy structure. The code used to crack it open each time is basically as follows:
declare @dt datetime2(7) set @dt = '2000/01/01 00:00:00' declare @bin varbinary(max) set @bin = CONVERT(varbinary(max), @dt)
To make my life easier, SQL conveniently outputs all the values as hexi-decimal numbers. The results are not what you would expect.
0x07000000000007240B
The date which traditionally occupied the first 4 bytes, clearly is occupying the last few bytes. So the format is not going to be obvious or simple. Interestingly the returned result is 9 bytes, but the length is quoted as 8. It is returning 8 when checked using the length, that first byte is somewhat odd to make an appearance. It’s also suspiciously the accuracy value, and with a few tests using a change of accuracy, it show that value changes. So the first pseudo-byte is the accuracy indicator.
To start figuring out some more, let’s take the time back to the beginning point, which in this case is not 1900/01/01 but 0001/01/01 which when converted gives us:
'0001/01/01 00:00:00' => 0x070000000000000000
Start incrementing the day portion and there is an obvious pattern, the 6th byte changes.
'0001/01/02 00:00:00' => 0x070000000000010000 '0001/01/03 00:00:00' => 0x070000000000020000 '0001/01/31 00:00:00' => 0x0700000000001E0000
As you try the 2nd month, to check where the month is, the same byte alters, so it represents days, not specific date parts. Is it the number of days since the beginning of the year? No.
'0001/02/01 00:00:00' => 0x0700000000001F0000
If it was, there would be an issue since 1 byte does not represent enough values, as we can see, FF occurs on the 13th of September, and then it rolls over and puts a 1 in the 7th Byte position.
'0001/09/13 00:00:00' => 0x070000000000FF0000 '0001/09/14 00:00:00' => 0x070000000000000100 '0001/09/15 00:00:00' => 0x070000000000010100
It rolls over, then carries on as before. This immediately suggests the next test, to roll over the year, and the pattern continues.
'0001/12/31 00:00:00' => 0x0700000000006C0100 '0001/12/31 00:00:00' => 0x0700000000006D0100
So the format is just counting, we see it in the example as hex, but it is a straight number count going on but the hex values are left-to-right. Only 2 bytes are used so far, which do not represent enough day combinations, add the third byte in by going past 180 years:
'0180/06/06 00:00:00' => 0x070000000000FFFF00 '0180/06/07 00:00:00' => 0x070000000000000001
So the final byte is then increased, so the number of combinations becomes 16777215 – that seems a lot better and certainly going to cover the range required.
'2001/01/01 00:00:00' => 0x07000000000075250B
So that is the final 3 bytes decoded, a simple pattern – and provides the template of how the time is also stored.
'0001/01/01 00:00:00.0000000' => 0x070000000000000000 '0001/01/01 00:00:00.0000001' => 0x070100000000000000 '0001/01/01 00:00:00.0000255' => 0x07FF00000000000000 '0001/01/01 00:00:00.0065535' => 0x07FFFF000000000000 '0001/01/01 00:00:00.0065536' => 0x070000010000000000
So to check whether the format is the same,
'0001/01/01 00:00:00.9999999' => 0x077F96980000000000
Decode that again and it all matches:
select (152 * 256 * 256) + (150 * 256) + 127
-----------
9999999
When we click over into 1 second exactly, we increment the first byte by 1, so the time portion is still represented in 100ns intervals, with the normal system of each byte counting up 1 every time the previous byte rolls over. As we get to the limit of the 3 bytes, it rolls into the 4th and then the 5th.
'0001/01/01 00:00:01.0000000' => 0x078096980000000000
So the internal format of the DateTime2(7) is decoded, not difficult but it is an interesting choice – it is now a straight binary number, with the Least Significant Byte being on the Left, the Most Significant being on the right (for each section.) Within the byte however, to convert it you must still read it right-to-left.
The first 5 bytes are recording how many time units intervals have passed since midnight, and the last 3 bytes recording how many days have passed since 0001/01/01.
The time unit intervals are dictated by the accuracy of the number, 100ns for DateTime2(7), and 1 Micro second intervals for a DateTime2(6) etc. The way in which you interpret it does not change, but the units you are multiplying the time portion by, alters based on the accuracy.
You could construct 2 dates that are identical at a binary level, but due to the field meta-data on accuracy, they do not represent the same date time.
declare @dt1 dt1 datetime2(6) set @dt1 = '0001/01/01 00:00:00.000001' declare @dt2 datetime2(7) set @dt2 = '0001/01/01 00:00:00.0000001' 0x060100000000000000 0x070100000000000000
And that is perhaps why on output they automatically have prefixed the binary value with the datetime accuracy, so that they are not entirely identical? I’m not sure but would be interested to find out.
Finding Next Identity Value, and a Wild Goose Chase.
A question asked on stack overflow was to find the next identity value that would occur on a table, without being required to add a record to work it out. The problem lies in that if the highest row is deleted, the number is not reused so any answers using the existing rows can be incorrect.
Logically the value must be stored, so first place I checked was the DMV’s. That stores the currently used value, but does not store the next value.
The wild goose chase started there…
- I used the dedicated admin console to pull all the system tables, expecting it to be in sys.syshobtcolumns, no joy. Dumped the whole system table contents before and after an insert looking for the difference and didn’t spot it.
- Took a dump of every page in the file before and after inserting a new row and having the textural dumps compared in a text comparison application, still no joy.
- Started dumping out log records using the following script and pulled the identity calls:
select * from ::fn_dblog(null, null) where operation = 'LOP_IDENT_NEWVAL'
- After a couple of hours running around the pages trying to find it, I realised I should of stuck with the DMV and I really went the wrong way around it.
The DMV has the right answer it seems, but as two fields you have to combine to get the answer.
create table foo (MyID int identity not null, MyField char(10)) insert into foo values ('test') go 10 -- Inserted 10 rows select Convert(varchar(8),increment_value) as IncrementValue, Convert(varchar(8),last_value) as LastValue from sys.identity_columns where name ='myid' -- insert another row insert into foo values ('test') -- check the values again select Convert(varchar(8),increment_value) as IncrementValue, Convert(varchar(8),last_value) as LastValue from sys.identity_columns where name ='myid' -- delete the rows delete from foo -- check the DMV again select Convert(varchar(8),increment_value) as IncrementValue, Convert(varchar(8),last_value) as LastValue from sys.identity_columns where name ='myid' -- value is currently 11 and increment is 1, so the next insert gets 12 insert into foo values ('test') select * from foo Result: MyID MyField ----------- ---------- 12 test (1 row(s) affected)
So adding the increment to the last value will predict the next value correctly, assuming someone else does not grab it in the mean time which is why it is not a good idea to use in code, but if you need to investigate a table and want to know what it thinks is next, without actually inserting a row and affecting the table, then it is useful.
All that and the easy way is then:
select ident_current(‘foo’) + ident_incr(‘foo’)
Ah well, was fun investigating it – but what a wild goose chase to find it was an easy answer.
Do Filtered Index Have Different Tipping Points?
I’ve been taking a look at filtered indexes and they do initially look particularly useful, but I was wandering what effect the filtered index has on the tipping point. The tipping point was described by Kimberley L. Tripp in an excellent post – and describes the point at which SQL opts to change the query plan strategy from a NC index seek to a scan of some kind (Table / Clustered Index)
So I was wondering whether Filtered indexes chose a different strategy, or would they ‘tip’ from a targeted index to a table scan in the same way.
First off was to create a dummy table:
create tabletblPersonnel ( PersonnelID Int Identity, FirstName Char(30), LastName Char(30), Department Char(30), SomePadding Char(250) )
This was deliberately created as a heap, since data had to be generated and loaded:
insert into tblPersonnel(firstname,lastname,Department) values( dbo.randomstring(20), dbo.randomstring(20), dbo.randomstring(20)); go 1000000
The dbo.randomstring is a user created function I added to generate a random string [A-Z] the length of the function parameter, so the insert statement is inserting random strings of length 20 into the first name, last name and department fields.
Once the data was loaded, a clustered index was applied.
alter table dbo.tblPersonnel add constraint PK_tblPersonnel primary key clustered(PersonnelID) with( statistics_norecompute = off, ignore_dup_key = off, allow_row_locks = on, allow_page_locks = on) on [PRIMARY] GO
A quick check of the table:
select COUNT(*) from tblpersonnel ----------- 1000000 (1 row(s) affected)
To get a rough idea of the tipping point we need to know the number of pages at the leaf level of the table.
select * from sys.objects where name ='tblpersonnel' -- grab the object id and query the physical stats. select * from sys.dm_db_index_physical_stats(DB_ID('FilteredIndexTest'), 1117247035 , 1,0,'DETAILED') -- 45455 pages shown in the table at leaf level 0 i.e. the clustered index
So create an NC index on the first name, last name.
create nonclustered index[IX_Normal] ON [dbo].[tblPersonnel] ( [FirstName] asc, [LastName] asc ) with( pad_index = off ,statistics_norecompute = off, ignore_dup_key = off, allow_row_locks = on, allow_page_locks = on) on [PRIMARY] GO
For reference, we will also check how many pages the NC index has:
select * from sys.dm_db_index_physical_stats(DB_ID('FilteredIndexTest'), 1117247035 , 2,0,'DETAILED') -- Level 0 of the index shows 8696 pages
The tipping point is expected between 25% to 33% of the number of pages when expressed as rows, so something between 11363 to 15000 rows is where to expect it. Since the data was random when inserted, a bit of human binary chop is needed to find the tipping point. After some experimentation – the dividing point was between the firstname like ‘a[a-h]%’ and ‘a[a-i]%’
select * from tblpersonnel where firstname like 'a[a-h]%' -- 11898 rows - seeks select * from tblpersonnel where firstname like 'a[a-i]%' -- 13299 rows - scans
So the tipping point showed up within the expected range, so next I created the filtered index based on FirstName ‘>=a’ and ‘<= ak’ since I can not use a like clause in the Filtered Index where statement.
create nonclustered index [IX_Filter] ON[dbo].[tblPersonnel] ( [FirstName] asc, [LastName] asc ) where[FirstName] >= 'a'and firstname <= 'ak' ) with( pad_index = off ,statistics_norecompute = off, ignore_dup_key = off, allow_row_locks = on, allow_page_locks = on) on [PRIMARY] GO
The filtered index is in play, so I disabled the non-filtered index and re-ran the same queries.
select * from tblpersonnel where firstname like 'a[a-h]%' -- 11898 rows - seeks select * from tblpersonnel where firstname like 'a[a-i]%' -- 13299 rows - scans
Conclusion: The tipping point did not alter based on the filtered index – it still tips based on the ratio of rows to total pages in the leaf level of the clustered index.
How to Remember the Next Used Filegroup in a Partition Scheme
Within SQL, the partitioned table feature provides an excellent way to store and roll forward data windows across large datasets without incurring huge loading and archiving penalties to the main table.
The process of loading, manipulating and decommissioning data from a partitioned table will need posts of their own – due to the depth of the information available and required as well as the detail needed to understand some of the best practises I have learnt from the field.
This entry is specifically relating to the ‘Next Used’ aspect of dealing with a partitioned table – for some time I wanted to know how could I tell what partition had been set as next used? There seems to be no record of the value – the BoL lists the Next Used facility as:
Specify the filegroup to be marked by the partition scheme as NEXT USED.
This is true in a simplistic sense, but the marking can not be at the filegroup level since the relation from partition schemes to filegroups can be many-to-many, so no single marking on a file group could suffice. I had never been able to figure out where to find that marking, and find a way to read what had been set.
I should add, that you really should not set the next used in one location of the code / stored proc and then perform the split of the partition function in another, it would be far safer to do them together, so the need to actually find it out is really borne out of investigative necessity in trying to understand for a given partition scheme that implements a rolling window that started to go bad, where does it think it should be splitting the partition to?
So the problem remained; it was given to Paul S. Randal to figure out how we could see / infer this information outside of the dedicated admin console. He figured out the starting point in terms of which value in which DMV to start the solution with, and I ran with it from there to create a relatively easy way to get to it. So credit to Paul for finding out where to start on the issue.
So down to the scripts and some test cases:
There are 2 types of partition scheme to test:
- Partition Function as Left.
- Partition Function as Right.
When defining the scheme initially, you can also define it in two ways:
- Partition scheme defined with the next used pre-set value.
- Partition scheme defined with no pre-set next used value.
So there are 4 combinations to test initially, since we do not need data for this or are worried about the query plan, the main work is pure schema creation and checking.
First script is to just create a test database, nothing significant:
CREATE DATABASE[NextUsedTest] ON PRIMARY ( NAME = N'NextUsedTest', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL \DATA\NextUsedTest.mdf' , SIZE = 3072KB , FILEGROWTH = 1024KB ), FILEGROUP[PartFG1] ( NAME = N'PartFile1', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL \DATA\PartFile1.ndf' , SIZE = 3072KB , FILEGROWTH = 1024KB ), FILEGROUP[PartFG2] ( NAME = N'PartFile2', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL \DATA\PartFile2.ndf' , SIZE = 3072KB , FILEGROWTH = 1024KB ), FILEGROUP[PartFG3] ( NAME = N'PartFile3', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL \DATA\PartFile3.ndf' , SIZE = 3072KB , FILEGROWTH = 1024KB ), FILEGROUP[PartFG4] ( NAME = N'PartFile4', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL \DATA\PartFile4.ndf' , SIZE = 3072KB , FILEGROWTH = 1024KB ), FILEGROUP[PartFG5] ( NAME = N'PartFile5', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL \DATA\PartFile5.ndf' , SIZE = 3072KB , FILEGROWTH = 1024KB ) LOG ON ( NAME = N'NextUsedTest_log', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL \DATA\NextUsedTest_log.ldf', SIZE = 1024KB , FILEGROWTH = 10% ) GO
Now the database is in place, lets create the partition functions /schemes:
CREATE PARTITION FUNCTION [pfLeftNoSpare](int) AS RANGE LEFT FOR VALUES (200801, 200802, 200803, 200804) CREATE PARTITION SCHEME [psLeftNoSpare] AS PARTITION[pfLeftNoSpare] TO([PartFG1], [PartFG2], [PartFG3], [PartFG4], [Primary]) CREATE PARTITION FUNCTION [pfLeftWithNextUsedSet](int) AS RANGE LEFT FOR VALUES (200801, 200802, 200803, 200804) CREATE PARTITION SCHEME [psLeftWithNextUsedSet] AS PARTITION [pfLeftWithNextUsedSet] TO([PartFG1], [PartFG2], [PartFG3], [PartFG4], [Primary], [PartFG5]) CREATE PARTITION FUNCTION [pfRightNoSpare](int) AS RANGE RIGHT FOR VALUES (200801, 200802, 200803, 200804) CREATE PARTITION SCHEME [psRightNoSpare] AS PARTITION[pfRightNoSpare] TO([Primary], [PartFG1], [PartFG2], [PartFG3], [PartFG4]) CREATE PARTITION FUNCTION [pfRightWithSpareNextUsedSet](int) AS RANGE RIGHT FOR VALUES (200801, 200802, 200803, 200804) CREATE PARTITION SCHEME [psRightWithSpareNextUsedSet] AS PARTITION[pfRightWithSpareNextUsedSet] TO([Primary],[PartFG1], [PartFG2], [PartFG3], [PartFG4], [PartFG5]) GO
Four partition functions and schemes, for the left schemes the next used file group appears on the far right when the next used it pre-set. When you use this mechanisms to preset, it issues a message for the operation:
Partition scheme 'psLeftWithNextUsedSet' has been created successfully. 'PartFG5' is marked as the next used filegroup in partition scheme 'psLeftWithNextUsedSet'. Partition scheme 'psRightWithSpareNextUsedSet' has been created successfully. 'PartFG5' is marked as the next used filegroup in partition scheme 'psRightWithSpareNextUsedSet'.
The trick to finding the next used is to look for the mismatching record – if you join the partition function values to the partition scheme there are scheme entries that do not have a corresponding partition function value, since it has not been set by performing the split.
Lets start with the Left based partition and compare the two.
select FG.Nameas FileGroupName , dds.destination_id , dds.data_space_id , prv.value, ps.Name from sys.partition_schemes PS inner join sys.destination_data_spaces as DDS on DDS.partition_scheme_id = PS.data_space_id inner join sys.filegroups as FG on FG.data_space_id = DDS.data_space_ID left join sys.partition_range_values as PRV on PRV.Boundary_ID = DDS.destination_id and prv.function_id=ps.function_id where PS.name = 'psLeftNoSpare'
The output is:
| FileGroupName | destination_id | data_space_id | value | Name |
|---|---|---|---|---|
| PartFG1 | 1 | 2 | 200801 | psLeftNoSpare |
| PartFG2 | 2 | 3 | 200802 | psLeftNoSpare |
| PartFG3 | 3 | 4 | 200803 | psLeftNoSpare |
| PartFG4 | 4 | 5 | 200804 | psLeftNoSpare |
| PRIMARY | 5 | 1 | NULL | psLeftNoSpare |
And check the other Left defined partition:
select FG.Name asFileGroupName , dds.destination_id , dds.data_space_id , prv.value, ps.Name from sys.partition_schemes PS inner join sys.destination_data_spaces as DDS on DDS.partition_scheme_id = PS.data_space_id inner join sys.filegroups as FG on FG.data_space_id = DDS.data_space_ID left join sys.partition_range_values as PRV on PRV.Boundary_ID = DDS.destination_id and prv.function_id=ps.function_id where PS.name = 'psLeftWithNextUsedSet'
Results:
| FileGroupName | destination_id | data_space_id | value | Name |
|---|---|---|---|---|
| PartFG1 | 1 | 2 | 200801 | psLeftNoSpare |
| PartFG2 | 2 | 3 | 200802 | psLeftNoSpare |
| PartFG3 | 3 | 4 | 200803 | psLeftNoSpare |
| PartFG4 | 4 | 5 | 200804 | psLeftNoSpare |
| PRIMARY | 5 | 1 | NULL | psLeftNoSpare |
| PartFG5 | 6 | 6 | NULL | psLeftNoSpare |
And the difference appears, the next used appears as the highest destination ID, but only when there are two data spaces assigned to the partition that do not have a value. The ‘primary’ entry that shows up as null is due to the partition scheme running from -ve infinity to +ve infinity, and whilst 4 lines are drawn on the number line, that divides the number line into 5 sections.
Running the same statements for the partitions declared using the right shows similar results, but the Primary entry is given destination 5 whilst the spare shows on 6 again.
The query is slightly awkward in that it must pick up the second entry of the list if it exists, using ‘orderings and tops’ will not pull the result we need.
select FileGroupName, Destination_ID, Data_Space_ID, Name from ( select FG.Name as FileGroupName , dds.destination_id , dds.data_space_id , prv.value , ps.Name , RANK() OVER (PARTITION BY ps.name order by dds.destination_Id) as dest_rank from sys.partition_schemes PS inner join sys.destination_data_spaces as DDS on DDS.partition_scheme_id = PS.data_space_id inner join sys.filegroups as FG on FG.data_space_id = DDS.data_space_ID left join sys.partition_range_values as PRV on PRV.Boundary_ID = DDS.destination_id and prv.function_id=ps.function_id where prv.Value is null ) as a where dest_rank = 2
Results:
| FileGroupName | destination_id | data_space_id | Name |
|---|---|---|---|
| PartFG5 | 6 | 6 | psLeftWithNextUsedSet |
| PartFG5 | 6 | 6 | psRightWithSpareNextUsedSet |
To test whether it picks up setting the next used, let’s set it on the partitions that did not previously have it.
ALTER PARTITION SCHEME psLeftNoSpare NEXT USED [PartFG5] ALTER PARTITION SCHEME psRightNoSpare NEXT USED [PartFG5]
And re-run the query
| FileGroupName | destination_id | data_space_id | Name |
|---|---|---|---|
| PartFG5 | 6 | 6 | psLeftNoSpare |
| PartFG5 | 6 | 6 | psLeftWithNextUsedSet |
| PartFG5 | 6 | 6 | psRightNoSpare |
| PartFG5 | 6 | 6 | psRightWithSpareNextUsedSet |
To make it re-usable, I switched the query into a view
create view NextUseFileGroups as select FileGroupName, Destination_ID, Data_Space_ID, Name from ( select FG.Name as FileGroupName
, dds.destination_id
, dds.data_space_id, prv.value, ps.Name, RANK() OVER (PARTITION BY ps.name order by dds.destination_Id) as dest_rank from sys.partition_schemes PS inner join sys.destination_data_spaces as DDS
on DDS.partition_scheme_id = PS.data_space_id inner join sys.filegroups as FG
on FG.data_space_id = DDS.data_space_ID left join sys.partition_range_values as PRV
on PRV.Boundary_ID = DDS.destination_id and prv.function_id=ps.function_id where prv.Value is null ) as a where dest_rank = 2
And a final check with removing the setting – you can blank a set next used value by specifying no value in the statement.
ALTER PARTITION SCHEME psLeftNoSpare NEXT USED ALTER PARTITION SCHEME psRightNoSpare NEXT USED
Select from the view and the two file partition schemes / file groups no longer show up in the list as intended.
So finding out the ‘next used’ setting is possible, although there really is no need in normal operation of the partition window to have to find out, but as an investigative tool it could be useful.
The scripts were tested on both 2005 and 2008, so are good for both in terms of testing, or using the view.
When is MaxDop not MaxDop?
MaxDop is in some sense a bit of a misnomer, in that you would think ‘Max Degree of Parallelism’, set by the system administrator would be the last and final word on the matter; That is your maximum, and there are no choices to be made.
However, whilst at the SQL Immersion event in Dublin hosted by Prodata I had made an off hand comment about increasing the thread count beyond the MaxDop setting whilst creating an on-line index on an OLTP based system that had MaxDop 1 set.
That gained me some quizzical looks, in that most assume MaxDop is set and has to be adhered to, so surely what I was indicating was not possible? Well – yes it is, and there is even a KB that relates to it and the difference between SQL Server 2000 /2005 and 2008. http://support.microsoft.com/default.aspx/kb/329204
I should also mention that the BoL is less than precise about the situation, against SQL 2008 the BoL states for ‘Max Degree of Parallelism Option’:
Set max degree of parallelism to 1 to suppress parallel plan generation.
And against the ‘Degree of Parallelism Page’ :
For example, you can use the MAXDOP option to control, by extending or reducing, the number of processors dedicated to an online index operation. In this way, you can balance the resources used by an index operation with those of the concurrent users.
So that’s entirely clear…
You can infer (aside from the KB) that something is not what it seems when it states ‘by extending or reducing’ – how can you extend it? that would not make logical sense if it was a hard limit.
So we have a slightly bizarre situation in which you can override the server level setting with your own value. Initial thought is, this is a bit dangerous isn’t it?
Would you want the average user or developer to start implementing the appropriate query hint and override you? They would get more CPU so they wouldn’t hesitate (but not guarenteed better performance for it, since they would all be doing it)
To perform a test, I used a large partitioned table since the parallelism that occurs when querying a partitioned table is far easier to predict and engineer specifix scenarios to. The table had ~688 million rows in it and the selection was asking for a simple row count from 2 of the available partitions. The server used was an 8-core.
Select Count(*) From LargePartitionedTable Where MyPartitionColumn in (200801, 200901)
The baseline case was setting the server to a maxdop of 1
sp_configure 'max degree of parallelism', 1 go reconfigure go
The query plan is not particularily complex and the two key parts are that the clustered index scan.


The baseline case of using MaxDop 0 was then set up and the same query run.
sp_configure 'max degree of parallelism', 0 go reconfigure go


As expected, the query ran up 8 threads (8 core machine) and only 2 of the threads processed any work, since SQL 2005 threading model is to assigned 1 thread per partition, when more than 1 partitioned is requested.
So the base scenarios are in place, now the query is altered slightly to override the MaxDop upwards.
sp_configure ‘max degree of parallelism’, 1
go
reconfigure
go
Select Count(*) From tblODS Where TimelineID in (200801, 200802) Option (maxdop 4)
If the override works, the thread count will go up from 1, and as per the KB article it does.


So the MaxDop set by an administrator does not need to be obeyed – but does this apply to everyone? That is really the key issue and unfortunately it does not appear to fall the way you would wish it to. Placing the SQL into Mixed Authentication mode I created a user account, and gave it only the permission to select from the table and the show plan permission, nothing else.
Result? The query parellelled again, indicating there is no special permission required for a user to overide the setting.
The facility to override it is useful, on an OLTP where you want to increase the indexing speed by using more threads it is essential – but to not require it as a granted permission to the service accounts / jobs performing those tasks – that seems slightly bizarre.
These tests were performed using SQL 2005, so I need to test SQL 2008 to see if the lack of controls continue to exist.