Posts Tagged ‘Internals’

The Sequence of an Index Uniquifier

October 20, 2009 Leave a comment

During a training session today, I was asked about the structure of the Uniquifier, and whether it was a straight identity column. Off the top of my head I couldn’t remember the exact structure, I considered it a 4 byte int, but was not sure whether it acted as a pure identity value or acted in a more complex manner when incrementing, so decided to investigate it tonight.

To start from the beginning, an index uniquifier is the term given to the field that is automatically generated by SQL Server when you create a clustered index, but the index key is not specified as unique. Since each record in the table has to be uniquely identifiable, SQL will automatically assigned a 4 byte field to the row to make it unique, commonly called the ‘Uniquifier’. At this point I am sure English scholars will be frowning, pondering on the nature of the word and whether it qualifies as English; however that the term used so we will run with it.

It is actually quite easy to see this field in action, let’s create a simple table:

CREATE TABLE dbo.unique_test  (  
firstname char(20) NOT NULL,  
surname char(20) NOT NULL  
CREATE CLUSTERED INDEX [ix_test] ON [dbo].[unique_test]
[firstname] ASC

The clustered index is not unique, by design, so let’s start adding duplicate rows to see the effect:

insert into unique_test values ('John', 'Smith')
go 10

The table now contains 10 rows, each with the same details. This does not cause any undue concern, because each row is actually still unique – the way to show this is using the DBCC INC and DBCC Page commands, I’ve cut the output down since it is so wide.

dbcc ind ('testdb','unique_test',1)
PageFID PagePID     IAMFID IAMPID      PageType
------- ----------- ------ ----------- --------
1       41          NULL   NULL        10      
1       174         1      41          1       

The output shows a data page numbered 174 for my example and the IAM page with an ID of 41. We can crack open the page and view the contents very easily using DBCC Page.

dbcc dbcc traceon(3604)
dbcc page (idtest,1,174,3)

The output is quite large, but in essence, the first record is stored with the following details: 

UNIQUIFIER = [NULL]                 
Slot 0 Column 1 Offset 0x4 Length 20
firstname = John                    
Slot 0 Column 2 Offset 0x18 Length 20

 The second record:

Slot 1 Column 0 Offset 0x33 Length 4
UNIQUIFIER = 1                      
Slot 1 Column 1 Offset 0x4 Length 20
firstname = John                    
Slot 1 Column 2 Offset 0x18 Length 20
surname = Smith  

 The third record:

Slot 2 Column 0 Offset 0x33 Length 4
UNIQUIFIER = 2                      
Slot 2 Column 1 Offset 0x4 Length 20
firstname = John                    
Slot 2 Column 2 Offset 0x18 Length 20
surname = Smith                     

And so forth. The first record’s uniquifier is visible and clearly named within the data page, but set to null. The second copy of the same value receives the uniquifier of one, the third copy receives a 2 etc.  This count is maintained separately for each duplication, so the insert of a new name multiple times will also receive its own counter, beginning at null and working upwards, 1,2,3 etc. So just because the uniquifier is 4 bytes, this does not limit the total number of rows in the table to ~2.1 billion, but does logically limit the total number of duplicates to 2.1 billion. I must confess to not having tested that limit, generating 2.1 billion rows of duplicate data is not trivial and a scrapbook calculation predicts 435 hours of processing on a virtual pc. I suspect the error message it raises when it hits the limit would be interesting.

If we remove all the rows from the table and then add 10 more does the uniquifier reset? Easy to test but the short answer was no, the uniquifier continued to rise, 10 thru 19.

I was a bit suspicious of this since any requirement for the uniquifier to rise / remember what existed before requires it to be stored somewhere – it has to survive a crash after all, but there is no apparent place the current count is stored. If there was, you wouldn’t be storing just 1 value, you would be forced to store a value for each record key that had duplicates. This could run into thousands of separate counters being maintained per clustered key so it just doesn’t make sense that it is stored, it would be a very noticable overhead.

When checking the DBCC Ind for the empty table it insisted it still had a data page, but the only contents of the data page was a single ghost record – a row that has been marked as deleted. The ghost record was the for the ‘John Smith’ with the highest uniquifier before, was this coincidence? The other ghost records had not hung around, so why did this one persist?

I dropped and recreated the table again, inserted 10 rows and then deleted them. Checking DBCC Ind the table still showed a HoBT IAM allocation page for the table and a data page, the data page contained a single ghost record, the one with a Uniquifier of 9 – the highest given out when 10 duplicates were added. Even waiting some considerable time the ghost record was not cleaned up, so it appears that it will not delete it.

If I added another duplicate row, it picked up the next number in the sequence (10) and shortly after the ghost record was removed from the page. Very convenient and not a coincidence at all – the memory of the last uniquifier given out  persists as a ghost record, even if all the duplicates for the table have been removed.  What seems strange is this ghost record hanging about, persisting an entire record, to keep the duplicate count for that key, when no instances of it remain on the table.

It can not possibly do this for every key since the space overhead would become very noticable again, so how does it choose what to persist, the last entry? unfortunately it doesn’t appear that simple at all, after a number of tests it appeared to only be interested in keeping the ghost entry for the record that had the highest key value, so alphabetically, the one closed to ‘Z’ for my example.

Conclusion? On the evidence, whilst the other ghost records still persist for a short time, even deleting and then adding more duplicates can see the number continue from where it left of, but given a short time for the ghost records to be removed the uniquifier will restart the sequence back at Null,1,2 etc. Except in the case of the highest entry from the index perspective, that ghost record decides to stick around until there is another entry using the same key, continuing the sequence, at which point the ghost record then finally disappears.

I can not think of any sensible reason why it would do this, can you?

Overall, the uniquifier is a cost overhead of not having a unique index, and at a cost of 4 bytes, an int identity column makes a lot of sense – for all purposes it acts the same and serves the same purpose but in a far more visible manner – so it really does not make much sense to rely on the uniquifier provided for you, take control and create your own.

What is the SQL Server 2008 DateTimeOffset Internal Structure?

October 13, 2009 Leave a comment

After decoding the DateTime2 internal structure I thought I would take a quick look at the DateTimeOffset structure, since it should not present too many difficulties and post it up quickly, but was surprised at the initial result of a test. It follows the same basic premise that the time portion is followed by the date portion and the time / date is based on a day count from the epoch time of 0001/01/01 and the time the number of intervals since midnight, where the interval is defined by the accuracy.

I was expecting the datetime offset value itself to occupy the additional 2 bytes quoted, and not affect the other values. Bad assumption, as soon as I cracked open a few examples I could immediately see that setting the offset also alters the underlying time / date component values as well.

Using the same comparison methods as before the underlying time value is clearly being adjusted:

'0001/01/01 00:00:00 -0:00' => 0x0700000000000000000000
'0001/01/01 00:00:00 -12:00' => 0x0700E034956400000030FD

So, even though an offset is being stored, the underlying time is also being altered to match, and the internal storage is using UTC as the reference point. This makes sense as the most valid reference that you could use.

'0001-01-01 12:00:00.0000000 +00:00' => 0x0700E03495640000000000
'0001-01-01 00:00:00.0000000 -12:00' => 0x0700E034956400000030FD

The same time / date is generated for the two values, but the last two bytes hold the offset and have stored the offset used when the date was initially stored. The underlying storage of the time though is clearly identical in both,  so UTC is the common ground they each get stored again.

The final 2 bytes for the offset are pretty easy to decode, since the pattern is the same as before with a slight twist. The offset time records the number of minutes for the offset in hex, with the first byte of the two being the least significant as before with the time, so you end up reading the two bytes left-to-right and then decode that byte right-to-left.

The twist is that for positive offsets, the value increments 1,2,3, in hex as appropriate, but for negative values, it starts decrementing by considering -1 equal to ‘FFFF’, I’ve split the hex output into the individual components by adding some spaces to make it easier to read. (Accuracy Code, Time Value, Date Value, Offset Used)

'2001-01-01 12:00:00.0000000 +00:01' => 0x07   009A717164   75250B  0100
'2001-01-01 12:00:00.0000000 +00:00' => 0x07   00E0349564   75250B  0000
'2001-01-01 12:00:00.0000000 -00:01' => 0x07   0026F8B864   75250B  FFFF

Since the offsets supported at only +14 hours to -14 hours, there is no risk of the two ranges overlapping. When I think about this a bit more, it is acting as a signed number, -1 being 11111111111 etc. So the 2 bytes at the end is a signed int of the number of minutes offset.

There are a number of time zones in the world that do not occur at exact hourly intervals from UTC, some are on the half hour mark such as Caracas (-4:30) or Delhi (+5:30) to name a few, whilst Kathmandu (+5:45) is even more specific.  In theory the format allows offsets specified to even greater levels of distinction, although I am not sure as to why you would wish to use it. Do you really want an offset of +3:17 minutes? That is potentially scary to consider that as a valid input to the value.

That has made me wonder as to why the accuracy was set so high, when in reality 15 minute intervals would of been sufficient, with a -14 to +14 range, that is 113 different values inclusively, which could be accomodated within a single byte.

So why spend 2 bytes on the format, when 1 was enough? Was it to just be compatible to the ISO format in some way that required it? Not sure.

What is the SQL Server 2008 DateTime2 Internal Structure?

October 11, 2009 2 comments

SQL Server has a number of new date time formats, but the one I am most interested in is DateTime2. The internal format of the SQL DateTime is commonly mistaken as 2×4 byte integers, with the latter integer being milliseconds since midnight. It is in fact the number of 1/300ths of a second since midnight which is why the accuracy of the DateTime within SQL Server has historically been 3.33ms. (If you really want to see it, crack it open by converting it to a binary, adding 1 and re-converting, you add 3.33ms, not 1 ms.)

So DateTime2 must use a different format, and as a weekend exercise that had no purpose than understanding the internals I thought I’d take a look. I have not seen the information in the BoL or posted as yet, so might be of use.  I am starting with the DateTime2(7) and looking at the maximum accuracy structure. The code used to crack it open each time is basically as follows:

declare @dt datetime2(7)
set @dt = '2000/01/01 00:00:00'
declare @bin varbinary(max)
set @bin = CONVERT(varbinary(max), @dt)

To make my life easier, SQL conveniently outputs all the values as hexi-decimal numbers. The results are not what you would expect.


The date which traditionally occupied the first 4 bytes, clearly is occupying the last few bytes. So the format is not going to be obvious or simple. Interestingly the returned result is 9 bytes, but the length is quoted as 8. It is returning 8 when checked using the length, that first byte is somewhat odd to make an appearance.  It’s also suspiciously the accuracy value, and with a few tests using a change of accuracy, it show that value changes. So the first pseudo-byte is the accuracy indicator.

To start figuring out some more, let’s take the time back to the beginning point, which in this case is not 1900/01/01 but 0001/01/01 which when converted gives us:

'0001/01/01 00:00:00' => 0x070000000000000000

Start incrementing the day portion and there is an obvious pattern, the 6th byte changes.

'0001/01/02 00:00:00' => 0x070000000000010000
'0001/01/03 00:00:00' => 0x070000000000020000
'0001/01/31 00:00:00' => 0x0700000000001E0000

As you try the 2nd month, to check where the month is, the same byte alters, so it represents days, not specific date parts. Is it the number of days since the beginning of the year? No.

'0001/02/01 00:00:00' => 0x0700000000001F0000

If it was, there would be an issue since 1 byte does not represent enough values, as we can see, FF occurs on the 13th of September, and then it rolls over and puts a 1 in the 7th Byte position.

'0001/09/13 00:00:00' => 0x070000000000FF0000
'0001/09/14 00:00:00' => 0x070000000000000100
'0001/09/15 00:00:00' => 0x070000000000010100

It rolls over, then carries on as before. This immediately suggests the next test, to roll over the year, and the pattern continues.

'0001/12/31 00:00:00' => 0x0700000000006C0100  
'0001/12/31 00:00:00' => 0x0700000000006D0100

So the format is just counting, we see it in the example as hex, but it is a straight number count going on but the hex values are left-to-right. Only 2 bytes are used so far, which do not represent enough day combinations, add the third byte in by going past 180 years:

'0180/06/06 00:00:00' => 0x070000000000FFFF00  
'0180/06/07 00:00:00' => 0x070000000000000001

So the final byte is then increased, so the number of combinations becomes 16777215 – that seems a lot better and certainly going to cover the range required.

'2001/01/01 00:00:00' => 0x07000000000075250B

So that is the final 3 bytes decoded, a simple pattern – and provides the template of how the time is also stored.

'0001/01/01 00:00:00.0000000' => 0x070000000000000000
'0001/01/01 00:00:00.0000001' => 0x070100000000000000
'0001/01/01 00:00:00.0000255' => 0x07FF00000000000000
'0001/01/01 00:00:00.0065535' => 0x07FFFF000000000000
'0001/01/01 00:00:00.0065536' => 0x070000010000000000

So to check whether the format is the same,

'0001/01/01 00:00:00.9999999' => 0x077F96980000000000

Decode that again and it all matches:

select (152 * 256 * 256) + (150 * 256) + 127

When we click over into 1 second exactly, we increment the first byte by 1, so the time portion is still represented in 100ns intervals, with the normal system of each byte counting up 1 every time the previous byte rolls over. As we get to the limit of the 3 bytes, it rolls into the 4th and then the 5th.

'0001/01/01 00:00:01.0000000' => 0x078096980000000000

So the internal format of the DateTime2(7) is decoded, not difficult but it is an interesting choice – it is now a straight binary number, with the Least Significant Byte being on the Left, the Most Significant being on the right (for each section.) Within the byte however, to convert it you must still read it right-to-left.

The first 5 bytes are recording how many time units intervals have passed since midnight, and the last 3 bytes recording how many days have passed since 0001/01/01.

The time unit intervals are dictated by the accuracy of the number, 100ns for DateTime2(7), and 1 Micro second intervals for a DateTime2(6) etc.  The way in which you interpret it does not change, but the units you are multiplying the time portion by, alters based on the accuracy.

You could construct 2 dates that are identical at a binary level, but due to the field meta-data on accuracy, they do not represent the same date time.

declare @dt1 dt1 datetime2(6)
set @dt1 = '0001/01/01 00:00:00.000001'
declare @dt2 datetime2(7)
set @dt2 = '0001/01/01 00:00:00.0000001'


 And that is perhaps why on output they automatically have prefixed the binary value with the datetime accuracy, so that they are not entirely identical? I’m not sure but would be interested to find out.