The Sequence of an Index Uniquifier
During a training session today, I was asked about the structure of the Uniquifier, and whether it was a straight identity column. Off the top of my head I couldn’t remember the exact structure, I considered it a 4 byte int, but was not sure whether it acted as a pure identity value or acted in a more complex manner when incrementing, so decided to investigate it tonight.
To start from the beginning, an index uniquifier is the term given to the field that is automatically generated by SQL Server when you create a clustered index, but the index key is not specified as unique. Since each record in the table has to be uniquely identifiable, SQL will automatically assigned a 4 byte field to the row to make it unique, commonly called the ‘Uniquifier’. At this point I am sure English scholars will be frowning, pondering on the nature of the word and whether it qualifies as English; however that the term used so we will run with it.
It is actually quite easy to see this field in action, let’s create a simple table:
CREATE TABLE dbo.unique_test ( firstname char(20) NOT NULL, surname char(20) NOT NULL ) ON [PRIMARY] GO CREATE CLUSTERED INDEX [ix_test] ON [dbo].[unique_test] ( [firstname] ASC ) ON [PRIMARY]
The clustered index is not unique, by design, so let’s start adding duplicate rows to see the effect:
insert into unique_test values ('John', 'Smith') go 10
The table now contains 10 rows, each with the same details. This does not cause any undue concern, because each row is actually still unique – the way to show this is using the DBCC INC and DBCC Page commands, I’ve cut the output down since it is so wide.
dbcc ind ('testdb','unique_test',1) PageFID PagePID IAMFID IAMPID PageType ------- ----------- ------ ----------- -------- 1 41 NULL NULL 10 1 174 1 41 1
The output shows a data page numbered 174 for my example and the IAM page with an ID of 41. We can crack open the page and view the contents very easily using DBCC Page.
dbcc dbcc traceon(3604) dbcc page (idtest,1,174,3)
The output is quite large, but in essence, the first record is stored with the following details:
UNIQUIFIER = [NULL] Slot 0 Column 1 Offset 0x4 Length 20 firstname = John Slot 0 Column 2 Offset 0x18 Length 20
The second record:
Slot 1 Column 0 Offset 0x33 Length 4 UNIQUIFIER = 1 Slot 1 Column 1 Offset 0x4 Length 20 firstname = John Slot 1 Column 2 Offset 0x18 Length 20 surname = Smith
The third record:
Slot 2 Column 0 Offset 0x33 Length 4 UNIQUIFIER = 2 Slot 2 Column 1 Offset 0x4 Length 20 firstname = John Slot 2 Column 2 Offset 0x18 Length 20 surname = Smith
And so forth. The first record’s uniquifier is visible and clearly named within the data page, but set to null. The second copy of the same value receives the uniquifier of one, the third copy receives a 2 etc. This count is maintained separately for each duplication, so the insert of a new name multiple times will also receive its own counter, beginning at null and working upwards, 1,2,3 etc. So just because the uniquifier is 4 bytes, this does not limit the total number of rows in the table to ~2.1 billion, but does logically limit the total number of duplicates to 2.1 billion. I must confess to not having tested that limit, generating 2.1 billion rows of duplicate data is not trivial and a scrapbook calculation predicts 435 hours of processing on a virtual pc. I suspect the error message it raises when it hits the limit would be interesting.
If we remove all the rows from the table and then add 10 more does the uniquifier reset? Easy to test but the short answer was no, the uniquifier continued to rise, 10 thru 19.
I was a bit suspicious of this since any requirement for the uniquifier to rise / remember what existed before requires it to be stored somewhere – it has to survive a crash after all, but there is no apparent place the current count is stored. If there was, you wouldn’t be storing just 1 value, you would be forced to store a value for each record key that had duplicates. This could run into thousands of separate counters being maintained per clustered key so it just doesn’t make sense that it is stored, it would be a very noticable overhead.
When checking the DBCC Ind for the empty table it insisted it still had a data page, but the only contents of the data page was a single ghost record – a row that has been marked as deleted. The ghost record was the for the ‘John Smith’ with the highest uniquifier before, was this coincidence? The other ghost records had not hung around, so why did this one persist?
I dropped and recreated the table again, inserted 10 rows and then deleted them. Checking DBCC Ind the table still showed a HoBT IAM allocation page for the table and a data page, the data page contained a single ghost record, the one with a Uniquifier of 9 – the highest given out when 10 duplicates were added. Even waiting some considerable time the ghost record was not cleaned up, so it appears that it will not delete it.
If I added another duplicate row, it picked up the next number in the sequence (10) and shortly after the ghost record was removed from the page. Very convenient and not a coincidence at all – the memory of the last uniquifier given out persists as a ghost record, even if all the duplicates for the table have been removed. What seems strange is this ghost record hanging about, persisting an entire record, to keep the duplicate count for that key, when no instances of it remain on the table.
It can not possibly do this for every key since the space overhead would become very noticable again, so how does it choose what to persist, the last entry? unfortunately it doesn’t appear that simple at all, after a number of tests it appeared to only be interested in keeping the ghost entry for the record that had the highest key value, so alphabetically, the one closed to ‘Z’ for my example.
Conclusion? On the evidence, whilst the other ghost records still persist for a short time, even deleting and then adding more duplicates can see the number continue from where it left of, but given a short time for the ghost records to be removed the uniquifier will restart the sequence back at Null,1,2 etc. Except in the case of the highest entry from the index perspective, that ghost record decides to stick around until there is another entry using the same key, continuing the sequence, at which point the ghost record then finally disappears.
I can not think of any sensible reason why it would do this, can you?
Overall, the uniquifier is a cost overhead of not having a unique index, and at a cost of 4 bytes, an int identity column makes a lot of sense – for all purposes it acts the same and serves the same purpose but in a far more visible manner – so it really does not make much sense to rely on the uniquifier provided for you, take control and create your own.