It is often said that a primary key should be immutable, and this advice is echoed on a multitude of sites some of which strengthen it to a ‘law’ – but we know with databases that absolutes are rare and it is very difficult to be 100% prescriptive. There is then no mention of the clustering key being immutable alongside it, which strikes me as strange since it is just as important.
What happens within SQL Server to the row if you change the clustered key?
- If you change the clustered key value the row must be physically moved and can result in page splits / fragmentation.
- A change of the clustered key requires all the non-clustered indexes to be updated to reflect the clustering key choice.
And if you change the primary key?
- A change of the primary key, has to be reflected in each of the other tables that use the key as a linking mechanism.
- The primary key must still uniquely identify the row within the table.
Clearly different issues, but why does the primary key immutability get so much attention and not the clustering key? The default behaviour of SQL Server is that the primary key becomes the clustering key, so in essence all 4 points get applied, but you can choose a different primary key to the clustering key.
What sort of expense are we talking about if we allow the clustering key to not be immutable and started altering a row’s clustered key value? To get a better understanding of what is going on under the hood, I’ll construct an example and check the transaction log.
Creating a simple table in SQL and inserting a few rows to set up the test is pretty easy:
CREATE TABLE [dbo].[MyTable]( [MyID] [int] IDENTITY(1,1) NOT NULL, [FirstName] [varchar](20) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL, [SecondName] [varchar](30) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL, CONSTRAINT [PK_Table_1] PRIMARY KEY CLUSTERED ( [FirstName] ASC )WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY] ) ON [PRIMARY]
Insert a few rows and then issue a simple update statement.
update mytable set firstname = 'TestFirstName', Secondname = 'TestSecondName' where MyID = 2
Inspect the transaction log and it is noticable that the log does not contain a LOP_MODIFY_ROW or LOW_MODIFY_COLUMNS within it, but contains a LOP_DELETE_ROWS and a LOP_INSERT_ROWS. Instead of just modifying the data, SQL has removed the row and reinserted it. A few other items appear with the transaction, LOP_BEGIN_XACT and LOP_COMMIT_XACT as you would expect to start and commit the transaction. There is also a LOP_SET_BITS on the LCX_PFS which is not surprising to see either, since we have potentially affected the free space level of the page the data was inserted into.
That maps to exactly what we expect from a high level logical perspective – the row has to be moved and there is no LOP_MOVE_ROW operation. This results in the row being placed into the transaction log twice, as a before and after.
What happens if we perform an update that does not include the clustered key?
update mytable set Secondname = 'AnotherTestSecondName' where MyID = 3
This time the log only includes 2 entries, the LOP_BEGIN_XACT / LOP_COMMIT_XACT and a single LOP_MODIFY_ROW which is more as you would expect.
Size wise, the transaction log entry length for the first alteration was 96 + 148 + 56 + 160 + 52 = 512 bytes. For the second entry it was only 96 + 144 + 52 = 292. So the alteration used more log space and due to write ahead logging it must be committed to the disk, but the actual difference for a single row does not look too significant.
Well, whilst it does not look significant, you have to remember that the row being modified was very small. As previous examples have shown the LOP_DELETE_ROWS and LOP_INSERT_ROWS include the entire contents of the row being removed / added, so with a larger row the entire contents of the row would be added to the log twice, compared to the simple modification. That would start to get expensive.
So altering the clustering key is clearly expensive for the transaction log in comparison to a normal update, and this example did not have additional non-clustered indexes added to the table, which would also then require even more entries to deal with the removal and re-insertion of the non-clustered index values.
Given a choice I would make both immutable; the primary key shouldn’t be the only one to get special treatment and be designed to be immutable, make the clustering key immutable as well.
I attended the SQL Immersion event last year in Dublin and can honestly say that it was the best training course I have ever attended. The level of detail is phenomenal and the interaction with Paul and Kim is superb. I can not recommend the course heavily enough and anyone who is serious about SQL should make the effort to attend one of these, I would even go as far as to say fund it yourself if you have to. Prodata have not only managed to get Paul and Kim back to run it again, but have also got 2 additional master class courses being scheduled as well. I have a feeling these must be closer to including more information / material from the SQL MCM course, which would be superb, but I will have to check whether the bank balance can handle doing them.
The Immersion event is split again into two tracks, DBA and Developer although both sides of that fence benefit from having a good in-depth understanding of the other, so I would ignore the distinction and go for the full ‘Immersion’, there is no reason that a DBA shouldn’t understand indexes and index tuning in depth, or that a developer shouldn’t have a good understanding of the transaction log and internal structure within SQL. I did the full course before and spent most evenings doing even more stuff and using the day’s material to find out new things, many of which have become topics that I have written about.
Early registration to the courses attracts a 15% discount, but using the promotion code SQLH you will get a 20% discount instead. On the full Immersion course that is a further ~100 euros off the price, which can’t be bad.
The Immersion event is running from the 28th June to 1st July, and registration is here.
The two additional master classes are being run the week after, and these are advertised as being material that is not on the Immersion course, but as mentioned – I’m not entirely sure what that is, and given how much detail is on the immersion course, that is going to have to be some very deep internals stuff.
Time flies when you’re busy, and it has been far too long since I last posted. To business however, and I noticed a problem in query parameterization the other day which does not make much sense at first glance. To demonstrate, I will use the AdventureWorks example database and use a couple of simple queries. As with all parameterization issues, you need to make sure that you know which mode the database is within, so I’ll begin by setting it to Simple Mode
ALTER DATABASE AdventureWorks SET PARAMETERIZATION SIMPLE
And then run two queries, separately so that they are not considered a single batch.
Select * from HumanResources.Employee where loginid like '%a%' Select * from HumanResources.Employee where loginid like '%b%'
Under simple parameterization it should not be too surprising to see that when the query cache is inspected, the queries have not been parameterized, and 2 entries exist within the cache. So what happens when the mode is changed to Forced?
ALTER DATABASE AdventureWorks SET PARAMETERIZATION FORCED
Clear down the query cache and try the two queries again, in the hope of a plan cache hit – and it hasn’t changed. Two query plans still show in the cache and there was no parameterization. Perhaps it is the existence of the 2 wildcard characters? no, altering the wild-cards makes no difference, removing them entirely still results in the query plan generating a separate plan cache entry.
Parameterization is not limited to dates and numbers, it will work on strings without any problem, but clearly the introduction of the like clause prevents the cache hit. This behaviour is on both SQL Server 2005 and 2008 – which is a bit annoying.
So how can we get around this problem?
Well bizarrely by just using a different syntax to mean the same thing. PatIndex works just like the like clause and takes a wildcard, but returns the position. In a like clause we are just interested in a match – whilst the pat index gives us a character position. If the pattern is not found it returns zero, so the simple replacement is to using patindex and look for any values greater than zero.
Select * from HumanResources.Employee where patindex('%a%',loginid) > 0 Select * from HumanResources.Employee where patindex('%b%',loginid) > 0
In simple mode this still produces 2 cache hits, but in forced mode you get a plan cache hit finally!
If solving it was only that simple… by using PatIndex the query where clause has become non-sargable – which makes no difference if you have a wild card either side of your expression but if you only had a trailing wildcard then this would produce a very bad performance hit. The cost of the extra query plan in memory is unlikely to be more than the use of scans to resolve the query, so faced with a few additional query plans in memory using wildcards, you might be best to leave them there.