Archive
Decoding a Simple Update Statement Within the Transaction Log
After decoding the delete statement within the transaction log, I thought I would tackle another statement – an update statement, although I am attempting the simple version of the update which is only to update a single value within the fixed width fields section. Again to assist in decoding the log entry, I have used the AdventureWorks database and know the values of the data before and after the modification, so it is easier to decode how the log entry is constructed.
As before the database has been backed up whilst in fully logged mode and the transaction log backed up to make it as small as possible. Once the database was ready I issued a simple statement against the database
update HumanResources.Employee set MaritalStatus = 'M' where employeeid = 6
The employee with the ID of 6 has changed marital status, congratulations employee id 6.
The first confusion is that two LOP_MODIFY_ROW entries show up against the LCX_Clustered context, for the HumanResources.Employee.PK_Employee_EmployeeID allocation unit / name, but only 1 modification has been made.
What is the second modification?
Against each LOP_MODIFY_ROW there is an [Offset in Row] column, which indicates where within the row the modification has been made. The first modification is recorded at byte 24, the second is at byte 57. Instead of providing a field number / name, the alteration is being recorded at a byte level, so to figure out which columns have changed, we must map out the structure in terms of byte positions for the row.
Offset | Name | Size in Bytes | Value |
0 | Status Bits | 2 | |
2 | Column Number Offset | 2 | |
4 | EmployeeID | 4 | 6 |
8 | ContactID | 4 | 1028 |
12 | ManagerID | 4 | 109 |
16 | BirthDate | 8 | 1965-04-19 00:00:00.000 |
24 | MaritalStatus | 2 | S |
26 | Gender | 2 | M |
28 | HireDate | 8 | 1998-01-20 00:00:00.000 |
36 | SalariedFlag | 1 | 1 |
37 | VacationHours | 2 | 40 |
39 | SickLeaveHours | 2 | 40 |
41 | CurrentFlag | 1 | 1 |
41 | rowguid | 16 | E87029AA-2CBA-4C03-B948-D83AF0313E28 |
57 | ModifiedDate | 8 | 2010-02-20 18:13:03.710 |
65 | Column Count | 2 | |
67 | Variable Column Count | 2 | 3 |
69 | Variable Offets | 6 | |
75 | NationalIDNumber | 16 | 24756624 |
91 | LoginID | 44 | adventure-works\david0 |
135 | Title | 100 | Marketing Manager |
Again, because there are no nullable columns at all, there is no nullability bitmap within the structure. This mapping is specific to this row, but the offsets for the fixed portion of the row would be valid for all the records, it is only the variable width columns that would change per data row.
The two modifications are at byte 24 and 57, which is clearly the Marital Status and the Modified Date.
Why is the modified date being changed? The adventure works employee table has an update trigger on it, which alters the modified data column to be the current date when any change is made to the row.
For the Marital status transaction log entry, there are two values we need to logically know – what was it before the change, and what is the value after the change.
The Row Log Contents 0 and Row Log Contents 1 fields clearly provide that information
Row Log Contents 0 : 0x53 Row Log Contents 1 : 0x4D select convert(nchar,0x53) = S select convert(nchar,0x4D) = M
So they are clearly the before and after results.
Checking the transaction log entry against the trigger, the log row contents are dates within hex format, which are not too difficult to deal with
In Hex, and Row Log Contents 0 contains the previous modified date, although the SQL Management Studio results show the Hex has been shortened somewhat from the normal 8 bytes for a date. Padding the extra zeros in gives us:
0x 00 00 00 00 35 95 00 00 = 0x 00 00 95 35 00 00 00 00 = 2004-07-31 00:00:00.000
The same rules apply to the RowLog Contents 1 which has also been cut short.
0x A5 F1 D6 00 24 9D 00 00 = 00 00 9D 24 00 D6 F1 A5 = 2010-02-21 13:02:35.217
Which is unsurprisingly today’s date and when I made the modification.
The immediate question is did the log contain two entries because there were two different causes of the change, or does it record a transaction log row per value modified. Since the Offset In Row value is within the transaction log row, you could guess at 1 entry per change.
To check, I restored the database to the starting position and issued a statement with two modifications:
update HumanResources.Employee set MaritalStatus = 'M', VacationHours = 60 where employeeid = 6
The transaction log alters signficantly, the modified date LOP_MODIFY_ROW entry with an offset of 57 still gets added, but the alteration of the two values does not produce 2 x LOP_MODIFY_ROW but a single LOP_MODIFY_COLUMNS – this appears to have a more complex format that is going to be tougher to crack, so that will have to wait until another day.
How Do You Decode A Simple Entry in the Transaction Log? (Part 2)
In the previous post we pulled out the raw information relating to a transaction log entry for a deletion of a single row. In this post we will decode the binary section of the log entry for the deletion performed.
A reminder of the Row Log Contents for the Clustered Index log entry that we need to decode.
0x3000410001000000B90400001000000000000000406700004D004D0000000000CB8900004A15001E004AD0E1AA37C27449B4D593524773771800000000359500001000000003005D008500BD003100340034003100370038003000370061006400760065006E0074007500720065002D0077006F0072006B0073005C006700750079003100500072006F00640075006300740069006F006E00200054006500630068006E0069006300690061006E0020002D0020005700430036003000
So we have 180 bytes of information relating to the deletion, the question is – what is it? The most obvious answer has to be the record that has been deleted – lots of fixed log information has already been exposed as named fields, this generic variable length value must be the row that we deleted.
How to decode it though? applying the structure of a row is a good place, the format mentioned before is detailed very nicely detailed in Kalen Delaney’s SQL Internal’s book so let’s use that as a guide.
- 2 Byte : Status Bits
- 2 Bytes: Offset to find number of columns
- X Bytes:Fixed Length Columns
- 2 Bytes: Total Number of Columns in the data row
- 1 Bit per column, Rounded up: Nullability Bitmap
- 2 Bytes:Number of Variable Length Columns within the data row
- 2 Bytes per variable length column : Row Offset marking the end of each variable length column
- X Bytes:Variable Length Columns
Ok, let’s start assigning the values from the log hex dump into the fields.
Status Bits (2 Bytes) : 3000
Offset to Find Number of Columns (2 Bytes) : 4100
Because of the endian ordering in effect, the offset should be read as 0041 – 65 in Decimal. So there must be 61 Bytes of Fixed width data, since we have already used 4 bytes.
Let’s remind ourselves of the table structure:
CREATE TABLE [HumanResources].[Employee]( [EmployeeID] [int] IDENTITY(1,1) NOT NULL, [NationalIDNumber] [nvarchar](15) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL, [ContactID] [int] NOT NULL, [LoginID] [nvarchar](256) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL, [ManagerID] [int] NULL, [Title] [nvarchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL, [BirthDate] [datetime] NOT NULL, [MaritalStatus] [nchar](1) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL, [Gender] [nchar](1) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL, [HireDate] [datetime] NOT NULL, [SalariedFlag] [dbo].[Flag] NOT NULL CONSTRAINT [DF_Employee_SalariedFlag] DEFAULT ((1)), [VacationHours] [smallint] NOT NULL CONSTRAINT [DF_Employee_VacationHours] DEFAULT ((0)), [SickLeaveHours] [smallint] NOT NULL CONSTRAINT [DF_Employee_SickLeaveHours] DEFAULT ((0)), [CurrentFlag] [dbo].[Flag] NOT NULL CONSTRAINT [DF_Employee_CurrentFlag] DEFAULT ((1)), [rowguid] [uniqueidentifier] ROWGUIDCOL NOT NULL CONSTRAINT [DF_Employee_rowguid] DEFAULT (newid()), [ModifiedDate] [datetime] NOT NULL CONSTRAINT [DF_Employee_ModifiedDate] DEFAULT (getdate()), CONSTRAINT [PK_Employee_EmployeeID] PRIMARY KEY CLUSTERED
The fixed columns are as follows:
[EmployeeID] [int] = 4 [ContactID] [int] = 4 [ManagerID] [int] = 4 [BirthDate] [datetime] = 8 [MaritalStatus] [nchar](1) = 2 [Gender] [nchar](1) = 2 [HireDate] [datetime] = 8 [SalariedFlag] [dbo].[Flag] = 1 [VacationHours] [smallint] = 2 [SickLeaveHours] [smallint] = 2 [CurrentFlag] [dbo].[Flag] = 0 - second bit. [rowguid] [uniqueidentifier] = 16 [ModifiedDate] [datetime] = 8
Adventure Works uses a user defined type of Flag, which is just declared as a bit. So thats 61 bytes of fixed width storage – which is the value we expected.
We need to know the column order, we could assume it is the same as the table declaration, but we should double check the actual colum ordering to be certain.
select colorder, syscolumns.name from syscolumns inner join systypes on syscolumns.xusertype = systypes.xusertype where id =object_id('HumanResources.Employee') and variable = 0 order by colorder colorder name -------- ------------- 1 EmployeeID 3 ContactID 5 ManagerID 7 BirthDate 8 MaritalStatus 9 Gender 10 HireDate 11 SalariedFlag 12 VacationHours 13 SickLeaveHours 14 CurrentFlag 15 rowguid 16 ModifiedDate
So the order looks good, let’s start assigning values:
[EmployeeID] [int] = 01000000 = 00000001 = 1 [ContactID] [int] = B9040000 = 000004B9 = 1209 [ManagerID] [int] = 10000000 = 00000001 = 1 [BirthDate] [datetime] = 0000000040670000 = 0000674000000000 = '1972-05-15 00:00:00.000' - (Use Select convert(datetime, 0x0000674000000000) to check) [MaritalStatus] [nchar](1) = 004D = 77 = UniCode for M (Use Select nchar(77) to check) [Gender] [nchar](1) = 4D00 = 004D = 77 = M [HireDate] [datetime] = 00000000CB890000 = 000089CB00000000 = '1996-07-31 00:00:00.000' [SalariedFlag] [dbo].[Flag] = 4A = 01001010 in Binary, which will need further work. [VacationHours] [smallint] = 1500 = 0015 = 21 [SickLeaveHours] [smallint] = 1E00 = 001E = 30 [CurrentFlag] [dbo].[Flag] = Contained within the First Flag Byte [rowguid] [uniqueidentifier] = 4AD0E1AA37C27449B4D5935247737718 [ModifiedDate] [datetime] = 0000000035950000 = 0000953500000000 = '2004-07-31 00:00:00.000' Number of Columns : 1000 = 0010 - = 16 Columns - which is correct. Nullability Bitmap : 1 bit per Column, so 16/8 = 2 Bytes : 0000 Number of Variable Rows : 0300 = 0003 = 3 Variable Length Columns Var Column End 1 : 5d00 = 005D = 93 Var Column End 2 : 8500 = 0085 = 133 Var Column End 3 : BD00 = 00BD = 189
The number of columns was at the offset of 65, that used 2 bytes, the nullability was 2, the variable number of rows was 2 bytes and each variable row pointer used a further 2 each, so the data starts at byte after e.g. 65 + 12 = 77. First variable width data ends at 93, so the first variable width column is using 16 bytes of space.
I should mention why the nullability bitmap was 2, all the columns are declared as not null, so the bitmap did not increase in size, if there was a single nullable column, then the bitmap would have a bit per column in the table – not a bit per nullable column, it is an all or nothing approach.
Raw Values : 3100 3400 3400 3100 3700 3800 3000 3700 Reverse Bytes : 0031 0034 0034 0031 0037 0038 0030 0037 Result : 14417807 (You can use Select NChar(0x0031) etc to check each value.)
Next up is 93 to 133, a gap of 40 bytes.
Raw Values : 6100 6400 7600 6500 6E00 7400 7500 7200 6500 2D00 7700 6F00 7200 6B00 7300 5C00 6700 7500 7900 3100 Reverse Bytes : 0061 0064 0076 0065 006E 0074 0075 0072 0065 002D 0077 006F 0072 006B 0073 005C 0067 0075 0079 0031 Output : adventure-works\guy1
Final column, should be 189 – 133 = 56 bytes long, which it is, and decoding it again in the same way gives the following.
Raw Values : 5000 7200 6F00 6400 7500 6300 7400 6900 6F00 6E00 2000 5400 6500 6300 6800 6E00 6900 6300 6900 6100 6E00 2000 2D00 2000 5700 4300 3600 3000 Reverse Bytes : 0050 0072 006F 0064 0075 0063 0074 0069 006F 006E 0020 0054 0065 0063 0068 006E 0069 0063 0069 0061 006E 0020 002D 0020 0057 0043 0036 0030 Result : Production Technician - WC60
That is the row contents decoded, and the deleted row’s original values are now exposed.
As you can see from the complexity of decoding one part of a trivial delete operation within the log, this is not something that you would wish to rely on / use as an audit trail etc. But hopefully it has provided some insight and added to the minimal documentation on how to approach the decoding process. I will try to decode more of the operations as, insert and update shouldn’t be too hard to manage, but sucess is not guarenteed.
How Do You Decode A Simple Entry in the Transaction Log? (Part 1)
There is always a lot of interest in reading log records within SQL, although when you start getting into it you soon realise that it is an undocumented quagmire and quickly decide that there are better things you can do with your time.
It has taken a non-trivial amount of time to decode and has created a long documented process, so it will be in two parts, so I should caveat this undertaking with two clear statements first:
- There is no real reason you want to go poking around in the log except for ‘interest value’ – if you need to actually properly read log files then as you are about to see, using commercial applications will look very cheap in comparison to the time you would spend trying to do this. Even better – make sure you have proper backup’s so that you never need to try retrieve rows from the file!
- There is basically very little external documentation to guide you in reading an entry, I am learning it by doing it, and using what little documentation there is, along with trying some logical deductions and good understanding of internals from training – it is designed to act as an example of figuring the log entry out, but also demonstrating that this is not trivial and not something you will want to do.
That said, let’s start by trying to work out a single log record.
Using the adventure works example database, I have set it into fully logged mode and taken a backup to ensure that the transaction log is being used correctly. I’ve removed the foreign key constraints and triggers on the HumanResources.EmployeeTable, since I want to be able to delete a row and not trip over the example FK’s and trigger that prevents deletion.
To get to the log file values out, I am using the following snippet of SQL:
Select * from ::fn_dblog(null,null)
On initial inspection the log has only a not too many rows, altering the Diff Map, File Header etc. So I issue the command to delete a single row from the table.
Delete from HumanResources.Employee where EmployeeId = 1
And then select the contents of the log, and I have copied some of the fields from the output:
Current LSN | Operation | Context | Transaction ID | Log Record Length | AllocUnitName |
---|---|---|---|---|---|
000001bd:000001d8:0002 | LOP_BEGIN_XACT | LCX_NULL | 0000:0000da58 | 96 | NULL |
000001bd:000001d8:0003 | LOP_DELETE_ROWS | LCX_MARK_AS_GHOST | 0000:0000da58 | 148 | HumanResources.Employee.AK_Employee_LoginID |
000001bd:000001d8:0004 | LOP_SET_BITS | LCX_DIFF_MAP | 0000:00000000 | 56 | Unknown Alloc Unit |
000001bd:000001d8:0005 | LOP_DELETE_ROWS | LCX_MARK_AS_GHOST | 0000:0000da58 | 124 | HumanResources.Employee.AK_Employee_NationalIDNumber |
000001bd:000001d8:0006 | LOP_SET_BITS | LCX_DIFF_MAP | 0000:00000000 | 56 | Unknown Alloc Unit |
000001bd:000001d8:0007 | LOP_DELETE_ROWS | LCX_MARK_AS_GHOST | 0000:0000da58 | 120 | HumanResources.Employee.AK_Employee_rowguid |
000001bd:000001d8:0008 | LOP_DELETE_ROWS | LCX_MARK_AS_GHOST | 0000:0000da58 | 108 | HumanResources.Employee.IX_Employee_ManagerID |
000001bd:000001d8:0009 | LOP_SET_BITS | LCX_DIFF_MAP | 0000:00000000 | 56 | Unknown Alloc Unit |
000001bd:000001d8:000a | LOP_DELETE_ROWS | LCX_MARK_AS_GHOST | 0000:0000da58 | 288 | HumanResources.Employee.PK_Employee_EmployeeID |
000001bd:000001d8:000b | LOP_SET_BITS | LCX_PFS | 0000:00000000 | 56 | HumanResources.Employee.PK_Employee_EmployeeID |
000001bd:000001d8:000c | LOP_COMMIT_XACT | LCX_NULL | 0000:0000da58 | 52 | NULL |
That’s a lot of entries for a single delete – which is explained when you check the AllocUnitName, the delete has to also delete the entries within the indexes and the Adventure works table I am working against does indeed have 5 indexes, 4 Non-clustered and 1 Clustered. So that is making a lot of sense, and the whole operation is surrounded by a LOP_BEGIN_XACT and LOP_COMMIT_XACT, and we know from normal SQL terminology with SET XACT ABORT ON / OFF that it is about transactional scope and whether a whole transaction rolls back if a single item within the statement.
Let’s concentrate on the record deletion for the clustered index, with the larger log length of 288. That is a long record for what is in theory marking the row as a ghost, which suggests the row is within the log record. This is also backed up by the differing lengths for the other ghost record marks, which differ in size, just as the index row size does.
What are the interesting fields available to us from the Log record that we can pick up on straight away:
- Transaction ID : All the deletions belong to the same transaction ID, so we know they are related / part of the same transactional operation.
- Previous LSN : Each subsequent deletion in the list, shows the previous one as the previous LSN, so we can see the chain of log records.
- Page ID : We can tell which page was altered.
- Slot ID : Which slot within the page was altered.
- SPID : The LOP_BEGIN_XACT row shows the SPID that issued the command
- BeginTime : The LOP_BEGIN_XACT row shows the DateTime when the command was issued.
- Transaction Name : The LOP_BEGIN_XACT row shows the type of transaction, for this one shows DELETE.
- EndTime : The LOP_COMMIT_XACT row shows the end time of the transaction.
Interestingly, there is no indication as to who issued the command, there is a TransactionSID – which might turn out to be the account SID, but that will be a tangental investigation.
The final column to have a look at is RowLogContents, for the clustered index row deletion, rowlog contents 0 has the larger set of data to work on.
0x3000410001000000B90400001000000000000000406700004D004D0000000000CB8900004A15001E004AD0E1AA37C27449B4D593524773771800000000359500001000000003005D008500BD003100340034003100370038003000370061006400760065006E0074007500720065002D0077006F0072006B0073005C006700750079003100500072006F00640075006300740069006F006E00200054006500630068006E0069006300690061006E0020002D0020005700430036003000
If that does make you go running for the hills, then you have a chance of decoding it.
That’s only 180 Bytes, but this is the log row contents, not the actual log entry itself, so the fixed log fields are not within it, let’s work on it. The structure of a row is a good place, since we know expect the structure of the row to be present within the log. The format is very nicely detailed in Kalen Delaney’s SQL Internal’s book so let’s use that.
- 2 Byte : Status Bits
- 2 Bytes: Offset to find number of columns
- X Bytes:Fixed Length Columns
- 2 Bytes: Total Number of Columns in the data row
- 1 Bit per column, Rounded up: Nullability Bitmap
- 2 Bytes:Number of Variable Length Columns within the data row
- 2 Bytes per variable length column : Row Offset marking the end of each variable length column
- X Bytes:Variable Length Columns
We can use this in Part 2 to decode the binary contents of the log record.
Why Are Date Tricks in SQL 2005 a Problem in Waiting?
One of the long time annoyances about the Date functions in SQL is that a number of them do not function as developer would wish, the prime example I’m going to focus on is the DateDiff function. The way it operates is very contrary to what a developer might expect or want from it – the function counts the number of boundaries that crossed for the specific units selected, not the number of whole units between the dates, as an example:
declare @date1 datetime declare @date2 datetime set @date1 = '20100101 23:00:00' set @date2 = '20100102 01:00:00' select datediff(d, @date1, @date2)
And the result is 1, since the units selected was days, the boundary line is predictably at midnight, so even though the time span is only 2 hours, it would count as 1 day – that is not intuiative. Now this is all documented so we cannot complain or grumble. If you wanted to know whether a full day has passed, you used hourly units instead and made sure you had the logic to deal with this.
All of this leaves you with a pretty bad resolution however, you can get the hours difference, but minutes and seconds are not available – so you have to datediff on those units and do some maths. It really makes for a ham-fisted way of getting a duration.
So people work around the problem by converting the date to the numeric equivalent and manipulating that directly.
declare @date1 datetime set @date1 = '20100101 12:15:30' select convert(float,@date1) 40177.5107638889
The decimal part represents the percentage through the day, which is not really how the underlying binary storage stores it, it uses a number to represent the number of 1/300ths of a second since the day started.
This format was very forgiving though, if you wanted to add a day, instead of using DateAdd, you could just add 1 to the number, very convienient.
It does however make it easier to create a pseudo-timespan by deducting one date’s numeric representation from another dates, although the code is somewhat long-winded. As a side note, make sure you convert to float and not real – real has not go sufficient accuracy for this to work.
declare @date1 datetime declare @date2 datetime set @date1 = '20100101 12:00:00' set @date2 = '20100102 13:15:35' declare @result float set @result = convert(float,@date2) - convert(float,@date1) declare @DurationDays float declare @DurationTime float declare @DurationHours float declare @DurationMinutes float declare @DurationSeconds float set @DurationDays = floor(@result) set @DurationTime = (@result - floor(@result) ) set @DurationTime = @DurationTime * 86400 set @DurationHours = floor(@DurationTime / 3600) set @DurationTime = @DurationTime - @DurationHours * 3600 set @DurationMinutes = floor(@DurationTime/60) set @DurationTime = @DurationTime - @DurationMinutes * 60 set @DurationSeconds = @DurationTime select @DurationDays as Days, @DurationHours as Hours , @DurationMinutes as Minutes, @DurationSeconds as Seconds Days Hours Minutes Seconds ----------------- ------------------ ----------------- ----------------- 1 1 15 35.0000002188608
Bit of a hack and was it really any shorter or better? Debatable. Whilst it can get time span information out, when used within SQL 2008 using the new datetime2 types, the wheels fall off:
declare @date1 datetime2(7) set @date1 = '20100101 12:00:00' select convert(float,@date1) Msg 529, Level 16, State 2, Line 3 Explicit conversion from data type datetime2 to float is not allowed.
And that is where the problem comes in – the new datetime2 types will not allow the date to be converted to a number, and a number of these developer tricks no longer work.
Most if not all the tricks can be re-written to use multiple date functions with some mathematical logic – and it can be done without the code ever knowing if it had been embedded within a function / stored procedure. Where you would see a less transparent move to the datetime2 data types would be where developers had embedded some of the tricks directly into ad-hoc SQL, that will fail if the type is altered. In an ideal world, is that the code never contains these techniques of course, but we do not all live in that nirvana.
So on the one hand Datetime2 gives great accuracy and can reduce storage, but on the other hands, the tricks used in the past to deal with the inadequacies of the built-in Date functions no longer work.
What I would really like is a revamp of the Date functions and the introduction of a time span type – could be a long wait.
Case Sensitivity in the Query Plan Cache
A surprising facet of the query plan cache is that it matches ad-hoc queries not only on their text, but the case of the query text must exactly match as well. This is documented quite clearly on MSDN although it is a bit of a surprising behaviour. It also does not change based on whether the collation of the server is case-sensitive or not.
The documentation gives a statement on case-sensitivity in the plan cache, but no mention of whether the behaviour changes under ‘Forced’ parameterization, which asks SQL to be more aggressive in extracting query literals and generating a query cache hit – so I decided to have a look and see how it acted in ‘Forced’ vs ‘Simple’.
Whether the database was in ‘Simple’ or ‘Forced’ the behaviour did not change – but it turns out that it is not case-sensitive on keywords, just on the object names.
To show the case sensitivity behaviour I have used the AdventureWorks sample database as a testing ground. Prior to each test I cleared the procedure cache using DBCC FreeProcCache.
I then issued two very simple queries:
SELECT * from humanresources.employee Select * from humanresources.employee
When the query cache is inspected, there are two entries – it remains case-sensitive.
sql_statement execution_count --------------------------------------------- -------------------- SELECT * from humanresources.employee 1 SELECT * from humanresources.employee 1
The query so simple and has no parameters so I suspect the simple / forced parameterization routines must not activate.
If we add a parameter to the query, then the parameterization activates and gets to work on the query string prior to the cache lookup. Both simple and forced are able to cope with such a simple query so both performed the parameterization.
SELECT * FROM humanresources.employee WHERE employeeID = 1 Select * From humanresources.employee Where employeeID = 1
Inspect the query plan cache when running in forced:
sql_statement execution_count --------------------------------------------------------- -------------------- (@0 int)select * from humanresources . employee 2 where employeeID = @0
Inspect the query plan cache when running in simple:
sql_statement execution_count --------------------------------------------------------- -------------------- (@1 tinyint)SELECT * FROM [humanresources].[employee] 2 WHERE [employeeID]=@1
The results show a plan cache hit, but more importantly show up a rather obvious difference in the parameterization routines for each mode:
- Simple changes keywords to upper case, Forced changes them to lowercase.
- Simple places square brackets around the objects, forced does not.
- Simple chooses to replace the literal with a tinyint, Forced uses an Int.
- Simple starts the parameters at @1, forced starts at @0
The differences can be filed under bizarre, strange and just inconsistent, although they do both get the job done, which counts at the end of the day.
What is then disappointing is that the same is not true for the tables and fields named in the query. Changing the case of one of the objects prevents the caching again.
select * from humanresources.employee where EmployeeID = 1 Select * From humanresources.employee where employeeID = 1
Inspect the query cache (this one from forced mode):
sql_statement execution_count -------------------------------------------------------- -------------------- (@0 int)select * from humanresources . employee 1 where EmployeeID = @0 (@0 int)select * from humanresources . employee 1 where employeeID = @0
So we are back to the situation of no cache hit.
It seems very strange that the parameterization only ensures the casing of the keywords is consistant to give it a better chance of a query plan cache hit – if this was a case-insensitive server than it is a valid optimization to try increase the chances of a plan cache hit.
The converse you would think, is that it would be an inherently risky optimization on a case-sensitive database? – but in fact it is an optimization that would never be needed or made – if anything a case-sensitive database server will have a better chance of making a query plan cache hit since all the tables names and field names have to exactly match the stored object names – and so the queries will which have a greater chance of matching each other.
It could clearly do more to try give a match, but I suspect the complications and edge cases, such as database / server case-sensitive collation mis-match account for why it might seem easier than it really would be to make better.
Dynamic Partitioning : What objects are using a Partition Schema? (SQL Tuesday #002)
As part of the look at dynamic partitioning one of the first problems I have come across is finding what objects are currently placing data within a partition schema, this can be both tables as well as indexes for a table or indexes for a view (which can also be partitioned).
This has tied in nicely with Adam Machanic’s SQL Tuesday in which we were to describe a confusing situation and possible solution.
Certainly it has been a bit confusing to get to the bottom of what should be a relatively trivial question – programmatically determine what objects are using my partition schema?
If I am going to auto-balance the partitions then I have to know what objects are using the partition schema and will be affected by any balancing – we can not assume it is a single object, since we know both the table and non-clustered indexes will often be aligned on the same partition schema, as could other tables / objects.
So the first place I chose to check was the system views for the partition objects, sys.partition_functions and sys.partition_schemes – with the partition schemes being where you would expect to start.
SELECT * FROM sys.partition_schemes
name | data_space_id | type | type_desc | is_default | function_id |
---|---|---|---|---|---|
psBalanceLeft | 65601 | PS | PARTITION_SCHEME | 0 | 65536 |
Unfortunately the results from partition schemes view is spectacularly unhelpful, aside from inheriting a number of columns from the data spaces system view, it only adds function_id – being the ID of the partition function used in the schema. It at least has the name of the partition scheme, so that definitely is going to have to be used at a later point.
The immediately useful looking value is the function_id linking the scheme to the partition function, so I had a look inside the partition functions view remember what it has.
SELECT * FROM sys.partition_functions
name | function_id | type | type_desc | fanout | boundary_values_on_right |
---|---|---|---|---|---|
pfBalanceLeft | 65536 | R | RANGE | 10 | 0 |
The output does not particularily lead anywhere useful – the function most certainly is not going to tell me which objects are assigned to it, since the tables / indexes get directly assigned to the partition scheme, this looks a dead end. So the only other option is to go to the data spaces system view:
SELECT ps.name, ds.* FROM sys.partition_schemes ps INNER JOIN sys.data_spaces ds on ps.data_space_id = ps.data_space_id
name | name | data_space_id | type | type_desc | is_default |
---|---|---|---|---|---|
psBalanceLeft | psBalanceLeft | 65601 | PS | PARTITION_SCHEME | 0 |
Not a stellar move – there are no obvious leads here.
So I can obtain the relation between the partition scheme and the storage but that it is so far. Given those two dead ends I next considered the problem from the opposite direction – sys.partitions claims to contain a row for each partitioned tables and index in the database – which should provide another starting point.
SELECT * FROM sys.partitions
Some of the output was as follows:
partition_id | object_id | index_id | partition_number | hobt_id | rows |
---|---|---|---|---|---|
72057594040549376 | 53575229 | 2 | 1 | 72057594040549376 | 10 |
72057594040614912 | 53575229 | 2 | 2 | 72057594040614912 | 10 |
72057594040680448 | 53575229 | 2 | 3 | 72057594040680448 | 10 |
72057594040745984 | 53575229 | 2 | 4 | 72057594040745984 | 10 |
72057594040811520 | 53575229 | 2 | 5 | 72057594040811520 | 10 |
72057594040877056 | 53575229 | 2 | 6 | 72057594040877056 | 10 |
72057594040942592 | 53575229 | 2 | 7 | 72057594040942592 | 10 |
72057594041008128 | 53575229 | 2 | 8 | 72057594041008128 | 10 |
72057594041073664 | 53575229 | 2 | 9 | 72057594041073664 | 10 |
72057594041139200 | 53575229 | 2 | 10 | 72057594041139200 | 136 |
72057594041204736 | 149575571 | 1 | 1 | 72057594041204736 | 10 |
72057594041270272 | 149575571 | 1 | 2 | 72057594041270272 | 10 |
72057594041335808 | 149575571 | 1 | 3 | 72057594041335808 | 10 |
72057594041401344 | 149575571 | 1 | 4 | 72057594041401344 | 10 |
72057594041466880 | 149575571 | 1 | 5 | 72057594041466880 | 10 |
72057594041532416 | 149575571 | 1 | 6 | 72057594041532416 | 10 |
72057594041597952 | 149575571 | 1 | 7 | 72057594041597952 | 10 |
72057594041663488 | 149575571 | 1 | 8 | 72057594041663488 | 10 |
72057594041729024 | 149575571 | 1 | 9 | 72057594041729024 | 10 |
72057594041794560 | 149575571 | 1 | 10 | 72057594041794560 | 136 |
This definitely has my partition schema in there somewhere since I know I have 10 partitions and have set the row quantities up to be 10 rows for the first 9 partitions and 136 rows in the tenth, it is pretty visible.
I’ve also got an indexed view on the same table which explains the duplicate set of values, and a NC index on the table which explained the triplicate set of values I’ve not pasted in. This is in essence what I am after though; finding out which object_id’s reference the partition schema.
A couple of immediate dead ends have also appeared:
- The partition_id looks useful but is the unique id of the partition record not an ID relating to the partition schema.
- The hobt_id is the heap or b-tree pointer for that specific partition so is not going to be able to help us since there are 10 hobt_id’s per object on the schema, all different.
It does however provide the object_id which we know we can translate into a name very easily and a partition_number column which only ever exceeds 1 on a partitioned table. So with a bit of a throw-away style query to just select those with a partition_number of 2 to make sure we only select partitioned objects gave me the following:
SELECT o.Name, s.* FROM sys.partitions s INNER JOIN sys.objects o on s.object_id = o.object_id WHERE partition_number = 2
Name | partition_id | object_id | index_id | partition_number |
---|---|---|---|---|
tblTestBalance | 72057594039304192 | 53575229 | 1 | 2 |
foo | 72057594039959552 | 85575343 | 0 | 2 |
tblTestBalance | 72057594040614912 | 53575229 | 2 | 2 |
vwTestView | 72057594041270272 | 149575571 | 1 | 2 |
So now I can see the partitioned objects from the other direction, but I have found no relation between the objects identified as partitioned and the partition schemas discovered earlier.
There is also an index_id being shown and documented as ‘the index within the object to which this partition belongs’, but this is not an object_id for an index, but the index_id you would normally use within a dbcc in command, 0 for heap, 1 for clustered index etc, so there must be a relation to the sys.indexes view – which when I thought about it made complete sense – the sys.indexes table is really badly named, since it is not a row per index, but a row per heap or index.
Not the best name in the world for it, but let’s move on – given we have both the object ID and the index_id we can join on both / filter in the where clause.
SELECT O.Name, p.* FROM sys.partitions p INNER JOIN sys.objects O on p.object_id = o.object_id INNER JOIN sys.indexes I on O.object_id = I.object_id and p.index_id = I.index_id WHERE partition_number = 2
Same output as before, since I have selected no fields from sys.indexes yet, checking the list an immediate candidate jumped out, data_space_id – I already had an odd-looking data_space_id earlier, so can the index link to it successfully?
SELECT O.Name, ds.name FROM sys.partitions p INNER JOIN sys.objects O on p.object_id = o.object_id INNER JOIN sys.indexes I on O.object_id = I.object_id and P.index_id = I.index_id INNER JOIN sys.data_spaces ds on i.data_space_id = ds.data_space_id WHERE partition_number = 2
Which gave the following results:
name | name |
---|---|
tblTestBalance | psBalanceLeft |
foo | psBalanceLeft |
tblTestBalance | psBalanceLeft |
vwTestView | psBalanceLeft |
I then rejoined in the partition schemes using the same join I did earlier:
SELECT O.Name, ds.name FROM sys.partitions p INNER JOIN sys.objects O on p.object_id = o.object_id INNER JOIN sys.indexes I on O.object_id = I.object_id and P.index_id = I.index_id INNER JOIN sys.data_spaces ds on i.data_space_id = ds.data_space_id INNER JOIN sys.partition_schemes ps on ds.data_space_id = ps.data_space_id
Running this gave 40 rows, 4 objects x 10 partitions, so I was filtered down to the partitioned objects but I returned too many rows – I could have used a group by clause but it seemed simpler to just select a single partition number, and since I know partition number 1 will always exist, that was the simplest to use.
I am now down all the way to the partition scheme, time to select some of the more interesting columns I found along the way, that are applicable to the dynamic partitioning problem I am looking at, the main one being the object name and type.
The final solution I’ve arrived at to get to the tables / indexes and indexed views using a partition scheme is:
SELECT O.Name as TableName, I.Name as IndexName, I.Type, I.type_desc as IndexType, ps.name as PartitionSchema FROM sys.objects O INNER JOIN sys.partitions p on P.object_id = O.object_id INNER JOIN sys.indexes i on p.object_id = i.object_id and p.index_id = i.index_id INNER JOIN sys.data_spaces ds on i.data_space_id = ds.data_space_id INNER JOIN sys.partition_schemes ps on ds.data_space_id = ps.data_space_id WHERE p.partition_number = 1
TableName | IndexName | Type_ID | Type_Desc | PartitionSchema |
---|---|---|---|---|
tblTestBalance | PK_tblTestBalance | 1 | CLUSTERED | psBalanceLeft |
foo | NULL | 0 | HEAP | psBalanceLeft |
tblTestBalance | testncindex | 2 | NONCLUSTERED | psBalanceLeft |
vwTestView | IxViewTest | 1 | CLUSTERED | psBalanceLeft |
This can be further filtered to an individual partition scheme based on the name trivially, but the output is giving us the information I am after – a list of what objects and type of object is allocated to the partition schemas.
There should be an easier way to get to this information than joining 5 system views, but that seems to be the only way I could manage to solve the problem.
Dynamic Partitioning : Wishlist
Whilst I consider dynamic partitioning something that really doesn’t serve a valid purpose that I can find yet, I decided to use it as an exercise to program a basic form of it within T-SQL over the coming weeks.
Given a blank piece of paper and some realism, what are the aims for the design and T-SQL:
- Batch based rebalancing – real-time is not realistic so let’s start with an overnight job.
- Choice to Balance by different metrics (Rows vs Physical Storage)
- Balance across a setup-defined fixed number of partitions – so that they do not run out.
- Ability to migrate Filegroups into and out of the Partition Scheme – e.g. schedule them for removal over the coming nights.
- Ability to limit the processing to a window – this is not easy, but a log of earlier migrations would offer guidance on how much processing could be done within an allotted time span.
- Ability to choose the specify the balancing as an online operation – partitioning being enterprise only we can automatically rely on online index rebuilds being available.
That’s not a bad start although I bet it is harder than it sounds.
Let’s just consider the ‘balancing act’ itself regardless of the options. A partition schema is not a database table – which automatically complicates matters since multiple tables and indexes can use the same partition schema. This means that any change to a partition scheme / function will directly affect more than 1 table / index. Any calculations for the number of rows / size of data will have to take all the tables and indexes into account.
It might seem unusual to place more than one table on a partition schema, but it really isn’t. You would commonly place any NC indexes also on the same partition schema to keep them ‘aligned’, so having multiple tables for the same ‘alignment’ purpose shouldn’t seem weird. If you consider the multi-tenancy usage of the partitioned table, then you can see why you could have dozens of tables all on the same partition schema.
These requirements are the starting point for the T-SQL and as I come across issues I will write them up.
Is Dynamic Partitioning in SQL Server Possible?
I often see people asking whether dynamic table partitioning exists in SQL Server, or they provide a scenario that would effectively be asking the same question. So let’s get the easy answer out now – straight out of the box SQL Server has no dynamic partitioning.
To be fair, straight out of the box there is no tooling surrounding partitioning either except for a handful of DMV’s – if you want to automate a rolling window, then you need to program that yourself. SQL Server 2008 added a few bits; it struck me that if you need to use a wizard to turn an existing table into a partitioned table then your not really planning ahead.
So if it is possible to automate a rolling window system, surely it is possible to automate some kind of dynamic partitioning?
Well, that depends on what the definition of ‘dynamic partitioning’ is when it comes to SQL, which would normally be defined by the person who needs the feature to solve their specific issue. Before I start writing up a wish list of options and features to guide me hacking some SQL together to solve the problem – you have to ask; do you really need dynamic partitioning?
Table Partitioning by its nature suits larger volumes of data in a rolling window, where we migrate older data out and bring in new values. However, partitioning has been used for a variety of purposes that it possibly was not considered for originally such as:
- Performance gain through Partition Elimination
- Multi-Tenancy databases, placing each client in a separate partition
Bizarrely each of those reasons has a counter argument:
- Partition elimination only benefits queries that include the partition key in the where clause, otherwise it is detrimental to the query since it requires every partition is examined.
- Aside from the limit of 1000 partitions therefore 1000 customers, security is easier to compromise, upgrades per customer are not possible and the whole backup restore strategy for individual customers get’s very complex since you do not wish to restore the whole table but a single partition.
Back to the question, do we really need dynamic partitioning?
The complexity and scale of most partitioned tables indicates that they should not occur by ‘accident’, and retro-fitting a partitioned table indicates a lack of data modelling / capacity planning. The ‘alternative’ reasons for partitioning, are amongst some of the drivers for the dynamic partitioning request.
To make best use of the partitioned table feature requires planning and design, in which case it does not need to be ‘dynamic’.
That all being said, in the coming posts I am going to write-up my wish list of features to start building a basic dynamic partitioning system and then make it more complex over time – it makes for a fun exercise.
If you have any thoughts on features you would want to see in it, just add them in a comment.
How Can You Tell if a Database is in Pseudo Full/Bulk Logged Mode?
I was asked on Friday, “how do you tell if a database logging mode is reporting bulk or full, but it is still in simple?” – as mentioned before, a database is not in really full / bulk logged unless a full backup has been taken. Until that time the database is still running in a simple mode, sometimes referred to as pseudo-simple. It is not easy to spot, because the properties of the database will report full / bulk as appropriate and give no indication that it is not actually logging in the way it says.
The existence of a backup of the database is not a reliable enough mechanism for this, since the database can be backed up and then moved out of full / bulk logged mode into simple and back again. This breaks the backup and transaction log chain, but the database is still reporting full – to make it worse there is a backup record showing on the history, giving it an air of legitimacy.
The backup records can be accessed from the sys.sysdatabases and msdb.dbo.backupset, MSDN even has an example script showing how to see when a database was last backed up and by whom.
SELECT T1.Name as DatabaseName, COALESCE(Convert(varchar(12), MAX(T2.backup_finish_date), 101),'Not Yet Taken') as LastBackUpTaken, COALESCE(Convert(varchar(12), MAX(T2.user_name), 101),'NA') as UserName FROM sys.sysdatabases T1 LEFT OUTER JOIN msdb.dbo.backupset T2 ON T2.database_name = T1.name GROUP BY T1.Name ORDER BY T1.Name
To play around with the scripts you probably want a test database:
CREATE DATABASE [LogModeTest] ON PRIMARY ( NAME = N'LogModeTest', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\LogModeTest.mdf' , SIZE = 3072KB , MAXSIZE = UNLIMITED, FILEGROWTH = 1024KB ) LOG ON ( NAME = N'LogModeTest_log', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\LogModeTest_log.ldf' , SIZE = 1024KB , MAXSIZE = 2048GB , FILEGROWTH = 10%) COLLATE Latin1_General_CI_AI
With a minor alteration to the MSDN script you can get the backup history for this database:
SELECT T1.Name as DatabaseName, COALESCE(Convert(varchar(12), MAX(T2.backup_finish_date), 101),'Not Yet Taken') as LastBackUpTaken FROM sys.sysdatabases T1 LEFT OUTER JOIN msdb.dbo.backupset T2 ON T2.database_name = T1.name WHERE T1.Name = 'LogModeTest' GROUP BY T1.Name
The results show the database is not yet backed up:
DatabaseName LastBackUpTaken ----------------------------- --------------- LogModeTest Not Yet Taken
That is easy to fix, so let’s take a backup of the database and recheck the last backup value.
BACKUP DATABASE [LogModeTest] TO DISK = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Backup\LogModeTest.bak' WITH NOFORMAT, NOINIT, NAME = N'LogModeTest-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10 DatabaseName LastBackUpTaken ------------------------------ --------------- LogModeTest 12/13/2009
As expected the date of the backup is now set. If we alter the logging mode of the database to simple we will break the transaction log chain. To demonstrate the backup information being an unreliable source, let’s change to simple, create a table and then return to the fully logged mode.
ALTER DATABASE [LogModeTest] SET RECOVERY SIMPLE WITH NO_WAIT CREATE TABLE foo(id int identity) ALTER DATABASE [LogModeTest] SET RECOVERY FULL WITH NO_WAIT
If we now attempt to backup the transaction log, SQL is going to throw an error.
BACKUP LOG [LogModeTest] TO DISK = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Backup\LogModeTest.bak' WITH NOFORMAT, NOINIT, NAME = N'LogModeTest-Transaction Log Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10 Msg 4214, Level 16, State 1, Line 1 BACKUP LOG cannot be performed because there is no current database backup. Msg 3013, Level 16, State 1, Line 1 BACKUP LOG is terminating abnormally.
And if we check the database backup history using the MSDN script:
DatabaseName LastBackUpTaken ------------------------------ --------------- LogModeTest 12/13/2009
So the backup history continues to show a date of the last full backup even though the transaction log chain is now broken. SQL certainly knows the database has not had a full backup since swapping into fully logged mode, so any transaction log backup is invalid, thus the error.
There is an easier way to find out that you are in pseudo-simple mode, without trying to perform a transaction log backup:
SELECT name, COALESCE(Convert(varchar(30),last_log_backup_lsn), 'No Full Backup Taken') as BackupLSN FROM sys.databases INNER JOIN sys.database_recovery_status on sys.databases.database_id = sys.database_recovery_status.database_id
Run this against your server and it lists the databases that have had a backup taken (by the existence of a backup LSN) and which have not had a full backup that could be used in recovery. If we then backup the database and recheck the values, the test database now records an LSN, showing it is out of psuedo-simple and into the full / bulk logged modes.
So that indicates whether we are in pseudo simple or not, but does not link back to the properties of the database to check what is the actual database logging mode – you are primarily only interested in databases that are not in simple mode in the first place, but are running in psuedo-simple due to the lack of a relevant full database backup. We can alter the query to handle this specific situation and the result is:
SELECT name, recovery_model_desc, COALESCE(Convert(varchar(30),last_log_backup_lsn), 'No Full Backup Taken') as BackupLSN FROM sys.databases INNER JOIN sys.database_recovery_status on sys.databases.database_id = sys.database_recovery_status.database_id WHERE sys.databases.recovery_model <> 3 AND last_log_backup_lsn is null
If you run that query against your database server and get any results then you have databases that are not running the recovery mode they are indicating / you that you think they are – which would generally not be a good thing.
Rolling a Partition Forward – Part 2
The first part of this topic provided a mini-guide to loading data into a partitioned table and a few helpful DMV based statements that can help you automate the process. The unloading of the data should in theory be easier, but to do this in an automated fashion you are more reliant on the DMVs and system views to get to the right information.
The steps to unload a partition of data are:
- Discover which file group the oldest partition is on.
- Create a staging table on the same filegroup with an identical schema and indexes
- Switch the data out to the staging table
- Merge the partition function
- Archive / Drop the data as appropriate.
As in part 1, there are 3 sections of the process which are not so common, whilst the creation of a table and the archive / drop of the old data at the end is standard T-SQL that you will be using regularly.
Discover which Filegroup the Oldest Partition is On
When checking for the oldest filegroup, I have assumed that the basis of the rolling window is that the highest boundary is the most recent data, whilst the lowest boundary is the oldest – in essence time is moving forward and the partition key ascends, not descends. The oldest boundary will therefore be boundary 1, how do you get the Filegroup name of the filegroup this partition is on? A somewhat complex use of a set of DMV’s.
SELECT sys.filegroups.Name as FileGroupName FROM sys.partition_schemes INNER JOIN sys.destination_data_spaces ON sys.destination_data_spaces.partition_scheme_id = sys.partition_schemes.data_space_id INNER JOIN sys.filegroups ON sys.filegroups.data_space_id = sys.destination_data_spaces.data_space_ID INNER JOIN sys.partition_range_values ON sys.partition_range_values.Boundary_ID = sys.destination_data_spaces.destination_id AND sys.partition_range_values.function_id = sys.partition_schemes.function_id WHERE sys.partition_schemes.name = 'YourPartitionScheme' and sys.partition_range_values.boundary_id = 1
This will return the name of file group, which allows you to create the staging table for the partition switch out on the correct filegroup.
Whilst the data space ID’s do alter in sequence depending on the partition function being a left or right based partition, the boundary ID remains consistent, which is why it is used to discover the oldest and not the destination_id / data_space_id.
Switch the Data Out to the Staging Table
Switching the data out is not complex, it just is the reverse syntax of switching the partition in essence. Under the hood you are redirecting IAM pointers, so the switch is considered a meta-data command and exceptionally fast.
ALTER TABLE YourPartitionedTable SWITCH PARTITION 1 TO YourStagingTable
The partition number used is in effect the boundary id, and the oldest boundary is for partition 1 the rolling window.
Merge the Partition Function
The last complex stage is the merging of the partition function, the command explicitly needs the value from the partition function that represents the partition. If you were doing this by hand you would know it, but to automate the process requires the discovery of this information from the DMV’s again.
SELECT value FROM sys.partition_range_values INNER JOIN sys.partition_functions ON sys.partition_functions.function_id = sys.partition_range_values.function_id WHERE name = 'YourPartitionFunctionName' AND boundary_id = 1
Again, we are using the boundary value of 1 to extract only the oldest partition function value, but this can then be used in a partition function merge command.
ALTER PARTITION FUNCTION YourPartitionFunctionName() MERGE RANGE (YourBoundaryValue)
Conclusion
Using the DMV’s and appropriate stored procedures, the rolling window can be automated and does not require hand-crufted SQL to work – just use of the DMV’s to get the key values you need to be able to construct the harder parts of the process.
If you are following the guide on partition layout I wrote before, then the filegroup you have just removed the data from becomes the next spare filegroup to be used to house the next time data is imported. If you store this within the database, the next load will be able to automatically know where to place the data and set the next used filegroup value to, closing the loop so to speak.