Another one of those features in Oracle that we do not natively have in SQL Server is interval partitioning, where you can automatically have partitions generated based on an interval automatically generated for you. The interval partitioning is yet again, another form of dynamic partitioning, so the thought was, could this be achieved within SQL Server?
My initial thought would be to use an instead of trigger, which would intercept the incoming values and take action on them if appropriate to extend the partition function. An initial look into the documentation suggests it will not be trivial – the BoL states:
"Additionally, the following Transact-SQL statements are not allowed inside the body of a DML trigger when it is used against the table or view that is the target of the triggering action. CREATE INDEX, ALTER INDEX, DROP INDEX, DBCC REINDEX, ALTER PARTITION FUNCTION, DROP TABLE..."
The issue there being the alter partition function – any response to an incoming piece of data that is not properly covered by the existing partition mechanism will need to alter the partition information. Of course there are still ways around such restrictions but when experimenting, it seems the BoL is not correct to list the ‘ALTER PARTITION FUNCTION’ command in the restrictions.
There are a few caveats to the following snippets of code, I am not attempting to deal with the rolling window affect, or complex storage pattern of a partitioned table of which I have written about in other posts. This is purely designed to demonstrate that it could be achieved, not to provide a total solution to the problem. The partitions created on the fly will all go to the Primary filegroup as well etc.
So start with a very basic partition function and scheme:
CREATE PARTITION FUNCTION pfTest (datetime) AS RANGE LEFT FOR VALUES ('20100104' , '20100111' , '20100118', '20100125') CREATE PARTITION SCHEME psTest AS PARTITION pfTest ALL TO ([PRIMARY])
And generate a table on the partition scheme:
CREATE TABLE IntervalTest ( MyID int identity(1,1) not null, MyField Varchar(200), MyDate datetime ) ON psTest(MyDate)
The next step is an ‘Instead of’ trigger, which has to intercept the incoming data from the inserted table, and extend the partition function if required:
CREATE TRIGGER tr_IntervalTest ON IntervalTest INSTEAD OF INSERT AS BEGIN -- get the current maximum partition value DECLARE @max_part_dt datetime; DECLARE @max_inserted_dt datetime; DECLARE @weeks_to_add int; SET @max_inserted_dt = (SELECT max(MyDate) FROM inserted); SET @max_part_dt = ( SELECT max(convert(datetime,value)) from sys.partition_functions f inner join sys.partition_range_values rv on f.function_id = rv.function_id where name = 'pfTest'); IF (@max_inserted_dt > dateadd(D,7,@max_part_dt)) BEGIN -- need to potentially add multiple partition splits, it depends on -- how many weeks in advance the new data is. -- get a whole number of the weeks to add to ensure that we cover -- the new data SET @weeks_to_add = ceiling(datediff(D,@max_part_dt, @max_inserted_dt) / 7.0) -- loop around splitting the partition function with the new weekly values -- that need to be covered WHILE @weeks_to_add > 0 BEGIN -- increase the maximum partition date by 7 days and split the function SET @max_part_dt = dateadd(D,7,@max_part_dt); ALTER PARTITION SCHEME psTest NEXT USED [Primary]; ALTER PARTITION FUNCTION pfTest() SPLIT RANGE (@max_part_dt); SET @weeks_to_add = @weeks_to_add - 1 END; END; -- finally insert the values INSERT INTO IntervalTest (MyField, MyDate) SELECT MyField, MyDate FROM inserted; END
The code is pretty self-explanatory, but I would point out that it is only covering an insert, not an update – and this is not production code but an experiment to see if it could be done (contrary to the BoL). To ‘productionize’ this would require significant work on exception handling, performance tuning, handling multiple filegroup partitioning, the list goes on, but all of it achievable.
A little test to insert a couple of values set ahead of the partition
insert into IntervalTest (MyField, MyDate) select 'b', '20100505' union select 'c', '20100606'
And a check of the partition function values now show 23 values in the partition function, instead of the original 4 as follows:
2010-01-04 00:00:00.000, 2010-01-11 00:00:00.000, 2010-01-18 00:00:00.000, 2010-01-25 00:00:00.000, 2010-02-01 00:00:00.000, 2010-02-08 00:00:00.000, 2010-02-15 00:00:00.000, 2010-02-22 00:00:00.000, 2010-03-01 00:00:00.000, 2010-03-08 00:00:00.000, 2010-03-15 00:00:00.000, 2010-03-22 00:00:00.000, 2010-03-29 00:00:00.000, 2010-04-05 00:00:00.000, 2010-04-12 00:00:00.000, 2010-04-19 00:00:00.000, 2010-04-26 00:00:00.000, 2010-05-03 00:00:00.000, 2010-05-10 00:00:00.000, 2010-05-17 00:00:00.000, 2010-05-24 00:00:00.000, 2010-05-31 00:00:00.000, 2010-06-07 00:00:00.000
It has clearly created the partitions to cover the data being inserted and then performed that insertion.
So it can be done, but the constant cost of intercepting every insertion and update to provide this kind of dynamic partition is really not ideal, whether it could be made sufficiently efficient to work at the sort of scale that partitioning tends to be used at is debatable. I have it feeling that it would struggle - I would need to be lent a suitably sized server and SAN to test that one and see whether it could be made efficient enough.
It’s been a while since the last post, primarily due to changing jobs and now spending most of my time on Oracle – although it is always good to see the other side of the coin and see what it has to offer, but I won’t be abandoning SQL Server, that is for certain.
One of the more interesting features to me in Oracle is hash partitioning – the ability to create a partition across a defined number of partitions, and then arbitrarily decide which partition the data will go in based on a hashing function. Why would that be handy? SQL Server partitioning is in effect a range partition, in which you define the dividing points on the number line / alphabet – which suits partitions based on a defined number range or date range, but does not suit partitioning of other types such as a GUID.
The merits of such a partition could be debated, since with a decent index in place the benefits of the partition elimination within the query plan can be limited. Regardless of those merits (and I am pretty sure it is not going to be performant at scale, however could SQL Server implement Hash Partitioning? On a side note, this could be considered semi-dynamic partitioning in that the partition is able to cope with additional data outside of the expected range, due to the hash function.
I’ve seen a few articles try and perform hash partitioning by pre-processing the insert statement, prior to insertion into the database, but what about something a bit more native?
To start with, we need to create a partition function and partition schema to support this endeavour, both are pretty easy to construct.
CREATE PARTITION FUNCTION [myPartitionFunction] (int) AS RANGE LEFT FOR VALUES (100,200,300,400,500,600,700,800,900) CREATE PARTITION SCHEME [myPartitionScheme] AS PARTITION [myPartitionFunction] ALL TO ([FG1])
I’ve set up the partition scheme to assign all of the partitions to FG1, just for convenience, it could easily be set to multiple filegroups, and instead of 9 partitions, this could be constructed with 999 partitions.
There are a variety of hashing algorithms and functions, but given the range covered by the partition function, I have chosen to use a very simple modulo on the converted binary of the unique identifier. The only trick here is that we must create the function with schema binding, otherwise SQL will refuse to use the function later on when we persist the column and partition on it.
CREATE FUNCTION GuidHash (@guid_value uniqueidentifier) RETURNS int WITH SCHEMABINDING AS BEGIN RETURN abs(convert(bigint,convert(varbinary,@guid_value))) % 999 END
That is a pretty simple hashing function, but the point is to demonstrate is can be done, not to implement the best hashing algorithm that will give the most even distribution etc. The next step is to create the table, with the persisted column defined using the GuidHash function. If the function is not schema bound, you get an error thrown at this stage.
CREATE TABLE MyTable( MyID UniqueIdentifier not null, SomeField Char(200), PartitionID as dbo.GuidHash(MyId) PERSISTED ) ON myPartitionScheme(PartitionID)
The surprise here is that is accepts the table creation definition - since when would you expect a partitioned table’s column to be a computed column?
Time to put an index on the table, given the data is indexed off a unique identifier, it would not be unusual to place a non-clustered index on the table and to use index-alignment, e.g. place it on the same partitioning scheme.
CREATE NONCLUSTERED INDEX [ix_id] ON [dbo].[MyTable] ( [MyID] ASC, [PartitionID] ASC ) ON [myPartitionScheme]([PartitionID])
Populate the table with some test data:
DECLARE @guid uniqueidentifier SET @guid = newid() INSERT INTO mytable (myid, somefield) VALUES (@guid, 'some text') go 10000
So what happens when we select a single row from our data, for convienience I looked up a value in the table and grabbed the GUID – comparing the two queries side by side,
SELECT * FROM mytable WHERE myid = 'D41CA3AC-06D1-4ACC-ABCA-E67A18245596' SELECT * FROM mytable WHERE (partitionid = dbo.guidhash ('D41CA3AC-06D1-4ACC-ABCA-E67A18245596') and myid = 'D41CA3AC-06D1-4ACC-ABCA-E67A18245596')
The comparison is interesting, in percentage terms, it was 85% to 15% for the batch, the IO Statistics reads:
First Query : Scan count 10, Logical Reads 21 Second Query : Scan count 1, Logical Reads 3
So the hashing has clearly made the query faster – but that just means it was faster than the query that didn’t use the partition key which shouldn’t be too surprising – partition elimination vs checking every partition should win, so all it shows is that partition elimination is occurring. So how does it stack up against a normal table, e.g. have we gained anything? To test we need to put 10k rows into the same table, minus the computed column, index it and perform the same kind of select – all easy stuff so I will not write the code here, the results of a select from a normal table?
Normal Table Query : Scan Count 1, Logical Reads 3
And when run side by side, the SSMS window reports a 50% split of work between the two queries within the batch – which is not surprising given the IO costs were listed as the same – so where is the catch? There is no such thing as a free lunch, and the additional cost here is the CPU to generate the PartitionID value for the hashed GUID, but as a technique to partition based on a GUID, it has some merits.
One final thing that I did notice is that under Simple Parameterization, the GuidHash based query does not parameterize, which would start having detrimental effects on the query cache, once the database was placed under forced parameterization, then the query did parameterize appropriately – so you either want forced parameterization on or to use stored procedures – I would vote for the later there.
As a technique is has some merits, but you have to remember to manually include the PartitionID column within each query and run it through the hashing function – which is not ideal, but manageable.
As part of the look at dynamic partitioning one of the first problems I have come across is finding what objects are currently placing data within a partition schema, this can be both tables as well as indexes for a table or indexes for a view (which can also be partitioned).
This has tied in nicely with Adam Machanic’s SQL Tuesday in which we were to describe a confusing situation and possible solution.
Certainly it has been a bit confusing to get to the bottom of what should be a relatively trivial question – programmatically determine what objects are using my partition schema?
If I am going to auto-balance the partitions then I have to know what objects are using the partition schema and will be affected by any balancing – we can not assume it is a single object, since we know both the table and non-clustered indexes will often be aligned on the same partition schema, as could other tables / objects.
So the first place I chose to check was the system views for the partition objects, sys.partition_functions and sys.partition_schemes – with the partition schemes being where you would expect to start.
SELECT * FROM sys.partition_schemes
Unfortunately the results from partition schemes view is spectacularly unhelpful, aside from inheriting a number of columns from the data spaces system view, it only adds function_id – being the ID of the partition function used in the schema. It at least has the name of the partition scheme, so that definitely is going to have to be used at a later point.
The immediately useful looking value is the function_id linking the scheme to the partition function, so I had a look inside the partition functions view remember what it has.
SELECT * FROM sys.partition_functions
The output does not particularily lead anywhere useful – the function most certainly is not going to tell me which objects are assigned to it, since the tables / indexes get directly assigned to the partition scheme, this looks a dead end. So the only other option is to go to the data spaces system view:
SELECT ps.name, ds.* FROM sys.partition_schemes ps INNER JOIN sys.data_spaces ds on ps.data_space_id = ps.data_space_id
Not a stellar move – there are no obvious leads here.
So I can obtain the relation between the partition scheme and the storage but that it is so far. Given those two dead ends I next considered the problem from the opposite direction – sys.partitions claims to contain a row for each partitioned tables and index in the database – which should provide another starting point.
SELECT * FROM sys.partitions
Some of the output was as follows:
This definitely has my partition schema in there somewhere since I know I have 10 partitions and have set the row quantities up to be 10 rows for the first 9 partitions and 136 rows in the tenth, it is pretty visible.
I’ve also got an indexed view on the same table which explains the duplicate set of values, and a NC index on the table which explained the triplicate set of values I’ve not pasted in. This is in essence what I am after though; finding out which object_id’s reference the partition schema.
A couple of immediate dead ends have also appeared:
- The partition_id looks useful but is the unique id of the partition record not an ID relating to the partition schema.
- The hobt_id is the heap or b-tree pointer for that specific partition so is not going to be able to help us since there are 10 hobt_id’s per object on the schema, all different.
It does however provide the object_id which we know we can translate into a name very easily and a partition_number column which only ever exceeds 1 on a partitioned table. So with a bit of a throw-away style query to just select those with a partition_number of 2 to make sure we only select partitioned objects gave me the following:
SELECT o.Name, s.* FROM sys.partitions s INNER JOIN sys.objects o on s.object_id = o.object_id WHERE partition_number = 2
So now I can see the partitioned objects from the other direction, but I have found no relation between the objects identified as partitioned and the partition schemas discovered earlier.
There is also an index_id being shown and documented as ‘the index within the object to which this partition belongs’, but this is not an object_id for an index, but the index_id you would normally use within a dbcc in command, 0 for heap, 1 for clustered index etc, so there must be a relation to the sys.indexes view – which when I thought about it made complete sense – the sys.indexes table is really badly named, since it is not a row per index, but a row per heap or index.
Not the best name in the world for it, but let’s move on – given we have both the object ID and the index_id we can join on both / filter in the where clause.
SELECT O.Name, p.* FROM sys.partitions p INNER JOIN sys.objects O on p.object_id = o.object_id INNER JOIN sys.indexes I on O.object_id = I.object_id and p.index_id = I.index_id WHERE partition_number = 2
Same output as before, since I have selected no fields from sys.indexes yet, checking the list an immediate candidate jumped out, data_space_id – I already had an odd-looking data_space_id earlier, so can the index link to it successfully?
SELECT O.Name, ds.name FROM sys.partitions p INNER JOIN sys.objects O on p.object_id = o.object_id INNER JOIN sys.indexes I on O.object_id = I.object_id and P.index_id = I.index_id INNER JOIN sys.data_spaces ds on i.data_space_id = ds.data_space_id WHERE partition_number = 2
Which gave the following results:
I then rejoined in the partition schemes using the same join I did earlier:
SELECT O.Name, ds.name FROM sys.partitions p INNER JOIN sys.objects O on p.object_id = o.object_id INNER JOIN sys.indexes I on O.object_id = I.object_id and P.index_id = I.index_id INNER JOIN sys.data_spaces ds on i.data_space_id = ds.data_space_id INNER JOIN sys.partition_schemes ps on ds.data_space_id = ps.data_space_id
Running this gave 40 rows, 4 objects x 10 partitions, so I was filtered down to the partitioned objects but I returned too many rows – I could have used a group by clause but it seemed simpler to just select a single partition number, and since I know partition number 1 will always exist, that was the simplest to use.
I am now down all the way to the partition scheme, time to select some of the more interesting columns I found along the way, that are applicable to the dynamic partitioning problem I am looking at, the main one being the object name and type.
The final solution I’ve arrived at to get to the tables / indexes and indexed views using a partition scheme is:
SELECT O.Name as TableName, I.Name as IndexName, I.Type, I.type_desc as IndexType, ps.name as PartitionSchema FROM sys.objects O INNER JOIN sys.partitions p on P.object_id = O.object_id INNER JOIN sys.indexes i on p.object_id = i.object_id and p.index_id = i.index_id INNER JOIN sys.data_spaces ds on i.data_space_id = ds.data_space_id INNER JOIN sys.partition_schemes ps on ds.data_space_id = ps.data_space_id WHERE p.partition_number = 1
This can be further filtered to an individual partition scheme based on the name trivially, but the output is giving us the information I am after – a list of what objects and type of object is allocated to the partition schemas.
There should be an easier way to get to this information than joining 5 system views, but that seems to be the only way I could manage to solve the problem.
Whilst I consider dynamic partitioning something that really doesn’t serve a valid purpose that I can find yet, I decided to use it as an exercise to program a basic form of it within T-SQL over the coming weeks.
Given a blank piece of paper and some realism, what are the aims for the design and T-SQL:
- Batch based rebalancing – real-time is not realistic so let’s start with an overnight job.
- Choice to Balance by different metrics (Rows vs Physical Storage)
- Balance across a setup-defined fixed number of partitions – so that they do not run out.
- Ability to migrate Filegroups into and out of the Partition Scheme – e.g. schedule them for removal over the coming nights.
- Ability to limit the processing to a window – this is not easy, but a log of earlier migrations would offer guidance on how much processing could be done within an allotted time span.
- Ability to choose the specify the balancing as an online operation – partitioning being enterprise only we can automatically rely on online index rebuilds being available.
That’s not a bad start although I bet it is harder than it sounds.
Let’s just consider the ‘balancing act’ itself regardless of the options. A partition schema is not a database table – which automatically complicates matters since multiple tables and indexes can use the same partition schema. This means that any change to a partition scheme / function will directly affect more than 1 table / index. Any calculations for the number of rows / size of data will have to take all the tables and indexes into account.
It might seem unusual to place more than one table on a partition schema, but it really isn’t. You would commonly place any NC indexes also on the same partition schema to keep them ‘aligned’, so having multiple tables for the same ‘alignment’ purpose shouldn’t seem weird. If you consider the multi-tenancy usage of the partitioned table, then you can see why you could have dozens of tables all on the same partition schema.
These requirements are the starting point for the T-SQL and as I come across issues I will write them up.
I often see people asking whether dynamic table partitioning exists in SQL Server, or they provide a scenario that would effectively be asking the same question. So let’s get the easy answer out now – straight out of the box SQL Server has no dynamic partitioning.
To be fair, straight out of the box there is no tooling surrounding partitioning either except for a handful of DMV’s – if you want to automate a rolling window, then you need to program that yourself. SQL Server 2008 added a few bits; it struck me that if you need to use a wizard to turn an existing table into a partitioned table then your not really planning ahead.
So if it is possible to automate a rolling window system, surely it is possible to automate some kind of dynamic partitioning?
Well, that depends on what the definition of ‘dynamic partitioning’ is when it comes to SQL, which would normally be defined by the person who needs the feature to solve their specific issue. Before I start writing up a wish list of options and features to guide me hacking some SQL together to solve the problem – you have to ask; do you really need dynamic partitioning?
Table Partitioning by its nature suits larger volumes of data in a rolling window, where we migrate older data out and bring in new values. However, partitioning has been used for a variety of purposes that it possibly was not considered for originally such as:
- Performance gain through Partition Elimination
- Multi-Tenancy databases, placing each client in a separate partition
Bizarrely each of those reasons has a counter argument:
- Partition elimination only benefits queries that include the partition key in the where clause, otherwise it is detrimental to the query since it requires every partition is examined.
- Aside from the limit of 1000 partitions therefore 1000 customers, security is easier to compromise, upgrades per customer are not possible and the whole backup restore strategy for individual customers get’s very complex since you do not wish to restore the whole table but a single partition.
Back to the question, do we really need dynamic partitioning?
The complexity and scale of most partitioned tables indicates that they should not occur by ‘accident’, and retro-fitting a partitioned table indicates a lack of data modelling / capacity planning. The ‘alternative’ reasons for partitioning, are amongst some of the drivers for the dynamic partitioning request.
To make best use of the partitioned table feature requires planning and design, in which case it does not need to be ‘dynamic’.
That all being said, in the coming posts I am going to write-up my wish list of features to start building a basic dynamic partitioning system and then make it more complex over time – it makes for a fun exercise.
If you have any thoughts on features you would want to see in it, just add them in a comment.
The first part of this topic provided a mini-guide to loading data into a partitioned table and a few helpful DMV based statements that can help you automate the process. The unloading of the data should in theory be easier, but to do this in an automated fashion you are more reliant on the DMVs and system views to get to the right information.
The steps to unload a partition of data are:
- Discover which file group the oldest partition is on.
- Create a staging table on the same filegroup with an identical schema and indexes
- Switch the data out to the staging table
- Merge the partition function
- Archive / Drop the data as appropriate.
As in part 1, there are 3 sections of the process which are not so common, whilst the creation of a table and the archive / drop of the old data at the end is standard T-SQL that you will be using regularly.
Discover which Filegroup the Oldest Partition is On
When checking for the oldest filegroup, I have assumed that the basis of the rolling window is that the highest boundary is the most recent data, whilst the lowest boundary is the oldest – in essence time is moving forward and the partition key ascends, not descends. The oldest boundary will therefore be boundary 1, how do you get the Filegroup name of the filegroup this partition is on? A somewhat complex use of a set of DMV’s.
SELECT sys.filegroups.Name as FileGroupName FROM sys.partition_schemes INNER JOIN sys.destination_data_spaces ON sys.destination_data_spaces.partition_scheme_id = sys.partition_schemes.data_space_id INNER JOIN sys.filegroups ON sys.filegroups.data_space_id = sys.destination_data_spaces.data_space_ID INNER JOIN sys.partition_range_values ON sys.partition_range_values.Boundary_ID = sys.destination_data_spaces.destination_id AND sys.partition_range_values.function_id = sys.partition_schemes.function_id WHERE sys.partition_schemes.name = 'YourPartitionScheme' and sys.partition_range_values.boundary_id = 1
This will return the name of file group, which allows you to create the staging table for the partition switch out on the correct filegroup.
Whilst the data space ID’s do alter in sequence depending on the partition function being a left or right based partition, the boundary ID remains consistent, which is why it is used to discover the oldest and not the destination_id / data_space_id.
Switch the Data Out to the Staging Table
Switching the data out is not complex, it just is the reverse syntax of switching the partition in essence. Under the hood you are redirecting IAM pointers, so the switch is considered a meta-data command and exceptionally fast.
ALTER TABLE YourPartitionedTable SWITCH PARTITION 1 TO YourStagingTable
The partition number used is in effect the boundary id, and the oldest boundary is for partition 1 the rolling window.
Merge the Partition Function
The last complex stage is the merging of the partition function, the command explicitly needs the value from the partition function that represents the partition. If you were doing this by hand you would know it, but to automate the process requires the discovery of this information from the DMV’s again.
SELECT value FROM sys.partition_range_values INNER JOIN sys.partition_functions ON sys.partition_functions.function_id = sys.partition_range_values.function_id WHERE name = 'YourPartitionFunctionName' AND boundary_id = 1
Again, we are using the boundary value of 1 to extract only the oldest partition function value, but this can then be used in a partition function merge command.
ALTER PARTITION FUNCTION YourPartitionFunctionName() MERGE RANGE (YourBoundaryValue)
Using the DMV’s and appropriate stored procedures, the rolling window can be automated and does not require hand-crufted SQL to work – just use of the DMV’s to get the key values you need to be able to construct the harder parts of the process.
If you are following the guide on partition layout I wrote before, then the filegroup you have just removed the data from becomes the next spare filegroup to be used to house the next time data is imported. If you store this within the database, the next load will be able to automatically know where to place the data and set the next used filegroup value to, closing the loop so to speak.
I have covered how to layout a partitioned table across filegroups previously, but have not gone through the steps of rolling a partitioned window – it sounds a simple process but with all the file group and pre-requisites for it to run smoothly anyone starting with partitioned tables could probably use a little guide. As you are about to see the process is quite intricate so I will go through the load process on this post and the unload on the next.
Because no one case fits all, I have made some assumptions / limitations to provide a guide, specifically:
- The main partitioned table has a clustered index.
- The layout is following the mechanism of keeping a staging filegroup and spare filegroup as detailed in the layout post.
- The rollout process intends to remove the oldest data / partition.
- The process is designed for large loads, not single inserts.
So let’ s see what it takes to prepare and get data into a partitioned table:
- Create a staging table on your dedicated ETL filegroup, of an identical column schema to your partitioned table.
- Load the data into the staging table.
- Move the staging table to the spare filegroup, using a clustered index creation. (The need for the spare was covered in the layout post)
- Add any additional Non-Clustered indexes required to match the partitioned table indexes.
- Constrain the data so that it is considered trusted – the constraint must ensure all values are within the partition boundary you intend to place it within.
- Set the Partition Schema Next Used Filegroup
- Split the Partition Function
- Switch the staging table into the main partitioned table
That was all just to bulk load data into a partitioned table – a long list and plenty of opportunity for it to go wrong, but most of these steps use T-SQL that you will be very familiar with - it is only the last 3 items that use less common SQL and are harder to automate, since there is no built-in tools to do the work for you.
Setting the Next Used Filegroup
The intention when setting the filegroup is to declare where the partition should locate data for the partition function split that is about to occur. Whilst you can discover what the previous setting might be, it is not advisable to rely on it but set it every time, just before performing a partition function split. The syntax for the command is:
ALTER PARTITION SCHEME YourPartitionSchemeName NEXT USED [YourSpareFG]
Splitting the Partition Function
This splitting of the partition function is in effect the creation of an extra dividing section on the number line / date line representing the partitioned table. If you split a partition that already has data the operation will be quite expensive since can be forced to move data between filegroups, so it is common in a rolling window scenario that you split to handle the incoming data, which is always in advance of your existing data, e.g. If you are storing sales data partitioned by the month/year of the sales date, and currently only hold up until November, you would not insert any data for December until the partition for December had been created.
The syntax forward:
ALTER PARTITION FUNCTION YourPartitionFunctionName() SPLIT RANGE (YourBoundaryValue)
But when importing new data in an automated fashion, you might not know whether the new partition split has already been performed or not, so how can you check whether the new boundary value is already created in the partition function? DMV’s can provide the answer:
SELECT count(value) as ValueExists FROM sys.partition_range_values INNER JOIN sys.PARTITION_FUNCTIONS ON sys.PARTITION_FUNCTIONS.function_id = sys.partition_range_values.function_id WHERE name = 'YourPartitionFunctionName' AND value = YourBoundaryValue
A returned value of 0 would indicate it did not exist, whilst a 1 would indicate a boundary value had already been created.
Switching the Staging Table In
Switching the staging table into the newly created partition looks relatively easy but needs the partition number:
ALTER TABLE yourStagingTable SWITCH TO YourPartitionedTable PARTITION PartitionNumber
Where do you get the partition number from? The partition number is the boundary ID, and is numbered starting at 1 from the furthers left partition sequentially upwards. If you know the boundary value you have set for the partition, you can get the boundary id using the DMV’s again
SELECT boundary_id FROM sys.partition_range_values INNER JOIN sys.partition_functions ON sys.partition_functions.function_id = sys.partition_range_values.function_id WHERE name = 'YourPartitionFunctionName' AND value= YourBoundaryValue
These additional DMVs allow you to get access to the data you need to automate the process in stored procedures, finding out the boundary IDs in one step, to be used in the next step etc.
These are the trickier parts of the process to automate that need the help of the DMVs. In the next post I will go through the unloading of the old data.
SQL Server’s Table Partitioning was one of my favourite features within the 2005 release. It really brought SQL into the mainstream when it came to holding very large data quantities and allowed us to talk with confidence about large tables containing ‘Billions’ of rows and not be limited to the ‘Millions’. From extensive use and a couple of SQL Labs specifically surrounding the partitioning feature there are some rules and best practises I use to try maximise the benefit / flexibility of the partitioned table, without causing uneccesary drops in performance or increases in the disk space requirements.
Many examples of table partitions focus heavily on using date range based partition functions and schema, with a common layout mechanism of stripping the weeks / months across a number of file groups. The file groups are re-used and in the example picture, you can see 2 years worth of data stripped across 12 filegroups.
This is pretty common and has an alluring charm of simplicity, but it is going to hurt when you start rolling the window. The underlying problem is that there is no gap.
For a good number of systems it would be unacceptable to remove a month of data unless you had already inserted the new month sucessfully. So the system starts to roll a new month of data in, and is required to use the same filegroups, the layout transforms into the following:
The file group has to expand by 50% to accomodate the new data, before the old data can be removed – and once the old data is removed the filegroups now look like:
So the 50% of space is now wasted unless you use a shrink, which is probably the worst thing you can do to your filegroup and data files at that point in time. Shrink can fragment the data to the extreme and is to be avoided at all costs. Which means you will have to maintain a 50% space penalty for the layout on every filegroup. That might not sound a lot, but on a large database in an enterprise with mirrored SAN’s,that additional 50% is going to cost a substancial amount.
There are also other issues, SQL allows you to backup at a filegroup level and since the bulk of the data is historic and will not alter, you are forced to re-back up historic data (Jan 08) when you backup the recently inserted Jan 09 data. So there is an impact on backup space, backup times and restore times.
The simplicity of the initial layout makes it seem like a good idea, but the side-effects are not pleasant. You can alter the layout and choose to have 6 Filegroups, each storing 4 months of data, and then the expansion is only from 4 to 5, so a 25% overhead. It is better, but still is a cost overhead, The ultimate extreme is to then place it all in one filegroup, but there are a number of difficulties and contention points with that.
A better approach is to use a different file group per month, but then also create an additional spare filegroup, so that no filegroup is forced to expand, as shown:
The difference here is that we have one free filegroup that is not a part of the current partition scheme / function definition, but will be allocated as the ‘Next Used’ filegroup for the partition scheme, so that when we split the partition function that filegroup is brought into the scheme and used. The old data that is going to be removed, will provide an empty filegroup that will be used for the next month’s import. In essence the head of the partition function is chasing the tail and will never catch up.
The expansion of size for this is 1/n where n is the number of partitions to be stored, so for our 24 month example 1/24th expansion – a considerable saving. Even better is that you can choose to utilise file group backups for the older static months of data.
This though is not perfect for a couple of reasons, primarily data loading and indexing. To make the ‘scene’ better there should be a further filegroup dedicated to the data staging / loading that is going to occur to bring the data into the partitioned table.
The purpose of this is twofold:
- The disks used for loading are seperate from the disks used for the actual data so that we can maintain Quality of Service on the main table.
- The data wants to be loaded as a heap for speed, but is likely to be sorted prior to insertion, an in place sort would expand one of the main file groups which is to be avoided.
By loading into a staging filegroup, you give yourself the opportunity to then use a clustered index to sort the data and move it to the appropriate file group, prior to being constrained and switched in to the main partitioned table. If you had loaded the data into FG25 and then tried to apply a clustered index, it would of doubled the size of the filegroup again as it needs to write all the new rows out and commit them, before the heap could be deleted. That would put you back at square one wasting 50% of the disk space.
The staging filegroup does cost us another file group worth of space, so the 24 initial filegroups has grown to 26, which is still a smaller expansion than the potential 50%.
So some simple guidlines are:
- For a partition function that is going to have N partitions, create N+1 Filegroups.
- ETL / Stage your data into a dedicated staging file group
- Move your data using a clustered index creation to the ‘spare’ filegroup you always have.
- Switch new data in, then switch old out creating the new ‘spare’
It is not entirely flawless however – there is a problem in using such a design if the quantity of data per partition is going to vary by a large amount; you then have to provision each filegroup to have enough space to cope with the fluctations, which in itself can result in wasted space and starts chipping away at the gains made.
It works best as a technique on datasets that remain relatively consistent in terms of the number of rows per partition and as the number of partitions goes up, the savings increase.
Within SQL, the partitioned table feature provides an excellent way to store and roll forward data windows across large datasets without incurring huge loading and archiving penalties to the main table.
The process of loading, manipulating and decommissioning data from a partitioned table will need posts of their own – due to the depth of the information available and required as well as the detail needed to understand some of the best practises I have learnt from the field.
This entry is specifically relating to the ‘Next Used’ aspect of dealing with a partitioned table – for some time I wanted to know how could I tell what partition had been set as next used? There seems to be no record of the value – the BoL lists the Next Used facility as:
Specify the filegroup to be marked by the partition scheme as NEXT USED.
This is true in a simplistic sense, but the marking can not be at the filegroup level since the relation from partition schemes to filegroups can be many-to-many, so no single marking on a file group could suffice. I had never been able to figure out where to find that marking, and find a way to read what had been set.
I should add, that you really should not set the next used in one location of the code / stored proc and then perform the split of the partition function in another, it would be far safer to do them together, so the need to actually find it out is really borne out of investigative necessity in trying to understand for a given partition scheme that implements a rolling window that started to go bad, where does it think it should be splitting the partition to?
So the problem remained; it was given to Paul S. Randal to figure out how we could see / infer this information outside of the dedicated admin console. He figured out the starting point in terms of which value in which DMV to start the solution with, and I ran with it from there to create a relatively easy way to get to it. So credit to Paul for finding out where to start on the issue.
So down to the scripts and some test cases:
There are 2 types of partition scheme to test:
- Partition Function as Left.
- Partition Function as Right.
When defining the scheme initially, you can also define it in two ways:
- Partition scheme defined with the next used pre-set value.
- Partition scheme defined with no pre-set next used value.
So there are 4 combinations to test initially, since we do not need data for this or are worried about the query plan, the main work is pure schema creation and checking.
First script is to just create a test database, nothing significant:
CREATE DATABASE[NextUsedTest] ON PRIMARY ( NAME = N'NextUsedTest', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL \DATA\NextUsedTest.mdf' , SIZE = 3072KB , FILEGROWTH = 1024KB ), FILEGROUP[PartFG1] ( NAME = N'PartFile1', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL \DATA\PartFile1.ndf' , SIZE = 3072KB , FILEGROWTH = 1024KB ), FILEGROUP[PartFG2] ( NAME = N'PartFile2', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL \DATA\PartFile2.ndf' , SIZE = 3072KB , FILEGROWTH = 1024KB ), FILEGROUP[PartFG3] ( NAME = N'PartFile3', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL \DATA\PartFile3.ndf' , SIZE = 3072KB , FILEGROWTH = 1024KB ), FILEGROUP[PartFG4] ( NAME = N'PartFile4', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL \DATA\PartFile4.ndf' , SIZE = 3072KB , FILEGROWTH = 1024KB ), FILEGROUP[PartFG5] ( NAME = N'PartFile5', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL \DATA\PartFile5.ndf' , SIZE = 3072KB , FILEGROWTH = 1024KB ) LOG ON ( NAME = N'NextUsedTest_log', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL \DATA\NextUsedTest_log.ldf', SIZE = 1024KB , FILEGROWTH = 10% ) GO
Now the database is in place, lets create the partition functions /schemes:
CREATE PARTITION FUNCTION [pfLeftNoSpare](int) AS RANGE LEFT FOR VALUES (200801, 200802, 200803, 200804) CREATE PARTITION SCHEME [psLeftNoSpare] AS PARTITION[pfLeftNoSpare] TO([PartFG1], [PartFG2], [PartFG3], [PartFG4], [Primary]) CREATE PARTITION FUNCTION [pfLeftWithNextUsedSet](int) AS RANGE LEFT FOR VALUES (200801, 200802, 200803, 200804) CREATE PARTITION SCHEME [psLeftWithNextUsedSet] AS PARTITION [pfLeftWithNextUsedSet] TO([PartFG1], [PartFG2], [PartFG3], [PartFG4], [Primary], [PartFG5]) CREATE PARTITION FUNCTION [pfRightNoSpare](int) AS RANGE RIGHT FOR VALUES (200801, 200802, 200803, 200804) CREATE PARTITION SCHEME [psRightNoSpare] AS PARTITION[pfRightNoSpare] TO([Primary], [PartFG1], [PartFG2], [PartFG3], [PartFG4]) CREATE PARTITION FUNCTION [pfRightWithSpareNextUsedSet](int) AS RANGE RIGHT FOR VALUES (200801, 200802, 200803, 200804) CREATE PARTITION SCHEME [psRightWithSpareNextUsedSet] AS PARTITION[pfRightWithSpareNextUsedSet] TO([Primary],[PartFG1], [PartFG2], [PartFG3], [PartFG4], [PartFG5]) GO
Four partition functions and schemes, for the left schemes the next used file group appears on the far right when the next used it pre-set. When you use this mechanisms to preset, it issues a message for the operation:
Partition scheme 'psLeftWithNextUsedSet' has been created successfully. 'PartFG5' is marked as the next used filegroup in partition scheme 'psLeftWithNextUsedSet'. Partition scheme 'psRightWithSpareNextUsedSet' has been created successfully. 'PartFG5' is marked as the next used filegroup in partition scheme 'psRightWithSpareNextUsedSet'.
The trick to finding the next used is to look for the mismatching record – if you join the partition function values to the partition scheme there are scheme entries that do not have a corresponding partition function value, since it has not been set by performing the split.
Lets start with the Left based partition and compare the two.
select FG.Nameas FileGroupName , dds.destination_id , dds.data_space_id , prv.value, ps.Name from sys.partition_schemes PS inner join sys.destination_data_spaces as DDS on DDS.partition_scheme_id = PS.data_space_id inner join sys.filegroups as FG on FG.data_space_id = DDS.data_space_ID left join sys.partition_range_values as PRV on PRV.Boundary_ID = DDS.destination_id and prv.function_id=ps.function_id where PS.name = 'psLeftNoSpare'
The output is:
And check the other Left defined partition:
select FG.Name asFileGroupName , dds.destination_id , dds.data_space_id , prv.value, ps.Name from sys.partition_schemes PS inner join sys.destination_data_spaces as DDS on DDS.partition_scheme_id = PS.data_space_id inner join sys.filegroups as FG on FG.data_space_id = DDS.data_space_ID left join sys.partition_range_values as PRV on PRV.Boundary_ID = DDS.destination_id and prv.function_id=ps.function_id where PS.name = 'psLeftWithNextUsedSet'
And the difference appears, the next used appears as the highest destination ID, but only when there are two data spaces assigned to the partition that do not have a value. The ‘primary’ entry that shows up as null is due to the partition scheme running from -ve infinity to +ve infinity, and whilst 4 lines are drawn on the number line, that divides the number line into 5 sections.
Running the same statements for the partitions declared using the right shows similar results, but the Primary entry is given destination 5 whilst the spare shows on 6 again.
The query is slightly awkward in that it must pick up the second entry of the list if it exists, using ‘orderings and tops’ will not pull the result we need.
select FileGroupName, Destination_ID, Data_Space_ID, Name from ( select FG.Name as FileGroupName , dds.destination_id , dds.data_space_id , prv.value , ps.Name , RANK() OVER (PARTITION BY ps.name order by dds.destination_Id) as dest_rank from sys.partition_schemes PS inner join sys.destination_data_spaces as DDS on DDS.partition_scheme_id = PS.data_space_id inner join sys.filegroups as FG on FG.data_space_id = DDS.data_space_ID left join sys.partition_range_values as PRV on PRV.Boundary_ID = DDS.destination_id and prv.function_id=ps.function_id where prv.Value is null ) as a where dest_rank = 2
To test whether it picks up setting the next used, let’s set it on the partitions that did not previously have it.
ALTER PARTITION SCHEME psLeftNoSpare NEXT USED [PartFG5] ALTER PARTITION SCHEME psRightNoSpare NEXT USED [PartFG5]
And re-run the query
To make it re-usable, I switched the query into a view
create view NextUseFileGroups as select FileGroupName, Destination_ID, Data_Space_ID, Name from ( select FG.Name as FileGroupName
, dds.data_space_id, prv.value, ps.Name, RANK() OVER (PARTITION BY ps.name order by dds.destination_Id) as dest_rank from sys.partition_schemes PS inner join sys.destination_data_spaces as DDS
on DDS.partition_scheme_id = PS.data_space_id inner join sys.filegroups as FG
on FG.data_space_id = DDS.data_space_ID left join sys.partition_range_values as PRV
on PRV.Boundary_ID = DDS.destination_id and prv.function_id=ps.function_id where prv.Value is null ) as a where dest_rank = 2
And a final check with removing the setting – you can blank a set next used value by specifying no value in the statement.
ALTER PARTITION SCHEME psLeftNoSpare NEXT USED ALTER PARTITION SCHEME psRightNoSpare NEXT USED
Select from the view and the two file partition schemes / file groups no longer show up in the list as intended.
So finding out the ‘next used’ setting is possible, although there really is no need in normal operation of the partition window to have to find out, but as an investigative tool it could be useful.
The scripts were tested on both 2005 and 2008, so are good for both in terms of testing, or using the view.