Archive for May, 2010

SQL Server Hash Partitioning

May 31, 2010 10 comments

It’s been a while since the last post, primarily due to changing jobs and now spending most of my time on Oracle – although it is always good to see the other side of the coin and see what it has to offer, but I won’t be abandoning SQL Server, that is for certain.

One of the more interesting features to me in Oracle is hash partitioning – the ability to create a partition across a defined number of partitions, and then arbitrarily decide which partition the data will go in based on a hashing function. Why would that be handy? SQL Server partitioning is in effect a range partition, in which you define the dividing points on the number line / alphabet – which suits partitions based on a defined number range or date range, but does not suit partitioning of other types such as a GUID.

The merits of such a partition could be debated, since with a decent index in place the benefits of the partition elimination within the query plan can be limited. Regardless of those merits (and I am pretty sure it is not going to be performant at scale, however could SQL Server implement Hash Partitioning? On a side note, this could be considered semi-dynamic partitioning in that the partition is able to cope with additional data outside of the expected range, due to the hash function.

I’ve seen a few articles try and perform hash partitioning by pre-processing the insert statement, prior to insertion into the database, but what about something a bit more native?

To start with, we need to create a partition function and partition schema to support this endeavour, both are pretty easy to construct.

CREATE PARTITION FUNCTION [myPartitionFunction] (int)
AS RANGE LEFT FOR VALUES (100,200,300,400,500,600,700,800,900)
PARTITION [myPartitionFunction] ALL TO ([FG1])

I’ve set up the partition scheme to assign all of the partitions to FG1, just for convenience, it could easily be set to multiple filegroups, and instead of 9 partitions, this could be constructed with 999 partitions.

There are a variety of hashing algorithms and functions, but given the range covered by the partition function, I have chosen to use a very simple modulo on the converted binary of the unique identifier. The only trick here is that we must create the function with schema binding, otherwise SQL will refuse to use the function later on when we persist the column and partition on it.

CREATE FUNCTION GuidHash (@guid_value uniqueidentifier) RETURNS int
 RETURN abs(convert(bigint,convert(varbinary,@guid_value))) % 999

That is a pretty simple hashing function, but the point is to demonstrate is can be done, not to implement the best hashing algorithm that will give the most even distribution etc. The next step is to create the table, with the persisted column defined using the GuidHash function. If the function is not schema bound, you get an error thrown at this stage.

CREATE TABLE MyTable(  MyID UniqueIdentifier not null,  
SomeField Char(200), 
PartitionID as dbo.GuidHash(MyId) PERSISTED
ON myPartitionScheme(PartitionID)

The surprise here is that is accepts the table creation definition – since when would you expect a partitioned table’s column to be a computed column?

Time to put an index on the table, given the data is indexed off a unique identifier, it would not be unusual to place a non-clustered index on the table and to use index-alignment, e.g. place it on the same partitioning scheme.

CREATE NONCLUSTERED INDEX [ix_id] ON [dbo].[MyTable] (  
[MyID] ASC,  
[PartitionID] ASC
) ON [myPartitionScheme]([PartitionID])

Populate the table with some test data:

DECLARE @guid uniqueidentifier
SET @guid = newid()
INSERT INTO mytable (myid, somefield) VALUES (@guid, 'some text')
go 10000

So what happens when we select a single row from our data, for convienience I looked up a value in the table and grabbed the GUID – comparing the two queries side by side,

SELECT * FROM mytable WHERE myid = 'D41CA3AC-06D1-4ACC-ABCA-E67A18245596' 
SELECT * FROM mytable WHERE (partitionid = dbo.guidhash ('D41CA3AC-06D1-4ACC-ABCA-E67A18245596') 
and myid = 'D41CA3AC-06D1-4ACC-ABCA-E67A18245596')

The comparison is interesting, in percentage terms, it was 85% to 15% for the batch, the IO Statistics reads:

First Query : Scan count 10, Logical Reads 21
Second Query : Scan count 1, Logical Reads 3

So the hashing has clearly made the query faster – but that just means it was faster than the query that didn’t use the partition key which shouldn’t be too surprising – partition elimination vs checking every partition should win, so all it shows is that partition elimination is occurring. So how does it stack up against a normal table, e.g. have we gained anything? To test we need to put 10k rows into the same table, minus the computed column, index it and perform the same kind of select – all easy stuff so I will not write the code here, the results of a select from a normal table?

Normal Table Query : Scan Count 1, Logical Reads 3

And when run side by side, the SSMS window reports a 50% split of work between the two queries within the batch – which is not surprising given the IO costs were listed as the same – so where is the catch? There is no such thing as a free lunch, and the additional cost here is the CPU to generate the PartitionID value for the hashed GUID, but as a technique to partition based on a GUID, it has some merits.

One final thing that I did notice is that under Simple Parameterization, the GuidHash based query does not parameterize, which would start having detrimental effects on the query cache, once the database was placed under forced parameterization, then the query did parameterize appropriately – so you either want forced parameterization on or to use stored procedures – I would vote for the later there.

As a technique is has some merits, but you have to remember to manually include the PartitionID column within each query and run it through the hashing function – which is not ideal, but manageable.