Archive

Archive for the ‘SQL Server’ Category

Dynamic Partitioning : Wishlist

January 11, 2010 1 comment

Whilst I consider dynamic partitioning something that really doesn’t serve a valid purpose that I can find yet, I decided to use it as an exercise to program a basic form of it within T-SQL over the coming weeks.

Given a blank piece of paper and some realism, what are the aims for the design and T-SQL:

  • Batch based rebalancing – real-time is not realistic so let’s start with an overnight job.
  • Choice to Balance by different metrics (Rows vs Physical Storage)
  • Balance across a setup-defined fixed number of partitions – so that they do not run out.
  • Ability to migrate Filegroups into and out of the Partition Scheme – e.g. schedule them for removal over the coming nights.
  • Ability to limit the processing to a window – this is not easy, but a log of earlier migrations would offer guidance on how much processing could be done within an allotted time span.
  • Ability to choose the specify the balancing as an online operation – partitioning being enterprise only we can automatically rely on online index rebuilds being available.

That’s not a bad start although I bet it is harder than it sounds.

Let’s just consider the ‘balancing act’ itself regardless of the options. A partition schema is not a database table – which automatically complicates matters since multiple tables and indexes can use the same partition schema. This means that any change to a partition scheme / function will directly affect more than 1 table / index. Any calculations for the number of rows / size of data will have to take all the tables and indexes into account.

It might seem unusual to place more than one table on a partition schema, but it really isn’t. You would commonly place any NC indexes also on the same partition schema to keep them ‘aligned’, so having multiple tables for the same ‘alignment’ purpose shouldn’t seem weird. If you consider the multi-tenancy usage of the partitioned table, then you can see why you could have dozens of tables all on the same partition schema.

These requirements are the starting point for the T-SQL and as I come across issues I will write them up.

Is Dynamic Partitioning in SQL Server Possible?

January 5, 2010 Leave a comment

I often see people asking whether dynamic table partitioning exists in SQL Server, or they provide a scenario that would effectively be asking the same question. So let’s get the easy answer out now – straight out of the box SQL Server has no dynamic partitioning.

To be fair, straight out of the box there is no tooling surrounding partitioning either except for a handful of DMV’s – if you want to automate a rolling window, then you need to program that yourself. SQL Server 2008 added a few bits; it struck me that if you need to use a wizard to turn an existing table into a partitioned table then your not really planning ahead.

So if it is possible to automate a rolling window system, surely it is possible to automate some kind of dynamic partitioning?

Well, that depends on what the definition of ‘dynamic partitioning’ is when it comes to SQL, which would normally be defined by the person who needs the feature to solve their specific issue. Before I start writing up a wish list of options and features to guide me hacking some SQL together to solve the problem – you have to ask; do you really need dynamic partitioning?

Table Partitioning by its nature suits larger volumes of data in a rolling window, where we migrate older data out and bring in new values. However, partitioning has been used for a variety of purposes that it possibly was not considered for originally such as:

  • Performance gain through Partition Elimination
  • Multi-Tenancy databases, placing each client in a separate partition

Bizarrely each of those reasons has a counter argument:

  • Partition elimination only benefits queries that include the partition key in the where clause, otherwise it is detrimental to the query since it requires every partition is examined.
  • Aside from the limit of 1000 partitions therefore 1000 customers, security is easier to compromise, upgrades per customer are not possible and the whole backup restore strategy for individual customers get’s very complex since you do not wish to restore the whole table but a single partition.

Back to the question, do we really need dynamic partitioning?

The complexity and scale of most partitioned tables indicates that they should not occur by ‘accident’, and retro-fitting a partitioned table indicates a lack of data modelling / capacity planning.  The ‘alternative’ reasons for partitioning, are amongst some of the drivers for the dynamic partitioning request.

To make best use of the partitioned table feature requires planning and design, in which case it does not need to be ‘dynamic’.

That all being said, in the coming posts I am going to write-up my wish list of features to start building a basic dynamic partitioning system and then make it more complex over time – it makes for a fun exercise.

If you have any thoughts on features you would want to see in it, just add them in a comment.

How Can You Tell if a Database is in Pseudo Full/Bulk Logged Mode?

December 13, 2009 Leave a comment

I was asked on Friday, “how do you tell if a database logging mode is reporting bulk or full, but it is still in simple?” – as mentioned before, a database is not in really full / bulk logged unless a full backup has been taken. Until that time the database is still running in a simple mode, sometimes referred to as pseudo-simple. It is not easy to spot, because the properties of the database will report full / bulk as appropriate and give no indication that it is not actually logging in the way it says.

The existence of a backup of the database is not a reliable enough mechanism for this, since the database can be backed up and then moved out of full / bulk logged mode into simple and back again. This breaks the backup and transaction log chain, but the database is still reporting full – to make it worse there is a backup record showing on the history, giving it an air of legitimacy.

The backup records can be accessed from the sys.sysdatabases and msdb.dbo.backupset, MSDN even has an example script showing how to see when a database was last backed up and by whom.

SELECT 
T1.Name as DatabaseName, COALESCE(Convert(varchar(12), MAX(T2.backup_finish_date), 101),'Not Yet Taken') as LastBackUpTaken, COALESCE(Convert(varchar(12), MAX(T2.user_name), 101),'NA') as UserName
FROM sys.sysdatabases T1 LEFT OUTER JOIN msdb.dbo.backupset T2 ON T2.database_name = T1.name
GROUP BY T1.Name
ORDER BY T1.Name

To play around with the scripts you probably want a test database:

CREATE DATABASE [LogModeTest] ON  PRIMARY
( NAME = N'LogModeTest', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\LogModeTest.mdf' , SIZE = 3072KB , MAXSIZE = UNLIMITED, FILEGROWTH = 1024KB )
 LOG ON
( NAME = N'LogModeTest_log', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\LogModeTest_log.ldf' , SIZE = 1024KB , MAXSIZE = 2048GB , FILEGROWTH = 10%)
 COLLATE Latin1_General_CI_AI

With a minor alteration to the MSDN script you can get the backup history for this database:

SELECT
T1.Name as DatabaseName, COALESCE(Convert(varchar(12), MAX(T2.backup_finish_date), 101),'Not Yet Taken') as LastBackUpTaken FROM sys.sysdatabases T1
LEFT OUTER JOIN msdb.dbo.backupset T2 ON T2.database_name = T1.name
WHERE T1.Name = 'LogModeTest'
GROUP BY T1.Name

The results show the database is not yet backed up:

DatabaseName                  LastBackUpTaken 
----------------------------- ---------------
LogModeTest                   Not Yet Taken

That is easy to fix, so let’s take a backup of the database and recheck the last backup value.

BACKUP DATABASE [LogModeTest] TO  DISK = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Backup\LogModeTest.bak' WITH NOFORMAT, NOINIT,  NAME = N'LogModeTest-Full Database Backup', SKIP, NOREWIND, NOUNLOAD,  STATS = 10

DatabaseName                   LastBackUpTaken 
------------------------------ ---------------
LogModeTest                    12/13/2009

As expected the date of the backup is now set. If we alter the logging mode of the database to simple we will break the transaction log chain. To demonstrate the backup information being an unreliable source, let’s change to simple, create a table and then return to the fully logged mode.

ALTER DATABASE [LogModeTest] SET RECOVERY SIMPLE WITH NO_WAIT
CREATE TABLE foo(id int identity)
ALTER DATABASE [LogModeTest] SET RECOVERY FULL WITH NO_WAIT

If we now attempt to backup the transaction log, SQL is going to throw an error.

BACKUP LOG [LogModeTest] TO  DISK = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Backup\LogModeTest.bak' WITH NOFORMAT, NOINIT,  NAME = N'LogModeTest-Transaction Log  Backup', SKIP, NOREWIND, NOUNLOAD,  STATS = 10 

Msg 4214, Level 16, State 1, Line 1
BACKUP LOG cannot be performed because there is no current database backup.
Msg 3013, Level 16, State 1, Line 1
BACKUP LOG is terminating abnormally.

And if we check the database backup history using the MSDN script:

DatabaseName                   LastBackUpTaken
------------------------------ ---------------
LogModeTest                    12/13/2009

So the backup history continues to show a date of the last full backup even though the transaction log chain is now broken. SQL certainly knows the database has not had a full backup since swapping into fully logged mode, so any transaction log backup is invalid, thus the error.

There is an easier way to find out that you are in pseudo-simple mode, without trying to perform a transaction log backup:

SELECT name, COALESCE(Convert(varchar(30),last_log_backup_lsn), 'No Full Backup Taken') as BackupLSN 
FROM sys.databases
INNER JOIN sys.database_recovery_status on sys.databases.database_id = sys.database_recovery_status.database_id

Run this against your server and it lists the databases that have had a backup taken (by the existence of a backup LSN) and which have not had a full backup that could be used in recovery. If we then backup the database and recheck the values, the test database now records an LSN, showing it is out of psuedo-simple and into the full / bulk logged modes.

So that indicates whether we are in pseudo simple or not, but does not link back to the properties of the database to check what is the actual database logging mode – you are primarily only interested in databases that are not in simple mode in the first place, but are running in psuedo-simple due to the lack of a relevant full database backup. We can alter the query to handle this specific situation and the result is:

SELECT name, recovery_model_desc, COALESCE(Convert(varchar(30),last_log_backup_lsn), 'No Full Backup Taken') as BackupLSN 
FROM sys.databases
INNER JOIN sys.database_recovery_status on sys.databases.database_id = sys.database_recovery_status.database_id
WHERE sys.databases.recovery_model <> 3 AND last_log_backup_lsn is null

If you run that query against your database server and get any results then you have databases that are not running the recovery mode they are indicating / you that you think they are – which would generally not be a good thing.

Rolling a Partition Forward – Part 2

December 10, 2009 2 comments

The first part of this topic provided a mini-guide to loading data into a partitioned table and a few helpful DMV based statements that can help you automate the process. The unloading of the data should in theory be easier, but to do this in an automated fashion you are more reliant on the DMVs and system views to get to the right information.

The steps to unload a partition of data are:

  • Discover which file group the oldest partition is on.
  • Create a staging table on the same filegroup with an identical schema and indexes
  • Switch the data out to the staging table
  • Merge the partition function
  • Archive / Drop the data as appropriate.

 As in part 1, there are 3 sections of the process which are not so common, whilst the creation of a table and the archive / drop of the old data at the end is standard T-SQL that you will be using regularly.

Discover which Filegroup the Oldest Partition is On

When checking for the oldest filegroup, I have assumed that the basis of the rolling window is that the highest boundary is the most recent data, whilst the lowest boundary is the oldest – in essence time is moving forward and the partition key ascends, not descends. The oldest boundary will therefore be boundary 1, how do you get the Filegroup name of the filegroup this partition is on? A somewhat complex use of a set of DMV’s.

SELECT sys.filegroups.Name as FileGroupName FROM sys.partition_schemes 
INNER JOIN sys.destination_data_spaces ON sys.destination_data_spaces.partition_scheme_id = sys.partition_schemes.data_space_id
INNER JOIN sys.filegroups ON  sys.filegroups.data_space_id = sys.destination_data_spaces.data_space_ID
INNER JOIN sys.partition_range_values ON  sys.partition_range_values.Boundary_ID = sys.destination_data_spaces.destination_id
AND sys.partition_range_values.function_id = sys.partition_schemes.function_id
WHERE sys.partition_schemes.name = 'YourPartitionScheme'
and sys.partition_range_values.boundary_id = 1

This will return the name of file group, which allows you to create the staging table for the partition switch out on the correct filegroup.

Whilst the data space ID’s do alter in sequence depending on the partition function being a left or right based partition, the boundary ID remains consistent, which is why it is used to discover the oldest and not the destination_id / data_space_id.

Switch the Data Out to the Staging Table

Switching the data out is not complex, it just is the reverse syntax of switching the partition in essence. Under the hood you are redirecting IAM pointers, so the switch is considered a meta-data command and exceptionally fast.

ALTER TABLE YourPartitionedTable SWITCH PARTITION 1 TO YourStagingTable

The partition number used is in effect the boundary id, and the oldest boundary is for partition 1 the rolling window.

Merge the Partition Function

The last complex stage is the merging of the partition function, the command explicitly needs the value from the partition function that represents the partition. If you were doing this by hand you would know it, but to automate the process requires the discovery of this information from the DMV’s again.

SELECT value
FROM sys.partition_range_values
INNER JOIN sys.partition_functions ON sys.partition_functions.function_id = sys.partition_range_values.function_id  
 WHERE name = 'YourPartitionFunctionName' AND boundary_id = 1

Again, we are using the boundary value of 1 to extract only the oldest partition function value, but this can then be used in a partition function merge command.

ALTER PARTITION FUNCTION YourPartitionFunctionName() MERGE RANGE (YourBoundaryValue)

 

Conclusion

Using the DMV’s and appropriate stored procedures, the rolling window can be automated and does not require hand-crufted SQL to work – just use of the DMV’s to get the key values you need to be able to construct the harder parts of the process.

If you are following the guide on partition layout I wrote before, then the filegroup you have just removed the data from becomes the next spare filegroup to be used to house the next time data is imported. If you store this within the database, the next load will be able to automatically know where to place the data and set the next used filegroup value to, closing the loop so to speak.

Rolling a Partition Forward – Part 1

December 9, 2009 Leave a comment

I have covered how to layout a partitioned table across filegroups previously, but have not gone through the steps of rolling a partitioned window – it sounds a simple process but with all the file group and pre-requisites for it to run smoothly anyone starting with partitioned tables could probably use a little guide. As you are about to see the process is quite intricate so I will go through the load process on this post and the unload on the next.

Because no one case fits all, I have made some assumptions / limitations to provide a guide, specifically:

  • The main partitioned table has a clustered index.
  • The layout is following the mechanism of keeping a staging filegroup and spare filegroup as detailed in the layout post.
  • The rollout process intends to remove the oldest data / partition.
  • The process is designed for large loads, not single inserts.

So let’ s see what it takes to prepare and get data into a partitioned table:

  • Create a staging table on your dedicated ETL filegroup, of an identical column schema to your partitioned table.
  • Load the data into the staging table.
  • Move the staging table to the spare filegroup, using a clustered index creation. (The need for the spare was covered in the layout post)
  • Add any additional Non-Clustered indexes required to match the partitioned table indexes.
  • Constrain the data so that it is considered trusted – the constraint must ensure all values are within the partition boundary you intend to place it within.
  • Set the Partition Schema Next Used Filegroup
  • Split the Partition Function
  • Switch the staging table into the main partitioned table

That was all just to bulk load data into a partitioned table – a long list and plenty of opportunity for it to go wrong, but most of these steps use T-SQL that you will be very familiar with – it is only the last 3 items that use less common SQL and are harder to automate, since there is no built-in tools to do the work for you.

Setting the Next Used Filegroup

The intention when setting the filegroup is to declare where the partition should locate data for the partition function split that is about to occur. Whilst you can discover what the previous setting might be, it is not advisable to rely on it but set it every time, just before performing a partition function split. The syntax for the command is:

ALTER PARTITION SCHEME YourPartitionSchemeName NEXT USED [YourSpareFG]

Splitting the Partition Function

This splitting of the partition function is in effect the creation of an extra dividing section on the number line / date line representing the partitioned table. If you split a partition that already has data the operation will be quite expensive since can be forced to move data between filegroups, so it is common in a rolling window scenario that you split to handle the incoming data, which is always in advance of your existing data, e.g. If you are storing sales data partitioned by the month/year of the sales date, and currently only hold up until November, you would not insert any data for December until the partition for December had been created.

The syntax forward:

ALTER PARTITION FUNCTION YourPartitionFunctionName() SPLIT RANGE (YourBoundaryValue)

But when importing new data in an automated fashion, you might not know whether the new partition split has already been performed or not, so how can you check whether the new boundary value is already created in the partition function? DMV’s can provide the answer:

SELECT count(value) as ValueExists FROM sys.partition_range_values
INNER JOIN sys.PARTITION_FUNCTIONS ON  sys.PARTITION_FUNCTIONS.function_id  = sys.partition_range_values.function_id
WHERE name = 'YourPartitionFunctionName' AND value = YourBoundaryValue

A returned value of 0 would indicate it did not exist, whilst a 1 would indicate a boundary value had already been created.

Switching the Staging Table In

Switching the staging table into the newly created partition looks relatively easy but needs the partition number:

ALTER TABLE yourStagingTable SWITCH TO YourPartitionedTable PARTITION PartitionNumber

Where do you get the partition number from? The partition number is the boundary ID, and is numbered starting at 1 from the furthers left partition sequentially upwards. If you know the boundary value you have set for the partition, you can get the boundary id using the DMV’s again

SELECT boundary_id
FROM sys.partition_range_values
INNER JOIN sys.partition_functions ON sys.partition_functions.function_id  = sys.partition_range_values.function_id
WHERE  name = 'YourPartitionFunctionName' AND value= YourBoundaryValue

These additional DMVs allow you to get access to the data you need to automate the process in stored procedures, finding out the boundary IDs in one step, to be used in the next step etc.

These are the trickier parts of the process to automate that need the help of the DMVs. In the next post I will go through the unloading of the old data.

How Can You Spot the Procedure Cache Being Flooded?

December 3, 2009 Leave a comment

This comes from a question I had a couple of days ago – the SQL Server: Buffer Manager : Page Life Expectancy provides a performance counter that indicates the current lifetime of a page within memory. As data pages and query object pages are being added to the buffer pool they will of course build up and SQL will come under memory pressure as a result. The normal advice is that this figure should be above 300 seconds, indicating that a page should stay in memory for at least 5 minutes.

This figure however, includes both the data cache and the procedure cache – which means you can not determine whether the pages being flushed are a result of churning data pages or you are in a situation where ad hoc queries are flooding the procedure cache. You can of course look at the procedure cache using DMV’s and see the number of objects grow and then shrink, but this is not particularly scientific, nor is it measurable within a trace.

The page life expectancy can easily be traced within Perfmon, but how do you measure the procedure cache? well are a couple of events you can trace in SQL profiler, the primary one I would like to be working do not seem to properly register the event, whilst the secondary does at least work. The two counters are SP:Cache Remove and SP:Cache Insert.

The SP:Cache Remove has 2 Event Sub Classes listed in documentation produced by the SQL Programmability team, sub class 2 is for a deliberate procedure cache flush, such as a DBCC FreeProcCache command, sub class 1 is for when a compiled plan is removed due to memory pressure. In testing the deliberate procedure cache flush does show up in the profiler traces, with an event subclass value of ‘2 – Proc Cache Flush’ – but after a number of tests, I can not ever get the event to be raised when the procedure cache is under memory pressure. If it did then we  would have exactly what I was after, an easy, traceable and recordable way to show a procedure cache under too much pressure.

The SP:Cache Insert is more of a backup mechanism to show the procedure cache is being flooded, but only on the basis that you would count the number of times this event shows up within a trace over a period of time. In essence a SP:Cache Insert is only going to occur if a query does not have a matching query plan within the cache. A large number of these within a short period of time is also going to be an indication that the procedure cache is potentially being flooded.

Combine a large number of SP:Cache Inserts with a low Page Life Expectancy and you can suspect you definitely have a procedure cache flooding problem.

So there is a kind of mechanism to determine whether a low page life expectancy is from data page churn or query page churn, but if the SP:Cache Remove subclass 1 event actually worked, it would be a lot easier. Once you know your plan cache is being flooded, you are then looking to check whether forced parameterization is the worth using to eliminate the issue.

Prodata SQL Academy Events

December 2, 2009 Leave a comment

If you haven’t seen them advertised, Bob Duffy from Prodata is running a series of SQL Academy half day training session in Dublin, hosted at the Microsoft Auditorium in their offices in Leopardstown – the events are level 300 which suits the half day slot allocated for the sessions – yesterday’s was about performance tuning an optimisation so myself and a colleague took a short flight over and enjoyed the excellent Irish hospitality. The talk was recorded so there will no doubt be a webcast published at some point published by Technet in Ireland. The talk primarily went through using perfmon counters and wait states – and the available tools that can make this a lot easier by wrapping up and correlating results from different logging mechanisms.

I would recommend keeping an eye out for the cast when it appears, since troubleshooting a production environment is all about using non-intrusive means to understand what is crippling the systems – memory, cpu, IO etc. If you are not practised at this form of troubleshooting it is very difficult to know which performance counters and wait states to observe amongst the thousands that exist – as well as which DMV’s can give you the critical information to diagnose the problems. (It was quite interesting that the demonstration performance issue he was looking at was fundamentally a combination of a missing index but more critically was a lack of query parameterisation since it was in simple mode. The counters used to diagnose this problem, and the symptoms that you might encounter I have previously written about.)

The wait-state side of the talk was very interesting, I often use a combination of DMV’s and perfmon in the field to diagnose, but have only used a certain amount of the wait-state information and do not delve into it as deeply – I will definitely be adding a few more wait states to the list for the future.

The next event is on February 16th and covers SQL Analysis Services – registration is already open.

Why is SQL Azure and Index Fragmentation a Bad Combination?

November 25, 2009 Leave a comment

I’ve been thinking through and experimenting a bit more with some of the concepts in SQL Azure – specifically I was considering the impact of fragmentation on both the storage (in terms of the storage limit) as well as the maintenance. This is not a new issue, DBA’s face fragmentation regularly and can deal with it in a variety of ways, but with SQL Azure the problem looks magnified by a lack of tools and working space. Whilst looking into this, I then realised that there is an unfortunate consequence of not knowing how much data space your index is actually using.

Each table in SQL Azure has to have a clustered index if data is going to be inserted into it and clustered indexes can suffer from fragmentation if chosen poorly. The combination of SQL Azure and the time-honoured fragmentation provides three consequences about it, fragmentation:

  • will occur and you have no way in which to measure it due to the lack of DMV support.
  • will create wasted space within your space allocation limit.
  • will reduce your performance.

You could work it out if you knew how much space you had actually used vs. what the size of the data held is, but we are unable to measure either of those values. If you have chosen the data compression option on the index then even those values would not give you a fragmentation ratio.

This leaves us with a situation in which you can not know how much you are fragmented, meaning:

  • You schedule a regular index rebuild.
  • Hope SQL Azure performs index rebuilds for you.

I’m not aware of SQL Azure doing this for you – and you do not have SQL Agent facilities either.

So this seems very wrong, the concept of SQL Azure is to take away a lot of the implementation details and hassle from the subscriber – DR and failover is handled etc. But there looks to be as gap in which certain items such as fragmentation is falling – I have not seen any documentation saying SQL Azure handles it (but there could be some hidden somewhere and I hope there is!) and neither are you given the right tools in which to program and handle it yourself.

What happens when you hit that size limit?

Msg 40544, Level 20, State 5, Line 1 The database has reached its size quota. Partition or delete data, drop indexes, or consult the documentation for possible resolutions. Code: 524289 

That took a lot of time to get to, (SQL Azure is not fast), but was generated using a simple example that would also demonstrate fragmentation.

Create Table fragtest ( id uniqueidentifier primary key clustered,
padding char(3000)
) 

Very simple stuff, deliberately using a clustered key on a GUID to cause a decent level of fragmentation, and also using the padding fixed with character field to ensure 2 rows per page only, maximising the page splits.

insert into fragtest values (newid(), replicate('a',1000))
go 200000

Because of the randomness of the newid() function, the level of fragmentation is not predictable but will certainly occur – in my test I hit the wall on 196,403 records and failed with an out of space message.

Given the 2 rows per page and the number of rows, with ~0% fragmentation the data should be able ~767Mb – that is considerably short of 1 Gb – so there is a significant level of fragmentation in there wasting space, about 23% of it. If you include the 2k per page being wasted by the awkward row size then the actual raw data stored is roughly ~60% of the overall size allowing for row overheads etc.

So there are two important points from this contrived example:

  • You can lose significant space from bad design.
  • Doing this backs you into a corner that you will not be able to get out of – this is the worst part.

How are you cornered? well, try work out how to get out of the situation and defrag the clustered index / free up the space, you could:

  • Attempt an index rebuild.
  • Try to rebuild it with SORT_IN_TEMP.
  • Drop the index.
  • Delete data.

The first three fail, the SORT_IN_TEMP is not supported and would not of rescued the situation either since you have no working space in which to write the newly sorted rows prior to removing the old ones.  So do you really want to delete data? I don’t think we can consider that an option for now.

This all seems like a ‘rock’ and a ‘hard place’; whilst SQL Azure can support these data quantities,  it seems prudent that you never consider actually going close to them at all – and that you equally are going to find it difficult to understand if you are close to them, since there is no way of measuring the fragmentation. The alternative is that you manually rebuild indexes on a regular basis to control fragmentation, but then enough free space is going to have to be left to allow you to rebuild your largest index without running out of space – reducing your data capacity significantly.

The corner is not entirely closed off, the way out of the corner would be to create another SQL Azure database within my account and select the data from database1.fragtest to database2.fragtest and then drop the original table and transfer it back – not ideal but it would work in an emergency.

I think the key is to design to make sure you do not have to face this issue; keep your data quantities very much under the SQL Azure size limits, and watch for the potential of tables being larger than the remaining space and preventing an re-indexing from occurring.

Interested to know your thoughts on this one, and what other consequences of being close to the limit will come out.

Categories: SQL Server Tags: , ,

PDC09 : Day 3 – SQL Azure and Codename ‘Houston’ announcement

November 20, 2009 Leave a comment

The PDC is just about over, the final sessions have finished and the place is emptying rapidly – the third day has included a lot of good information about SQL Azure, the progress made to date on it as well as the overall direction – including a new announcement by David Robinson, Senior PM on the Azure team about a project codenamed ‘Houston’ .

During the sessions today the 10Gb limit on a SQL Azure database was mentioned a number of times, but each time was caveated with the suggestion that this is purely the limit right now, and it will be increased. To get around this limit, you can partition your data across multiple SQL Azure databases, as long as your application logic understands which database to get the data from. There was no intrinsic way of creating a view across the databases, but it immediately made me consider that if you were able to use the linked server feature of the full SQL Server, you could link to multiple Azure databases and created a partitioned view across the SQL Azure databases – got to try that out when I get back to the office but I do not expect it to work.

SQL Azure handles all of the resilience, backup, DR modes etc, and it remains hidden from you – although when connected to the SQL Azure database you do see a ‘Master’ database present. It is not really a ‘Master’ in the same way that we think of one, and it quickly becomes apparent how limited that version of the ‘Master’ really is – it exists purely to give you a place to create logins and databases. It could have been called something else to make it a bit clearer but one of the SQL Azure team said it was to keep compatibility to other 3rd party applications that expected there to be a master.

SQL Azure supports transactions as mentioned before, but given the 10GB limit currently on a database you will be partitioning your data across databases. That will be a problem, because the system does not support distributed transactions, so any atomic work that is to be committed on multiple databases at once it going to have to be controlled manually / crufted in code, which is not ideal and a limitation to be aware of.

Equally cross database joins came up as an area with problems – they can be made, but it appears there are performance issues – interested to start running some more tests there and see whether you can mimic a partitioned view across databases using joins. The recommendation was to duplicate reference data between databases to avoid joins, so lookup tables would appear in each database in effect, removing the cross database join.

On the futures list:

  • The ability to have dynamic partition splits looked interesting, regular SQL server does not have this facility within a partitioned table – so if Azure can do it across databases then this might come up on the SQL roadmap as a feature – that could be wishful thinking.
  • Better tooling for developers and administrators – that is a standard future roadmap entry.
  • Ability to Merge database partitions.
  • Ability to Split database partitions.

So SQL Azure has grown up considerably and continues to grow, in the hands-on-labs today I got to have more of a play with it and start testing more of the subtle limitations and boundaries that are in place. Connecting to an azure database via SQL Server Management Studio is trivial and the object explorer contains a cut down version of the normal object tree, but includes all the things you would expect such as tables, views and stored procedures.

Some limitations of the lack of master and real admin access become apparent pretty fast, no DMV support, no ability to check your current size. No ability to change a significant number of options, in fact, the bulk of the options are not even exposed.

Two of my personal favourites I took an immediate look at, maxdop and parameterization.

  • Maxdop is set at 1, although you can not see it, and attempting to override it throws an error from the query windows, telling you that it is not permitted. Do not plan on parallel query execution, you will not get it.
  • I attempted to test the query parameterisation using the date literal trick and it appeared to remain parametrized, as though the database is in ‘forced’ parameterisation mode, so is more likely to get parameter sniffing problems but I have not been able to concretely prove it as yet, but the early indication is the setting is ‘Forced’

One other interesting concept was that a table had to have a clustered index, it was not optional if you wanted to get data into the table, although is did not stop me from creating a table without a clustered index, I had not attempted to populate data into it to see this limit in action – a case of too much to do and so little time.

On one of the final talks about SQL Azure, David Robinson announced a project codenamed ‘Houston’ – (there will be so many ‘we have a problem’ jokes on that one) which is basically a silverlight equivalent of SQL Server Management Studio. The concept comes from the SQL Azure being within the cloud, and if the only way to interact with it is by installing SSMS locally then it does not feel like a consistent story.

From the limited preview, it only contains the basics but it clearly let you create tables, stored procedures and views, edit them, even add data to tables in a grid view reminiscent of Microsoft Access. The UI was based around the standard ribbon bar, object window on the left and working pane on the right. It was lo-fi to say the least  but you could see conceptually where it could go – given enough time it could become a very good SSMS replacement, but I doubt it will be taken that far. There was an import and Export button on the ribbon with what looked to be ‘Excel’ like icons but nothing was said / shown of them. Date wise ‘Targetting sometime in 2010’, so this has some way to go and is not even in beta as yet.

So that was PDC09, excellent event, roll on the next one!

PDC09 : Day 1 Keynote

November 18, 2009 Leave a comment

As promised, I wanted to only blog about the bits of the PDC that relate to SQL / Database / Data Services, and not every session within the PDC that I am attending. Many of the sessions have been interesting, but I am viewing them with my Architect’s hat on, and not from the viewpoint of my personal passion for SQL Server. I feel fortunate to be here and listening to the speakers and chatting to them offline instead of watching the PDC on the released videos after the event.

The keynote today contained a number of very interesting looking prospects on the data side of the fence, primarily ‘compered’ by Ray Ozzie, Chief Software Architect at Microsoft. There were also some demos, some of which were quite good, whilst others suffered from over-scripting. I am sure twitter was going wild at times during the keynote as people were giving real-time feedback about what they thought. (Whether that is a good thing or not I am not sure, walking off stage to find a few hundred bad reviews can not be nice.) But this is not about the demos but about the SQL / Data stuff.

A lot of work Microsoft have been doing and the phrase repeated throughout was ‘3 screens and a cloud’, using the 3 screens of mobile, computer and tv to represent 3 different delivery paradigms, but fundamentally using the same technology stack to deliver all 3.

The Azure data centres were announced to be going into production on Jan 1st 2010, and billing for those services will commence on the 1st Feb. However, the European and far eastern data centres were not listed as coming online until late in 2010, so the only data centres that will be up and running will be the Chicago and San Antonio data centres.

This may not seem a big problem, and in fact having 3 pair’s of data centres around the world is far more ideal and a single centralised resource, but for Europeans there are data protection laws in place that prohibit the movement of personal data outside of the bounds of Europe. In effect, you may not move the data into another jurisdiction where the data laws remove the legal protection the data subject owns. So from a data angle, it will be more interesting when the Dublin / Amsterdam data centre comes online in 2010, at which point storing data in the Azure cloud has a better data protection story.

SQL Azure has clearly been ‘beefed’ up and can now be connected to via SQL Server Management Studio just like a normal database, and be administered / interacted with – even supporting transactions. The disaster recovery and physical administration of the SQL remains out of sight and handled by the cloud, and not the application / vendor. SQL Azure understands TDS, so connecting to the SQL Azure is pretty seamless and appears like a regular SQL server. It has clearly matured as a platform, and rightly so.

Another project, codenamed ‘Dallas’ was announced which forms part of pinpoint. Pinpoint is a products / services portal, which instantly made me think of Apple’s ‘AppStore’ but for windows products and companies offering services. The interesting part is the ‘Dallas’ section, which is something like a ‘Data Store’ – allowing the discovery and consumption of centralised data services.

There has always been an issue when consuming data from other sources, that you are required to download it, understand the schema of the data and often ETL it from the format it is being supplied in, such as CSV, XML, Atom etc into a format that you can work with. Each data source often has its own schema and delivery mechanism and handling updates to the data remains an operational issue.

With ‘Dallas’ you are buying into the data being held within the cloud and it will auto-generate the proxy class for the data being consumed, so the schema of the data is available to you within code on the development side. This is an awesome concept and if they can tie in some form of micro-payment structure, you could easily visualise a set of data services that you consume within an application on an as needed basis. Without the micro-payments, you would have to have purchased a license, whether that is a one off cost, or a monthly subscription, neither deals with the ‘elastic’ nature of the applications that are being placed onto the cloud and one of the key benefits in that the data centres can scale up / down as your apps require. Given the billing of that is based on usage and you specifically want to take advantage of the elasticity of the infrastructure provision, it would make sense to have a similar elasticity in the data service charging arena.

This is definitely a technology to keep a close eye on, and I will be signing up an account to get access to the free data services that they are going to expose.

Categories: PDC 09, SQL Server Tags: , ,