I’ve been thinking through and experimenting a bit more with some of the concepts in SQL Azure – specifically I was considering the impact of fragmentation on both the storage (in terms of the storage limit) as well as the maintenance. This is not a new issue, DBA’s face fragmentation regularly and can deal with it in a variety of ways, but with SQL Azure the problem looks magnified by a lack of tools and working space. Whilst looking into this, I then realised that there is an unfortunate consequence of not knowing how much data space your index is actually using.
Each table in SQL Azure has to have a clustered index if data is going to be inserted into it and clustered indexes can suffer from fragmentation if chosen poorly. The combination of SQL Azure and the time-honoured fragmentation provides three consequences about it, fragmentation:
- will occur and you have no way in which to measure it due to the lack of DMV support.
- will create wasted space within your space allocation limit.
- will reduce your performance.
You could work it out if you knew how much space you had actually used vs. what the size of the data held is, but we are unable to measure either of those values. If you have chosen the data compression option on the index then even those values would not give you a fragmentation ratio.
This leaves us with a situation in which you can not know how much you are fragmented, meaning:
- You schedule a regular index rebuild.
- Hope SQL Azure performs index rebuilds for you.
I’m not aware of SQL Azure doing this for you – and you do not have SQL Agent facilities either.
So this seems very wrong, the concept of SQL Azure is to take away a lot of the implementation details and hassle from the subscriber – DR and failover is handled etc. But there looks to be as gap in which certain items such as fragmentation is falling – I have not seen any documentation saying SQL Azure handles it (but there could be some hidden somewhere and I hope there is!) and neither are you given the right tools in which to program and handle it yourself.
What happens when you hit that size limit?
Msg 40544, Level 20, State 5, Line 1 The database has reached its size quota. Partition or delete data, drop indexes, or consult the documentation for possible resolutions. Code: 524289
That took a lot of time to get to, (SQL Azure is not fast), but was generated using a simple example that would also demonstrate fragmentation.
Create Table fragtest ( id uniqueidentifier primary key clustered, padding char(3000) )
Very simple stuff, deliberately using a clustered key on a GUID to cause a decent level of fragmentation, and also using the padding fixed with character field to ensure 2 rows per page only, maximising the page splits.
insert into fragtest values (newid(), replicate('a',1000)) go 200000
Because of the randomness of the newid() function, the level of fragmentation is not predictable but will certainly occur – in my test I hit the wall on 196,403 records and failed with an out of space message.
Given the 2 rows per page and the number of rows, with ~0% fragmentation the data should be able ~767Mb – that is considerably short of 1 Gb – so there is a significant level of fragmentation in there wasting space, about 23% of it. If you include the 2k per page being wasted by the awkward row size then the actual raw data stored is roughly ~60% of the overall size allowing for row overheads etc.
So there are two important points from this contrived example:
- You can lose significant space from bad design.
- Doing this backs you into a corner that you will not be able to get out of – this is the worst part.
How are you cornered? well, try work out how to get out of the situation and defrag the clustered index / free up the space, you could:
- Attempt an index rebuild.
- Try to rebuild it with SORT_IN_TEMP.
- Drop the index.
- Delete data.
The first three fail, the SORT_IN_TEMP is not supported and would not of rescued the situation either since you have no working space in which to write the newly sorted rows prior to removing the old ones. So do you really want to delete data? I don’t think we can consider that an option for now.
This all seems like a ‘rock’ and a ‘hard place’; whilst SQL Azure can support these data quantities, it seems prudent that you never consider actually going close to them at all – and that you equally are going to find it difficult to understand if you are close to them, since there is no way of measuring the fragmentation. The alternative is that you manually rebuild indexes on a regular basis to control fragmentation, but then enough free space is going to have to be left to allow you to rebuild your largest index without running out of space – reducing your data capacity significantly.
The corner is not entirely closed off, the way out of the corner would be to create another SQL Azure database within my account and select the data from database1.fragtest to database2.fragtest and then drop the original table and transfer it back – not ideal but it would work in an emergency.
I think the key is to design to make sure you do not have to face this issue; keep your data quantities very much under the SQL Azure size limits, and watch for the potential of tables being larger than the remaining space and preventing an re-indexing from occurring.
Interested to know your thoughts on this one, and what other consequences of being close to the limit will come out.
The PDC is just about over, the final sessions have finished and the place is emptying rapidly – the third day has included a lot of good information about SQL Azure, the progress made to date on it as well as the overall direction – including a new announcement by David Robinson, Senior PM on the Azure team about a project codenamed ‘Houston’ .
During the sessions today the 10Gb limit on a SQL Azure database was mentioned a number of times, but each time was caveated with the suggestion that this is purely the limit right now, and it will be increased. To get around this limit, you can partition your data across multiple SQL Azure databases, as long as your application logic understands which database to get the data from. There was no intrinsic way of creating a view across the databases, but it immediately made me consider that if you were able to use the linked server feature of the full SQL Server, you could link to multiple Azure databases and created a partitioned view across the SQL Azure databases – got to try that out when I get back to the office but I do not expect it to work.
SQL Azure handles all of the resilience, backup, DR modes etc, and it remains hidden from you – although when connected to the SQL Azure database you do see a ‘Master’ database present. It is not really a ‘Master’ in the same way that we think of one, and it quickly becomes apparent how limited that version of the ‘Master’ really is – it exists purely to give you a place to create logins and databases. It could have been called something else to make it a bit clearer but one of the SQL Azure team said it was to keep compatibility to other 3rd party applications that expected there to be a master.
SQL Azure supports transactions as mentioned before, but given the 10GB limit currently on a database you will be partitioning your data across databases. That will be a problem, because the system does not support distributed transactions, so any atomic work that is to be committed on multiple databases at once it going to have to be controlled manually / crufted in code, which is not ideal and a limitation to be aware of.
Equally cross database joins came up as an area with problems – they can be made, but it appears there are performance issues – interested to start running some more tests there and see whether you can mimic a partitioned view across databases using joins. The recommendation was to duplicate reference data between databases to avoid joins, so lookup tables would appear in each database in effect, removing the cross database join.
On the futures list:
- The ability to have dynamic partition splits looked interesting, regular SQL server does not have this facility within a partitioned table – so if Azure can do it across databases then this might come up on the SQL roadmap as a feature – that could be wishful thinking.
- Better tooling for developers and administrators – that is a standard future roadmap entry.
- Ability to Merge database partitions.
- Ability to Split database partitions.
So SQL Azure has grown up considerably and continues to grow, in the hands-on-labs today I got to have more of a play with it and start testing more of the subtle limitations and boundaries that are in place. Connecting to an azure database via SQL Server Management Studio is trivial and the object explorer contains a cut down version of the normal object tree, but includes all the things you would expect such as tables, views and stored procedures.
Some limitations of the lack of master and real admin access become apparent pretty fast, no DMV support, no ability to check your current size. No ability to change a significant number of options, in fact, the bulk of the options are not even exposed.
Two of my personal favourites I took an immediate look at, maxdop and parameterization.
- Maxdop is set at 1, although you can not see it, and attempting to override it throws an error from the query windows, telling you that it is not permitted. Do not plan on parallel query execution, you will not get it.
- I attempted to test the query parameterisation using the date literal trick and it appeared to remain parametrized, as though the database is in ‘forced’ parameterisation mode, so is more likely to get parameter sniffing problems but I have not been able to concretely prove it as yet, but the early indication is the setting is ‘Forced’
One other interesting concept was that a table had to have a clustered index, it was not optional if you wanted to get data into the table, although is did not stop me from creating a table without a clustered index, I had not attempted to populate data into it to see this limit in action – a case of too much to do and so little time.
On one of the final talks about SQL Azure, David Robinson announced a project codenamed ‘Houston’ – (there will be so many ‘we have a problem’ jokes on that one) which is basically a silverlight equivalent of SQL Server Management Studio. The concept comes from the SQL Azure being within the cloud, and if the only way to interact with it is by installing SSMS locally then it does not feel like a consistent story.
From the limited preview, it only contains the basics but it clearly let you create tables, stored procedures and views, edit them, even add data to tables in a grid view reminiscent of Microsoft Access. The UI was based around the standard ribbon bar, object window on the left and working pane on the right. It was lo-fi to say the least but you could see conceptually where it could go – given enough time it could become a very good SSMS replacement, but I doubt it will be taken that far. There was an import and Export button on the ribbon with what looked to be ‘Excel’ like icons but nothing was said / shown of them. Date wise ‘Targetting sometime in 2010’, so this has some way to go and is not even in beta as yet.
So that was PDC09, excellent event, roll on the next one!
As promised, I wanted to only blog about the bits of the PDC that relate to SQL / Database / Data Services, and not every session within the PDC that I am attending. Many of the sessions have been interesting, but I am viewing them with my Architect’s hat on, and not from the viewpoint of my personal passion for SQL Server. I feel fortunate to be here and listening to the speakers and chatting to them offline instead of watching the PDC on the released videos after the event.
The keynote today contained a number of very interesting looking prospects on the data side of the fence, primarily ‘compered’ by Ray Ozzie, Chief Software Architect at Microsoft. There were also some demos, some of which were quite good, whilst others suffered from over-scripting. I am sure twitter was going wild at times during the keynote as people were giving real-time feedback about what they thought. (Whether that is a good thing or not I am not sure, walking off stage to find a few hundred bad reviews can not be nice.) But this is not about the demos but about the SQL / Data stuff.
A lot of work Microsoft have been doing and the phrase repeated throughout was ‘3 screens and a cloud’, using the 3 screens of mobile, computer and tv to represent 3 different delivery paradigms, but fundamentally using the same technology stack to deliver all 3.
The Azure data centres were announced to be going into production on Jan 1st 2010, and billing for those services will commence on the 1st Feb. However, the European and far eastern data centres were not listed as coming online until late in 2010, so the only data centres that will be up and running will be the Chicago and San Antonio data centres.
This may not seem a big problem, and in fact having 3 pair’s of data centres around the world is far more ideal and a single centralised resource, but for Europeans there are data protection laws in place that prohibit the movement of personal data outside of the bounds of Europe. In effect, you may not move the data into another jurisdiction where the data laws remove the legal protection the data subject owns. So from a data angle, it will be more interesting when the Dublin / Amsterdam data centre comes online in 2010, at which point storing data in the Azure cloud has a better data protection story.
SQL Azure has clearly been ‘beefed’ up and can now be connected to via SQL Server Management Studio just like a normal database, and be administered / interacted with – even supporting transactions. The disaster recovery and physical administration of the SQL remains out of sight and handled by the cloud, and not the application / vendor. SQL Azure understands TDS, so connecting to the SQL Azure is pretty seamless and appears like a regular SQL server. It has clearly matured as a platform, and rightly so.
Another project, codenamed ‘Dallas’ was announced which forms part of pinpoint. Pinpoint is a products / services portal, which instantly made me think of Apple’s ‘AppStore’ but for windows products and companies offering services. The interesting part is the ‘Dallas’ section, which is something like a ‘Data Store’ – allowing the discovery and consumption of centralised data services.
There has always been an issue when consuming data from other sources, that you are required to download it, understand the schema of the data and often ETL it from the format it is being supplied in, such as CSV, XML, Atom etc into a format that you can work with. Each data source often has its own schema and delivery mechanism and handling updates to the data remains an operational issue.
With ‘Dallas’ you are buying into the data being held within the cloud and it will auto-generate the proxy class for the data being consumed, so the schema of the data is available to you within code on the development side. This is an awesome concept and if they can tie in some form of micro-payment structure, you could easily visualise a set of data services that you consume within an application on an as needed basis. Without the micro-payments, you would have to have purchased a license, whether that is a one off cost, or a monthly subscription, neither deals with the ‘elastic’ nature of the applications that are being placed onto the cloud and one of the key benefits in that the data centres can scale up / down as your apps require. Given the billing of that is based on usage and you specifically want to take advantage of the elasticity of the infrastructure provision, it would make sense to have a similar elasticity in the data service charging arena.
This is definitely a technology to keep a close eye on, and I will be signing up an account to get access to the free data services that they are going to expose.