Oracle : ambiguous join statements being accepted and executed


I came across a weird piece of SQL earlier today where my first reaction was “That is just invalid SQL”, except that I was viewing it in OEM so it clearly had been parsed and executed. Perhaps it is not too bizarre or surprising to some, but because the form of the SQL is something I would not write due to the inherent problem within it that I didn’t expect Oracle to behave as it did.
In abstract form the statement is as follows:

select *
from a
full outer join b on a.id = b.id
inner join c on a.id = c.id

Perhaps that seems innocent enough but its ambiguous and non-resolvable. To demonstrate that let’s consider giving some minor data to A,B and C.

A - records with ID = 1,2 and 3.
B - records with ID = 1,2 and 4.
C - records with ID = 1 and 2.

If A joins to C, we resolve down via the inner join to just IDs of 1 and 2 – and then full outer join to B getting to a final result of 1,2,4.
If A joins to B first, we get rows with 1,2,3,4 – and then the inner join to C getting a final result of 1,2.

This is why as soon as I saw the statement I considered it invalid – its ambiguous and if I was to treat SQL correctly as a declarative language, there is no notion within that statement as it stands to know which would be the correct answer.

So that is so far theory crafting – so let’s put this into some real simple examples of code.

with a as
(
select 1 as id from dual
union
select 2 as id from dual
union
select 3 as id from dual
)
, b as (
select 1 as id from dual
union
select 2 as id from dual
union
select 4 as id from dual
)
, c as (
select 1 as id from dual
union
select 2 as id from dual
)
select a.id as a_id, b.id as b_id, c.id as c_id
from a
full outer join b on a.id = b.id
inner join c on a.id = c.id

The results:


>A_ID       B_ID       C_ID
>---------- ---------- ----------
>1          1          1
>2          2          2

So that looks clear, it performs the outer join, and then the inner join – seems straight forward? but if this is properly being treated as declarative as it should be then we still know that there is an ambiguity in the statement, but Oracle has ignored that and applied a rule to make the decision for you. I already have an issue with that approach since there is no indication that it had to step in and correct an ambiguity in the statement submitted.

Normally in SQL we know that changing the order of the joins should not make a difference in the result – it can change the explain plan for the query, especially with an ordered hint in place, but the results should be the same – so let’s test that concept on this.

with a as
(
select 1 as id from dual
union
select 2 as id from dual
union
select 3 as id from dual
)
, b as (
select 1 as id from dual
union
select 2 as id from dual
union
select 4 as id from dual
)
, c as (
select 1 as id from dual
union
select 2 as id from dual
)
select a.id as a_id, b.id as b_id, c.id as c_id
from a
inner join c on a.id = c.id
full outer join b on a.id = b.id

 

All that has changed is the inner join is moved above the full outer join – but that is the limit of the difference. The results:


>A_ID       B_ID       C_ID
>---------- ---------- ----------
>1          1          1
>2          2          2
>           4

 

Now, in my view of the SQL world, this is just incorrect – the statement was ambiguous to start with and should not of been permitted to be executed, but rejected and an exception thrown. It should be treated as a declarative language and no inference from the order of the joins should be made.

Whats surprising to me is that it accepts, parses and executes the statement with no warnings or declaration of what it is doing, and then to make matters worse it is not consistent in the result, but bases the result on an inference which is going to be somewhat random on the style of the developer writing the SQL. I would normally say that the order of joins on a statement does not affect the final output – evidently that is not entirely true.

Categories: Oracle Tags: ,

Oracle : ANSI vs Old Style Joins

August 17, 2013 1 comment

I came across this problem and I still can’t find any documented reason as to why the Oracle query processor can not handle mixing the join types in a statement. The reason I was mixing join types at the time was due to an ORA-600 bug involving nested table types; the cross join to the nested table field was not working properly. This resulted in me having to use a combination of ANSI joins (which is my default approach to joins for many years) and the old style join using a comma and the where clause.

So a quick look at the docs : http://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_10002.htm – the query_block::= section is what interests me here.

query_block

Now for me, the train tracks here show a syntax fault. The loop back to add another join shows a comma on it which is not required depending on the route taken after the FROM part of the statement. technically, that would end up with a FROM clause like this:

FROM tbl_a, INNER JOIN tbl_b on tbl_a.field = tbl_d.field

We know that is just not legal syntax – so we can already consider this train track diagram a bit off. Am I reading the diagram incorrectly? http://docs.oracle.com/cd/B28359_01/server.111/b28286/ap_syntx.htm#SQLRF018 is the guide to reading them, and the syntax loops section hows an example where fields are seperated by commas in a loop. Optional keywords and parameters show that if the comma was optional, it would appear differently in the diagram.

Using the diagram we should also be able to use this syntax:
select *
FROM tbl_a a, tbl_b b
INNER JOIN tbl_c c on c.field1 = a.field1
WHERE c.field2 = b.field2

This again fails – although the error thrown is correct, but not as helpful as you might like.
ORA-00904: "A"."FIELD1": invalid identifier
00904. 00000 - "%s: invalid identifier"

So the error is basically saying that the query processor can not find a reference to A.FIELD1, even though it is clearly there in the FROM clause. This leads me to assume that it processes ANSI joins first, and that the old style joins using a where clause are processed second. At the time that the ANSI join is processed it has no way of knowing A.FIELD1 does indeed exist.

Could the error message be better? most definitely, if you tried the same but using an old style left join, like this:
select *
FROM tbl_a a, tbl_b b
INNER JOIN tbl_c c on c.field1 = a.field1
WHERE c.field2 (+)= b.field2

Oracle will kindly inform you that ORA-25156: old style outer join (+) cannot be used with ANSI joins – so the query processor is quite aware of the distinction between the join styles. Just to add to the frustration, re-ordering the joins allows it to work:
select *
FROM tbl_b b, tbl_a a
INNER JOIN tbl_c c on c.field1 = a.field1
WHERE c.field2 = b.field2

This is because the ANSI join processed sees this as tbl_a inner join tbl_c, so tbl_a is ‘visible’ to the query processor again.

This is just an annoying bit of behaviour by the query processor with bad error messaging and it seems flawed documentation – to be fair, if I was not forced to mix join styles due to the ORA-600, I wouldn’t have come across it either.

Categories: Oracle Tags: ,

Oracle : Recursive Common Table Expressions and Parallelism

April 14, 2012 3 comments

Sometimes, no matter how hard you search you just can’t find an answer – that was the problem this week. Oracle’s recursive common table expressions (RCTE), or Recursive Sub Query Refactoring to put it in Oracle’s terms were proving to be pretty bad on performance. (Hopefully, the next person searching will now find this answer.)

As feature’s go, this one is should be a relatively well-known feature – it’s part of the ANSI SQL-99 standard and available in a number of RDBMs, with near identical implementation on the syntax.

Even the esteemed Mr Kyte has changed his position on RCTE’s from being more code and harder to understand than the CONNECT BY syntax, to being a somewhat useful feature.

So what was the question to which we could find no answer?

Why does a RCTE seem to ignore parallel hints?

Amazingly, we can’t find anything documented about this against RCTE’s themselves or in the parallelism sections of the documentation. No mention of restrictions of parallelism on RCTE’s appear anywhere.

We have quite a complex example but needed a simple scenario to submit to Oracle to get an answer. Kudos for the krufting of this goes to Phil Miesle – it was his turn to deal with Oracle support.

First, create a numbers table, and fill it with data, we even used a RCTE to do that part.

create table vals (n ,constraint vals_pk primary key (n) ) 
as 
with numbers(n) as (
select 1 as n    
from dual   
union all  
select n+1 as n    
from numbers   
where n < 1000000 
)
select n from numbers;

We now need a data table, that is basically going to act as a hierarchy, for us to test the RCTE against. A simple parent / child table is suffice:

create table pairs (
  parent 
  ,child 
  ,constraint pairs_pk primary key (parent,child) ) 
as 
select 
  'B'||to_char(mod(n,100000)+1) as parent      
  ,'A'||to_char(n) as child   from vals;

Using the numbers table, the table now contains A1 to 1000000 linking to B1 to B100000 – so in effect, every B has 10 A’s linked to it.

This then continues, with every 10 B’s linking to a C:

insert into pairs 
select distinct        
  'C'||to_char(mod(n,10000)+1) as parent      
   ,'B'||to_char(mod(n,100000)+1) as child  
from vals;

And so on, with each successive layer having a 10 to 1 ratio:

insert into pairs 
select distinct       
  'D'||to_char(mod(n,1000)+1) as parent      
  ,'C'||to_char(mod(n,10000)+1) as child  
from vals;

insert into pairs 
select distinct       
  'E'||to_char(mod(n,100)+1) as parent       
  ,'D'||to_char(mod(n,1000)+1) as child  
from vals;

insert into pairs 
select distinct        
  'F'||to_char(mod(n,10)+1) as parent      
  ,'E'||to_char(mod(n,100)+1) as child  
from vals;

insert into pairs 
select distinct       
  'G'||to_char(mod(n,1)+1) as parent      
  ,'F'||to_char(mod(n,10)+1) as child  
from vals;
commit;

And finally we have G1 linking to F1 to F10 – so the structure is clearly a very basic tree.

Gather the stats to make sure the optimizer is going to have a half chance at a decent plan.

begin 
dbms_stats.gather_table_stats(ownname=>user,tabname=>'PAIRS'); 
end; 
/

So for the test cast query, we wished to generate all the possible paths within the tree – which is in effect a non-cyclical directed graph. This is the ideal scenario for connect by / RCTE to perform its magic, I need to recurse the dataset in a single set based statement.

explain plan for
select count(*)
from (
  with e (parent,child) 
  as (
    select parent,child
    from pairs
    union all
    select e.parent,pairs.child
    from e
    join pairs on e.child = pairs.parent
  )
  select * from e
);

Grab the plan output using

select * from table(dbms_xplan.display)

– which I’ve included the important bit below:

---------------------------------------------------- 
| Id  | Operation                                  |
---------------------------------------------------- 
|   0 | SELECT STATEMENT                           |
|   1 |  SORT AGGREGATE                            |
|   2 |   VIEW                                     |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST|
|   4 |     TABLE ACCESS FULL                      |
|*  5 |     HASH JOIN                              |
|   6 |      TABLE ACCESS FULL                     |
|   7 |      RECURSIVE WITH PUMP                   |
----------------------------------------------------

There is nothing shocking or unusual about the plan, it is what we would expect to see. So let’s now add some parallelism to the query:

explain plan for
select count(*)
from (
  with e (parent,child) 
  as (
    select /*+ parallel(pairs,4) */
      parent,child
    from pairs
    union all
    select  /*+ parallel(pairs,4) parallel(e,4) */ 
      e.parent,pairs.child
    from e
    join pairs on e.child = pairs.parent
  )
  select * from e
);

The expected effect on the plan should be that we would see parallelism operations against both table accesses.

----------------------------------------------------
| Id  | Operation                                  |
----------------------------------------------------
|   0 | SELECT STATEMENT                           |
|   1 |  SORT AGGREGATE                            |
|   2 |   VIEW                                     |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST|
|   4 |     PX COORDINATOR                         |
|   5 |      PX SEND QC (RANDOM)                   |
|   6 |       PX BLOCK ITERATOR                    |
|   7 |        TABLE ACCESS FULL                   |
|*  8 |     HASH JOIN                              |
|   9 |      TABLE ACCESS FULL                     |
|  10 |      RECURSIVE WITH PUMP                   |
----------------------------------------------------

This now shows us the problem, you can see the PX Co-ordinator is present within the anchor clause of the RCTE, but there is no parallelism listed against the recursion. At first we though it might be ignoring the hints for some reason, but the following idea disproved that theory immediately.

explain plan for
select count(*)
from (
  with e (parent,child) 
  as (
    select /*+ parallel(pairs,4) */
      parent,child
    from pairs
    union all
    select  /*+ parallel(pairs,4) parallel(e,4) use_merge(e,pairs) */ 
      e.parent,pairs.child
    from e
    join pairs on e.child = pairs.parent
  )
  select * from e
);

The plan altered to use the merge hint as follows:

---------------------------------------------------- 
| Id  | Operation                                  |
| --------------------------------------------------
|   0 | SELECT STATEMENT                           |
|   1 |  SORT AGGREGATE                            |
|   2 |   VIEW                                     |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST|
|   4 |     PX COORDINATOR                         |
|   5 |      PX SEND QC (RANDOM)                   |
|   6 |       PX BLOCK ITERATOR                    |
|   7 |        TABLE ACCESS FULL                   |
|   8 |     MERGE JOIN                             |
|   9 |      SORT JOIN                             |
|  10 |       RECURSIVE WITH PUMP                  |
|* 11 |      SORT JOIN                             |
|  12 |       TABLE ACCESS FULL                    |
----------------------------------------------------

The explain plan is of course a terrible plan – there would be no reason to use a merge join, but the fact it appears in the plan demonstrates the hints on the recursion clause are being read by the query engine and that it chose to discard the parallelism ones.

Given this example – an SR was raised to find out why the performance is so bad, and are we looking at a bug? If it was a bug, then we could look for a fix of some kind.

The test case was accepted and reproduced inside Oracle very efficiently – it was given to the parallel query department to determine what was the problem.

The response back?

This is the expected behavior. Oracle does not parallelize the iterations sub-query of connect-by/recursive-with clause.

That’s the last thing we wanted to hear – ‘by design’. It’s by design that this feature is going to be incredibly slow on larger data sets. That’s not so much as ‘design’ as rendering RCTE’s useless in Oracle unless you have small data sets, or don’t mind waiting around for a long time to get answers back.

We were already close to ditching any use of the RCTE syntax, this fully nailed the coffin shut on that feature.
(The other reason we are still looking to sort out the test case for – but we have witnessed problems with RCTEs contained within a view. When the view is joined to and accessed with a predicate against the view, we have seen the predicate pushed into the recursion – which results in an incorrect answer. The predicate pushing cuts the recursion short in effect. We had worked around this – but it was an annoying bug.)

Oracle stalwarts will consider that we were foolish to use the RCTE’s over oracle connect by syntax – except that we were not. An RCTE can do far more complex recursion than the Connect By can do, and for the specific instance we wanted to use it, that complexity was required

Another reason for trying to go down that route was performance, because the connect by clause is no better at parallelism:

explain plan for 
select count(*) 
from 
(  
  select /*+ parallel(pairs,4) */         
    parent,child    
  from pairs   
  start with parent = 'G1' 
  connect by parent = prior child 
);

-------------------------------------
| Id  | Operation                   | 
-------------------------------------
|   0 | SELECT STATEMENT            |
|   1 |  SORT AGGREGATE             |
|   2 |   VIEW                      |
|*  3 |    CONNECT BY WITH FILTERING|
|*  4 |     INDEX RANGE SCAN        |
|   5 |     NESTED LOOPS            |
|   6 |      CONNECT BY PUMP        |
|*  7 |      INDEX RANGE SCAN       |
-------------------------------------

The plan is no better for using a CONNECT BY – but from a performance perspective  the connect by clause is clearly faster when we ran some comparisons.

So the verdict on Oracle and RCTE / Recursive Sub-Query Refactoring – excellent language feature – unscalable performance – will refuse to parallel the recursion – very useless for those of us in the VLDB world.

Oracle : Duplicate GUID values being returned from sys_guid() when run in parallel

January 22, 2012 4 comments

A post? yes, it’s been a while and because I am having to spend all my time on Oracle these days – it’s a post relating to a problem in Oracle.

I had to construct a test case recently to try track down a primary key failure. The primary key was a sys_guid value and the failure was coming from the insertion of new values, that didn’t make much sense since the odds of a collision on a GUID should be astronomically high – assuming they used an up to date algorithm. Even with those astronomical odds, primary key failures were occurring very regularly, so the immediate suspicion is that the sys_guid algorithm in Oracle is not up to date and not-consistent across all platforms. It can return GUIDs that appear totally random, or GUIDs that are clearly within a sequence. It’s easy enough to test any individual platform to see how it behaves:

select sys_guid from dual
union
select sys_guid() from dual;
SYS_GUID()
--------------------------------
B71D52B1531167D9E040760ADD7E0B80
B71D52B1531267D9E040760ADD7E0B80

12th character in has increased by one, the rest of the guid remains identical.

This isn’t too surprising, the documentation is delightfully vague in using the term ‘most’:

SYS_GUID generates and returns a globally unique identifier (RAW value) made up of 16 bytes. On most platforms, the generated identifier consists of a host identifier, a process or thread identifier of the process or thread invoking the function, and a nonrepeating value (sequence of bytes) for that process or thread.

So ‘most’ platforms will behave like this – that’s helpful documentation, thanks for that.

So back to the problem and test case – whenever I come across potential Oracle bugs, I have an immediate suspicion that parallelism is at play – this is just from the consistent experience of Oracle getting parallelism wrong within the database – I have multiple outstanding SR’s for various features when combined with parallelism causing failures – anything from ORA-600’s to incorrect data being returned. (Parallel + Pivot = missing columns, nice!).

When you have these GUIDs being generated in a pseudo sequence, it makes sense that adding parallelism is a recipe for disaster, since the parallel slaves would all have to communicate and co-ordinate to ensure that they did not duplicate values in that sequence. After many hours whittling down the original statement, I was able to construct a repeatable test case to finally submit to Oracle for fixing – the shocking part is how trivial it was to demonstrate the problem on a specific AIX environment.

So let’s walk through the test case, firstly, create a numbers table:

create table n (c1 number);

..and populate it:

begin  
  for i in 1..30 loop   
    insert into n     
      select i*100000 + level from dual connect by level<=100000;  
  end loop;
  commit;
end;
/

This just populates the table with 3 million rows, 30 iterations of 100k rows, it’s a bit faster to do it that way than populate it in a single statement – the connect by level goes slower as the number rises.

That is all we need for the set up, the test code is pretty simple but I will explain it:

declare 
  e number := 0;
begin  
  for i in 1..10 loop
    begin      
      select count(*) into e      
      FROM (        
        select sid, count(*)        
        from (
          select /*+ parallel(n,40) */                
            sys_guid() as sid              
          from n              
        )        
        group by sid        
        having count(*) > 1      
      ) t;    
    exception      
      when no_data_found then null;      
      when others then raise;    
    end;            
    if e>0 then raise_application_error(-20000
        ,e||' duplicates found in iteration '||i); end if;
  end loop;
end; /

The easiest way to explain this is from the inside out – the inner most query generates 3 million sys_guid values by selecting from the numbers table and asking for a sys_guid value per row – the statement is given a parallel hint.

We then perform an outer select that group’s by the SID (Sys guID) values, and uses a having count(*) > 1 clause to only show duplicates. Under normal conditions this of course should return 0 rows at that point, since every sys_guid generated should be unique. The next outer select count’s up how many instances of duplicates occurred and finally places this into a variable e.

If e is ever greater than 0, we have encountered a duplicate and an error will be raised.

When run on an AIX box with SMT enabled, the error does get raised.

202148 duplicates found in iteration 1

The number of duplicates changes per run and seems to have no pattern; it can be anything from about ~40k duplicates up to ~250k duplicates. If you take the parallel hint out of the script, it never fails. So it is clearly linked to the simultanesous creation of sys_guid values.

As yet, Oracle have not been able to reproduce this themselves which is indicating that this is a platform specific bug, but the client’s DBA’s have been provided the script and have seen it churn out duplicates time and time again, much to their amazement. They really should use a better algorithm, having such a predictable sequentially guid as their default guid for ‘most’ platforms is less than ideal.

Categories: Oracle Tags: ,

SQL Server Denali – Sequences

January 9, 2011 Leave a comment

Another new programmability feature added in the Denali CTP is ‘Sequences’ – a concept very familiar to those of us who already deal with Oracle, but an unusual addition for SQL Server and one that makes me scratch my head thinking – why? We already have the identity column feature available to us within SQL Server but not available within Oracle, thus the need for sequences in Oracle. When using any identity / numeric key / FK mechanism it is important that the actual value for this identity has no actual relation to the data it represents other than an arbitrary number representing the row, if anything it would be a bad design to rely on the identity value be in sequence, or contiguous in any way. In SQL Server it is not guaranteed to be contiguous at all – transaction rollback or a value other than 1 for the increment for example will prevent it.

Sequences are primarily for when you wish to know a number in advance of using it, or perhaps you wish to use the same number for a number of records spread across tables (and thus relate them using that number.)

Looking at the syntax you can see that the SQL Server and Oracle Syntax are very similar and share the same keywords. 

SQL Server Syntax:

CREATE SEQUENCE [schema_name . ] sequence_name [ [ ,…n ] ] [ ; ]
::= {[ AS { built_in_integer_type | user-defined_integer_type}]    
START WITH        
INCREMENT BY        
{ MINVALUE | NO MINVALUE }        
{ MAXVALUE | NO MAXVALUE }        
{ CYCLE | NO CYCLE }        
{ CACHE [ ] | NO CACHE } }

Oracle Syntax:

CREATE SEQUENCE INCREMENT BY
START WITH
MAXVALUE / NOMAXVALUE
MINVALUE / NOMINVALUE
CYCLE / NOCYCLE
CACHE <#> / NOCACHE
ORDER / NOORDER 

Most of the keywords are self-explanatory, and from a comparison of syntax you can see that SQL Server and Oracle are pretty similar in terms of the syntax.

Most of the keywords are pretty self-explanatory, the one that makes me cringe the most is CYCLE – It’s bad enough using a sequence number instead of an identity, but even worse when you consider that it may not be unique. The advice there is to create an additional unique index on the field to prevent an insertion / updated from taking a duplicate – but that seems like a bit of a ‘fudge’, and instead of solving the real problem, works around it.

To add to the weirdness of the construct, you can even ask for a sequence based on an OVER clause, using the adventure works database as an example I created a sequence:

create sequence testSequence as integer
start with 1
increment by 1
minvalue 1
maxvalue 10000

And then used it within a select statement as follows:

select next value for testSequence over (order by Name asc) as id, Name
from Production.Product

The results come back:

1 Adjustable Race 2 All-Purpose Bike Stand 3 AWC Logo Cap 4 BB Ball Bearing ...

In case you were thinking that was relatively useful, when you re-run the command, you of course are returned a different set of numbers, as the sequence does not restart, making this one of the weirdest features I have seen.

505 Adjustable Race 506 All-Purpose Bike Stand 507 AWC Logo Cap 508 BB Ball Bearing ...

If you attempt to place the order on the outside in the following manner, SQL Server will just throw an error. 

select next value for testSequence as id, Name
from Production.Product
order by Name asc
Msg 11723, Level 15, State 1, Line 1
NEXT VALUE FOR function cannot be used directly in a statement that contains 
an ORDER BY clause unless the OVER clause is specified.

And to round off the errors you can expect to see when using this, when you run the sequence out of values, you will get:

Msg 11728, Level 16, State 1, Line 1
The sequence object 'testSequence' has reached its minimum or maximum value. 
Restart the sequence object to allow new values to be generated.

Try create a sequence based on a numeric or decimal with some scale, such as numeric(6,2):

Msg 11702, Level 16, State 2, Line 1
The sequence object 'testSequence' must be of data type int, bigint, smallint,
tinyint, or decimal or numeric with a scale of 0, or any user-defined data type
that is based on one of the above integer data types.

Or if you fail to get your starting value within the min and max boundaries you are setting:

Msg 11703, Level 16, State 1, Line 1
The start value for sequence object 'testSequence' must be between the minimum
and maximum value of the sequence object.

Overall sequences remain a bit of a niche feature for me in SQL Server, I just can not see any normal everyday activity needing to use them, although it would make porting of applications between Oracle and SQL Server a bit easier since they will both be able to use them.

In that kind of situation though I would still prefer the GUID mechanisms that we have available to us. They have the same benefits of being able to know a record ID in advance of using it as well as the ability to be stored in either database. It also has the added advantage of being able to be created whilst offline from the database, something a sequence can not do.

Which User Made That Change?

January 8, 2011 4 comments

If you have spent any time tinkering about in the transaction log, you will of already come across a bit of a problem when trying to decide what was done and by whom – the ‘what part’ I have decoded in a few posts, but the ‘whom’ part is a lot harder. As far as I can tell the log only contains the SPID of the user who opened the transaction, and does not give us any indication as to who that user really was.

From an actual investigative perspective this is a bit of a painful exercise, I can see a row was deleted but to find out who / what did that actual deletion I would have to start examining either the SQL Server logs or the Windows Server Logs. The default behaviour of SQL Server security though is to only log failed login attempts so the successful ones will not show up by default – to get those appearing you need to change your SQL Server security settings. You can access these logs from the SQL management studio using either the xp_readerrorlogs or sp_readerrorlogs procedures although the nature of the log and textural values make it difficult to then combine in a set based manner – I can humanly read the values but machine reading them for any purpose is a bit of a pain – there is also the issue that those logs will be cycled – and the old logs could well be completely offline.

So I would prefer an easier solution, keeping a record of the logins within the database regardless of the SQL Server security settings, and being in a form that allows me to use a bit more of a set based solution against it. To start with, we will need a table to store the information available to us during the logon process:

create table master.dbo.spidArchive (
 LoginTime  datetime2(7)
 ,SPID   integer
 ,ServerName  nvarchar(100)
 ,LoginName  nvarchar(100)
 ,LoginType  nvarchar(100)
 ,LoginSID  nvarchar(100)
 ,ClientHost nvarchar(100)
 ,IsPooled  tinyint
)

The spidArchive table here is created in the master database so that it can cover the connections for any of the databases. You can see we have access to a lot of useful information, not just who executed the command, but from which machine they logged in from. The next step is to get SQL Server to add a row to the table every time a login occurs – from SQL Server 2005 onwards we have had access to DDL triggers as well as DML triggers and have the ability to intercept a number of non-DML events.

create trigger spidLogin on all server
after logon
as
 declare @eventdata xml;
 set @eventdata = EVENTDATA();

 INSERT INTO master.dbo.spidArchive
 (
  LoginTime
  ,SPID   
  ,ServerName  
  ,LoginName  
  ,LoginType  
  ,LoginSID  
  ,ClientHost 
  ,IsPooled
 )
 VALUES 
 (
  @eventdata.value('(/EVENT_INSTANCE/PostTime)[1]','datetime2(7)')
  ,@eventdata.value('(/EVENT_INSTANCE/SPID)[1]','nvarchar(100)')
  ,@eventdata.value('(/EVENT_INSTANCE/ServerName)[1]','nvarchar(100)')
  ,@eventdata.value('(/EVENT_INSTANCE/LoginName)[1]','nvarchar(100)')
  ,@eventdata.value('(/EVENT_INSTANCE/LoginType)[1]','nvarchar(100)')
  ,@eventdata.value('(/EVENT_INSTANCE/SID)[1]','nvarchar(100)')
  ,@eventdata.value('(/EVENT_INSTANCE/ClientHost)[1]','nvarchar(100)')
  ,@eventdata.value('(/EVENT_INSTANCE/IsPooled)[1]','tinyint')
 )

During the login process, the EventData() function returns a fixed format XML fragment from which we can extract the values we seek and simply insert into our spidArchive table. Now we have a log being taken of all connections being established to the server, we can start using this to translate from a SPID to a user, even when the user is no longer connected – as long as we know the SPID and the time, we just need to look for the closest entry in the past for that SPID, and that will indicate which user was currently logged on at the time. This function should go in the master database again.

CREATE FUNCTION dbo.ConvertSpidToName(@SPID integer, @Date datetime2(7)) RETURNS nvarchar(100) AS
BEGIN
 DECLARE @name nvarchar(100)
 SELECT TOP(1) @name = LoginName
 FROM master.dbo.spidArchive
 WHERE SPID = @SPID AND LoginTime <= @Date
 ORDER BY LoginTime DESC;
 RETURN @name;
END

This function just performs the logic stated above  and converts the SPID and DateTime into the login name for the user. Once this infrastructure is in place we can now directly use that in a call to ::fn_dblog(null,null) to translate the SPID column

select master.dbo.ConvertSpidToName(log.SPID, log.[Begin Time]) as UserName, log.* from ::fn_dblog(null,null) log

What you will notice is that for the majority of log lines, there is no user name displayed – this is because the SPID is only recorded against the LOP_BEGIN_XACT entry, the beginning of the transaction. This doesn’t really present a problem, from previous experiments we know all the entries for an individual transaction are given a unique Tansaction ID which we can use to group them together. It becomes pretty trivial to join back to the log, and connect any transaction entries to the LOP_BEGIN_XACT record and produce the name on every row possible.

select master.dbo.ConvertSpidToName(log2.SPID, log2.[Begin Time]) as UserName, log.*
from ::fn_dblog(null,null) log
left join ::fn_dblog(null,null) log2 on log.[Transaction ID] = log2.[Transaction ID] and log2.Operation = 'LOP_BEGIN_XACT'

So overall it is not too hard to get the log entries attributed to the accounts that generated them.

A couple of final notes / caveats:

  • If your application is using a trusted sub-system approach this of course will not work as a technique, since all the users will be logged into the application through an internal mechanism (such as a users table) and then the application service connects using it’s own credentials – always a good thing since then the user’s have no direct access to the database. In that kind of situation this is of no value, every connection will be shown up as the same user/ source.
  • Within my code I chose to use datetime2(7), to be as accurate as possible on the connections and timings, you could drop to just datetime for SQL Server 2005 but with only 1/300ths of a second accuracy there is a chance on a very busy server that you could see two entries for a single SPID at the same datetime – which would pose a bit of a problem.
  • The spidArchive table can not be allowed to grow unconstrained – I have not included anything here for clearing down the table, but it is not difficult to conceive of it being archived off, or cleaned up weekly via a SQL Agent job.


SQL Server Denali – Paging

December 31, 2010 1 comment

The introduction of paging within SQL Server Denali will have made a significant number of developers happy, all of which will of previously created home-baked solutions to the same problem. All the solutions have the same underlying problem – paging is by its nature is inefficient. Most solutions use the row number analytic function, and then sub-select data from that. For a large dataset that presents a problem – the data has to be fully scanned, sorted and allocated row numbers. A suitable index can eliminate the sort operator, but you still end up scanning the entire index to allocate the row numbers.

In Denali, we can see that they have added support to the Order By clause, to include a starting offsets and syntax to denote how many rows we wish to return. The syntax can be examined on MSDN (http://msdn.microsoft.com/en-us/library/ms188385%28v=SQL.110%29.aspx#Offset) and in brief is:

ORDER BY order_by_expression
    [ COLLATE collation_name ]
    [ ASC | DESC ]
    [ ,...n ]
[ <offset_fetch> ]

<offset_fetch> ::=
{
    OFFSET { integer_constant | offset_row_count_expression } { ROW | ROWS }
    [
      FETCH { FIRST | NEXT } {integer_constant | fetch_row_count_expression }
      { ROW | ROWS } ONLY
    ]
}

Seeing this new syntax, made me want to try it out and see how the query plans are affected. I am using the trusty Adventure Works as usual – a version for Denali has been put on codeplex, so one quick download later and I was ready to test the new syntax. (Adventure Works download : http://msftdbprodsamples.codeplex.com/releases/view/55330 )

For my tests, I used the production.product table, and wished to page the products based on their name. There is a non-clustered index on the Name field of the product table as well as a clustered index on the product_id, so what would the query plan give?

select * from Production.Product order by name asc 
offset 10 rows fetch first 10 rows only

And the query plan is not very surprising

 

So even with a new syntax the underlying problem remains, the nature of paging is that you are scanning the data, with statistics io turned on the stats come back with Table ‘Product’. Scan count 1, logical reads 15 etc. not particularly exciting and what we would expect given the table is contained within 15 pages. It was because of the stats though that I noticed an anomaly,  in one of  the tests, I had dropped to returning only a single row from the table as follows:

select * from Production.Product order by name asc
offset 10 rows fetch first 1 rows only

What I noticed was that the statistics changed to Table ‘Product’. Scan count 1, logical reads 24 – the entire table is contained within 15 pages, so how could it jump to reading 24?

 

A quick check of the query plan showed what has changed, the engine decided that it was cheaper to use the Name index, which for the purposes of the ordering was narrower and therefore more efficient, and then join back to the main table via the clustered key. Understandable, although the additional pages read is unlikely to make this more efficient, but I doubt you would see much real world difference. An oddity, but nothing really significant in it.

This triggered a more interesting thought, what happens if we reduce our fields so that the index is considered a covering index? is SQL going to get smart when making a selection – so far we have only seen full table scans occurring.

select Name, ProductID from Production.Product order by name asc
offset 20 rows fetch first 10 rows only

The query is now being covered by the name index since the non-clustered index includes the clustered key (ProductID) – and this changes the query plan again, although its pretty subtle change to notice.

The expected index scan appears, but if you look closely at the tooltip for the scan, the number of rows being read in the scan is not the total number of rows in the index, but a product of the offset + the number of rows requested. This was also reflected within the statistics, showing only 2 logical reads – the index uses 6 pages in total. As I changed the number of rows to offset / return the Actual number of rows read changed accordingly. o with a covering index in place, the query engine gets a bit more efficient and does a forward scan of the index until the point at which we have passed a sufficient number of rows. This sounds good – we have avoided scanning the whole index to provide the paged results in a slightly more efficient manner.

Except those with a quick mind will realise that the performance degrades as you go further and further down the list, requesting the 490-500th products will results in 500 rows being checked, not 30. By putting in a covering index we have sacrificed consistency on query times to gain some potential performance – the full scans solutions will broadly speaking take the same time regardless of which 10 rows you might be requesting, since it has to scan, sort, allocate numbers and then sub-select.

As features go, I like the paging – it removes the need for all the different homegrown solutions that are out there, but the performance of it remains a problem – this is no silver bullet to paging performance problems that people have.

SQL Internals Viewer

November 27, 2010 1 comment

As Captain Oates once said, ‘I am just going outside and may be some time’ – feels like quite a while since I had to time to see down and write something.

I had a bit of time to take a look at the SQL Internals Viewer (http://internalsviewer.codeplex.com/) , it has been out for some time but I had never downloaded it to play around to see how useful it is in terms of a way of learning more about the internals.

The Page Viewer is excellent, the breakdown of a page into the component parts for a row and the display of the page data is a superb aid to anyone wanting to understand how the data is stored on a page. Whilst you can use DBCC PAGE to get at all this information, presenting it in such a readable form will satisfy most people easily.

The page allocation map is a nice little addition, but really is just an extension of showing you what pages belong to which object etc.

The transaction log viewer though I was really looking forward to seeing, primarily to help me decode more transactions, but it has been a bit disappointing – the level of detail shown is very limited, and provides no real benefit over just looking at the log directly, or using the last transaction log trick I have previously posted.

As you can see from the screenshot, the level of details is pretty light for a simple transaction, no actual breakdown of the log record itself is provided, which is a shame – whilst it does given you some basic information and will help some people, I think if you are at the stage where you are taking an interest in the transaction log, you are already beyond this point.

So as an educational / learning aid, it is pretty good on the page internals side – and anyone wanting an easier way to visualize that for learning it is still worth grabbing. I would love to see more on the Log side – but at present the project appears to be in hibernation, with no changes in some considerable time, so I suspect we will not see any enhancements now.

Interval Partitioning in SQL Server 2008

September 12, 2010 2 comments

Another one of those features in Oracle that we do not natively have in SQL Server is interval partitioning, where you can automatically have partitions generated based on an interval automatically generated for you. The interval partitioning is yet again, another form of dynamic partitioning, so the thought was, could this be achieved within SQL Server?

My initial thought would be to use an instead of trigger, which would intercept the incoming values and take action on them if appropriate to extend the partition function. An initial look into the documentation suggests it will not be trivial – the BoL states:

"Additionally, the following Transact-SQL statements are not allowed inside the
body of a DML trigger when it is used against the table or view that is the
target of the triggering action. CREATE INDEX, ALTER INDEX, DROP INDEX, DBCC
REINDEX, ALTER PARTITION FUNCTION, DROP TABLE..."

The issue there being the alter partition function – any response to an incoming piece of data that is not properly covered by the existing partition mechanism will need to alter the partition information. Of course there are still ways around such restrictions but when experimenting, it seems the BoL is not correct to list the ‘ALTER PARTITION FUNCTION’ command in the restrictions.

There are a few caveats to the following snippets of code, I am not attempting to deal with the rolling window affect, or complex storage pattern of a partitioned table of which I have written about in other posts. This is purely designed to demonstrate that it could be achieved, not to provide a total solution to the problem. The partitions created on the fly will all go to the Primary filegroup as well etc.

So start with a very basic partition function and scheme:

CREATE PARTITION FUNCTION pfTest (datetime)
AS RANGE LEFT FOR VALUES ('20100104' , '20100111' , '20100118', '20100125')
CREATE PARTITION SCHEME psTest AS
PARTITION pfTest ALL TO ([PRIMARY])

And generate a table on the partition scheme:

CREATE TABLE IntervalTest (
MyID int identity(1,1) not null,   
MyField Varchar(200),
MyDate datetime
) ON psTest(MyDate)

The next step is an ‘Instead of’ trigger, which has to intercept the incoming data from the inserted table, and extend the partition function if required:

CREATE TRIGGER tr_IntervalTest ON IntervalTest INSTEAD OF INSERT
AS
 BEGIN     

  -- get the current maximum partition value
  DECLARE @max_part_dt datetime;   DECLARE @max_inserted_dt datetime;
  DECLARE @weeks_to_add int;

  SET @max_inserted_dt = (SELECT max(MyDate) FROM inserted);

  SET @max_part_dt = (
SELECT max(convert(datetime,value))
from sys.partition_functions f
  inner join sys.partition_range_values rv on f.function_id = rv.function_id
  where name = 'pfTest');      
IF (@max_inserted_dt > dateadd(D,7,@max_part_dt))
  BEGIN
   -- need to potentially add multiple partition splits, it depends on
   -- how many weeks in advance the new data is.
   -- get a whole number of the weeks to add to ensure that we cover 
   -- the new data
   SET @weeks_to_add = ceiling(datediff(D,@max_part_dt, @max_inserted_dt) / 7.0)

   -- loop around splitting the partition function with the new weekly values
   -- that need to be covered
   WHILE @weeks_to_add > 0
   BEGIN
    -- increase the maximum partition date by 7 days and split the function
    SET @max_part_dt = dateadd(D,7,@max_part_dt);

    ALTER PARTITION SCHEME psTest NEXT USED [Primary];

    ALTER PARTITION FUNCTION pfTest() SPLIT RANGE (@max_part_dt);
    SET @weeks_to_add = @weeks_to_add - 1
   END;
  END;

  -- finally insert the values
  INSERT INTO IntervalTest (MyField, MyDate)
  SELECT MyField, MyDate
  FROM inserted;  
END     

The code is pretty self-explanatory, but I would point out that it is only covering an insert, not an update – and this is not production code but an experiment to see if it could be done (contrary to the BoL). To ‘productionize’ this would require significant work on exception handling, performance tuning, handling multiple filegroup partitioning, the list goes on, but all of it achievable.

A little test to insert a couple of values set ahead of the partition

insert into IntervalTest (MyField, MyDate)
select 'b', '20100505'
union
select 'c', '20100606'

And a check of the partition function values now show 23 values in the partition function, instead of the original  4 as follows:

2010-01-04 00:00:00.000, 2010-01-11 00:00:00.000, 2010-01-18 00:00:00.000,
2010-01-25 00:00:00.000, 2010-02-01 00:00:00.000, 2010-02-08 00:00:00.000,
2010-02-15 00:00:00.000, 2010-02-22 00:00:00.000, 2010-03-01 00:00:00.000,
2010-03-08 00:00:00.000, 2010-03-15 00:00:00.000, 2010-03-22 00:00:00.000,
2010-03-29 00:00:00.000, 2010-04-05 00:00:00.000, 2010-04-12 00:00:00.000,
2010-04-19 00:00:00.000, 2010-04-26 00:00:00.000, 2010-05-03 00:00:00.000,
2010-05-10 00:00:00.000, 2010-05-17 00:00:00.000, 2010-05-24 00:00:00.000,
2010-05-31 00:00:00.000, 2010-06-07 00:00:00.000

It has clearly created the partitions to cover the data being inserted and then performed that insertion.

So it can be done, but the constant cost of intercepting every insertion and update to provide this kind of dynamic partition is really not ideal, whether it could be made sufficiently efficient to work at the sort of scale that partitioning tends to be used at is debatable. I have it feeling that it would struggle – I would need to be lent a suitably sized server and SAN to test that one and see whether it could be made efficient enough.

Blank Transactions

September 5, 2010 Leave a comment

Busy time with a new addition to the household – sleep is clearly an optional parameter these days, but on to one of those oddities you might see in the transaction log. On many occasions you will see transactions in the log that have no operations, the individual transaction entry just has a LOP_BEGIN_XACT following be a LOP_COMMIT_XACT, with no operations being recorded.

So what causes these?
The immediate thought is a straight:

Begin Transaction
Commit Transaction

If you try that and inspect the log, you will notice that it has not added in this mysterious, zero-operation transaction. So that is not the cause.

How about a rolled back transaction? well you should already know that this would not be the answer since the log reserves space ahead during a transaction to record undo operations, due to a rollback. To show that in more detail, given the following simple snippet of SQL:

begin transaction
insert into test (name) values ('a')
rollback transaction

The log then shows the transaction beginning, performing the insert, and then rolling it back by deleting the record and recording it as an aborted transaction.

LOP_ABORT_XACT
LOP_DELETE_ROWS
LOP_INSERT_ROWS
LOP_BEGIN_XACT

So neither of the first obvious choices are the cause, the reason seems a bit bizarre, but centres around whether any change was attempted but ignored due to being unecessary. Setting up a test table and inserting that single row into it with a value of ‘a’, run the following statement:

begin transaction
update test set name = 'a'
rollback transaction

Now when you inspect the log, there is a blank transaction, it recorded the start and end of the transaction, but no operations are shown. The same is true if the transaction is rolled back.

If the code is altered slightly to deliberately mean that no modification would occur though, the same does not hold true:

begin transaction
update test set name = 'a' where 1 = 0
commit transaction

Clearly the code is designed to make no modifications, so it is not surprising that no entry occurs in the transaction log, to make the test a bit fairer, let’s design the code in a way that it might make a modification, but it doesn’t.

begin transaction
update test set name = 'a' where name = 'b'
commit transaction

Still no entry in the transaction log.

So the distinction in the logging of these zero-op transactions is whether or not there was matching data to be altered, if a record was found but that the alteration was unnecessary we get a zero-op transaction appear in the log. It does make you wonder, why?

It also means that from an auditing perspective, the attempt to modify the data was not logged, not because it was unsuccessful, but because it was being altered to the values it already had.