Archive for January 22, 2012

Oracle : Duplicate GUID values being returned from sys_guid() when run in parallel

January 22, 2012 4 comments

A post? yes, it’s been a while and because I am having to spend all my time on Oracle these days – it’s a post relating to a problem in Oracle.

I had to construct a test case recently to try track down a primary key failure. The primary key was a sys_guid value and the failure was coming from the insertion of new values, that didn’t make much sense since the odds of a collision on a GUID should be astronomically high – assuming they used an up to date algorithm. Even with those astronomical odds, primary key failures were occurring very regularly, so the immediate suspicion is that the sys_guid algorithm in Oracle is not up to date and not-consistent across all platforms. It can return GUIDs that appear totally random, or GUIDs that are clearly within a sequence. It’s easy enough to test any individual platform to see how it behaves:

select sys_guid from dual
select sys_guid() from dual;

12th character in has increased by one, the rest of the guid remains identical.

This isn’t too surprising, the documentation is delightfully vague in using the term ‘most’:

SYS_GUID generates and returns a globally unique identifier (RAW value) made up of 16 bytes. On most platforms, the generated identifier consists of a host identifier, a process or thread identifier of the process or thread invoking the function, and a nonrepeating value (sequence of bytes) for that process or thread.

So ‘most’ platforms will behave like this – that’s helpful documentation, thanks for that.

So back to the problem and test case – whenever I come across potential Oracle bugs, I have an immediate suspicion that parallelism is at play – this is just from the consistent experience of Oracle getting parallelism wrong within the database – I have multiple outstanding SR’s for various features when combined with parallelism causing failures – anything from ORA-600’s to incorrect data being returned. (Parallel + Pivot = missing columns, nice!).

When you have these GUIDs being generated in a pseudo sequence, it makes sense that adding parallelism is a recipe for disaster, since the parallel slaves would all have to communicate and co-ordinate to ensure that they did not duplicate values in that sequence. After many hours whittling down the original statement, I was able to construct a repeatable test case to finally submit to Oracle for fixing – the shocking part is how trivial it was to demonstrate the problem on a specific AIX environment.

So let’s walk through the test case, firstly, create a numbers table:

create table n (c1 number);

..and populate it:

  for i in 1..30 loop   
    insert into n     
      select i*100000 + level from dual connect by level<=100000;  
  end loop;

This just populates the table with 3 million rows, 30 iterations of 100k rows, it’s a bit faster to do it that way than populate it in a single statement – the connect by level goes slower as the number rises.

That is all we need for the set up, the test code is pretty simple but I will explain it:

  e number := 0;
  for i in 1..10 loop
      select count(*) into e      
      FROM (        
        select sid, count(*)        
        from (
          select /*+ parallel(n,40) */                
            sys_guid() as sid              
          from n              
        group by sid        
        having count(*) > 1      
      ) t;    
      when no_data_found then null;      
      when others then raise;    
    if e>0 then raise_application_error(-20000
        ,e||' duplicates found in iteration '||i); end if;
  end loop;
end; /

The easiest way to explain this is from the inside out – the inner most query generates 3 million sys_guid values by selecting from the numbers table and asking for a sys_guid value per row – the statement is given a parallel hint.

We then perform an outer select that group’s by the SID (Sys guID) values, and uses a having count(*) > 1 clause to only show duplicates. Under normal conditions this of course should return 0 rows at that point, since every sys_guid generated should be unique. The next outer select count’s up how many instances of duplicates occurred and finally places this into a variable e.

If e is ever greater than 0, we have encountered a duplicate and an error will be raised.

When run on an AIX box with SMT enabled, the error does get raised.

202148 duplicates found in iteration 1

The number of duplicates changes per run and seems to have no pattern; it can be anything from about ~40k duplicates up to ~250k duplicates. If you take the parallel hint out of the script, it never fails. So it is clearly linked to the simultanesous creation of sys_guid values.

As yet, Oracle have not been able to reproduce this themselves which is indicating that this is a platform specific bug, but the client’s DBA’s have been provided the script and have seen it churn out duplicates time and time again, much to their amazement. They really should use a better algorithm, having such a predictable sequentially guid as their default guid for ‘most’ platforms is less than ideal.

Categories: Oracle Tags: ,