Tony’s Oracle Tips

July 2, 2017

CSV parsing and tokenizing strings

Filed under: Uncategorized — tonyhasler @ 2:28 pm

Most of you will be familiar with the “Comma Separated Values” data format. It is used in spreadsheets and other places to store ordered lists of character data. The delimiter is typically a comma and if a comma is in a list item the item as a whole needs to be quoted, typically with double quotes.

So

a,"b,c",d

is actually just three items

a
"b,c"
d

Also, if we wish to include a double-quote in our item it must be repeated to ensure that it is not interpreted as a sentinel for the list item. So, for example:

item 1," this has not ""one"", but ""two"" quoted items",item 3

has just three items.

Frequently we need our Oracle SQL or PL/SQL code to parse CSV strings and convert them into ordered lists of items. External tables are often touted as the solution to this problem but they come with the not insignificant complication of needing the data stored on the database server in flat files. Can we not just parse the CSV data in our code?

Parsing CSV data has been the topic of many a post such as this Ask Tom article but to my mind there are two key parts to this puzzle:

  1. Find a regular expression that will identify the items.
  2. Find a way to separate the string into items.

I struggled with the first part of this for a while but then came across the answer buried in XSLT code here. To demonstrate how this works, let us change the delimiters in a CSV text string to pipes:

WITH q1 AS ( SELECT 'a,"b,""c""",d' comma_string FROM dual ) SELECT
rtrim(regexp_replace(comma_string || ',','(("[^"]*")+|[^,]*),','\1|'),'|')
FROM q1;

gives us

a|"b,""c"""|d

Now that we have that bit sorted out, how do we tokenize the data. The Ask Tom article above is one of many to discuss this topic but my benchmarks suggest that APEX_UTIL.STRING_TO_TABLE offers the best prospects for performance. If you want to pass your items to PL/SQL then you are home but if you want the items returned to SQL then a little more work is needed.

Take a look at this PL/SQL function and its associated types.

CREATE OR REPLACE TYPE col_type AS OBJECT (
col_number INTEGER,
col_value VARCHAR2(32767)
);
/

CREATE OR REPLACE TYPE col_type_t AS
TABLE OF col_type;
/

CREATE OR REPLACE FUNCTION tokenize (
    p_csv_row   VARCHAR2
) RETURN col_type_t
    PIPELINED
IS
    str_table   apex_application_global.vc_arr2;
    c_delim     CONSTANT char(1) := chr(1); -- An unprintable character hopefully not in string
BEGIN
    str_table := apex_util.string_to_table(
        regexp_replace(
            p_csv_row
             ||  ',',
            '(("[^"]*")+|[^,]*),',
            '\1'
             ||  c_delim
        ),
        c_delim
    );
--
-- We get an extra item because
-- we added a delimiter at the end
--
    FOR i IN 1..str_table.count () - 1 LOOP
        PIPE ROW (
            col_type(
                i,
                rtrim(
                    str_table(
                        i
                    ),
                c_delim)
            )
        );
    END LOOP;
END tokenize;
/

The code replaces the comma separators (and only those commas that are separators…not the ones inside quotes) with an unprintable character and then uses  APEX_UTIL.STRING_TO_TABLE to tokenize the strings before returning the results to the caller. Let me create some test data for benchmarking before proceeding:

CREATE TABLE csv_tests
AS SELECT
ROWNUM row_number,
rpad('a,"b,""c,"",d",e',1000,'x') csv_row
FROM dual CONNECT BY level <= 10000;

We have 10,000 rows, each approximately 1K in length with three items. Here is how you might use the pipelined function defined above:

WITH r AS ( SELECT
        *
    FROM
        csv_tests
    ORDER BY row_number ) SELECT
    r.row_number,
    c.col_number,
    c.col_value
FROM
    r,
    TABLE ( tokenize(r.csv_row) ) c;

Here is the first of the 30,000 rows (truncated)


ROW_NUMBER COL_NUMBER COL_VALUE
          1         1 a
          1         2 "b,""c,"",d"
          1         3 exxxxxxxxxxxx
          2         1 a
          2         2 "b,""c,"",d"
          2         3 exxxxxxxxxxxx

Notice that table expansion automatically orders the columns so I just ordered the rows in advance.

Using a benchmark that ignores network (such as CREATE TABLE AS SELECT) the query runs in under 5 seconds on my cloud database.

March 3, 2017

Object oriented logging

Filed under: Uncategorized — tonyhasler @ 11:34 am

It is a very common requirement. You need to be able to log the start and end of procedure and function calls. You also need to log error and exception events and sometimes some informational messages as well. This is typically done by writing a simple logging package that uses autonomous transactions so that any transaction rollback does not rollback the diagnostic logging information as well. Apart from logging text (and usually other things as well) the package typically logs the name of the package and procedure performing the logging. Here is a simplified example that might be implemented at a mythical AJAX company.

create or replace PACKAGE ajax_log_pkg IS
    PROCEDURE log_message (
        p_package_name     VARCHAR2,
        p_procedure_name   VARCHAR2,
        p_message          VARCHAR2
    );
END ajax_log_pkg;
/

create or replace PACKAGE BODY ajax_log_pkg IS
    PROCEDURE log_message (
        p_package_name     VARCHAR2,
        p_procedure_name   VARCHAR2,
        p_message          VARCHAR2 
    ) IS
        PRAGMA autonomous_transaction;
    BEGIN
        INSERT INTO ajax_log_table (
            log_timestamp,
            package_name,
            procedure_name,
            message
        ) VALUES (
            systimestamp,
            p_package_name,
            p_procedure_name,
            p_message
        );

        COMMIT;
    END log_message;

END ajax_log_pkg;
/

The package has a prefix of the company name and a suffix of “_pkg”, a curiously common convention. Here is how such a package might be used (once again simplified).

create or replace PACKAGE ajax_log_demo_pkg AS
    PROCEDURE inner_proc;
    PROCEDURE outer_proc;
END ajax_log_demo_pkg;
/

create or replace PACKAGE BODY ajax_log_demo_pkg AS
    g_package_name     CONSTANT VARCHAR2(30) := 'log_demo_pkg';

    PROCEDURE inner_proc IS
        v_procedure_name   CONSTANT VARCHAR2(50) := 'inner_proc';
    BEGIN
        ajax_log_pkg.log_message(
            g_package_name,
        v_procedure_name,'START');
        -- Do something
        ajax_log_pkg.log_message(
            g_package_name,
        v_procedure_name,'END');
    END inner_proc;

    PROCEDURE outer_proc IS
        v_procedure_name   CONSTANT VARCHAR2(50) := 'outer_proc';
    BEGIN
        ajax_log_pkg.log_message(
            g_package_name,
        v_procedure_name,'START');
        ajax_log_pkg.log_message(
            g_package_name,
        v_procedure_name,'Before call to inner_proc');
        inner_proc ();
        ajax_log_pkg.log_message(
            g_package_name,
        v_procedure_name,'After call to inner_proc');
        ajax_log_pkg.log_message(
            g_package_name,
        v_procedure_name,'END');
    END outer_proc;

END ajax_log_demo_pkg;

We can see that there is a lot of wasted typing here. The calling package name and calling procedure name are specified repeatedly on each call. Still cut-and-paste can limit the productivity cost. What else is there to do? Presumably nothing? Every logging implementation that I have seen has looked very similar to this.

This is roughly what the entries in AJAX_LOG_TABLE might look like after a call to ajax_log_demo_pkg.outer_proc:

03-MAR-17 10.05.27 log_demo_pkg outer_proc START
03-MAR-17 10.05.27 log_demo_pkg outer_proc Before call to inner_proc
03-MAR-17 10.05.27 log_demo_pkg inner_proc START
03-MAR-17 10.05.27 log_demo_pkg inner_proc END
03-MAR-17 10.05.27 log_demo_pkg outer_proc After call to inner_proc
03-MAR-17 10.05.27 log_demo_pkg outer_proc END

The other day a curious thought sped briefly past my mind as I typed CTRL/V. “This would be a lot easier in Java”, I thought. Then I paused and started a short conversation with myself:

“Tony, why do you think ANYTHING in Java would be easier than PL/SQL”?
“Well, Tony, in Java you have classes not packages. You could create an instance of a logging object and avoid specifying the calling package and procedure names on each call”
“Good point, Tony…..”

At that point I sat straight upright in my chair and opened my eyes wide open.

“Tony, Oracle has that capability! It has been there since 8i it is just that we have all forgotten about it!”

It took just moments to implement an object version of AJAX_LOG_PKG.

create or replace TYPE ajax_log_typ AS OBJECT (
    o_package_name     VARCHAR2(30),
    o_procedure_name   VARCHAR2(30),

    MEMBER PROCEDURE log_message (
        p_message   VARCHAR2
    )
) ;
/

create or replace TYPE BODY ajax_log_typ AS
    MEMBER PROCEDURE log_message (
        p_message   VARCHAR2 
    ) IS
        PRAGMA autonomous_transaction;
    BEGIN
        INSERT INTO ajax_log_table (
            log_timestamp,
            package_name,
            procedure_name,
            message
        ) VALUES (
            systimestamp,
            o_package_name,
            o_procedure_name,
            p_message
        );

        COMMIT;
    END log_message;

END;

The object specification replaces the package specification and the object body replaces the package body. I then added two members to the class for the calling package name and the calling procedure name. As a consequence, we can eliminate these two parameters from the member procedure or procedures. Look at how the object-oriented logging implementation is used:

create or replace PACKAGE ajax_log_demo_pkg2 AS
    PROCEDURE inner_proc;
    PROCEDURE outer_proc;
END ajax_log_demo_pkg2;
/

create or replace PACKAGE BODY ajax_log_demo_pkg2 AS
    g_package_name     CONSTANT VARCHAR2(30) := 'log_demo_pkg';

    PROCEDURE inner_proc IS
           o_log              ajax_log_typ := ajax_log_typ(
            g_package_name,
        'inner_proc');
    BEGIN
        o_log.log_message(
            'START'
        );
        -- Do something
        o_log.log_message(
            'END'
        );
    END inner_proc;

    PROCEDURE outer_proc IS
        o_log   ajax_log_typ := ajax_log_typ(
            g_package_name,
        'outer_proc');
    BEGIN
        o_log.log_message(
            'START'
        );
        o_log.log_message(
            'Before call to inner_proc'
        );
        inner_proc ();
        o_log.log_message(
            'After call to inner_proc'
        );
        o_log.log_message(
            'END'
        );
    END outer_proc;

END ajax_log_demo_pkg2;

Each procedure creates a new instance of AJAX_LOG_TYP object and specifies the package and procedure name. Subsequent calls just need to reference the local variable and the message to log; the rest of the context is retrieved from the object instance.
Hardly rocket science and yet I am quite sure that I am very far from alone in missing this neat trick for more years than I care to admit.

February 9, 2016

Parallel partition scans

Filed under: Uncategorized — tonyhasler @ 9:51 pm

When a table is accessed by multiple members of a parallel query server set, the execution plan may show the use of block range granules (PX BLOCK ITERATOR) or partition granules (PX PARTITION [RANGE|LIST|HASH] ITERATOR or PX PARTITION [RANGE|LIST|HASH] ALL).

The basic ideas surrounding these concepts are discussed in numerous blogs and books, including my own, but discussion of partition range granules is normally restricted to either partition wise joins or the creation and rebuild of partitioned indexes.

However, a colleague came to me recently with a problem with another use case. A problem that a quick Google search was unable to resolve and so after I worked it out I thought it was time for one of my rare blogs.

Let us begin by creating a table partitioned by range using the BUSINESS_DATE and AMOUNT columns.


CREATE TABLE part2
(
   business_date   DATE
  ,amount          NUMBER
)
PARTITION BY RANGE (business_date, amount)
   (PARTITION pdefault VALUES LESS THAN (maxvalue, maxvalue));

BEGIN
   FOR r
      IN (    SELECT TO_CHAR (DATE '2000-01-01' + TRUNC ((ROWNUM-1) / 3), 'YYYYMMDD')
                        partition_date
                    ,DECODE (MOD (ROWNUM-1, 3),  0, '10',  1, '100',  'MAXVALUE')
                        partition_amount
                FROM DUAL
          CONNECT BY LEVEL <= 510)
   LOOP
      EXECUTE IMMEDIATE
            '
ALTER TABLE PART2
   SPLIT PARTITION PDEFAULT
      AT (TO_DATE('''
         || r.partition_date
         || ''',''YYYYMMDD''),'
         || r.partition_amount
         || ')
      INTO (PARTITION P'
         || r.partition_date
         || '_'
         || r.partition_amount
         || ', PARTITION PDEFAULT)';
   END LOOP;
END;
/

After creating the table with 1 partition I added 510 more based on 170 dates with three amount ranges each.

Let us begin by looking at the execution plans of a few queries using 8 parallel query servers:


SET HEADING OFF PAGES 0 FEED OFF
EXPLAIN PLAN
   FOR
      SELECT /*+ parallel(8) */
            * FROM part2;

SELECT *
  FROM TABLE (DBMS_XPLAN.display (format => 'BASIC +PARTITION +OUTLINE'));

EXPLAIN PLAN
   FOR
      SELECT /*+ parallel(8) */
            *
        FROM part2
       WHERE business_date = DATE '2000-01-01';

SELECT *
  FROM TABLE (DBMS_XPLAN.display (format => 'BASIC +PARTITION +OUTLINE'));

EXPLAIN PLAN
   FOR
      SELECT /*+ parallel(8) */
            *
        FROM part2
       WHERE business_date = :b1;

SELECT *
  FROM TABLE (DBMS_XPLAN.display (format => 'BASIC +PARTITION +OUTLINE'));
  
Plan hash value: 3087476059                                                     
                                                                                
---------------------------------------------------------                       
| Id  | Operation            | Name     | Pstart| Pstop |                       
---------------------------------------------------------                       
|   0 | SELECT STATEMENT     |          |       |       |                       
|   1 |  PX COORDINATOR      |          |       |       |                       
|   2 |   PX SEND QC (RANDOM)| :TQ10000 |       |       |                       
|   3 |    PX BLOCK ITERATOR |          |     1 |   511 |                       
|   4 |     TABLE ACCESS FULL| PART2    |     1 |   511 |                       
---------------------------------------------------------                       
                                                                                
Outline Data                                                                    
-------------                                                                   
                                                                                
  /*+                                                                           
      BEGIN_OUTLINE_DATA                                                        
      FULL(@"SEL$1" "PART2"@"SEL$1")                                            
      OUTLINE_LEAF(@"SEL$1")                                                    
      SHARED(8)                                                                 
      ALL_ROWS                                                                  
      DB_VERSION('12.1.0.1')                                                    
      OPTIMIZER_FEATURES_ENABLE('12.1.0.1')                                     
      IGNORE_OPTIM_EMBEDDED_HINTS                                               
      END_OUTLINE_DATA                                                          
  */
Plan hash value: 3618659711                                                     
                                                                                
---------------------------------------------------------                       
| Id  | Operation            | Name     | Pstart| Pstop |                       
---------------------------------------------------------                       
|   0 | SELECT STATEMENT     |          |       |       |                       
|   1 |  PX COORDINATOR      |          |       |       |                       
|   2 |   PX SEND QC (RANDOM)| :TQ10000 |       |       |                       
|   3 |    PX BLOCK ITERATOR |          |     1 |     3 |                       
|   4 |     TABLE ACCESS FULL| PART2    |     1 |     3 |                       
---------------------------------------------------------                       
                                                                                
Outline Data                                                                    
-------------                                                                   
                                                                                
  /*+                                                                           
      BEGIN_OUTLINE_DATA                                                        
      FULL(@"SEL$1" "PART2"@"SEL$1")                                            
      OUTLINE_LEAF(@"SEL$1")                                                    
      SHARED(8)                                                                 
      ALL_ROWS                                                                  
      DB_VERSION('12.1.0.1')                                                    
      OPTIMIZER_FEATURES_ENABLE('12.1.0.1')                                     
      IGNORE_OPTIM_EMBEDDED_HINTS                                               
      END_OUTLINE_DATA                                                          
  */
Plan hash value: 3618659711                                                     
                                                                                
---------------------------------------------------------                       
| Id  | Operation            | Name     | Pstart| Pstop |                       
---------------------------------------------------------                       
|   0 | SELECT STATEMENT     |          |       |       |                       
|   1 |  PX COORDINATOR      |          |       |       |                       
|   2 |   PX SEND QC (RANDOM)| :TQ10000 |       |       |                       
|   3 |    PX BLOCK ITERATOR |          |   KEY |   KEY |                       
|   4 |     TABLE ACCESS FULL| PART2    |   KEY |   KEY |                       
---------------------------------------------------------                       
                                                                                
Outline Data                                                                    
-------------                                                                   
                                                                                
  /*+                                                                           
      BEGIN_OUTLINE_DATA                                                        
      FULL(@"SEL$1" "PART2"@"SEL$1")                                            
      OUTLINE_LEAF(@"SEL$1")                                                    
      SHARED(8)                                                                 
      ALL_ROWS                                                                  
      DB_VERSION('12.1.0.1')                                                    
      OPTIMIZER_FEATURES_ENABLE('12.1.0.1')                                     
      IGNORE_OPTIM_EMBEDDED_HINTS                                               
      END_OUTLINE_DATA                                                          
  */

The first query selects all rows in the table, the second all rows for a specific hard-coded date, and the third all rows for a date specified by a bind variable. All three execution plans show the PX BLOCK ITERATOR operation indicating the use of block range granules.

Now let us add one more partition to bring the total number of partitions to 512

ALTER TABLE part2
   SPLIT PARTITION pdefault
      AT (DATE '2000-06-19', maxvalue)
      INTO (PARTITION p20000619_maxvalue, PARTITION pdefault);

Let us see what execution plans change.

Plan hash value: 1807983963                                                     
                                                                                
-------------------------------------------------------------                   
| Id  | Operation                | Name     | Pstart| Pstop |                   
-------------------------------------------------------------                   
|   0 | SELECT STATEMENT         |          |       |       |                   
|   1 |  PX COORDINATOR          |          |       |       |                   
|   2 |   PX SEND QC (RANDOM)    | :TQ10000 |       |       |                   
|   3 |    PX PARTITION RANGE ALL|          |     1 |   512 |                   
|   4 |     TABLE ACCESS FULL    | PART2    |     1 |   512 |                   
-------------------------------------------------------------                   
                                                                                
Outline Data                                                                    
-------------                                                                   
                                                                                
  /*+                                                                           
      BEGIN_OUTLINE_DATA                                                        
      FULL(@"SEL$1" "PART2"@"SEL$1")                                            
      OUTLINE_LEAF(@"SEL$1")                                                    
      SHARED(8)                                                                 
      ALL_ROWS                                                                  
      DB_VERSION('12.1.0.1')                                                    
      OPTIMIZER_FEATURES_ENABLE('12.1.0.1')                                     
      IGNORE_OPTIM_EMBEDDED_HINTS                                               
      END_OUTLINE_DATA                                                          
  */
Plan hash value: 3618659711                                                     
                                                                                
---------------------------------------------------------                       
| Id  | Operation            | Name     | Pstart| Pstop |                       
---------------------------------------------------------                       
|   0 | SELECT STATEMENT     |          |       |       |                       
|   1 |  PX COORDINATOR      |          |       |       |                       
|   2 |   PX SEND QC (RANDOM)| :TQ10000 |       |       |                       
|   3 |    PX BLOCK ITERATOR |          |     1 |     3 |                       
|   4 |     TABLE ACCESS FULL| PART2    |     1 |     3 |                       
---------------------------------------------------------                       
                                                                                
Outline Data                                                                    
-------------                                                                   
                                                                                
  /*+                                                                           
      BEGIN_OUTLINE_DATA                                                        
      FULL(@"SEL$1" "PART2"@"SEL$1")                                            
      OUTLINE_LEAF(@"SEL$1")                                                    
      SHARED(8)                                                                 
      ALL_ROWS                                                                  
      DB_VERSION('12.1.0.1')                                                    
      OPTIMIZER_FEATURES_ENABLE('12.1.0.1')                                     
      IGNORE_OPTIM_EMBEDDED_HINTS                                               
      END_OUTLINE_DATA                                                          
  */
Plan hash value: 2732640421                                                     
                                                                                
------------------------------------------------------------------              
| Id  | Operation                     | Name     | Pstart| Pstop |              
------------------------------------------------------------------              
|   0 | SELECT STATEMENT              |          |       |       |              
|   1 |  PX COORDINATOR               |          |       |       |              
|   2 |   PX SEND QC (RANDOM)         | :TQ10000 |       |       |              
|   3 |    PX PARTITION RANGE ITERATOR|          |   KEY |   KEY |              
|   4 |     TABLE ACCESS FULL         | PART2    |   KEY |   KEY |              
------------------------------------------------------------------              
                                                                                
Outline Data                                                                    
-------------                                                                   
                                                                                
  /*+                                                                           
      BEGIN_OUTLINE_DATA                                                        
      FULL(@"SEL$1" "PART2"@"SEL$1")                                            
      OUTLINE_LEAF(@"SEL$1")                                                    
      SHARED(8)                                                                 
      ALL_ROWS                                                                  
      DB_VERSION('12.1.0.1')                                                    
      OPTIMIZER_FEATURES_ENABLE('12.1.0.1')                                     
      IGNORE_OPTIM_EMBEDDED_HINTS                                               
      END_OUTLINE_DATA                                                          
  */

We can see that block range granules continue to be used when a hard-coded date is specified but that in the other two cases partition granules are now used. Why?

The answer can be found by looking at the description of two hidden parameters:


NAME                            VALUE   DEFLT   TYPE    DESCRIPTION
_px_partition_scan_enabled      TRUE    TRUE    boolean enables or disables parallel
                                                        partition-based scan 
_px_partition_scan_threshold      64    TRUE    number  least number of partitions per slave
                                                        to start partition-based scan

What we learn from these descriptions is that once the number of partitions reaches 64 times the degree of paralellism (DOP) then the CBO will elect to use partition granules. The DOP is 8 meaning that when the number of partitions reached 512 (64 x 8 for those of you with arithmetical challenges) the CBO switched to using partition granules. When partition elimination occured as a result of the hard-coded date the CBO knew that only 3 subpartitions would be referenced and so block range granules were selected. But what about the bind variable case?

The issue here can be a bit tricky for some people to spot (mea culpa). Although we know that there are at most three partitions for each date this is not enforced and the CBO does not have a way of checking. Accordingly, in this specific case, partition elimination is not considered when determining which type of granule should be used. But we know that block ranges should be used if we are to utilize all 8 parallel query servers and not just 3.

Interestingly enough the outline hints do not change when the granule type changes, meaning that execution plans may be unstable even if you have a SQL Plan Baseline defined for a statement.

The good news is that if you recognize the problem you do have the option of adding an opt_param('_px_partition_scan_enabled','false') hint to force the use of block range granules (and that will propagate to your baseline). Furthermore, if you know your application well enough you may reasonably consider setting "_px_partition_scan_enabled" to FALSE at the system level. Do not worry. This won’t affect partition-wise joins!

Three more quick points before I close:

1. You will have noticed that empty partitions are not discounted. Yet another reason not to create lots of them in advance!
2. If you are wondering about whether bind variable peeking affects the result, I can tell you that it doesn’t!
3. I tested this on 11.2.0.2 and 12.1 with identical results. I don’t have access to anything earlier or later.

Update on composite partitioned tables

When we look at composite partitioned tables life gets a bit more complicated.  I originally thought that I would need another post to address it but at the end of the day the conclusions can be stated reasonably concisely.

  • If no equality predicate is specified for the partitioning column(s) and an equality predicate is specified for the subpartitioning column (or equality predicates specified for all subpartitioning columns) then the CBO knows that there will be exactly one subpartition accessed per partition.  Partition granules will be used when the number of partitions is 64 x DOP or greater.
  • If no equality predicate is specified for the subpartitioning column(s) and an equality predicate is specified for the partitioning column (or equality predicates specified for all partitioning columns) then the CBO divides the total number of subpartitions in the table by the number of partitions.  Partition granules are used if this calculated average exceeds 64 x DOP.
  • When the predicates are any more complicated then partition elimination is not considered and partition granules are used whenever the total number of subpartitions exceeds 64 x DOP.

In my opinion this is all fuzzy thinking.  The high value of "_px_partition_scan_threshold" suggests a manta of “if in doubt use block range granules” but by ignoring the effect of empty partitions or complex predicates the mantra “if in doubt use partition granules” comes to mind.  The latter is surely dangerous.

September 13, 2015

The importance of the GLOBAL_STATS statistic on partitioned tables

Filed under: Uncategorized — tonyhasler @ 9:22 pm

I am currently working in a team alongside a gentleman known as Roberto Rigliaco. You are unlikely to have heard of Roberto as he is not (yet) a blogger and not (yet) known on the conference circuit.

Nevertheless, Roberto has done some excellent work recently analyzing the use of object statistics in partitioned tables. I haven’t seen his findings published anywhere else and so I am glad to say that Roberto has accepted my offer to publish his findings on my blog.

It turns out that the setting of the GLOBAL_STATS column in ALL_TAB_STATISTICS is critical to the analysis. Don’t be confused into thinking that the column GLOBAL_STATS has anything to do with the level (table, partition or subpartition) that the statistics refer to. In fact, GLOBAL_STATS indicates whether the table statistics have been aggregated from partition level statistics (“NO”) or explicitly gathered or set (“YES”). At the partition level a value of GLOBAL_STATS=”NO” would mean that the partition level statistics were aggregated from subpartition level statistics.

I will begin by giving you the punch lines and then providing some examples to prove the points. First some intuitive stuff:

  • If more than one partition is referenced then table level statistics will be used if available. No partition or subpartition level statistics will be used.
  • If partition elimination occurs but the optimizer does not know what single partition will be used then table level statistics will be used if available as above.
  • If partition elimination occurs (and in the case of composite partitioned tables elimination of all but one subpartition has not occurred) and Oracle knows what partition will be used then the statistics for that partition will be used if available.
  • If subpartition elimination occurs and the optimizer knows neither the partition nor subpartition that will be used then table level statistics will be used if available.
  • If subpartition elimination occurs and Oracle knows both the partition and subpartition that will be used then subpartition level statistics will be used if available.

The word “knows” in the above list needs to be taken with a pinch of salt: the optimizer might use bind variable peeking to create an execution plan based on statistics for one (sub)partition and then use that plan on an entirely different (sub)partition later on. As always with bind variable peeking, the execution plan obtained through the explain plan statement may differ from the one seen at runtime.

Now comes the not so intuitive stuff:

  • If Oracle knows what partition will be used (and in the case of composite partitioned tables elimination of all but one subpartition has not occurred) but partition level statistics are not available then table level statistics will be used only if GLOBAL_STATS is set to “YES” for the table level statistics.
  • If subpartition elimination occurs and the optimizer knows what single partition will be used but not which single subpartition then table level statistics will be used if available regardless of the value of GLOBAL_STATS.
  • If Oracle knows what single subpartition will be used and subpartition statistics are not available, then table level statistics will be used only if GLOBAL_STATS is set to “YES” for the table level statistics. Interestingly, partition level statistics will not be used even if GLOBAL_STATS=”YES”.

Roberto provided me with a full suite of test cases to verify the observations above and I have checked them on 11gR2 and 12cR1. For the purpose of this blog I will provide just three cases.

Let us begin by creating our test table

CREATE TABLE stats_test
(
   part_key      INTEGER
  ,subpart_key   INTEGER
)
PARTITION BY RANGE (part_key)
   SUBPARTITION BY LIST (subpart_key)
   (PARTITION p1 VALUES LESS THAN (2)
       (SUBPARTITION p1_sp1 VALUES (1), SUBPARTITION p1_sp2 VALUES (2))
   ,PARTITION p2 VALUES LESS THAN (3)
       (SUBPARTITION p2_sp1 VALUES (1), SUBPARTITION p2_sp2 VALUES (2)));

BEGIN
   DBMS_STATS.set_table_stats (ownname    => USER
                              ,tabname    => 'STATS_TEST'
                              ,partname   => 'P1'
                              ,numrows    => 100);
   DBMS_STATS.set_column_stats (ownname    => USER
                               ,tabname    => 'STATS_TEST'
                               ,partname   => 'P1'
                               ,colname    => 'PART_KEY'
                               ,distcnt    => 5
                               ,density    => 1 / 5);
   DBMS_STATS.set_column_stats (ownname    => USER
                               ,tabname    => 'STATS_TEST'
                               ,partname   => 'P1'
                               ,colname    => 'SUBPART_KEY'
                               ,distcnt    => 5
                               ,density    => 1 / 5);
   DBMS_STATS.set_table_stats (ownname    => USER
                              ,tabname    => 'STATS_TEST'
                              ,partname   => 'P2'
                              ,numrows    => 200);
   DBMS_STATS.set_column_stats (ownname    => USER
                               ,tabname    => 'STATS_TEST'
                               ,partname   => 'P2'
                               ,colname    => 'PART_KEY'
                               ,distcnt    => 2
                               ,density    => 1 / 2);
   DBMS_STATS.set_column_stats (ownname    => USER
                               ,tabname    => 'STATS_TEST'
                               ,partname   => 'P2'
                               ,colname    => 'SUBPART_KEY'
                               ,distcnt    => 2
                               ,density    => 1 / 2);
END;
/

STATS_TEST has two partitions each of which has two subpartitions. The table is empty but I have set relevant statistics at the partition level only. This means that:

There are no subpartition level statistics

  • The partition level statistics have GLOBAL_STATS=YES
  • The table level stats have GLOBAL_STATS=NO because they have been aggregated from the partition level statistics.

Now let us run a couple of EXPLAIN PLAN statements and examine the associated execution plans. Let us start without using bind variables.

EXPLAIN PLAN
   FOR
      SELECT COUNT (*)
        FROM stats_test
       WHERE part_key = 1 AND subpart_key = 1;

SELECT *
  FROM TABLE (
          DBMS_XPLAN.display (NULL, NULL, 'basic +rows +partition +note'));
----------------------------------------------------------------------
| Id  | Operation               | Name       | Rows  | Pstart| Pstop |
----------------------------------------------------------------------
|   0 | SELECT STATEMENT        |            |     1 |       |       |
|   1 |  SORT AGGREGATE         |            |     1 |       |       |
|   2 |   PARTITION RANGE SINGLE|            |     1 |     1 |     1 |
|   3 |    PARTITION LIST SINGLE|            |     1 |     1 |     1 |
|   4 |     TABLE ACCESS FULL   | STATS_TEST |     1 |     1 |     1 |
----------------------------------------------------------------------

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)

Notice that the precise subpartition is known because literals have been used for the predicates. However, there are no subpartition statistics to use. Partition level statistics have been mysteriously bypassed and the table level statistics not used because GLOBAL_STATS=NO. The fact that no statistics have been used is confirmed by both the row count and the use of dynamic sampling.

Now let us replace one literal with a bind variable.

EXPLAIN PLAN
   FOR
      SELECT COUNT (*)
        FROM stats_test
       WHERE part_key = 1 AND subpart_key = :b1;

SELECT *
  FROM TABLE (
          DBMS_XPLAN.display (NULL, NULL, 'basic +rows +partition +note'));
----------------------------------------------------------------------
| Id  | Operation               | Name       | Rows  | Pstart| Pstop |
----------------------------------------------------------------------
|   0 | SELECT STATEMENT        |            |     1 |       |       |
|   1 |  SORT AGGREGATE         |            |     1 |       |       |
|   2 |   PARTITION RANGE SINGLE|            |    12 |     1 |     1 |
|   3 |    PARTITION LIST SINGLE|            |    12 |   KEY |   KEY |
|   4 |     TABLE ACCESS FULL   | STATS_TEST |    12 |   KEY |   KEY |
----------------------------------------------------------------------

In this example, the optimizer knows what partition is used but not what subpartition. In this case, the optimizer uses the table level statistics regardless of the fact that GLOBAL_STATS=NO. Notice that the row count is 12, obtained by dividing the number of rows the statistics indicate are in the table as a whole by the number of distinct values of the two columns (300/5/5=12).

Finally, let us actually run the query with the bind variable in place:

DECLARE
   dummy       PLS_INTEGER;
   v_subpart   PLS_INTEGER := 1;
BEGIN
   SELECT COUNT (*)
     INTO dummy
     FROM stats_test
    WHERE part_key = 1 AND subpart_key = v_subpart;
END;
/

SELECT *
  FROM TABLE (
          DBMS_XPLAN.display_cursor ('c9uw6vg3204v0'
                                    ,NULL
                                    ,'basic +rows +partition +peeked_binds +note'));
SELECT COUNT (*) FROM STATS_TEST WHERE PART_KEY = 1 AND SUBPART_KEY =
:B1

Plan hash value: 3184952168

----------------------------------------------------------------------
| Id  | Operation               | Name       | Rows  | Pstart| Pstop |
----------------------------------------------------------------------
|   0 | SELECT STATEMENT        |            |       |       |       |
|   1 |  SORT AGGREGATE         |            |     1 |       |       |
|   2 |   PARTITION RANGE SINGLE|            |     1 |     1 |     1 |
|   3 |    PARTITION LIST SINGLE|            |     1 |   KEY |   KEY |
|   4 |     TABLE ACCESS FULL   | STATS_TEST |     1 |   KEY |   KEY |
----------------------------------------------------------------------

Peeked Binds (identified by position):
--------------------------------------

   1 - :B1 (NUMBER): 1

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)

The use of bind variable peeking means that the optimizer now knows that it wants to use statistics for subpartition P1_SP1 even though the actual subpartition at runtime might vary. The optimizer reverts to the use of dynamic sampling as with the use of literals.

I don’t know if this behaviour is deliberate or accidental but if it is deliberate the logic escapes me.

June 13, 2015

Interview question: second largest value

Filed under: Interview questions,SQL Fun — tonyhasler @ 8:32 pm

This is my first blog for a very very long time.  I stopped blogging a couple of years ago when I started working on my book and after I finally finished the book I felt I needed a break to recuperate.

However, I have finally decided to put my toe back in the water.  This isn’t a very serious post but I hope you find it thought provoking and a bit of fun.

The thought behind this post came from a colleague who had just asked a candidate for employment the following question in order to help assess their command of the SQL syntax.

“How would you select the second largest value from a column in a table”.

A simple enough question. Or so it seems.  You may be able to come up with several answers yourself.  But what is the “best” answer?

There are others that have opined on this subject before. Here is one example. This post is just my thoughts on the topic.

Let me begin by creating a test table with the values 1 t 10 as the basis for some examples:

CREATE TABLE t AS SELECT ROWNUM n1 FROM DUAL CONNECT BY LEVEL <= 10; 

Here is one possible way to find the second highest value:

WITH q1
     AS (SELECT NTH_VALUE (
                   n1,
                   2)
                OVER (
                   ORDER BY n1 DESC
                   ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
                   n1
           FROM t)
SELECT *
  FROM q1
 WHERE ROWNUM < 2;

Solution 1

Solution 1 shows your interviewer that you know about new 11g Oracle features.  Solution 1 also has the benefit that it is readily changeable to obtain the third, fourth, or 100th top value from a table.  Nevertheless, I wouldn’t recommend offering this solution as it uses relatively obscure Oracle-specific syntax and may be viewed as overcomplicated.  This solution has a second disadvantage:  it isn’t correct!

Let me add an 11th row to the table:

INSERT INTO t
VALUES (10);

After adding this 11th row, “solution” 1 now returns 10.  But the second largest value is still 9, despite the fact that there are two rows with a value of 10.

You may try to impress on your interviewer a mastery of analytic functions.  How about this solution:

WITH q1 AS (SELECT n1, DENSE_RANK () OVER (ORDER BY n1 DESC) rnk FROM t)
SELECT n1
FROM q1
WHERE rnk = 2;

Solution 2

Solution 2 returns a value of 9 despite the presence of two rows with a value of 10. You might explain that the runtime engine will not, in practice, sort all of the rows in the table; it will just maintain the two largest values as it scans the table.  You explain that were the table to hold 11 million rows rather than 11 this may be an important consideration.  I don’t commend this solution to you either as it is also incorrect!

let us add a twelfth row to the table to demonstrate one of two flaws:

INSERT INTO t
VALUES (9);

Now our query returns two rows with a value of 9.  We need to limit the rows to one.

WITH q1 AS (SELECT n1, DENSE_RANK () OVER (ORDER BY n1 DESC) rnk FROM t)
SELECT n1
FROM q1
WHERE rnk = 2 AND ROWNUM = 1;

Solution 3

But even now the query is still incorrect.  Let me add a lucky 13th row to our table.

INSERT INTO t
VALUES (NULL);

Now solution 3 returns a value of 10!  If there is no NOT NULL constraint on our column we need to clarify the problem statement.  When a column is null, its value is unknown or unspecified; neither equality predicates nor inequality predicates will evaluate to true since we cannot assert that the column value is equal to anything (not even NULL) nor can we assert that it is not equal to anything (even NULL).  So how can we assert that the value (were it specified) would not be the largest or the second largest value?  We cannot.  If we are to “respect nulls” we have to return NULL as our second largest value, since the second largest value would be unknown.

However, let us assume (as do most aggregate functions in SQL) that we are to “ignore nulls” and return the second largest value from the subset of rows that are non-null.  We can now recode our solution to function correctly in the presence of duplicate values and NULLs.

WITH q1
AS (SELECT n1, DENSE_RANK () OVER (ORDER BY n1 DESC NULLS LAST) rnk
FROM t)
SELECT n1
FROM q1
WHERE rnk = 2 AND ROWNUM = 1;

UPDATE

The above is actually incorrect as no row is returned when there are fewer than two distinct values. One way to correct this is as follows:

WITH q1
AS (SELECT n1, DENSE_RANK () OVER (ORDER BY n1 DESC NULLS LAST) rnk
FROM t)
SELECT max(n1)
FROM q1
WHERE rnk = 2 AND ROWNUM = 1;
 

Solution 4

As far as I can see, solution 4 is semantically correct.

But is solution 4 the “best” solution?  How about this:

SELECT MAX (n1) n1
FROM t
WHERE n1 < (SELECT MAX (n1) FROM t);

Solution 5

Solution 5 handles duplicate values and nulls correctly.  It also avoids any sorts, uses no Oracle-specific SQL extensions and is very easy to read.

Although solution 5 avoids sorting large numbers of rows just like solution 4, will it be as fast?  In some cases, solution 5 will involve two full table scans and as a consequence might be a lot slower on a large table.

But this is not necessarily the case.  Let us add an index to our table:

CREATE INDEX i1
ON t (n1);

Once the index has been added the execution plan for solution 5 becomes this:

-------------------------------------------------------------------
| Id  | Operation                     | Name | Rows  | Cost (%CPU)|
-------------------------------------------------------------------
|   0 | SELECT STATEMENT              |      |     1 |     1   (0)|
|   1 |  SORT AGGREGATE               |      |     1 |            |
|   2 |   FIRST ROW                   |      |     1 |     1   (0)|
|*  3 |    INDEX RANGE SCAN (MIN/MAX) | I1   |     1 |     1   (0)|
|   4 |     SORT AGGREGATE            |      |     1 |            |
|   5 |      INDEX FULL SCAN (MIN/MAX)| I1   |     1 |     1   (0)|
-------------------------------------------------------------------

What this execution plan shows is that when the runtime engine executes solution 5 on an indexed column it will pick the maximum value of the column from the right edge of the index and then pick the second largest value by looking at the entry immediately preceding that index entry (or entries).  No full scan of the index or table is required.  This means that when the column is indexed, solution 5 is likely to return the second largest value in milliseconds.  So in the case of an indexed column solution 5 will likely outperform solution 4 on large tables and in the case of an unindexed column solution 4 will likely outperform solution 5.

Of course, it would also be possible to devise solutions to our problem statement based on MATCH RECOGNIZE, the MODEL clause, or even user written aggregate functions. I would be very surprised if any such solutions would be simpler than those shown in this post.  Nor is it likely that such solutions would outperform solutions 4 and 5 posted here.

If you enjoyed this, feel free to suggest other topics in the vein in a reply.

December 9, 2014

UKOUG TECH QUIZ TECH14 Answers

Filed under: Uncategorized — tonyhasler @ 10:05 am

The questions in the quiz were, of course, all trick questions.

Question 1: Answer D

SELECT * FROM T1 LEFT NATURAL JOIN T2;

SELECT * FROM T1 NATURAL LEFT JOIN T2;

The second example shows the correct syntax for a natural outer join.  In the first example, LEFT is interpreted as a table alias for T1 and so an inner join is used.

QUESTION 2: Answer E

SELECT * FROM T1 INNER JOIN T2 ON C1=C2;

SELECT * FROM T1 OUTER JOIN T2 ON C1=C2;

This is the same sort of issue.  Any outer join needs to specify either LEFT or RIGHT but in the second example above OUTER is interpreted as a table alias for T1.  They keyword INNER is optional.  A third construct that also generates identical results is:

SELECT * FROM T1 OUTER INNER JOIN T2 ON C1=C2;

Question 3: Answer D

SELECT *
FROM t1, t2, t3
WHERE t1.c1 = t2.c2(+) AND t2.c2(+) = t3.c3;

SELECT *
FROM t1
LEFT JOIN t2 ON t1.c1 = t2.c2
RIGHT JOIN t3 ON t2.c2 = t3.c3;

Those of you attending my talk on Monday would know the answer to this one.  The ambiguity in the first statement meant that it was illegal in releases prior to 12c but has now become legal for performance reasons.  To express the results of the first query in ANSI syntax you would have to say:

SELECT * FROM T1 CROSS JOIN T3 LEFT JOIN T2 ON T1.C1=T2.C2 AND T3.C3=T2.C2;

Question 4: Answer E

SELECT c1, COUNT (*)
FROM t1
GROUP BY c1
HAVING COUNT (*) = 1;

SELECT c1, COUNT (*)
FROM t1
HAVING COUNT (*) = 1
GROUP BY c1;

I wouldn’t consider it best practice but the HAVING clause may precede the GROUP BY clause.

Tie breaker 1:

It is actually possible to solve the Paul and Sam problem using just two query blocks.  Here is my solution:


WITH base
AS ( SELECT TRUNC (LEVEL / 97) + 2 a,
MOD (LEVEL, 97) + 3 b,
TRUNC (LEVEL / 97) + MOD (LEVEL, 97) + 5 s,
(TRUNC (LEVEL / 97) + 2) * (MOD (LEVEL, 97) + 3) p,
COUNT (*)
OVER (
PARTITION BY (TRUNC (LEVEL / 97) + 2)
* (MOD (LEVEL, 97) + 3))
p_cnt1,
COUNT (*)
OVER (
PARTITION BY (TRUNC (LEVEL / 97) + 2)
+ (MOD (LEVEL, 97) + 3))
s_cnt1
FROM DUAL b
WHERE TRUNC (LEVEL / 97) <= MOD (LEVEL, 97)
CONNECT BY LEVEL <= 97 * 97)
-- Main query block
SELECT a,b,p,s
FROM base b LEFT JOIN base p ON p.p_cnt1 = 1 AND b.s = p.s
WHERE b.s_cnt1 > 1 AND p.s IS NULL
GROUP BY b.p
HAVING COUNT (*) = 1
MODEL UNIQUE SINGLE REFERENCE RETURN UPDATED ROWS
DIMENSION BY (COUNT (*) OVER (PARTITION BY MAX (b.s)) s_cnt)
MEASURES (MAX (b.a) a, MAX (b.b) b, b.p, MAX (b.s) s)
RULES
(a [1] = a[1]);

The first query block is easy enough to understand.  We generate all possible pairs of numbers where A is the lower and B is the higher.  P_CNT1 is the number of rows with the same product and S_CNT1 is the number of rows with the same sum.

To understand the main query block we need to build it up in stages.

Let us start with this:
.....
SELECT a,b,p,s
FROM base b
WHERE b.s_cnt1 > 1
AND b.s NOT IN (SELECT s
FROM base p
WHERE p.p_cnt1 = 1);

Here we select the just the rows where Sam doesn’t know and Sam knew Paul didn’t know because there is no pair of numbers with the same sum where Paul could know.  We can get rid of the subquery with an outer join.


SELECT b.a,
b.b,
b.p,
b.s
FROM base b LEFT JOIN base p ON p.p_cnt1 = 1 AND b.s = p.s
WHERE b.s_cnt1 > 1 AND b.p_cnt1 > 1 AND p.s IS NULL;

In the above example the last predicate selects just the rows preserved by the outer join by virtue of not matching.

We can now move on to identify the rows where Paul can now declare that he knows.  For example, if Paul had been given the product 18, he would know that the numbers couldn’t be 3 and 6 because then Sam would have been given the number 9 and Sam wouldn’t have known in advance that Paul hadn’t been given the number 14 and hence have known the answer.  Once Paul is able to eliminate 3 and 6 from consideration he knows that the numbers must be 2 and 9.

We can identify the 86 rows where Paul now knows by using a GROUP BY with a HAVING clause.

...
SELECT MAX (b.a) a,
MAX (b.b) b,
b.p,
MAX (b.s) s,
COUNT (*) OVER (PARTITION BY MAX (b.S)) s_cnt
FROM base b LEFT JOIN base p ON p.p_cnt1 = 1 AND b.s = p.s
WHERE b.s_cnt1 > 1 AND b.p_cnt1 > 1 AND p.s IS NULL
GROUP BY b.p
HAVING COUNT (*) = 1;

The MAX aggregate functions could be MIN or even AVG because in the result set we are aggregating just one row.  The analytic function identifies the number of possibilities that Sam can now consider: the number out of the remaining 86 rows with identical values for the sum.  A visual inspection of these 86 rows yields the one possibility where Sam now knows the answer.  But how can we select that row without another query block?

The answer is to use the MODEL clause.

First of all, let us get the MODEL clause setup without attempting to restrict the rows.  When a model clause is used the aggregate and analytic expressions move out of the SELECT list and into the PARTITION, DIMENSION, and MEASURES sub-clauses.  We ultimately want to look up using S_CNT so that goes in the DIMENSION clause.

...
SELECT a,b,p,s,s_cnt
FROM paul_knows b LEFT JOIN paul_knows p ON p.p_cnt1 = 1 AND b.s = p.s
WHERE b.s_cnt1 > 1 AND p.s IS NULL
GROUP BY b.p
HAVING COUNT (*) = 1
MODEL UNIQUE SINGLE REFERENCE
DIMENSION BY (COUNT (*) OVER (PARTITION BY MAX (b.s)) s_cnt)
MEASURES (MAX (b.a) a, MAX (b.b) b, b.p, MAX (b.s) s)
RULES
();

The above construct generates the same 86 rows as the previous example.  The UNIQUE SINGLE REFERENCE bit is used to supress a check the SQL engine makes; a reference to a measure a[3] would appear to reference a single cell but would actually reference 6.  By specifying UNIQUE SINGLE REFERENCE we promise not to do anything naughty like that and the error is supressed.

To restrict the rows further we add the keywords RETURN UPDATED ROWS.  If that is all we did we would get no rows returned because there are no rules yet to update rows. The final thing to do is to update the single row that is our puzzle answer.  a[1] = a[1] is a dummy update to identify the one row (a= 4 and b = 13) that is our puzzle answer.  We could have used any measure such as b[1] = b[1]+0.

December 2, 2014

UKOUG TECH14 SQL Quiz

Filed under: SQL Fun — tonyhasler @ 8:03 pm
Tags:

If you are an attendee of UKOUG Tech14 you have a chance to win a free copy of Troubleshooting Oracle Performance (2nd edition) by Christian Antognini

TOP

 

Troubleshooting Oracle Performance, 2nd Edition provides a systematic approach to addressing the underlying causes of poor database application performance.


 

AND a you can also win a free copy of Expert Oracle SQL by Tony Hasler

cover

Expert Oracle SQL provides a systematic approach for tuning individual SQL statements and shows how to stabilize production performance with a radical approach to managing object statistics.


 

To get a chance to win a copy of both of these books just leave a comment on this blog with your answers to these questions:

For each of the following pairs of SQL statements indicate whether:

A) They are both illegal
B) The first is legal and the second illegal
C) The second is legal and the first illegal
D) They are both legal but may provide different results
E) They are both legal statements guaranteed to return identical results (ignoring order)

You should assume an Oracle 12cR1 database containing tables T1, T2 and T3 in the current schema. Table T1 has a column C1, table T2 has a column C2, and table T3 has a column C3.  All tables have a column called ID.

Question 1:

SELECT * FROM T1 LEFT NATURAL JOIN T2;

SELECT * FROM T1 NATURAL LEFT JOIN T2;

Question 2:

SELECT * from T1 INNER JOIN T2 ON C1=C2;

SELECT * FROM T1 OUTER JOIN T2 ON C1=C2;

Question 3:

SELECT *
FROM t1, t2, t3
WHERE t1.c1 = t2.c2(+) AND t2.c2(+) = t3.c3;

SELECT *
FROM t1
LEFT JOIN t2 ON t1.c1 = t2.c2
RIGHT JOIN t3 ON t2.c2 = t3.c3;

Question 4:

SELECT c1, COUNT (*)
FROM t1
GROUP BY c1
HAVING COUNT (*) = 1;

SELECT c1, COUNT (*)
FROM t1
HAVING COUNT (*) = 1
GROUP BY c1;

Tie breaker 1:

Consider this puzzle.

All the numbers from 2 to 99 are placed into a hat. John draws two numbers from the hat and whispers the product of the two numbers to Paul. John whispers the sum of the two numbers to Sam. Paul and Sam then have the following conversation:

Paul: I don’t know what the two numbers are.
Sam: I knew you didn’t. Neither did I.
Paul: I do now.
Sam: So do I.

Write an SQL query that runs on an Oracle database to identify the only possible pair of numbers that could have been extracted from the hat.

  • The only database object you should reference is DUAL.
  • Use the keyword SELECT as infrequently as possible.
  • No PL/SQL. For the avoidance of doubt, PL/SQL used in part of an SQL statement in 12cR1 is not allowed.
  • You cannot work out all or part of the answer yourself. So, for example, you may work out that if Sam knows the answer immediately then so does Paul. Your SQL cannot make this assumption.

Tie breaker 2:

The first three names drawn out of a hat of containing the names of all correct answers will win a copy of the two books.

Rules

  • This competition is only available to attendees of the UKOUG Tech14 conference
  • To stand a chance of wining you must post your answers to the quiz questions by 11:00 a.m. on Tuesday 9th December 2014
  • You must be present at the Optimizer Round Table session at 12:00 on Tuesday 9th December to collect your prize. Otherwise another name will be draw from the hat.

October 24, 2012

Model clause use cases

Filed under: SQL Fun — tonyhasler @ 9:22 pm

This note describes a couple of real-life use cases for the SQL MODEL (or SPREADSHEET) clause that appeared in Oracle 10g. This is mainly targeted at people attending my UKOUG presentation on the model clause (materials TBS) but anybody who knows the syntax of the model clause may find this of some interest.

We already have SQL (non-procedural) and PL/SQL (procedural) languages and it is easy to call SQL from PL/SQL and vice-versa. So you might be asking yourself why on earth we need the model clause. Doesn’t it seem like just a procedural extension to SQL? If so why wouldn’t we just use PL/SQL?

Well, over the last year or so I have found a number of cases where the model clause has been useful. The two key areas are:

  • Implementing analytics not available directly from an SQL function
  • Implementing calculations in parallel

Before we discuss my use cases let us create some test data:

CREATE TABLE model_test
AS
   WITH q1
        AS (    SELECT DATE '2000-01-01' + LEVEL mydate, DBMS_RANDOM.VALUE ( 90, 100) VALUE
                  FROM DUAL
                 WHERE MOD ( LEVEL, 11)  0
            CONNECT BY LEVEL <= 4000
            UNION ALL
                SELECT DATE '2000-01-01' + LEVEL mydate, DBMS_RANDOM.VALUE ( 100, 200) VALUE
                  FROM DUAL
                 WHERE MOD ( LEVEL, 11) = 0
            CONNECT BY LEVEL <= 4000)
       ,q2
        AS (    SELECT DBMS_RANDOM.string ( 'p', 10) account
                  FROM DUAL
            CONNECT BY LEVEL <= 100)
   SELECT *
     FROM q1, q2;

EXEC DBMS_STATS.gather_table_stats(USER,'MODEL_TEST');

This table contains 40,000 rows with 4,000 days’ worth of data for each of 100 accounts. The data is just a single value of some kind. The average value is approximately 100 but for each account there are 3637 closely packed values between 90 and 100 and 363 values between 100 and 200.

A genuine requirement from my investment bank client was to implement a moving median over a one year window. The first problem I ran into related to the definition of a one year window. I discussed this issue in my last post. The second issue is, of course, that the median function does not support windowing. This is a solution based on the model clause:

SELECT /*+ parallel(10) */ * FROM model_test
  MODEL
  PARTITION BY (account)
  DIMENSION BY (mydate)
  MEASURES (VALUE, 0 mov_median)
  RULES(
    mov_median[ANY] = MEDIAN(VALUE)[mydate BETWEEN ADD_MONTHS(CV()+1,-12) AND CV()]);

This is a relatively concise way to express the requirement and it doesn’t look particularly procedural.

Trying to code this in PL/SQL would be pretty tricky because of the need to constantly manipulate sets of data. Furthermore, a moving median can’t be calculated in any efficient way, which is why there is no native support. The fact that we can use parallel query slaves to reduce the amount of time taken to perform the calculations further adds to the benefit of using the model clause rather than a PL/SQL function of some kind. Each slave performs the calculations for some subset of accounts.

In another case I was asked to calculate a zscore based on semi-standard deviations. A semi-standard deviation is required when the distribution of data above the mean is substantially different from that below the mean as in my test data. Essentially we have one standard deviation for the values above the mean and a separate one for those less than or equal to the mean. Zscore is simply a term for the number of standard deviations (in this case semi) that a particular value is above or below the mean. The zscores were also to be calculated over a moving one year window.

This is much more complicated than a moving median. Firstly, we have no built-in aggregate function that we can use for semi-standard deviations. Secondly, as we shall shortly see, we need an iterative model to implement our calculations.

To help explain my approach I need to invent some terms. I will say that:

  • The target date is the date for which we need our semi-standard deviation and zscore.
  • A contributing date is a date that contributes to the semi-standard deviation. Contributing
    dates are those on the correct side of the mean that are within the window for a target date.

A particular date will be part of the window for itself and all of the target dates in the following year. However, it will only contribute to some subset because its value will only sometimes be on the same side of the mean as the value of the target date. As a consequence we need a separate iteration for each target date.

To avoid unnecessary complications I will make two assumptions:

  • The data is dense. In other words, data is present for every account for every day. This wasn’t the case in real life.
  • A window of 365 days, not a calendar year, was adequate. This was the case in real life.

This is roughly how I did it:

   WITH q1
        AS (SELECT m.*
                  ,  ROW_NUMBER ()
                        OVER (PARTITION BY account ORDER BY mydate)
                   - 1
                      rn
                  ,0 AS mov_stdd
                  ,0 AS mov_avg
                  ,0 AS temp
                  ,0 AS zscore
              FROM model_test m)
   SELECT /*+ parallel(10) */
         account
         ,mydate
         ,VALUE
         ,mov_avg
         ,mov_stdd
         ,zscore
     FROM q1
   MODEL
      PARTITION BY (account)
      DIMENSION BY (rn)
      MEASURES (mydate, VALUE, mov_avg, mov_stdd, zscore, temp)
      RULES UPDATE
         ITERATE (100000) UNTIL VALUE[ITERATION_NUMBER] IS NULL
         (mov_avg [ITERATION_NUMBER] =
               AVG (VALUE)[rn BETWEEN ITERATION_NUMBER - 364
                                  AND ITERATION_NUMBER],
         temp [rn BETWEEN ITERATION_NUMBER - 364 AND ITERATION_NUMBER] =
               CASE
                  WHEN    (    VALUE[CV ()] <= mov_avg[ITERATION_NUMBER]
                           AND VALUE[ITERATION_NUMBER] <=
                                  mov_avg[ITERATION_NUMBER])
                       OR (    VALUE[CV ()] > mov_avg[ITERATION_NUMBER]
                           AND VALUE[ITERATION_NUMBER] >
                                  mov_avg[ITERATION_NUMBER])
                  THEN
                     POWER ( VALUE[CV ()] - mov_avg[ITERATION_NUMBER], 2)
               END,
         mov_stdd [ITERATION_NUMBER] =
               SQRT (
                  AVG (temp)[rn BETWEEN ITERATION_NUMBER - 364
                                    AND ITERATION_NUMBER]),
         zscore [ITERATION_NUMBER] =
               DECODE (mov_stdd[CV ()]
                      ,0, 0
                      , (VALUE[CV ()] - mov_avg[CV ()]) / mov_stdd[CV ()]));

We have associated a number with each target date so that we could use the ITERATION_NUMBER to find our target values. Note that the measure TEMP for a particular date is typically recalculated for each of 365 target dates and holds the contribution to the semi-standard standard deviation. The variance is the average contribution and the standard deviation (or more specifically, the population standard deviation) is the square root of the variance. The zscore is then a simple calculation as you can see.

Even though iteration was required here, PL/SQL wasn’t the right tool; even the limited use of aggregate functions in the above example has substantially simplified the logic and once again we can use parallel query slaves for our computations.

Update in reply to Dushyant

Dushyant has posted an interesting question in his comment. He wants to know how to implement a moving RANK function in a similar way to the moving MEDIAN function above. This was my first attempt:

WITH a
     AS (SELECT 'a' sector, TRUNC (SYSDATE) dt, 64 v FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 1 dt, 2 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 2 dt, 4 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 3 dt, 128 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 4 dt, 8 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 5 dt, 16 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 6 dt, 32 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 7 dt, 256 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 8 dt, 1 v FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 9 dt, 512 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) dt, 3 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) - 1 dt, 27 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) - 2 dt, 9 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) - 3 dt, 81 FROM DUAL),
     b
     AS (SELECT a.*,
                RANK () OVER (PARTITION BY sector ORDER BY dt) dt_rnk,
                COUNT (DISTINCT dt) OVER (PARTITION BY sector) dt_cnt
           FROM a)
SELECT sector,
       dt,
       v,
       mov_rank
  FROM b
MODEL
   PARTITION BY (sector)
   DIMENSION BY (dt)
   MEASURES (v, 0 mov_rank)
   RULES UPDATE
 (     mov_rank [ANY] =
            RANK(v[CV(dt)]) WITHIN GROUP (ORDER BY v)[dt BETWEEN CV()-3 AND CV()]);

One problem with the code above is that the nested cell reference v[CV(dt)] is illegal. I have only been able to solve the problem using an iterative model clause.

WITH a
     AS (SELECT 'a' sector, TRUNC (SYSDATE) dt, 64 v FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 1 dt, 2 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 2 dt, 4 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 3 dt, 128 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 4 dt, 8 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 5 dt, 16 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 6 dt, 32 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 7 dt, 256 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 8 dt, 1 v FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 9 dt, 512 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) dt, 3 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) - 1 dt, 27 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) - 2 dt, 9 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) - 3 dt, 81 FROM DUAL),
     b
     AS (SELECT a.*,
                RANK () OVER (PARTITION BY sector ORDER BY dt) dt_rnk,
                COUNT (DISTINCT dt) OVER (PARTITION BY sector) dt_cnt
           FROM a)
SELECT sector,
       dt,
       v,
       mov_rank
  FROM b
MODEL
   PARTITION BY (sector)
   DIMENSION BY (dt_rnk)
   MEASURES (dt, v, dt_cnt, 0 mov_rank, 0 temp)
   RULES UPDATE
      ITERATE (100000) UNTIL dt_cnt[1] = ITERATION_NUMBER
      (temp [dt_rnk BETWEEN ITERATION_NUMBER - 2 AND ITERATION_NUMBER + 1] =
            RANK () OVER (ORDER BY v),
      mov_rank [ITERATION_NUMBER + 1] =
            temp[CV ()]);

This provides the desired effect!  Here is the output:

SECTOR DT V MOV_RANK
a 27/08/2014
512
1
a 28/08/2014
1
1
a 29/08/2014
256
2
a 30/08/2014
32
2
a 31/08/2014
16
2
a 01/09/2014
8
1
a 02/09/2014
128
4
a 03/09/2014
4
1
a 04/09/2014
2
1
a 05/09/2014
64
3
b 02/09/2014
81
1
b 03/09/2014
9
1
b 04/09/2014
27
2
b 05/09/2014
3
1

October 23, 2012

Non-inclusive windows for yearly or monthly intervals

Filed under: SQL Fun — tonyhasler @ 1:01 pm

About six months ago I came across the need to use analytic functions with a moving one year window. This is supposedly supported in Oracle 10g and beyond but the behaviour was not what I expected. Consider this test query:

WITH q1
     AS (    SELECT DATE '2012-01-01' + ROWNUM mydate
               FROM DUAL
         CONNECT BY LEVEL <= 400)
    ,q2
     AS (SELECT mydate
               ,COUNT (
                   *)
                OVER (
                   ORDER BY mydate
                   RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND CURRENT ROW)
                   cnt
               ,MIN (
                   mydate)
                OVER (
                   ORDER BY mydate
                   RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND CURRENT ROW)
                   mindate
               ,MAX (
                   mydate)
                OVER (
                   ORDER BY mydate
                   RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND CURRENT ROW)
                   maxdate
           FROM q1)
SELECT *
  FROM q2
 WHERE mydate = DATE '2013-01-10';

This is the result:

MYDATE CNT MINDATE MAXDATE
10-Jan-2013
367
10-Jan-2012 10-Jan-2013

The count of days in the year ending 10th January is 367 days. Now we know we have included 29th January 2012 so we do expect 366 days but not 367! The explanation comes when we look at the minimum and maximum values in the range: we have included both the 10th January 2012 and 10th January 2013 in our range. In reality, the fact that Oracle included an extra day was not an issue on this occasion but it has been bothering me for the last six months that I couldn’t see a way to solve this problem without the use of the model clause (more of this in my next post). However, as is often the case when one muses on something like this the answer came to me in my sleep and I awoke with a start!

I may have been inspired by Stephen Hawking’s invention of virual time when I finally came up with this solution:

WITH q1
     AS (    SELECT DATE '2012-01-01' + ROWNUM mydate
               FROM DUAL
         CONNECT BY LEVEL <= 400)
    ,q2
     AS (SELECT mydate
               ,COUNT (
                   *)
                OVER (
                   ORDER BY
                        ( (mydate - DATE '1950-01-01') * (86401 / 86400))
                      + DATE '1950-01-01'
                   RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND CURRENT ROW)
                   cnt
               ,MIN (
                   mydate)
                OVER (
                   ORDER BY
                        ( (mydate - DATE '1950-01-01') * (86401 / 86400))
                      + DATE '1950-01-01'
                   RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND CURRENT ROW)
                   mindate
               ,MAX (
                   mydate)
                OVER (
                   ORDER BY
                        ( (mydate - DATE '1950-01-01') * (86401 / 86400))
                      + DATE '1950-01-01'
                   RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND CURRENT ROW)
                   maxdate
           FROM q1)
SELECT *
  FROM q2
 WHERE mydate = DATE '2013-01-10';

This window function works using a somewhat simpler concept of virtual time than that of the famous physicist: each day is translated into 1 day and one second. This is accomplished by subtracting an arbitrary date earlier than the first date in your data and no more than 236 days earlier than the last date in your data. You then multiply the resultant number of days by 86401/86400 (there are 86400 seconds in one day) and then add the arbitrary date back in again. Now, when converted into virtual time 10th January 2012 is 1 year and 366 seconds prior to 10 January 2013 rather than 1 year and so is excluded from the window. As long as your window is less than around 236 years and your real dates do not include a time of day this technique allows you to implement windows of months and/or years that exclude the start date of the window.

June 11, 2012

Ordering Non-sorting aggregate functions

Filed under: SQL Fun — tonyhasler @ 3:49 pm

Aggregate functions fall into two categories. The first is what I call “Sorting aggregate functions” that need their data sorted in some way to operate. Examples of sorting aggregate functions are MEDIAN and DENSE_RANK. The second category is non-sorting aggregate functions that do not require a sort to operate. These include COUNT, AVG, and RANK.

In Oracle 10g the concept of HASH AGGREGATION was introduced as an optimisation of the GROUP BY operation. The idea is that non-sorting aggregate functions (let me call them NSAFs from now on) do not need to be kept in a sorted list; you just keep a hash table of the various accumulating counters that are needed for that function’s operation. Since a hash table is quicker to access than a sorted list this offers some performance benefit, more so when the list of groups is large.

Hash aggregation obviously only works for NSAFs. If you include MEDIAN(X) in your select list or HAVING clauses then the input data will all need to be sorted. One big list is created, sorted first by the GROUP BY columns and then by, in this case, column X.

For some reason, there are five NSAFs that do not take advantage of hash aggregation. These are FIRST, LAST, RANK, SYS_XMLAGG and XMLAGG. I can’t imagine why not but there it is.

Even if an NSAF supports hash aggregation the CBO doesn’t always use it when the data needs to be sorted anyway, for example as a result of an ORDER BY operation.

Guy Harrison has suggested in his blog that the logic is “seriously flawed” because if you have 1,000,000 rows and 100 groups it would be best to defer the sort until after the aggregation because then you have only 100 rows to sort and not 1,000,000. He has even done some experiements to show that the CPU overhead is much less if you use the USE_HASH_AGGREGATION hint.

However, it turns out that Oracle does not sort 1,000,000 rows after all. Just the 100. Let us create a test case.

CREATE TABLE sort_test
AS
   WITH q1
        AS (    SELECT ROWNUM rn
                  FROM DUAL
            CONNECT BY LEVEL <= 100)
   SELECT MOD (a.rn, 2) grp, a.rn + b.rn + c.rn myint1, b.rn myint2
     FROM q1 a, q1 b, q1 c;

This table has 1,000,000 rows but there are only two distinct values for the GRP column. Let us set a 1K SORT_AREA_SIZE as follows:

ALTER SESSION SET workarea_size_policy=manual;
ALTER SESSION SET sort_area_size=1024;

Now let us see what the execution plan looks like for a statement that truly requires a sort of all 1,000,000 rows:

  SELECT grp, MEDIAN (myint1) myavg
    FROM sort_test
GROUP BY grp
ORDER BY grp;

--------------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           |   966K|    23M|   603   (9)| 00:00:08 |
|   1 |  SORT GROUP BY     |           |   966K|    23M|   603   (9)| 00:00:08 |
|   2 |   TABLE ACCESS FULL| SORT_TEST |   966K|    23M|   559   (2)| 00:00:07 |
--------------------------------------------------------------------------------

The 23M of space that we require is much greater than the 1K we have allocated and so when we run it we get a multi-pass sort that on my laptop took 51 minutes.

Let us change the function to a NSAF:

  SELECT grp, AVG (myint1) myavg
    FROM sort_test
GROUP BY grp
ORDER BY grp;

The execution plan reported by the CBO is identical. However, the query ran in under 1 second and the session statistics report a memory sort!

What is clearly happening is that only the accumulating counters are being maintained not the values from each row. I did the experiment on 10.2.0.4 and 11.2.0.1 so I don’t think it is a new feature.

Incidentally, the same mechansim is used for NSAFs that don’t support hash aggregation:

  SELECT grp, MIN (myint1) KEEP (DENSE_RANK FIRST ORDER BY myint2) myvalue
    FROM sort_test
GROUP BY grp
ORDER BY grp;

The above query also finishes very quickly.

So why was Guy able to demonstrate a performance improvement with the use of the USE_HASH_AGGREGATION hint? Well the answer is that wandering up and down a tree of sorted items can be much more expensive than performing a direct hash. Particularly as Guy had 200,000 groups. However, this benefit may be small, or even negative, if the number of groups is very small as in my artificial case.

So Guy is right about the CBO being “flawed” bacause a) the execution plan costs the operation incorrectly and b) the sorted tree is going to be expensive to walk up and down a huge number of times. However, things aren’t quite as bad as they first appear.

Next Page »

Blog at WordPress.com.