Tony’s Oracle Tips

October 24, 2012

Model clause use cases

Filed under: SQL Fun — tonyhasler @ 9:22 pm

This note describes a couple of real-life use cases for the SQL MODEL (or SPREADSHEET) clause that appeared in Oracle 10g. This is mainly targeted at people attending my UKOUG presentation on the model clause (materials TBS) but anybody who knows the syntax of the model clause may find this of some interest.

We already have SQL (non-procedural) and PL/SQL (procedural) languages and it is easy to call SQL from PL/SQL and vice-versa. So you might be asking yourself why on earth we need the model clause. Doesn’t it seem like just a procedural extension to SQL? If so why wouldn’t we just use PL/SQL?

Well, over the last year or so I have found a number of cases where the model clause has been useful. The two key areas are:

  • Implementing analytics not available directly from an SQL function
  • Implementing calculations in parallel

Before we discuss my use cases let us create some test data:

CREATE TABLE model_test
AS
   WITH q1
        AS (    SELECT DATE '2000-01-01' + LEVEL mydate, DBMS_RANDOM.VALUE ( 90, 100) VALUE
                  FROM DUAL
                 WHERE MOD ( LEVEL, 11)  0
            CONNECT BY LEVEL <= 4000
            UNION ALL
                SELECT DATE '2000-01-01' + LEVEL mydate, DBMS_RANDOM.VALUE ( 100, 200) VALUE
                  FROM DUAL
                 WHERE MOD ( LEVEL, 11) = 0
            CONNECT BY LEVEL <= 4000)
       ,q2
        AS (    SELECT DBMS_RANDOM.string ( 'p', 10) account
                  FROM DUAL
            CONNECT BY LEVEL <= 100)
   SELECT *
     FROM q1, q2;

EXEC DBMS_STATS.gather_table_stats(USER,'MODEL_TEST');

This table contains 40,000 rows with 4,000 days’ worth of data for each of 100 accounts. The data is just a single value of some kind. The average value is approximately 100 but for each account there are 3637 closely packed values between 90 and 100 and 363 values between 100 and 200.

A genuine requirement from my investment bank client was to implement a moving median over a one year window. The first problem I ran into related to the definition of a one year window. I discussed this issue in my last post. The second issue is, of course, that the median function does not support windowing. This is a solution based on the model clause:

SELECT /*+ parallel(10) */ * FROM model_test
  MODEL
  PARTITION BY (account)
  DIMENSION BY (mydate)
  MEASURES (VALUE, 0 mov_median)
  RULES(
    mov_median[ANY] = MEDIAN(VALUE)[mydate BETWEEN ADD_MONTHS(CV()+1,-12) AND CV()]);

This is a relatively concise way to express the requirement and it doesn’t look particularly procedural.

Trying to code this in PL/SQL would be pretty tricky because of the need to constantly manipulate sets of data. Furthermore, a moving median can’t be calculated in any efficient way, which is why there is no native support. The fact that we can use parallel query slaves to reduce the amount of time taken to perform the calculations further adds to the benefit of using the model clause rather than a PL/SQL function of some kind. Each slave performs the calculations for some subset of accounts.

In another case I was asked to calculate a zscore based on semi-standard deviations. A semi-standard deviation is required when the distribution of data above the mean is substantially different from that below the mean as in my test data. Essentially we have one standard deviation for the values above the mean and a separate one for those less than or equal to the mean. Zscore is simply a term for the number of standard deviations (in this case semi) that a particular value is above or below the mean. The zscores were also to be calculated over a moving one year window.

This is much more complicated than a moving median. Firstly, we have no built-in aggregate function that we can use for semi-standard deviations. Secondly, as we shall shortly see, we need an iterative model to implement our calculations.

To help explain my approach I need to invent some terms. I will say that:

  • The target date is the date for which we need our semi-standard deviation and zscore.
  • A contributing date is a date that contributes to the semi-standard deviation. Contributing
    dates are those on the correct side of the mean that are within the window for a target date.

A particular date will be part of the window for itself and all of the target dates in the following year. However, it will only contribute to some subset because its value will only sometimes be on the same side of the mean as the value of the target date. As a consequence we need a separate iteration for each target date.

To avoid unnecessary complications I will make two assumptions:

  • The data is dense. In other words, data is present for every account for every day. This wasn’t the case in real life.
  • A window of 365 days, not a calendar year, was adequate. This was the case in real life.

This is roughly how I did it:

   WITH q1
        AS (SELECT m.*
                  ,  ROW_NUMBER ()
                        OVER (PARTITION BY account ORDER BY mydate)
                   - 1
                      rn
                  ,0 AS mov_stdd
                  ,0 AS mov_avg
                  ,0 AS temp
                  ,0 AS zscore
              FROM model_test m)
   SELECT /*+ parallel(10) */
         account
         ,mydate
         ,VALUE
         ,mov_avg
         ,mov_stdd
         ,zscore
     FROM q1
   MODEL
      PARTITION BY (account)
      DIMENSION BY (rn)
      MEASURES (mydate, VALUE, mov_avg, mov_stdd, zscore, temp)
      RULES UPDATE
         ITERATE (100000) UNTIL VALUE[ITERATION_NUMBER] IS NULL
         (mov_avg [ITERATION_NUMBER] =
               AVG (VALUE)[rn BETWEEN ITERATION_NUMBER - 364
                                  AND ITERATION_NUMBER],
         temp [rn BETWEEN ITERATION_NUMBER - 364 AND ITERATION_NUMBER] =
               CASE
                  WHEN    (    VALUE[CV ()] <= mov_avg[ITERATION_NUMBER]
                           AND VALUE[ITERATION_NUMBER] <=
                                  mov_avg[ITERATION_NUMBER])
                       OR (    VALUE[CV ()] > mov_avg[ITERATION_NUMBER]
                           AND VALUE[ITERATION_NUMBER] >
                                  mov_avg[ITERATION_NUMBER])
                  THEN
                     POWER ( VALUE[CV ()] - mov_avg[ITERATION_NUMBER], 2)
               END,
         mov_stdd [ITERATION_NUMBER] =
               SQRT (
                  AVG (temp)[rn BETWEEN ITERATION_NUMBER - 364
                                    AND ITERATION_NUMBER]),
         zscore [ITERATION_NUMBER] =
               DECODE (mov_stdd[CV ()]
                      ,0, 0
                      , (VALUE[CV ()] - mov_avg[CV ()]) / mov_stdd[CV ()]));

We have associated a number with each target date so that we could use the ITERATION_NUMBER to find our target values. Note that the measure TEMP for a particular date is typically recalculated for each of 365 target dates and holds the contribution to the semi-standard standard deviation. The variance is the average contribution and the standard deviation (or more specifically, the population standard deviation) is the square root of the variance. The zscore is then a simple calculation as you can see.

Even though iteration was required here, PL/SQL wasn’t the right tool; even the limited use of aggregate functions in the above example has substantially simplified the logic and once again we can use parallel query slaves for our computations.

October 23, 2012

Non-inclusive windows for yearly or monthly intervals

Filed under: SQL Fun — tonyhasler @ 1:01 pm

About six months ago I came across the need to use analytic functions with a moving one year window. This is supposedly supported in Oracle 10g and beyond but the behaviour was not what I expected. Consider this test query:

WITH q1
     AS (    SELECT DATE '2012-01-01' + ROWNUM mydate
               FROM DUAL
         CONNECT BY LEVEL <= 400)
    ,q2
     AS (SELECT mydate
               ,COUNT (
                   *)
                OVER (
                   ORDER BY mydate
                   RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND CURRENT ROW)
                   cnt
               ,MIN (
                   mydate)
                OVER (
                   ORDER BY mydate
                   RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND CURRENT ROW)
                   mindate
               ,MAX (
                   mydate)
                OVER (
                   ORDER BY mydate
                   RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND CURRENT ROW)
                   maxdate
           FROM q1)
SELECT *
  FROM q2
 WHERE mydate = DATE '2013-01-10';

This is the result:

MYDATE CNT MINDATE MAXDATE
10-Jan-2013
367
10-Jan-2012 10-Jan-2013

The count of days in the year ending 10th January is 367 days. Now we know we have included 29th January 2012 so we do expect 366 days but not 367! The explanation comes when we look at the minimum and maximum values in the range: we have included both the 10th January 2012 and 10th January 2013 in our range. In reality, the fact that Oracle included an extra day was not an issue on this occasion but it has been bothering me for the last six months that I couldn’t see a way to solve this problem without the use of the model clause (more of this in my next post). However, as is often the case when one muses on something like this the answer came to me in my sleep and I awoke with a start!

I may have been inspired by Stephen Hawking’s invention of virual time when I finally came up with this solution:

WITH q1
     AS (    SELECT DATE '2012-01-01' + ROWNUM mydate
               FROM DUAL
         CONNECT BY LEVEL <= 400)
    ,q2
     AS (SELECT mydate
               ,COUNT (
                   *)
                OVER (
                   ORDER BY
                        ( (mydate - DATE '1950-01-01') * (86401 / 86400))
                      + DATE '1950-01-01'
                   RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND CURRENT ROW)
                   cnt
               ,MIN (
                   mydate)
                OVER (
                   ORDER BY
                        ( (mydate - DATE '1950-01-01') * (86401 / 86400))
                      + DATE '1950-01-01'
                   RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND CURRENT ROW)
                   mindate
               ,MAX (
                   mydate)
                OVER (
                   ORDER BY
                        ( (mydate - DATE '1950-01-01') * (86401 / 86400))
                      + DATE '1950-01-01'
                   RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND CURRENT ROW)
                   maxdate
           FROM q1)
SELECT *
  FROM q2
 WHERE mydate = DATE '2013-01-10';

This window function works using a somewhat simpler concept of virtual time than that of the famous physicist: each day is translated into 1 day and one second. This is accomplished by subtracting an arbitrary date earlier than the first date in your data and no more than 236 days earlier than the last date in your data. You then multiply the resultant number of days by 86401/86400 (there are 86400 seconds in one day) and then add the arbitrary date back in again. Now, when converted into virtual time 10th January 2012 is 1 year and 366 seconds prior to 10 January 2013 rather than 1 year and so is excluded from the window. As long as your window is less than around 236 years and your real dates do not include a time of day this technique allows you to implement windows of months and/or years that exclude the start date of the window.

June 11, 2012

Ordering Non-sorting aggregate functions

Filed under: SQL Fun — tonyhasler @ 3:49 pm

Aggregate functions fall into two categories. The first is what I call “Sorting aggregate functions” that need their data sorted in some way to operate. Examples of sorting aggregate functions are MEDIAN and DENSE_RANK. The second category is non-sorting aggregate functions that do not require a sort to operate. These include COUNT, AVG, and RANK.

In Oracle 10g the concept of HASH AGGREGATION was introduced as an optimisation of the GROUP BY operation. The idea is that non-sorting aggregate functions (let me call them NSAFs from now on) do not need to be kept in a sorted list; you just keep a hash table of the various accumulating counters that are needed for that function’s operation. Since a hash table is quicker to access than a sorted list this offers some performance benefit, more so when the list of groups is large.

Hash aggregation obviously only works for NSAFs. If you include MEDIAN(X) in your select list or HAVING clauses then the input data will all need to be sorted. One big list is created, sorted first by the GROUP BY columns and then by, in this case, column X.

For some reason, there are five NSAFs that do not take advantage of hash aggregation. These are FIRST, LAST, RANK, SYS_XMLAGG and XMLAGG. I can’t imagine why not but there it is.

Even if an NSAF supports hash aggregation the CBO doesn’t always use it when the data needs to be sorted anyway, for example as a result of an ORDER BY operation.

Guy Harrison has suggested in his blog that the logic is “seriously flawed” because if you have 1,000,000 rows and 100 groups it would be best to defer the sort until after the aggregation because then you have only 100 rows to sort and not 1,000,000. He has even done some experiements to show that the CPU overhead is much less if you use the USE_HASH_AGGREGATION hint.

However, it turns out that Oracle does not sort 1,000,000 rows after all. Just the 100. Let us create a test case.

CREATE TABLE sort_test
AS
   WITH q1
        AS (    SELECT ROWNUM rn
                  FROM DUAL
            CONNECT BY LEVEL <= 100)
   SELECT MOD (a.rn, 2) grp, a.rn + b.rn + c.rn myint1, b.rn myint2
     FROM q1 a, q1 b, q1 c;

This table has 1,000,000 rows but there are only two distinct values for the GRP column. Let us set a 1K SORT_AREA_SIZE as follows:

ALTER SESSION SET workarea_size_policy=manual;
ALTER SESSION SET sort_area_size=1024;

Now let us see what the execution plan looks like for a statement that truly requires a sort of all 1,000,000 rows:

  SELECT grp, MEDIAN (myint1) myavg
    FROM sort_test
GROUP BY grp
ORDER BY grp;

--------------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           |   966K|    23M|   603   (9)| 00:00:08 |
|   1 |  SORT GROUP BY     |           |   966K|    23M|   603   (9)| 00:00:08 |
|   2 |   TABLE ACCESS FULL| SORT_TEST |   966K|    23M|   559   (2)| 00:00:07 |
--------------------------------------------------------------------------------

The 23M of space that we require is much greater than the 1K we have allocated and so when we run it we get a multi-pass sort that on my laptop took 51 minutes.

Let us change the function to a NSAF:

  SELECT grp, AVG (myint1) myavg
    FROM sort_test
GROUP BY grp
ORDER BY grp;

The execution plan reported by the CBO is identical. However, the query ran in under 1 second and the session statistics report a memory sort!

What is clearly happening is that only the accumulating counters are being maintained not the values from each row. I did the experiment on 10.2.0.4 and 11.2.0.1 so I don’t think it is a new feature.

Incidentally, the same mechansim is used for NSAFs that don’t support hash aggregation:

  SELECT grp, MIN (myint1) KEEP (DENSE_RANK FIRST ORDER BY myint2) myvalue
    FROM sort_test
GROUP BY grp
ORDER BY grp;

The above query also finishes very quickly.

So why was Guy able to demonstrate a performance improvement with the use of the USE_HASH_AGGREGATION hint? Well the answer is that wandering up and down a tree of sorted items can be much more expensive than performing a direct hash. Particularly as Guy had 200,000 groups. However, this benefit may be small, or even negative, if the number of groups is very small as in my artificial case.

So Guy is right about the CBO being “flawed” bacause a) the execution plan costs the operation incorrectly and b) the sorted tree is going to be expensive to walk up and down a huge number of times. However, things aren’t quite as bad as they first appear.

Pushing parameters into views a.k.a predicate pushing

Filed under: SQL Fun — tonyhasler @ 12:18 am

It is a very common and oft discussed issue. You have a complex view definition, possibly containing an analytic function, and when you SELECT from the view you can’t get any advantage from a WHERE clause. For an old, and lengthy, discussion of why Oracle “just can’t do it” you can see this article on Ask Tom. Well it turns out that there is something you can do after all. First, let us set the scene. We need a table and associated index with some values:

CREATE TABLE t1
AS
   WITH q1
        AS (    SELECT ROWNUM rn
                  FROM DUAL
            CONNECT BY LEVEL <= 100)
   SELECT MOD (a.rn, 100) grp, a.rn+b.rn fact_int, RPAD ('X', 100) vpad
     FROM q1 a, q1 b;

CREATE INDEX i1
   ON t1 (grp);

We have a table with 10,000 rows with 100 values of GRP. Now let us build our view with the awkward analytic query:

CREATE OR REPLACE VIEW v1
AS
     SELECT grp
           ,AVG (SUM (fact_int)) OVER (ORDER BY grp RANGE 2 PRECEDING) mv_avg
       FROM t1
   GROUP BY grp;

When we query this view with a specific value for GRP we get no index usage.

SELECT *
  FROM v1
 WHERE grp BETWEEN 50 AND 51;
 
-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |  9485 |   240K|    49   (3)| 00:00:01 |
|*  1 |  VIEW                | V1   |  9485 |   240K|    49   (3)| 00:00:01 |
|   2 |   WINDOW BUFFER      |      |  9485 |   240K|    49   (3)| 00:00:01 |
|   3 |    SORT GROUP BY     |      |  9485 |   240K|    49   (3)| 00:00:01 |
|   4 |     TABLE ACCESS FULL| T1   |  9485 |   240K|    48   (0)| 00:00:01 |
-----------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("GRP">=50 AND "GRP"<=51)

Note how our predicate is applied at id 1, after all 10,000 rows have been aggregated.
Wouldn’t it be nice if somehow we could convince the optimizer to transform our query into this:

WITH q1
     AS (  SELECT grp
                 ,AVG (SUM (fact_int)) OVER (ORDER BY grp RANGE 2 PRECEDING) mv_avg
             FROM t1
            WHERE grp BETWEEN 48 AND 51
         GROUP BY grp)
SELECT *
  FROM q1
 WHERE grp BETWEEN 50 AND 51;

For the moving average calculation we actually only need to look at the rows with values within the range being aggregated. Well, it turns out that you can rewrite your view to allow this providing you have a convenient way to identify the complete set of legal values for the predicated column (grp in our case). If the table being queried is a fact table in a data warehouse you may already have a suitable dimension table you can use. In our case we will build one:

CREATE TABLE grp_table (grp PRIMARY KEY)
ORGANIZATION INDEX
AS
   SELECT DISTINCT grp FROM t1;

I have made this an IOT given that (in my example) it has only one column and we plan to use it exclusively for unique primary key lookups. A heap dimension table with several columns is fine if it is reasonably small.
We also need to define a couple of types to reflect the result of our original query:

CREATE OR REPLACE TYPE pair_num AS OBJECT (n1 NUMBER, n2 NUMBER);
CREATE OR REPLACE TYPE pair_num_t AS TABLE OF pair_num;

In our case, our desired query returns two numeric values. We define our types generically so that we can reuse them for other views with similar result sets. Now redefine our view as follows:

CREATE OR REPLACE VIEW v1 (startgrp, endgrp, grp, mv_avg)
AS
   WITH q1 AS (SELECT grp startgrp FROM grp_table)
       ,q2 AS (SELECT grp endgrp FROM grp_table)
     SELECT startgrp, endgrp, n1 AS grp, n2 AS mv_avg
     FROM q1
         ,q2
         ,TABLE (CAST (MULTISET (  SELECT grp
                                         ,AVG (SUM (fact_int)) OVER (ORDER BY grp RANGE 2 PRECEDING) mv_avg
                                     FROM t1
                                    WHERE grp BETWEEN startgrp-2 AND endgrp
                                 GROUP BY grp) AS pair_num_t)) pairs
    WHERE n1 >= startgrp;

This view now contains two extra columns STARTGRP and ENDGRP that can be used in our WHERE clause to identify the subset of rows that we want. The new query and associated execution plan are as follows:

SELECT grp,mv_avg
  FROM v1
 WHERE startgrp = 50 AND endgrp = 51;
 
---------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                   |    20 |   560 |    30   (0)| 00:00:01 |
|   1 |  NESTED LOOPS                       |                   |    20 |   560 |    30   (0)| 00:00:01 |
|   2 |   NESTED LOOPS                      |                   |     1 |    26 |     1   (0)| 00:00:01 |
|*  3 |    INDEX UNIQUE SCAN                | SYS_IOT_TOP_84132 |     1 |    13 |     1   (0)| 00:00:01 |
|*  4 |    INDEX UNIQUE SCAN                | SYS_IOT_TOP_84132 |     1 |    13 |     0   (0)| 00:00:01 |
|*  5 |   COLLECTION ITERATOR SUBQUERY FETCH|                   |    20 |    40 |    29   (0)| 00:00:01 |
|   6 |    VIEW                             |                   |    24 |   624 |     3   (0)| 00:00:01 |
|   7 |     WINDOW BUFFER                   |                   |    24 |   624 |     3   (0)| 00:00:01 |
|   8 |      SORT GROUP BY NOSORT           |                   |    24 |   624 |     3   (0)| 00:00:01 |
|*  9 |       FILTER                        |                   |       |       |            |          |
|  10 |        TABLE ACCESS BY INDEX ROWID  | T1                |    24 |   624 |     3   (0)| 00:00:01 |
|* 11 |         INDEX RANGE SCAN            | I1                |    43 |       |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("GRP"=51)
   4 - access("GRP"=50)
   5 - filter(SYS_OP_ATG(VALUE(KOKBF$),1,2,2)>=50 AND "GRP"<=SYS_OP_ATG(VALUE(KOKBF$),1,2,2))
   9 - filter(:B1-2<=:B2)
  11 - access("GRP">=:B1-2 AND "GRP"<=:B2)

Our new query has a small unnecessary overhead where we "look up" the STARTDATE and ENDDATE in our dimension table but on the other hand we have been able to use information about our predicate to restrict the rows aggregated to the point that we can take advantage of our index.
This technique of passing information from one rows source in a join to another is known as a lateral join and is available with the TABLE and XMLTABLE operators.
Note that we will get very undesirable results if a user does not specify a WHERE clause as now we will get results matching every combination of STARTDATE and ENDDATE where ENDDATE >= STARTDATE. If you want to protect against this you can write a pipelined function.

CREATE OR REPLACE VIEW v1 (startgrp, endgrp, grp, mv_avg)
AS
   WITH q1
        AS (  SELECT /*+ no_merge no_eliminate_oby */
                    grp startgrp
                FROM grp_table
            ORDER BY startgrp DESC)
       ,q2
        AS (  SELECT /*+ no_merge no_eliminate_oby */
                    grp endgrp
                FROM grp_table
            ORDER BY endgrp)
   SELECT /*+ leading(q1) use_nl(q2) */
         startgrp, endgrp, n1 AS grp, n2 AS mv_avg
     FROM q1, q2
         ,TABLE (my_package.my_pipelined_function (startgrp, endgrp)) pairs;

Note the order by clauses with associated hints will cause the pipelined function to be first invoked with a value of ENDDATE less than STARTDATE if no predicate is supplied. The package containing the piplelined function might look like this:

CREATE OR REPLACE PACKAGE my_package
AS
   FUNCTION my_pipelined_function (startgrp NUMBER, endgrp NUMBER)
      RETURN pair_num_t
      PIPELINED;
END my_package;

CREATE OR REPLACE PACKAGE BODY my_package
AS
   FUNCTION my_pipelined_function (startgrp NUMBER, endgrp NUMBER)
      RETURN pair_num_t
      PIPELINED
   IS
      CURSOR c1
      IS
         WITH q1
              AS (  SELECT grp
                          ,AVG (SUM (fact_int)) OVER (ORDER BY grp RANGE 2 PRECEDING) mv_avg
                      FROM t1
                     WHERE grp BETWEEN startgrp - 2 AND endgrp
                  GROUP BY grp)
         SELECT *
           FROM q1
          WHERE grp BETWEEN startgrp AND endgrp;
   BEGIN
      IF startgrp > endgrp
      THEN
         raise_application_error (-20000, 'STARTGRP must be specified and be less than ENDGRP');
      END IF;

      FOR r IN c1
      LOOP
         PIPE ROW (pair_num (r.grp, r.mv_avg));
      END LOOP;
   END my_pipelined_function;
END my_package;

June 10, 2012

Protected: tony-hasler-cv

Filed under: Uncategorized — tonyhasler @ 1:50 pm

This post is password protected. To view it please enter your password below:

March 27, 2012

FIRST/LAST versus FIRST_VALUE/LAST_VALUE

Filed under: SQL Fun — tonyhasler @ 2:28 pm

As I promised in my last post, I want to discuss the difference between FIRST/LAST and FIRST_VALUE/LAST_VALUE.

Previously I discussed FIRST and LAST purely as aggregate functions but they can be used as analytic functions as well. FIRST_VALUE and LAST_VALUE can only be used as an analytic functions and in that respect at least they are somewhat less powerful.

Most of us take a while to get to grasp with what analytic functions are and how they differ from aggregate functions when we first learn about them but there are many articles and discussions on the web about the general topic. Here is one you might like.

Now that you know about analytic functions in general, let us return to look at an example of FIRST using the same table as last time:

SELECT c1, MIN (c3) KEEP (DENSE_RANK FIRST ORDER BY c1) OVER () "Analytic"
  FROM t1
 WHERE c1 <= 50;
 
 QUERY 1

What this query says is: After filtering out rows where C1 is NULL or has a value greater than 50, find the row or rows that have the minimum value of C1 and amongst those rows find the minimum value of C3. This value is used for the “Analytic” column; all rows in the final result set have the same value for the “Analytic” column.

Now let us look at a similar query using FIRST_VALUE:

SELECT c1, FIRST_VALUE (c3) OVER (ORDER BY c1) "Analytic"
  FROM t1
 WHERE c1 <= 50;
 
 QUERY 2
 

What this query says is: After filtering out rows where C1 is NULL or has a value greater than 50, find the row or rows that have the minimum value of C1 and amongst those rows pick an arbitrary row and then select the value of C3. This value is used for the “Analytic” column; all rows in the final result set have the same value for the “Analytic” column.

Let us first assume that the expression used in the ORDER BY clause is unique (as recommended in the documentation of FIRST_VALUE). In this case, the results of QUERY 2 are deterministic and absolutely the same as QUERY 1. Of course, in this case the aggregate function used in conjunction with FIRST could be MIN, MAX, or even AVG. The result is the same as there is only one row from which to pick.

However, if the ORDER BY CLAUSE does not represent a unique key then I would suggest you use the FIRST function in preference to FIRST_VALUE as it provides a way to explicitly deal with any duplicates.

The FIRST function may also provide better performance. Here is the execution plan for QUERY 1:

Here is the execution plan:

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    50 |  1300 |     6   (0)| 00:00:01 |
|   1 |  WINDOW BUFFER     |      |    50 |  1300 |     6   (0)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL| T1   |    50 |  1300 |     6   (0)| 00:00:01 |
---------------------------------------------------------------------------

As you can see, we don’t need a sort as explained in my last post but we do need to buffer the rows as we don’t know what to put into the “Analytic” column for any row until we process the last row in our result set.

Now let us look at the execution plan for QUERY 2:

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    50 |  1300 |     7  (15)| 00:00:01 |
|   1 |  WINDOW SORT       |      |    50 |  1300 |     7  (15)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL| T1   |    50 |  1300 |     6   (0)| 00:00:01 |
---------------------------------------------------------------------------

As you can see, with FIRST_VALUE we have a superfluous sort and as a consequence a higher cost.

You can’t always avoid a sort with FIRST. Consider this example:

  SELECT c1
        ,MIN (c3) KEEP (DENSE_RANK FIRST ORDER BY c1) OVER (PARTITION BY c2)
            "Analytic"
    FROM t1
   WHERE c1 <= 50
ORDER BY c2, c1 DESC;
 
 QUERY 3
 

This query is similar to QUERY 1 but this time there is a seperate value of "Analytic" for each value of C2. This PARTITION clause means we have to sort the data by C2. However, the ORDER BY clause requires that at some stage we need to order the rows by "C2, C1 DESC". This single sort is sufficient for both purposes as we can see in the execution plan:

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    50 |  1950 |     6   (0)| 00:00:01 |
|   1 |  WINDOW SORT       |      |    50 |  1950 |     6   (0)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL| T1   |    50 |  1950 |     6   (0)| 00:00:01 |
---------------------------------------------------------------------------

Let us look at the equivalent statement using FIRST_VALUE:

SELECT c1, FIRST_VALUE (c3) OVER (PARTITION BY c2 ORDER BY c1) "Analytic"
    FROM t1
   WHERE c1 <= 50
ORDER BY c2, c1 DESC;

QUERY 4


Here is the execution plan for QUERY 4:

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |    50 |  1950 |     8  (25)| 00:00:01 |
|   1 |  SORT ORDER BY      |      |    50 |  1950 |     8  (25)| 00:00:01 |
|   2 |   WINDOW SORT       |      |    50 |  1950 |     8  (25)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL| T1   |    50 |  1950 |     6   (0)| 00:00:01 |
----------------------------------------------------------------------------

Now we see an extra sort. The reason is that the FIRST_VALUE function requires the data sorted by the composite sort key "c2,c1" and so needs to be reordered by "c2,c1 DESC" for the final result.

Of course, all the above observations can be applied to LAST and LAST_VALUE.

So, is there ever a need for FIRST_VALUE and LAST_VALUE? Well consider this query:

SELECT c1
      ,FIRST_VALUE (
          c3)
       OVER (PARTITION BY c2
             ORDER BY c1
             ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
          "Analytic"
  FROM t1
 WHERE c1 <= 50;
 
QUERY 5

This time FIRST_VALUE is being used almost like LAG(C3,3) in that it is picking up the value of C3 "three rows up" so to speak. The difference is that LAG(c3,3) returns NULL for the first three rows and FIRST_VALUE doesn't. There isn't a way to code this expression using FIRST because FIRST doesn't support either an ORDER BY clause or a windowing clause.

Whilst we are on the topic of windowing clauses, the implicit and unchangeable window clause for the FIRST and LAST functions is ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, in other words all rows in our partition. For FIRST_VALUE and LAST_VALUE the default but changeable windowing clause is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, in other words we exclude rows after the current one. Dropping rows off the bottom of a list makes no difference when we are looking for the first row in the list (FIRST_VALUE) but it does make a difference when we are looking for the last row in the list (LAST_VALUE) so you will usually need either to specify ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING explicitly when using LAST_VALUE or just use FIRST_VALUE and reverse the sort order.

In conclusion, I would recommend that you only ever use FIRST_VALUE or LAST_VALUE when you need a windowing function other than ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. When FIRST or LAST are available the performance is never worse and frequently better than the equivalent FIRST_VALUE or LAST_VALUE construct. In addition, FIRST and LAST avoid the possibility of non-deterministic results when non-unique ORDER BY clauses are supplied.

March 26, 2012

Aggregate Functions and Sorts

Filed under: SQL Fun,Uncategorized — tonyhasler @ 3:46 pm

I have been prompted to write one of my rare blogs because I have just discovered that the topic of aggregate functions is not quite as straightforward as I previously thought.

Before we start, I want you to look long and hard at the following query and tell me (figuratively speaking) how many sorts are involved in it:

SELECT AVG (c1)
      ,MIN (c1)
      ,MAX (c1)
      ,STDDEV (c2)
      ,RANK (2) WITHIN GROUP (ORDER BY c1 DESC)
      ,STDDEV (c1) KEEP (DENSE_RANK FIRST ORDER BY c2)
      ,VARIANCE (c2) KEEP (DENSE_RANK LAST ORDER BY c3)
  FROM t1;

How are you doing? Let us have a look at the execution plan and see if that will help you?

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    39 |     6   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |    39 |            |          |
|   2 |   TABLE ACCESS FULL| T1   |  6000 |   228K|     6   (0)| 00:00:01 |
---------------------------------------------------------------------------

Has that changed your opinion?

So the correct answer is……zero!

If you are in any doubt, you can create the table I used for the query and try it yourself:

CREATE TABLE t1
AS
       SELECT ROWNUM c1, MOD (ROWNUM, 2) c2, MOD (ROWNUM, 3) c3
         FROM DUAL
   CONNECT BY LEVEL <= 6000;

You can use the following query before and after the query under test to check the sort statistics:

SELECT name, VALUE
  FROM v$mystat NATURAL JOIN v$statname
 WHERE name LIKE 'sort%';

WARNING: Do not use autotrace when checking the stats. as autotrace can perform a sort.

Let us see how it is possible for this statement to complete without any sorts.

The AVG function calculates the mean of a set of values. AVG(c1) can be calculated as the rows are scanned by simply keeping a running total of c1 and a count of the number of rows where c1 is not null and then performing a division at the end. No sort is required. It should be clear that no sort is required for MIN or MAX either. It may be less clear (until you Google it) that STDDEV and VARIANCE aggregations can be performed in a single pass with no sort. In my query, the RANK function simply counts the number of rows that have a value greater than or equal to 2.

This is all very straightforward but what about the last two calculations? Those ORDER BY clauses surely imply some kind of sort don’t they?

Well no. In my query, the FIRST function returns the standard deviation of all values of c1 amongst the rows that have the lowest value of c2.

In general, the FIRST function almost certainly operates by keeping track of the lowest value of c2 encountered so far and on encountering a new row does the following:

  1. If the value of c2 is greater than the stored minimum value of c2, ignore the row for the purposes of this aggregate function.
  2. If the value of c2 equals the minimum encountered so far then perform the aggregation calculations on c1 defined on the left as if no FIRST function was specified.
  3. If the value of c2 is less than the minimum stored so far, replace the stored minimum, delete any stored aggregate values of c1 calculated so far and then process the row as per step 2.

Now, so far I have been very careful in my selection of aggregate functions. Some aggregate functions do require a sort. Let us make a tiny change to the select statement:

SELECT AVG (c1)
      ,MIN (c1)
      ,MAX (c1)
      ,STDDEV (c2)
      ,DENSE_RANK (2) WITHIN GROUP (ORDER BY c1 DESC)
      ,STDDEV (c1) KEEP (DENSE_RANK FIRST ORDER BY c2)
      ,VARIANCE (c2) KEEP (DENSE_RANK LAST ORDER BY c3)
  FROM t1;

The only difference is that I have replaced the RANK function call with a DENSE_RANK function call. The DENSE_RANK function requires a sort as confirmed by the statistics. This is the new execution plan:

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    39 |     6   (0)| 00:00:01 |
|   1 |  SORT GROUP BY     |      |     1 |    39 |            |          |
|   2 |   TABLE ACCESS FULL| T1   |  6000 |   228K|     6   (0)| 00:00:01 |
---------------------------------------------------------------------------

Note that the SORT AGGREGATE operation has been replaced by a SORT GROUP BY. So we have discovered something interesting: SORT AGGREGATE does not perform a sort! Ever!

What happens if we need more than one sort:

SELECT DENSE_RANK (3000) WITHIN GROUP (ORDER BY c1 DESC), MEDIAN (c2) FROM t1;

The execution plan remains the same and the statistic “sorts (memory)” seems to indicate that there has been only one sort but this is impossible. In fact, if you look at “sorts (rows)” you will see that it has been incremented by 9000. 6000 of these rows are for the MEDIAN function but DENSE_RANK only needs to sort the 3000 rows where c1 has a value of less than or equal to 3000. So SORT GROUP BY just counts one sort in the statistics, no matter how many are actually performed.

Let us introduce a GROUP BY clause:

SELECT DENSE_RANK (3000) WITHIN GROUP (ORDER BY c1 DESC) from t1 GROUP BY c2;

The execution plan remains unchanged and the statistics once again show only one sort. In this case, however, there may be several sorts: the first sort is by c2, the others will be sorts to evaluate DENSE_RANK within each group. The “sorts (rows)” statistic seems to support this theory.

I think that is almost enough for this blog. I will post another soon on the difference between the FIRST and LAST functions and the FIRST_VALUE and LAST_VALUE functions. But let me close this note with a comment about the HASH GROUP BY operation introduced in Oracle 10g.

HASH GROUP BY doesn’t perform a sort (as you might expect) and so can only be used for (sub)queries that do not involve aggregates that require sorts. Consider this query:

SELECT MAX(c1) FROM t1 GROUP BY c2;

Here is the execution plan:

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |  6000 |   152K|     7  (15)| 00:00:01 |
|   1 |  HASH GROUP BY     |      |  6000 |   152K|     7  (15)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T1   |  6000 |   152K|     6   (0)| 00:00:01 |
---------------------------------------------------------------------------

In this case we simply keep an array of maximums (one for each distinct value of c2) as the table is scanned.

December 26, 2011

FORCE_MATCH for Stored Outlines and/or SQL Baselines????? – follow up

Filed under: Uncategorized — tonyhasler @ 1:45 pm

As has happened a couple of times recently, comments on a blog of mine have helped me understand the subject matter on which I am commenting far better.

In this case a dialogue with Dom Brooks on my previous blog on how to simulate a stored outline/SQL baseline with FORCE_MATCH using SQL profiles has provided me with a simplified way of approaching the creation of said profiles.

The observation is that there is a column OTHER_XML that appears in a number of tables and views from which you can directly copy the outline hints necessary to fix the execution plan of a statement – and use FORCE_MATCH to boot.

It seems that we are not the first to experiment with this technique. See for example this blog by Kerry Osborne published a while ago but I think these posts have slightly different emphasis.

The OTHER_XML column appears in PLAN_TABLE/SYS.PLAN_TABLE$, V$SQL_PLAN, and DBA_HIST_SQL_PLAN. The outline hints from these views are often displayed using the OUTLINE or ADVANCED formatting options to DBMS_XPLAN.DISPLAY, DBMS_XPLAN.DISPLAY_CURSOR, or DBMS_XPLAN.DISPLAY_AWR table functions respectively.

There is also an OTHER_XML column in DBA_SQLTUNE_PLANS that provides an alternative way to accept a SQL profile. Let me demonstrate. Let us setup some tables first:


-- 
-- Setup tables for test and gather stats 
-- 
DROP TABLE t1; 
DROP TABLE t2; 
DROP TABLE t3; 

BEGIN 
   FOR r IN (SELECT name FROM dba_sql_profiles) 
   LOOP 
      DBMS_SQLTUNE.drop_sql_profile (r.name); 
   END LOOP; 
END; 
/ 

CREATE TABLE t1 
AS 
       SELECT ROWNUM * 2 c1 
         FROM DUAL 
   CONNECT BY LEVEL <= 100000; 

CREATE TABLE t2 
AS 
       SELECT ROWNUM * 2 - 1 c1 
         FROM DUAL 
   CONNECT BY LEVEL <= 50000 
   UNION ALL 
   SELECT 2 FROM DUAL; 

CREATE TABLE t3 
AS 
       SELECT ROWNUM * 2 c1 
         FROM DUAL 
   CONNECT BY LEVEL <= 200000; 

EXEC dbms_stats.gather_table_stats(user,'T1'); 
EXEC dbms_stats.gather_table_stats(user,'T2'); 
EXEC dbms_stats.gather_table_stats(user,'T3'); 

CREATE INDEX t3_i1 
   ON t3 (c1); 

These tables are setup in way to deliberately confuse the CBO. Now, let us run the SQL Tuning advisor.

 
BEGIN 
   DBMS_SQLTUNE.drop_sql_profile (name => 'Tony''s SQLTUNE profile'); 
EXCEPTION 
   WHEN OTHERS 
   THEN 
      NULL; 
END; 
/ 

VARIABLE task_name VARCHAR2(20) 

BEGIN 
   :task_name := 
      DBMS_SQLTUNE.create_tuning_task ( 
         sql_text => ' 
SELECT * 
        FROM t1, t2, t3 
       WHERE t1.c1 = t2.c1 AND t1.c1 = t3.c1 
 '); 
   DBMS_SQLTUNE.execute_tuning_task (:task_name); 
END; 
/ 

SET LONG 100000 LONGC 100000 PAGES 10000 

SELECT DBMS_SQLTUNE.report_tuning_task (:task_name) FROM DUAL 
/ 

-- 
-- This is the standard way to accept a profile (with FORCE_MATCHING) 
-- 

--BEGIN 
--   DBMS_SQLTUNE.accept_sql_profile (task_name => :task_name, REPLACE => TRUE, name => 'Tony''s SQLTUNE profile', force_match => TRUE);

--END; 
--/ 

--EXPLAIN PLAN 
--   FOR 
--      SELECT * 
--        FROM t1, t2, t3 
--       WHERE t1.c1 = t2.c1 AND t1.c1 = t3.c1; 

--SELECT * FROM TABLE (DBMS_XPLAN.display); 

The hints stored in the SQL profile by the commented-out method are not guaranteed to (and are not intended to) fix the SQL profile in quite the same way as a stored outline or SQL baseline. By extracting the outline hints from the advisor task, however, we can fix the plan with the same level of certainty as a SQL baseline or stored outline.

 

DECLARE 
   v_sqlprof_xml   CLOB; 
BEGIN 
   SELECT REGEXP_SUBSTR (other_xml, '<outline_data>.*</outline_data>') 
     INTO v_sqlprof_xml 
     FROM dba_sqltune_plans JOIN dba_advisor_tasks USING (task_id) 
    WHERE     attribute = 'Using SQL profile' 
          AND other_xml IS NOT NULL 
          AND task_name = :task_name; 

   DBMS_SQLTUNE.import_sql_profile (name => 'Tony''s SQLTUNE profile' 
                                    , description => 'Example profile from Tony Hasler''s blog' 
                                    , category => 'DEFAULT' 
                                    , sql_text => q'[ 
SELECT * 
        FROM t1, t2, t3 
       WHERE t1.c1 = t2.c1 AND t1.c1 = t3.c1 
]' 
                                    , REPLACE => TRUE, 
                                    force_match => TRUE, 
                                    profile_xml => v_sqlprof_xml); 
END; 
/ 

EXPLAIN PLAN 
   FOR 
      SELECT * 
        FROM t1, t2, t3 
       WHERE t1.c1 = t2.c1 AND t1.c1 = t3.c1; 

SELECT * FROM TABLE (DBMS_XPLAN.display); 

Note that REGEXP_SUBSTR seems adequate for extracting the outline hints from the OTHER_XML column.

December 16, 2011

FORCE_MATCH for Stored Outlines and/or SQL Baselines?????

Filed under: Uncategorized — tonyhasler @ 2:11 pm

Note that there is a follow up to this post here that you should read after this post.

Stored outlines were introduced in Oracle 9i as a way of helping stabilise execution plans.  In 11gR1 these are deprecated in favour of enterprise-edition-only SQL Baselines (sorry standard edition users )-:) but as of 11gR2 neither facility has the capability built into SQL profiles for matching a plan to any statement that matches text when literal values and whitespace are ignored.  I heard this raised as an enhancement request at last week’s Uk Oracle User Group conference.

By a strange coincidence, I was faced with this precise problem this week.  The issue arose with a large application that was very time consuming and expensive to regression test.  It involved a dynamically generated SQL statement against partitioned tables that avoided the use of bind variables for literals, including a couple of dates.  Apart from the literal values, the query was identical from call to call.

When historic data was queried, the execution plan was fine.  However, when the current day was queried the plan was often very poor.  Neither gathering statistics during the day nor copying partition statistics from one day to the next helped.

My long term plan is to identify why the statistics copying isn’t working well and either to customise the copy process or to abandon the use of partition level statistics altogether.  But that is root cause analysis.  What about a recovery plan for my client?

The funny thing is that SQL baselines, stored outlines, and SQL profiles all store hints.  Normally, the types of hints that are stored in a SQL profile are very different from those stored in a SQL baseline or in a stored outline but they don’t have to be.

Let me demonstrate with a two table join for Christmas Day:

CREATE TABLE t1 (d1, c1 DEFAULT 'x') 
PARTITION BY RANGE 
   (d1) 
   ( 
      PARTITION p1 VALUES LESS THAN (DATE '2012-01-01'), 
      PARTITION pdefault VALUES LESS THAN (maxvalue)) 
AS 
       SELECT DATE '2011-12-25', RPAD ('x', 2000) 
         FROM DUAL 
   CONNECT BY LEVEL <= 10000; 

CREATE TABLE t2 
AS 
       SELECT DATE '2011-12-25' d1 
         FROM DUAL 
   CONNECT BY LEVEL <= 100; 

EXEC dbms_stats.gather_table_stats(ownname=>user,tabname=>'T1',no_invalidate=>false); 
EXEC dbms_stats.gather_table_stats(ownname=>user,tabname=>'T2',no_invalidate=>false); 

EXPLAIN PLAN 
   FOR 
      SELECT * 
        FROM t1 NATURAL JOIN t2 
       WHERE d1 = DATE '2011-12-25'; 

SET PAGES 0 LINES 132 

SELECT * FROM TABLE (DBMS_XPLAN.display (NULL, NULL, 'outline')); 

This is the output I get on my 11.2.0.1 test database:

Plan hash value: 937381588             

----------------------------------------------------------------------------------------------------
| Id  | Operation                | Name    | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
---------------------------------------------------------------------------------------------------- 
|   0 | SELECT STATEMENT         |         |  1000K|  1923M|   932   (1)| 00:00:12 |       |       | 
|   1 |  HASH JOIN               |         |  1000K|  1923M|   932   (1)| 00:00:12 |       |       | 
|   2 |   PART JOIN FILTER CREATE| :BF0000 |   100 |   800 |     3   (0)| 00:00:01 |       |       |
|   3 |    TABLE ACCESS FULL     | T2      |   100 |   800 |     3   (0)| 00:00:01 |       |       |
|   4 |   PARTITION RANGE SINGLE |         | 10000 |    19M|   923   (1)| 00:00:12 |KEY(AP)|KEY(AP)| 
|   5 |    TABLE ACCESS FULL     | T1      | 10000 |    19M|   923   (1)| 00:00:12 |     1 |     1 | 
---------------------------------------------------------------------------------------------------- 

Outline Data                           
-------------                          

  /*+   
      BEGIN_OUTLINE_DATA               
      USE_HASH(@"SEL$58A6D7F6" "T1"@"SEL$1")                          
      LEADING(@"SEL$58A6D7F6" "T2"@"SEL$1" "T1"@"SEL$1")              
      FULL(@"SEL$58A6D7F6" "T1"@"SEL$1")                              
      FULL(@"SEL$58A6D7F6" "T2"@"SEL$1")                             
      OUTLINE(@"SEL$1")                
      OUTLINE(@"SEL$2")               
      MERGE(@"SEL$1")                  
      OUTLINE_LEAF(@"SEL$58A6D7F6")    
      ALL_ROWS                         
      DB_VERSION('11.2.0.1')           
      OPTIMIZER_FEATURES_ENABLE('11.2.0.1')                           
      IGNORE_OPTIM_EMBEDDED_HINTS      
      END_OUTLINE_DATA                 
  */

Now what if I re-gather statistics and issue the same statement for New Years day?

EXEC dbms_stats.gather_table_stats(ownname=>user,tabname=>'T1',no_invalidate=>false); 
EXEC dbms_stats.gather_table_stats(ownname=>user,tabname=>'T2',no_invalidate=>false); 
EXPLAIN PLAN 
   FOR 
      SELECT * 
        FROM t1 NATURAL JOIN t2 
       WHERE d1 = DATE '2012-01-01'; 

SELECT * FROM TABLE (DBMS_XPLAN.display (NULL, NULL, 'outline -predicate'));

This is the plan for that date:

Plan hash value: 2331183230                                 
------------------------------------------------------------------------------------------------
| Id  | Operation               | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |      |     1 |  1019 |     6  (17)| 00:00:01 |       |       |
|   1 |  HASH JOIN              |      |     1 |  1019 |     6  (17)| 00:00:01 |       |       |
|   2 |   PARTITION RANGE SINGLE|      |     1 |  1011 |     2   (0)| 00:00:01 |     2 |     2 |
|   3 |    TABLE ACCESS FULL    | T1   |     1 |  1011 |     2   (0)| 00:00:01 |     2 |     2 |
|   4 |   TABLE ACCESS FULL     | T2   |     1 |     8 |     3   (0)| 00:00:01 |       |       |
------------------------------------------------------------------------------------------------
Outline Data            
-------------           
  /*+                   
      BEGIN_OUTLINE_DATA
      USE_HASH(@"SEL$58A6D7F6" "T2"@"SEL$1")                
      LEADING(@"SEL$58A6D7F6" "T1"@"SEL$1" "T2"@"SEL$1")    
      FULL(@"SEL$58A6D7F6" "T2"@"SEL$1")                    
      FULL(@"SEL$58A6D7F6" "T1"@"SEL$1")                    
      OUTLINE(@"SEL$1") 
      OUTLINE(@"SEL$2") 
      MERGE(@"SEL$1")   
      OUTLINE_LEAF(@"SEL$58A6D7F6")                         
      ALL_ROWS          
      DB_VERSION('11.2.0.1')                                
      OPTIMIZER_FEATURES_ENABLE('11.2.0.1')                 
      IGNORE_OPTIM_EMBEDDED_HINTS                           
      END_OUTLINE_DATA  
  */

Now what if I know that the first plan is desirable and the second one isn’t? Well, I can load the outline hints associated with the good plan into a SQL profile:

BEGIN 
   DBMS_SQLTUNE.import_sql_profile ( 
      name          => 'BLOG profile', 
      description   => 'Example profile from Tony Hasler''s blog', 
      category      => 'DEFAULT', 
      sql_text      => q'[ 
 SELECT * 
        FROM t1 NATURAL JOIN t2 
       WHERE d1 = DATE '2011-12-25' 
   ]', 
      REPLACE       => TRUE, 
      force_match   => TRUE, 
      profile       => sqlprof_attr('BEGIN_OUTLINE_DATA', 
'USE_HASH(@"SEL$58A6D7F6" "T1"@"SEL$1")', 
'LEADING(@"SEL$58A6D7F6" "T2"@"SEL$1" "T1"@"SEL$1")', 
'FULL(@"SEL$58A6D7F6" "T1"@"SEL$1")', 
'FULL(@"SEL$58A6D7F6" "T2"@"SEL$1")', 
'OUTLINE(@"SEL$1")', 
'OUTLINE(@"SEL$2")', 
'MERGE(@"SEL$1")', 
'OUTLINE_LEAF(@"SEL$58A6D7F6")', 
'ALL_ROWS', 
'DB_VERSION(''11.2.0.1'')', 
'OPTIMIZER_FEATURES_ENABLE(''11.2.0.1'')', 
'IGNORE_OPTIM_EMBEDDED_HINTS', 
'END_OUTLINE_DATA')); 
END;

Careful editing is required: each hint has to be an element in the SQLPROF_ATTR array so be careful with multi-line hints. If you have a very long hint you may have to use the overloaded variant of this undocumented routine that allows you to specify XML rather than an array (profile_xml rather than profile). In my case, I was able to simply replace a long list of columns in an index hint with an index name. Also note the need to deal with sinqle quotes inside the strings (as in OPTIMIZER_FEATURES_ENABLE).

Note the important FORCE_MATCH parameter.

With this profile in place, we get the same plan for both Christmas Day and New Years Day.

You may wonder what happens if the SQL text already contains hints? Well, you have to include these embedded hints in the SQL text when you create the profile but the BEGIN_OUTLINE_DATA, IGNORE_EMBEDDED_HINTS, and END_OUTLINE_DATA hints in the SQL profile combine to allow the embedded hints to be overridden.

August 24, 2011

ACID and TPC-C

Filed under: ACID — tonyhasler @ 7:53 pm

I saw Tom Kyte yesterday and he stated that in his opinion the ACID issue had no bearing on TPC-C tests.  Well, I beg to differ.  The current specification can be viewed in full here but here are the three key parts of the specification.

First from section 3.4 on isolation requirements:

P1 (“Dirty Read”): Database transaction T1 modifies a data element. Database transaction T2 then reads that data element before T1 performs a COMMIT. If T1 were to perform a ROLLBACK, T2 will have read a value that was never committed and that may thus be considered to have never existed.

A little ater on it says that dirty reads are not allowed, as you might expect.

Isolation Level P0 P1 P2 P3
0 Not Possible Possible Possible Possible
1 Not Possible Not Possible Possible Possible
2 Not Possible Not Possible Not Possible Possible
3 Not Possible Not Possible Not Possible Not Possible

skipping ahead a bit…

2. {Ti, Tn}1 ≤ i ≤ 4 P0, P1, P2 Ti Level 2 isolation for New-Order, Payment, Delivery, and Order-Status transactions relative to any arbitrary transaction.

Now, in section 3.5.2 we have the committed property definition:

A transaction is considered committed when the transaction manager component of the system has either written the log or written the data for the committed updates associated with the transaction to a durable medium.

Therefore, Oracle violates the isolation property because data not yet committed has been made visible to other transactions and “Dirty Reads” are not prevented.  Specifically, the read-only order status transaction may see changes made by the other read-write transactions listed above.

Next Page »

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.