Tony’s Oracle Tips

October 24, 2012

Model clause use cases

Filed under: SQL Fun — tonyhasler @ 9:22 pm

This note describes a couple of real-life use cases for the SQL MODEL (or SPREADSHEET) clause that appeared in Oracle 10g. This is mainly targeted at people attending my UKOUG presentation on the model clause (materials TBS) but anybody who knows the syntax of the model clause may find this of some interest.

We already have SQL (non-procedural) and PL/SQL (procedural) languages and it is easy to call SQL from PL/SQL and vice-versa. So you might be asking yourself why on earth we need the model clause. Doesn’t it seem like just a procedural extension to SQL? If so why wouldn’t we just use PL/SQL?

Well, over the last year or so I have found a number of cases where the model clause has been useful. The two key areas are:

  • Implementing analytics not available directly from an SQL function
  • Implementing calculations in parallel

Before we discuss my use cases let us create some test data:

CREATE TABLE model_test
AS
   WITH q1
        AS (    SELECT DATE '2000-01-01' + LEVEL mydate, DBMS_RANDOM.VALUE ( 90, 100) VALUE
                  FROM DUAL
                 WHERE MOD ( LEVEL, 11)  0
            CONNECT BY LEVEL <= 4000
            UNION ALL
                SELECT DATE '2000-01-01' + LEVEL mydate, DBMS_RANDOM.VALUE ( 100, 200) VALUE
                  FROM DUAL
                 WHERE MOD ( LEVEL, 11) = 0
            CONNECT BY LEVEL <= 4000)
       ,q2
        AS (    SELECT DBMS_RANDOM.string ( 'p', 10) account
                  FROM DUAL
            CONNECT BY LEVEL <= 100)
   SELECT *
     FROM q1, q2;

EXEC DBMS_STATS.gather_table_stats(USER,'MODEL_TEST');

This table contains 40,000 rows with 4,000 days’ worth of data for each of 100 accounts. The data is just a single value of some kind. The average value is approximately 100 but for each account there are 3637 closely packed values between 90 and 100 and 363 values between 100 and 200.

A genuine requirement from my investment bank client was to implement a moving median over a one year window. The first problem I ran into related to the definition of a one year window. I discussed this issue in my last post. The second issue is, of course, that the median function does not support windowing. This is a solution based on the model clause:

SELECT /*+ parallel(10) */ * FROM model_test
  MODEL
  PARTITION BY (account)
  DIMENSION BY (mydate)
  MEASURES (VALUE, 0 mov_median)
  RULES(
    mov_median[ANY] = MEDIAN(VALUE)[mydate BETWEEN ADD_MONTHS(CV()+1,-12) AND CV()]);

This is a relatively concise way to express the requirement and it doesn’t look particularly procedural.

Trying to code this in PL/SQL would be pretty tricky because of the need to constantly manipulate sets of data. Furthermore, a moving median can’t be calculated in any efficient way, which is why there is no native support. The fact that we can use parallel query slaves to reduce the amount of time taken to perform the calculations further adds to the benefit of using the model clause rather than a PL/SQL function of some kind. Each slave performs the calculations for some subset of accounts.

In another case I was asked to calculate a zscore based on semi-standard deviations. A semi-standard deviation is required when the distribution of data above the mean is substantially different from that below the mean as in my test data. Essentially we have one standard deviation for the values above the mean and a separate one for those less than or equal to the mean. Zscore is simply a term for the number of standard deviations (in this case semi) that a particular value is above or below the mean. The zscores were also to be calculated over a moving one year window.

This is much more complicated than a moving median. Firstly, we have no built-in aggregate function that we can use for semi-standard deviations. Secondly, as we shall shortly see, we need an iterative model to implement our calculations.

To help explain my approach I need to invent some terms. I will say that:

  • The target date is the date for which we need our semi-standard deviation and zscore.
  • A contributing date is a date that contributes to the semi-standard deviation. Contributing
    dates are those on the correct side of the mean that are within the window for a target date.

A particular date will be part of the window for itself and all of the target dates in the following year. However, it will only contribute to some subset because its value will only sometimes be on the same side of the mean as the value of the target date. As a consequence we need a separate iteration for each target date.

To avoid unnecessary complications I will make two assumptions:

  • The data is dense. In other words, data is present for every account for every day. This wasn’t the case in real life.
  • A window of 365 days, not a calendar year, was adequate. This was the case in real life.

This is roughly how I did it:

   WITH q1
        AS (SELECT m.*
                  ,  ROW_NUMBER ()
                        OVER (PARTITION BY account ORDER BY mydate)
                   - 1
                      rn
                  ,0 AS mov_stdd
                  ,0 AS mov_avg
                  ,0 AS temp
                  ,0 AS zscore
              FROM model_test m)
   SELECT /*+ parallel(10) */
         account
         ,mydate
         ,VALUE
         ,mov_avg
         ,mov_stdd
         ,zscore
     FROM q1
   MODEL
      PARTITION BY (account)
      DIMENSION BY (rn)
      MEASURES (mydate, VALUE, mov_avg, mov_stdd, zscore, temp)
      RULES UPDATE
         ITERATE (100000) UNTIL VALUE[ITERATION_NUMBER] IS NULL
         (mov_avg [ITERATION_NUMBER] =
               AVG (VALUE)[rn BETWEEN ITERATION_NUMBER - 364
                                  AND ITERATION_NUMBER],
         temp [rn BETWEEN ITERATION_NUMBER - 364 AND ITERATION_NUMBER] =
               CASE
                  WHEN    (    VALUE[CV ()] <= mov_avg[ITERATION_NUMBER]
                           AND VALUE[ITERATION_NUMBER] <=
                                  mov_avg[ITERATION_NUMBER])
                       OR (    VALUE[CV ()] > mov_avg[ITERATION_NUMBER]
                           AND VALUE[ITERATION_NUMBER] >
                                  mov_avg[ITERATION_NUMBER])
                  THEN
                     POWER ( VALUE[CV ()] - mov_avg[ITERATION_NUMBER], 2)
               END,
         mov_stdd [ITERATION_NUMBER] =
               SQRT (
                  AVG (temp)[rn BETWEEN ITERATION_NUMBER - 364
                                    AND ITERATION_NUMBER]),
         zscore [ITERATION_NUMBER] =
               DECODE (mov_stdd[CV ()]
                      ,0, 0
                      , (VALUE[CV ()] - mov_avg[CV ()]) / mov_stdd[CV ()]));

We have associated a number with each target date so that we could use the ITERATION_NUMBER to find our target values. Note that the measure TEMP for a particular date is typically recalculated for each of 365 target dates and holds the contribution to the semi-standard standard deviation. The variance is the average contribution and the standard deviation (or more specifically, the population standard deviation) is the square root of the variance. The zscore is then a simple calculation as you can see.

Even though iteration was required here, PL/SQL wasn’t the right tool; even the limited use of aggregate functions in the above example has substantially simplified the logic and once again we can use parallel query slaves for our computations.

Update in reply to Dushyant

Dushyant has posted an interesting question in his comment. He wants to know how to implement a moving RANK function in a similar way to the moving MEDIAN function above. This was my first attempt:

WITH a
     AS (SELECT 'a' sector, TRUNC (SYSDATE) dt, 64 v FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 1 dt, 2 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 2 dt, 4 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 3 dt, 128 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 4 dt, 8 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 5 dt, 16 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 6 dt, 32 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 7 dt, 256 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 8 dt, 1 v FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 9 dt, 512 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) dt, 3 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) - 1 dt, 27 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) - 2 dt, 9 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) - 3 dt, 81 FROM DUAL),
     b
     AS (SELECT a.*,
                RANK () OVER (PARTITION BY sector ORDER BY dt) dt_rnk,
                COUNT (DISTINCT dt) OVER (PARTITION BY sector) dt_cnt
           FROM a)
SELECT sector,
       dt,
       v,
       mov_rank
  FROM b
MODEL
   PARTITION BY (sector)
   DIMENSION BY (dt)
   MEASURES (v, 0 mov_rank)
   RULES UPDATE
 (     mov_rank [ANY] =
            RANK(v[CV(dt)]) WITHIN GROUP (ORDER BY v)[dt BETWEEN CV()-3 AND CV()]);

One problem with the code above is that the nested cell reference v[CV(dt)] is illegal. I have only been able to solve the problem using an iterative model clause.

WITH a
     AS (SELECT 'a' sector, TRUNC (SYSDATE) dt, 64 v FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 1 dt, 2 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 2 dt, 4 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 3 dt, 128 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 4 dt, 8 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 5 dt, 16 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 6 dt, 32 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 7 dt, 256 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 8 dt, 1 v FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 9 dt, 512 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) dt, 3 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) - 1 dt, 27 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) - 2 dt, 9 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) - 3 dt, 81 FROM DUAL),
     b
     AS (SELECT a.*,
                RANK () OVER (PARTITION BY sector ORDER BY dt) dt_rnk,
                COUNT (DISTINCT dt) OVER (PARTITION BY sector) dt_cnt
           FROM a)
SELECT sector,
       dt,
       v,
       mov_rank
  FROM b
MODEL
   PARTITION BY (sector)
   DIMENSION BY (dt_rnk)
   MEASURES (dt, v, dt_cnt, 0 mov_rank, 0 temp)
   RULES UPDATE
      ITERATE (100000) UNTIL dt_cnt[1] = ITERATION_NUMBER
      (temp [dt_rnk BETWEEN ITERATION_NUMBER - 2 AND ITERATION_NUMBER + 1] =
            RANK () OVER (ORDER BY v),
      mov_rank [ITERATION_NUMBER + 1] =
            temp[CV ()]);

This provides the desired effect!  Here is the output:

SECTOR DT V MOV_RANK
a 27/08/2014
512
1
a 28/08/2014
1
1
a 29/08/2014
256
2
a 30/08/2014
32
2
a 31/08/2014
16
2
a 01/09/2014
8
1
a 02/09/2014
128
4
a 03/09/2014
4
1
a 04/09/2014
2
1
a 05/09/2014
64
3
b 02/09/2014
81
1
b 03/09/2014
9
1
b 04/09/2014
27
2
b 05/09/2014
3
1

October 23, 2012

Non-inclusive windows for yearly or monthly intervals

Filed under: SQL Fun — tonyhasler @ 1:01 pm

About six months ago I came across the need to use analytic functions with a moving one year window. This is supposedly supported in Oracle 10g and beyond but the behaviour was not what I expected. Consider this test query:

WITH q1
     AS (    SELECT DATE '2012-01-01' + ROWNUM mydate
               FROM DUAL
         CONNECT BY LEVEL <= 400)
    ,q2
     AS (SELECT mydate
               ,COUNT (
                   *)
                OVER (
                   ORDER BY mydate
                   RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND CURRENT ROW)
                   cnt
               ,MIN (
                   mydate)
                OVER (
                   ORDER BY mydate
                   RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND CURRENT ROW)
                   mindate
               ,MAX (
                   mydate)
                OVER (
                   ORDER BY mydate
                   RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND CURRENT ROW)
                   maxdate
           FROM q1)
SELECT *
  FROM q2
 WHERE mydate = DATE '2013-01-10';

This is the result:

MYDATE CNT MINDATE MAXDATE
10-Jan-2013
367
10-Jan-2012 10-Jan-2013

The count of days in the year ending 10th January is 367 days. Now we know we have included 29th January 2012 so we do expect 366 days but not 367! The explanation comes when we look at the minimum and maximum values in the range: we have included both the 10th January 2012 and 10th January 2013 in our range. In reality, the fact that Oracle included an extra day was not an issue on this occasion but it has been bothering me for the last six months that I couldn’t see a way to solve this problem without the use of the model clause (more of this in my next post). However, as is often the case when one muses on something like this the answer came to me in my sleep and I awoke with a start!

I may have been inspired by Stephen Hawking’s invention of virual time when I finally came up with this solution:

WITH q1
     AS (    SELECT DATE '2012-01-01' + ROWNUM mydate
               FROM DUAL
         CONNECT BY LEVEL <= 400)
    ,q2
     AS (SELECT mydate
               ,COUNT (
                   *)
                OVER (
                   ORDER BY
                        ( (mydate - DATE '1950-01-01') * (86401 / 86400))
                      + DATE '1950-01-01'
                   RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND CURRENT ROW)
                   cnt
               ,MIN (
                   mydate)
                OVER (
                   ORDER BY
                        ( (mydate - DATE '1950-01-01') * (86401 / 86400))
                      + DATE '1950-01-01'
                   RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND CURRENT ROW)
                   mindate
               ,MAX (
                   mydate)
                OVER (
                   ORDER BY
                        ( (mydate - DATE '1950-01-01') * (86401 / 86400))
                      + DATE '1950-01-01'
                   RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND CURRENT ROW)
                   maxdate
           FROM q1)
SELECT *
  FROM q2
 WHERE mydate = DATE '2013-01-10';

This window function works using a somewhat simpler concept of virtual time than that of the famous physicist: each day is translated into 1 day and one second. This is accomplished by subtracting an arbitrary date earlier than the first date in your data and no more than 236 days earlier than the last date in your data. You then multiply the resultant number of days by 86401/86400 (there are 86400 seconds in one day) and then add the arbitrary date back in again. Now, when converted into virtual time 10th January 2012 is 1 year and 366 seconds prior to 10 January 2013 rather than 1 year and so is excluded from the window. As long as your window is less than around 236 years and your real dates do not include a time of day this technique allows you to implement windows of months and/or years that exclude the start date of the window.

Blog at WordPress.com.