Model clause use cases

October 24, 2012

Model clause use cases

Filed under: SQL Fun — tonyhasler @ 9:22 pm

This note describes a couple of real-life use cases for the SQL MODEL (or SPREADSHEET) clause that appeared in Oracle 10g. This is mainly targeted at people attending my UKOUG presentation on the model clause (materials TBS) but anybody who knows the syntax of the model clause may find this of some interest.

We already have SQL (non-procedural) and PL/SQL (procedural) languages and it is easy to call SQL from PL/SQL and vice-versa. So you might be asking yourself why on earth we need the model clause. Doesn’t it seem like just a procedural extension to SQL? If so why wouldn’t we just use PL/SQL?

Well, over the last year or so I have found a number of cases where the model clause has been useful. The two key areas are:

Implementing analytics not available directly from an SQL function
Implementing calculations in parallel

Before we discuss my use cases let us create some test data:

CREATE TABLE model_test
AS
   WITH q1
        AS (    SELECT DATE '2000-01-01' + LEVEL mydate, DBMS_RANDOM.VALUE ( 90, 100) VALUE
                  FROM DUAL
                 WHERE MOD ( LEVEL, 11)  0
            CONNECT BY LEVEL <= 4000
            UNION ALL
                SELECT DATE '2000-01-01' + LEVEL mydate, DBMS_RANDOM.VALUE ( 100, 200) VALUE
                  FROM DUAL
                 WHERE MOD ( LEVEL, 11) = 0
            CONNECT BY LEVEL <= 4000)
       ,q2
        AS (    SELECT DBMS_RANDOM.string ( 'p', 10) account
                  FROM DUAL
            CONNECT BY LEVEL <= 100)
   SELECT *
     FROM q1, q2;

EXEC DBMS_STATS.gather_table_stats(USER,'MODEL_TEST');

This table contains 40,000 rows with 4,000 days’ worth of data for each of 100 accounts. The data is just a single value of some kind. The average value is approximately 100 but for each account there are 3637 closely packed values between 90 and 100 and 363 values between 100 and 200.

A genuine requirement from my investment bank client was to implement a moving median over a one year window. The first problem I ran into related to the definition of a one year window. I discussed this issue in my last post. The second issue is, of course, that the median function does not support windowing. This is a solution based on the model clause:

SELECT /*+ parallel(10) */ * FROM model_test
  MODEL
  PARTITION BY (account)
  DIMENSION BY (mydate)
  MEASURES (VALUE, 0 mov_median)
  RULES(
    mov_median[ANY] = MEDIAN(VALUE)[mydate BETWEEN ADD_MONTHS(CV()+1,-12) AND CV()]);

This is a relatively concise way to express the requirement and it doesn’t look particularly procedural.

Trying to code this in PL/SQL would be pretty tricky because of the need to constantly manipulate sets of data. Furthermore, a moving median can’t be calculated in any efficient way, which is why there is no native support. The fact that we can use parallel query slaves to reduce the amount of time taken to perform the calculations further adds to the benefit of using the model clause rather than a PL/SQL function of some kind. Each slave performs the calculations for some subset of accounts.

In another case I was asked to calculate a zscore based on semi-standard deviations. A semi-standard deviation is required when the distribution of data above the mean is substantially different from that below the mean as in my test data. Essentially we have one standard deviation for the values above the mean and a separate one for those less than or equal to the mean. Zscore is simply a term for the number of standard deviations (in this case semi) that a particular value is above or below the mean. The zscores were also to be calculated over a moving one year window.

This is much more complicated than a moving median. Firstly, we have no built-in aggregate function that we can use for semi-standard deviations. Secondly, as we shall shortly see, we need an iterative model to implement our calculations.

To help explain my approach I need to invent some terms. I will say that:

The target date is the date for which we need our semi-standard deviation and zscore.
A contributing date is a date that contributes to the semi-standard deviation. Contributing
dates are those on the correct side of the mean that are within the window for a target date.

A particular date will be part of the window for itself and all of the target dates in the following year. However, it will only contribute to some subset because its value will only sometimes be on the same side of the mean as the value of the target date. As a consequence we need a separate iteration for each target date.

To avoid unnecessary complications I will make two assumptions:

The data is dense. In other words, data is present for every account for every day. This wasn’t the case in real life.
A window of 365 days, not a calendar year, was adequate. This was the case in real life.

This is roughly how I did it:

   WITH q1
        AS (SELECT m.*
                  ,  ROW_NUMBER ()
                        OVER (PARTITION BY account ORDER BY mydate)
                   - 1
                      rn
                  ,0 AS mov_stdd
                  ,0 AS mov_avg
                  ,0 AS temp
                  ,0 AS zscore
              FROM model_test m)
   SELECT /*+ parallel(10) */
         account
         ,mydate
         ,VALUE
         ,mov_avg
         ,mov_stdd
         ,zscore
     FROM q1
   MODEL
      PARTITION BY (account)
      DIMENSION BY (rn)
      MEASURES (mydate, VALUE, mov_avg, mov_stdd, zscore, temp)
      RULES UPDATE
         ITERATE (100000) UNTIL VALUE[ITERATION_NUMBER] IS NULL
         (mov_avg [ITERATION_NUMBER] =
               AVG (VALUE)[rn BETWEEN ITERATION_NUMBER - 364
                                  AND ITERATION_NUMBER],
         temp [rn BETWEEN ITERATION_NUMBER - 364 AND ITERATION_NUMBER] =
               CASE
                  WHEN    (    VALUE[CV ()] <= mov_avg[ITERATION_NUMBER]
                           AND VALUE[ITERATION_NUMBER] <=
                                  mov_avg[ITERATION_NUMBER])
                       OR (    VALUE[CV ()] > mov_avg[ITERATION_NUMBER]
                           AND VALUE[ITERATION_NUMBER] >
                                  mov_avg[ITERATION_NUMBER])
                  THEN
                     POWER ( VALUE[CV ()] - mov_avg[ITERATION_NUMBER], 2)
               END,
         mov_stdd [ITERATION_NUMBER] =
               SQRT (
                  AVG (temp)[rn BETWEEN ITERATION_NUMBER - 364
                                    AND ITERATION_NUMBER]),
         zscore [ITERATION_NUMBER] =
               DECODE (mov_stdd[CV ()]
                      ,0, 0
                      , (VALUE[CV ()] - mov_avg[CV ()]) / mov_stdd[CV ()]));

We have associated a number with each target date so that we could use the ITERATION_NUMBER to find our target values. Note that the measure TEMP for a particular date is typically recalculated for each of 365 target dates and holds the contribution to the semi-standard standard deviation. The variance is the average contribution and the standard deviation (or more specifically, the population standard deviation) is the square root of the variance. The zscore is then a simple calculation as you can see.

Even though iteration was required here, PL/SQL wasn’t the right tool; even the limited use of aggregate functions in the above example has substantially simplified the logic and once again we can use parallel query slaves for our computations.

Update in reply to Dushyant

Dushyant has posted an interesting question in his comment. He wants to know how to implement a moving RANK function in a similar way to the moving MEDIAN function above. This was my first attempt:

WITH a
     AS (SELECT 'a' sector, TRUNC (SYSDATE) dt, 64 v FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 1 dt, 2 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 2 dt, 4 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 3 dt, 128 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 4 dt, 8 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 5 dt, 16 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 6 dt, 32 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 7 dt, 256 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 8 dt, 1 v FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 9 dt, 512 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) dt, 3 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) - 1 dt, 27 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) - 2 dt, 9 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) - 3 dt, 81 FROM DUAL),
     b
     AS (SELECT a.*,
                RANK () OVER (PARTITION BY sector ORDER BY dt) dt_rnk,
                COUNT (DISTINCT dt) OVER (PARTITION BY sector) dt_cnt
           FROM a)
SELECT sector,
       dt,
       v,
       mov_rank
  FROM b
MODEL
   PARTITION BY (sector)
   DIMENSION BY (dt)
   MEASURES (v, 0 mov_rank)
   RULES UPDATE
 (     mov_rank [ANY] =
            RANK(v[CV(dt)]) WITHIN GROUP (ORDER BY v)[dt BETWEEN CV()-3 AND CV()]);

One problem with the code above is that the nested cell reference v[CV(dt)] is illegal. I have only been able to solve the problem using an iterative model clause.

WITH a
     AS (SELECT 'a' sector, TRUNC (SYSDATE) dt, 64 v FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 1 dt, 2 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 2 dt, 4 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 3 dt, 128 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 4 dt, 8 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 5 dt, 16 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 6 dt, 32 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 7 dt, 256 FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 8 dt, 1 v FROM DUAL
         UNION ALL
         SELECT 'a' sector, TRUNC (SYSDATE) - 9 dt, 512 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) dt, 3 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) - 1 dt, 27 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) - 2 dt, 9 FROM DUAL
         UNION ALL
         SELECT 'b' sector, TRUNC (SYSDATE) - 3 dt, 81 FROM DUAL),
     b
     AS (SELECT a.*,
                RANK () OVER (PARTITION BY sector ORDER BY dt) dt_rnk,
                COUNT (DISTINCT dt) OVER (PARTITION BY sector) dt_cnt
           FROM a)
SELECT sector,
       dt,
       v,
       mov_rank
  FROM b
MODEL
   PARTITION BY (sector)
   DIMENSION BY (dt_rnk)
   MEASURES (dt, v, dt_cnt, 0 mov_rank, 0 temp)
   RULES UPDATE
      ITERATE (100000) UNTIL dt_cnt[1] = ITERATION_NUMBER
      (temp [dt_rnk BETWEEN ITERATION_NUMBER - 2 AND ITERATION_NUMBER + 1] =
            RANK () OVER (ORDER BY v),
      mov_rank [ITERATION_NUMBER + 1] =
            temp[CV ()]);

This provides the desired effect! Here is the output:

SECTOR	DT	V	MOV_RANK
a	27/08/2014	512	1
a	28/08/2014	1	1
a	29/08/2014	256	2
a	30/08/2014	32	2
a	31/08/2014	16	2
a	01/09/2014	8	1
a	02/09/2014	128	4
a	03/09/2014	4	1
a	04/09/2014	2	1
a	05/09/2014	64	3
b	02/09/2014	81	1
b	03/09/2014	9	1
b	04/09/2014	27	2
b	05/09/2014	3	1

Comments (5)

5 Comments »

Tony,

Thanks for your tips on the model clause. I am trying to use it in a similar situation as you described for the median case except that I need to apply a rank() and percent_rank() function on a moving time window. I just cannot get the following query to work. I would appreciate any help.

with a as (
select ‘a’ sector, trunc(sysdate) dt, 64 v from dual union all
select ‘a’ sector, trunc(sysdate)-1 dt, 2 from dual union all
select ‘a’ sector, trunc(sysdate)-2 dt, 4 from dual union all
select ‘a’ sector, trunc(sysdate)-3 dt, 128 from dual union all
select ‘a’ sector, trunc(sysdate)-4 dt, 8 from dual union all
select ‘a’ sector, trunc(sysdate)-5 dt, 16 from dual union all
select ‘a’ sector, trunc(sysdate)-6 dt, 32 from dual union all
select ‘a’ sector, trunc(sysdate)-7 dt, 256 from dual union all
select ‘a’ sector, trunc(sysdate)-8 dt, 1 v from dual union all
select ‘a’ sector, trunc(sysdate)-9 dt, 512 from dual union all
select ‘b’ sector, trunc(sysdate) dt, 3 from dual union all
select ‘b’ sector, trunc(sysdate)-1 dt, 27 from dual union all
select ‘b’ sector, trunc(sysdate)-2 dt, 9 from dual union all
select ‘b’ sector, trunc(sysdate)-3 dt, 81 from dual
)
select * from a
model
partition by (sector)
dimension by (dt)
measures (v, rank() over (order by v) mov_rank) — tried this based on something I read in the Oracle manual, started with 0 mov_rank as in your example
rules (
mov_rank[ANY] = mov_rank[dt between CV() -3 and CV()]
)
;

Thanks

Comment by Dushyant — September 4, 2014 @ 3:22 pm | Reply
- Hi Dushyant, a fascinating question! I have updated my blog in response to your post.
  
  Comment by tonyhasler — September 5, 2014 @ 5:31 pm | Reply
Hi Tony,

First of all thanks for nice post on model clause, I am currently learning model clause and have confusion regarding moving rank solution which you post for Dushyant.

I have hard time understanding how below piece of code work.

ITERATE (100000) UNTIL dt_cnt[1] = ITERATION_NUMBER

(temp [dt_rnk BETWEEN ITERATION_NUMBER – 2 AND ITERATION_NUMBER + 1] =
RANK () OVER (ORDER BY v),
mov_rank [ITERATION_NUMBER + 1] =
temp[CV ()]);

lets consider sector b as it has less row (i.e. 4), so loop will iterate for each row with in each sector? so for b it will run for total 32 time 4 times per row?

also temp [dt_rnk BETWEEN ITERATION_NUMBER – 2 AND ITERATION_NUMBER + 1] will reference from -2 to 1 during first iteration, so I believe -2 to 0 will be ignore as
dt_rnk will start from 1, also RANK () OVER (ORDER BY v) will apply on complete data set or with in given sector?

I would appreciate if you explain how this logic work to get final output i.e. mov_rank?

Thanks in Advance

Regard’s

Comment by Henish — September 24, 2014 @ 4:18 pm | Reply
- Henish,
  
  You are right in saying that there are four iterations of the loop for sector ‘b’. Let us take them one by one. The first iteration will apply the RANK analytic function to just the first row. You are right in saying that rows -2 to 0 are ignored. So we get a value of 1 for TEMP[1] and MOV_RANK[1]. The second iteration considers just the first two rows. This will generate TEMP[1] = 2 and TEMP[2] = 1. However, we only set MOV_RANK[2] = 1 and leave MOV_RANK[1] alone. The third iteration performs the RANK analytic function on the first three rows generating TEMP[1] = 3, TEMP{2] = 1 and TEMP[3] = 2. This reflects the order of the corresponding values of V (81, 9, 27). We update MOV_RANK[3] = 2. The final iteration considers all four rows for sector ‘b’ and sets MOV_RANK[4] to 1 as the corresponding value of 3 is the lowest of the four. Notice that I didn’t actually need the TEMP measure, I could have just said:
  
  mov_rank [dt_rnk BETWEEN ITERATION_NUMBER – 2 AND ITERATION_NUMBER + 1] =
  CASE
  WHEN CV (dt_rnk) = ITERATION_NUMBER + 1
  THEN
  RANK () OVER (ORDER BY v)
  ELSE
  mov_rank[CV ()]
  END
  
  As far as sector ‘a’ is concerned this is a seperate partition and dealt with independantly, Sector ‘a’ will have a total of 10 iterations.
  
  Comment by tonyhasler — September 25, 2014 @ 7:45 am | Reply
  - Thank you very much Tony for your explanation, it clarify my doubt.
    
    Comment by Henish — September 25, 2014 @ 7:07 pm

RSS feed for comments on this post. TrackBack URI

Tony’s Oracle Tips

October 24, 2012