This note describes a couple of reallife use cases for the SQL MODEL (or SPREADSHEET) clause that appeared in Oracle 10g. This is mainly targeted at people attending my UKOUG presentation on the model clause (materials TBS) but anybody who knows the syntax of the model clause may find this of some interest.
We already have SQL (nonprocedural) and PL/SQL (procedural) languages and it is easy to call SQL from PL/SQL and viceversa. So you might be asking yourself why on earth we need the model clause. Doesn’t it seem like just a procedural extension to SQL? If so why wouldn’t we just use PL/SQL?
Well, over the last year or so I have found a number of cases where the model clause has been useful. The two key areas are:
 Implementing analytics not available directly from an SQL function
 Implementing calculations in parallel
Before we discuss my use cases let us create some test data:
CREATE TABLE model_test
AS
WITH q1
AS ( SELECT DATE '20000101' + LEVEL mydate, DBMS_RANDOM.VALUE ( 90, 100) VALUE
FROM DUAL
WHERE MOD ( LEVEL, 11) 0
CONNECT BY LEVEL <= 4000
UNION ALL
SELECT DATE '20000101' + LEVEL mydate, DBMS_RANDOM.VALUE ( 100, 200) VALUE
FROM DUAL
WHERE MOD ( LEVEL, 11) = 0
CONNECT BY LEVEL <= 4000)
,q2
AS ( SELECT DBMS_RANDOM.string ( 'p', 10) account
FROM DUAL
CONNECT BY LEVEL <= 100)
SELECT *
FROM q1, q2;
EXEC DBMS_STATS.gather_table_stats(USER,'MODEL_TEST');
This table contains 40,000 rows with 4,000 days’ worth of data for each of 100 accounts. The data is just a single value of some kind. The average value is approximately 100 but for each account there are 3637 closely packed values between 90 and 100 and 363 values between 100 and 200.
A genuine requirement from my investment bank client was to implement a moving median over a one year window. The first problem I ran into related to the definition of a one year window. I discussed this issue in my last post. The second issue is, of course, that the median function does not support windowing. This is a solution based on the model clause:
SELECT /*+ parallel(10) */ * FROM model_test
MODEL
PARTITION BY (account)
DIMENSION BY (mydate)
MEASURES (VALUE, 0 mov_median)
RULES(
mov_median[ANY] = MEDIAN(VALUE)[mydate BETWEEN ADD_MONTHS(CV()+1,12) AND CV()]);
This is a relatively concise way to express the requirement and it doesn’t look particularly procedural.
Trying to code this in PL/SQL would be pretty tricky because of the need to constantly manipulate sets of data. Furthermore, a moving median can’t be calculated in any efficient way, which is why there is no native support. The fact that we can use parallel query slaves to reduce the amount of time taken to perform the calculations further adds to the benefit of using the model clause rather than a PL/SQL function of some kind. Each slave performs the calculations for some subset of accounts.
In another case I was asked to calculate a zscore based on semistandard deviations. A semistandard deviation is required when the distribution of data above the mean is substantially different from that below the mean as in my test data. Essentially we have one standard deviation for the values above the mean and a separate one for those less than or equal to the mean. Zscore is simply a term for the number of standard deviations (in this case semi) that a particular value is above or below the mean. The zscores were also to be calculated over a moving one year window.
This is much more complicated than a moving median. Firstly, we have no builtin aggregate function that we can use for semistandard deviations. Secondly, as we shall shortly see, we need an iterative model to implement our calculations.
To help explain my approach I need to invent some terms. I will say that:
 The target date is the date for which we need our semistandard deviation and zscore.
 A contributing date is a date that contributes to the semistandard deviation. Contributing
dates are those on the correct side of the mean that are within the window for a target date.
A particular date will be part of the window for itself and all of the target dates in the following year. However, it will only contribute to some subset because its value will only sometimes be on the same side of the mean as the value of the target date. As a consequence we need a separate iteration for each target date.
To avoid unnecessary complications I will make two assumptions:
 The data is dense. In other words, data is present for every account for every day. This wasn’t the case in real life.
 A window of 365 days, not a calendar year, was adequate. This was the case in real life.
This is roughly how I did it:
WITH q1
AS (SELECT m.*
, ROW_NUMBER ()
OVER (PARTITION BY account ORDER BY mydate)
 1
rn
,0 AS mov_stdd
,0 AS mov_avg
,0 AS temp
,0 AS zscore
FROM model_test m)
SELECT /*+ parallel(10) */
account
,mydate
,VALUE
,mov_avg
,mov_stdd
,zscore
FROM q1
MODEL
PARTITION BY (account)
DIMENSION BY (rn)
MEASURES (mydate, VALUE, mov_avg, mov_stdd, zscore, temp)
RULES UPDATE
ITERATE (100000) UNTIL VALUE[ITERATION_NUMBER] IS NULL
(mov_avg [ITERATION_NUMBER] =
AVG (VALUE)[rn BETWEEN ITERATION_NUMBER  364
AND ITERATION_NUMBER],
temp [rn BETWEEN ITERATION_NUMBER  364 AND ITERATION_NUMBER] =
CASE
WHEN ( VALUE[CV ()] <= mov_avg[ITERATION_NUMBER]
AND VALUE[ITERATION_NUMBER] <=
mov_avg[ITERATION_NUMBER])
OR ( VALUE[CV ()] > mov_avg[ITERATION_NUMBER]
AND VALUE[ITERATION_NUMBER] >
mov_avg[ITERATION_NUMBER])
THEN
POWER ( VALUE[CV ()]  mov_avg[ITERATION_NUMBER], 2)
END,
mov_stdd [ITERATION_NUMBER] =
SQRT (
AVG (temp)[rn BETWEEN ITERATION_NUMBER  364
AND ITERATION_NUMBER]),
zscore [ITERATION_NUMBER] =
DECODE (mov_stdd[CV ()]
,0, 0
, (VALUE[CV ()]  mov_avg[CV ()]) / mov_stdd[CV ()]));
We have associated a number with each target date so that we could use the ITERATION_NUMBER to find our target values. Note that the measure TEMP for a particular date is typically recalculated for each of 365 target dates and holds the contribution to the semistandard standard deviation. The variance is the average contribution and the standard deviation (or more specifically, the population standard deviation) is the square root of the variance. The zscore is then a simple calculation as you can see.
Even though iteration was required here, PL/SQL wasn’t the right tool; even the limited use of aggregate functions in the above example has substantially simplified the logic and once again we can use parallel query slaves for our computations.
Update in reply to Dushyant
Dushyant has posted an interesting question in his comment. He wants to know how to implement a moving RANK function in a similar way to the moving MEDIAN function above. This was my first attempt:
WITH a
AS (SELECT 'a' sector, TRUNC (SYSDATE) dt, 64 v FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE)  1 dt, 2 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE)  2 dt, 4 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE)  3 dt, 128 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE)  4 dt, 8 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE)  5 dt, 16 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE)  6 dt, 32 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE)  7 dt, 256 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE)  8 dt, 1 v FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE)  9 dt, 512 FROM DUAL
UNION ALL
SELECT 'b' sector, TRUNC (SYSDATE) dt, 3 FROM DUAL
UNION ALL
SELECT 'b' sector, TRUNC (SYSDATE)  1 dt, 27 FROM DUAL
UNION ALL
SELECT 'b' sector, TRUNC (SYSDATE)  2 dt, 9 FROM DUAL
UNION ALL
SELECT 'b' sector, TRUNC (SYSDATE)  3 dt, 81 FROM DUAL),
b
AS (SELECT a.*,
RANK () OVER (PARTITION BY sector ORDER BY dt) dt_rnk,
COUNT (DISTINCT dt) OVER (PARTITION BY sector) dt_cnt
FROM a)
SELECT sector,
dt,
v,
mov_rank
FROM b
MODEL
PARTITION BY (sector)
DIMENSION BY (dt)
MEASURES (v, 0 mov_rank)
RULES UPDATE
( mov_rank [ANY] =
RANK(v[CV(dt)]) WITHIN GROUP (ORDER BY v)[dt BETWEEN CV()3 AND CV()]);
One problem with the code above is that the nested cell reference v[CV(dt)] is illegal. I have only been able to solve the problem using an iterative model clause.
WITH a
AS (SELECT 'a' sector, TRUNC (SYSDATE) dt, 64 v FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE)  1 dt, 2 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE)  2 dt, 4 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE)  3 dt, 128 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE)  4 dt, 8 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE)  5 dt, 16 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE)  6 dt, 32 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE)  7 dt, 256 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE)  8 dt, 1 v FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE)  9 dt, 512 FROM DUAL
UNION ALL
SELECT 'b' sector, TRUNC (SYSDATE) dt, 3 FROM DUAL
UNION ALL
SELECT 'b' sector, TRUNC (SYSDATE)  1 dt, 27 FROM DUAL
UNION ALL
SELECT 'b' sector, TRUNC (SYSDATE)  2 dt, 9 FROM DUAL
UNION ALL
SELECT 'b' sector, TRUNC (SYSDATE)  3 dt, 81 FROM DUAL),
b
AS (SELECT a.*,
RANK () OVER (PARTITION BY sector ORDER BY dt) dt_rnk,
COUNT (DISTINCT dt) OVER (PARTITION BY sector) dt_cnt
FROM a)
SELECT sector,
dt,
v,
mov_rank
FROM b
MODEL
PARTITION BY (sector)
DIMENSION BY (dt_rnk)
MEASURES (dt, v, dt_cnt, 0 mov_rank, 0 temp)
RULES UPDATE
ITERATE (100000) UNTIL dt_cnt[1] = ITERATION_NUMBER
(temp [dt_rnk BETWEEN ITERATION_NUMBER  2 AND ITERATION_NUMBER + 1] =
RANK () OVER (ORDER BY v),
mov_rank [ITERATION_NUMBER + 1] =
temp[CV ()]);
This provides the desired effect! Here is the output:
SECTOR  DT  V  MOV_RANK 

a  27/08/2014 
512

1

a  28/08/2014 
1

1

a  29/08/2014 
256

2

a  30/08/2014 
32

2

a  31/08/2014 
16

2

a  01/09/2014 
8

1

a  02/09/2014 
128

4

a  03/09/2014 
4

1

a  04/09/2014 
2

1

a  05/09/2014 
64

3

b  02/09/2014 
81

1

b  03/09/2014 
9

1

b  04/09/2014 
27

2

b  05/09/2014 
3

1

Tony,
Thanks for your tips on the model clause. I am trying to use it in a similar situation as you described for the median case except that I need to apply a rank() and percent_rank() function on a moving time window. I just cannot get the following query to work. I would appreciate any help.
with a as (
select ‘a’ sector, trunc(sysdate) dt, 64 v from dual union all
select ‘a’ sector, trunc(sysdate)1 dt, 2 from dual union all
select ‘a’ sector, trunc(sysdate)2 dt, 4 from dual union all
select ‘a’ sector, trunc(sysdate)3 dt, 128 from dual union all
select ‘a’ sector, trunc(sysdate)4 dt, 8 from dual union all
select ‘a’ sector, trunc(sysdate)5 dt, 16 from dual union all
select ‘a’ sector, trunc(sysdate)6 dt, 32 from dual union all
select ‘a’ sector, trunc(sysdate)7 dt, 256 from dual union all
select ‘a’ sector, trunc(sysdate)8 dt, 1 v from dual union all
select ‘a’ sector, trunc(sysdate)9 dt, 512 from dual union all
select ‘b’ sector, trunc(sysdate) dt, 3 from dual union all
select ‘b’ sector, trunc(sysdate)1 dt, 27 from dual union all
select ‘b’ sector, trunc(sysdate)2 dt, 9 from dual union all
select ‘b’ sector, trunc(sysdate)3 dt, 81 from dual
)
select * from a
model
partition by (sector)
dimension by (dt)
measures (v, rank() over (order by v) mov_rank) — tried this based on something I read in the Oracle manual, started with 0 mov_rank as in your example
rules (
mov_rank[ANY] = mov_rank[dt between CV() 3 and CV()]
)
;
Thanks
Comment by Dushyant — September 4, 2014 @ 3:22 pm 
Hi Dushyant, a fascinating question! I have updated my blog in response to your post.
Comment by tonyhasler — September 5, 2014 @ 5:31 pm 
Hi Tony,
First of all thanks for nice post on model clause, I am currently learning model clause and have confusion regarding moving rank solution which you post for Dushyant.
I have hard time understanding how below piece of code work.
ITERATE (100000) UNTIL dt_cnt[1] = ITERATION_NUMBER
(temp [dt_rnk BETWEEN ITERATION_NUMBER – 2 AND ITERATION_NUMBER + 1] =
RANK () OVER (ORDER BY v),
mov_rank [ITERATION_NUMBER + 1] =
temp[CV ()]);
lets consider sector b as it has less row (i.e. 4), so loop will iterate for each row with in each sector? so for b it will run for total 32 time 4 times per row?
also temp [dt_rnk BETWEEN ITERATION_NUMBER – 2 AND ITERATION_NUMBER + 1] will reference from 2 to 1 during first iteration, so I believe 2 to 0 will be ignore as
dt_rnk will start from 1, also RANK () OVER (ORDER BY v) will apply on complete data set or with in given sector?
I would appreciate if you explain how this logic work to get final output i.e. mov_rank?
Thanks in Advance
Regard’s
Comment by Henish — September 24, 2014 @ 4:18 pm 
Henish,
You are right in saying that there are four iterations of the loop for sector ‘b’. Let us take them one by one. The first iteration will apply the RANK analytic function to just the first row. You are right in saying that rows 2 to 0 are ignored. So we get a value of 1 for TEMP[1] and MOV_RANK[1]. The second iteration considers just the first two rows. This will generate TEMP[1] = 2 and TEMP[2] = 1. However, we only set MOV_RANK[2] = 1 and leave MOV_RANK[1] alone. The third iteration performs the RANK analytic function on the first three rows generating TEMP[1] = 3, TEMP{2] = 1 and TEMP[3] = 2. This reflects the order of the corresponding values of V (81, 9, 27). We update MOV_RANK[3] = 2. The final iteration considers all four rows for sector ‘b’ and sets MOV_RANK[4] to 1 as the corresponding value of 3 is the lowest of the four. Notice that I didn’t actually need the TEMP measure, I could have just said:
mov_rank [dt_rnk BETWEEN ITERATION_NUMBER – 2 AND ITERATION_NUMBER + 1] =
CASE
WHEN CV (dt_rnk) = ITERATION_NUMBER + 1
THEN
RANK () OVER (ORDER BY v)
ELSE
mov_rank[CV ()]
END
As far as sector ‘a’ is concerned this is a seperate partition and dealt with independantly, Sector ‘a’ will have a total of 10 iterations.
Comment by tonyhasler — September 25, 2014 @ 7:45 am 
Thank you very much Tony for your explanation, it clarify my doubt.
Comment by Henish — September 25, 2014 @ 7:07 pm