This note describes a couple of real-life use cases for the SQL MODEL (or SPREADSHEET) clause that appeared in Oracle 10g. This is mainly targeted at people attending my UKOUG presentation on the model clause (materials TBS) but anybody who knows the syntax of the model clause may find this of some interest.
We already have SQL (non-procedural) and PL/SQL (procedural) languages and it is easy to call SQL from PL/SQL and vice-versa. So you might be asking yourself why on earth we need the model clause. Doesn’t it seem like just a procedural extension to SQL? If so why wouldn’t we just use PL/SQL?
Well, over the last year or so I have found a number of cases where the model clause has been useful. The two key areas are:
- Implementing analytics not available directly from an SQL function
- Implementing calculations in parallel
Before we discuss my use cases let us create some test data:
CREATE TABLE model_test
AS
WITH q1
AS ( SELECT DATE '2000-01-01' + LEVEL mydate, DBMS_RANDOM.VALUE ( 90, 100) VALUE
FROM DUAL
WHERE MOD ( LEVEL, 11) 0
CONNECT BY LEVEL <= 4000
UNION ALL
SELECT DATE '2000-01-01' + LEVEL mydate, DBMS_RANDOM.VALUE ( 100, 200) VALUE
FROM DUAL
WHERE MOD ( LEVEL, 11) = 0
CONNECT BY LEVEL <= 4000)
,q2
AS ( SELECT DBMS_RANDOM.string ( 'p', 10) account
FROM DUAL
CONNECT BY LEVEL <= 100)
SELECT *
FROM q1, q2;
EXEC DBMS_STATS.gather_table_stats(USER,'MODEL_TEST');
This table contains 40,000 rows with 4,000 days’ worth of data for each of 100 accounts. The data is just a single value of some kind. The average value is approximately 100 but for each account there are 3637 closely packed values between 90 and 100 and 363 values between 100 and 200.
A genuine requirement from my investment bank client was to implement a moving median over a one year window. The first problem I ran into related to the definition of a one year window. I discussed this issue in my last post. The second issue is, of course, that the median function does not support windowing. This is a solution based on the model clause:
SELECT /*+ parallel(10) */ * FROM model_test
MODEL
PARTITION BY (account)
DIMENSION BY (mydate)
MEASURES (VALUE, 0 mov_median)
RULES(
mov_median[ANY] = MEDIAN(VALUE)[mydate BETWEEN ADD_MONTHS(CV()+1,-12) AND CV()]);
This is a relatively concise way to express the requirement and it doesn’t look particularly procedural.
Trying to code this in PL/SQL would be pretty tricky because of the need to constantly manipulate sets of data. Furthermore, a moving median can’t be calculated in any efficient way, which is why there is no native support. The fact that we can use parallel query slaves to reduce the amount of time taken to perform the calculations further adds to the benefit of using the model clause rather than a PL/SQL function of some kind. Each slave performs the calculations for some subset of accounts.
In another case I was asked to calculate a zscore based on semi-standard deviations. A semi-standard deviation is required when the distribution of data above the mean is substantially different from that below the mean as in my test data. Essentially we have one standard deviation for the values above the mean and a separate one for those less than or equal to the mean. Zscore is simply a term for the number of standard deviations (in this case semi) that a particular value is above or below the mean. The zscores were also to be calculated over a moving one year window.
This is much more complicated than a moving median. Firstly, we have no built-in aggregate function that we can use for semi-standard deviations. Secondly, as we shall shortly see, we need an iterative model to implement our calculations.
To help explain my approach I need to invent some terms. I will say that:
- The target date is the date for which we need our semi-standard deviation and zscore.
- A contributing date is a date that contributes to the semi-standard deviation. Contributing
dates are those on the correct side of the mean that are within the window for a target date.
A particular date will be part of the window for itself and all of the target dates in the following year. However, it will only contribute to some subset because its value will only sometimes be on the same side of the mean as the value of the target date. As a consequence we need a separate iteration for each target date.
To avoid unnecessary complications I will make two assumptions:
- The data is dense. In other words, data is present for every account for every day. This wasn’t the case in real life.
- A window of 365 days, not a calendar year, was adequate. This was the case in real life.
This is roughly how I did it:
WITH q1
AS (SELECT m.*
, ROW_NUMBER ()
OVER (PARTITION BY account ORDER BY mydate)
- 1
rn
,0 AS mov_stdd
,0 AS mov_avg
,0 AS temp
,0 AS zscore
FROM model_test m)
SELECT /*+ parallel(10) */
account
,mydate
,VALUE
,mov_avg
,mov_stdd
,zscore
FROM q1
MODEL
PARTITION BY (account)
DIMENSION BY (rn)
MEASURES (mydate, VALUE, mov_avg, mov_stdd, zscore, temp)
RULES UPDATE
ITERATE (100000) UNTIL VALUE[ITERATION_NUMBER] IS NULL
(mov_avg [ITERATION_NUMBER] =
AVG (VALUE)[rn BETWEEN ITERATION_NUMBER - 364
AND ITERATION_NUMBER],
temp [rn BETWEEN ITERATION_NUMBER - 364 AND ITERATION_NUMBER] =
CASE
WHEN ( VALUE[CV ()] <= mov_avg[ITERATION_NUMBER]
AND VALUE[ITERATION_NUMBER] <=
mov_avg[ITERATION_NUMBER])
OR ( VALUE[CV ()] > mov_avg[ITERATION_NUMBER]
AND VALUE[ITERATION_NUMBER] >
mov_avg[ITERATION_NUMBER])
THEN
POWER ( VALUE[CV ()] - mov_avg[ITERATION_NUMBER], 2)
END,
mov_stdd [ITERATION_NUMBER] =
SQRT (
AVG (temp)[rn BETWEEN ITERATION_NUMBER - 364
AND ITERATION_NUMBER]),
zscore [ITERATION_NUMBER] =
DECODE (mov_stdd[CV ()]
,0, 0
, (VALUE[CV ()] - mov_avg[CV ()]) / mov_stdd[CV ()]));
We have associated a number with each target date so that we could use the ITERATION_NUMBER to find our target values. Note that the measure TEMP for a particular date is typically recalculated for each of 365 target dates and holds the contribution to the semi-standard standard deviation. The variance is the average contribution and the standard deviation (or more specifically, the population standard deviation) is the square root of the variance. The zscore is then a simple calculation as you can see.
Even though iteration was required here, PL/SQL wasn’t the right tool; even the limited use of aggregate functions in the above example has substantially simplified the logic and once again we can use parallel query slaves for our computations.
Update in reply to Dushyant
Dushyant has posted an interesting question in his comment. He wants to know how to implement a moving RANK function in a similar way to the moving MEDIAN function above. This was my first attempt:
WITH a
AS (SELECT 'a' sector, TRUNC (SYSDATE) dt, 64 v FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE) - 1 dt, 2 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE) - 2 dt, 4 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE) - 3 dt, 128 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE) - 4 dt, 8 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE) - 5 dt, 16 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE) - 6 dt, 32 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE) - 7 dt, 256 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE) - 8 dt, 1 v FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE) - 9 dt, 512 FROM DUAL
UNION ALL
SELECT 'b' sector, TRUNC (SYSDATE) dt, 3 FROM DUAL
UNION ALL
SELECT 'b' sector, TRUNC (SYSDATE) - 1 dt, 27 FROM DUAL
UNION ALL
SELECT 'b' sector, TRUNC (SYSDATE) - 2 dt, 9 FROM DUAL
UNION ALL
SELECT 'b' sector, TRUNC (SYSDATE) - 3 dt, 81 FROM DUAL),
b
AS (SELECT a.*,
RANK () OVER (PARTITION BY sector ORDER BY dt) dt_rnk,
COUNT (DISTINCT dt) OVER (PARTITION BY sector) dt_cnt
FROM a)
SELECT sector,
dt,
v,
mov_rank
FROM b
MODEL
PARTITION BY (sector)
DIMENSION BY (dt)
MEASURES (v, 0 mov_rank)
RULES UPDATE
( mov_rank [ANY] =
RANK(v[CV(dt)]) WITHIN GROUP (ORDER BY v)[dt BETWEEN CV()-3 AND CV()]);
One problem with the code above is that the nested cell reference v[CV(dt)] is illegal. I have only been able to solve the problem using an iterative model clause.
WITH a
AS (SELECT 'a' sector, TRUNC (SYSDATE) dt, 64 v FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE) - 1 dt, 2 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE) - 2 dt, 4 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE) - 3 dt, 128 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE) - 4 dt, 8 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE) - 5 dt, 16 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE) - 6 dt, 32 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE) - 7 dt, 256 FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE) - 8 dt, 1 v FROM DUAL
UNION ALL
SELECT 'a' sector, TRUNC (SYSDATE) - 9 dt, 512 FROM DUAL
UNION ALL
SELECT 'b' sector, TRUNC (SYSDATE) dt, 3 FROM DUAL
UNION ALL
SELECT 'b' sector, TRUNC (SYSDATE) - 1 dt, 27 FROM DUAL
UNION ALL
SELECT 'b' sector, TRUNC (SYSDATE) - 2 dt, 9 FROM DUAL
UNION ALL
SELECT 'b' sector, TRUNC (SYSDATE) - 3 dt, 81 FROM DUAL),
b
AS (SELECT a.*,
RANK () OVER (PARTITION BY sector ORDER BY dt) dt_rnk,
COUNT (DISTINCT dt) OVER (PARTITION BY sector) dt_cnt
FROM a)
SELECT sector,
dt,
v,
mov_rank
FROM b
MODEL
PARTITION BY (sector)
DIMENSION BY (dt_rnk)
MEASURES (dt, v, dt_cnt, 0 mov_rank, 0 temp)
RULES UPDATE
ITERATE (100000) UNTIL dt_cnt[1] = ITERATION_NUMBER
(temp [dt_rnk BETWEEN ITERATION_NUMBER - 2 AND ITERATION_NUMBER + 1] =
RANK () OVER (ORDER BY v),
mov_rank [ITERATION_NUMBER + 1] =
temp[CV ()]);
This provides the desired effect! Here is the output:
SECTOR |
DT |
V |
MOV_RANK |
a |
27/08/2014 |
512
|
1
|
a |
28/08/2014 |
1
|
1
|
a |
29/08/2014 |
256
|
2
|
a |
30/08/2014 |
32
|
2
|
a |
31/08/2014 |
16
|
2
|
a |
01/09/2014 |
8
|
1
|
a |
02/09/2014 |
128
|
4
|
a |
03/09/2014 |
4
|
1
|
a |
04/09/2014 |
2
|
1
|
a |
05/09/2014 |
64
|
3
|
b |
02/09/2014 |
81
|
1
|
b |
03/09/2014 |
9
|
1
|
b |
04/09/2014 |
27
|
2
|
b |
05/09/2014 |
3
|
1
|