I have been prompted to write one of my rare blogs because I have just discovered that the topic of aggregate functions is not quite as straightforward as I previously thought.

Before we start, I want you to look long and hard at the following query and tell me (figuratively speaking) how many sorts are involved in it:

```
SELECT AVG (c1)
,MIN (c1)
,MAX (c1)
,STDDEV (c2)
,RANK (2) WITHIN GROUP (ORDER BY c1 DESC)
,STDDEV (c1) KEEP (DENSE_RANK FIRST ORDER BY c2)
,VARIANCE (c2) KEEP (DENSE_RANK LAST ORDER BY c3)
FROM t1;
```

How are you doing? Let us have a look at the execution plan and see if that will help you?

```
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 39 | 6 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 39 | | |
| 2 | TABLE ACCESS FULL| T1 | 6000 | 228K| 6 (0)| 00:00:01 |
---------------------------------------------------------------------------
```

Has that changed your opinion?

So the correct answer is……zero!

If you are in any doubt, you can create the table I used for the query and try it yourself:

```
CREATE TABLE t1
AS
SELECT ROWNUM c1, MOD (ROWNUM, 2) c2, MOD (ROWNUM, 3) c3
FROM DUAL
CONNECT BY LEVEL <= 6000;
```

You can use the following query before and after the query under test to check the sort statistics:

```
SELECT name, VALUE
FROM v$mystat NATURAL JOIN v$statname
WHERE name LIKE 'sort%';
```

WARNING: Do not use autotrace when checking the stats. as autotrace can perform a sort.

Let us see how it is possible for this statement to complete without any sorts.

The AVG function calculates the mean of a set of values. AVG(c1) can be calculated as the rows are scanned by simply keeping a running total of c1 and a count of the number of rows where c1 is not null and then performing a division at the end. No sort is required. It should be clear that no sort is required for MIN or MAX either. It may be less clear (until you Google it) that STDDEV and VARIANCE aggregations can be performed in a single pass with no sort. In my query, the RANK function simply counts the number of rows that have a value greater than or equal to 2.

This is all very straightforward but what about the last two calculations? Those ORDER BY clauses surely imply some kind of sort don’t they?

Well no. In my query, the FIRST function returns the standard deviation of all values of c1 amongst the rows that have the lowest value of c2.

In general, the FIRST function almost certainly operates by keeping track of the lowest value of c2 encountered so far and on encountering a new row does the following:

- If the value of c2 is greater than the stored minimum value of c2, ignore the row for the purposes of this aggregate function.
- If the value of c2 equals the minimum encountered so far then perform the aggregation calculations on c1 defined on the left as if no FIRST function was specified.
- If the value of c2 is less than the minimum stored so far, replace the stored minimum, delete any stored aggregate values of c1 calculated so far and then process the row as per step 2.

Now, so far I have been very careful in my selection of aggregate functions. Some aggregate functions do require a sort. Let us make a tiny change to the select statement:

```
SELECT AVG (c1)
,MIN (c1)
,MAX (c1)
,STDDEV (c2)
,
```**DENSE_RANK** (2) WITHIN GROUP (ORDER BY c1 DESC)
,STDDEV (c1) KEEP (DENSE_RANK FIRST ORDER BY c2)
,VARIANCE (c2) KEEP (DENSE_RANK LAST ORDER BY c3)
FROM t1;

The only difference is that I have replaced the RANK function call with a DENSE_RANK function call. The DENSE_RANK function requires a sort as confirmed by the statistics. This is the new execution plan:

```
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 39 | 6 (0)| 00:00:01 |
| 1 | SORT GROUP BY | | 1 | 39 | | |
| 2 | TABLE ACCESS FULL| T1 | 6000 | 228K| 6 (0)| 00:00:01 |
---------------------------------------------------------------------------
```

Note that the SORT AGGREGATE operation has been replaced by a SORT GROUP BY. So we have discovered something interesting: SORT AGGREGATE does not perform a sort! Ever!

What happens if we need more than one sort:

`SELECT DENSE_RANK (3000) WITHIN GROUP (ORDER BY c1 DESC), MEDIAN (c2) FROM t1;`

The execution plan remains the same and the statistic “sorts (memory)” seems to indicate that there has been only one sort but this is impossible. In fact, if you look at “sorts (rows)” you will see that it has been incremented by 9000. 6000 of these rows are for the MEDIAN function but DENSE_RANK only needs to sort the 3000 rows where c1 has a value of less than or equal to 3000. So SORT GROUP BY just counts one sort in the statistics, no matter how many are actually performed.

Let us introduce a GROUP BY clause:

`SELECT DENSE_RANK (3000) WITHIN GROUP (ORDER BY c1 DESC) from t1 GROUP BY c2;`

The execution plan remains unchanged and the statistics once again show only one sort. In this case, however, there may be several sorts: the first sort is by c2, the others will be sorts to evaluate DENSE_RANK within each group. The “sorts (rows)” statistic seems to support this theory.

I think that is almost enough for this blog. I will post another soon on the difference between the FIRST and LAST functions and the FIRST_VALUE and LAST_VALUE functions. But let me close this note with a comment about the HASH GROUP BY operation introduced in Oracle 10g.

HASH GROUP BY doesn’t perform a sort (as you might expect) and so can only be used for (sub)queries that do not involve aggregates that require sorts. Consider this query:

`SELECT MAX(c1) FROM t1 GROUP BY c2;`

Here is the execution plan:

```
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 6000 | 152K| 7 (15)| 00:00:01 |
| 1 | HASH GROUP BY | | 6000 | 152K| 7 (15)| 00:00:01 |
| 2 | TABLE ACCESS FULL| T1 | 6000 | 152K| 6 (0)| 00:00:01 |
---------------------------------------------------------------------------
```

In this case we simply keep an array of maximums (one for each distinct value of c2) as the table is scanned.

## Leave a Reply