Tony’s Oracle Tips

March 27, 2012

FIRST/LAST versus FIRST_VALUE/LAST_VALUE

Filed under: SQL Fun — tonyhasler @ 2:28 pm

As I promised in my last post, I want to discuss the difference between FIRST/LAST and FIRST_VALUE/LAST_VALUE.

Previously I discussed FIRST and LAST purely as aggregate functions but they can be used as analytic functions as well. FIRST_VALUE and LAST_VALUE can only be used as an analytic functions and in that respect at least they are somewhat less powerful.

Most of us take a while to get to grasp with what analytic functions are and how they differ from aggregate functions when we first learn about them but there are many articles and discussions on the web about the general topic. Here is one you might like.

Now that you know about analytic functions in general, let us return to look at an example of FIRST using the same table as last time:

SELECT c1, MIN (c3) KEEP (DENSE_RANK FIRST ORDER BY c1) OVER () "Analytic"
  FROM t1
 WHERE c1 <= 50;
 
 QUERY 1

What this query says is: After filtering out rows where C1 is NULL or has a value greater than 50, find the row or rows that have the minimum value of C1 and amongst those rows find the minimum value of C3. This value is used for the “Analytic” column; all rows in the final result set have the same value for the “Analytic” column.

Now let us look at a similar query using FIRST_VALUE:

SELECT c1, FIRST_VALUE (c3) OVER (ORDER BY c1) "Analytic"
  FROM t1
 WHERE c1 <= 50;
 
 QUERY 2
 

What this query says is: After filtering out rows where C1 is NULL or has a value greater than 50, find the row or rows that have the minimum value of C1 and amongst those rows pick an arbitrary row and then select the value of C3. This value is used for the “Analytic” column; all rows in the final result set have the same value for the “Analytic” column.

Let us first assume that the expression used in the ORDER BY clause is unique (as recommended in the documentation of FIRST_VALUE). In this case, the results of QUERY 2 are deterministic and absolutely the same as QUERY 1. Of course, in this case the aggregate function used in conjunction with FIRST could be MIN, MAX, or even AVG. The result is the same as there is only one row from which to pick.

However, if the ORDER BY CLAUSE does not represent a unique key then I would suggest you use the FIRST function in preference to FIRST_VALUE as it provides a way to explicitly deal with any duplicates.

The FIRST function may also provide better performance. Here is the execution plan for QUERY 1:

Here is the execution plan:

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    50 |  1300 |     6   (0)| 00:00:01 |
|   1 |  WINDOW BUFFER     |      |    50 |  1300 |     6   (0)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL| T1   |    50 |  1300 |     6   (0)| 00:00:01 |
---------------------------------------------------------------------------

As you can see, we don’t need a sort as explained in my last post but we do need to buffer the rows as we don’t know what to put into the “Analytic” column for any row until we process the last row in our result set.

Now let us look at the execution plan for QUERY 2:

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    50 |  1300 |     7  (15)| 00:00:01 |
|   1 |  WINDOW SORT       |      |    50 |  1300 |     7  (15)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL| T1   |    50 |  1300 |     6   (0)| 00:00:01 |
---------------------------------------------------------------------------

As you can see, with FIRST_VALUE we have a superfluous sort and as a consequence a higher cost.

You can’t always avoid a sort with FIRST. Consider this example:

  SELECT c1
        ,MIN (c3) KEEP (DENSE_RANK FIRST ORDER BY c1) OVER (PARTITION BY c2)
            "Analytic"
    FROM t1
   WHERE c1 <= 50
ORDER BY c2, c1 DESC;
 
 QUERY 3
 

This query is similar to QUERY 1 but this time there is a seperate value of "Analytic" for each value of C2. This PARTITION clause means we have to sort the data by C2. However, the ORDER BY clause requires that at some stage we need to order the rows by "C2, C1 DESC". This single sort is sufficient for both purposes as we can see in the execution plan:

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    50 |  1950 |     6   (0)| 00:00:01 |
|   1 |  WINDOW SORT       |      |    50 |  1950 |     6   (0)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL| T1   |    50 |  1950 |     6   (0)| 00:00:01 |
---------------------------------------------------------------------------

Let us look at the equivalent statement using FIRST_VALUE:

SELECT c1, FIRST_VALUE (c3) OVER (PARTITION BY c2 ORDER BY c1) "Analytic"
    FROM t1
   WHERE c1 <= 50
ORDER BY c2, c1 DESC;

QUERY 4


Here is the execution plan for QUERY 4:

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |    50 |  1950 |     8  (25)| 00:00:01 |
|   1 |  SORT ORDER BY      |      |    50 |  1950 |     8  (25)| 00:00:01 |
|   2 |   WINDOW SORT       |      |    50 |  1950 |     8  (25)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL| T1   |    50 |  1950 |     6   (0)| 00:00:01 |
----------------------------------------------------------------------------

Now we see an extra sort. The reason is that the FIRST_VALUE function requires the data sorted by the composite sort key "c2,c1" and so needs to be reordered by "c2,c1 DESC" for the final result.

Of course, all the above observations can be applied to LAST and LAST_VALUE.

So, is there ever a need for FIRST_VALUE and LAST_VALUE? Well consider this query:

SELECT c1
      ,FIRST_VALUE (
          c3)
       OVER (PARTITION BY c2
             ORDER BY c1
             ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
          "Analytic"
  FROM t1
 WHERE c1 <= 50;
 
QUERY 5

This time FIRST_VALUE is being used almost like LAG(C3,3) in that it is picking up the value of C3 "three rows up" so to speak. The difference is that LAG(c3,3) returns NULL for the first three rows and FIRST_VALUE doesn't. There isn't a way to code this expression using FIRST because FIRST doesn't support either an ORDER BY clause or a windowing clause.

Whilst we are on the topic of windowing clauses, the implicit and unchangeable window clause for the FIRST and LAST functions is ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, in other words all rows in our partition. For FIRST_VALUE and LAST_VALUE the default but changeable windowing clause is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, in other words we exclude rows after the current one. Dropping rows off the bottom of a list makes no difference when we are looking for the first row in the list (FIRST_VALUE) but it does make a difference when we are looking for the last row in the list (LAST_VALUE) so you will usually need either to specify ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING explicitly when using LAST_VALUE or just use FIRST_VALUE and reverse the sort order.

In conclusion, I would recommend that you only ever use FIRST_VALUE or LAST_VALUE when you need a windowing function other than ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. When FIRST or LAST are available the performance is never worse and frequently better than the equivalent FIRST_VALUE or LAST_VALUE construct. In addition, FIRST and LAST avoid the possibility of non-deterministic results when non-unique ORDER BY clauses are supplied.

Advertisements

March 26, 2012

Aggregate Functions and Sorts

Filed under: SQL Fun,Uncategorized — tonyhasler @ 3:46 pm

I have been prompted to write one of my rare blogs because I have just discovered that the topic of aggregate functions is not quite as straightforward as I previously thought.

Before we start, I want you to look long and hard at the following query and tell me (figuratively speaking) how many sorts are involved in it:

SELECT AVG (c1)
      ,MIN (c1)
      ,MAX (c1)
      ,STDDEV (c2)
      ,RANK (2) WITHIN GROUP (ORDER BY c1 DESC)
      ,STDDEV (c1) KEEP (DENSE_RANK FIRST ORDER BY c2)
      ,VARIANCE (c2) KEEP (DENSE_RANK LAST ORDER BY c3)
  FROM t1;

How are you doing? Let us have a look at the execution plan and see if that will help you?

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    39 |     6   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |    39 |            |          |
|   2 |   TABLE ACCESS FULL| T1   |  6000 |   228K|     6   (0)| 00:00:01 |
---------------------------------------------------------------------------

Has that changed your opinion?

So the correct answer is……zero!

If you are in any doubt, you can create the table I used for the query and try it yourself:

CREATE TABLE t1
AS
       SELECT ROWNUM c1, MOD (ROWNUM, 2) c2, MOD (ROWNUM, 3) c3
         FROM DUAL
   CONNECT BY LEVEL <= 6000;

You can use the following query before and after the query under test to check the sort statistics:

SELECT name, VALUE
  FROM v$mystat NATURAL JOIN v$statname
 WHERE name LIKE 'sort%';

WARNING: Do not use autotrace when checking the stats. as autotrace can perform a sort.

Let us see how it is possible for this statement to complete without any sorts.

The AVG function calculates the mean of a set of values. AVG(c1) can be calculated as the rows are scanned by simply keeping a running total of c1 and a count of the number of rows where c1 is not null and then performing a division at the end. No sort is required. It should be clear that no sort is required for MIN or MAX either. It may be less clear (until you Google it) that STDDEV and VARIANCE aggregations can be performed in a single pass with no sort. In my query, the RANK function simply counts the number of rows that have a value greater than or equal to 2.

This is all very straightforward but what about the last two calculations? Those ORDER BY clauses surely imply some kind of sort don’t they?

Well no. In my query, the FIRST function returns the standard deviation of all values of c1 amongst the rows that have the lowest value of c2.

In general, the FIRST function almost certainly operates by keeping track of the lowest value of c2 encountered so far and on encountering a new row does the following:

  1. If the value of c2 is greater than the stored minimum value of c2, ignore the row for the purposes of this aggregate function.
  2. If the value of c2 equals the minimum encountered so far then perform the aggregation calculations on c1 defined on the left as if no FIRST function was specified.
  3. If the value of c2 is less than the minimum stored so far, replace the stored minimum, delete any stored aggregate values of c1 calculated so far and then process the row as per step 2.

Now, so far I have been very careful in my selection of aggregate functions. Some aggregate functions do require a sort. Let us make a tiny change to the select statement:

SELECT AVG (c1)
      ,MIN (c1)
      ,MAX (c1)
      ,STDDEV (c2)
      ,DENSE_RANK (2) WITHIN GROUP (ORDER BY c1 DESC)
      ,STDDEV (c1) KEEP (DENSE_RANK FIRST ORDER BY c2)
      ,VARIANCE (c2) KEEP (DENSE_RANK LAST ORDER BY c3)
  FROM t1;

The only difference is that I have replaced the RANK function call with a DENSE_RANK function call. The DENSE_RANK function requires a sort as confirmed by the statistics. This is the new execution plan:

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    39 |     6   (0)| 00:00:01 |
|   1 |  SORT GROUP BY     |      |     1 |    39 |            |          |
|   2 |   TABLE ACCESS FULL| T1   |  6000 |   228K|     6   (0)| 00:00:01 |
---------------------------------------------------------------------------

Note that the SORT AGGREGATE operation has been replaced by a SORT GROUP BY. So we have discovered something interesting: SORT AGGREGATE does not perform a sort! Ever!

What happens if we need more than one sort:

SELECT DENSE_RANK (3000) WITHIN GROUP (ORDER BY c1 DESC), MEDIAN (c2) FROM t1;

The execution plan remains the same and the statistic “sorts (memory)” seems to indicate that there has been only one sort but this is impossible. In fact, if you look at “sorts (rows)” you will see that it has been incremented by 9000. 6000 of these rows are for the MEDIAN function but DENSE_RANK only needs to sort the 3000 rows where c1 has a value of less than or equal to 3000. So SORT GROUP BY just counts one sort in the statistics, no matter how many are actually performed.

Let us introduce a GROUP BY clause:

SELECT DENSE_RANK (3000) WITHIN GROUP (ORDER BY c1 DESC) from t1 GROUP BY c2;

The execution plan remains unchanged and the statistics once again show only one sort. In this case, however, there may be several sorts: the first sort is by c2, the others will be sorts to evaluate DENSE_RANK within each group. The “sorts (rows)” statistic seems to support this theory.

I think that is almost enough for this blog. I will post another soon on the difference between the FIRST and LAST functions and the FIRST_VALUE and LAST_VALUE functions. But let me close this note with a comment about the HASH GROUP BY operation introduced in Oracle 10g.

HASH GROUP BY doesn’t perform a sort (as you might expect) and so can only be used for (sub)queries that do not involve aggregates that require sorts. Consider this query:

SELECT MAX(c1) FROM t1 GROUP BY c2;

Here is the execution plan:

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |  6000 |   152K|     7  (15)| 00:00:01 |
|   1 |  HASH GROUP BY     |      |  6000 |   152K|     7  (15)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T1   |  6000 |   152K|     6   (0)| 00:00:01 |
---------------------------------------------------------------------------

In this case we simply keep an array of maximums (one for each distinct value of c2) as the table is scanned.

Create a free website or blog at WordPress.com.