Tony’s Oracle Tips

June 13, 2015

Interview question: second largest value

Filed under: Interview questions,SQL Fun — tonyhasler @ 8:32 pm

This is my first blog for a very very long time.  I stopped blogging a couple of years ago when I started working on my book and after I finally finished the book I felt I needed a break to recuperate.

However, I have finally decided to put my toe back in the water.  This isn’t a very serious post but I hope you find it thought provoking and a bit of fun.

The thought behind this post came from a colleague who had just asked a candidate for employment the following question in order to help assess their command of the SQL syntax.

“How would you select the second largest value from a column in a table”.

A simple enough question. Or so it seems.  You may be able to come up with several answers yourself.  But what is the “best” answer?

There are others that have opined on this subject before. Here is one example. This post is just my thoughts on the topic.

Let me begin by creating a test table with the values 1 t 10 as the basis for some examples:

CREATE TABLE t AS SELECT ROWNUM n1 FROM DUAL CONNECT BY LEVEL <= 10; 

Here is one possible way to find the second highest value:

WITH q1
     AS (SELECT NTH_VALUE (
                   n1,
                   2)
                OVER (
                   ORDER BY n1 DESC
                   ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
                   n1
           FROM t)
SELECT *
  FROM q1
 WHERE ROWNUM < 2;

Solution 1

Solution 1 shows your interviewer that you know about new 11g Oracle features.  Solution 1 also has the benefit that it is readily changeable to obtain the third, fourth, or 100th top value from a table.  Nevertheless, I wouldn’t recommend offering this solution as it uses relatively obscure Oracle-specific syntax and may be viewed as overcomplicated.  This solution has a second disadvantage:  it isn’t correct!

Let me add an 11th row to the table:

INSERT INTO t
VALUES (10);

After adding this 11th row, “solution” 1 now returns 10.  But the second largest value is still 9, despite the fact that there are two rows with a value of 10.

You may try to impress on your interviewer a mastery of analytic functions.  How about this solution:

WITH q1 AS (SELECT n1, DENSE_RANK () OVER (ORDER BY n1 DESC) rnk FROM t)
SELECT n1
FROM q1
WHERE rnk = 2;

Solution 2

Solution 2 returns a value of 9 despite the presence of two rows with a value of 10. You might explain that the runtime engine will not, in practice, sort all of the rows in the table; it will just maintain the two largest values as it scans the table.  You explain that were the table to hold 11 million rows rather than 11 this may be an important consideration.  I don’t commend this solution to you either as it is also incorrect!

let us add a twelfth row to the table to demonstrate one of two flaws:

INSERT INTO t
VALUES (9);

Now our query returns two rows with a value of 9.  We need to limit the rows to one.

WITH q1 AS (SELECT n1, DENSE_RANK () OVER (ORDER BY n1 DESC) rnk FROM t)
SELECT n1
FROM q1
WHERE rnk = 2 AND ROWNUM = 1;

Solution 3

But even now the query is still incorrect.  Let me add a lucky 13th row to our table.

INSERT INTO t
VALUES (NULL);

Now solution 3 returns a value of 10!  If there is no NOT NULL constraint on our column we need to clarify the problem statement.  When a column is null, its value is unknown or unspecified; neither equality predicates nor inequality predicates will evaluate to true since we cannot assert that the column value is equal to anything (not even NULL) nor can we assert that it is not equal to anything (even NULL).  So how can we assert that the value (were it specified) would not be the largest or the second largest value?  We cannot.  If we are to “respect nulls” we have to return NULL as our second largest value, since the second largest value would be unknown.

However, let us assume (as do most aggregate functions in SQL) that we are to “ignore nulls” and return the second largest value from the subset of rows that are non-null.  We can now recode our solution to function correctly in the presence of duplicate values and NULLs.

WITH q1
AS (SELECT n1, DENSE_RANK () OVER (ORDER BY n1 DESC NULLS LAST) rnk
FROM t)
SELECT n1
FROM q1
WHERE rnk = 2 AND ROWNUM = 1;

UPDATE

The above is actually incorrect as no row is returned when there are fewer than two distinct values. One way to correct this is as follows:

WITH q1
AS (SELECT n1, DENSE_RANK () OVER (ORDER BY n1 DESC NULLS LAST) rnk
FROM t)
SELECT max(n1)
FROM q1
WHERE rnk = 2 AND ROWNUM = 1;
 

Solution 4

As far as I can see, solution 4 is semantically correct.

But is solution 4 the “best” solution?  How about this:

SELECT MAX (n1) n1
FROM t
WHERE n1 < (SELECT MAX (n1) FROM t);

Solution 5

Solution 5 handles duplicate values and nulls correctly.  It also avoids any sorts, uses no Oracle-specific SQL extensions and is very easy to read.

Although solution 5 avoids sorting large numbers of rows just like solution 4, will it be as fast?  In some cases, solution 5 will involve two full table scans and as a consequence might be a lot slower on a large table.

But this is not necessarily the case.  Let us add an index to our table:

CREATE INDEX i1
ON t (n1);

Once the index has been added the execution plan for solution 5 becomes this:

-------------------------------------------------------------------
| Id  | Operation                     | Name | Rows  | Cost (%CPU)|
-------------------------------------------------------------------
|   0 | SELECT STATEMENT              |      |     1 |     1   (0)|
|   1 |  SORT AGGREGATE               |      |     1 |            |
|   2 |   FIRST ROW                   |      |     1 |     1   (0)|
|*  3 |    INDEX RANGE SCAN (MIN/MAX) | I1   |     1 |     1   (0)|
|   4 |     SORT AGGREGATE            |      |     1 |            |
|   5 |      INDEX FULL SCAN (MIN/MAX)| I1   |     1 |     1   (0)|
-------------------------------------------------------------------

What this execution plan shows is that when the runtime engine executes solution 5 on an indexed column it will pick the maximum value of the column from the right edge of the index and then pick the second largest value by looking at the entry immediately preceding that index entry (or entries).  No full scan of the index or table is required.  This means that when the column is indexed, solution 5 is likely to return the second largest value in milliseconds.  So in the case of an indexed column solution 5 will likely outperform solution 4 on large tables and in the case of an unindexed column solution 4 will likely outperform solution 5.

Of course, it would also be possible to devise solutions to our problem statement based on MATCH RECOGNIZE, the MODEL clause, or even user written aggregate functions. I would be very surprised if any such solutions would be simpler than those shown in this post.  Nor is it likely that such solutions would outperform solutions 4 and 5 posted here.

If you enjoyed this, feel free to suggest other topics in the vein in a reply.

Advertisements

4 Comments »

  1. No sorting.
    No index.

    with
    function second
    return number
    is
    l_first number;
    l_second number;
    begin
    for rec_t in (
    select n1 from t
    where n1 is not null
    )
    loop
    if l_first is null
    then
    l_first := rec_t.n1;
    l_second := rec_t.n1;
    end if;

    if rec_t.n1 > l_first
    then
    l_second := l_first;
    l_first := rec_t.n1;
    else
    if rec_t.n1 l_second
    then
    l_second := rec_t.n1;
    end if;
    end if;
    end loop;
    return l_second;
    end;
    select second from dual
    /

    Comment by exagriddba — June 14, 2015 @ 1:40 am | Reply

    • A bit lengthy, but nice command of 12c features! This should have similar performance characteristics as DENSE_RANK: takes longer the larger the table but one pass and (as you say) no sort.

      Thanks for the contribution.

      Comment by tonyhasler — June 14, 2015 @ 6:30 pm | Reply

  2. My previous comment got cut off, so I’m trying again… I guess the “less than” sign was treated as an html tag start, so I’ll replace it with words…

    Hi Tony.
    Thanks for a fun and thought-provoking post.

    Here is another solution for the indexed case (let’s call it solution 7, assuming Brian’s solution is number 6):

    SELECT MIN(n1)
    FROM (
    SELECT /*+ index_desc(t,i1) */ DISTINCT n1
    FROM t
    WHERE n1 IS NOT NULL
    ORDER BY n1 DESC)
    WHERE rownum “less than or equal to” 2

    Solution 7 will likely outperform solution 5 (slightly) on large tables (unless the table contains many many records with the highest value).

    Solution 5 does 2 index lookups, which means about 2*H logical reads, where H is the index height.
    Solution 7 does 1 index lookup, and then a descending range scan until the second highest value is found. If there aren’t too many records with the highest value, then good chances are that the second highest value is in the same index block as the highest value, so it will take about H logical reads. Even if it needs to visit one more index block, it will be H+1 which is still less than 2*H (for H>1).

    Solution 7 can be used also for looking the Nth highest value, where N>2, and the performance will still be good if N is not too big.

    Thanks,
    Oren.

    Comment by Oren Nakdimon (@DBoriented) — August 5, 2015 @ 6:21 am | Reply

    • Thanks for that Oren. Really clever!

      Comment by tonyhasler — August 6, 2015 @ 3:29 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: