This is my first blog for a very very long time. I stopped blogging a couple of years ago when I started working on my book and after I finally finished the book I felt I needed a break to recuperate.
However, I have finally decided to put my toe back in the water. This isn’t a very serious post but I hope you find it thought provoking and a bit of fun.
The thought behind this post came from a colleague who had just asked a candidate for employment the following question in order to help assess their command of the SQL syntax.
“How would you select the second largest value from a column in a table”.
A simple enough question. Or so it seems. You may be able to come up with several answers yourself. But what is the “best” answer?
There are others that have opined on this subject before. Here is one example. This post is just my thoughts on the topic.
Let me begin by creating a test table with the values 1 t 10 as the basis for some examples:
CREATE TABLE t AS SELECT ROWNUM n1 FROM DUAL CONNECT BY LEVEL <= 10;
Here is one possible way to find the second highest value:
WITH q1 AS (SELECT NTH_VALUE ( n1, 2) OVER ( ORDER BY n1 DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) n1 FROM t) SELECT * FROM q1 WHERE ROWNUM < 2;
Solution 1 shows your interviewer that you know about new 11g Oracle features. Solution 1 also has the benefit that it is readily changeable to obtain the third, fourth, or 100th top value from a table. Nevertheless, I wouldn’t recommend offering this solution as it uses relatively obscure Oracle-specific syntax and may be viewed as overcomplicated. This solution has a second disadvantage: it isn’t correct!
Let me add an 11th row to the table:
INSERT INTO t VALUES (10);
After adding this 11th row, “solution” 1 now returns 10. But the second largest value is still 9, despite the fact that there are two rows with a value of 10.
You may try to impress on your interviewer a mastery of analytic functions. How about this solution:
WITH q1 AS (SELECT n1, DENSE_RANK () OVER (ORDER BY n1 DESC) rnk FROM t) SELECT n1 FROM q1 WHERE rnk = 2;
Solution 2 returns a value of 9 despite the presence of two rows with a value of 10. You might explain that the runtime engine will not, in practice, sort all of the rows in the table; it will just maintain the two largest values as it scans the table. You explain that were the table to hold 11 million rows rather than 11 this may be an important consideration. I don’t commend this solution to you either as it is also incorrect!
let us add a twelfth row to the table to demonstrate one of two flaws:
INSERT INTO t VALUES (9);
Now our query returns two rows with a value of 9. We need to limit the rows to one.
WITH q1 AS (SELECT n1, DENSE_RANK () OVER (ORDER BY n1 DESC) rnk FROM t) SELECT n1 FROM q1 WHERE rnk = 2 AND ROWNUM = 1;
But even now the query is still incorrect. Let me add a lucky 13th row to our table.
INSERT INTO t VALUES (NULL);
Now solution 3 returns a value of 10! If there is no NOT NULL constraint on our column we need to clarify the problem statement. When a column is null, its value is unknown or unspecified; neither equality predicates nor inequality predicates will evaluate to true since we cannot assert that the column value is equal to anything (not even NULL) nor can we assert that it is not equal to anything (even NULL). So how can we assert that the value (were it specified) would not be the largest or the second largest value? We cannot. If we are to “respect nulls” we have to return NULL as our second largest value, since the second largest value would be unknown.
However, let us assume (as do most aggregate functions in SQL) that we are to “ignore nulls” and return the second largest value from the subset of rows that are non-null. We can now recode our solution to function correctly in the presence of duplicate values and NULLs.
WITH q1 AS (SELECT n1, DENSE_RANK () OVER (ORDER BY n1 DESC NULLS LAST) rnk FROM t) SELECT n1 FROM q1 WHERE rnk = 2 AND ROWNUM = 1;
The above is actually incorrect as no row is returned when there are fewer than two distinct values. One way to correct this is as follows:
WITH q1 AS (SELECT n1, DENSE_RANK () OVER (ORDER BY n1 DESC NULLS LAST) rnk FROM t) SELECT max(n1) FROM q1 WHERE rnk = 2 AND ROWNUM = 1;
As far as I can see, solution 4 is semantically correct.
But is solution 4 the “best” solution? How about this:
SELECT MAX (n1) n1 FROM t WHERE n1 < (SELECT MAX (n1) FROM t);
Solution 5 handles duplicate values and nulls correctly. It also avoids any sorts, uses no Oracle-specific SQL extensions and is very easy to read.
Although solution 5 avoids sorting large numbers of rows just like solution 4, will it be as fast? In some cases, solution 5 will involve two full table scans and as a consequence might be a lot slower on a large table.
But this is not necessarily the case. Let us add an index to our table:
CREATE INDEX i1 ON t (n1);
Once the index has been added the execution plan for solution 5 becomes this:
------------------------------------------------------------------- | Id | Operation | Name | Rows | Cost (%CPU)| ------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 1 (0)| | 1 | SORT AGGREGATE | | 1 | | | 2 | FIRST ROW | | 1 | 1 (0)| |* 3 | INDEX RANGE SCAN (MIN/MAX) | I1 | 1 | 1 (0)| | 4 | SORT AGGREGATE | | 1 | | | 5 | INDEX FULL SCAN (MIN/MAX)| I1 | 1 | 1 (0)| -------------------------------------------------------------------
What this execution plan shows is that when the runtime engine executes solution 5 on an indexed column it will pick the maximum value of the column from the right edge of the index and then pick the second largest value by looking at the entry immediately preceding that index entry (or entries). No full scan of the index or table is required. This means that when the column is indexed, solution 5 is likely to return the second largest value in milliseconds. So in the case of an indexed column solution 5 will likely outperform solution 4 on large tables and in the case of an unindexed column solution 4 will likely outperform solution 5.
Of course, it would also be possible to devise solutions to our problem statement based on MATCH RECOGNIZE, the MODEL clause, or even user written aggregate functions. I would be very surprised if any such solutions would be simpler than those shown in this post. Nor is it likely that such solutions would outperform solutions 4 and 5 posted here.
If you enjoyed this, feel free to suggest other topics in the vein in a reply.