A while ago I started to investigate the possibility of using parallel query on some large tables. These large tables were partitioned by date and joined together. It struck me that using subpartitioning on the join column I could get good results.
As I had learned from Christian Antognini’s excellent book this would allow a “Full Partition-wise Joins”. Let me describe how this should have helped in my case.
Let us assume that we are joining two tables T1 and T2 both partitioned by range on a date column D1 and sub-partitioned by hash on the join column J1. It should be possible for a parallel query slave to join one sub-partition from T1 to its corresponding subpartition from T2 without any need to communicate with any other parallel query slaves.
So I wrote a test script:
set autotrace off
DROP TABLE T1;
CREATE TABLE T1(D1 DATE, J1 CHAR(2000), C1 CHAR(2000))
PCTFREE 99 -- To make the table a reasonable size
PARTITION BY RANGE (D1)
SUBPARTITION BY HASH (J1)
SUBPARTITIONS 8 (PARTITION P1
VALUES LESS THAN (DATE '2010-01-02')
,PARTITION P2
VALUES LESS THAN (DATE '2010-01-03')
,PARTITION P3
VALUES LESS THAN (DATE '2010-01-04')
,PARTITION P4
VALUES LESS THAN (DATE '2010-01-05')
,PARTITION P5
VALUES LESS THAN (DATE '2010-01-06'))
PARALLEL(DEGREE 4);
DROP TABLE T2;
CREATE TABLE T2(D1 DATE, J1 CHAR(2000), C1 CHAR(2000))
PCTFREE 99 -- To make the table a reasonable size
PARTITION BY RANGE (D1)
SUBPARTITION BY HASH (J1)
SUBPARTITIONS 8 (PARTITION P1
VALUES LESS THAN (DATE '2010-01-02')
,PARTITION P2
VALUES LESS THAN (DATE '2010-01-03')
,PARTITION P3
VALUES LESS THAN (DATE '2010-01-04')
,PARTITION P4
VALUES LESS THAN (DATE '2010-01-05')
,PARTITION P5
VALUES LESS THAN (DATE '2010-01-06'))
PARALLEL(DEGREE 4);
INSERT INTO T1(D1, J1, C1)
SELECT DATE '2010-01-03', ROWNUM, ROWNUM
FROM DUAL
CONNECT BY LEVEL <= 30000;
INSERT INTO T2(D1, J1, C1)
SELECT DATE '2010-01-03', ROWNUM, ROWNUM
FROM DUAL
CONNECT BY LEVEL <=30000;
BEGIN
DBMS_STATS.GATHER_TABLE_STATS(USER
,'T1'
,estimate_percent => 100);
DBMS_STATS.GATHER_TABLE_STATS(USER
,'T2'
,estimate_percent => 100);
END;
/
set autotrace traceonly
set timing on
ALTER SESSION SET EVENTS '10053 trace name context forever';
--
-- First let us try without the hint
--
SELECT COUNT( * )
FROM T1, T2
WHERE t1.d1 = DATE '2010-01-03'
AND t2.d1 = DATE '2010-01-03'
AND t1.j1 = t2.j1;
--
-- And now hinted
--
SELECT /*+ leading(t1, t2) pq_distribute(t2 none none) */
COUNT( * )
FROM T1, T2
WHERE t1.d1 = DATE '2010-01-03'
AND t2.d1 = DATE '2010-01-03'
AND t1.j1 = t2.j1;
ALTER SESSION SET EVENTS '10053 trace name context forever';
set autotrace off
Although parallel query was deployed I got a sub-optimal distribution method that not only took longer but used twice as many parallel query slaves as necessary unless I added hints. First the execution plan unhinted:
----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| P
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4018 | 2090 (1)| 00:00:11 | |
| 1 | SORT AGGREGATE | | 1 | 4018 | | | |
| 2 | PX COORDINATOR | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10001 | 1 | 4018 | | | |
| 4 | SORT AGGREGATE | | 1 | 4018 | | | |
|* 5 | HASH JOIN | | 30000 | 114M| 2090 (1)| 00:00:11 | |
| 6 | PX RECEIVE | | 30000 | 57M| 1044 (0)| 00:00:06 | |
| 7 | PX SEND BROADCAST LOCAL| :TQ10000 | 30000 | 57M| 1044 (0)| 00:00:06 | |
| 8 | PX BLOCK ITERATOR | | 30000 | 57M| 1044 (0)| 00:00:06 | 1 |
|* 9 | TABLE ACCESS FULL | T2 | 30000 | 57M| 1044 (0)| 00:00:06 | 17 |
| 10 | PX BLOCK ITERATOR | | 30000 | 57M| 1044 (0)| 00:00:06 | 1 |
|* 11 | TABLE ACCESS FULL | T1 | 30000 | 57M| 1044 (0)| 00:00:06 | 17 |
----------------------------------------------------------------------------------------------------
and now hinted:
----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart|
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4018 | 2090 (1)| 00:00:11 | |
| 1 | SORT AGGREGATE | | 1 | 4018 | | | |
| 2 | PX COORDINATOR | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | 1 | 4018 | | | |
| 4 | SORT AGGREGATE | | 1 | 4018 | | | |
| 5 | PX PARTITION HASH ALL | | 30000 | 114M| 2090 (1)| 00:00:11 | 1 |
|* 6 | HASH JOIN | | 30000 | 114M| 2090 (1)| 00:00:11 | |
| 7 | PX PARTITION RANGE SINGLE| | 30000 | 57M| 1044 (0)| 00:00:06 | 3 |
|* 8 | TABLE ACCESS FULL | T1 | 30000 | 57M| 1044 (0)| 00:00:06 | 17 |
| 9 | PX PARTITION RANGE SINGLE| | 30000 | 57M| 1044 (0)| 00:00:06 | 3 |
|* 10 | TABLE ACCESS FULL | T2 | 30000 | 57M| 1044 (0)| 00:00:06 | 17 |
----------------------------------------------------------------------------------------------------
The curious thing was that no matter what distribution mechanism I picked the cost was the same!
I asked Christian about this and his full reply came in this blog. Christian was able to explain why the estimated costs in the execution plan were misleading and further suggested that my problem with the wrong distribution mechanism was due to partition elimination. I changed my query to remove the date predicate but this didn’t help.
So I took Christian’s lead and looked at the 10053 trace file. The following is an extract of the relevant section of the trace file for the un-hinted query:
-- Enumerating distribution methods for #Hash Join:
---- cost NONE = 0.00 Outer table:
resc: 3759.85 card 30000.00 bytes: 2009 deg: 4 resp: 1044.40
Inner table: T2 Alias: T2
resc: 3759.85 card: 30000.00 bytes: 2009 deg: 4 resp: 1044.40
using dmeth: 129 #groups: 1
Cost per ptn: 0.62 #ptns: 8
hash_area: 16430 (max=82150) buildfrag: 926 probefrag: 926 passes: 1
Hash join: Resc: 7524.65 Resp: 2090.04 [multiMatchCost=0.00]
---- cost(Hash Join) = 2090.04 (w/o dist), 2090.04 (w/ dist)
---- cost VALUE = 16.16
---- cost with slave mapping = 6.07
Outer table:
resc: 3759.85 card 30000.00 bytes: 2009 deg: 4 resp: 1044.40
Inner table: T2 Alias: T2
resc: 3759.85 card: 30000.00 bytes: 2009 deg: 4 resp: 1044.40
using dmeth: 2 #groups: 1
Cost per ptn: 0.74 #ptns: 4
hash_area: 16430 (max=82150) buildfrag: 1851 probefrag: 1851 passes: 1
Hash join: Resc: 7522.65 Resp: 2089.54 [multiMatchCost=0.00]
---- cost(Hash Join) = 2089.54 (w/o dist), 2095.61 (w/ dist)
---- cost PARTITION-RIGHT = 4.04
Outer table:
resc: 3759.85 card 30000.00 bytes: 2009 deg: 4 resp: 1044.40
Inner table: T2 Alias: T2
resc: 3759.85 card: 30000.00 bytes: 2009 deg: 4 resp: 1044.40
using dmeth: 192 #groups: 1
Cost per ptn: 0.62 #ptns: 8
hash_area: 16430 (max=82150) buildfrag: 926 probefrag: 926 passes: 1
Hash join: Resc: 7524.65 Resp: 2090.04 [multiMatchCost=0.00]
---- cost(Hash Join) = 2090.04 (w/o dist), 2094.08 (w/ dist)
---- cost PARTITION-LEFT = 4.04
Outer table:
resc: 3759.85 card 30000.00 bytes: 2009 deg: 4 resp: 1044.40
Inner table: T2 Alias: T2
resc: 3759.85 card: 30000.00 bytes: 2009 deg: 4 resp: 1044.40
using dmeth: 160 #groups: 1
Cost per ptn: 0.62 #ptns: 8
hash_area: 16430 (max=82150) buildfrag: 926 probefrag: 926 passes: 1
Hash join: Resc: 7524.65 Resp: 2090.04 [multiMatchCost=0.00]
---- cost(Hash Join) = 2090.04 (w/o dist), 2094.08 (w/ dist)
---- cost BROADCAST-RIGHT = 31.94
---- cost with slave mapping = 0.00
Outer table:
resc: 3759.85 card 30000.00 bytes: 2009 deg: 4 resp: 1044.40
Inner table: T2 Alias: T2
resc: 3759.85 card: 30000.00 bytes: 2009 deg: 4 resp: 1044.40
using dmeth: 8 #groups: 8
Cost per ptn: 0.69 #ptns: 4
hash_area: 16430 (max=82150) buildfrag: 1851 probefrag: 926 passes: 1
Hash join: Resc: 7522.46 Resp: 2089.50 [multiMatchCost=0.00]
Outer table:
resc: 3759.85 card 30000.00 bytes: 2009 deg: 4 resp: 1044.40
Inner table: T1 Alias: T1
resc: 3759.85 card: 30000.00 bytes: 2009 deg: 4 resp: 1044.40
using dmeth: 16 #groups: 8
Cost per ptn: 0.67 #ptns: 4
hash_area: 16430 (max=82150) buildfrag: 926 probefrag: 1851 passes: 1
Hash join: Resc: 7522.36 Resp: 2089.47 [multiMatchCost=0.00]
---- cost(Hash Join) = 2089.47 (w/o dist), 2089.47 (w/ dist)
---- cost BROADCAST-LEFT = 31.94
---- cost with slave mapping = 0.00
Outer table:
resc: 3759.85 card 30000.00 bytes: 2009 deg: 4 resp: 1044.40
Inner table: T2 Alias: T2
resc: 3759.85 card: 30000.00 bytes: 2009 deg: 4 resp: 1044.40
using dmeth: 16 #groups: 8
Cost per ptn: 0.67 #ptns: 4
hash_area: 16430 (max=82150) buildfrag: 926 probefrag: 1851 passes: 1
Hash join: Resc: 7522.36 Resp: 2089.47 [multiMatchCost=0.00]
---- cost(Hash Join) = 2089.47 (w/o dist), 2089.47 (w/ dist)
(newjo-save) [1 0 ]
Final - All Rows Plan: Best join order: 1
Cost: 2089.5431 Degree: 4 Card: 30000.0000 Bytes: 120540000
Resc: 7522.6511 Resc_io: 7518.0000 Resc_cpu: 36316296
Resp: 2089.5431 Resp_io: 2088.3333 Resc_cpu: 9445741
You can see that the difference between the cost without distribution (2090.04) and that with (2090.04) is zero for the desired plan (shown first) and that the difference for other distribution methods is either also zero or positive. However, the total cost for some of the other distribution methods is lower! This is because the cost without the distribution is not constant. This in turn seems to be because in some distribution methods the number of partitions (subpartitions in our case) has been correctly calculated as 8 and in other cases the number of partitions has been set to 4 – the degree of parallelism! I confirmed that when altering the degree of parallelism this number changed accordingly.
One other oddity: Although the selected distribution mechanism has a cost less than 2090.04 (2089.5431) it is not the lowest possible according to the trace (2089.47). But that is a problem for another day.
It struck me that a good workaround for what seems to be a bug is to ensure that the number of sub-partitions and the degree of parallelism is the same. Let us try it:
ALTER TABLE t1 PARALLEL(DEGREE 8);
ALTER TABLE t2 PARALLEL(DEGREE 8);
Voila! The correct plan is produced unhinted.
----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart|
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4018 | 1045 (1)| 00:00:06 | |
| 1 | SORT AGGREGATE | | 1 | 4018 | | | |
| 2 | PX COORDINATOR | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | 1 | 4018 | | | |
| 4 | SORT AGGREGATE | | 1 | 4018 | | | |
| 5 | PX PARTITION HASH ALL | | 30000 | 114M| 1045 (1)| 00:00:06 | 1 |
|* 6 | HASH JOIN | | 30000 | 114M| 1045 (1)| 00:00:06 | |
| 7 | PX PARTITION RANGE SINGLE| | 30000 | 57M| 522 (0)| 00:00:03 | 3 |
|* 8 | TABLE ACCESS FULL | T1 | 30000 | 57M| 522 (0)| 00:00:03 | 17 |
| 9 | PX PARTITION RANGE SINGLE| | 30000 | 57M| 522 (0)| 00:00:03 | 3 |
|* 10 | TABLE ACCESS FULL | T2 | 30000 | 57M| 522 (0)| 00:00:03 | 17 |
----------------------------------------------------------------------------------------------------
So here is the tip of the day:
Set the number of partitions or sub-partitions and the degree of parallelism to be the same when performing full partition-wise joins.
I have reproduced the results in this blog entry both on 10.2.04 and 11.1.0.6.
Tried with 11.2.0.1. No such problem. Partition-wise join used already with the first combination and without hints. Seems like your example is missing gather stats for t1.
Comment by Timo Raitalaakso — March 10, 2010 @ 7:44 pm |
Timo,
Thanks for your comments. There was a cut and paste error on the stats. gathering for T1 (now fixed). Glad to hear that the problem is fixed in 11.2. I haven’t been able to get 11.2 up and running yet (I can’t download a suitable copy of SUSE!) so this was a useful update.
Comment by tonyhasler — March 11, 2010 @ 1:08 pm |
With 11.1.0.7 the behaviour is as described here with 10.2.0.4 and 11.1.0.6. For trial installation of 11.2.0.1 I am using CentOS 5.4. Still an extra ) in insert into t2.
Comment by Timo Raitalaakso — March 11, 2010 @ 7:14 pm |
Second typo fixed. Thanks!
Comment by tonyhasler — March 12, 2010 @ 9:14 pm |
[…] 20-Degree of Parallelism and partition/subpartition relationship? Tony Hasler-Parallel Query Distribution Methods […]
Pingback by Blogroll Report 05/02/2010 – 12/03/2010 « Coskan’s Approach to Oracle — April 23, 2010 @ 1:08 am |
hi,
I’m looking for some cource and book about Parallel query,distributed database,query processing in distributed database and etc.
if you are found it please mail to me.
thanks.
Comment by raspina — July 25, 2010 @ 8:23 am |
I am not sure that you will find a book entirely dedicated to parallel processing but i can recommend an excellent book by Christian Antognini that contains a substantial section on this topic with downloadable examples. Check it out here:
http://antognini.ch/top/
Comment by tonyhasler — July 25, 2010 @ 1:20 pm |
[…] Parallel query distribution methods by Tony Hasler […]
Pingback by Divide to conquer: Parallel load distribution and balance | ocpdba oracle weblog — July 12, 2011 @ 4:28 pm |