Tony’s Oracle Tips

March 10, 2010

Parallel query distribution methods

Filed under: SQL Fun — tonyhasler @ 4:38 pm

A while ago I started to investigate the possibility of using parallel query on some large tables. These large tables were partitioned by date and joined together. It struck me that using subpartitioning on the join column I could get good results.

As I had learned from Christian Antognini’s excellent book this would allow a “Full Partition-wise Joins”. Let me describe how this should have helped in my case.

Let us assume that we are joining two tables T1 and T2 both partitioned by range on a date column D1 and sub-partitioned by hash on the join column J1. It should be possible for a parallel query slave to join one sub-partition from T1 to its corresponding subpartition from T2 without any need to communicate with any other parallel query slaves.

So I wrote a test script:


set autotrace off
DROP TABLE T1;

CREATE TABLE T1(D1 DATE, J1 CHAR(2000), C1 CHAR(2000))
PCTFREE 99 -- To make the table a reasonable size
PARTITION BY RANGE (D1)
   SUBPARTITION BY HASH (J1)
      SUBPARTITIONS 8 (PARTITION P1
                          VALUES LESS THAN (DATE '2010-01-02')
                      ,PARTITION P2
                          VALUES LESS THAN (DATE '2010-01-03')
                      ,PARTITION P3
                          VALUES LESS THAN (DATE '2010-01-04')
                      ,PARTITION P4
                          VALUES LESS THAN (DATE '2010-01-05')
                      ,PARTITION P5
                          VALUES LESS THAN (DATE '2010-01-06'))

PARALLEL(DEGREE 4);

DROP TABLE T2;

CREATE TABLE T2(D1 DATE, J1 CHAR(2000), C1 CHAR(2000))
PCTFREE 99 -- To make the table a reasonable size
PARTITION BY RANGE (D1)
   SUBPARTITION BY HASH (J1)
      SUBPARTITIONS 8 (PARTITION P1
                          VALUES LESS THAN (DATE '2010-01-02')
                      ,PARTITION P2
                          VALUES LESS THAN (DATE '2010-01-03')
                      ,PARTITION P3
                          VALUES LESS THAN (DATE '2010-01-04')
                      ,PARTITION P4
                          VALUES LESS THAN (DATE '2010-01-05')
                      ,PARTITION P5
                          VALUES LESS THAN (DATE '2010-01-06'))

PARALLEL(DEGREE 4);

INSERT INTO T1(D1, J1, C1)
       SELECT   DATE '2010-01-03', ROWNUM, ROWNUM
         FROM   DUAL
   CONNECT BY   LEVEL <= 30000;

INSERT INTO T2(D1, J1, C1)
       SELECT   DATE '2010-01-03', ROWNUM, ROWNUM
         FROM   DUAL
   CONNECT BY   LEVEL  <=30000;

BEGIN
   DBMS_STATS.GATHER_TABLE_STATS(USER
                                ,'T1'
                                ,estimate_percent => 100);
   DBMS_STATS.GATHER_TABLE_STATS(USER
                                ,'T2'
                                ,estimate_percent => 100);
END;
/

set autotrace traceonly
set timing on
ALTER SESSION SET EVENTS '10053 trace name context forever';
--
-- First let us try without the hint
--

SELECT   COUNT( * )
  FROM   T1, T2
 WHERE       t1.d1 = DATE '2010-01-03'
         AND t2.d1 = DATE '2010-01-03'
         AND t1.j1 = t2.j1;

--
-- And now hinted
--

SELECT /*+ leading(t1, t2) pq_distribute(t2 none none) */
      COUNT( * )
  FROM   T1, T2
 WHERE       t1.d1 = DATE '2010-01-03'
         AND t2.d1 = DATE '2010-01-03'
         AND t1.j1 = t2.j1;

ALTER SESSION SET EVENTS '10053 trace name context forever';
set autotrace off

Although parallel query was deployed I got a sub-optimal distribution method that not only took longer but used twice as many parallel query slaves as necessary unless I added hints. First the execution plan unhinted:


----------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| P
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |          |     1 |  4018 |  2090   (1)| 00:00:11 |       |
|   1 |  SORT AGGREGATE               |          |     1 |  4018 |            |          |       |
|   2 |   PX COORDINATOR              |          |       |       |            |          |       |
|   3 |    PX SEND QC (RANDOM)        | :TQ10001 |     1 |  4018 |            |          |       |
|   4 |     SORT AGGREGATE            |          |     1 |  4018 |            |          |       |
|*  5 |      HASH JOIN                |          | 30000 |   114M|  2090   (1)| 00:00:11 |       |
|   6 |       PX RECEIVE              |          | 30000 |    57M|  1044   (0)| 00:00:06 |       |
|   7 |        PX SEND BROADCAST LOCAL| :TQ10000 | 30000 |    57M|  1044   (0)| 00:00:06 |       |
|   8 |         PX BLOCK ITERATOR     |          | 30000 |    57M|  1044   (0)| 00:00:06 |     1 |
|*  9 |          TABLE ACCESS FULL    | T2       | 30000 |    57M|  1044   (0)| 00:00:06 |    17 |
|  10 |       PX BLOCK ITERATOR       |          | 30000 |    57M|  1044   (0)| 00:00:06 |     1 |
|* 11 |        TABLE ACCESS FULL      | T1       | 30000 |    57M|  1044   (0)| 00:00:06 |    17 |
----------------------------------------------------------------------------------------------------

and now hinted:


----------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Pstart|
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |          |     1 |  4018 |  2090   (1)| 00:00:11 |       |
|   1 |  SORT AGGREGATE                 |          |     1 |  4018 |            |          |       |
|   2 |   PX COORDINATOR                |          |       |       |            |          |       |
|   3 |    PX SEND QC (RANDOM)          | :TQ10000 |     1 |  4018 |            |          |       |
|   4 |     SORT AGGREGATE              |          |     1 |  4018 |            |          |       |
|   5 |      PX PARTITION HASH ALL      |          | 30000 |   114M|  2090   (1)| 00:00:11 |     1 |
|*  6 |       HASH JOIN                 |          | 30000 |   114M|  2090   (1)| 00:00:11 |       |
|   7 |        PX PARTITION RANGE SINGLE|          | 30000 |    57M|  1044   (0)| 00:00:06 |     3 |
|*  8 |         TABLE ACCESS FULL       | T1       | 30000 |    57M|  1044   (0)| 00:00:06 |    17 |
|   9 |        PX PARTITION RANGE SINGLE|          | 30000 |    57M|  1044   (0)| 00:00:06 |     3 |
|* 10 |         TABLE ACCESS FULL       | T2       | 30000 |    57M|  1044   (0)| 00:00:06 |    17 |
----------------------------------------------------------------------------------------------------

The curious thing was that no matter what distribution mechanism I picked the cost was the same!

I asked Christian about this and his full reply came in this blog. Christian was able to explain why the estimated costs in the execution plan were misleading and further suggested that my problem with the wrong distribution mechanism was due to partition elimination. I changed my query to remove the date predicate but this didn’t help.

So I took Christian’s lead and looked at the 10053 trace file. The following is an extract of the relevant section of the trace file for the un-hinted query:


-- Enumerating distribution methods for #Hash Join:
---- cost NONE = 0.00  Outer table:
    resc: 3759.85  card 30000.00  bytes: 2009  deg: 4  resp: 1044.40
  Inner table: T2  Alias: T2
    resc: 3759.85  card: 30000.00  bytes: 2009  deg: 4  resp: 1044.40
    using dmeth: 129  #groups: 1
    Cost per ptn: 0.62  #ptns: 8
    hash_area: 16430 (max=82150)       buildfrag: 926  probefrag: 926  passes: 1
  Hash join: Resc: 7524.65  Resp: 2090.04  [multiMatchCost=0.00]
---- cost(Hash Join) = 2090.04 (w/o dist), 2090.04 (w/ dist)
---- cost VALUE = 16.16
---- cost with slave mapping = 6.07
  Outer table:
    resc: 3759.85  card 30000.00  bytes: 2009  deg: 4  resp: 1044.40
  Inner table: T2  Alias: T2
    resc: 3759.85  card: 30000.00  bytes: 2009  deg: 4  resp: 1044.40
    using dmeth: 2  #groups: 1
    Cost per ptn: 0.74  #ptns: 4
    hash_area: 16430 (max=82150)       buildfrag: 1851  probefrag: 1851  passes: 1
  Hash join: Resc: 7522.65  Resp: 2089.54  [multiMatchCost=0.00]
---- cost(Hash Join) = 2089.54 (w/o dist), 2095.61 (w/ dist)
---- cost PARTITION-RIGHT = 4.04
  Outer table:
    resc: 3759.85  card 30000.00  bytes: 2009  deg: 4  resp: 1044.40
  Inner table: T2  Alias: T2
    resc: 3759.85  card: 30000.00  bytes: 2009  deg: 4  resp: 1044.40
    using dmeth: 192  #groups: 1
    Cost per ptn: 0.62  #ptns: 8
    hash_area: 16430 (max=82150)       buildfrag: 926  probefrag: 926  passes: 1
  Hash join: Resc: 7524.65  Resp: 2090.04  [multiMatchCost=0.00]
---- cost(Hash Join) = 2090.04 (w/o dist), 2094.08 (w/ dist)
---- cost PARTITION-LEFT = 4.04
  Outer table:
    resc: 3759.85  card 30000.00  bytes: 2009  deg: 4  resp: 1044.40
  Inner table: T2  Alias: T2
    resc: 3759.85  card: 30000.00  bytes: 2009  deg: 4  resp: 1044.40
    using dmeth: 160  #groups: 1
    Cost per ptn: 0.62  #ptns: 8
    hash_area: 16430 (max=82150)       buildfrag: 926  probefrag: 926  passes: 1
  Hash join: Resc: 7524.65  Resp: 2090.04  [multiMatchCost=0.00]
---- cost(Hash Join) = 2090.04 (w/o dist), 2094.08 (w/ dist)
---- cost BROADCAST-RIGHT = 31.94
---- cost with slave mapping = 0.00
  Outer table:
    resc: 3759.85  card 30000.00  bytes: 2009  deg: 4  resp: 1044.40
  Inner table: T2  Alias: T2
    resc: 3759.85  card: 30000.00  bytes: 2009  deg: 4  resp: 1044.40
    using dmeth: 8  #groups: 8
    Cost per ptn: 0.69  #ptns: 4
    hash_area: 16430 (max=82150)       buildfrag: 1851  probefrag: 926  passes: 1
  Hash join: Resc: 7522.46  Resp: 2089.50  [multiMatchCost=0.00]
  Outer table:
    resc: 3759.85  card 30000.00  bytes: 2009  deg: 4  resp: 1044.40
  Inner table: T1  Alias: T1
    resc: 3759.85  card: 30000.00  bytes: 2009  deg: 4  resp: 1044.40
    using dmeth: 16  #groups: 8
    Cost per ptn: 0.67  #ptns: 4
    hash_area: 16430 (max=82150)       buildfrag: 926  probefrag: 1851  passes: 1
  Hash join: Resc: 7522.36  Resp: 2089.47  [multiMatchCost=0.00]
---- cost(Hash Join) = 2089.47 (w/o dist), 2089.47 (w/ dist)
---- cost BROADCAST-LEFT = 31.94
---- cost with slave mapping = 0.00
  Outer table:
    resc: 3759.85  card 30000.00  bytes: 2009  deg: 4  resp: 1044.40
  Inner table: T2  Alias: T2
    resc: 3759.85  card: 30000.00  bytes: 2009  deg: 4  resp: 1044.40
    using dmeth: 16  #groups: 8
    Cost per ptn: 0.67  #ptns: 4
    hash_area: 16430 (max=82150)       buildfrag: 926  probefrag: 1851  passes: 1
  Hash join: Resc: 7522.36  Resp: 2089.47  [multiMatchCost=0.00]
---- cost(Hash Join) = 2089.47 (w/o dist), 2089.47 (w/ dist)
(newjo-save)    [1 0 ]
Final - All Rows Plan:  Best join order: 1
  Cost: 2089.5431  Degree: 4  Card: 30000.0000  Bytes: 120540000
  Resc: 7522.6511  Resc_io: 7518.0000  Resc_cpu: 36316296
  Resp: 2089.5431  Resp_io: 2088.3333  Resc_cpu: 9445741

You can see that the difference between the cost without distribution (2090.04) and that with (2090.04) is zero for the desired plan (shown first) and that the difference for other distribution methods is either also zero or positive. However, the total cost for some of the other distribution methods is lower! This is because the cost without the distribution is not constant. This in turn seems to be because in some distribution methods the number of partitions (subpartitions in our case) has been correctly calculated as 8 and in other cases the number of partitions has been set to 4 – the degree of parallelism! I confirmed that when altering the degree of parallelism this number changed accordingly.

One other oddity: Although the selected distribution mechanism has a cost less than 2090.04 (2089.5431) it is not the lowest possible according to the trace (2089.47). But that is a problem for another day.

It struck me that a good workaround for what seems to be a bug is to ensure that the number of sub-partitions and the degree of parallelism is the same. Let us try it:

ALTER TABLE t1 PARALLEL(DEGREE 8);
ALTER TABLE t2 PARALLEL(DEGREE 8);

Voila! The correct plan is produced unhinted.

----------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Pstart|
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |          |     1 |  4018 |  1045   (1)| 00:00:06 |       |
|   1 |  SORT AGGREGATE                 |          |     1 |  4018 |            |          |       |
|   2 |   PX COORDINATOR                |          |       |       |            |          |       |
|   3 |    PX SEND QC (RANDOM)          | :TQ10000 |     1 |  4018 |            |          |       |
|   4 |     SORT AGGREGATE              |          |     1 |  4018 |            |          |       |
|   5 |      PX PARTITION HASH ALL      |          | 30000 |   114M|  1045   (1)| 00:00:06 |     1 |
|*  6 |       HASH JOIN                 |          | 30000 |   114M|  1045   (1)| 00:00:06 |       |
|   7 |        PX PARTITION RANGE SINGLE|          | 30000 |    57M|   522   (0)| 00:00:03 |     3 |
|*  8 |         TABLE ACCESS FULL       | T1       | 30000 |    57M|   522   (0)| 00:00:03 |    17 |
|   9 |        PX PARTITION RANGE SINGLE|          | 30000 |    57M|   522   (0)| 00:00:03 |     3 |
|* 10 |         TABLE ACCESS FULL       | T2       | 30000 |    57M|   522   (0)| 00:00:03 |    17 |
----------------------------------------------------------------------------------------------------

So here is the tip of the day:

Set the number of partitions or sub-partitions and the degree of parallelism to be the same when performing full partition-wise joins.

I have reproduced the results in this blog entry both on 10.2.04 and 11.1.0.6.

About these ads

8 Comments »

  1. Tried with 11.2.0.1. No such problem. Partition-wise join used already with the first combination and without hints. Seems like your example is missing gather stats for t1.

    Comment by Timo Raitalaakso — March 10, 2010 @ 7:44 pm | Reply

    • Timo,

      Thanks for your comments. There was a cut and paste error on the stats. gathering for T1 (now fixed). Glad to hear that the problem is fixed in 11.2. I haven’t been able to get 11.2 up and running yet (I can’t download a suitable copy of SUSE!) so this was a useful update.

      Comment by tonyhasler — March 11, 2010 @ 1:08 pm | Reply

  2. With 11.1.0.7 the behaviour is as described here with 10.2.0.4 and 11.1.0.6. For trial installation of 11.2.0.1 I am using CentOS 5.4. Still an extra ) in insert into t2.

    Comment by Timo Raitalaakso — March 11, 2010 @ 7:14 pm | Reply

    • Second typo fixed. Thanks!

      Comment by tonyhasler — March 12, 2010 @ 9:14 pm | Reply

  3. [...] 20-Degree of Parallelism and partition/subpartition relationship? Tony Hasler-Parallel Query Distribution Methods [...]

    Pingback by Blogroll Report 05/02/2010 – 12/03/2010 « Coskan’s Approach to Oracle — April 23, 2010 @ 1:08 am | Reply

  4. hi,
    I’m looking for some cource and book about Parallel query,distributed database,query processing in distributed database and etc.
    if you are found it please mail to me.
    thanks.

    Comment by raspina — July 25, 2010 @ 8:23 am | Reply

    • I am not sure that you will find a book entirely dedicated to parallel processing but i can recommend an excellent book by Christian Antognini that contains a substantial section on this topic with downloadable examples. Check it out here:

      http://antognini.ch/top/

      Comment by tonyhasler — July 25, 2010 @ 1:20 pm | Reply

  5. [...] Parallel query distribution methods by Tony Hasler [...]

    Pingback by Divide to conquer: Parallel load distribution and balance | ocpdba oracle weblog — July 12, 2011 @ 4:28 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 44 other followers

%d bloggers like this: