speeding up planning with partitions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

speeding up planning with partitions

Amit Langote-2
It is more or less well known that the planner doesn't perform well with
more than a few hundred partitions even when only a handful of partitions
are ultimately included in the plan.  Situation has improved a bit in PG
11 where we replaced the older method of pruning partitions one-by-one
using constraint exclusion with a much faster method that finds relevant
partitions by using partitioning metadata.  However, we could only use it
for SELECT queries, because UPDATE/DELETE are handled by a completely
different code path, whose structure doesn't allow it to call the new
pruning module's functionality.  Actually, not being able to use the new
pruning is not the only problem for UPDATE/DELETE, more on which further
below.

While situation improved with new pruning where it could, there are still
overheads in the way planner handles partitions.  As things stand today,
it will spend cycles and allocate memory for partitions even before
pruning is performed, meaning most of that effort could be for partitions
that were better left untouched.  Currently, planner will lock, heap_open
*all* partitions, create range table entries and AppendRelInfos  for them,
and finally initialize RelOptInfos for them, even touching the disk file
of each partition in the process, in an earlier phase of planning.  All of
that processing is vain for partitions that are pruned, because they won't
be included in the final plan.  This problem grows worse as the number of
partitions grows beyond thousands, because range table grows too big.

That could be fixed by delaying all that per-partition activity to a point
where pruning has already been performed, so that we know the partitions
to open and create planning data structures for, such as somewhere
downstream to query_planner.  But before we can do that we must do
something about the fact that UPDATE/DELETE won't be able to cope with
that because the code path that currently handles the planning of
UPDATE/DELETE on partitioned tables (inheritance_planner called from
subquery_planner) relies on AppendRelInfos for all partitions having been
initialized by an earlier planning phase.  Delaying it to query_planner
would be too late, because inheritance_planner calls query_planner for
each partition, not for the parent.  That is, if query_planner, which is
downstream to inheritance_planner, was in the charge of determining which
partitions to open, the latter wouldn't know which partitions to call the
former for. :)

That would be fixed if there is no longer this ordering dependency, which
is what I propose to do with the attached patch 0001.  I've tried to
describe how the patch manages to do that in its commit message, but I'll
summarize here.  As things stand today, inheritance_planner modifies the
query for each leaf partition to make the partition act as the query's
result relation instead of the original partitioned table and calls
grouping_planner on the query.  That means anything that's joined to
partitioned table looks to instead be joined to the partition and join
paths are generated likewise.  Also, the resulting path's targetlist is
adjusted to be suitable for the result partition.  Upon studying how this
works, I concluded that the same result can be achieved if we call
grouping_planner only once and repeat the portions of query_planner's and
grouping_planner's processing that generate the join paths and appropriate
target list, respectively, for each partition.  That way, we can rely on
query_planner determining result partitions for us, which in turn relies
on the faster partprune.c based method of pruning.  That speeds things up
in two ways.  Faster pruning and we no longer repeat common processing for
each partition.


With 0001 in place, there is nothing that requires that partitions be
opened by an earlier planning phase, so, I propose patch 0002, which
refactors the opening and creation of planner data structures for
partitions such that it is now performed after pruning. However, it
doesn't do anything about the fact that partitions are all still locked in
the earlier phase.

With various overheads gone thanks to 0001 and 0002, locking of all
partitions via find_all_inheritos can be seen as the single largest
bottleneck, which 0003 tries to address.  I've kept it a separate patch,
because I'll need to think a bit more to say that it's actually to safe to
defer locking to late planning, due mainly to the concern about the change
in the order of locking from the current method.  I'm attaching it here,
because I also want to show the performance improvement we can expect with it.


I measured the gain in performance due to each patch on a modest virtual
machine.  Details of the measurement and results follow.

* Benchmark scripts

update.sql
update ht set a = 0 where b = 1;

select.sql
select * from ht where b = 1;

* Table:

create table ht (a int, b int) partition by hash (b)
create table ht_1 partition of ht for values with (modulus N, remainder 0)
..
create table ht_N partition of ht for values with (modulus N, remainder N-1)

* Rounded tps with update.sql and select.sql against regular table (nparts
= 0) and partitioned table with various partition counts:

pgbench -n -T 60 -f update.sql

nparts  master    0001   0002   0003
======  ======    ====   ====   ====
0         2856    2893   2862   2816
8          507    1115   1447   1872
16         260     765   1173   1892
32         119     483    922   1884
64          59     282    615   1881
128         29     153    378   1835
256         14      79    210   1803
512          5      40    113   1728
1024         2      17     57   1616
2048         0*      9     30   1471
4096         0+      4     15   1236
8192         0=      2      7    975

* 0.46
+ 0.0064
= 0 (OOM on a virtual machine with 4GB RAM)

As can be seen here, 0001 is a big help for update queries.

pgbench -n -T 60 -f select.sql

For a select query that doesn't contain join and needs to scan only one
partition:

nparts  master    0001   0002   0003
======  ======    ====   ====   ====
0         2290    2329   2319   2268
8         1058    1077   1414   1788
16         711     729   1124   1789
32         450     475    879   1773
64         265     272    603   1765
128        146     149    371   1685
256    76      77    214   1678
512    39      39    112   1636
1024    16      17     59   1525
2048     8       9     29   1416
4096     4       4     15   1195
8192     2       2      7    932

Actually, here we get almost same numbers with 0001 as with master,
because 0001 changes nothing for SELECT queries.  We start seeing
improvement with 0002, the patch to delay opening partitions.

Thanks,
Amit


0001-Overhaul-partitioned-table-update-delete-planning.patch (36K) Download Attachment
0002-Lazy-creation-of-partition-objects-for-planning.patch (63K) Download Attachment
0003-Only-lock-partitions-that-will-be-scanned-by-a-query.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: speeding up planning with partitions

David Rowley-3
On 30 August 2018 at 00:06, Amit Langote <[hidden email]> wrote:

> nparts  master    0001   0002   0003
> ======  ======    ====   ====   ====
> 0         2856    2893   2862   2816
> 8          507    1115   1447   1872
> 16         260     765   1173   1892
> 32         119     483    922   1884
> 64          59     282    615   1881
> 128         29     153    378   1835
> 256         14      79    210   1803
> 512          5      40    113   1728
> 1024         2      17     57   1616
> 2048         0*      9     30   1471
> 4096         0+      4     15   1236
> 8192         0=      2      7    975

Those look promising. Are you going to submit these patches to the
September 'fest?


--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

RE: speeding up planning with partitions

Tsunakawa, Takayuki
In reply to this post by Amit Langote-2
From: Amit Langote [mailto:[hidden email]]
> I measured the gain in performance due to each patch on a modest virtual
> machine.  Details of the measurement and results follow.

Amazing!

UPDATE
> nparts  master    0001   0002   0003
> ======  ======    ====   ====   ====
> 0         2856    2893   2862   2816
> 8          507    1115   1447   1872

SELECT
> nparts  master    0001   0002   0003
> ======  ======    ====   ====   ====
> 0         2290    2329   2319   2268
> 8         1058    1077   1414   1788

Even a small number of partitions still introduces a not-small overhead (UPDATE:34%, SELECT:22%).  Do you think this overhead can be reduced further?  What part do you guess would be relevant?  This one?


> that it is now performed after pruning. However, it doesn't do anything
> about the fact that partitions are all still locked in the earlier phase.


Regards
Takayuki Tsunakawa

Reply | Threaded
Open this post in threaded view
|

Re: speeding up planning with partitions

Amit Langote-2
In reply to this post by David Rowley-3
On 2018/08/30 7:27, David Rowley wrote:

> On 30 August 2018 at 00:06, Amit Langote <[hidden email]> wrote:
>> nparts  master    0001   0002   0003
>> ======  ======    ====   ====   ====
>> 0         2856    2893   2862   2816
>> 8          507    1115   1447   1872
>> 16         260     765   1173   1892
>> 32         119     483    922   1884
>> 64          59     282    615   1881
>> 128         29     153    378   1835
>> 256         14      79    210   1803
>> 512          5      40    113   1728
>> 1024         2      17     57   1616
>> 2048         0*      9     30   1471
>> 4096         0+      4     15   1236
>> 8192         0=      2      7    975
>
> Those look promising. Are you going to submit these patches to the
> September 'fest?

Thanks, I just did.

https://commitfest.postgresql.org/19/1778/

Regards,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: speeding up planning with partitions

Amit Langote-2
In reply to this post by Tsunakawa, Takayuki
On 2018/08/30 10:09, Tsunakawa, Takayuki wrote:
> From: Amit Langote [mailto:[hidden email]]
>> I measured the gain in performance due to each patch on a modest virtual
>> machine.  Details of the measurement and results follow.
>
> Amazing!

Thanks.

> UPDATE
>> nparts  master    0001   0002   0003
>> ======  ======    ====   ====   ====
>> 0         2856    2893   2862   2816
>> 8          507    1115   1447   1872
>
> SELECT
>> nparts  master    0001   0002   0003
>> ======  ======    ====   ====   ====
>> 0         2290    2329   2319   2268
>> 8         1058    1077   1414   1788
>
> Even a small number of partitions still introduces a not-small overhead (UPDATE:34%, SELECT:22%).

Yeah, that's true.

> Do you think this overhead can be reduced further?

We can definitely try, but I'm not immediately sure if the further
improvements will come from continuing to fix the planner.  Maybe, the
overhead of partitioning could be attributed to other parts of the system.

> What part do you guess would be relevant?  This one?
>
>> that it is now performed after pruning. However, it doesn't do anything
>> about the fact that partitions are all still locked in the earlier phase.

Actually, I wrote that for patch 0002.  The next patch (0003) is meant to
fix that.  So, the overhead you're seeing is even after making sure that
only the selected partitions are locked.

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

RE: speeding up planning with partitions

Tsunakawa, Takayuki
From: Amit Langote [mailto:[hidden email]]
> We can definitely try, but I'm not immediately sure if the further
> improvements will come from continuing to fix the planner.  Maybe, the
> overhead of partitioning could be attributed to other parts of the system.

> Actually, I wrote that for patch 0002.  The next patch (0003) is meant to
> fix that.  So, the overhead you're seeing is even after making sure that
> only the selected partitions are locked.

Thanks for telling your thought.  I understood we should find the bottleneck with profiling first.


Regards
Takayuki Tsunakawa



Reply | Threaded
Open this post in threaded view
|

Re: speeding up planning with partitions

Amit Langote-2
In reply to this post by Amit Langote-2
On 2018/08/29 21:06, Amit Langote wrote:

> I measured the gain in performance due to each patch on a modest virtual
> machine.  Details of the measurement and results follow.
>
> UPDATE:
>
> nparts  master    0001   0002   0003
> ======  ======    ====   ====   ====
> 0         2856    2893   2862   2816
> 8          507    1115   1447   1872
> 16         260     765   1173   1892
> 32         119     483    922   1884
> 64          59     282    615   1881
> 128         29     153    378   1835
> 256         14      79    210   1803
> 512          5      40    113   1728
> 1024         2      17     57   1616
> 2048         0*      9     30   1471
> 4096         0+      4     15   1236
> 8192         0=      2      7    975
>
> For SELECT:
>
> nparts  master    0001   0002   0003
> ======  ======    ====   ====   ====
> 0         2290    2329   2319   2268
> 8         1058    1077   1414   1788
> 16         711     729   1124   1789
> 32         450     475    879   1773
> 64         265     272    603   1765
> 128        146     149    371   1685
> 256      76      77    214   1678
> 512      39      39    112   1636
> 1024      16      17     59   1525
> 2048       8       9     29   1416
> 4096       4       4     15   1195
> 8192       2       2      7    932
Prompted by Tsunakawa-san's comment, I tried to look at the profiles when
running the benchmark with partitioning and noticed a few things that made
clear why, even with 0003 applied, tps numbers decreased as the number of
partitions increased.  Some functions that appeared high up in the
profiles were related to partitioning:

* set_relation_partition_info calling partition_bounds_copy(), which calls
  datumCopy() on N Datums, where N is the number of partitions.  The more
  the number of partitions, higher up it is in profiles.  I suspect that
  this copying might be redundant; planner can keep using the same pointer
  as relcache

There are a few existing and newly introduced sites in the planner where
the code iterates over *all* partitions of a table where processing just
the partition selected for scanning would suffice.  I observed the
following functions in profiles:

* make_partitionedrel_pruneinfo, which goes over all partitions to
  generate subplan_map and subpart_map arrays to put into the
  PartitionedRelPruneInfo data structure that it's in the charge of
  generating

* apply_scanjoin_target_to_paths, which goes over all partitions to adjust
  their Paths for applying required scanjoin target, although most of
  those are dummy ones that won't need the adjustment

* For UPDATE, a couple of functions I introduced in patch 0001 were doing
  the same thing as apply_scanjoin_target_to_paths, which is unnecessary

To fix the above three instances of redundant processing, I added a
Bitmapset 'live_parts' to the RelOptInfo which stores the set of indexes
of only the unpruned partitions (into the RelOptInfo.part_rels array) and
replaced the for (i = 0; i < rel->nparts; i++) loops in those sites with
the loop that iterates over the members of 'live_parts'.

Results looked were promising indeed, especially after applying 0003 which
gets rid of locking all partitions.

UPDATE:

nparts  master    0001    0002   0003
======  ======    ====    ====   ====
0         2856    2893    2862   2816
8          507    1115    1466   1845
16         260     765    1161   1876
32         119     483     910   1862
64          59     282     609   1895
128         29     153     376   1884
256         14      79     212   1874
512          5      40     115   1859
1024         2      17      58   1847
2048         0       9      29   1883
4096         0       4      15   1867
8192         0       2       7   1826

SELECT:

nparts  master    0001    0002   0003
======  ======    ====    ====   ====
0       2290      2329    2319   2268
8       1058      1077    1431   1800
16       711       729    1158   1781
32       450       475     908   1777
64       265       272     612   1791
128      146       149     379   1777
256       76        77     213   1785
512       39        39     114   1776
1024      16        17      59   1756
2048       8         9      30   1746
4096       4         4      15   1722
8192       2         2       7   1706

Note that with 0003, tps doesn't degrade as the number of partitions increase.

Attached updated patches, with 0002 containing the changes mentioned above.

Thanks,
Amit

v2-0001-Overhaul-partitioned-table-update-delete-planning.patch (36K) Download Attachment
v2-0002-Lazy-creation-of-partition-objects-for-planning.patch (67K) Download Attachment
v2-0003-Only-lock-partitions-that-will-be-scanned-by-a-qu.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: speeding up planning with partitions

Kato, Sho
In reply to this post by Amit Langote-2
Hi, Amit

Great!
With the total number of records being 6400, I benchmarked while increasing the number of partitions from 100 to 6400.
Applying three all patches, 20% performance improved for 100 partitions.

I have the same number of records for each partition, do you do the same?

Also, in my case, performance was better when not prepare.
I think these patches do not improve execute case, so we need faster runtime pruning patch[1], right?

Details of measurement conditions and results are as follows.

- base source
  master(@777e6ddf17) + Speeding up Insert v8 patch[1]

- table definition(e.g. 100 partition)
  create table test.accounts(aid serial, abalance int) partition by range(aid);
  create table test.accounts_history(id serial, aid int, delta int, mtime timestamp without time zone) partition by range(aid);
 
  create table test.account_part_1 partition of test.accounts for values from (1) to (65);
  create table test.account_part_2 partition of test.accounts for values from (65) to (129);
  ...
  create table test.account_part_100 partition of test.accounts for values from (6337) to (6400);
 
  create table test.ah_part_1 partition of test.accounts_history for values from (1) to (65);
  create table test.ah_part_2 partition of test.accounts_history for values from (65) to (129);
  ...
  create table test.ah_part_100 partition of test.accounts_history for values from (6337) to (6400);

- benchmark script
  \set aid random(1, 6400)
  \set delta random(-5000, 5000)
  BEGIN;
  UPDATE test.accounts SET abalance = abalance + :delta WHERE aid = :aid;
  SELECT abalance FROM test.accounts WHERE aid = :aid;
  INSERT INTO test.accounts_history (aid, delta, mtime) VALUES (:aid, :delta, CURRENT_TIMESTAMP);
  END;

- command option
  pgbench -d testdb -f benchmark.pgbench -T 180 -r -n -M prepare
  pgbench -d testdb -f benchmark.pgbench -T 180 -r -n
 
-results
  base source no prepared
   part_num |   tps_ex   | update_latency | select_latency | insert_latency
  ----------+------------+----------------+----------------+----------------
        100 | 662.414805 |          0.357 |          0.265 |          0.421
        200 | 494.478431 |          0.439 |          0.349 |          0.579
        400 | 307.982089 |          0.651 |          0.558 |          0.927
        800 | 191.360676 |          0.979 |          0.876 |          1.548
       1600 |  75.344947 |          2.253 |          2.003 |          4.301
       3200 |  30.643902 |          5.716 |          4.955 |         10.118
       6400 |  16.726056 |         12.512 |          8.582 |         18.054

  0001 no prepared
   part_num |   tps_ex   | update_latency | select_latency | insert_latency
  ----------+------------+----------------+----------------+----------------
        100 | 429.816329 |          0.811 |           0.75 |          0.365
        200 | 275.211531 |          1.333 |          1.248 |          0.501
        400 | 160.499833 |          2.384 |          2.252 |          0.754
        800 |  79.387776 |          4.935 |          4.698 |          1.468
       1600 |  24.787377 |         16.593 |         15.954 |          4.302
       3200 |   9.846421 |          42.96 |         42.139 |          8.848
       6400 |   4.919772 |          87.43 |         83.109 |          16.56

  0001 prepared
   part_num |   tps_ex   | update_latency | select_latency | insert_latency
  ----------+------------+----------------+----------------+----------------
        100 | 245.100776 |          2.728 |          0.374 |          0.476
        200 | 140.249283 |          5.116 |          0.603 |          0.686
        400 |  67.608559 |         11.383 |          1.055 |          1.179
        800 |  23.085806 |         35.781 |          2.585 |          2.677
       1600 |   6.211247 |        141.093 |          7.096 |          6.785
       3200 |   1.808214 |        508.045 |         15.741 |         13.243
       6400 |   0.495635 |       1919.362 |         37.691 |         28.177

  0001 + 0002 no prepared
   part_num |   tps_ex   | update_latency | select_latency | insert_latency
  ----------+------------+----------------+----------------+----------------
        100 |  682.53091 |          0.388 |           0.35 |           0.35
        200 | 469.906601 |          0.543 |          0.496 |           0.51
        400 | 321.915349 |           0.78 |          0.721 |          0.752
        800 | 201.620975 |          1.246 |          1.156 |          1.236
       1600 |  94.438204 |          2.612 |          2.335 |          2.745
       3200 |  38.292922 |          6.657 |          5.579 |          6.808
       6400 |   21.48462 |         11.989 |         10.104 |         12.601

  0001 + 0002 prepared
   part_num |   tps_ex   | update_latency | select_latency | insert_latency
  ----------+------------+----------------+----------------+----------------
        100 |  591.10863 |          0.433 |          0.342 |          0.422
        200 | 393.223638 |          0.625 |          0.522 |          0.614
        400 | 253.672736 |          0.946 |          0.828 |          0.928
        800 | 146.840262 |          1.615 |          1.448 |          1.604
       1600 |  52.805593 |          4.656 |          3.811 |          4.473
       3200 |  21.461606 |          11.48 |           9.56 |         10.661
       6400 |  11.888232 |         22.869 |         16.841 |         18.871
       
  0001 + 0002 + 0003 no prepared
   part_num |   tps_ex   | update_latency | select_latency | insert_latency
  ----------+------------+----------------+----------------+----------------
        100 | 798.962029 |          0.304 |          0.267 |          0.339
        200 | 577.893396 |          0.384 |          0.346 |          0.487
        400 | 426.542177 |          0.472 |          0.435 |          0.717
        800 | 288.616213 |           0.63 |          0.591 |          1.162
       1600 | 154.247034 |          1.056 |          0.987 |          2.384
       3200 |  59.711446 |          2.416 |          2.233 |          6.514
       6400 |  37.109761 |          3.387 |          3.099 |         11.762
       
  0001 + 0002 + 0003 prepared
   part_num |   tps_ex   | update_latency | select_latency | insert_latency
  ----------+------------+----------------+----------------+----------------
        100 | 662.414805 |          0.357 |          0.265 |          0.421
        200 | 494.478431 |          0.439 |          0.349 |          0.579
        400 | 307.982089 |          0.651 |          0.558 |          0.927
        800 | 191.360676 |          0.979 |          0.876 |          1.548
       1600 |  75.344947 |          2.253 |          2.003 |          4.301
       3200 |  30.643902 |          5.716 |          4.955 |         10.118
       6400 |  16.726056 |         12.512 |          8.582 |         18.054
       
       
  Although it may not be related to this, when measured with pg11 beta2, somehow the performance was better.

  11beta2 + v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch[3] prepared
   part_num |   tps_ex    | update_latency | select_latency | insert_latency
  ----------+-------------+----------------+----------------+----------------
     100    | 756.07228   |          0.942 |          0.091 |          0.123
   
[1] https://www.postgresql.org/message-id/CAKJS1f_QN-nmf6jCQ4gRU_8ab0zrd0ms-U%3D_Dj0KUARJiuGpOA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAKJS1f9T_32Xpb-p8cWwo5ezSfVhXviUW8QTWncP8ksPHDRK8g%40mail.gmail.com
[3] https://www.postgresql.org/message-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz%3DGVBwvGh4a6xA%40mail.gmail.com

regards,
Sho Kato
-----Original Message-----
From: Amit Langote [mailto:[hidden email]]
Sent: Wednesday, August 29, 2018 9:06 PM
To: Pg Hackers <[hidden email]>
Subject: speeding up planning with partitions

It is more or less well known that the planner doesn't perform well with more than a few hundred partitions even when only a handful of partitions are ultimately included in the plan.  Situation has improved a bit in PG
11 where we replaced the older method of pruning partitions one-by-one using constraint exclusion with a much faster method that finds relevant partitions by using partitioning metadata.  However, we could only use it for SELECT queries, because UPDATE/DELETE are handled by a completely different code path, whose structure doesn't allow it to call the new pruning module's functionality.  Actually, not being able to use the new pruning is not the only problem for UPDATE/DELETE, more on which further below.

While situation improved with new pruning where it could, there are still overheads in the way planner handles partitions.  As things stand today, it will spend cycles and allocate memory for partitions even before pruning is performed, meaning most of that effort could be for partitions that were better left untouched.  Currently, planner will lock, heap_open
*all* partitions, create range table entries and AppendRelInfos  for them, and finally initialize RelOptInfos for them, even touching the disk file of each partition in the process, in an earlier phase of planning.  All of that processing is vain for partitions that are pruned, because they won't be included in the final plan.  This problem grows worse as the number of partitions grows beyond thousands, because range table grows too big.

That could be fixed by delaying all that per-partition activity to a point where pruning has already been performed, so that we know the partitions to open and create planning data structures for, such as somewhere downstream to query_planner.  But before we can do that we must do something about the fact that UPDATE/DELETE won't be able to cope with that because the code path that currently handles the planning of UPDATE/DELETE on partitioned tables (inheritance_planner called from
subquery_planner) relies on AppendRelInfos for all partitions having been initialized by an earlier planning phase.  Delaying it to query_planner would be too late, because inheritance_planner calls query_planner for each partition, not for the parent.  That is, if query_planner, which is downstream to inheritance_planner, was in the charge of determining which partitions to open, the latter wouldn't know which partitions to call the former for. :)

That would be fixed if there is no longer this ordering dependency, which is what I propose to do with the attached patch 0001.  I've tried to describe how the patch manages to do that in its commit message, but I'll summarize here.  As things stand today, inheritance_planner modifies the query for each leaf partition to make the partition act as the query's result relation instead of the original partitioned table and calls grouping_planner on the query.  That means anything that's joined to partitioned table looks to instead be joined to the partition and join paths are generated likewise.  Also, the resulting path's targetlist is adjusted to be suitable for the result partition.  Upon studying how this works, I concluded that the same result can be achieved if we call grouping_planner only once and repeat the portions of query_planner's and grouping_planner's processing that generate the join paths and appropriate target list, respectively, for each partition.  That way, we can rely on query_planner determining result partitions for us, which in turn relies on the faster partprune.c based method of pruning.  That speeds things up in two ways.  Faster pruning and we no longer repeat common processing for each partition.


With 0001 in place, there is nothing that requires that partitions be opened by an earlier planning phase, so, I propose patch 0002, which refactors the opening and creation of planner data structures for partitions such that it is now performed after pruning. However, it doesn't do anything about the fact that partitions are all still locked in the earlier phase.

With various overheads gone thanks to 0001 and 0002, locking of all partitions via find_all_inheritos can be seen as the single largest bottleneck, which 0003 tries to address.  I've kept it a separate patch, because I'll need to think a bit more to say that it's actually to safe to defer locking to late planning, due mainly to the concern about the change in the order of locking from the current method.  I'm attaching it here, because I also want to show the performance improvement we can expect with it.


I measured the gain in performance due to each patch on a modest virtual machine.  Details of the measurement and results follow.

* Benchmark scripts

update.sql
update ht set a = 0 where b = 1;

select.sql
select * from ht where b = 1;

* Table:

create table ht (a int, b int) partition by hash (b) create table ht_1 partition of ht for values with (modulus N, remainder 0) ..
create table ht_N partition of ht for values with (modulus N, remainder N-1)

* Rounded tps with update.sql and select.sql against regular table (nparts = 0) and partitioned table with various partition counts:

pgbench -n -T 60 -f update.sql

nparts  master    0001   0002   0003
======  ======    ====   ====   ====
0         2856    2893   2862   2816
8          507    1115   1447   1872
16         260     765   1173   1892
32         119     483    922   1884
64          59     282    615   1881
128         29     153    378   1835
256         14      79    210   1803
512          5      40    113   1728
1024         2      17     57   1616
2048         0*      9     30   1471
4096         0+      4     15   1236
8192         0=      2      7    975

* 0.46
+ 0.0064
= 0 (OOM on a virtual machine with 4GB RAM)

As can be seen here, 0001 is a big help for update queries.

pgbench -n -T 60 -f select.sql

For a select query that doesn't contain join and needs to scan only one
partition:

nparts  master    0001   0002   0003
======  ======    ====   ====   ====
0         2290    2329   2319   2268
8         1058    1077   1414   1788
16         711     729   1124   1789
32         450     475    879   1773
64         265     272    603   1765
128        146     149    371   1685
256    76      77    214   1678
512    39      39    112   1636
1024    16      17     59   1525
2048     8       9     29   1416
4096     4       4     15   1195
8192     2       2      7    932

Actually, here we get almost same numbers with 0001 as with master, because 0001 changes nothing for SELECT queries.  We start seeing improvement with 0002, the patch to delay opening partitions.

Thanks,
Amit

Reply | Threaded
Open this post in threaded view
|

Re: speeding up planning with partitions

Amit Langote-2
Thank you Kato-san for testing.

On 2018/08/31 19:48, Kato, Sho wrote:
> Hi, Amit
>
> Great!
> With the total number of records being 6400, I benchmarked while increasing the number of partitions from 100 to 6400.
> Applying three all patches, 20% performance improved for 100 partitions.
>
> I have the same number of records for each partition, do you do the same?

I didn't load any data into tables when running the tests, because these
patches are meant for reducing planner latency.  More specifically,
they're addressed to fix the current planner behavior that its latency
increases with increasing number of partitions, with focus on the common
case where only a single partition will need to be scanned by a given query.

I'd try to avoid using a benchmark whose results is affected by anything
other than the planning latency.  It will be a good idea if you leave the
tables empty and just vary the number of partitions and nothing else.

Also, we're interested in planning latency, so using just SELECT and
UPDATE in your benchmark script will be enough, because their planning
time is affected by the number of partitions.  No need to try to measure
the INSERT latency, because its planning latency is not affected by the
number of partitions.  Moreover, I'd rather suggest you take out the
INSERT statement from the benchmark for now, because its execution time
does vary unfavorably with increasing number of partitions.  Sure, there
are other patches to address that, but it's better not to mix the patches
to avoid confusion.

> Also, in my case, performance was better when not prepare.

Patches in this thread do nothing for the execution, so, there is no need
to compare prepared vs non-prepared.  In fact, just measure non-prepared
tps and latency, because we're only interested in planning time here.

> I think these patches do not improve execute case, so we need faster runtime pruning patch[1], right?

We already have run-time pruning code (that is the code in the patch you
linked) committed into the tree in PG 11, so the master tree also has it.
But since we're not interested in execution time, no need to worry about
run-time pruning.

> Details of measurement conditions and results are as follows.
> - base source
>   master(@777e6ddf17) + Speeding up Insert v8 patch[1]
>
> - table definition(e.g. 100 partition)
>   create table test.accounts(aid serial, abalance int)
>   partition by range(aid);
>   create table test.accounts_history(id serial, aid int, delta int,
>   mtime timestamp without time zone)
>   partition by range(aid);
>
> - command option
>   pgbench -d testdb -f benchmark.pgbench -T 180 -r -n -M prepare
>   pgbench -d testdb -f benchmark.pgbench -T 180 -r -n
>  
> -results
>   base source no prepared
>    part_num |   tps_ex   | update_latency | select_latency | insert_latency
>   ----------+------------+----------------+----------------+----------------
>         100 | 662.414805 |          0.357 |          0.265 |          0.421
>         200 | 494.478431 |          0.439 |          0.349 |          0.579
>         400 | 307.982089 |          0.651 |          0.558 |          0.927
>         800 | 191.360676 |          0.979 |          0.876 |          1.548
>        1600 |  75.344947 |          2.253 |          2.003 |          4.301
>        3200 |  30.643902 |          5.716 |          4.955 |         10.118
>        6400 |  16.726056 |         12.512 |          8.582 |         18.054
>
>   0001 no prepared
>    part_num |   tps_ex   | update_latency | select_latency | insert_latency
>   ----------+------------+----------------+----------------+----------------
>         100 | 429.816329 |          0.811 |           0.75 |          0.365
>         200 | 275.211531 |          1.333 |          1.248 |          0.501
>         400 | 160.499833 |          2.384 |          2.252 |          0.754
>         800 |  79.387776 |          4.935 |          4.698 |          1.468
>        1600 |  24.787377 |         16.593 |         15.954 |          4.302
>        3200 |   9.846421 |          42.96 |         42.139 |          8.848
>        6400 |   4.919772 |          87.43 |         83.109 |          16.56

Hmm, since 0001 is meant to improve update planning time, it seems odd
that you'd get poorer results compared to base source.  But, it seems base
source is actually master + the patch to improve the execution time of
update, so maybe that patch is playing a part here, although I'm not sure
why even that's making this much of a difference.

I suggest that you use un-patched master as base source, that is, leave
out any patches to improve execution time.

[ ... ]

>   0001 + 0002 no prepared
>    part_num |   tps_ex   | update_latency | select_latency | insert_latency
>   ----------+------------+----------------+----------------+----------------
>         100 |  682.53091 |          0.388 |           0.35 |           0.35
>         200 | 469.906601 |          0.543 |          0.496 |           0.51
>         400 | 321.915349 |           0.78 |          0.721 |          0.752
>         800 | 201.620975 |          1.246 |          1.156 |          1.236
>        1600 |  94.438204 |          2.612 |          2.335 |          2.745
>        3200 |  38.292922 |          6.657 |          5.579 |          6.808
>        6400 |   21.48462 |         11.989 |         10.104 |         12.601
>

[ ... ]

>        
>   0001 + 0002 + 0003 no prepared
>    part_num |   tps_ex   | update_latency | select_latency | insert_latency
>   ----------+------------+----------------+----------------+----------------
>         100 | 798.962029 |          0.304 |          0.267 |          0.339
>         200 | 577.893396 |          0.384 |          0.346 |          0.487
>         400 | 426.542177 |          0.472 |          0.435 |          0.717
>         800 | 288.616213 |           0.63 |          0.591 |          1.162
>        1600 | 154.247034 |          1.056 |          0.987 |          2.384
>        3200 |  59.711446 |          2.416 |          2.233 |          6.514
>        6400 |  37.109761 |          3.387 |          3.099 |         11.762
>        

[ ... ]

By the way, as you may have noticed, I posted a version 2 of the patches
on this thread.  If you apply them, you will be be able to see almost same
TPS for any number of partitions with master + 0001 + 0002 + 0003.

>   Although it may not be related to this, when measured with pg11 beta2, somehow the performance was better.
>
>   11beta2 + v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch[3] prepared
>    part_num |   tps_ex    | update_latency | select_latency | insert_latency
>   ----------+-------------+----------------+----------------+----------------
>      100    | 756.07228   |          0.942 |          0.091 |          0.123

I guess if you had applied the latest version of
"Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch" (which is v8
posted at [1]) on top of master + 0001 + 0002 + 0003, you'd get better
performance too.  But, as mentioned above, we're interested in measuring
planning latency, not execution latency, so we should leave out any
patches that are meant toward improving execution latency to avoid confusion.


Thanks again.

Regards,
Amit

[1]
https://www.postgresql.org/message-id/CAKJS1f9T_32Xpb-p8cWwo5ezSfVhXviUW8QTWncP8ksPHDRK8g%40mail.gmail.com


Reply | Threaded
Open this post in threaded view
|

Re: speeding up planning with partitions

Dilip Kumar-2
In reply to this post by Amit Langote-2
On Wed, Aug 29, 2018 at 5:36 PM, Amit Langote
<[hidden email]> wrote:

> It is more or less well known that the planner doesn't perform well with
> more than a few hundred partitions even when only a handful of partitions
> are ultimately included in the plan.  Situation has improved a bit in PG
> 11 where we replaced the older method of pruning partitions one-by-one
> using constraint exclusion with a much faster method that finds relevant
> partitions by using partitioning metadata.  However, we could only use it
> for SELECT queries, because UPDATE/DELETE are handled by a completely
> different code path, whose structure doesn't allow it to call the new
> pruning module's functionality.  Actually, not being able to use the new
> pruning is not the only problem for UPDATE/DELETE, more on which further
> below.
>
>
> pgbench -n -T 60 -f update.sql
>
> nparts  master    0001   0002   0003
> ======  ======    ====   ====   ====
> 0         2856    2893   2862   2816
> 8          507    1115   1447   1872
> 16         260     765   1173   1892
> 32         119     483    922   1884
> 64          59     282    615   1881
> 128         29     153    378   1835
> 256         14      79    210   1803
> 512          5      40    113   1728
> 1024         2      17     57   1616
> 2048         0*      9     30   1471
> 4096         0+      4     15   1236
> 8192         0=      2      7    975
>
> * 0.46
> + 0.0064
> = 0 (OOM on a virtual machine with 4GB RAM)
>

The idea looks interesting while going through the patch I observed
this comment.

/*
 * inheritance_planner
 *   Generate Paths in the case where the result relation is an
 *   inheritance set.
 *
 * We have to handle this case differently from cases where a source relation
 * is an inheritance set. Source inheritance is expanded at the bottom of the
 * plan tree (see allpaths.c), but target inheritance has to be expanded at
 * the top.

I think with your patch these comments needs to be change?


  if (parse->resultRelation &&
- rt_fetch(parse->resultRelation, parse->rtable)->inh)
+ rt_fetch(parse->resultRelation, parse->rtable)->inh &&
+ rt_fetch(parse->resultRelation, parse->rtable)->relkind !=
+ RELKIND_PARTITIONED_TABLE)
  inheritance_planner(root);
  else
  grouping_planner(root, false, tuple_fraction);

I think we can add some comments to explain if the target rel itself
is partitioned rel then why
we can directly go to the grouping planner.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

RE: speeding up planning with partitions

Kato, Sho
In reply to this post by Amit Langote-2
Hi, Amit. Thank you for your reply.

> I didn't load any data into tables when running the tests, because these patches are meant for reducing planner latency.  More specifically, they're addressed to fix the current planner behavior that its latency increases with increasing number of partitions, with focus on the common case where only a single partition will need to be scanned by a given query.

Thank you for telling me details. It is very helpful.

> No need to try to measure the INSERT latency, because its planning latency is not affected by the number of partitions.  Moreover, I'd rather suggest you take out the INSERT statement from the benchmark for now, because its execution time does vary unfavorably with increasing number of partitions.

Thank you for your advice.

> In fact, just measure non-prepared tps and latency, because we're only interested in planning time here.

I got it.

> Hmm, since 0001 is meant to improve update planning time, it seems odd that you'd get poorer results compared to base source.  But, it seems base source is actually master + the patch to improve the execution time of update, so maybe that patch is playing a part here, although I'm not sure why even that's making this much of a difference.
> I suggest that you use un-patched master as base source, that is, leave out any patches to improve execution time.

> By the way, as you may have noticed, I posted a version 2 of the patches on this thread.  If you apply them, you will be be able to see almost same TPS for any number of partitions with master + 0001 + 0002 + 0003.

> I guess if you had applied the latest version of "Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch" (which is v8 posted at [1]) on top of master + 0001 + 0002 + 0003, you'd get better performance too.

Thank you. I will try out these cases.

Thanks,

--
Sho Kato
-----Original Message-----
From: Amit Langote [mailto:[hidden email]]
Sent: Monday, September 3, 2018 2:12 PM
To: Kato, Sho/加藤 翔 <[hidden email]>; Pg Hackers <[hidden email]>
Subject: Re: speeding up planning with partitions

Thank you Kato-san for testing.

On 2018/08/31 19:48, Kato, Sho wrote:
> Hi, Amit
>
> Great!
> With the total number of records being 6400, I benchmarked while increasing the number of partitions from 100 to 6400.
> Applying three all patches, 20% performance improved for 100 partitions.
>
> I have the same number of records for each partition, do you do the same?

I didn't load any data into tables when running the tests, because these patches are meant for reducing planner latency.  More specifically, they're addressed to fix the current planner behavior that its latency increases with increasing number of partitions, with focus on the common case where only a single partition will need to be scanned by a given query.

I'd try to avoid using a benchmark whose results is affected by anything other than the planning latency.  It will be a good idea if you leave the tables empty and just vary the number of partitions and nothing else.

Also, we're interested in planning latency, so using just SELECT and UPDATE in your benchmark script will be enough, because their planning time is affected by the number of partitions.  No need to try to measure the INSERT latency, because its planning latency is not affected by the number of partitions.  Moreover, I'd rather suggest you take out the INSERT statement from the benchmark for now, because its execution time does vary unfavorably with increasing number of partitions.  Sure, there are other patches to address that, but it's better not to mix the patches to avoid confusion.

> Also, in my case, performance was better when not prepare.

Patches in this thread do nothing for the execution, so, there is no need to compare prepared vs non-prepared.  In fact, just measure non-prepared tps and latency, because we're only interested in planning time here.

> I think these patches do not improve execute case, so we need faster runtime pruning patch[1], right?

We already have run-time pruning code (that is the code in the patch you
linked) committed into the tree in PG 11, so the master tree also has it.
But since we're not interested in execution time, no need to worry about run-time pruning.

> Details of measurement conditions and results are as follows.
> - base source
>   master(@777e6ddf17) + Speeding up Insert v8 patch[1]
>
> - table definition(e.g. 100 partition)
>   create table test.accounts(aid serial, abalance int)
>   partition by range(aid);
>   create table test.accounts_history(id serial, aid int, delta int,
>   mtime timestamp without time zone)
>   partition by range(aid);
>
> - command option
>   pgbench -d testdb -f benchmark.pgbench -T 180 -r -n -M prepare
>   pgbench -d testdb -f benchmark.pgbench -T 180 -r -n
>  
> -results
>   base source no prepared
>    part_num |   tps_ex   | update_latency | select_latency | insert_latency
>   ----------+------------+----------------+----------------+----------------
>         100 | 662.414805 |          0.357 |          0.265 |          0.421
>         200 | 494.478431 |          0.439 |          0.349 |          0.579
>         400 | 307.982089 |          0.651 |          0.558 |          0.927
>         800 | 191.360676 |          0.979 |          0.876 |          1.548
>        1600 |  75.344947 |          2.253 |          2.003 |          4.301
>        3200 |  30.643902 |          5.716 |          4.955 |         10.118
>        6400 |  16.726056 |         12.512 |          8.582 |         18.054
>
>   0001 no prepared
>    part_num |   tps_ex   | update_latency | select_latency | insert_latency
>   ----------+------------+----------------+----------------+----------------
>         100 | 429.816329 |          0.811 |           0.75 |          0.365
>         200 | 275.211531 |          1.333 |          1.248 |          0.501
>         400 | 160.499833 |          2.384 |          2.252 |          0.754
>         800 |  79.387776 |          4.935 |          4.698 |          1.468
>        1600 |  24.787377 |         16.593 |         15.954 |          4.302
>        3200 |   9.846421 |          42.96 |         42.139 |          8.848
>        6400 |   4.919772 |          87.43 |         83.109 |          16.56

Hmm, since 0001 is meant to improve update planning time, it seems odd that you'd get poorer results compared to base source.  But, it seems base source is actually master + the patch to improve the execution time of update, so maybe that patch is playing a part here, although I'm not sure why even that's making this much of a difference.

I suggest that you use un-patched master as base source, that is, leave out any patches to improve execution time.

[ ... ]

>   0001 + 0002 no prepared
>    part_num |   tps_ex   | update_latency | select_latency | insert_latency
>   ----------+------------+----------------+----------------+----------------
>         100 |  682.53091 |          0.388 |           0.35 |           0.35
>         200 | 469.906601 |          0.543 |          0.496 |           0.51
>         400 | 321.915349 |           0.78 |          0.721 |          0.752
>         800 | 201.620975 |          1.246 |          1.156 |          1.236
>        1600 |  94.438204 |          2.612 |          2.335 |          2.745
>        3200 |  38.292922 |          6.657 |          5.579 |          6.808
>        6400 |   21.48462 |         11.989 |         10.104 |         12.601
>

[ ... ]

>        
>   0001 + 0002 + 0003 no prepared
>    part_num |   tps_ex   | update_latency | select_latency | insert_latency
>   ----------+------------+----------------+----------------+----------------
>         100 | 798.962029 |          0.304 |          0.267 |          0.339
>         200 | 577.893396 |          0.384 |          0.346 |          0.487
>         400 | 426.542177 |          0.472 |          0.435 |          0.717
>         800 | 288.616213 |           0.63 |          0.591 |          1.162
>        1600 | 154.247034 |          1.056 |          0.987 |          2.384
>        3200 |  59.711446 |          2.416 |          2.233 |          6.514
>        6400 |  37.109761 |          3.387 |          3.099 |         11.762
>        

[ ... ]

By the way, as you may have noticed, I posted a version 2 of the patches on this thread.  If you apply them, you will be be able to see almost same TPS for any number of partitions with master + 0001 + 0002 + 0003.

>   Although it may not be related to this, when measured with pg11 beta2, somehow the performance was better.
>
>   11beta2 + v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch[3] prepared
>    part_num |   tps_ex    | update_latency | select_latency | insert_latency
>   ----------+-------------+----------------+----------------+----------------
>      100    | 756.07228   |          0.942 |          0.091 |          0.123

I guess if you had applied the latest version of "Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch" (which is v8 posted at [1]) on top of master + 0001 + 0002 + 0003, you'd get better performance too.  But, as mentioned above, we're interested in measuring planning latency, not execution latency, so we should leave out any patches that are meant toward improving execution latency to avoid confusion.


Thanks again.

Regards,
Amit

[1]
https://www.postgresql.org/message-id/CAKJS1f9T_32Xpb-p8cWwo5ezSfVhXviUW8QTWncP8ksPHDRK8g%40mail.gmail.com


Reply | Threaded
Open this post in threaded view
|

Re: speeding up planning with partitions

Amit Langote-2
In reply to this post by Dilip Kumar-2
Hi Dilip,

Thanks for taking a look.

On 2018/09/03 20:57, Dilip Kumar wrote:

> The idea looks interesting while going through the patch I observed
> this comment.
>
> /*
>  * inheritance_planner
>  *   Generate Paths in the case where the result relation is an
>  *   inheritance set.
>  *
>  * We have to handle this case differently from cases where a source relation
>  * is an inheritance set. Source inheritance is expanded at the bottom of the
>  * plan tree (see allpaths.c), but target inheritance has to be expanded at
>  * the top.
>
> I think with your patch these comments needs to be change?

Yes, maybe a good idea to mention that partitioned table result relations
are not handled here.

Actually, I've been wondering if this patch (0001) shouldn't get rid of
inheritance_planner altogether and implement the new approach for *all*
inheritance sets, not just partitioned tables, but haven't spent much time
on that idea yet.

>   if (parse->resultRelation &&
> - rt_fetch(parse->resultRelation, parse->rtable)->inh)
> + rt_fetch(parse->resultRelation, parse->rtable)->inh &&
> + rt_fetch(parse->resultRelation, parse->rtable)->relkind !=
> + RELKIND_PARTITIONED_TABLE)
>   inheritance_planner(root);
>   else
>   grouping_planner(root, false, tuple_fraction);
>
> I think we can add some comments to explain if the target rel itself
> is partitioned rel then why
> we can directly go to the grouping planner.

Okay, I will try to add more explanatory comments here and there in the
next version I will post later this week.

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: speeding up planning with partitions

David Rowley-3
In reply to this post by Amit Langote-2
On 30 August 2018 at 00:06, Amit Langote <[hidden email]> wrote:
> With various overheads gone thanks to 0001 and 0002, locking of all
> partitions via find_all_inheritos can be seen as the single largest
> bottleneck, which 0003 tries to address.  I've kept it a separate patch,
> because I'll need to think a bit more to say that it's actually to safe to
> defer locking to late planning, due mainly to the concern about the change
> in the order of locking from the current method.  I'm attaching it here,
> because I also want to show the performance improvement we can expect with it.

For now, find_all_inheritors() locks the tables in ascending Oid
order.  This makes sense with inheritance parent tables as it's much
cheaper to sort on this rather than on something like the table's
namespace and name.  I see no reason why what we sort on really
matters, as long as we always sort on the same thing and the key we
sort on is always unique so that the locking order is well defined.

For partitioned tables, there's really not much sense in sticking to
the same lock by Oid order. The order of the PartitionDesc is already
well defined so I don't see any reason why we can't just perform the
locking in PartitionDesc order. This would mean you could perform the
locking of the partitions once pruning is complete somewhere around
add_rel_partitions_to_query(). Also, doing this would remove the need
for scanning pg_inherits during find_all_inheritors() and would likely
further speed up the planning of queries to partitioned tables with
many partitions. I wrote a function named
get_partition_descendants_worker() to do this in patch 0002 in [1] (it
may have been a bad choice to have made this a separate function
rather than just part of find_all_inheritors() as it meant I had to
change a bit too much code in tablecmds.c). There might be something
you can salvage from that patch to help here.

[1] https://www.postgresql.org/message-id/CAKJS1f9QjUwQrio20Pi%3DyCHmnouf4z3SfN8sqXaAcwREG6k0zQ%40mail.gmail.com

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: speeding up planning with partitions

Amit Langote-2
On 2018/09/04 22:24, David Rowley wrote:

> On 30 August 2018 at 00:06, Amit Langote <[hidden email]> wrote:
>> With various overheads gone thanks to 0001 and 0002, locking of all
>> partitions via find_all_inheritos can be seen as the single largest
>> bottleneck, which 0003 tries to address.  I've kept it a separate patch,
>> because I'll need to think a bit more to say that it's actually to safe to
>> defer locking to late planning, due mainly to the concern about the change
>> in the order of locking from the current method.  I'm attaching it here,
>> because I also want to show the performance improvement we can expect with it.
>
> For now, find_all_inheritors() locks the tables in ascending Oid
> order.  This makes sense with inheritance parent tables as it's much
> cheaper to sort on this rather than on something like the table's
> namespace and name.  I see no reason why what we sort on really
> matters, as long as we always sort on the same thing and the key we
> sort on is always unique so that the locking order is well defined.
>
> For partitioned tables, there's really not much sense in sticking to
> the same lock by Oid order. The order of the PartitionDesc is already
> well defined so I don't see any reason why we can't just perform the
> locking in PartitionDesc order.

The reason we do the locking with find_all_inheritors for regular queries
(planner (expand_inherited_rtentry) does it for select/update/delete and
executor (ExecSetupPartitionTupleRouting) for insert) is that that's the
order used by various DDL commands when locking partitions, such as when
adding a column.  So, we'd have one piece of code locking partitions by
Oid order and others by canonical partition bound order or PartitionDesc
order.  I'm no longer sure if that's problematic though, about which more
below.

> This would mean you could perform the
> locking of the partitions once pruning is complete somewhere around
> add_rel_partitions_to_query(). Also, doing this would remove the need
> for scanning pg_inherits during find_all_inheritors() and would likely
> further speed up the planning of queries to partitioned tables with
> many partitions.

That's what happens with 0003; note the following hunk in it:

@@ -1555,14 +1555,15 @@ expand_inherited_rtentry(PlannerInfo *root,
RangeTblEntry *rte, Index rti)
         lockmode = AccessShareLock;

     /* Scan for all members of inheritance set, acquire needed locks */
-    inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
+    if (rte->relkind != RELKIND_PARTITIONED_TABLE)
+        inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);

> I wrote a function named
> get_partition_descendants_worker() to do this in patch 0002 in [1] (it
> may have been a bad choice to have made this a separate function
> rather than just part of find_all_inheritors() as it meant I had to
> change a bit too much code in tablecmds.c). There might be something
> you can salvage from that patch to help here.
>
> [1] https://www.postgresql.org/message-id/CAKJS1f9QjUwQrio20Pi%3DyCHmnouf4z3SfN8sqXaAcwREG6k0zQ%40mail.gmail.com

Actually, I had written a similar patch to replace the usages of
find_all_inheritors and find_inheritance_children by different
partitioning-specific functions which would collect the the partition OIDs
from the already open parent table's PartitionDesc, more or less like the
patch you mention does.  But I think we don't need any new function(s) to
do that, that is, we don't need to change all the sites that call
find_all_inheritors or find_inheritance_children in favor of new functions
that return partition OIDs in PartitionDesc order, if only *because* we
want to change planner to lock partitions in the PartitionDesc order.  I'm
failing to see why the difference in locking order matters.  I understood
the concern as that locking partitions in different order could lead to a
deadlock if concurrent backends request mutually conflicting locks, but
only one of the conflicting backends, the one that got lock on the parent,
would be allowed to lock children.  Thoughts on that?

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: speeding up planning with partitions

David Rowley-3
In reply to this post by Amit Langote-2
On 30 August 2018 at 21:29, Amit Langote <[hidden email]> wrote:
> Attached updated patches, with 0002 containing the changes mentioned above.

Here's my first pass review on this:

0001:

1. I think the following paragraph should probably mention something
about some performance difference between the two methods:

   <para>
    Constraint exclusion works in a very similar way to partition
    pruning, except that it uses each table's <literal>CHECK</literal>
    constraints &mdash; which gives it its name &mdash; whereas partition
    pruning uses the table's partition bounds, which exist only in the
    case of declarative partitioning.  Another difference is that
    constraint exclusion is only applied at plan time; there is no attempt
    to remove partitions at execution time.
   </para>

Perhaps tagging on. "Constraint exclusion is also a much less
efficient way of eliminating unneeded partitions as metadata for each
partition must be loaded in the planner before constraint exclusion
can be applied.  This is not a requirement for partition pruning."

2. I think "rootrel" should be named "targetpart" in:

+ RelOptInfo *rootrel = root->simple_rel_array[root->parse->resultRelation];

3. Why does the following need to list_copy()?

+ List    *saved_join_info_list = list_copy(root->join_info_list);

4. Is the "root->parse->commandType != CMD_INSERT" required in:

@@ -181,13 +185,30 @@ make_one_rel(PlannerInfo *root, List *joinlist)

  /*
  * Generate access paths for the entire join tree.
+ *
+ * If we're doing this for an UPDATE or DELETE query whose target is a
+ * partitioned table, we must do the join planning against each of its
+ * leaf partitions instead.
  */
- rel = make_rel_from_joinlist(root, joinlist);
+ if (root->parse->resultRelation &&
+ root->parse->commandType != CMD_INSERT &&
+ root->simple_rel_array[root->parse->resultRelation] &&
+ root->simple_rel_array[root->parse->resultRelation]->part_scheme)
+ {

Won't the simple_rel_array entry for the resultRelation always be NULL
for an INSERT?

5. In regards to:

+ /*
+ * Hack to make the join planning code believe that 'partrel' can
+ * be joined against.
+ */
+ partrel->reloptkind = RELOPT_BASEREL;

Have you thought about other implications of join planning for other
member rels, for example, equivalence classes and em_is_child?

6. It would be good to not have to rt_fetch the same RangeTblEntry twice here:

@@ -959,7 +969,9 @@ subquery_planner(PlannerGlobal *glob, Query *parse,
  * needs special processing, else go straight to grouping_planner.
  */
  if (parse->resultRelation &&
- rt_fetch(parse->resultRelation, parse->rtable)->inh)
+ rt_fetch(parse->resultRelation, parse->rtable)->inh &&
+ rt_fetch(parse->resultRelation, parse->rtable)->relkind !=
+ RELKIND_PARTITIONED_TABLE)
  inheritance_planner(root);

7. Why don't you just pass the Parse into the function as a parameter
instead of overwriting PlannerInfo's copy in:

+ root->parse = partition_parse;
+ partitionwise_adjust_scanjoin_target(root, child_rel,
+ subroots,
+ partitioned_rels,
+ resultRelations,
+ subpaths,
+ WCOLists,
+ returningLists,
+ rowMarks);
+ /* Restore the Query for processing the next partition. */
+ root->parse = parse;

8. Can you update the following comment to mention why you're not
calling add_paths_to_append_rel for this case?

@@ -6964,7 +7164,9 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
  }

  /* Build new paths for this relation by appending child paths. */
- if (live_children != NIL)
+ if (live_children != NIL &&
+ !(rel->reloptkind == RELOPT_BASEREL &&
+   rel->relid == root->parse->resultRelation))
  add_paths_to_append_rel(root, rel, live_children);

9. The following should use >= 0, not > 0

+ while ((relid = bms_next_member(child_rel->relids, relid)) > 0)

0002:

10. I think moving the PlannerInfo->total_table_pages calculation
needs to be done in its own patch. This is a behavioural change where
we no longer count pruned relations in the calculation which can
result in plan changes. I posted a patch in [1] to fix this as a bug
fix as I think the current code is incorrect. My patch also updates
the first paragraph of the comment. You've not done that.

11. "pruning"

+ * And do prunin.  Note that this adds AppendRelInfo's of only the

12. It's much more efficient just to do bms_add_range() outside the loop in:

+ for (i = 0; i < rel->nparts; i++)
+ {
+ rel->part_rels[i] = build_partition_rel(root, rel,
+ rel->part_oids[i]);
+ rel->live_parts = bms_add_member(rel->live_parts, i);
+ }

13. In set_append_rel_size() the foreach(l, root->append_rel_list)
loop could be made to loop over RelOptInfo->live_parts instead which
would save having to skip over AppendRelInfos that don't belong to
this parent. You'd need to make live_parts more general purpose and
also use it to mark children of inheritance parents.

14. I think you can skip the following if both are NULL. You could
likely add more smarts for different join types, but I think you
should at least skip when both are NULL. Perhaps the loop could be
driven off of bms_intersec of the two Rel's live_parts?

+ if (child_rel1 == NULL)
+ child_rel1 = build_dummy_partition_rel(root, rel1, cnt_parts);
+ if (child_rel2 == NULL)
+ child_rel2 = build_dummy_partition_rel(root, rel2, cnt_parts);

15. The following is not required when append_rel_array was previously NULL.

+ MemSet(root->append_rel_array + root->simple_rel_array_size,
+    0, sizeof(AppendRelInfo *) * num_added_parts);

16. I don't think scan_all_parts is worth the extra code.  The cost of
doing bms_num_members is going to be pretty small in comparison to
building paths for and maybe doing a join search for all partitions.

+ num_added_parts = scan_all_parts ? rel->nparts :
+ bms_num_members(partindexes);

In any case, you've got a bug in prune_append_rel_partitions() where
you're setting scan_all_parts to true instead of returning when
contradictory == true.

17. lockmode be of type LOCKMODE, not int.

+    Oid childOID, int lockmode,

18. Comment contradicts the code.

+ /* Open rel if needed; we already have required locks */
+ if (childOID != parentOID)
+ {
+ childrel = heap_open(childOID, lockmode);

I think you should be using NoLock here.

19. Comment contradicts the code.

+ /* Close child relations, but keep locks */
+ if (childOID != parentOID)
+ {
+ Assert(childrel != NULL);
+ heap_close(childrel, lockmode);
+ }

20. I assume translated_vars can now be NIL due to
build_dummy_partition_rel() not setting it.

- if (var->varlevelsup == 0 && appinfo)
+ if (var->varlevelsup == 0 && appinfo && appinfo->translated_vars)

It might be worth writing a comment to explain that, otherwise it's
not quite clear why you're doing this.

21. Unrelated change;

  Assert(relid > 0 && relid < root->simple_rel_array_size);
+

22. The following comment mentions that Oids are copied, but that does
not happen in the code.

+ /*
+ * For partitioned tables, we just allocate space for RelOptInfo's.
+ * pointers for all partitions and copy the partition OIDs from the
+ * relcache.  Actual RelOptInfo is built for a partition only if it is
+ * not pruned.
+ */

The Oid copying already happened during get_relation_info().

23. Traditionally translated_vars populated with a sequence of Vars in
the same order to mean no translation. Here you're changing how that
works:

+ /* leaving translated_vars to NIL to mean no translation needed */

This seems to be about the extent of your documentation on this, which
is not good enough.

24. "each" -> "reach"? ... Actually, I don't understand the comment.
In a partitioned hierarchy, how can the one before the top-level
partitioned table not be a partitioned table?

/*
* Keep moving up until we each the parent rel that's not a
* partitioned table.  The one before that one would be the root
* parent.
*/

25. "already"

+ * expand_inherited_rtentry alreay locked all partitions, so pass

26. Your changes to make_partitionedrel_pruneinfo() look a bit broken.
You're wrongly assuming live_parts is the same as present_parts.  If a
CHECK constraint eliminated the partition then those will be present
in live_parts but won't be part of the Append/MergeAppend subplans.
You might be able to maintain some of this optimisation by checking
for dummy rels inside the loop, but you're going to need to put back
the code that sets present_parts.

+ present_parts = bms_copy(subpart->live_parts);

27. Comment contradicts the code:

+ Bitmapset  *live_parts; /* unpruned parts; NULL if all are live */

in add_rel_partitions_to_query() you're doing:

+ if (scan_all_parts)
+ {
+ for (i = 0; i < rel->nparts; i++)
+ {
+ rel->part_rels[i] = build_partition_rel(root, rel,
+ rel->part_oids[i]);
+ rel->live_parts = bms_add_member(rel->live_parts, i);
+ }
+ }

so the NULL part seems untrue.

28. Missing comments:

+ TupleDesc tupdesc;
+ Oid reltype;

29. The comment for prune_append_rel_partitions claims it "Returns RT
indexes", but that's not the case, per:

-Relids
-prune_append_rel_partitions(RelOptInfo *rel)
+void
+prune_append_rel_partitions(PlannerInfo *root, RelOptInfo *rel)

0003:

30. 2nd test can be tested inside the first test to remove redundant
partition check.

- inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
+ if (rte->relkind != RELKIND_PARTITIONED_TABLE)
+ inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);

  /*
  * Check that there's at least one descendant, else treat as no-child
  * case.  This could happen despite above has_subclass() check, if table
  * once had a child but no longer does.
  */
- if (list_length(inhOIDs) < 2)
+ if (rte->relkind != RELKIND_PARTITIONED_TABLE && list_length(inhOIDs) < 2)
  {

31. The following code is wrong:

+ /* Determine the correct lockmode to use. */
+ if (rootRTindex == root->parse->resultRelation)
+ lockmode = RowExclusiveLock;
+ else if (rootrc && RowMarkRequiresRowShareLock(rootrc->markType))
+ lockmode = RowShareLock;
+ else
+ lockmode = AccessShareLock;

rootRTIndex remains at 0 if there are no row marks and resultRelation
will be 0 for SELECT queries, this means you'll use a RowExclusiveLock
for a SELECT instead of an AccessShareLock.

Why not just check the lockmode of the parent and use that?

[1] https://commitfest.postgresql.org/19/1768/

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: speeding up planning with partitions

Peter Geoghegan-4
In reply to this post by Amit Langote-2
On Wed, Aug 29, 2018 at 5:06 AM, Amit Langote
<[hidden email]> wrote:

> It is more or less well known that the planner doesn't perform well with
> more than a few hundred partitions even when only a handful of partitions
> are ultimately included in the plan.  Situation has improved a bit in PG
> 11 where we replaced the older method of pruning partitions one-by-one
> using constraint exclusion with a much faster method that finds relevant
> partitions by using partitioning metadata.  However, we could only use it
> for SELECT queries, because UPDATE/DELETE are handled by a completely
> different code path, whose structure doesn't allow it to call the new
> pruning module's functionality.  Actually, not being able to use the new
> pruning is not the only problem for UPDATE/DELETE, more on which further
> below.

This was a big problem for the SQL MERGE patch. I hope that this
problem can be fixed.

--
Peter Geoghegan

Reply | Threaded
Open this post in threaded view
|

Re: speeding up planning with partitions

Amit Langote-2
On 2018/09/11 10:11, Peter Geoghegan wrote:

> On Wed, Aug 29, 2018 at 5:06 AM, Amit Langote
> <[hidden email]> wrote:
>> It is more or less well known that the planner doesn't perform well with
>> more than a few hundred partitions even when only a handful of partitions
>> are ultimately included in the plan.  Situation has improved a bit in PG
>> 11 where we replaced the older method of pruning partitions one-by-one
>> using constraint exclusion with a much faster method that finds relevant
>> partitions by using partitioning metadata.  However, we could only use it
>> for SELECT queries, because UPDATE/DELETE are handled by a completely
>> different code path, whose structure doesn't allow it to call the new
>> pruning module's functionality.  Actually, not being able to use the new
>> pruning is not the only problem for UPDATE/DELETE, more on which further
>> below.
>
> This was a big problem for the SQL MERGE patch. I hope that this
> problem can be fixed.

Sorry if I've missed some prior discussion about this, but could you
clarify which aspect of UPDATE/DELETE planning got in the way of SQL MERGE
patch?  It'd be great if you can point to an email or portion of the SQL
MERGE discussion thread where this aspect was discussed.

In the updated patch that I'm still hacking on (also need to look at
David's comments earlier today before posting it), I have managed to
eliminate the separation of code paths handling SELECT vs UPDATE/DELETE on
inheritance tables, but I can't be sure if the new approach (also) solves
any problems that were faced during SQL MERGE development.

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: speeding up planning with partitions

Dilip Kumar-2
In reply to this post by Amit Langote-2
On Tue, Sep 4, 2018 at 12:44 PM, Amit Langote
<[hidden email]> wrote:

> Hi Dilip,
>
> Thanks for taking a look.
>
> On 2018/09/03 20:57, Dilip Kumar wrote:
>> The idea looks interesting while going through the patch I observed
>> this comment.
>>
>> /*
>>  * inheritance_planner
>>  *   Generate Paths in the case where the result relation is an
>>  *   inheritance set.
>>  *
>>  * We have to handle this case differently from cases where a source relation
>>  * is an inheritance set. Source inheritance is expanded at the bottom of the
>>  * plan tree (see allpaths.c), but target inheritance has to be expanded at
>>  * the top.
>>
>> I think with your patch these comments needs to be change?
>
> Yes, maybe a good idea to mention that partitioned table result relations
> are not handled here.
>
> Actually, I've been wondering if this patch (0001) shouldn't get rid of
> inheritance_planner altogether and implement the new approach for *all*
> inheritance sets, not just partitioned tables, but haven't spent much time
> on that idea yet.

That will be interesting.

>
>>   if (parse->resultRelation &&
>> - rt_fetch(parse->resultRelation, parse->rtable)->inh)
>> + rt_fetch(parse->resultRelation, parse->rtable)->inh &&
>> + rt_fetch(parse->resultRelation, parse->rtable)->relkind !=
>> + RELKIND_PARTITIONED_TABLE)
>>   inheritance_planner(root);
>>   else
>>   grouping_planner(root, false, tuple_fraction);
>>
>> I think we can add some comments to explain if the target rel itself
>> is partitioned rel then why
>> we can directly go to the grouping planner.
>
> Okay, I will try to add more explanatory comments here and there in the
> next version I will post later this week.

Okay.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: speeding up planning with partitions

Amit Langote-2
In reply to this post by David Rowley-3
Thanks a lot for your detailed review.

On 2018/09/11 8:23, David Rowley wrote:
> On 30 August 2018 at 21:29, Amit Langote <[hidden email]> wrote:
>> Attached updated patches, with 0002 containing the changes mentioned above.
>
> Here's my first pass review on this:

I've gone through your comments on 0001 so far, but didn't go through
others yet, to which I'll reply separately.

> 0001:
>
> 1. I think the following paragraph should probably mention something
> about some performance difference between the two methods:
>
>    <para>
>     Constraint exclusion works in a very similar way to partition
>     pruning, except that it uses each table's <literal>CHECK</literal>
>     constraints &mdash; which gives it its name &mdash; whereas partition
>     pruning uses the table's partition bounds, which exist only in the
>     case of declarative partitioning.  Another difference is that
>     constraint exclusion is only applied at plan time; there is no attempt
>     to remove partitions at execution time.
>    </para>
>
> Perhaps tagging on. "Constraint exclusion is also a much less
> efficient way of eliminating unneeded partitions as metadata for each
> partition must be loaded in the planner before constraint exclusion
> can be applied.  This is not a requirement for partition pruning."
Hmm, isn't that implied by the existing text?  It already says that
constraint exclusion uses *each* table's/partition's CHECK constraints,
which should make it clear that for a whole lot of partitions, that will
be a slower than partition pruning which requires accessing only one
table's metadata.

If we will have dissociated constraint exclusion completely from
partitioned tables with these patches, I'm not sure if we have to stress
that it is inefficient for large number of tables.

> 2. I think "rootrel" should be named "targetpart" in:
>
> + RelOptInfo *rootrel = root->simple_rel_array[root->parse->resultRelation];

To me, "targetpart" makes a partitioned table sound like a partition,
which it is not.  I get that using "root" can be ambiguous, because a
query can specify a non-root partitioned table.

How about "targetrel"?

> 3. Why does the following need to list_copy()?
>
> + List    *saved_join_info_list = list_copy(root->join_info_list);

In earlier versions of this code, root->join_info_list would be modified
during make_one_rel_from_joinlist, but that no longer seems true.

Removed the list_copy.

> 4. Is the "root->parse->commandType != CMD_INSERT" required in:
>
> @@ -181,13 +185,30 @@ make_one_rel(PlannerInfo *root, List *joinlist)
>
>   /*
>   * Generate access paths for the entire join tree.
> + *
> + * If we're doing this for an UPDATE or DELETE query whose target is a
> + * partitioned table, we must do the join planning against each of its
> + * leaf partitions instead.
>   */
> - rel = make_rel_from_joinlist(root, joinlist);
> + if (root->parse->resultRelation &&
> + root->parse->commandType != CMD_INSERT &&
> + root->simple_rel_array[root->parse->resultRelation] &&
> + root->simple_rel_array[root->parse->resultRelation]->part_scheme)
> + {
>
> Won't the simple_rel_array entry for the resultRelation always be NULL
> for an INSERT?
Yep, you're right.  Removed.

> 5. In regards to:
>
> + /*
> + * Hack to make the join planning code believe that 'partrel' can
> + * be joined against.
> + */
> + partrel->reloptkind = RELOPT_BASEREL;
>
> Have you thought about other implications of join planning for other
> member rels, for example, equivalence classes and em_is_child?
Haven't really, but that seems like an important point.  I will study it
more closely.

> 6. It would be good to not have to rt_fetch the same RangeTblEntry twice
here:

>
> @@ -959,7 +969,9 @@ subquery_planner(PlannerGlobal *glob, Query *parse,
>   * needs special processing, else go straight to grouping_planner.
>   */
>   if (parse->resultRelation &&
> - rt_fetch(parse->resultRelation, parse->rtable)->inh)
> + rt_fetch(parse->resultRelation, parse->rtable)->inh &&
> + rt_fetch(parse->resultRelation, parse->rtable)->relkind !=
> + RELKIND_PARTITIONED_TABLE)
>   inheritance_planner(root);
The new version doesn't call inheritance_planner at all; there is no
inheritance_planner in the new version.

> 7. Why don't you just pass the Parse into the function as a parameter
> instead of overwriting PlannerInfo's copy in:
>
> + root->parse = partition_parse;
> + partitionwise_adjust_scanjoin_target(root, child_rel,
> + subroots,
> + partitioned_rels,
> + resultRelations,
> + subpaths,
> + WCOLists,
> + returningLists,
> + rowMarks);
> + /* Restore the Query for processing the next partition. */
> + root->parse = parse;
Okay, done.

> 8. Can you update the following comment to mention why you're not
> calling add_paths_to_append_rel for this case?
>
> @@ -6964,7 +7164,9 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
>   }
>
>   /* Build new paths for this relation by appending child paths. */
> - if (live_children != NIL)
> + if (live_children != NIL &&
> + !(rel->reloptkind == RELOPT_BASEREL &&
> +   rel->relid == root->parse->resultRelation))
>   add_paths_to_append_rel(root, rel, live_children);
Oops, stray code from a previous revision.  Removed this hunk.

>
> 9. The following should use >= 0, not > 0
>
> + while ((relid = bms_next_member(child_rel->relids, relid)) > 0)

Yeah, fixed.


Sorry, I haven't yet worked on your comments on 0002 and 0003.  For time
being, I'd like to report what I've been up to these past couple of days,
because starting tomorrow until the end of this month, I won't be able to
reply to emails on -hackers due to personal vacation.

So, I spent a better part of last few days on trying to revise the patches
so that it changes the planning code for *all* inheritance cases instead
of just focusing on partitioning.  Because, really, the only difference
between the partitioning code and regular inheritance code should be that
partitioning is able to use faster pruning, and all the other parts should
look and work more or less the same.  There shouldn't be code here that
deals with partitioning and code there to deal with inheritance.
Minimizing code this way should be a good end to aim for, imho.

Attached is what I have at the moment.  As part of the effort mentioned
above, I made 0001 to remove inheritance_planner altogether, instead of
just taking out the partitioning case out of it.  Then there is the WIP
patch 0004 which tries to move even regular inheritance expansion to late
planning stage, just like the earlier versions did for partitioning.  It
will need quite a bit of polishing before we could consider it for merging
with 0002.  Of course, I'll need to address your comments before
considering doing that any seriously.

Thanks,
Amit

v3-0001-Overhaul-inheritance-update-delete-planning.patch (82K) Download Attachment
v3-0002-Lazy-creation-of-partition-objects-for-planning.patch (67K) Download Attachment
v3-0003-Only-lock-partitions-that-will-be-scanned-by-a-qu.patch (14K) Download Attachment
v3-0004-WIP-generate-all-inheritance-child-RelOptInfos-af.patch (87K) Download Attachment