Speeding up INSERTs and UPDATEs to partitioned tables

classic Classic list List threaded Threaded
71 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Speeding up INSERTs and UPDATEs to partitioned tables

David Rowley-3
Hi,

As part of my efforts to make partitioning scale better for larger
numbers of partitions, I've been looking at primarily INSERT VALUES
performance.  Here the overheads are almost completely in the
executor. Planning of this type of statement is very simple since
there is no FROM clause to process.

My benchmarks have been around a RANGE partitioned table with 10k leaf
partitions and no sub-partitioned tables. The partition key is a
timestamp column.

I've found that ExecSetupPartitionTupleRouting() is very slow indeed
and there are a number of things slow about it.  The biggest culprit
for the slowness is the locking of each partition inside of
find_all_inheritors().  For now, this needs to remain as we must hold
locks on each partition while performing RelationBuildPartitionDesc(),
otherwise, one of the partitions may get dropped out from under us.
There might be other valid reasons too, but please see my note at the
bottom of this email.

The locking is not the only slow thing. I found the following to also be slow:

1. RelationGetPartitionDispatchInfo uses a List and lappend() must
perform a palloc() each time a partition is added to the list.
2. A foreach loop is performed over leaf_parts to search for subplans
belonging to this partition. This seems pointless to do for INSERTs as
there's never any to find.
3. ExecCleanupTupleRouting() loops through the entire partitions
array. If a single tuple was inserted then all but one of the elements
will be NULL.
4. Tuple conversion map allocates an empty array thinking there might
be something to put into it. This is costly when the array is large
and pointless when there are no maps to store.
5. During get_partition_dispatch_recurse(), get_rel_relkind() is
called to determine if the partition is a partitioned table or a leaf
partition. This results in a slow relcache hashtable lookup.
6. get_partition_dispatch_recurse() also ends up just building the
indexes array with a sequence of numbers from 0 to nparts - 1 if there
are no sub-partitioned tables. Doing this is slow when there are many
partitions.

Besides the locking, the only thing that remains slow now is the
palloc0() for the 'partitions' array. In my test, it takes 0.6% of
execution time. I don't see any pretty ways to fix that.

I've written fixes for items 1-6 above.

I did:

1. Use an array instead of a List.
2. Don't do this loop. palloc0() the partitions array instead. Let
UPDATE add whatever subplans exist to the zeroed array.
3. Track what we initialize in a gapless array and cleanup just those
ones. Make this array small and increase it only when we need more
space.
4. Only allocate the map array when we need to store a map.
5. Work that out in relcache beforehand.
6. ditto

The only questionable thing I see is what I did for 6.  In partcache.c
I'm basically building an array of nparts containing 0 to nparts -1.
It seems a bit pointless, so perhaps there's a better way. I was also
a bit too tight to memcpy() that out of relcache, and just pointed
directly to it. That might be a no-go area.

I've attached 2 patches:

0001: implements items 1-6
0002: Is not intended for commit. It just a demo of where we could get
the performance if we were smarter about locking partitions. I've just
included this to show 0001's worth.

Performance

AWS: m5d.large fsync=off

Unpatched:

$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 2836
latency average = 21.162 ms
tps = 47.254409 (including connections establishing)
tps = 47.255756 (excluding connections establishing)

(yes, it's bad)

0001:

$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 3235
latency average = 18.548 ms
tps = 53.913121 (including connections establishing)
tps = 53.914629 (excluding connections establishing)

(a small improvement from 0001)

0001+0002:

$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 660079
latency average = 0.091 ms
tps = 11001.303764 (including connections establishing)
tps = 11001.602377 (excluding connections establishing)

(something to aspire towards)

0002 (only):

$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 27682
latency average = 2.168 ms
tps = 461.350885 (including connections establishing)
tps = 461.363327 (excluding connections establishing)

(shows that doing 0002 alone does not fix all our problems)

Unpartitioned table (control test):

$ pgbench -n -T 60 -f partbench__insert.sql postgres
transaction type: partbench__insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 801260
latency average = 0.075 ms
tps = 13354.311397 (including connections establishing)
tps = 13354.656163 (excluding connections establishing)

Test setup:

CREATE TABLE partbench_ (date TIMESTAMP NOT NULL, i1 INT NOT NULL, i2
INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL);
CREATE TABLE partbench (date TIMESTAMP NOT NULL, i1 INT NOT NULL, i2
INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL)
PARTITION BY RANGE (date);
\o /dev/null
select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench
FOR VALUES FROM (''' || '2017-03-06'::date + (x::text || '
hours')::interval || ''') TO (''' || '2017-03-06'::date + ((x+1)::text
|| ' hours')::interval || ''');'
from generate_Series(0,9999) x;
\gexec
\o

partbench_insert.sql contains:
insert into partbench values('2018-04-26 15:00:00',1,2,3,4,5);

partbench__insert.sql contains:
insert into partbench_ values('2018-04-26 15:00:00',1,2,3,4,5);

I don't want to discuss the locking on this thread. That discussion
will detract from discussing what I'm proposing here... Which is not
to change anything relating to locks. I'm still working on that and
will post elsewhere. Please start another thread if you'd like to
discuss that in the meantime. Feel free to link it in here so others
can follow.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch (41K) Download Attachment
v1-0002-Unsafe-locking-reduction-for-partitioned-INSERT-U.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Speeding up INSERTs and UPDATEs to partitioned tables

David Rowley-3
On 22 June 2018 at 18:28, David Rowley <[hidden email]> wrote:

> I've written fixes for items 1-6 above.
>
> I did:
>
> 1. Use an array instead of a List.
> 2. Don't do this loop. palloc0() the partitions array instead. Let
> UPDATE add whatever subplans exist to the zeroed array.
> 3. Track what we initialize in a gapless array and cleanup just those
> ones. Make this array small and increase it only when we need more
> space.
> 4. Only allocate the map array when we need to store a map.
> 5. Work that out in relcache beforehand.
> 6. ditto

I've added this to the July 'fest:

https://commitfest.postgresql.org/18/1690/

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

RE: Speeding up INSERTs and UPDATEs to partitioned tables

Kato, Sho
Hi,

I tried to benchmark with v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch, but when I create the second partition, server process get segmentation fault.

I don't know the cause, but it seems that an incorrect value is set to partdesc->boundinfo.

(gdb) p partdesc->boundinfo[0]
$6 = {strategy = 0 '\000', ndatums = 2139062142, datums = 0x7f7f7f7f7f7f7f7f, kind = 0x7f7f7f7f7f7f7f7f, indexes = 0x7f7f7f7f7f7f7f7f, null_index = 2139062143, default_index = 2139062143}


$ psql postgres
psql (11beta2)
Type "help" for help.

postgres=# create table a(i int) partition by range(i);
CREATE TABLE
postgres=# create table a_1 partition of a for values from(1) to (200);
CREATE TABLE
postgres=# create table a_2 partition of a for values from(200) to (400);
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

2018-07-05 14:02:52.405 JST [60250] LOG:  server process (PID 60272) was terminated by signal 11: Segmentation fault
2018-07-05 14:02:52.405 JST [60250] DETAIL:  Failed process was running: create table a_2 partition of a for values from(200) to (400);

(gdb) bt
#0  0x0000000000596e52 in get_default_oid_from_partdesc (partdesc=0x259e928) at partition.c:269
#1  0x0000000000677355 in DefineRelation (stmt=0x259e610, relkind=114 'r', ownerId=10, typaddress=0x0, queryString=0x24d58b8 "create table a_2 partition of a for values from(200) to (400);") at tablecmds.c:832
#2  0x00000000008b6893 in ProcessUtilitySlow (pstate=0x259e4f8, pstmt=0x24d67d8, queryString=0x24d58b8 "create table a_2 partition of a for values from(200) to (400);", context=PROCESS_UTILITY_TOPLEVEL,
    params=0x0, queryEnv=0x0, dest=0x24d6ac8, completionTag=0x7ffc05932330 "") at utility.c:1000
#3  0x00000000008b66c2 in standard_ProcessUtility (pstmt=0x24d67d8, queryString=0x24d58b8 "create table a_2 partition of a for values from(200) to (400);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
    queryEnv=0x0, dest=0x24d6ac8, completionTag=0x7ffc05932330 "") at utility.c:920
#4  0x00000000008b583b in ProcessUtility (pstmt=0x24d67d8, queryString=0x24d58b8 "create table a_2 partition of a for values from(200) to (400);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
    dest=0x24d6ac8, completionTag=0x7ffc05932330 "") at utility.c:360
#5  0x00000000008b482c in PortalRunUtility (portal=0x253af38, pstmt=0x24d67d8, isTopLevel=true, setHoldSnapshot=false, dest=0x24d6ac8, completionTag=0x7ffc05932330 "") at pquery.c:1178
#6  0x00000000008b4a45 in PortalRunMulti (portal=0x253af38, isTopLevel=true, setHoldSnapshot=false, dest=0x24d6ac8, altdest=0x24d6ac8, completionTag=0x7ffc05932330 "") at pquery.c:1324
#7  0x00000000008b3f7d in PortalRun (portal=0x253af38, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x24d6ac8, altdest=0x24d6ac8, completionTag=0x7ffc05932330 "") at pquery.c:799
#8  0x00000000008adf16 in exec_simple_query (query_string=0x24d58b8 "create table a_2 partition of a for values from(200) to (400);") at postgres.c:1122
#9  0x00000000008b21a5 in PostgresMain (argc=1, argv=0x24ff5b0, dbname=0x24ff410 "postgres", username=0x24d2358 "symfo") at postgres.c:4153
#10 0x00000000008113f4 in BackendRun (port=0x24f73f0) at postmaster.c:4361
#11 0x0000000000810b67 in BackendStartup (port=0x24f73f0) at postmaster.c:4033
#12 0x000000000080d0ed in ServerLoop () at postmaster.c:1706
#13 0x000000000080c9a3 in PostmasterMain (argc=1, argv=0x24d0310) at postmaster.c:1379
#14 0x0000000000737392 in main (argc=1, argv=0x24d0310) at main.c:228

(gdb) disassemble
Dump of assembler code for function get_default_oid_from_partdesc:
   0x0000000000596e0a <+0>:     push   %rbp
   0x0000000000596e0b <+1>:     mov    %rsp,%rbp
   0x0000000000596e0e <+4>:     mov    %rdi,-0x8(%rbp)
   0x0000000000596e12 <+8>:     cmpq   $0x0,-0x8(%rbp)
   0x0000000000596e17 <+13>:    je     0x596e56 <get_default_oid_from_partdesc+76>
   0x0000000000596e19 <+15>:    mov    -0x8(%rbp),%rax
   0x0000000000596e1d <+19>:    mov    0x10(%rax),%rax
   0x0000000000596e21 <+23>:    test   %rax,%rax
   0x0000000000596e24 <+26>:    je     0x596e56 <get_default_oid_from_partdesc+76>
   0x0000000000596e26 <+28>:    mov    -0x8(%rbp),%rax
   0x0000000000596e2a <+32>:    mov    0x10(%rax),%rax
   0x0000000000596e2e <+36>:    mov    0x24(%rax),%eax
   0x0000000000596e31 <+39>:    cmp    $0xffffffff,%eax
   0x0000000000596e34 <+42>:    je     0x596e56 <get_default_oid_from_partdesc+76>
   0x0000000000596e36 <+44>:    mov    -0x8(%rbp),%rax
   0x0000000000596e3a <+48>:    mov    0x8(%rax),%rdx
   0x0000000000596e3e <+52>:    mov    -0x8(%rbp),%rax
   0x0000000000596e42 <+56>:    mov    0x10(%rax),%rax
   0x0000000000596e46 <+60>:    mov    0x24(%rax),%eax
   0x0000000000596e49 <+63>:    cltq  
   0x0000000000596e4b <+65>:    shl    $0x2,%rax
   0x0000000000596e4f <+69>:    add    %rdx,%rax
=> 0x0000000000596e52 <+72>:    mov    (%rax),%eax
   0x0000000000596e54 <+74>:    jmp    0x596e5b <get_default_oid_from_partdesc+81>
   0x0000000000596e56 <+76>:    mov    $0x0,%eax
   0x0000000000596e5b <+81>:    pop    %rbp
   0x0000000000596e5c <+82>:    retq  
End of assembler dump.

(gdb) i r
rax            0x20057e77c      8595695484
rbx            0x72     114
rcx            0x7f50ce90e0e8   139985039712488
rdx            0x259e980        39446912
rsi            0x7f50ce90e0a8   139985039712424
rdi            0x259e928        39446824
rbp            0x7ffc05931890   0x7ffc05931890
rsp            0x7ffc05931890   0x7ffc05931890
r8             0x7ffc059317bf   140720402012095
r9             0x0      0
r10            0x6b     107
r11            0x7f50cdbc3f10   139985025777424
r12            0x70     112
r13            0x0      0
r14            0x0      0
r15            0x0      0
rip            0x596e52 0x596e52 <get_default_oid_from_partdesc+72>
eflags         0x10202  [ IF RF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0

(gdb) list *0x596e52
0x596e52 is in get_default_oid_from_partdesc (partition.c:269).
264     Oid
265     get_default_oid_from_partdesc(PartitionDesc partdesc)
266     {
267             if (partdesc && partdesc->boundinfo &&
268                     partition_bound_has_default(partdesc->boundinfo))
269                     return partdesc->oids[partdesc->boundinfo->default_index];
270
271             return InvalidOid;
272     }
273

regards,
-----Original Message-----
From: David Rowley [mailto:[hidden email]]
Sent: Saturday, June 23, 2018 7:19 AM
To: PostgreSQL Hackers <[hidden email]>
Subject: Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 22 June 2018 at 18:28, David Rowley <[hidden email]> wrote:

> I've written fixes for items 1-6 above.
>
> I did:
>
> 1. Use an array instead of a List.
> 2. Don't do this loop. palloc0() the partitions array instead. Let
> UPDATE add whatever subplans exist to the zeroed array.
> 3. Track what we initialize in a gapless array and cleanup just those
> ones. Make this array small and increase it only when we need more
> space.
> 4. Only allocate the map array when we need to store a map.
> 5. Work that out in relcache beforehand.
> 6. ditto

I've added this to the July 'fest:

https://commitfest.postgresql.org/18/1690/

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Speeding up INSERTs and UPDATEs to partitioned tables

David Rowley-3
On 5 July 2018 at 18:39, Kato, Sho <[hidden email]> wrote:
> postgres=# create table a(i int) partition by range(i);
> CREATE TABLE
> postgres=# create table a_1 partition of a for values from(1) to (200);
> CREATE TABLE
> postgres=# create table a_2 partition of a for values from(200) to (400);
> server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.

Hi,

Thanks for testing. I'm unable to reproduce this on beta2 or master as
of f61988d16.

Did you try make clean then building again?  The 0001 patch does
change PartitionDescData, so if you've not rebuilt all .c files which
use that then that might explain your crash.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up INSERTs and UPDATEs to partitioned tables

Jesper Pedersen
In reply to this post by David Rowley-3
Hi David,

On 06/22/2018 02:28 AM, David Rowley wrote:
> I've attached 2 patches:
>
> 0001: implements items 1-6
> 0002: Is not intended for commit. It just a demo of where we could get
> the performance if we were smarter about locking partitions. I've just
> included this to show 0001's worth.
>

I did some tests with a 64 hash partition setup, and see a speedup for
INSERT / UPDATE scenarios.

> I don't want to discuss the locking on this thread. That discussion
> will detract from discussing what I'm proposing here... Which is not
> to change anything relating to locks. I'm still working on that and
> will post elsewhere.

With 0002 INSERTs get close to the TPS of the non-partitioned case.
However, UPDATEs doesn't see the same speedup. But, as you said, a
discussion for another thread.

Best regards,
  Jesper

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up INSERTs and UPDATEs to partitioned tables

David Rowley-3
On 6 July 2018 at 01:18, Jesper Pedersen <[hidden email]> wrote:
> With 0002 INSERTs get close to the TPS of the non-partitioned case. However,
> UPDATEs doesn't see the same speedup. But, as you said, a discussion for
> another thread.

Hi Jesper,

Thanks for testing this out.  It was only really the locking I didn't
want to discuss here due to the risk of the discussion of removing the
other overheads getting lost in discussions about locking.

It's most likely that the UPDATE is slower due to the planner still
being quite slow when dealing with partitioned tables. It still builds
RangeTblEntry and RelOptInfo objects for each partition even if the
partition is pruned. With INSERT with a VALUES clause, the planner
does not build these objects, in fact, the planner bearly does any
work at all, so this can be speeded up just by removing the executor
overheads.

(I do have other WIP patches to speed up the planner. After delaying
building the RelOptInfo and RangeTblEntry, with my 10k partition
setup, the planner SELECT became 1600 times faster. UPDATE did not
finish in the unpatched version, so gains there are harder to measure.
There's still much work to do on these patches, and there's still more
performance to squeeze out too. Hopefully, I'll be discussing this on
another thread soon.)

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

RE: Speeding up INSERTs and UPDATEs to partitioned tables

Kato, Sho
In reply to this post by David Rowley-3
Thanks David!

I did benchmark with pgbench, and see a speedup for INSERT / UPDATE scenarios.
I used range partition.

Benchmark results are as follows.

1. 11beta2 result

 part_num |   tps_ex   | latency_avg | update_latency | select_latency | insert_latency
----------+------------+-------------+----------------+----------------+----------------
      100 | 479.456278 |       2.086 |          1.382 |          0.365 |          0.168
      200 | 169.155411 |       5.912 |          4.628 |          0.737 |          0.299
      400 |  24.857495 |       40.23 |         36.606 |          2.252 |          0.881
      800 |   6.718104 |     148.853 |        141.471 |          5.253 |          1.433
     1600 |   1.934908 |     516.825 |        489.982 |         21.102 |          3.701
     3200 |   0.456967 |    2188.362 |       2101.247 |         72.784 |          8.833
     6400 |   0.116643 |    8573.224 |        8286.79 |        257.904 |         14.949


2. 11beta2 + patch1 + patch2

patch1: Allow direct lookups of AppendRelInfo by child relid
        commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687
patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch

 part_num |   tps_ex    | latency_avg | update_latency | select_latency | insert_latency
----------+-------------+-------------+----------------+----------------+----------------
      100 | 1224.430344 |       0.817 |          0.551 |          0.085 |          0.048
      200 |  689.567511 |        1.45 |           1.12 |          0.119 |           0.05
      400 |  347.876616 |       2.875 |          2.419 |          0.185 |          0.052
      800 |  140.489269 |       7.118 |          6.393 |          0.329 |          0.059
     1600 |   29.681672 |      33.691 |         31.272 |          1.517 |          0.147
     3200 |    7.021957 |     142.412 |          136.4 |          4.033 |          0.214
     6400 |    1.462949 |     683.557 |        669.187 |          7.677 |          0.264


benchmark script:

\set aid random(1, 100 * 1)
\set delta random(-5000, 5000)
BEGIN;
UPDATE test.accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM test.accounts WHERE aid = :aid;
INSERT INTO test.accounts_history (aid, delta, mtime) VALUES (:aid, :delta, CURRENT_TIMESTAMP);
END;

partition key is aid.

-----Original Message-----
From: David Rowley [mailto:[hidden email]]
Sent: Thursday, July 05, 2018 6:19 PM
To: Kato, Sho/加藤 翔 <[hidden email]>
Cc: PostgreSQL Hackers <[hidden email]>
Subject: Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 5 July 2018 at 18:39, Kato, Sho <[hidden email]> wrote:
> postgres=# create table a(i int) partition by range(i); CREATE TABLE
> postgres=# create table a_1 partition of a for values from(1) to
> (200); CREATE TABLE postgres=# create table a_2 partition of a for
> values from(200) to (400); server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.

Hi,

Thanks for testing. I'm unable to reproduce this on beta2 or master as of f61988d16.

Did you try make clean then building again?  The 0001 patch does change PartitionDescData, so if you've not rebuilt all .c files which use that then that might explain your crash.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up INSERTs and UPDATEs to partitioned tables

David Rowley-3
On 6 July 2018 at 21:25, Kato, Sho <[hidden email]> wrote:

> 2. 11beta2 + patch1 + patch2
>
> patch1: Allow direct lookups of AppendRelInfo by child relid
>         commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687
> patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch
>
>  part_num |   tps_ex    | latency_avg | update_latency | select_latency | insert_latency
> ----------+-------------+-------------+----------------+----------------+----------------
>       100 | 1224.430344 |       0.817 |          0.551 |          0.085 |          0.048
>       200 |  689.567511 |        1.45 |           1.12 |          0.119 |           0.05
>       400 |  347.876616 |       2.875 |          2.419 |          0.185 |          0.052
>       800 |  140.489269 |       7.118 |          6.393 |          0.329 |          0.059
>      1600 |   29.681672 |      33.691 |         31.272 |          1.517 |          0.147
>      3200 |    7.021957 |     142.412 |          136.4 |          4.033 |          0.214
>      6400 |    1.462949 |     683.557 |        669.187 |          7.677 |          0.264

Hi,

Thanks a lot for benchmarking this.

Just a note to say that the "Allow direct lookups of AppendRelInfo by
child relid" patch is already in master. It's much more relevant to be
testing with master than pg11. This patch is not intended for pg11.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up INSERTs and UPDATEs to partitioned tables

Alvaro Herrera-9
On 2018-Jul-11, David Rowley wrote:

> On 6 July 2018 at 21:25, Kato, Sho <[hidden email]> wrote:
> > 2. 11beta2 + patch1 + patch2
> >
> > patch1: Allow direct lookups of AppendRelInfo by child relid
> >         commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687
> > patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch
> >
> >  part_num |   tps_ex    | latency_avg | update_latency | select_latency | insert_latency
> > ----------+-------------+-------------+----------------+----------------+----------------
> >       100 | 1224.430344 |       0.817 |          0.551 |          0.085 |          0.048
> >       200 |  689.567511 |        1.45 |           1.12 |          0.119 |           0.05
> >       400 |  347.876616 |       2.875 |          2.419 |          0.185 |          0.052
> >       800 |  140.489269 |       7.118 |          6.393 |          0.329 |          0.059
> >      1600 |   29.681672 |      33.691 |         31.272 |          1.517 |          0.147
> >      3200 |    7.021957 |     142.412 |          136.4 |          4.033 |          0.214
> >      6400 |    1.462949 |     683.557 |        669.187 |          7.677 |          0.264

> Just a note to say that the "Allow direct lookups of AppendRelInfo by
> child relid" patch is already in master. It's much more relevant to be
> testing with master than pg11. This patch is not intended for pg11.

That commit is also in pg11, though -- just not in beta2.  So we still
don't know how much of an improvement patch2 is by itself :-)

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up INSERTs and UPDATEs to partitioned tables

Amit Langote-2
In reply to this post by David Rowley-3
Hi David.

On 2018/06/22 15:28, David Rowley wrote:
> Hi,
>
> As part of my efforts to make partitioning scale better for larger
> numbers of partitions, I've been looking at primarily INSERT VALUES
> performance.  Here the overheads are almost completely in the
> executor. Planning of this type of statement is very simple since
> there is no FROM clause to process.

Thanks for this effort.

> My benchmarks have been around a RANGE partitioned table with 10k leaf
> partitions and no sub-partitioned tables. The partition key is a
> timestamp column.
>
> I've found that ExecSetupPartitionTupleRouting() is very slow indeed
> and there are a number of things slow about it.  The biggest culprit
> for the slowness is the locking of each partition inside of
> find_all_inheritors().

Yes. :-(

> For now, this needs to remain as we must hold
> locks on each partition while performing RelationBuildPartitionDesc(),
> otherwise, one of the partitions may get dropped out from under us.

We lock all partitions using find_all_inheritors to keep locking order
consistent with other sites that may want to lock tables in the same
partition tree but with a possibly conflicting lock mode.  If we remove
the find_all_inheritors call in ExecSetupPartitionPruneState (like your
0002 does), we may end up locking partitions in arbitrary order in a given
transaction, because input tuples will be routed to various partitions in
an order that's not predetermined.

But, maybe it's not necessary to be that paranoid.  If we've locked on the
parent, any concurrent lockers would have to wait for the lock on the
parent anyway, so it doesn't matter which order tuple routing locks the
partitions.

> The locking is not the only slow thing. I found the following to also be slow:
>
> 1. RelationGetPartitionDispatchInfo uses a List and lappend() must
> perform a palloc() each time a partition is added to the list.
> 2. A foreach loop is performed over leaf_parts to search for subplans
> belonging to this partition. This seems pointless to do for INSERTs as
> there's never any to find.
> 3. ExecCleanupTupleRouting() loops through the entire partitions
> array. If a single tuple was inserted then all but one of the elements
> will be NULL.
> 4. Tuple conversion map allocates an empty array thinking there might
> be something to put into it. This is costly when the array is large
> and pointless when there are no maps to store.
> 5. During get_partition_dispatch_recurse(), get_rel_relkind() is
> called to determine if the partition is a partitioned table or a leaf
> partition. This results in a slow relcache hashtable lookup.
> 6. get_partition_dispatch_recurse() also ends up just building the
> indexes array with a sequence of numbers from 0 to nparts - 1 if there
> are no sub-partitioned tables. Doing this is slow when there are many
> partitions.
>
> Besides the locking, the only thing that remains slow now is the
> palloc0() for the 'partitions' array. In my test, it takes 0.6% of
> execution time. I don't see any pretty ways to fix that.
>
> I've written fixes for items 1-6 above.
>
> I did:
>
> 1. Use an array instead of a List.
> 2. Don't do this loop. palloc0() the partitions array instead. Let
> UPDATE add whatever subplans exist to the zeroed array.
> 3. Track what we initialize in a gapless array and cleanup just those
> ones. Make this array small and increase it only when we need more
> space.
> 4. Only allocate the map array when we need to store a map.
> 5. Work that out in relcache beforehand.
> 6. ditto
The issues you list seem all legitimate to me and also your proposed fixes
for each, except I think we could go a bit further.

Why don't we abandon the notion altogether that
ExecSetupPartitionTupleRouting *has to* process the whole partition tree?
ISTM, there is no need to determine the exact number of leaf partitions
and partitioned tables in the partition tree and allocate the arrays in
PartitionTupleRouting based on that.  I know that the indexes array in
PartitionDispatchData contains mapping from local partition indexes (0 to
partdesc->nparts - 1) to those that span *all* leaf partitions and *all*
partitioned tables (0 to proute->num_partitions or proute->num_dispatch)
in a partition tree, but we can change that.

The idea I had was inspired by looking at partitions_init stuff in your
patch.  We could allocate proute->partition_dispatch_info and
proute->partitions arrays to be of a predetermined size, which doesn't
require us to calculate the exact number of leaf partitions and
partitioned tables beforehand.  So, RelationGetPartitionDispatchInfo need
not recursively go over all of the partition tree.  Instead we create just
one PartitionDispatch object of the root parent table, whose indexes array
is initialized with -1 meaning none of the partitions has not been
encountered yet.  In ExecFindPartition, once tuple routing chooses a
partition, we create either a ResultRelInfo (if leaf) or a
PartitionDispatch for it and store it in the 0th slot in
proute->partitions or proute->partition_dispatch_info, respectively.
Also, we update the indexes array in the parent's PartitionDispatch to
replace -1 with 0 so that future tuples routing to that partition don't
allocate it again.  The process is repeated if the tuple needs to be
routed one more level down.  If the query needs to allocate more
ResultRelInfos and/or PartitionDispatch objects than we initially
allocated space for, we expand those arrays.  Finally, during
ExecCleanupTupleRouting, we only "clean up" the partitions that we
allocated ResultRelInfos and PartitionDispatch objects for, which is very
similar to the partitions_init idea in your patch.

I implemented that idea in the attached patch, which applies on top of
your 0001 patch, but I'd say it's too big to be just called a delta.  I
was able to get following performance numbers using the following pgbench
test:

pgbench -n -T 180 -f insert-ht.sql
cat insert-ht.sql
\set b random(1, 1000)
\set a random(1, 1000)
insert into ht values (:b, :a);

Note that pgbench is run 3 times and every tps result is listed below.

HEAD - 0 parts (unpartitioned table)
tps = 2519.603076 (including connections establishing)
tps = 2486.903189 (including connections establishing)
tps = 2518.771182 (including connections establishing)

HEAD - 2500 hash parts (no subpart)
tps = 13.158224 (including connections establishing)
tps = 12.940713 (including connections establishing)
tps = 12.882465 (including connections establishing)

David - 2500 hash parts (no subpart)
tps = 18.717628 (including connections establishing)
tps = 18.602280 (including connections establishing)
tps = 18.945527 (including connections establishing)

Amit - 2500 hash parts (no subpart)
tps = 18.576858 (including connections establishing)
tps = 18.431902 (including connections establishing)
tps = 18.797023 (including connections establishing)

HEAD - 2500 hash parts (4 hash subparts each)
tps = 2.339332 (including connections establishing)
tps = 2.339582 (including connections establishing)
tps = 2.317037 (including connections establishing)

David - 2500 hash parts (4 hash subparts each)
tps = 3.225044 (including connections establishing)
tps = 3.214053 (including connections establishing)
tps = 3.239591 (including connections establishing)

Amit - 2500 hash parts (4 hash subparts each)
tps = 3.321918 (including connections establishing)
tps = 3.305952 (including connections establishing)
tps = 3.301036 (including connections establishing)

Applying the lazy locking patch on top of David's and my patch,
respectively, produces the following results.

David - 2500 hash parts (no subpart)
tps = 1577.854360 (including connections establishing)
tps = 1532.681499 (including connections establishing)
tps = 1464.254096 (including connections establishing)

Amit - 2500 hash parts (no subpart)
tps = 1532.475751 (including connections establishing)
tps = 1534.650325 (including connections establishing)
tps = 1527.840837 (including connections establishing)

David - 2500 hash parts (4 hash subparts each)
tps = 78.845916 (including connections establishing)
tps = 79.167079 (including connections establishing)
tps = 79.621686 (including connections establishing)

Amit - 2500 hash parts (4 hash subparts each)
9:tps = 329.887008 (including connections establishing)
9:tps = 327.428103 (including connections establishing)
9:tps = 326.863248 (including connections establishing)

About the last two results: after getting rid of the time-hog that is
find_all_inheritors() call in ExecSetupPartitionTupleRouting for locking
all partitions, it seems that we'll end up spending most of the time in
RelationGetPartitionDispatchInfo() without my patch, because it will call
get_partition_dispatch_recurse() for each of the 2500 partitions of first
level that are themselves partitioned.  With my patch, we won't do that
and won't end up generating 2499 PartitionDispatch objects that would not
be needed for a single-row insert statement.

Thanks,
Amit

david-0001-delta.patch (67K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Speeding up INSERTs and UPDATEs to partitioned tables

David Rowley-3
On 13 July 2018 at 20:20, Amit Langote <[hidden email]> wrote:
> Why don't we abandon the notion altogether that
> ExecSetupPartitionTupleRouting *has to* process the whole partition tree?

[...]

> I implemented that idea in the attached patch, which applies on top of
> your 0001 patch, but I'd say it's too big to be just called a delta.  I
> was able to get following performance numbers using the following pgbench
> test:

Thanks for looking at this.  I like that your idea gets rid of the
indexes cache in syscache. I was not very happy with that part.

I've looked over the code and the ExecUseUpdateResultRelForRouting()
function is broken.  Your while loop only skips partitions for the
current partitioned table, it does not skip ModifyTable subnodes that
belong to other partitioned tables.

You can use the following. The code does not find the t1_a2 subnode.

create table t1 (a int, b int) partition by list(a);
create table t1_a1 partition of t1 for values in(1) partition by list(b);
create table t1_a2 partition of t1 for values in(2);
create table t1_a1_b1 partition of t1_a1 for values in(1);
create table t1_a1_b2 partition of t1_a1 for values in(2);
insert into t1 values(2,2);

update t1 set a = a;

I think there might not be enough information to make this work
correctly, as if you change the loop to skip subnodes, then it won't
work in cases where the partition[0] was pruned.

I've another patch sitting here, partly done, that changes
pg_class.relispartition into pg_class.relpartitionparent.  If we had
that then we could code your loop to work correctly.   Alternatively,
I guess we could just ignore the UPDATE's ResultRelInfos and just
build new ones. Unsure if there's actually a reason we need to reuse
the existing ones, is there?

I think you'd need to know the owning partition and skip subnodes that
don't belong to pd->reldesc. Alternatively, a hashtable could be built
with all the oids belonging to pd->reldesc, then we could loop over
the update_rris finding subnodes that can be found in the hashtable.
Likely this will be much slower than the sort of merge lookup that the
previous code did.

Another thing that I don't like is the PARTITION_ROUTING_MAXSIZE code.
The code seems to assume that there can only be at the most 65536
partitions, but I don't think there's any code which restricts us to
that. There is code in the planner that will bork when trying to
create a RangeTblEntry up that high, but as far as I know that won't
be noticed on the INSERT path.  I don't think this code has any
business knowing what the special varnos are set to either.  It would
be better to just remove the limit and suffer the small wasted array
space.  I understand you've probably coded it like this due to the
similar code that was in my patch, but with mine I knew the total
number of partitions. Your patch does not.

Other thoughts on the patch:

I wonder if it's worth having syscache keep a count on the number of
sub-partitioned tables a partition has. If there are none in the root
partition then the partition_dispatch_info can be initialized with
just 1 element to store the root details. Although, maybe it's not
worth it to reduce the array size by 7 elements.

Also, I'm a bit confused why you change the comments in
execPartition.h for PartitionTupleRouting to be inline again. I
brought those out of line as I thought the complexity of the code
warranted that. You're inlining them again goes against what all the
other structs do in that file.

Apart from that, I think the idea is promising. We'll just need to
find a way to make ExecUseUpdateResultRelForRouting work correctly.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up INSERTs and UPDATEs to partitioned tables

Amit Langote-2
Hi David,

Thanks for taking a look.

On 2018/07/15 17:34, David Rowley wrote:

> I've looked over the code and the ExecUseUpdateResultRelForRouting()
> function is broken.  Your while loop only skips partitions for the
> current partitioned table, it does not skip ModifyTable subnodes that
> belong to other partitioned tables.
>
> You can use the following. The code does not find the t1_a2 subnode.
>
> create table t1 (a int, b int) partition by list(a);
> create table t1_a1 partition of t1 for values in(1) partition by list(b);
> create table t1_a2 partition of t1 for values in(2);
> create table t1_a1_b1 partition of t1_a1 for values in(1);
> create table t1_a1_b2 partition of t1_a1 for values in(2);
> insert into t1 values(2,2);
>
> update t1 set a = a;
Hmm, it indeed is broken.

> I think there might not be enough information to make this work
> correctly, as if you change the loop to skip subnodes, then it won't
> work in cases where the partition[0] was pruned.
>
> I've another patch sitting here, partly done, that changes
> pg_class.relispartition into pg_class.relpartitionparent.  If we had
> that then we could code your loop to work correctly.> Alternatively,
> I guess we could just ignore the UPDATE's ResultRelInfos and just
> build new ones. Unsure if there's actually a reason we need to reuse
> the existing ones, is there?
We try to use the existing ones because we thought back when the patch was
written (not by me though) that redoing all the work that
InitResultRelInfo does for each partition, for which we could have instead
used an existing one, would cumulatively end up being more expensive than
figuring out which ones we could reuse by a linear scan of partition and
result rels arrays in parallel.  I don't remember seeing a benchmark to
demonstrate the benefit of doing this though.  Maybe it was posted, but I
don't remember having looked at it closely.

> I think you'd need to know the owning partition and skip subnodes that
> don't belong to pd->reldesc. Alternatively, a hashtable could be built
> with all the oids belonging to pd->reldesc, then we could loop over
> the update_rris finding subnodes that can be found in the hashtable.
> Likely this will be much slower than the sort of merge lookup that the
> previous code did.

I think one option is to simply give up on the idea of matching *all*
UPDATE result rels that belong to a given partitioned table (pd->reldesc)
in one call of ExecUseUpdateResultRelForRouting.  Instead, pass the index
of the partition (in pd->partdesc->oids) to find the ResultRelInfo for,
loop over all UPDATE result rels looking for one, and return immediately
on finding one after having stored its pointer in proute->partitions.  In
the worst case, we'll end up scanning UPDATE result rels array for every
partition that gets touched, but maybe such an UPDATE query is less common
or even if such a query occurs, tuple routing might be the last of its
bottlenecks.

I have implemented that approach in the updated patch.

That means I also needed to change things so that
ExecUseUpdateResultRelsForRouting is now only called by ExecFindPartition,
because with the new arrangement, it's useless to call it from
ExecSetupPartitionTupleRouting.  Moreover, an UPDATE may not use tuple
routing at all, even if the fact that partition key is being updated
results in calling ExecSetupPartitionTupleRouting.

> Another thing that I don't like is the PARTITION_ROUTING_MAXSIZE code.
> The code seems to assume that there can only be at the most 65536
> partitions, but I don't think there's any code which restricts us to
> that. There is code in the planner that will bork when trying to
> create a RangeTblEntry up that high, but as far as I know that won't
> be noticed on the INSERT path.  I don't think this code has any
> business knowing what the special varnos are set to either.  It would
> be better to just remove the limit and suffer the small wasted array
> space.  I understand you've probably coded it like this due to the
> similar code that was in my patch, but with mine I knew the total
> number of partitions. Your patch does not.
OK, I changed it to UINT_MAX.

> Other thoughts on the patch:
>
> I wonder if it's worth having syscache keep a count on the number of
> sub-partitioned tables a partition has. If there are none in the root
> partition then the partition_dispatch_info can be initialized with
> just 1 element to store the root details. Although, maybe it's not
> worth it to reduce the array size by 7 elements.

Hmm yes.  Allocating space for 8 pointers when we really need 1 is not too
bad, if the alternative is to modify partcache.c.

> Also, I'm a bit confused why you change the comments in
> execPartition.h for PartitionTupleRouting to be inline again. I
> brought those out of line as I thought the complexity of the code
> warranted that. You're inlining them again goes against what all the
> other structs do in that file.

It was out-of-line to begin with but it started to become distracting when
updating the comments.  But I agree about being consistent and hence I
have moved them back to where they were.  I have significantly rewritten
those comments though to be clearer.

> Apart from that, I think the idea is promising. We'll just need to
> find a way to make ExecUseUpdateResultRelForRouting work correctly.

Let me know what you think of the code in the updated patch.

Thanks,
Amit

david-0001-delta-v2.patch (71K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Speeding up INSERTs and UPDATEs to partitioned tables

Kato, Sho
In reply to this post by Alvaro Herrera-9
On 2018-Jul-11, Alvaro Herrera wrote:
> That commit is also in pg11, though -- just not in beta2.  So we still don't know how much of an improvement patch2 is by itself :-)

Oops! I benchmarked with 11beta2 + 0001-Speed​​-up-INSERT-and-UPDATE-on-partitioned-tables.patch.
Results are as follows.

Performance seems to be improved.

part_num | latency_avg |   tps_ex   | update_latency | select_latency | insert_latency
----------+-------------+------------+----------------+----------------+----------------
      100 |        2.09 | 478.379516 |          1.407 |           0.36 |          0.159
      200 |       5.871 | 170.322179 |          4.621 |          0.732 |          0.285
      400 |      39.029 |  25.622384 |         35.542 |          2.273 |          0.758
      800 |     142.624 |   7.011494 |        135.447 |           5.04 |          1.388
     1600 |     559.872 |   1.786138 |        534.301 |         20.318 |          3.122
     3200 |    2161.834 |   0.462574 |       2077.737 |         72.804 |          7.037
     6400 |     8282.38 |   0.120739 |       7996.212 |        259.406 |         14.514

Thanks

Kato Sho
-----Original Message-----
From: Alvaro Herrera [mailto:[hidden email]]
Sent: Wednesday, July 11, 2018 10:30 PM
To: David Rowley <[hidden email]>
Cc: Kato, Sho/加藤 翔 <[hidden email]>; PostgreSQL Hackers <[hidden email]>
Subject: Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018-Jul-11, David Rowley wrote:

> On 6 July 2018 at 21:25, Kato, Sho <[hidden email]> wrote:
> > 2. 11beta2 + patch1 + patch2
> >
> > patch1: Allow direct lookups of AppendRelInfo by child relid
> >         commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687
> > patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch
> >
> >  part_num |   tps_ex    | latency_avg | update_latency | select_latency | insert_latency
> > ----------+-------------+-------------+----------------+----------------+----------------
> >       100 | 1224.430344 |       0.817 |          0.551 |          0.085 |          0.048
> >       200 |  689.567511 |        1.45 |           1.12 |          0.119 |           0.05
> >       400 |  347.876616 |       2.875 |          2.419 |          0.185 |          0.052
> >       800 |  140.489269 |       7.118 |          6.393 |          0.329 |          0.059
> >      1600 |   29.681672 |      33.691 |         31.272 |          1.517 |          0.147
> >      3200 |    7.021957 |     142.412 |          136.4 |          4.033 |          0.214
> >      6400 |    1.462949 |     683.557 |        669.187 |          7.677 |          0.264

> Just a note to say that the "Allow direct lookups of AppendRelInfo by
> child relid" patch is already in master. It's much more relevant to be
> testing with master than pg11. This patch is not intended for pg11.

That commit is also in pg11, though -- just not in beta2.  So we still don't know how much of an improvement patch2 is by itself :-)

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Speeding up INSERTs and UPDATEs to partitioned tables

David Rowley-3
On 18 July 2018 at 21:44, Kato, Sho <[hidden email]> wrote:
> part_num | latency_avg |   tps_ex   | update_latency | select_latency | insert_latency
> ----------+-------------+------------+----------------+----------------+----------------
>       100 |        2.09 | 478.379516 |          1.407 |           0.36 |          0.159
>       200 |       5.871 | 170.322179 |          4.621 |          0.732 |          0.285
>       400 |      39.029 |  25.622384 |         35.542 |          2.273 |          0.758
>       800 |     142.624 |   7.011494 |        135.447 |           5.04 |          1.388
>      1600 |     559.872 |   1.786138 |        534.301 |         20.318 |          3.122
>      3200 |    2161.834 |   0.462574 |       2077.737 |         72.804 |          7.037
>      6400 |     8282.38 |   0.120739 |       7996.212 |        259.406 |         14.514

Thanks for testing. It's fairly customary to include before/after,
unpatched/patched results.  I don't think your patched results really
mean much by themselves. It's pretty well known that adding more
partitions slows down the planner and the executor, to a lesser
extent. This patch only aims to reduce some of the executor startup
overheads for INSERT and UPDATE.

Also, the 0001 patch is not really aiming to break any performance
records. I posted results already and there is only a very small
improvement. The main aim with the 0001 patch is to remove the
bottlenecks so that the performance drop between partitioned and
non-partitioned is primarily due to the partition locking.  I'd like
to fix that too, but it's more work and I see no reason that we
shouldn't fix up the other slow parts first. I imagine this will
increase the motivation to resolve the locking all partitions issue
too.

I'd also recommend that if you're testing this, that you do so with a
recent master. The patch is not intended for pg11.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up INSERTs and UPDATEs to partitioned tables

David Rowley-3
In reply to this post by Amit Langote-2
On 18 July 2018 at 20:29, Amit Langote <[hidden email]> wrote:
> Let me know what you think of the code in the updated patch.

Thanks for sending the updated patch.

I looked over it tonight and made a number of changes:

1) Got rid of PARTITION_ROUTING_MAXSIZE. The code using this was
useless since the int would have wrapped long before it reached
UINT_MAX. There's no shortage of other code doubling the size of an
array by multiplying it by 2 unconditionally without considering
overflowing an int. Unsure why you considered this more risky.
2) Fixed a series of bugs regarding the size of the arrays in
PartitionTupleRouting. The map arrays and the partitions array could
differ in size despite your comment that claimed
child_parent_tupconv_maps was the same size as 'partitions' when
non-NULL. The map arrays being a different size than the partitions
array caused the following two cases to segfault. I've included two
cases as it was two seperate bugs that caused them.

-- case 1
drop table listp;
create table listp (a int, b int) partition by list (a);
create table listp1 partition of listp for values in (1);
create table listp2 partition of listp for values in (2);
create table listp3 partition of listp for values in (3);
create table listp4 partition of listp for values in (4);
create table listp5 partition of listp for values in (5);
create table listp6 partition of listp for values in (6);
create table listp7 partition of listp for values in (7);
create table listp8 partition of listp for values in (8);
create table listp9 (b int, a int);

alter table listp attach partition listp9 for values in(9);

insert into listp select generate_series(1,9);

-- case 2
drop table listp;
create table listp (a int, b int) partition by list (a);
create table listp1 (b int, a int);

alter table listp attach partition listp1 for values in(1);

create table listp1 partition of listp for values in (1);
create table listp2 partition of listp for values in (2);
create table listp3 partition of listp for values in (3);
create table listp4 partition of listp for values in (4);
create table listp5 partition of listp for values in (5);
create table listp6 partition of listp for values in (6);
create table listp7 partition of listp for values in (7);
create table listp8 partition of listp for values in (8);
create table listp9 partition of listp for values in (9);

insert into listp select generate_series(1,9);

3) Got rid of ExecUseUpdateResultRelForRouting. I started to change
this to remove references to UPDATE in order to make it more friendly
towards other possible future node types that it would get used for
(aka MERGE). In the end, I found that performance could regress when
in cases like:

drop table listp;
create table listp (a int) partition by list(a);
\o /dev/null
\timing off
select 'create table listp'||x::Text||' partition of listp for values
in('||x::Text||');' from generate_series(1,1000) x;
\gexec
\o
insert into listp select x from generate_series(1,999) x;
\timing on
update listp set a = a+1;

It's true that UPDATEs with a large number of subplans performance is
quite terrible today in the planner, but this code made the
performance of planning+execution a bit worse. If we get around to
fixing the inheritance planner then I think
ExecUseUpdateResultRelForRouting() could easily appear in profiles.

I ended up rewriting it to just get called once and build a hash table
by Oid storing a ResultRelInfo pointer.  This also gets rid of the
slow nested loop in the cleanup operation inside
ExecCleanupTupleRouting().

4) Did some tuning work in ExecFindPartition() getting rid of a
redundant check after the loop completion. Also added some likely()
and unlikely() decorations around some conditions.

5) Updated some newly out-dated comments since your patch in execPartition.h.

6) Replaced the palloc0() in ExecSetupPartitionTupleRouting() with a
palloc() updating the few fields that were not initialised. This might
save a few TPS (at least once we get rid of the all partition locking)
in the single-row INSERT case, but I've not tested the performance of
this yet.

7) Also moved and edited some comments above
ExecSetupPartitionTupleRouting() that I felt explained a little too
much about some internal implementation details.

One thing that I thought of, but didn't do was just having
ExecFindPartition() return the ResultRelInfo. I think it would be much
nicer in both call sites to not have to check the ->partitions array
to get that.  The copy.c call site would need a few modifications
around the detection code to see if the partition has changed, but it
all looks quite possible to change. I left this for now as I have
another patch which touches all that code that I feel is closer to
commit than this patch is.

I've attached a delta of the changes I made since your v2 delta and
also a complete updated patch.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v2-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch (74K) Download Attachment
v2_insert_speedups_delta.patch (45K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Speeding up INSERTs and UPDATEs to partitioned tables

Tom Lane-2
David Rowley <[hidden email]> writes:
> 1) Got rid of PARTITION_ROUTING_MAXSIZE. The code using this was
> useless since the int would have wrapped long before it reached
> UINT_MAX. There's no shortage of other code doubling the size of an
> array by multiplying it by 2 unconditionally without considering
> overflowing an int. Unsure why you considered this more risky.

As long as you're re-palloc'ing the array each time, and not increasing
its size more than 2X, this is perfectly safe because of the 1GB size
limit on palloc requests.  You'll fail because of that in the iteration
where the request is between 1GB and 2GB, just before integer overflow
can occur.

(Yes, this is intentional.)

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up INSERTs and UPDATEs to partitioned tables

David Rowley-3
In reply to this post by David Rowley-3
On 27 July 2018 at 04:19, David Rowley <[hidden email]> wrote:
> I've attached a delta of the changes I made since your v2 delta and
> also a complete updated patch.

I did a very quick performance test of this patch on an AWS m5d.large
instance with fsync=off.

The test setup is the same as is described in my initial email on this thread.

The test compares the performance of INSERTs into a partitioned table
with 10k partitions compared to a non-partitioned table.

Patched with v2 patch on master@39d51fe87

-- partitioned
$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 1063764
latency average = 0.056 ms
tps = 17729.375930 (including connections establishing)
tps = 17729.855215 (excluding connections establishing)

-- non-partitioned
$ pgbench -n -T 60 -f partbench__insert.sql postgres
transaction type: partbench__insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 1147273
latency average = 0.052 ms
tps = 19121.194366 (including connections establishing)
tps = 19121.695469 (excluding connections establishing)

Here we're within 92% of the non-partitioned performance.

Looking back at the first email in this thread where I tested the v1
patch, we were within 82% with:

-- partitioned
tps = 11001.602377 (excluding connections establishing)

-- non-partitioned
tps = 13354.656163 (excluding connections establishing)

Again, same as with the v1 test, the v2 test was done with the locking
of all partitions removed with:

diff --git a/src/backend/executor/execPartition.c
b/src/backend/executor/execPartition.c
index d7b18f52ed..6223c62094 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -80,9 +80,6 @@ ExecSetupPartitionTupleRouting(ModifyTableState
*mtstate, Relation rel)
  PartitionTupleRouting *proute;
  ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;

- /* Lock all the partitions. */
- (void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-
  /*
  * Here we attempt to expend as little effort as possible in setting up
  * the PartitionTupleRouting.  Each partition's ResultRelInfo is built
@@ -442,7 +439,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
  * We locked all the partitions in ExecSetupPartitionTupleRouting
  * including the leaf partitions.
  */
- partrel = heap_open(partoid, NoLock);
+ partrel = heap_open(partoid, RowExclusiveLock);

  /*
  * Keep ResultRelInfo and other information for this partition in the

Again, the reduce locking is not meant for commit for this patch.
Changing the locking will require a discussion on its own thread.

And just for fun, the unpatched performance on the partitioned table:

ubuntu@ip-10-0-0-33:~$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 5751
latency average = 10.434 ms
tps = 95.836052 (including connections establishing)
tps = 95.838490 (excluding connections establishing)

(185x increase)

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up INSERTs and UPDATEs to partitioned tables

Amit Langote-2
In reply to this post by David Rowley-3
On 2018/07/27 1:19, David Rowley wrote:

> On 18 July 2018 at 20:29, Amit Langote <[hidden email]> wrote:
>> Let me know what you think of the code in the updated patch.
>
> Thanks for sending the updated patch.
>
> I looked over it tonight and made a number of changes:
>
> 1) Got rid of PARTITION_ROUTING_MAXSIZE. The code using this was
> useless since the int would have wrapped long before it reached
> UINT_MAX.
Oops, you're right.

> There's no shortage of other code doubling the size of an
> array by multiplying it by 2 unconditionally without considering
> overflowing an int. Unsure why you considered this more risky.

Just ill-informed paranoia on my part.  Let's just drop it as you say,
given also the Tom's comment that repalloc would fail anyway for requests
over 1GB.

> 2) Fixed a series of bugs regarding the size of the arrays in
> PartitionTupleRouting. The map arrays and the partitions array could
> differ in size despite your comment that claimed
> child_parent_tupconv_maps was the same size as 'partitions' when
> non-NULL. The map arrays being a different size than the partitions
> array caused the following two cases to segfault. I've included two
> cases as it was two seperate bugs that caused them.
>
> -- case 1

[ .... ]

> -- case 2

Indeed, there were some holes in the logic that led to me to come up with
that code.

> 3) Got rid of ExecUseUpdateResultRelForRouting. I started to change
> this to remove references to UPDATE in order to make it more friendly
> towards other possible future node types that it would get used for
> (aka MERGE). In the end, I found that performance could regress when
> in cases like:
>
> drop table listp;
> create table listp (a int) partition by list(a);
> \o /dev/null
> \timing off
> select 'create table listp'||x::Text||' partition of listp for values
> in('||x::Text||');' from generate_series(1,1000) x;
> \gexec
> \o
> insert into listp select x from generate_series(1,999) x;
> \timing on
> update listp set a = a+1;
>
> It's true that UPDATEs with a large number of subplans performance is
> quite terrible today in the planner, but this code made the
> performance of planning+execution a bit worse. If we get around to
> fixing the inheritance planner then I think
> ExecUseUpdateResultRelForRouting() could easily appear in profiles.
>
> I ended up rewriting it to just get called once and build a hash table
> by Oid storing a ResultRelInfo pointer.  This also gets rid of the
> slow nested loop in the cleanup operation inside
> ExecCleanupTupleRouting().
OK, looks neat, although I'd name the hash table subplan_resultrel_hash
(like join_rel_hash in PlannerInfo), instead of subplan_partition_table.

> 4) Did some tuning work in ExecFindPartition() getting rid of a
> redundant check after the loop completion. Also added some likely()
> and unlikely() decorations around some conditions.

All changes seem good.

> 5) Updated some newly out-dated comments since your patch in execPartition.h.
>
> 6) Replaced the palloc0() in ExecSetupPartitionTupleRouting() with a
> palloc() updating the few fields that were not initialised. This might
> save a few TPS (at least once we get rid of the all partition locking)
> in the single-row INSERT case, but I've not tested the performance of
> this yet.
>
> 7) Also moved and edited some comments above
> ExecSetupPartitionTupleRouting() that I felt explained a little too
> much about some internal implementation details.
Thanks, changes look good.

> One thing that I thought of, but didn't do was just having
> ExecFindPartition() return the ResultRelInfo. I think it would be much
> nicer in both call sites to not have to check the ->partitions array
> to get that.  The copy.c call site would need a few modifications
> around the detection code to see if the partition has changed, but it
> all looks quite possible to change. I left this for now as I have
> another patch which touches all that code that I feel is closer to
> commit than this patch is.

I had wondered about that too, but gave up on doing anything about it
because the callers of ExecFindPartition want to access other fields of
PartitionTupleRouting using the returned index.  Maybe, we could make it
return a ResultRelInfo * and also return the index itself using a separate
output argument.  Seems like a cosmetic improvement that can be made later.

> I've attached a delta of the changes I made since your v2 delta and
> also a complete updated patch.

Thanks.  Here are some other minor comments on the complete v2 patch.

-            tuple =
ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
+            tuple =
ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps ?
+
proute->parent_child_tupconv_maps[leaf_part_index] :
+                                                NULL,

This piece of code that's present in both ExecPrepareTupleRouting and
CopyFrom can be written as:

if (proute->parent_child_tupconv_maps)
    ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
                              tuple,
                              proute->partition_tuple_slot,
                              &slot);

+    /*
+     * If UPDATE needs to do tuple routing, we'll need a slot that will
+     * transiently store the tuple being routed using the root parent's
+     * rowtype.  We must set up at least this slot, because it's needed even
+     * before tuple routing begins.  Other necessary information is
+     * initialized when  tuple routing code calls
+     * ExecUseUpdateResultRelForRouting.
+     */
     if (node && node->operation == CMD_UPDATE)

This comment needs to be updated, because you changed the if block's body as:

+        ExecHashSubPlanResultRelsByOid(mtstate, proute);
         proute->root_tuple_slot = MakeTupleTableSlot(NULL);

So, we don't just set up the slot here, we also now set up the hash table
to store sub-plan result rels.  Also, ExecUseUpdateResultRelForRouting no
longer exist.

+            /*
+             * Get the index for PartitionTupleRouting->partitions array
index
+             * for this leaf partition.  This may require building a new
+             * ResultRelInfo.
+             */


1st sentence reads a bit strange.  Did you mean to write the following?

            /*
             * Get this leaf partition's index in the
             * PartitionTupleRouting->partitions array.
             * This may require building a new ResultRelInfo.
             */

The following block of code could use a one-line comment describing what's
going on (although, what's going on might be pretty clear to some eyes
just by looking at the code):

            else
            {
                if (proute->subplan_partition_table)
                {
                    ResultRelInfo *rri;
                    Oid         partoid = partdesc->oids[partidx];

                    rri = hash_search(proute->subplan_partition_table,
                                      &partoid, HASH_FIND, NULL);

 /*
+ * ExecInitPartitionDispatchInfo
+ *      Initialize PartitionDispatch for a partitioned table
+ *
+ * This also stores it in the proute->partition_dispatch_info array at the
+ * specified index ('dispatchidx'), possibly expanding the array if there
+ * isn't space left in it.
+ */

You renamed 'dispatchidx' to 'partidx' in the function's signature but
forgot to update this comment.

I've attached a delta patch to make the above changes.  I'm leaving the
hash table rename up to you though.

Thanks
Amit

v2-delta.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Speeding up INSERTs and UPDATEs to partitioned tables

David Rowley-3
On 27 July 2018 at 19:11, Amit Langote <[hidden email]> wrote:
> I've attached a delta patch to make the above changes.  I'm leaving the
> hash table rename up to you though.

Thanks for the delta patch. I took all of it, just rewrote a comment slightly.

I also renamed the hash table to your suggestion and changed a few more things.

Attached a delta based on v2 and the full v3 patch.

This includes another small change to make
PartitionDispatchData->indexes an array that's allocated in the same
memory as the PartitionDispatchData. This will save a palloc() call
and also should be a bit more cache friendly.

This also required a rebase on master again due to 3e32109049.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v2-delta2.patch (26K) Download Attachment
v3-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch (76K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Speeding up INSERTs and UPDATEs to partitioned tables

Amit Langote-2
On 2018/07/28 10:54, David Rowley wrote:

> On 27 July 2018 at 19:11, Amit Langote <[hidden email]> wrote:
>> I've attached a delta patch to make the above changes.  I'm leaving the
>> hash table rename up to you though.
>
> Thanks for the delta patch. I took all of it, just rewrote a comment slightly.
>
> I also renamed the hash table to your suggestion and changed a few more things.
>
> Attached a delta based on v2 and the full v3 patch.
>
> This includes another small change to make
> PartitionDispatchData->indexes an array that's allocated in the same
> memory as the PartitionDispatchData. This will save a palloc() call
> and also should be a bit more cache friendly.
>
> This also required a rebase on master again due to 3e32109049.
Thanks for the updated patch.

I couldn't find much to complain about in the latest v3, except I noticed
a few instances of the word "setup" where I think what's really meant is
"set up".

+     * must be setup, but any sub-partitioned tables can be setup lazily as

+                 * A ResultRelInfo has not been setup for this partition yet,


By the way, when going over the updated code, I noticed that the code
around child_parent_tupconv_maps could use some refactoring too.
Especially, I noticed that ExecSetupChildParentMapForLeaf() allocates
child-to-parent map array needed for transition tuple capture even if not
needed by any of the leaf partitions.  I'm attaching here a patch that
applies on top of your v3 to show what I'm thinking we could do.

Thanks,
Amit

0002-Some-refactoring-around-child_parent_tupconv_maps.patch (12K) Download Attachment
1234