Delay locking partitions during query execution

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Delay locking partitions during query execution

David Rowley-3
Over on [1] I'm proposing to delay locking partitions of a partitioned
table that's the target of an INSERT or UPDATE command until we first
route a tuple to the partition.   Currently, we go and lock all
partitions, even if we just insert a single tuple to a single
partition. The patch in [1] improves the situation when there are many
partitions and only a few tuples to route to just to a few partitions.

Over here and along similar lines to the above, but this time I'd like
to take this even further and change things so we don't lock *any*
partitions during AcquireExecutorLocks() and instead just lock them
when we first access them with ExecGetRangeTableRelation().  This
improves the situation when many partitions get run-time pruned, as
we'll never bother locking those at all since we'll never call
ExecGetRangeTableRelation() on them. We'll of course still lock the
partitioned table so that plan invalidation works correctly.

This does make the locking order less well defined, but I'm already
proposing similar in [1] and over there I've mentioned that I can't
quite see any huge issues with doing that.  We already don't lock all
partitions inside AcquireExecutorLocks() during INSERT with VALUES
anyway.

The attached patch implements this delayed locking.  A quick benchmark
shows a pretty good performance improvement when there are a large
number of partitions and run-time pruning prunes almost all of them.

Setup:
create table hashp (a int) partition by hash(a);
select 'create table hashp'|| x::text || ' partition of hashp for
values with (modulus 10000, remainder ' || x ::text || ');' from
generate_Series(0,9999) x;
\gexec
insert into hashp select generate_Series(1,1000000)

bench.sql:
\set p_a 13315
select * from hashp where a = :p_a;

Master: 10000 parts

$ pgbench -n -f bench.sql -M prepared -T 60 postgres
tps = 108.882749 (excluding connections establishing)
tps = 108.245437 (excluding connections establishing)

delaylock: 10000 parts

$ pgbench -n -f bench.sql -M prepared -T 60 postgres
tps = 1068.289505 (excluding connections establishing)
tps = 1092.797015 (excluding connections establishing)

More could be done to make this quite a bit faster again, but that
mostly requires the range table coming directly from the planner as an
array and marking which array elements require locking with a
Bitmapset. This'll save having to loop over the entire large array
that mostly does not need anything locked. Similar can be done for
ExecCheckRTPerms(), but that's also for another day. With those
changes and some further tweaks done on the Append/MergeAppend code,
tps is about 22k on my machine, which is just slightly slower than
with an equivalent non-partitioned table.

I'll add the attached patch to the January commitfest

[1] https://www.postgresql.org/message-id/CAKJS1f-%3DFnMqmQP6qitkD%2BxEddxw22ySLP-0xFk3JAqUX2yfMw%40mail.gmail.com

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v1-0001-Allow-lock-acquisitions-for-partitions-to-be-dela.patch (17K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Delay locking partitions during query execution

Tomas Vondra-4
On 12/3/18 12:42 PM, David Rowley wrote:

> ...
>
> Master: 10000 parts
>
> $ pgbench -n -f bench.sql -M prepared -T 60 postgres
> tps = 108.882749 (excluding connections establishing)
> tps = 108.245437 (excluding connections establishing)
>
> delaylock: 10000 parts
>
> $ pgbench -n -f bench.sql -M prepared -T 60 postgres
> tps = 1068.289505 (excluding connections establishing)
> tps = 1092.797015 (excluding connections establishing)
>
I'm a bit confused, because I can't reproduce any such speedup. I've
used the attached script that varies the number of partitions (which
worked quite nicely in the INSERT thread), but I'm getting results like
this:

    partitions      0       100     1000   10000
    --------------------------------------------
    master         49      1214      186      11
    patched        53      1225      187      11

So I don't see any significant speedup, for some reason :-(

Before I start digging into this, is there something that needs to be
done to enable it?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

select.sql (82 bytes) Download Attachment
run-select.sh (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Delay locking partitions during query execution

David Rowley-3
On Fri, 4 Jan 2019 at 02:40, Tomas Vondra <[hidden email]> wrote:

> I'm a bit confused, because I can't reproduce any such speedup. I've
> used the attached script that varies the number of partitions (which
> worked quite nicely in the INSERT thread), but I'm getting results like
> this:
>
>     partitions      0       100     1000   10000
>     --------------------------------------------
>     master         49      1214      186      11
>     patched        53      1225      187      11
>
> So I don't see any significant speedup, for some reason :-(
>
> Before I start digging into this, is there something that needs to be
> done to enable it?

Thanks for looking at this.

One thing I seem to quite often forget to mention is that I was running with:

plan_cache_mode = force_generic_plan
max_parallel_workers_per_gather = 0;

Without changing plan_cache_mode then the planner would likely never
favour a generic plan since it will not appear to be very efficient
due to the lack of consideration to the costing of run-time partition
pruning.

Also, then with a generic plan, the planner will likely want to build
a parallel plan since it sees up to 10k partitions that need to be
scanned. max_parallel_workers_per_gather = 0 puts it right.

(Ideally, the planner would cost run-time pruning, but it's not quite
so simple for RANGE partitions with non-equality operators. Likely
we'll want to fix that one day, but that's not for here)

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Delay locking partitions during query execution

Tomas Vondra-4
On 1/3/19 10:50 PM, David Rowley wrote:

> On Fri, 4 Jan 2019 at 02:40, Tomas Vondra <[hidden email]> wrote:
>> I'm a bit confused, because I can't reproduce any such speedup. I've
>> used the attached script that varies the number of partitions (which
>> worked quite nicely in the INSERT thread), but I'm getting results like
>> this:
>>
>>     partitions      0       100     1000   10000
>>     --------------------------------------------
>>     master         49      1214      186      11
>>     patched        53      1225      187      11
>>
>> So I don't see any significant speedup, for some reason :-(
>>
>> Before I start digging into this, is there something that needs to be
>> done to enable it?
>
> Thanks for looking at this.
>
> One thing I seem to quite often forget to mention is that I was running with:
>
> plan_cache_mode = force_generic_plan
> max_parallel_workers_per_gather = 0;
>
> Without changing plan_cache_mode then the planner would likely never
> favour a generic plan since it will not appear to be very efficient
> due to the lack of consideration to the costing of run-time partition
> pruning.
>
> Also, then with a generic plan, the planner will likely want to build
> a parallel plan since it sees up to 10k partitions that need to be
> scanned. max_parallel_workers_per_gather = 0 puts it right.
>
> (Ideally, the planner would cost run-time pruning, but it's not quite
> so simple for RANGE partitions with non-equality operators. Likely
> we'll want to fix that one day, but that's not for here)
>

Nope, that doesn't seem to make any difference :-( In all cases the
resulting plan (with 10k partitions) looks like this:

test=# explain analyze select * from hashp where a = 13442;

                              QUERY PLAN
-----------------------------------------------------------------------
 Append  (cost=0.00..41.94 rows=13 width=4)
         (actual time=0.018..0.018 rows=0 loops=1)
   ->  Seq Scan on hashp6784 (cost=0.00..41.88 rows=13 width=4)
                             (actual time=0.017..0.018 rows=0 loops=1)
         Filter: (a = 13442)
 Planning Time: 75.870 ms
 Execution Time: 0.471 ms
(5 rows)

and it doesn't change (the timings on shape) no matter how I set any of
the GUCs.

Furthermore, I've repeatedly ran into this issue:

test=# \d hashp
ERROR:  unrecognized token: "false"
LINE 2: ...catalog.array_to_string(array(select rolname from pg_catalog...
                                                             ^
I have no idea why it breaks like this, and it's somewhat random (i.e.
not readily reproducible). But I've only ever seen it with this patch
applied.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Delay locking partitions during query execution

David Rowley-3
On Fri, 4 Jan 2019 at 11:48, Tomas Vondra <[hidden email]> wrote:

> Nope, that doesn't seem to make any difference :-( In all cases the
> resulting plan (with 10k partitions) looks like this:
>
> test=# explain analyze select * from hashp where a = 13442;
>
>                               QUERY PLAN
> -----------------------------------------------------------------------
>  Append  (cost=0.00..41.94 rows=13 width=4)
>          (actual time=0.018..0.018 rows=0 loops=1)
>    ->  Seq Scan on hashp6784 (cost=0.00..41.88 rows=13 width=4)
>                              (actual time=0.017..0.018 rows=0 loops=1)
>          Filter: (a = 13442)
>  Planning Time: 75.870 ms
>  Execution Time: 0.471 ms
> (5 rows)
>
> and it doesn't change (the timings on shape) no matter how I set any of
> the GUCs.

For this to work, run-time pruning needs to take place, so it must be
a PREPAREd statement.

With my test I used:

bench.sql:
\set p_a 13315
select * from hashp where a = :p_a;

$ pgbench -n -f bench.sql -M prepared -T 60 postgres

You'll know you're getting a generic plan when you see "Filter (a =
$1)" and see "Subplans Removed: 9999" below the Append.

> Furthermore, I've repeatedly ran into this issue:
>
> test=# \d hashp
> ERROR:  unrecognized token: "false"
> LINE 2: ...catalog.array_to_string(array(select rolname from pg_catalog...
>                                                              ^
> I have no idea why it breaks like this, and it's somewhat random (i.e.
> not readily reproducible). But I've only ever seen it with this patch
> applied.

You'll probably need to initdb with the patch applied as there's a new
field in RangeTblEntry. If there's a serialised one of these stored in
the in the catalogue somewhere then the new read function will have
issues reading the old serialised format.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Delay locking partitions during query execution

Tomas Vondra-4


On 1/3/19 11:57 PM, David Rowley wrote:

> On Fri, 4 Jan 2019 at 11:48, Tomas Vondra <[hidden email]> wrote:
>> Nope, that doesn't seem to make any difference :-( In all cases the
>> resulting plan (with 10k partitions) looks like this:
>>
>> test=# explain analyze select * from hashp where a = 13442;
>>
>>                               QUERY PLAN
>> -----------------------------------------------------------------------
>>  Append  (cost=0.00..41.94 rows=13 width=4)
>>          (actual time=0.018..0.018 rows=0 loops=1)
>>    ->  Seq Scan on hashp6784 (cost=0.00..41.88 rows=13 width=4)
>>                              (actual time=0.017..0.018 rows=0 loops=1)
>>          Filter: (a = 13442)
>>  Planning Time: 75.870 ms
>>  Execution Time: 0.471 ms
>> (5 rows)
>>
>> and it doesn't change (the timings on shape) no matter how I set any of
>> the GUCs.
>
> For this to work, run-time pruning needs to take place, so it must be
> a PREPAREd statement.
>
> With my test I used:
>
> bench.sql:
> \set p_a 13315
> select * from hashp where a = :p_a;
>
> $ pgbench -n -f bench.sql -M prepared -T 60 postgres
>
> You'll know you're getting a generic plan when you see "Filter (a =
> $1)" and see "Subplans Removed: 9999" below the Append.
>

Indeed, with prepared statements I now see some improvements:

    partitions    0      100     1000    10000
    --------------------------------------------
    master       19     1590     2090      128
    patched      18     1780     6820     1130

So, that's nice. I wonder why the throughput drops so fast between 1k
and 10k partitions, but I'll look into that later.

Does this mean this optimization can only ever work with prepared
statements, or can it be made to work with regular plans too?

>> Furthermore, I've repeatedly ran into this issue:
>>
>> test=# \d hashp
>> ERROR:  unrecognized token: "false"
>> LINE 2: ...catalog.array_to_string(array(select rolname from pg_catalog...
>>                                                              ^
>> I have no idea why it breaks like this, and it's somewhat random (i.e.
>> not readily reproducible). But I've only ever seen it with this patch
>> applied.
>
> You'll probably need to initdb with the patch applied as there's a new
> field in RangeTblEntry. If there's a serialised one of these stored in
> the in the catalogue somewhere then the new read function will have
> issues reading the old serialised format.
>

D'oh! That explains it, because switching from/to patched binaries might
have easily been triggering the error. I've checked that there are no
changes to catalogs, but it did not occur to me adding a new RTE field
could have such consequences ...

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Delay locking partitions during query execution

David Rowley-3
On Fri, 4 Jan 2019 at 13:01, Tomas Vondra <[hidden email]> wrote:

> On 1/3/19 11:57 PM, David Rowley wrote:
> > You'll know you're getting a generic plan when you see "Filter (a =
> > $1)" and see "Subplans Removed: 9999" below the Append.
> >
>
> Indeed, with prepared statements I now see some improvements:
>
>     partitions    0      100     1000    10000
>     --------------------------------------------
>     master       19     1590     2090      128
>     patched      18     1780     6820     1130
>
> So, that's nice. I wonder why the throughput drops so fast between 1k
> and 10k partitions, but I'll look into that later.

Those look strange. Why is it so slow with the non-partitioned case?
I'd have expected that to be the fastest result.

> Does this mean this optimization can only ever work with prepared
> statements, or can it be made to work with regular plans too?

That's a good question.  I confirm it's only any use of run-time
pruning occurs during init plan. For this patch to be any use, you'll
need to see a "Subplans Removed: <N>" in the query's EXPLAIN ANALYZE
output.  If you don't see this then all Append/MergeAppend subplans
were initialised and the relation lock would have been obtained
regardless of if delaylock is set for the relation. The effect of the
patch here would just have been to obtain the lock during the first
call to ExecGetRangeTableRelation() for that relation instead of
during AcquireExecutorLocks().  There may actually be a tiny overhead
in this case since AcquireExecutorLocks() must skip the delaylock
relations, but they'll get locked later anyway. I doubt you could
measure that though.

When run-time pruning is able to prune partitions before execution
starts then the optimisation is useful since AcquireExecutorLocks()
won't obtain the lock and ExecGetRangeTableRelation() won't be called
for all pruned partition's rels as we don't bother to init the
Append/MergeAppend subplan for those.

I'm a little unsure if there are any cases where this type of run-time
pruning can occur when PREPAREd statements are not in use.  Initplan
parameters can't prune before executor run since we need to run the
executor to obtain the values of those. Likewise for evaluation of
volatile functions. So I think run-time pruning before initplan is
only ever going to happen for PARAM_EXTERN type parameters, i.e. with
PREPAREd statements (REF: analyze_partkey_exprs() partprune.c).
Without PREPAREd statements, if the planner itself was unable to prune
the partitions it would already have obtained the lock during
planning, so AcquireExecutorLocks(), in this case, would bump into the
local lock hash table entry and forego trying to obtain the lock
itself.  That's not free, but it's significantly faster than obtaining
a lock.

Or in short... it only good for prepared statements where the
statement's parameters allow for run-time pruning. However, that's a
pretty large case since the planner is still very slow at planning for
large numbers of partitions, meaning it's common (or at least it will
be) for people to use PREPAREd statement and plan_cache_mode =
force_generic_plan;

> >> Furthermore, I've repeatedly ran into this issue:
> >>
> >> test=# \d hashp
> >> ERROR:  unrecognized token: "false"
> >> LINE 2: ...catalog.array_to_string(array(select rolname from pg_catalog...
> >>                                                              ^
> >> I have no idea why it breaks like this, and it's somewhat random (i.e.
> >> not readily reproducible). But I've only ever seen it with this patch
> >> applied.
> >
> > You'll probably need to initdb with the patch applied as there's a new
> > field in RangeTblEntry. If there's a serialised one of these stored in
> > the in the catalogue somewhere then the new read function will have
> > issues reading the old serialised format.
> >
>
> D'oh! That explains it, because switching from/to patched binaries might
> have easily been triggering the error. I've checked that there are no
> changes to catalogs, but it did not occur to me adding a new RTE field
> could have such consequences ...

schema-wise, no changes, but data-wise, there are changes.

$ pg_dump --schema=pg_catalog --data-only postgres | grep ":rellockmode" | wc -l
    121

All of which are inside the pg_rewrite table:

$ pg_dump --schema=pg_catalog --data-only --table=pg_rewrite postgres
| grep ":rellockmode" | wc -l
    121

I just used ":rellockmode" here as it's a field that exists in RangeTblEntry.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Delay locking partitions during query execution

Tomas Vondra-4


On 1/4/19 1:53 AM, David Rowley wrote:

> On Fri, 4 Jan 2019 at 13:01, Tomas Vondra <[hidden email]> wrote:
>> On 1/3/19 11:57 PM, David Rowley wrote:
>>> You'll know you're getting a generic plan when you see "Filter (a =
>>> $1)" and see "Subplans Removed: 9999" below the Append.
>>>
>>
>> Indeed, with prepared statements I now see some improvements:
>>
>>     partitions    0      100     1000    10000
>>     --------------------------------------------
>>     master       19     1590     2090      128
>>     patched      18     1780     6820     1130
>>
>> So, that's nice. I wonder why the throughput drops so fast between 1k
>> and 10k partitions, but I'll look into that later.
>
> Those look strange. Why is it so slow with the non-partitioned case?
> I'd have expected that to be the fastest result.
>

Because there are 1M rows in the table, and it's doing a seqscan.

That also makes the other cases difficult to compare, because with few
partitions there will be multiple pages per partition, scanned
sequentially. And with many partitions it's likely only a single page
with a couple of rows on it. I'll think about constructing a better
benchmark, to make it easier to compare - perhaps by using a single row
per table and/or adding indexes. Or something ...

>> Does this mean this optimization can only ever work with prepared
>> statements, or can it be made to work with regular plans too?
>
> That's a good question.  I confirm it's only any use of run-time
> pruning occurs during init plan. For this patch to be any use, you'll
> need to see a "Subplans Removed: <N>" in the query's EXPLAIN ANALYZE
> output.  If you don't see this then all Append/MergeAppend subplans
> were initialised and the relation lock would have been obtained
> regardless of if delaylock is set for the relation. The effect of the
> patch here would just have been to obtain the lock during the first
> call to ExecGetRangeTableRelation() for that relation instead of
> during AcquireExecutorLocks().  There may actually be a tiny overhead
> in this case since AcquireExecutorLocks() must skip the delaylock
> relations, but they'll get locked later anyway. I doubt you could
> measure that though.
>
> When run-time pruning is able to prune partitions before execution
> starts then the optimisation is useful since AcquireExecutorLocks()
> won't obtain the lock and ExecGetRangeTableRelation() won't be called
> for all pruned partition's rels as we don't bother to init the
> Append/MergeAppend subplan for those.
>
> I'm a little unsure if there are any cases where this type of run-time
> pruning can occur when PREPAREd statements are not in use.  Initplan
> parameters can't prune before executor run since we need to run the
> executor to obtain the values of those. Likewise for evaluation of
> volatile functions. So I think run-time pruning before initplan is
> only ever going to happen for PARAM_EXTERN type parameters, i.e. with
> PREPAREd statements (REF: analyze_partkey_exprs() partprune.c).
> Without PREPAREd statements, if the planner itself was unable to prune
> the partitions it would already have obtained the lock during
> planning, so AcquireExecutorLocks(), in this case, would bump into the
> local lock hash table entry and forego trying to obtain the lock
> itself.  That's not free, but it's significantly faster than obtaining
> a lock.
>
> Or in short... it only good for prepared statements where the
> statement's parameters allow for run-time pruning. However, that's a
> pretty large case since the planner is still very slow at planning for
> large numbers of partitions, meaning it's common (or at least it will
> be) for people to use PREPAREd statement and plan_cache_mode =
> force_generic_plan;
>

OK, thanks for the explanation. One more reason to use prepared
statements in such cases ...

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Delay locking partitions during query execution

David Rowley-3
On Sat, 5 Jan 2019 at 03:12, Tomas Vondra <[hidden email]> wrote:

> >>
> >>     partitions    0      100     1000    10000
> >>     --------------------------------------------
> >>     master       19     1590     2090      128
> >>     patched      18     1780     6820     1130
> >>
> >> So, that's nice. I wonder why the throughput drops so fast between 1k
> >> and 10k partitions, but I'll look into that later.
> >
> > Those look strange. Why is it so slow with the non-partitioned case?
> > I'd have expected that to be the fastest result.
> >
>
> Because there are 1M rows in the table, and it's doing a seqscan.

Of course. My test did the same, but I didn't consider that because I
had so few rows per partition.  Likely just adding an index would have
it make more sense.


--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Delay locking partitions during query execution

David Rowley-3
In reply to this post by David Rowley-3
On Tue, 4 Dec 2018 at 00:42, David Rowley <[hidden email]> wrote:
> Over here and along similar lines to the above, but this time I'd like
> to take this even further and change things so we don't lock *any*
> partitions during AcquireExecutorLocks() and instead just lock them
> when we first access them with ExecGetRangeTableRelation().

I've attached a rebase version of this. The previous version
conflicted with some changes made in b60c397599.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v2-0001-Allow-lock-acquisitions-for-partitions-to-be-dela.patch (17K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Delay locking partitions during query execution

Amit Langote-2
In reply to this post by David Rowley-3
On 2019/01/04 9:53, David Rowley wrote:
> Without PREPAREd statements, if the planner itself was unable to prune
> the partitions it would already have obtained the lock during
> planning, so AcquireExecutorLocks(), in this case, would bump into the
> local lock hash table entry and forego trying to obtain the lock
> itself.  That's not free, but it's significantly faster than obtaining
> a lock.

Hmm, AcquireExecutorLocks is only called if prepared statements are used
and that too if a generic cached plan is reused.

GetCachedPlan -> CheckCachedPlan -> AcquireExecutorLocks

In GetCachedPlan:

    if (!customplan)
    {
        if (CheckCachedPlan(plansource))


Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: Delay locking partitions during query execution

David Rowley-3
On Thu, 17 Jan 2019 at 17:18, Amit Langote
<[hidden email]> wrote:

>
> On 2019/01/04 9:53, David Rowley wrote:
> > Without PREPAREd statements, if the planner itself was unable to prune
> > the partitions it would already have obtained the lock during
> > planning, so AcquireExecutorLocks(), in this case, would bump into the
> > local lock hash table entry and forego trying to obtain the lock
> > itself.  That's not free, but it's significantly faster than obtaining
> > a lock.
>
> Hmm, AcquireExecutorLocks is only called if prepared statements are used
> and that too if a generic cached plan is reused.
>
> GetCachedPlan -> CheckCachedPlan -> AcquireExecutorLocks

ah okay. Thanks for the correction, but either way, it's really only
useful when run-time pruning prunes partitions before initialising the
Append/MergeAppend node.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services