Small performance tweak to run-time partition pruning

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Small performance tweak to run-time partition pruning

David Rowley-3
While reviewing some other patches to improve partitioning performance
I noticed that one of the loops in ExecFindInitialMatchingSubPlans()
could be coded a bit more efficiently.  The current code loops over
all the original subplans checking if the subplan is newly pruned, if
it is, the code sets the new_subplan_indexes array element to -1, else
it sets it assigns the new subplan index.  This can be done more
efficiently if we make this array 1-based and initialise the whole
thing to 0 then just loop over the non-pruned subplans instead of all
subplans. Pruning all but 1 subplan is quite common.

In profiles, I'd seen ExecFindInitialMatchingSubPlans() consume about
5.2% percent of CPU time. With the patch that dropped to 0.72%.

A quick test with just 300 partitions shows about a 2.3% performance
improvement. Hardly groundbreaking, but it seems like a small enough
change for it to be worth it.

The test was conducted as follows:

postgresql.conf:
plan_cache_mode = 'force_generic_plan'
max_parallel_workers_per_gather = 0

setup:
CREATE TABLE partbench (id BIGINT NOT NULL, i1 INT NOT NULL, i2 INT
NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL) PARTITION
BY RANGE (id);
\o /dev/null
select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench
FOR VALUES FROM (' || (x*100000)::text || ') TO (' ||
((x+1)*100000)::text || ');' from generate_Series(0,299) x;
\gexec
\o

select.sql:
\set p_id 29999999
select * from partbench where id = :p_id;

Test:
$ pgbench -n -f select.sql -M prepared -T 60 postgres

Unpatched:
tps = 6946.940678 (excluding connections establishing)
tps = 6913.993655 (excluding connections establishing)
tps = 6854.693214 (excluding connections establishing)

Patched
tps = 7066.854267 (excluding connections establishing)
tps = 7082.890458 (excluding connections establishing)
tps = 7052.255429 (excluding connections establishing)

Patch attached. I'll park this here until the November 'fest.

I've also included an additional test to ensure the other_subplans
gets updated correctly. The other tests for this seem to only perform
run-time pruning during init plan and do no further pruning, so don't
fully test that other_subplans gets updated correctly.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v1-0001-Improve-performance-of-run-time-partition-pruning.patch (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Small performance tweak to run-time partition pruning

David Rowley-3
On 7 September 2018 at 19:29, David Rowley <[hidden email]> wrote:
> While reviewing some other patches to improve partitioning performance
> I noticed that one of the loops in ExecFindInitialMatchingSubPlans()
> could be coded a bit more efficiently.  The current code loops over
> all the original subplans checking if the subplan is newly pruned, if
> it is, the code sets the new_subplan_indexes array element to -1, else
> it sets it assigns the new subplan index.  This can be done more
> efficiently if we make this array 1-based and initialise the whole
> thing to 0 then just loop over the non-pruned subplans instead of all
> subplans. Pruning all but 1 subplan is quite common.

I was looking at this again and I realised that we can completely skip
the re-sequence of the subplan map when we're not going to perform any
further pruning during execution. We possibly could also not make a
copy of the subplan_map in this case at all in
ExecCreatePartitionPruneState(), and just take the planner's copy
verbatim as we do for the subpart_map.  I was just unable to see any
performance gains from doing this, so I've just left it for now.

Currently, this improves performance about 2% with prepared queries
and 300 partitions.

Patched:

tps = 5169.169452 (excluding connections establishing)
tps = 5155.914286 (excluding connections establishing)

Unpatched:
tps = 5059.511370 (excluding connections establishing)
tps = 5082.851062 (excluding connections establishing)

However with other patches to remove partitioning bottlenecks in the
executor, the TPS goes to about 25,000, so 2% becomes 10%, which seems
more meaningful.

I've attached an updated patch which skips the re-sequence work when
doing that is not required for anything.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v2-0001-Improve-performance-of-run-time-partition-pruning.patch (11K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Small performance tweak to run-time partition pruning

Imai, Yoshikazu
Hi, David.
Thanks for the patch!

On Mon, Oct 8, 2018 at 1:00 AM, David Rowley wrote:
> I was looking at this again and I realised that we can completely skip
> the re-sequence of the subplan map when we're not going to perform any
> further pruning during execution.

I checked codes and I think so too.

Confirmation of my understanding and For more information to others:

The subplan map is used when we call ExecFindInitialMatchingSubPlans or
ExecFindMatchingSubPlans.
ExecFindInitialMatchingSubPlans is called by ExecInit(Merge)Append.
ExecFindmatchingSubPlans is called by below fours which is executed after
ExecInit(Merge)Append and it is called when the
as_valid_subplans(or ms_valid_subplans) is not NULL.

1. choose_next_subplan_locally(AppendState *node)
2. choose_next_subplan_for_leader(AppendState *node)
3. choose_next_subplan_for_worker(AppendState *node)
4. ExecMergeAppend(PlanState *pstate)

The as_valid_subplans(or ms_valid_subplans) is set in ExecInit(Merge)Append
if pruning during execution is not required.

That's why we can completely skip the re-sequence of the subplan map
when we're not going to perform any further pruning during execution.


> I've attached an updated patch which skips the re-sequence work when
> doing that is not required for anything.

I saw the patch and it seems good to me about the codes.
I still couldn't check additional test cases in the patch.


BTW, when I looking the codes, I thought there is further improvements in
ExecInitAppend which is described below.

        j = i = 0;
        firstvalid = nplans;
        foreach(lc, node->appendplans)
        {
                if (bms_is_member(i, validsubplans))
                {
                        Plan       *initNode = (Plan *) lfirst(lc);

                        /*
                         * Record the lowest appendplans index which is a valid partial
                         * plan.
                         */
                        if (i >= node->first_partial_plan && j < firstvalid)
                                firstvalid = j;

                        appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
                }
                i++;
        }

If number of valid subplans is few compared to node->appendplans, it is a waste to check
bms_is_member(i, validsubplans) for all node->appendplans.
Of course, node->appendplans is List struct and we have to loop it to find valid plan,
that we can't simply use while(bms_next_member(i, validsubplans)) loop.
I don't have any good idea for it now, but can we improve it?


--
Yoshikazu Imai

Reply | Threaded
Open this post in threaded view
|

Re: Small performance tweak to run-time partition pruning

David Rowley-3
On 9 October 2018 at 14:23, Imai, Yoshikazu
<[hidden email]> wrote:

> I checked codes and I think so too.
>
> Confirmation of my understanding and For more information to others:
>
> The subplan map is used when we call ExecFindInitialMatchingSubPlans or
> ExecFindMatchingSubPlans.
> ExecFindInitialMatchingSubPlans is called by ExecInit(Merge)Append.
> ExecFindmatchingSubPlans is called by below fours which is executed after
> ExecInit(Merge)Append and it is called when the
> as_valid_subplans(or ms_valid_subplans) is not NULL.

Thanks for looking at this.

Yeah, so subplan_map is just an array that stores the List index of
the Append or MergeAppend's subplans. When we perform run-time pruning
during executor initialisation, if we prune away some of these
subplans then we don't initialise those pruned subplans at all which
results in missing executor nodes for those plans. Instead of
maintaining an array to store these with a bunch of NULLs in them to
represent the pruned subnodes, for performance reasons, we make a
gapless array and store them in there. This means that for the
run-time pruning that we could do running actual execution
(ExecFindMatchingSubPlans), the old subplan_map would be out of date,
as it would contain the indexes of the subplans as if we'd not pruned
any.  We can simply not bother adjusting the subplan_map if no further
pruning is required. This could leave the map pointing to subplans
that don't exist, but we only need to care about that when the map is
actually going to be used for something. The good news is, we know in
advance if the map will be used again.

> I saw the patch and it seems good to me about the codes.
> I still couldn't check additional test cases in the patch.
>
>
> BTW, when I looking the codes, I thought there is further improvements in
> ExecInitAppend which is described below.
>
>         j = i = 0;
>         firstvalid = nplans;
>         foreach(lc, node->appendplans)
>         {
>                 if (bms_is_member(i, validsubplans))
>                 {
>                         Plan       *initNode = (Plan *) lfirst(lc);
>
>                         /*
>                          * Record the lowest appendplans index which is a valid partial
>                          * plan.
>                          */
>                         if (i >= node->first_partial_plan && j < firstvalid)
>                                 firstvalid = j;
>
>                         appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
>                 }
>                 i++;
>         }
>
> If number of valid subplans is few compared to node->appendplans, it is a waste to check
> bms_is_member(i, validsubplans) for all node->appendplans.
> Of course, node->appendplans is List struct and we have to loop it to find valid plan,
> that we can't simply use while(bms_next_member(i, validsubplans)) loop.
> I don't have any good idea for it now, but can we improve it?

I do have other ideas for that but not ready to share code yet, but it
basically requires reimplementing List to use arrays as their
underlying implementation. This will allow the bms_next_member() type
loop and list_nth() will be O(1) instead of O(N).

It might be possible to squeeze a bit more performance out of this
code with the current List implementation, but I it might actually
slow performance in some cases, for example, if the only surviving
partition was one of the last ones in the List.  Getting the final
element with list_nth() is optimized, but if you consider a
time-based, say monthly, RANGE partition, a DBA might maintain a
partition ready for the next month of data, so it might be very common
to access the 2nd last element in the list for all queries looking at
"this months" data. In that case, list_nth(), in its current form, is
as slow as can be.  You'd also need to do a bms_num_members() or
bms_get_singleton_member(), in order to decide if the alternative
method is going to be any faster. That call is not going to be free.

It might also be possible to form the loop so that it calls
bms_next_member() then store the result and loop until we reach that
number. That would only save the bms_is_member call per loop, but I'd
much rather do the array idea as it should also speed up plenty of
other things.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

RE: Small performance tweak to run-time partition pruning

Imai, Yoshikazu
On Tue, Oct 9, 2018 at 2:02 AM, David Rowley wrote:

> Yeah, so subplan_map is just an array that stores the List index of
> the Append or MergeAppend's subplans. When we perform run-time pruning
> during executor initialisation, if we prune away some of these
> subplans then we don't initialise those pruned subplans at all which
> results in missing executor nodes for those plans. Instead of
> maintaining an array to store these with a bunch of NULLs in them to
> represent the pruned subnodes, for performance reasons, we make a
> gapless array and store them in there. This means that for the
> run-time pruning that we could do running actual execution
> (ExecFindMatchingSubPlans), the old subplan_map would be out of date,
> as it would contain the indexes of the subplans as if we'd not pruned
> any.  We can simply not bother adjusting the subplan_map if no further
> pruning is required. This could leave the map pointing to subplans
> that don't exist, but we only need to care about that when the map is
> actually going to be used for something. The good news is, we know in
> advance if the map will be used again.

Thanks for the detail explanation! I got it fully.

> > I saw the patch and it seems good to me about the codes.
> > I still couldn't check additional test cases in the patch.
> >
> >
> > BTW, when I looking the codes, I thought there is further improvements in
> > ExecInitAppend which is described below.
> >
> >         j = i = 0;
> >         firstvalid = nplans;
> >         foreach(lc, node->appendplans)
> >         {
> >                 if (bms_is_member(i, validsubplans))
> >                 {
> >                         Plan       *initNode = (Plan *) lfirst(lc);
> >
> >                         /*
> >                          * Record the lowest appendplans index which is a valid partial
> >                          * plan.
> >                          */
> >                         if (i >= node->first_partial_plan && j < firstvalid)
> >                                 firstvalid = j;
> >
> >                         appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
> >                 }
> >                 i++;
> >         }
> >
> > If number of valid subplans is few compared to node->appendplans, it is a waste to check
> > bms_is_member(i, validsubplans) for all node->appendplans.
> > Of course, node->appendplans is List struct and we have to loop it to find valid plan,
> > that we can't simply use while(bms_next_member(i, validsubplans)) loop.
> > I don't have any good idea for it now, but can we improve it?
>
>
>
> I do have other ideas for that but not ready to share code yet, but it
> basically requires reimplementing List to use arrays as their
> underlying implementation. This will allow the bms_next_member() type
> loop and list_nth() will be O(1) instead of O(N).

Great!
So there might be little gain in using memory, but we get large performance improvement.

> It might be possible to squeeze a bit more performance out of this
> code with the current List implementation, but I it might actually
> slow performance in some cases, for example, if the only surviving
> partition was one of the last ones in the List.  Getting the final
> element with list_nth() is optimized, but if you consider a
> time-based, say monthly, RANGE partition, a DBA might maintain a
> partition ready for the next month of data, so it might be very common
> to access the 2nd last element in the list for all queries looking at
> "this months" data. In that case, list_nth(), in its current form, is
> as slow as can be.  

I understand accessing 2nd last element is slow in current List implementation,
but I don't know your new implementation is also slow in that case.
I don't know I might misunderstood "squeeze ~ out of ~ with ~" sentence.


> You'd also need to do a bms_num_members() or
> bms_get_singleton_member(), in order to decide if the alternative
> method is going to be any faster. That call is not going to be free.

Yeah, I need to use bms_num_members() or bms_get_singleton_member() to
check number of valid subplans is enough smaller than number of all append plans
to check alternative method is faster.

> It might also be possible to form the loop so that it calls
> bms_next_member() then store the result and loop until we reach that
> number. That would only save the bms_is_member call per loop, but I'd
> much rather do the array idea as it should also speed up plenty of
> other things.

Actually, it is possible refactor that loop with the method you described,
but it would be complex and hacky codes for only saving the wasting loop.

So, I also like your array idea.

--
Yoshikazu Imai

Reply | Threaded
Open this post in threaded view
|

RE: Small performance tweak to run-time partition pruning

Imai, Yoshikazu
In reply to this post by Imai, Yoshikazu
On Tue, Oct 9, 2018 at 1:24 AM, I wrote:
> On Mon, Oct 8, 2018 at 1:00 AM, David Rowley wrote:
> > I've attached an updated patch which skips the re-sequence work when
> > doing that is not required for anything.
>
>
>
> I saw the patch and it seems good to me about the codes.
> I still couldn't check additional test cases in the patch.

I checked an additional test which is:

On Thu, Sept 6, 2018 at 7:30 PM, David Rowley wrote:
> I've also included an additional test to ensure the other_subplans
> gets updated correctly. The other tests for this seem to only perform
> run-time pruning during init plan and do no further pruning, so don't
> fully test that other_subplans gets updated correctly.

I execute the sql in this test with gdb and confirmed that it tests
other_subplans gets updated correctly. It also performs exec run-time pruning
and actually is through the codes in the patch which update other_subplans.

I also did "check world" at the latest master e9edc1ba0b and all tests passed
successfully.

It seems to me that there is no problem in this patch as far.
Is there another thing I have to do for the review?


--
Yoshikazu Imai
Reply | Threaded
Open this post in threaded view
|

Re: Small performance tweak to run-time partition pruning

David Rowley-3
On 11 October 2018 at 16:00, Imai, Yoshikazu
<[hidden email]> wrote:

> On Thu, Sept 6, 2018 at 7:30 PM, David Rowley wrote:
>> I've also included an additional test to ensure the other_subplans
>> gets updated correctly. The other tests for this seem to only perform
>> run-time pruning during init plan and do no further pruning, so don't
>> fully test that other_subplans gets updated correctly.
>
> I execute the sql in this test with gdb and confirmed that it tests
> other_subplans gets updated correctly. It also performs exec run-time pruning
> and actually is through the codes in the patch which update other_subplans.
>
> I also did "check world" at the latest master e9edc1ba0b and all tests passed
> successfully.

Many thanks for checking this in detail.

> It seems to me that there is no problem in this patch as far.
> Is there another thing I have to do for the review?

There's a checklist in [1]. Perhaps there's something mentioned there
that you've missed.

[1] https://wiki.postgresql.org/wiki/Reviewing_a_Patch

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

RE: Small performance tweak to run-time partition pruning

Imai, Yoshikazu
Sorry for the delay in replying.

On Wed, Oct 10, 2018 at 4:05 PM, David Rowley wrote:
> > It seems to me that there is no problem in this patch as far.
> > Is there another thing I have to do for the review?
>
> There's a checklist in [1]. Perhaps there's something mentioned there
> that you've missed.
>
> [1] https://wiki.postgresql.org/wiki/Reviewing_a_Patch

Thanks for the URL.


I did the performance test which is almost same as you did but only changed
"\set p_id 1" in select.sql for another performance test that I will send
that result in next mail. Maybe it doesn't affect the performance, so it's ok.

I tested with master(28d750c) + v2patch.

[Unpatched]
ave 3669 TPS

tps = 3700.319242 (excluding connections establishing)
tps = 3642.287089 (excluding connections establishing)
tps = 3668.243399 (excluding connections establishing)
tps = 3689.457722 (excluding connections establishing)
tps = 3714.309178 (excluding connections establishing)
tps = 3697.488958 (excluding connections establishing)
tps = 3573.372327 (excluding connections establishing)
tps = 3620.473191 (excluding connections establishing)
tps = 3689.794860 (excluding connections establishing)
tps = 3692.317099 (excluding connections establishing)

[Patched]
ave 3718 TPS

tps = 3751.639616 (excluding connections establishing)
tps = 3736.482071 (excluding connections establishing)
tps = 3747.613223 (excluding connections establishing)
tps = 3745.578446 (excluding connections establishing)
tps = 3662.612013 (excluding connections establishing)
tps = 3715.271028 (excluding connections establishing)
tps = 3718.671552 (excluding connections establishing)
tps = 3698.766946 (excluding connections establishing)
tps = 3639.026099 (excluding connections establishing)
tps = 3760.507508 (excluding connections establishing)

The patch improves the performance about 1.3% which is less than David's
result, but it seems still improves the performance.


Above performance test don't exec run-time pruning and we can't check the
performance of the loop codes patch modified, so I also did the below test
which do exec run-time pruning.

setup:
CREATE TABLE partbench (id BIGINT NOT NULL, i1 INT NOT NULL, i2 INT
NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL) PARTITION
BY RANGE (id);

\o /dev/null
select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench
FOR VALUES FROM (' || (x*100000)::text || ') TO (' ||
((x+1)*100000)::text || ') PARTITION BY RANGE (i1);'
from generate_Series(0,299) x;
\gexec
\o

\o /dev/null
select 'CREATE TABLE partbench' || x::text ||
'_i1a PARTITION OF partbench' || x::text ||
' FOR VALUES FROM (0) TO (1);' from generate_Series(0,299) x;
\gexec
\o

\o /dev/null
select 'CREATE TABLE partbench' || x::text ||
'_i1b PARTITION OF partbench' || x::text ||
' FOR VALUES FROM (1) TO (2);' from generate_Series(0,299) x;
\gexec
\o


select-exec-init.sql:
\set p_id 1
select * from partbench where id = :p_id and i1 = (select 0);


[Unpatched]
ave  1060 TPS

tps = 1067.286500 (excluding connections establishing)
tps = 1052.705136 (excluding connections establishing)
tps = 1056.684966 (excluding connections establishing)
tps = 1059.803865 (excluding connections establishing)
tps = 1053.418776 (excluding connections establishing)
tps = 1053.383518 (excluding connections establishing)
tps = 1058.542617 (excluding connections establishing)
tps = 1071.875455 (excluding connections establishing)
tps = 1058.064092 (excluding connections establishing)
tps = 1066.869393 (excluding connections establishing)

[Patched]
ave 1069 TPS

tps = 1071.621247 (excluding connections establishing)
tps = 1067.881709 (excluding connections establishing)
tps = 1073.274357 (excluding connections establishing)
tps = 1058.648528 (excluding connections establishing)
tps = 1068.490598 (excluding connections establishing)
tps = 1064.739885 (excluding connections establishing)
tps = 1069.189778 (excluding connections establishing)
tps = 1070.253092 (excluding connections establishing)
tps = 1070.395411 (excluding connections establishing)
tps = 1071.003647 (excluding connections establishing)

The patch also improves the performance about 0.85% in this case.

--
Yoshikazu Imai

Reply | Threaded
Open this post in threaded view
|

Re: Small performance tweak to run-time partition pruning

David Rowley-3
On 18 October 2018 at 16:13, Imai, Yoshikazu
<[hidden email]> wrote:
> The patch improves the performance about 1.3% which is less than David's
> result, but it seems still improves the performance.

Thanks for doing these benchmarks.

The speedup is small, but it becomes much more significant once other
bottlenecks are removed. More partitions may show a larger increase,
but more partitions also means that a larger range table array gets
built during ExecInitRangeTable(), which is also slow.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

RE: Small performance tweak to run-time partition pruning

Imai, Yoshikazu
On Mon, Oct 22, 2018 at 8:31 PM, David Rowley wrote:

> On 18 October 2018 at 16:13, Imai, Yoshikazu
> <[hidden email]> wrote:
> > The patch improves the performance about 1.3% which is less than
> > David's result, but it seems still improves the performance.
>
> Thanks for doing these benchmarks.
>
> The speedup is small, but it becomes much more significant once other
> bottlenecks are removed. More partitions may show a larger increase, but
> more partitions also means that a larger range table array gets built
> during ExecInitRangeTable(), which is also slow.

Since I did enough tests and ISTM the patch is sophisticated,
I changed the state of this patch to "Ready for Committer".

--
Yoshikazu Imai

Reply | Threaded
Open this post in threaded view
|

Re: Small performance tweak to run-time partition pruning

Tom Lane-2
"Imai, Yoshikazu" <[hidden email]> writes:
> I changed the state of this patch to "Ready for Committer".

I pushed this with a bit of extra tweaking --- notably, I added an
Assert to ExecFindMatchingSubPlans to guard against the possibility
that someone calls it when the remapping hasn't been done.

I think that the main win here is the recognition that we only need
remapping when both do_initial_prune and do_exec_prune are true.
I suspect that's an unusual case, so that we'll usually get to skip
this code altogether.

I am not really convinced that the original idea of replacing the
loop-over-subplans with a bms_next_member() loop is a good one.
Yeah, it wins in the best case where there are a lot of subplans
and initial pruning got rid of nearly all of them --- but how often
is that true, given the context that both do_initial_prune and
do_exec_prune are needed?  And does it still win if most of the
subplans *didn't* get pruned here?  bms_next_member() is not cheap
compared to an increment-and-compare.  I am distressed by the fact
that your performance testing seems to have examined only this
best case.  IMO performance testing should usually worry more about
the possible downside in the worst case.

Rather than arguing about that, though, I think we should focus
on something else: I'd be a lot happier if this remapping logic
just disappeared completely.  I doubt it's a performance issue
now that we run it only when initial and exec pruning are both
needed.  But it seems like a source of complexity and bugs, most
especially the latter now that common cases won't run it.  Why
is it that we can't fix things so that ExecFindMatchingSubPlans
just uses the original planner output data?

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Small performance tweak to run-time partition pruning

David Rowley-3
On Fri, 16 Nov 2018 at 07:50, Tom Lane <[hidden email]> wrote:
> I pushed this with a bit of extra tweaking --- notably, I added an
> Assert to ExecFindMatchingSubPlans to guard against the possibility
> that someone calls it when the remapping hasn't been done.

Thanks for pushing.

> I am not really convinced that the original idea of replacing the
> loop-over-subplans with a bms_next_member() loop is a good one.
> Yeah, it wins in the best case where there are a lot of subplans
> and initial pruning got rid of nearly all of them --- but how often
> is that true, given the context that both do_initial_prune and
> do_exec_prune are needed?  And does it still win if most of the
> subplans *didn't* get pruned here?  bms_next_member() is not cheap
> compared to an increment-and-compare.  I am distressed by the fact
> that your performance testing seems to have examined only this
> best case.  IMO performance testing should usually worry more about
> the possible downside in the worst case.

You are correct that a for loop over a Bitmapset with all members set
up to N while testing bms_is_member() is faster than a while loop over
bms_next_member(), however, I don't think the case where many
partitions remain after the initial prune is an interesting one to
focus on. The additional overhead of doing init plan on a large number
of partitions most likely to more than drown out anything we can do to
improve the performance of ExecFindInitialMatchingSubPlans().  As an
example, I tried on master with:

create table listp (b bool, z int) partition by list(b);
create table listp_false partition of listp for values in(false);
create table listp_true partition of listp for values in(true)
partition by list (z);
select 'create table listp_true'||x::Text || ' partition of listp_true
for values in(' || x::text || ');' from generate_series(1,1000) x;
\gexec

Having changed postgresql.conf:

plan_cache_mode = force_generic_plan
max_parallel_workers_per_gather = 0;

I ran this through pgbench:

\set p_b true
prepare q1 (bool) as select * from listp where b = :p_b and z = (select 1);

tps = 230.317940 (excluding connections establishing)

With my original test case, I reported about a 4-microsecond per query
performance improvement from this patch. That does not really account
for very much in a query that takes well over 4000 microseconds to
complete. I've not benchmarked after having reverted the patch as I
don't think it would be easy to measure something that only takes 0.1%
of the total query time.

If we could have pruned all those other partitions during executor startup, aka:

\set p_b true
\set p_z 1
select * from listp where b = :p_b and z = :p_z

then we'd get something more like:

tps = 1952.976045 (excluding connections establishing)

The saving here is not having to init/shutdown nearly as many scan
nodes. Any additional overhead in ExecFindInitialMatchingSubPlans() is
going to be noticed much more for a fast query such as this one, and
in this case, where most or all but one partition was pruned, the
bms_next_member() loop will be significantly faster.

> Rather than arguing about that, though, I think we should focus
> on something else: I'd be a lot happier if this remapping logic
> just disappeared completely.  I doubt it's a performance issue
> now that we run it only when initial and exec pruning are both
> needed.  But it seems like a source of complexity and bugs, most
> especially the latter now that common cases won't run it.  Why
> is it that we can't fix things so that ExecFindMatchingSubPlans
> just uses the original planner output data?

I understand you'd rather not have this code, but I believe it's worth
doing things this way for the added performance improvements we get
from not having to allocate a large empty array to store the
initialized subnodes. I wrote a quick patch to get rid of the re-map
code and just palloc0 an array for all subnodes, initializing only the
ones that were not pruned.  Testing this against master and the
patched version with a partitioned table with 10k hash partitions I
get:

$ pgbench -n -f bench.sql -T 60 -M prepared postgres

-- Don't re-map
tps = 95.459176 (excluding connections establishing)
tps = 96.320146 (excluding connections establishing)

-- master
tps = 102.019177 (excluding connections establishing)
tps = 104.052788 (excluding connections establishing)

So there's a non-zero performance penalty if we remove the re-map code.

As far as I see it, we've still got lots to improve in the executor to
further speed this up. Having to populate the large 10k element range
table array in ExecInitRangeTable(), and also run ExecCheckRTPerms()
on the range table that requires no permission checks for any of the
partitions and then also lock each partition in AcquireExecutorLocks()
instead of delaying locking of partitions until after pruning takes
place.  I've WIP patches locally to resolve each of these which takes
performance to about 900 tps.  The overhead of the palloc0() on the
almost entirely empty Append subnode array becomes a much larger
percentage then.

Perhaps a compromise would to somehow move the re-map code into
partprune.c so as to keep all that logic in the one place.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services