Small performance tweak to run-time partition pruning

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Small performance tweak to run-time partition pruning

David Rowley-3
While reviewing some other patches to improve partitioning performance
I noticed that one of the loops in ExecFindInitialMatchingSubPlans()
could be coded a bit more efficiently.  The current code loops over
all the original subplans checking if the subplan is newly pruned, if
it is, the code sets the new_subplan_indexes array element to -1, else
it sets it assigns the new subplan index.  This can be done more
efficiently if we make this array 1-based and initialise the whole
thing to 0 then just loop over the non-pruned subplans instead of all
subplans. Pruning all but 1 subplan is quite common.

In profiles, I'd seen ExecFindInitialMatchingSubPlans() consume about
5.2% percent of CPU time. With the patch that dropped to 0.72%.

A quick test with just 300 partitions shows about a 2.3% performance
improvement. Hardly groundbreaking, but it seems like a small enough
change for it to be worth it.

The test was conducted as follows:

postgresql.conf:
plan_cache_mode = 'force_generic_plan'
max_parallel_workers_per_gather = 0

setup:
CREATE TABLE partbench (id BIGINT NOT NULL, i1 INT NOT NULL, i2 INT
NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL) PARTITION
BY RANGE (id);
\o /dev/null
select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench
FOR VALUES FROM (' || (x*100000)::text || ') TO (' ||
((x+1)*100000)::text || ');' from generate_Series(0,299) x;
\gexec
\o

select.sql:
\set p_id 29999999
select * from partbench where id = :p_id;

Test:
$ pgbench -n -f select.sql -M prepared -T 60 postgres

Unpatched:
tps = 6946.940678 (excluding connections establishing)
tps = 6913.993655 (excluding connections establishing)
tps = 6854.693214 (excluding connections establishing)

Patched
tps = 7066.854267 (excluding connections establishing)
tps = 7082.890458 (excluding connections establishing)
tps = 7052.255429 (excluding connections establishing)

Patch attached. I'll park this here until the November 'fest.

I've also included an additional test to ensure the other_subplans
gets updated correctly. The other tests for this seem to only perform
run-time pruning during init plan and do no further pruning, so don't
fully test that other_subplans gets updated correctly.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v1-0001-Improve-performance-of-run-time-partition-pruning.patch (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Small performance tweak to run-time partition pruning

David Rowley-3
On 7 September 2018 at 19:29, David Rowley <[hidden email]> wrote:
> While reviewing some other patches to improve partitioning performance
> I noticed that one of the loops in ExecFindInitialMatchingSubPlans()
> could be coded a bit more efficiently.  The current code loops over
> all the original subplans checking if the subplan is newly pruned, if
> it is, the code sets the new_subplan_indexes array element to -1, else
> it sets it assigns the new subplan index.  This can be done more
> efficiently if we make this array 1-based and initialise the whole
> thing to 0 then just loop over the non-pruned subplans instead of all
> subplans. Pruning all but 1 subplan is quite common.

I was looking at this again and I realised that we can completely skip
the re-sequence of the subplan map when we're not going to perform any
further pruning during execution. We possibly could also not make a
copy of the subplan_map in this case at all in
ExecCreatePartitionPruneState(), and just take the planner's copy
verbatim as we do for the subpart_map.  I was just unable to see any
performance gains from doing this, so I've just left it for now.

Currently, this improves performance about 2% with prepared queries
and 300 partitions.

Patched:

tps = 5169.169452 (excluding connections establishing)
tps = 5155.914286 (excluding connections establishing)

Unpatched:
tps = 5059.511370 (excluding connections establishing)
tps = 5082.851062 (excluding connections establishing)

However with other patches to remove partitioning bottlenecks in the
executor, the TPS goes to about 25,000, so 2% becomes 10%, which seems
more meaningful.

I've attached an updated patch which skips the re-sequence work when
doing that is not required for anything.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v2-0001-Improve-performance-of-run-time-partition-pruning.patch (11K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Small performance tweak to run-time partition pruning

Imai, Yoshikazu
Hi, David.
Thanks for the patch!

On Mon, Oct 8, 2018 at 1:00 AM, David Rowley wrote:
> I was looking at this again and I realised that we can completely skip
> the re-sequence of the subplan map when we're not going to perform any
> further pruning during execution.

I checked codes and I think so too.

Confirmation of my understanding and For more information to others:

The subplan map is used when we call ExecFindInitialMatchingSubPlans or
ExecFindMatchingSubPlans.
ExecFindInitialMatchingSubPlans is called by ExecInit(Merge)Append.
ExecFindmatchingSubPlans is called by below fours which is executed after
ExecInit(Merge)Append and it is called when the
as_valid_subplans(or ms_valid_subplans) is not NULL.

1. choose_next_subplan_locally(AppendState *node)
2. choose_next_subplan_for_leader(AppendState *node)
3. choose_next_subplan_for_worker(AppendState *node)
4. ExecMergeAppend(PlanState *pstate)

The as_valid_subplans(or ms_valid_subplans) is set in ExecInit(Merge)Append
if pruning during execution is not required.

That's why we can completely skip the re-sequence of the subplan map
when we're not going to perform any further pruning during execution.


> I've attached an updated patch which skips the re-sequence work when
> doing that is not required for anything.

I saw the patch and it seems good to me about the codes.
I still couldn't check additional test cases in the patch.


BTW, when I looking the codes, I thought there is further improvements in
ExecInitAppend which is described below.

        j = i = 0;
        firstvalid = nplans;
        foreach(lc, node->appendplans)
        {
                if (bms_is_member(i, validsubplans))
                {
                        Plan       *initNode = (Plan *) lfirst(lc);

                        /*
                         * Record the lowest appendplans index which is a valid partial
                         * plan.
                         */
                        if (i >= node->first_partial_plan && j < firstvalid)
                                firstvalid = j;

                        appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
                }
                i++;
        }

If number of valid subplans is few compared to node->appendplans, it is a waste to check
bms_is_member(i, validsubplans) for all node->appendplans.
Of course, node->appendplans is List struct and we have to loop it to find valid plan,
that we can't simply use while(bms_next_member(i, validsubplans)) loop.
I don't have any good idea for it now, but can we improve it?


--
Yoshikazu Imai

Reply | Threaded
Open this post in threaded view
|

Re: Small performance tweak to run-time partition pruning

David Rowley-3
On 9 October 2018 at 14:23, Imai, Yoshikazu
<[hidden email]> wrote:

> I checked codes and I think so too.
>
> Confirmation of my understanding and For more information to others:
>
> The subplan map is used when we call ExecFindInitialMatchingSubPlans or
> ExecFindMatchingSubPlans.
> ExecFindInitialMatchingSubPlans is called by ExecInit(Merge)Append.
> ExecFindmatchingSubPlans is called by below fours which is executed after
> ExecInit(Merge)Append and it is called when the
> as_valid_subplans(or ms_valid_subplans) is not NULL.

Thanks for looking at this.

Yeah, so subplan_map is just an array that stores the List index of
the Append or MergeAppend's subplans. When we perform run-time pruning
during executor initialisation, if we prune away some of these
subplans then we don't initialise those pruned subplans at all which
results in missing executor nodes for those plans. Instead of
maintaining an array to store these with a bunch of NULLs in them to
represent the pruned subnodes, for performance reasons, we make a
gapless array and store them in there. This means that for the
run-time pruning that we could do running actual execution
(ExecFindMatchingSubPlans), the old subplan_map would be out of date,
as it would contain the indexes of the subplans as if we'd not pruned
any.  We can simply not bother adjusting the subplan_map if no further
pruning is required. This could leave the map pointing to subplans
that don't exist, but we only need to care about that when the map is
actually going to be used for something. The good news is, we know in
advance if the map will be used again.

> I saw the patch and it seems good to me about the codes.
> I still couldn't check additional test cases in the patch.
>
>
> BTW, when I looking the codes, I thought there is further improvements in
> ExecInitAppend which is described below.
>
>         j = i = 0;
>         firstvalid = nplans;
>         foreach(lc, node->appendplans)
>         {
>                 if (bms_is_member(i, validsubplans))
>                 {
>                         Plan       *initNode = (Plan *) lfirst(lc);
>
>                         /*
>                          * Record the lowest appendplans index which is a valid partial
>                          * plan.
>                          */
>                         if (i >= node->first_partial_plan && j < firstvalid)
>                                 firstvalid = j;
>
>                         appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
>                 }
>                 i++;
>         }
>
> If number of valid subplans is few compared to node->appendplans, it is a waste to check
> bms_is_member(i, validsubplans) for all node->appendplans.
> Of course, node->appendplans is List struct and we have to loop it to find valid plan,
> that we can't simply use while(bms_next_member(i, validsubplans)) loop.
> I don't have any good idea for it now, but can we improve it?

I do have other ideas for that but not ready to share code yet, but it
basically requires reimplementing List to use arrays as their
underlying implementation. This will allow the bms_next_member() type
loop and list_nth() will be O(1) instead of O(N).

It might be possible to squeeze a bit more performance out of this
code with the current List implementation, but I it might actually
slow performance in some cases, for example, if the only surviving
partition was one of the last ones in the List.  Getting the final
element with list_nth() is optimized, but if you consider a
time-based, say monthly, RANGE partition, a DBA might maintain a
partition ready for the next month of data, so it might be very common
to access the 2nd last element in the list for all queries looking at
"this months" data. In that case, list_nth(), in its current form, is
as slow as can be.  You'd also need to do a bms_num_members() or
bms_get_singleton_member(), in order to decide if the alternative
method is going to be any faster. That call is not going to be free.

It might also be possible to form the loop so that it calls
bms_next_member() then store the result and loop until we reach that
number. That would only save the bms_is_member call per loop, but I'd
much rather do the array idea as it should also speed up plenty of
other things.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

RE: Small performance tweak to run-time partition pruning

Imai, Yoshikazu
On Tue, Oct 9, 2018 at 2:02 AM, David Rowley wrote:

> Yeah, so subplan_map is just an array that stores the List index of
> the Append or MergeAppend's subplans. When we perform run-time pruning
> during executor initialisation, if we prune away some of these
> subplans then we don't initialise those pruned subplans at all which
> results in missing executor nodes for those plans. Instead of
> maintaining an array to store these with a bunch of NULLs in them to
> represent the pruned subnodes, for performance reasons, we make a
> gapless array and store them in there. This means that for the
> run-time pruning that we could do running actual execution
> (ExecFindMatchingSubPlans), the old subplan_map would be out of date,
> as it would contain the indexes of the subplans as if we'd not pruned
> any.  We can simply not bother adjusting the subplan_map if no further
> pruning is required. This could leave the map pointing to subplans
> that don't exist, but we only need to care about that when the map is
> actually going to be used for something. The good news is, we know in
> advance if the map will be used again.

Thanks for the detail explanation! I got it fully.

> > I saw the patch and it seems good to me about the codes.
> > I still couldn't check additional test cases in the patch.
> >
> >
> > BTW, when I looking the codes, I thought there is further improvements in
> > ExecInitAppend which is described below.
> >
> >         j = i = 0;
> >         firstvalid = nplans;
> >         foreach(lc, node->appendplans)
> >         {
> >                 if (bms_is_member(i, validsubplans))
> >                 {
> >                         Plan       *initNode = (Plan *) lfirst(lc);
> >
> >                         /*
> >                          * Record the lowest appendplans index which is a valid partial
> >                          * plan.
> >                          */
> >                         if (i >= node->first_partial_plan && j < firstvalid)
> >                                 firstvalid = j;
> >
> >                         appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
> >                 }
> >                 i++;
> >         }
> >
> > If number of valid subplans is few compared to node->appendplans, it is a waste to check
> > bms_is_member(i, validsubplans) for all node->appendplans.
> > Of course, node->appendplans is List struct and we have to loop it to find valid plan,
> > that we can't simply use while(bms_next_member(i, validsubplans)) loop.
> > I don't have any good idea for it now, but can we improve it?
>
>
>
> I do have other ideas for that but not ready to share code yet, but it
> basically requires reimplementing List to use arrays as their
> underlying implementation. This will allow the bms_next_member() type
> loop and list_nth() will be O(1) instead of O(N).

Great!
So there might be little gain in using memory, but we get large performance improvement.

> It might be possible to squeeze a bit more performance out of this
> code with the current List implementation, but I it might actually
> slow performance in some cases, for example, if the only surviving
> partition was one of the last ones in the List.  Getting the final
> element with list_nth() is optimized, but if you consider a
> time-based, say monthly, RANGE partition, a DBA might maintain a
> partition ready for the next month of data, so it might be very common
> to access the 2nd last element in the list for all queries looking at
> "this months" data. In that case, list_nth(), in its current form, is
> as slow as can be.  

I understand accessing 2nd last element is slow in current List implementation,
but I don't know your new implementation is also slow in that case.
I don't know I might misunderstood "squeeze ~ out of ~ with ~" sentence.


> You'd also need to do a bms_num_members() or
> bms_get_singleton_member(), in order to decide if the alternative
> method is going to be any faster. That call is not going to be free.

Yeah, I need to use bms_num_members() or bms_get_singleton_member() to
check number of valid subplans is enough smaller than number of all append plans
to check alternative method is faster.

> It might also be possible to form the loop so that it calls
> bms_next_member() then store the result and loop until we reach that
> number. That would only save the bms_is_member call per loop, but I'd
> much rather do the array idea as it should also speed up plenty of
> other things.

Actually, it is possible refactor that loop with the method you described,
but it would be complex and hacky codes for only saving the wasting loop.

So, I also like your array idea.

--
Yoshikazu Imai

Reply | Threaded
Open this post in threaded view
|

RE: Small performance tweak to run-time partition pruning

Imai, Yoshikazu
In reply to this post by Imai, Yoshikazu
On Tue, Oct 9, 2018 at 1:24 AM, I wrote:
> On Mon, Oct 8, 2018 at 1:00 AM, David Rowley wrote:
> > I've attached an updated patch which skips the re-sequence work when
> > doing that is not required for anything.
>
>
>
> I saw the patch and it seems good to me about the codes.
> I still couldn't check additional test cases in the patch.

I checked an additional test which is:

On Thu, Sept 6, 2018 at 7:30 PM, David Rowley wrote:
> I've also included an additional test to ensure the other_subplans
> gets updated correctly. The other tests for this seem to only perform
> run-time pruning during init plan and do no further pruning, so don't
> fully test that other_subplans gets updated correctly.

I execute the sql in this test with gdb and confirmed that it tests
other_subplans gets updated correctly. It also performs exec run-time pruning
and actually is through the codes in the patch which update other_subplans.

I also did "check world" at the latest master e9edc1ba0b and all tests passed
successfully.

It seems to me that there is no problem in this patch as far.
Is there another thing I have to do for the review?


--
Yoshikazu Imai
Reply | Threaded
Open this post in threaded view
|

Re: Small performance tweak to run-time partition pruning

David Rowley-3
On 11 October 2018 at 16:00, Imai, Yoshikazu
<[hidden email]> wrote:

> On Thu, Sept 6, 2018 at 7:30 PM, David Rowley wrote:
>> I've also included an additional test to ensure the other_subplans
>> gets updated correctly. The other tests for this seem to only perform
>> run-time pruning during init plan and do no further pruning, so don't
>> fully test that other_subplans gets updated correctly.
>
> I execute the sql in this test with gdb and confirmed that it tests
> other_subplans gets updated correctly. It also performs exec run-time pruning
> and actually is through the codes in the patch which update other_subplans.
>
> I also did "check world" at the latest master e9edc1ba0b and all tests passed
> successfully.

Many thanks for checking this in detail.

> It seems to me that there is no problem in this patch as far.
> Is there another thing I have to do for the review?

There's a checklist in [1]. Perhaps there's something mentioned there
that you've missed.

[1] https://wiki.postgresql.org/wiki/Reviewing_a_Patch

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services