parallelize queries containing initplans

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
58 messages Options
123
Reply | Threaded
Open this post in threaded view
|

parallelize queries containing initplans

akapila
By seeing the subject line, one might wonder why we need to consider
parallelizing the queries containing initplans differently from
queries containing subplans considering that I have posted a mail to
achieve later a few hours back.  The reason is that both are treated
differently with respect to parallelism and otherwise as well and both
can be parallelized in a different way depending on the design we
choose.  InitPlans can be used in three forms (a) a Param node
representing a single scalar result (b) a row comparison tree
containing multiple Param nodes (c) NULL constant for MULTIEXPR
subquery whereas SubPlans are used as SubPlan nodes. Here, I am
primarily interested in parallelizing queries that contain InitPlans
of the form (a) and the reason is that I have seen that form used more
as compared to other forms (primarily based on a study of TPC-H and
TPC-DS workloads).  However, if we find that parallelizing other forms
can be done along with it easily, then that is excellent.  To start
with let us see the plan of TPC-H query (Q-22) and understand how it
can be improved.

Limit
   InitPlan 1 (returns $0)
     ->  Finalize Aggregate
           ->  Gather
                 Workers Planned: 2
                 ->  Partial Aggregate
                       ->  Parallel Seq Scan on customer customer_1
                             Filter: (...)
   ->  GroupAggregate
         Group Key: ("substring"((customer.c_phone)::text, 1, 2))
         ->  Sort
               Sort Key: ("substring"((customer.c_phone)::text, 1, 2))
               ->  Nested Loop Anti Join
                     ->  Seq Scan on customer
                           Filter: ((c_acctbal > $0) AND (...)))
                     ->  Index Only Scan using idx_orders_custkey on orders
                           Index Cond: (o_custkey = customer.c_custkey)


In the above plan, we can see that the join on customer and orders
table (Nested Loop Anti Join) is not parallelised even though we have
the capability to parallelize Nested Loop Joins. The reason for not
choosing the parallel plan is that one of the nodes (Seq Scan on
customer) is referring to initplan and we consider such nodes as
parallel-restricted which means they can't be parallelised.  Now, I
could see three ways of parallelizing such a query.  The first way is
that we just push parallel-safe initplans to workers and allow them to
execute it, the drawback of this approach is that it won't be able to
push initplans in cases as shown above where initplan is
parallel-unsafe (contains Gather node) and second is we will lose the
expectation of single evaluation.  The second way is that we always
execute the initplan in the master backend and pass the resultant
value to the worker, this will allow above form of plans to push
initplans to workers and hence can help in enabling parallelism for
other nodes in plan tree.   The drawback of the second approach is
that we need to evaluate the initplan before it is actually required
which means that we might evaluate it even when it is not required.  I
am not sure if it is always safe to assume that we can evaluate the
initplan before pushing it to workers especially for the cases when it
is far enough down in the plan tree which we are parallelizing,
however, I think we can assume it when the iniplan is above the plan
tree where it is used (like in the above case).  The third way is that
we allow Gather node to be executed below another Gather node, but I
think that will be bad especially for the plans like above because
each worker needs to further spawn another set of workers to evaluate
the iniplan which could be done once.  Now we can build some way such
that only one of the workers executes such an initplan and share the
values with other workers, but I think overall this requires much more
effort than first or second approach.

Among all the three approaches, first seems to be simpler than the
other two, but I feel if we just do that then we leave a lot on the
table.   Another way to accomplish this project could be that we do a
mix of first and second such that when the initplan is above the plan
tree to be parallelized, then use the second approach (one-time
evaluation by master backend and share the result with workers),
otherwise use the first approach of pushing down the initplan to
workers.

Thoughts?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: parallelize queries containing initplans

akapila
On Wed, Dec 28, 2016 at 5:20 PM, Amit Kapila <[hidden email]> wrote:

>  To start
> with let us see the plan of TPC-H query (Q-22) and understand how it
> can be improved.
>
> Limit
>    InitPlan 1 (returns $0)
>      ->  Finalize Aggregate
>            ->  Gather
>                  Workers Planned: 2
>                  ->  Partial Aggregate
>                        ->  Parallel Seq Scan on customer customer_1
>                              Filter: (...)
>    ->  GroupAggregate
>          Group Key: ("substring"((customer.c_phone)::text, 1, 2))
>          ->  Sort
>                Sort Key: ("substring"((customer.c_phone)::text, 1, 2))
>                ->  Nested Loop Anti Join
>                      ->  Seq Scan on customer
>                            Filter: ((c_acctbal > $0) AND (...)))
>                      ->  Index Only Scan using idx_orders_custkey on orders
>                            Index Cond: (o_custkey = customer.c_custkey)
>
>
> In the above plan, we can see that the join on customer and orders
> table (Nested Loop Anti Join) is not parallelised even though we have
> the capability to parallelize Nested Loop Joins. The reason for not
> choosing the parallel plan is that one of the nodes (Seq Scan on
> customer) is referring to initplan and we consider such nodes as
> parallel-restricted which means they can't be parallelised.  Now, I
> could see three ways of parallelizing such a query.  The first way is
> that we just push parallel-safe initplans to workers and allow them to
> execute it, the drawback of this approach is that it won't be able to
> push initplans in cases as shown above where initplan is
> parallel-unsafe (contains Gather node) and second is we will lose the
> expectation of single evaluation.  The second way is that we always
> execute the initplan in the master backend and pass the resultant
> value to the worker, this will allow above form of plans to push
> initplans to workers and hence can help in enabling parallelism for
> other nodes in plan tree.
>
I have used the second way to parallelize queries containing initplans
as that can help in cases where initplans in itself also uses
parallelism and it will also retain an existing expectation of single
evaluation for initplans. The basic idea as mentioned in above mail is
to evaluate the initplans at Gather node and pass the value to worker
backends which can use it as required. The patch has used
*plan->allParam* bitmapset to evaluate the initplans at Gather node
(we traverse the planstate tree to find params at each node and we
take care to avoid multiple evaluations of same initplan).  To
identify initplan params among other params in *allParams*, the patch
has added an additional bool variable (isinitplan) in ParamExecData.
We can do it in some other way as well if there is any better
suggestion.

The patch has also changed the explain output of queries where
initplan param is evaluated at Gather node. For ex.

postgres=# explain (costs off) select t1.i from t1, t2 where t1.j=t2.j
and t1.k < (select max(k) from t3) and t1.k < (select max(k) from t3);
                       QUERY PLAN
--------------------------------------------------------
 Hash Join
   Hash Cond: (t2.j = t1.j)
   InitPlan 1 (returns $0)
     ->  Finalize Aggregate
           ->  Gather
                 Workers Planned: 1
                 ->  Partial Aggregate
                       ->  Parallel Seq Scan on t3
   InitPlan 2 (returns $1)
     ->  Finalize Aggregate
           ->  Gather
                 Workers Planned: 1
                 ->  Partial Aggregate
                       ->  Parallel Seq Scan on t3 t3_1
   ->  Gather
         Workers Planned: 1
         ->  Parallel Seq Scan on t2
   ->  Hash
         ->  Gather
               Workers Planned: 1
               Params Evaluated: $0, $1
               ->  Parallel Seq Scan on t1
                     Filter: ((k < $0) AND (k < $1))
(23 rows)


In the above plan, you can notice a line (Params Evaluated: $0, $1)
which indicates the params evaluated at Gather node.  As of now,
explain just uses the *allParam* params present at the Gather node,
but we need to traverse the planstate tree as we do during execution.
This patch gives 2.5~3x performance benefit for Q-22 of TPC-H.

>   The drawback of the second approach is
> that we need to evaluate the initplan before it is actually required
> which means that we might evaluate it even when it is not required.  I
> am not sure if it is always safe to assume that we can evaluate the
> initplan before pushing it to workers especially for the cases when it
> is far enough down in the plan tree which we are parallelizing,
>

I think we can always pull up un-correlated initplans at Gather node,
however, if there is a correlated initplan, then it is better not to
allow such initplans for being pushed below gather.  Ex. of correlated
initplans:

postgres=# explain (costs off) select * from t1 where t1.i in (select
t2.i from t2 where t1.k = (select max(k) from t3 where t3.i=t1.i));
                  QUERY PLAN
----------------------------------------------
 Seq Scan on t1
   Filter: (SubPlan 2)
   SubPlan 2
     ->  Gather
           Workers Planned: 1
           Params Evaluated: $1
           InitPlan 1 (returns $1)
             ->  Aggregate
                   ->  Seq Scan on t3
                         Filter: (i = t1.i)
           ->  Result
                 One-Time Filter: (t1.k = $1)
                 ->  Parallel Seq Scan on t2
(13 rows)

It might be safe to allow above plan, but in general, such plans
should not be allowed, because it might not be feasible to compute
such initplan references at Gather node.  I am still thinking on the
best way to deal with such initplans.

Thoughts?

Thanks to Kuntal who is a co-author of this patch for doing the
investigation along with me of different plans which contain
references to initplans.


Note - This patch needs to be applied on top of subplan patches [1][2].

[1] - https://www.postgresql.org/message-id/CAA4eK1KYQjQzQMpEz%2BQRA2fmim386gQLQBEf%2Bp2Wmtqjh1rjwg%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1LK3NjNY4ghHUOwYfBFa%2BAb2SvccTKAxTHbOdW1NhUjvQ%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pq_pushdown_initplan_v1.patch (62K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: parallelize queries containing initplans

akapila
On Tue, Jan 31, 2017 at 4:16 PM, Amit Kapila <[hidden email]> wrote:

> On Wed, Dec 28, 2016 at 5:20 PM, Amit Kapila <[hidden email]> wrote:
>
>>   The drawback of the second approach is
>> that we need to evaluate the initplan before it is actually required
>> which means that we might evaluate it even when it is not required.  I
>> am not sure if it is always safe to assume that we can evaluate the
>> initplan before pushing it to workers especially for the cases when it
>> is far enough down in the plan tree which we are parallelizing,
>>
>
> I think we can always pull up un-correlated initplans at Gather node,
> however, if there is a correlated initplan, then it is better not to
> allow such initplans for being pushed below gather.  Ex. of correlated
> initplans:
>
> postgres=# explain (costs off) select * from t1 where t1.i in (select
> t2.i from t2 where t1.k = (select max(k) from t3 where t3.i=t1.i));
>                   QUERY PLAN
> ----------------------------------------------
>  Seq Scan on t1
>    Filter: (SubPlan 2)
>    SubPlan 2
>      ->  Gather
>            Workers Planned: 1
>            Params Evaluated: $1
>            InitPlan 1 (returns $1)
>              ->  Aggregate
>                    ->  Seq Scan on t3
>                          Filter: (i = t1.i)
>            ->  Result
>                  One-Time Filter: (t1.k = $1)
>                  ->  Parallel Seq Scan on t2
> (13 rows)
>
> It might be safe to allow above plan, but in general, such plans
> should not be allowed, because it might not be feasible to compute
> such initplan references at Gather node.  I am still thinking on the
> best way to deal with such initplans.
>
I could see two possibilities to determine whether the plan (for which
we are going to generate an initplan) contains a reference to a
correlated var param node.  One is to write a plan or path walker to
determine any such reference and the second is to keep the information
about the correlated param in path node.   I think the drawback of the
first approach is that traversing path tree during generation of
initplan can be costly, so for now I have kept the information in path
node to prohibit generating parallel initplans which contain a
reference to correlated vars. I think we can go with first approach of
using path walker if people feel that is better than maintaining a
reference in path.  Attached patch
prohibit_parallel_correl_params_v1.patch implements the second
approach of keeping the correlated var param reference in path node
and pq_pushdown_initplan_v2.patch uses that to generate parallel
initplans.

Thoughts?

These patches build on top of parallel subplan patch [1].

[1] - https://www.postgresql.org/message-id/CAA4eK1KYQjQzQMpEz+QRA2fmim386gQLQBEf+p2Wmtqjh1rjwg@...


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

prohibit_parallel_correl_params_v1.patch (54K) Download Attachment
pq_pushdown_initplan_v2.patch (64K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: parallelize queries containing initplans

akapila
On Fri, Feb 10, 2017 at 4:34 PM, Amit Kapila <[hidden email]> wrote:

>
> I could see two possibilities to determine whether the plan (for which
> we are going to generate an initplan) contains a reference to a
> correlated var param node.  One is to write a plan or path walker to
> determine any such reference and the second is to keep the information
> about the correlated param in path node.   I think the drawback of the
> first approach is that traversing path tree during generation of
> initplan can be costly, so for now I have kept the information in path
> node to prohibit generating parallel initplans which contain a
> reference to correlated vars. I think we can go with first approach of
> using path walker if people feel that is better than maintaining a
> reference in path.  Attached patch
> prohibit_parallel_correl_params_v1.patch implements the second
> approach of keeping the correlated var param reference in path node
> and pq_pushdown_initplan_v2.patch uses that to generate parallel
> initplans.
>
Two weeks back when Robert was in Bangalore, we (myself, Kuntal and
Robert) had a discussion on this patch.   He mentioned that the idea
of pulling up initplans (uncorrelated initplans) at Gather node (and
then execute them and share the values to each worker) used in this
patch doesn't sound appealing and has a chance of bugs in some corner
cases. We discussed an idea where the first worker to access the
initplan will evaluate it and then share the value with other
participating processes, but with that, we won't be able to use
parallelism in the execution of Initplan due to the restriction of
multiple levels of Gather node.  Another idea we discussed is that we
can evaluate the Initplans at Gather node if it is used as an external
param (plan->extParam) at or below the Gather node.

Based on that idea, I have modified the patch such that it will
compute the set of initplans Params that are required below gather
node and store them as bitmap of initplan params at gather node.
During set_plan_references, we can find the intersection of external
parameters that are required at Gather or nodes below it with the
initplans that are passed from same or above query level. Once the set
of initplan params are established, we evaluate those (if they are not
already evaluated) before execution of gather node and then pass the
computed value to each of the workers.   To identify whether a
particular param is parallel safe or not, we check if the paramid of
the param exists in initplans at same or above query level.  We don't
allow to generate gather path if there are initplans at some query
level below the current query level as those plans could be
parallel-unsafe or undirect correlated plans.

This restricts some of the cases for parallelism like when initplans
are below gather node, but the patch looks better. We can open up
those cases if required in a separate patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pq_pushdown_initplan_v3.patch (46K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: parallelize queries containing initplans

Kuntal Ghosh
On Tue, Mar 14, 2017 at 3:20 PM, Amit Kapila <[hidden email]> wrote:

> Based on that idea, I have modified the patch such that it will
> compute the set of initplans Params that are required below gather
> node and store them as bitmap of initplan params at gather node.
> During set_plan_references, we can find the intersection of external
> parameters that are required at Gather or nodes below it with the
> initplans that are passed from same or above query level. Once the set
> of initplan params are established, we evaluate those (if they are not
> already evaluated) before execution of gather node and then pass the
> computed value to each of the workers.   To identify whether a
> particular param is parallel safe or not, we check if the paramid of
> the param exists in initplans at same or above query level.  We don't
> allow to generate gather path if there are initplans at some query
> level below the current query level as those plans could be
> parallel-unsafe or undirect correlated plans.
I would like to mention different test scenarios with InitPlans that
we've considered while developing and testing of the patch.

An InitPlan is a subselect that doesn't take any reference from its
immediate outer query level and it returns a param value. For example,
consider the following query:

          QUERY PLAN
------------------------------
 Seq Scan on t1
   Filter: (k = $0)
   allParams: $0
   InitPlan 1 (returns $0)
     ->  Aggregate
           ->  Seq Scan on t3
In this case, the InitPlan is evaluated once when the filter is
checked for the first time. For subsequent checks, we need not
evaluate the initplan again since we already have the value. In our
approach, we parallelize the sequential scan by inserting a Gather
node on top of parallel sequential scan node. At the Gather node, we
evaluate the InitPlan before spawning the workers and pass this value
to the worker using dynamic shared memory. This yields the following
plan:
                    QUERY PLAN
---------------------------------------------------
 Gather
   Workers Planned: 2
   Params Evaluated: $0
   InitPlan 1 (returns $0)
       ->  Aggregate
           ->  Seq Scan on t3
   ->  Parallel Seq Scan on t1
         Filter: (k = $0)
As Amit mentioned up in the thread, at a Gather node, we evaluate only
those InitPlans that are attached to this query level or any higher
one and are used under the Gather node. extParam at a node includes
the InitPlan params that should be passed from an outer node. I've
attached a patch to show extParams and allParams for each node. Here
is the output with that patch:
                    QUERY PLAN
---------------------------------------------------
 Gather
   Workers Planned: 2
   Params Evaluated: $0
   allParams: $0
   InitPlan 1 (returns $0)
     ->  Finalize Aggregate
           ->  Gather
                 Workers Planned: 2
                 ->  Partial Aggregate
                       ->  Parallel Seq Scan on t3
   ->  Parallel Seq Scan on t1
         Filter: (k = $0)
         allParams: $0
         extParams: $0
In this case, $0 is included in extParam of parallel sequential scan
and the InitPlan corresponding to this param is attached to the same
query level that contains the Gather node. Hence, we evaluate $0 at
Gather and pass it to workers.

But, for generating a plan like this requires marking an InitPlan
param as parallel_safe. We can't mark all params as parallel_safe
because of correlated subselects. Hence, in
max_parallel_hazard_walker, the only params marked safe are InitPlan
params from current or outer query level. An InitPlan param from inner
query level isn't marked safe since we can't evaluate this param at
any Gather node above the current node(where the param is used). As
mentioned by Amit, we also don't allow generation of gather path if
there are InitPlans at some query level below the current query level
as those plans could be parallel-unsafe or undirect correlated plans.

I've attached a script file and its output containing several
scenarios relevant to InitPlans. I've also attached the patch for
displaying extParam and allParam at each node. This patch can be
applied on top of pq_pushdown_initplan_v3.patch. Please find the
attachments.


> This restricts some of the cases for parallelism like when initplans
> are below gather node, but the patch looks better. We can open up
> those cases if required in a separate patch.
+1. Unfortunately, this patch doesn't enable parallelism for all
possible cases with InitPlans. Our objective is to keep things simple
and clean. Still, TPC-H q22 runs 2.5~3 times faster with this patch.



--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

script.out (26K) Download Attachment
script.sql (3K) Download Attachment
0002-Show-extParams-and-allParams-in-explain.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: parallelize queries containing initplans

akapila
On Thu, Mar 16, 2017 at 2:34 AM, Kuntal Ghosh
<[hidden email]> wrote:

> On Tue, Mar 14, 2017 at 3:20 PM, Amit Kapila <[hidden email]> wrote:
>> Based on that idea, I have modified the patch such that it will
>> compute the set of initplans Params that are required below gather
>> node and store them as bitmap of initplan params at gather node.
>> During set_plan_references, we can find the intersection of external
>> parameters that are required at Gather or nodes below it with the
>> initplans that are passed from same or above query level. Once the set
>> of initplan params are established, we evaluate those (if they are not
>> already evaluated) before execution of gather node and then pass the
>> computed value to each of the workers.   To identify whether a
>> particular param is parallel safe or not, we check if the paramid of
>> the param exists in initplans at same or above query level.  We don't
>> allow to generate gather path if there are initplans at some query
>> level below the current query level as those plans could be
>> parallel-unsafe or undirect correlated plans.
>
> I would like to mention different test scenarios with InitPlans that
> we've considered while developing and testing of the patch.
>

Thanks a lot Kuntal for sharing different test scenarios.
Unfortunately, this patch doesn't received any review till now, so
there is no chance of making it in to PostgreSQL-10.  I have moved
this to next CF.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: parallelize queries containing initplans

akapila
On Tue, Mar 28, 2017 at 7:25 AM, Amit Kapila <[hidden email]> wrote:

> On Thu, Mar 16, 2017 at 2:34 AM, Kuntal Ghosh
> <[hidden email]> wrote:
>> On Tue, Mar 14, 2017 at 3:20 PM, Amit Kapila <[hidden email]> wrote:
>>> Based on that idea, I have modified the patch such that it will
>>> compute the set of initplans Params that are required below gather
>>> node and store them as bitmap of initplan params at gather node.
>>> During set_plan_references, we can find the intersection of external
>>> parameters that are required at Gather or nodes below it with the
>>> initplans that are passed from same or above query level. Once the set
>>> of initplan params are established, we evaluate those (if they are not
>>> already evaluated) before execution of gather node and then pass the
>>> computed value to each of the workers.   To identify whether a
>>> particular param is parallel safe or not, we check if the paramid of
>>> the param exists in initplans at same or above query level.  We don't
>>> allow to generate gather path if there are initplans at some query
>>> level below the current query level as those plans could be
>>> parallel-unsafe or undirect correlated plans.
>>
>> I would like to mention different test scenarios with InitPlans that
>> we've considered while developing and testing of the patch.
>>
>
> Thanks a lot Kuntal for sharing different test scenarios.
> Unfortunately, this patch doesn't received any review till now, so
> there is no chance of making it in to PostgreSQL-10.  I have moved
> this to next CF.
>
Attached is a rebased version of the patch with below changes:
a. SubplanState now directly stores Subplan rather than ExprState, so
patch needs some adjustment in that regard.
b. While rejecting the paths (based on if there are initplans at level
below the current query level) for parallelism, the rejected paths
were not marked as parallel unsafe.  Due to this in
force_parallel_mode=regress, we were able to add gather node above
parallel unsafe paths.  The modified patch ensures to mark such paths
as parallel unsafe.
c. Added regression test.
d. Improve comments in the code.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pq_pushdown_initplan_v4.patch (51K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: parallelize queries containing initplans

Haribabu Kommi-2


On Mon, Jul 17, 2017 at 10:53 PM, Amit Kapila <[hidden email]> wrote:
On Tue, Mar 28, 2017 at 7:25 AM, Amit Kapila <[hidden email]> wrote:
> On Thu, Mar 16, 2017 at 2:34 AM, Kuntal Ghosh
> <[hidden email]> wrote:
>> On Tue, Mar 14, 2017 at 3:20 PM, Amit Kapila <[hidden email]> wrote:
>>> Based on that idea, I have modified the patch such that it will
>>> compute the set of initplans Params that are required below gather
>>> node and store them as bitmap of initplan params at gather node.
>>> During set_plan_references, we can find the intersection of external
>>> parameters that are required at Gather or nodes below it with the
>>> initplans that are passed from same or above query level. Once the set
>>> of initplan params are established, we evaluate those (if they are not
>>> already evaluated) before execution of gather node and then pass the
>>> computed value to each of the workers.   To identify whether a
>>> particular param is parallel safe or not, we check if the paramid of
>>> the param exists in initplans at same or above query level.  We don't
>>> allow to generate gather path if there are initplans at some query
>>> level below the current query level as those plans could be
>>> parallel-unsafe or undirect correlated plans.
>>
>> I would like to mention different test scenarios with InitPlans that
>> we've considered while developing and testing of the patch.
>>
>
> Thanks a lot Kuntal for sharing different test scenarios.
> Unfortunately, this patch doesn't received any review till now, so
> there is no chance of making it in to PostgreSQL-10.  I have moved
> this to next CF.
>

Attached is a rebased version of the patch with below changes:
a. SubplanState now directly stores Subplan rather than ExprState, so
patch needs some adjustment in that regard.
b. While rejecting the paths (based on if there are initplans at level
below the current query level) for parallelism, the rejected paths
were not marked as parallel unsafe.  Due to this in
force_parallel_mode=regress, we were able to add gather node above
parallel unsafe paths.  The modified patch ensures to mark such paths
as parallel unsafe.
c. Added regression test.
d. Improve comments in the code.


I tested the latest patch and the parallel plan is getting choose for most of
the init plans.

For the following query the parallel plan is not chosen. The query contains
an init plan that refer the outer node. 

postgres=# explain analyze select * from t1 where t1.i in (select t2.i from t2 where t1.k = (select max(k) from t3 where t3.i=t1.i));
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Seq Scan on t1  (cost=0.00..22426.28 rows=448 width=12) (actual time=8.335..132.557 rows=2 loops=1)
   Filter: (SubPlan 2)
   Rows Removed by Filter: 894
   SubPlan 2
     ->  Result  (cost=16.27..31.26 rows=999 width=4) (actual time=0.146..0.146 rows=0 loops=896)
           One-Time Filter: (t1.k = $1)
           InitPlan 1 (returns $1)
             ->  Aggregate  (cost=16.25..16.27 rows=1 width=4) (actual time=0.145..0.145 rows=1 loops=896)
                   ->  Seq Scan on t3  (cost=0.00..16.25 rows=2 width=4) (actual time=0.131..0.144 rows=0 loops=896)
                         Filter: (i = t1.i)
                         Rows Removed by Filter: 900
           ->  Seq Scan on t2  (cost=16.27..31.26 rows=999 width=4) (actual time=0.012..0.013 rows=10 loops=2)
 Planning time: 0.272 ms
 Execution time: 132.623 ms
(14 rows)

If I change the query a little bit, the Result node doesn't appear and the parallel plan
gets chosen.

postgres=# explain analyze select * from t1 where t1.i in (select t2.i from t2 where t2.k = (select max(k) from t3 where t3.i=t1.i));
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Seq Scan on t1  (cost=0.00..19162.88 rows=448 width=12) (actual time=3501.483..3501.483 rows=0 loops=1)
   Filter: (SubPlan 2)
   Rows Removed by Filter: 896
   SubPlan 2
     ->  Gather  (cost=16.27..26.47 rows=2 width=4) (actual time=3.471..3.795 rows=0 loops=896)
           Workers Planned: 2
           Params Evaluated: $1
           Workers Launched: 2
           InitPlan 1 (returns $1)
             ->  Aggregate  (cost=16.25..16.27 rows=1 width=4) (actual time=0.161..0.161 rows=1 loops=896)
                   ->  Seq Scan on t3  (cost=0.00..16.25 rows=2 width=4) (actual time=0.144..0.156 rows=0 loops=896)
                         Filter: (i = t1.i)
                         Rows Removed by Filter: 900
           ->  Parallel Seq Scan on t2  (cost=0.00..10.20 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=804608)
                 Filter: (k = $1)
                 Rows Removed by Filter: 1
 Planning time: 0.480 ms
 Execution time: 3502.016 ms
(18 rows)

I didn't check the code why the plan is not getting chosen.
Just shared it for your reference, whether it is a known already.


Regards,
Hari Babu
Fujitsu Australia
Reply | Threaded
Open this post in threaded view
|

Re: parallelize queries containing initplans

Kuntal Ghosh
On Wed, Aug 9, 2017 at 10:24 AM, Haribabu Kommi
<[hidden email]> wrote:
>
> I tested the latest patch and the parallel plan is getting choose for most
> of
> the init plans.
>
Thanks for testing.

> For the following query the parallel plan is not chosen. The query contains
> an init plan that refer the outer node.
>
> postgres=# explain analyze select * from t1 where t1.i in (select t2.i from
> t2 where t1.k = (select max(k) from t3 where t3.i=t1.i));
>                                                      QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------
>  Seq Scan on t1  (cost=0.00..22426.28 rows=448 width=12) (actual
> time=8.335..132.557 rows=2 loops=1)
>    Filter: (SubPlan 2)
>    Rows Removed by Filter: 894
>    SubPlan 2
>      ->  Result  (cost=16.27..31.26 rows=999 width=4) (actual
> time=0.146..0.146 rows=0 loops=896)
>            One-Time Filter: (t1.k = $1)
>            InitPlan 1 (returns $1)
>              ->  Aggregate  (cost=16.25..16.27 rows=1 width=4) (actual
> time=0.145..0.145 rows=1 loops=896)
>                    ->  Seq Scan on t3  (cost=0.00..16.25 rows=2 width=4)
> (actual time=0.131..0.144 rows=0 loops=896)
>                          Filter: (i = t1.i)
>                          Rows Removed by Filter: 900
>            ->  Seq Scan on t2  (cost=16.27..31.26 rows=999 width=4) (actual
> time=0.012..0.013 rows=10 loops=2)
>  Planning time: 0.272 ms
>  Execution time: 132.623 ms
> (14 rows)
>
An observation is that the filter at Result node can't be pushed down
to the sequential scan on t2 because the filter is on t1. So, it has
to scan the complete t2 relation and send all the tuple to upper node,
a worst case for parallelism. Probably, this is the reason the
optimizer doesn't pick parallel plan for the above case.

Just for clarification, do you see any changes in the plan after
forcing parallelism(parallel_tuple_cost, parallel_setup_cost,
min_parallel_table_scan_size=0)?


--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: parallelize queries containing initplans

akapila
In reply to this post by Haribabu Kommi-2
On Wed, Aug 9, 2017 at 10:24 AM, Haribabu Kommi
<[hidden email]> wrote:

>
>
> On Mon, Jul 17, 2017 at 10:53 PM, Amit Kapila <[hidden email]>
> wrote:
>>
>> On Tue, Mar 28, 2017 at 7:25 AM, Amit Kapila <[hidden email]>
>> wrote:
>> > On Thu, Mar 16, 2017 at 2:34 AM, Kuntal Ghosh
>> > <[hidden email]> wrote:
>> >> On Tue, Mar 14, 2017 at 3:20 PM, Amit Kapila <[hidden email]>
>> >> wrote:
>> >>> Based on that idea, I have modified the patch such that it will
>> >>> compute the set of initplans Params that are required below gather
>> >>> node and store them as bitmap of initplan params at gather node.
>> >>> During set_plan_references, we can find the intersection of external
>> >>> parameters that are required at Gather or nodes below it with the
>> >>> initplans that are passed from same or above query level. Once the set
>> >>> of initplan params are established, we evaluate those (if they are not
>> >>> already evaluated) before execution of gather node and then pass the
>> >>> computed value to each of the workers.   To identify whether a
>> >>> particular param is parallel safe or not, we check if the paramid of
>> >>> the param exists in initplans at same or above query level.  We don't
>> >>> allow to generate gather path if there are initplans at some query
>> >>> level below the current query level as those plans could be
>> >>> parallel-unsafe or undirect correlated plans.
>> >>
>> >> I would like to mention different test scenarios with InitPlans that
>> >> we've considered while developing and testing of the patch.
>> >>
>> >
>> > Thanks a lot Kuntal for sharing different test scenarios.
>> > Unfortunately, this patch doesn't received any review till now, so
>> > there is no chance of making it in to PostgreSQL-10.  I have moved
>> > this to next CF.
>> >
>>
>> Attached is a rebased version of the patch with below changes:
>> a. SubplanState now directly stores Subplan rather than ExprState, so
>> patch needs some adjustment in that regard.
>> b. While rejecting the paths (based on if there are initplans at level
>> below the current query level) for parallelism, the rejected paths
>> were not marked as parallel unsafe.  Due to this in
>> force_parallel_mode=regress, we were able to add gather node above
>> parallel unsafe paths.  The modified patch ensures to mark such paths
>> as parallel unsafe.
>> c. Added regression test.
>> d. Improve comments in the code.
>>
>
> I tested the latest patch and the parallel plan is getting choose for most
> of
> the init plans.
>
Thanks for looking into this patch.

> For the following query the parallel plan is not chosen. The query contains
> an init plan that refer the outer node.
>

We don't want to generate the parallel plan for such cases.  Whenever
initplan refers to any outer node (aka correlated plan), it won't
generate a parallel plan.  Also, for t2, it doesn't choose a parallel
plan because one-time filter refers to the outer node (again
correlated plan case). Basically, till now we don't support parallel
plan for any case where the correlated plan is used.  So, it is
perfectly valid that it doesn't use parallel plan here.

> postgres=# explain analyze select * from t1 where t1.i in (select t2.i from
> t2 where t1.k = (select max(k) from t3 where t3.i=t1.i));
>                                                      QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------
>  Seq Scan on t1  (cost=0.00..22426.28 rows=448 width=12) (actual
> time=8.335..132.557 rows=2 loops=1)
>    Filter: (SubPlan 2)
>    Rows Removed by Filter: 894
>    SubPlan 2
>      ->  Result  (cost=16.27..31.26 rows=999 width=4) (actual
> time=0.146..0.146 rows=0 loops=896)
>            One-Time Filter: (t1.k = $1)
>            InitPlan 1 (returns $1)
>              ->  Aggregate  (cost=16.25..16.27 rows=1 width=4) (actual
> time=0.145..0.145 rows=1 loops=896)
>                    ->  Seq Scan on t3  (cost=0.00..16.25 rows=2 width=4)
> (actual time=0.131..0.144 rows=0 loops=896)
>                          Filter: (i = t1.i)
>                          Rows Removed by Filter: 900
>            ->  Seq Scan on t2  (cost=16.27..31.26 rows=999 width=4) (actual
> time=0.012..0.013 rows=10 loops=2)
>  Planning time: 0.272 ms
>  Execution time: 132.623 ms
> (14 rows)
>
> If I change the query a little bit, the Result node doesn't appear and the
> parallel plan
> gets chosen.
>
This is a valid case for choosing a parallel plan.

> postgres=# explain analyze select * from t1 where t1.i in (select t2.i from
> t2 where t2.k = (select max(k) from t3 where t3.i=t1.i));
>                                                        QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------------
>  Seq Scan on t1  (cost=0.00..19162.88 rows=448 width=12) (actual
> time=3501.483..3501.483 rows=0 loops=1)
>    Filter: (SubPlan 2)
>    Rows Removed by Filter: 896
>    SubPlan 2
>      ->  Gather  (cost=16.27..26.47 rows=2 width=4) (actual
> time=3.471..3.795 rows=0 loops=896)
>            Workers Planned: 2
>            Params Evaluated: $1
>            Workers Launched: 2
>            InitPlan 1 (returns $1)
>              ->  Aggregate  (cost=16.25..16.27 rows=1 width=4) (actual
> time=0.161..0.161 rows=1 loops=896)
>                    ->  Seq Scan on t3  (cost=0.00..16.25 rows=2 width=4)
> (actual time=0.144..0.156 rows=0 loops=896)
>                          Filter: (i = t1.i)
>                          Rows Removed by Filter: 900
>            ->  Parallel Seq Scan on t2  (cost=0.00..10.20 rows=1 width=4)
> (actual time=0.001..0.001 rows=0 loops=804608)
>                  Filter: (k = $1)
>                  Rows Removed by Filter: 1
>  Planning time: 0.480 ms
>  Execution time: 3502.016 ms
> (18 rows)
>
Here if you notice the parallel node t2 refers to the initplan which
can be parallelised after this patch.  Basically, whenever the
initplan is attached at or above Gather node, we compute its value and
pass down to workers.

> I didn't check the code why the plan is not getting chosen.
> Just shared it for your reference, whether it is a known already.
>

Yeah, it is known the behavior of the patch.

By the way, the patch doesn't apply on HEAD, so attached rebased patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pq_pushdown_initplan_v5.patch (51K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: parallelize queries containing initplans

Haribabu Kommi-2
In reply to this post by Kuntal Ghosh


On Wed, Aug 9, 2017 at 8:54 PM, Kuntal Ghosh <[hidden email]> wrote:
On Wed, Aug 9, 2017 at 10:24 AM, Haribabu Kommi
<[hidden email]> wrote:
>
> I tested the latest patch and the parallel plan is getting choose for most
> of
> the init plans.
>
Thanks for testing.

> For the following query the parallel plan is not chosen. The query contains
> an init plan that refer the outer node.
>
> postgres=# explain analyze select * from t1 where t1.i in (select t2.i from
> t2 where t1.k = (select max(k) from t3 where t3.i=t1.i));
>                                                      QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------
>  Seq Scan on t1  (cost=0.00..22426.28 rows=448 width=12) (actual
> time=8.335..132.557 rows=2 loops=1)
>    Filter: (SubPlan 2)
>    Rows Removed by Filter: 894
>    SubPlan 2
>      ->  Result  (cost=16.27..31.26 rows=999 width=4) (actual
> time=0.146..0.146 rows=0 loops=896)
>            One-Time Filter: (t1.k = $1)
>            InitPlan 1 (returns $1)
>              ->  Aggregate  (cost=16.25..16.27 rows=1 width=4) (actual
> time=0.145..0.145 rows=1 loops=896)
>                    ->  Seq Scan on t3  (cost=0.00..16.25 rows=2 width=4)
> (actual time=0.131..0.144 rows=0 loops=896)
>                          Filter: (i = t1.i)
>                          Rows Removed by Filter: 900
>            ->  Seq Scan on t2  (cost=16.27..31.26 rows=999 width=4) (actual
> time=0.012..0.013 rows=10 loops=2)
>  Planning time: 0.272 ms
>  Execution time: 132.623 ms
> (14 rows)
>
An observation is that the filter at Result node can't be pushed down
to the sequential scan on t2 because the filter is on t1. So, it has
to scan the complete t2 relation and send all the tuple to upper node,
a worst case for parallelism. Probably, this is the reason the
optimizer doesn't pick parallel plan for the above case.

Just for clarification, do you see any changes in the plan after
forcing parallelism(parallel_tuple_cost, parallel_setup_cost,
min_parallel_table_scan_size=0)?

There is no plan change with parallel* GUC changes.

Regards,
Hari Babu
Fujitsu Australia
Reply | Threaded
Open this post in threaded view
|

Re: parallelize queries containing initplans

Haribabu Kommi-2
In reply to this post by akapila


On Wed, Aug 9, 2017 at 9:26 PM, Amit Kapila <[hidden email]> wrote:
On Wed, Aug 9, 2017 at 10:24 AM, Haribabu Kommi
<[hidden email]> wrote:
>
>
> For the following query the parallel plan is not chosen. The query contains
> an init plan that refer the outer node.
>

We don't want to generate the parallel plan for such cases.  Whenever
initplan refers to any outer node (aka correlated plan), it won't
generate a parallel plan.  Also, for t2, it doesn't choose a parallel
plan because one-time filter refers to the outer node (again
correlated plan case). Basically, till now we don't support parallel
plan for any case where the correlated plan is used.  So, it is
perfectly valid that it doesn't use parallel plan here.

Thanks for providing the details.
 
Here if you notice the parallel node t2 refers to the initplan which
can be parallelised after this patch.  Basically, whenever the
initplan is attached at or above Gather node, we compute its value and
pass down to workers.

Thanks for the details. I checked the code also.

By the way, I tested the patch with by DML support for parallel patch to
check the returning of clause of insert, and all the returning clause init plans
are parallel plans with this patch.
 
By the way, the patch doesn't apply on HEAD, so attached rebased patch.


Thanks for the updated patch. I have some comments and I am yet to finish
the review.

+ /*
+ * We don't want to generate gather or gather merge node if there are
+ * initplans at some query level below the current query level as those
+ * plans could be parallel-unsafe or undirect correlated plans.  Ensure to
+ * mark all the partial and non-partial paths for a relation at this level
+ * to be parallel-unsafe.
+ */
+ if (is_initplan_below_current_query_level(root))
+ {
+ foreach(lc, rel->partial_pathlist)
+ {
+ Path   *subpath = (Path *) lfirst(lc);
+
+ subpath->parallel_safe = false;
+ }
+
+ foreach(lc, rel->pathlist)
+ {
+ Path   *subpath = (Path *) lfirst(lc);
+
+ subpath->parallel_safe = false;
+ }
+ return;
+ }
+

The above part of the code is almost same in mutiple places, is it possible
to change to function?


+ node->initParam = NULL;

This new parameter is set to NULL in make_gather function, the same parameter
is added to GatherMerge structure also, but anyway this parameter is set to NULL
makeNode macro, why only setting it to here, but not the other place. 

Do we need to set it to default value such as NULL or false if it is already the same value?
This is not related to the above parameter, for all existing parameters also.


+ if (IsA(plan, Gather))
+ ((Gather *) plan)->initParam = bms_intersect(plan->lefttree->extParam, initSetParam);
+ else if (IsA(plan, GatherMerge))
+ ((GatherMerge *) plan)->initParam = bms_intersect(plan->lefttree->extParam, initSetParam);
+ else
+ elog(ERROR, "unrecognized node type: %d", nodeTag(plan));

The else case is not possible, because it is already validated for Gather or GatherMerge.
Can we change it simple if and else?

Regards,
Hari Babu
Fujitsu Australia
Reply | Threaded
Open this post in threaded view
|

Re: parallelize queries containing initplans

akapila
On Wed, Aug 9, 2017 at 6:51 PM, Haribabu Kommi <[hidden email]> wrote:
>
> On Wed, Aug 9, 2017 at 9:26 PM, Amit Kapila <[hidden email]> wrote:
>>
>
> By the way, I tested the patch with by DML support for parallel patch to
> check the returning of clause of insert, and all the returning clause init
> plans
> are parallel plans with this patch.
>

Good to know.

>>
>> By the way, the patch doesn't apply on HEAD, so attached rebased patch.
>
>
>
> Thanks for the updated patch. I have some comments and I am yet to finish
> the review.
>
> + /*
> + * We don't want to generate gather or gather merge node if there are
> + * initplans at some query level below the current query level as those
> + * plans could be parallel-unsafe or undirect correlated plans.  Ensure to
> + * mark all the partial and non-partial paths for a relation at this level
> + * to be parallel-unsafe.
> + */
> + if (is_initplan_below_current_query_level(root))
> + {
> + foreach(lc, rel->partial_pathlist)
> + {
> + Path   *subpath = (Path *) lfirst(lc);
> +
> + subpath->parallel_safe = false;
> + }
> +
> + foreach(lc, rel->pathlist)
> + {
> + Path   *subpath = (Path *) lfirst(lc);
> +
> + subpath->parallel_safe = false;
> + }
> + return;
> + }
> +
>
> The above part of the code is almost same in mutiple places, is it possible
> to change to function?
>
Sure, we can do that and I think that makes sense as well, so changed
accordingly in the attached patch.

>
> + node->initParam = NULL;
>
> This new parameter is set to NULL in make_gather function, the same
> parameter
> is added to GatherMerge structure also, but anyway this parameter is set to
> NULL
> makeNode macro, why only setting it to here, but not the other place.
>
> Do we need to set it to default value such as NULL or false if it is already
> the same value?
> This is not related to the above parameter, for all existing parameters
> also.
>
Strictly speaking, it is not required, but I have initialised at only
one of the place to make it consistent to near by code.  We already do
similar stuff in some other functions like make_seqscan,
make_samplescan, ..

I don't see any value in trying to initialize it for GatherMerge as
well, so left it as it is.

>
> + if (IsA(plan, Gather))
> + ((Gather *) plan)->initParam = bms_intersect(plan->lefttree->extParam,
> initSetParam);
> + else if (IsA(plan, GatherMerge))
> + ((GatherMerge *) plan)->initParam =
> bms_intersect(plan->lefttree->extParam, initSetParam);
> + else
> + elog(ERROR, "unrecognized node type: %d", nodeTag(plan));
>
> The else case is not possible, because it is already validated for Gather or
> GatherMerge.
> Can we change it simple if and else?
>
As we already have an assert in this function to protect from any
other node type (nodes other than Gather and Gather Merge), it makes
sense to change the code to just if...else, so changed accordingly.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pq_pushdown_initplan_v.6.patch (50K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: parallelize queries containing initplans

Haribabu Kommi-2


On Fri, Aug 11, 2017 at 1:18 AM, Amit Kapila <[hidden email]> wrote:
On Wed, Aug 9, 2017 at 6:51 PM, Haribabu Kommi <[hidden email]> wrote:
>
>
> + if (IsA(plan, Gather))
> + ((Gather *) plan)->initParam = bms_intersect(plan->lefttree->extParam,
> initSetParam);
> + else if (IsA(plan, GatherMerge))
> + ((GatherMerge *) plan)->initParam =
> bms_intersect(plan->lefttree->extParam, initSetParam);
> + else
> + elog(ERROR, "unrecognized node type: %d", nodeTag(plan));
>
> The else case is not possible, because it is already validated for Gather or
> GatherMerge.
> Can we change it simple if and else?
>

As we already have an assert in this function to protect from any
other node type (nodes other than Gather and Gather Merge), it makes
sense to change the code to just if...else, so changed accordingly.

Thanks for the updated patch. Patch looks fine. I just have some
minor comments.

+ * ExecEvalParamExecParams
+ *
+ * Execute the subplan stored in PARAM_EXEC initplans params, if not executed
+ * till now.
+ */
+void
+ExecEvalParamExecParams(Bitmapset *params, EState *estate)

I feel it is better to explain when this function executes the sub plans that are
not executed till now? Means like in what scenario?


+ if (params == NULL)
+ nparams = 0;
+ else
+ nparams = bms_num_members(params);

bms_num_members return 0 in case if the params is NULL.
Is it fine to keep the specific check for NULL is required for performance benefit
or just remove it? Anything is fine for me.


+ if (IsA(plan, Gather))
+ ((Gather *) plan)->initParam = bms_intersect(plan->lefttree->extParam, initSetParam);
+ else
+ ((GatherMerge *) plan)->initParam = bms_intersect(plan->lefttree->extParam, initSetParam);


I think the above code is to find out the common parameters that are prsent in the external
and out params. It may be better to explain the logic in the comments.


Regards,
Hari Babu
Fujitsu Australia
Reply | Threaded
Open this post in threaded view
|

Re: parallelize queries containing initplans

akapila
On Sun, Aug 13, 2017 at 6:49 PM, Haribabu Kommi
<[hidden email]> wrote:

> On Fri, Aug 11, 2017 at 1:18 AM, Amit Kapila <[hidden email]>
> wrote:
>>
>
> Thanks for the updated patch. Patch looks fine. I just have some
> minor comments.
>
> + * ExecEvalParamExecParams
> + *
> + * Execute the subplan stored in PARAM_EXEC initplans params, if not
> executed
> + * till now.
> + */
> +void
> +ExecEvalParamExecParams(Bitmapset *params, EState *estate)
>
> I feel it is better to explain when this function executes the sub plans
> that are
> not executed till now? Means like in what scenario?
>
It just means that it will execute the same initplan (subplan) just
once in master backend even if it used in multiple places.  This is
the same kind of usage as we have in ExecEvalParamExec.  You can find
its usage by using some query like

explain analyse select sum(t1.i) from t1, t2 where t1.j=t2.j and t1.k
= (select count(k) from t3) and t1.k=t2.k;

Ensure you insert some rows in t1 and t2 that match the count from t3.
If you are using the schema and data given in Kuntal's script in email
above, then you need to insert something like
t1(900,900,900);t2(900,900,900);

It generates plan like below:

postgres=# explain analyse select sum(t1.i) from t1, t2 where
t1.j=t2.j and t1.k = (select count(k) from t3) and t1.k=t2.k;
                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=29.65..29.66 rows=1 width=8) (actual
time=22572.521..22572.521 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Finalize Aggregate  (cost=9.70..9.71 rows=1 width=8) (actual
time=4345.110..4345.111 rows=1 loops=1)
           ->  Gather  (cost=9.69..9.70 rows=2 width=8) (actual
time=4285.019..4345.098 rows=3 loops=1)
                 Workers Planned: 2
                 Workers Launched: 2
                 ->  Partial Aggregate  (cost=9.69..9.70 rows=1
width=8) (actual time=0.154..0.155 rows=1 loops=3)
                       ->  Parallel Seq Scan on t3  (cost=0.00..8.75
rows=375 width=4) (actual time=0.011..0.090 rows=300 loops=3)
   ->  Nested Loop  (cost=0.00..19.93 rows=1 width=4) (actual
time=22499.918..22572.512 rows=1 loops=1)
         Join Filter: (t1.j = t2.j)
         ->  Gather  (cost=0.00..9.67 rows=2 width=12) (actual
time=10521.356..10521.363 rows=1 loops=1)
               Workers Planned: 2
               Params Evaluated: $0
               Workers Launched: 2
               ->  Parallel Seq Scan on t1  (cost=0.00..9.67 rows=1
width=12) (actual time=0.506..0.507 rows=0 loops=3)
                     Filter: (k = $0)
                     Rows Removed by Filter: 299
         ->  Materialize  (cost=0.00..10.21 rows=2 width=8) (actual
time=11978.557..12051.142 rows=1 loops=1)
               ->  Gather  (cost=0.00..10.20 rows=2 width=8) (actual
time=11978.530..12051.113 rows=1 loops=1)
                     Workers Planned: 2
                     Params Evaluated: $0
                     Workers Launched: 2
                     ->  Parallel Seq Scan on t2  (cost=0.00..10.20
rows=1 width=8) (actual time=0.067..0.067 rows=0 loops=3)
                           Filter: (k = $0)
                           Rows Removed by Filter: 333
 Planning time: 15103.237 ms
 Execution time: 22574.703 ms
(27 rows)

You can notice that initplan is used at multiple nodes, but it will be
evaluated just once.  If you want, I can add a sentence in the
comments, but I think this is somewhat obvious and the same use case
already exists.  Let me know if you still think that comments need to
be expanded?

>
> + if (params == NULL)
> + nparams = 0;
> + else
> + nparams = bms_num_members(params);
>
> bms_num_members return 0 in case if the params is NULL.
> Is it fine to keep the specific check for NULL is required for performance
> benefit
> or just remove it? Anything is fine for me.
>
I don't see any performance benefit here, so removed the if check.

>
> + if (IsA(plan, Gather))
> + ((Gather *) plan)->initParam = bms_intersect(plan->lefttree->extParam,
> initSetParam);
> + else
> + ((GatherMerge *) plan)->initParam =
> bms_intersect(plan->lefttree->extParam, initSetParam);
>
>
> I think the above code is to find out the common parameters that are prsent
> in the external
> and out params.
>
Here, we want to save all the initplan params that can be used below
the gather node.  extParam contains the set of all external PARAM_EXEC
params that can be used below gather node and we just want initplan
params out of those.

> It may be better to explain the logic in the comments.
>

I have kept comments atop of the function set_param_references to
explain the context, but now I have added few more in the code as
suggested by you.  See if that suffices the need.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pq_pushdown_initplan_v7.patch (50K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: parallelize queries containing initplans

Haribabu Kommi-2


On Mon, Aug 14, 2017 at 8:41 PM, Amit Kapila <[hidden email]> wrote:
On Sun, Aug 13, 2017 at 6:49 PM, Haribabu Kommi
<[hidden email]> wrote:
> On Fri, Aug 11, 2017 at 1:18 AM, Amit Kapila <[hidden email]>
> wrote:
>>
>
> Thanks for the updated patch. Patch looks fine. I just have some
> minor comments.
>
> + * ExecEvalParamExecParams
> + *
> + * Execute the subplan stored in PARAM_EXEC initplans params, if not
> executed
> + * till now.
> + */
> +void
> +ExecEvalParamExecParams(Bitmapset *params, EState *estate)
>
> I feel it is better to explain when this function executes the sub plans
> that are
> not executed till now? Means like in what scenario?
>

It just means that it will execute the same initplan (subplan) just
once in master backend even if it used in multiple places.  This is
the same kind of usage as we have in ExecEvalParamExec.  You can find
its usage by using some query like

explain analyse select sum(t1.i) from t1, t2 where t1.j=t2.j and t1.k
= (select count(k) from t3) and t1.k=t2.k;

Ensure you insert some rows in t1 and t2 that match the count from t3.
If you are using the schema and data given in Kuntal's script in email
above, then you need to insert something like
t1(900,900,900);t2(900,900,900);

It generates plan like below:

postgres=# explain analyse select sum(t1.i) from t1, t2 where
t1.j=t2.j and t1.k = (select count(k) from t3) and t1.k=t2.k;
                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=29.65..29.66 rows=1 width=8) (actual
time=22572.521..22572.521 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Finalize Aggregate  (cost=9.70..9.71 rows=1 width=8) (actual
time=4345.110..4345.111 rows=1 loops=1)
           ->  Gather  (cost=9.69..9.70 rows=2 width=8) (actual
time=4285.019..4345.098 rows=3 loops=1)
                 Workers Planned: 2
                 Workers Launched: 2
                 ->  Partial Aggregate  (cost=9.69..9.70 rows=1
width=8) (actual time=0.154..0.155 rows=1 loops=3)
                       ->  Parallel Seq Scan on t3  (cost=0.00..8.75
rows=375 width=4) (actual time=0.011..0.090 rows=300 loops=3)
   ->  Nested Loop  (cost=0.00..19.93 rows=1 width=4) (actual
time=22499.918..22572.512 rows=1 loops=1)
         Join Filter: (t1.j = t2.j)
         ->  Gather  (cost=0.00..9.67 rows=2 width=12) (actual
time=10521.356..10521.363 rows=1 loops=1)
               Workers Planned: 2
               Params Evaluated: $0
               Workers Launched: 2
               ->  Parallel Seq Scan on t1  (cost=0.00..9.67 rows=1
width=12) (actual time=0.506..0.507 rows=0 loops=3)
                     Filter: (k = $0)
                     Rows Removed by Filter: 299
         ->  Materialize  (cost=0.00..10.21 rows=2 width=8) (actual
time=11978.557..12051.142 rows=1 loops=1)
               ->  Gather  (cost=0.00..10.20 rows=2 width=8) (actual
time=11978.530..12051.113 rows=1 loops=1)
                     Workers Planned: 2
                     Params Evaluated: $0
                     Workers Launched: 2
                     ->  Parallel Seq Scan on t2  (cost=0.00..10.20
rows=1 width=8) (actual time=0.067..0.067 rows=0 loops=3)
                           Filter: (k = $0)
                           Rows Removed by Filter: 333
 Planning time: 15103.237 ms
 Execution time: 22574.703 ms
(27 rows)

You can notice that initplan is used at multiple nodes, but it will be
evaluated just once.  If you want, I can add a sentence in the
comments, but I think this is somewhat obvious and the same use case
already exists.  Let me know if you still think that comments need to
be expanded?

Thanks for providing details. Yes it is clear to me.


>
> + if (IsA(plan, Gather))
> + ((Gather *) plan)->initParam = bms_intersect(plan->lefttree->extParam,
> initSetParam);
> + else
> + ((GatherMerge *) plan)->initParam =
> bms_intersect(plan->lefttree->extParam, initSetParam);
>
>
> I think the above code is to find out the common parameters that are prsent
> in the external
> and out params.
>

Here, we want to save all the initplan params that can be used below
the gather node.  extParam contains the set of all external PARAM_EXEC
params that can be used below gather node and we just want initplan
params out of those.

> It may be better to explain the logic in the comments.
>

I have kept comments atop of the function set_param_references to
explain the context, but now I have added few more in the code as
suggested by you.  See if that suffices the need.

Thanks for adding more details. It is easy to understand.

I marked the patch as ready for committer in the commitfest.


Regards,
Hari Babu
Fujitsu Australia
Reply | Threaded
Open this post in threaded view
|

Re: parallelize queries containing initplans

akapila
On Mon, Aug 21, 2017 at 1:44 PM, Haribabu Kommi
<[hidden email]> wrote:
>
>
> Thanks for adding more details. It is easy to understand.
>
> I marked the patch as ready for committer in the commitfest.
>

Thank you.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: parallelize queries containing initplans

akapila
On Mon, Aug 21, 2017 at 2:40 PM, Amit Kapila <[hidden email]> wrote:
> On Mon, Aug 21, 2017 at 1:44 PM, Haribabu Kommi
> <[hidden email]> wrote:
>>
>>
>> Thanks for adding more details. It is easy to understand.
>>
>> I marked the patch as ready for committer in the commitfest.
>>

Rebased the patch.  The output of test case added by the patch is also
slightly changed because of the recent commit
7df2c1f8daeb361133ac8bdeaf59ceb0484e315a.  I have verified that the
new test result is as expected.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pq_pushdown_initplan_v8.patch (51K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: parallelize queries containing initplans

akapila
On Thu, Aug 31, 2017 at 11:23 AM, Amit Kapila <[hidden email]> wrote:

> On Mon, Aug 21, 2017 at 2:40 PM, Amit Kapila <[hidden email]> wrote:
>> On Mon, Aug 21, 2017 at 1:44 PM, Haribabu Kommi
>> <[hidden email]> wrote:
>>>
>>>
>>> Thanks for adding more details. It is easy to understand.
>>>
>>> I marked the patch as ready for committer in the commitfest.
>>>
>
> Rebased the patch.  The output of test case added by the patch is also
> slightly changed because of the recent commit
> 7df2c1f8daeb361133ac8bdeaf59ceb0484e315a.  I have verified that the
> new test result is as expected.
>
The latest patch again needs to be rebased.  Find rebased patch
attached with this email.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pq_pushdown_initplan_v9.patch (51K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: parallelize queries containing initplans

Daniel Gustafsson
> On 15 Sep 2017, at 04:45, Amit Kapila <[hidden email]> wrote:
>
> On Thu, Aug 31, 2017 at 11:23 AM, Amit Kapila <[hidden email]> wrote:
>> On Mon, Aug 21, 2017 at 2:40 PM, Amit Kapila <[hidden email]> wrote:
>>> On Mon, Aug 21, 2017 at 1:44 PM, Haribabu Kommi
>>> <[hidden email]> wrote:
>>>>
>>>>
>>>> Thanks for adding more details. It is easy to understand.
>>>>
>>>> I marked the patch as ready for committer in the commitfest.
>>
>> Rebased the patch.  The output of test case added by the patch is also
>> slightly changed because of the recent commit
>> 7df2c1f8daeb361133ac8bdeaf59ceb0484e315a.  I have verified that the
>> new test result is as expected.
>
> The latest patch again needs to be rebased.  Find rebased patch
> attached with this email.

Moved to next commitfest, but changed to Waiting for author since it no longer
applies cleanly.

cheers ./daniel

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
123
Previous Thread Next Thread