Parallel INSERT (INTO ... SELECT ...)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
363 messages Options
12345 ... 19
Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

vignesh C
On Wed, Sep 23, 2020 at 2:21 PM Greg Nancarrow <[hidden email]> wrote:

>
> > - When INSERTs are made parallel, currently the reported row-count in
> > the "INSERT 0 <row-count>" status only reflects the rows that the
> > leader has processed (not the workers) - so it is obviously less than
> > the actual number of rows inserted.
>
> Attached an updated patch which fixes this issue (for parallel
> INSERTs, each worker's processed tuple count is communicated in shared
> memory back to the leader, where it is added to the global
> "es_processed" count).

I noticed that we are not having any check for skipping temporary
table insertion.

/* Check if the target relation has foreign keys; if so, avoid
* creating a parallel Insert plan (because inserting into
* such tables would result in creation of new CommandIds, and
* this isn't supported by parallel workers).
* Similarly, avoid creating a parallel Insert plan if ON
* CONFLICT ... DO UPDATE ... has been specified, because
* parallel UPDATE is not supported.
* However, do allow any underlying query to be run by parallel
* workers in these cases.
*/

You should also include temporary tables check here, as parallel
workers might not have access to temporary tables.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Greg Nancarrow
On Sun, Sep 27, 2020 at 2:03 AM vignesh C <[hidden email]> wrote:
>
> I noticed that we are not having any check for skipping temporary
> table insertion.
>

> You should also include temporary tables check here, as parallel
> workers might not have access to temporary tables.
>

Thanks Vignesh, you are right, I need to test this and add it to the
list of further exclusions that the patch needs to check for.
Hopefully I can provide an updated patch soon that caters for these
additional identified cases.

Regards,
Greg Nancarrow
Fujitsu Australia


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Greg Nancarrow
In reply to this post by Bharath Rupireddy
On Sat, Sep 26, 2020 at 3:30 PM Bharath Rupireddy
<[hidden email]> wrote:

> I further checked on full txn id and command id. Yes, these are
> getting passed to workers  via InitializeParallelDSM() ->
> SerializeTransactionState(). I tried to summarize what we need to do
> in case of parallel inserts in general i.e. parallel COPY, parallel
> inserts in INSERT INTO and parallel inserts in CTAS.
>
> In the leader:
>     GetCurrentFullTransactionId()
>     GetCurrentCommandId(true)
>     EnterParallelMode();
>     InitializeParallelDSM() --> calls SerializeTransactionState()
> (both full txn id and command id are serialized into parallel DSM)
>
> In the workers:
> ParallelWorkerMain() -->  calls StartParallelWorkerTransaction() (both
> full txn id and command id are restored into workers'
> CurrentTransactionState->fullTransactionId and currentCommandId)
> If the parallel workers are meant for insertions, then we need to set
> currentCommandIdUsed = true; Maybe we can lift the assert in
> GetCurrentCommandId(), if we don't want to touch that function, then
> we can have a new function GetCurrentCommandidInWorker() whose
> functionality will be same as GetCurrentCommandId() without the
> Assert(!IsParallelWorker());.
>
> Am I missing something?
>
> If the above points are true, we might have to update the parallel
> copy patch set, test the use cases and post separately in the parallel
> copy thread in coming days.
>
Hi Bharath,

I pretty much agree with your above points.

I've attached an updated Parallel INSERT...SELECT patch, that:
- Only uses existing transaction state serialization support for
transfer of xid and cid.
- Adds a "SetCurrentCommandIdUsedForWorker" function, for setting
currentCommandIdUsed=true at the start of a parallel operation (used
for Parallel INSERT case, where we know the currentCommandId has been
assigned to the worker at the start of the parallel operation).
- Tweaks the Assert condition within "used=true" parameter case in
GetCurrentCommandId(), so that it only fires if in a parallel worker
and currentCommandId is false - refer to the updated comment in that
function.
- Does not modify any existing GetCurrentCommandId() calls.
- Does not remove any existing parallel-related asserts/checks, except
for the "cannot insert tuples in a parallel worker" error in
heap_prepare_insert(). I am still considering what to do with the
original error-check here.
[- Does not yet cater for other exclusion cases that you and Vignesh
have pointed out]

This patch is mostly a lot cleaner, but does contain a possible ugly
hack, in that where it needs to call GetCurrentFullTransactionId(), it
must temporarily escape parallel-mode (recalling that parallel-mode is
set true right at the top of ExectePlan() in the cases of Parallel
INSERT/SELECT).

Regards,
Greg Nancarrow
Fujitsu Australia

0003-ParallelInsertSelect.patch (39K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Bharath Rupireddy
On Mon, Sep 28, 2020 at 8:45 AM Greg Nancarrow <[hidden email]> wrote:

>
> On Sat, Sep 26, 2020 at 3:30 PM Bharath Rupireddy
> <[hidden email]> wrote:
>
> > I further checked on full txn id and command id. Yes, these are
> > getting passed to workers  via InitializeParallelDSM() ->
> > SerializeTransactionState(). I tried to summarize what we need to do
> > in case of parallel inserts in general i.e. parallel COPY, parallel
> > inserts in INSERT INTO and parallel inserts in CTAS.
> >
> > In the leader:
> >     GetCurrentFullTransactionId()
> >     GetCurrentCommandId(true)
> >     EnterParallelMode();
> >     InitializeParallelDSM() --> calls SerializeTransactionState()
> > (both full txn id and command id are serialized into parallel DSM)
> >
> > In the workers:
> > ParallelWorkerMain() -->  calls StartParallelWorkerTransaction() (both
> > full txn id and command id are restored into workers'
> > CurrentTransactionState->fullTransactionId and currentCommandId)
> > If the parallel workers are meant for insertions, then we need to set
> > currentCommandIdUsed = true; Maybe we can lift the assert in
> > GetCurrentCommandId(), if we don't want to touch that function, then
> > we can have a new function GetCurrentCommandidInWorker() whose
> > functionality will be same as GetCurrentCommandId() without the
> > Assert(!IsParallelWorker());.
> >
> > Am I missing something?
> >
> > If the above points are true, we might have to update the parallel
> > copy patch set, test the use cases and post separately in the parallel
> > copy thread in coming days.
> >
>
> Hi Bharath,
>
> I pretty much agree with your above points.
>
> I've attached an updated Parallel INSERT...SELECT patch, that:
> - Only uses existing transaction state serialization support for
> transfer of xid and cid.
> - Adds a "SetCurrentCommandIdUsedForWorker" function, for setting
> currentCommandIdUsed=true at the start of a parallel operation (used
> for Parallel INSERT case, where we know the currentCommandId has been
> assigned to the worker at the start of the parallel operation).
> - Tweaks the Assert condition within "used=true" parameter case in
> GetCurrentCommandId(), so that it only fires if in a parallel worker
> and currentCommandId is false - refer to the updated comment in that
> function.
> - Does not modify any existing GetCurrentCommandId() calls.
> - Does not remove any existing parallel-related asserts/checks, except
> for the "cannot insert tuples in a parallel worker" error in
> heap_prepare_insert(). I am still considering what to do with the
> original error-check here.
> [- Does not yet cater for other exclusion cases that you and Vignesh
> have pointed out]
>
> This patch is mostly a lot cleaner, but does contain a possible ugly
> hack, in that where it needs to call GetCurrentFullTransactionId(), it
> must temporarily escape parallel-mode (recalling that parallel-mode is
> set true right at the top of ExectePlan() in the cases of Parallel
> INSERT/SELECT).
>

Thanks Greg.

In general, see a few things common to all parallel insert
cases(CTAS[1], COPY[2], INSERT INTO SELECTs):
1. Removal of "cannot insert tuples in a parallel worker" restriction
from heap_prepare_insert()
2. Each worker should be able to set currentCommandIdUsed to true.
3. The change you proposed to make in GetCurrentCommandId()'s assert condition.

Please add if I miss any other common point.

Common solutions to each of the above points would be beneficial to
all the parallel insert cases. How about having a common thread,
discussion and a common patch for all the 3 points?

@Amit Kapila  @Greg Nancarrow @vignesh C Thoughts?

[1] https://www.postgresql.org/message-id/CALj2ACWj%2B3H5TQqwxANZmdePEnSNxk-YAeT1c5WE184Gf75XUw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAA4eK1%2BkpddvvLxWm4BuG_AhVvYz8mKAEa7osxp_X0d4ZEiV%3Dg%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

akapila
On Mon, Sep 28, 2020 at 4:06 PM Bharath Rupireddy
<[hidden email]> wrote:

>
> On Mon, Sep 28, 2020 at 8:45 AM Greg Nancarrow <[hidden email]> wrote:
> >
> > On Sat, Sep 26, 2020 at 3:30 PM Bharath Rupireddy
> > <[hidden email]> wrote:
> >
> > > I further checked on full txn id and command id. Yes, these are
> > > getting passed to workers  via InitializeParallelDSM() ->
> > > SerializeTransactionState(). I tried to summarize what we need to do
> > > in case of parallel inserts in general i.e. parallel COPY, parallel
> > > inserts in INSERT INTO and parallel inserts in CTAS.
> > >
> > > In the leader:
> > >     GetCurrentFullTransactionId()
> > >     GetCurrentCommandId(true)
> > >     EnterParallelMode();
> > >     InitializeParallelDSM() --> calls SerializeTransactionState()
> > > (both full txn id and command id are serialized into parallel DSM)
> > >
> > > In the workers:
> > > ParallelWorkerMain() -->  calls StartParallelWorkerTransaction() (both
> > > full txn id and command id are restored into workers'
> > > CurrentTransactionState->fullTransactionId and currentCommandId)
> > > If the parallel workers are meant for insertions, then we need to set
> > > currentCommandIdUsed = true; Maybe we can lift the assert in
> > > GetCurrentCommandId(), if we don't want to touch that function, then
> > > we can have a new function GetCurrentCommandidInWorker() whose
> > > functionality will be same as GetCurrentCommandId() without the
> > > Assert(!IsParallelWorker());.
> > >
> > > Am I missing something?
> > >
> > > If the above points are true, we might have to update the parallel
> > > copy patch set, test the use cases and post separately in the parallel
> > > copy thread in coming days.
> > >
> >
> > Hi Bharath,
> >
> > I pretty much agree with your above points.
> >
> > I've attached an updated Parallel INSERT...SELECT patch, that:
> > - Only uses existing transaction state serialization support for
> > transfer of xid and cid.
> > - Adds a "SetCurrentCommandIdUsedForWorker" function, for setting
> > currentCommandIdUsed=true at the start of a parallel operation (used
> > for Parallel INSERT case, where we know the currentCommandId has been
> > assigned to the worker at the start of the parallel operation).
> > - Tweaks the Assert condition within "used=true" parameter case in
> > GetCurrentCommandId(), so that it only fires if in a parallel worker
> > and currentCommandId is false - refer to the updated comment in that
> > function.
> > - Does not modify any existing GetCurrentCommandId() calls.
> > - Does not remove any existing parallel-related asserts/checks, except
> > for the "cannot insert tuples in a parallel worker" error in
> > heap_prepare_insert(). I am still considering what to do with the
> > original error-check here.
> > [- Does not yet cater for other exclusion cases that you and Vignesh
> > have pointed out]
> >
> > This patch is mostly a lot cleaner, but does contain a possible ugly
> > hack, in that where it needs to call GetCurrentFullTransactionId(), it
> > must temporarily escape parallel-mode (recalling that parallel-mode is
> > set true right at the top of ExectePlan() in the cases of Parallel
> > INSERT/SELECT).
> >
>
> Thanks Greg.
>
> In general, see a few things common to all parallel insert
> cases(CTAS[1], COPY[2], INSERT INTO SELECTs):
> 1. Removal of "cannot insert tuples in a parallel worker" restriction
> from heap_prepare_insert()
> 2. Each worker should be able to set currentCommandIdUsed to true.
> 3. The change you proposed to make in GetCurrentCommandId()'s assert condition.
>
> Please add if I miss any other common point.
>
> Common solutions to each of the above points would be beneficial to
> all the parallel insert cases. How about having a common thread,
> discussion and a common patch for all the 3 points?
>

I am not sure if that is required at this stage, lets first sort out
other parts of the design because there could be other bigger problems
which we have not thought bout yet. I have already shared some
thoughts on those points in this thread, lets first get that done and
have the basic patch ready then if required we can discuss in detail
about these points in other thread.

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Dilip Kumar-2
In reply to this post by Greg Nancarrow
On Mon, Sep 28, 2020 at 8:45 AM Greg Nancarrow <[hidden email]> wrote:

>
> On Sat, Sep 26, 2020 at 3:30 PM Bharath Rupireddy
> <[hidden email]> wrote:
>
> > I further checked on full txn id and command id. Yes, these are
> > getting passed to workers  via InitializeParallelDSM() ->
> > SerializeTransactionState(). I tried to summarize what we need to do
> > in case of parallel inserts in general i.e. parallel COPY, parallel
> > inserts in INSERT INTO and parallel inserts in CTAS.
> >
> > In the leader:
> >     GetCurrentFullTransactionId()
> >     GetCurrentCommandId(true)
> >     EnterParallelMode();
> >     InitializeParallelDSM() --> calls SerializeTransactionState()
> > (both full txn id and command id are serialized into parallel DSM)
> >
> > In the workers:
> > ParallelWorkerMain() -->  calls StartParallelWorkerTransaction() (both
> > full txn id and command id are restored into workers'
> > CurrentTransactionState->fullTransactionId and currentCommandId)
> > If the parallel workers are meant for insertions, then we need to set
> > currentCommandIdUsed = true; Maybe we can lift the assert in
> > GetCurrentCommandId(), if we don't want to touch that function, then
> > we can have a new function GetCurrentCommandidInWorker() whose
> > functionality will be same as GetCurrentCommandId() without the
> > Assert(!IsParallelWorker());.
> >
> > Am I missing something?
> >
> > If the above points are true, we might have to update the parallel
> > copy patch set, test the use cases and post separately in the parallel
> > copy thread in coming days.
> >
>
> Hi Bharath,
>
> I pretty much agree with your above points.
>
> I've attached an updated Parallel INSERT...SELECT patch, that:
> - Only uses existing transaction state serialization support for
> transfer of xid and cid.
> - Adds a "SetCurrentCommandIdUsedForWorker" function, for setting
> currentCommandIdUsed=true at the start of a parallel operation (used
> for Parallel INSERT case, where we know the currentCommandId has been
> assigned to the worker at the start of the parallel operation).
> - Tweaks the Assert condition within "used=true" parameter case in
> GetCurrentCommandId(), so that it only fires if in a parallel worker
> and currentCommandId is false - refer to the updated comment in that
> function.
> - Does not modify any existing GetCurrentCommandId() calls.
> - Does not remove any existing parallel-related asserts/checks, except
> for the "cannot insert tuples in a parallel worker" error in
> heap_prepare_insert(). I am still considering what to do with the
> original error-check here.
> [- Does not yet cater for other exclusion cases that you and Vignesh
> have pointed out]
>
> This patch is mostly a lot cleaner, but does contain a possible ugly
> hack, in that where it needs to call GetCurrentFullTransactionId(), it
> must temporarily escape parallel-mode (recalling that parallel-mode is
> set true right at the top of ExectePlan() in the cases of Parallel
> INSERT/SELECT).

I think you still need to work on the costing part, basically if we
are parallelizing whole insert then plan is like below

-> Gather
  -> Parallel Insert
      -> Parallel Seq Scan

That means the tuple we are selecting via scan are not sent back to
the gather node, so in cost_gather we need to see if it is for the
INSERT then there is no row transferred through the parallel queue
that mean we need not to pay any parallel tuple cost.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Dilip Kumar-2
On Tue, Sep 29, 2020 at 8:27 PM Dilip Kumar <[hidden email]> wrote:

>
> On Mon, Sep 28, 2020 at 8:45 AM Greg Nancarrow <[hidden email]> wrote:
> >
> > On Sat, Sep 26, 2020 at 3:30 PM Bharath Rupireddy
> > <[hidden email]> wrote:
> >
> > > I further checked on full txn id and command id. Yes, these are
> > > getting passed to workers  via InitializeParallelDSM() ->
> > > SerializeTransactionState(). I tried to summarize what we need to do
> > > in case of parallel inserts in general i.e. parallel COPY, parallel
> > > inserts in INSERT INTO and parallel inserts in CTAS.
> > >
> > > In the leader:
> > >     GetCurrentFullTransactionId()
> > >     GetCurrentCommandId(true)
> > >     EnterParallelMode();
> > >     InitializeParallelDSM() --> calls SerializeTransactionState()
> > > (both full txn id and command id are serialized into parallel DSM)
> > >
> > > In the workers:
> > > ParallelWorkerMain() -->  calls StartParallelWorkerTransaction() (both
> > > full txn id and command id are restored into workers'
> > > CurrentTransactionState->fullTransactionId and currentCommandId)
> > > If the parallel workers are meant for insertions, then we need to set
> > > currentCommandIdUsed = true; Maybe we can lift the assert in
> > > GetCurrentCommandId(), if we don't want to touch that function, then
> > > we can have a new function GetCurrentCommandidInWorker() whose
> > > functionality will be same as GetCurrentCommandId() without the
> > > Assert(!IsParallelWorker());.
> > >
> > > Am I missing something?
> > >
> > > If the above points are true, we might have to update the parallel
> > > copy patch set, test the use cases and post separately in the parallel
> > > copy thread in coming days.
> > >
> >
> > Hi Bharath,
> >
> > I pretty much agree with your above points.
> >
> > I've attached an updated Parallel INSERT...SELECT patch, that:
> > - Only uses existing transaction state serialization support for
> > transfer of xid and cid.
> > - Adds a "SetCurrentCommandIdUsedForWorker" function, for setting
> > currentCommandIdUsed=true at the start of a parallel operation (used
> > for Parallel INSERT case, where we know the currentCommandId has been
> > assigned to the worker at the start of the parallel operation).
> > - Tweaks the Assert condition within "used=true" parameter case in
> > GetCurrentCommandId(), so that it only fires if in a parallel worker
> > and currentCommandId is false - refer to the updated comment in that
> > function.
> > - Does not modify any existing GetCurrentCommandId() calls.
> > - Does not remove any existing parallel-related asserts/checks, except
> > for the "cannot insert tuples in a parallel worker" error in
> > heap_prepare_insert(). I am still considering what to do with the
> > original error-check here.
> > [- Does not yet cater for other exclusion cases that you and Vignesh
> > have pointed out]
> >
> > This patch is mostly a lot cleaner, but does contain a possible ugly
> > hack, in that where it needs to call GetCurrentFullTransactionId(), it
> > must temporarily escape parallel-mode (recalling that parallel-mode is
> > set true right at the top of ExectePlan() in the cases of Parallel
> > INSERT/SELECT).
>
> I think you still need to work on the costing part, basically if we
> are parallelizing whole insert then plan is like below
>
> -> Gather
>   -> Parallel Insert
>       -> Parallel Seq Scan
>
> That means the tuple we are selecting via scan are not sent back to
> the gather node, so in cost_gather we need to see if it is for the
> INSERT then there is no row transferred through the parallel queue
> that mean we need not to pay any parallel tuple cost.

I just looked into the parallel CTAS[1] patch for the same thing, and
I can see in that patch it is being handled.

[1] https://www.postgresql.org/message-id/CALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf%2B0-Jg%2BKYT7ZO-Ug%40mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Greg Nancarrow
> >
> > I think you still need to work on the costing part, basically if we
> > are parallelizing whole insert then plan is like below
> >
> > -> Gather
> >   -> Parallel Insert
> >       -> Parallel Seq Scan
> >
> > That means the tuple we are selecting via scan are not sent back to
> > the gather node, so in cost_gather we need to see if it is for the
> > INSERT then there is no row transferred through the parallel queue
> > that mean we need not to pay any parallel tuple cost.
>
> I just looked into the parallel CTAS[1] patch for the same thing, and
> I can see in that patch it is being handled.
>
> [1] https://www.postgresql.org/message-id/CALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf%2B0-Jg%2BKYT7ZO-Ug%40mail.gmail.com
>

Hi Dilip,

You're right, the costing for Parallel Insert is not done and
finished, I'm still working on the costing, and haven't posted an
updated patch for it yet.
As far as cost_gather() method is concerned, for Parallel INSERT, it
can probably use the same costing approach as the CTAS patch except in
the case of a specified RETURNING clause.

Regards,
Greg Nancarrow
Fujitsu Australia


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Dilip Kumar-2
On Wed, Sep 30, 2020 at 7:38 AM Greg Nancarrow <[hidden email]> wrote:

>
> > >
> > > I think you still need to work on the costing part, basically if we
> > > are parallelizing whole insert then plan is like below
> > >
> > > -> Gather
> > >   -> Parallel Insert
> > >       -> Parallel Seq Scan
> > >
> > > That means the tuple we are selecting via scan are not sent back to
> > > the gather node, so in cost_gather we need to see if it is for the
> > > INSERT then there is no row transferred through the parallel queue
> > > that mean we need not to pay any parallel tuple cost.
> >
> > I just looked into the parallel CTAS[1] patch for the same thing, and
> > I can see in that patch it is being handled.
> >
> > [1] https://www.postgresql.org/message-id/CALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf%2B0-Jg%2BKYT7ZO-Ug%40mail.gmail.com
> >
>
> Hi Dilip,
>
> You're right, the costing for Parallel Insert is not done and
> finished, I'm still working on the costing, and haven't posted an
> updated patch for it yet.

Okay.

> As far as cost_gather() method is concerned, for Parallel INSERT, it
> can probably use the same costing approach as the CTAS patch except in
> the case of a specified RETURNING clause.

Yeah right.  I did not think about the returning part.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Bharath Rupireddy
In reply to this post by Greg Nancarrow
On Wed, Sep 30, 2020 at 7:38 AM Greg Nancarrow <[hidden email]> wrote:

>
> > >
> > > I think you still need to work on the costing part, basically if we
> > > are parallelizing whole insert then plan is like below
> > >
> > > -> Gather
> > >   -> Parallel Insert
> > >       -> Parallel Seq Scan
> > >
> > > That means the tuple we are selecting via scan are not sent back to
> > > the gather node, so in cost_gather we need to see if it is for the
> > > INSERT then there is no row transferred through the parallel queue
> > > that mean we need not to pay any parallel tuple cost.
> >
> > I just looked into the parallel CTAS[1] patch for the same thing, and
> > I can see in that patch it is being handled.
> >
> > [1] https://www.postgresql.org/message-id/CALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf%2B0-Jg%2BKYT7ZO-Ug%40mail.gmail.com
> >
>
> Hi Dilip,
>
> You're right, the costing for Parallel Insert is not done and
> finished, I'm still working on the costing, and haven't posted an
> updated patch for it yet.
> As far as cost_gather() method is concerned, for Parallel INSERT, it
> can probably use the same costing approach as the CTAS patch except in
> the case of a specified RETURNING clause.
>

I have one question which is common to both this patch and parallel
inserts in CTAS[1], do we need to skip creating tuple
queues(ExecParallelSetupTupleQueues) as we don't have any tuples
that's being shared from workers to leader? Put it another way, do we
use the tuple queue for sharing any info other than tuples from
workers to leader?

[1] https://www.postgresql.org/message-id/CALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf%2B0-Jg%2BKYT7ZO-Ug%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Dilip Kumar-2
On Mon, Oct 5, 2020 at 4:26 PM Bharath Rupireddy
<[hidden email]> wrote:

>
> On Wed, Sep 30, 2020 at 7:38 AM Greg Nancarrow <[hidden email]> wrote:
> >
> > > >
> > > > I think you still need to work on the costing part, basically if we
> > > > are parallelizing whole insert then plan is like below
> > > >
> > > > -> Gather
> > > >   -> Parallel Insert
> > > >       -> Parallel Seq Scan
> > > >
> > > > That means the tuple we are selecting via scan are not sent back to
> > > > the gather node, so in cost_gather we need to see if it is for the
> > > > INSERT then there is no row transferred through the parallel queue
> > > > that mean we need not to pay any parallel tuple cost.
> > >
> > > I just looked into the parallel CTAS[1] patch for the same thing, and
> > > I can see in that patch it is being handled.
> > >
> > > [1] https://www.postgresql.org/message-id/CALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf%2B0-Jg%2BKYT7ZO-Ug%40mail.gmail.com
> > >
> >
> > Hi Dilip,
> >
> > You're right, the costing for Parallel Insert is not done and
> > finished, I'm still working on the costing, and haven't posted an
> > updated patch for it yet.
> > As far as cost_gather() method is concerned, for Parallel INSERT, it
> > can probably use the same costing approach as the CTAS patch except in
> > the case of a specified RETURNING clause.
> >
>
> I have one question which is common to both this patch and parallel
> inserts in CTAS[1], do we need to skip creating tuple
> queues(ExecParallelSetupTupleQueues) as we don't have any tuples
> that's being shared from workers to leader? Put it another way, do we
> use the tuple queue for sharing any info other than tuples from
> workers to leader?

Ideally, we don't need the tuple queue unless we want to transfer the
tuple to the gather node.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

akapila
In reply to this post by Bharath Rupireddy
On Mon, Oct 5, 2020 at 4:26 PM Bharath Rupireddy
<[hidden email]> wrote:
>
> On Wed, Sep 30, 2020 at 7:38 AM Greg Nancarrow <[hidden email]> wrote:
> >
>
> I have one question which is common to both this patch and parallel
> inserts in CTAS[1], do we need to skip creating tuple
> queues(ExecParallelSetupTupleQueues) as we don't have any tuples
> that's being shared from workers to leader?
>

As far as this patch is concerned we might need to return tuples when
there is a Returning clause. I think for the cases where we don't need
to return tuples we might want to skip creating these queues if it is
feasible without too many changes.

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Dilip Kumar-2
On Mon, Oct 5, 2020 at 4:53 PM Amit Kapila <[hidden email]> wrote:

>
> On Mon, Oct 5, 2020 at 4:26 PM Bharath Rupireddy
> <[hidden email]> wrote:
> >
> > On Wed, Sep 30, 2020 at 7:38 AM Greg Nancarrow <[hidden email]> wrote:
> > >
> >
> > I have one question which is common to both this patch and parallel
> > inserts in CTAS[1], do we need to skip creating tuple
> > queues(ExecParallelSetupTupleQueues) as we don't have any tuples
> > that's being shared from workers to leader?
> >
>
> As far as this patch is concerned we might need to return tuples when
> there is a Returning clause. I think for the cases where we don't need
> to return tuples we might want to skip creating these queues if it is
> feasible without too many changes.

+1

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Greg Nancarrow
On Mon, Oct 5, 2020 at 10:36 PM Dilip Kumar <[hidden email]> wrote:

>
> > > I have one question which is common to both this patch and parallel
> > > inserts in CTAS[1], do we need to skip creating tuple
> > > queues(ExecParallelSetupTupleQueues) as we don't have any tuples
> > > that's being shared from workers to leader?
> > >
> >
> > As far as this patch is concerned we might need to return tuples when
> > there is a Returning clause. I think for the cases where we don't need
> > to return tuples we might want to skip creating these queues if it is
> > feasible without too many changes.
>
Hi Dilip,

You're right. I've included that in my latest version of the patch (so
Gather should only start tuple queues in the case of parallel SELECT
or parallel INSERT with a RETURNING clause).
Other functionality updated includes:
- Added more necessary exclusions for Parallel INSERT INTO ... SELECT
... (but allowing underlying query to still be parallel):
  - non-parallel-safe triggers
  - non-parallel-safe default and check expressions
  - foreign tables
  - temporary tables
- Added support for before/after statement-level INSERT triggers
(can't allow parallel workers to execute these)
- Adjusted cost of Gather node, for when RETURNING clause is not specified
I have not found issues with partition tables (yet) or toast column values.

Also, I have attached a separate patch (requested by Andres Freund)
that just allows the underlying SELECT part of "INSERT INTO ... SELECT
..." to be parallel.

Regards,
Greg Nancarrow
Fujitsu Australia

0004-ParallelInsertSelect.patch (54K) Download Attachment
0001-InsertParallelSelect.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Bharath Rupireddy
On Tue, Oct 6, 2020 at 3:08 PM Greg Nancarrow <[hidden email]> wrote:
>
> I have not found issues with partition tables (yet) or toast column values.
>

I think for toast column values there may not be a problem as each
parallel worker inserts toast column values individually.

But the problem may arise if a partitioned table has foreign table as
a partition, I think we can not allow parallelism for this case too,
but it's hard to determine ahead of time whether a table has a foreign
partition.(See [1] in copy.c)

>
> - Added support for before/after statement-level INSERT triggers
> (can't allow parallel workers to execute these)
>

I think we can allow parallelism for before statement level-triggers.
Leader can execute this trigger and go for parallel inserts.

How about before row, after row, instead row, new table type triggers?

[1]
    else
    {
        /*
         * For partitioned tables, we may still be able to perform bulk
         * inserts.  However, the possibility of this depends on which types
         * of triggers exist on the partition.  We must disable bulk inserts
         * if the partition is a foreign table or it has any before row insert
         * or insert instead triggers (same as we checked above for the parent
         * table).  Since the partition's resultRelInfos are initialized only
         * when we actually need to insert the first tuple into them, we must
         * have the intermediate insert method of CIM_MULTI_CONDITIONAL to
         * flag that we must later determine if we can use bulk-inserts for
         * the partition being inserted into.
         */
        if (proute)
            insertMethod = CIM_MULTI_CONDITIONAL;

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Greg Nancarrow
On Tue, Oct 6, 2020 at 9:10 PM Bharath Rupireddy
<[hidden email]> wrote:

> But the problem may arise if a partitioned table has foreign table as
> a partition, I think we can not allow parallelism for this case too,
> but it's hard to determine ahead of time whether a table has a foreign
> partition.(See [1] in copy.c)
>

Thanks, I had seen that as a potential issue when scanning the code,
but had forgotten to note it. I'll check your code again.

> >
> > - Added support for before/after statement-level INSERT triggers
> > (can't allow parallel workers to execute these)
> >
>
> I think we can allow parallelism for before statement level-triggers.
> Leader can execute this trigger and go for parallel inserts.
>

My attached patch implements the before/after statement-level trigger
invocation.
(For INSERT INTO ... SELECT... case, it needs to account for parallel
and non-parallel INSERT, and also the fact that, as the patch
currently stands, the leader also participates in a parallel INSERT -
so I found it necessary to invoke those triggers at the Gather node
level in that case).

> How about before row, after row, instead row, new table type triggers?
>

My attached patch does not allow parallel INSERT if there are any
row-level triggers (as the trigger functions could see a different and
unpredictable table state compared to non-parallel INSERT, even if
otherwise parallel-safe).

Regards,
Greg Nancarrow
Fujitsu Australia


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Bharath Rupireddy
On Tue, Oct 6, 2020 at 4:13 PM Greg Nancarrow <[hidden email]> wrote:

>
> On Tue, Oct 6, 2020 at 9:10 PM Bharath Rupireddy
> <[hidden email]> wrote:
>
> > But the problem may arise if a partitioned table has foreign table as
> > a partition, I think we can not allow parallelism for this case too,
> > but it's hard to determine ahead of time whether a table has a foreign
> > partition.(See [1] in copy.c)
> >
>
> Thanks, I had seen that as a potential issue when scanning the code,
> but had forgotten to note it. I'll check your code again.
>

In parallel, we are not doing anything(due to the same reason
explained in above comment) to find whether there is a foreign
partition or not while deciding to go with parallel/non-parallel copy,
we are just throwing an error during the first tuple insertion into
the partition.

errmsg("cannot perform PARALLEL COPY if partition has BEFORE/INSTEAD
OF triggers, or if the partition is foreign partition"),
                            errhint("Try COPY without PARALLEL option")));

> > >
> > > - Added support for before/after statement-level INSERT triggers
> > > (can't allow parallel workers to execute these)
> > >
> >
> > I think we can allow parallelism for before statement level-triggers.
> > Leader can execute this trigger and go for parallel inserts.
> >
>
> My attached patch implements the before/after statement-level trigger
> invocation.
> (For INSERT INTO ... SELECT... case, it needs to account for parallel
> and non-parallel INSERT, and also the fact that, as the patch
> currently stands, the leader also participates in a parallel INSERT -
> so I found it necessary to invoke those triggers at the Gather node
> level in that case).
>

Allowing the leader to execute before statement triggers at Gather
node level before invoking the parallel plan and then parallel inserts
makes sense. But if there are any after statement triggers, there may
come transition tables, see Amit's findings under Case-1 in [1] and we
must disable parallelism in that case.

[1] - https://www.postgresql.org/message-id/flat/CAA4eK1%2BANNEaMJCCXm4naweP5PLY6LhJMvGo_V7-Pnfbh6GsOA%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Greg Nancarrow
On Wed, Oct 7, 2020 at 12:40 AM Bharath Rupireddy
<[hidden email]> wrote:

>
> In parallel, we are not doing anything(due to the same reason
> explained in above comment) to find whether there is a foreign
> partition or not while deciding to go with parallel/non-parallel copy,
> we are just throwing an error during the first tuple insertion into
> the partition.
>
> errmsg("cannot perform PARALLEL COPY if partition has BEFORE/INSTEAD
> OF triggers, or if the partition is foreign partition"),
>                             errhint("Try COPY without PARALLEL option")));
>

I may well need to do something similar for parallel INSERT, but I'm
kind of surprised it can't be detected earlier (?).
Will need to further test this.

>
> Allowing the leader to execute before statement triggers at Gather
> node level before invoking the parallel plan and then parallel inserts
> makes sense. But if there are any after statement triggers, there may
> come transition tables, see Amit's findings under Case-1 in [1] and we
> must disable parallelism in that case.
>
> [1] - https://www.postgresql.org/message-id/flat/CAA4eK1%2BANNEaMJCCXm4naweP5PLY6LhJMvGo_V7-Pnfbh6GsOA%40mail.gmail.com
>

The patch I last posted for parallel INSERT does detect use of
transition tables in this case (trigdesc->trig_insert_new_table) and
disables INSERT parallelism (I tested it against Amit's example), yet
still otherwise allows AFTER STATEMENT triggers for parallel INSERT.

Regards,
Greg Nancarrow
Fujitsu Australia


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Greg Nancarrow
In reply to this post by Bharath Rupireddy
On Wed, Oct 7, 2020 at 12:40 AM Bharath Rupireddy
<[hidden email]> wrote:

> In parallel, we are not doing anything(due to the same reason
> explained in above comment) to find whether there is a foreign
> partition or not while deciding to go with parallel/non-parallel copy,
> we are just throwing an error during the first tuple insertion into
> the partition.
>
> errmsg("cannot perform PARALLEL COPY if partition has BEFORE/INSTEAD
> OF triggers, or if the partition is foreign partition"),
>                             errhint("Try COPY without PARALLEL option")));
>

I'm wondering whether code similar to the following can safely be used
to detect a foreign partition:

    if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
    {
        int i;
        PartitionDesc pd = RelationGetPartitionDesc(rel);
        for (i = 0; i < pd->nparts; i++)
        {
            if (get_rel_relkind(pd->oids[i]) == RELKIND_FOREIGN_TABLE)
            {
                table_close(rel, NoLock);
                return false;
            }
        }
    }

Thoughts?

Regards,
Greg Nancarrow
Fujitsu Australia


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Greg Nancarrow
On Wed, Oct 7, 2020 at 7:25 PM Greg Nancarrow <[hidden email]> wrote:

>
> On Wed, Oct 7, 2020 at 12:40 AM Bharath Rupireddy
> <[hidden email]> wrote:
> > In parallel, we are not doing anything(due to the same reason
> > explained in above comment) to find whether there is a foreign
> > partition or not while deciding to go with parallel/non-parallel copy,
> > we are just throwing an error during the first tuple insertion into
> > the partition.
> >
> > errmsg("cannot perform PARALLEL COPY if partition has BEFORE/INSTEAD
> > OF triggers, or if the partition is foreign partition"),
> >                             errhint("Try COPY without PARALLEL option")));
> >
>
> I'm wondering whether code similar to the following can safely be used
> to detect a foreign partition:
>
>     if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
>     {
>         int i;
>         PartitionDesc pd = RelationGetPartitionDesc(rel);
>         for (i = 0; i < pd->nparts; i++)
>         {
>             if (get_rel_relkind(pd->oids[i]) == RELKIND_FOREIGN_TABLE)
>             {
>                 table_close(rel, NoLock);
>                 return false;
>             }
>         }
>     }
>

Actually, the addition of this kind of check is still not good enough.
Partitions can have their own constraints, triggers, column default
expressions etc. and a partition itself can be partitioned.
I've written code to recursively walk the partitions and do all the
various checks for parallel-insert-safety as before, but it's doing a
fair bit of work.
Any other idea of dealing with this? Seems it can't be avoided if you
want to support partitioned tables and partitions.

Regards,
Greg Nancarrow
Fujitsu Australia


12345 ... 19