partition routing layering in nodeModifyTable.c

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
79 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

partition routing layering in nodeModifyTable.c

Andres Freund
Hi,

While discussing partition related code with David in [1], I again was
confused by the layering of partition related code in
nodeModifyTable.c.


1) How come partition routing is done outside of ExecInsert()?

            case CMD_INSERT:
                /* Prepare for tuple routing if needed. */
                if (proute)
                    slot = ExecPrepareTupleRouting(node, estate, proute,
                                                   resultRelInfo, slot);
                slot = ExecInsert(node, slot, planSlot,
                                  estate, node->canSetTag);
                /* Revert ExecPrepareTupleRouting's state change. */
                if (proute)
                    estate->es_result_relation_info = resultRelInfo;
                break;

That already seems like a layering violation, but it's made worse by
ExecUpdate() having its partition handling solely inside - including another
call to ExecInsert(), including the surrounding partition setup code.

And even worse, after all that, ExecInsert() still contains partitioning
code.

It seems to me that if we just moved the ExecPrepareTupleRouting() into
ExecInsert(), we could remove the duplication.


2) The contents of the
        /*
         * If a partition check failed, try to move the row into the right
         * partition.
         */
        if (partition_constraint_failed)

block ought to be moved to a separate function (maybe
ExecCrossPartitionUpdate or ExecMove). ExecUpdate() is already
complicated enough without dealing with the partition move.


3) How come we reset estate->es_result_relation_info after partition
   routing, but not the mtstate wide changes by
   ExecPrepareTupleRouting()?  Note that its comment says:

 * Caller must revert the estate changes after executing the insertion!
 * In mtstate, transition capture changes may also need to be reverted.

ExecUpdate() contains

            /*
             * Updates set the transition capture map only when a new subplan
             * is chosen.  But for inserts, it is set for each row. So after
             * INSERT, we need to revert back to the map created for UPDATE;
             * otherwise the next UPDATE will incorrectly use the one created
             * for INSERT.  So first save the one created for UPDATE.
             */
            if (mtstate->mt_transition_capture)
                saved_tcs_map = mtstate->mt_transition_capture->tcs_map;

but as I read the code, that's not really true?  It's
ExecPrepareTupleRouting() that does so, and that's called directly in ExecUpdate().


4)
    /*
     * If this insert is the result of a partition key update that moved the
     * tuple to a new partition, put this row into the transition NEW TABLE,
     * if there is one. We need to do this separately for DELETE and INSERT
     * because they happen on different tables.
     */
    ar_insert_trig_tcs = mtstate->mt_transition_capture;
    if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
        && mtstate->mt_transition_capture->tcs_update_new_table)
    {
        ExecARUpdateTriggers(estate, resultRelInfo, NULL,
                             NULL,
                             slot,
                             NULL,
                             mtstate->mt_transition_capture);

        /*
         * We've already captured the NEW TABLE row, so make sure any AR
         * INSERT trigger fired below doesn't capture it again.
         */
        ar_insert_trig_tcs = NULL;
    }

    /* AFTER ROW INSERT Triggers */
    ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
                         ar_insert_trig_tcs);

Besides not using the just defined ar_insert_trig_tcs and instead
repeatedly referring to mtstate->mt_transition_capture, wouldn't this be
a easier to understand if the were an if/else, instead of resetting
ar_insert_trig_tcs?  If the block were

    /*
     * triggers behave differently depending on this being a delete as
     * part of a partion move, or a deletion proper.
    if (mtstate->operation == CMD_UPDATE)
    {
        /*
         * If this insert is the result of a partition key update that moved the
         * tuple to a new partition, put this row into the transition NEW TABLE,
         * if there is one. We need to do this separately for DELETE and INSERT
         * because they happen on different tables.
         */
        ExecARUpdateTriggers(estate, resultRelInfo, NULL,
                             NULL,
                             slot,
                             NULL,
                             mtstate->mt_transition_capture);

        /*
         * But we do want to fire plain per-row INSERT triggers on the
         * new table. By not passing in transition_capture we prevent
         * ....
         */
         ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
                              NULL);
    }
    else
    {
        /* AFTER ROW INSERT Triggers */
        ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
                             ar_insert_trig_tcs);
    }

it seems like it'd be quite a bit clearer (although I do think the
comments also need a fair bit of polishing independent of this proposed
change).


Greetings,

Andres Freund


[1] https://www.postgresql.org/message-id/CAKJS1f-YObQJTbncGJGRZ6gSFiS%2Bgw_Y5kvrpR%3DvEnFKH17AVA%40mail.gmail.com


Reply | Threaded
Open this post in threaded view
|

Re: partition routing layering in nodeModifyTable.c

Amit Langote
Hi Andres,

On Thu, Jul 18, 2019 at 10:09 AM Andres Freund <[hidden email]> wrote:

> 1) How come partition routing is done outside of ExecInsert()?
>
>             case CMD_INSERT:
>                 /* Prepare for tuple routing if needed. */
>                 if (proute)
>                     slot = ExecPrepareTupleRouting(node, estate, proute,
>                                                    resultRelInfo, slot);
>                 slot = ExecInsert(node, slot, planSlot,
>                                   estate, node->canSetTag);
>                 /* Revert ExecPrepareTupleRouting's state change. */
>                 if (proute)
>                     estate->es_result_relation_info = resultRelInfo;
>                 break;
>
> That already seems like a layering violation,

The decision to move partition routing out of ExecInsert() came about
when we encountered a bug [1] whereby ExecInsert() would fail to reset
estate->es_result_relation_info back to the root table if it had to
take an abnormal path out (early return), of which there are quite a
few instances.  The first solution I came up with was to add a goto
label  for the code to reset estate->es_result_relation_info and jump
to it from the various places that do an early return, which was
complained about as reducing readability.  So, the solution we
eventually settled on in 6666ee49f was to perform ResultRelInfos
switching at a higher level.

> but it's made worse by
> ExecUpdate() having its partition handling solely inside - including another
> call to ExecInsert(), including the surrounding partition setup code.
>
> And even worse, after all that, ExecInsert() still contains partitioning
> code.

AFAIK, it's only to check the partition constraint when necessary.
Partition routing complexity is totally outside, but based on what you
write in point 4 below there's bit more...

> It seems to me that if we just moved the ExecPrepareTupleRouting() into
> ExecInsert(), we could remove the duplication.

I agree that there's duplication here.  Given what I wrote above, I
can think of doing this: move all of ExecInsert()'s code into
ExecInsertInternal() and make the former instead look like this:

static TupleTableSlot *
ExecInsert(ModifyTableState *mtstate,
           TupleTableSlot *slot,
           TupleTableSlot *planSlot,
           EState *estate,
           bool canSetTag)
{
    PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
    ResultRelInfo *resultRelInfo = estate->es_result_relation_info;

    /* Prepare for tuple routing if needed. */
    if (proute)
        slot = ExecPrepareTupleRouting(mtstate, estate, proute, resultRelInfo,
                                       slot);

    slot = ExecInsertInternal(mtstate, slot, planSlot, estate,
                              mtstate->canSetTag);

    /* Revert ExecPrepareTupleRouting's state change. */
    if (proute)
        estate->es_result_relation_info = resultRelInfo;

    return slot;
}

> 2) The contents of the
>         /*
>          * If a partition check failed, try to move the row into the right
>          * partition.
>          */
>         if (partition_constraint_failed)
>
> block ought to be moved to a separate function (maybe
> ExecCrossPartitionUpdate or ExecMove). ExecUpdate() is already
> complicated enough without dealing with the partition move.

I tend to agree with this.  Adding Amit Khandekar in case he wants to
chime in about this.

> 3) How come we reset estate->es_result_relation_info after partition
>    routing, but not the mtstate wide changes by
>    ExecPrepareTupleRouting()?  Note that its comment says:
>
>  * Caller must revert the estate changes after executing the insertion!
>  * In mtstate, transition capture changes may also need to be reverted.
>
> ExecUpdate() contains
>
>             /*
>              * Updates set the transition capture map only when a new subplan
>              * is chosen.  But for inserts, it is set for each row. So after
>              * INSERT, we need to revert back to the map created for UPDATE;
>              * otherwise the next UPDATE will incorrectly use the one created
>              * for INSERT.  So first save the one created for UPDATE.
>              */
>             if (mtstate->mt_transition_capture)
>                 saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
>
> but as I read the code, that's not really true?  It's
> ExecPrepareTupleRouting() that does so, and that's called directly in ExecUpdate().

Calling ExecPrepareTupleRouting() is considered a part of a given
INSERT operation, so anything it does is to facilitate the INSERT.  In
this case, which map to assign to tcs_map can only be determined after
a partition is chosen and determining the partition (routing) is a job
of ExecPrepareTupleRouting().  Perhaps, we need to update the comment
here a bit.

> 4)
>     /*
>      * If this insert is the result of a partition key update that moved the
>      * tuple to a new partition, put this row into the transition NEW TABLE,
>      * if there is one. We need to do this separately for DELETE and INSERT
>      * because they happen on different tables.
>      */
>     ar_insert_trig_tcs = mtstate->mt_transition_capture;
>     if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
>         && mtstate->mt_transition_capture->tcs_update_new_table)
>     {
>         ExecARUpdateTriggers(estate, resultRelInfo, NULL,
>                              NULL,
>                              slot,
>                              NULL,
>                              mtstate->mt_transition_capture);
>
>         /*
>          * We've already captured the NEW TABLE row, so make sure any AR
>          * INSERT trigger fired below doesn't capture it again.
>          */
>         ar_insert_trig_tcs = NULL;
>     }
>
>     /* AFTER ROW INSERT Triggers */
>     ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
>                          ar_insert_trig_tcs);
>
> Besides not using the just defined ar_insert_trig_tcs and instead
> repeatedly referring to mtstate->mt_transition_capture, wouldn't this be
> a easier to understand if the were an if/else, instead of resetting
> ar_insert_trig_tcs?  If the block were
>
>     /*
>      * triggers behave differently depending on this being a delete as
>      * part of a partion move, or a deletion proper.
>     if (mtstate->operation == CMD_UPDATE)
>     {
>         /*
>          * If this insert is the result of a partition key update that moved the
>          * tuple to a new partition, put this row into the transition NEW TABLE,
>          * if there is one. We need to do this separately for DELETE and INSERT
>          * because they happen on different tables.
>          */
>         ExecARUpdateTriggers(estate, resultRelInfo, NULL,
>                              NULL,
>                              slot,
>                              NULL,
>                              mtstate->mt_transition_capture);
>
>         /*
>          * But we do want to fire plain per-row INSERT triggers on the
>          * new table. By not passing in transition_capture we prevent
>          * ....
>          */
>          ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
>                               NULL);
>     }
>     else
>     {
>         /* AFTER ROW INSERT Triggers */
>         ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
>                              ar_insert_trig_tcs);
>     }

Maybe you meant to use mtstate->mt_transition_capture instead of
ar_insert_trig_tcs in the else block.  We don't need
ar_insert_trig_tcs at all.

> it seems like it'd be quite a bit clearer (although I do think the
> comments also need a fair bit of polishing independent of this proposed
> change).

Fwiw, I agree with your proposed restructuring, although I'd let Amit
Kh chime in as he'd be more familiar with this code.  I wasn't aware
of this partitioning-related bit being present in ExecInsert().

Would you like me to write a patch for some or all items?

Thanks,
Amit

[1] https://www.postgresql.org/message-id/flat/0473bf5c-57b1-f1f7-3d58-455c2230bc5f%40lab.ntt.co.jp


Reply | Threaded
Open this post in threaded view
|

Re: partition routing layering in nodeModifyTable.c

Andres Freund
Hi,

On 2019-07-18 14:24:29 +0900, Amit Langote wrote:

> On Thu, Jul 18, 2019 at 10:09 AM Andres Freund <[hidden email]> wrote:
> > 1) How come partition routing is done outside of ExecInsert()?
> >
> >             case CMD_INSERT:
> >                 /* Prepare for tuple routing if needed. */
> >                 if (proute)
> >                     slot = ExecPrepareTupleRouting(node, estate, proute,
> >                                                    resultRelInfo, slot);
> >                 slot = ExecInsert(node, slot, planSlot,
> >                                   estate, node->canSetTag);
> >                 /* Revert ExecPrepareTupleRouting's state change. */
> >                 if (proute)
> >                     estate->es_result_relation_info = resultRelInfo;
> >                 break;
> >
> > That already seems like a layering violation,
>
> The decision to move partition routing out of ExecInsert() came about
> when we encountered a bug [1] whereby ExecInsert() would fail to reset
> estate->es_result_relation_info back to the root table if it had to
> take an abnormal path out (early return), of which there are quite a
> few instances.  The first solution I came up with was to add a goto
> label  for the code to reset estate->es_result_relation_info and jump
> to it from the various places that do an early return, which was
> complained about as reducing readability.  So, the solution we
> eventually settled on in 6666ee49f was to perform ResultRelInfos
> switching at a higher level.

I think that was the wrong path, given that the code now lives in
multiple places. Without even a comment explaining that if one has to be
changed, the other has to be changed too.


> > It seems to me that if we just moved the ExecPrepareTupleRouting() into
> > ExecInsert(), we could remove the duplication.
>
> I agree that there's duplication here.  Given what I wrote above, I
> can think of doing this: move all of ExecInsert()'s code into
> ExecInsertInternal() and make the former instead look like this:

For me just having the gotos is cleaner than that here.

But perhaps the right fix would be to not have ExecPrepareTupleRouting()
change global state at all, and instead change it much more locally
inside ExecInsert(), around the calls that need it to be set
differently.

Or perhaps the actually correct fix is to remove es_result_relation_info
alltogether, and just pass it down the places that need it - we've a lot
more code setting it than using the value.  And it'd not be hard to
actually pass it to the places that read it.  Given all the
setting/resetting of it it's pretty obvious that a query-global resource
isn't the right place for it.



> >     /*
> >      * triggers behave differently depending on this being a delete as
> >      * part of a partion move, or a deletion proper.
> >     if (mtstate->operation == CMD_UPDATE)
> >     {
> >         /*
> >          * If this insert is the result of a partition key update that moved the
> >          * tuple to a new partition, put this row into the transition NEW TABLE,
> >          * if there is one. We need to do this separately for DELETE and INSERT
> >          * because they happen on different tables.
> >          */
> >         ExecARUpdateTriggers(estate, resultRelInfo, NULL,
> >                              NULL,
> >                              slot,
> >                              NULL,
> >                              mtstate->mt_transition_capture);
> >
> >         /*
> >          * But we do want to fire plain per-row INSERT triggers on the
> >          * new table. By not passing in transition_capture we prevent
> >          * ....
> >          */
> >          ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
> >                               NULL);
> >     }
> >     else
> >     {
> >         /* AFTER ROW INSERT Triggers */
> >         ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
> >                              ar_insert_trig_tcs);
> >     }
>
> Maybe you meant to use mtstate->mt_transition_capture instead of
> ar_insert_trig_tcs in the else block.  We don't need
> ar_insert_trig_tcs at all.

Yes, it was just a untested example of how the code could be made
clearer.


> > it seems like it'd be quite a bit clearer (although I do think the
> > comments also need a fair bit of polishing independent of this proposed
> > change).
>
> Fwiw, I agree with your proposed restructuring, although I'd let Amit
> Kh chime in as he'd be more familiar with this code.  I wasn't aware
> of this partitioning-related bit being present in ExecInsert().
>
> Would you like me to write a patch for some or all items?

Yes, that would be awesome.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: partition routing layering in nodeModifyTable.c

Amit Langote
On Thu, Jul 18, 2019 at 2:53 PM Andres Freund <[hidden email]> wrote:

> On 2019-07-18 14:24:29 +0900, Amit Langote wrote:
> > On Thu, Jul 18, 2019 at 10:09 AM Andres Freund <[hidden email]> wrote:
> > > 1) How come partition routing is done outside of ExecInsert()?
> > >
> > >             case CMD_INSERT:
> > >                 /* Prepare for tuple routing if needed. */
> > >                 if (proute)
> > >                     slot = ExecPrepareTupleRouting(node, estate, proute,
> > >                                                    resultRelInfo, slot);
> > >                 slot = ExecInsert(node, slot, planSlot,
> > >                                   estate, node->canSetTag);
> > >                 /* Revert ExecPrepareTupleRouting's state change. */
> > >                 if (proute)
> > >                     estate->es_result_relation_info = resultRelInfo;
> > >                 break;
> > >
> > > That already seems like a layering violation,
> >
> > The decision to move partition routing out of ExecInsert() came about
> > when we encountered a bug [1] whereby ExecInsert() would fail to reset
> > estate->es_result_relation_info back to the root table if it had to
> > take an abnormal path out (early return), of which there are quite a
> > few instances.  The first solution I came up with was to add a goto
> > label  for the code to reset estate->es_result_relation_info and jump
> > to it from the various places that do an early return, which was
> > complained about as reducing readability.  So, the solution we
> > eventually settled on in 6666ee49f was to perform ResultRelInfos
> > switching at a higher level.
>
> I think that was the wrong path, given that the code now lives in
> multiple places. Without even a comment explaining that if one has to be
> changed, the other has to be changed too.
>
>
> > > It seems to me that if we just moved the ExecPrepareTupleRouting() into
> > > ExecInsert(), we could remove the duplication.
> >
> > I agree that there's duplication here.  Given what I wrote above, I
> > can think of doing this: move all of ExecInsert()'s code into
> > ExecInsertInternal() and make the former instead look like this:
>
> For me just having the gotos is cleaner than that here.
>
> But perhaps the right fix would be to not have ExecPrepareTupleRouting()
> change global state at all, and instead change it much more locally
> inside ExecInsert(), around the calls that need it to be set
> differently.
>
> Or perhaps the actually correct fix is to remove es_result_relation_info
> alltogether, and just pass it down the places that need it - we've a lot
> more code setting it than using the value.  And it'd not be hard to
> actually pass it to the places that read it.  Given all the
> setting/resetting of it it's pretty obvious that a query-global resource
> isn't the right place for it.

I tend to agree that managing state through es_result_relation_info
across various operations on a result relation has turned a bit messy
at this point.  That said, while most of the places that access the
currently active result relation from es_result_relation_info can be
easily modified to receive it directly, the FDW API BeginDirectModify
poses bit of a challenge.  BeginDirectlyModify() is called via
ExecInitForeignScan() that in turn can't be changed to add a result
relation (Index or ResultRelInfo *) argument, so the only way left for
BeginDirectlyModify() is to access it via es_result_relation_info.

Maybe we can do to ExecPrepareTupleRouting() what you say -- remove
all code in it that changes ModifyTable-global and EState-global
states.  Also, maybe call ExecPrepareTupleRouting() inside
ExecInsert() at the beginning instead of outside of it.  I agree that
setting and reverting global states around the exact piece of code
that need that to be done is better for clarity.  All of that assuming
you're not saying that we scrap ExecPrepareTupleRouting altogether.

Thoughts?  Other opinions?

> > Would you like me to write a patch for some or all items?
>
> Yes, that would be awesome.

OK, I will try to post a patch soon.

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: partition routing layering in nodeModifyTable.c

Etsuro Fujita-2
On Thu, Jul 18, 2019 at 4:51 PM Amit Langote <[hidden email]> wrote:

> On Thu, Jul 18, 2019 at 2:53 PM Andres Freund <[hidden email]> wrote:
> > On 2019-07-18 14:24:29 +0900, Amit Langote wrote:
> > > On Thu, Jul 18, 2019 at 10:09 AM Andres Freund <[hidden email]> wrote:
> > > > 1) How come partition routing is done outside of ExecInsert()?
> > > >
> > > >             case CMD_INSERT:
> > > >                 /* Prepare for tuple routing if needed. */
> > > >                 if (proute)
> > > >                     slot = ExecPrepareTupleRouting(node, estate, proute,
> > > >                                                    resultRelInfo, slot);
> > > >                 slot = ExecInsert(node, slot, planSlot,
> > > >                                   estate, node->canSetTag);
> > > >                 /* Revert ExecPrepareTupleRouting's state change. */
> > > >                 if (proute)
> > > >                     estate->es_result_relation_info = resultRelInfo;
> > > >                 break;
> > > >
> > > > That already seems like a layering violation,
> > >
> > > The decision to move partition routing out of ExecInsert() came about
> > > when we encountered a bug [1] whereby ExecInsert() would fail to reset
> > > estate->es_result_relation_info back to the root table if it had to
> > > take an abnormal path out (early return), of which there are quite a
> > > few instances.  The first solution I came up with was to add a goto
> > > label  for the code to reset estate->es_result_relation_info and jump
> > > to it from the various places that do an early return, which was
> > > complained about as reducing readability.  So, the solution we
> > > eventually settled on in 6666ee49f was to perform ResultRelInfos
> > > switching at a higher level.
> >
> > I think that was the wrong path, given that the code now lives in
> > multiple places. Without even a comment explaining that if one has to be
> > changed, the other has to be changed too.

I thought this would be OK because we have the
ExecPrepareTupleRouting() call in just two places in a single source
file, at least currently.

> > Or perhaps the actually correct fix is to remove es_result_relation_info
> > alltogether, and just pass it down the places that need it - we've a lot
> > more code setting it than using the value.  And it'd not be hard to
> > actually pass it to the places that read it.  Given all the
> > setting/resetting of it it's pretty obvious that a query-global resource
> > isn't the right place for it.
>
> I tend to agree that managing state through es_result_relation_info
> across various operations on a result relation has turned a bit messy
> at this point.  That said, while most of the places that access the
> currently active result relation from es_result_relation_info can be
> easily modified to receive it directly, the FDW API BeginDirectModify
> poses bit of a challenge.  BeginDirectlyModify() is called via
> ExecInitForeignScan() that in turn can't be changed to add a result
> relation (Index or ResultRelInfo *) argument, so the only way left for
> BeginDirectlyModify() is to access it via es_result_relation_info.

That's right.  I'm not sure that's a good idea, because I think other
extensions also might look at es_result_relation_info, and if so,
removing es_result_relation_info altogether would require the
extension authors to update their extensions without any benefit,
which I think isn't a good thing.

Best regards,
Etsuro Fujita


Reply | Threaded
Open this post in threaded view
|

Re: partition routing layering in nodeModifyTable.c

Amit Langote
In reply to this post by Amit Langote
On Thu, Jul 18, 2019 at 4:50 PM Amit Langote <[hidden email]> wrote:

> On Thu, Jul 18, 2019 at 2:53 PM Andres Freund <[hidden email]> wrote:
> > On 2019-07-18 14:24:29 +0900, Amit Langote wrote:
> > > On Thu, Jul 18, 2019 at 10:09 AM Andres Freund <[hidden email]> wrote:
> > Or perhaps the actually correct fix is to remove es_result_relation_info
> > alltogether, and just pass it down the places that need it - we've a lot
> > more code setting it than using the value.  And it'd not be hard to
> > actually pass it to the places that read it.  Given all the
> > setting/resetting of it it's pretty obvious that a query-global resource
> > isn't the right place for it.
>>
> > > Would you like me to write a patch for some or all items?
> >
> > Yes, that would be awesome.
>
> OK, I will try to post a patch soon.
Attached are two patches.

The first one (0001) deals with reducing the core executor's reliance
on es_result_relation_info to access the currently active result
relation, in favor of receiving it from the caller as a function
argument.  So no piece of core code relies on it being correctly set
anymore.  It still needs to be set correctly for the third-party code
such as FDWs.  Also, because the partition routing related suggestions
upthread are closely tied into this, especially those around
ExecInsert(), I've included them in the same patch.  I chose to keep
the function ExecPrepareTupleRouting, even though it's now only called
from ExecInsert(), to preserve the readability of the latter.

The second patch (0002) implements some rearrangement of the UPDATE
tuple movement code, addressing the point 2 of in the first email of
this thread.  Mainly the block of code in ExecUpdate() that implements
row movement proper has been moved in a function called ExecMove().
It also contains the cosmetic improvements suggested in point 4.

Thanks,
Amit

v1-0001-Reduce-es_result_relation_info-usage.patch (49K) Download Attachment
v1-0002-Rearrange-partition-update-row-movement-code-a-bi.patch (20K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: partition routing layering in nodeModifyTable.c

Andres Freund
Hi,

On 2019-07-19 17:52:20 +0900, Amit Langote wrote:
> Attached are two patches.

Awesome.


> The first one (0001) deals with reducing the core executor's reliance
> on es_result_relation_info to access the currently active result
> relation, in favor of receiving it from the caller as a function
> argument.  So no piece of core code relies on it being correctly set
> anymore.  It still needs to be set correctly for the third-party code
> such as FDWs.

I'm inclined to just remove it. There's not much code out there relying
on it, as far as I can tell. Most FDWs don't support the direct modify
API, and that's afaict the case where we one needs to use
es_result_relation_info?

In fact, I searched through alllFDWs listed on https://wiki.postgresql.org/wiki/Foreign_data_wrappers
that are on github and in first few categories (up and including to
"file wrappers"), and there was only one reference to
es_result_relation_info, and that just in comments in a test:
https://github.com/pgspider/griddb_fdw/search?utf8=%E2%9C%93&q=es_result_relation_info&type=
which I think was just copied from our source code.

IOW, we should just change the direct modify calls to get the relevant
ResultRelationInfo or something in that vein (perhaps just the relevant
RT index?).

pglogical also references it, but just because it creates its own
EState afaict.



> @@ -334,32 +335,50 @@ ExecComputeStoredGenerated(EState *estate, TupleTableSlot *slot)
>   * ExecInsert
>   *
>   * For INSERT, we have to insert the tuple into the target relation
> - * and insert appropriate tuples into the index relations.
> + * (or partition thereof) and insert appropriate tuples into the index
> + * relations.
>   *
>   * Returns RETURNING result if any, otherwise NULL.
> + *
> + * This may change the currently active tuple conversion map in
> + * mtstate->mt_transition_capture, so the callers must take care to
> + * save the previous value to avoid losing track of it.
>   * ----------------------------------------------------------------
>   */
>  static TupleTableSlot *
>  ExecInsert(ModifyTableState *mtstate,
> +   ResultRelInfo *resultRelInfo,
>     TupleTableSlot *slot,
>     TupleTableSlot *planSlot,
>     EState *estate,
>     bool canSetTag)
>  {
> - ResultRelInfo *resultRelInfo;
>   Relation resultRelationDesc;
>   List   *recheckIndexes = NIL;
>   TupleTableSlot *result = NULL;
>   TransitionCaptureState *ar_insert_trig_tcs;
>   ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
>   OnConflictAction onconflict = node->onConflictAction;
> + PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
> +
> + /*
> + * If the input result relation is a partitioned table, find the leaf
> + * partition to insert the tuple into.
> + */
> + if (proute)
> + {
> + ResultRelInfo *partRelInfo;
> +
> + slot = ExecPrepareTupleRouting(mtstate, estate, proute,
> +   resultRelInfo, slot,
> +   &partRelInfo);
> + resultRelInfo = partRelInfo;
> + /* Result relation has changed, so update EState reference too. */
> + estate->es_result_relation_info = resultRelInfo;
> + }

I think by removing es_result_relation entirely, this would look
cleaner.


> @@ -1271,18 +1274,18 @@ lreplace:;
>   mtstate->mt_root_tuple_slot);
>  
>   /*
> - * Prepare for tuple routing, making it look like we're inserting
> - * into the root.
> + * ExecInsert() may scribble on mtstate->mt_transition_capture,
> + * so save the currently active map.
>   */
> + if (mtstate->mt_transition_capture)
> + saved_tcs_map = mtstate->mt_transition_capture->tcs_map;

Wonder if we could remove the need for this somehow, it's still pretty
darn ugly.  Thomas, perhaps you have some insights?

To me the need to modify these ModifyTable wide state on a per-subplan
and even per-partition basis indicates that the datastructures are in
the wrong place.



> @@ -2212,23 +2207,17 @@ ExecModifyTable(PlanState *pstate)
>   switch (operation)
>   {
>   case CMD_INSERT:
> - /* Prepare for tuple routing if needed. */
> - if (proute)
> - slot = ExecPrepareTupleRouting(node, estate, proute,
> -   resultRelInfo, slot);
> - slot = ExecInsert(node, slot, planSlot,
> + slot = ExecInsert(node, resultRelInfo, slot, planSlot,
>    estate, node->canSetTag);
> - /* Revert ExecPrepareTupleRouting's state change. */
> - if (proute)
> - estate->es_result_relation_info = resultRelInfo;
>   break;
>   case CMD_UPDATE:
> - slot = ExecUpdate(node, tupleid, oldtuple, slot, planSlot,
> -  &node->mt_epqstate, estate, node->canSetTag);
> + slot = ExecUpdate(node, resultRelInfo, tupleid, oldtuple, slot,
> +  planSlot, &node->mt_epqstate, estate,
> +  node->canSetTag);
>   break;
>   case CMD_DELETE:
> - slot = ExecDelete(node, tupleid, oldtuple, planSlot,
> -  &node->mt_epqstate, estate,
> + slot = ExecDelete(node, resultRelInfo, tupleid, oldtuple,
> +  planSlot, &node->mt_epqstate, estate,
>    true, node->canSetTag,
>    false /* changingPart */ , NULL, NULL);
>   break;

This reminds me of another complaint: ExecDelete and ExecInsert() have
gotten more boolean parameters for partition moving, but only one of
them is explained with a comment (/* changingPart */) - think we should
do that for all.



> diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
> index 43edfef089..7df3e78b22 100644
> --- a/src/backend/replication/logical/worker.c
> +++ b/src/backend/replication/logical/worker.c
> @@ -173,10 +173,10 @@ ensure_transaction(void)
>   * This is based on similar code in copy.c
>   */
>  static EState *
> -create_estate_for_relation(LogicalRepRelMapEntry *rel)
> +create_estate_for_relation(LogicalRepRelMapEntry *rel,
> +   ResultRelInfo **resultRelInfo)
>  {
>   EState   *estate;
> - ResultRelInfo *resultRelInfo;
>   RangeTblEntry *rte;
>  
>   estate = CreateExecutorState();
> @@ -188,12 +188,11 @@ create_estate_for_relation(LogicalRepRelMapEntry *rel)
>   rte->rellockmode = AccessShareLock;
>   ExecInitRangeTable(estate, list_make1(rte));
>  
> - resultRelInfo = makeNode(ResultRelInfo);
> - InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, 0);
> + *resultRelInfo = makeNode(ResultRelInfo);
> + InitResultRelInfo(*resultRelInfo, rel->localrel, 1, NULL, 0);
>  
> - estate->es_result_relations = resultRelInfo;
> + estate->es_result_relations = *resultRelInfo;
>   estate->es_num_result_relations = 1;
> - estate->es_result_relation_info = resultRelInfo;
>  
>   estate->es_output_cid = GetCurrentCommandId(true);
>  
> @@ -567,6 +566,7 @@ GetRelationIdentityOrPK(Relation rel)
>  static void
>  apply_handle_insert(StringInfo s)
>  {
> + ResultRelInfo *resultRelInfo;
>   LogicalRepRelMapEntry *rel;
>   LogicalRepTupleData newtup;
>   LogicalRepRelId relid;
> @@ -589,7 +589,7 @@ apply_handle_insert(StringInfo s)
>   }
>  
>   /* Initialize the executor state. */
> - estate = create_estate_for_relation(rel);
> + estate = create_estate_for_relation(rel, &resultRelInfo);

Hm. It kinda seems cleaner if we were to instead return the relevant
index, rather than the entire ResultRelInfo, as an output from
create_estate_for_relation().  Makes it clearer that it's still in the
EState.

Or perhaps we ought to compute it in a separate step? Then that'd be
more amenable to support replcating into partition roots.


>   /*
> - * If this insert is the result of a partition key update that moved the
> - * tuple to a new partition, put this row into the transition NEW TABLE,
> - * if there is one. We need to do this separately for DELETE and INSERT
> - * because they happen on different tables.
> + * If this delete is a part of a partition key update, put this row into
> + * the UPDATE trigger's NEW TABLE instead of that of an INSERT trigger.
>   */
> - ar_insert_trig_tcs = mtstate->mt_transition_capture;
> - if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
> - && mtstate->mt_transition_capture->tcs_update_new_table)
> + if (mtstate->operation == CMD_UPDATE &&
> + mtstate->mt_transition_capture &&
> + mtstate->mt_transition_capture->tcs_update_new_table)
>   {
> - ExecARUpdateTriggers(estate, resultRelInfo, NULL,
> - NULL,
> - slot,
> - NULL,
> - mtstate->mt_transition_capture);
> + ExecARUpdateTriggers(estate, resultRelInfo, NULL, NULL, slot,
> + NIL, mtstate->mt_transition_capture);
>  
>   /*
> - * We've already captured the NEW TABLE row, so make sure any AR
> - * INSERT trigger fired below doesn't capture it again.
> + * Execute AFTER ROW INSERT Triggers, but such that the row is not
> + * captured again in the transition table if any.
>   */
> - ar_insert_trig_tcs = NULL;
> + ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
> + NULL);
> + }
> + else
> + {
> + /* AFTER ROW INSERT Triggers */
> + ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
> + mtstate->mt_transition_capture);
>   }
> -
> - /* AFTER ROW INSERT Triggers */
> - ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
> - ar_insert_trig_tcs);

While a tiny bit more code, perhaps, this is considerably clearer
imo. Thanks.


> +/*
> + * ExecMove
> + * Move an updated tuple from the input result relation to the
> + * new partition of its root parent table
> + *
> + * This works by first deleting the tuple from the input result relation
> + * followed by inserting it into the root parent table, that is,
> + * mtstate->rootResultRelInfo.
> + *
> + * Returns true if it's detected that the tuple we're trying to move has
> + * been concurrently updated.
> + */
> +static bool
> +ExecMove(ModifyTableState *mtstate, ResultRelInfo *resultRelInfo,
> + ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *planSlot,
> + EPQState *epqstate, bool canSetTag, TupleTableSlot **slot,
> + TupleTableSlot **inserted_tuple)
> +{

I know that it was one of the names I proposed, but now that I'm
thinking about it again, it sounds too generic. Perhaps
ExecCrossPartitionUpdate() wouldn't be a quite so generic name? Since
there's only one reference the longer name wouldn't be painful.


> + /*
> + * Row movement, part 1.  Delete the tuple, but skip RETURNING
> + * processing. We want to return rows from INSERT.
> + */
> + ExecDelete(mtstate, resultRelInfo, tupleid, oldtuple, planSlot,
> +   epqstate, estate, false, false /* canSetTag */ ,
> +   true /* changingPart */ , &tuple_deleted, &epqslot);

Here again it'd be nice if all the booleans would be explained with a
comment.


Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: partition routing layering in nodeModifyTable.c

Alvaro Herrera-9
On 2019-Jul-19, Andres Freund wrote:

> On 2019-07-19 17:52:20 +0900, Amit Langote wrote:
> > The first one (0001) deals with reducing the core executor's reliance
> > on es_result_relation_info to access the currently active result
> > relation, in favor of receiving it from the caller as a function
> > argument.  So no piece of core code relies on it being correctly set
> > anymore.  It still needs to be set correctly for the third-party code
> > such as FDWs.
>
> I'm inclined to just remove it. There's not much code out there relying
> on it, as far as I can tell. Most FDWs don't support the direct modify
> API, and that's afaict the case where we one needs to use
> es_result_relation_info?

Yeah, I too agree with removing it; since our code doesn't use it, it
seems very likely that it will become slightly out of sync with reality
and we'd not notice until some FDW misbehaves weirdly.

> > - slot = ExecDelete(node, tupleid, oldtuple, planSlot,
> > -  &node->mt_epqstate, estate,
> > + slot = ExecDelete(node, resultRelInfo, tupleid, oldtuple,
> > +  planSlot, &node->mt_epqstate, estate,
> >    true, node->canSetTag,
> >    false /* changingPart */ , NULL, NULL);
> >   break;
>
> This reminds me of another complaint: ExecDelete and ExecInsert() have
> gotten more boolean parameters for partition moving, but only one of
> them is explained with a comment (/* changingPart */) - think we should
> do that for all.

Maybe change the API to use a flags bitmask?

(IMO the placement of the comment inside the function call, making the
comma appear preceded with a space, looks ugly.  If we want to add
comments, let's put each param on its own line with the comment beyond
the comma.  That's what we do in other places where this pattern is
used.)

> >   /* Initialize the executor state. */
> > - estate = create_estate_for_relation(rel);
> > + estate = create_estate_for_relation(rel, &resultRelInfo);
>
> Hm. It kinda seems cleaner if we were to instead return the relevant
> index, rather than the entire ResultRelInfo, as an output from
> create_estate_for_relation().  Makes it clearer that it's still in the
> EState.

Yeah.

> Or perhaps we ought to compute it in a separate step? Then that'd be
> more amenable to support replcating into partition roots.

I'm not quite seeing the shape that you're imagining this would take.
I vote not to mess with that for this patch; I bet that we'll have to
change a few other things in this code when we add better support for
partitioning in logical replication.

> > +/*
> > + * ExecMove
> > + * Move an updated tuple from the input result relation to the
> > + * new partition of its root parent table
> > + *
> > + * This works by first deleting the tuple from the input result relation
> > + * followed by inserting it into the root parent table, that is,
> > + * mtstate->rootResultRelInfo.
> > + *
> > + * Returns true if it's detected that the tuple we're trying to move has
> > + * been concurrently updated.
> > + */
> > +static bool
> > +ExecMove(ModifyTableState *mtstate, ResultRelInfo *resultRelInfo,
> > + ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *planSlot,
> > + EPQState *epqstate, bool canSetTag, TupleTableSlot **slot,
> > + TupleTableSlot **inserted_tuple)
> > +{
>
> I know that it was one of the names I proposed, but now that I'm
> thinking about it again, it sounds too generic. Perhaps
> ExecCrossPartitionUpdate() wouldn't be a quite so generic name? Since
> there's only one reference the longer name wouldn't be painful.

That name sounds good.  Isn't the return convention backwards?  Sounds
like "true" should mean that it succeeded.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: partition routing layering in nodeModifyTable.c

Andres Freund
Hi,

On 2019-07-19 17:11:10 -0400, Alvaro Herrera wrote:

> On 2019-Jul-19, Andres Freund wrote:
> > > - slot = ExecDelete(node, tupleid, oldtuple, planSlot,
> > > -  &node->mt_epqstate, estate,
> > > + slot = ExecDelete(node, resultRelInfo, tupleid, oldtuple,
> > > +  planSlot, &node->mt_epqstate, estate,
> > >    true, node->canSetTag,
> > >    false /* changingPart */ , NULL, NULL);
> > >   break;
> >
> > This reminds me of another complaint: ExecDelete and ExecInsert() have
> > gotten more boolean parameters for partition moving, but only one of
> > them is explained with a comment (/* changingPart */) - think we should
> > do that for all.
>
> Maybe change the API to use a flags bitmask?
>
> (IMO the placement of the comment inside the function call, making the
> comma appear preceded with a space, looks ugly.  If we want to add
> comments, let's put each param on its own line with the comment beyond
> the comma.  That's what we do in other places where this pattern is
> used.)

Well, that's the pre-existing style, so I'd just have gone with
that. I'm not sure I buy there's much point in going for a bitmask, as
this is file-private code, not code where changing the signature
requires modifying multiple files.


> > >   /* Initialize the executor state. */
> > > - estate = create_estate_for_relation(rel);
> > > + estate = create_estate_for_relation(rel, &resultRelInfo);
> >
> > Hm. It kinda seems cleaner if we were to instead return the relevant
> > index, rather than the entire ResultRelInfo, as an output from
> > create_estate_for_relation().  Makes it clearer that it's still in the
> > EState.
>
> Yeah.
>
> > Or perhaps we ought to compute it in a separate step? Then that'd be
> > more amenable to support replcating into partition roots.
>
> I'm not quite seeing the shape that you're imagining this would take.
> I vote not to mess with that for this patch; I bet that we'll have to
> change a few other things in this code when we add better support for
> partitioning in logical replication.

Yea, I think it's fine to do that separately.  If we wanted to support
replication roots as replication targets, we'd obviously need to do
something pretty similar to what ExecInsert()/ExecUpdate() already
do. And there we can't just reference an index in EState, as partition
children aren't in there.

I kind of was wondering if we were to have a separate function for
getting the ResultRelInfo targeted, we'd be able to just extend that
function to support replication.  But now that I think about it a bit
more, that's so much just scratching the surface...

We really ought to have the replication "sink" code share more code with
nodeModifyTable.c.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: partition routing layering in nodeModifyTable.c

Amit Langote
In reply to this post by Andres Freund
Hi Andres,

Sorry about the delay in replying as I was on vacation for the last few days.

On Sat, Jul 20, 2019 at 1:52 AM Andres Freund <[hidden email]> wrote:

> > The first one (0001) deals with reducing the core executor's reliance
> > on es_result_relation_info to access the currently active result
> > relation, in favor of receiving it from the caller as a function
> > argument.  So no piece of core code relies on it being correctly set
> > anymore.  It still needs to be set correctly for the third-party code
> > such as FDWs.
>
> I'm inclined to just remove it. There's not much code out there relying
> on it, as far as I can tell. Most FDWs don't support the direct modify
> API, and that's afaict the case where we one needs to use
> es_result_relation_info?
Right, only the directly modify API uses it.

> In fact, I searched through alllFDWs listed on https://wiki.postgresql.org/wiki/Foreign_data_wrappers
> that are on github and in first few categories (up and including to
> "file wrappers"), and there was only one reference to
> es_result_relation_info, and that just in comments in a test:
> https://github.com/pgspider/griddb_fdw/search?utf8=%E2%9C%93&q=es_result_relation_info&type=
> which I think was just copied from our source code.
>
> IOW, we should just change the direct modify calls to get the relevant
> ResultRelationInfo or something in that vein (perhaps just the relevant
> RT index?).
It seems easy to make one of the two functions that constitute the
direct modify API, IterateDirectModify(), access the result relation
from ForeignScanState by saving either the result relation RT index or
ResultRelInfo pointer itself into the ForeignScanState's FDW-private
area.  For example, for postgres_fdw, one would simply add a new
member to PgFdwDirectModifyState struct.

Doing that for the other function BeginDirectModify() seems a bit more
involved.  We could add a new field to ForeignScan, say
resultRelation, that's set by either PlanDirectModify() (the FDW code)
or make_modifytable() (the core code) if the ForeignScan node contains
the command for direct modification.  BeginDirectModify() can then use
that value instead of relying on es_result_relation_info being set.

Thoughts?  Fujita-san, do you have any opinion on whether that would
be a good idea?

> pglogical also references it, but just because it creates its own
> EState afaict.

That sounds easily manageable.

> > @@ -334,32 +335,50 @@ ExecComputeStoredGenerated(EState *estate, TupleTableSlot *slot)
> >   *           ExecInsert
> >   *
> >   *           For INSERT, we have to insert the tuple into the target relation
> > - *           and insert appropriate tuples into the index relations.
> > + *           (or partition thereof) and insert appropriate tuples into the index
> > + *           relations.
> >   *
> >   *           Returns RETURNING result if any, otherwise NULL.
> > + *
> > + *           This may change the currently active tuple conversion map in
> > + *           mtstate->mt_transition_capture, so the callers must take care to
> > + *           save the previous value to avoid losing track of it.
> >   * ----------------------------------------------------------------
> >   */
> >  static TupleTableSlot *
> >  ExecInsert(ModifyTableState *mtstate,
> > +                ResultRelInfo *resultRelInfo,
> >                  TupleTableSlot *slot,
> >                  TupleTableSlot *planSlot,
> >                  EState *estate,
> >                  bool canSetTag)
> >  {
> > -     ResultRelInfo *resultRelInfo;
> >       Relation        resultRelationDesc;
> >       List       *recheckIndexes = NIL;
> >       TupleTableSlot *result = NULL;
> >       TransitionCaptureState *ar_insert_trig_tcs;
> >       ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
> >       OnConflictAction onconflict = node->onConflictAction;
> > +     PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
> > +
> > +     /*
> > +      * If the input result relation is a partitioned table, find the leaf
> > +      * partition to insert the tuple into.
> > +      */
> > +     if (proute)
> > +     {
> > +             ResultRelInfo *partRelInfo;
> > +
> > +             slot = ExecPrepareTupleRouting(mtstate, estate, proute,
> > +                                                                        resultRelInfo, slot,
> > +                                                                        &partRelInfo);
> > +             resultRelInfo = partRelInfo;
> > +             /* Result relation has changed, so update EState reference too. */
> > +             estate->es_result_relation_info = resultRelInfo;
> > +     }
>
> I think by removing es_result_relation entirely, this would look
> cleaner.
I agree.  Maybe, setting es_result_relation_info here isn't really
needed, because the ResultRelInfo is directly passed through
ExecForeignInsert.  Still, some FDWs may be relying on
es_result_relation_info being correctly set despite the
aforementioned.  Again, the only way to get them to stop doing so may
be to remove it.


> > @@ -1271,18 +1274,18 @@ lreplace:;
> >                                                                                        mtstate->mt_root_tuple_slot);
> >
> >                       /*
> > -                      * Prepare for tuple routing, making it look like we're inserting
> > -                      * into the root.
> > +                      * ExecInsert() may scribble on mtstate->mt_transition_capture,
> > +                      * so save the currently active map.
> >                        */
> > +                     if (mtstate->mt_transition_capture)
> > +                             saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
>
> Wonder if we could remove the need for this somehow, it's still pretty
> darn ugly.  Thomas, perhaps you have some insights?
>
> To me the need to modify these ModifyTable wide state on a per-subplan
> and even per-partition basis indicates that the datastructures are in
> the wrong place.
I agree that having to ensure tcs_map is set correctly is cumbersome,
because it has to be reset every time the currently active result
relation changes.  I think a better place for the map to be is
ResultRelInfo itself.  The trigger code can just get the correct map
from the ResultRelInfo of the result relation it's processing.

Regarding that idea, the necessary map is already present in the
tuple-routing state struct that's embedded in the partition's
ResultRelInfo.  But the UPDATE result relations that are never
processed as tuple routing targets don't have routing info initialized
(also think non-partition inheritance children), so we could add
another TupleConversionMap * field in ResultRelInfo.  Attached patch
0003 implements that.

With this change, we no longer need to track the map in a global
variable, that is, TransitionCaptureState no longer needs tcs_map.  We
still have tcs_original_insert_tuple though, which must be set during
ExecInsert and reset after it's read by AfterTriggerSaveEvent.  I have
moved the resetting of its value to right after where the originally
set value is read to make it clear that the value must be read only
once.

> > @@ -2212,23 +2207,17 @@ ExecModifyTable(PlanState *pstate)
> >               switch (operation)
> >               {
> >                       case CMD_INSERT:
> > -                             /* Prepare for tuple routing if needed. */
> > -                             if (proute)
> > -                                     slot = ExecPrepareTupleRouting(node, estate, proute,
> > -                                                                                                resultRelInfo, slot);
> > -                             slot = ExecInsert(node, slot, planSlot,
> > +                             slot = ExecInsert(node, resultRelInfo, slot, planSlot,
> >                                                                 estate, node->canSetTag);
> > -                             /* Revert ExecPrepareTupleRouting's state change. */
> > -                             if (proute)
> > -                                     estate->es_result_relation_info = resultRelInfo;
> >                               break;
> >                       case CMD_UPDATE:
> > -                             slot = ExecUpdate(node, tupleid, oldtuple, slot, planSlot,
> > -                                                               &node->mt_epqstate, estate, node->canSetTag);
> > +                             slot = ExecUpdate(node, resultRelInfo, tupleid, oldtuple, slot,
> > +                                                               planSlot, &node->mt_epqstate, estate,
> > +                                                               node->canSetTag);
> >                               break;
> >                       case CMD_DELETE:
> > -                             slot = ExecDelete(node, tupleid, oldtuple, planSlot,
> > -                                                               &node->mt_epqstate, estate,
> > +                             slot = ExecDelete(node, resultRelInfo, tupleid, oldtuple,
> > +                                                               planSlot, &node->mt_epqstate, estate,
> >                                                                 true, node->canSetTag,
> >                                                                 false /* changingPart */ , NULL, NULL);
> >                               break;
>
> This reminds me of another complaint: ExecDelete and ExecInsert() have
> gotten more boolean parameters for partition moving, but only one of
> them is explained with a comment (/* changingPart */) - think we should
> do that for all.
Agree about the confusing state of ExecDelete call sites.  I've
reformatted the calls to properly label the arguments (the changes are
contained in the revised 0001).  I don't see many
partitioning-specific boolean parameters in ExecInsert though.

> > diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
> > index 43edfef089..7df3e78b22 100644
> > --- a/src/backend/replication/logical/worker.c
> > +++ b/src/backend/replication/logical/worker.c
> > @@ -173,10 +173,10 @@ ensure_transaction(void)
> >   * This is based on similar code in copy.c
> >   */
> >  static EState *
> > -create_estate_for_relation(LogicalRepRelMapEntry *rel)
> > +create_estate_for_relation(LogicalRepRelMapEntry *rel,
> > +                                                ResultRelInfo **resultRelInfo)
> >  {
> >       EState     *estate;
> > -     ResultRelInfo *resultRelInfo;
> >       RangeTblEntry *rte;
> >
> >       estate = CreateExecutorState();
> > @@ -188,12 +188,11 @@ create_estate_for_relation(LogicalRepRelMapEntry *rel)
> >       rte->rellockmode = AccessShareLock;
> >       ExecInitRangeTable(estate, list_make1(rte));
> >
> > -     resultRelInfo = makeNode(ResultRelInfo);
> > -     InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, 0);
> > +     *resultRelInfo = makeNode(ResultRelInfo);
> > +     InitResultRelInfo(*resultRelInfo, rel->localrel, 1, NULL, 0);
> >
> > -     estate->es_result_relations = resultRelInfo;
> > +     estate->es_result_relations = *resultRelInfo;
> >       estate->es_num_result_relations = 1;
> > -     estate->es_result_relation_info = resultRelInfo;
> >
> >       estate->es_output_cid = GetCurrentCommandId(true);
> >
> > @@ -567,6 +566,7 @@ GetRelationIdentityOrPK(Relation rel)
> >  static void
> >  apply_handle_insert(StringInfo s)
> >  {
> > +     ResultRelInfo *resultRelInfo;
> >       LogicalRepRelMapEntry *rel;
> >       LogicalRepTupleData newtup;
> >       LogicalRepRelId relid;
> > @@ -589,7 +589,7 @@ apply_handle_insert(StringInfo s)
> >       }
> >
> >       /* Initialize the executor state. */
> > -     estate = create_estate_for_relation(rel);
> > +     estate = create_estate_for_relation(rel, &resultRelInfo);
>
> Hm. It kinda seems cleaner if we were to instead return the relevant
> index, rather than the entire ResultRelInfo, as an output from
> create_estate_for_relation().  Makes it clearer that it's still in the
> EState.
For now, I've reverted these changes in favor of just doing this:

     /* Initialize the executor state. */
     estate = create_estate_for_relation(rel);
+    resultRelInfo = &estate->es_result_relations[0];

This seems OK as we know for sure that there is only one target relation.

> Or perhaps we ought to compute it in a separate step? Then that'd be
> more amenable to support replcating into partition roots.

If we think of create_estate_for_relation() being like InitPlan(),
then perhaps it makes sense to leave it as is.  Any setup needed for
replicating into partition roots will have to be in a separate
function anyway.

> > +/*
> > + *   ExecMove
> > + *           Move an updated tuple from the input result relation to the
> > + *           new partition of its root parent table
> > + *
> > + *   This works by first deleting the tuple from the input result relation
> > + *   followed by inserting it into the root parent table, that is,
> > + *   mtstate->rootResultRelInfo.
> > + *
> > + *   Returns true if it's detected that the tuple we're trying to move has
> > + *   been concurrently updated.
> > + */
> > +static bool
> > +ExecMove(ModifyTableState *mtstate, ResultRelInfo *resultRelInfo,
> > +              ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *planSlot,
> > +              EPQState *epqstate, bool canSetTag, TupleTableSlot **slot,
> > +              TupleTableSlot **inserted_tuple)
> > +{
>
> I know that it was one of the names I proposed, but now that I'm
> thinking about it again, it sounds too generic. Perhaps
> ExecCrossPartitionUpdate() wouldn't be a quite so generic name? Since
> there's only one reference the longer name wouldn't be painful.
OK, I've renamed ExecMove to ExecCrossPartitionUpdate.

> > +     /*
> > +      * Row movement, part 1.  Delete the tuple, but skip RETURNING
> > +      * processing. We want to return rows from INSERT.
> > +      */
> > +     ExecDelete(mtstate, resultRelInfo, tupleid, oldtuple, planSlot,
> > +                        epqstate, estate, false, false /* canSetTag */ ,
> > +                        true /* changingPart */ , &tuple_deleted, &epqslot);
>
> Here again it'd be nice if all the booleans would be explained with a
> comment.
Done too.

Attached updated 0001, 0002, and the new 0003 for transition tuple
conversion map related refactoring as explained above.

Thanks,
Amit

v2-0003-Refactor-transition-tuple-capture-code-a-bit.patch (26K) Download Attachment
v2-0002-Rearrange-partition-update-row-movement-code-a-bi.patch (21K) Download Attachment
v2-0001-Reduce-es_result_relation_info-usage.patch (48K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: partition routing layering in nodeModifyTable.c

Amit Langote
On Tue, Jul 30, 2019 at 4:20 PM Amit Langote <[hidden email]> wrote:

> On Sat, Jul 20, 2019 at 1:52 AM Andres Freund <[hidden email]> wrote:
> > > The first one (0001) deals with reducing the core executor's reliance
> > > on es_result_relation_info to access the currently active result
> > > relation, in favor of receiving it from the caller as a function
> > > argument.  So no piece of core code relies on it being correctly set
> > > anymore.  It still needs to be set correctly for the third-party code
> > > such as FDWs.
> >
> > I'm inclined to just remove it. There's not much code out there relying
> > on it, as far as I can tell. Most FDWs don't support the direct modify
> > API, and that's afaict the case where we one needs to use
> > es_result_relation_info?
>
> Right, only the directly modify API uses it.
>
> > In fact, I searched through alllFDWs listed on https://wiki.postgresql.org/wiki/Foreign_data_wrappers
> > that are on github and in first few categories (up and including to
> > "file wrappers"), and there was only one reference to
> > es_result_relation_info, and that just in comments in a test:
> > https://github.com/pgspider/griddb_fdw/search?utf8=%E2%9C%93&q=es_result_relation_info&type=
> > which I think was just copied from our source code.
> >
> > IOW, we should just change the direct modify calls to get the relevant
> > ResultRelationInfo or something in that vein (perhaps just the relevant
> > RT index?).
>
> It seems easy to make one of the two functions that constitute the
> direct modify API, IterateDirectModify(), access the result relation
> from ForeignScanState by saving either the result relation RT index or
> ResultRelInfo pointer itself into the ForeignScanState's FDW-private
> area.  For example, for postgres_fdw, one would simply add a new
> member to PgFdwDirectModifyState struct.
>
> Doing that for the other function BeginDirectModify() seems a bit more
> involved.  We could add a new field to ForeignScan, say
> resultRelation, that's set by either PlanDirectModify() (the FDW code)
> or make_modifytable() (the core code) if the ForeignScan node contains
> the command for direct modification.  BeginDirectModify() can then use
> that value instead of relying on es_result_relation_info being set.
>
> Thoughts?  Fujita-san, do you have any opinion on whether that would
> be a good idea?
I looked into trying to do the things I mentioned above and it seems
to me that revising BeginDirectModify()'s API to receive the
ResultRelInfo directly as Andres suggested might be the best way
forward.  I've implemented that in the attached 0001.  Patches that
were previously 0001, 0002, and 0003 are now 0002, 003, and 0004,
respectively.  0002 is now a patch to "remove"
es_result_relation_info.

Thanks,
Amit

v3-0001-Revise-BeginDirectModify-API-to-pass-ResultRelInf.patch (10K) Download Attachment
v3-0002-Remove-es_result_relation_info.patch (49K) Download Attachment
v3-0004-Refactor-transition-tuple-capture-code-a-bit.patch (26K) Download Attachment
v3-0003-Rearrange-partition-update-row-movement-code-a-bi.patch (21K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: partition routing layering in nodeModifyTable.c

Etsuro Fujita-2
On Wed, Jul 31, 2019 at 5:05 PM Amit Langote <[hidden email]> wrote:

> On Tue, Jul 30, 2019 at 4:20 PM Amit Langote <[hidden email]> wrote:
> > On Sat, Jul 20, 2019 at 1:52 AM Andres Freund <[hidden email]> wrote:
> > > IOW, we should just change the direct modify calls to get the relevant
> > > ResultRelationInfo or something in that vein (perhaps just the relevant
> > > RT index?).
> >
> > It seems easy to make one of the two functions that constitute the
> > direct modify API, IterateDirectModify(), access the result relation
> > from ForeignScanState by saving either the result relation RT index or
> > ResultRelInfo pointer itself into the ForeignScanState's FDW-private
> > area.  For example, for postgres_fdw, one would simply add a new
> > member to PgFdwDirectModifyState struct.
> >
> > Doing that for the other function BeginDirectModify() seems a bit more
> > involved.  We could add a new field to ForeignScan, say
> > resultRelation, that's set by either PlanDirectModify() (the FDW code)
> > or make_modifytable() (the core code) if the ForeignScan node contains
> > the command for direct modification.  BeginDirectModify() can then use
> > that value instead of relying on es_result_relation_info being set.
> >
> > Thoughts?  Fujita-san, do you have any opinion on whether that would
> > be a good idea?

I'm still not sure that it's a good idea to remove
es_result_relation_info, but if I had to say then I think the latter
would probably be better.  I'm planning to rework on direct
modification to base it on upper planner pathification so we can
perform direct modification without the ModifyTable node.  (I'm not
sure we can really do this for inherited UPDATE/DELETE, though.)  For
that rewrite, I'm thinking to call BeginDirectModify() from the
ForeignScan node (ie, ExecInitForeignScan()) as-is.  The latter
approach would allow that without any changes and avoid changing that
API many times.  That's the reason why I think the latter would
probably be better.

> I looked into trying to do the things I mentioned above and it seems
> to me that revising BeginDirectModify()'s API to receive the
> ResultRelInfo directly as Andres suggested might be the best way
> forward.  I've implemented that in the attached 0001.  Patches that
> were previously 0001, 0002, and 0003 are now 0002, 003, and 0004,
> respectively.  0002 is now a patch to "remove"
> es_result_relation_info.

Sorry for speaking this late.

Best regards,
Etsuro Fujita


Reply | Threaded
Open this post in threaded view
|

Re: partition routing layering in nodeModifyTable.c

Andres Freund
Hi,

On 2019-07-31 21:03:58 +0900, Etsuro Fujita wrote:

> I'm still not sure that it's a good idea to remove
> es_result_relation_info, but if I had to say then I think the latter
> would probably be better.  I'm planning to rework on direct
> modification to base it on upper planner pathification so we can
> perform direct modification without the ModifyTable node.  (I'm not
> sure we can really do this for inherited UPDATE/DELETE, though.)  For
> that rewrite, I'm thinking to call BeginDirectModify() from the
> ForeignScan node (ie, ExecInitForeignScan()) as-is.  The latter
> approach would allow that without any changes and avoid changing that
> API many times.  That's the reason why I think the latter would
> probably be better.

I think if we did that, it'd become *more* urgent to remove
es_result_relation. Having more and more plan nodes change global
resources is a recipse for disaster.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: partition routing layering in nodeModifyTable.c

Amit Langote
In reply to this post by Etsuro Fujita-2
Fujita-san,

Thanks for the reply and sorry I didn't wait a bit more before posting
the patch.

On Wed, Jul 31, 2019 at 9:04 PM Etsuro Fujita <[hidden email]> wrote:

> On Wed, Jul 31, 2019 at 5:05 PM Amit Langote <[hidden email]> wrote:
> > On Tue, Jul 30, 2019 at 4:20 PM Amit Langote <[hidden email]> wrote:
> > > On Sat, Jul 20, 2019 at 1:52 AM Andres Freund <[hidden email]> wrote:
> > > > IOW, we should just change the direct modify calls to get the relevant
> > > > ResultRelationInfo or something in that vein (perhaps just the relevant
> > > > RT index?).
> > >
> > > It seems easy to make one of the two functions that constitute the
> > > direct modify API, IterateDirectModify(), access the result relation
> > > from ForeignScanState by saving either the result relation RT index or
> > > ResultRelInfo pointer itself into the ForeignScanState's FDW-private
> > > area.  For example, for postgres_fdw, one would simply add a new
> > > member to PgFdwDirectModifyState struct.
> > >
> > > Doing that for the other function BeginDirectModify() seems a bit more
> > > involved.  We could add a new field to ForeignScan, say
> > > resultRelation, that's set by either PlanDirectModify() (the FDW code)
> > > or make_modifytable() (the core code) if the ForeignScan node contains
> > > the command for direct modification.  BeginDirectModify() can then use
> > > that value instead of relying on es_result_relation_info being set.
> > >
> > > Thoughts?  Fujita-san, do you have any opinion on whether that would
> > > be a good idea?
>
> I'm still not sure that it's a good idea to remove
> es_result_relation_info, but if I had to say then I think the latter
> would probably be better.

Could you please clarify what you meant by the "latter"?

If it's the approach of adding a resultRelation Index field to
ForeignScan node, I tried and had to give up, realizing that we don't
maintain ResultRelInfos in an array that is indexable by RT indexes.
It would've worked if es_result_relations had mirrored es_range_table,
although that probably complicates how the individual ModifyTable
nodes attach to that array.  In any case, given this discussion,
further hacking on a global variable like es_result_relations may be a
course we might not want to pursue.

>  I'm planning to rework on direct
> modification to base it on upper planner pathification so we can
> perform direct modification without the ModifyTable node.  (I'm not
> sure we can really do this for inherited UPDATE/DELETE, though.)  For
> that rewrite, I'm thinking to call BeginDirectModify() from the
> ForeignScan node (ie, ExecInitForeignScan()) as-is.  The latter
> approach would allow that without any changes and avoid changing that
> API many times.  That's the reason why I think the latter would
> probably be better.

Will the new planning approach you're thinking of get rid of needing
any result relations at all (and so the ResultRelInfos in the
executor)?

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: partition routing layering in nodeModifyTable.c

Etsuro Fujita-2
Amit-san,

On Thu, Aug 1, 2019 at 10:33 AM Amit Langote <[hidden email]> wrote:

> On Wed, Jul 31, 2019 at 9:04 PM Etsuro Fujita <[hidden email]> wrote:
> > On Wed, Jul 31, 2019 at 5:05 PM Amit Langote <[hidden email]> wrote:
> > > On Tue, Jul 30, 2019 at 4:20 PM Amit Langote <[hidden email]> wrote:
> > > > On Sat, Jul 20, 2019 at 1:52 AM Andres Freund <[hidden email]> wrote:
> > > > > IOW, we should just change the direct modify calls to get the relevant
> > > > > ResultRelationInfo or something in that vein (perhaps just the relevant
> > > > > RT index?).
> > > >
> > > > It seems easy to make one of the two functions that constitute the
> > > > direct modify API, IterateDirectModify(), access the result relation
> > > > from ForeignScanState by saving either the result relation RT index or
> > > > ResultRelInfo pointer itself into the ForeignScanState's FDW-private
> > > > area.  For example, for postgres_fdw, one would simply add a new
> > > > member to PgFdwDirectModifyState struct.
> > > >
> > > > Doing that for the other function BeginDirectModify() seems a bit more
> > > > involved.  We could add a new field to ForeignScan, say
> > > > resultRelation, that's set by either PlanDirectModify() (the FDW code)
> > > > or make_modifytable() (the core code) if the ForeignScan node contains
> > > > the command for direct modification.  BeginDirectModify() can then use
> > > > that value instead of relying on es_result_relation_info being set.
> > > >
> > > > Thoughts?  Fujita-san, do you have any opinion on whether that would
> > > > be a good idea?
> >
> > I'm still not sure that it's a good idea to remove
> > es_result_relation_info, but if I had to say then I think the latter
> > would probably be better.
>
> Could you please clarify what you meant by the "latter"?
>
> If it's the approach of adding a resultRelation Index field to
> ForeignScan node, I tried and had to give up, realizing that we don't
> maintain ResultRelInfos in an array that is indexable by RT indexes.
> It would've worked if es_result_relations had mirrored es_range_table,
> although that probably complicates how the individual ModifyTable
> nodes attach to that array.  In any case, given this discussion,
> further hacking on a global variable like es_result_relations may be a
> course we might not want to pursue.
Yeah, I mean that approach.  To get the ResultRelInfo, I think
searching through the es_result_relations for the ResultRelInfo for
the resultRelation added to the ForeignScan in BeginDirectModify()
like the attached, which is created along your proposal.
ExecFindResultRelInfo() added by the patch wouldn't work efficiently
for inherited UPDATE/DELETE where there are many children that are
foreign tables, but I think that would probably be OK because in most
use-cases, including sharding, the number of such children would be at
most < 100 or so.  For improving the efficiency for the cases where
there are a lot more such children, however, I think it would be an
option to do something about global variables so that we can access
the ResultRelInfos by RT indexes more efficiently, because IMO I don't
think that would be against the point here ie, removing the dependency
on es_result_relation_info.  Maybe I'm missing something, though.

> >  I'm planning to rework on direct
> > modification to base it on upper planner pathification so we can
> > perform direct modification without the ModifyTable node.  (I'm not
> > sure we can really do this for inherited UPDATE/DELETE, though.)  For
> > that rewrite, I'm thinking to call BeginDirectModify() from the
> > ForeignScan node (ie, ExecInitForeignScan()) as-is.  The latter
> > approach would allow that without any changes and avoid changing that
> > API many times.  That's the reason why I think the latter would
> > probably be better.
>
> Will the new planning approach you're thinking of get rid of needing
> any result relations at all (and so the ResultRelInfos in the
> executor)?
I think the new planning approach would still need result relations
and ResultRelInfos in the executor as-is; and the FDW would probably
use the ResultRelInfo for the foreign table created by the core.  Some
of the ResultRelInfo data would prpbably need to be initialized by the
FDW itesef, though (eg, WCO constraints and/or RETURNING if any).

Best regards,
Etsuro Fujita

postgres-fdw-dont-depend-on-es_result_relation_info.patch (10K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: partition routing layering in nodeModifyTable.c

Andres Freund
In reply to this post by Amit Langote
Hi,

On 2019-07-31 17:04:38 +0900, Amit Langote wrote:
> I looked into trying to do the things I mentioned above and it seems
> to me that revising BeginDirectModify()'s API to receive the
> ResultRelInfo directly as Andres suggested might be the best way
> forward.  I've implemented that in the attached 0001.  Patches that
> were previously 0001, 0002, and 0003 are now 0002, 003, and 0004,
> respectively.  0002 is now a patch to "remove"
> es_result_relation_info.

Thanks!  Some minor quibbles aside, the non FDW patches look good to me.

Fujita-san, do you have any comments on the FDW API change?  Or anybody
else?

I'm a bit woried about the move of BeginDirectModify() into
nodeModifyTable.c - it just seems like an odd control flow to me. Not
allowing any intermittent nodes between ForeignScan and ModifyTable also
seems like an undesirable restriction for the future. I realize that we
already do that for BeginForeignModify() (just btw, that already accepts
resultRelInfo as a parameter, so being symmetrical for BeginDirectModify
makes sense), but it still seems like the wrong direction to me.

The need for that move, I assume, comes from needing knowing the correct
ResultRelInfo, correct?  I wonder if we shouldn't instead determine the
at plan time (in setrefs.c), somewhat similar to how we determine
ModifyTable.resultRelIndex. Doesn't look like that'd be too hard?

Then we could just have BeginForeignModify, BeginDirectModify,
BeginForeignScan all be called from ExecInitForeignScan().



Path 04 is such a nice improvement. Besides getting rid of a substantial
amount of code, it also makes the control flow a lot easier to read.


> @@ -4644,9 +4645,7 @@ GetAfterTriggersTableData(Oid relid, CmdType cmdType)
>   * If there are no triggers in 'trigdesc' that request relevant transition
>   * tables, then return NULL.
>   *
> - * The resulting object can be passed to the ExecAR* functions.  The caller
> - * should set tcs_map or tcs_original_insert_tuple as appropriate when dealing
> - * with child tables.
> + * The resulting object can be passed to the ExecAR* functions.
>   *
>   * Note that we copy the flags from a parent table into this struct (rather
>   * than subsequently using the relation's TriggerDesc directly) so that we can
> @@ -5750,14 +5749,26 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
>   */
>   if (row_trigger && transition_capture != NULL)
>   {
> - TupleTableSlot *original_insert_tuple = transition_capture->tcs_original_insert_tuple;
> - TupleConversionMap *map = transition_capture->tcs_map;
> + TupleTableSlot *original_insert_tuple;
> + PartitionRoutingInfo *pinfo = relinfo->ri_PartitionInfo;
> + TupleConversionMap *map = pinfo ?
> + pinfo->pi_PartitionToRootMap :
> + relinfo->ri_ChildToRootMap;
>   bool delete_old_table = transition_capture->tcs_delete_old_table;
>   bool update_old_table = transition_capture->tcs_update_old_table;
>   bool update_new_table = transition_capture->tcs_update_new_table;
>   bool insert_new_table = transition_capture->tcs_insert_new_table;
>  
>   /*
> + * Get the originally inserted tuple from the global variable and set
> + * the latter to NULL because any given tuple must be read only once.
> + * Note that the TransitionCaptureState is shared across many calls
> + * to this function.
> + */
> + original_insert_tuple = transition_capture->tcs_original_insert_tuple;
> + transition_capture->tcs_original_insert_tuple = NULL;

Maybe I'm missing something, but original_insert_tuple is not a global
variable?


> @@ -888,7 +889,8 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
>   PartitionTupleRouting *proute,
>   PartitionDispatch dispatch,
>   ResultRelInfo *partRelInfo,
> - int partidx)
> + int partidx,
> + bool is_update_result_rel)
>  {
>   MemoryContext oldcxt;
>   PartitionRoutingInfo *partrouteinfo;
> @@ -935,10 +937,15 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
>   if (mtstate &&
>   (mtstate->mt_transition_capture || mtstate->mt_oc_transition_capture))
>   {
> - partrouteinfo->pi_PartitionToRootMap =
> - convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
> -   RelationGetDescr(partRelInfo->ri_PartitionRoot),
> -   gettext_noop("could not convert row type"));
> + /* If partition is an update target, then we already got the map. */
> + if (is_update_result_rel)
> + partrouteinfo->pi_PartitionToRootMap =
> + partRelInfo->ri_ChildToRootMap;
> + else
> + partrouteinfo->pi_PartitionToRootMap =
> + convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
> +   RelationGetDescr(partRelInfo->ri_PartitionRoot),
> +   gettext_noop("could not convert row type"));
>   }

Hm, isn't is_update_result_rel just ModifyTable->operation == CMD_UPDATE?



Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: partition routing layering in nodeModifyTable.c

Etsuro Fujita-2
Hi,

On Sat, Aug 3, 2019 at 3:01 AM Andres Freund <[hidden email]> wrote:
> On 2019-07-31 17:04:38 +0900, Amit Langote wrote:
> > I looked into trying to do the things I mentioned above and it seems
> > to me that revising BeginDirectModify()'s API to receive the
> > ResultRelInfo directly as Andres suggested might be the best way
> > forward.  I've implemented that in the attached 0001.

> Fujita-san, do you have any comments on the FDW API change?  Or anybody
> else?
>
> I'm a bit woried about the move of BeginDirectModify() into
> nodeModifyTable.c - it just seems like an odd control flow to me. Not
> allowing any intermittent nodes between ForeignScan and ModifyTable also
> seems like an undesirable restriction for the future. I realize that we
> already do that for BeginForeignModify() (just btw, that already accepts
> resultRelInfo as a parameter, so being symmetrical for BeginDirectModify
> makes sense), but it still seems like the wrong direction to me.
>
> The need for that move, I assume, comes from needing knowing the correct
> ResultRelInfo, correct?  I wonder if we shouldn't instead determine the
> at plan time (in setrefs.c), somewhat similar to how we determine
> ModifyTable.resultRelIndex. Doesn't look like that'd be too hard?

I'd vote for that; I created a patch for that [1].

> Then we could just have BeginForeignModify, BeginDirectModify,
> BeginForeignScan all be called from ExecInitForeignScan().

I think so too.

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAPmGK15%3DoFHmWNND5vopfokSGfn6jMXVvnHa7K7P49F7k1hWPQ%40mail.gmail.com


Reply | Threaded
Open this post in threaded view
|

Re: partition routing layering in nodeModifyTable.c

Andres Freund
Hi,

On 2019-08-03 05:20:35 +0900, Etsuro Fujita wrote:

> On Sat, Aug 3, 2019 at 3:01 AM Andres Freund <[hidden email]> wrote:
> > On 2019-07-31 17:04:38 +0900, Amit Langote wrote:
> > > I looked into trying to do the things I mentioned above and it seems
> > > to me that revising BeginDirectModify()'s API to receive the
> > > ResultRelInfo directly as Andres suggested might be the best way
> > > forward.  I've implemented that in the attached 0001.
>
> > Fujita-san, do you have any comments on the FDW API change?  Or anybody
> > else?
> >
> > I'm a bit woried about the move of BeginDirectModify() into
> > nodeModifyTable.c - it just seems like an odd control flow to me. Not
> > allowing any intermittent nodes between ForeignScan and ModifyTable also
> > seems like an undesirable restriction for the future. I realize that we
> > already do that for BeginForeignModify() (just btw, that already accepts
> > resultRelInfo as a parameter, so being symmetrical for BeginDirectModify
> > makes sense), but it still seems like the wrong direction to me.
> >
> > The need for that move, I assume, comes from needing knowing the correct
> > ResultRelInfo, correct?  I wonder if we shouldn't instead determine the
> > at plan time (in setrefs.c), somewhat similar to how we determine
> > ModifyTable.resultRelIndex. Doesn't look like that'd be too hard?
>
> I'd vote for that; I created a patch for that [1].
>
> [1] https://www.postgresql.org/message-id/CAPmGK15%3DoFHmWNND5vopfokSGfn6jMXVvnHa7K7P49F7k1hWPQ%40mail.gmail.com

Oh, missed that. But that's not quite what I'm proposing. I don't like
ExecFindResultRelInfo at all. What's the point of it? It's introduction
is still an API break - I don't understand why that break is better than
just passing the ResultRelInfo directly to BeginDirectModify()?  I want
to again remark that BeginForeignModify() does get the ResultRelInfo -
it should have been done the same when adding direct modify.

Even if you need the loop - which I don't think is right - it should
live somewhere that individual FDWs don't have to care about.

- Andres


Reply | Threaded
Open this post in threaded view
|

Re: partition routing layering in nodeModifyTable.c

Andres Freund
In reply to this post by Etsuro Fujita-2
Hi,

On 2019-08-01 18:38:09 +0900, Etsuro Fujita wrote:
> On Thu, Aug 1, 2019 at 10:33 AM Amit Langote <[hidden email]> wrote:
> > If it's the approach of adding a resultRelation Index field to
> > ForeignScan node, I tried and had to give up, realizing that we don't
> > maintain ResultRelInfos in an array that is indexable by RT indexes.
> > It would've worked if es_result_relations had mirrored es_range_table,
> > although that probably complicates how the individual ModifyTable
> > nodes attach to that array.

We know at plan time what the the resultRelation offset for a
ModifyTable node is. We just need to transport that to the respective
foreign scan node, and update it properly in setrefs?  Then we can index
es_result_relations without any additional mapping?

Maybe I'm missing something? I think all we need to do is to have
setrefs.c:set_plan_refs() iterate over ->fdwDirectModifyPlans or such,
and set the respective node's result_relation_offset or whatever we're
naming it to splan->resultRelIndex + offset from fdwDirectModifyPlans?


> > In any case, given this discussion, further hacking on a global
> > variable like es_result_relations may be a course we might not want
> > to pursue.

I don't think es_result_relations really is problem - it doesn't have to
change while processing individual subplans / partitions / whatnot.  If
we needed a mapping between rtis and result indexes, I'd not see a
problem. Doubtful it's needed though.

There's a fundamental difference between EState->es_result_relations and
EState->es_result_relation_info. The former stays static during the
whole query once initialized, whereas es_result_relation_info changes
depending on which relation we're processing. The latter is what makes
the code more complicated, because we cannot ever return early etc.

Similarly, ModifyTableState->mt_per_subplan_tupconv_maps is not a
problem, it stays static, but e.g. mtstate->mt_transition_capture is a
problem, because we have to change for each subplan / routing /
partition movement.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: partition routing layering in nodeModifyTable.c

Etsuro Fujita-2
In reply to this post by Andres Freund
Hi,

On Sat, Aug 3, 2019 at 5:31 AM Andres Freund <[hidden email]> wrote:

> On 2019-08-03 05:20:35 +0900, Etsuro Fujita wrote:
> > On Sat, Aug 3, 2019 at 3:01 AM Andres Freund <[hidden email]> wrote:
> > > On 2019-07-31 17:04:38 +0900, Amit Langote wrote:
> > > > I looked into trying to do the things I mentioned above and it seems
> > > > to me that revising BeginDirectModify()'s API to receive the
> > > > ResultRelInfo directly as Andres suggested might be the best way
> > > > forward.  I've implemented that in the attached 0001.
> >
> > > Fujita-san, do you have any comments on the FDW API change?  Or anybody
> > > else?
> > >
> > > I'm a bit woried about the move of BeginDirectModify() into
> > > nodeModifyTable.c - it just seems like an odd control flow to me. Not
> > > allowing any intermittent nodes between ForeignScan and ModifyTable also
> > > seems like an undesirable restriction for the future. I realize that we
> > > already do that for BeginForeignModify() (just btw, that already accepts
> > > resultRelInfo as a parameter, so being symmetrical for BeginDirectModify
> > > makes sense), but it still seems like the wrong direction to me.
> > >
> > > The need for that move, I assume, comes from needing knowing the correct
> > > ResultRelInfo, correct?  I wonder if we shouldn't instead determine the
> > > at plan time (in setrefs.c), somewhat similar to how we determine
> > > ModifyTable.resultRelIndex. Doesn't look like that'd be too hard?
> >
> > I'd vote for that; I created a patch for that [1].
> >
> > [1] https://www.postgresql.org/message-id/CAPmGK15%3DoFHmWNND5vopfokSGfn6jMXVvnHa7K7P49F7k1hWPQ%40mail.gmail.com
>
> Oh, missed that. But that's not quite what I'm proposing.

Sorry, I misread your message.  I think I was too tired.

> I don't like
> ExecFindResultRelInfo at all. What's the point of it? It's introduction
> is still an API break - I don't understand why that break is better than
> just passing the ResultRelInfo directly to BeginDirectModify()?

What API does that function break?  The point of that function was to
keep the direct modify layering/API as-is, because 1) I too felt the
same way about the move of BeginDirectModify() to nodeModifyTable.c,
and 2) I was thinking that when rewriting direct modify with upper
planner pathification so that we can perform it without ModifyTable,
we could still use the existing layering/API as-is, leading to smaller
changes to the core for that.

> I want
> to again remark that BeginForeignModify() does get the ResultRelInfo -
> it should have been done the same when adding direct modify.

Might have been so.

> Even if you need the loop - which I don't think is right - it should
> live somewhere that individual FDWs don't have to care about.

I was thinking to use hash lookup in ExecFindResultRelInfo() when
es_result_relations is very long, but I think the setters.c approach
you mentioned above might be much better.  Will consider that.

Best regards,
Etsuro Fujita


1234