repeated decoding of prepared transactions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
61 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

repeated decoding of prepared transactions

Markus Wanner-4
Amit, Ajin, hackers,

testing logical decoding for two-phase transactions, I stumbled over
what I first thought is a bug.  But comments seems to indicate this is
intended behavior.  Could you please clarify or elaborate on the design
decision?  Or indicate this indeed is a bug?

What puzzled me is that if a decoder is restarted in between the PREPARE
and the COMMIT PREPARED, it repeats the entire transaction, despite it
being already sent and potentially prepared on the receiving side.

In terms of `pg_logical_slot_get_changes` (and roughly from the
prepare.sql test), this looks as follows:

                         data
----------------------------------------------------
  BEGIN
  table public.test_prepared1: INSERT: id[integer]:1
  PREPARE TRANSACTION 'test_prepared#1'
(3 rows)


This is the first delivery of the transaction.  After a restart, it will
get all of the changes again, though:


                         data
----------------------------------------------------
  BEGIN
  table public.test_prepared1: INSERT: id[integer]:1
  PREPARE TRANSACTION 'test_prepared#1'
  COMMIT PREPARED 'test_prepared#1'
(4 rows)


I did not expect this, as any receiver that wants to have decoded 2PC is
likely supporting some kind of two-phase commits itself.  And would
therefore prepare the transaction upon its first reception.  Potentially
receiving it a second time would require complicated filtering on every
prepared transaction.

Furthermore, this clearly and unnecessarily holds back the restart LSN.
Meaning even just a single prepared transaction can block advancing the
restart LSN.  In most cases, these are short lived.  But on the other
hand, there may be an arbitrary amount of other transactions in between
a PREPARE and the corresponding COMMIT PREPARED in the WAL.  Not being
able to advance over a prepared transaction seems like a bad thing in
such a case.

I fail to see where this repetition would ever be useful.  Is there any
reason for the current implementation that I'm missing or can this be
corrected?  Thanks for elaborating.

Regards

Markus


Reply | Threaded
Open this post in threaded view
|

Re: repeated decoding of prepared transactions

akapila
On Mon, Feb 8, 2021 at 2:01 PM Markus Wanner
<[hidden email]> wrote:

>
> Amit, Ajin, hackers,
>
> testing logical decoding for two-phase transactions, I stumbled over
> what I first thought is a bug.  But comments seems to indicate this is
> intended behavior.  Could you please clarify or elaborate on the design
> decision?  Or indicate this indeed is a bug?
>
> What puzzled me is that if a decoder is restarted in between the PREPARE
> and the COMMIT PREPARED, it repeats the entire transaction, despite it
> being already sent and potentially prepared on the receiving side.
>
> In terms of `pg_logical_slot_get_changes` (and roughly from the
> prepare.sql test), this looks as follows:
>
>                          data
> ----------------------------------------------------
>   BEGIN
>   table public.test_prepared1: INSERT: id[integer]:1
>   PREPARE TRANSACTION 'test_prepared#1'
> (3 rows)
>
>
> This is the first delivery of the transaction.  After a restart, it will
> get all of the changes again, though:
>
>
>                          data
> ----------------------------------------------------
>   BEGIN
>   table public.test_prepared1: INSERT: id[integer]:1
>   PREPARE TRANSACTION 'test_prepared#1'
>   COMMIT PREPARED 'test_prepared#1'
> (4 rows)
>
>
> I did not expect this, as any receiver that wants to have decoded 2PC is
> likely supporting some kind of two-phase commits itself.  And would
> therefore prepare the transaction upon its first reception.  Potentially
> receiving it a second time would require complicated filtering on every
> prepared transaction.
>

The reason was mentioned in ReorderBufferFinishPrepared(). See below
comments in code:
/*
 * It is possible that this transaction is not decoded at prepare time
 * either because by that time we didn't have a consistent snapshot or it
 * was decoded earlier but we have restarted. We can't distinguish between
 * those two cases so we send the prepare in both the cases and let
 * downstream decide whether to process or skip it. We don't need to
 * decode the xact for aborts if it is not done already.
 */
This won't happen when we replicate via pgoutput  (the patch for which
is still not committed) because it won't restart from a previous point
(unless the server needs to be restarted due to some reason) as you
are doing via logical decoding APIs. Now, we don't send again the
prepared xacts on repeated calls of pg_logical_slot_get_changes()
unless we encounter commit. This behavior is already explained in docs
[1] (See, Transaction Begin Prepare Callback in docs) and a way to
skip the prepare.

> Furthermore, this clearly and unnecessarily holds back the restart LSN.
> Meaning even just a single prepared transaction can block advancing the
> restart LSN.  In most cases, these are short lived.  But on the other
> hand, there may be an arbitrary amount of other transactions in between
> a PREPARE and the corresponding COMMIT PREPARED in the WAL.  Not being
> able to advance over a prepared transaction seems like a bad thing in
> such a case.
>

That anyway is true without this work as well where restart_lsn can be
advanced on commits. We haven't changed anything in that regard.

[1] - https://www.postgresql.org/docs/devel/logicaldecoding-output-plugin.html


--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: repeated decoding of prepared transactions

Markus Wanner-4
Hello Amit,

thanks for your very quick response.

On 08.02.21 11:13, Amit Kapila wrote:
> /*
>   * It is possible that this transaction is not decoded at prepare time
>   * either because by that time we didn't have a consistent snapshot or it
>   * was decoded earlier but we have restarted. We can't distinguish between
>   * those two cases so we send the prepare in both the cases and let
>   * downstream decide whether to process or skip it. We don't need to
>   * decode the xact for aborts if it is not done already.
>   */

The way I read the surrounding code, the only case a 2PC transaction
does not get decoded a prepare time is if the transaction is empty.  Or
are you aware of any other situation that might currently happen?

> (unless the server needs to be restarted due to some reason)

Right, the repetition occurs only after a restart of the walsender in
between a prepare and a commit prepared record.

> That anyway is true without this work as well where restart_lsn can be
> advanced on commits. We haven't changed anything in that regard.

I did not mean to blame the patch, but merely try to understand some of
the design decisions behind it.

And as I just learned, even if we managed to avoid the repetition, a
restarted walsender still needs to see prepared transactions as
in-progress in its snapshots.  So we cannot move forward the restart_lsn
to after a prepare record (until the final commit or rollback is consumed).

Best Regards

Markus


Reply | Threaded
Open this post in threaded view
|

Re: repeated decoding of prepared transactions

akapila
On Mon, Feb 8, 2021 at 8:36 PM Markus Wanner
<[hidden email]> wrote:

>
> Hello Amit,
>
> thanks for your very quick response.
>
> On 08.02.21 11:13, Amit Kapila wrote:
> > /*
> >   * It is possible that this transaction is not decoded at prepare time
> >   * either because by that time we didn't have a consistent snapshot or it
> >   * was decoded earlier but we have restarted. We can't distinguish between
> >   * those two cases so we send the prepare in both the cases and let
> >   * downstream decide whether to process or skip it. We don't need to
> >   * decode the xact for aborts if it is not done already.
> >   */
>
> The way I read the surrounding code, the only case a 2PC transaction
> does not get decoded a prepare time is if the transaction is empty.  Or
> are you aware of any other situation that might currently happen?
>

We also skip decoding at prepare time if we haven't reached a
consistent snapshot by that time. See below code in DecodePrepare().
DecodePrepare()
{
..
/* We can't start streaming unless a consistent state is reached. */
if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT)
{
ReorderBufferSkipPrepare(ctx->reorder, xid);
return;
}
..
}

There are other reasons as well like the output plugin doesn't want to
allow decoding at prepare time but I don't think those are relevant to
the discussion here.

> > (unless the server needs to be restarted due to some reason)
>
> Right, the repetition occurs only after a restart of the walsender in
> between a prepare and a commit prepared record.
>
> > That anyway is true without this work as well where restart_lsn can be
> > advanced on commits. We haven't changed anything in that regard.
>
> I did not mean to blame the patch, but merely try to understand some of
> the design decisions behind it.
>
> And as I just learned, even if we managed to avoid the repetition, a
> restarted walsender still needs to see prepared transactions as
> in-progress in its snapshots.  So we cannot move forward the restart_lsn
> to after a prepare record (until the final commit or rollback is consumed).
>

Right and say if we forget the prepared transactions and move forward
with restart_lsn once we get the prepare for any transaction. Then we
will open up a window where we haven't actually sent the prepared xact
because of say "snapshot has not yet reached consistent state" and we
have moved the restart_lsn. Then later when we get the commit
corresponding to the prepared transaction by which time say the
"snapshot has reached consistent state" then we will miss sending the
transaction contents and prepare for it. I think for such reasons we
allow restart_lsn to moved only once the transaction is finished
(committed or rolled back).

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: repeated decoding of prepared transactions

Ashutosh Bapat-2
On Tue, Feb 9, 2021 at 8:32 AM Amit Kapila <[hidden email]> wrote:

>
> On Mon, Feb 8, 2021 at 8:36 PM Markus Wanner
> <[hidden email]> wrote:
> >
> > Hello Amit,
> >
> > thanks for your very quick response.
> >
> > On 08.02.21 11:13, Amit Kapila wrote:
> > > /*
> > >   * It is possible that this transaction is not decoded at prepare time
> > >   * either because by that time we didn't have a consistent snapshot or it
> > >   * was decoded earlier but we have restarted. We can't distinguish between
> > >   * those two cases so we send the prepare in both the cases and let
> > >   * downstream decide whether to process or skip it. We don't need to
> > >   * decode the xact for aborts if it is not done already.
> > >   */
> >
> > The way I read the surrounding code, the only case a 2PC transaction
> > does not get decoded a prepare time is if the transaction is empty.  Or
> > are you aware of any other situation that might currently happen?
> >
>
> We also skip decoding at prepare time if we haven't reached a
> consistent snapshot by that time. See below code in DecodePrepare().
> DecodePrepare()
> {
> ..
> /* We can't start streaming unless a consistent state is reached. */
> if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT)
> {
> ReorderBufferSkipPrepare(ctx->reorder, xid);
> return;
> }
> ..
> }

Can you please provide steps which can lead to this situation? If
there is an earlier discussion which has example scenarios, please
point us to the relevant thread.

If we are not sending PREPARED transactions that's fine, but sending
the same prepared transaction as many times as the WAL sender is
restarted between sending prepare and commit prepared is a waste of
network bandwidth. The wastage is proportional to the amount of
changes in the transaction and number of such transactions themselves.
Also this will cause performance degradation. So if we can avoid
resending prepared transactions twice that will help.

--
Best Wishes,
Ashutosh Bapat


Reply | Threaded
Open this post in threaded view
|

Re: repeated decoding of prepared transactions

Ajin Cherian
On Tue, Feb 9, 2021 at 4:59 PM Ashutosh Bapat
<[hidden email]> wrote:

> Can you please provide steps which can lead to this situation? If
> there is an earlier discussion which has example scenarios, please
> point us to the relevant thread.
>
> If we are not sending PREPARED transactions that's fine, but sending
> the same prepared transaction as many times as the WAL sender is
> restarted between sending prepare and commit prepared is a waste of
> network bandwidth. The wastage is proportional to the amount of
> changes in the transaction and number of such transactions themselves.
> Also this will cause performance degradation. So if we can avoid
> resending prepared transactions twice that will help.

One of this scenario is explained in the test case in

postgres/contrib/test_decoding/specs/twophase_snapshot.spec

regards,
Ajin Cherian
Fujitsu Australia


Reply | Threaded
Open this post in threaded view
|

Re: repeated decoding of prepared transactions

akapila
In reply to this post by Ashutosh Bapat-2
On Tue, Feb 9, 2021 at 11:29 AM Ashutosh Bapat
<[hidden email]> wrote:

>
> On Tue, Feb 9, 2021 at 8:32 AM Amit Kapila <[hidden email]> wrote:
> >
> > On Mon, Feb 8, 2021 at 8:36 PM Markus Wanner
> > <[hidden email]> wrote:
> > >
> > > Hello Amit,
> > >
> > > thanks for your very quick response.
> > >
> > > On 08.02.21 11:13, Amit Kapila wrote:
> > > > /*
> > > >   * It is possible that this transaction is not decoded at prepare time
> > > >   * either because by that time we didn't have a consistent snapshot or it
> > > >   * was decoded earlier but we have restarted. We can't distinguish between
> > > >   * those two cases so we send the prepare in both the cases and let
> > > >   * downstream decide whether to process or skip it. We don't need to
> > > >   * decode the xact for aborts if it is not done already.
> > > >   */
> > >
> > > The way I read the surrounding code, the only case a 2PC transaction
> > > does not get decoded a prepare time is if the transaction is empty.  Or
> > > are you aware of any other situation that might currently happen?
> > >
> >
> > We also skip decoding at prepare time if we haven't reached a
> > consistent snapshot by that time. See below code in DecodePrepare().
> > DecodePrepare()
> > {
> > ..
> > /* We can't start streaming unless a consistent state is reached. */
> > if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT)
> > {
> > ReorderBufferSkipPrepare(ctx->reorder, xid);
> > return;
> > }
> > ..
> > }
>
> Can you please provide steps which can lead to this situation?
>

Ajin has already shared the example with you.

> If
> there is an earlier discussion which has example scenarios, please
> point us to the relevant thread.
>

It started in the email [1] and from there you can read later emails
to know more about this.

> If we are not sending PREPARED transactions that's fine,
>

Hmm, I am not sure if that is fine because if the output plugin sets
the two-phase-commit option, it would expect all prepared xacts to
arrive not some only some of them.

> but sending
> the same prepared transaction as many times as the WAL sender is
> restarted between sending prepare and commit prepared is a waste of
> network bandwidth.
>

I think similar happens without any of the work done in PG-14 as well
if we restart the apply worker before the commit completes on the
subscriber. After the restart, we will send the start_decoding_at
point based on some previous commit which will make publisher send the
entire transaction again. I don't think restart of WAL sender or WAL
receiver is such a common thing. It can only happen due to some bug in
code or user wishes to stop the nodes or some crash happened.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2Bd3gzCyzsYjt1m6sfGf_C_uFmo9JK%3D3Wafp6yR8Mg8uQ%40mail.gmail.com

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: repeated decoding of prepared transactions

Robert Haas
On Tue, Feb 9, 2021 at 6:57 AM Amit Kapila <[hidden email]> wrote:
> I think similar happens without any of the work done in PG-14 as well
> if we restart the apply worker before the commit completes on the
> subscriber. After the restart, we will send the start_decoding_at
> point based on some previous commit which will make publisher send the
> entire transaction again. I don't think restart of WAL sender or WAL
> receiver is such a common thing. It can only happen due to some bug in
> code or user wishes to stop the nodes or some crash happened.

Really? My impression is that the logical replication protocol is
supposed to be designed in such a way that once a transaction is
successfully confirmed, it won't be sent again. Now if something is
not confirmed then it has to be sent again. But if it is confirmed
then it shouldn't happen.

--
Robert Haas
EDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: repeated decoding of prepared transactions

akapila
On Wed, Feb 10, 2021 at 12:08 AM Robert Haas <[hidden email]> wrote:

>
> On Tue, Feb 9, 2021 at 6:57 AM Amit Kapila <[hidden email]> wrote:
> > I think similar happens without any of the work done in PG-14 as well
> > if we restart the apply worker before the commit completes on the
> > subscriber. After the restart, we will send the start_decoding_at
> > point based on some previous commit which will make publisher send the
> > entire transaction again. I don't think restart of WAL sender or WAL
> > receiver is such a common thing. It can only happen due to some bug in
> > code or user wishes to stop the nodes or some crash happened.
>
> Really? My impression is that the logical replication protocol is
> supposed to be designed in such a way that once a transaction is
> successfully confirmed, it won't be sent again. Now if something is
> not confirmed then it has to be sent again. But if it is confirmed
> then it shouldn't happen.
>

If by successfully confirmed, you mean that once the subscriber node
has received, it won't be sent again then as far as I know that is not
true. We rely on the flush location sent by the subscriber to advance
the decoding locations. We update the flush locations after we apply
the transaction's commit successfully. Also, after the restart, we use
the replication origin's last flush location as a point from where we
need the transactions and the origin's progress is updated at commit
time.

OTOH, If by successfully confirmed, you mean that once the subscriber
has applied the complete transaction (including commit), then you are
right that it won't be sent again.

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: repeated decoding of prepared transactions

Ashutosh Bapat


On Wed, Feb 10, 2021 at 8:02 AM Amit Kapila <[hidden email]> wrote:
On Wed, Feb 10, 2021 at 12:08 AM Robert Haas <[hidden email]> wrote:
>
> On Tue, Feb 9, 2021 at 6:57 AM Amit Kapila <[hidden email]> wrote:
> > I think similar happens without any of the work done in PG-14 as well
> > if we restart the apply worker before the commit completes on the
> > subscriber. After the restart, we will send the start_decoding_at
> > point based on some previous commit which will make publisher send the
> > entire transaction again. I don't think restart of WAL sender or WAL
> > receiver is such a common thing. It can only happen due to some bug in
> > code or user wishes to stop the nodes or some crash happened.
>
> Really? My impression is that the logical replication protocol is
> supposed to be designed in such a way that once a transaction is
> successfully confirmed, it won't be sent again. Now if something is
> not confirmed then it has to be sent again. But if it is confirmed
> then it shouldn't happen.
>

If by successfully confirmed, you mean that once the subscriber node
has received, it won't be sent again then as far as I know that is not
true. We rely on the flush location sent by the subscriber to advance
the decoding locations. We update the flush locations after we apply
the transaction's commit successfully. Also, after the restart, we use
the replication origin's last flush location as a point from where we
need the transactions and the origin's progress is updated at commit
time.

OTOH, If by successfully confirmed, you mean that once the subscriber
has applied the complete transaction (including commit), then you are
right that it won't be sent again.

I think we need to treat a prepared transaction slightly different from an uncommitted transaction when sending downstream. We need to send a whole uncommitted transaction downstream again because previously applied changes must have been aborted and hence lost by the downstream and thus it needs to get all of those again. But when a downstream prepares a transaction, even if it's not committed, those changes are not lost even after restart of a walsender. If the downstream confirms that it has "flushed" PREPARE, there is no need to send all the changes again.

--
Best Wishes,
Ashutosh
Reply | Threaded
Open this post in threaded view
|

Re: repeated decoding of prepared transactions

Ajin Cherian
On Wed, Feb 10, 2021 at 3:43 PM Ashutosh Bapat
<[hidden email]> wrote:


> I think we need to treat a prepared transaction slightly different from an uncommitted transaction when sending downstream. We need to send a whole uncommitted transaction downstream again because previously applied changes must have been aborted and hence lost by the downstream and thus it needs to get all of those again. But when a downstream prepares a transaction, even if it's not committed, those changes are not lost even after restart of a walsender. If the downstream confirms that it has "flushed" PREPARE, there is no need to send all the changes again.

But the other side of the problem is that ,without this, if the
prepared transaction is prior to a consistent snapshot when decoding
starts/restarts, then only the "commit prepared" is sent to downstream
(as seen in the test scenario I shared above), and downstream has to
error away the commit prepared because it does not have the
corresponding prepared transaction. We did not find an easy way to
distinguish between these two scenarios for prepared transactions.
a. A consistent snapshot being formed in between a prepare and a
commit prepared for the first time.
b. Decoder restarting between a prepare and a commit prepared.

For plugins to be able to handle this, we have added a special
callback "Begin Prepare" as explained in [1] section 49.6.4.10

"The required begin_prepare_cb callback is called whenever the start
of a prepared transaction has been decoded. The gid field, which is
part of the txn parameter can be used in this callback to check if the
plugin has already received this prepare in which case it can skip the
remaining changes of the transaction. This can only happen if the user
restarts the decoding after receiving the prepare for a transaction
but before receiving the commit prepared say because of some error."

The pgoutput plugin is also being updated to be able to handle this
situation of duplicate prepared transactions.

regards,
Ajin Cherian
Fujitsu Australia


Reply | Threaded
Open this post in threaded view
|

Re: repeated decoding of prepared transactions

akapila
On Wed, Feb 10, 2021 at 11:45 AM Ajin Cherian <[hidden email]> wrote:

>
> On Wed, Feb 10, 2021 at 3:43 PM Ashutosh Bapat
> <[hidden email]> wrote:
>
>
> > I think we need to treat a prepared transaction slightly different from an uncommitted transaction when sending downstream. We need to send a whole uncommitted transaction downstream again because previously applied changes must have been aborted and hence lost by the downstream and thus it needs to get all of those again. But when a downstream prepares a transaction, even if it's not committed, those changes are not lost even after restart of a walsender. If the downstream confirms that it has "flushed" PREPARE, there is no need to send all the changes again.
>
> But the other side of the problem is that ,without this, if the
> prepared transaction is prior to a consistent snapshot when decoding
> starts/restarts, then only the "commit prepared" is sent to downstream
> (as seen in the test scenario I shared above), and downstream has to
> error away the commit prepared because it does not have the
> corresponding prepared transaction.
>

I think it is not only simple error handling, it is required for
data-consistency. We need to send the transactions whose commits are
encountered after a consistent snapshot is reached.

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: repeated decoding of prepared transactions

Markus Wanner-4
On 10.02.21 07:32, Amit Kapila wrote:

> On Wed, Feb 10, 2021 at 11:45 AM Ajin Cherian <[hidden email]> wrote:
>> But the other side of the problem is that ,without this, if the
>> prepared transaction is prior to a consistent snapshot when decoding
>> starts/restarts, then only the "commit prepared" is sent to downstream
>> (as seen in the test scenario I shared above), and downstream has to
>> error away the commit prepared because it does not have the
>> corresponding prepared transaction.
>
> I think it is not only simple error handling, it is required for
> data-consistency. We need to send the transactions whose commits are
> encountered after a consistent snapshot is reached.

I'm with Ashutosh here.  If a replica is properly in sync, it knows
about prepared transactions and all the gids of those.  Sending the
transactional changes and the prepare again is inconsistent.

The point of a two-phase transaction is to have two phases.  An output
plugin must have the chance of treating them as independent events.
Once a PREPARE is confirmed, it must not be sent again.  Even if the
transaction is still in-progress and its changes are not yet visible on
the origin node.

Regards

Markus


Reply | Threaded
Open this post in threaded view
|

Re: repeated decoding of prepared transactions

akapila
On Wed, Feb 10, 2021 at 1:40 PM Markus Wanner
<[hidden email]> wrote:

>
> On 10.02.21 07:32, Amit Kapila wrote:
> > On Wed, Feb 10, 2021 at 11:45 AM Ajin Cherian <[hidden email]> wrote:
> >> But the other side of the problem is that ,without this, if the
> >> prepared transaction is prior to a consistent snapshot when decoding
> >> starts/restarts, then only the "commit prepared" is sent to downstream
> >> (as seen in the test scenario I shared above), and downstream has to
> >> error away the commit prepared because it does not have the
> >> corresponding prepared transaction.
> >
> > I think it is not only simple error handling, it is required for
> > data-consistency. We need to send the transactions whose commits are
> > encountered after a consistent snapshot is reached.
>
> I'm with Ashutosh here.  If a replica is properly in sync, it knows
> about prepared transactions and all the gids of those.  Sending the
> transactional changes and the prepare again is inconsistent.
>
> The point of a two-phase transaction is to have two phases.  An output
> plugin must have the chance of treating them as independent events.
>

I am not sure I understand what problem you are facing to deal with
this in the output plugin, it is explained in docs and Ajin also
pointed out the same. Ajin and I have explained to you the design
constraints on the publisher-side due to which we have done this way.
Do you have any better ideas to deal with this?

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: repeated decoding of prepared transactions

Robert Haas
In reply to this post by akapila
On Tue, Feb 9, 2021 at 9:32 PM Amit Kapila <[hidden email]> wrote:

> If by successfully confirmed, you mean that once the subscriber node
> has received, it won't be sent again then as far as I know that is not
> true. We rely on the flush location sent by the subscriber to advance
> the decoding locations. We update the flush locations after we apply
> the transaction's commit successfully. Also, after the restart, we use
> the replication origin's last flush location as a point from where we
> need the transactions and the origin's progress is updated at commit
> time.
>
> OTOH, If by successfully confirmed, you mean that once the subscriber
> has applied the complete transaction (including commit), then you are
> right that it won't be sent again.

I meant - once the subscriber has advanced the flush location.

--
Robert Haas
EDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: repeated decoding of prepared transactions

akapila
In reply to this post by Markus Wanner-4
On Mon, Feb 8, 2021 at 2:01 PM Markus Wanner
<[hidden email]> wrote:
>
> I did not expect this, as any receiver that wants to have decoded 2PC is
> likely supporting some kind of two-phase commits itself.  And would
> therefore prepare the transaction upon its first reception.  Potentially
> receiving it a second time would require complicated filtering on every
> prepared transaction.
>

I would like to bring one other scenario to your notice where you
might want to handle things differently for prepared transactions on
the plugin side. Assume we have multiple publications (for simplicity
say 2) on publisher with corresponding subscriptions (say 2, each
corresponding to one publication on the publisher). When a user
performs a transaction on a publisher that involves the tables from
both publications, on the subscriber-side, we do that work via two
different transactions, corresponding to each subscription. But, we
need some way to deal with prepared xacts because they need GID and we
can't use the same GID for both subscriptions. Please see the detailed
example and one idea to deal with the same in the main thread[1]. It
would be really helpful if you or others working on the plugin side
can share your opinion on the same.

Now, coming back to the restart case where the prepared transaction
can be sent again by the publisher. I understand yours and others
point that we should not send prepared transaction if there is a
restart between prepare and commit but there are reasons why we have
done that way and I am open to your suggestions. I'll once again try
to explain the exact case to you which is not very apparent. The basic
idea is that we ship/replay all transactions where commit happens
after the snapshot has a consistent state (SNAPBUILD_CONSISTENT), see
atop snapbuild.c for details. Now, for transactions where prepare is
before snapshot state SNAPBUILD_CONSISTENT and commit prepared is
after SNAPBUILD_CONSISTENT, we need to send the entire transaction
including prepare at the commit time. One might think it is quite easy
to detect that, basically if we skip prepare when the snapshot state
was not SNAPBUILD_CONSISTENT, then mark a flag in ReorderBufferTxn and
use the same to detect during commit and accordingly take the decision
to send prepare but unfortunately it is not that easy. There is always
a chance that on restart we reuse the snapshot serialized by some
other Walsender at a location prior to Prepare and if that happens
then this time the prepare won't be skipped due to snapshot state
(SNAPBUILD_CONSISTENT) but due to start_decodint_at point (considering
we have already shipped some of the later commits but not prepare).
Now, this will actually become the same situation where the restart
has happened after we have sent the prepare but not commit. This is
the reason we have to resend the prepare when the subscriber restarts
between prepare and commit.

You can reproduce the case where we can't distinguish between two
situations by using the test case in twophase_snapshot.spec and
additionally starting a separate session via the debugger. So, the
steps in the test case are as below:

"s2b" "s2txid" "s1init" "s3b" "s3txid" "s2c" "s2b" "s2insert" "s2p"
"s3c" "s1insert" "s1start" "s2cp" "s1start"

Define new steps as

"s4init" {SELECT 'init' FROM
pg_create_logical_replication_slot('isolation_slot_1',
'test_decoding');}
"s4start" {SELECT data  FROM
pg_logical_slot_get_changes('isolation_slot_1', NULL, NULL,
'include-xids', 'false', 'skip-empty-xacts', '1', 'two-phase-commit',
'1');}

The first thing we need to do is s4init and stop the debugger in
SnapBuildProcessRunningXacts. Now perform steps from 's2b' till first
's1start' in twophase_snapshot.spec. Then continue in the s4 session
and perform s4start. After this, if you debug (or add the logs) the
second s1start, you will notice that we are skipping prepare not
because of inconsistent snapshot but a forward location in
start_decoding_at. If you don't involve session-4, then it will always
skip prepare due to an inconsistent snapshot state. This involves a
debugger so not easy to write an automated test for it.

I have used a bit tricky scenario to explain this but not sure if
there was any other simpler way.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BLvkeX%3DB3xon7RcBwD4CVaFSryPj3pTBAALrDxQVPDwA%40mail.gmail.com

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: repeated decoding of prepared transactions

Markus Wanner-4
Hello Amit,

thanks a lot for your extensive explanation and examples, I appreciate
this very much.  I'll need to think this through and see how we can make
this work for us.

Best Regards

Markus


Reply | Threaded
Open this post in threaded view
|

Re: repeated decoding of prepared transactions

Robert Haas
In reply to this post by akapila
On Thu, Feb 11, 2021 at 5:37 AM Amit Kapila <[hidden email]> wrote:
> to explain the exact case to you which is not very apparent. The basic
> idea is that we ship/replay all transactions where commit happens
> after the snapshot has a consistent state (SNAPBUILD_CONSISTENT), see
> atop snapbuild.c for details. Now, for transactions where prepare is
> before snapshot state SNAPBUILD_CONSISTENT and commit prepared is
> after SNAPBUILD_CONSISTENT, we need to send the entire transaction
> including prepare at the commit time.

This might be a dumb question, but: why?

Is this because the effects of the prepared transaction might
otherwise be included neither in the initial synchronization of the
data nor in any subsequently decoded transaction, thus leaving the
replica out of sync?

--
Robert Haas
EDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: repeated decoding of prepared transactions

akapila
On Fri, Feb 12, 2021 at 1:10 AM Robert Haas <[hidden email]> wrote:

>
> On Thu, Feb 11, 2021 at 5:37 AM Amit Kapila <[hidden email]> wrote:
> > to explain the exact case to you which is not very apparent. The basic
> > idea is that we ship/replay all transactions where commit happens
> > after the snapshot has a consistent state (SNAPBUILD_CONSISTENT), see
> > atop snapbuild.c for details. Now, for transactions where prepare is
> > before snapshot state SNAPBUILD_CONSISTENT and commit prepared is
> > after SNAPBUILD_CONSISTENT, we need to send the entire transaction
> > including prepare at the commit time.
>
> This might be a dumb question, but: why?
>
> Is this because the effects of the prepared transaction might
> otherwise be included neither in the initial synchronization of the
> data nor in any subsequently decoded transaction, thus leaving the
> replica out of sync?
>

Yes.

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: repeated decoding of prepared transactions

Andres Freund
In reply to this post by akapila
Hi,

On 2021-02-10 08:02:17 +0530, Amit Kapila wrote:

> On Wed, Feb 10, 2021 at 12:08 AM Robert Haas <[hidden email]> wrote:
> >
> > On Tue, Feb 9, 2021 at 6:57 AM Amit Kapila <[hidden email]> wrote:
> > > I think similar happens without any of the work done in PG-14 as well
> > > if we restart the apply worker before the commit completes on the
> > > subscriber. After the restart, we will send the start_decoding_at
> > > point based on some previous commit which will make publisher send the
> > > entire transaction again. I don't think restart of WAL sender or WAL
> > > receiver is such a common thing. It can only happen due to some bug in
> > > code or user wishes to stop the nodes or some crash happened.
> >
> > Really? My impression is that the logical replication protocol is
> > supposed to be designed in such a way that once a transaction is
> > successfully confirmed, it won't be sent again. Now if something is
> > not confirmed then it has to be sent again. But if it is confirmed
> > then it shouldn't happen.

Correct.


> If by successfully confirmed, you mean that once the subscriber node
> has received, it won't be sent again then as far as I know that is not
> true. We rely on the flush location sent by the subscriber to advance
> the decoding locations. We update the flush locations after we apply
> the transaction's commit successfully. Also, after the restart, we use
> the replication origin's last flush location as a point from where we
> need the transactions and the origin's progress is updated at commit
> time.

That's not quite right. Yes, the flush location isn't guaranteed to be
updated at that point, but a replication client will send the last
location they've received and successfully processed, and that has to
*guarantee* that they won't receive anything twice, or miss
something. Otherwise you've broken the protocol.

Greetings,

Andres Freund


1234