Amit, Ajin, hackers,
testing logical decoding for two-phase transactions, I stumbled over what I first thought is a bug. But comments seems to indicate this is intended behavior. Could you please clarify or elaborate on the design decision? Or indicate this indeed is a bug? What puzzled me is that if a decoder is restarted in between the PREPARE and the COMMIT PREPARED, it repeats the entire transaction, despite it being already sent and potentially prepared on the receiving side. In terms of `pg_logical_slot_get_changes` (and roughly from the prepare.sql test), this looks as follows: data ---------------------------------------------------- BEGIN table public.test_prepared1: INSERT: id[integer]:1 PREPARE TRANSACTION 'test_prepared#1' (3 rows) This is the first delivery of the transaction. After a restart, it will get all of the changes again, though: data ---------------------------------------------------- BEGIN table public.test_prepared1: INSERT: id[integer]:1 PREPARE TRANSACTION 'test_prepared#1' COMMIT PREPARED 'test_prepared#1' (4 rows) I did not expect this, as any receiver that wants to have decoded 2PC is likely supporting some kind of two-phase commits itself. And would therefore prepare the transaction upon its first reception. Potentially receiving it a second time would require complicated filtering on every prepared transaction. Furthermore, this clearly and unnecessarily holds back the restart LSN. Meaning even just a single prepared transaction can block advancing the restart LSN. In most cases, these are short lived. But on the other hand, there may be an arbitrary amount of other transactions in between a PREPARE and the corresponding COMMIT PREPARED in the WAL. Not being able to advance over a prepared transaction seems like a bad thing in such a case. I fail to see where this repetition would ever be useful. Is there any reason for the current implementation that I'm missing or can this be corrected? Thanks for elaborating. Regards Markus |
On Mon, Feb 8, 2021 at 2:01 PM Markus Wanner
<[hidden email]> wrote: > > Amit, Ajin, hackers, > > testing logical decoding for two-phase transactions, I stumbled over > what I first thought is a bug. But comments seems to indicate this is > intended behavior. Could you please clarify or elaborate on the design > decision? Or indicate this indeed is a bug? > > What puzzled me is that if a decoder is restarted in between the PREPARE > and the COMMIT PREPARED, it repeats the entire transaction, despite it > being already sent and potentially prepared on the receiving side. > > In terms of `pg_logical_slot_get_changes` (and roughly from the > prepare.sql test), this looks as follows: > > data > ---------------------------------------------------- > BEGIN > table public.test_prepared1: INSERT: id[integer]:1 > PREPARE TRANSACTION 'test_prepared#1' > (3 rows) > > > This is the first delivery of the transaction. After a restart, it will > get all of the changes again, though: > > > data > ---------------------------------------------------- > BEGIN > table public.test_prepared1: INSERT: id[integer]:1 > PREPARE TRANSACTION 'test_prepared#1' > COMMIT PREPARED 'test_prepared#1' > (4 rows) > > > I did not expect this, as any receiver that wants to have decoded 2PC is > likely supporting some kind of two-phase commits itself. And would > therefore prepare the transaction upon its first reception. Potentially > receiving it a second time would require complicated filtering on every > prepared transaction. > The reason was mentioned in ReorderBufferFinishPrepared(). See below comments in code: /* * It is possible that this transaction is not decoded at prepare time * either because by that time we didn't have a consistent snapshot or it * was decoded earlier but we have restarted. We can't distinguish between * those two cases so we send the prepare in both the cases and let * downstream decide whether to process or skip it. We don't need to * decode the xact for aborts if it is not done already. */ This won't happen when we replicate via pgoutput (the patch for which is still not committed) because it won't restart from a previous point (unless the server needs to be restarted due to some reason) as you are doing via logical decoding APIs. Now, we don't send again the prepared xacts on repeated calls of pg_logical_slot_get_changes() unless we encounter commit. This behavior is already explained in docs [1] (See, Transaction Begin Prepare Callback in docs) and a way to skip the prepare. > Furthermore, this clearly and unnecessarily holds back the restart LSN. > Meaning even just a single prepared transaction can block advancing the > restart LSN. In most cases, these are short lived. But on the other > hand, there may be an arbitrary amount of other transactions in between > a PREPARE and the corresponding COMMIT PREPARED in the WAL. Not being > able to advance over a prepared transaction seems like a bad thing in > such a case. > That anyway is true without this work as well where restart_lsn can be advanced on commits. We haven't changed anything in that regard. [1] - https://www.postgresql.org/docs/devel/logicaldecoding-output-plugin.html -- With Regards, Amit Kapila. |
Hello Amit,
thanks for your very quick response. On 08.02.21 11:13, Amit Kapila wrote: > /* > * It is possible that this transaction is not decoded at prepare time > * either because by that time we didn't have a consistent snapshot or it > * was decoded earlier but we have restarted. We can't distinguish between > * those two cases so we send the prepare in both the cases and let > * downstream decide whether to process or skip it. We don't need to > * decode the xact for aborts if it is not done already. > */ The way I read the surrounding code, the only case a 2PC transaction does not get decoded a prepare time is if the transaction is empty. Or are you aware of any other situation that might currently happen? > (unless the server needs to be restarted due to some reason) Right, the repetition occurs only after a restart of the walsender in between a prepare and a commit prepared record. > That anyway is true without this work as well where restart_lsn can be > advanced on commits. We haven't changed anything in that regard. I did not mean to blame the patch, but merely try to understand some of the design decisions behind it. And as I just learned, even if we managed to avoid the repetition, a restarted walsender still needs to see prepared transactions as in-progress in its snapshots. So we cannot move forward the restart_lsn to after a prepare record (until the final commit or rollback is consumed). Best Regards Markus |
On Mon, Feb 8, 2021 at 8:36 PM Markus Wanner
<[hidden email]> wrote: > > Hello Amit, > > thanks for your very quick response. > > On 08.02.21 11:13, Amit Kapila wrote: > > /* > > * It is possible that this transaction is not decoded at prepare time > > * either because by that time we didn't have a consistent snapshot or it > > * was decoded earlier but we have restarted. We can't distinguish between > > * those two cases so we send the prepare in both the cases and let > > * downstream decide whether to process or skip it. We don't need to > > * decode the xact for aborts if it is not done already. > > */ > > The way I read the surrounding code, the only case a 2PC transaction > does not get decoded a prepare time is if the transaction is empty. Or > are you aware of any other situation that might currently happen? > We also skip decoding at prepare time if we haven't reached a consistent snapshot by that time. See below code in DecodePrepare(). DecodePrepare() { .. /* We can't start streaming unless a consistent state is reached. */ if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT) { ReorderBufferSkipPrepare(ctx->reorder, xid); return; } .. } There are other reasons as well like the output plugin doesn't want to allow decoding at prepare time but I don't think those are relevant to the discussion here. > > (unless the server needs to be restarted due to some reason) > > Right, the repetition occurs only after a restart of the walsender in > between a prepare and a commit prepared record. > > > That anyway is true without this work as well where restart_lsn can be > > advanced on commits. We haven't changed anything in that regard. > > I did not mean to blame the patch, but merely try to understand some of > the design decisions behind it. > > And as I just learned, even if we managed to avoid the repetition, a > restarted walsender still needs to see prepared transactions as > in-progress in its snapshots. So we cannot move forward the restart_lsn > to after a prepare record (until the final commit or rollback is consumed). > Right and say if we forget the prepared transactions and move forward with restart_lsn once we get the prepare for any transaction. Then we will open up a window where we haven't actually sent the prepared xact because of say "snapshot has not yet reached consistent state" and we have moved the restart_lsn. Then later when we get the commit corresponding to the prepared transaction by which time say the "snapshot has reached consistent state" then we will miss sending the transaction contents and prepare for it. I think for such reasons we allow restart_lsn to moved only once the transaction is finished (committed or rolled back). -- With Regards, Amit Kapila. |
On Tue, Feb 9, 2021 at 8:32 AM Amit Kapila <[hidden email]> wrote:
> > On Mon, Feb 8, 2021 at 8:36 PM Markus Wanner > <[hidden email]> wrote: > > > > Hello Amit, > > > > thanks for your very quick response. > > > > On 08.02.21 11:13, Amit Kapila wrote: > > > /* > > > * It is possible that this transaction is not decoded at prepare time > > > * either because by that time we didn't have a consistent snapshot or it > > > * was decoded earlier but we have restarted. We can't distinguish between > > > * those two cases so we send the prepare in both the cases and let > > > * downstream decide whether to process or skip it. We don't need to > > > * decode the xact for aborts if it is not done already. > > > */ > > > > The way I read the surrounding code, the only case a 2PC transaction > > does not get decoded a prepare time is if the transaction is empty. Or > > are you aware of any other situation that might currently happen? > > > > We also skip decoding at prepare time if we haven't reached a > consistent snapshot by that time. See below code in DecodePrepare(). > DecodePrepare() > { > .. > /* We can't start streaming unless a consistent state is reached. */ > if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT) > { > ReorderBufferSkipPrepare(ctx->reorder, xid); > return; > } > .. > } Can you please provide steps which can lead to this situation? If there is an earlier discussion which has example scenarios, please point us to the relevant thread. If we are not sending PREPARED transactions that's fine, but sending the same prepared transaction as many times as the WAL sender is restarted between sending prepare and commit prepared is a waste of network bandwidth. The wastage is proportional to the amount of changes in the transaction and number of such transactions themselves. Also this will cause performance degradation. So if we can avoid resending prepared transactions twice that will help. -- Best Wishes, Ashutosh Bapat |
On Tue, Feb 9, 2021 at 4:59 PM Ashutosh Bapat
<[hidden email]> wrote: > Can you please provide steps which can lead to this situation? If > there is an earlier discussion which has example scenarios, please > point us to the relevant thread. > > If we are not sending PREPARED transactions that's fine, but sending > the same prepared transaction as many times as the WAL sender is > restarted between sending prepare and commit prepared is a waste of > network bandwidth. The wastage is proportional to the amount of > changes in the transaction and number of such transactions themselves. > Also this will cause performance degradation. So if we can avoid > resending prepared transactions twice that will help. One of this scenario is explained in the test case in postgres/contrib/test_decoding/specs/twophase_snapshot.spec regards, Ajin Cherian Fujitsu Australia |
In reply to this post by Ashutosh Bapat-2
On Tue, Feb 9, 2021 at 11:29 AM Ashutosh Bapat
<[hidden email]> wrote: > > On Tue, Feb 9, 2021 at 8:32 AM Amit Kapila <[hidden email]> wrote: > > > > On Mon, Feb 8, 2021 at 8:36 PM Markus Wanner > > <[hidden email]> wrote: > > > > > > Hello Amit, > > > > > > thanks for your very quick response. > > > > > > On 08.02.21 11:13, Amit Kapila wrote: > > > > /* > > > > * It is possible that this transaction is not decoded at prepare time > > > > * either because by that time we didn't have a consistent snapshot or it > > > > * was decoded earlier but we have restarted. We can't distinguish between > > > > * those two cases so we send the prepare in both the cases and let > > > > * downstream decide whether to process or skip it. We don't need to > > > > * decode the xact for aborts if it is not done already. > > > > */ > > > > > > The way I read the surrounding code, the only case a 2PC transaction > > > does not get decoded a prepare time is if the transaction is empty. Or > > > are you aware of any other situation that might currently happen? > > > > > > > We also skip decoding at prepare time if we haven't reached a > > consistent snapshot by that time. See below code in DecodePrepare(). > > DecodePrepare() > > { > > .. > > /* We can't start streaming unless a consistent state is reached. */ > > if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT) > > { > > ReorderBufferSkipPrepare(ctx->reorder, xid); > > return; > > } > > .. > > } > > Can you please provide steps which can lead to this situation? > Ajin has already shared the example with you. > If > there is an earlier discussion which has example scenarios, please > point us to the relevant thread. > It started in the email [1] and from there you can read later emails to know more about this. > If we are not sending PREPARED transactions that's fine, > Hmm, I am not sure if that is fine because if the output plugin sets the two-phase-commit option, it would expect all prepared xacts to arrive not some only some of them. > but sending > the same prepared transaction as many times as the WAL sender is > restarted between sending prepare and commit prepared is a waste of > network bandwidth. > I think similar happens without any of the work done in PG-14 as well if we restart the apply worker before the commit completes on the subscriber. After the restart, we will send the start_decoding_at point based on some previous commit which will make publisher send the entire transaction again. I don't think restart of WAL sender or WAL receiver is such a common thing. It can only happen due to some bug in code or user wishes to stop the nodes or some crash happened. [1] - https://www.postgresql.org/message-id/CAA4eK1%2Bd3gzCyzsYjt1m6sfGf_C_uFmo9JK%3D3Wafp6yR8Mg8uQ%40mail.gmail.com -- With Regards, Amit Kapila. |
On Tue, Feb 9, 2021 at 6:57 AM Amit Kapila <[hidden email]> wrote:
> I think similar happens without any of the work done in PG-14 as well > if we restart the apply worker before the commit completes on the > subscriber. After the restart, we will send the start_decoding_at > point based on some previous commit which will make publisher send the > entire transaction again. I don't think restart of WAL sender or WAL > receiver is such a common thing. It can only happen due to some bug in > code or user wishes to stop the nodes or some crash happened. Really? My impression is that the logical replication protocol is supposed to be designed in such a way that once a transaction is successfully confirmed, it won't be sent again. Now if something is not confirmed then it has to be sent again. But if it is confirmed then it shouldn't happen. -- Robert Haas EDB: http://www.enterprisedb.com |
On Wed, Feb 10, 2021 at 12:08 AM Robert Haas <[hidden email]> wrote:
> > On Tue, Feb 9, 2021 at 6:57 AM Amit Kapila <[hidden email]> wrote: > > I think similar happens without any of the work done in PG-14 as well > > if we restart the apply worker before the commit completes on the > > subscriber. After the restart, we will send the start_decoding_at > > point based on some previous commit which will make publisher send the > > entire transaction again. I don't think restart of WAL sender or WAL > > receiver is such a common thing. It can only happen due to some bug in > > code or user wishes to stop the nodes or some crash happened. > > Really? My impression is that the logical replication protocol is > supposed to be designed in such a way that once a transaction is > successfully confirmed, it won't be sent again. Now if something is > not confirmed then it has to be sent again. But if it is confirmed > then it shouldn't happen. > If by successfully confirmed, you mean that once the subscriber node has received, it won't be sent again then as far as I know that is not true. We rely on the flush location sent by the subscriber to advance the decoding locations. We update the flush locations after we apply the transaction's commit successfully. Also, after the restart, we use the replication origin's last flush location as a point from where we need the transactions and the origin's progress is updated at commit time. OTOH, If by successfully confirmed, you mean that once the subscriber has applied the complete transaction (including commit), then you are right that it won't be sent again. -- With Regards, Amit Kapila. |
On Wed, Feb 10, 2021 at 8:02 AM Amit Kapila <[hidden email]> wrote: On Wed, Feb 10, 2021 at 12:08 AM Robert Haas <[hidden email]> wrote: I think we need to treat a prepared transaction slightly different from an uncommitted transaction when sending downstream. We need to send a whole uncommitted transaction downstream again because previously applied changes must have been aborted and hence lost by the downstream and thus it needs to get all of those again. But when a downstream prepares a transaction, even if it's not committed, those changes are not lost even after restart of a walsender. If the downstream confirms that it has "flushed" PREPARE, there is no need to send all the changes again. -- Best Wishes, Ashutosh |
On Wed, Feb 10, 2021 at 3:43 PM Ashutosh Bapat
<[hidden email]> wrote: > I think we need to treat a prepared transaction slightly different from an uncommitted transaction when sending downstream. We need to send a whole uncommitted transaction downstream again because previously applied changes must have been aborted and hence lost by the downstream and thus it needs to get all of those again. But when a downstream prepares a transaction, even if it's not committed, those changes are not lost even after restart of a walsender. If the downstream confirms that it has "flushed" PREPARE, there is no need to send all the changes again. But the other side of the problem is that ,without this, if the prepared transaction is prior to a consistent snapshot when decoding starts/restarts, then only the "commit prepared" is sent to downstream (as seen in the test scenario I shared above), and downstream has to error away the commit prepared because it does not have the corresponding prepared transaction. We did not find an easy way to distinguish between these two scenarios for prepared transactions. a. A consistent snapshot being formed in between a prepare and a commit prepared for the first time. b. Decoder restarting between a prepare and a commit prepared. For plugins to be able to handle this, we have added a special callback "Begin Prepare" as explained in [1] section 49.6.4.10 "The required begin_prepare_cb callback is called whenever the start of a prepared transaction has been decoded. The gid field, which is part of the txn parameter can be used in this callback to check if the plugin has already received this prepare in which case it can skip the remaining changes of the transaction. This can only happen if the user restarts the decoding after receiving the prepare for a transaction but before receiving the commit prepared say because of some error." The pgoutput plugin is also being updated to be able to handle this situation of duplicate prepared transactions. regards, Ajin Cherian Fujitsu Australia |
On Wed, Feb 10, 2021 at 11:45 AM Ajin Cherian <[hidden email]> wrote:
> > On Wed, Feb 10, 2021 at 3:43 PM Ashutosh Bapat > <[hidden email]> wrote: > > > > I think we need to treat a prepared transaction slightly different from an uncommitted transaction when sending downstream. We need to send a whole uncommitted transaction downstream again because previously applied changes must have been aborted and hence lost by the downstream and thus it needs to get all of those again. But when a downstream prepares a transaction, even if it's not committed, those changes are not lost even after restart of a walsender. If the downstream confirms that it has "flushed" PREPARE, there is no need to send all the changes again. > > But the other side of the problem is that ,without this, if the > prepared transaction is prior to a consistent snapshot when decoding > starts/restarts, then only the "commit prepared" is sent to downstream > (as seen in the test scenario I shared above), and downstream has to > error away the commit prepared because it does not have the > corresponding prepared transaction. > I think it is not only simple error handling, it is required for data-consistency. We need to send the transactions whose commits are encountered after a consistent snapshot is reached. -- With Regards, Amit Kapila. |
On 10.02.21 07:32, Amit Kapila wrote:
> On Wed, Feb 10, 2021 at 11:45 AM Ajin Cherian <[hidden email]> wrote: >> But the other side of the problem is that ,without this, if the >> prepared transaction is prior to a consistent snapshot when decoding >> starts/restarts, then only the "commit prepared" is sent to downstream >> (as seen in the test scenario I shared above), and downstream has to >> error away the commit prepared because it does not have the >> corresponding prepared transaction. > > I think it is not only simple error handling, it is required for > data-consistency. We need to send the transactions whose commits are > encountered after a consistent snapshot is reached. I'm with Ashutosh here. If a replica is properly in sync, it knows about prepared transactions and all the gids of those. Sending the transactional changes and the prepare again is inconsistent. The point of a two-phase transaction is to have two phases. An output plugin must have the chance of treating them as independent events. Once a PREPARE is confirmed, it must not be sent again. Even if the transaction is still in-progress and its changes are not yet visible on the origin node. Regards Markus |
On Wed, Feb 10, 2021 at 1:40 PM Markus Wanner
<[hidden email]> wrote: > > On 10.02.21 07:32, Amit Kapila wrote: > > On Wed, Feb 10, 2021 at 11:45 AM Ajin Cherian <[hidden email]> wrote: > >> But the other side of the problem is that ,without this, if the > >> prepared transaction is prior to a consistent snapshot when decoding > >> starts/restarts, then only the "commit prepared" is sent to downstream > >> (as seen in the test scenario I shared above), and downstream has to > >> error away the commit prepared because it does not have the > >> corresponding prepared transaction. > > > > I think it is not only simple error handling, it is required for > > data-consistency. We need to send the transactions whose commits are > > encountered after a consistent snapshot is reached. > > I'm with Ashutosh here. If a replica is properly in sync, it knows > about prepared transactions and all the gids of those. Sending the > transactional changes and the prepare again is inconsistent. > > The point of a two-phase transaction is to have two phases. An output > plugin must have the chance of treating them as independent events. > I am not sure I understand what problem you are facing to deal with this in the output plugin, it is explained in docs and Ajin also pointed out the same. Ajin and I have explained to you the design constraints on the publisher-side due to which we have done this way. Do you have any better ideas to deal with this? -- With Regards, Amit Kapila. |
In reply to this post by akapila
On Tue, Feb 9, 2021 at 9:32 PM Amit Kapila <[hidden email]> wrote:
> If by successfully confirmed, you mean that once the subscriber node > has received, it won't be sent again then as far as I know that is not > true. We rely on the flush location sent by the subscriber to advance > the decoding locations. We update the flush locations after we apply > the transaction's commit successfully. Also, after the restart, we use > the replication origin's last flush location as a point from where we > need the transactions and the origin's progress is updated at commit > time. > > OTOH, If by successfully confirmed, you mean that once the subscriber > has applied the complete transaction (including commit), then you are > right that it won't be sent again. I meant - once the subscriber has advanced the flush location. -- Robert Haas EDB: http://www.enterprisedb.com |
In reply to this post by Markus Wanner-4
On Mon, Feb 8, 2021 at 2:01 PM Markus Wanner
<[hidden email]> wrote: > > I did not expect this, as any receiver that wants to have decoded 2PC is > likely supporting some kind of two-phase commits itself. And would > therefore prepare the transaction upon its first reception. Potentially > receiving it a second time would require complicated filtering on every > prepared transaction. > I would like to bring one other scenario to your notice where you might want to handle things differently for prepared transactions on the plugin side. Assume we have multiple publications (for simplicity say 2) on publisher with corresponding subscriptions (say 2, each corresponding to one publication on the publisher). When a user performs a transaction on a publisher that involves the tables from both publications, on the subscriber-side, we do that work via two different transactions, corresponding to each subscription. But, we need some way to deal with prepared xacts because they need GID and we can't use the same GID for both subscriptions. Please see the detailed example and one idea to deal with the same in the main thread[1]. It would be really helpful if you or others working on the plugin side can share your opinion on the same. Now, coming back to the restart case where the prepared transaction can be sent again by the publisher. I understand yours and others point that we should not send prepared transaction if there is a restart between prepare and commit but there are reasons why we have done that way and I am open to your suggestions. I'll once again try to explain the exact case to you which is not very apparent. The basic idea is that we ship/replay all transactions where commit happens after the snapshot has a consistent state (SNAPBUILD_CONSISTENT), see atop snapbuild.c for details. Now, for transactions where prepare is before snapshot state SNAPBUILD_CONSISTENT and commit prepared is after SNAPBUILD_CONSISTENT, we need to send the entire transaction including prepare at the commit time. One might think it is quite easy to detect that, basically if we skip prepare when the snapshot state was not SNAPBUILD_CONSISTENT, then mark a flag in ReorderBufferTxn and use the same to detect during commit and accordingly take the decision to send prepare but unfortunately it is not that easy. There is always a chance that on restart we reuse the snapshot serialized by some other Walsender at a location prior to Prepare and if that happens then this time the prepare won't be skipped due to snapshot state (SNAPBUILD_CONSISTENT) but due to start_decodint_at point (considering we have already shipped some of the later commits but not prepare). Now, this will actually become the same situation where the restart has happened after we have sent the prepare but not commit. This is the reason we have to resend the prepare when the subscriber restarts between prepare and commit. You can reproduce the case where we can't distinguish between two situations by using the test case in twophase_snapshot.spec and additionally starting a separate session via the debugger. So, the steps in the test case are as below: "s2b" "s2txid" "s1init" "s3b" "s3txid" "s2c" "s2b" "s2insert" "s2p" "s3c" "s1insert" "s1start" "s2cp" "s1start" Define new steps as "s4init" {SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_1', 'test_decoding');} "s4start" {SELECT data FROM pg_logical_slot_get_changes('isolation_slot_1', NULL, NULL, 'include-xids', 'false', 'skip-empty-xacts', '1', 'two-phase-commit', '1');} The first thing we need to do is s4init and stop the debugger in SnapBuildProcessRunningXacts. Now perform steps from 's2b' till first 's1start' in twophase_snapshot.spec. Then continue in the s4 session and perform s4start. After this, if you debug (or add the logs) the second s1start, you will notice that we are skipping prepare not because of inconsistent snapshot but a forward location in start_decoding_at. If you don't involve session-4, then it will always skip prepare due to an inconsistent snapshot state. This involves a debugger so not easy to write an automated test for it. I have used a bit tricky scenario to explain this but not sure if there was any other simpler way. [1] - https://www.postgresql.org/message-id/CAA4eK1%2BLvkeX%3DB3xon7RcBwD4CVaFSryPj3pTBAALrDxQVPDwA%40mail.gmail.com -- With Regards, Amit Kapila. |
Hello Amit,
thanks a lot for your extensive explanation and examples, I appreciate this very much. I'll need to think this through and see how we can make this work for us. Best Regards Markus |
In reply to this post by akapila
On Thu, Feb 11, 2021 at 5:37 AM Amit Kapila <[hidden email]> wrote:
> to explain the exact case to you which is not very apparent. The basic > idea is that we ship/replay all transactions where commit happens > after the snapshot has a consistent state (SNAPBUILD_CONSISTENT), see > atop snapbuild.c for details. Now, for transactions where prepare is > before snapshot state SNAPBUILD_CONSISTENT and commit prepared is > after SNAPBUILD_CONSISTENT, we need to send the entire transaction > including prepare at the commit time. This might be a dumb question, but: why? Is this because the effects of the prepared transaction might otherwise be included neither in the initial synchronization of the data nor in any subsequently decoded transaction, thus leaving the replica out of sync? -- Robert Haas EDB: http://www.enterprisedb.com |
On Fri, Feb 12, 2021 at 1:10 AM Robert Haas <[hidden email]> wrote:
> > On Thu, Feb 11, 2021 at 5:37 AM Amit Kapila <[hidden email]> wrote: > > to explain the exact case to you which is not very apparent. The basic > > idea is that we ship/replay all transactions where commit happens > > after the snapshot has a consistent state (SNAPBUILD_CONSISTENT), see > > atop snapbuild.c for details. Now, for transactions where prepare is > > before snapshot state SNAPBUILD_CONSISTENT and commit prepared is > > after SNAPBUILD_CONSISTENT, we need to send the entire transaction > > including prepare at the commit time. > > This might be a dumb question, but: why? > > Is this because the effects of the prepared transaction might > otherwise be included neither in the initial synchronization of the > data nor in any subsequently decoded transaction, thus leaving the > replica out of sync? > Yes. -- With Regards, Amit Kapila. |
In reply to this post by akapila
Hi,
On 2021-02-10 08:02:17 +0530, Amit Kapila wrote: > On Wed, Feb 10, 2021 at 12:08 AM Robert Haas <[hidden email]> wrote: > > > > On Tue, Feb 9, 2021 at 6:57 AM Amit Kapila <[hidden email]> wrote: > > > I think similar happens without any of the work done in PG-14 as well > > > if we restart the apply worker before the commit completes on the > > > subscriber. After the restart, we will send the start_decoding_at > > > point based on some previous commit which will make publisher send the > > > entire transaction again. I don't think restart of WAL sender or WAL > > > receiver is such a common thing. It can only happen due to some bug in > > > code or user wishes to stop the nodes or some crash happened. > > > > Really? My impression is that the logical replication protocol is > > supposed to be designed in such a way that once a transaction is > > successfully confirmed, it won't be sent again. Now if something is > > not confirmed then it has to be sent again. But if it is confirmed > > then it shouldn't happen. Correct. > If by successfully confirmed, you mean that once the subscriber node > has received, it won't be sent again then as far as I know that is not > true. We rely on the flush location sent by the subscriber to advance > the decoding locations. We update the flush locations after we apply > the transaction's commit successfully. Also, after the restart, we use > the replication origin's last flush location as a point from where we > need the transactions and the origin's progress is updated at commit > time. That's not quite right. Yes, the flush location isn't guaranteed to be updated at that point, but a replication client will send the last location they've received and successfully processed, and that has to *guarantee* that they won't receive anything twice, or miss something. Otherwise you've broken the protocol. Greetings, Andres Freund |
Free forum by Nabble | Edit this page |