Single transaction in the tablesync worker?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
106 messages Options
1234 ... 6
Reply | Threaded
Open this post in threaded view
|

Single transaction in the tablesync worker?

akapila
The tablesync worker in logical replication performs the table data
sync in a single transaction which means it will copy the initial data
and then catch up with apply worker in the same transaction. There is
a comment in LogicalRepSyncTableStart ("We want to do the table data
sync in a single transaction.") saying so but I can't find the
concrete theory behind the same. Is there any fundamental problem if
we commit the transaction after initial copy and slot creation in
LogicalRepSyncTableStart and then allow the apply of transactions as
it happens in apply worker? I have tried doing so in the attached (a
quick prototype to test) and didn't find any problems with regression
tests. I have tried a few manual tests as well to see if it works and
didn't find any problem. Now, it is quite possible that it is
mandatory to do the way we are doing currently, or maybe something
else is required to remove this requirement but I think we can do
better with respect to comments in this area.

The reason why I am looking into this area is to support the logical
decoding of prepared transactions. See the problem [1] reported by
Peter Smith. Basically, when we stream prepared transactions in the
tablesync worker, it will simply commit the same due to the
requirement of maintaining a single transaction for the entire
duration of copy and streaming of transactions. Now, we can fix that
problem by disabling the decoding of prepared xacts in tablesync
worker. But that will arise to a different kind of problems like the
prepare will not be sent by the publisher but a later commit might
move lsn to a later step which will allow it to catch up till the
apply worker. So, now the prepared transaction will be skipped by both
tablesync and apply worker.

I think apart from unblocking the development of 'logical decoding of
prepared xacts', it will make the code consistent between apply and
tablesync worker and reduce the chances of future bugs in this area.
Basically, it will reduce the checks related to am_tablesync_worker()
at various places in the code.

I see that this code is added as part of commit
7c4f52409a8c7d85ed169bbbc1f6092274d03920 (Logical replication support
for initial data copy).

Thoughts?

[1] - https://www.postgresql.org/message-id/CAHut+PuEMk4SO8oGzxc_ftzPkGA8uC-y5qi-KRqHSy_P0i30DA@...

--
With Regards,
Amit Kapila.

v1-0001-Allow-more-than-one-transaction-in-tablesync-work.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Single transaction in the tablesync worker?

Ashutosh Bapat-2
On Thu, Dec 3, 2020 at 2:55 PM Amit Kapila <[hidden email]> wrote:

>
> The tablesync worker in logical replication performs the table data
> sync in a single transaction which means it will copy the initial data
> and then catch up with apply worker in the same transaction. There is
> a comment in LogicalRepSyncTableStart ("We want to do the table data
> sync in a single transaction.") saying so but I can't find the
> concrete theory behind the same. Is there any fundamental problem if
> we commit the transaction after initial copy and slot creation in
> LogicalRepSyncTableStart and then allow the apply of transactions as
> it happens in apply worker? I have tried doing so in the attached (a
> quick prototype to test) and didn't find any problems with regression
> tests. I have tried a few manual tests as well to see if it works and
> didn't find any problem. Now, it is quite possible that it is
> mandatory to do the way we are doing currently, or maybe something
> else is required to remove this requirement but I think we can do
> better with respect to comments in this area.

If we commit the initial copy, the data upto the initial copy's
snapshot will be visible downstream. If we apply the changes by
committing changes per transaction, the data visible to the other
transactions will differ as the apply progresses. You haven't
clarified whether we will respect the transaction boundaries in the
apply log or not. I assume we will. Whereas if we apply all the
changes in one go, other transactions either see the data before
resync or after it without any intermediate states. That will not
violate consistency, I think.

That's all I can think of as the reason behind doing a whole resync as
a single transaction.

--
Best Wishes,
Ashutosh Bapat


Reply | Threaded
Open this post in threaded view
|

Re: Single transaction in the tablesync worker?

akapila
On Thu, Dec 3, 2020 at 7:04 PM Ashutosh Bapat
<[hidden email]> wrote:

>
> On Thu, Dec 3, 2020 at 2:55 PM Amit Kapila <[hidden email]> wrote:
> >
> > The tablesync worker in logical replication performs the table data
> > sync in a single transaction which means it will copy the initial data
> > and then catch up with apply worker in the same transaction. There is
> > a comment in LogicalRepSyncTableStart ("We want to do the table data
> > sync in a single transaction.") saying so but I can't find the
> > concrete theory behind the same. Is there any fundamental problem if
> > we commit the transaction after initial copy and slot creation in
> > LogicalRepSyncTableStart and then allow the apply of transactions as
> > it happens in apply worker? I have tried doing so in the attached (a
> > quick prototype to test) and didn't find any problems with regression
> > tests. I have tried a few manual tests as well to see if it works and
> > didn't find any problem. Now, it is quite possible that it is
> > mandatory to do the way we are doing currently, or maybe something
> > else is required to remove this requirement but I think we can do
> > better with respect to comments in this area.
>
> If we commit the initial copy, the data upto the initial copy's
> snapshot will be visible downstream. If we apply the changes by
> committing changes per transaction, the data visible to the other
> transactions will differ as the apply progresses.
>

It is not clear what you mean by the above.  The way you have written
appears that you are saying that instead of copying the initial data,
I am saying to copy it transaction-by-transaction. But that is not the
case. I am saying copy the initial data by using REPEATABLE READ
isolation level as we are doing now, commit it and then process
transaction-by-transaction till we reach sync-point (point till where
apply worker has already received the data).

> You haven't
> clarified whether we will respect the transaction boundaries in the
> apply log or not. I assume we will.
>

It will be transaction-by-transaction.

> Whereas if we apply all the
> changes in one go, other transactions either see the data before
> resync or after it without any intermediate states.
>

What is the problem even if the user is able to see the data after the
initial copy?

> That will not
> violate consistency, I think.
>

I am not sure how consistency will be broken.

> That's all I can think of as the reason behind doing a whole resync as
> a single transaction.
>

Thanks for sharing your thoughts.

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Single transaction in the tablesync worker?

Craig Ringer-5
In reply to this post by akapila
On Thu, 3 Dec 2020 at 17:25, Amit Kapila <[hidden email]> wrote:

> Is there any fundamental problem if
> we commit the transaction after initial copy and slot creation in
> LogicalRepSyncTableStart and then allow the apply of transactions as
> it happens in apply worker?

No fundamental problem. Both approaches are fine. Committing the
initial copy then doing the rest in individual txns means an
incomplete sync state for the table becomes visible, which may not be
ideal. Ideally we'd do something like sync the data into a clone of
the table then swap the table relfilenodes out once we're synced up.

IMO the main advantage of committing as we go is that it would let us
use a non-temporary slot and support recovering an incomplete sync and
finishing it after interruption by connection loss, crash, etc. That
would be advantageous for big table syncs or where the sync has lots
of lag to replay. But it means we have to remember sync states, and
give users a way to cancel/abort them. Otherwise forgotten temp slots
for syncs will cause a mess on the upstream.

It also allows the sync slot to advance, freeing any held upstream
resources before the whole sync is done, which is good if the upstream
is busy and generating lots of WAL.

Finally, committing as we go means we won't exceed the cid increment
limit in a single txn.

> The reason why I am looking into this area is to support the logical
> decoding of prepared transactions. See the problem [1] reported by
> Peter Smith. Basically, when we stream prepared transactions in the
> tablesync worker, it will simply commit the same due to the
> requirement of maintaining a single transaction for the entire
> duration of copy and streaming of transactions. Now, we can fix that
> problem by disabling the decoding of prepared xacts in tablesync
> worker.

Tablesync should indeed only receive a txn when the commit arrives, it
should not attempt to handle uncommitted prepared xacts.

> But that will arise to a different kind of problems like the
> prepare will not be sent by the publisher but a later commit might
> move lsn to a later step which will allow it to catch up till the
> apply worker. So, now the prepared transaction will be skipped by both
> tablesync and apply worker.

I'm not sure I understand. If what you describe is possible then
there's already a bug in prepared xact handling. Prepared xact commit
progress should be tracked by commit lsn, not by prepare lsn.

Can you set out the ordering of events in more detail?

> I think apart from unblocking the development of 'logical decoding of
> prepared xacts', it will make the code consistent between apply and
> tablesync worker and reduce the chances of future bugs in this area.
> Basically, it will reduce the checks related to am_tablesync_worker()
> at various places in the code.

I think we made similar changes in pglogical to switch to applying
sync work in individual txns.


Reply | Threaded
Open this post in threaded view
|

Re: Single transaction in the tablesync worker?

akapila
On Fri, Dec 4, 2020 at 7:53 AM Craig Ringer
<[hidden email]> wrote:

>
> On Thu, 3 Dec 2020 at 17:25, Amit Kapila <[hidden email]> wrote:
>
>
> > The reason why I am looking into this area is to support the logical
> > decoding of prepared transactions. See the problem [1] reported by
> > Peter Smith. Basically, when we stream prepared transactions in the
> > tablesync worker, it will simply commit the same due to the
> > requirement of maintaining a single transaction for the entire
> > duration of copy and streaming of transactions. Now, we can fix that
> > problem by disabling the decoding of prepared xacts in tablesync
> > worker.
>
> Tablesync should indeed only receive a txn when the commit arrives, it
> should not attempt to handle uncommitted prepared xacts.
>

Why? If we go with the approach of the commit as we go for individual
transactions in the tablesync worker then this shouldn't be a problem.

> > But that will arise to a different kind of problems like the
> > prepare will not be sent by the publisher but a later commit might
> > move lsn to a later step which will allow it to catch up till the
> > apply worker. So, now the prepared transaction will be skipped by both
> > tablesync and apply worker.
>
> I'm not sure I understand. If what you describe is possible then
> there's already a bug in prepared xact handling. Prepared xact commit
> progress should be tracked by commit lsn, not by prepare lsn.
>

Oh no, I am talking about commit of some other transaction.

> Can you set out the ordering of events in more detail?
>

Sure. It will be something like where apply worker is ahead of sync worker:

Assume t1 has some data which tablesync worker has to first copy.

tx1
Begin;
Insert into t1....
Prepare Transaction 'foo'

tx2
Begin;
Insert into t1....
Commit

apply worker
• tx1: replays - does not apply anything because
should_apply_changes_for_rel thinks relation is not ready
• tx2: replays - does not apply anything because
should_apply_changes_for_rel thinks relation is not ready

tablesync worder
• tx1: handles: BEGIN - INSERT - PREPARE 'xyz';  (but tablesync gets
nothing because say we disable 2-PC for it)
• tx2: handles: BEGIN - INSERT - COMMIT;
• tablelsync exits

Now the situation is that the apply worker has skipped the prepared
xact data and tablesync worker has not received it, so not applied it.
Next, when we get Commit Prepared for tx1, it will silently commit the
prepared transaction without any data being updated. The commit
prepared won't error out in subscriber because the prepare would have
been successful even though the data is skipped via
should_apply_changes_for_rel.

> > I think apart from unblocking the development of 'logical decoding of
> > prepared xacts', it will make the code consistent between apply and
> > tablesync worker and reduce the chances of future bugs in this area.
> > Basically, it will reduce the checks related to am_tablesync_worker()
> > at various places in the code.
>
> I think we made similar changes in pglogical to switch to applying
> sync work in individual txns.
>

oh, cool. Did you make some additional changes as you have mentioned
in the earlier part of the email?

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Single transaction in the tablesync worker?

akapila
In reply to this post by Craig Ringer-5
On Fri, Dec 4, 2020 at 7:53 AM Craig Ringer
<[hidden email]> wrote:

>
> On Thu, 3 Dec 2020 at 17:25, Amit Kapila <[hidden email]> wrote:
>
> > Is there any fundamental problem if
> > we commit the transaction after initial copy and slot creation in
> > LogicalRepSyncTableStart and then allow the apply of transactions as
> > it happens in apply worker?
>
> No fundamental problem. Both approaches are fine. Committing the
> initial copy then doing the rest in individual txns means an
> incomplete sync state for the table becomes visible, which may not be
> ideal. Ideally we'd do something like sync the data into a clone of
> the table then swap the table relfilenodes out once we're synced up.
>
> IMO the main advantage of committing as we go is that it would let us
> use a non-temporary slot and support recovering an incomplete sync and
> finishing it after interruption by connection loss, crash, etc. That
> would be advantageous for big table syncs or where the sync has lots
> of lag to replay. But it means we have to remember sync states, and
> give users a way to cancel/abort them. Otherwise forgotten temp slots
> for syncs will cause a mess on the upstream.
>
> It also allows the sync slot to advance, freeing any held upstream
> resources before the whole sync is done, which is good if the upstream
> is busy and generating lots of WAL.
>
> Finally, committing as we go means we won't exceed the cid increment
> limit in a single txn.
>


Yeah, all these are advantages of processing
transaction-by-transaction. IIUC, we need to primarily do two things
to achieve it, one is to have an additional state in the catalog (say
catch up) which will say that the initial copy is done. Then we need
to have a permanent slot using which we can track the progress of the
slot so that after restart (due to crash, connection break, etc.) we
can start from the appropriate position.

Apart from the above, I think with the current design of tablesync we
can see partial data of transactions because we allow all the
tablesync workers to run parallelly. Consider the below scenario:

CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
CREATE TABLE mytbl2(id SERIAL PRIMARY KEY, somedata int, text varchar(120));

Tx1
BEGIN;
INSERT INTO mytbl1(somedata, text) VALUES (1, 1);
INSERT INTO mytbl2(somedata, text) VALUES (1, 1);
COMMIT;

CREATE PUBLICATION mypublication FOR TABLE mytbl;

CREATE SUBSCRIPTION mysub
         CONNECTION 'host=localhost port=5432 dbname=postgres'
        PUBLICATION mypublication;

Tx2
BEGIN;
INSERT INTO mytbl1(somedata, text) VALUES (1, 2);
INSERT INTO mytbl2(somedata, text) VALUES (1, 2);
Commit;

Tx3
BEGIN;
INSERT INTO mytbl1(somedata, text) VALUES (1, 3);
INSERT INTO mytbl2(somedata, text) VALUES (1, 3);
Commit;

Now, I could see the below results on subscriber:

postgres=# select * from mytbl1;
 id | somedata | text
----+----------+------
(0 rows)


postgres=# select * from mytbl2;
 id | somedata | text
----+----------+------
  1 |        1 | 1
  2 |        1 | 2
  3 |        1 | 3
(3 rows)

Basically, the results for Tx1, Tx2, Tx3 are visible for mytbl2 but
not for mytbl1. To reproduce this I have stopped the tablesync workers
(via debugger) for mytbl1 and mytbl2 in LogicalRepSyncTableStart
before it changes the relstate to SUBREL_STATE_SYNCWAIT. Then allowed
Tx2 and Tx3 to be processed by apply worker and then allowed tablesync
worker for mytbl2 to proceed. After that, I can see the above state.

Now, won't this behavior be considered as transaction inconsistency
where partial transaction data or later transaction data is visible? I
don't think we can have such a situation on the master (publisher)
node or in physical standby.

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Single transaction in the tablesync worker?

akapila
On Fri, Dec 4, 2020 at 10:29 AM Amit Kapila <[hidden email]> wrote:

>
> On Fri, Dec 4, 2020 at 7:53 AM Craig Ringer
> <[hidden email]> wrote:
> >
> > On Thu, 3 Dec 2020 at 17:25, Amit Kapila <[hidden email]> wrote:
> >
> > > Is there any fundamental problem if
> > > we commit the transaction after initial copy and slot creation in
> > > LogicalRepSyncTableStart and then allow the apply of transactions as
> > > it happens in apply worker?
> >
> > No fundamental problem. Both approaches are fine. Committing the
> > initial copy then doing the rest in individual txns means an
> > incomplete sync state for the table becomes visible, which may not be
> > ideal. Ideally we'd do something like sync the data into a clone of
> > the table then swap the table relfilenodes out once we're synced up.
> >
> > IMO the main advantage of committing as we go is that it would let us
> > use a non-temporary slot and support recovering an incomplete sync and
> > finishing it after interruption by connection loss, crash, etc. That
> > would be advantageous for big table syncs or where the sync has lots
> > of lag to replay. But it means we have to remember sync states, and
> > give users a way to cancel/abort them. Otherwise forgotten temp slots
> > for syncs will cause a mess on the upstream.
> >
> > It also allows the sync slot to advance, freeing any held upstream
> > resources before the whole sync is done, which is good if the upstream
> > is busy and generating lots of WAL.
> >
> > Finally, committing as we go means we won't exceed the cid increment
> > limit in a single txn.
> >
>
> Yeah, all these are advantages of processing
> transaction-by-transaction. IIUC, we need to primarily do two things
> to achieve it, one is to have an additional state in the catalog (say
> catch up) which will say that the initial copy is done. Then we need
> to have a permanent slot using which we can track the progress of the
> slot so that after restart (due to crash, connection break, etc.) we
> can start from the appropriate position.
>
> Apart from the above, I think with the current design of tablesync we
> can see partial data of transactions because we allow all the
> tablesync workers to run parallelly. Consider the below scenario:
>
> CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
> CREATE TABLE mytbl2(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
>
> Tx1
> BEGIN;
> INSERT INTO mytbl1(somedata, text) VALUES (1, 1);
> INSERT INTO mytbl2(somedata, text) VALUES (1, 1);
> COMMIT;
>
> CREATE PUBLICATION mypublication FOR TABLE mytbl;
>

oops, the above statement should be CREATE PUBLICATION mypublication
FOR TABLE mytbl1, mytbl2;

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Single transaction in the tablesync worker?

akapila
In reply to this post by akapila
On Fri, Dec 4, 2020 at 10:29 AM Amit Kapila <[hidden email]> wrote:

>
> On Fri, Dec 4, 2020 at 7:53 AM Craig Ringer
> <[hidden email]> wrote:
> >
> > On Thu, 3 Dec 2020 at 17:25, Amit Kapila <[hidden email]> wrote:
> >
> > > Is there any fundamental problem if
> > > we commit the transaction after initial copy and slot creation in
> > > LogicalRepSyncTableStart and then allow the apply of transactions as
> > > it happens in apply worker?
> >
> > No fundamental problem. Both approaches are fine. Committing the
> > initial copy then doing the rest in individual txns means an
> > incomplete sync state for the table becomes visible, which may not be
> > ideal. Ideally we'd do something like sync the data into a clone of
> > the table then swap the table relfilenodes out once we're synced up.
> >
> > IMO the main advantage of committing as we go is that it would let us
> > use a non-temporary slot and support recovering an incomplete sync and
> > finishing it after interruption by connection loss, crash, etc. That
> > would be advantageous for big table syncs or where the sync has lots
> > of lag to replay. But it means we have to remember sync states, and
> > give users a way to cancel/abort them. Otherwise forgotten temp slots
> > for syncs will cause a mess on the upstream.
> >
> > It also allows the sync slot to advance, freeing any held upstream
> > resources before the whole sync is done, which is good if the upstream
> > is busy and generating lots of WAL.
> >
> > Finally, committing as we go means we won't exceed the cid increment
> > limit in a single txn.
> >
>
>
> Yeah, all these are advantages of processing
> transaction-by-transaction. IIUC, we need to primarily do two things
> to achieve it, one is to have an additional state in the catalog (say
> catch up) which will say that the initial copy is done. Then we need
> to have a permanent slot using which we can track the progress of the
> slot so that after restart (due to crash, connection break, etc.) we
> can start from the appropriate position.
>
> Apart from the above, I think with the current design of tablesync we
> can see partial data of transactions because we allow all the
> tablesync workers to run parallelly. Consider the below scenario:
>
..
..

>
> Basically, the results for Tx1, Tx2, Tx3 are visible for mytbl2 but
> not for mytbl1. To reproduce this I have stopped the tablesync workers
> (via debugger) for mytbl1 and mytbl2 in LogicalRepSyncTableStart
> before it changes the relstate to SUBREL_STATE_SYNCWAIT. Then allowed
> Tx2 and Tx3 to be processed by apply worker and then allowed tablesync
> worker for mytbl2 to proceed. After that, I can see the above state.
>
> Now, won't this behavior be considered as transaction inconsistency
> where partial transaction data or later transaction data is visible? I
> don't think we can have such a situation on the master (publisher)
> node or in physical standby.
>

On briefly checking the pglogical code [1], it seems this problem
won't be there in pglogical. Because it seems to first copy all the
tables (via pglogical_sync_table) in one process and then catch with
the apply worker in a transaction-by-transaction manner. Am, I reading
it correctly? If so then why we followed a different approach for
in-core solution or is it that the pglogical has improved over time
but all the improvements can't be implemented in-core because of some
missing features?

[1] - https://github.com/2ndQuadrant/pglogical

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Single transaction in the tablesync worker?

Ashutosh Bapat-2
In reply to this post by akapila
On Thu, Dec 3, 2020 at 7:24 PM Amit Kapila <[hidden email]> wrote:

>
> On Thu, Dec 3, 2020 at 7:04 PM Ashutosh Bapat
> <[hidden email]> wrote:
> >
> > On Thu, Dec 3, 2020 at 2:55 PM Amit Kapila <[hidden email]> wrote:
> > >
> > > The tablesync worker in logical replication performs the table data
> > > sync in a single transaction which means it will copy the initial data
> > > and then catch up with apply worker in the same transaction. There is
> > > a comment in LogicalRepSyncTableStart ("We want to do the table data
> > > sync in a single transaction.") saying so but I can't find the
> > > concrete theory behind the same. Is there any fundamental problem if
> > > we commit the transaction after initial copy and slot creation in
> > > LogicalRepSyncTableStart and then allow the apply of transactions as
> > > it happens in apply worker? I have tried doing so in the attached (a
> > > quick prototype to test) and didn't find any problems with regression
> > > tests. I have tried a few manual tests as well to see if it works and
> > > didn't find any problem. Now, it is quite possible that it is
> > > mandatory to do the way we are doing currently, or maybe something
> > > else is required to remove this requirement but I think we can do
> > > better with respect to comments in this area.
> >
> > If we commit the initial copy, the data upto the initial copy's
> > snapshot will be visible downstream. If we apply the changes by
> > committing changes per transaction, the data visible to the other
> > transactions will differ as the apply progresses.
> >
>
> It is not clear what you mean by the above.  The way you have written
> appears that you are saying that instead of copying the initial data,
> I am saying to copy it transaction-by-transaction. But that is not the
> case. I am saying copy the initial data by using REPEATABLE READ
> isolation level as we are doing now, commit it and then process
> transaction-by-transaction till we reach sync-point (point till where
> apply worker has already received the data).

Craig in his mail has clarified this. The changes after the initial
COPY will be visible before the table sync catches up.

>
> > You haven't
> > clarified whether we will respect the transaction boundaries in the
> > apply log or not. I assume we will.
> >
>
> It will be transaction-by-transaction.
>
> > Whereas if we apply all the
> > changes in one go, other transactions either see the data before
> > resync or after it without any intermediate states.
> >
>
> What is the problem even if the user is able to see the data after the
> initial copy?
>
> > That will not
> > violate consistency, I think.
> >
>
> I am not sure how consistency will be broken.

Some of the transactions applied by apply workers may not have been
applied by the resync and vice versa. If the intermediate states of
table resync worker are visible, this difference in applied
transaction will result in loss of consistency if those transactions
are changing the table being resynced and some other table in the same
transaction. The changes won't be atomically visible. Thinking more
about this, this problem exists today for a table being resynced, but
at least it's only the table being resynced that is behind the other
tables so it's predictable.

--
Best Wishes,
Ashutosh Bapat


Reply | Threaded
Open this post in threaded view
|

Re: Single transaction in the tablesync worker?

akapila
On Fri, Dec 4, 2020 at 7:12 PM Ashutosh Bapat
<[hidden email]> wrote:

>
> On Thu, Dec 3, 2020 at 7:24 PM Amit Kapila <[hidden email]> wrote:
> >
> > On Thu, Dec 3, 2020 at 7:04 PM Ashutosh Bapat
> > <[hidden email]> wrote:
> > >
> > > On Thu, Dec 3, 2020 at 2:55 PM Amit Kapila <[hidden email]> wrote:
> > > >
> > > > The tablesync worker in logical replication performs the table data
> > > > sync in a single transaction which means it will copy the initial data
> > > > and then catch up with apply worker in the same transaction. There is
> > > > a comment in LogicalRepSyncTableStart ("We want to do the table data
> > > > sync in a single transaction.") saying so but I can't find the
> > > > concrete theory behind the same. Is there any fundamental problem if
> > > > we commit the transaction after initial copy and slot creation in
> > > > LogicalRepSyncTableStart and then allow the apply of transactions as
> > > > it happens in apply worker? I have tried doing so in the attached (a
> > > > quick prototype to test) and didn't find any problems with regression
> > > > tests. I have tried a few manual tests as well to see if it works and
> > > > didn't find any problem. Now, it is quite possible that it is
> > > > mandatory to do the way we are doing currently, or maybe something
> > > > else is required to remove this requirement but I think we can do
> > > > better with respect to comments in this area.
> > >
> > > If we commit the initial copy, the data upto the initial copy's
> > > snapshot will be visible downstream. If we apply the changes by
> > > committing changes per transaction, the data visible to the other
> > > transactions will differ as the apply progresses.
> > >
> >
> > It is not clear what you mean by the above.  The way you have written
> > appears that you are saying that instead of copying the initial data,
> > I am saying to copy it transaction-by-transaction. But that is not the
> > case. I am saying copy the initial data by using REPEATABLE READ
> > isolation level as we are doing now, commit it and then process
> > transaction-by-transaction till we reach sync-point (point till where
> > apply worker has already received the data).
>
> Craig in his mail has clarified this. The changes after the initial
> COPY will be visible before the table sync catches up.
>

I think the problem is not that the changes are visible after COPY
rather it is that we don't have a mechanism to restart if it crashes
after COPY unless we do all the sync up in one transaction. Assume we
commit after COPY and then process transaction-by-transaction and it
errors out (due to connection loss) or crashes, in-between one of the
following transactions after COPY then after the restart we won't know
from where to start for that relation. This is because the catalog
(pg_subscription_rel) will show the state as 'd' (data is being
copied) and the slot would have gone as it was a temporary slot. But
as mentioned in one of my emails above [1] we can solve these problems
which Craig also seems to be advocating for as there are many
advantages of not doing the entire sync (initial copy + stream changes
for that relation) in one single transaction. It will allow us to
support decode of prepared xacts in the subscriber. Also, it seems
pglogical already does processing transaction-by-transaction after the
initial copy. The only thing which is not clear to me is why we
haven't decided to go ahead initially and it would be probably better
if the original authors would also chime-in to at least clarify the
same.

> >
> > > You haven't
> > > clarified whether we will respect the transaction boundaries in the
> > > apply log or not. I assume we will.
> > >
> >
> > It will be transaction-by-transaction.
> >
> > > Whereas if we apply all the
> > > changes in one go, other transactions either see the data before
> > > resync or after it without any intermediate states.
> > >
> >
> > What is the problem even if the user is able to see the data after the
> > initial copy?
> >
> > > That will not
> > > violate consistency, I think.
> > >
> >
> > I am not sure how consistency will be broken.
>
> Some of the transactions applied by apply workers may not have been
> applied by the resync and vice versa. If the intermediate states of
> table resync worker are visible, this difference in applied
> transaction will result in loss of consistency if those transactions
> are changing the table being resynced and some other table in the same
> transaction. The changes won't be atomically visible. Thinking more
> about this, this problem exists today for a table being resynced, but
> at least it's only the table being resynced that is behind the other
> tables so it's predictable.
>

Yeah, I have already shown that this problem [1] exists today and it
won't be predictable when the number of tables to be synced are more.
I am not sure why but it seems acceptable to original authors that the
data of transactions are visibly partially during the initial
synchronization phase for a subscription. I don't see it documented
clearly either.


[1] - https://www.postgresql.org/message-id/CAA4eK1Ld9XaLoTZCoKF_gET7kc1fDf8CPR3CM48MQb1N1jDLYg%40mail.gmail.com

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Single transaction in the tablesync worker?

Craig Ringer-5


On Sat, 5 Dec 2020, 10:03 Amit Kapila, <[hidden email]> wrote:
On Fri, Dec 4, 2020 at 7:12 PM Ashutosh Bapat
<[hidden email]> wrote:
>
> On Thu, Dec 3, 2020 at 7:24 PM Amit Kapila <[hidden email]> wrote:
> >
> > On Thu, Dec 3, 2020 at 7:04 PM Ashutosh Bapat
> > <[hidden email]> wrote:
> > >
> > > On Thu, Dec 3, 2020 at 2:55 PM Amit Kapila <[hidden email]> wrote:
> > > >
> > > > The tablesync worker in logical replication performs the table data
> > > > sync in a single transaction which means it will copy the initial data
> > > > and then catch up with apply worker in the same transaction. There is
> > > > a comment in LogicalRepSyncTableStart ("We want to do the table data
> > > > sync in a single transaction.") saying so but I can't find the
> > > > concrete theory behind the same. Is there any fundamental problem if
> > > > we commit the transaction after initial copy and slot creation in
> > > > LogicalRepSyncTableStart and then allow the apply of transactions as
> > > > it happens in apply worker? I have tried doing so in the attached (a
> > > > quick prototype to test) and didn't find any problems with regression
> > > > tests. I have tried a few manual tests as well to see if it works and
> > > > didn't find any problem. Now, it is quite possible that it is
> > > > mandatory to do the way we are doing currently, or maybe something
> > > > else is required to remove this requirement but I think we can do
> > > > better with respect to comments in this area.
> > >
> > > If we commit the initial copy, the data upto the initial copy's
> > > snapshot will be visible downstream. If we apply the changes by
> > > committing changes per transaction, the data visible to the other
> > > transactions will differ as the apply progresses.
> > >
> >
> > It is not clear what you mean by the above.  The way you have written
> > appears that you are saying that instead of copying the initial data,
> > I am saying to copy it transaction-by-transaction. But that is not the
> > case. I am saying copy the initial data by using REPEATABLE READ
> > isolation level as we are doing now, commit it and then process
> > transaction-by-transaction till we reach sync-point (point till where
> > apply worker has already received the data).
>
> Craig in his mail has clarified this. The changes after the initial
> COPY will be visible before the table sync catches up.
>

I think the problem is not that the changes are visible after COPY
rather it is that we don't have a mechanism to restart if it crashes
after COPY unless we do all the sync up in one transaction. Assume we
commit after COPY and then process transaction-by-transaction and it
errors out (due to connection loss) or crashes, in-between one of the
following transactions after COPY then after the restart we won't know
from where to start for that relation. This is because the catalog
(pg_subscription_rel) will show the state as 'd' (data is being
copied) and the slot would have gone as it was a temporary slot. But
as mentioned in one of my emails above [1] we can solve these problems
which Craig also seems to be advocating for as there are many
advantages of not doing the entire sync (initial copy + stream changes
for that relation) in one single transaction. It will allow us to
support decode of prepared xacts in the subscriber. Also, it seems
pglogical already does processing transaction-by-transaction after the
initial copy. The only thing which is not clear to me is why we
haven't decided to go ahead initially and it would be probably better
if the original authors would also chime-in to at least clarify the
same.

It's partly a resource management issue.

Replication origins are a limited resource. We need to use a replication origin for any sync we want to be durable across restarts.

Then again so are slots and we use temp slots for each sync.

If a sync fails cleanup on the upstream side is simple with a temp slot. With persistent slots we have more risk of creating upstream issues. But then, so long as the subscriber exists it can deal with that. And if the subscriber no longer exists its primary slot is an issue too.

It'd help if we could register pg_shdepend entries between catalog entries and slots, and from a main subscription slot to any extra slots used for resynchronization.

And I should write a patch for a resource retention summarisation view.


I am not sure why but it seems acceptable to original authors that the
data of transactions are visibly partially during the initial
synchronization phase for a subscription.

I don't think there's much alternative there.

Pg would need some kind of cross commit visibility control mechanism that separates durable commit from visibility
Reply | Threaded
Open this post in threaded view
|

Re: Single transaction in the tablesync worker?

Peter Smith
Hi,

I wanted to float another idea to solve these tablesync/apply worker problems.

This idea may or may not have merit. Please consider it.

~

Basically, I was wondering why can't the "tablesync" worker just
gather messages in a similar way to how the current streaming feature
gathers messages into a "changes" file, so that they can be replayed
later.

e.g. Imagine if

A) The "tablesync" worker (after the COPY) does not ever apply any of
the incoming messages, but instead it just gobbles them into a
"changes" file until it decides it has reached SYNCDONE state and
exits.

B) Then, when the "apply" worker proceeds, if it detects the existence
of the "changes" file it will replay/apply_dispatch all those gobbled
messages before just continuing as normal.

So
- IIUC this kind of replay is like how the current code stream commit
applies the streamed "changes" file.
- "tablesync" worker would only be doing table sync (COPY) as its name
suggests. Any detected "changes" are recorded and left for the "apply"
worker to handle.
- "tablesync" worker would just operate in single tx with a temporary
slot as per current code
- Then the "apply" worker would be the *only* worker that actually
applies anything. (as its name suggests)

Thoughts?

---

Kind Regards,
Peter Smith.
Fujitsu Australia


Reply | Threaded
Open this post in threaded view
|

Re: Single transaction in the tablesync worker?

akapila
In reply to this post by Craig Ringer-5
On Mon, Dec 7, 2020 at 6:20 AM Craig Ringer
<[hidden email]> wrote:

>
> On Sat, 5 Dec 2020, 10:03 Amit Kapila, <[hidden email]> wrote:
>>
>> On Fri, Dec 4, 2020 at 7:12 PM Ashutosh Bapat
>> <[hidden email]> wrote:
>>
>> I think the problem is not that the changes are visible after COPY
>> rather it is that we don't have a mechanism to restart if it crashes
>> after COPY unless we do all the sync up in one transaction. Assume we
>> commit after COPY and then process transaction-by-transaction and it
>> errors out (due to connection loss) or crashes, in-between one of the
>> following transactions after COPY then after the restart we won't know
>> from where to start for that relation. This is because the catalog
>> (pg_subscription_rel) will show the state as 'd' (data is being
>> copied) and the slot would have gone as it was a temporary slot. But
>> as mentioned in one of my emails above [1] we can solve these problems
>> which Craig also seems to be advocating for as there are many
>> advantages of not doing the entire sync (initial copy + stream changes
>> for that relation) in one single transaction. It will allow us to
>> support decode of prepared xacts in the subscriber. Also, it seems
>> pglogical already does processing transaction-by-transaction after the
>> initial copy. The only thing which is not clear to me is why we
>> haven't decided to go ahead initially and it would be probably better
>> if the original authors would also chime-in to at least clarify the
>> same.
>
>
> It's partly a resource management issue.
>
> Replication origins are a limited resource. We need to use a replication origin for any sync we want to be durable across restarts.
>
> Then again so are slots and we use temp slots for each sync.
>
> If a sync fails cleanup on the upstream side is simple with a temp slot. With persistent slots we have more risk of creating upstream issues. But then, so long as the subscriber exists it can deal with that. And if the subscriber no longer exists its primary slot is an issue too.
>

I think if the only issue is slot clean up, then the same exists today
for the slot created by the apply worker (or which I think you are
referring to as a primary slot). This can only happen if the
subscriber goes away without dropping the subscription. Also, if we
are worried about using up too many slots then the slots used by
tablesync workers will probably be freed sooner.

> It'd help if we could register pg_shdepend entries between catalog entries and slots, and from a main subscription slot to any extra slots used for resynchronization.
>

Which catalog entries you are referring to here?

> And I should write a patch for a resource retention summarisation view.
>

That would be great.

>>
>> I am not sure why but it seems acceptable to original authors that the
>> data of transactions are visibly partially during the initial
>> synchronization phase for a subscription.
>
>
> I don't think there's much alternative there.
>

I am not sure about this. I think it is primarily to allow some more
parallelism among apply and sync workers. One primitive way to achieve
parallelism and don't have this problem is to allow apply worker to
wait till all the tablesync workers are in DONE state. Then we will
never have an inconsistency problem or the prepared xact problem. Now,
surely if large copies are required for multiple relations then we
would delay a bit to replay transactions partially by the apply worker
but don't know how much that matters as compared to transaction
visibility issue and anyway we would have achieved the maximum
parallelism by allowing copy via multiple workers.

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Single transaction in the tablesync worker?

Craig Ringer-5
In reply to this post by Peter Smith
On Mon, 7 Dec 2020 at 11:44, Peter Smith <[hidden email]> wrote:

Basically, I was wondering why can't the "tablesync" worker just
gather messages in a similar way to how the current streaming feature
gathers messages into a "changes" file, so that they can be replayed
later.


See the related thread "Logical archiving"


where I addressed some parts of this topic in detail earlier today.

A) The "tablesync" worker (after the COPY) does not ever apply any of
the incoming messages, but instead it just gobbles them into a
"changes" file until it decides it has reached SYNCDONE state and
exits.

This has a few issues.

Most importantly, the sync worker must cooperate with the main apply worker to achieve a consistent end-of-sync cutover. The sync worker must have replayed the pending changes in order to make this cut-over, because the non-sync apply worker will need to start applying changes on top of the resync'd table potentially as soon as the next transaction it starts applying, so it needs to see the rows there.

Doing this would also add another round of write multiplication since the data would get spooled then applied to WAL then heap. Write multiplication is already an issue for logical replication so adding to it isn't particularly desirable without a really compelling reason. With  the write multiplication comes disk space management issues for big transactions as well as the obvious performance/throughput impact.

It adds even more latency between upstream commit and downstream apply, something that is again already an issue for logical replication.

Right now we don't have any concept of a durable and locally flushed spool.

It's not impossible to do as you suggest but the cutover requirement makes it far from simple. As discussed in the logical archiving thread I think it'd be good to have something like this, and there are times the write multiplication price would be well worth paying. But it's not easy.

B) Then, when the "apply" worker proceeds, if it detects the existence
of the "changes" file it will replay/apply_dispatch all those gobbled
messages before just continuing as normal.

That's going to introduce a really big stall in the apply worker's progress in many cases. During that time it won't be receiving from upstream (since we don't spool logical changes to disk at this time) so the upstream lag will grow. That will impact synchronous replication, pg_wal size management, catalog bloat, etc. It'll also leave the upstream logical decoding session idle, so when it resumes it may create a spike of I/O and CPU load as it catches up, as well as a spike of network traffic. And depending on how close the upstream write rate is to the max decode speed, network throughput max, and downstream apply speed max, it may take some time to catch up over the resulting lag.

Not a big fan of that approach.
Reply | Threaded
Open this post in threaded view
|

Re: Single transaction in the tablesync worker?

akapila
On Mon, Dec 7, 2020 at 10:02 AM Craig Ringer
<[hidden email]> wrote:

>
> On Mon, 7 Dec 2020 at 11:44, Peter Smith <[hidden email]> wrote:
>>
>>
>> Basically, I was wondering why can't the "tablesync" worker just
>> gather messages in a similar way to how the current streaming feature
>> gathers messages into a "changes" file, so that they can be replayed
>> later.
>>
>
> See the related thread "Logical archiving"
>
> https://www.postgresql.org/message-id/20D9328B-A189-43D1-80E2-EB25B9284AD6@...
>
> where I addressed some parts of this topic in detail earlier today.
>
>> A) The "tablesync" worker (after the COPY) does not ever apply any of
>> the incoming messages, but instead it just gobbles them into a
>> "changes" file until it decides it has reached SYNCDONE state and
>> exits.
>
>
> This has a few issues.
>
> Most importantly, the sync worker must cooperate with the main apply worker to achieve a consistent end-of-sync cutover.
>

In this idea, there is no need to change the end-of-sync cutover. It
will work as it is now. I am not sure what makes you think so.

> The sync worker must have replayed the pending changes in order to make this cut-over, because the non-sync apply worker will need to start applying changes on top of the resync'd table potentially as soon as the next transaction it starts applying, so it needs to see the rows there.
>

The change here would be that the apply worker will check for changes
file and if it exists then apply them before it changes the relstate
to SUBREL_STATE_READY in process_syncing_tables_for_apply(). So, it
will not miss seeing any rows.

> Doing this would also add another round of write multiplication since the data would get spooled then applied to WAL then heap. Write multiplication is already an issue for logical replication so adding to it isn't particularly desirable without a really compelling reason.
>

It will solve our problem of allowing decoding of prepared xacts in
pgoutput. I have explained the problem above [1]. The other idea which
we discussed is to allow having an additional state in
pg_subscription_rel, make the slot as permanent in tablesync worker,
and then process transaction-by-transaction in apply worker. Does that
approach sounds better? Is there any bigger change involved in this
approach (making tablesync slot permanent) which I am missing?

> With  the write multiplication comes disk space management issues for big transactions as well as the obvious performance/throughput impact.
>
> It adds even more latency between upstream commit and downstream apply, something that is again already an issue for logical replication.
>
> Right now we don't have any concept of a durable and locally flushed spool.
>

I think we have a concept quite close to it for writing changes for
in-progress xacts as done in PG-14. It is not durable but that
shouldn't be a big problem if we allow syncing the changes file.

> It's not impossible to do as you suggest but the cutover requirement makes it far from simple. As discussed in the logical archiving thread I think it'd be good to have something like this, and there are times the write multiplication price would be well worth paying. But it's not easy.
>
>> B) Then, when the "apply" worker proceeds, if it detects the existence
>> of the "changes" file it will replay/apply_dispatch all those gobbled
>> messages before just continuing as normal.
>
>
> That's going to introduce a really big stall in the apply worker's progress in many cases. During that time it won't be receiving from upstream (since we don't spool logical changes to disk at this time) so the upstream lag will grow. That will impact synchronous replication, pg_wal size management, catalog bloat, etc. It'll also leave the upstream logical decoding session idle, so when it resumes it may create a spike of I/O and CPU load as it catches up, as well as a spike of network traffic. And depending on how close the upstream write rate is to the max decode speed, network throughput max, and downstream apply speed max, it may take some time to catch up over the resulting lag.
>

This is just for the initial tablesync phase. I think it is equivalent
to saying that during basebackup, we need to parallelly start physical
replication. I agree that sometimes it can take a lot of time to copy
large tables but it will be just one time and no worse than the other
situations like basebackup.

[1] - https://www.postgresql.org/message-id/CAA4eK1KFsjf6x-S7b0dJLvEL3tcn9x-voBJiFoGsccyH5xgDzQ%40mail.gmail.com

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Single transaction in the tablesync worker?

akapila
In reply to this post by akapila
On Mon, Dec 7, 2020 at 9:21 AM Amit Kapila <[hidden email]> wrote:

>
> On Mon, Dec 7, 2020 at 6:20 AM Craig Ringer
> <[hidden email]> wrote:
> >
>
> >>
> >> I am not sure why but it seems acceptable to original authors that the
> >> data of transactions are visibly partially during the initial
> >> synchronization phase for a subscription.
> >
> >
> > I don't think there's much alternative there.
> >
>
> I am not sure about this. I think it is primarily to allow some more
> parallelism among apply and sync workers. One primitive way to achieve
> parallelism and don't have this problem is to allow apply worker to
> wait till all the tablesync workers are in DONE state.
>

As the slot of apply worker is created before all the tablesync
workers it should never miss any LSN which tablesync workers would
have processed. Also, the table sync workers should not process any
xact if the apply worker has not processed anything. I think tablesync
currently always processes one transaction (because we call
process_sync_tables at commit of a txn) even if that is not required
to be in sync with the apply worker. This should solve both the
problems (a) visibility of partial transactions (b) allow prepared
transactions because tablesync worker no longer needs to combine
multiple transactions data.

I think the other advantages of this would be that it would reduce the
load (both CPU and I/O) on the publisher-side by allowing to decode
the data only once instead of for each table sync worker once and
separately for the apply worker. I think it will use fewer resources
to finish the work.

Is there any flaw in this idea which I am missing?

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Single transaction in the tablesync worker?

akapila
On Mon, Dec 7, 2020 at 2:21 PM Amit Kapila <[hidden email]> wrote:

>
> On Mon, Dec 7, 2020 at 9:21 AM Amit Kapila <[hidden email]> wrote:
> >
> > On Mon, Dec 7, 2020 at 6:20 AM Craig Ringer
> > <[hidden email]> wrote:
> > >
> >
> > >>
> > >> I am not sure why but it seems acceptable to original authors that the
> > >> data of transactions are visibly partially during the initial
> > >> synchronization phase for a subscription.
> > >
> > >
> > > I don't think there's much alternative there.
> > >
> >
> > I am not sure about this. I think it is primarily to allow some more
> > parallelism among apply and sync workers. One primitive way to achieve
> > parallelism and don't have this problem is to allow apply worker to
> > wait till all the tablesync workers are in DONE state.
> >
>
> As the slot of apply worker is created before all the tablesync
> workers it should never miss any LSN which tablesync workers would
> have processed. Also, the table sync workers should not process any
> xact if the apply worker has not processed anything. I think tablesync
> currently always processes one transaction (because we call
> process_sync_tables at commit of a txn) even if that is not required
> to be in sync with the apply worker.
>

One more thing to consider here is that currently in tablesync worker,
we create a slot with CRS_USE_SNAPSHOT option which creates a
transaction snapshot on the publisher, and then we use the same
snapshot for a copy from the publisher. After this, when we try to
receive the data from the publisher using the same slot, it will be in
sync with the COPY. I think to keep the same consistency between COPY
and the data we receive from the publisher in this approach, we need
to export the snapshot while creating a slot in the apply worker by
using CRS_EXPORT_SNAPSHOT and then use the same snapshot by all the
tablesync workers doing the copy. In tablesync workers, we can use the
SET TRANSACTION SNAPSHOT command after "BEGIN READ ONLY ISOLATION
LEVEL REPEATABLE READ" to achieve it. That way the COPY will use the
same snapshot as is used for receiving the changes in apply worker and
the data will be in sync.

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Single transaction in the tablesync worker?

Peter Smith
In reply to this post by akapila
On Mon, Dec 7, 2020 at 7:49 PM Amit Kapila <[hidden email]> wrote:

> As the slot of apply worker is created before all the tablesync
> workers it should never miss any LSN which tablesync workers would
> have processed. Also, the table sync workers should not process any
> xact if the apply worker has not processed anything. I think tablesync
> currently always processes one transaction (because we call
> process_sync_tables at commit of a txn) even if that is not required
> to be in sync with the apply worker. This should solve both the
> problems (a) visibility of partial transactions (b) allow prepared
> transactions because tablesync worker no longer needs to combine
> multiple transactions data.
>
> I think the other advantages of this would be that it would reduce the
> load (both CPU and I/O) on the publisher-side by allowing to decode
> the data only once instead of for each table sync worker once and
> separately for the apply worker. I think it will use fewer resources
> to finish the work.
Yes, I observed this same behavior.

IIUC the only way for the tablesync worker to go from CATCHUP mode to
SYNCDONE is via the call to process_sync_tables.

But a side-effect of this is, when messages arrive during this CATCHUP
phase one tx will be getting handled by the tablesync worker before
the process_sync_tables() is ever encountered.

I have created and attached a simple patch which allows the tablesync
to detect if there is anything to do *before* it enters the apply main
loop. Calling process_sync_tables() before the apply main loop  offers
a quick way out so the message handling will not be split
unnecessarily between the workers.

~

The result of the patch is demonstrated by the following test/logs
which are also attached.
Note: I added more logging (not in this patch) to make it easier to
see what is going on.

LOGS1. Current code.
Test: 10 x INSERTS done at CATCHUP time.
Result: tablesync worker does 1 x INSERT, then apply worker skips 1
and does remaining 9 x INSERTs.

LOGS2. Patched code.
Test: Same 10 x INSERTS done at CATCHUP time.
Result: tablesync can exit early. apply worker handles all 10 x INSERTs

LOGS3. Patched code.
Test: 2PC PREPARE then COMMIT PREPARED [1] done at CATCHUP time
psql -d test_pub -c "BEGIN;INSERT INTO test_tab VALUES(1,
'foo');PREPARE TRANSACTION 'test_prepared_tab';"
psql -d test_pub -c "COMMIT PREPARED 'test_prepared_tab';"
Result: The PREPARE and COMMIT PREPARED are both handle by apply
worker. This avoids complications which the split otherwise causes.
[1] 2PC prepare test requires v29 patch from
https://www.postgresql.org/message-id/flat/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3%2BQ%40mail.gmail.com

---

Kind Regards,
Peter Smith.
Fujitsu Australia

LOGS1-10xInsertTest.txt (13K) Download Attachment
tablesync-early-exit.patch (940 bytes) Download Attachment
LOGS3-Patched-PrepareTest.txt (5K) Download Attachment
LOGS2-Patched-10xInsertTest.txt (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Single transaction in the tablesync worker?

akapila
On Tue, Dec 8, 2020 at 11:53 AM Peter Smith <[hidden email]> wrote:

>
> Yes, I observed this same behavior.
>
> IIUC the only way for the tablesync worker to go from CATCHUP mode to
> SYNCDONE is via the call to process_sync_tables.
>
> But a side-effect of this is, when messages arrive during this CATCHUP
> phase one tx will be getting handled by the tablesync worker before
> the process_sync_tables() is ever encountered.
>
> I have created and attached a simple patch which allows the tablesync
> to detect if there is anything to do *before* it enters the apply main
> loop. Calling process_sync_tables() before the apply main loop  offers
> a quick way out so the message handling will not be split
> unnecessarily between the workers.
>

Yeah, this demonstrates the idea can work but as mentioned in my
previous email [1] this needs much more work to make the COPY and then
later fetching the changes from the publisher consistently. So, let me
summarize the discussion so far. We wanted to enhance the tablesync
phase of Logical Replication to enable decoding of prepared
transactions [2]. The problem was when we stream prepared transactions
in the tablesync worker, it will simply commit the same due to the
requirement of maintaining a single transaction for the entire
duration of copy and streaming of transactions afterward. We can't
simply disable the decoding of prepared xacts for tablesync workers
because it can skip some of the prepared xacts forever on subscriber
as explained in one of the emails above [3]. Now, while investigating
the solutions to enhance tablesync to support decoding at prepare
time, I found that due to the current design of tablesync we can see
partial data of transactions on subscribers which is also explained in
the email above with an example [4]. This problem of visibility is
there since the Logical Replication is introduced in PostgreSQL and
the only answer I got till now is that there doesn't seem to be any
other alternative which I think is not true and I have provided one
alternative as well.

Next, we have discussed three different solutions all of which will
solve the first problem (allow the tablesync worker to decode
transactions at prepare time) and one of which solves both the first
and second problem (partial transaction data visibility).

Solution-1: Allow the table-sync worker to use multiple transactions.
The reason for doing it in a single transaction is that if after
initial COPY we commit and then crash while streaming changes of other
transactions, the state of the table won't be known after the restart
as we are using temporary slot so we don't from where to restart
syncing the table.

IIUC, we need to primarily do two things to achieve multiple
transactions, one is to have an additional state in the catalog (say
catch up) which will say that the initial copy is done. Then we need
to have a permanent slot using which we can track the progress of the
slot so that after restart (due to crash, connection break, etc.) we
can start from the appropriate position. Now, this will allow us to do
less work after recovering from a crash because we will know the
restart point. As Craig mentioned, it also allows the sync slot to
advance, freeing any held upstream resources before the whole sync is
done, which is good if the upstream is busy and generating lots of
WAL. Finally, committing as we go means we won't exceed the cid
increment limit in a single txn.

Solution-2: The next solution we discussed is to make "tablesync"
worker just gather messages after COPY in a similar way to how the
current streaming of in-progress transaction feature gathers messages
into a "changes" file so that they can be replayed later by the apply
worker. Now, here as we don't need to replay the individual
transactions in tablesync worker in a single transaction, it will
allow us to send decode prepared to the subscriber. This has some
disadvantages such as each transaction processed by tablesync worker
needs to be durably written to file and it can also lead to some apply
lag later when we process the same by apply worker.

Solution-3: Allow the table-sync workers to just perform initial COPY
and then once the COPY is done for all relations the apply worker will
stream all the future changes. Now, surely if large copies are
required for multiple relations then we would delay a bit to replay
transactions partially by the apply worker but don't know how much
that matters as compared to transaction visibility issue and anyway we
would have achieved the maximum parallelism by allowing copy via
multiple workers. This would reduce the load (both CPU and I/O) on the
publisher-side by allowing to decode the data only once instead of for
each table sync worker once and separately for the apply worker. I
think it will use fewer resources to finish the work.

Currently, in tablesync worker, we create a slot with CRS_USE_SNAPSHOT
option which creates a transaction snapshot on the publisher, and then
we use the same snapshot for COPY from the publisher. After this, when
we try to receive the data from the publisher using the same slot, it
will be in sync with the COPY. I think to keep the same consistency
between COPY and the data we receive from the publisher in this
approach, we need to export the snapshot while creating a slot in the
apply worker by using CRS_EXPORT_SNAPSHOT and then use the same
snapshot by all the tablesync workers doing the copy. In tablesync
workers, we can use the SET TRANSACTION SNAPSHOT command after "BEGIN
READ ONLY ISOLATION LEVEL REPEATABLE READ" to use the exported
snapshot. That way the COPY will use the same snapshot as is used for
receiving the changes in apply worker and the data will be in sync.

Then we also need a way to export snapshot while the apply worker is
already receiving the changes because users can use 'ALTER
SUBSCRIPTION name REFRESH PUBLICATION' which allows new tables to be
synced. I think we need to introduce a new command in
exec_replication_command() to export the snapshot from the existing
slot and then use it by the new tablesync worker.


Among the above three solutions, the first two will solve the first
problem (allow the tablesync worker to decode transactions at prepare
time) and the third solution will solve both the first and second
problem (partial transaction data visibility). The third solution
requires quite some redesign of how the Logical Replication work is
synchronized between apply and tablesync workers and might turn out to
be a bigger implementation effort. I am tentatively thinking to go
with a first or second solution at this stage and anyway if later
people feel that we need some bigger redesign then we can go with
something on the lines of Solution-3.

Thoughts?

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BQC74wRQmbYT%2BMmOs%3DYbdUjuq0_A9CBbVoQMB1Ryi-OA%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAHut+PuEMk4SO8oGzxc_ftzPkGA8uC-y5qi-KRqHSy_P0i30DA@...
[3] - https://www.postgresql.org/message-id/CAA4eK1KFsjf6x-S7b0dJLvEL3tcn9x-voBJiFoGsccyH5xgDzQ%40mail.gmail.com
[4] - https://www.postgresql.org/message-id/CAA4eK1Ld9XaLoTZCoKF_gET7kc1fDf8CPR3CM48MQb1N1jDLYg%40mail.gmail.com

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Single transaction in the tablesync worker?

Peter Smith
On Tue, Dec 8, 2020 at 9:14 PM Amit Kapila <[hidden email]> wrote:

>
> On Tue, Dec 8, 2020 at 11:53 AM Peter Smith <[hidden email]> wrote:
> >
> > Yes, I observed this same behavior.
> >
> > IIUC the only way for the tablesync worker to go from CATCHUP mode to
> > SYNCDONE is via the call to process_sync_tables.
> >
> > But a side-effect of this is, when messages arrive during this CATCHUP
> > phase one tx will be getting handled by the tablesync worker before
> > the process_sync_tables() is ever encountered.
> >
> > I have created and attached a simple patch which allows the tablesync
> > to detect if there is anything to do *before* it enters the apply main
> > loop. Calling process_sync_tables() before the apply main loop  offers
> > a quick way out so the message handling will not be split
> > unnecessarily between the workers.
> >
>
> Yeah, this demonstrates the idea can work but as mentioned in my
> previous email [1] this needs much more work to make the COPY and then
> later fetching the changes from the publisher consistently. So, let me
> summarize the discussion so far. We wanted to enhance the tablesync
> phase of Logical Replication to enable decoding of prepared
> transactions [2]. The problem was when we stream prepared transactions
> in the tablesync worker, it will simply commit the same due to the
> requirement of maintaining a single transaction for the entire
> duration of copy and streaming of transactions afterward. We can't
> simply disable the decoding of prepared xacts for tablesync workers
> because it can skip some of the prepared xacts forever on subscriber
> as explained in one of the emails above [3]. Now, while investigating
> the solutions to enhance tablesync to support decoding at prepare
> time, I found that due to the current design of tablesync we can see
> partial data of transactions on subscribers which is also explained in
> the email above with an example [4]. This problem of visibility is
> there since the Logical Replication is introduced in PostgreSQL and
> the only answer I got till now is that there doesn't seem to be any
> other alternative which I think is not true and I have provided one
> alternative as well.
>
> Next, we have discussed three different solutions all of which will
> solve the first problem (allow the tablesync worker to decode
> transactions at prepare time) and one of which solves both the first
> and second problem (partial transaction data visibility).
>
> Solution-1: Allow the table-sync worker to use multiple transactions.
> The reason for doing it in a single transaction is that if after
> initial COPY we commit and then crash while streaming changes of other
> transactions, the state of the table won't be known after the restart
> as we are using temporary slot so we don't from where to restart
> syncing the table.
>
> IIUC, we need to primarily do two things to achieve multiple
> transactions, one is to have an additional state in the catalog (say
> catch up) which will say that the initial copy is done. Then we need
> to have a permanent slot using which we can track the progress of the
> slot so that after restart (due to crash, connection break, etc.) we
> can start from the appropriate position. Now, this will allow us to do
> less work after recovering from a crash because we will know the
> restart point. As Craig mentioned, it also allows the sync slot to
> advance, freeing any held upstream resources before the whole sync is
> done, which is good if the upstream is busy and generating lots of
> WAL. Finally, committing as we go means we won't exceed the cid
> increment limit in a single txn.
>
> Solution-2: The next solution we discussed is to make "tablesync"
> worker just gather messages after COPY in a similar way to how the
> current streaming of in-progress transaction feature gathers messages
> into a "changes" file so that they can be replayed later by the apply
> worker. Now, here as we don't need to replay the individual
> transactions in tablesync worker in a single transaction, it will
> allow us to send decode prepared to the subscriber. This has some
> disadvantages such as each transaction processed by tablesync worker
> needs to be durably written to file and it can also lead to some apply
> lag later when we process the same by apply worker.
>
> Solution-3: Allow the table-sync workers to just perform initial COPY
> and then once the COPY is done for all relations the apply worker will
> stream all the future changes. Now, surely if large copies are
> required for multiple relations then we would delay a bit to replay
> transactions partially by the apply worker but don't know how much
> that matters as compared to transaction visibility issue and anyway we
> would have achieved the maximum parallelism by allowing copy via
> multiple workers. This would reduce the load (both CPU and I/O) on the
> publisher-side by allowing to decode the data only once instead of for
> each table sync worker once and separately for the apply worker. I
> think it will use fewer resources to finish the work.
>
> Currently, in tablesync worker, we create a slot with CRS_USE_SNAPSHOT
> option which creates a transaction snapshot on the publisher, and then
> we use the same snapshot for COPY from the publisher. After this, when
> we try to receive the data from the publisher using the same slot, it
> will be in sync with the COPY. I think to keep the same consistency
> between COPY and the data we receive from the publisher in this
> approach, we need to export the snapshot while creating a slot in the
> apply worker by using CRS_EXPORT_SNAPSHOT and then use the same
> snapshot by all the tablesync workers doing the copy. In tablesync
> workers, we can use the SET TRANSACTION SNAPSHOT command after "BEGIN
> READ ONLY ISOLATION LEVEL REPEATABLE READ" to use the exported
> snapshot. That way the COPY will use the same snapshot as is used for
> receiving the changes in apply worker and the data will be in sync.
>
> Then we also need a way to export snapshot while the apply worker is
> already receiving the changes because users can use 'ALTER
> SUBSCRIPTION name REFRESH PUBLICATION' which allows new tables to be
> synced. I think we need to introduce a new command in
> exec_replication_command() to export the snapshot from the existing
> slot and then use it by the new tablesync worker.
>
>
> Among the above three solutions, the first two will solve the first
> problem (allow the tablesync worker to decode transactions at prepare
> time) and the third solution will solve both the first and second
> problem (partial transaction data visibility). The third solution
> requires quite some redesign of how the Logical Replication work is
> synchronized between apply and tablesync workers and might turn out to
> be a bigger implementation effort. I am tentatively thinking to go
> with a first or second solution at this stage and anyway if later
> people feel that we need some bigger redesign then we can go with
> something on the lines of Solution-3.
>
> Thoughts?
>
> [1] - https://www.postgresql.org/message-id/CAA4eK1%2BQC74wRQmbYT%2BMmOs%3DYbdUjuq0_A9CBbVoQMB1Ryi-OA%40mail.gmail.com
> [2] - https://www.postgresql.org/message-id/CAHut+PuEMk4SO8oGzxc_ftzPkGA8uC-y5qi-KRqHSy_P0i30DA@...
> [3] - https://www.postgresql.org/message-id/CAA4eK1KFsjf6x-S7b0dJLvEL3tcn9x-voBJiFoGsccyH5xgDzQ%40mail.gmail.com
> [4] - https://www.postgresql.org/message-id/CAA4eK1Ld9XaLoTZCoKF_gET7kc1fDf8CPR3CM48MQb1N1jDLYg%40mail.gmail.com
>
> --

Hi Amit,

- Solution-3 has become too complicated to be attempted by me. Anyway,
we may be better to just focus on eliminating the new problems exposed
by the 2PC work [1], rather than burning too much effort to fix some
other quirk which apparently has existed for years.
[1] https://www.postgresql.org/message-id/CAHut%2BPtm7E5Jj92tJWPtnnjbNjJN60_%3DaGGKYW3h23b7J%3DqeDg%40mail.gmail.com

- Solution-2 has some potential lag problems, and maybe file resource
problems as well. This idea did not get a very favourable response
when I first proposed it.

- This leaves Solution-1 as the best viable option to fix the current
known 2PC trouble.

~~

So I will try to write a patch for the proposed Solution-1.

---
Kind Regards,
Peter Smith.
Fujitsu Australia


1234 ... 6