BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
42 messages Options
123
Reply | Threaded
Open this post in threaded view
|

BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

PG Bug reporting form
The following bug has been logged on the website:

Bug reference:      16643
Logged by:          Henry Hinze
Email address:      [hidden email]
PostgreSQL version: 13.0
Operating system:   Ubuntu 20.04
Description:        

The following setup of logical replication for one table works nicely with
PG 12.4. In PG 13.0 the initial startup never finishes. Instead it starts
over and over again.
All settings are default except for wal_level which was changed to
"logical".

To replicate:
-- Create 2 Databases
psql

CREATE DATABASE d1;
CREATE DATABASE d2;

-- In d1: Create table, fill with data, create publication, create
replication slot
psql d1

CREATE TABLE t (f int);
INSERT INTO t SELECT * FROM generate_series(1, 10000000);
ALTER TABLE t ADD PRIMARY KEY (f);

CREATE PUBLICATION p;
ALTER PUBLICATION p ADD TABLE t;
SELECT pg_create_logical_replication_slot('my_slot','pgoutput');

-- In d2: Create table, create subscription

psql d2

CREATE TABLE t (f int);
ALTER TABLE t ADD PRIMARY KEY (f);
CREATE SUBSCRIPTION s CONNECTION 'user=postgres dbname=d1' PUBLICATION p
WITH (create_slot = false, slot_name='my_slot');


Here are the Logs:
**********************
2020-09-29 21:27:13.267 CEST [17178] postgres@d1 LOG:  logical decoding
found consistent point at 0/33CDDBB8
2020-09-29 21:27:13.267 CEST [17178] postgres@d1 DETAIL:  There are no
running transactions.
2020-09-29 21:27:13.267 CEST [17178] postgres@d1 STATEMENT:  select
pg_create_logical_replication_slot('my_slot','pgoutput');
2020-09-29 21:27:37.543 CEST [17229] LOG:  logical replication apply worker
for subscription "s" has started
2020-09-29 21:27:37.548 CEST [17230] postgres@d1 LOG:  starting logical
decoding for slot "my_slot"
2020-09-29 21:27:37.548 CEST [17230] postgres@d1 DETAIL:  Streaming
transactions committing after 0/33CDDBF0, reading WAL from 0/33CDDBB8.
2020-09-29 21:27:37.548 CEST [17230] postgres@d1 LOG:  logical decoding
found consistent point at 0/33CDDBB8
2020-09-29 21:27:37.548 CEST [17230] postgres@d1 DETAIL:  There are no
running transactions.
2020-09-29 21:27:37.552 CEST [17231] LOG:  logical replication table
synchronization worker for subscription "s", table "t" has started
2020-09-29 21:27:37.704 CEST [17232] postgres@d1 LOG:  logical decoding
found consistent point at 0/33D49208
2020-09-29 21:27:37.704 CEST [17232] postgres@d1 DETAIL:  There are no
running transactions.
2020-09-29 21:27:58.888 CEST [17232] postgres@d1 LOG:  starting logical
decoding for slot "my_slot_16398_sync_16393"
2020-09-29 21:27:58.888 CEST [17232] postgres@d1 DETAIL:  Streaming
transactions committing after 0/33D49240, reading WAL from 0/33D49208.
2020-09-29 21:27:58.888 CEST [17232] postgres@d1 LOG:  logical decoding
found consistent point at 0/33D49208
2020-09-29 21:27:58.888 CEST [17232] postgres@d1 DETAIL:  There are no
running transactions.
2020-09-29 21:28:05.138 CEST [17231] ERROR:  error reading result of
streaming command:
2020-09-29 21:28:05.143 CEST [17130] LOG:  background worker "logical
replication worker" (PID 17231) exited with exit code 1
2020-09-29 21:28:05.145 CEST [17256] LOG:  logical replication table
synchronization worker for subscription "s", table "t" has started
2020-09-29 21:28:05.222 CEST [17257] postgres@d1 LOG:  logical decoding
found consistent point at 0/62FBD4E8
2020-09-29 21:28:05.222 CEST [17257] postgres@d1 DETAIL:  There are no
running transactions.
2020-09-29 21:28:49.187 CEST [17257] postgres@d1 LOG:  starting logical
decoding for slot "my_slot_16398_sync_16393"
2020-09-29 21:28:49.187 CEST [17257] postgres@d1 DETAIL:  Streaming
transactions committing after 0/62FBD520, reading WAL from 0/62FBD4E8.
2020-09-29 21:28:49.187 CEST [17257] postgres@d1 LOG:  logical decoding
found consistent point at 0/62FBD4E8
2020-09-29 21:28:49.187 CEST [17257] postgres@d1 DETAIL:  There are no
running transactions.
2020-09-29 21:28:55.672 CEST [17256] ERROR:  error reading result of
streaming command:
2020-09-29 21:28:55.676 CEST [17130] LOG:  background worker "logical
replication worker" (PID 17256) exited with exit code 1
2020-09-29 21:28:55.679 CEST [17308] LOG:  logical replication table
synchronization worker for subscription "s", table "t" has started
2020-09-29 21:28:55.833 CEST [17309] postgres@d1 LOG:  logical decoding
found consistent point at 0/9E9AEA38
2020-09-29 21:28:55.833 CEST [17309] postgres@d1 DETAIL:  There are no
running transactions.

Reply | Threaded
Open this post in threaded view
|

Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Tom Lane-2
PG Bug reporting form <[hidden email]> writes:
> The following setup of logical replication for one table works nicely with
> PG 12.4. In PG 13.0 the initial startup never finishes. Instead it starts
> over and over again.

It kinda looks like something is timing out before it's managed to
catch up with the (rather large) initial state of table t.  What
non-default replication settings are you using?

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Henry Hinze
Hi Tom,

thanks for your reply!

For my comparison test of PG 12 and 13 I did not change any default setting except wal_level.

But I also did some tests and increased wal_keep_size and max_slot_wal_keep_size to 1GB. And I set wal_sender_timeout to 1h but without success. The setup works in PG 13 only with a small amount of data.

Best,
Henry

Am Di., 29. Sept. 2020 um 22:09 Uhr schrieb Tom Lane <[hidden email]>:
PG Bug reporting form <[hidden email]> writes:
> The following setup of logical replication for one table works nicely with
> PG 12.4. In PG 13.0 the initial startup never finishes. Instead it starts
> over and over again.

It kinda looks like something is timing out before it's managed to
catch up with the (rather large) initial state of table t.  What
non-default replication settings are you using?

                        regards, tom lane


--
Diplominformatiker Henry Hinze
Dietzgenstraße 75
13156 Berlin
Tel:  +49 - 177 - 3160621
USt-ID: DE306639264
Reply | Threaded
Open this post in threaded view
|

Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Henry Hinze
I've made an important observation!

Since I had the impression this setup was already working with RC1 of PG 13, I re-installed RC1 and did the same test. And it's working fine!

So it seems that something has changed from RC1 to the final version that breaks logical replication in some cases.

Best,
Henry

Am Mi., 30. Sept. 2020 um 08:34 Uhr schrieb Henry Hinze <[hidden email]>:
Hi Tom,

thanks for your reply!

For my comparison test of PG 12 and 13 I did not change any default setting except wal_level.

But I also did some tests and increased wal_keep_size and max_slot_wal_keep_size to 1GB. And I set wal_sender_timeout to 1h but without success. The setup works in PG 13 only with a small amount of data.

Best,
Henry

Am Di., 29. Sept. 2020 um 22:09 Uhr schrieb Tom Lane <[hidden email]>:
PG Bug reporting form <[hidden email]> writes:
> The following setup of logical replication for one table works nicely with
> PG 12.4. In PG 13.0 the initial startup never finishes. Instead it starts
> over and over again.

It kinda looks like something is timing out before it's managed to
catch up with the (rather large) initial state of table t.  What
non-default replication settings are you using?

                        regards, tom lane





Reply | Threaded
Open this post in threaded view
|

Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Tom Lane-2
Henry Hinze <[hidden email]> writes:
> I've made an important observation!
> Since I had the impression this setup was already working with RC1 of PG
> 13, I re-installed RC1 and did the same test. And it's working fine!

Ugh.  So that points the finger at commits 07082b08c/bfb12cd2b,
which are the only nearby change between rc1 and 13.0.  A quick
comparison of before-and-after checkouts confirms it.

After some digging around, I realize that that commit actually
resulted in a protocol break.  libpqwalreceiver is expecting to
get an additional CommandComplete message after COPY OUT finishes,
per libpqrcv_endstreaming(), and it's no longer getting one.

(I have not read the protocol document to see if this is per spec;
but spec or no, that's what libpqwalreceiver is expecting.)

The question that this raises is how the heck did that get past
our test suites?  It seems like the error should have been obvious
to even the most minimal testing.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Álvaro Herrera
On 2020-Sep-30, Tom Lane wrote:

> Henry Hinze <[hidden email]> writes:
> > I've made an important observation!
> > Since I had the impression this setup was already working with RC1 of PG
> > 13, I re-installed RC1 and did the same test. And it's working fine!
>
> Ugh.  So that points the finger at commits 07082b08c/bfb12cd2b,
> which are the only nearby change between rc1 and 13.0.  A quick
> comparison of before-and-after checkouts confirms it.

Oh dear.

> After some digging around, I realize that that commit actually
> resulted in a protocol break.  libpqwalreceiver is expecting to
> get an additional CommandComplete message after COPY OUT finishes,
> per libpqrcv_endstreaming(), and it's no longer getting one.
>
> (I have not read the protocol document to see if this is per spec;
> but spec or no, that's what libpqwalreceiver is expecting.)

Yeah, definitely.

The minimal fix seems to be to add an EndReplicationCommand() call in
the T_StartReplicationCmd case.  Testing this now ...

> The question that this raises is how the heck did that get past
> our test suites?  It seems like the error should have been obvious
> to even the most minimal testing.

... yeah, that's indeed an important question.  I'm going to guess that
the TAP suites are too forgiving :-(


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Tom Lane-2
Alvaro Herrera <[hidden email]> writes:
> On 2020-Sep-30, Tom Lane wrote:
>> The question that this raises is how the heck did that get past
>> our test suites?  It seems like the error should have been obvious
>> to even the most minimal testing.

> ... yeah, that's indeed an important question.  I'm going to guess that
> the TAP suites are too forgiving :-(

One thing I noticed while trying to trace this down is that while the
initial table sync is happening, we have *both* a regular
walsender/walreceiver pair and a "sync" pair, eg

postgres  905650  0.0  0.0 186052 11888 ?        Ss   17:12   0:00 postgres: logical replication worker for subscription 16398
postgres  905651 50.1  0.0 173704 13496 ?        Ss   17:12   0:09 postgres: walsender postgres [local] idle
postgres  905652  104  0.4 186832 148608 ?       Rs   17:12   0:19 postgres: logical replication worker for subscription 16398 sync 16393
postgres  905653 12.2  0.0 174380 15524 ?        Ss   17:12   0:02 postgres: walsender postgres [local] COPY

Is it supposed to be like that?  Notice also that the regular walsender
has consumed significant CPU time; it's not pinning a CPU like the sync
walreceiver is, but it's eating maybe 20% of a CPU according to "top".
I wonder whether in cases with only small tables (which is likely all
that our tests test), the regular walreceiver manages to complete the
table sync despite repeated(?) failures of the sync worker.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Peter Eisentraut-6
In reply to this post by Tom Lane-2
On 2020-09-30 23:32, Tom Lane wrote:
> The question that this raises is how the heck did that get past
> our test suites?  It seems like the error should have been obvious
> to even the most minimal testing.

What's the difference between this case and what the test suite is
testing?  Is it that it replicates between two databases on the same
instance?

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Peter Eisentraut-6
In reply to this post by Tom Lane-2
On 2020-09-30 23:52, Tom Lane wrote:
> One thing I noticed while trying to trace this down is that while the
> initial table sync is happening, we have *both* a regular
> walsender/walreceiver pair and a "sync" pair, eg
>
> postgres  905650  0.0  0.0 186052 11888 ?        Ss   17:12   0:00 postgres: logical replication worker for subscription 16398
> postgres  905651 50.1  0.0 173704 13496 ?        Ss   17:12   0:09 postgres: walsender postgres [local] idle
> postgres  905652  104  0.4 186832 148608 ?       Rs   17:12   0:19 postgres: logical replication worker for subscription 16398 sync 16393
> postgres  905653 12.2  0.0 174380 15524 ?        Ss   17:12   0:02 postgres: walsender postgres [local] COPY

That's normal.  You could also have even more if tables are syncing in
parallel.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Álvaro Herrera
In reply to this post by Peter Eisentraut-6
On 2020-Oct-01, Peter Eisentraut wrote:

> On 2020-09-30 23:32, Tom Lane wrote:
> > The question that this raises is how the heck did that get past
> > our test suites?  It seems like the error should have been obvious
> > to even the most minimal testing.
>
> What's the difference between this case and what the test suite is testing?
> Is it that it replicates between two databases on the same instance?

I don't know why the tests pass, but the message
  ERROR:  error reading result of streaming command:
does appear in the logs after running src/test/subscription many times
(I see it in tests 001, 002, 013 and 014, apart from the new one in
100).  It's indeed surprising that these tests all pass!

I turned Henry's reproducer into the attached TAP test, and it does
reproduce the problem; but if I reduce the number of rows from 5000 to
1000, then it no longer does.  I don't quite see why this would be a
problem with a larger table only.  Do you?

The fix is the commented-out line in walsender.c; the test reliably
passes for me if I uncomment that, and the error message disappear from
the server logs in all the other tests.

0001-Restore-double-command-completion-tags.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Tom Lane-2
Alvaro Herrera <[hidden email]> writes:
> On 2020-Oct-01, Peter Eisentraut wrote:
>> What's the difference between this case and what the test suite is testing?
>> Is it that it replicates between two databases on the same instance?

> I don't know why the tests pass, but the message
>   ERROR:  error reading result of streaming command:
> does appear in the logs after running src/test/subscription many times
> (I see it in tests 001, 002, 013 and 014, apart from the new one in
> 100).  It's indeed surprising that these tests all pass!

> I turned Henry's reproducer into the attached TAP test, and it does
> reproduce the problem; but if I reduce the number of rows from 5000 to
> 1000, then it no longer does.  I don't quite see why this would be a
> problem with a larger table only.  Do you?

I think we really need to figure that out before concluding that this
problem is solved.  Now that we've seen this, I'm wondering urgently
what other coverage gaps we've got there.

> The fix is the commented-out line in walsender.c; the test reliably
> passes for me if I uncomment that, and the error message disappear from
> the server logs in all the other tests.

I agree that this is what we need to do code-wise; we can't let the
protocol break stand, or we'll break all sorts of cross-version
replication scenarios.  However, we'd better also change the protocol
spec to say that this is what is supposed to happen.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Álvaro Herrera
In reply to this post by Tom Lane-2
On 2020-Sep-30, Tom Lane wrote:

> After some digging around, I realize that that commit actually
> resulted in a protocol break.  libpqwalreceiver is expecting to
> get an additional CommandComplete message after COPY OUT finishes,
> per libpqrcv_endstreaming(), and it's no longer getting one.
>
> (I have not read the protocol document to see if this is per spec;
> but spec or no, that's what libpqwalreceiver is expecting.)
>
> The question that this raises is how the heck did that get past
> our test suites?  It seems like the error should have been obvious
> to even the most minimal testing.
So I think I can answer this now.  Petr, as intellectual author of this
code I would appreciate your thoughts on this -- you probably understand
this much better.  Same goes for Peter as committer.

I think the logical replication feature is missing an detailed
architectural diagram.

What is happening is that the tablesync code, which is in charge of
initial syncing of tables, has two main code paths: the first one is an
optimization (?) in LogicalRepSyncTableStart() that if the sync takes
little time, the process will exit immediately (via
finish_sync_worker()) without returning control to the main logical
replication apply worker code.  In the TAP tests, all tables are pretty
small, so that optimization always fires.

So in that case we never get control back to walrcv_endstreaming (the
complainant in Henry's test case), which is why replication never fails
in the tests.  You can verify this because the final "return slotname"
line in LogicalRepSyncTableStart has zero coverage.


The other code path returns the slot name to ApplyWorkerMain, which
does the normal walrcv_startstreaming / LogicalRepApplyLoop /
endstreaming dance.  And whenever that code path is taken, all the
src/test/subscription tests fail with the missing command tag, and
they work correctly when the command tag is restored, which is what
we wanted.

My proposal at this stage is to remove the optimization in
LogicalRepSyncTableStart, per 0001, if only because the code becomes
simpler (0002 reindents code).  With this, we get a coverage increase
for tablesync.c not only in the final "return slotname" but also for
process_syncing_tables_for_sync() which is currently uncovered:
https://coverage.postgresql.org/src/backend/replication/logical/tablesync.c.gcov.html#268

However, and this is one reason why I'd welcome Petr/Peter thoughts on
this, I don't really understand what happens in LogicalRepApplyLoop
afterwards with a tablesync worker; are we actually doing anything
useful there, considering that the actual data copy seems to have
occurred in the CopyFrom() call in copy_table?  In other words, by the
time we return control to ApplyWorkerMain with a slot name, isn't the
work all done, and the only thing we need is to synchronize protocol and
close the connection?

Or is it?


Another thing I noticed is that if I crank up the number of rows in the
new t/100_bugs.pl test case(*), the two sync workers run in sequence -- not
in parallel.  What's going on with that?  Shouldn't there be two
simultaneous workers?

(*) In the submitted patch, the test uses 10 million rows.  I intend to
reduce that to, say, 5000 rows in the committed version.  Strictly
speaking, there's no need for this test at all since with the
LogicalRepSyncTableStart changes I propose, even t/001_rep_changes.pl
fails.  But since my new test case fails even without that change, I
prefer to keep it anyway.


0001-Restore-double-command-completion-tags.patch (7K) Download Attachment
0002-move-declarations-and-reindent.patch (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Petr Jelinek
Hi,

On 10/10/2020 21:06, Alvaro Herrera wrote:

> On 2020-Sep-30, Tom Lane wrote:
>
>> After some digging around, I realize that that commit actually
>> resulted in a protocol break.  libpqwalreceiver is expecting to
>> get an additional CommandComplete message after COPY OUT finishes,
>> per libpqrcv_endstreaming(), and it's no longer getting one.
>>
>> (I have not read the protocol document to see if this is per spec;
>> but spec or no, that's what libpqwalreceiver is expecting.)
>>
>> The question that this raises is how the heck did that get past
>> our test suites?  It seems like the error should have been obvious
>> to even the most minimal testing.
>
> So I think I can answer this now.  Petr, as intellectual author of this
> code I would appreciate your thoughts on this -- you probably understand
> this much better.  Same goes for Peter as committer.
>
> I think the logical replication feature is missing an detailed
> architectural diagram.
>
> What is happening is that the tablesync code, which is in charge of
> initial syncing of tables, has two main code paths: the first one is an
> optimization (?) in LogicalRepSyncTableStart() that if the sync takes
> little time, the process will exit immediately (via
> finish_sync_worker()) without returning control to the main logical
> replication apply worker code.  In the TAP tests, all tables are pretty
> small, so that optimization always fires.

It's not only about size of the tables, it's mainly that there is no
write activity so the main apply is not moving past the LSN at which
table sync has started at. With bigger table you get at least something
written (running xacts, autovacuum, or whatever) that moves lsn forward
eventually.

>
> So in that case we never get control back to walrcv_endstreaming (the
> complainant in Henry's test case), which is why replication never fails
> in the tests.  You can verify this because the final "return slotname"
> line in LogicalRepSyncTableStart has zero coverage.
>

Correct.

>
> The other code path returns the slot name to ApplyWorkerMain, which
> does the normal walrcv_startstreaming / LogicalRepApplyLoop /
> endstreaming dance.  And whenever that code path is taken, all the
> src/test/subscription tests fail with the missing command tag, and
> they work correctly when the command tag is restored, which is what
> we wanted.
>
> My proposal at this stage is to remove the optimization in
> LogicalRepSyncTableStart, per 0001, if only because the code becomes
> simpler (0002 reindents code).  With this, we get a coverage increase
> for tablesync.c not only in the final "return slotname" but also for
> process_syncing_tables_for_sync() which is currently uncovered:
> https://coverage.postgresql.org/src/backend/replication/logical/tablesync.c.gcov.html#268
>
> However, and this is one reason why I'd welcome Petr/Peter thoughts on
> this, I don't really understand what happens in LogicalRepApplyLoop
> afterwards with a tablesync worker; are we actually doing anything
> useful there, considering that the actual data copy seems to have
> occurred in the CopyFrom() call in copy_table?  In other words, by the
> time we return control to ApplyWorkerMain with a slot name, isn't the
> work all done, and the only thing we need is to synchronize protocol and
> close the connection?
>

There are 2 possible states at that point, either tablesync is ahead
(when main apply lags or nothing is happening on publication side) or
it's behind the main apply. When tablesync is ahead we are indeed done
and just need to update the state of the table (which is what the code
you removed did, but LogicalRepApplyLoop should do it as well, just a
bit later). When it's behind we need to do catchup for that table only
which still happens in the tablesync worker. See the explanation at the
beginning of tablesync.c, it probably needs some small adjustments after
the changes in your first patch.

>
> Another thing I noticed is that if I crank up the number of rows in the
> new t/100_bugs.pl test case(*), the two sync workers run in sequence -- not
> in parallel.  What's going on with that?  Shouldn't there be two
> simultaneous workers?
>

That's likely because the second replication slot is waiting for xid
created by the first one to commit or abort while building consistent
snapshot.

--
Petr Jelinek
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Álvaro Herrera
On 2020-Oct-12, Petr Jelinek wrote:

> It's not only about size of the tables, it's mainly that there is no write
> activity so the main apply is not moving past the LSN at which table sync
> has started at. With bigger table you get at least something written
> (running xacts, autovacuum, or whatever) that moves lsn forward eventually.

I see -- yeah, okay.

> > However, and this is one reason why I'd welcome Petr/Peter thoughts on
> > this, I don't really understand what happens in LogicalRepApplyLoop
> > afterwards with a tablesync worker; are we actually doing anything
> > useful there, considering that the actual data copy seems to have
> > occurred in the CopyFrom() call in copy_table?  In other words, by the
> > time we return control to ApplyWorkerMain with a slot name, isn't the
> > work all done, and the only thing we need is to synchronize protocol and
> > close the connection?
>
> There are 2 possible states at that point, either tablesync is ahead (when
> main apply lags or nothing is happening on publication side) or it's behind
> the main apply. When tablesync is ahead we are indeed done and just need to
> update the state of the table (which is what the code you removed did, but
> LogicalRepApplyLoop should do it as well, just a bit later). When it's
> behind we need to do catchup for that table only which still happens in the
> tablesync worker. See the explanation at the beginning of tablesync.c, it
> probably needs some small adjustments after the changes in your first patch.
... Ooh, things start to make some sense now.  So how about the
attached?  There are some not really related cleanups.  (Changes to
protocol.sgml are still pending.)

If I understand correcly, the early exit in tablesync.c is not saving *a
lot* of time (we don't actually skip replaying any WAL), even if it's
saving execution of a bunch of code.  So I stand by my position that
removing the code is better because it's clearer about what is actually
happening.

0001-Restore-logical-replication-dupe-command-tags.patch (3K) Download Attachment
0002-Review-logical-replication-tablesync-code.patch (16K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Petr Jelinek
Hi,

On 14/10/2020 03:12, Alvaro Herrera wrote:

> On 2020-Oct-12, Petr Jelinek wrote:
>
>>> However, and this is one reason why I'd welcome Petr/Peter thoughts on
>>> this, I don't really understand what happens in LogicalRepApplyLoop
>>> afterwards with a tablesync worker; are we actually doing anything
>>> useful there, considering that the actual data copy seems to have
>>> occurred in the CopyFrom() call in copy_table?  In other words, by the
>>> time we return control to ApplyWorkerMain with a slot name, isn't the
>>> work all done, and the only thing we need is to synchronize protocol and
>>> close the connection?
>>
>> There are 2 possible states at that point, either tablesync is ahead (when
>> main apply lags or nothing is happening on publication side) or it's behind
>> the main apply. When tablesync is ahead we are indeed done and just need to
>> update the state of the table (which is what the code you removed did, but
>> LogicalRepApplyLoop should do it as well, just a bit later). When it's
>> behind we need to do catchup for that table only which still happens in the
>> tablesync worker. See the explanation at the beginning of tablesync.c, it
>> probably needs some small adjustments after the changes in your first patch.
>
> ... Ooh, things start to make some sense now.  So how about the
> attached?  There are some not really related cleanups.  (Changes to
> protocol.sgml are still pending.)
>

It would be nice if the new sentences at the beginning of tablesync.c
started with uppercase, but that's about as nitpicky as I can be :)

> If I understand correcly, the early exit in tablesync.c is not saving *a
> lot* of time (we don't actually skip replaying any WAL), even if it's
> saving execution of a bunch of code.  So I stand by my position that
> removing the code is better because it's clearer about what is actually
> happening.
>

I don't really have any problems with the simplification you propose.
The saved time is probably in order of hundreds of ms which for table
sync is insignificant.

--
Petr Jelinek
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Álvaro Herrera
On 2020-Oct-14, Petr Jelinek wrote:

> It would be nice if the new sentences at the beginning of tablesync.c
> started with uppercase, but that's about as nitpicky as I can be :)

OK, fixed that :-)  And pushed (to master only).  There's one more
change I added at the last minute, which is to remove the 'missing_ok'
parameter of GetSubscriptionRelState.  There are some other cosmetic
changes, but nothing of substance.

> > If I understand correcly, the early exit in tablesync.c is not saving *a
> > lot* of time (we don't actually skip replaying any WAL), even if it's
> > saving execution of a bunch of code.  So I stand by my position that
> > removing the code is better because it's clearer about what is actually
> > happening.
>
> I don't really have any problems with the simplification you propose. The
> saved time is probably in order of hundreds of ms which for table sync is
> insignificant.

Great, thanks.

If you think this is done ... I have taken a few notes in the process:

* STREAM COMMIT bug?
  In apply_handle_stream_commit, we do CommitTransactionCommand, but
  apparently in a tablesync worker we shouldn't do it.
* process_syncing_tables_for_apply belongs not in tablesync.c (maybe
  worker.c?)
* DropSubscription bug / logicalrep_find_workers
  DropSubscription believes that it's valid to get a list of workers, then
  kill them one by one. But this is racy since one worker could end for other
  reasons and be replaced by a worker to another subscription.  This code
  should use a loop that restarts after killing each worker.
* subscription_change_cb too coarse
  It seems that we should only reset "subscription valid" when *our*
  subscription is changed, not other subscriptions.  Otherwise we read our own
  sub data far too often.
* PostgresNode uses print where it should use Test::More's note()


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

akapila
On Thu, Oct 15, 2020 at 8:20 PM Alvaro Herrera <[hidden email]> wrote:

>
> On 2020-Oct-14, Petr Jelinek wrote:
>
> > It would be nice if the new sentences at the beginning of tablesync.c
> > started with uppercase, but that's about as nitpicky as I can be :)
>
> OK, fixed that :-)  And pushed (to master only).  There's one more
> change I added at the last minute, which is to remove the 'missing_ok'
> parameter of GetSubscriptionRelState.  There are some other cosmetic
> changes, but nothing of substance.
>
> > > If I understand correcly, the early exit in tablesync.c is not saving *a
> > > lot* of time (we don't actually skip replaying any WAL), even if it's
> > > saving execution of a bunch of code.  So I stand by my position that
> > > removing the code is better because it's clearer about what is actually
> > > happening.
> >
> > I don't really have any problems with the simplification you propose. The
> > saved time is probably in order of hundreds of ms which for table sync is
> > insignificant.
>
> Great, thanks.
>
> If you think this is done ... I have taken a few notes in the process:
>
> * STREAM COMMIT bug?
>   In apply_handle_stream_commit, we do CommitTransactionCommand, but
>   apparently in a tablesync worker we shouldn't do it.
>

In the tablesync stage, we don't allow streaming. See pgoutput_startup
where we disable streaming for the init phase. As far as I understand,
for tablesync we create the initial slot during which streaming will
be disabled then we will copy the table (here logical decoding won't
be used) and then allow the apply worker to get any other data which
is inserted in the meantime. Now, I might be missing something here
but if you can explain it a bit more or share some test to show how we
can reach here via tablesync worker then we can discuss the possible
solution.

Sorry for responding late.

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Álvaro Herrera
On 2020-Nov-04, Amit Kapila wrote:

> On Thu, Oct 15, 2020 at 8:20 PM Alvaro Herrera <[hidden email]> wrote:

> > * STREAM COMMIT bug?
> >   In apply_handle_stream_commit, we do CommitTransactionCommand, but
> >   apparently in a tablesync worker we shouldn't do it.
>
> In the tablesync stage, we don't allow streaming. See pgoutput_startup
> where we disable streaming for the init phase. As far as I understand,
> for tablesync we create the initial slot during which streaming will
> be disabled then we will copy the table (here logical decoding won't
> be used) and then allow the apply worker to get any other data which
> is inserted in the meantime. Now, I might be missing something here
> but if you can explain it a bit more or share some test to show how we
> can reach here via tablesync worker then we can discuss the possible
> solution.
Hmm, okay, that sounds like there would be no bug then.  Maybe what we
need is just an assert in apply_handle_stream_commit that
!am_tablesync_worker(), as in the attached patch.  Passes tests.

assert.patch (522 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

akapila
On Wed, Nov 4, 2020 at 7:19 PM Alvaro Herrera <[hidden email]> wrote:

>
> On 2020-Nov-04, Amit Kapila wrote:
>
> > On Thu, Oct 15, 2020 at 8:20 PM Alvaro Herrera <[hidden email]> wrote:
>
> > > * STREAM COMMIT bug?
> > >   In apply_handle_stream_commit, we do CommitTransactionCommand, but
> > >   apparently in a tablesync worker we shouldn't do it.
> >
> > In the tablesync stage, we don't allow streaming. See pgoutput_startup
> > where we disable streaming for the init phase. As far as I understand,
> > for tablesync we create the initial slot during which streaming will
> > be disabled then we will copy the table (here logical decoding won't
> > be used) and then allow the apply worker to get any other data which
> > is inserted in the meantime. Now, I might be missing something here
> > but if you can explain it a bit more or share some test to show how we
> > can reach here via tablesync worker then we can discuss the possible
> > solution.
>
> Hmm, okay, that sounds like there would be no bug then.  Maybe what we
> need is just an assert in apply_handle_stream_commit that
> !am_tablesync_worker(), as in the attached patch.  Passes tests.
>

+1. But do we want to have this Assert only in stream_commit API or
all stream APIs as well?

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

akapila
On Thu, Nov 5, 2020 at 9:13 AM Amit Kapila <[hidden email]> wrote:

>
> On Wed, Nov 4, 2020 at 7:19 PM Alvaro Herrera <[hidden email]> wrote:
> >
> > On 2020-Nov-04, Amit Kapila wrote:
> >
> > > On Thu, Oct 15, 2020 at 8:20 PM Alvaro Herrera <[hidden email]> wrote:
> >
> > > > * STREAM COMMIT bug?
> > > >   In apply_handle_stream_commit, we do CommitTransactionCommand, but
> > > >   apparently in a tablesync worker we shouldn't do it.
> > >
> > > In the tablesync stage, we don't allow streaming. See pgoutput_startup
> > > where we disable streaming for the init phase. As far as I understand,
> > > for tablesync we create the initial slot during which streaming will
> > > be disabled then we will copy the table (here logical decoding won't
> > > be used) and then allow the apply worker to get any other data which
> > > is inserted in the meantime. Now, I might be missing something here
> > > but if you can explain it a bit more or share some test to show how we
> > > can reach here via tablesync worker then we can discuss the possible
> > > solution.
> >
> > Hmm, okay, that sounds like there would be no bug then.  Maybe what we
> > need is just an assert in apply_handle_stream_commit that
> > !am_tablesync_worker(), as in the attached patch.  Passes tests.
> >
>
> +1. But do we want to have this Assert only in stream_commit API or
> all stream APIs as well?
>

One more point to look here is at what point does the tablesync worker
is involved in applying decode transactions if any?  Basically, I
would like to ensure that if it uses the slot it has initially created
(before copy) then it is probably fine because we don't enable
streaming with it during the initial phase but if it uses the slot to
decode xacts after copy then we need to probably once check if it is
still true that streaming is not enabled at that point. I am not
completely sure if there are existing test cases to cover any such
scenarios so probably thinking a bit more about this might be helpful.


--
With Regards,
Amit Kapila.


123