Unresolved repliaction hang and stop problem.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Unresolved repliaction hang and stop problem.

Krzysztof Kois
Hello,
After upgrading the cluster from 10.x to 13.1 we've started getting a problem describe pgsql-general:
https://www.postgresql.org/message-id/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com
We've noticed similar issue being described on this list in
https://www.postgresql-archive.org/Logical-replication-CPU-bound-with-TRUNCATE-DROP-CREATE-many-tables-tt6155123.html
with a fix being rolled out in 13.2.

After the 13.2 release, we've upgraded to it and unfortunately this did not solve the issue - the replication still stalls just as described in the original issue.
Please advise, how to debug and solve this issue.
Reply | Threaded
Open this post in threaded view
|

Re: Unresolved repliaction hang and stop problem.

akapila
On Tue, Apr 13, 2021 at 1:18 PM Krzysztof Kois
<[hidden email]> wrote:
>
> Hello,
> After upgrading the cluster from 10.x to 13.1 we've started getting a problem describe pgsql-general:
> https://www.postgresql.org/message-id/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com
> We've noticed similar issue being described on this list in
> https://www.postgresql-archive.org/Logical-replication-CPU-bound-with-TRUNCATE-DROP-CREATE-many-tables-tt6155123.html
> with a fix being rolled out in 13.2.
>

The fix for the problem discussed in the above threads is committed
only in PG-14, see [1]. I don't know what makes you think it is fixed
in 13.2. Also, it is not easy to back-patch that because this fix
depends on some of the infrastructure introduced in PG-14.

[1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d7eb52d7181d83cf2363570f7a205b8eb1008dbc

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Unresolved repliaction hang and stop problem.

Álvaro Herrera
On 2021-Apr-14, Amit Kapila wrote:

> On Tue, Apr 13, 2021 at 1:18 PM Krzysztof Kois
> <[hidden email]> wrote:
> >
> > Hello,
> > After upgrading the cluster from 10.x to 13.1 we've started getting a problem describe pgsql-general:
> > https://www.postgresql.org/message-id/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com
> > We've noticed similar issue being described on this list in
> > https://www.postgresql-archive.org/Logical-replication-CPU-bound-with-TRUNCATE-DROP-CREATE-many-tables-tt6155123.html
> > with a fix being rolled out in 13.2.
>
> The fix for the problem discussed in the above threads is committed
> only in PG-14, see [1]. I don't know what makes you think it is fixed
> in 13.2. Also, it is not easy to back-patch that because this fix
> depends on some of the infrastructure introduced in PG-14.
>
> [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d7eb52d7181d83cf2363570f7a205b8eb1008dbc

Hmm ... On what does it depend (other than plain git conflicts, which
are aplenty)?  On a quick look to the commit, it's clear that we need to
be careful in order not to cause an ABI break, but that doesn't seem
impossible to solve, but I'm wondering if there is more to it than that.

--
Álvaro Herrera                            39°49'30"S 73°17'W


Reply | Threaded
Open this post in threaded view
|

Re: Unresolved repliaction hang and stop problem.

akapila
On Wed, Apr 28, 2021 at 6:48 AM Alvaro Herrera <[hidden email]> wrote:

>
> On 2021-Apr-14, Amit Kapila wrote:
>
> > On Tue, Apr 13, 2021 at 1:18 PM Krzysztof Kois
> > <[hidden email]> wrote:
> > >
> > > Hello,
> > > After upgrading the cluster from 10.x to 13.1 we've started getting a problem describe pgsql-general:
> > > https://www.postgresql.org/message-id/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com
> > > We've noticed similar issue being described on this list in
> > > https://www.postgresql-archive.org/Logical-replication-CPU-bound-with-TRUNCATE-DROP-CREATE-many-tables-tt6155123.html
> > > with a fix being rolled out in 13.2.
> >
> > The fix for the problem discussed in the above threads is committed
> > only in PG-14, see [1]. I don't know what makes you think it is fixed
> > in 13.2. Also, it is not easy to back-patch that because this fix
> > depends on some of the infrastructure introduced in PG-14.
> >
> > [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d7eb52d7181d83cf2363570f7a205b8eb1008dbc
>
> Hmm ... On what does it depend (other than plain git conflicts, which
> are aplenty)?  On a quick look to the commit, it's clear that we need to
> be careful in order not to cause an ABI break, but that doesn't seem
> impossible to solve, but I'm wondering if there is more to it than that.
>

As mentioned in the commit message, we need another commit [1] change
to make this work.

[1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c55040ccd0

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Unresolved repliaction hang and stop problem.

Álvaro Herrera
On 2021-Apr-28, Amit Kapila wrote:

> On Wed, Apr 28, 2021 at 6:48 AM Alvaro Herrera <[hidden email]> wrote:

> > Hmm ... On what does it depend (other than plain git conflicts, which
> > are aplenty)?  On a quick look to the commit, it's clear that we need to
> > be careful in order not to cause an ABI break, but that doesn't seem
> > impossible to solve, but I'm wondering if there is more to it than that.
>
> As mentioned in the commit message, we need another commit [1] change
> to make this work.
>
> [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c55040ccd0

Oh, yeah, that looks tougher.  (Still not impossible: it adds a new WAL
message type, but we have added such on a minor release before.)

... It's strange that replication worked for them on pg10 though and
broke on 13.  What did we change anything to make it so?  (I don't have
any fish to fry on this topic at present, but it seems a bit
concerning.)

--
Álvaro Herrera       Valdivia, Chile


Reply | Threaded
Open this post in threaded view
|

Re: Unresolved repliaction hang and stop problem.

akapila
On Wed, Apr 28, 2021 at 7:36 PM Alvaro Herrera <[hidden email]> wrote:

>
> On 2021-Apr-28, Amit Kapila wrote:
>
> > On Wed, Apr 28, 2021 at 6:48 AM Alvaro Herrera <[hidden email]> wrote:
>
> > > Hmm ... On what does it depend (other than plain git conflicts, which
> > > are aplenty)?  On a quick look to the commit, it's clear that we need to
> > > be careful in order not to cause an ABI break, but that doesn't seem
> > > impossible to solve, but I'm wondering if there is more to it than that.
> >
> > As mentioned in the commit message, we need another commit [1] change
> > to make this work.
> >
> > [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c55040ccd0
>
> Oh, yeah, that looks tougher.  (Still not impossible: it adds a new WAL
> message type, but we have added such on a minor release before.)
>

Yeah, we can try to make it possible if it is really a pressing issue
but I guess even in that case it is better to do it after we release
PG14 so that it can get some more testing.

> ... It's strange that replication worked for them on pg10 though and
> broke on 13.  What did we change anything to make it so?
>

No idea but probably if the other person can share the exact test case
which he sees working fine on PG10 but not on PG13 then it might be a
bit easier to investigate.

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Unresolved repliaction hang and stop problem.

Álvaro Herrera
Cc'ing Lukasz Biegaj because of the pgsql-general thread.

On 2021-Apr-29, Amit Kapila wrote:

> On Wed, Apr 28, 2021 at 7:36 PM Alvaro Herrera <[hidden email]> wrote:

> > ... It's strange that replication worked for them on pg10 though and
> > broke on 13.  What did we change anything to make it so?
>
> No idea but probably if the other person can share the exact test case
> which he sees working fine on PG10 but not on PG13 then it might be a
> bit easier to investigate.

Ah, noticed now that Krzysztof posted links to these older threads,
where a problem is described:

https://www.postgresql.org/message-id/flat/CANDwggKYveEtXjXjqHA6RL3AKSHMsQyfRY6bK%2BNqhAWJyw8psQ%40mail.gmail.com
https://www.postgresql.org/message-id/flat/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com

Krzysztof said "after upgrading to pg13 we started having problems",
which implicitly indicates that the same thing worked well in pg10 ---
but if the problem has been correctly identified, then this wouldn't
have worked in pg10 either.  So something in the story doesn't quite
match up.  Maybe it's not the same problem after all, or maybe they
weren't doing X in pg10 which they are attempting in pg13.

Krzysztof, Lukasz, maybe you can describe more?

--
Álvaro Herrera       Valdivia, Chile


Reply | Threaded
Open this post in threaded view
|

Re: Unresolved repliaction hang and stop problem.

Lukasz Biegaj
Hey, thanks for reaching out and sorry for the late reply - we had few
days of national holidays.

On 29.04.2021 15:55, Alvaro Herrera wrote:

> https://www.postgresql.org/message-id/flat/CANDwggKYveEtXjXjqHA6RL3AKSHMsQyfRY6bK%2BNqhAWJyw8psQ%40mail.gmail.com
> https://www.postgresql.org/message-id/flat/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com
>
> Krzysztof said "after upgrading to pg13 we started having problems",
> which implicitly indicates that the same thing worked well in pg10 ---
> but if the problem has been correctly identified, then this wouldn't
> have worked in pg10 either.  So something in the story doesn't quite
> match up.  Maybe it's not the same problem after all, or maybe they
> weren't doing X in pg10 which they are attempting in pg13.
>

The problem started occurring after upgrade from pg10 to pg13. No other
changes were performed, especially not within the database structure nor
performed operations.

The problem is as described in
https://www.postgresql.org/message-id/flat/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com

It does occur on two separate production clusters and one test cluster -
all belonging to the same customer, although processing slightly
different data (it's an e-commerce store with multiple languages and
separate production databases for each language).

We've tried recreating the database from dump, and recreating the
replication, but without any positive effect - the problem persists.

We did not rollback the databases to pg10, instead we've stayed with
pg13 and implemented a shell script to kill the walsender process if it
seems stuck in `hash_seq_search`. It's ugly, but it works and we backup
and monitor the data integrity anyway.

I'd be happy to help in debugging the issue had I known how to do it
:-). If you'd like then we can also try to rollback the installation
back to pg10 to get certainty that this was not caused by schema changes.


--
Lukasz Biegaj | Unity Group | https://www.unitygroup.com/
System Architect, AWS Certified Solutions Architect


Reply | Threaded
Open this post in threaded view
|

Re: Unresolved repliaction hang and stop problem.

Álvaro Herrera
Hi Lukasz, thanks for following up.

On 2021-May-04, Lukasz Biegaj wrote:

> The problem is as described in https://www.postgresql.org/message-id/flat/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com
>
> It does occur on two separate production clusters and one test cluster - all
> belonging to the same customer, although processing slightly different data
> (it's an e-commerce store with multiple languages and separate production
> databases for each language).

I think the best next move would be to make certain that the problem is
what we think it is, so that we can discuss if Amit's commit is an
appropriate fix.  I would suggest to do that by running the problematic
workload in the test system under "perf record -g" and then get a report
with "perf report -g" which should hopefully give enough of a clue.
(Sometimes the reports are much better if you use a binary that was
compiled with -fno-omit-frame-pointer, so if you're in a position to try
that, it might be useful -- or apparently you could try "perf record
--call-graph dwarf" or "perf record --call-graph lbr", depending.)

Also I would be much more comfortable about proposing to backpatch such
an invasive change if you could ensure that in pg10 the same workload
does not cause the problem.  If it does, then it'd be clear we're
talking about a regression.

--
Álvaro Herrera       Valdivia, Chile
"I'm always right, but sometimes I'm more right than other times."
                                                  (Linus Torvalds)


Reply | Threaded
Open this post in threaded view
|

Re: Unresolved repliaction hang and stop problem.

Lukasz Biegaj
On 04.05.2021 16:35, Alvaro Herrera wrote:
> I would suggest to do that by running the problematic
> workload in the test system under "perf record -g"
 > [..]
 >  you could ensure that in pg10 the same workload
 > does not cause the problem.

We'll go with both propositions. I expect to come back to you with
results in about a week or two.

--
Lukasz Biegaj | Unity Group | https://www.unitygroup.com/
System Architect, AWS Certified Solutions Architect