BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.

PG Doc comments form
The following bug has been logged on the website:

Bug reference:      15567
Logged by:          movead
Email address:      [hidden email]
PostgreSQL version: 10.5
Operating system:   centos7
Description:        

[Problem description]
We use pg9.5.7 with replica and i find that if some damaged wal record
arrived at standby then the startup process on standby try to restart
walreceiver process but never sucess and log below Infinite cycle on
standby.If i try to restart database now,it sucess.
(I try it on pg10.5,the same result.)
-----------------------------------------------------------------------------------------------------------------
2018-12-15 10:38:00.302 CST,,,30928,,5c1422d5.78d0,2,,2018-12-15 05:38:29
CST,,0,FATAL,57P01,"terminating walreceiver process due to administrator
command",,,,,,,,,""
2018-12-15 10:38:00.302 CST,,,5546,,5c1276c1.15aa,8,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:00.302 CST,,,5546,,5c1276c1.15aa,9,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:00.302 CST,,,5546,,5c1276c1.15aa,10,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:05.308 CST,,,5546,,5c1276c1.15aa,11,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:10.314 CST,,,5546,,5c1276c1.15aa,12,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:15.319 CST,,,5546,,5c1276c1.15aa,13,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:20.325 CST,,,5546,,5c1276c1.15aa,14,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:25.331 CST,,,5546,,5c1276c1.15aa,15,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:30.336 CST,,,5546,,5c1276c1.15aa,16,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:35.342 CST,,,5546,,5c1276c1.15aa,17,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:40.348 CST,,,5546,,5c1276c1.15aa,18,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:45.353 CST,,,5546,,5c1276c1.15aa,19,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:50.359 CST,,,5546,,5c1276c1.15aa,20,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:55.364 CST,,,5546,,5c1276c1.15aa,21,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:39:00.370 CST,,,5546,,5c1276c1.15aa,22,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
-----------------------------------------------------------------------------------------------------------------


[Code review]
I read the code about startup process and walreciver process.

When a damaged wal record arrived:
1.The startup process shutdown the walreciver process use function
ShutdownWalRcv().
2.The startup process start the walreciver process use funtion
RequestXLogStreaming() by signal PMSIGNAL_START_WALRECEIVER.
3.The startup process try to read wal record.
        And now reciver process does not startup completely,so the startup process
read the damaged wal record another times.
        And startup process set walrcv->walRcvState = WALRCV_STOPPED use function
ShutdownWalRcv().
4.go to 2.
That's the infinite cycle.

[Question]
I wonder that if it is a bug or you just don't want restart walreciver
process.

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.

Maxim Boguk-2


On Fri, Dec 28, 2018 at 8:10 PM PG Bug reporting form <[hidden email]> wrote:
The following bug has been logged on the website:

Bug reference:      15567
Logged by:          movead
Email address:      [hidden email]
PostgreSQL version: 10.5
Operating system:   centos7
Description:       

[Problem description]
We use pg9.5.7 with replica and i find that if some damaged wal record
arrived at standby then the startup process on standby try to restart
walreceiver process but never sucess and log below Infinite cycle on
standby.If i try to restart database now,it sucess.
(I try it on pg10.5,the same result.)
-----------------------------------------------------------------------------------------------------------------
2018-12-15 10:38:00.302 CST,,,30928,,5c1422d5.78d0,2,,2018-12-15 05:38:29
CST,,0,FATAL,57P01,"terminating walreceiver process due to administrator
command",,,,,,,,,""
2018-12-15 10:38:00.302 CST,,,5546,,5c1276c1.15aa,8,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:00.302 CST,,,5546,,5c1276c1.15aa,9,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:00.302 CST,,,5546,,5c1276c1.15aa,10,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:05.308 CST,,,5546,,5c1276c1.15aa,11,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:10.314 CST,,,5546,,5c1276c1.15aa,12,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:15.319 CST,,,5546,,5c1276c1.15aa,13,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:20.325 CST,,,5546,,5c1276c1.15aa,14,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:25.331 CST,,,5546,,5c1276c1.15aa,15,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:30.336 CST,,,5546,,5c1276c1.15aa,16,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:35.342 CST,,,5546,,5c1276c1.15aa,17,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:40.348 CST,,,5546,,5c1276c1.15aa,18,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:45.353 CST,,,5546,,5c1276c1.15aa,19,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:50.359 CST,,,5546,,5c1276c1.15aa,20,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:55.364 CST,,,5546,,5c1276c1.15aa,21,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:39:00.370 CST,,,5546,,5c1276c1.15aa,22,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
-----------------------------------------------------------------------------------------------------------------


[Code review]
I read the code about startup process and walreciver process.

When a damaged wal record arrived:
1.The startup process shutdown the walreciver process use function
ShutdownWalRcv().
2.The startup process start the walreciver process use funtion
RequestXLogStreaming() by signal PMSIGNAL_START_WALRECEIVER.
3.The startup process try to read wal record.
        And now reciver process does not startup completely,so the startup process
read the damaged wal record another times.
        And startup process set walrcv->walRcvState = WALRCV_STOPPED use function
ShutdownWalRcv().
4.go to 2.
That's the infinite cycle.

[Question]
I wonder that if it is a bug or you just don't want restart walreciver
process.


It look like the same issue as I reported in BUG #15151.
(unfortunately I didn't have time to explore it further then).

You have good description of what's going on under hood.


--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/


Reply | Threaded
Open this post in threaded view
|

Re: Re: BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.

lichuancheng@highgo.com

>It look like the same issue as I reported in BUG #15151.
>https://www.postgresql.org/message-id/flat/152353560977.31233.11406243320876395727%40wrigleys.postgresql.org
>(unfortunately I didn't have time to explore it further then).

>You have good description of what's going on under hood.

Hi,I read the BUG #15151.
You have say that it does not help to restart slave.

But in 15567, either 'restart slave' or 'move the wal segment from master to slave' can make the slave nomal.
Reply | Threaded
Open this post in threaded view
|

BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.

lichuancheng@highgo.com
hello

I have analysis of the causes of bug when report it.

>[Code review]
>I read the code about startup process and walreciver process.
>When a damaged wal record arrived:
>1.The startup process shutdown the walreciver process use function ShutdownWalRcv().
>2.The startup process start the walreciver process use funtion RequestXLogStreaming() by signal PMSIGNAL_START_WALRECEIVER.
>3.The startup process try to read wal record. 
> And now reciver process does not startup completely,so the startup process read the damaged wal record another times.
> And startup process set walrcv->walRcvState = WALRCV_STOPPED use function ShutdownWalRcv().
>4.go to 2.
>That's the infinite cycle.


And now i do a patch in mail attachment to solve the bug.  
I test it on pg10.5 and work well.

-----
movead

bug15567_lichuancheng_20190103.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.

Michael Paquier-2
On Thu, Jan 03, 2019 at 05:50:48PM +0800, [hidden email] wrote:

>> I read the code about startup process and walreciver process.
>> When a damaged wal record arrived:
>> 1.The startup process shutdown the walreciver process use function ShutdownWalRcv().
>> 2.The startup process start the walreciver process use funtion RequestXLogStreaming() by signal PMSIGNAL_START_WALRECEIVER.
>> 3.The startup process try to read wal record.
>>  And now reciver process does not startup completely,so the startup process read the damaged wal record another times.
>>  And startup process set walrcv->walRcvState = WALRCV_STOPPED use function ShutdownWalRcv().
>> 4.go to 2.
>> That's the infinite cycle.
>
> And now i do a patch in mail attachment to solve the bug.  
> I test it on pg10.5 and work well.
So the take of the patch is to make sure that the WAL receiver has
been correctly started if the startup process requested one before
trying to replay more WAL records.  Now it seems to me that you do not
close all the gaps here.  The WAL receiver could finish starting, but
still not have the time to stream up to the point where the startup
process could consume things safely.  The patch makes somewhat sure
that the gap is closed by applying a timeout of 1s, still in
environments with more latency it would make the failure still
happen.  Wouldn't it be better instead to make sure that the startup
process does not try to use once again the same record where the
failure happened before the WAL receiver confirms that it has flushed
something newer than the LSN position which was known as bad?

One problem may be that upon restart we reset receivedUpto, and that
visibly we may want to be more careful with its handling to avoid
successive failures so as the startup process does not bump
repeatitively into the same bogus record.  Looking at the code, the
computation looks quite safe though (I need to double-check to be 100%
sure), so isn't the origin of the problem just that the WAL segment
from the primary is corrupted, and that what gets streamed is broken
at its origin point, preventing the standby to move on?  You may want
to double-check that as well..

As a side note, the patch is incorrect for a couple of things though:
1) Formatting of comments and indentation.
2) Please be careful of whitespaces.
3) Usually no more than 72-80 characters per line.
4) Why a wait phase of 1s?
5) And much worse, the WAL receiver status cannot be looked at without
taking its mutex lock first, so this makes the logic look at something
which may be concurrently updated.  Oops.
6) Still the patch is incorrect per the reasons above ;)
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.

lichuancheng@highgo.com

>So the take of the patch is to make sure that the WAL receiver has
>been correctly started if the startup process requested one before
>trying to replay more WAL records.  Now it seems to me that you do not
>close all the gaps here.  The WAL receiver could finish starting, but
>still not have the time to stream up to the point where the startup
>process could consume things safely.  The patch makes somewhat sure
>that the gap is closed by applying a timeout of 1s, still in
>environments with more latency it would make the failure still
>happen.  Wouldn't it be better instead to make sure that the startup
>process does not try to use once again the same record where the
>failure happened before the WAL receiver confirms that it has flushed
>something newer than the LSN position which was known as bad?
 
>One problem may be that upon restart we reset receivedUpto, and that
>visibly we may want to be more careful with its handling to avoid
>successive failures so as the startup process does not bump
>repeatitively into the same bogus record.  Looking at the code, the
>computation looks quite safe though (I need to double-check to be 100%
>sure), so isn't the origin of the problem just that the WAL segment
>from the primary is corrupted, and that what gets streamed is broken
>at its origin point, preventing the standby to move on?  You may want
>to double-check that as well..

I known what you describe.
Nomally,when the reciver process restart,the code reset redo LSN to the start of the 
wal segment. But now the question is that the reciver process can not restart
sucessful,so correct wal record nerver arrived. so the startup process is in a
infinite cycle to start and end the reciver process. 
So the way to solve the problem  is give some time to reciver process to 
restart. As long as it restart sucessful,it will apply wal record at the start of the 
wal segment from master and the correct wal record will arrived. so i do not 
think what you say 'still in environments with more latency it would make the 
failure still happen' is right.So i think the patch can solve the problem after
 modify as  side note blow.

I have make sure that the the broken wal record is broked during stream,
it's well in master and other slave.

 
>As a side note, the patch is incorrect for a couple of things though:
>1) Formatting of comments and indentation.
>2) Please be careful of whitespaces.
>3) Usually no more than 72-80 characters per line.
>4) Why a wait phase of 1s?
I think 1s is enought for reciver process to restart. Maybe it need a discuss.

>5) And much worse, the WAL receiver status cannot be looked at without
>taking its mutex lock first, so this makes the logic look at something
>which may be concurrently updated.  Oops.
I am sorry.

>6) Still the patch is incorrect per the reasons above ;)
I am new here, thanks for your explanation in side note.

-----
movead


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.

lichuancheng@highgo.com

Thanks Michael Paquier for your explanation.
I have modified the patch  in mail attachment,wish your reply.


-----
movead


bug15567_lichuancheng_20190111.patch (1K) Download Attachment