Race condition in recovery?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Race condition in recovery?

Dilip Kumar-2
While analyzing one of the customer issues, based on the log it
appeared that there is a race condition in the recovery process.

The summary of the issue is, that one of the standby is promoted as
the new primary (Node2) and another standby (Node3) is restarted and
set the primary_info and the restore_command so that it can
stream/restore from Node2 (new primary).  The problem is that during
the promotion the timeline switch happened in the middle of the
segment in Node2 and the Node3 is able to restore the newTLI.history
file from the archive but the WAL file with the new TLI is not yet
archived.  Now, we will try to stream the wal file from the primary
but if we are fetching the checkpoint that time we will not use the
latest timeline instead we will try with the checkpoint timeline and
walsender will send the WAL from the new timeline file because
requested WAL recptr is before the TLI switch.   And once that
happened the expectedTLEs will be set based on the oldTLI.history
file.  Now, whenever we try to restore the required WAL file and
recoveryTargetTimeLineGoal is set to the latest we again try to rescan
the latest timeline (rescanLatestTimeLine) but the problem is
recoveryTargetTLI was already set to the latest timeline.  So now the
problem is expectedTLEs is set to oldTLI and recoveryTargetTLI is set
to newTLI and rescanLatestTimeLine will never update the expectedTLEs.
Now,  walsender will fail to stream new WAL using the old TLI and the
archiver process will also fail because that file doesn't not exists
anymore (converted to .partial).  Basically, now we will never try
with the newTLI.

I have given the sequence of the events based on my analysis.

Refer to the sequence of event
-----------------------------------------
Node1 primary, Node2 standby1, Node3 standby2

1. Node2 got promoted to new primary, and node 2 picked new TL 13 in
the middle of the segment.
2. Node3, restarted with new primary info of Node2 and restore command
3. Node3, found the newest TL13 in validateRecoveryParameters()
Because the latest TL was requested in recovery.conf (history file
restored from TL13) and set recoveryTargetTLI to 13
So point to note is recoveryTargetTLI is set to 13 but expectedTLEs is
not yet set.
4. Node3, entered into the standby mode.
5. Node3, tries to read the checkpoint Record, on Node3 still the
checkpoint TL (ControlFile->checkPointCopy.ThisTimeLineID) is 12.
6. Node3, tries to get the checkpoint record file using new TL13 from
the archive which it should get ideally but it may not if the Node2
haven't yet archived it.
7. Node3, tries to stream from primary but using TL12 because
ControlFile->checkPointCopy.ThisTimeLineID is 12.
8. Node3, get it because walsender of Node2 read it from TL13 and send
it and Node2 write in the new WAL file but with TL12.

WalSndSegmentOpen()
{
/*-------
* When reading from a historic timeline, and there is a timeline switch
* within this segment, read from the WAL segment belonging to the new
* timeline.
}

9. Node3, now set the expectedTLEs to 12 because that is what
walreceiver has streamed the WAL using.

10. Node3, now recoveryTargetTLI is 13 and expectedTLEs is 12.  So
whenever it tries to find the latest TLE (rescanLatestTimeLine ) it
finds it is 13 which is the same as recoveryTargetTLI so nothing to
change but expectedTLEs is 12 using which it will try to
restore/stream further WAL and fail every time.

rescanLatestTimeLine(void)
{
....
newtarget = findNewestTimeLine(recoveryTargetTLI);
if (newtarget == recoveryTargetTLI)
{
      /* No new timelines found */
        return false;
}

11.  So now the situation is that ideally in rescanLatestTimeLine() we
should get the latest TLI but it assumes that it is already on the
latest TLI because recoveryTargetTLI is on the latest TLI.

Other points to be noted:
- The actual issue happened on 9.6.11 but based on the code comparison
it appeared that same can occur on the latest code as well.
- After Node3 is shutdown wal from its pg_wal/ directory were removed
so that it can follow the new primary.

Based on the sequence of events It is quite clear that something is
wrong in rescanLatestTimeLine, maybe after selecting the latest TLI we
should compare it with the head of the expectedTLEs as well instead of
just comparing it to the recoveryTargetTLI?


Log from Node2:
2020-12-22 04:49:02 UTC [1019]: [9-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  received promote request
2020-12-22 04:49:02 UTC [1019]: [10-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  redo done at 0/F8000028
2020-12-22 04:49:02 UTC [1019]: [11-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  last completed transaction was at log time
2020-12-22 04:48:01.833476+00
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000D.history" failed:
No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
2020-12-22 04:49:02 UTC [1019]: [12-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  selected new timeline ID: 13
2020-12-22 04:49:02 UTC [1019]: [13-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  archive recovery complete


Log From Node3 (with pointwise analysis):

1. Node3 restarted, restored 0000000D.history from archive and
recoveryTargetTLI is set to 13
2020-12-22 04:49:40 UTC [2896]: [2-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  database system is shut down
2020-12-22 04:49:40 UTC [2872]: [6-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  database system is shut down
2020-12-22 04:49:41 UTC [9082]: [1-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  database system was shut down in recovery at
2020-12-22 04:49:40 UTC
2020-12-22 04:49:41 UTC [9082]: [2-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  creating missing WAL directory
"pg_xlog/archive_status"
2020-12-22 04:49:41 UTC [9082]: [3-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  restored log file "0000000D.history" from archive
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000E.history" failed:
No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
2020-12-22 04:49:41 UTC [9082]: [4-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  entering standby mode
2020-12-22 04:49:41 UTC [9082]: [5-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  restored log file "0000000D.history" from archive


2. Tries to get the WAL file with different timelines from the archive
but did not get so expectedTLEs is not yet set

rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000D00000000000000F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000B00000000000000F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000100000000000000F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]

3. Since fetching the checkpoint record so use the checkpoint TLI
which is 12,  although primary doesn’t have 0000000C00000000000000F8
file as it renamed it to 0000000C00000000000000F8.partial

But there is the logic in walsender that if requested wal is there in
the current TLI then send from their so it will stream from
0000000D00000000000000F8 file

2020-12-22 04:49:42 UTC [9105]: [1-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  fetching timeline history file for timeline 12
from primary server
2020-12-22 04:49:42 UTC [9105]: [2-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  started streaming WAL from primary at 0/F8000000
on timeline 12
2020-12-22 04:49:42 UTC [9105]: [3-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  replication terminated by primary server
2020-12-22 04:49:42 UTC [9105]: [4-1] user=; db=; app=; client=;
SQLcode=00000 DETAIL:  End of WAL reached on timeline 12 at
0/F8000098.



4. Now since walreciver assumes that it has restore the WAL from the
TL 12 the recieveTLI is 12 and the expectedTLEs is set base on the
0000000C.history.
See below Logic in WaitForWalToBecomeAvailable
                        if (readFile < 0)
                        {
                            if (!expectedTLEs)
                                expectedTLEs = readTimeLineHistory(receiveTLI);


2020-12-22 04:49:42 UTC [9082]: [6-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  restored log file "0000000C.history" from archive
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000E.history" failed:
No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
2020-12-22 04:49:42 UTC [9082]: [7-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  restored log file "0000000C.history" from archive
2020-12-22 04:49:42 UTC [9082]: [8-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  consistent recovery state reached at 0/F8000098
2020-12-22 04:49:42 UTC [9082]: [9-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  invalid record length at 0/F8000098: wanted 24,
got 0
2020-12-22 04:49:42 UTC [9074]: [3-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  database system is ready to accept read only
connections
2020-12-22 04:49:42 UTC [9105]: [5-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  restarted WAL streaming at 0/F8000000 on timeline
12
2020-12-22 04:49:42 UTC [9105]: [6-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  replication terminated by primary server
2020-12-22 04:49:42 UTC [9105]: [7-1] user=; db=; app=; client=;
SQLcode=00000 DETAIL:  End of WAL reached on timeline 12 at
0/F8000098.


4. Now, expectedTLEs head is as 12 and recoveryTargetTLI is 13 so in
rescanLatestTimeLine we always assume we are at the latest Ali but we
try to archive from expectedTLEs which is old TLI.

rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000E.history" failed:
No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
2020-12-22 04:49:47 UTC [9105]: [8-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  restarted WAL streaming at 0/F8000000 on timeline
12
2020-12-22 04:49:47 UTC [9105]: [9-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  replication terminated by primary server
2020-12-22 04:49:47 UTC [9105]: [10-1] user=; db=; app=; client=;
SQLcode=00000 DETAIL:  End of WAL reached on timeline 12 at
0/F8000098.
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000E.history" failed:
No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8"
failed: No such file or directory (2)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Race condition in recovery?

Robert Haas
On Thu, Jan 21, 2021 at 4:00 AM Dilip Kumar <[hidden email]> wrote:

> 8. Node3, get it because walsender of Node2 read it from TL13 and send
> it and Node2 write in the new WAL file but with TL12.
>
> WalSndSegmentOpen()
> {
> /*-------
> * When reading from a historic timeline, and there is a timeline switch
> * within this segment, read from the WAL segment belonging to the new
> * timeline.
> }
>
> 9. Node3, now set the expectedTLEs to 12 because that is what
> walreceiver has streamed the WAL using.

This seems to be incorrect, because the comment for expectedTLEs says:

 * expectedTLEs: a list of TimeLineHistoryEntries for
recoveryTargetTLI and the timelines of
 * its known parents, newest first (so recoveryTargetTLI is always the
 * first list member).  Only these TLIs are expected to be seen in the WAL
 * segments we read, and indeed only these TLIs will be considered as
 * candidate WAL files to open at all.

But in your scenario apparently we end up with a situation that
contradicts that, because you go on to say:

> 10. Node3, now recoveryTargetTLI is 13 and expectedTLEs is 12.  So

As I understand, expectedTLEs should end up being a list where the
first element is the timeline we want to end up on, and the last
element is the timeline where we are now, and every timeline in the
list branches off of the next timeline in the list. So here if 13
branches of 12 then the list should be 13,12 not just 12; and if 13
does not branch off of 12 OR if 13 branches off of 12 at an earlier
point in the WAL stream than where we are now, then that should be an
error that shuts down the standby, because then there is no way for
replay to get from where it is now to the correct timeline.

--
Robert Haas
EDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Race condition in recovery?

Dilip Kumar-2
On Fri, Jan 22, 2021 at 2:05 AM Robert Haas <[hidden email]> wrote:

>
> On Thu, Jan 21, 2021 at 4:00 AM Dilip Kumar <[hidden email]> wrote:
> > 8. Node3, get it because walsender of Node2 read it from TL13 and send
> > it and Node2 write in the new WAL file but with TL12.
> >
> > WalSndSegmentOpen()
> > {
> > /*-------
> > * When reading from a historic timeline, and there is a timeline switch
> > * within this segment, read from the WAL segment belonging to the new
> > * timeline.
> > }
> >
> > 9. Node3, now set the expectedTLEs to 12 because that is what
> > walreceiver has streamed the WAL using.
>
> This seems to be incorrect, because the comment for expectedTLEs says:
>
>  * expectedTLEs: a list of TimeLineHistoryEntries for
> recoveryTargetTLI and the timelines of
>  * its known parents, newest first (so recoveryTargetTLI is always the
>  * first list member).  Only these TLIs are expected to be seen in the WAL
>  * segments we read, and indeed only these TLIs will be considered as
>  * candidate WAL files to open at all.
>
> But in your scenario apparently we end up with a situation that
> contradicts that, because you go on to say:
>
> > 10. Node3, now recoveryTargetTLI is 13 and expectedTLEs is 12.  So
>
> As I understand, expectedTLEs should end up being a list where the
> first element is the timeline we want to end up on, and the last
> element is the timeline where we are now, and every timeline in the
> list branches off of the next timeline in the list. So here if 13
> branches of 12 then the list should be 13,12 not just 12; and if 13
> does not branch off of 12 OR if 13 branches off of 12 at an earlier
> point in the WAL stream than where we are now, then that should be an
> error that shuts down the standby, because then there is no way for
> replay to get from where it is now to the correct timeline.

Yeah, I agree with this.  So IMHO the expectedTLEs should be set based
on the recoveryTargetTLI instead of receiveTLI.  Based on the
expectedTLEs definition it can never be correct to set it based on the
receiveTLI.

Basically, the simple fix could be this.

diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index b18257c198..465bc7c929 100644

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12533,7 +12533,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr,
bool randAccess,
                                                if (readFile < 0)
                                                {
                                                        if (!expectedTLEs)
-
expectedTLEs = readTimeLineHistory(receiveTLI);
+
expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
+
                                                       readFile =
XLogFileRead(readSegNo, PANIC,

                                 receiveTLI,

                                 XLOG_FROM_STREAM, false);

But I am afraid that whether this adjustment (setting based on
receiveTLI) is done based on some analysis or part of some bug fix.  I
will try to find the history of this and maybe based on that we can
make a better decision.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com