Incorrect snapshots while promoting hot standby node when 2PC is used

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Incorrect snapshots while promoting hot standby node when 2PC is used

Andres Freund
Hi,

Michael Paquier (running locally I think), and subsequently Thomas Munro
(noticing [1]), privately reported that they noticed an assertion failure in
GetSnapshotData(). Both reasonably were wondering if that's related to the
snapshot scalability patches.

Michael reported the following assertion failure in 023_pitr_prepared_xact.pl:
> TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2468, PID: 22901)

> The failure was triggered by one of the new TAP tests,
> 023_pitr_prepared_xact.pl, after recovering a 2PC transaction that
> used a transaction ID that matches with RecentXmin:
> (gdb) p RecentXmin
> $1 = 588
> (gdb) p TransactionXmin
> $2 = 589

I tried for a while to reproduce that, but couldn't. Adding a bunch of
debugging output and increasing the log level shows the problem pretty clearly
however, just not tripping any asserts:

2021-04-21 17:55:54.287 PDT [1829098] [unknown] LOG:  setting xmin to 588
...
2021-04-21 17:55:54.377 PDT [1829049] DEBUG:  removing all KnownAssignedXids
2021-04-21 17:55:54.377 PDT [1829049] DEBUG:  release all standby locks
...
2021-04-21 17:55:54.396 PDT [1829100] [unknown] LOG:  setting xmin to 589
...
2021-04-21 17:55:55.379 PDT [1829048] LOG:  database system is ready to accept connections
...
2021-04-21 17:55:55.380 PDT [1829120] LOG:  setting xmin to 588
...
2021-04-21 17:55:55.386 PDT [1829126] [unknown] LOG:  setting xmin to 588
2021-04-21 17:55:55.387 PDT [1829126] 023_pitr_prepared_xact.pl LOG:  statement: COMMIT PREPARED 'fooinsert';
...
2021-04-21 17:55:55.428 PDT [1829128] [unknown] LOG:  setting xmin to 589

So there's clear proof for xmin going from 588 to 589 and back and
forth.


After looking some more the bug isn't even that subtle. And definitely not new
- likely it exists since the introduction of hot standby.

The sequence in StartupXLOG() leading to the issue is the following:

1) RecoverPreparedTransactions();
2) ShutdownRecoveryTransactionEnvironment();
3) XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;

Because 2) resets the KnownAssignedXids machinery, snapshots that happen
between 2) and 3) will not actually look at the procarray to compute
snapshots, as that only happens within

        snapshot->takenDuringRecovery = RecoveryInProgress();
        if (!snapshot->takenDuringRecovery)

and RecoveryInProgress() checks XLogCtl->SharedRecoveryState !=
RECOVERY_STATE_DONE, which is set in 3).

So snapshots within that window will always be "empty", i.e. xmin is
latestCompletedXid and xmax is latestCompletedXid + 1. Once we reach 3), we'll
look at the procarray, which then leads xmin going back to 588.


I think that this can lead to data corruption, because a too new xmin horizon
could lead to rows from a prepared transaction getting hint bitted as dead (or
perhaps even pruned, although that's probably harder to hit). Due to the too
new xmin horizon we won't treat rows by the prepared xact as in-progress, and
TransactionIdDidCommit() will return false, as the transaction didn't commit
yet. Which afaict can result in row versions created by the prepared
transaction being invisible even after the prepared transaction commits.

Without prepared transaction there probably isn't an issue, because there
shouldn't be any other in-progress xids at that point?


I think to fix the issue we'd have to move
ShutdownRecoveryTransactionEnvironment() to after XLogCtl->SharedRecoveryState
= RECOVERY_STATE_DONE.

The acquisition of ProcArrayLock() in
ShutdownRecoveryTransactionEnvironment()->ExpireAllKnownAssignedTransactionIds()
should prevent the data from being removed between the RecoveryInProgress()
and the KnownAssignedXidsGetAndSetXmin() calls in GetSnapshotData().

I haven't yet figured out whether there would be a problem with deferring the
other tasks in ShutdownRecoveryTransactionEnvironment() until after
RECOVERY_STATE_DONE.


I think we ought to introduce assertions that have a higher chance to
catch cases like this. The window to hit the new assertion that caused
Michael to hit this is pretty darn small (xmin needs to move backwards
between two snapshot computations inside a single transaction).  I
*think* we can safely assert that xmin doesn't move backwards globally,
if we store it as a 64bit xid, and don't perform that check in
walsender?

Greetings,

Andres Freund

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-04-20%2003%3A04%3A04


Reply | Threaded
Open this post in threaded view
|

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

Andrey Borodin-2
Hi Andres!

> 23 апр. 2021 г., в 01:36, Andres Freund <[hidden email]> написал(а):
>
> So snapshots within that window will always be "empty", i.e. xmin is
> latestCompletedXid and xmax is latestCompletedXid + 1. Once we reach 3), we'll
> look at the procarray, which then leads xmin going back to 588.
>
>
> I think that this can lead to data corruption, because a too new xmin horizon
> could lead to rows from a prepared transaction getting hint bitted as dead (or
> perhaps even pruned, although that's probably harder to hit). Due to the too
> new xmin horizon we won't treat rows by the prepared xact as in-progress, and
> TransactionIdDidCommit() will return false, as the transaction didn't commit
> yet. Which afaict can result in row versions created by the prepared
> transaction being invisible even after the prepared transaction commits.
>
> Without prepared transaction there probably isn't an issue, because there
> shouldn't be any other in-progress xids at that point?

I'm investigating somewhat resemblant case.
We have an OLTP sharded installation where shards are almost always under rebalancing. Data movement is implemented with 2pc.
Switchover happens quite often due to datacenter drills. The installation is running on PostgreSQL 12.6.

In January heapcheck of backup reported some suspicious problems
ERROR:  Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470240, offnum(tuple)=1
ERROR:  Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470241, offnum(tuple)=1
ERROR:  Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470242, offnum(tuple)=1
...
and so on for ~100 pages - tuples with lp==1 were not frozen.

We froze tuples with pg_dirty_hands and run VACUUM (DSIABLE_PAGE_SKIPPING) on the table.

In the end of the March the same shard stroke again with:
ERROR:  Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470240, offnum(tuple)=42
....
around ~1040 blocks (starting from the same 1470240!) had non-frozen tuple at lp==42.
I've run
update s3.objects_65 set created = created where ctid = '(1470241,42)' returning *;

After that heapcheck showed VM problem
ERROR:  XX001: Found non all-visible tuple. Oid(relation)=18487, blkno(page)=1470240, offnum(tuple)=42
LOCATION:  collect_corrupt_items, heap_check.c:186

VACUUM fixed it with warnings.
WARNING: 01000: page is not marked all-visible but visibility map bit is set in relation "objects_65" page 1470240
and failed on next page
ERROR:  XX001: found xmin 1650436694 from before relfrozenxid 1752174172
LOCATION:  heap_prepare_freeze_tuple, heapam.c:6172

I run update from all tuples in heapcheks ctid list and subsequent vacuum (without page skipping). This satisfied corruption monitoring.


Can this case be related to the problem that you described?

Or, perhaps, it looks more like a hardware problem? Data_checksums are on, but few years ago we observed ssd firmware that was loosing updates, but passing checksums. I'm sure that we would benefit from having separate relation fork for checksums or LSNs.


We observe similar cases 3-5 times a year. To the date no data was lost due to this, but it's somewhat annoying.
BTW I'd say that such things are an argument for back-porting pg_surgery and heapcheck to old versions.

Thanks!

Best regards, Andrey Borodin.



Reply | Threaded
Open this post in threaded view
|

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

Andres Freund
Hi,

On 2021-05-01 17:35:09 +0500, Andrey Borodin wrote:
> I'm investigating somewhat resemblant case.
> We have an OLTP sharded installation where shards are almost always under rebalancing. Data movement is implemented with 2pc.
> Switchover happens quite often due to datacenter drills. The installation is running on PostgreSQL 12.6.

If you still have the data it would be useful if you could check if the
LSNs of the corrupted pages are LSNs from shortly after standby
promotion/switchover?


> Can this case be related to the problem that you described?

It seems possible, but it's hard to know without a lot more information.


> Or, perhaps, it looks more like a hardware problem? Data_checksums are
> on, but few years ago we observed ssd firmware that was loosing
> updates, but passing checksums. I'm sure that we would benefit from
> having separate relation fork for checksums or LSNs.

Right - checksums are "page local". They can only detect if a page is
corrupted, not if e.g. an older version of the page (with correct
checksum) has been restored. While there are schemes to have stronger
error detection properties, they do come with substantial overhead (at
least the ones I can think of right now).


Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

Andrey Borodin-2


> 3 мая 2021 г., в 23:10, Andres Freund <[hidden email]> написал(а):
>
> Hi,
>
> On 2021-05-01 17:35:09 +0500, Andrey Borodin wrote:
>> I'm investigating somewhat resemblant case.
>> We have an OLTP sharded installation where shards are almost always under rebalancing. Data movement is implemented with 2pc.
>> Switchover happens quite often due to datacenter drills. The installation is running on PostgreSQL 12.6.
>
> If you still have the data it would be useful if you could check if the
> LSNs of the corrupted pages are LSNs from shortly after standby
> promotion/switchover?
That's a neat idea, I'll check if I can restore backup with corruptions.
I have a test cluster with corruptions, but it has undergone tens of switchovers...

>> Or, perhaps, it looks more like a hardware problem? Data_checksums are
>> on, but few years ago we observed ssd firmware that was loosing
>> updates, but passing checksums. I'm sure that we would benefit from
>> having separate relation fork for checksums or LSNs.
>
> Right - checksums are "page local". They can only detect if a page is
> corrupted, not if e.g. an older version of the page (with correct
> checksum) has been restored. While there are schemes to have stronger
> error detection properties, they do come with substantial overhead (at
> least the ones I can think of right now).

We can have PTRACK-like fork with page LSNs. It can be flushed on checkpoint and restored from WAL on crash. So we always can detect stale page version. Like LSN-track :) We will have much faster rewind and delta-backups for free.

Though I don't think it worth an effort until we at least checksum CLOG.

Thanks!

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

Tom Lane-2
In reply to this post by Andres Freund
Andres Freund <[hidden email]> writes:
> Michael Paquier (running locally I think), and subsequently Thomas Munro
> (noticing [1]), privately reported that they noticed an assertion failure in
> GetSnapshotData(). Both reasonably were wondering if that's related to the
> snapshot scalability patches.
> Michael reported the following assertion failure in 023_pitr_prepared_xact.pl:
>> TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2468, PID: 22901)

mantid just showed a failure that looks like the same thing, at
least it's also in 023_pitr_prepared_xact.pl:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mantid&dt=2021-05-03%2013%3A07%3A06

The assertion line number is rather different though:

TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2094, PID: 1163004)

and interestingly, this happened in a parallel worker:

postgres: node_pitr: parallel worker for PID 1162998 (ExceptionalCondition+0x7a)[0x946eca]
postgres: node_pitr: parallel worker for PID 1162998 (GetSnapshotData+0x897)[0x7ef327]
postgres: node_pitr: parallel worker for PID 1162998 (GetNonHistoricCatalogSnapshot+0x4a)[0x986a5a]
postgres: node_pitr: parallel worker for PID 1162998 (systable_beginscan+0x189)[0x4fffe9]
postgres: node_pitr: parallel worker for PID 1162998 [0x937336]
postgres: node_pitr: parallel worker for PID 1162998 [0x937743]
postgres: node_pitr: parallel worker for PID 1162998 (RelationIdGetRelation+0x85)[0x93f155]
postgres: node_pitr: parallel worker for PID 1162998 (relation_open+0x5c)[0x4a348c]
postgres: node_pitr: parallel worker for PID 1162998 (index_open+0x6)[0x5007a6]
postgres: node_pitr: parallel worker for PID 1162998 (systable_beginscan+0x177)[0x4fffd7]
postgres: node_pitr: parallel worker for PID 1162998 [0x937336]
postgres: node_pitr: parallel worker for PID 1162998 [0x93e1c1]
postgres: node_pitr: parallel worker for PID 1162998 (RelationIdGetRelation+0xbd)[0x93f18d]
postgres: node_pitr: parallel worker for PID 1162998 (relation_open+0x5c)[0x4a348c]
postgres: node_pitr: parallel worker for PID 1162998 (table_open+0x6)[0x532656]
postgres: node_pitr: parallel worker for PID 1162998 [0x937306]
postgres: node_pitr: parallel worker for PID 1162998 [0x93e1c1]
postgres: node_pitr: parallel worker for PID 1162998 (RelationIdGetRelation+0xbd)[0x93f18d]
postgres: node_pitr: parallel worker for PID 1162998 (relation_open+0x5c)[0x4a348c]
postgres: node_pitr: parallel worker for PID 1162998 (table_open+0x6)[0x532656]
postgres: node_pitr: parallel worker for PID 1162998 [0x92c4f1]
postgres: node_pitr: parallel worker for PID 1162998 (SearchCatCache1+0x176)[0x92e236]
postgres: node_pitr: parallel worker for PID 1162998 (TupleDescInitEntry+0xb3)[0x4a7eb3]
postgres: node_pitr: parallel worker for PID 1162998 [0x688f30]
postgres: node_pitr: parallel worker for PID 1162998 (ExecInitResultTupleSlotTL+0x1b)[0x68a8fb]
postgres: node_pitr: parallel worker for PID 1162998 (ExecInitResult+0x92)[0x6af902]
postgres: node_pitr: parallel worker for PID 1162998 (ExecInitNode+0x446)[0x684b06]
postgres: node_pitr: parallel worker for PID 1162998 (standard_ExecutorStart+0x269)[0x67d7b9]
postgres: node_pitr: parallel worker for PID 1162998 (ParallelQueryMain+0x1a3)[0x681d83]
postgres: node_pitr: parallel worker for PID 1162998 (ParallelWorkerMain+0x408)[0x53c218]
postgres: node_pitr: parallel worker for PID 1162998 (StartBackgroundWorker+0x23f)[0x774ebf]
postgres: node_pitr: parallel worker for PID 1162998 [0x780f3d]

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

Andres Freund
Hi,

On 2021-05-04 12:32:34 -0400, Tom Lane wrote:

> Andres Freund <[hidden email]> writes:
> > Michael Paquier (running locally I think), and subsequently Thomas Munro
> > (noticing [1]), privately reported that they noticed an assertion failure in
> > GetSnapshotData(). Both reasonably were wondering if that's related to the
> > snapshot scalability patches.
> > Michael reported the following assertion failure in 023_pitr_prepared_xact.pl:
> >> TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2468, PID: 22901)
>
> mantid just showed a failure that looks like the same thing, at
> least it's also in 023_pitr_prepared_xact.pl:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mantid&dt=2021-05-03%2013%3A07%3A06
>
> The assertion line number is rather different though:
>
> TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2094, PID: 1163004)

I managed to hit that one as well and it's also what fairywren hit - the
assertion in 2094 and 2468 are basically copies of the same check, and
which one hit is a question of timing.


> and interestingly, this happened in a parallel worker:

I think the issue can be hit (or rather detected) whenever a transaction
builds one snapshot while in recovery, and a second one during
end-of-recovery.  The parallel query here is just
2021-05-03 09:18:35.602 EDT [1162987:6] DETAIL:  Failed process was running: SELECT pg_is_in_recovery() = 'f';
(parallel due to force_parallel_mode) - which of course is likely to run
during end-of-recovery

So it does seem like the same bug of resetting the KnownAssignedXids
stuff too early.

Greetings,

Andres Freund