Online verification of checksums

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
180 messages Options
1 ... 6789
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Magnus Hagander-2


On Tue, Nov 10, 2020 at 5:44 AM Michael Paquier <[hidden email]> wrote:
On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
> I was referring to the patch I sent on this thread that fixes the
> detection of a corruption for the zero-only case and where pd_lsn
> and/or pg_upper are trashed by a corruption of the page header.  Both
> cases allow a base backup to complete on HEAD, while sending pages
> that could be corrupted, which is wrong.  Once you make the page
> verification rely only on pd_checksum, as the patch does because the
> checksum is the only source of truth in the page header, corrupted
> pages are correctly detected, causing pg_basebackup to complain as it
> should.  However, it has also the risk to cause pg_basebackup to fail
> *and* to report as broken pages that are in the process of being
> written, depending on how slow a disk is able to finish a 8kB write.
> That's a different kind of wrongness, and users have two more reasons
> to be pissed.  Note that if a page is found as torn we have a
> consistent page header, meaning that on HEAD the PageIsNew() and
> PageGetLSN() would pass, but the checksum verification would fail as
> the contents at the end of the page does not match the checksum.

Magnus, as the original committer of 4eb77d5, do you have an opinion
to share?

I admit that I at some point lost track of the overlapping threads around this, and just figured there was enough different checksum-involved-people on those threads to handle it :) Meaning the short answer is "no, I don't really have one at this point".

Slightly longer comment is that it does seem reasonable, but I have not read in on all the different issues discussed over the whole thread, so take that as a weak-certainty comment.

--
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Michael Paquier-2
On Sun, Nov 15, 2020 at 04:37:36PM +0100, Magnus Hagander wrote:

> On Tue, Nov 10, 2020 at 5:44 AM Michael Paquier <[hidden email]> wrote:
>> On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
>>> I was referring to the patch I sent on this thread that fixes the
>>> detection of a corruption for the zero-only case and where pd_lsn
>>> and/or pg_upper are trashed by a corruption of the page header.  Both
>>> cases allow a base backup to complete on HEAD, while sending pages
>>> that could be corrupted, which is wrong.  Once you make the page
>>> verification rely only on pd_checksum, as the patch does because the
>>> checksum is the only source of truth in the page header, corrupted
>>> pages are correctly detected, causing pg_basebackup to complain as it
>>> should.  However, it has also the risk to cause pg_basebackup to fail
>>> *and* to report as broken pages that are in the process of being
>>> written, depending on how slow a disk is able to finish a 8kB write.
>>> That's a different kind of wrongness, and users have two more reasons
>>> to be pissed.  Note that if a page is found as torn we have a
>>> consistent page header, meaning that on HEAD the PageIsNew() and
>>> PageGetLSN() would pass, but the checksum verification would fail as
>>> the contents at the end of the page does not match the checksum.
>>
>> Magnus, as the original committer of 4eb77d5, do you have an opinion
>> to share?
>>
>
> I admit that I at some point lost track of the overlapping threads around
> this, and just figured there was enough different checksum-involved-people
> on those threads to handle it :) Meaning the short answer is "no, I don't
> really have one at this point".
>
> Slightly longer comment is that it does seem reasonable, but I have not
> read in on all the different issues discussed over the whole thread, so
> take that as a weak-certainty comment.
Which part are you considering as reasonable?  The removal-feature
part on a stable branch or perhaps something else?
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Magnus Hagander-2


On Mon, Nov 16, 2020 at 1:23 AM Michael Paquier <[hidden email]> wrote:
On Sun, Nov 15, 2020 at 04:37:36PM +0100, Magnus Hagander wrote:
> On Tue, Nov 10, 2020 at 5:44 AM Michael Paquier <[hidden email]> wrote:
>> On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
>>> I was referring to the patch I sent on this thread that fixes the
>>> detection of a corruption for the zero-only case and where pd_lsn
>>> and/or pg_upper are trashed by a corruption of the page header.  Both
>>> cases allow a base backup to complete on HEAD, while sending pages
>>> that could be corrupted, which is wrong.  Once you make the page
>>> verification rely only on pd_checksum, as the patch does because the
>>> checksum is the only source of truth in the page header, corrupted
>>> pages are correctly detected, causing pg_basebackup to complain as it
>>> should.  However, it has also the risk to cause pg_basebackup to fail
>>> *and* to report as broken pages that are in the process of being
>>> written, depending on how slow a disk is able to finish a 8kB write.
>>> That's a different kind of wrongness, and users have two more reasons
>>> to be pissed.  Note that if a page is found as torn we have a
>>> consistent page header, meaning that on HEAD the PageIsNew() and
>>> PageGetLSN() would pass, but the checksum verification would fail as
>>> the contents at the end of the page does not match the checksum.
>>
>> Magnus, as the original committer of 4eb77d5, do you have an opinion
>> to share?
>>
>
> I admit that I at some point lost track of the overlapping threads around
> this, and just figured there was enough different checksum-involved-people
> on those threads to handle it :) Meaning the short answer is "no, I don't
> really have one at this point".
>
> Slightly longer comment is that it does seem reasonable, but I have not
> read in on all the different issues discussed over the whole thread, so
> take that as a weak-certainty comment.

Which part are you considering as reasonable?  The removal-feature
part on a stable branch or perhaps something else?

I was referring to the latest patch on the thread. But as I said, I have not read up on all the different issues raised in the thread, so take it with a big grain os salt.

And I would also echo the previous comment that this code was adapted from what the pgbackrest folks do. As such, it would be good to get a comment from for example David on that -- I don't see any of them having commented after that was mentioned?

--
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Michael Paquier-2
On Mon, Nov 16, 2020 at 11:41:51AM +0100, Magnus Hagander wrote:
> I was referring to the latest patch on the thread. But as I said, I have
> not read up on all the different issues raised in the thread, so take it
> with a big grain os salt.
>
> And I would also echo the previous comment that this code was adapted from
> what the pgbackrest folks do. As such, it would be good to get a comment
> from for example David on that -- I don't see any of them having commented
> after that was mentioned?

Agreed.  I am adding Stephen as well in CC.  From the code of
backrest, the same logic happens in src/command/backup/pageChecksum.c
(see pageChecksumProcess), where two checks on pd_upper and pd_lsn
happen before verifying the checksum.  So, if the page header finishes
with random junk because of some kind of corruption, even corrupted
pages would be incorrectly considered as correct if the random data
passes the pd_upper and pg_lsn checks :/
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

David Steele
Hi Michael,

On 11/20/20 2:28 AM, Michael Paquier wrote:

> On Mon, Nov 16, 2020 at 11:41:51AM +0100, Magnus Hagander wrote:
>> I was referring to the latest patch on the thread. But as I said, I have
>> not read up on all the different issues raised in the thread, so take it
>> with a big grain os salt.
>>
>> And I would also echo the previous comment that this code was adapted from
>> what the pgbackrest folks do. As such, it would be good to get a comment
>> from for example David on that -- I don't see any of them having commented
>> after that was mentioned?
>
> Agreed.  I am adding Stephen as well in CC.  From the code of
> backrest, the same logic happens in src/command/backup/pageChecksum.c
> (see pageChecksumProcess), where two checks on pd_upper and pd_lsn
> happen before verifying the checksum.  So, if the page header finishes
> with random junk because of some kind of corruption, even corrupted
> pages would be incorrectly considered as correct if the random data
> passes the pd_upper and pg_lsn checks :/

Indeed, this is not good, as Andres pointed out some time ago. My
apologies for not getting to this sooner.

Our current plan for pgBackRest:

1) Remove the LSN check as you have done in your patch and when
rechecking see if the page has become valid *or* the LSN is ascending.
2) Check the LSN against the max LSN reported by PostgreSQL to make sure
it is valid.

These do completely rule out any type of corruption, but they certainly
narrows the possibility by a lot.

In the future we would also like to scan the WAL to verify that the page
is definitely being written to.

As for your patch, it mostly looks good but my objection is that a page
may be reported as invalid after 5 retries when in fact it may just be
very hot.

Maybe checking for an ascending LSN is a good idea there as well? At
least in that case we could issue a different warning, instead of
"checksum verification failed" perhaps "checksum verification skipped
due to concurrent modifications".

Regards,
--
-David
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Stephen Frost
Greetings,

* David Steele ([hidden email]) wrote:

> On 11/20/20 2:28 AM, Michael Paquier wrote:
> >On Mon, Nov 16, 2020 at 11:41:51AM +0100, Magnus Hagander wrote:
> >>I was referring to the latest patch on the thread. But as I said, I have
> >>not read up on all the different issues raised in the thread, so take it
> >>with a big grain os salt.
> >>
> >>And I would also echo the previous comment that this code was adapted from
> >>what the pgbackrest folks do. As such, it would be good to get a comment
> >>from for example David on that -- I don't see any of them having commented
> >>after that was mentioned?
> >
> >Agreed.  I am adding Stephen as well in CC.  From the code of
> >backrest, the same logic happens in src/command/backup/pageChecksum.c
> >(see pageChecksumProcess), where two checks on pd_upper and pd_lsn
> >happen before verifying the checksum.  So, if the page header finishes
> >with random junk because of some kind of corruption, even corrupted
> >pages would be incorrectly considered as correct if the random data
> >passes the pd_upper and pg_lsn checks :/
>
> Indeed, this is not good, as Andres pointed out some time ago. My apologies
> for not getting to this sooner.
Yeah, it's been on our backlog to improve this.

> Our current plan for pgBackRest:
>
> 1) Remove the LSN check as you have done in your patch and when rechecking
> see if the page has become valid *or* the LSN is ascending.
> 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it
> is valid.

Yup, that's my recollection also as to our plans for how to improve
things here.

> These do completely rule out any type of corruption, but they certainly
> narrows the possibility by a lot.

*don't :)

> In the future we would also like to scan the WAL to verify that the page is
> definitely being written to.

Yeah, that'd certainly be nice to do too.

> As for your patch, it mostly looks good but my objection is that a page may
> be reported as invalid after 5 retries when in fact it may just be very hot.

Yeah.. while unlikely that it'd actually get written out that much, it
does seem at least possible.

> Maybe checking for an ascending LSN is a good idea there as well? At least
> in that case we could issue a different warning, instead of "checksum
> verification failed" perhaps "checksum verification skipped due to
> concurrent modifications".

+1.

Thanks,

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Michael Paquier-2
On Fri, Nov 20, 2020 at 11:08:27AM -0500, Stephen Frost wrote:

> David Steele ([hidden email]) wrote:
>> Our current plan for pgBackRest:
>>
>> 1) Remove the LSN check as you have done in your patch and when rechecking
>> see if the page has become valid *or* the LSN is ascending.
>> 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it
>> is valid.
>
> Yup, that's my recollection also as to our plans for how to improve
> things here.
>
>> These do completely rule out any type of corruption, but they certainly
>> narrows the possibility by a lot.
>
> *don't :)
Have you considered the possibility of only using pd_checksums for the
validation?  This is the only source of truth in the page header we
can rely on to validate the full contents of the page, so if the logic
relies on anything but the checksum then you expose the logic to risks
of reporting pages as corrupted while they were just torn, or just
miss corrupted pages, which is what we should avoid for such things.
Both are bad.

>> As for your patch, it mostly looks good but my objection is that a page may
>> be reported as invalid after 5 retries when in fact it may just be very hot.
>
> Yeah.. while unlikely that it'd actually get written out that much, it
> does seem at least possible.
>
>> Maybe checking for an ascending LSN is a good idea there as well? At least
>> in that case we could issue a different warning, instead of "checksum
>> verification failed" perhaps "checksum verification skipped due to
>> concurrent modifications".
>
> +1.
I don't quite understand how you can make sure that the page is not
corrupted here?  It could be possible that the last 4kB of a 8kB page
got corrupted, where the header had valid data but failing the
checksum verification.  So if you are not careful you could have at
hand a corrupted page discarded because of it failed the retry
multiple times in a row.  The only method I can think as being really
reliable is based on two facts:
- Do a check only on pd_checksums, as that validates the full contents
of the page.
- When doing a retry, make sure that there is no concurrent I/O
activity in the shared buffers.  This requires an API we don't have
yet.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Anastasia Lubennikova
On 21.11.2020 04:30, Michael Paquier wrote:
> The only method I can think as being really
> reliable is based on two facts:
> - Do a check only on pd_checksums, as that validates the full contents
> of the page.
> - When doing a retry, make sure that there is no concurrent I/O
> activity in the shared buffers.  This requires an API we don't have
> yet.

It seems reasonable to me to rely on checksums only.

As for retry, I think that API for concurrent I/O will be complicated.
Instead, we can introduce a function to read the page directly from
shared buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a
bullet-proof solution to me. Do you see any possible problems with it?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Stephen Frost
In reply to this post by Michael Paquier-2
Greetings,

* Michael Paquier ([hidden email]) wrote:

> On Fri, Nov 20, 2020 at 11:08:27AM -0500, Stephen Frost wrote:
> > David Steele ([hidden email]) wrote:
> >> Our current plan for pgBackRest:
> >>
> >> 1) Remove the LSN check as you have done in your patch and when rechecking
> >> see if the page has become valid *or* the LSN is ascending.
> >> 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it
> >> is valid.
> >
> > Yup, that's my recollection also as to our plans for how to improve
> > things here.
> >
> >> These do completely rule out any type of corruption, but they certainly
> >> narrows the possibility by a lot.
> >
> > *don't :)
>
> Have you considered the possibility of only using pd_checksums for the
> validation?  This is the only source of truth in the page header we
> can rely on to validate the full contents of the page, so if the logic
> relies on anything but the checksum then you expose the logic to risks
> of reporting pages as corrupted while they were just torn, or just
> miss corrupted pages, which is what we should avoid for such things.
> Both are bad.
There's no doubt that you'll get checksum failures from time to time,
and that it's an entirely valid case if the page is being concurrently
written, so we have to decide if we should be reporting those failures,
retrying, or what.

It's not at all clear what you're suggesting here as to how you can use
'only' the checksum.

> >> As for your patch, it mostly looks good but my objection is that a page may
> >> be reported as invalid after 5 retries when in fact it may just be very hot.
> >
> > Yeah.. while unlikely that it'd actually get written out that much, it
> > does seem at least possible.
> >
> >> Maybe checking for an ascending LSN is a good idea there as well? At least
> >> in that case we could issue a different warning, instead of "checksum
> >> verification failed" perhaps "checksum verification skipped due to
> >> concurrent modifications".
> >
> > +1.
>
> I don't quite understand how you can make sure that the page is not
> corrupted here?  It could be possible that the last 4kB of a 8kB page
> got corrupted, where the header had valid data but failing the
> checksum verification.
Not sure that the proposed approach was really understood here.
Specifically what we're talking about is:

- read(), save the LSN seen
- calculate checksum- get a failure
- re-read(), compare LSN to prior LSN, maybe also re-check checksum

If checksum fails again AND the LSN has changed and increased (and
perhaps otherwise seems reasonable) then we have at least a bit more
confidence that the failing checksum is due to the page being rewritten
concurrently and not due to latest storage corruption, which is the
specific distinction that we're trying to discern here.

> So if you are not careful you could have at
> hand a corrupted page discarded because of it failed the retry
> multiple times in a row.

The point of checking for an ascending LSN is to see if the page is
being concurrently modified.  If it is, then we actually don't care if
the page is corrupted because it's going to be rewritten during WAL
replay as part of the restore process.

> The only method I can think as being really
> reliable is based on two facts:
> - Do a check only on pd_checksums, as that validates the full contents
> of the page.
> - When doing a retry, make sure that there is no concurrent I/O
> activity in the shared buffers.  This requires an API we don't have
> yet.

I don't think we actually want the backup process to start locking
pages, which it seems like is what you're suggesting here..?  Trying to
do a check without a lock and without having PG end up reading the page
back in if it had been evicted due to pressure seems likely to be hard
to do reliably and without race conditions complicating things.

The other 100% reliable approach, as David discussed before, is to be
scanning the WAL at the same time and to ignore any checksum failures
for pages that we know are in the WAL with FPIs.  Unfortunately, reading
WAL for all different versions of PG is a fair bit of work and we
haven't quite gotten to biting that off yet (though it's on the
roadmap), and the core code certainly doesn't help us in that regard
since any given version only supports the current major version WAL (an
issue pg_basebackup would also have to deal with it, were it to be
modified to use such an approach and to continue working with older
versions of PG..).  In a similar vein to what we do (in pgbackrest) with
pg_control, we expect to develop our own library basically vendorizing
WAL reading code from all the major versions of PG which we support in
order to track FPIs, restore points, all the kinds of potential recovery
targets, and other useful information.

Thanks,

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Stephen Frost
In reply to this post by Anastasia Lubennikova
Greetings,

* Anastasia Lubennikova ([hidden email]) wrote:

> On 21.11.2020 04:30, Michael Paquier wrote:
> >The only method I can think as being really
> >reliable is based on two facts:
> >- Do a check only on pd_checksums, as that validates the full contents
> >of the page.
> >- When doing a retry, make sure that there is no concurrent I/O
> >activity in the shared buffers.  This requires an API we don't have
> >yet.
>
> It seems reasonable to me to rely on checksums only.
>
> As for retry, I think that API for concurrent I/O will be complicated.
> Instead, we can introduce a function to read the page directly from shared
> buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof
> solution to me. Do you see any possible problems with it?
We might end up reading pages back in that have been evicted, for one
thing, which doesn't seem great, and this also seems likely to be
awkward for cases which aren't using the replication protocol, unless
every process maintains a connection to PG the entire time, which also
doesn't seem great.

Also- what is the point of reading the page from shared buffers
anyway..?  All we need to do is prove that the page will be rewritten
during WAL replay.  If we can prove that, we don't actually care what
the contents of the page are.  We certainly can't calculate the
checksum on a page we plucked out of shared buffers since we only
calculate the checksum when we go to write the page out.

Thanks,

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Anastasia Lubennikova
On 23.11.2020 18:35, Stephen Frost wrote:
Greetings,

* Anastasia Lubennikova ([hidden email]) wrote:
On 21.11.2020 04:30, Michael Paquier wrote:
The only method I can think as being really
reliable is based on two facts:
- Do a check only on pd_checksums, as that validates the full contents
of the page.
- When doing a retry, make sure that there is no concurrent I/O
activity in the shared buffers.  This requires an API we don't have
yet.
It seems reasonable to me to rely on checksums only.

As for retry, I think that API for concurrent I/O will be complicated.
Instead, we can introduce a function to read the page directly from shared
buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof
solution to me. Do you see any possible problems with it?
We might end up reading pages back in that have been evicted, for one
thing, which doesn't seem great, 
TBH, I think it is highly unlikely that the page that was just updated will be evicted.
and this also seems likely to be
awkward for cases which aren't using the replication protocol, unless
every process maintains a connection to PG the entire time, which also
doesn't seem great.
Have I missed something? Now pg_basebackup has only one process + one child process for streaming. Anyway, I totally agree with your argument. The need to maintain connection(s) to PG is the most unpleasant part of the proposed approach.

Also- what is the point of reading the page from shared buffers
anyway..?  
Well... Reading a page from shared buffers is a reliable way to get a correct page from postgres under any concurrent load. So it just seems natural to me.
All we need to do is prove that the page will be rewritten
during WAL replay. 
Yes and this is a tricky part. Until you have explained it in your latest message, I wasn't sure how we can distinct concurrent update from a page header corruption. Now I agree that if page LSN updated and increased between rereads, it is safe enough to conclude that we have some concurrent load.

 If we can prove that, we don't actually care what
the contents of the page are.  We certainly can't calculate the
checksum on a page we plucked out of shared buffers since we only
calculate the checksum when we go to write the page out.
Good point. I was thinking that we can recalculate checksum. Or even save a page without it, as we have checked LSN and know for sure that it will be rewritten by WAL replay.


To sum up, I agree with your proposal to reread the page and rely on ascending LSNs. Can you submit a patch?
You can write it on top of the latest attachment in this thread:
v8-master-0001-Fix-page-verifications-in-base-backups.patch from this message https://www.postgresql.org/message-id/20201030023028.GC1693@...

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Stephen Frost
Greetings,

* Anastasia Lubennikova ([hidden email]) wrote:

> On 23.11.2020 18:35, Stephen Frost wrote:
> >* Anastasia Lubennikova ([hidden email]) wrote:
> >>On 21.11.2020 04:30, Michael Paquier wrote:
> >>>The only method I can think as being really
> >>>reliable is based on two facts:
> >>>- Do a check only on pd_checksums, as that validates the full contents
> >>>of the page.
> >>>- When doing a retry, make sure that there is no concurrent I/O
> >>>activity in the shared buffers.  This requires an API we don't have
> >>>yet.
> >>It seems reasonable to me to rely on checksums only.
> >>
> >>As for retry, I think that API for concurrent I/O will be complicated.
> >>Instead, we can introduce a function to read the page directly from shared
> >>buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof
> >>solution to me. Do you see any possible problems with it?
> >We might end up reading pages back in that have been evicted, for one
> >thing, which doesn't seem great,
> TBH, I think it is highly unlikely that the page that was just updated will
> be evicted.
Is it though..?  Consider that the page which was being written out was
being done so specifically to free a page for use by another backend-
while perhaps that doesn't happen all the time, it certainly happens
enough on very busy systems.

> >and this also seems likely to be
> >awkward for cases which aren't using the replication protocol, unless
> >every process maintains a connection to PG the entire time, which also
> >doesn't seem great.
> Have I missed something? Now pg_basebackup has only one process + one child
> process for streaming. Anyway, I totally agree with your argument. The need
> to maintain connection(s) to PG is the most unpleasant part of the proposed
> approach.

I was thinking beyond pg_basebackup, yes; apologies for that not being
clear but that's what I was meaning when I said "aren't using the
replication protocol".

> >Also- what is the point of reading the page from shared buffers
> >anyway..?
> Well... Reading a page from shared buffers is a reliable way to get a
> correct page from postgres under any concurrent load. So it just seems
> natural to me.

Yes, that's true, but if a dirty page was just written out by a backend
in order to be able to evict it, so that the backend can then pull in a
new page, then having pg_basebackup pull that page back in really isn't
great.

> >All we need to do is prove that the page will be rewritten
> >during WAL replay.
> Yes and this is a tricky part. Until you have explained it in your latest
> message, I wasn't sure how we can distinct concurrent update from a page
> header corruption. Now I agree that if page LSN updated and increased
> between rereads, it is safe enough to conclude that we have some concurrent
> load.

Even in this case, it's almost free to compare the LSN to the starting
backup LSN, and to the current LSN position, and make sure it's
somewhere between the two.  While that doesn't entirely eliminite the
possibility that the page happened to get corrupted *and* return a
different result on subsequent reads *and* that it was corrupted in such
a way that the LSN ended up falling between the starting backup LSN and
the current LSN, it's certainly reducing the chances of a false negative
a fair bit.

A concern here, however, is- can we be 100% sure that we'll get a
different result from the two subsequent reads?  For my part, at least,
I've been doubtful that it's possible but it'd be nice to hear it from
someone who has really looked at the kernel side.  To try and clairfy,
let me illustrate:

pg_basebackup (the backend that's sending data to it anyway) starts
reading an 8K page, but gets interrupted halfway through, meaning that
it's read 4K and is now paused.

PG writes that same 8K page, and is able to successfully write the
entire block.

pg_basebackup then wakes up, reads the second half, computes a checksum
and gets a checksum failure.

At this point the question is: if pg_basebackup loops, seeks and
re-reads the same 8K block again, is it possible that pg_basebackup will
get the "old" starting 4K and the "new" ending 4K again?  I'd like to
think that the answer is 'no' and that the kernel will guarantee that if
we managed to read a "new" ending 4K block then the following read of
the full 8K block would be guaranteed to give us the "new" starting 4K.
If that is truely guaranteed then we could be much more confident that
the idea here of simply checking for an ascending LSN, which falls
between the starting LSN of the backup and the current LSN (or perhaps
the final LSN for the backup) would be sufficient to detect this case.

I would also think that, if we can trust that, then there really isn't
any need for the delay in performing the re-read, which I have to admit
that I don't particularly care for.

> >  If we can prove that, we don't actually care what
> >the contents of the page are.  We certainly can't calculate the
> >checksum on a page we plucked out of shared buffers since we only
> >calculate the checksum when we go to write the page out.
> Good point. I was thinking that we can recalculate checksum. Or even save a
> page without it, as we have checked LSN and know for sure that it will be
> rewritten by WAL replay.

At the point that we know the page is in the WAL which must be replayed
to make this backup consistent, we could theoretically zero the page out
of the actual backup (or if we're doing some kind of incremental magic,
skip it entirely, as long as we zero-fill it on restore).

> To sum up, I agree with your proposal to reread the page and rely on
> ascending LSNs. Can you submit a patch?

Probably would make sense to give Michael an opportunity to comment and
get his thoughts on this, and for him to update the patch if he agrees.

As it relates to pgbackrest, we're currently contemplating having a
higher level loop which, upon detecting any page with an invalid
checksum, continues to scan to the end of that file and perform the
compression, encryption, et al, but then loops back after we've
completed that file and skips through the file again, re-reading those
pages which didn't have a valid checksum the first time to see if their
LSN has changed and is within the range of the backup.  This will
certainly give more opportunity for the kernel to 'catch up', if needed,
and give us an updated page without a random 100ms delay, and will also
make it easier for us to, eventually, check and make sure the page was
in the WAL that was been produced as part of the backup, to give us a
complete guarantee that the contents of this page don't matter and that
the failed checksum isn't a sign of latent storage corruption.

Thanks,

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Michael Paquier-2
In reply to this post by Stephen Frost
On Mon, Nov 23, 2020 at 10:35:54AM -0500, Stephen Frost wrote:
> * Anastasia Lubennikova ([hidden email]) wrote:
>> It seems reasonable to me to rely on checksums only.
>>
>> As for retry, I think that API for concurrent I/O will be complicated.
>> Instead, we can introduce a function to read the page directly from shared
>> buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof
>> solution to me. Do you see any possible problems with it?

It seems to me that you are missing the point here.  It is not
necessary to read a page from shared buffers.  What is necessary is to
make sure that there is zero concurrent I/O activity in shared buffers
while a page is getting checked on disk, giving the insurance that
there is zero risk of having a torn page for a check for anything
working with shared buffers.  You could do that only on a retry if we
found a page where there was a checksum mismatch, meaning that the
page we either torn or currupted, but need an extra verification
anyway.

> We might end up reading pages back in that have been evicted, for one
> thing, which doesn't seem great, and this also seems likely to be
> awkward for cases which aren't using the replication protocol, unless
> every process maintains a connection to PG the entire time, which also
> doesn't seem great.

I don't quite see a problem in checking pages that have been just
evicted if we are able to detect faster that a page is corrupted,
because the initial check may fail because a page was torn, meaning
that it was in the middle of an eviction, but the page could also be
corrupted, meaning also that it was *not* torn, and would fail a retry
where we should make sure that there is no s_b concurrent activity.
So in the worst case of seeing you make the detection of a corrupted
page faster.

Please note that Andres also mentioned about the potential need to
worry about table AMs that call directly smgrwrite(), bypassing shared
buffers.  The only cases in-core where it is used are related to init
forks when an unlogged relation gets created, where it would not
matter if you are doing a page check while holding a database
transaction as the newly-created relation would not be visible yet,
but it would matter in the case of base backups doing direct page
lookups.  Fun.

> Also- what is the point of reading the page from shared buffers
> anyway..?  All we need to do is prove that the page will be rewritten
> during WAL replay.  If we can prove that, we don't actually care what
> the contents of the page are.  We certainly can't calculate the
> checksum on a page we plucked out of shared buffers since we only
> calculate the checksum when we go to write the page out.

A LSN-based check makes the thing tricky.  How do you make sure that
pd_lsn is not itself broken?  It could be perfectly possible that a
random on-disk corruption makes pd_lsn seen as having a correct value,
still the rest of the page is borked.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Michael Paquier-2
In reply to this post by Stephen Frost
On Mon, Nov 23, 2020 at 05:28:52PM -0500, Stephen Frost wrote:

> * Anastasia Lubennikova ([hidden email]) wrote:
>> Yes and this is a tricky part. Until you have explained it in your latest
>> message, I wasn't sure how we can distinct concurrent update from a page
>> header corruption. Now I agree that if page LSN updated and increased
>> between rereads, it is safe enough to conclude that we have some concurrent
>> load.
>
> Even in this case, it's almost free to compare the LSN to the starting
> backup LSN, and to the current LSN position, and make sure it's
> somewhere between the two.  While that doesn't entirely eliminite the
> possibility that the page happened to get corrupted *and* return a
> different result on subsequent reads *and* that it was corrupted in such
> a way that the LSN ended up falling between the starting backup LSN and
> the current LSN, it's certainly reducing the chances of a false negative
> a fair bit.
FWIW, I am not much a fan of designs that are not bullet-proof by
design.  This reduces the odds of problems, sure, still it does not
discard the possibility of incorrect results, confusing users as well
as people looking at such reports.

>> To sum up, I agree with your proposal to reread the page and rely on
>> ascending LSNs. Can you submit a patch?
>
> Probably would make sense to give Michael an opportunity to comment and
> get his thoughts on this, and for him to update the patch if he agrees.

I think that a LSN check would be a safe thing to do iff pd_checksum
is already checked first to make sure that the page contents are fine
to use.   Still, what's the point in doing a LSN check anyway if we
know that the checksum is valid?  Then on a retry if the first attempt
failed you also need the guarantee that there is zero concurrent I/O
activity while a page is rechecked (no need to do that unless the
initial page check doing a checksum match failed).  So the retry needs
to do some s_b interactions, but then comes the much trickier point of
concurrent smgrwrite() calls bypassing the shared buffers.

> As it relates to pgbackrest, we're currently contemplating having a
> higher level loop which, upon detecting any page with an invalid
> checksum, continues to scan to the end of that file and perform the
> compression, encryption, et al, but then loops back after we've
> completed that file and skips through the file again, re-reading those
> pages which didn't have a valid checksum the first time to see if their
> LSN has changed and is within the range of the backup.  This will
> certainly give more opportunity for the kernel to 'catch up', if needed,
> and give us an updated page without a random 100ms delay, and will also
> make it easier for us to, eventually, check and make sure the page was
> in the WAL that was been produced as part of the backup, to give us a
> complete guarantee that the contents of this page don't matter and that
> the failed checksum isn't a sign of latent storage corruption.
That would reduce the likelyhood of facing torn pages, still you
cannot fully discard the problem either as a same page may get changed
again once you loop over, no?  And what if a corruption has updated
pd_lsn on-disk?  Unlikely so, still possible.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Stephen Frost
Greetings,

On Mon, Nov 23, 2020 at 20:28 Michael Paquier <[hidden email]> wrote:
On Mon, Nov 23, 2020 at 05:28:52PM -0500, Stephen Frost wrote:
> * Anastasia Lubennikova ([hidden email]) wrote:
>> Yes and this is a tricky part. Until you have explained it in your latest
>> message, I wasn't sure how we can distinct concurrent update from a page
>> header corruption. Now I agree that if page LSN updated and increased
>> between rereads, it is safe enough to conclude that we have some concurrent
>> load.
>
> Even in this case, it's almost free to compare the LSN to the starting
> backup LSN, and to the current LSN position, and make sure it's
> somewhere between the two.  While that doesn't entirely eliminite the
> possibility that the page happened to get corrupted *and* return a
> different result on subsequent reads *and* that it was corrupted in such
> a way that the LSN ended up falling between the starting backup LSN and
> the current LSN, it's certainly reducing the chances of a false negative
> a fair bit.

FWIW, I am not much a fan of designs that are not bullet-proof by
design.  This reduces the odds of problems, sure, still it does not
discard the possibility of incorrect results, confusing users as well
as people looking at such reports.

Let’s be clear about this- our checksums are, themselves, far from bulletproof, regardless of all of our other efforts.  They are not foolproof against any corruption, and certainly not even close to being sufficient for guarantees you’d expect in, say, encryption integrity.  We cannot say with certainty that a page which passes checksum validation isn’t corrupted in some way.  A page which doesn’t pass checksum validation may be corrupted or may be torn and we aren’t 100% of that either, but we can work to try and make a sensible call about which it is.

>> To sum up, I agree with your proposal to reread the page and rely on
>> ascending LSNs. Can you submit a patch?
>
> Probably would make sense to give Michael an opportunity to comment and
> get his thoughts on this, and for him to update the patch if he agrees.

I think that a LSN check would be a safe thing to do iff pd_checksum
is already checked first to make sure that the page contents are fine
to use.   Still, what's the point in doing a LSN check anyway if we
know that the checksum is valid?  Then on a retry if the first attempt
failed you also need the guarantee that there is zero concurrent I/O
activity while a page is rechecked (no need to do that unless the
initial page check doing a checksum match failed).  So the retry needs
to do some s_b interactions, but then comes the much trickier point of
concurrent smgrwrite() calls bypassing the shared buffers.

I agree that the LSN check isn’t interesting if the page passes the checksum validation.  I do think we can look at the LSN and make reasonable inferences based off of it even if the checksum doesn’t validate- in particular, in my experience at least, the result of a read, without any intervening write, is very likely to be the same if performed multiple times quickly even if there is latent storage corruption- due to cache’ing, if nothing else.  What’s interesting about the LSN check is that we are specifically looking to see if it *changed* in a reasonable and predictable manner, and that it was replaced with a new yet reasonable value. The chances of that happening due to latent storage corruption is vanishingly small.

> As it relates to pgbackrest, we're currently contemplating having a
> higher level loop which, upon detecting any page with an invalid
> checksum, continues to scan to the end of that file and perform the
> compression, encryption, et al, but then loops back after we've
> completed that file and skips through the file again, re-reading those
> pages which didn't have a valid checksum the first time to see if their
> LSN has changed and is within the range of the backup.  This will
> certainly give more opportunity for the kernel to 'catch up', if needed,
> and give us an updated page without a random 100ms delay, and will also
> make it easier for us to, eventually, check and make sure the page was
> in the WAL that was been produced as part of the backup, to give us a
> complete guarantee that the contents of this page don't matter and that
> the failed checksum isn't a sign of latent storage corruption.

That would reduce the likelyhood of facing torn pages, still you
cannot fully discard the problem either as a same page may get changed
again once you loop over, no?  And what if a corruption has updated
pd_lsn on-disk?  Unlikely so, still possible.

We surely don’t care about a page which has been changed multiple times by PG during the backup, since all those changes will be, by definition, in the WAL, no?  Therefore, one loop to see that the value of the LSN *changed*, meaning something wrote something new there, with a cross-check to see that the LSN was in the expected range, is going an awfully long way to assuring that this isn’t a case of latent storage corruption. If there is an attacker who is not the PG process but who is modifying files then, yes, that’s a risk, and won’t be picked up by this, but why would they create an invalid checksum in the first place..?

We aren’t attempting to protect against a sophisticated attack, we are trying to detect latent storage corruption.

I would also ask for a clarification as to if you feel that checking the WAL for the page to be insufficient somehow, since I mentioned that as also being on the roadmap.  If there’s some reason that checking the WAL for the page wouldn’t be sufficient, I am anxious to understand that reasoning.

Thanks,

Stephen
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

David Steele
In reply to this post by Michael Paquier-2
Hi Michael,

On 11/23/20 8:10 PM, Michael Paquier wrote:

> On Mon, Nov 23, 2020 at 10:35:54AM -0500, Stephen Frost wrote:
>
>> Also- what is the point of reading the page from shared buffers
>> anyway..?  All we need to do is prove that the page will be rewritten
>> during WAL replay.  If we can prove that, we don't actually care what
>> the contents of the page are.  We certainly can't calculate the
>> checksum on a page we plucked out of shared buffers since we only
>> calculate the checksum when we go to write the page out.
>
> A LSN-based check makes the thing tricky.  How do you make sure that
> pd_lsn is not itself broken?  It could be perfectly possible that a
> random on-disk corruption makes pd_lsn seen as having a correct value,
> still the rest of the page is borked.

We are not just looking at one LSN value. Here are the steps we are
proposing (I'll skip checks for zero pages here):

1) Test the page checksum. If it passes the page is OK.
2) If the checksum does not pass then record the page offset and LSN and
continue.
3) After the file is copied, reopen and reread the file, seeking to
offsets where possible invalid pages were recorded in the first pass.
     a) If the page is now valid then it is OK.
     b) If the page is not valid but the LSN has increased from the LSN
recorded in the previous pass then it is OK. We can infer this because
the LSN has been updated in a way that is not consistent with storage
corruption.

This is what we are planning for the first round of improving our page
checksum validation. We believe that doing the retry in a second pass
will be faster and more reliable because some time will have passed
since the first read without having to build in a delay for each page error.

A further improvement is to check the ascending LSNs found in 3b against
PostgreSQL to be completely sure they are valid. We are planning this
for our second round of improvements.

Reopening the file for the second pass does require some additional logic:

1) The file may have been deleted by PG since the first pass and in that
case we won't report any page errors.
2) The file may have been truncated by PG since the first pass so we
won't report any errors past the point of truncation.

A malicious attacker could easily trick these checks, but as Stephen
pointed out elsewhere they would likely make the checksums valid which
would escape detection anyway.

We believe that the chances of random storage corruption passing all
these checks is incredibly small, but eventually we'll also check
against the WAL to be completely sure.

Regards,
--
-David
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Michael Paquier-2
On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote:
> We are not just looking at one LSN value. Here are the steps we are
> proposing (I'll skip checks for zero pages here):
>
> 1) Test the page checksum. If it passes the page is OK.
> 2) If the checksum does not pass then record the page offset and LSN and
> continue.

But here the checksum is broken, so while the offset is something we
can rely on how do you make sure that the LSN is fine?  A broken
checksum could perfectly mean that the LSN is actually *not* fine if
the page header got corrupted.

> 3) After the file is copied, reopen and reread the file, seeking to offsets
> where possible invalid pages were recorded in the first pass.
>     a) If the page is now valid then it is OK.
>     b) If the page is not valid but the LSN has increased from the LSN

Per se the previous point about the LSN value that we cannot rely on.

> A malicious attacker could easily trick these checks, but as Stephen pointed
> out elsewhere they would likely make the checksums valid which would escape
> detection anyway.
>
> We believe that the chances of random storage corruption passing all these
> checks is incredibly small, but eventually we'll also check against the WAL
> to be completely sure.

The lack of check for any concurrent I/O on the follow-up retries is
disturbing.  How do you guarantee that on the second retry what you
have is a torn page and not something corrupted?  Init forks for
example are made of up to 2 blocks, so the window would get short for
at least those.  There are many instances with tables that have few
pages as well.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Magnus Hagander-2
On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier <[hidden email]> wrote:

>
> On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote:
> > We are not just looking at one LSN value. Here are the steps we are
> > proposing (I'll skip checks for zero pages here):
> >
> > 1) Test the page checksum. If it passes the page is OK.
> > 2) If the checksum does not pass then record the page offset and LSN and
> > continue.
>
> But here the checksum is broken, so while the offset is something we
> can rely on how do you make sure that the LSN is fine?  A broken
> checksum could perfectly mean that the LSN is actually *not* fine if
> the page header got corrupted.
>
> > 3) After the file is copied, reopen and reread the file, seeking to offsets
> > where possible invalid pages were recorded in the first pass.
> >     a) If the page is now valid then it is OK.
> >     b) If the page is not valid but the LSN has increased from the LSN
>
> Per se the previous point about the LSN value that we cannot rely on.

We cannot rely on the LSN itself. But it's a lot more likely that we
can rely on the LSN changing, and on the LSN changing in a "correct
way". That is, if the LSN *decreases* we know it's corrupt. If the LSN
*doesn't change* we know it's corrupt. But if the LSN *increases* AND
the new page now has a correct checksum, it's very most likely to be
correct. You could perhaps even put cap on it saying "if the LSN
increased, but less than <n>", where <n> is a sufficiently high number
that it's entirely unreasonable to advanced that far between the
reading of two blocks. But it has to have a very high margin in that
case.


> > A malicious attacker could easily trick these checks, but as Stephen pointed
> > out elsewhere they would likely make the checksums valid which would escape
> > detection anyway.
> >
> > We believe that the chances of random storage corruption passing all these
> > checks is incredibly small, but eventually we'll also check against the WAL
> > to be completely sure.
>
> The lack of check for any concurrent I/O on the follow-up retries is
> disturbing.  How do you guarantee that on the second retry what you
> have is a torn page and not something corrupted?  Init forks for
> example are made of up to 2 blocks, so the window would get short for
> at least those.  There are many instances with tables that have few
> pages as well.

Here I was more worried that the window might get *too long* if tables
are large :)

The risk is certainly that you get a torn page *again* on the second
read. It could be the same torn page (if it hasn't changed), but you
can detect that (by the fact that it hasn't actually changed) and
possibly do a short delay before trying again if it gets that far.
That could happen if the process is too quick. It could also be that
you are unlucky and that you hit a *new* write, and you were so
unlucky that both times it happened to hit exactly when you were
reading the page the next time. I'm not sure the chance of that
happening is even big enough we have to care about it, though?

--
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/


Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Stephen Frost
Greetings,

* Magnus Hagander ([hidden email]) wrote:

> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier <[hidden email]> wrote:
> > On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote:
> > > We are not just looking at one LSN value. Here are the steps we are
> > > proposing (I'll skip checks for zero pages here):
> > >
> > > 1) Test the page checksum. If it passes the page is OK.
> > > 2) If the checksum does not pass then record the page offset and LSN and
> > > continue.
> >
> > But here the checksum is broken, so while the offset is something we
> > can rely on how do you make sure that the LSN is fine?  A broken
> > checksum could perfectly mean that the LSN is actually *not* fine if
> > the page header got corrupted.
Of course that could be the case, but it gets to be a smaller and
smaller chance by checking that the LSN read falls within reasonable
bounds.

> > > 3) After the file is copied, reopen and reread the file, seeking to offsets
> > > where possible invalid pages were recorded in the first pass.
> > >     a) If the page is now valid then it is OK.
> > >     b) If the page is not valid but the LSN has increased from the LSN
> >
> > Per se the previous point about the LSN value that we cannot rely on.
>
> We cannot rely on the LSN itself. But it's a lot more likely that we
> can rely on the LSN changing, and on the LSN changing in a "correct
> way". That is, if the LSN *decreases* we know it's corrupt. If the LSN
> *doesn't change* we know it's corrupt. But if the LSN *increases* AND
> the new page now has a correct checksum, it's very most likely to be
> correct. You could perhaps even put cap on it saying "if the LSN
> increased, but less than <n>", where <n> is a sufficiently high number
> that it's entirely unreasonable to advanced that far between the
> reading of two blocks. But it has to have a very high margin in that
> case.
This is, in fact, included in what was proposed- the "max increase"
would be "the ending LSN of the backup".  I don't think we can make it
any tighter than that though without risking false positives, which is
surely worse than a false negative in this particular case- we already
risk false negatives due to the fact that our checksum isn't perfect, so
even a perfect check to make sure that the page will, in fact, be
replayed over during crash recovery doesn't guarantee that there's no
corruption.

> > > A malicious attacker could easily trick these checks, but as Stephen pointed
> > > out elsewhere they would likely make the checksums valid which would escape
> > > detection anyway.
> > >
> > > We believe that the chances of random storage corruption passing all these
> > > checks is incredibly small, but eventually we'll also check against the WAL
> > > to be completely sure.
> >
> > The lack of check for any concurrent I/O on the follow-up retries is
> > disturbing.  How do you guarantee that on the second retry what you
> > have is a torn page and not something corrupted?  Init forks for
> > example are made of up to 2 blocks, so the window would get short for
> > at least those.  There are many instances with tables that have few
> > pages as well.
If there's an easy and cheap way to see if there was concurrent i/o
happening for the page, then let's hear it.  One idea that has occured
to me which hasn't been discussed is checking the file's mtime to see if
it's changed since the backup started.  In that case, I would think it'd
be something like:

- Checksum is invalid
- LSN is within range
- Close file
- Stat file
- If mtime is from before the backup then signal possible corruption

If the checksum is invalid and the LSN isn't in range, then signal
corruption.

In general, however, I don't like the idea of reaching into PG and
asking PG for this page.

> Here I was more worried that the window might get *too long* if tables
> are large :)

I'm not sure that there's really a 'too long' possibility here.

> The risk is certainly that you get a torn page *again* on the second
> read. It could be the same torn page (if it hasn't changed), but you
> can detect that (by the fact that it hasn't actually changed) and
> possibly do a short delay before trying again if it gets that far.

I'm really not a fan of introducing these delays in the hopes that
they'll work..

> That could happen if the process is too quick. It could also be that
> you are unlucky and that you hit a *new* write, and you were so
> unlucky that both times it happened to hit exactly when you were
> reading the page the next time. I'm not sure the chance of that
> happening is even big enough we have to care about it, though?

If there's actually a new write, surely the LSN would be new?  At the
least, it wouldn't be the same LSN as the first read that picked up a
torn page.

In general though, I agree, we are getting to the point here where the
chances of missing something with this approach seems extremely slim.  I
do still like the idea of doing better by actually scanning the WAL but
at least for now, this is far better than what we have today while not
introducing a huge amount of additional code or complexity.

Thanks,

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Michael Paquier-2
On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote:

> * Magnus Hagander ([hidden email]) wrote:
>> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier <[hidden email]> wrote:
>>> But here the checksum is broken, so while the offset is something we
>>> can rely on how do you make sure that the LSN is fine?  A broken
>>> checksum could perfectly mean that the LSN is actually *not* fine if
>>> the page header got corrupted.
>
> Of course that could be the case, but it gets to be a smaller and
> smaller chance by checking that the LSN read falls within reasonable
> bounds.
FWIW, I find that scary.

>> We cannot rely on the LSN itself. But it's a lot more likely that we
>> can rely on the LSN changing, and on the LSN changing in a "correct
>> way". That is, if the LSN *decreases* we know it's corrupt. If the LSN
>> *doesn't change* we know it's corrupt. But if the LSN *increases* AND
>> the new page now has a correct checksum, it's very most likely to be
>> correct. You could perhaps even put cap on it saying "if the LSN
>> increased, but less than <n>", where <n> is a sufficiently high number
>> that it's entirely unreasonable to advanced that far between the
>> reading of two blocks. But it has to have a very high margin in that
>> case.
>
> This is, in fact, included in what was proposed- the "max increase"
> would be "the ending LSN of the backup".  I don't think we can make it
> any tighter than that though without risking false positives, which is
> surely worse than a false negative in this particular case- we already
> risk false negatives due to the fact that our checksum isn't perfect, so
> even a perfect check to make sure that the page will, in fact, be
> replayed over during crash recovery doesn't guarantee that there's no
> corruption.
>
> If there's an easy and cheap way to see if there was concurrent i/o
> happening for the page, then let's hear it.
This has been discussed for a couple of months now.  I would recommend
to go through this thread:
https://www.postgresql.org/message-id/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@...

And this bit is interesting, because that would give the guarantees
you are looking for with a page retry (just grep for BM_IO_IN_PROGRESS
on the thread):
https://www.postgresql.org/message-id/20201102193457.fc2hoen7ahth4bbc@...

> One idea that has occured
> to me which hasn't been discussed is checking the file's mtime to see if
> it's changed since the backup started.  In that case, I would think it'd
> be something like:
>
> - Checksum is invalid
> - LSN is within range
> - Close file
> - Stat file
> - If mtime is from before the backup then signal possible corruption
I suspect that relying on mtime may cause problems.  One case coming
to my mind is NFS.

> In general, however, I don't like the idea of reaching into PG and
> asking PG for this page.

It seems to me that if we don't ask to PG what it thinks about a page,
we will never have a fully bullet-proof design either.
--
Michael

signature.asc (849 bytes) Download Attachment
1 ... 6789