Online verification of checksums

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
142 messages Options
1234 ... 8
Reply | Threaded
Open this post in threaded view
|

Online verification of checksums

Michael Banck-2
Hi,

v10 almost added online activation of checksums, but all we've got is
pg_verify_checksums, i.e. offline verification of checkums. 

However, we also got (online) checksum verification during base backups,
and I have ported/adapted David Steele's recheck code to my personal
fork of pg_checksums[1], removed the online check (for verification) and
that seems to work fine.

I've now forward-ported this change to pg_verify_checksums, in order to
make this application useful for online clusters, see attached patch.

I've tested this in a tight loop (while true; do pg_verify_checksums -D
data1 -d > /dev/null || /bin/true; done)[2] while doing "while true; do
createdb pgbench; pgbench -i -s 10 pgbench > /dev/null; dropdb pgbench;
done", which I already used to develop the original code in the fork and
which brought up a few bugs.

I got one checksums verification failure this way, all others were
caught by the recheck (I've introduced a 500ms delay for the first ten
failures) like this:

|pg_verify_checksums: checksum verification failed on first attempt in
|file "data1/base/16837/16850", block 7770: calculated checksum 785 but
|expected 5063
|pg_verify_checksums: block 7770 in file "data1/base/16837/16850"
|verified ok on recheck

However, I am also seeing sporadic (maybe 0.5 times per pgbench run)
failures like this:

|pg_verify_checksums: short read of block 2644 in file
|"data1/base/16637/16650", got only 4096 bytes

This is not strictly a verification failure, should we do anything about
this? In my fork, I am also rechecking on this[3] (and I am happy to
extend the patch that way), but that makes the code and the patch more
complicated and I wanted to check the general opinion on this case
first.


Michael

[1] https://github.com/credativ/pg_checksums/commit/dc052f0d6f1282d3c821
5b0eb28b8e7c4e74f9e5
[2] while patching out the somewhat unhelpful (in regular operation,
anyway) debug message for every successful checksum verification
[3] https://github.com/credativ/pg_checksums/blob/master/pg_checksums.c#
L160
--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: [hidden email]

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

online-verification-of-checksums_V1.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Peter Eisentraut-6
On 26/07/2018 13:59, Michael Banck wrote:
> I've now forward-ported this change to pg_verify_checksums, in order to
> make this application useful for online clusters, see attached patch.

Why not provide this functionality as a server function or command.
Then you can access blocks with proper locks and don't have to do this
rather ad hoc retry logic on concurrent access.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Magnus Hagander-2
On Thu, Aug 30, 2018 at 8:06 PM, Peter Eisentraut <[hidden email]> wrote:
On 26/07/2018 13:59, Michael Banck wrote:
> I've now forward-ported this change to pg_verify_checksums, in order to
> make this application useful for online clusters, see attached patch.

Why not provide this functionality as a server function or command.
Then you can access blocks with proper locks and don't have to do this
rather ad hoc retry logic on concurrent access.

I think it would make sense to provide this functionality in the "checksum worker" infrastruture suggested in the online checksum enabling patch. But I think being able to run it from the outside would also be useful, particularly when it's this simple.

But why do we need a sleep in it? AFAICT this is basically the same code that we have in basebackup.c, and that one does not need the sleep? Certainly 500ms would be very long since we're just protecting against a torn page, but the comment is wrong I think, and we're actually sleeping 0.5ms?

--
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Fabien COELHO-3
In reply to this post by Michael Banck-2

Hallo Michael,

> I've now forward-ported this change to pg_verify_checksums, in order to
> make this application useful for online clusters, see attached patch.

Patch does not seem to apply anymore, could you rebase it?

--
Fabien.

Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Tomas Vondra-4
In reply to this post by Michael Banck-2
Hi,

The patch is mostly copying the verification / retry logic from
basebackup.c, but I think it omitted a rather important detail that
makes it incorrect in the presence of concurrent writes.

The very first thing basebackup does is this:

    startptr = do_pg_start_backup(...);

i.e. it waits for a checkpoint, remembering the LSN. And then when
checking a page it does this:

   if (!PageIsNew(page) && PageGetLSN(page) < startptr)
   {
       ... verify the page checksum
   }

Obviously, pg_verify_checksums can't do that easily because it's
supposed to run from outside the database instance. But the startptr
detail is pretty important because it supports this retry reasoning:

    /*
     * Retry the block on the first failure.  It's
     * possible that we read the first 4K page of the
     * block just before postgres updated the entire block
     * so it ends up looking torn to us.  We only need to
     * retry once because the LSN should be updated to
     * something we can ignore on the next pass.  If the
     * error happens again then it is a true validation
     * failure.
     */

Imagine the 8kB page as two 4kB pages, with the initial state being
[A1,A2] and another process over-writing it with [B1,B2]. If you read
the 8kB page, what states can you see?

I don't think POSIX provides any guarantees about atomicity of the write
calls (and even if it does, the filesystems on Linux don't seem to). So
you may observe both [A1,B2] or [A2,B1], or various inconsistent mixes
of the two versions, depending on timing. Well, torn pages ...

Pretty much the only thing you can rely on is that when one process does

    write([B1,B2])

the other process may first read [A1,B2], but the next read will return
[B1,B2] (or possibly newer data, if there was another write). It will
not read the "stale" A1 again.

The basebackup relies on this kinda implicitly - on the retry it'll
notice the LSN changed (thanks to the startptr check), and the page will
be skipped entirely. This is pretty important, because the new page
might be torn in some other way.

The pg_verify_checksum apparently ignores this skip logic, because on
the retry it simply re-reads the page again, verifies the checksum and
reports an error. Which is broken, because the newly read page might be
torn again due to a concurrent write.

So IMHO this should do something similar to basebackup - check the page
LSN, and if it changed then skip the page.

I'm afraid this requires using the last checkpoint LSN, the way startptr
is used in basebackup. In particular we can't simply remember LSN from
the first read, because we might actually read [B1,A2] on the first try,
and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
torn in various other ways, not necessarily at the 4kB boundary - it
might be torn right after the LSN, for example).


FWIW I also don't understand the purpose of pg_sleep(), it does not seem
to protect against anything, really.

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Michael Banck-2
Hi,

On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:

> The patch is mostly copying the verification / retry logic from
> basebackup.c, but I think it omitted a rather important detail that
> makes it incorrect in the presence of concurrent writes.
>
> The very first thing basebackup does is this:
>
>     startptr = do_pg_start_backup(...);
>
> i.e. it waits for a checkpoint, remembering the LSN. And then when
> checking a page it does this:
>
>    if (!PageIsNew(page) && PageGetLSN(page) < startptr)
>    {
>        ... verify the page checksum
>    }
>
> Obviously, pg_verify_checksums can't do that easily because it's
> supposed to run from outside the database instance.

It reads pg_control anyway, so couldn't we just take
ControlFile->checkPoint?

Other than that, basebackup.c seems to only look at pages which haven't
been changed since the backup starting checkpoint (see above if
statement). That's reasonable for backups, but is it just as reasonable
for online verification?

> But the startptr detail is pretty important because it supports this
> retry reasoning:
>
>     /*
>      * Retry the block on the first failure.  It's
>      * possible that we read the first 4K page of the
>      * block just before postgres updated the entire block
>      * so it ends up looking torn to us.  We only need to
>      * retry once because the LSN should be updated to
>      * something we can ignore on the next pass.  If the
>      * error happens again then it is a true validation
>      * failure.
>      */
>
> Imagine the 8kB page as two 4kB pages, with the initial state being
> [A1,A2] and another process over-writing it with [B1,B2]. If you read
> the 8kB page, what states can you see?
>
> I don't think POSIX provides any guarantees about atomicity of the write
> calls (and even if it does, the filesystems on Linux don't seem to). So
> you may observe both [A1,B2] or [A2,B1], or various inconsistent mixes
> of the two versions, depending on timing. Well, torn pages ...
>
> Pretty much the only thing you can rely on is that when one process does
>
>     write([B1,B2])
>
> the other process may first read [A1,B2], but the next read will return
> [B1,B2] (or possibly newer data, if there was another write). It will
> not read the "stale" A1 again.
>
> The basebackup relies on this kinda implicitly - on the retry it'll
> notice the LSN changed (thanks to the startptr check), and the page will
> be skipped entirely. This is pretty important, because the new page
> might be torn in some other way.
>
> The pg_verify_checksum apparently ignores this skip logic, because on
> the retry it simply re-reads the page again, verifies the checksum and
> reports an error. Which is broken, because the newly read page might be
> torn again due to a concurrent write.

Well, ok.
 
> So IMHO this should do something similar to basebackup - check the page
> LSN, and if it changed then skip the page.
>
> I'm afraid this requires using the last checkpoint LSN, the way startptr
> is used in basebackup. In particular we can't simply remember LSN from
> the first read, because we might actually read [B1,A2] on the first try,
> and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
> torn in various other ways, not necessarily at the 4kB boundary - it
> might be torn right after the LSN, for example).

I'd prefer to come up with a plan where we don't just give up once we
see a new LSN, if possible. If I run a modified pg_verify_checksums
which skips on newer pages in a tight benchmark, basically everything
gets skipped as checkpoints don't happen often enough.

So how about we do check every page, but if one fails on retry, and the
LSN is newer than the checkpoint, we then skip it? Is that logic sound?

In any case, if we decide we really should skip the page if it is newer
than the checkpoint, I think it makes sense to track those skipped pages
and print their number out at the end, if there are any.

> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
> to protect against anything, really.

Well, I've noticed that without it I get sporadic checksum failures on
reread, so I've added it to make them go away. It was certainly a
phenomenological decision that I am happy to trade for a better one.

Also, I noticed there's sometimes a 'data/global/pg_internal.init.606'
or some such file which pg_verify_checksums gets confused on, I guess we
should skip that as well.  Can we assume that all files that start with
the ones in skip[] are safe to skip or should we have an exception for
files starting with pg_internal.init?


Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: [hidden email]

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Stephen Frost
Greetings,

* Michael Banck ([hidden email]) wrote:

> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
> > Obviously, pg_verify_checksums can't do that easily because it's
> > supposed to run from outside the database instance.
>
> It reads pg_control anyway, so couldn't we just take
> ControlFile->checkPoint?
>
> Other than that, basebackup.c seems to only look at pages which haven't
> been changed since the backup starting checkpoint (see above if
> statement). That's reasonable for backups, but is it just as reasonable
> for online verification?
Right, basebackup doesn't need to look at other pages.

> > The pg_verify_checksum apparently ignores this skip logic, because on
> > the retry it simply re-reads the page again, verifies the checksum and
> > reports an error. Which is broken, because the newly read page might be
> > torn again due to a concurrent write.
>
> Well, ok.

The newly read page will have an updated LSN though then on the re-read,
in which case basebackup can know that what happened was a rewrite of
the page and it no longer has to care about the page and can skip it.

I haven't looked, but if basebackup isn't checking the LSN again for the
newly read page then that'd be broken, but I believe it does (at least,
that's the algorithm we came up with for pgBackRest, and I know David
shared that when the basebackup code was being written).

> > So IMHO this should do something similar to basebackup - check the page
> > LSN, and if it changed then skip the page.
> >
> > I'm afraid this requires using the last checkpoint LSN, the way startptr
> > is used in basebackup. In particular we can't simply remember LSN from
> > the first read, because we might actually read [B1,A2] on the first try,
> > and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
> > torn in various other ways, not necessarily at the 4kB boundary - it
> > might be torn right after the LSN, for example).
>
> I'd prefer to come up with a plan where we don't just give up once we
> see a new LSN, if possible. If I run a modified pg_verify_checksums
> which skips on newer pages in a tight benchmark, basically everything
> gets skipped as checkpoints don't happen often enough.
I'm really not sure how you expect to be able to do something different
here.  Even if we started poking into shared buffers, all you'd be able
to see is that there's a bunch of dirty pages- and we don't maintain the
checksums in shared buffers, so it's not like you could verify them
there.

You could possibly have an option that says "force a checkpoint" but,
honestly, that's really not all that interesting either- all you'd be
doing is forcing all the pages to be written out from shared buffers
into the kernel cache and then reading them from there instead, it's not
like you'd actually be able to tell if there was a disk/storage error
because you'll only be looking at the kernel cache.

> So how about we do check every page, but if one fails on retry, and the
> LSN is newer than the checkpoint, we then skip it? Is that logic sound?

I thought that's what basebackup did- if it doesn't do that today, then
it really should.

> In any case, if we decide we really should skip the page if it is newer
> than the checkpoint, I think it makes sense to track those skipped pages
> and print their number out at the end, if there are any.

Not sure what the point of this is.  If we wanted to really do something
to cross-check here, we'd track the pages that were skipped and then
look through the WAL to make sure that they're there.  That's something
we've talked about doing with pgBackRest, but don't currently.

> > FWIW I also don't understand the purpose of pg_sleep(), it does not seem
> > to protect against anything, really.
>
> Well, I've noticed that without it I get sporadic checksum failures on
> reread, so I've added it to make them go away. It was certainly a
> phenomenological decision that I am happy to trade for a better one.

That then sounds like we really aren't re-checking the LSN, and we
really should be, to avoid getting these sporadic checksum failures on
reread..

> Also, I noticed there's sometimes a 'data/global/pg_internal.init.606'
> or some such file which pg_verify_checksums gets confused on, I guess we
> should skip that as well.  Can we assume that all files that start with
> the ones in skip[] are safe to skip or should we have an exception for
> files starting with pg_internal.init?

Everything listed in skip is safe to skip on a restore..  I've not
really thought too much about if they're all safe to skip when checking
checksums for an online system, but I would generally think so..

Thanks!

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Tomas Vondra-4

On 09/17/2018 04:46 PM, Stephen Frost wrote:

> Greetings,
>
> * Michael Banck ([hidden email]) wrote:
>> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
>>> Obviously, pg_verify_checksums can't do that easily because it's
>>> supposed to run from outside the database instance.
>>
>> It reads pg_control anyway, so couldn't we just take
>> ControlFile->checkPoint?
>>
>> Other than that, basebackup.c seems to only look at pages which haven't
>> been changed since the backup starting checkpoint (see above if
>> statement). That's reasonable for backups, but is it just as reasonable
>> for online verification?
>
> Right, basebackup doesn't need to look at other pages.
>
>>> The pg_verify_checksum apparently ignores this skip logic, because on
>>> the retry it simply re-reads the page again, verifies the checksum and
>>> reports an error. Which is broken, because the newly read page might be
>>> torn again due to a concurrent write.
>>
>> Well, ok.
>
> The newly read page will have an updated LSN though then on the re-read,
> in which case basebackup can know that what happened was a rewrite of
> the page and it no longer has to care about the page and can skip it.
>
> I haven't looked, but if basebackup isn't checking the LSN again for the
> newly read page then that'd be broken, but I believe it does (at least,
> that's the algorithm we came up with for pgBackRest, and I know David
> shared that when the basebackup code was being written).
>

Yes, basebackup does check the LSN on re-read, and skips the page if it
changed on re-read (because it eliminates the consistency guarantees
provided by the checkpoint).

>>> So IMHO this should do something similar to basebackup - check the page
>>> LSN, and if it changed then skip the page.
>>>
>>> I'm afraid this requires using the last checkpoint LSN, the way startptr
>>> is used in basebackup. In particular we can't simply remember LSN from
>>> the first read, because we might actually read [B1,A2] on the first try,
>>> and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
>>> torn in various other ways, not necessarily at the 4kB boundary - it
>>> might be torn right after the LSN, for example).
>>
>> I'd prefer to come up with a plan where we don't just give up once we
>> see a new LSN, if possible. If I run a modified pg_verify_checksums
>> which skips on newer pages in a tight benchmark, basically everything
>> gets skipped as checkpoints don't happen often enough.
>
> I'm really not sure how you expect to be able to do something different
> here.  Even if we started poking into shared buffers, all you'd be able
> to see is that there's a bunch of dirty pages- and we don't maintain the
> checksums in shared buffers, so it's not like you could verify them
> there.
>
> You could possibly have an option that says "force a checkpoint" but,
> honestly, that's really not all that interesting either- all you'd be
> doing is forcing all the pages to be written out from shared buffers
> into the kernel cache and then reading them from there instead, it's not
> like you'd actually be able to tell if there was a disk/storage error
> because you'll only be looking at the kernel cache.
>

Yeah.

>> So how about we do check every page, but if one fails on retry, and the
>> LSN is newer than the checkpoint, we then skip it? Is that logic sound?
>
> I thought that's what basebackup did- if it doesn't do that today, then
> it really should.
>

The crucial distinction here is that the trick is not in comparing LSNs
from the two page reads, but comparing it to the checkpoint LSN. If it's
greater, the page may be torn or broken, and there's no way to know
which case it is - so basebackup simply skips it.

>> In any case, if we decide we really should skip the page if it is newer
>> than the checkpoint, I think it makes sense to track those skipped pages
>> and print their number out at the end, if there are any.
>
> Not sure what the point of this is.  If we wanted to really do something
> to cross-check here, we'd track the pages that were skipped and then
> look through the WAL to make sure that they're there.  That's something
> we've talked about doing with pgBackRest, but don't currently.
>

I agree simply printing the page numbers seems rather useless. What we
could do is remember which pages we skipped and then try checking them
after another checkpoint. Or something like that.

>>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
>>> to protect against anything, really.
>>
>> Well, I've noticed that without it I get sporadic checksum failures on
>> reread, so I've added it to make them go away. It was certainly a
>> phenomenological decision that I am happy to trade for a better one.
>
> That then sounds like we really aren't re-checking the LSN, and we
> really should be, to avoid getting these sporadic checksum failures on
> reread..
>

Again, it's not enough to check the LSN against the preceding read. We
need a checkpoint LSN or something like that.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Tomas Vondra-4
In reply to this post by Michael Banck-2


On 09/17/2018 04:04 PM, Michael Banck wrote:

> Hi,
>
> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
>> The patch is mostly copying the verification / retry logic from
>> basebackup.c, but I think it omitted a rather important detail that
>> makes it incorrect in the presence of concurrent writes.
>>
>> The very first thing basebackup does is this:
>>
>>     startptr = do_pg_start_backup(...);
>>
>> i.e. it waits for a checkpoint, remembering the LSN. And then when
>> checking a page it does this:
>>
>>    if (!PageIsNew(page) && PageGetLSN(page) < startptr)
>>    {
>>        ... verify the page checksum
>>    }
>>
>> Obviously, pg_verify_checksums can't do that easily because it's
>> supposed to run from outside the database instance.
>
> It reads pg_control anyway, so couldn't we just take
> ControlFile->checkPoint?
>
> Other than that, basebackup.c seems to only look at pages which haven't
> been changed since the backup starting checkpoint (see above if
> statement). That's reasonable for backups, but is it just as reasonable
> for online verification?
>

I suppose we might refresh the checkpoint LSN regularly, and use the
most recent one. On large/busy databases that would allow checking
larger part of the database.

>> But the startptr detail is pretty important because it supports this
>> retry reasoning:
>>
>>     /*
>>      * Retry the block on the first failure.  It's
>>      * possible that we read the first 4K page of the
>>      * block just before postgres updated the entire block
>>      * so it ends up looking torn to us.  We only need to
>>      * retry once because the LSN should be updated to
>>      * something we can ignore on the next pass.  If the
>>      * error happens again then it is a true validation
>>      * failure.
>>      */
>>
>> Imagine the 8kB page as two 4kB pages, with the initial state being
>> [A1,A2] and another process over-writing it with [B1,B2]. If you read
>> the 8kB page, what states can you see?
>>
>> I don't think POSIX provides any guarantees about atomicity of the write
>> calls (and even if it does, the filesystems on Linux don't seem to). So
>> you may observe both [A1,B2] or [A2,B1], or various inconsistent mixes
>> of the two versions, depending on timing. Well, torn pages ...
>>
>> Pretty much the only thing you can rely on is that when one process does
>>
>>     write([B1,B2])
>>
>> the other process may first read [A1,B2], but the next read will return
>> [B1,B2] (or possibly newer data, if there was another write). It will
>> not read the "stale" A1 again.
>>
>> The basebackup relies on this kinda implicitly - on the retry it'll
>> notice the LSN changed (thanks to the startptr check), and the page will
>> be skipped entirely. This is pretty important, because the new page
>> might be torn in some other way.
>>
>> The pg_verify_checksum apparently ignores this skip logic, because on
>> the retry it simply re-reads the page again, verifies the checksum and
>> reports an error. Which is broken, because the newly read page might be
>> torn again due to a concurrent write.
>
> Well, ok.
>  
>> So IMHO this should do something similar to basebackup - check the page
>> LSN, and if it changed then skip the page.
>>
>> I'm afraid this requires using the last checkpoint LSN, the way startptr
>> is used in basebackup. In particular we can't simply remember LSN from
>> the first read, because we might actually read [B1,A2] on the first try,
>> and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
>> torn in various other ways, not necessarily at the 4kB boundary - it
>> might be torn right after the LSN, for example).
>
> I'd prefer to come up with a plan where we don't just give up once we
> see a new LSN, if possible. If I run a modified pg_verify_checksums
> which skips on newer pages in a tight benchmark, basically everything
> gets skipped as checkpoints don't happen often enough.
>

But in that case the checksums are verified when reading the buffer into
shared buffers, it's not like we don't notice the checksum error at all.
We are interested in the pages that have not been read/written for an
extended period time. So I think this is not a problem.

> So how about we do check every page, but if one fails on retry, and the
> LSN is newer than the checkpoint, we then skip it? Is that logic sound?
>

Hmmm, maybe.

> In any case, if we decide we really should skip the page if it is newer
> than the checkpoint, I think it makes sense to track those skipped pages
> and print their number out at the end, if there are any.
>

I agree it might be useful to know how many pages were skipped, and how
many actually passed the checksum check.

>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
>> to protect against anything, really.
>
> Well, I've noticed that without it I get sporadic checksum failures on
> reread, so I've added it to make them go away. It was certainly a
> phenomenological decision that I am happy to trade for a better one.
>

My guess is this happened because both the read and re-read completed
during the same write.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Stephen Frost
In reply to this post by Tomas Vondra-4
Greetings,

* Tomas Vondra ([hidden email]) wrote:

> On 09/17/2018 04:46 PM, Stephen Frost wrote:
> > * Michael Banck ([hidden email]) wrote:
> >> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
> >>> Obviously, pg_verify_checksums can't do that easily because it's
> >>> supposed to run from outside the database instance.
> >>
> >> It reads pg_control anyway, so couldn't we just take
> >> ControlFile->checkPoint?
> >>
> >> Other than that, basebackup.c seems to only look at pages which haven't
> >> been changed since the backup starting checkpoint (see above if
> >> statement). That's reasonable for backups, but is it just as reasonable
> >> for online verification?
> >
> > Right, basebackup doesn't need to look at other pages.
> >
> >>> The pg_verify_checksum apparently ignores this skip logic, because on
> >>> the retry it simply re-reads the page again, verifies the checksum and
> >>> reports an error. Which is broken, because the newly read page might be
> >>> torn again due to a concurrent write.
> >>
> >> Well, ok.
> >
> > The newly read page will have an updated LSN though then on the re-read,
> > in which case basebackup can know that what happened was a rewrite of
> > the page and it no longer has to care about the page and can skip it.
> >
> > I haven't looked, but if basebackup isn't checking the LSN again for the
> > newly read page then that'd be broken, but I believe it does (at least,
> > that's the algorithm we came up with for pgBackRest, and I know David
> > shared that when the basebackup code was being written).
>
> Yes, basebackup does check the LSN on re-read, and skips the page if it
> changed on re-read (because it eliminates the consistency guarantees
> provided by the checkpoint).
Ok, good, though I'm not sure what you mean by 'eliminates the
consistency guarantees provided by the checkpoint'.  The point is that
the page will be in the WAL and the WAL will be replayed during the
restore of the backup.

> >> So how about we do check every page, but if one fails on retry, and the
> >> LSN is newer than the checkpoint, we then skip it? Is that logic sound?
> >
> > I thought that's what basebackup did- if it doesn't do that today, then
> > it really should.
>
> The crucial distinction here is that the trick is not in comparing LSNs
> from the two page reads, but comparing it to the checkpoint LSN. If it's
> greater, the page may be torn or broken, and there's no way to know
> which case it is - so basebackup simply skips it.
Sure, because we don't care about it any longer- that page isn't
interesting because the WAL will replay over it.  IIRC it actually goes
something like: check the checksum, if it failed then check if the LSN
is greater than the checkpoint (of the backup start..), if not, then
re-read, if the LSN is now newer than the checkpoint then skip, if the
LSN is the same then throw an error.

> >> In any case, if we decide we really should skip the page if it is newer
> >> than the checkpoint, I think it makes sense to track those skipped pages
> >> and print their number out at the end, if there are any.
> >
> > Not sure what the point of this is.  If we wanted to really do something
> > to cross-check here, we'd track the pages that were skipped and then
> > look through the WAL to make sure that they're there.  That's something
> > we've talked about doing with pgBackRest, but don't currently.
>
> I agree simply printing the page numbers seems rather useless. What we
> could do is remember which pages we skipped and then try checking them
> after another checkpoint. Or something like that.
I'm still not sure I'm seeing the point of that.  They're still going to
almost certainly be in the kernel cache.  The reason for checking
against the WAL would be to detect errors in PG where we aren't putting
a page into the WAL when it really should be, or something similar,
which seems like it at least could be useful.

Maybe to put it another way- there's very little point in checking the
checksum of a page which we know must be re-written during recovery to
get to a consistent point.  I don't think it hurts in the general case,
but I wouldn't write a lot of code which then needs to be tested to
handle it.  I also don't think that we really need to make
pg_verify_checksum spend lots of extra cycles trying to verify that
*every* page had its checksum validated when we know that lots of pages
are going to be in memory marked dirty and our checking of them will be
ultimately pointless as they'll either be written out by the
checkpointer or some other process, or we'll replay them from the WAL if
we crash.

> >>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
> >>> to protect against anything, really.
> >>
> >> Well, I've noticed that without it I get sporadic checksum failures on
> >> reread, so I've added it to make them go away. It was certainly a
> >> phenomenological decision that I am happy to trade for a better one.
> >
> > That then sounds like we really aren't re-checking the LSN, and we
> > really should be, to avoid getting these sporadic checksum failures on
> > reread..
>
> Again, it's not enough to check the LSN against the preceding read. We
> need a checkpoint LSN or something like that.
I actually tend to disagree with you that, for this purpose, it's
actually necessary to check against the checkpoint LSN- if the LSN
changed and everything is operating correctly then the new LSN must be
more recent than the last checkpoint location or things are broken
badly.

Now, that said, I do think it's a good *idea* to check against the
checkpoint LSN (presuming this is for online checking of checksums- for
basebackup, we could just check against the backup-start LSN as anything
after that point will be rewritten by WAL anyway).  The reason that I
think it's a good idea to check against the checkpoint LSN is that we'd
want to throw a big warning if the kernel is just feeding us random
garbage on reads and only finding a difference between two reads isn't
really doing any kind of validation, whereas checking against the
checkpoint-LSN would at least give us some idea that the value being
read isn't completely ridiculous.

When it comes to if the pg_sleep() is necessary or not, I have to admit
to being unsure about that..  I could see how it might be but it seems a
bit surprising- I'd probably want to see exactly what the page was at
the time of the failure and at the time of the second (no-sleep) re-read
and then after a delay and convince myself that it was just an unlucky
case of being scheduled in twice to read that page before the process
writing it out got a chance to finish the write.

Thanks!

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Tomas Vondra-4
Hi,

On 09/17/2018 06:42 PM, Stephen Frost wrote:

> Greetings,
>
> * Tomas Vondra ([hidden email]) wrote:
>> On 09/17/2018 04:46 PM, Stephen Frost wrote:
>>> * Michael Banck ([hidden email]) wrote:
>>>> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
>>>>> Obviously, pg_verify_checksums can't do that easily because it's
>>>>> supposed to run from outside the database instance.
>>>>
>>>> It reads pg_control anyway, so couldn't we just take
>>>> ControlFile->checkPoint?
>>>>
>>>> Other than that, basebackup.c seems to only look at pages which haven't
>>>> been changed since the backup starting checkpoint (see above if
>>>> statement). That's reasonable for backups, but is it just as reasonable
>>>> for online verification?
>>>
>>> Right, basebackup doesn't need to look at other pages.
>>>
>>>>> The pg_verify_checksum apparently ignores this skip logic, because on
>>>>> the retry it simply re-reads the page again, verifies the checksum and
>>>>> reports an error. Which is broken, because the newly read page might be
>>>>> torn again due to a concurrent write.
>>>>
>>>> Well, ok.
>>>
>>> The newly read page will have an updated LSN though then on the re-read,
>>> in which case basebackup can know that what happened was a rewrite of
>>> the page and it no longer has to care about the page and can skip it.
>>>
>>> I haven't looked, but if basebackup isn't checking the LSN again for the
>>> newly read page then that'd be broken, but I believe it does (at least,
>>> that's the algorithm we came up with for pgBackRest, and I know David
>>> shared that when the basebackup code was being written).
>>
>> Yes, basebackup does check the LSN on re-read, and skips the page if it
>> changed on re-read (because it eliminates the consistency guarantees
>> provided by the checkpoint).
>
> Ok, good, though I'm not sure what you mean by 'eliminates the
> consistency guarantees provided by the checkpoint'.  The point is that
> the page will be in the WAL and the WAL will be replayed during the
> restore of the backup.
>

The checkpoint guarantees that the whole page was written and flushed to
disk with an LSN before the ckeckpoint LSN. So when you read a page with
that LSN, you know the whole write already completed and a read won't
return data from before the LSN.

Without the checkpoint that's not guaranteed, and simply re-reading the
page and rechecking it vs. the first read does not help:

1) write the first 512B of the page (sector), which includes the LSN

2) read the whole page, which will be a mix [new 512B, ... old ... ]

3) the checksum verification fails

4) read the page again (possibly reading a bit more new data)

5) the LSN did not change compared to the first read, yet the checksum
still fails


>>>> So how about we do check every page, but if one fails on retry, and the
>>>> LSN is newer than the checkpoint, we then skip it? Is that logic sound?
>>>
>>> I thought that's what basebackup did- if it doesn't do that today, then
>>> it really should.
>>
>> The crucial distinction here is that the trick is not in comparing LSNs
>> from the two page reads, but comparing it to the checkpoint LSN. If it's
>> greater, the page may be torn or broken, and there's no way to know
>> which case it is - so basebackup simply skips it.
>
> Sure, because we don't care about it any longer- that page isn't
> interesting because the WAL will replay over it.  IIRC it actually goes
> something like: check the checksum, if it failed then check if the LSN
> is greater than the checkpoint (of the backup start..), if not, then
> re-read, if the LSN is now newer than the checkpoint then skip, if the
> LSN is the same then throw an error.
>

Nope, we only verify the checksum if it's LSN precedes the checkpoint:

https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454

>>>> In any case, if we decide we really should skip the page if it is newer
>>>> than the checkpoint, I think it makes sense to track those skipped pages
>>>> and print their number out at the end, if there are any.
>>>
>>> Not sure what the point of this is.  If we wanted to really do something
>>> to cross-check here, we'd track the pages that were skipped and then
>>> look through the WAL to make sure that they're there.  That's something
>>> we've talked about doing with pgBackRest, but don't currently.
>>
>> I agree simply printing the page numbers seems rather useless. What we
>> could do is remember which pages we skipped and then try checking them
>> after another checkpoint. Or something like that.
>
> I'm still not sure I'm seeing the point of that.  They're still going to
> almost certainly be in the kernel cache.  The reason for checking
> against the WAL would be to detect errors in PG where we aren't putting
> a page into the WAL when it really should be, or something similar,
> which seems like it at least could be useful.
>
> Maybe to put it another way- there's very little point in checking the
> checksum of a page which we know must be re-written during recovery to
> get to a consistent point.  I don't think it hurts in the general case,
> but I wouldn't write a lot of code which then needs to be tested to
> handle it.  I also don't think that we really need to make
> pg_verify_checksum spend lots of extra cycles trying to verify that
> *every* page had its checksum validated when we know that lots of pages
> are going to be in memory marked dirty and our checking of them will be
> ultimately pointless as they'll either be written out by the
> checkpointer or some other process, or we'll replay them from the WAL if
> we crash.
>

Yeah, I agree.

>>>>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
>>>>> to protect against anything, really.
>>>>
>>>> Well, I've noticed that without it I get sporadic checksum failures on
>>>> reread, so I've added it to make them go away. It was certainly a
>>>> phenomenological decision that I am happy to trade for a better one.
>>>
>>> That then sounds like we really aren't re-checking the LSN, and we
>>> really should be, to avoid getting these sporadic checksum failures on
>>> reread..
>>
>> Again, it's not enough to check the LSN against the preceding read. We
>> need a checkpoint LSN or something like that.
>
> I actually tend to disagree with you that, for this purpose, it's
> actually necessary to check against the checkpoint LSN- if the LSN
> changed and everything is operating correctly then the new LSN must be
> more recent than the last checkpoint location or things are broken
> badly.
>

I don't follow. Are you suggesting we don't need the checkpoint LSN?

I'm pretty sure that's not the case. The thing is - the LSN may not
change between the two reads, but that's not a guarantee the page was
not torn. The example I posted earlier in this message illustrates that.

> Now, that said, I do think it's a good *idea* to check against the
> checkpoint LSN (presuming this is for online checking of checksums- for
> basebackup, we could just check against the backup-start LSN as anything
> after that point will be rewritten by WAL anyway).  The reason that I
> think it's a good idea to check against the checkpoint LSN is that we'd
> want to throw a big warning if the kernel is just feeding us random
> garbage on reads and only finding a difference between two reads isn't
> really doing any kind of validation, whereas checking against the
> checkpoint-LSN would at least give us some idea that the value being
> read isn't completely ridiculous.
>
> When it comes to if the pg_sleep() is necessary or not, I have to admit
> to being unsure about that..  I could see how it might be but it seems a
> bit surprising- I'd probably want to see exactly what the page was at
> the time of the failure and at the time of the second (no-sleep) re-read
> and then after a delay and convince myself that it was just an unlucky
> case of being scheduled in twice to read that page before the process
> writing it out got a chance to finish the write.
>

I think the pg_sleep() is a pretty strong sign there's something broken.
At the very least, it's likely to misbehave on machines with different
timings, machines under memory and/or memory pressure, etc.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Stephen Frost
Greetings,

* Tomas Vondra ([hidden email]) wrote:

> On 09/17/2018 06:42 PM, Stephen Frost wrote:
> > Ok, good, though I'm not sure what you mean by 'eliminates the
> > consistency guarantees provided by the checkpoint'.  The point is that
> > the page will be in the WAL and the WAL will be replayed during the
> > restore of the backup.
>
> The checkpoint guarantees that the whole page was written and flushed to
> disk with an LSN before the ckeckpoint LSN. So when you read a page with
> that LSN, you know the whole write already completed and a read won't
> return data from before the LSN.
Well, you know that the first part was written out at some prior point,
but you could end up reading the first part of a page with an older LSN
while also reading the second part with new data.

> Without the checkpoint that's not guaranteed, and simply re-reading the
> page and rechecking it vs. the first read does not help:
>
> 1) write the first 512B of the page (sector), which includes the LSN
>
> 2) read the whole page, which will be a mix [new 512B, ... old ... ]
>
> 3) the checksum verification fails
>
> 4) read the page again (possibly reading a bit more new data)
>
> 5) the LSN did not change compared to the first read, yet the checksum
> still fails
So, I agree with all of the above though I've found it to be extremely
rare to get a single read which you've managed to catch part-way through
a write, getting multiple of them over a period of time strikes me as
even more unlikely.  Still, if we can come up with a solution to solve
all of this, great, but I'm not sure that I'm hearing one.

> > Sure, because we don't care about it any longer- that page isn't
> > interesting because the WAL will replay over it.  IIRC it actually goes
> > something like: check the checksum, if it failed then check if the LSN
> > is greater than the checkpoint (of the backup start..), if not, then
> > re-read, if the LSN is now newer than the checkpoint then skip, if the
> > LSN is the same then throw an error.
>
> Nope, we only verify the checksum if it's LSN precedes the checkpoint:
>
> https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454
That seems like it's leaving something on the table, but, to be fair, we
know that all of those pages should be rewritten by WAL anyway so they
aren't all that interesting to us, particularly in the basebackup case.

> > I actually tend to disagree with you that, for this purpose, it's
> > actually necessary to check against the checkpoint LSN- if the LSN
> > changed and everything is operating correctly then the new LSN must be
> > more recent than the last checkpoint location or things are broken
> > badly.
>
> I don't follow. Are you suggesting we don't need the checkpoint LSN?
>
> I'm pretty sure that's not the case. The thing is - the LSN may not
> change between the two reads, but that's not a guarantee the page was
> not torn. The example I posted earlier in this message illustrates that.
I agree that there's some risk there, but it's certainly much less
likely.

> > Now, that said, I do think it's a good *idea* to check against the
> > checkpoint LSN (presuming this is for online checking of checksums- for
> > basebackup, we could just check against the backup-start LSN as anything
> > after that point will be rewritten by WAL anyway).  The reason that I
> > think it's a good idea to check against the checkpoint LSN is that we'd
> > want to throw a big warning if the kernel is just feeding us random
> > garbage on reads and only finding a difference between two reads isn't
> > really doing any kind of validation, whereas checking against the
> > checkpoint-LSN would at least give us some idea that the value being
> > read isn't completely ridiculous.
> >
> > When it comes to if the pg_sleep() is necessary or not, I have to admit
> > to being unsure about that..  I could see how it might be but it seems a
> > bit surprising- I'd probably want to see exactly what the page was at
> > the time of the failure and at the time of the second (no-sleep) re-read
> > and then after a delay and convince myself that it was just an unlucky
> > case of being scheduled in twice to read that page before the process
> > writing it out got a chance to finish the write.
>
> I think the pg_sleep() is a pretty strong sign there's something broken.
> At the very least, it's likely to misbehave on machines with different
> timings, machines under memory and/or memory pressure, etc.
If we assume that what you've outlined above is a serious enough issue
that we have to address it, and do so without a pg_sleep(), then I think
we have to bake into this a way for the process to check with PG as to
what the page's current LSN is, in shared buffers, because that's the
only place where we've got the locking required to ensure that we don't
end up with a read of a partially written page, and I'm really not
entirely convinced that we need to go to that level.  It'd certainly add
a huge amount of additional complexity for what appears to be a quite
unlikely gain.

I'll chat w/ David shortly about this again though and get his thoughts
on it.  This is certainly an area we've spent time thinking about but
are obviously also open to finding a better solution.

Thanks!

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Tomas Vondra-4
On 09/17/2018 07:11 PM, Stephen Frost wrote:

> Greetings,
>
> * Tomas Vondra ([hidden email]) wrote:
>> On 09/17/2018 06:42 PM, Stephen Frost wrote:
>>> Ok, good, though I'm not sure what you mean by 'eliminates the
>>> consistency guarantees provided by the checkpoint'.  The point is that
>>> the page will be in the WAL and the WAL will be replayed during the
>>> restore of the backup.
>>
>> The checkpoint guarantees that the whole page was written and flushed to
>> disk with an LSN before the ckeckpoint LSN. So when you read a page with
>> that LSN, you know the whole write already completed and a read won't
>> return data from before the LSN.
>
> Well, you know that the first part was written out at some prior point,
> but you could end up reading the first part of a page with an older LSN
> while also reading the second part with new data.
>

Doesn't the checkpoint fsync pretty much guarantee this can't happen?

>> Without the checkpoint that's not guaranteed, and simply re-reading the
>> page and rechecking it vs. the first read does not help:
>>
>> 1) write the first 512B of the page (sector), which includes the LSN
>>
>> 2) read the whole page, which will be a mix [new 512B, ... old ... ]
>>
>> 3) the checksum verification fails
>>
>> 4) read the page again (possibly reading a bit more new data)
>>
>> 5) the LSN did not change compared to the first read, yet the checksum
>> still fails
>
> So, I agree with all of the above though I've found it to be extremely
> rare to get a single read which you've managed to catch part-way through
> a write, getting multiple of them over a period of time strikes me as
> even more unlikely.  Still, if we can come up with a solution to solve
> all of this, great, but I'm not sure that I'm hearing one.
>

I don't recall claiming catching many such torn pages - I'm sure it's
not very common in most workloads. But I suspect constructing workloads
hitting them regularly is not very difficult either (something with a
lot of churn in shared buffers should do the trick).

>>> Sure, because we don't care about it any longer- that page isn't
>>> interesting because the WAL will replay over it.  IIRC it actually goes
>>> something like: check the checksum, if it failed then check if the LSN
>>> is greater than the checkpoint (of the backup start..), if not, then
>>> re-read, if the LSN is now newer than the checkpoint then skip, if the
>>> LSN is the same then throw an error.
>>
>> Nope, we only verify the checksum if it's LSN precedes the checkpoint:
>>
>> https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454
>
> That seems like it's leaving something on the table, but, to be fair, we
> know that all of those pages should be rewritten by WAL anyway so they
> aren't all that interesting to us, particularly in the basebackup case.
>

Yep.

>>> I actually tend to disagree with you that, for this purpose, it's
>>> actually necessary to check against the checkpoint LSN- if the LSN
>>> changed and everything is operating correctly then the new LSN must be
>>> more recent than the last checkpoint location or things are broken
>>> badly.
>>
>> I don't follow. Are you suggesting we don't need the checkpoint LSN?
>>
>> I'm pretty sure that's not the case. The thing is - the LSN may not
>> change between the two reads, but that's not a guarantee the page was
>> not torn. The example I posted earlier in this message illustrates that.
>
> I agree that there's some risk there, but it's certainly much less
> likely.
>

Well. If we're going to report a checksum failure, we better be sure it
actually is a broken page. I don't want users to start chasing bogus
data corruption issues.

>>> Now, that said, I do think it's a good *idea* to check against the
>>> checkpoint LSN (presuming this is for online checking of checksums- for
>>> basebackup, we could just check against the backup-start LSN as anything
>>> after that point will be rewritten by WAL anyway).  The reason that I
>>> think it's a good idea to check against the checkpoint LSN is that we'd
>>> want to throw a big warning if the kernel is just feeding us random
>>> garbage on reads and only finding a difference between two reads isn't
>>> really doing any kind of validation, whereas checking against the
>>> checkpoint-LSN would at least give us some idea that the value being
>>> read isn't completely ridiculous.
>>>
>>> When it comes to if the pg_sleep() is necessary or not, I have to admit
>>> to being unsure about that..  I could see how it might be but it seems a
>>> bit surprising- I'd probably want to see exactly what the page was at
>>> the time of the failure and at the time of the second (no-sleep) re-read
>>> and then after a delay and convince myself that it was just an unlucky
>>> case of being scheduled in twice to read that page before the process
>>> writing it out got a chance to finish the write.
>>
>> I think the pg_sleep() is a pretty strong sign there's something broken.
>> At the very least, it's likely to misbehave on machines with different
>> timings, machines under memory and/or memory pressure, etc.
>
> If we assume that what you've outlined above is a serious enough issue
> that we have to address it, and do so without a pg_sleep(), then I think
> we have to bake into this a way for the process to check with PG as to
> what the page's current LSN is, in shared buffers, because that's the
> only place where we've got the locking required to ensure that we don't
> end up with a read of a partially written page, and I'm really not
> entirely convinced that we need to go to that level.  It'd certainly add
> a huge amount of additional complexity for what appears to be a quite
> unlikely gain.
>
> I'll chat w/ David shortly about this again though and get his thoughts
> on it.  This is certainly an area we've spent time thinking about but
> are obviously also open to finding a better solution.
>

Why not to simply look at the last checkpoint LSN and use that the same
way basebackup does? AFAICS that should make the pg_sleep() unnecessary.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Stephen Frost
Greetings,

On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <[hidden email]> wrote:
On 09/17/2018 07:11 PM, Stephen Frost wrote:
> Greetings,
>
> * Tomas Vondra ([hidden email]) wrote:
>> On 09/17/2018 06:42 PM, Stephen Frost wrote:
>>> Ok, good, though I'm not sure what you mean by 'eliminates the
>>> consistency guarantees provided by the checkpoint'.  The point is that
>>> the page will be in the WAL and the WAL will be replayed during the
>>> restore of the backup.
>>
>> The checkpoint guarantees that the whole page was written and flushed to
>> disk with an LSN before the ckeckpoint LSN. So when you read a page with
>> that LSN, you know the whole write already completed and a read won't
>> return data from before the LSN.
>
> Well, you know that the first part was written out at some prior point,
> but you could end up reading the first part of a page with an older LSN
> while also reading the second part with new data.


Doesn't the checkpoint fsync pretty much guarantee this can't happen?

How? Either it’s possible for the latter half of a page to be updated before the first half (where the LSN lives), or it isn’t. If it’s possible then that LSN could be ancient and it wouldn’t matter. 

>> Without the checkpoint that's not guaranteed, and simply re-reading the
>> page and rechecking it vs. the first read does not help:
>>
>> 1) write the first 512B of the page (sector), which includes the LSN
>>
>> 2) read the whole page, which will be a mix [new 512B, ... old ... ]
>>
>> 3) the checksum verification fails
>>
>> 4) read the page again (possibly reading a bit more new data)
>>
>> 5) the LSN did not change compared to the first read, yet the checksum
>> still fails
>
> So, I agree with all of the above though I've found it to be extremely
> rare to get a single read which you've managed to catch part-way through
> a write, getting multiple of them over a period of time strikes me as
> even more unlikely.  Still, if we can come up with a solution to solve
> all of this, great, but I'm not sure that I'm hearing one.

I don't recall claiming catching many such torn pages - I'm sure it's
not very common in most workloads. But I suspect constructing workloads
hitting them regularly is not very difficult either (something with a
lot of churn in shared buffers should do the trick).

The question is if it’s possible to catch a torn page where the second half is updated *before* the first half of the page in a read (and then in subsequent reads having that state be maintained).  I have some skepticism that it’s really possible to happen in the first place but having an interrupted system call be stalled across two more system calls just seems terribly unlikely, and this is all based on the assumption that the kernel might write the second half of a write before the first to the kernel cache in the first place. 

>>> Sure, because we don't care about it any longer- that page isn't
>>> interesting because the WAL will replay over it.  IIRC it actually goes
>>> something like: check the checksum, if it failed then check if the LSN
>>> is greater than the checkpoint (of the backup start..), if not, then
>>> re-read, if the LSN is now newer than the checkpoint then skip, if the
>>> LSN is the same then throw an error.
>>
>> Nope, we only verify the checksum if it's LSN precedes the checkpoint:
>>
>> https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454
>
> That seems like it's leaving something on the table, but, to be fair, we
> know that all of those pages should be rewritten by WAL anyway so they
> aren't all that interesting to us, particularly in the basebackup case.
>

Yep.

>>> I actually tend to disagree with you that, for this purpose, it's
>>> actually necessary to check against the checkpoint LSN- if the LSN
>>> changed and everything is operating correctly then the new LSN must be
>>> more recent than the last checkpoint location or things are broken
>>> badly.
>>
>> I don't follow. Are you suggesting we don't need the checkpoint LSN?
>>
>> I'm pretty sure that's not the case. The thing is - the LSN may not
>> change between the two reads, but that's not a guarantee the page was
>> not torn. The example I posted earlier in this message illustrates that.
>
> I agree that there's some risk there, but it's certainly much less
> likely.
>

Well. If we're going to report a checksum failure, we better be sure it
actually is a broken page. I don't want users to start chasing bogus
data corruption issues.

Yes, I definitely agree that we don’t want to mis-report checksum failures if we can avoid it. 

>>> Now, that said, I do think it's a good *idea* to check against the
>>> checkpoint LSN (presuming this is for online checking of checksums- for
>>> basebackup, we could just check against the backup-start LSN as anything
>>> after that point will be rewritten by WAL anyway).  The reason that I
>>> think it's a good idea to check against the checkpoint LSN is that we'd
>>> want to throw a big warning if the kernel is just feeding us random
>>> garbage on reads and only finding a difference between two reads isn't
>>> really doing any kind of validation, whereas checking against the
>>> checkpoint-LSN would at least give us some idea that the value being
>>> read isn't completely ridiculous.
>>>
>>> When it comes to if the pg_sleep() is necessary or not, I have to admit
>>> to being unsure about that..  I could see how it might be but it seems a
>>> bit surprising- I'd probably want to see exactly what the page was at
>>> the time of the failure and at the time of the second (no-sleep) re-read
>>> and then after a delay and convince myself that it was just an unlucky
>>> case of being scheduled in twice to read that page before the process
>>> writing it out got a chance to finish the write.
>>
>> I think the pg_sleep() is a pretty strong sign there's something broken.
>> At the very least, it's likely to misbehave on machines with different
>> timings, machines under memory and/or memory pressure, etc.
>
> If we assume that what you've outlined above is a serious enough issue
> that we have to address it, and do so without a pg_sleep(), then I think
> we have to bake into this a way for the process to check with PG as to
> what the page's current LSN is, in shared buffers, because that's the
> only place where we've got the locking required to ensure that we don't
> end up with a read of a partially written page, and I'm really not
> entirely convinced that we need to go to that level.  It'd certainly add
> a huge amount of additional complexity for what appears to be a quite
> unlikely gain.
>
> I'll chat w/ David shortly about this again though and get his thoughts
> on it.  This is certainly an area we've spent time thinking about but
> are obviously also open to finding a better solution.

Why not to simply look at the last checkpoint LSN and use that the same
way basebackup does? AFAICS that should make the pg_sleep() unnecessary.

Use that to compare to what?  The LSN in the first half of the page could be from well before the checkpoint or even the backup started.

Thanks!

Stephen
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Michael Banck-2
In reply to this post by Tomas Vondra-4
Hi,

so, trying some intermediate summary here, sorry for (also) top-posting:

1. the basebackup checksum verification logic only checks pages not
changed since the checkpoint, which makes sense for the basebackup. 

2. However, it would be desirable to go further for pg_verify_checksums
and (re-)check all pages.

3. pg_verify_checksums should read the checkpoint LSN on startup and
compare the page LSN against it on re-read, and discard pages which have
checksum failures but are new. (Maybe it should read new checkpoint LSNs
as they come in during its runtime as well? See below).

4. The pg_sleep should go.

5. There seems to be no consensus on whether the number of skipped pages
should be summarized at the end.

Further comments:

Am Montag, den 17.09.2018, 19:19 +0200 schrieb Tomas Vondra:

> On 09/17/2018 07:11 PM, Stephen Frost wrote:
> > * Tomas Vondra ([hidden email]) wrote:
> > > On 09/17/2018 06:42 PM, Stephen Frost wrote:
> > > Without the checkpoint that's not guaranteed, and simply re-reading the
> > > page and rechecking it vs. the first read does not help:
> > >
> > > 1) write the first 512B of the page (sector), which includes the LSN
> > >
> > > 2) read the whole page, which will be a mix [new 512B, ... old ... ]
> > >
> > > 3) the checksum verification fails
> > >
> > > 4) read the page again (possibly reading a bit more new data)
> > >
> > > 5) the LSN did not change compared to the first read, yet the checksum
> > > still fails
> >
> > So, I agree with all of the above though I've found it to be extremely
> > rare to get a single read which you've managed to catch part-way through
> > a write, getting multiple of them over a period of time strikes me as
> > even more unlikely.  Still, if we can come up with a solution to solve
> > all of this, great, but I'm not sure that I'm hearing one.
>
> I don't recall claiming catching many such torn pages - I'm sure it's
> not very common in most workloads. But I suspect constructing workloads
> hitting them regularly is not very difficult either (something with a
> lot of churn in shared buffers should do the trick).
>
> > > > Sure, because we don't care about it any longer- that page isn't
> > > > interesting because the WAL will replay over it.  IIRC it actually goes
> > > > something like: check the checksum, if it failed then check if the LSN
> > > > is greater than the checkpoint (of the backup start..), if not, then
> > > > re-read, if the LSN is now newer than the checkpoint then skip, if the
> > > > LSN is the same then throw an error.
> > >
> > > Nope, we only verify the checksum if it's LSN precedes the checkpoint:
> > >
> > > https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454
> >
> > That seems like it's leaving something on the table, but, to be fair, we
> > know that all of those pages should be rewritten by WAL anyway so they
> > aren't all that interesting to us, particularly in the basebackup case.
>
> Yep.

Right, see point 1 above.

> > > > I actually tend to disagree with you that, for this purpose, it's
> > > > actually necessary to check against the checkpoint LSN- if the LSN
> > > > changed and everything is operating correctly then the new LSN must be
> > > > more recent than the last checkpoint location or things are broken
> > > > badly.
> > >
> > > I don't follow. Are you suggesting we don't need the checkpoint LSN?
> > >
> > > I'm pretty sure that's not the case. The thing is - the LSN may not
> > > change between the two reads, but that's not a guarantee the page was
> > > not torn. The example I posted earlier in this message illustrates that.
> >
> > I agree that there's some risk there, but it's certainly much less
> > likely.
>
> Well. If we're going to report a checksum failure, we better be sure it
> actually is a broken page. I don't want users to start chasing bogus
> data corruption issues.

I agree.

> > > > Now, that said, I do think it's a good *idea* to check against the
> > > > checkpoint LSN (presuming this is for online checking of checksums- for
> > > > basebackup, we could just check against the backup-start LSN as anything
> > > > after that point will be rewritten by WAL anyway).  The reason that I
> > > > think it's a good idea to check against the checkpoint LSN is that we'd
> > > > want to throw a big warning if the kernel is just feeding us random
> > > > garbage on reads and only finding a difference between two reads isn't
> > > > really doing any kind of validation, whereas checking against the
> > > > checkpoint-LSN would at least give us some idea that the value being
> > > > read isn't completely ridiculous.

Are you suggesting here that we always check against the current
checkpoint, or is checking against the checkpoint that we saw at startup
enough? I think re-reading pg_control all the time might be more
errorprone that what we could get from this, so I would prefer not to do
this.

> > > > When it comes to if the pg_sleep() is necessary or not, I have to admit
> > > > to being unsure about that..  I could see how it might be but it seems a
> > > > bit surprising- I'd probably want to see exactly what the page was at
> > > > the time of the failure and at the time of the second (no-sleep) re-read
> > > > and then after a delay and convince myself that it was just an unlucky
> > > > case of being scheduled in twice to read that page before the process
> > > > writing it out got a chance to finish the write.
> > >
> > > I think the pg_sleep() is a pretty strong sign there's something broken.
> > > At the very least, it's likely to misbehave on machines with different
> > > timings, machines under memory and/or memory pressure, etc.

I swapped out the pg_sleep earlier today for the check-against-
checkpoint-LSN-on-reread, and that seems to work just as fine, at least
in the tests I ran.

> > If we assume that what you've outlined above is a serious enough issue
> > that we have to address it, and do so without a pg_sleep(), then I think
> > we have to bake into this a way for the process to check with PG as to
> > what the page's current LSN is, in shared buffers, because that's the
> > only place where we've got the locking required to ensure that we don't
> > end up with a read of a partially written page, and I'm really not
> > entirely convinced that we need to go to that level.  It'd certainly add
> > a huge amount of additional complexity for what appears to be a quite
> > unlikely gain.
> >
> > I'll chat w/ David shortly about this again though and get his thoughts
> > on it.  This is certainly an area we've spent time thinking about but
> > are obviously also open to finding a better solution.
>
> Why not to simply look at the last checkpoint LSN and use that the same
> way basebackup does? AFAICS that should make the pg_sleep() unnecessary.

Right.


Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: [hidden email]

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Stephen Frost
Greetings,

On Mon, Sep 17, 2018 at 13:38 Michael Banck <[hidden email]> wrote:
so, trying some intermediate summary here, sorry for (also) top-posting:

1. the basebackup checksum verification logic only checks pages not
changed since the checkpoint, which makes sense for the basebackup. 

Right. I’m tending towards the idea that this also be adopted for pg_verify_checksums. 

2. However, it would be desirable to go further for pg_verify_checksums
and (re-)check all pages.

Maybe.  I’m not entirely convinced that it’s all that useful. 

3. pg_verify_checksums should read the checkpoint LSN on startup and
compare the page LSN against it on re-read, and discard pages which have
checksum failures but are new. (Maybe it should read new checkpoint LSNs
as they come in during its runtime as well? See below).

I’m not sure that we really need to but I’m not against it either- but in that case you’re definitely going to see checksum failures on torn pages.

4. The pg_sleep should go.

I know that pgbackrest does not have a sleep currently and we’ve not yet seen or been able to reproduce this case where, on a reread, we still see an older LSN, but we check the LSN first also.  If it’s possible that the LSN still hasn’t changed on the reread then maybe we do need to have a sleep to force ourselves off CPU to allow the other process to finish writing, or maybe finish the file and come back around to these pages later, but we have yet to see this behavior in the wild anywhere, nor have we been able to reproduce it. 

5. There seems to be no consensus on whether the number of skipped pages
should be summarized at the end.

I agree with printing the number of skipped pages, that does seem like a nice to have.  I don’t know that actually printing the pages themselves is all that useful though. 

Further comments:

Am Montag, den 17.09.2018, 19:19 +0200 schrieb Tomas Vondra:
> On 09/17/2018 07:11 PM, Stephen Frost wrote:
> > * Tomas Vondra ([hidden email]) wrote:
> > > On 09/17/2018 06:42 PM, Stephen Frost wrote:
> > > Without the checkpoint that's not guaranteed, and simply re-reading the
> > > page and rechecking it vs. the first read does not help:
> > >
> > > 1) write the first 512B of the page (sector), which includes the LSN
> > >
> > > 2) read the whole page, which will be a mix [new 512B, ... old ... ]
> > >
> > > 3) the checksum verification fails
> > >
> > > 4) read the page again (possibly reading a bit more new data)
> > >
> > > 5) the LSN did not change compared to the first read, yet the checksum
> > > still fails
> >
> > So, I agree with all of the above though I've found it to be extremely
> > rare to get a single read which you've managed to catch part-way through
> > a write, getting multiple of them over a period of time strikes me as
> > even more unlikely.  Still, if we can come up with a solution to solve
> > all of this, great, but I'm not sure that I'm hearing one.
>
> I don't recall claiming catching many such torn pages - I'm sure it's
> not very common in most workloads. But I suspect constructing workloads
> hitting them regularly is not very difficult either (something with a
> lot of churn in shared buffers should do the trick).
>
> > > > Sure, because we don't care about it any longer- that page isn't
> > > > interesting because the WAL will replay over it.  IIRC it actually goes
> > > > something like: check the checksum, if it failed then check if the LSN
> > > > is greater than the checkpoint (of the backup start..), if not, then
> > > > re-read, if the LSN is now newer than the checkpoint then skip, if the
> > > > LSN is the same then throw an error.
> > >
> > > Nope, we only verify the checksum if it's LSN precedes the checkpoint:
> > >
> > > https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454
> >
> > That seems like it's leaving something on the table, but, to be fair, we
> > know that all of those pages should be rewritten by WAL anyway so they
> > aren't all that interesting to us, particularly in the basebackup case.
>
> Yep.

Right, see point 1 above.

> > > > I actually tend to disagree with you that, for this purpose, it's
> > > > actually necessary to check against the checkpoint LSN- if the LSN
> > > > changed and everything is operating correctly then the new LSN must be
> > > > more recent than the last checkpoint location or things are broken
> > > > badly.
> > >
> > > I don't follow. Are you suggesting we don't need the checkpoint LSN?
> > >
> > > I'm pretty sure that's not the case. The thing is - the LSN may not
> > > change between the two reads, but that's not a guarantee the page was
> > > not torn. The example I posted earlier in this message illustrates that.
> >
> > I agree that there's some risk there, but it's certainly much less
> > likely.
>
> Well. If we're going to report a checksum failure, we better be sure it
> actually is a broken page. I don't want users to start chasing bogus
> data corruption issues.

I agree.

> > > > Now, that said, I do think it's a good *idea* to check against the
> > > > checkpoint LSN (presuming this is for online checking of checksums- for
> > > > basebackup, we could just check against the backup-start LSN as anything
> > > > after that point will be rewritten by WAL anyway).  The reason that I
> > > > think it's a good idea to check against the checkpoint LSN is that we'd
> > > > want to throw a big warning if the kernel is just feeding us random
> > > > garbage on reads and only finding a difference between two reads isn't
> > > > really doing any kind of validation, whereas checking against the
> > > > checkpoint-LSN would at least give us some idea that the value being
> > > > read isn't completely ridiculous.

Are you suggesting here that we always check against the current
checkpoint, or is checking against the checkpoint that we saw at startup
enough? I think re-reading pg_control all the time might be more
errorprone that what we could get from this, so I would prefer not to do
this.

I don’t follow why rereading pg_control would be error-prone.  That said, I don’t have a particularly strong opinion either way on this. 

> > > > When it comes to if the pg_sleep() is necessary or not, I have to admit
> > > > to being unsure about that..  I could see how it might be but it seems a
> > > > bit surprising- I'd probably want to see exactly what the page was at
> > > > the time of the failure and at the time of the second (no-sleep) re-read
> > > > and then after a delay and convince myself that it was just an unlucky
> > > > case of being scheduled in twice to read that page before the process
> > > > writing it out got a chance to finish the write.
> > >
> > > I think the pg_sleep() is a pretty strong sign there's something broken.
> > > At the very least, it's likely to misbehave on machines with different
> > > timings, machines under memory and/or memory pressure, etc.

I swapped out the pg_sleep earlier today for the check-against-
checkpoint-LSN-on-reread, and that seems to work just as fine, at least
in the tests I ran.

Ok, this sounds like you were probably seeing normal forward torn pages, and we have certainly seen that before.  

> > If we assume that what you've outlined above is a serious enough issue
> > that we have to address it, and do so without a pg_sleep(), then I think
> > we have to bake into this a way for the process to check with PG as to
> > what the page's current LSN is, in shared buffers, because that's the
> > only place where we've got the locking required to ensure that we don't
> > end up with a read of a partially written page, and I'm really not
> > entirely convinced that we need to go to that level.  It'd certainly add
> > a huge amount of additional complexity for what appears to be a quite
> > unlikely gain.
> >
> > I'll chat w/ David shortly about this again though and get his thoughts
> > on it.  This is certainly an area we've spent time thinking about but
> > are obviously also open to finding a better solution.
>
> Why not to simply look at the last checkpoint LSN and use that the same
> way basebackup does? AFAICS that should make the pg_sleep() unnecessary.

Right.

This is fine if you know the kernel will always write the first page first, or you accept that a reread of a page which isn’t valid will always result in seeing a completely updated page.

We’ve made the assumption that a reread on a failure where the LSN on the first read was older than the backup-start LSN will give us an updated first-half of the page which we then check the LSN, of, but we have yet to prove that this is actually possible.

Thanks!

Stephen
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Tomas Vondra-4
In reply to this post by Stephen Frost
On 09/17/2018 07:35 PM, Stephen Frost wrote:

> Greetings,
>
> On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 09/17/2018 07:11 PM, Stephen Frost wrote:
>     > Greetings,
>     >
>     > * Tomas Vondra ([hidden email]
>     <mailto:[hidden email]>) wrote:
>     >> On 09/17/2018 06:42 PM, Stephen Frost wrote:
>     >>> Ok, good, though I'm not sure what you mean by 'eliminates the
>     >>> consistency guarantees provided by the checkpoint'.  The point
>     is that
>     >>> the page will be in the WAL and the WAL will be replayed during the
>     >>> restore of the backup.
>     >>
>     >> The checkpoint guarantees that the whole page was written and
>     flushed to
>     >> disk with an LSN before the ckeckpoint LSN. So when you read a
>     page with
>     >> that LSN, you know the whole write already completed and a read won't
>     >> return data from before the LSN.
>     >
>     > Well, you know that the first part was written out at some prior
>     point,
>     > but you could end up reading the first part of a page with an
>     older LSN
>     > while also reading the second part with new data.
>
>
>
>     Doesn't the checkpoint fsync pretty much guarantee this can't happen?
>
>
> How? Either it’s possible for the latter half of a page to be updated
> before the first half (where the LSN lives), or it isn’t. If it’s
> possible then that LSN could be ancient and it wouldn’t matter. 
>

I'm not sure I understand what you're saying here.

It is not about the latter page to be updated before the first half. I
don't think that's quite possible, because write() into page cache does
in fact write the data sequentially.

The problem is that the write is not atomic, and AFAIK it happens in
sectors (which are either 512B or 4K these days). And it may arbitrarily
interleave with reads.

So you may do write(8k), but it actually happens in 512B chunks and a
concurrent read may observe some mix of those.

But the trick is that if the read sees the effect of the write somewhere
in the middle of the page, the next read is guaranteed to see all the
preceding new data.

Without the checkpoint we risk seeing the same write() both in read and
re-read, just in a different stage - so the LSN would not change, making
the check futile.

But by waiting for the checkpoint we know that the original write is no
longer in progress, so if we saw a partial write we're guaranteed to see
a new LSN on re-read.

This is what I mean by the checkpoint / fsync guarantee.


>     >> Without the checkpoint that's not guaranteed, and simply
>     re-reading the
>     >> page and rechecking it vs. the first read does not help:
>     >>
>     >> 1) write the first 512B of the page (sector), which includes the LSN
>     >>
>     >> 2) read the whole page, which will be a mix [new 512B, ... old ... ]
>     >>
>     >> 3) the checksum verification fails
>     >>
>     >> 4) read the page again (possibly reading a bit more new data)
>     >>
>     >> 5) the LSN did not change compared to the first read, yet the
>     checksum
>     >> still fails
>     >
>     > So, I agree with all of the above though I've found it to be extremely
>     > rare to get a single read which you've managed to catch part-way
>     through
>     > a write, getting multiple of them over a period of time strikes me as
>     > even more unlikely.  Still, if we can come up with a solution to solve
>     > all of this, great, but I'm not sure that I'm hearing one.
>
>
>     I don't recall claiming catching many such torn pages - I'm sure it's
>     not very common in most workloads. But I suspect constructing workloads
>     hitting them regularly is not very difficult either (something with a
>     lot of churn in shared buffers should do the trick).
>
>
> The question is if it’s possible to catch a torn page where the second
> half is updated *before* the first half of the page in a read (and then
> in subsequent reads having that state be maintained).  I have some
> skepticism that it’s really possible to happen in the first place but
> having an interrupted system call be stalled across two more system
> calls just seems terribly unlikely, and this is all based on the
> assumption that the kernel might write the second half of a write before
> the first to the kernel cache in the first place.
>

Yes, if that was possible, the explanation about the checkpoint fsync
guarantee would be bogus, obviously.

I've spent quite a bit of time looking into how write() is handled, and
I believe seeing only the second half is not possible. You may observe a
page torn in various ways (not necessarily in half), e.g.

    [old,new,old]

but then the re-read you should be guaranteed to see new data up until
the last "new" chunk:

    [new,new,old]

At least that's my understanding. I failed to deduce what POSIX says
about this, or how it behaves on various OS/filesystems.

The one thing I've done was writing a simple stress test that writes a
single 8kB page in a loop, reads it concurrently and checks the
behavior. And it seems consistent with my understanding.

>
>     >>> Now, that said, I do think it's a good *idea* to check against the
>     >>> checkpoint LSN (presuming this is for online checking of
>     checksums- for
>     >>> basebackup, we could just check against the backup-start LSN as
>     anything
>     >>> after that point will be rewritten by WAL anyway).  The reason
>     that I
>     >>> think it's a good idea to check against the checkpoint LSN is
>     that we'd
>     >>> want to throw a big warning if the kernel is just feeding us random
>     >>> garbage on reads and only finding a difference between two reads
>     isn't
>     >>> really doing any kind of validation, whereas checking against the
>     >>> checkpoint-LSN would at least give us some idea that the value being
>     >>> read isn't completely ridiculous.
>     >>>
>     >>> When it comes to if the pg_sleep() is necessary or not, I have
>     to admit
>     >>> to being unsure about that..  I could see how it might be but it
>     seems a
>     >>> bit surprising- I'd probably want to see exactly what the page
>     was at
>     >>> the time of the failure and at the time of the second (no-sleep)
>     re-read
>     >>> and then after a delay and convince myself that it was just an
>     unlucky
>     >>> case of being scheduled in twice to read that page before the
>     process
>     >>> writing it out got a chance to finish the write.
>     >>
>     >> I think the pg_sleep() is a pretty strong sign there's something
>     broken.
>     >> At the very least, it's likely to misbehave on machines with
>     different
>     >> timings, machines under memory and/or memory pressure, etc.
>     >
>     > If we assume that what you've outlined above is a serious enough issue
>     > that we have to address it, and do so without a pg_sleep(), then I
>     think
>     > we have to bake into this a way for the process to check with PG as to
>     > what the page's current LSN is, in shared buffers, because that's the
>     > only place where we've got the locking required to ensure that we
>     don't
>     > end up with a read of a partially written page, and I'm really not
>     > entirely convinced that we need to go to that level.  It'd
>     certainly add
>     > a huge amount of additional complexity for what appears to be a quite
>     > unlikely gain.
>     >
>     > I'll chat w/ David shortly about this again though and get his
>     thoughts
>     > on it.  This is certainly an area we've spent time thinking about but
>     > are obviously also open to finding a better solution.
>
>
>     Why not to simply look at the last checkpoint LSN and use that the same
>     way basebackup does? AFAICS that should make the pg_sleep() unnecessary.
>
>
> Use that to compare to what?  The LSN in the first half of the page
> could be from well before the checkpoint or even the backup started.
>

Not sure I follow. If the LSN in the page header is old, and the
checksum check failed, then on re-read we either find a new LSN (in
which case we skip the page) or consider this to be a checksum failure.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Stephen Frost
Greetings,

* Tomas Vondra ([hidden email]) wrote:

> On 09/17/2018 07:35 PM, Stephen Frost wrote:
> > On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <[hidden email]
> > <mailto:[hidden email]>> wrote:
> >     Doesn't the checkpoint fsync pretty much guarantee this can't happen?
> >
> > How? Either it’s possible for the latter half of a page to be updated
> > before the first half (where the LSN lives), or it isn’t. If it’s
> > possible then that LSN could be ancient and it wouldn’t matter. 
>
> I'm not sure I understand what you're saying here.
>
> It is not about the latter page to be updated before the first half. I
> don't think that's quite possible, because write() into page cache does
> in fact write the data sequentially.
Well, maybe 'updated before' wasn't quite the right way to talk about
it, but consider if a read(8K) gets only half-way through the copy
before having to go do something else and by the time it gets back, a
write has come in and rewritten the page, such that the read(8K)
returns half-old and half-new data.

> The problem is that the write is not atomic, and AFAIK it happens in
> sectors (which are either 512B or 4K these days). And it may arbitrarily
> interleave with reads.

Yes, of course the write isn't atomic, that's clear.

> So you may do write(8k), but it actually happens in 512B chunks and a
> concurrent read may observe some mix of those.

Right, I'm not sure that we really need to worry about sub-4K writes
though I suppose they're technically possible, but it doesn't much
matter in this case since the LSN is early on in the page, of course.

> But the trick is that if the read sees the effect of the write somewhere
> in the middle of the page, the next read is guaranteed to see all the
> preceding new data.

If that's guaranteed then we can just check the LSN and be done.

> Without the checkpoint we risk seeing the same write() both in read and
> re-read, just in a different stage - so the LSN would not change, making
> the check futile.

This is the part that isn't making much sense to me.  If we are
guaranteed that writes into the kernel cache are always in order and
always at least 512B in size, then if we check the LSN first and
discover it's "old", and then read the rest of the page and calculate
the checksum, discover it's a bad checksum, and then go back and re-read
the page then we *must* see that the LSN has changed OR conclude that
the checksum is invalidated.

The reason this can happen in the first place is that our 8K read might
only get half-way done before getting scheduled off and a 8K write
happened on the page before our read(8K) gets back to finishing the
read, but if what you're saying is true, then we can't ever have a case
where such a thing would happen and a re-read would still see the "old"
LSN.

If we check the LSN first and discover it's "new" (as in, more recent
than our last checkpoint, or the checkpoint where the backup started)
then, sure, there's going to be a risk that the page is currently being
written right that moment and isn't yet completely valid.

The problem that we aren't solving for is if, somehow, we do a read(8K)
and get the first half/second half mixup and then on a subsequent
read(8K) we see that *again*, implying that somehow the kernel's copy
has the latter-half of the page updated consistently but not the first
half.  That's a problem that I haven't got a solution to today.  I'd
love to have a guarantee that it's not possible- we've certainly never
seen it but it's been a concern and I thought Michael was suggesting
he'd seen that, but it sounds like there wasn't a check on the LSN in
the first read, in which case it could have just been a 'regular' torn
page case.

> But by waiting for the checkpoint we know that the original write is no
> longer in progress, so if we saw a partial write we're guaranteed to see
> a new LSN on re-read.
>
> This is what I mean by the checkpoint / fsync guarantee.

I don't think any of this really has anythign to do with either fsync
being called or with the actual checkpointing process (except to the
extent that the checkpointer is the thing doing the writing, and that we
should be checking the LSN against the LSN of the last checkpoint when
we started, or against the start of the backup LSN if we're talking
about doing a backup).

> > The question is if it’s possible to catch a torn page where the second
> > half is updated *before* the first half of the page in a read (and then
> > in subsequent reads having that state be maintained).  I have some
> > skepticism that it’s really possible to happen in the first place but
> > having an interrupted system call be stalled across two more system
> > calls just seems terribly unlikely, and this is all based on the
> > assumption that the kernel might write the second half of a write before
> > the first to the kernel cache in the first place.
>
> Yes, if that was possible, the explanation about the checkpoint fsync
> guarantee would be bogus, obviously.
>
> I've spent quite a bit of time looking into how write() is handled, and
> I believe seeing only the second half is not possible. You may observe a
> page torn in various ways (not necessarily in half), e.g.
>
>     [old,new,old]
>
> but then the re-read you should be guaranteed to see new data up until
> the last "new" chunk:
>
>     [new,new,old]
>
> At least that's my understanding. I failed to deduce what POSIX says
> about this, or how it behaves on various OS/filesystems.
>
> The one thing I've done was writing a simple stress test that writes a
> single 8kB page in a loop, reads it concurrently and checks the
> behavior. And it seems consistent with my understanding.
Good.

> > Use that to compare to what?  The LSN in the first half of the page
> > could be from well before the checkpoint or even the backup started.
>
> Not sure I follow. If the LSN in the page header is old, and the
> checksum check failed, then on re-read we either find a new LSN (in
> which case we skip the page) or consider this to be a checksum failure.

Right, I'm in agreement with doing that and it's what is done in
pgbasebackup and pgBackRest.

Thanks!

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Tomas Vondra-4

On 09/18/2018 12:01 AM, Stephen Frost wrote:

> Greetings,
>
> * Tomas Vondra ([hidden email]) wrote:
>> On 09/17/2018 07:35 PM, Stephen Frost wrote:
>>> On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <[hidden email]
>>> <mailto:[hidden email]>> wrote:
>>>     Doesn't the checkpoint fsync pretty much guarantee this can't happen?
>>>
>>> How? Either it’s possible for the latter half of a page to be updated
>>> before the first half (where the LSN lives), or it isn’t. If it’s
>>> possible then that LSN could be ancient and it wouldn’t matter. 
>>
>> I'm not sure I understand what you're saying here.
>>
>> It is not about the latter page to be updated before the first half. I
>> don't think that's quite possible, because write() into page cache does
>> in fact write the data sequentially.
>
> Well, maybe 'updated before' wasn't quite the right way to talk about
> it, but consider if a read(8K) gets only half-way through the copy
> before having to go do something else and by the time it gets back, a
> write has come in and rewritten the page, such that the read(8K)
> returns half-old and half-new data.
>
>> The problem is that the write is not atomic, and AFAIK it happens in
>> sectors (which are either 512B or 4K these days). And it may arbitrarily
>> interleave with reads.
>
> Yes, of course the write isn't atomic, that's clear.
>
>> So you may do write(8k), but it actually happens in 512B chunks and a
>> concurrent read may observe some mix of those.
>
> Right, I'm not sure that we really need to worry about sub-4K writes
> though I suppose they're technically possible, but it doesn't much
> matter in this case since the LSN is early on in the page, of course.
>
>> But the trick is that if the read sees the effect of the write somewhere
>> in the middle of the page, the next read is guaranteed to see all the
>> preceding new data.
>
> If that's guaranteed then we can just check the LSN and be done.
>

What do you mean by "check the LSN"? Compare it to LSN from the first
read? You don't know if the first read already saw the new LSN or not
(see the next example).

>> Without the checkpoint we risk seeing the same write() both in read and
>> re-read, just in a different stage - so the LSN would not change, making
>> the check futile.
>
> This is the part that isn't making much sense to me.  If we are
> guaranteed that writes into the kernel cache are always in order and
> always at least 512B in size, then if we check the LSN first and
> discover it's "old", and then read the rest of the page and calculate
> the checksum, discover it's a bad checksum, and then go back and re-read
> the page then we *must* see that the LSN has changed OR conclude that
> the checksum is invalidated.
>

Even if the writes are in order and in 512B chunks, you don't know how
they are interleaved with the reads.

Let's assume we're doing a write(), which splits the 8kB page into 512B
chunks. A concurrent read may observe a random mix of old and new data,
depending on timing.

So let's say a read sees the first 2kB of data like this:

[new, new, new, old, new, old, new, old]

OK, the page is obviously torn, checksum fails, and we try reading it
again. We should see new data at least until the last 'new' chunk in the
first read, so let's say we got this:

[new, new, new, new, new, new, new, old]

Obviously, this page is also torn (there are old data at the end), but
we've read the new data in both cases, which includes the LSN. So the
LSN is the same in both cases, and your detection fails.

Comparing the page LSN to the last checkpoint LSN solves this, because
if the LSN is older than the checkpoint LSN, that write must have been
completed by now, and so we're not in danger of seeing only incomplete
effects of it. And newer write will update the LSN.

> The reason this can happen in the first place is that our 8K read might
> only get half-way done before getting scheduled off and a 8K write
> happened on the page before our read(8K) gets back to finishing the
> read, but if what you're saying is true, then we can't ever have a case
> where such a thing would happen and a re-read would still see the "old"
> LSN.
>
> If we check the LSN first and discover it's "new" (as in, more recent
> than our last checkpoint, or the checkpoint where the backup started)
> then, sure, there's going to be a risk that the page is currently being
> written right that moment and isn't yet completely valid.
>

Right.

> The problem that we aren't solving for is if, somehow, we do a read(8K)
> and get the first half/second half mixup and then on a subsequent
> read(8K) we see that *again*, implying that somehow the kernel's copy
> has the latter-half of the page updated consistently but not the first
> half.  That's a problem that I haven't got a solution to today.  I'd
> love to have a guarantee that it's not possible- we've certainly never
> seen it but it's been a concern and I thought Michael was suggesting
> he'd seen that, but it sounds like there wasn't a check on the LSN in
> the first read, in which case it could have just been a 'regular' torn
> page case.
>

Well, yeah. If that would be possible, we'd be in serious trouble. I've
done quite a bit of experimentation with concurrent reads and writes and
I have not observed such behavior. Of course, that's hardly a proof it
can't happen, and it wouldn't be the first surprise with respect to
kernel I/O this year ...

>> But by waiting for the checkpoint we know that the original write is no
>> longer in progress, so if we saw a partial write we're guaranteed to see
>> a new LSN on re-read.
>>
>> This is what I mean by the checkpoint / fsync guarantee.
>
> I don't think any of this really has anythign to do with either fsync
> being called or with the actual checkpointing process (except to the
> extent that the checkpointer is the thing doing the writing, and that we
> should be checking the LSN against the LSN of the last checkpoint when
> we started, or against the start of the backup LSN if we're talking
> about doing a backup).
>

You're right it's not about the fsync, sorry for the confusion. My point
is that using the checkpoint LSN gives us a guarantee that write is no
longer in progress, and so we can't see a page torn because of it. And
if we see a partial write due to a new write, it's guaranteed to update
the page LSN (and we'll notice it).

>>> The question is if it’s possible to catch a torn page where the second
>>> half is updated *before* the first half of the page in a read (and then
>>> in subsequent reads having that state be maintained).  I have some
>>> skepticism that it’s really possible to happen in the first place but
>>> having an interrupted system call be stalled across two more system
>>> calls just seems terribly unlikely, and this is all based on the
>>> assumption that the kernel might write the second half of a write before
>>> the first to the kernel cache in the first place.
>>
>> Yes, if that was possible, the explanation about the checkpoint fsync
>> guarantee would be bogus, obviously.
>>
>> I've spent quite a bit of time looking into how write() is handled, and
>> I believe seeing only the second half is not possible. You may observe a
>> page torn in various ways (not necessarily in half), e.g.
>>
>>     [old,new,old]
>>
>> but then the re-read you should be guaranteed to see new data up until
>> the last "new" chunk:
>>
>>     [new,new,old]
>>
>> At least that's my understanding. I failed to deduce what POSIX says
>> about this, or how it behaves on various OS/filesystems.
>>
>> The one thing I've done was writing a simple stress test that writes a
>> single 8kB page in a loop, reads it concurrently and checks the
>> behavior. And it seems consistent with my understanding.
>
> Good.
>
>>> Use that to compare to what?  The LSN in the first half of the page
>>> could be from well before the checkpoint or even the backup started.
>>
>> Not sure I follow. If the LSN in the page header is old, and the
>> checksum check failed, then on re-read we either find a new LSN (in
>> which case we skip the page) or consider this to be a checksum failure.
>
> Right, I'm in agreement with doing that and it's what is done in
> pgbasebackup and pgBackRest.
>

OK. All I'm saying is pg_verify_checksums should probably do the same
thing, i.e. grab checkpoint LSN and roll with that.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Online verification of checksums

Stephen Frost
Greetings,

* Tomas Vondra ([hidden email]) wrote:

> On 09/18/2018 12:01 AM, Stephen Frost wrote:
> > * Tomas Vondra ([hidden email]) wrote:
> >> On 09/17/2018 07:35 PM, Stephen Frost wrote:
> >> But the trick is that if the read sees the effect of the write somewhere
> >> in the middle of the page, the next read is guaranteed to see all the
> >> preceding new data.
> >
> > If that's guaranteed then we can just check the LSN and be done.
>
> What do you mean by "check the LSN"? Compare it to LSN from the first
> read? You don't know if the first read already saw the new LSN or not
> (see the next example).
Hmm, ok, I can see your point there.  I've been going back and forth
between checking against what the prior LSN was on the page and checking
it against an independent source (like the last checkpoint's LSN), but..

[...]

> Comparing the page LSN to the last checkpoint LSN solves this, because
> if the LSN is older than the checkpoint LSN, that write must have been
> completed by now, and so we're not in danger of seeing only incomplete
> effects of it. And newer write will update the LSN.

Yeah, that makes sense- we need to be looking at something which only
gets updated once the write has actually completed, and the last
checkpoint's LSN gives us that guarantee.

> > The problem that we aren't solving for is if, somehow, we do a read(8K)
> > and get the first half/second half mixup and then on a subsequent
> > read(8K) we see that *again*, implying that somehow the kernel's copy
> > has the latter-half of the page updated consistently but not the first
> > half.  That's a problem that I haven't got a solution to today.  I'd
> > love to have a guarantee that it's not possible- we've certainly never
> > seen it but it's been a concern and I thought Michael was suggesting
> > he'd seen that, but it sounds like there wasn't a check on the LSN in
> > the first read, in which case it could have just been a 'regular' torn
> > page case.
>
> Well, yeah. If that would be possible, we'd be in serious trouble. I've
> done quite a bit of experimentation with concurrent reads and writes and
> I have not observed such behavior. Of course, that's hardly a proof it
> can't happen, and it wouldn't be the first surprise with respect to
> kernel I/O this year ...
I'm glad to hear that you've done a lot of experimentation in this area
and haven't seen such strange behavior happen- we've got quite a few
people running pgBackRest with checksum-checking and haven't seen it
either, but it's always been a bit of a concern.

> You're right it's not about the fsync, sorry for the confusion. My point
> is that using the checkpoint LSN gives us a guarantee that write is no
> longer in progress, and so we can't see a page torn because of it. And
> if we see a partial write due to a new write, it's guaranteed to update
> the page LSN (and we'll notice it).

Right, no worries about the confusion, I hadn't been fully thinking
through the LSN bit either and that what we really need is some external
confirmation of a write having *completed* (not just started) and that
makes a definite difference.

> > Right, I'm in agreement with doing that and it's what is done in
> > pgbasebackup and pgBackRest.
>
> OK. All I'm saying is pg_verify_checksums should probably do the same
> thing, i.e. grab checkpoint LSN and roll with that.

Agreed.

Thanks!

Stephen

signature.asc (836 bytes) Download Attachment
1234 ... 8