Online checksums verification in the backend

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
61 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Online checksums verification in the backend

Julien Rouhaud
Hi,

This topic was discussed several times, with the most recent
discussions found at [1] and [2].  Based on those discussions, my
understanding is that the current approach in BASE_BACKUP has too many
drawbacks and we should instead do this check in the backend.  I've
been working using such approach at VMware, and I'm submitting it here
to discuss the approach and rationales, and hopefully have such a
feature integrated.

First, this was originally developed as an extension.  It means that
the check is performed using an SRF.  That's maybe not the best
approach, as a transaction has be kept for the total processing time.
It can be leveraged by checking each relation independently, but
that's still not ideal.  Maybe using some utility commands (as part of
VACUUM or a new CHECK command for instance) would be a better
approach.

This brings the second consideration: how to report the list corrupted
blocks to end users.  As I said this is for now returned via the SRF,
but this is clearly not ideal and should rather be made available more
globally.  One usage of this information could be block level
recovery.  I'm Cc-ing Sawada-san, as I know he's working on this and
mentioned me that he had ideas on passing the list of corrupted blocks
using the stat collector.

Finally, the read and locking considerations.  I tried to cover that
extensively in the comments, but here are some details on how I tried
to make the check safe while trying to keep the overhead as low as
possible.  First thing is that this is only doing buffered reads,
without any attempt to discard OS cache.  Therefore, any discrepancy
between the OS cache and the disk cannot be detected unless you do
other actions, such as sync / drop_caches on GNU/Linux.

An access share lock on the currently checked relation is held,
meaning that it can't get deleted/truncated.  The total number of
blocks for the given fork is retrieved first, so any new block will be
ignored.  Such new blocks are considered out of scope as being written
after the start of the check.

Each time a buffer is being checked, the target buffer mapping
partition lock is acquired in shared mode, to prevent concurrent
eviction.  If the buffer is found in shared buffers, it's pinned and
released immediately, just to get the state.  If the buffer is found
dirty, no check is performed as it'll be written to disk by the
checkpointer, or during recovery in case of unclean shutdown.
Otherwise, an IO lock is held while the the buffer is being read in a
private buffer. IO Lock and buffer mapping lock are released and then
the check is performed.

If the buffer is not found in shared buffers, the buffer mapping
partition lock is released immediately and the block is read from
disk.  It's therefore possible to get a false positive here, as the
block could be concurrently read, modified and partially written to
disk.  So, if an error is detected in this case, the check is
restarted from scratch and if the buffer is still not found in shared
buffers, the read will be done while still holding the buffer mapping
partition lock to make sure that it can't get concurrently loaded and
modified.  This is an optimistic approach to avoid performance
overhead, assuming that there shouldn't be a lot of positive, and
false positive possibility is very narrow.

The check consists of simply comparing the stored and computed
checksum, with an additional check that the page is really new (using
PageIsVerified) if it's found as PageIsNew().  Since this is done
after releasing all locks, we could definitely add more checks without
causing much overhead, like pd_lower/pd_upper sanity.  I prefer to
keep the check simple for now and rather focus on the general
approach.

Finally, I also reused vacuum costing GUC (for simplicity) and
approach to add some throttling.

I'm attaching a patch that adds a new pg_check_relation() sql function
to perform a check of one or all relations, and some simple regression
tests.

[1] https://www.postgresql.org/message-id/flat/1532606373.3422.5.camel%40credativ.de
[2] https://www.postgresql.org/message-id/flat/20190326170820.6sylklg7eh6uhabd%40alap3.anarazel.de

online_checksum_verification-v1.diff (39K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Robert Haas
On Fri, Dec 6, 2019 at 9:51 AM Julien Rouhaud <[hidden email]> wrote:
> This topic was discussed several times, with the most recent
> discussions found at [1] and [2].  Based on those discussions, my
> understanding is that the current approach in BASE_BACKUP has too many
> drawbacks and we should instead do this check in the backend.

Good idea.

> This brings the second consideration: how to report the list corrupted
> blocks to end users.  As I said this is for now returned via the SRF,
> but this is clearly not ideal and should rather be made available more
> globally.

Some people might prefer notices, because you can get those while the
thing is still running, rather than a result set, which you will only
see when the query finishes. Other people might prefer an SRF, because
they want to have the data in structured form so that they can
postprocess it. Not sure what you mean by "more globally." I guess one
idea would be to provide a way to kick this off in the background via
a background worker or similar and then have it put the results in a
table. But that might fail if there are checksum errors in the
catalogs themselves.

I don't really know what's best.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Julien Rouhaud
On Mon, Dec 9, 2019 at 5:21 PM Robert Haas <[hidden email]> wrote:

>
> On Fri, Dec 6, 2019 at 9:51 AM Julien Rouhaud <[hidden email]> wrote:
>
> > This brings the second consideration: how to report the list corrupted
> > blocks to end users.  As I said this is for now returned via the SRF,
> > but this is clearly not ideal and should rather be made available more
> > globally.
>
> Some people might prefer notices, because you can get those while the
> thing is still running, rather than a result set, which you will only
> see when the query finishes. Other people might prefer an SRF, because
> they want to have the data in structured form so that they can
> postprocess it. Not sure what you mean by "more globally."

I meant having the results available system-wide, not only to the
caller.  I think that emitting a log/notice level should always be
done on top on whatever other communication facility we're using.

> I guess one
> idea would be to provide a way to kick this off in the background via
> a background worker or similar and then have it put the results in a
> table. But that might fail if there are checksum errors in the
> catalogs themselves.

Yes that's a concern.  We could maintain a list in (dynamic) shared
memory with a simple SQL wrapper to read the data, but that would be
lost with a crash/restart.  Or use
pgstat_report_checksum_failures_in_db(), modifying it to get an
relfilenode, bocknum and forknum and append that to some flat files,
hoping that it won't get corrupted either.


Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Michael Paquier-2
On Mon, Dec 09, 2019 at 07:02:43PM +0100, Julien Rouhaud wrote:

> On Mon, Dec 9, 2019 at 5:21 PM Robert Haas <[hidden email]> wrote:
>> Some people might prefer notices, because you can get those while the
>> thing is still running, rather than a result set, which you will only
>> see when the query finishes. Other people might prefer an SRF, because
>> they want to have the data in structured form so that they can
>> postprocess it. Not sure what you mean by "more globally."
>
> I meant having the results available system-wide, not only to the
> caller.  I think that emitting a log/notice level should always be
> done on top on whatever other communication facility we're using.
The problem of notice and logs is that they tend to be ignored.  Now I
don't see no problems either in adding something into the logs which
can be found later on for parsing on top of a SRF returned by the
caller which includes all the corruption details, say with pgbadger
or your friendly neighborhood grep.  I think that any backend function
should also make sure to call pgstat_report_checksum_failure() to
report a report visible at database-level in the catalogs, so as it is
possible to use that as a cheap high-level warning.  The details of
the failures could always be dug from the logs or the result of the
function itself after finding out that something is wrong in
pg_stat_database.

>> I guess one
>> idea would be to provide a way to kick this off in the background via
>> a background worker or similar and then have it put the results in a
>> table. But that might fail if there are checksum errors in the
>> catalogs themselves.
>
> Yes that's a concern.  We could maintain a list in (dynamic) shared
> memory with a simple SQL wrapper to read the data, but that would be
> lost with a crash/restart.  Or use
> pgstat_report_checksum_failures_in_db(), modifying it to get an
> relfilenode, bocknum and forknum and append that to some flat files,
> hoping that it won't get corrupted either.
If a lot of blocks are corrupted, that could bloat things.  Hence some
retention policies would be necessary, and that's tricky to define and
configure properly.  I'd tend to be in the school of just logging the
information and be done with it, because that's simple and because you
won't need to worry about any more configuration.  Doing the work in
the background is still separate than a SQL-callable function though,
no?  In this case you need a connection to a database to allow the
checksum verification to happen on a relfilenode based on the relation
to check, also because you want the thing to be safe concurrently
(a background work here is a combo with a bgworker triggering dynamic
children working on one database, not necessarily something that needs
to be in core).
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Julien Rouhaud
On Tue, Dec 10, 2019 at 3:26 AM Michael Paquier <[hidden email]> wrote:

>
> On Mon, Dec 09, 2019 at 07:02:43PM +0100, Julien Rouhaud wrote:
> > On Mon, Dec 9, 2019 at 5:21 PM Robert Haas <[hidden email]> wrote:
> >> Some people might prefer notices, because you can get those while the
> >> thing is still running, rather than a result set, which you will only
> >> see when the query finishes. Other people might prefer an SRF, because
> >> they want to have the data in structured form so that they can
> >> postprocess it. Not sure what you mean by "more globally."
> >
> > I meant having the results available system-wide, not only to the
> > caller.  I think that emitting a log/notice level should always be
> > done on top on whatever other communication facility we're using.
>
> The problem of notice and logs is that they tend to be ignored.  Now I
> don't see no problems either in adding something into the logs which
> can be found later on for parsing on top of a SRF returned by the
> caller which includes all the corruption details, say with pgbadger
> or your friendly neighborhood grep.  I think that any backend function
> should also make sure to call pgstat_report_checksum_failure() to
> report a report visible at database-level in the catalogs, so as it is
> possible to use that as a cheap high-level warning.  The details of
> the failures could always be dug from the logs or the result of the
> function itself after finding out that something is wrong in
> pg_stat_database.

I agree that adding extra information in the logs and calling
pgstat_report_checksum_failure is a must do, and I changed that
locally.  However, I doubt that the logs is the right place to find
the details of corrupted blocks.  There's no guarantee that the file
will be accessible to the DBA, nor that the content won't get
truncated by the time it's needed.  I really think that corruption is
important enough to justify more specific location.

> >> I guess one
> >> idea would be to provide a way to kick this off in the background via
> >> a background worker or similar and then have it put the results in a
> >> table. But that might fail if there are checksum errors in the
> >> catalogs themselves.
> >
> > Yes that's a concern.  We could maintain a list in (dynamic) shared
> > memory with a simple SQL wrapper to read the data, but that would be
> > lost with a crash/restart.  Or use
> > pgstat_report_checksum_failures_in_db(), modifying it to get an
> > relfilenode, bocknum and forknum and append that to some flat files,
> > hoping that it won't get corrupted either.
>
> If a lot of blocks are corrupted, that could bloat things.  Hence some
> retention policies would be necessary, and that's tricky to define and
> configure properly.  I'd tend to be in the school of just logging the
> information and be done with it, because that's simple and because you
> won't need to worry about any more configuration.

If the number of corrupted blocks becomes high enough to excessively
bloat things, it's likely that the instance is doomed anyway, so I'm
not especially concerned about it.


Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Masahiko Sawada
In reply to this post by Julien Rouhaud
On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud <[hidden email]> wrote:

>
> Hi,
>
> This topic was discussed several times, with the most recent
> discussions found at [1] and [2].  Based on those discussions, my
> understanding is that the current approach in BASE_BACKUP has too many
> drawbacks and we should instead do this check in the backend.  I've
> been working using such approach at VMware, and I'm submitting it here
> to discuss the approach and rationales, and hopefully have such a
> feature integrated.

Thank you for working on this!

>
> First, this was originally developed as an extension.  It means that
> the check is performed using an SRF.  That's maybe not the best
> approach, as a transaction has be kept for the total processing time.
> It can be leveraged by checking each relation independently, but
> that's still not ideal.  Maybe using some utility commands (as part of
> VACUUM or a new CHECK command for instance) would be a better
> approach.
>
> This brings the second consideration: how to report the list corrupted
> blocks to end users.  As I said this is for now returned via the SRF,
> but this is clearly not ideal and should rather be made available more
> globally.  One usage of this information could be block level
> recovery.  I'm Cc-ing Sawada-san, as I know he's working on this and
> mentioned me that he had ideas on passing the list of corrupted blocks
> using the stat collector.

Yes it's necessary the list of corrupted pages for single page
recovery. Apart from single page recovery I think it's helpful for DBA
if they can find the corrupted blocks in the server logs and on a
system view.

I've also tried to report corrupted pages to the stats collector
during I researching single page recovery in PostgreSQL but one
problem is that the statistics in the stats collector is cleared when
crash recovery. I want the information of block corruption to survive
even when the server down. And we might want to add checksums to the
permanent file having information of database corruption. The
correctness of these information would be important because we can fix
a database by restoring some tables from a logical backup or by doing
reindex etc as long as we have a non-broken information of database
corruption.

>
> Finally, the read and locking considerations.  I tried to cover that
> extensively in the comments, but here are some details on how I tried
> to make the check safe while trying to keep the overhead as low as
> possible.  First thing is that this is only doing buffered reads,
> without any attempt to discard OS cache.  Therefore, any discrepancy
> between the OS cache and the disk cannot be detected unless you do
> other actions, such as sync / drop_caches on GNU/Linux.
>
> An access share lock on the currently checked relation is held,
> meaning that it can't get deleted/truncated.  The total number of
> blocks for the given fork is retrieved first, so any new block will be
> ignored.  Such new blocks are considered out of scope as being written
> after the start of the check.
>
> Each time a buffer is being checked, the target buffer mapping
> partition lock is acquired in shared mode, to prevent concurrent
> eviction.  If the buffer is found in shared buffers, it's pinned and
> released immediately, just to get the state.

I wonder if there is possibility that blocks on disk can be corrupted
even if these are loaded to the shared buffer. ISTM the above method
cannot detect such corruption. Reading and checking blocks fast is
attractive but I thought it's also important to check blocks precisely
without overlooking.

Regards,

--
Masahiko Sawada  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Julien Rouhaud
On Tue, Dec 24, 2019 at 4:23 AM Masahiko Sawada <[hidden email]> wrote:

>
> On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud <[hidden email]> wrote:
> >
> > This brings the second consideration: how to report the list corrupted
> > blocks to end users.  As I said this is for now returned via the SRF,
> > but this is clearly not ideal and should rather be made available more
> > globally.  One usage of this information could be block level
> > recovery.  I'm Cc-ing Sawada-san, as I know he's working on this and
> > mentioned me that he had ideas on passing the list of corrupted blocks
> > using the stat collector.
>
> Yes it's necessary the list of corrupted pages for single page
> recovery. Apart from single page recovery I think it's helpful for DBA
> if they can find the corrupted blocks in the server logs and on a
> system view.
>
> I've also tried to report corrupted pages to the stats collector
> during I researching single page recovery in PostgreSQL but one
> problem is that the statistics in the stats collector is cleared when
> crash recovery. I want the information of block corruption to survive
> even when the server down.

Yes, having the list of corrupted blocks surviving a crash-and-restart
cycle, and also available after a clean shutdown is definitely
important.

> And we might want to add checksums to the
> permanent file having information of database corruption. The
> correctness of these information would be important because we can fix
> a database by restoring some tables from a logical backup or by doing
> reindex etc as long as we have a non-broken information of database
> corruption.

Agreed

> > Finally, the read and locking considerations.  I tried to cover that
> > extensively in the comments, but here are some details on how I tried
> > to make the check safe while trying to keep the overhead as low as
> > possible.  First thing is that this is only doing buffered reads,
> > without any attempt to discard OS cache.  Therefore, any discrepancy
> > between the OS cache and the disk cannot be detected unless you do
> > other actions, such as sync / drop_caches on GNU/Linux.
> >
> > An access share lock on the currently checked relation is held,
> > meaning that it can't get deleted/truncated.  The total number of
> > blocks for the given fork is retrieved first, so any new block will be
> > ignored.  Such new blocks are considered out of scope as being written
> > after the start of the check.
> >
> > Each time a buffer is being checked, the target buffer mapping
> > partition lock is acquired in shared mode, to prevent concurrent
> > eviction.  If the buffer is found in shared buffers, it's pinned and
> > released immediately, just to get the state.
>
> I wonder if there is possibility that blocks on disk can be corrupted
> even if these are loaded to the shared buffer. ISTM the above method
> cannot detect such corruption. Reading and checking blocks fast is
> attractive but I thought it's also important to check blocks precisely
> without overlooking.

It can definitely happen, and it's the usual doomsday scenario:
database is working fine for months, then postgres is restarted say
for a minor version upgrade and then boom the most populars blocks
that are constantly used in read only were corrupted on disk but never
evicted from shared buffers, and you have a major outage.  I have
witnessed that unfortunately too many times.  This is especially bad
as in this kind of scenario, you typically discover the corruption
once all backup only contains the corrupted blocks.

Note that in the approach I'm suggesting, I do verify blocks that are
loaded in shared buffers, I only ignore the dirty blocks, as they'll
be written by the checkpointer or recovery process in case of unclean
shutdown.  A bufferpin isn't necessary to avoid torn page read, an IO
lock also guarantees that and causes less overhead.  The included TAP
test should also detect the corruption of a
present-in-shared-buffers-non-dirty block.  It could however be
improved eg. by calling pg_prewarm to make sure that it's indeed in
shared_buffers, and also do the same test after a clean restart to
make sure that it's hitting the not-in-shared-buffers case.


Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Masahiko Sawada-2
On Tue, 24 Dec 2019 at 16:09, Julien Rouhaud <[hidden email]> wrote:

>
> On Tue, Dec 24, 2019 at 4:23 AM Masahiko Sawada <[hidden email]> wrote:
> >
> > On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud <[hidden email]> wrote:
> > >
> > > This brings the second consideration: how to report the list corrupted
> > > blocks to end users.  As I said this is for now returned via the SRF,
> > > but this is clearly not ideal and should rather be made available more
> > > globally.  One usage of this information could be block level
> > > recovery.  I'm Cc-ing Sawada-san, as I know he's working on this and
> > > mentioned me that he had ideas on passing the list of corrupted blocks
> > > using the stat collector.
> >
> > Yes it's necessary the list of corrupted pages for single page
> > recovery. Apart from single page recovery I think it's helpful for DBA
> > if they can find the corrupted blocks in the server logs and on a
> > system view.
> >
> > I've also tried to report corrupted pages to the stats collector
> > during I researching single page recovery in PostgreSQL but one
> > problem is that the statistics in the stats collector is cleared when
> > crash recovery. I want the information of block corruption to survive
> > even when the server down.
>
> Yes, having the list of corrupted blocks surviving a crash-and-restart
> cycle, and also available after a clean shutdown is definitely
> important.
>
> > And we might want to add checksums to the
> > permanent file having information of database corruption. The
> > correctness of these information would be important because we can fix
> > a database by restoring some tables from a logical backup or by doing
> > reindex etc as long as we have a non-broken information of database
> > corruption.
>
> Agreed
>
> > > Finally, the read and locking considerations.  I tried to cover that
> > > extensively in the comments, but here are some details on how I tried
> > > to make the check safe while trying to keep the overhead as low as
> > > possible.  First thing is that this is only doing buffered reads,
> > > without any attempt to discard OS cache.  Therefore, any discrepancy
> > > between the OS cache and the disk cannot be detected unless you do
> > > other actions, such as sync / drop_caches on GNU/Linux.
> > >
> > > An access share lock on the currently checked relation is held,
> > > meaning that it can't get deleted/truncated.  The total number of
> > > blocks for the given fork is retrieved first, so any new block will be
> > > ignored.  Such new blocks are considered out of scope as being written
> > > after the start of the check.
> > >
> > > Each time a buffer is being checked, the target buffer mapping
> > > partition lock is acquired in shared mode, to prevent concurrent
> > > eviction.  If the buffer is found in shared buffers, it's pinned and
> > > released immediately, just to get the state.
> >
> > I wonder if there is possibility that blocks on disk can be corrupted
> > even if these are loaded to the shared buffer. ISTM the above method
> > cannot detect such corruption. Reading and checking blocks fast is
> > attractive but I thought it's also important to check blocks precisely
> > without overlooking.
>
> It can definitely happen, and it's the usual doomsday scenario:
> database is working fine for months, then postgres is restarted say
> for a minor version upgrade and then boom the most populars blocks
> that are constantly used in read only were corrupted on disk but never
> evicted from shared buffers, and you have a major outage.  I have
> witnessed that unfortunately too many times.  This is especially bad
> as in this kind of scenario, you typically discover the corruption
> once all backup only contains the corrupted blocks.
>
> Note that in the approach I'm suggesting, I do verify blocks that are
> loaded in shared buffers, I only ignore the dirty blocks, as they'll
> be written by the checkpointer or recovery process in case of unclean
> shutdown.  A bufferpin isn't necessary to avoid torn page read, an IO
> lock also guarantees that and causes less overhead.  The included TAP
> test should also detect the corruption of a
> present-in-shared-buffers-non-dirty block.  It could however be
> improved eg. by calling pg_prewarm to make sure that it's indeed in
> shared_buffers, and also do the same test after a clean restart to
> make sure that it's hitting the not-in-shared-buffers case.

It reads blocks from disk even if they are loaded in shared buffer.
Now I understand. Thanks!

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Julien Rouhaud
In reply to this post by Julien Rouhaud
On Tue, Dec 10, 2019 at 11:12:34AM +0100, Julien Rouhaud wrote:

> On Tue, Dec 10, 2019 at 3:26 AM Michael Paquier <[hidden email]> wrote:
> >
> > On Mon, Dec 09, 2019 at 07:02:43PM +0100, Julien Rouhaud wrote:
> > > On Mon, Dec 9, 2019 at 5:21 PM Robert Haas <[hidden email]> wrote:
> > >> Some people might prefer notices, because you can get those while the
> > >> thing is still running, rather than a result set, which you will only
> > >> see when the query finishes. Other people might prefer an SRF, because
> > >> they want to have the data in structured form so that they can
> > >> postprocess it. Not sure what you mean by "more globally."
> > >
> > > I meant having the results available system-wide, not only to the
> > > caller.  I think that emitting a log/notice level should always be
> > > done on top on whatever other communication facility we're using.
> >
> > The problem of notice and logs is that they tend to be ignored.  Now I
> > don't see no problems either in adding something into the logs which
> > can be found later on for parsing on top of a SRF returned by the
> > caller which includes all the corruption details, say with pgbadger
> > or your friendly neighborhood grep.  I think that any backend function
> > should also make sure to call pgstat_report_checksum_failure() to
> > report a report visible at database-level in the catalogs, so as it is
> > possible to use that as a cheap high-level warning.  The details of
> > the failures could always be dug from the logs or the result of the
> > function itself after finding out that something is wrong in
> > pg_stat_database.
>
> I agree that adding extra information in the logs and calling
> pgstat_report_checksum_failure is a must do, and I changed that
> locally.  However, I doubt that the logs is the right place to find
> the details of corrupted blocks.  There's no guarantee that the file
> will be accessible to the DBA, nor that the content won't get
> truncated by the time it's needed.  I really think that corruption is
> important enough to justify more specific location.

The cfbot reported a build failure, so here's a rebased v2 which also contains
the pg_stat_report_failure() call and extra log info.

online_checksum_verification-v2.diff (30K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Michael Paquier-2
On Wed, Mar 11, 2020 at 08:18:23AM +0100, Julien Rouhaud wrote:
> The cfbot reported a build failure, so here's a rebased v2 which also contains
> the pg_stat_report_failure() call and extra log info.

+ * - if a block is not found in shared_buffers, the LWLock is relased and the
+ *   block is read from disk without taking any lock.  If an error is detected,
+ *   the read block will be discarded and retrieved again while holding the
+ *   LWLock.  This is because an error due to concurrent write is possible but
+ *   very unlikely, so it's better to have an optimistic approach to limit
+ *   locking overhead
This can be risky with false positives, no?  With a large amount of
shared buffer eviction you actually increase the risk of torn page
reads.  Instead of a logic relying on partition mapping locks, which
could be unwise on performance grounds, did you consider different
approaches?  For example a kind of pre-emptive lock on the page in
storage to prevent any shared buffer operation to happen while the
block is read from storage, that would act like a barrier.

+ * Vacuum's GUCs are used to avoid consuming too much resources while running
+ * this tool.
Shouldn't this involve separate GUCs instead of the VACUUM ones?  I
guess that this leads to the fact that this function may be better as
a contrib module, with the addition of some better-suited APIs in core
(see paragraph above).

+Run
+    make check
+or
+    make installcheck
Why is installcheck mentioned here?

I don't think that it is appropriate to place the SQL-callable part in
the existing checksum.c.  I would suggest instead a new file, say
checksumfuncs.c in src/backend/utils/adt/, holding any SQL functions
for checksums.

-SUBDIRS = perl regress isolation modules authentication recovery
 subscription
+SUBDIRS = perl regress isolation modules authentication check_relation \
+     recovery subscription
It seems to me that this test would be a good fit for
src/test/modules/test_misc/.

+static void
+check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore,
+                         ForkNumber forknum)
Per the argument of bloat, I think that I would remove
check_all_relation() as this function could take a very long time to
run, and just make the SQL function strict.

+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
Wouldn't it actually be a good thing to check that the page on storage
is fine in this case?  This depends on the system settings and the
checkpoint frequency, but checkpoint_timeout can be extended up to 1
day.  And plenty of things could happen to the storage in one day,
including a base backup that includes a corrupted page on storage,
that this function would not be able to detect.

+ * - if a block is otherwise found in shared_buffers, an IO lock is taken on
+ *   the block and the block is then read from storage, ignoring the block in
+ *   shared_buffers
Yeah, I think that you are right here to check the page on storage
anyway.

+ *   we detect if a block is in shared_buffers or not.  See get_buffer()
+ *   comments for more details about the locking strategy.
get_buffer() does not exist in your patch, check_get_buffer() does.

+ * - if a block is not found in shared_buffers, the LWLock is relased and the
[...]
+ * To avoid torn page and possible false postives when reading data, and
Typos.

+   if (!DataChecksumsEnabled())
+       elog(ERROR, "Data checksums are not enabled");
Note that elog() is for the class of errors which are never expected,
and here a caller of pg_check_relation() with checksums disabled can
trigger that.  So you need to call ereport() with
ERRCODE_FEATURE_NOT_SUPPORTED.

+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
Not sure that the indentation is going to react well on that part of
the patch, perhaps it would be better to add some "/*-------" at the
beginning and end of the comment block to tell pgindent to ignore this
part?

Based on the feedback gathered on this thread, I guess that you should
have a SRF returning the list of broken blocks, as well as NOTICE
messages.  Another thing to consider is the addition of a range
argument to only check a certain portion of the blocks, say one
segment file at a time, etc.  Fine by me to not include in the first
flavor of the patch.

The patch needs documentation.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Masahiko Sawada-2
In reply to this post by Julien Rouhaud
On Wed, 11 Mar 2020 at 16:18, Julien Rouhaud <[hidden email]> wrote:

>
> On Tue, Dec 10, 2019 at 11:12:34AM +0100, Julien Rouhaud wrote:
> > On Tue, Dec 10, 2019 at 3:26 AM Michael Paquier <[hidden email]> wrote:
> > >
> > > On Mon, Dec 09, 2019 at 07:02:43PM +0100, Julien Rouhaud wrote:
> > > > On Mon, Dec 9, 2019 at 5:21 PM Robert Haas <[hidden email]> wrote:
> > > >> Some people might prefer notices, because you can get those while the
> > > >> thing is still running, rather than a result set, which you will only
> > > >> see when the query finishes. Other people might prefer an SRF, because
> > > >> they want to have the data in structured form so that they can
> > > >> postprocess it. Not sure what you mean by "more globally."
> > > >
> > > > I meant having the results available system-wide, not only to the
> > > > caller.  I think that emitting a log/notice level should always be
> > > > done on top on whatever other communication facility we're using.
> > >
> > > The problem of notice and logs is that they tend to be ignored.  Now I
> > > don't see no problems either in adding something into the logs which
> > > can be found later on for parsing on top of a SRF returned by the
> > > caller which includes all the corruption details, say with pgbadger
> > > or your friendly neighborhood grep.  I think that any backend function
> > > should also make sure to call pgstat_report_checksum_failure() to
> > > report a report visible at database-level in the catalogs, so as it is
> > > possible to use that as a cheap high-level warning.  The details of
> > > the failures could always be dug from the logs or the result of the
> > > function itself after finding out that something is wrong in
> > > pg_stat_database.
> >
> > I agree that adding extra information in the logs and calling
> > pgstat_report_checksum_failure is a must do, and I changed that
> > locally.  However, I doubt that the logs is the right place to find
> > the details of corrupted blocks.  There's no guarantee that the file
> > will be accessible to the DBA, nor that the content won't get
> > truncated by the time it's needed.  I really think that corruption is
> > important enough to justify more specific location.
>
>
> The cfbot reported a build failure, so here's a rebased v2 which also contains
> the pg_stat_report_failure() call and extra log info.

In addition to comments from Michael-san, here are my comments:

1.
+   if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
+       ereport(ERROR,
+               (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+                errmsg("only superuser or a member of the
pg_read_server_files role may use this function")));
+
+   if (!DataChecksumsEnabled())
+       elog(ERROR, "Data checksums are not enabled");

I think it's better to reverse the order of the above checks.

2.
+#define CRF_COLS           5   /* Number of output arguments in the SRF */

Should it be SRF_COLS?

3.
+static void
+check_delay_point(void)
+{
+   /* Always check for interrupts */
+   CHECK_FOR_INTERRUPTS();
+
+   /* Nap if appropriate */
+   if (!InterruptPending && VacuumCostBalance >= VacuumCostLimit)
+   {
+       int         msec;
+
+       msec = VacuumCostDelay * VacuumCostBalance / VacuumCostLimit;
+       if (msec > VacuumCostDelay * 4)
+           msec = VacuumCostDelay * 4;
+
+       pg_usleep(msec * 1000L);
+
+       VacuumCostBalance = 0;
+
+       /* Might have gotten an interrupt while sleeping */
+       CHECK_FOR_INTERRUPTS();
+   }
+}

Even if we use vacuum delay for this function, I think we need to set
VacuumDelayActive and return if it's false, or it's better to just
return if VacuumCostDelay == 0.

4.
+static void
+check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore,
+                         ForkNumber forknum)

I also agree with Michael-san to remove this function. Instead we can
check all relations by:

select pg_check_relation(oid) from pg_class;

6.
Other typos

s/dirted/dirtied/
s/explictly/explicitly/

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Julien Rouhaud
In reply to this post by Michael Paquier-2
Thanks for the review Michael!

On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote:

> On Wed, Mar 11, 2020 at 08:18:23AM +0100, Julien Rouhaud wrote:
> > The cfbot reported a build failure, so here's a rebased v2 which also contains
> > the pg_stat_report_failure() call and extra log info.
>
> + * - if a block is not found in shared_buffers, the LWLock is relased and the
> + *   block is read from disk without taking any lock.  If an error is detected,
> + *   the read block will be discarded and retrieved again while holding the
> + *   LWLock.  This is because an error due to concurrent write is possible but
> + *   very unlikely, so it's better to have an optimistic approach to limit
> + *   locking overhead
> This can be risky with false positives, no?


Do you mean high probability of false positive in the 1st iteration, so running
frequently the recheck that can't have false positive, not that the 2nd check
can lead to false positive?


> With a large amount of
> shared buffer eviction you actually increase the risk of torn page
> reads.  Instead of a logic relying on partition mapping locks, which
> could be unwise on performance grounds, did you consider different
> approaches?  For example a kind of pre-emptive lock on the page in
> storage to prevent any shared buffer operation to happen while the
> block is read from storage, that would act like a barrier.


Even with a workload having a large shared_buffers eviction pattern, I don't
think that there's a high probability of hitting a torn page.  Unless I'm
mistaken it can only happen if all those steps happen concurrently to doing the
block read just after releasing the LWLock:

- postgres read the same block in shared_buffers (including all the locking)
- dirties it
- writes part of the page

It's certainly possible, but it seems so unlikely that the optimistic lock-less
approach seems like a very good tradeoff.


>
> + * Vacuum's GUCs are used to avoid consuming too much resources while running
> + * this tool.
> Shouldn't this involve separate GUCs instead of the VACUUM ones?


We could but the access pattern looked so similar that it looked like a good
idea to avoid adding 2 new GUC for that to keep configuration simple.  Unless
there are objections I'll add them in the next version.

> I guess that this leads to the fact that this function may be better as
> a contrib module, with the addition of some better-suited APIs in core
> (see paragraph above).


Below?


>
> +Run
> +    make check
> +or
> +    make installcheck
> Why is installcheck mentioned here?


Oups, copy/pasto error from the original contrib module this stuff was
initially implemented as, will fix.

>
> I don't think that it is appropriate to place the SQL-callable part in
> the existing checksum.c.  I would suggest instead a new file, say
> checksumfuncs.c in src/backend/utils/adt/, holding any SQL functions
> for checksums.


Agreed.

>
> -SUBDIRS = perl regress isolation modules authentication recovery
>  subscription
> +SUBDIRS = perl regress isolation modules authentication check_relation \
> +     recovery subscription
> It seems to me that this test would be a good fit for
> src/test/modules/test_misc/.


WFM.

>
> +static void
> +check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore,
> +                         ForkNumber forknum)
> Per the argument of bloat, I think that I would remove
> check_all_relation() as this function could take a very long time to
> run, and just make the SQL function strict.


No objection.

>
> + * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
> + *   disk either before the end of the next checkpoint or during recovery in
> + *   case of unsafe shutdown
> Wouldn't it actually be a good thing to check that the page on storage
> is fine in this case?  This depends on the system settings and the
> checkpoint frequency, but checkpoint_timeout can be extended up to 1
> day.  And plenty of things could happen to the storage in one day,
> including a base backup that includes a corrupted page on storage,
> that this function would not be able to detect.


How could that lead to data corruption?  If postgres crashes before the
checkpoint completion, the block will be overwritten during recovery, and if a
base backup is taken the block will also be overwritten while replaying all the
required WALs.  Detecting a corrupted blocks in those cases would have the
merit of possibly warning about possibly broken hardware sooner, but it would
also make the check more expensive as the odds to prevent postgres from
evicting a dirty block is way higher.  Maybe an additional GUC for that?

For the record when I first tested that feature I did try to check dirty
blocks, and it seemed that dirty blocks of shared relation were sometimes
wrongly reported as corrupted.  I didn't try to investigate more though.


> + *   we detect if a block is in shared_buffers or not.  See get_buffer()
> + *   comments for more details about the locking strategy.
> get_buffer() does not exist in your patch, check_get_buffer() does.


Oops, will fix.


>

> + * - if a block is not found in shared_buffers, the LWLock is relased and the
> [...]
> + * To avoid torn page and possible false postives when reading data, and
> Typos.
>
> +   if (!DataChecksumsEnabled())
> +       elog(ERROR, "Data checksums are not enabled");
> Note that elog() is for the class of errors which are never expected,
> and here a caller of pg_check_relation() with checksums disabled can
> trigger that.  So you need to call ereport() with
> ERRCODE_FEATURE_NOT_SUPPORTED.


Indeed, will fix.


>
> + * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
> + *   disk either before the end of the next checkpoint or during recovery in
> + *   case of unsafe shutdown
> Not sure that the indentation is going to react well on that part of
> the patch, perhaps it would be better to add some "/*-------" at the
> beginning and end of the comment block to tell pgindent to ignore this
> part?


Ok.  Although I think only the beginning comment is needed?

>
> Based on the feedback gathered on this thread, I guess that you should
> have a SRF returning the list of broken blocks, as well as NOTICE
> messages.


The current patch has an SRF and a WARNING message, do you want an additional
NOTICE message or downgrade the existing one?

> Another thing to consider is the addition of a range
> argument to only check a certain portion of the blocks, say one
> segment file at a time, etc.  Fine by me to not include in the first
> flavor of the patch.


Ok!


> The patch needs documentation.


I'll try to add some.


Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Julien Rouhaud
In reply to this post by Masahiko Sawada-2
On Mon, Mar 16, 2020 at 01:53:35PM +0900, Masahiko Sawada wrote:
>
> In addition to comments from Michael-san, here are my comments:
>
> 1.
> +   if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
> +       ereport(ERROR,
> +               (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> +                errmsg("only superuser or a member of the
> pg_read_server_files role may use this function")));


Good point!  I'll fix it.


> +
> +   if (!DataChecksumsEnabled())
> +       elog(ERROR, "Data checksums are not enabled");
>
> I think it's better to reverse the order of the above checks.


Indeed.


>
> 2.
> +#define CRF_COLS           5   /* Number of output arguments in the SRF */
>
> Should it be SRF_COLS?


Oops, will fix.


>
> 3.
> +static void
> +check_delay_point(void)
> +{
> +   /* Always check for interrupts */
> +   CHECK_FOR_INTERRUPTS();
> +
> +   /* Nap if appropriate */
> +   if (!InterruptPending && VacuumCostBalance >= VacuumCostLimit)
> +   {
> +       int         msec;
> +
> +       msec = VacuumCostDelay * VacuumCostBalance / VacuumCostLimit;
> +       if (msec > VacuumCostDelay * 4)
> +           msec = VacuumCostDelay * 4;
> +
> +       pg_usleep(msec * 1000L);
> +
> +       VacuumCostBalance = 0;
> +
> +       /* Might have gotten an interrupt while sleeping */
> +       CHECK_FOR_INTERRUPTS();
> +   }
> +}
>
> Even if we use vacuum delay for this function, I think we need to set
> VacuumDelayActive and return if it's false, or it's better to just
> return if VacuumCostDelay == 0.


Good point, I'll fix that.


>
> 4.
> +static void
> +check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore,
> +                         ForkNumber forknum)
>
> I also agree with Michael-san to remove this function. Instead we can
> check all relations by:
>
> select pg_check_relation(oid) from pg_class;


Sure, but ideally we should do that in a client program (eg.  pg_checksums)
that wouldn't maintain a transaction active for the whole execution.


> 6.
> Other typos
>
> s/dirted/dirtied/
> s/explictly/explicitly/


Will fix, thanks!


Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Julien Rouhaud
On Mon, Mar 16, 2020 at 09:42:39AM +0100, Julien Rouhaud wrote:
> On Mon, Mar 16, 2020 at 01:53:35PM +0900, Masahiko Sawada wrote:
> >
> > In addition to comments from Michael-san, here are my comments:

Thanks both for the reviews.  I'm attaching a v3 with all comments addressed,
except:

> It seems to me that this test would be a good fit for
> src/test/modules/test_misc/.


AFAICT this is explicitly documented as tests for various extensions, and for
now it's a core function, so I didn't move it.


> +Run
> +    make check
> +or
> +    make installcheck
> Why is installcheck mentioned here?


This is actually already used in multiple other test readme.

v3-0001-Add-a-pg_check_relation-function.patch (38K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Julien Rouhaud
On Mon, Mar 16, 2020 at 02:15:27PM +0100, Julien Rouhaud wrote:

> On Mon, Mar 16, 2020 at 09:42:39AM +0100, Julien Rouhaud wrote:
> > On Mon, Mar 16, 2020 at 01:53:35PM +0900, Masahiko Sawada wrote:
> > >
> > > In addition to comments from Michael-san, here are my comments:
>
> Thanks both for the reviews.  I'm attaching a v3 with all comments addressed,
> except:
>
> > It seems to me that this test would be a good fit for
> > src/test/modules/test_misc/.
>
>
> AFAICT this is explicitly documented as tests for various extensions, and for
> now it's a core function, so I didn't move it.
>
>
> > +Run
> > +    make check
> > +or
> > +    make installcheck
> > Why is installcheck mentioned here?
>
>
> This is actually already used in multiple other test readme.

Sorry I forgot to update the regression tests.  v4 attached.

v4-0001-Add-a-pg_check_relation-function.patch (39K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Michael Paquier-2
In reply to this post by Julien Rouhaud
On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote:

> On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote:
>> With a large amount of
>> shared buffer eviction you actually increase the risk of torn page
>> reads.  Instead of a logic relying on partition mapping locks, which
>> could be unwise on performance grounds, did you consider different
>> approaches?  For example a kind of pre-emptive lock on the page in
>> storage to prevent any shared buffer operation to happen while the
>> block is read from storage, that would act like a barrier.
>
> Even with a workload having a large shared_buffers eviction pattern, I don't
> think that there's a high probability of hitting a torn page.  Unless I'm
> mistaken it can only happen if all those steps happen concurrently to doing the
> block read just after releasing the LWLock:
>
> - postgres read the same block in shared_buffers (including all the locking)
> - dirties it
> - writes part of the page
>
> It's certainly possible, but it seems so unlikely that the optimistic lock-less
> approach seems like a very good tradeoff.
Having false reports in this area could be very confusing for the
user.  That's for example possible now with checksum verification and
base backups.

>> I guess that this leads to the fact that this function may be better as
>> a contrib module, with the addition of some better-suited APIs in core
>> (see paragraph above).
>
> Below?

Above.  This thought more precisely:
>> For example a kind of pre-emptive lock on the page in
>> storage to prevent any shared buffer operation to happen while the
>> block is read from storage, that would act like a barrier.

> For the record when I first tested that feature I did try to check dirty
> blocks, and it seemed that dirty blocks of shared relation were sometimes
> wrongly reported as corrupted.  I didn't try to investigate more though.

Hmm.  It would be good to look at that, correct verification of shared
relations matter.

>> + * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
>> + *   disk either before the end of the next checkpoint or during recovery in
>> + *   case of unsafe shutdown
>> Not sure that the indentation is going to react well on that part of
>> the patch, perhaps it would be better to add some "/*-------" at the
>> beginning and end of the comment block to tell pgindent to ignore this
>> part?
>
> Ok.  Although I think only the beginning comment is needed?

From src/tools/pgindent/README:
"pgindent will reflow any comment block that's not at the left margin.
If this messes up manual formatting that ought to be preserved,
protect the comment block with some dashes:"

        /*----------
         * Text here will not be touched by pgindent.
         *----------
         */

>> Based on the feedback gathered on this thread, I guess that you should
>> have a SRF returning the list of broken blocks, as well as NOTICE
>> messages.
>
> The current patch has an SRF and a WARNING message, do you want an additional
> NOTICE message or downgrade the existing one?

Right, not sure which one is better, for zero_damaged_pages a WARNING
is used.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Julien Rouhaud
On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote:

> On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote:
> > On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote:
> >> With a large amount of
> >> shared buffer eviction you actually increase the risk of torn page
> >> reads.  Instead of a logic relying on partition mapping locks, which
> >> could be unwise on performance grounds, did you consider different
> >> approaches?  For example a kind of pre-emptive lock on the page in
> >> storage to prevent any shared buffer operation to happen while the
> >> block is read from storage, that would act like a barrier.
> >
> > Even with a workload having a large shared_buffers eviction pattern, I don't
> > think that there's a high probability of hitting a torn page.  Unless I'm
> > mistaken it can only happen if all those steps happen concurrently to doing the
> > block read just after releasing the LWLock:
> >
> > - postgres read the same block in shared_buffers (including all the locking)
> > - dirties it
> > - writes part of the page
> >
> > It's certainly possible, but it seems so unlikely that the optimistic lock-less
> > approach seems like a very good tradeoff.
>
> Having false reports in this area could be very confusing for the
> user.  That's for example possible now with checksum verification and
> base backups.


I agree, however this shouldn't be the case here, as the block will be
rechecked while holding proper lock the 2nd time in case of possible false
positive before being reported as corrupted.  So the only downside is to check
twice a corrupted block that's not found in shared buffers (or concurrently
loaded/modified/half flushed).  As the number of corrupted or concurrently
loaded/modified/half flushed blocks should usually be close to zero, it seems
worthwhile to have a lockless check first for performance reason.


> > For the record when I first tested that feature I did try to check dirty
> > blocks, and it seemed that dirty blocks of shared relation were sometimes
> > wrongly reported as corrupted.  I didn't try to investigate more though.
>
> Hmm.  It would be good to look at that, correct verification of shared
> relations matter.


I'll try to investigate, but non-dirty shared relation blocks can be checked
and work as intended.


>
> >> + * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
> >> + *   disk either before the end of the next checkpoint or during recovery in
> >> + *   case of unsafe shutdown
> >> Not sure that the indentation is going to react well on that part of
> >> the patch, perhaps it would be better to add some "/*-------" at the
> >> beginning and end of the comment block to tell pgindent to ignore this
> >> part?
> >
> > Ok.  Although I think only the beginning comment is needed?
>
> From src/tools/pgindent/README:
> "pgindent will reflow any comment block that's not at the left margin.
> If this messes up manual formatting that ought to be preserved,
> protect the comment block with some dashes:"
>
>         /*----------
> * Text here will not be touched by pgindent.
>          *----------
>          */


For instance the block comment in gen_partprune_steps_internal() disagrees.
Anyway I added both as all the nearby codes does that for overall function
comments.


Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Julien Rouhaud
In reply to this post by Michael Paquier-2
On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote:

> On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote:
> > On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote:
> >> Based on the feedback gathered on this thread, I guess that you should
> >> have a SRF returning the list of broken blocks, as well as NOTICE
> >> messages.
> >
> > The current patch has an SRF and a WARNING message, do you want an additional
> > NOTICE message or downgrade the existing one?
>
> Right, not sure which one is better, for zero_damaged_pages a WARNING
> is used.


Sorry forgot to answer that.  IMHO a WARNING is better here, as we're talking
about data corruption.  Also, a WARNING will be reported to both the client and
server logs, which sounds like a good thing.


Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Julien Rouhaud
In reply to this post by Julien Rouhaud
On Wed, Mar 18, 2020 at 07:06:19AM +0100, Julien Rouhaud wrote:

> On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote:
> > On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote:
> > > On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote:
> > >> With a large amount of
> > >> shared buffer eviction you actually increase the risk of torn page
> > >> reads.  Instead of a logic relying on partition mapping locks, which
> > >> could be unwise on performance grounds, did you consider different
> > >> approaches?  For example a kind of pre-emptive lock on the page in
> > >> storage to prevent any shared buffer operation to happen while the
> > >> block is read from storage, that would act like a barrier.
> > >
> > > Even with a workload having a large shared_buffers eviction pattern, I don't
> > > think that there's a high probability of hitting a torn page.  Unless I'm
> > > mistaken it can only happen if all those steps happen concurrently to doing the
> > > block read just after releasing the LWLock:
> > >
> > > - postgres read the same block in shared_buffers (including all the locking)
> > > - dirties it
> > > - writes part of the page
> > >
> > > It's certainly possible, but it seems so unlikely that the optimistic lock-less
> > > approach seems like a very good tradeoff.
> >
> > Having false reports in this area could be very confusing for the
> > user.  That's for example possible now with checksum verification and
> > base backups.
>
>
> I agree, however this shouldn't be the case here, as the block will be
> rechecked while holding proper lock the 2nd time in case of possible false
> positive before being reported as corrupted.  So the only downside is to check
> twice a corrupted block that's not found in shared buffers (or concurrently
> loaded/modified/half flushed).  As the number of corrupted or concurrently
> loaded/modified/half flushed blocks should usually be close to zero, it seems
> worthwhile to have a lockless check first for performance reason.

I just noticed some dumb mistakes while adding the new GUCs.  v5 attached to
fix that, no other changes.

v5-0001-Add-a-pg_check_relation-function.patch (39K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Masahiko Sawada-2
On Wed, 18 Mar 2020 at 19:11, Julien Rouhaud <[hidden email]> wrote:

>
> On Wed, Mar 18, 2020 at 07:06:19AM +0100, Julien Rouhaud wrote:
> > On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote:
> > > On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote:
> > > > On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote:
> > > >> With a large amount of
> > > >> shared buffer eviction you actually increase the risk of torn page
> > > >> reads.  Instead of a logic relying on partition mapping locks, which
> > > >> could be unwise on performance grounds, did you consider different
> > > >> approaches?  For example a kind of pre-emptive lock on the page in
> > > >> storage to prevent any shared buffer operation to happen while the
> > > >> block is read from storage, that would act like a barrier.
> > > >
> > > > Even with a workload having a large shared_buffers eviction pattern, I don't
> > > > think that there's a high probability of hitting a torn page.  Unless I'm
> > > > mistaken it can only happen if all those steps happen concurrently to doing the
> > > > block read just after releasing the LWLock:
> > > >
> > > > - postgres read the same block in shared_buffers (including all the locking)
> > > > - dirties it
> > > > - writes part of the page
> > > >
> > > > It's certainly possible, but it seems so unlikely that the optimistic lock-less
> > > > approach seems like a very good tradeoff.
> > >
> > > Having false reports in this area could be very confusing for the
> > > user.  That's for example possible now with checksum verification and
> > > base backups.
> >
> >
> > I agree, however this shouldn't be the case here, as the block will be
> > rechecked while holding proper lock the 2nd time in case of possible false
> > positive before being reported as corrupted.  So the only downside is to check
> > twice a corrupted block that's not found in shared buffers (or concurrently
> > loaded/modified/half flushed).  As the number of corrupted or concurrently
> > loaded/modified/half flushed blocks should usually be close to zero, it seems
> > worthwhile to have a lockless check first for performance reason.
>
>
> I just noticed some dumb mistakes while adding the new GUCs.  v5 attached to
> fix that, no other changes.

Thank you for updating the patch. I have some comments:

1.
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter>
<type>oid</type>, <parameter>fork</parameter>
<type>text</type>)</function></literal>
+       </entry>

Looking at the declaration of pg_check_relation, 'relation' and 'fork'
are optional arguments. So I think the above is not correct. But as I
commented below, 'relation' should not be optional, so maybe the above
line could be:

+        <literal><function>pg_check_relation(<parameter>relation</parameter>
<type>oid</type>[, <parameter>fork</parameter>
<type>text</type>])</function></literal>

2.
+   <indexterm>
+    <primary>pg_check_relation</primary>
+   </indexterm>
+   <para>
+    <function>pg_check_relation</function> iterates over all the blocks of all
+    or the specified fork of a given relation and verify their checksum.  It
+    returns the list of blocks for which the found checksum doesn't match the
+    expected one.  You must be a member of the
+    <literal>pg_read_all_stats</literal> role to use this function.  It can
+    only be used if data checksums are enabled.  See <xref
+    linkend="app-initdb-data-checksums"/> for more information.
+   </para>

* I think we need a description about possible values for 'fork'
(i.g., 'main', 'vm', 'fsm' and 'init'), and the behavior when 'fork'
is omitted.

* Do we need to explain about checksum cost-based delay here?

3.
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass DEFAULT NULL::regclass, IN fork text DEFAULT NULL::text,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation'
+  PARALLEL RESTRICTED;

Now that pg_check_relation doesn't accept NULL as 'relation', I think
we need to make 'relation' a mandatory argument.

4.
+   /* Check if the relation (still) exists */
+   if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+   {
+       /*
+        * We use a standard relation_open() to acquire the initial lock.  It
+        * means that this will block until the lock is acquired, or will
+        * raise an ERROR if lock_timeout has been set.  If caller wants to
+        * check multiple tables while relying on a maximum wait time, it
+        * should process tables one by one instead of relying on a global
+        * processing with the main SRF.
+        */
+       relation = relation_open(relid, AccessShareLock);
+   }

IIUC the above was necessary because we used to have
check_all_relations() which iterates all relations on the database to
do checksum checks. But now that we don't have that function and
pg_check_relation processes one relation. Can we just do
relation_open() here?

5.
I think we need to check if the relation is a temp relation. I'm not
sure it's worth to check checksums of temp relations but at least we
need not to check other's temp relations.

6.
+/*
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion..
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read from disk without any
+ * lock unless caller asked otherwise, setting needlock.  In this case, the
+ * read will be done while the buffer mapping partition LWLock is still being
+ * held.  Reading with this lock is to avoid the unlikely but possible case
+ * that a buffer wasn't present in shared buffers when we checked but it then
+ * alloc'ed in shared_buffers, modified and flushed concurrently when we
+ * later try to read it, leading to false positive due to torn page.  Caller
+ * can read first the buffer without holding the target buffer mapping
+ * partition LWLock to have an optimistic approach, and reread the buffer
+ * from disk in case of error.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ */

I think the above comment also needs some "/*-------" at the beginning and end.

7.
+static void
+check_get_buffer(Relation relation, ForkNumber forknum,
+                BlockNumber blkno, char *buffer, bool needlock, bool *checkit,
+                bool *found_in_sb)
+{

Maybe we can make check_get_buffer() return a bool indicating we found
a buffer to check, instead of having '*checkit'. That way, we can
simplify the following code:

+       check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+                        &checkit, &found_in_sb);
+
+       if (!checkit)
+           continue;

to something like:

+ if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+                          &found_in_sb))
+     continue;

8.
+       if (PageIsVerified(buffer, blkno))
+       {
+           /*
+            * If the page is really new, there won't by any checksum to be
+            * computed or expected.
+            */
+           *chk_expected = *chk_found = NoComputedChecksum;
+           return true;
+       }
+       else
+       {
+           /*
+            * There's a corruption, but since this affect PageIsNew, we
+            * can't compute a checksum, so set NoComputedChecksum for the
+            * expected checksum.
+            */
+           *chk_expected = NoComputedChecksum;
+           *chk_found = hdr->pd_checksum;
+       }
+       return false;

* I think the 'else' is not necessary here.

* Setting *chk_expected and *chk_found seems useless when we return
true. The caller doesn't use them.

* Should we forcibly overwrite ignore_checksum_failure to off in
pg_check_relation()? Otherwise, this logic seems not work fine.

* I don't understand why we cannot compute a checksum in case where a
page looks like a new page but is actually corrupted. Could you please
elaborate on that?

8.
+   {
+       {"checksum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+           gettext_noop("Checksum cost for a page found in the buffer cache."),
+           NULL
+       },
+       &ChecksumCostPageHit,
+       1, 0, 10000,
+       NULL, NULL, NULL
+   },

* There is no description about the newly added four GUC parameters in the doc.

* We need to put new GUC parameters into postgresql.conf.sample as well.

* The patch doesn't use checksum_cost_page_hit at all.

9.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..37f63e747c 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -134,6 +134,14 @@ int            max_worker_processes = 8;
 int            max_parallel_workers = 8;
 int            MaxBackends = 0;

+int            ChecksumCostPageHit = 1;    /* GUC parameters for
checksum check */
+int            ChecksumCostPageMiss = 10;
+int            ChecksumCostLimit = 200;
+double     ChecksumCostDelay = 0;
+
+int            ChecksumCostBalance = 0;    /* working state for
checksums check */
+bool       ChecksumCostActive = false;

Can we declare them in checksum.c since these parameters are used only
in checksum.c and it does I/O my itself.

10.
+       /* Report the failure to the stat collector and the logs. */
+       pgstat_report_checksum_failure();
+       ereport(WARNING,
+               (errcode(ERRCODE_DATA_CORRUPTED),
+                errmsg("invalid page in block %u of relation %s",
+                       blkno,
+                       relpath(relation->rd_smgr->smgr_rnode, forknum))));

I think we could do pgstat_report_checksum_failure() and emit WARNING
twice for the same page since PageIsVerified() also does the same.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


1234