Gotchas about pg_verify_checksums

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Gotchas about pg_verify_checksums

Michael Paquier-2
Hi all,

I have not been giving much attention to the thread about enabling
checksums online, which has resulted in the revert of the feature, but
there is still pg_verify_checksums around.  So I looked at it a bit.

I have a couple of questions/gotchas about it:
1) The documentation states that the cluster needs to be offline.
Doesn't this imply that the cluster can also be forcibly killed?  It
seems to me that the documentation ought to say that the cluster needs
to be shut down cleanly instead.  Mentioning that only in the notes of
the documentation would be enough in my opinion.
2) On a cluster where checksums are disabled, aka the control file says
so, then using --force does not result in incorrect blocks to be
reported.  This is caused by the presence of the check on
PG_DATA_CHECKSUM_VERSION which does not match for a cluster where
checksums have been disabled.  Wouldn't one want to know this
information as well to know what are the portions of the data folder not
treated yet by checksum updates?
3) Is the force option actually useful?  I assume that --force is useful
if one wants to check how checksums are computed if a switch off -> on
is used to see the progress of the operation or to see how much progress
has been done after doing an online switch, which is also what a228cc13
outlines.  Still this requires the cluster to be offline...

Attached is a patch which addresses point 1) (with one markup fix in
bonus).  I'd love to get more input about the other bullet points.

Thanks,
--
Michael

checksum-docs.patch (968 bytes) Download Attachment
signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Gotchas about pg_verify_checksums

Daniel Gustafsson
> On 10 Apr 2018, at 06:21, Michael Paquier <[hidden email]> wrote:

> 1) The documentation states that the cluster needs to be offline.
> Doesn't this imply that the cluster can also be forcibly killed?  It
> seems to me that the documentation ought to say that the cluster needs
> to be shut down cleanly instead.  Mentioning that only in the notes of
> the documentation would be enough in my opinion.

Does it really imply that?  Either way, the tool could potentially be useful
for debugging a broken cluster so I’m not sure that stating it requires a
cleanly shut down server is useful.  That being said, you’re absolutely right
that the current wording isn’t great, I think “The cluster must be shut down
before running..” would be better.

> 2) On a cluster where checksums are disabled, aka the control file says
> so, then using --force does not result in incorrect blocks to be
> reported.  This is caused by the presence of the check on
> PG_DATA_CHECKSUM_VERSION which does not match for a cluster where
> checksums have been disabled.  Wouldn't one want to know this
> information as well to know what are the portions of the data folder not
> treated yet by checksum updates?
> 3) Is the force option actually useful?  I assume that --force is useful
> if one wants to check how checksums are computed if a switch off -> on
> is used to see the progress of the operation or to see how much progress
> has been done after doing an online switch, which is also what a228cc13
> outlines.  Still this requires the cluster to be offline…
Thinking more on this, I don’t think the -f option should be in the tool until
we have the ability to turn on/off checksums.  Since checksums are always on,
or always off, -f is at best confusing IMO.  The attached patch removes -f,
when we can turn checksums on/off we can rethink how -f should behave.

cheers ./daniel


remove_force.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Gotchas about pg_verify_checksums

Peter Geoghegan-4
On Tue, Apr 10, 2018 at 1:27 PM, Daniel Gustafsson <[hidden email]> wrote:

>> On 10 Apr 2018, at 06:21, Michael Paquier <[hidden email]> wrote:
>> 1) The documentation states that the cluster needs to be offline.
>> Doesn't this imply that the cluster can also be forcibly killed?  It
>> seems to me that the documentation ought to say that the cluster needs
>> to be shut down cleanly instead.  Mentioning that only in the notes of
>> the documentation would be enough in my opinion.
>
> Does it really imply that?  Either way, the tool could potentially be useful
> for debugging a broken cluster so I’m not sure that stating it requires a
> cleanly shut down server is useful.  That being said, you’re absolutely right
> that the current wording isn’t great, I think “The cluster must be shut down
> before running..” would be better.

I agree with Michael -- shutting down the server using immediate mode
could lead to torn pages, that crash recovery will need to repair at a
later stage. I think that some strong caveats around this are required
in the pg_verify_checksums docs, at a minimum.

--
Peter Geoghegan

Reply | Threaded
Open this post in threaded view
|

Re: Gotchas about pg_verify_checksums

Michael Paquier-2
On Tue, Apr 10, 2018 at 02:40:58PM -0700, Peter Geoghegan wrote:
> I agree with Michael -- shutting down the server using immediate mode
> could lead to torn pages, that crash recovery will need to repair at a
> later stage. I think that some strong caveats around this are required
> in the pg_verify_checksums docs, at a minimum.

Peter, the code does the right thing as it requires the instance's
control file state to be either DB_SHUTDOWNED_IN_RECOVERY or
DB_SHUTDOWNED.  The documentation, on the contrary, implies that
the instance just needs to be offline, which can be anything as long as
the postmaster is stopped.  That's how I understand the current
wording.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Gotchas about pg_verify_checksums

Michael Paquier-2
In reply to this post by Daniel Gustafsson
On Tue, Apr 10, 2018 at 10:27:19PM +0200, Daniel Gustafsson wrote:
>> On 10 Apr 2018, at 06:21, Michael Paquier <[hidden email]> wrote:
> Does it really imply that?  Either way, the tool could potentially be useful
> for debugging a broken cluster so I’m not sure that stating it requires a
> cleanly shut down server is useful.

Torn pages could lead to false positives.  So I think that the tool's
assumptions are right.

> Thinking more on this, I don’t think the -f option should be in the tool until
> we have the ability to turn on/off checksums.  Since checksums are always on,
> or always off, -f is at best confusing IMO.  The attached patch removes -f,
> when we can turn checksums on/off we can rethink how -f should behave.

That's my impression as well, thanks for confirming.  Your patch looks
fine to me.

I am wondering as well if we should not actually rename the tool?  Why
not naming it pg_checksums instead of pg_verify_checksums, and add an
--action switch to it which can be used to work on checksums.  The
obvious option to support in v11 is a "verify" mode, but I would imagine
that a "disable" and "enable" modes would be useful as well, and all the
APIs are here already to be able to do an in-place update of the
checksums, and then switch the control file properly.  We have no idea
at this stage if a patch to enable checksums while the cluster is online
will be able to make it, still a way to switch checksums while the
cluster is offline is both reliable and easy to implement.  Not saying
do to that for v11 of course, I would like to keep the door open for
v12.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Gotchas about pg_verify_checksums

Peter Geoghegan-4
In reply to this post by Michael Paquier-2
On Tue, Apr 10, 2018 at 4:44 PM, Michael Paquier <[hidden email]> wrote:
> Peter, the code does the right thing as it requires the instance's
> control file state to be either DB_SHUTDOWNED_IN_RECOVERY or
> DB_SHUTDOWNED.  The documentation, on the contrary, implies that
> the instance just needs to be offline, which can be anything as long as
> the postmaster is stopped.  That's how I understand the current
> wording.

I see. The problem is clearly the documentation, then.

--
Peter Geoghegan

Reply | Threaded
Open this post in threaded view
|

Re: Gotchas about pg_verify_checksums

Daniel Gustafsson
In reply to this post by Michael Paquier-2
> On 11 Apr 2018, at 01:53, Michael Paquier <[hidden email]> wrote:
>
> On Tue, Apr 10, 2018 at 10:27:19PM +0200, Daniel Gustafsson wrote:
>>> On 10 Apr 2018, at 06:21, Michael Paquier <[hidden email]> wrote:
>> Does it really imply that?  Either way, the tool could potentially be useful
>> for debugging a broken cluster so I’m not sure that stating it requires a
>> cleanly shut down server is useful.
>
> Torn pages could lead to false positives.  So I think that the tool's
> assumptions are right.

Right, I misunderstood your initial email but I see what you mean.  Yes, you
are right and with that +1 on your patch.

> I am wondering as well if we should not actually rename the tool?  Why
> not naming it pg_checksums instead of pg_verify_checksums, and add an
> --action switch to it which can be used to work on checksums.  The
> obvious option to support in v11 is a "verify" mode, but I would imagine
> that a "disable" and "enable" modes would be useful as well, and all the
> APIs are here already to be able to do an in-place update of the
> checksums, and then switch the control file properly.  We have no idea
> at this stage if a patch to enable checksums while the cluster is online
> will be able to make it, still a way to switch checksums while the
> cluster is offline is both reliable and easy to implement.  Not saying
> do to that for v11 of course, I would like to keep the door open for
> v12.

Naming it pg_checksums, with only verification as an option, seems to me to
imply future direction for 12 more than what pg_verify_checksums does.  I would
leave it the way it is, but I don’t have very strong opinions (or any plans on
hacking on offline checksum enabling for that matter).

cheers ./daniel
Reply | Threaded
Open this post in threaded view
|

Re: Gotchas about pg_verify_checksums

Michael Paquier-2
On Wed, Apr 11, 2018 at 10:21:29PM +0200, Daniel Gustafsson wrote:
> Right, I misunderstood your initial email but I see what you mean.  Yes, you
> are right and with that +1 on your patch.

OK, no problem.

> Naming it pg_checksums, with only verification as an option, seems to me to
> imply future direction for 12 more than what pg_verify_checksums does.  I would
> leave it the way it is, but I don’t have very strong opinions (or any plans on
> hacking on offline checksum enabling for that matter).

Okay, I am fine to let such decision to you and Magnus at the end as the
authors and committers of the thing.  I think that I will just hack out
this tool myself after reusing this code if you don't mind of course..
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Gotchas about pg_verify_checksums

Michael Paquier-2
In reply to this post by Daniel Gustafsson
On Tue, Apr 10, 2018 at 10:27:19PM +0200, Daniel Gustafsson wrote:
> Thinking more on this, I don’t think the -f option should be in the tool until
> we have the ability to turn on/off checksums.  Since checksums are always on,
> or always off, -f is at best confusing IMO.  The attached patch removes -f,
> when we can turn checksums on/off we can rethink how -f should behave.

You have forgotten to update the list of arguments in getopt().
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Gotchas about pg_verify_checksums

Daniel Gustafsson
> On 15 Apr 2018, at 09:36, Michael Paquier <[hidden email]> wrote:
>
> On Tue, Apr 10, 2018 at 10:27:19PM +0200, Daniel Gustafsson wrote:
>> Thinking more on this, I don’t think the -f option should be in the tool until
>> we have the ability to turn on/off checksums.  Since checksums are always on,
>> or always off, -f is at best confusing IMO.  The attached patch removes -f,
>> when we can turn checksums on/off we can rethink how -f should behave.
>
> You have forgotten to update the list of arguments in getopt().

Doh. Thanks, fixed in attached patch.

cheers ./daniel


remove_force-v2.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Gotchas about pg_verify_checksums

Magnus Hagander-2


On Sun, Apr 15, 2018 at 9:49 AM, Daniel Gustafsson <[hidden email]> wrote:
> On 15 Apr 2018, at 09:36, Michael Paquier <[hidden email]> wrote:
>
> On Tue, Apr 10, 2018 at 10:27:19PM +0200, Daniel Gustafsson wrote:
>> Thinking more on this, I don’t think the -f option should be in the tool until
>> we have the ability to turn on/off checksums.  Since checksums are always on,
>> or always off, -f is at best confusing IMO.  The attached patch removes -f,
>> when we can turn checksums on/off we can rethink how -f should behave.
>
> You have forgotten to update the list of arguments in getopt().

Doh. Thanks, fixed in attached patch.

I have applied this patch along with the documentation patch from Michael.

Thanks! 

--
Reply | Threaded
Open this post in threaded view
|

Re: Gotchas about pg_verify_checksums

Michael Paquier-2
In reply to this post by Michael Paquier-2
On Thu, Apr 12, 2018 at 05:47:29AM +0900, Michael Paquier wrote:
> On Wed, Apr 11, 2018 at 10:21:29PM +0200, Daniel Gustafsson wrote:
>> Naming it pg_checksums, with only verification as an option, seems to me to
>> imply future direction for 12 more than what pg_verify_checksums does.  I would
>> leave it the way it is, but I don’t have very strong opinions (or any plans on
>> hacking on offline checksum enabling for that matter).
>
> Okay, I am fine to let such decision to you and Magnus at the end as the
> authors and committers of the thing.  I think that I will just hack out
> this tool myself after reusing this code if you don't mind of course..

If anybody on this list is interested, I have extended
pg_verify_checksums into this tool I called pg_checksums which is able
to do a couple of more things like enable checksums, disable checksums,
bypass the final fsync of the data folder with a --no-sync.  The core
feature to verify checksums is still around of course.

This allows anybody to control the checksums of a cluster that was shut
down cleanly, which I wanted for some time myself:
https://github.com/michaelpq/pg_plugins/tree/master/pg_checksums

That's only compatible with v11, but it can be easily tweaked to be
compatible with past versions.  I am of course not proposing to include
that in v11 or above.  If anybody wishes to do a proposal based on that
stuff for v12, please feel free to.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Gotchas about pg_verify_checksums

Michael Paquier-2
In reply to this post by Magnus Hagander-2
On Sun, Apr 15, 2018 at 01:54:13PM +0200, Magnus Hagander wrote:
> I have applied this patch along with the documentation patch from Michael.

Thanks, I saw the commits.  That's fine by me.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Gotchas about pg_verify_checksums

Michael Banck-2
In reply to this post by Michael Paquier-2
Hi Michael,

On Mon, Apr 16, 2018 at 11:30:30AM +0900, Michael Paquier wrote:

> On Thu, Apr 12, 2018 at 05:47:29AM +0900, Michael Paquier wrote:
> > On Wed, Apr 11, 2018 at 10:21:29PM +0200, Daniel Gustafsson wrote:
> >> Naming it pg_checksums, with only verification as an option, seems to me to
> >> imply future direction for 12 more than what pg_verify_checksums does.  I would
> >> leave it the way it is, but I don’t have very strong opinions (or any plans on
> >> hacking on offline checksum enabling for that matter).
> >
> > Okay, I am fine to let such decision to you and Magnus at the end as the
> > authors and committers of the thing.  I think that I will just hack out
> > this tool myself after reusing this code if you don't mind of course..
>
> If anybody on this list is interested, I have extended
> pg_verify_checksums into this tool I called pg_checksums which is able
> to do a couple of more things like enable checksums, disable checksums,
> bypass the final fsync of the data folder with a --no-sync.  The core
> feature to verify checksums is still around of course.
>
> This allows anybody to control the checksums of a cluster that was shut
> down cleanly, which I wanted for some time myself:
> https://github.com/michaelpq/pg_plugins/tree/master/pg_checksums

I didn't see this part of the thread immediately, but following a
discussion with Magnus on offline activation/deactivation of checksums
at PGConf.DE, I independently adapted pg_verify_checksums in a very
similar fashion on my way back from the conference, also renaming it to
pg_checksums etc...

> That's only compatible with v11, but it can be easily tweaked to be
> compatible with past versions.  I am of course not proposing to include
> that in v11 or above.  If anybody wishes to do a proposal based on that
> stuff for v12, please feel free to.

I think it is useful for legacy installation which "forgot" to enable
checksums when they setup their instances and/or came around that they
are useful in the meantime and are willing to take the downtime in order
to activate them.

So I decided to add some support for earlier version in my version of
the program, and pushed it to https://github.com/credativ/pg_checksums
if anybody is interested in that. I have to admit that it is quite less
fancy than your version.


Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: [hidden email]

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Reply | Threaded
Open this post in threaded view
|

Re: Gotchas about pg_verify_checksums

Michael Paquier-2
On Tue, Apr 17, 2018 at 09:43:47PM +0200, Michael Banck wrote:
> So I decided to add some support for earlier version in my version of
> the program, and pushed it to https://github.com/credativ/pg_checksums
> if anybody is interested in that. I have to admit that it is quite less
> fancy than your version.

Both things have the same features, except that I added as well long
options.  I can see two minor issues with your version:
1) The tool should not be allowed to run as root when using the enable
and disable modes.
2) On 11 and newer versions, you should use GetDataDirectoryCreatePerm()
to get the correct folder permissions and then set umask(pg_mode_mask).
--
Michael

signature.asc (849 bytes) Download Attachment
Previous Thread Next Thread