Checksum errors in pg_stat_database

classic Classic list List threaded Threaded
50 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Checksum errors in pg_stat_database

Magnus Hagander-2
Would it make sense to add a column to pg_stat_database showing the total number of checksum errors that have occurred in a database?

It's really a ">1 means it's bad", but it's a lot easier to monitor that in the statistics views, and given how much a lot of people set their systems out to log, it's far too easy to miss individual checksum matches in the logs.

If we track it at the database level, I don't think the overhead of adding one more counter would be very high either.

Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

Robert Haas
On Fri, Jan 11, 2019 at 5:21 AM Magnus Hagander <[hidden email]> wrote:
> Would it make sense to add a column to pg_stat_database showing the total number of checksum errors that have occurred in a database?
>
> It's really a ">1 means it's bad", but it's a lot easier to monitor that in the statistics views, and given how much a lot of people set their systems out to log, it's far too easy to miss individual checksum matches in the logs.
>
> If we track it at the database level, I don't think the overhead of adding one more counter would be very high either.

It's probably not the idea way to track it.  If you have a terabyte or
fifty of data, and you see that you have some checksum failures, good
luck finding the offending blocks.

But I'm tentatively in favor of your proposal anyway, because it's
pretty simple and cheap and might help people, and doing something
noticeably better is probably annoyingly complicated.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

Tomas Vondra-4



On 1/11/19 7:40 PM, Robert Haas wrote:

> On Fri, Jan 11, 2019 at 5:21 AM Magnus Hagander <[hidden email]> wrote:
>> Would it make sense to add a column to pg_stat_database showing
>> the total number of checksum errors that have occurred in a database?
>>
>> It's really a ">1 means it's bad", but it's a lot easier to monitor
>> that in the statistics views, and given how much a lot of people
>> set their systems out to log, it's far too easy to miss individual
>> checksum matches in the logs.
>>
>> If we track it at the database level, I don't think the overhead
>> of adding one more counter would be very high either.
>
> It's probably not the idea way to track it.  If you have a terabyte or
> fifty of data, and you see that you have some checksum failures, good
> luck finding the offending blocks.
>

Isn't that somewhat similar to deadlocks, which we also track in
pg_stat_database? The number of deadlocks is rather useless on it's own,
you need to dive into the server log to find the details. Same for
checksum errors.

> But I'm tentatively in favor of your proposal anyway, because it's
> pretty simple and cheap and might help people, and doing something
> noticeably better is probably annoyingly complicated.
>

+1

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

Magnus Hagander-2
On Fri, Jan 11, 2019 at 9:20 PM Tomas Vondra <[hidden email]> wrote:



On 1/11/19 7:40 PM, Robert Haas wrote:
> On Fri, Jan 11, 2019 at 5:21 AM Magnus Hagander <[hidden email]> wrote:
>> Would it make sense to add a column to pg_stat_database showing
>> the total number of checksum errors that have occurred in a database?
>>
>> It's really a ">1 means it's bad", but it's a lot easier to monitor
>> that in the statistics views, and given how much a lot of people
>> set their systems out to log, it's far too easy to miss individual
>> checksum matches in the logs.
>>
>> If we track it at the database level, I don't think the overhead
>> of adding one more counter would be very high either.
>
> It's probably not the idea way to track it.  If you have a terabyte or
> fifty of data, and you see that you have some checksum failures, good
> luck finding the offending blocks.
>

Isn't that somewhat similar to deadlocks, which we also track in
pg_stat_database? The number of deadlocks is rather useless on it's own,
you need to dive into the server log to find the details. Same for
checksum errors.

It is a bit similar yeah. Though a checksum counter is really a "you need to look at fixing this right away" in a bit more sense than deadlocks. But yes, the fact that we already tracks deadlocks there is a good example. (Of course, I believe I added that one at some point as well, so I'm clearly biased there)


> But I'm tentatively in favor of your proposal anyway, because it's
> pretty simple and cheap and might help people, and doing something
> noticeably better is probably annoyingly complicated.
>

+1

Yeah, that's the idea behind it -- it's cheap, and an early-warning-indicator.  

--
Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

David Steele
On 1/11/19 10:25 PM, Magnus Hagander wrote:

> On Fri, Jan 11, 2019 at 9:20 PM Tomas Vondra
>     On 1/11/19 7:40 PM, Robert Haas wrote:
>      > But I'm tentatively in favor of your proposal anyway, because it's
>      > pretty simple and cheap and might help people, and doing something
>      > noticeably better is probably annoyingly complicated.
>      >
>
>     +1
>
> Yeah, that's the idea behind it -- it's cheap, and an
> early-warning-indicator.

+1

--
-David
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

Magnus Hagander-2
On Sat, Jan 12, 2019 at 5:16 AM David Steele <[hidden email]> wrote:
On 1/11/19 10:25 PM, Magnus Hagander wrote:
> On Fri, Jan 11, 2019 at 9:20 PM Tomas Vondra
>     On 1/11/19 7:40 PM, Robert Haas wrote:
>      > But I'm tentatively in favor of your proposal anyway, because it's
>      > pretty simple and cheap and might help people, and doing something
>      > noticeably better is probably annoyingly complicated.
>      >
>
>     +1
>
> Yeah, that's the idea behind it -- it's cheap, and an
> early-warning-indicator.

+1

PFA is a patch to do this.

It tracks things that happen in the general backends. Possibly we should also consider counting the errors actually found when running base backups? OTOH, that part of the code doesn't really track things like databases (as it operates just on the raw data directory underneath), so that implementation would definitely not be as clean...

--

stat_checksums.patch (10K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

Julien Rouhaud
On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <[hidden email]> wrote:
>
> PFA is a patch to do this.

+void
+pgstat_report_checksum_failure(void)
+{
+   PgStat_MsgDeadlock msg;

I think that you meant PgStat_MsgChecksumFailure :)

+/* ----------
+ * pgstat_recv_checksum_failure() -
+ *
+ * Process a DEADLOCK message.
+ * ----------

same here

Otherwise LGTM.

Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

Magnus Hagander-2


On Fri, Feb 22, 2019 at 3:16 PM Julien Rouhaud <[hidden email]> wrote:
On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <[hidden email]> wrote:
>
> PFA is a patch to do this.

+void
+pgstat_report_checksum_failure(void)
+{
+   PgStat_MsgDeadlock msg;

I think that you meant PgStat_MsgChecksumFailure :)

+/* ----------
+ * pgstat_recv_checksum_failure() -
+ *
+ * Process a DEADLOCK message.
+ * ----------

same here

Otherwise LGTM.

Haha, damit, that's embarassing. You can probably guess where I copy/pasted from :)

--
Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

Magnus Hagander-2
On Fri, Feb 22, 2019 at 3:23 PM Magnus Hagander <[hidden email]> wrote:


On Fri, Feb 22, 2019 at 3:16 PM Julien Rouhaud <[hidden email]> wrote:
On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <[hidden email]> wrote:
>
> PFA is a patch to do this.

+void
+pgstat_report_checksum_failure(void)
+{
+   PgStat_MsgDeadlock msg;

I think that you meant PgStat_MsgChecksumFailure :)

+/* ----------
+ * pgstat_recv_checksum_failure() -
+ *
+ * Process a DEADLOCK message.
+ * ----------

same here

Otherwise LGTM.

Haha, damit, that's embarassing. You can probably guess where I copy/pasted from :)


And of course, then I forgot to attach the new file.
 
--

stat_checksums2.patch (10K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

Julien Rouhaud
On Fri, Feb 22, 2019 at 3:25 PM Magnus Hagander <[hidden email]> wrote:

>
> On Fri, Feb 22, 2019 at 3:23 PM Magnus Hagander <[hidden email]> wrote:
>>
>>
>>
>> On Fri, Feb 22, 2019 at 3:16 PM Julien Rouhaud <[hidden email]> wrote:
>>>
>>> On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <[hidden email]> wrote:
>>> >
>>> > PFA is a patch to do this.
>>>
>>> +void
>>> +pgstat_report_checksum_failure(void)
>>> +{
>>> +   PgStat_MsgDeadlock msg;
>>>
>>> I think that you meant PgStat_MsgChecksumFailure :)
>>>
>>> +/* ----------
>>> + * pgstat_recv_checksum_failure() -
>>> + *
>>> + * Process a DEADLOCK message.
>>> + * ----------
>>>
>>> same here
>>>
>>> Otherwise LGTM.
>>
>>
>> Haha, damit, that's embarassing. You can probably guess where I copy/pasted from :)

heh :)

>>
>
> And of course, then I forgot to attach the new file.

It all looks fine.  One minor nitpicking issue I just noticed, there's
an extra space there:

+   dbentry->n_checksum_failures ++;

I'm marking it as ready for committer!

Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

Julien Rouhaud
In reply to this post by Magnus Hagander-2
On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <[hidden email]> wrote:
>
> It tracks things that happen in the general backends. Possibly we should also consider counting the errors actually found when running base backups? OTOH, that part of the code doesn't really track things like databases (as it operates just on the raw data directory underneath), so that implementation would definitely not be as clean...

Sorry I just realized that I totally forgot this part of the thread.

While it's true that we operate on raw directory, I see that sendDir()
already setup a isDbDir var, and if this is true lastDir should
contain the oid of the underlying database.  Wouldn't it be enough to
call sendFile() using this, something like (untested):

if (!sizeonly)
- sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true);
+ sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true,
isDbDir ? pg_atoi(lastDir+1, 4) : InvalidOid);

and accordingly report any checksum error from sendFile()?

Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

Julien Rouhaud
On Mon, Mar 4, 2019 at 8:31 PM Julien Rouhaud <[hidden email]> wrote:

>
> On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <[hidden email]> wrote:
> >
> > It tracks things that happen in the general backends. Possibly we should also consider counting the errors actually found when running base backups? OTOH, that part of the code doesn't really track things like databases (as it operates just on the raw data directory underneath), so that implementation would definitely not be as clean...
>
> Sorry I just realized that I totally forgot this part of the thread.
>
> While it's true that we operate on raw directory, I see that sendDir()
> already setup a isDbDir var, and if this is true lastDir should
> contain the oid of the underlying database.  Wouldn't it be enough to
> call sendFile() using this, something like (untested):
>
> if (!sizeonly)
> - sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true);
> + sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true,
> isDbDir ? pg_atoi(lastDir+1, 4) : InvalidOid);
>
> and accordingly report any checksum error from sendFile()?
So this seem to work just fine without adding much code.  PFA v3 of
Magnus' patch including error reporting for BASE_BACKUP.

stat_checksums_v3.diff (14K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

Magnus Hagander-2
In reply to this post by Julien Rouhaud
On Mon, Mar 4, 2019 at 11:31 AM Julien Rouhaud <[hidden email]> wrote:
On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <[hidden email]> wrote:
>
> It tracks things that happen in the general backends. Possibly we should also consider counting the errors actually found when running base backups? OTOH, that part of the code doesn't really track things like databases (as it operates just on the raw data directory underneath), so that implementation would definitely not be as clean...

Sorry I just realized that I totally forgot this part of the thread.

While it's true that we operate on raw directory, I see that sendDir()
already setup a isDbDir var, and if this is true lastDir should
contain the oid of the underlying database.  Wouldn't it be enough to
call sendFile() using this, something like (untested):

if (!sizeonly)
- sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true);
+ sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true,
isDbDir ? pg_atoi(lastDir+1, 4) : InvalidOid);

and accordingly report any checksum error from sendFile()?

That seems it was easy enough. PFA an updated patch that does this, and also rebased so it doesn't conflict on oid.

(yes, while moving this from draft to publish after lunch, I realized that you put a patch togerher for about the same. So let's merge it)

--

stat_checksums_2.patch (15K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

Julien Rouhaud
On Sat, Mar 9, 2019 at 12:35 AM Magnus Hagander <[hidden email]> wrote:

>
> On Mon, Mar 4, 2019 at 11:31 AM Julien Rouhaud <[hidden email]> wrote:
>>
>> On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <[hidden email]> wrote:
>> >
>> > It tracks things that happen in the general backends. Possibly we should also consider counting the errors actually found when running base backups? OTOH, that part of the code doesn't really track things like databases (as it operates just on the raw data directory underneath), so that implementation would definitely not be as clean...
>>
>> Sorry I just realized that I totally forgot this part of the thread.
>>
>> While it's true that we operate on raw directory, I see that sendDir()
>> already setup a isDbDir var, and if this is true lastDir should
>> contain the oid of the underlying database.  Wouldn't it be enough to
>> call sendFile() using this, something like (untested):
>>
>> if (!sizeonly)
>> - sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true);
>> + sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true,
>> isDbDir ? pg_atoi(lastDir+1, 4) : InvalidOid);
>>
>> and accordingly report any checksum error from sendFile()?
>
>
> That seems it was easy enough. PFA an updated patch that does this, and also rebased so it doesn't conflict on oid.
>
> (yes, while moving this from draft to publish after lunch, I realized that you put a patch togerher for about the same. So let's merge it)

Thanks!  Our implementations are quite similar, so I'm fine with most
of the changes :) I'm just not sure about having two distinct
functions for reporting failures, given that there's only one caller
for each.  On the other hand it avoids to include miscadmin.h in
bufpage.c.

That's just a detail, so I'm marking it (again) as ready for committer!

Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

Julien Rouhaud
On Sat, Mar 9, 2019 at 9:34 AM Julien Rouhaud <[hidden email]> wrote:

>
> On Sat, Mar 9, 2019 at 12:35 AM Magnus Hagander <[hidden email]> wrote:
> >
> > On Mon, Mar 4, 2019 at 11:31 AM Julien Rouhaud <[hidden email]> wrote:
> >>
> >> On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <[hidden email]> wrote:
> >> >
> >> > It tracks things that happen in the general backends. Possibly we should also consider counting the errors actually found when running base backups? OTOH, that part of the code doesn't really track things like databases (as it operates just on the raw data directory underneath), so that implementation would definitely not be as clean...
> >>
> >> Sorry I just realized that I totally forgot this part of the thread.
> >>
> >> While it's true that we operate on raw directory, I see that sendDir()
> >> already setup a isDbDir var, and if this is true lastDir should
> >> contain the oid of the underlying database.  Wouldn't it be enough to
> >> call sendFile() using this, something like (untested):
> >>
> >> if (!sizeonly)
> >> - sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true);
> >> + sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true,
> >> isDbDir ? pg_atoi(lastDir+1, 4) : InvalidOid);
> >>
> >> and accordingly report any checksum error from sendFile()?
> >
> > That seems it was easy enough. PFA an updated patch that does this, and also rebased so it doesn't conflict on oid.
> >

Sorry, I have again new comments after a little bit more thinking.
I'm wondering if we can do something about shared objects while we're
at it.  They don't belong to any database, so it's a little bit
orthogonal to this proposal, but it seems quite important to track
error on those too!

What about adding a new field in PgStat_GlobalStats for that?  We can
use the same lastDir to easily detect such objects and slightly adapt
sendFile again, which seems quite straightforward.

Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

Magnus Hagander-2
In reply to this post by Julien Rouhaud
On Sat, Mar 9, 2019 at 12:33 AM Julien Rouhaud <[hidden email]> wrote:
On Sat, Mar 9, 2019 at 12:35 AM Magnus Hagander <[hidden email]> wrote:
>
> On Mon, Mar 4, 2019 at 11:31 AM Julien Rouhaud <[hidden email]> wrote:
>>
>> On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <[hidden email]> wrote:
>> >
>> > It tracks things that happen in the general backends. Possibly we should also consider counting the errors actually found when running base backups? OTOH, that part of the code doesn't really track things like databases (as it operates just on the raw data directory underneath), so that implementation would definitely not be as clean...
>>
>> Sorry I just realized that I totally forgot this part of the thread.
>>
>> While it's true that we operate on raw directory, I see that sendDir()
>> already setup a isDbDir var, and if this is true lastDir should
>> contain the oid of the underlying database.  Wouldn't it be enough to
>> call sendFile() using this, something like (untested):
>>
>> if (!sizeonly)
>> - sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true);
>> + sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true,
>> isDbDir ? pg_atoi(lastDir+1, 4) : InvalidOid);
>>
>> and accordingly report any checksum error from sendFile()?
>
>
> That seems it was easy enough. PFA an updated patch that does this, and also rebased so it doesn't conflict on oid.
>
> (yes, while moving this from draft to publish after lunch, I realized that you put a patch togerher for about the same. So let's merge it)

Thanks!  Our implementations are quite similar, so I'm fine with most
of the changes :) I'm just not sure about having two distinct
functions for reporting failures, given that there's only one caller
for each.  On the other hand it avoids to include miscadmin.h in
bufpage.c.

Yeah, and it brings "cosistence" to at least the calling point(s) within regular backends. 

 
That's just a detail, so I'm marking it (again) as ready for committer!

Thanks, and pushed :)

--
Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

Magnus Hagander-2
In reply to this post by Julien Rouhaud
On Sat, Mar 9, 2019 at 10:41 AM Julien Rouhaud <[hidden email]> wrote:
On Sat, Mar 9, 2019 at 9:34 AM Julien Rouhaud <[hidden email]> wrote:
>
> On Sat, Mar 9, 2019 at 12:35 AM Magnus Hagander <[hidden email]> wrote:
> >
> > On Mon, Mar 4, 2019 at 11:31 AM Julien Rouhaud <[hidden email]> wrote:
> >>
> >> On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <[hidden email]> wrote:
> >> >
> >> > It tracks things that happen in the general backends. Possibly we should also consider counting the errors actually found when running base backups? OTOH, that part of the code doesn't really track things like databases (as it operates just on the raw data directory underneath), so that implementation would definitely not be as clean...
> >>
> >> Sorry I just realized that I totally forgot this part of the thread.
> >>
> >> While it's true that we operate on raw directory, I see that sendDir()
> >> already setup a isDbDir var, and if this is true lastDir should
> >> contain the oid of the underlying database.  Wouldn't it be enough to
> >> call sendFile() using this, something like (untested):
> >>
> >> if (!sizeonly)
> >> - sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true);
> >> + sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true,
> >> isDbDir ? pg_atoi(lastDir+1, 4) : InvalidOid);
> >>
> >> and accordingly report any checksum error from sendFile()?
> >
> > That seems it was easy enough. PFA an updated patch that does this, and also rebased so it doesn't conflict on oid.
> >

Sorry, I have again new comments after a little bit more thinking.
I'm wondering if we can do something about shared objects while we're
at it.  They don't belong to any database, so it's a little bit
orthogonal to this proposal, but it seems quite important to track
error on those too!

What about adding a new field in PgStat_GlobalStats for that?  We can
use the same lastDir to easily detect such objects and slightly adapt
sendFile again, which seems quite straightforward.

Ah, didn't spot that one until after I pushed :/ Sorry about that.

Hmm. That's an interesting thought. And then add a column to pg_stat_bgwriter, I assume? (Which is an ever increasingly bad name for the view, but that's unrelated to this)

Question is then what number that should show -- only the checksum counter in non-database-fields, or the total number across the cluster?

--
Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

Julien Rouhaud
In reply to this post by Magnus Hagander-2
On Sat, Mar 9, 2019 at 7:48 PM Magnus Hagander <[hidden email]> wrote:

>
> On Sat, Mar 9, 2019 at 12:33 AM Julien Rouhaud <[hidden email]> wrote:
>>
>> Thanks!  Our implementations are quite similar, so I'm fine with most
>> of the changes :) I'm just not sure about having two distinct
>> functions for reporting failures, given that there's only one caller
>> for each.  On the other hand it avoids to include miscadmin.h in
>> bufpage.c.
>
>
> Yeah, and it brings "cosistence" to at least the calling point(s) within regular backends.
>
>
>>
>> That's just a detail, so I'm marking it (again) as ready for committer!
>
>
> Thanks, and pushed :)

Thanks :)

Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

Julien Rouhaud
In reply to this post by Magnus Hagander-2
On Sat, Mar 9, 2019 at 7:50 PM Magnus Hagander <[hidden email]> wrote:

>
> On Sat, Mar 9, 2019 at 10:41 AM Julien Rouhaud <[hidden email]> wrote:
>>
>> Sorry, I have again new comments after a little bit more thinking.
>> I'm wondering if we can do something about shared objects while we're
>> at it.  They don't belong to any database, so it's a little bit
>> orthogonal to this proposal, but it seems quite important to track
>> error on those too!
>>
>> What about adding a new field in PgStat_GlobalStats for that?  We can
>> use the same lastDir to easily detect such objects and slightly adapt
>> sendFile again, which seems quite straightforward.
>
>
> Ah, didn't spot that one until after I pushed :/ Sorry about that.

No problem, I should have thought about it sooner anyway.

> Hmm. That's an interesting thought. And then add a column to pg_stat_bgwriter, I assume?

Yes, and a new entry for PgStat_Shared_Reset_Target I guess.

 (Which is an ever increasingly bad name for the view, but that's
unrelated to this)

yeah :/

> Question is then what number that should show -- only the checksum counter in non-database-fields, or the total number across the cluster?

I'd say only for non-database-fields errors, especially if we can
reset each counters separately.  If necessary, we can add a new view
to give a global overview of checksum errors for DBA convenience.

Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

Julien Rouhaud
On Sat, Mar 9, 2019 at 7:58 PM Julien Rouhaud <[hidden email]> wrote:

>
> On Sat, Mar 9, 2019 at 7:50 PM Magnus Hagander <[hidden email]> wrote:
> >
> > On Sat, Mar 9, 2019 at 10:41 AM Julien Rouhaud <[hidden email]> wrote:
> >>
> >> Sorry, I have again new comments after a little bit more thinking.
> >> I'm wondering if we can do something about shared objects while we're
> >> at it.  They don't belong to any database, so it's a little bit
> >> orthogonal to this proposal, but it seems quite important to track
> >> error on those too!
> >>
> >> What about adding a new field in PgStat_GlobalStats for that?  We can
> >> use the same lastDir to easily detect such objects and slightly adapt
> >> sendFile again, which seems quite straightforward.
>
> > Question is then what number that should show -- only the checksum counter in non-database-fields, or the total number across the cluster?
>
> I'd say only for non-database-fields errors, especially if we can
> reset each counters separately.  If necessary, we can add a new view
> to give a global overview of checksum errors for DBA convenience.

So, after reading current implementation, I don't think that
PgStat_GlobalStats is the right place.  It's already enough of a mess,
and clearly pg_stat_reset_shared('bgwriter') would not make any sense
if it did reset the shared relations checksum errors (though arguably
the fact that's it's resetting checkpointer stats right now is hardly
better), and handling a different target to reset part of
PgStat_GlobalStats counters would be an ugly kludge.

I'm considering adding a new PgStat_ChecksumStats for that purpose
instead, but I don't know if that's acceptable to do so in the last
commitfest.  It seems worthwhile to add it eventually, since we'll
probably end up having more things to report to users related to
checksum.  Online enabling of checksum could be the most immediate
potential target.

If that's acceptable, I was thinking this new stat could have those
fields with the first drop:

- number of non-db-related checksum checks done
- number of non-db-related checksum checks failed
- last stats reset

(and adding the number of checks for db-related blocks done with the
current checksum errors counter).  Maybe also adding a
pg_checksum_stats view that would summarise all the counters in one
place.

It'll obviously add some traffic to the stats collector, but I'd hope
not too much, since BufferAlloc shouldn't be that frequent, and
BASE_BACKUP reports stats only once per file.

123