Checksum errors in pg_stat_database

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Checksum errors in pg_stat_database

Magnus Hagander-2
Would it make sense to add a column to pg_stat_database showing the total number of checksum errors that have occurred in a database?

It's really a ">1 means it's bad", but it's a lot easier to monitor that in the statistics views, and given how much a lot of people set their systems out to log, it's far too easy to miss individual checksum matches in the logs.

If we track it at the database level, I don't think the overhead of adding one more counter would be very high either.

Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

Robert Haas
On Fri, Jan 11, 2019 at 5:21 AM Magnus Hagander <[hidden email]> wrote:
> Would it make sense to add a column to pg_stat_database showing the total number of checksum errors that have occurred in a database?
>
> It's really a ">1 means it's bad", but it's a lot easier to monitor that in the statistics views, and given how much a lot of people set their systems out to log, it's far too easy to miss individual checksum matches in the logs.
>
> If we track it at the database level, I don't think the overhead of adding one more counter would be very high either.

It's probably not the idea way to track it.  If you have a terabyte or
fifty of data, and you see that you have some checksum failures, good
luck finding the offending blocks.

But I'm tentatively in favor of your proposal anyway, because it's
pretty simple and cheap and might help people, and doing something
noticeably better is probably annoyingly complicated.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

Tomas Vondra-4



On 1/11/19 7:40 PM, Robert Haas wrote:

> On Fri, Jan 11, 2019 at 5:21 AM Magnus Hagander <[hidden email]> wrote:
>> Would it make sense to add a column to pg_stat_database showing
>> the total number of checksum errors that have occurred in a database?
>>
>> It's really a ">1 means it's bad", but it's a lot easier to monitor
>> that in the statistics views, and given how much a lot of people
>> set their systems out to log, it's far too easy to miss individual
>> checksum matches in the logs.
>>
>> If we track it at the database level, I don't think the overhead
>> of adding one more counter would be very high either.
>
> It's probably not the idea way to track it.  If you have a terabyte or
> fifty of data, and you see that you have some checksum failures, good
> luck finding the offending blocks.
>

Isn't that somewhat similar to deadlocks, which we also track in
pg_stat_database? The number of deadlocks is rather useless on it's own,
you need to dive into the server log to find the details. Same for
checksum errors.

> But I'm tentatively in favor of your proposal anyway, because it's
> pretty simple and cheap and might help people, and doing something
> noticeably better is probably annoyingly complicated.
>

+1

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

Magnus Hagander-2
On Fri, Jan 11, 2019 at 9:20 PM Tomas Vondra <[hidden email]> wrote:



On 1/11/19 7:40 PM, Robert Haas wrote:
> On Fri, Jan 11, 2019 at 5:21 AM Magnus Hagander <[hidden email]> wrote:
>> Would it make sense to add a column to pg_stat_database showing
>> the total number of checksum errors that have occurred in a database?
>>
>> It's really a ">1 means it's bad", but it's a lot easier to monitor
>> that in the statistics views, and given how much a lot of people
>> set their systems out to log, it's far too easy to miss individual
>> checksum matches in the logs.
>>
>> If we track it at the database level, I don't think the overhead
>> of adding one more counter would be very high either.
>
> It's probably not the idea way to track it.  If you have a terabyte or
> fifty of data, and you see that you have some checksum failures, good
> luck finding the offending blocks.
>

Isn't that somewhat similar to deadlocks, which we also track in
pg_stat_database? The number of deadlocks is rather useless on it's own,
you need to dive into the server log to find the details. Same for
checksum errors.

It is a bit similar yeah. Though a checksum counter is really a "you need to look at fixing this right away" in a bit more sense than deadlocks. But yes, the fact that we already tracks deadlocks there is a good example. (Of course, I believe I added that one at some point as well, so I'm clearly biased there)


> But I'm tentatively in favor of your proposal anyway, because it's
> pretty simple and cheap and might help people, and doing something
> noticeably better is probably annoyingly complicated.
>

+1

Yeah, that's the idea behind it -- it's cheap, and an early-warning-indicator.  

--
Reply | Threaded
Open this post in threaded view
|

Re: Checksum errors in pg_stat_database

David Steele
On 1/11/19 10:25 PM, Magnus Hagander wrote:

> On Fri, Jan 11, 2019 at 9:20 PM Tomas Vondra
>     On 1/11/19 7:40 PM, Robert Haas wrote:
>      > But I'm tentatively in favor of your proposal anyway, because it's
>      > pretty simple and cheap and might help people, and doing something
>      > noticeably better is probably annoyingly complicated.
>      >
>
>     +1
>
> Yeah, that's the idea behind it -- it's cheap, and an
> early-warning-indicator.

+1

--
-David
[hidden email]