Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
32 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Dilip Kumar-2
The attached patch allows the vacuum to continue by emitting WARNING
for the corrupted tuple instead of immediately error out as discussed
at [1].

Basically, it provides a new GUC called vacuum_tolerate_damage, to
control whether to continue the vacuum or to stop on the occurrence of
a corrupted tuple.  So if the vacuum_tolerate_damage is set then in
all the cases in heap_prepare_freeze_tuple where the corrupted xid is
detected, it will emit a warning and return that nothing is changed in
the tuple and the 'tuple_totally_frozen' will also be set to false.
Since we are returning false the caller will not try to freeze such
tuple and the tuple_totally_frozen is also set to false so that the
page will not be marked to all frozen even if all other tuples in the
page are frozen.

Alternatively,  we can try to freeze other XIDs in the tuple which is
not corrupted but I don't think we will gain anything from this,
because if one of the xmin or xmax is wrong then next time also if we
run the vacuum then we are going to get the same WARNING or the ERROR.
Is there any other opinion on this?

[1] http://postgr.es/m/CA+TgmoaZwZHtFFU6NUJgEAp6adDs-qWfNOXpZGQpZMmm0VTDfg@...

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

v1-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patch (12K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Andrey M. Borodin
Hi Dilip!


> 17 июля 2020 г., в 15:46, Dilip Kumar <[hidden email]> написал(а):
>
> The attached patch allows the vacuum to continue by emitting WARNING
> for the corrupted tuple instead of immediately error out as discussed
> at [1].
>
> Basically, it provides a new GUC called vacuum_tolerate_damage, to
> control whether to continue the vacuum or to stop on the occurrence of
> a corrupted tuple.  So if the vacuum_tolerate_damage is set then in
> all the cases in heap_prepare_freeze_tuple where the corrupted xid is
> detected, it will emit a warning and return that nothing is changed in
> the tuple and the 'tuple_totally_frozen' will also be set to false.
> Since we are returning false the caller will not try to freeze such
> tuple and the tuple_totally_frozen is also set to false so that the
> page will not be marked to all frozen even if all other tuples in the
> page are frozen.
>
> Alternatively,  we can try to freeze other XIDs in the tuple which is
> not corrupted but I don't think we will gain anything from this,
> because if one of the xmin or xmax is wrong then next time also if we
> run the vacuum then we are going to get the same WARNING or the ERROR.
> Is there any other opinion on this?

FWIW AFAIK this ERROR was the reason why we had to use older versions of heap_prepare_freeze_tuple() in our recovery kit [0].
So +1 from me.
But I do not think that just ignoring corruption here is sufficient. Soon after this freeze problem user will, probably, have to deal with absent CLOG.
I think this GUC is only a part of an incomplete solution.
Personally I'd be happy if this is backported - our recovery kit would be much smaller. But this does not seem like a valid reason.

Thanks!

Best regards, Andrey Borodin.


[0] https://github.com/dsarafan/pg_dirty_hands/blob/master/src/pg_dirty_hands.c#L443




Reply | Threaded
Open this post in threaded view
|

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Dilip Kumar-2
In reply to this post by Dilip Kumar-2
On Fri, Jul 17, 2020 at 4:16 PM Dilip Kumar <[hidden email]> wrote:

>
> The attached patch allows the vacuum to continue by emitting WARNING
> for the corrupted tuple instead of immediately error out as discussed
> at [1].
>
> Basically, it provides a new GUC called vacuum_tolerate_damage, to
> control whether to continue the vacuum or to stop on the occurrence of
> a corrupted tuple.  So if the vacuum_tolerate_damage is set then in
> all the cases in heap_prepare_freeze_tuple where the corrupted xid is
> detected, it will emit a warning and return that nothing is changed in
> the tuple and the 'tuple_totally_frozen' will also be set to false.
> Since we are returning false the caller will not try to freeze such
> tuple and the tuple_totally_frozen is also set to false so that the
> page will not be marked to all frozen even if all other tuples in the
> page are frozen.
>
> Alternatively,  we can try to freeze other XIDs in the tuple which is
> not corrupted but I don't think we will gain anything from this,
> because if one of the xmin or xmax is wrong then next time also if we
> run the vacuum then we are going to get the same WARNING or the ERROR.
> Is there any other opinion on this?

Robert has mentioned at [1] that we probably should refuse to update
'relfrozenxid/relminmxid' when we encounter such tuple and emit
WARNING instead of an error.  I think we shall do that in some cases
but IMHO it's not a very good idea in all the cases.  Basically, if
the xmin precedes the relfrozenxid then probably we should allow to
update the relfrozenxid whereas if the xmin precedes cutoff xid and
still uncommitted then probably we might stop relfrozenxid from being
updated so that we can stop CLOG from getting truncated.  I will make
these changes if we agree with the idea?  Or we should keep it simple
and never allow to update 'relfrozenxid/relminmxid' in such cases?

[1] http://postgr.es/m/CA+TgmoaZwZHtFFU6NUJgEAp6adDs-qWfNOXpZGQpZMmm0VTDfg@...

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Dilip Kumar-2
In reply to this post by Andrey M. Borodin
On Sun, Jul 19, 2020 at 4:56 PM Andrey M. Borodin <[hidden email]> wrote:

>
> Hi Dilip!
>
>
> > 17 июля 2020 г., в 15:46, Dilip Kumar <[hidden email]> написал(а):
> >
> > The attached patch allows the vacuum to continue by emitting WARNING
> > for the corrupted tuple instead of immediately error out as discussed
> > at [1].
> >
> > Basically, it provides a new GUC called vacuum_tolerate_damage, to
> > control whether to continue the vacuum or to stop on the occurrence of
> > a corrupted tuple.  So if the vacuum_tolerate_damage is set then in
> > all the cases in heap_prepare_freeze_tuple where the corrupted xid is
> > detected, it will emit a warning and return that nothing is changed in
> > the tuple and the 'tuple_totally_frozen' will also be set to false.
> > Since we are returning false the caller will not try to freeze such
> > tuple and the tuple_totally_frozen is also set to false so that the
> > page will not be marked to all frozen even if all other tuples in the
> > page are frozen.
> >
> > Alternatively,  we can try to freeze other XIDs in the tuple which is
> > not corrupted but I don't think we will gain anything from this,
> > because if one of the xmin or xmax is wrong then next time also if we
> > run the vacuum then we are going to get the same WARNING or the ERROR.
> > Is there any other opinion on this?
>
> FWIW AFAIK this ERROR was the reason why we had to use older versions of heap_prepare_freeze_tuple() in our recovery kit [0].
> So +1 from me.

Thanks for showing interest in this patch.

> But I do not think that just ignoring corruption here is sufficient. Soon after this freeze problem user will, probably, have to deal with absent CLOG.
> I think this GUC is only a part of an incomplete solution.
> Personally I'd be happy if this is backported - our recovery kit would be much smaller. But this does not seem like a valid reason.

I agree that this is just solving one part of the problem and in some
cases, it may not work if the CLOG itself is corrupted i.e does not
exist for the xid which are not yet frozen.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Alvaro Herrera-9
In reply to this post by Dilip Kumar-2
On 2020-Jul-20, Dilip Kumar wrote:

> On Fri, Jul 17, 2020 at 4:16 PM Dilip Kumar <[hidden email]> wrote:

> > So if the vacuum_tolerate_damage is set then in
> > all the cases in heap_prepare_freeze_tuple where the corrupted xid is
> > detected, it will emit a warning and return that nothing is changed in
> > the tuple and the 'tuple_totally_frozen' will also be set to false.
> > Since we are returning false the caller will not try to freeze such
> > tuple and the tuple_totally_frozen is also set to false so that the
> > page will not be marked to all frozen even if all other tuples in the
> > page are frozen.

> Robert has mentioned at [1] that we probably should refuse to update
> 'relfrozenxid/relminmxid' when we encounter such tuple and emit
> WARNING instead of an error.

Isn't this already happening per your description above?

> I think we shall do that in some cases
> but IMHO it's not a very good idea in all the cases.  Basically, if
> the xmin precedes the relfrozenxid then probably we should allow to
> update the relfrozenxid whereas if the xmin precedes cutoff xid and
> still uncommitted then probably we might stop relfrozenxid from being
> updated so that we can stop CLOG from getting truncated.

I'm not sure I understand 100% what you're talking about here (the first
half seems dangerous unless you misspoke), but in any case it seems a
pointless optimization.  I mean, if the heap is corrupted, you can hope
to complete the vacuum (which will hopefully return which *other* tuples
are similarly corrupt) but trying to advance relfrozenxid is a lost
cause.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Andrey M. Borodin


> 20 июля 2020 г., в 21:44, Alvaro Herrera <[hidden email]> написал(а):
>
>> I think we shall do that in some cases
>> but IMHO it's not a very good idea in all the cases.  Basically, if
>> the xmin precedes the relfrozenxid then probably we should allow to
>> update the relfrozenxid whereas if the xmin precedes cutoff xid and
>> still uncommitted then probably we might stop relfrozenxid from being
>> updated so that we can stop CLOG from getting truncated.
>
> I'm not sure I understand 100% what you're talking about here (the first
> half seems dangerous unless you misspoke), but in any case it seems a
> pointless optimization.  I mean, if the heap is corrupted, you can hope
> to complete the vacuum (which will hopefully return which *other* tuples
> are similarly corrupt) but trying to advance relfrozenxid is a lost
> cause.

I think the point here is to actually move relfrozenxid back. But the mince can't be turned back. If CLOG is rotated - the table is corrupted beyond easy repair.

I'm not sure it's Dilip's case, but I'll try to describe what I was encountering.

We were observing this kind of corruption in three cases:
1. With a bug in patched Linux kernel page cache we could loose FS page write
2. With a bug in WAL-G block-level incremental backup - we could loose update of the page.
3. With a firmware bug in SSD drives from one vendor - one write to block storage device was lost
One page in a database is of some non-latest version (but with correct checksum, it's just an old version). And in our case usually a VACUUMing of a page was lost (with freezes of all tuples). Some tuples are not marked as frozen, while VM has frozen bit for page. Everything works just fine until someone updates a tuple on the same page: VM bit is reset and eventually user will try to consult CLOG, which is already truncated.

This is why we may need to defer CLOG truncation or even move relfrozenxid back.

FWIW we coped with this by actively monitoring this kind of corruption with this amcheck patch [0]. One can observe this lost page updates cheaply in indexes and act on first sight of corruption: identify source of the buggy behaviour.

Dilip, does this sound like a corruption case you are working on?

Thanks!

Best regards, Andrey Borodin.

[0] https://commitfest.postgresql.org/24/2254/

Reply | Threaded
Open this post in threaded view
|

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Alvaro Herrera-9
On 2020-Jul-20, Andrey M. Borodin wrote:

> I think the point here is to actually move relfrozenxid back. But the
> mince can't be turned back. If CLOG is rotated - the table is
> corrupted beyond easy repair.

Oh, I see.  Hmm.  Well, if you discover relfrozenxid that's newer and
the pg_clog files are still there, then yeah it might make sense to move
relfrozenxid back. But it seems difficult to do correctly ... you have
to move datfrozenxid back too ... frankly, I'd rather not go there.

> I'm not sure it's Dilip's case, but I'll try to describe what I was encountering.
>
> We were observing this kind of corruption in three cases:
> 1. With a bug in patched Linux kernel page cache we could loose FS page write

I think I've seen this too.  (Or possibly your #3, which from Postgres
POV is the same thing -- a write was claimed done but actually not
done.)

> FWIW we coped with this by actively monitoring this kind of corruption
> with this amcheck patch [0]. One can observe this lost page updates
> cheaply in indexes and act on first sight of corruption: identify
> source of the buggy behaviour.

Right.

I wish we had some way to better protect against this kind of problems,
but I don't have any ideas.  Some things can be protected against with
checksums, but if you just lose a write, there's nothing to indicate
that.  We don't have a per-page write counter, or a central repository
of per-page LSNs or checksums, and it seems very expensive to maintain
such things.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Andres Freund
In reply to this post by Dilip Kumar-2
Hi,

On 2020-07-17 16:16:23 +0530, Dilip Kumar wrote:

> The attached patch allows the vacuum to continue by emitting WARNING
> for the corrupted tuple instead of immediately error out as discussed
> at [1].
>
> Basically, it provides a new GUC called vacuum_tolerate_damage, to
> control whether to continue the vacuum or to stop on the occurrence of
> a corrupted tuple.  So if the vacuum_tolerate_damage is set then in
> all the cases in heap_prepare_freeze_tuple where the corrupted xid is
> detected, it will emit a warning and return that nothing is changed in
> the tuple and the 'tuple_totally_frozen' will also be set to false.
> Since we are returning false the caller will not try to freeze such
> tuple and the tuple_totally_frozen is also set to false so that the
> page will not be marked to all frozen even if all other tuples in the
> page are frozen.

I'm extremely doubtful this is a good idea. In all likelihood this will
just exascerbate corruption.

You cannot just stop freezing tuples, that'll lead to relfrozenxid
getting *further* out of sync with the actual table contents. And you
cannot just freeze such tuples, because that has a good chance of making
deleted tuples suddenly visible, leading to unique constraint violations
etc. Which will then subsequently lead to clog lookup errors and such.

At the very least you'd need to signal up that relfrozenxid/relminmxid
cannot be increased. Without that I think it's entirely unacceptable to
do this.


If we really were to do something like this the option would need to be
called vacuum_allow_making_corruption_worse or such. Its need to be
*exceedingly* clear that it will likely lead to making everything much
worse.


> @@ -6123,6 +6124,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
>   frz->t_infomask = tuple->t_infomask;
>   frz->xmax = HeapTupleHeaderGetRawXmax(tuple);

I don't think it can be right to just update heap_prepare_freeze_tuple()
but not FreezeMultiXactId().

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Andrey M. Borodin
In reply to this post by Alvaro Herrera-9


> 21 июля 2020 г., в 00:36, Alvaro Herrera <[hidden email]> написал(а):
>
>
>> FWIW we coped with this by actively monitoring this kind of corruption
>> with this amcheck patch [0]. One can observe this lost page updates
>> cheaply in indexes and act on first sight of corruption: identify
>> source of the buggy behaviour.
>
> Right.
>
> I wish we had some way to better protect against this kind of problems,
> but I don't have any ideas.  Some things can be protected against with
> checksums, but if you just lose a write, there's nothing to indicate
> that.  We don't have a per-page write counter, or a central repository
> of per-page LSNs or checksums, and it seems very expensive to maintain
> such things.

If we had data checksums in another fork we could flush them on checkpoint.
This checksums could protect from lost page update.
And it would be much easier to maintain these checksums for SLRUs.

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Dilip Kumar-2
In reply to this post by Alvaro Herrera-9
On Mon, Jul 20, 2020 at 10:14 PM Alvaro Herrera
<[hidden email]> wrote:

>
> On 2020-Jul-20, Dilip Kumar wrote:
>
> > On Fri, Jul 17, 2020 at 4:16 PM Dilip Kumar <[hidden email]> wrote:
>
> > > So if the vacuum_tolerate_damage is set then in
> > > all the cases in heap_prepare_freeze_tuple where the corrupted xid is
> > > detected, it will emit a warning and return that nothing is changed in
> > > the tuple and the 'tuple_totally_frozen' will also be set to false.
> > > Since we are returning false the caller will not try to freeze such
> > > tuple and the tuple_totally_frozen is also set to false so that the
> > > page will not be marked to all frozen even if all other tuples in the
> > > page are frozen.
>
> > Robert has mentioned at [1] that we probably should refuse to update
> > 'relfrozenxid/relminmxid' when we encounter such tuple and emit
> > WARNING instead of an error.
>
> Isn't this already happening per your description above?

As per the above description, we are avoiding to set the page as all
frozen.  But the vacrelstats->scanned_pages count has already been
increased for this page.  Now, right after the lazy_scan_heap,  we
will update the pg_class tuple with the new FreezeLimit and
MultiXactCutoff.

>
> > I think we shall do that in some cases
> > but IMHO it's not a very good idea in all the cases.  Basically, if
> > the xmin precedes the relfrozenxid then probably we should allow to
> > update the relfrozenxid whereas if the xmin precedes cutoff xid and
> > still uncommitted then probably we might stop relfrozenxid from being
> > updated so that we can stop CLOG from getting truncated.
>
> I'm not sure I understand 100% what you're talking about here (the first
> half seems dangerous unless you misspoke), but in any case it seems a
> pointless optimization.  I mean, if the heap is corrupted, you can hope
> to complete the vacuum (which will hopefully return which *other* tuples
> are similarly corrupt) but trying to advance relfrozenxid is a lost
> cause.

I agree with your point.  I think we just need to avoid advancing the
relfrozenxid in all such cases.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Dilip Kumar-2
In reply to this post by Andres Freund
On Tue, Jul 21, 2020 at 2:00 AM Andres Freund <[hidden email]> wrote:

>
> Hi,
>
> On 2020-07-17 16:16:23 +0530, Dilip Kumar wrote:
> > The attached patch allows the vacuum to continue by emitting WARNING
> > for the corrupted tuple instead of immediately error out as discussed
> > at [1].
> >
> > Basically, it provides a new GUC called vacuum_tolerate_damage, to
> > control whether to continue the vacuum or to stop on the occurrence of
> > a corrupted tuple.  So if the vacuum_tolerate_damage is set then in
> > all the cases in heap_prepare_freeze_tuple where the corrupted xid is
> > detected, it will emit a warning and return that nothing is changed in
> > the tuple and the 'tuple_totally_frozen' will also be set to false.
> > Since we are returning false the caller will not try to freeze such
> > tuple and the tuple_totally_frozen is also set to false so that the
> > page will not be marked to all frozen even if all other tuples in the
> > page are frozen.
>
> I'm extremely doubtful this is a good idea. In all likelihood this will
> just exascerbate corruption.
>
> You cannot just stop freezing tuples, that'll lead to relfrozenxid
> getting *further* out of sync with the actual table contents. And you
> cannot just freeze such tuples, because that has a good chance of making
> deleted tuples suddenly visible, leading to unique constraint violations
> etc. Which will then subsequently lead to clog lookup errors and such.

I agree with the point.  But, if we keep giving the ERROR in such
cases then also the situation is not any better.  Basically, we are
not freezing such tuple as well as we can not advance the
relfrozenxid.  So if we follow the same rule that we don't freeze
those tuples and also don't advance the relfrozenxid.  The only
difference is, allow the vacuum to continue with other tuples.

> At the very least you'd need to signal up that relfrozenxid/relminmxid
> cannot be increased. Without that I think it's entirely unacceptable to
> do this.

I agree with that point.  I was just confused that shall we disallow
to advance the relfrozenxid in all such cases or in some cases where
the xid already precedes the relfrozenxid, we can allow it to advance
as it can not become any worse.  But, as Alvaro pointed out that there
is no point in optimizing such cases.   I will update the patch to
stop advancing the relfrozenxid if we find any corrupted xid during
tuple freeze.

> If we really were to do something like this the option would need to be
> called vacuum_allow_making_corruption_worse or such. Its need to be
> *exceedingly* clear that it will likely lead to making everything much
> worse.
>
Maybe we can clearly describe this in the document.

> > @@ -6123,6 +6124,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
> >       frz->t_infomask = tuple->t_infomask;
> >       frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
>
> I don't think it can be right to just update heap_prepare_freeze_tuple()
> but not FreezeMultiXactId().

oh, I missed this part.  I will fix it.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Dilip Kumar-2
On Tue, Jul 21, 2020 at 11:00 AM Dilip Kumar <[hidden email]> wrote:

>
> On Tue, Jul 21, 2020 at 2:00 AM Andres Freund <[hidden email]> wrote:
> >
> > Hi,
> >
> > On 2020-07-17 16:16:23 +0530, Dilip Kumar wrote:
> > > The attached patch allows the vacuum to continue by emitting WARNING
> > > for the corrupted tuple instead of immediately error out as discussed
> > > at [1].
> > >
> > > Basically, it provides a new GUC called vacuum_tolerate_damage, to
> > > control whether to continue the vacuum or to stop on the occurrence of
> > > a corrupted tuple.  So if the vacuum_tolerate_damage is set then in
> > > all the cases in heap_prepare_freeze_tuple where the corrupted xid is
> > > detected, it will emit a warning and return that nothing is changed in
> > > the tuple and the 'tuple_totally_frozen' will also be set to false.
> > > Since we are returning false the caller will not try to freeze such
> > > tuple and the tuple_totally_frozen is also set to false so that the
> > > page will not be marked to all frozen even if all other tuples in the
> > > page are frozen.
> >
> > I'm extremely doubtful this is a good idea. In all likelihood this will
> > just exascerbate corruption.
> >
> > You cannot just stop freezing tuples, that'll lead to relfrozenxid
> > getting *further* out of sync with the actual table contents. And you
> > cannot just freeze such tuples, because that has a good chance of making
> > deleted tuples suddenly visible, leading to unique constraint violations
> > etc. Which will then subsequently lead to clog lookup errors and such.
>
> I agree with the point.  But, if we keep giving the ERROR in such
> cases then also the situation is not any better.  Basically, we are
> not freezing such tuple as well as we can not advance the
> relfrozenxid.  So if we follow the same rule that we don't freeze
> those tuples and also don't advance the relfrozenxid.  The only
> difference is, allow the vacuum to continue with other tuples.
>
> > At the very least you'd need to signal up that relfrozenxid/relminmxid
> > cannot be increased. Without that I think it's entirely unacceptable to
> > do this.
>
> I agree with that point.  I was just confused that shall we disallow
> to advance the relfrozenxid in all such cases or in some cases where
> the xid already precedes the relfrozenxid, we can allow it to advance
> as it can not become any worse.  But, as Alvaro pointed out that there
> is no point in optimizing such cases.   I will update the patch to
> stop advancing the relfrozenxid if we find any corrupted xid during
> tuple freeze.
>
> > If we really were to do something like this the option would need to be
> > called vacuum_allow_making_corruption_worse or such. Its need to be
> > *exceedingly* clear that it will likely lead to making everything much
> > worse.
> >
> Maybe we can clearly describe this in the document.
>
> > > @@ -6123,6 +6124,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
> > >       frz->t_infomask = tuple->t_infomask;
> > >       frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
> >
> > I don't think it can be right to just update heap_prepare_freeze_tuple()
> > but not FreezeMultiXactId().
>
> oh, I missed this part.  I will fix it.
Please find the updated patch.  In this version, we don't allow the
relfrozenxid and relminmxid to advance if the corruption is detected
and also added the handling in FreezeMultiXactId.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

v2-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patch (24K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Dilip Kumar-2
On Tue, Jul 21, 2020 at 4:08 PM Dilip Kumar <[hidden email]> wrote:

>
> On Tue, Jul 21, 2020 at 11:00 AM Dilip Kumar <[hidden email]> wrote:
> >
> > On Tue, Jul 21, 2020 at 2:00 AM Andres Freund <[hidden email]> wrote:
> > >
> > > Hi,
> > >
> > > On 2020-07-17 16:16:23 +0530, Dilip Kumar wrote:
> > > > The attached patch allows the vacuum to continue by emitting WARNING
> > > > for the corrupted tuple instead of immediately error out as discussed
> > > > at [1].
> > > >
> > > > Basically, it provides a new GUC called vacuum_tolerate_damage, to
> > > > control whether to continue the vacuum or to stop on the occurrence of
> > > > a corrupted tuple.  So if the vacuum_tolerate_damage is set then in
> > > > all the cases in heap_prepare_freeze_tuple where the corrupted xid is
> > > > detected, it will emit a warning and return that nothing is changed in
> > > > the tuple and the 'tuple_totally_frozen' will also be set to false.
> > > > Since we are returning false the caller will not try to freeze such
> > > > tuple and the tuple_totally_frozen is also set to false so that the
> > > > page will not be marked to all frozen even if all other tuples in the
> > > > page are frozen.
> > >
> > > I'm extremely doubtful this is a good idea. In all likelihood this will
> > > just exascerbate corruption.
> > >
> > > You cannot just stop freezing tuples, that'll lead to relfrozenxid
> > > getting *further* out of sync with the actual table contents. And you
> > > cannot just freeze such tuples, because that has a good chance of making
> > > deleted tuples suddenly visible, leading to unique constraint violations
> > > etc. Which will then subsequently lead to clog lookup errors and such.
> >
> > I agree with the point.  But, if we keep giving the ERROR in such
> > cases then also the situation is not any better.  Basically, we are
> > not freezing such tuple as well as we can not advance the
> > relfrozenxid.  So if we follow the same rule that we don't freeze
> > those tuples and also don't advance the relfrozenxid.  The only
> > difference is, allow the vacuum to continue with other tuples.
> >
> > > At the very least you'd need to signal up that relfrozenxid/relminmxid
> > > cannot be increased. Without that I think it's entirely unacceptable to
> > > do this.
> >
> > I agree with that point.  I was just confused that shall we disallow
> > to advance the relfrozenxid in all such cases or in some cases where
> > the xid already precedes the relfrozenxid, we can allow it to advance
> > as it can not become any worse.  But, as Alvaro pointed out that there
> > is no point in optimizing such cases.   I will update the patch to
> > stop advancing the relfrozenxid if we find any corrupted xid during
> > tuple freeze.
> >
> > > If we really were to do something like this the option would need to be
> > > called vacuum_allow_making_corruption_worse or such. Its need to be
> > > *exceedingly* clear that it will likely lead to making everything much
> > > worse.
> > >
> > Maybe we can clearly describe this in the document.
> >
> > > > @@ -6123,6 +6124,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
> > > >       frz->t_infomask = tuple->t_infomask;
> > > >       frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
> > >
> > > I don't think it can be right to just update heap_prepare_freeze_tuple()
> > > but not FreezeMultiXactId().
> >
> > oh, I missed this part.  I will fix it.
>
> Please find the updated patch.  In this version, we don't allow the
> relfrozenxid and relminmxid to advance if the corruption is detected
> and also added the handling in FreezeMultiXactId.
In the previous version, the feature was enabled for cluster/vacuum
full command as well.   in the attached patch I have enabled it only
if we are running vacuum command.  It will not be enabled during a
table rewrite.  If we think that it should be enabled for the 'vacuum
full' then we might need to pass a flag from the cluster_rel, all the
way down to the heap_freeze_tuple.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

v3-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patch (25K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Robert Haas
In reply to this post by Andres Freund
On Mon, Jul 20, 2020 at 4:30 PM Andres Freund <[hidden email]> wrote:
> I'm extremely doubtful this is a good idea. In all likelihood this will
> just exascerbate corruption.
>
> You cannot just stop freezing tuples, that'll lead to relfrozenxid
> getting *further* out of sync with the actual table contents. And you
> cannot just freeze such tuples, because that has a good chance of making
> deleted tuples suddenly visible, leading to unique constraint violations
> etc. Which will then subsequently lead to clog lookup errors and such.

I think that the behavior ought to be:

- If we encounter any damaged tuples (e.g. tuple xid < relfrozenxid),
we give up on advancing relfrozenxid and relminmxid. This vacuum won't
change them at all.

- We do nothing to the damaged tuples themselves.

- We can still prune pages, and we can still freeze tuples that do not
appear to be damaged.

This amounts to an assumption that relfrozenxid is probably sane, and
that there are individual tuples that are messed up. It's probably not
the right thing if relfrozenxid got overwritten with a nonsense value
without changing the table contents. But, I think it's difficult to
cater to all contingencies. In my experience, the normal problem here
is that there are a few tuples or pages in the table that somehow
escaped vacuuming for long enough that they contain references to XIDs
from before the last time relfrozenxid was advanced - so continuing to
do what we can to the rest of the table is the right thing to do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Robert Haas
In reply to this post by Dilip Kumar-2
On Tue, Jul 21, 2020 at 9:21 AM Dilip Kumar <[hidden email]> wrote:
> In the previous version, the feature was enabled for cluster/vacuum
> full command as well.   in the attached patch I have enabled it only
> if we are running vacuum command.  It will not be enabled during a
> table rewrite.  If we think that it should be enabled for the 'vacuum
> full' then we might need to pass a flag from the cluster_rel, all the
> way down to the heap_freeze_tuple.

I think this is a somewhat clunky way of accomplishing this. The
caller passes down a flag to heap_prepare_freeze_tuple() which decides
whether or not an error is forced, and then that function and
FreezeMultiXactId use vacuum_damage_elevel() to combine the results of
that flag with the value of the vacuum_tolerate_damage GUC. But that
means that a decision that could be made in one place is instead made
in many places. If we just had heap_freeze_tuple() and
FreezeMultiXactId() take an argument int vacuum_damage_elevel, then
heap_freeze_tuple() could pass ERROR and lazy_scan_heap() could
arrange to pass WARNING or ERROR based on the value of
vacuum_tolerate_damage. I think that would likely end up cleaner. What
do you think?

I also suggest renaming is_corrupted_xid to found_corruption. With the
current name, it's not very clear which XID we're saying is corrupted;
in fact, the problem might be a MultiXactId rather than an XID, and
the real issue might be with the table's pg_class entry or something.

The new arguments to heap_prepare_freeze_tuple() need to be documented
in its header comment.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Robert Haas
In reply to this post by Andres Freund
On Mon, Jul 20, 2020 at 4:30 PM Andres Freund <[hidden email]> wrote:
> If we really were to do something like this the option would need to be
> called vacuum_allow_making_corruption_worse or such. Its need to be
> *exceedingly* clear that it will likely lead to making everything much
> worse.

I don't really understand this objection. How does letting VACUUM
continue after problems have been detected make anything worse? I
agree that if it does, it shouldn't touch relfrozenxid or relminmxid,
but the patch has been adjusted to work that way. Assuming you don't
touch relfrozenxid or relminmxid, what harm befalls if you continue
freezing undamaged tuples and continue removing dead tuples after
finding a bad tuple? You may have already done an arbitrary amount of
that before encountering the damage, and doing it afterward is no
different. Doing the index vacuuming step is different, but I don't
see how that would exacerbate corruption either.

The point is that when you make VACUUM fail, you not only don't
advance relfrozenxid/relminmxid, but also don't remove dead tuples. In
the long run, either thing will kill you, but it is not difficult to
have a situation where failing to remove dead tuples kills you a lot
faster. The table can just bloat until performance tanks, and then the
application goes down, even if you still had 100+ million XIDs before
you needed a wraparound vacuum.

Honestly, I wonder why continuing (but without advancing relfrozenxid
or relminmxid) shouldn't be the default behavior. I mean, if it
actually corrupts your data, then it clearly shouldn't be, and
probably shouldn't even be an optional behavior, but I still don't see
why it would do that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Andres Freund
Hi,

On 2020-08-28 12:37:17 -0400, Robert Haas wrote:
> On Mon, Jul 20, 2020 at 4:30 PM Andres Freund <[hidden email]> wrote:
> > If we really were to do something like this the option would need to be
> > called vacuum_allow_making_corruption_worse or such. Its need to be
> > *exceedingly* clear that it will likely lead to making everything much
> > worse.
>
> I don't really understand this objection. How does letting VACUUM
> continue after problems have been detected make anything worse?

It can break HOT chains, plain ctid chains etc, for example. Which, if
earlier / follower tuples are removed can't be detected anymore at a
later time.


> The point is that when you make VACUUM fail, you not only don't
> advance relfrozenxid/relminmxid, but also don't remove dead tuples. In
> the long run, either thing will kill you, but it is not difficult to
> have a situation where failing to remove dead tuples kills you a lot
> faster. The table can just bloat until performance tanks, and then the
> application goes down, even if you still had 100+ million XIDs before
> you needed a wraparound vacuum.
>
> Honestly, I wonder why continuing (but without advancing relfrozenxid
> or relminmxid) shouldn't be the default behavior. I mean, if it
> actually corrupts your data, then it clearly shouldn't be, and
> probably shouldn't even be an optional behavior, but I still don't see
> why it would do that.

I think it's an EXTREMELY bad idea to enable anything like this by
default. It'll make bugs entirely undiagnosable, because we'll remove a
lot of the evidence of what the problem is. And we've had many long
standing bugs in this area, several only found because we actually
started to bleat about them. And quite evidently, we have more bugs to
fix in the area.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Robert Haas
On Fri, Aug 28, 2020 at 1:29 PM Andres Freund <[hidden email]> wrote:
> It can break HOT chains, plain ctid chains etc, for example. Which, if
> earlier / follower tuples are removed can't be detected anymore at a
> later time.

I think I need a more specific example here to understand the problem.
If the xmax of one tuple matches the xmin of the next, then either
both values precede relfrozenxid or both follow it. In the former
case, neither tuple should be frozen and the chain should not get
broken; in the latter case, everything's normal anyway. If the xmax
and xmin don't match, then the chain was already broken. Perhaps we
are removing important evidence, though it seems like that might've
happened anyway prior to reaching the damaged page, but we're not
making whatever corruption may exist any worse. At least, not as far
as I can see.

> And we've had many long
> standing bugs in this area, several only found because we actually
> started to bleat about them. And quite evidently, we have more bugs to
> fix in the area.

I agree with all of this, but I do not think that it establishes that
we should abandon the entire VACUUM. "Bleating" about something
usually means logging it, and I think you understand that I am not now
nor have I ever complained about the logging we are doing here. I also
think you understand why I don't like the current behavior, and that
EDB has actual customers who have actually been damaged by it. All the
same, I don't expect to win an argument about changing the default,
but I hope to win one about at least providing an option. And if we're
not even going to do that much, then I hope to come out of this
discussion with a clear understanding of exactly why that's a bad
idea. I don't think "we need the data for forensics" is a sufficient
justification for "if you end up with one corrupted XID in a
billion-row table, your entire table will bloat out the wazoo, and
there is no option to get any other behavior."

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Dilip Kumar-2
In reply to this post by Robert Haas
On Fri, Aug 28, 2020 at 9:49 PM Robert Haas <[hidden email]> wrote:

>
> On Tue, Jul 21, 2020 at 9:21 AM Dilip Kumar <[hidden email]> wrote:
> > In the previous version, the feature was enabled for cluster/vacuum
> > full command as well.   in the attached patch I have enabled it only
> > if we are running vacuum command.  It will not be enabled during a
> > table rewrite.  If we think that it should be enabled for the 'vacuum
> > full' then we might need to pass a flag from the cluster_rel, all the
> > way down to the heap_freeze_tuple.
>
> I think this is a somewhat clunky way of accomplishing this. The
> caller passes down a flag to heap_prepare_freeze_tuple() which decides
> whether or not an error is forced, and then that function and
> FreezeMultiXactId use vacuum_damage_elevel() to combine the results of
> that flag with the value of the vacuum_tolerate_damage GUC. But that
> means that a decision that could be made in one place is instead made
> in many places. If we just had heap_freeze_tuple() and
> FreezeMultiXactId() take an argument int vacuum_damage_elevel, then
> heap_freeze_tuple() could pass ERROR and lazy_scan_heap() could
> arrange to pass WARNING or ERROR based on the value of
> vacuum_tolerate_damage. I think that would likely end up cleaner. What
> do you think?
I agree this way it is much more cleaner.  I have changed in the attached patch.

> I also suggest renaming is_corrupted_xid to found_corruption. With the
> current name, it's not very clear which XID we're saying is corrupted;
> in fact, the problem might be a MultiXactId rather than an XID, and
> the real issue might be with the table's pg_class entry or something.

Okay, changed to found_corruption.

> The new arguments to heap_prepare_freeze_tuple() need to be documented
> in its header comment.

Done.

I have also done a few more cosmetic changes to the patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

v4-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patch (25K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Dilip Kumar-2
In reply to this post by Robert Haas
On Sat, Aug 29, 2020 at 1:46 AM Robert Haas <[hidden email]> wrote:

>
> On Fri, Aug 28, 2020 at 1:29 PM Andres Freund <[hidden email]> wrote:
> > It can break HOT chains, plain ctid chains etc, for example. Which, if
> > earlier / follower tuples are removed can't be detected anymore at a
> > later time.
>
> I think I need a more specific example here to understand the problem.
> If the xmax of one tuple matches the xmin of the next, then either
> both values precede relfrozenxid or both follow it. In the former
> case, neither tuple should be frozen and the chain should not get
> broken; in the latter case, everything's normal anyway. If the xmax
> and xmin don't match, then the chain was already broken. Perhaps we
> are removing important evidence, though it seems like that might've
> happened anyway prior to reaching the damaged page, but we're not
> making whatever corruption may exist any worse. At least, not as far
> as I can see.

One example is,  suppose during vacuum, there are 2 tuples in the hot
chain,  and the xmin of the first tuple is corrupted (i.e. smaller
than relfrozenxid).  And the xmax of this tuple (which is same as the
xmin of the second tuple) is smaller than the cutoff_xid while trying
to freeze the tuple.  As a result, it will freeze the second tuple but
the first tuple will be left untouched.

Now,  if we come for the heap_hot_search_buffer, then the xmax of the
first tuple will not match the xmin of the second tuple as we have
frozen the second tuple.  But, I feel this is easily fixable right? I
mean instead of not doing anything to the corrupted tuple we can
partially freeze it?  I mean we can just leave the corrupted xid alone
but mark the other xid as frozen if that is smaller then cutoff_xid.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


12