xid wraparound danger due to INDEX_CLEANUP false

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
36 messages Options
12
Reply | Threaded
Open this post in threaded view
|

xid wraparound danger due to INDEX_CLEANUP false

Andres Freund
Hi,

commit a96c41feec6b6616eb9d5baee9a9e08c20533c38
Author: Robert Haas <[hidden email]>
Date:   2019-04-04 14:58:53 -0400

    Allow VACUUM to be run with index cleanup disabled.

    This commit adds a new reloption, vacuum_index_cleanup, which
    controls whether index cleanup is performed for a particular
    relation by default.  It also adds a new option to the VACUUM
    command, INDEX_CLEANUP, which can be used to override the
    reloption.  If neither the reloption nor the VACUUM option is
    used, the default is true, as before.

    Masahiko Sawada, reviewed and tested by Nathan Bossart, Alvaro
    Herrera, Kyotaro Horiguchi, Darafei Praliaskouski, and me.
    The wording of the documentation is mostly due to me.

    Discussion: http://postgr.es/m/CAD21AoAt5R3DNUZSjOoXDUY=naYPUOuffVsRzuTYMz29yLzQCA@...

made the index scan that is part of vacuum optional.  I'm afraid that it
is not safe to do so unconditionally. Until this commit indexes could
rely on at least the amvacuumcleanup callback being called once per
vacuum. Which guaranteed that an index could ensure that there are no
too-old xids anywhere in the index.

But now that's not the case anymore:

        vacrelstats->useindex = (nindexes > 0 &&
                                                         params->index_cleanup == VACOPT_TERNARY_ENABLED);
...
        /* Do post-vacuum cleanup */
        if (vacrelstats->useindex)
                lazy_cleanup_all_indexes(Irel, indstats, vacrelstats, lps, nindexes);

E.g. btree has xids both in the metapage contents, as well as using it
on normal index pages as part of page deletion.

The slightly oder feature to avoid unnecessary scans during cleanup
protects against this issue by skipping the scan inside the index AM:

/*
 * _bt_vacuum_needs_cleanup() -- Checks if index needs cleanup assuming that
 * btbulkdelete() wasn't called.
 */
static bool
_bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
{
...
        else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
                         TransactionIdPrecedes(metad->btm_oldest_btpo_xact,
                                                                   RecentGlobalXmin))
        {
                /*
                 * If oldest btpo.xact in the deleted pages is older than
                 * RecentGlobalXmin, then at least one deleted page can be recycled.
                 */
                result = true;
        }

which will afaict result in all such xids getting removed (or at least
give the AM the choice to do so).


It's possible that something protects against dangers in the case of
INDEX CLEANUP false, or that the consequences aren't too bad. But I
didn't see any comments about the danagers in the patch.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: xid wraparound danger due to INDEX_CLEANUP false

Robert Haas
On Wed, Apr 15, 2020 at 7:38 PM Andres Freund <[hidden email]> wrote:
> It's possible that something protects against dangers in the case of
> INDEX CLEANUP false, or that the consequences aren't too bad. But I
> didn't see any comments about the danagers in the patch.

I seem to recall Simon raising this issue at the time that the patch
was being discussed, and I thought that we had eventually decided that
it was OK for some reason. But I don't remember the details, and it is
possible that we got it wrong. :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: xid wraparound danger due to INDEX_CLEANUP false

Peter Geoghegan-4
On Wed, Apr 15, 2020 at 4:57 PM Robert Haas <[hidden email]> wrote:
> I seem to recall Simon raising this issue at the time that the patch
> was being discussed, and I thought that we had eventually decided that
> it was OK for some reason. But I don't remember the details, and it is
> possible that we got it wrong. :-(

It must be unreliable because it's based on something that is known to
be unreliable:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/README;h=c5b0a30e4ebd4fe3bd4a6f8192284c452d1170b9;hb=refs/heads/REL_12_STABLE#l331

Also, the commit message of 6655a729 says that nbtree has had this
problem "since time immemorial". I am planning to work on that
problem, eventually.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: xid wraparound danger due to INDEX_CLEANUP false

Masahiko Sawada
In reply to this post by Andres Freund
On Thu, Apr 16, 2020 at 8:38 AM Andres Freund <[hidden email]> wrote:

>
> Hi,
>
> commit a96c41feec6b6616eb9d5baee9a9e08c20533c38
> Author: Robert Haas <[hidden email]>
> Date:   2019-04-04 14:58:53 -0400
>
>     Allow VACUUM to be run with index cleanup disabled.
>
>     This commit adds a new reloption, vacuum_index_cleanup, which
>     controls whether index cleanup is performed for a particular
>     relation by default.  It also adds a new option to the VACUUM
>     command, INDEX_CLEANUP, which can be used to override the
>     reloption.  If neither the reloption nor the VACUUM option is
>     used, the default is true, as before.
>
>     Masahiko Sawada, reviewed and tested by Nathan Bossart, Alvaro
>     Herrera, Kyotaro Horiguchi, Darafei Praliaskouski, and me.
>     The wording of the documentation is mostly due to me.
>
>     Discussion: http://postgr.es/m/CAD21AoAt5R3DNUZSjOoXDUY=naYPUOuffVsRzuTYMz29yLzQCA@...
>
> made the index scan that is part of vacuum optional.  I'm afraid that it
> is not safe to do so unconditionally. Until this commit indexes could
> rely on at least the amvacuumcleanup callback being called once per
> vacuum. Which guaranteed that an index could ensure that there are no
> too-old xids anywhere in the index.
>
> But now that's not the case anymore:
>
>         vacrelstats->useindex = (nindexes > 0 &&
>                                                          params->index_cleanup == VACOPT_TERNARY_ENABLED);
> ...
>         /* Do post-vacuum cleanup */
>         if (vacrelstats->useindex)
>                 lazy_cleanup_all_indexes(Irel, indstats, vacrelstats, lps, nindexes);
>
> E.g. btree has xids both in the metapage contents, as well as using it
> on normal index pages as part of page deletion.
>
> The slightly oder feature to avoid unnecessary scans during cleanup
> protects against this issue by skipping the scan inside the index AM:
>
> /*
>  * _bt_vacuum_needs_cleanup() -- Checks if index needs cleanup assuming that
>  *                      btbulkdelete() wasn't called.
>  */
> static bool
> _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
> {
> ...
>         else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
>                          TransactionIdPrecedes(metad->btm_oldest_btpo_xact,
>                                                                    RecentGlobalXmin))
>         {
>                 /*
>                  * If oldest btpo.xact in the deleted pages is older than
>                  * RecentGlobalXmin, then at least one deleted page can be recycled.
>                  */
>                 result = true;
>         }
>
> which will afaict result in all such xids getting removed (or at least
> give the AM the choice to do so).

For btree indexes, IIRC skipping index cleanup could not be a cause of
corruption, but be a cause of index bloat since it leaves recyclable
pages which are not marked as recyclable. The index bloat is the main
side effect of skipping index cleanup. When user executes VACUUM with
INDEX_CLEANUP to reclaim index garbage, such pages will also be
recycled sooner or later? Or skipping index cleanup can be a cause of
recyclable page never being recycled?

Regards,

--
Masahiko Sawada  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: xid wraparound danger due to INDEX_CLEANUP false

Andres Freund
Hi,

On 2020-04-16 16:30:02 +0900, Masahiko Sawada wrote:
> For btree indexes, IIRC skipping index cleanup could not be a cause of
> corruption, but be a cause of index bloat since it leaves recyclable
> pages which are not marked as recyclable. The index bloat is the main
> side effect of skipping index cleanup. When user executes VACUUM with
> INDEX_CLEANUP to reclaim index garbage, such pages will also be
> recycled sooner or later? Or skipping index cleanup can be a cause of
> recyclable page never being recycled?

Well, it depends on what you define as "never". Once the xids on the
pages have wrapped around, the page level xids will appear to be from
the future for a long time. And the metapage xid appearing to be from
the future will prevent some vacuums from actually doing the scan too,
even if INDEX_CLEANUP is reenabled.  So a VACUUM, even with
INDEX_CLEANUP on, will not be able to recycle those pages anymore.  At
some point the wrapped around xids will be "current" again, if there's
enough new xids.


It's no ok for vacuumlazy.c to make decisions like this. I think the
INDEX_CLEANUP logic clearly needs to be pushed down into the
amvacuumcleanup callbacks, and it needs to be left to the index AMs to
decide what the correct behaviour is.


You can't just change things like this without reviewing the
consequences to AMs and documenting them?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: xid wraparound danger due to INDEX_CLEANUP false

Andres Freund
In reply to this post by Peter Geoghegan-4
Hi,

On 2020-04-15 18:11:40 -0700, Peter Geoghegan wrote:

> On Wed, Apr 15, 2020 at 4:57 PM Robert Haas <[hidden email]> wrote:
> > I seem to recall Simon raising this issue at the time that the patch
> > was being discussed, and I thought that we had eventually decided that
> > it was OK for some reason. But I don't remember the details, and it is
> > possible that we got it wrong. :-(
>
> It must be unreliable because it's based on something that is known to
> be unreliable:
>
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/README;h=c5b0a30e4ebd4fe3bd4a6f8192284c452d1170b9;hb=refs/heads/REL_12_STABLE#l331

Sure, there is some pre-existing wraparound danger for individual
pages. But it's a pretty narrow corner case before INDEX_CLEANUP
off.

That comment says something about "shared-memory free space map", making
it sound like any crash would loose the page. But it's a normal FSM
these days. Vacuum will insert the deleted page into the free space
map. So either the FSM would need to be corrupted to not find the
inserted page anymore, or the index would need to grow slow enough to
not use a page before the wraparound.  And then such wrapped around xids
would exist on individual pages. Not on all deleted pages, like with
INDEX_CLEANUP false.

And, what's worse, in the INDEX_CLEANUP off case, future VACUUMs with
INDEX_CLEANUP on might not even visit the index. As there very well
might not be many dead heap tuples around anymore (previous vacuums with
cleanup off will have removed them), the
vacuum_cleanup_index_scale_factor logic may prevent index vacuums. In
contrast to the normal situations where the btm_oldest_btpo_xact check
will prevent that from becoming a problem.


Peter, as far as I can tell, with INDEX_CLEANUP off, nbtree will never
be able to recycle half-dead pages? And thus would effectively never
recycle any dead space? Is that correct?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: xid wraparound danger due to INDEX_CLEANUP false

Peter Geoghegan-4
On Thu, Apr 16, 2020 at 11:27 AM Andres Freund <[hidden email]> wrote:
> Sure, there is some pre-existing wraparound danger for individual
> pages. But it's a pretty narrow corner case before INDEX_CLEANUP
> off.

It's a matter of degree. Hard to judge something like that.

> And, what's worse, in the INDEX_CLEANUP off case, future VACUUMs with
> INDEX_CLEANUP on might not even visit the index. As there very well
> might not be many dead heap tuples around anymore (previous vacuums with
> cleanup off will have removed them), the
> vacuum_cleanup_index_scale_factor logic may prevent index vacuums. In
> contrast to the normal situations where the btm_oldest_btpo_xact check
> will prevent that from becoming a problem.

I guess that they should visit the metapage to see if they need to do
that much. That would allow us to fix the problem while mostly
honoring INDEX_CLEANUP off, I think.

> Peter, as far as I can tell, with INDEX_CLEANUP off, nbtree will never
> be able to recycle half-dead pages? And thus would effectively never
> recycle any dead space? Is that correct?

I agree. The fact that btm_oldest_btpo_xact is an all-or-nothing thing
(with wraparound hazards) is bad in itself, and introduced new risk to
v11 compared to previous versions (without the INDEX_CLEANUP = off
feature entering into it).  The simple fact that we don't even check
it with INDEX_CLEANUP = off is a bigger problem, though, and one that
now seems unrelated.

BWT, a lot of people get confused about what half-dead pages are. I
would like to make something clear that may not be obvious: While it's
bad that the implementation leaks pages that should go in the FSM,
it's not the end of the world. They should get evicted from
shared_buffers pretty quickly if there is any pressure, and impose no
real cost on index scans.

There are (roughly) 3 types of pages that we're concerned about here
in the common case where we're just deleting a leaf page:

* A half-dead page -- no downlink in its parent, marked dead.

* A deleted page -- now no sidelinks, either. Not initially safe to recycle.

* A deleted page in the FSM -- this is what we have the interlock for.

Half-dead pages are pretty rare, because VACUUM really has to have a
hard crash for that to happen (that might not be 100% true, but it's
at least 99% true). That's always been the case, and we don't really
need to talk about them here at all. We're just concerned with deleted
pages in the context of this discussion (and whether or not they can
be recycled without confusing in-flight index scans). These are the
only pages that are marked with an XID at all.

Another thing that's worth pointing out is that this whole
RecentGlobalXmin business is how we opted to implement what Lanin &
Sasha call "the drain technique". It is rather different to the usual
ways in which we use RecentGlobalXmin. We're only using it as a proxy
(an absurdly conservative proxy) for whether or not there might be an
in-flight index scan that lands on a concurrently recycled index page
and gets completely confused. So it is purely about the integrity of
the data structure itself. It is a consequence of doing so little
locking when descending the tree -- our index scans don't need to
couple buffer locks on the way down the tree at all. So we make VACUUM
worry about that, rather than making index scans worry about VACUUM
(though the latter design is a reasonable and common one).

There is absolutely no reason why we have to delay recycling for very
long, even in cases with long running transactions or whatever. I
agree that it's just an accident that it works that way. VACUUM could
probably remember deleted pages, and then revisited those pages at the
end of the index vacuuming -- that might make a big difference in a
lot of workloads. Or it could chain them together as a linked list
which can be accessed much more eagerly in some cases.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: xid wraparound danger due to INDEX_CLEANUP false

Andres Freund
Hi,

On 2020-04-16 13:28:00 -0700, Peter Geoghegan wrote:

> > And, what's worse, in the INDEX_CLEANUP off case, future VACUUMs with
> > INDEX_CLEANUP on might not even visit the index. As there very well
> > might not be many dead heap tuples around anymore (previous vacuums with
> > cleanup off will have removed them), the
> > vacuum_cleanup_index_scale_factor logic may prevent index vacuums. In
> > contrast to the normal situations where the btm_oldest_btpo_xact check
> > will prevent that from becoming a problem.
>
> I guess that they should visit the metapage to see if they need to do
> that much. That would allow us to fix the problem while mostly
> honoring INDEX_CLEANUP off, I think.

Yea. _bt_vacuum_needs_cleanup() needs to check if
metad->btm_oldest_btpo_xact is older than the FreezeLimit computed by
vacuum_set_xid_limits() and vacuum the index if so even if INDEX_CLEANUP
false.


> BWT, a lot of people get confused about what half-dead pages are. I
> would like to make something clear that may not be obvious: While it's
> bad that the implementation leaks pages that should go in the FSM,
> it's not the end of the world. They should get evicted from
> shared_buffers pretty quickly if there is any pressure, and impose no
> real cost on index scans.

Yea, half-dead pages aren't the main problem. It's pages that contain
only dead tuples, but aren't unlinked from the tree. Without a vacuum
scan we'll never reuse them - even if we know they're all dead.

Note that the page being in the FSM is not protection against wraparound
:(. We recheck whether a page is recyclable when getting it from the FSM
(probably required, due to the FSM not being crashsafe). It's of course
much less likely to happen at that stage, because the pages can get
reused.

I think we really just stop being miserly and update the xid to be
FrozenTransactionId or InvalidTransactionId when vacuum encounters one
that's from before the the xid cutoff used by vacuum (i.e. what could
become the new relfrozenxid).  That seems like it'd be a few lines, not
more.


> Another thing that's worth pointing out is that this whole
> RecentGlobalXmin business is how we opted to implement what Lanin &
> Sasha call "the drain technique". It is rather different to the usual
> ways in which we use RecentGlobalXmin. We're only using it as a proxy
> (an absurdly conservative proxy) for whether or not there might be an
> in-flight index scan that lands on a concurrently recycled index page
> and gets completely confused. So it is purely about the integrity of
> the data structure itself. It is a consequence of doing so little
> locking when descending the tree -- our index scans don't need to
> couple buffer locks on the way down the tree at all. So we make VACUUM
> worry about that, rather than making index scans worry about VACUUM
> (though the latter design is a reasonable and common one).
>
> There is absolutely no reason why we have to delay recycling for very
> long, even in cases with long running transactions or whatever. I
> agree that it's just an accident that it works that way. VACUUM could
> probably remember deleted pages, and then revisited those pages at the
> end of the index vacuuming -- that might make a big difference in a
> lot of workloads. Or it could chain them together as a linked list
> which can be accessed much more eagerly in some cases.

I think it doesn't really help meaningfully for vacuum to be a bit
smarter about when to recognize pages as being recyclable. IMO the big
issue is that vacuum won't be very frequent, so we'll grow the index
until that time, even if there's many "effectively empty" pages.

I.e. even if the killtuples logic allows us to recognize that all actual
index tuples are fully dead, we'll not benefit from that unless there's
a new insertion that belongs onto the "empty" page. That's fine for
indexes that are updated roughly evenly across the value range, but
terrible for indexes that "grow" mostly on one side, and "shrink" on the
other.

I'd bet it be beneficial if we were to either have scans unlink such
pages directly, or if they just entered the page into the FSM and have
_bt_getbuf() do the unlinking.  I'm not sure if the current locking
model assumes anywhere that there is only one process (vacuum) unlinking
pages though?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: xid wraparound danger due to INDEX_CLEANUP false

Peter Geoghegan-4
On Thu, Apr 16, 2020 at 3:49 PM Andres Freund <[hidden email]> wrote:
> I think we really just stop being miserly and update the xid to be
> FrozenTransactionId or InvalidTransactionId when vacuum encounters one
> that's from before the the xid cutoff used by vacuum (i.e. what could
> become the new relfrozenxid).  That seems like it'd be a few lines, not
> more.

Okay.

> > There is absolutely no reason why we have to delay recycling for very
> > long, even in cases with long running transactions or whatever. I
> > agree that it's just an accident that it works that way. VACUUM could
> > probably remember deleted pages, and then revisited those pages at the
> > end of the index vacuuming -- that might make a big difference in a
> > lot of workloads. Or it could chain them together as a linked list
> > which can be accessed much more eagerly in some cases.
>
> I think it doesn't really help meaningfully for vacuum to be a bit
> smarter about when to recognize pages as being recyclable. IMO the big
> issue is that vacuum won't be very frequent, so we'll grow the index
> until that time, even if there's many "effectively empty" pages.

(It seems like you're talking about the big picture now, not the
problems in Postgres 11 and 12 features in this area -- you're talking
about what happens to empty pages, not what happens to deleted pages.)

I'll say some more things about the less ambitious goal of more eager
recycling of pages, as they are deleted:

An individual VACUUM operation cannot recycle a page right after
_bt_pagedel() is called to delete the page. VACUUM will both set a
target leaf page half dead and delete it all at once in _bt_pagedel()
(it does that much in the simple and common case). Again, this is
because recycling very soon after the call to _bt_pagedel() will
introduce races with concurrent index scans -- they could fail to
observe the deletion in the parent (i.e. see its downlink, since child
isn't even half dead), and land on a concurrently recycled page
(VACUUM concurrently marks the page half-dead, fully dead/deleted, and
then even goes as far as recycling it). So the current design makes a
certain amount of sense -- we can't be super aggressive like that.
(Actually, maybe it doesn't make sense to not just put the page in the
FSM there and then -- see "Thinks some more" below.)

Even still, nothing stops the same VACUUM operation from (for example)
remembering a list of pages it has deleted during the current scan,
and then coming back at the end of the bulk scan of the index to
reconsider if it can recycle the pages now (2 minutes later instead of
2 months later). With a new RecentGlobalXmin (or something that's
conceptually like a new RecentGlobalXmin).

Similarly, we could do limited VACUUMs that only visit previously
deleted pages, once VACUUM is taught to chain deleted pages together
to optimize recycling. We don't have to repeat another pass over the
entire index to recycle the pages because of this special deleted page
linking. This is something that we use when we have to recycle pages,
but it's a " INDEX_CLEANUP = off" index VACUUM -- we don't really want
to do most of the stuff that index vacuuming needs to do, but we must
still visit the metapage to check btm_oldest_btpo_xact, and then maybe
walk the deleted page linked list.

*** Thinks some more ***

As you pointed out, _bt_getbuf() already distrusts the FSM -- it has
its own _bt_page_recyclable() check, probably because the FSM isn't
crash safe. Maybe we could improve matters by teaching _bt_pagedel()
to put a page it deleted in the FSM immediately -- no need to wait
until the next index VACUUM for the RecordFreeIndexPage() call. It
still isn't quite enough that _bt_getbuf() distrusts the FSM, so we'd
also have to teach _bt_getbuf() some heuristics that made it
understand that VACUUM is now designed to put stuff in the FSM
immediately, so we don't have to wait for the next VACUUM operation to
get to it. Maybe _bt_getbuf() should try the FSM a few times before
giving up and allocating a new page, etc.

This wouldn't make VACUUM delete any more pages any sooner, but it
would make those pages reclaimable much sooner. Also, it wouldn't
solve the wraparound problem, but that is a bug, not a
performance/efficiency issue.

> I.e. even if the killtuples logic allows us to recognize that all actual
> index tuples are fully dead, we'll not benefit from that unless there's
> a new insertion that belongs onto the "empty" page. That's fine for
> indexes that are updated roughly evenly across the value range, but
> terrible for indexes that "grow" mostly on one side, and "shrink" on the
> other.

That could be true, but there are certain things about B-Tree space
utilization that might surprise you:

https://www.drdobbs.com/reexamining-b-trees/184408694?pgno=3

> I'd bet it be beneficial if we were to either have scans unlink such
> pages directly, or if they just entered the page into the FSM and have
> _bt_getbuf() do the unlinking.

That won't work, since you're now talking about pages that aren't
deleted (or even half-dead) that are just candidates to be deleted
because they're empty. So you'd have to do all the steps in
_bt_pagedel() within a new _bt_getbuf() path, which would have many
deadlock hazards. Unlinking the page from the tree itself (deleting)
is really complicated.

> I'm not sure if the current locking
> model assumes anywhere that there is only one process (vacuum) unlinking
> pages though?

I'm not sure, though _bt_unlink_halfdead_page() has comments supposing
that there could be concurrent page deletions like that.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: xid wraparound danger due to INDEX_CLEANUP false

Masahiko Sawada-2
In reply to this post by Andres Freund
On Fri, 17 Apr 2020 at 02:58, Andres Freund <[hidden email]> wrote:

>
> Hi,
>
> On 2020-04-16 16:30:02 +0900, Masahiko Sawada wrote:
> > For btree indexes, IIRC skipping index cleanup could not be a cause of
> > corruption, but be a cause of index bloat since it leaves recyclable
> > pages which are not marked as recyclable. The index bloat is the main
> > side effect of skipping index cleanup. When user executes VACUUM with
> > INDEX_CLEANUP to reclaim index garbage, such pages will also be
> > recycled sooner or later? Or skipping index cleanup can be a cause of
> > recyclable page never being recycled?
>
> Well, it depends on what you define as "never". Once the xids on the
> pages have wrapped around, the page level xids will appear to be from
> the future for a long time. And the metapage xid appearing to be from
> the future will prevent some vacuums from actually doing the scan too,
> even if INDEX_CLEANUP is reenabled.  So a VACUUM, even with
> INDEX_CLEANUP on, will not be able to recycle those pages anymore.  At
> some point the wrapped around xids will be "current" again, if there's
> enough new xids.
>
>
> It's no ok for vacuumlazy.c to make decisions like this. I think the
> INDEX_CLEANUP logic clearly needs to be pushed down into the
> amvacuumcleanup callbacks, and it needs to be left to the index AMs to
> decide what the correct behaviour is.

I wanted to clarify the impact of this bug. I agree with you.

On Fri, 17 Apr 2020 at 07:49, Andres Freund <[hidden email]> wrote:
>
> Hi,
>
> On 2020-04-16 13:28:00 -0700, Peter Geoghegan wrote:
> > > And, what's worse, in the INDEX_CLEANUP off case, future VACUUMs with
> > > INDEX_CLEANUP on might not even visit the index. As there very well
> > > might not be many dead heap tuples around anymore (previous vacuums with
> > > cleanup off will have removed them), the
> > > vacuum_cleanup_index_scale_factor logic may prevent index vacuums.

I think this doesn't happen because, in the INDEX_CLEANUP off case,
vacuum marks linepointers of dead tuples as dead but leaves them.
Therefore future VACUUMs with INDEX_CLEANUP on will see these dead
linepointers and invoke ambulkdelete.

> > I guess that they should visit the metapage to see if they need to do
> > that much. That would allow us to fix the problem while mostly
> > honoring INDEX_CLEANUP off, I think.
>
> Yea. _bt_vacuum_needs_cleanup() needs to check if
> metad->btm_oldest_btpo_xact is older than the FreezeLimit computed by
> vacuum_set_xid_limits() and vacuum the index if so even if INDEX_CLEANUP
> false.

Agreed. So _bt_vacuum_needs_cleanup() would become something like the following?

if (metad->btm_version < BTREE_NOVAC_VERSION)
    result = true;
else if (TransactionIdIsvaid(metad->btm_oldest_btpo_xact) &&
         TransactionIdPrecedes(metad->btm_oldest_btpo_xact,
                               FreezeLimit))
    result = true;
else if (index_cleanup_disabled)
    result = false;
else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
         TransactionIdPrecedes(metad->btm_oldest_btpo_xact,
                               RecentGlobalXmin))
    result = true;
else
   result = determine based on vacuum_cleanup_index_scale_factor;

Or perhaps we can change _bt_vacuum_needs_cleanup() so that it does
index cleanup if metad->btm_oldest_btpo_xact is older than the
FreezeLimit *and* it's an aggressive vacuum.

Anyway, if we change IndexVacuumInfo to tell AM that INDEX_CLEANUP
option is disabled and FreezeLimit a problem is that it would break
compatibility

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: xid wraparound danger due to INDEX_CLEANUP false

Robert Haas
In reply to this post by Andres Freund
On Thu, Apr 16, 2020 at 6:49 PM Andres Freund <[hidden email]> wrote:
> Yea. _bt_vacuum_needs_cleanup() needs to check if
> metad->btm_oldest_btpo_xact is older than the FreezeLimit computed by
> vacuum_set_xid_limits() and vacuum the index if so even if INDEX_CLEANUP
> false.

I'm still fairly unclear on what the actual problem is here, and on
how we propose to fix it. It seems to me that we probably don't have a
problem in the case where we don't advance relfrozenxid or relminmxid,
because in that case there's not much difference between the behavior
created by this patch and a case where we just error out due to an
interrupt or something before reaching the index cleanup stage. I
think that the problem is that in the case where we do relfrozenxid,
we might advance it past some XID value stored in the index metadata.
Is that right?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: xid wraparound danger due to INDEX_CLEANUP false

Peter Geoghegan-4
In reply to this post by Masahiko Sawada
On Thu, Apr 16, 2020 at 12:30 AM Masahiko Sawada <[hidden email]> wrote:
> For btree indexes, IIRC skipping index cleanup could not be a cause of
> corruption, but be a cause of index bloat since it leaves recyclable
> pages which are not marked as recyclable.

I spotted a bug in "Skip full index scan during cleanup of B-tree
indexes when possible" which is unrelated to the index cleanup issue.

This code is wrong, because we don't have a buffer lock (or a buffer
pin) on the buffer anymore:

        ndel = _bt_pagedel(rel, buf);

        /* count only this page, else may double-count parent */
        if (ndel)
        {
            stats->pages_deleted++;
            if (!TransactionIdIsValid(vstate->oldestBtpoXact) ||
                TransactionIdPrecedes(opaque->btpo.xact,
vstate->oldestBtpoXact))
                vstate->oldestBtpoXact = opaque->btpo.xact;
        }

        MemoryContextSwitchTo(oldcontext);
        /* pagedel released buffer, so we shouldn't */

(As the comment says, _bt_pagedel() releases it.)

There is another, more fundamental issue, though: _bt_pagedel() can
delete more than one page. That might be okay if the "extra" pages
were always internal pages, but they're not -- it says so in the
comments above _bt_pagedel(). See the code at the end of
_bt_pagedel(), that says something about why we delete the right
sibling page in some cases.

I think that the fix is to push down the vstate into lower level code
in nbtpage.c. Want to have a go at fixing it?

(It would be nice if we could teach Valgrind to "poison" buffers when
we don't have a pin held...that would probably have caught this issue
almost immediately.)

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: xid wraparound danger due to INDEX_CLEANUP false

Peter Geoghegan-4
On Wed, Apr 22, 2020 at 6:05 PM Peter Geoghegan <[hidden email]> wrote:
> (It would be nice if we could teach Valgrind to "poison" buffers when
> we don't have a pin held...that would probably have caught this issue
> almost immediately.)

I can get Valgrind to complain about it when the regression tests are
run with the attached patch applied. I see this in the logs at several
points when "make installcheck" runs:

==1082059== VALGRINDERROR-BEGIN
==1082059== Invalid read of size 4
==1082059==    at 0x21D8DE: btvacuumpage (nbtree.c:1370)
==1082059==    by 0x21DA61: btvacuumscan (nbtree.c:1039)
==1082059==    by 0x21DBD5: btbulkdelete (nbtree.c:879)
==1082059==    by 0x215821: index_bulk_delete (indexam.c:698)
==1082059==    by 0x20FDCE: lazy_vacuum_index (vacuumlazy.c:2427)
==1082059==    by 0x2103EA: lazy_vacuum_all_indexes (vacuumlazy.c:1794)
==1082059==    by 0x211EA1: lazy_scan_heap (vacuumlazy.c:1681)
==1082059==    by 0x211EA1: heap_vacuum_rel (vacuumlazy.c:510)
==1082059==    by 0x360414: table_relation_vacuum (tableam.h:1457)
==1082059==    by 0x360414: vacuum_rel (vacuum.c:1880)
==1082059==    by 0x361785: vacuum (vacuum.c:449)
==1082059==    by 0x361F0E: ExecVacuum (vacuum.c:249)
==1082059==    by 0x4D979C: standard_ProcessUtility (utility.c:823)
==1082059==    by 0x4D9C7F: ProcessUtility (utility.c:522)
==1082059==    by 0x4D6791: PortalRunUtility (pquery.c:1157)
==1082059==    by 0x4D725F: PortalRunMulti (pquery.c:1303)
==1082059==    by 0x4D7CEF: PortalRun (pquery.c:779)
==1082059==    by 0x4D3BB7: exec_simple_query (postgres.c:1239)
==1082059==    by 0x4D4ABD: PostgresMain (postgres.c:4315)
==1082059==    by 0x45B0C9: BackendRun (postmaster.c:4510)
==1082059==    by 0x45B0C9: BackendStartup (postmaster.c:4202)
==1082059==    by 0x45B0C9: ServerLoop (postmaster.c:1727)
==1082059==    by 0x45C754: PostmasterMain (postmaster.c:1400)
==1082059==    by 0x3BDD68: main (main.c:210)
==1082059==  Address 0x6cc7378 is in a rw- anonymous segment
==1082059==
==1082059== VALGRINDERROR-END

(The line numbers might be slightly different to master here, but the
line from btvacuumpage() is definitely the one that accesses the
special area of the B-Tree page after we drop the pin.)

This patch is very rough -- it was just the first thing that I tried.
I don't know how Valgrind remembers the status of shared memory
regions across backends when they're marked with
VALGRIND_MAKE_MEM_NOACCESS(). Even still, this idea looks promising. I
should try to come up with a committable patch before too long.

The good news is that the error I showed is the only error that I see,
at least with this rough patch + "make installcheck". It's possible
that the patch isn't as effective as it could be, though. For one
thing, it definitely won't detect incorrect buffer accesses where a
pin is held but a buffer lock is not held. That seems possible, but a
bit harder.

--
Peter Geoghegan

0002-Valgrind-instrumentation.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: xid wraparound danger due to INDEX_CLEANUP false

Andres Freund
Hi,

On 2020-04-22 20:08:42 -0700, Peter Geoghegan wrote:
> I can get Valgrind to complain about it when the regression tests are
> run with the attached patch applied.

Nice!  Have you checked how much of an incremental slowdown this causes?


> This patch is very rough -- it was just the first thing that I tried.
> I don't know how Valgrind remembers the status of shared memory
> regions across backends when they're marked with
> VALGRIND_MAKE_MEM_NOACCESS(). Even still, this idea looks promising. I
> should try to come up with a committable patch before too long.

IIRC valgrind doesn't at all share access markings across processes.


> The good news is that the error I showed is the only error that I see,
> at least with this rough patch + "make installcheck". It's possible
> that the patch isn't as effective as it could be, though. For one
> thing, it definitely won't detect incorrect buffer accesses where a
> pin is held but a buffer lock is not held. That seems possible, but a
> bit harder.

Given hint bits it seems fairly hard to make that a reliable check.


> +#ifdef USE_VALGRIND
> + if (!isLocalBuf)
> + {
> + Buffer b = BufferDescriptorGetBuffer(bufHdr);
> + VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(b), BLCKSZ);
> + }
> +#endif

Hm. It's a bit annoying that we have to mark the contents defined. It'd
be kinda useful to be able to mark unused parts of pages as undefined
initially. But there's afaictl no way to just set/unset addressability,
while not touching definedness. So this is probably the best we can do
without adding a lot of complexity.


>   if (isExtend)
>   {
>   /* new buffers are zero-filled */
> @@ -1039,6 +1047,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>   buf = GetBufferDescriptor(buf_id);
>  
>   valid = PinBuffer(buf, strategy);
> +#ifdef USE_VALGRIND
> + {
> + Buffer b = BufferDescriptorGetBuffer(buf);
> + VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(b), BLCKSZ);
> + }
> +#endif

Why aren't we doing this in PinBuffer() and PinBuffer_Locked(), but at
their callsites?


> @@ -1633,6 +1653,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
>     buf_state))
>   {
>   result = (buf_state & BM_VALID) != 0;
> +#ifdef USE_VALGRIND
> + {
> + Buffer b = BufferDescriptorGetBuffer(buf);
> + VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(b), BLCKSZ);
> + }
> +#endif
>   break;
>   }
>   }

Oh, but we actually are doing it in PinBuffer() too?


>   /*
>   * Decrement the shared reference count.
> @@ -2007,6 +2039,12 @@ BufferSync(int flags)
>   */
>   if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
>   {
> +#ifdef USE_VALGRIND
> + {
> + Buffer b = BufferDescriptorGetBuffer(bufHdr);
> + VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(b), BLCKSZ);
> + }
> +#endif
>   if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
>   {
>   TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);

Shouldn't the pin we finally acquire in SyncOneBuffer() be sufficient?


> @@ -2730,6 +2768,12 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
>   * Run PageGetLSN while holding header lock, since we don't have the
>   * buffer locked exclusively in all cases.
>   */
> +#ifdef USE_VALGRIND
> + {
> + Buffer b = BufferDescriptorGetBuffer(buf);
> + VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(b), BLCKSZ);
> + }
> +#endif
>   recptr = BufferGetLSN(buf);

This shouldn't be needed, as the caller ought to hold a pin:
 *
 * The caller must hold a pin on the buffer and have share-locked the
 * buffer contents.  (Note: a share-lock does not prevent updates of
 * hint bits in the buffer, so the page could change while the write
 * is in progress, but we assume that that will not invalidate the data
 * written.)
 *

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: xid wraparound danger due to INDEX_CLEANUP false

Peter Geoghegan-4
On Wed, Apr 22, 2020 at 8:33 PM Andres Freund <[hidden email]> wrote:
> On 2020-04-22 20:08:42 -0700, Peter Geoghegan wrote:
> > I can get Valgrind to complain about it when the regression tests are
> > run with the attached patch applied.
>
> Nice!  Have you checked how much of an incremental slowdown this causes?

No, but I didn't notice much of a slowdown.

> > This patch is very rough -- it was just the first thing that I tried.
> > I don't know how Valgrind remembers the status of shared memory
> > regions across backends when they're marked with
> > VALGRIND_MAKE_MEM_NOACCESS(). Even still, this idea looks promising. I
> > should try to come up with a committable patch before too long.
>
> IIRC valgrind doesn't at all share access markings across processes.

I didn't think so.

> > The good news is that the error I showed is the only error that I see,
> > at least with this rough patch + "make installcheck". It's possible
> > that the patch isn't as effective as it could be, though. For one
> > thing, it definitely won't detect incorrect buffer accesses where a
> > pin is held but a buffer lock is not held. That seems possible, but a
> > bit harder.
>
> Given hint bits it seems fairly hard to make that a reliable check.

I don't follow. It doesn't have to be a perfect check. Detecting if
there is *any* buffer lock held at all would be a big improvement.

> Why aren't we doing this in PinBuffer() and PinBuffer_Locked(), but at
> their callsites?

I wrote this patch in a completely careless manner in less than 10
minutes, just to see how hard it was (I thought that it might have
been much harder). I wasn't expecting you to review it. I thought that
I was clear about that.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: xid wraparound danger due to INDEX_CLEANUP false

Peter Geoghegan-4
On Wed, Apr 22, 2020 at 9:05 PM Peter Geoghegan <[hidden email]> wrote:
> > Given hint bits it seems fairly hard to make that a reliable check.
>
> I don't follow. It doesn't have to be a perfect check. Detecting if
> there is *any* buffer lock held at all would be a big improvement.

It is true that the assumptions that heapam makes about what a buffer
pin will prevent (that a pin will prevent any kind of page
defragmentation) are not really compatible with marking pages as
undefined in lower level code like bufmgr.c. There are too many
exceptions for it to work like that.

The case I was really thinking about was the nbtree
_bt_drop_lock_and_maybe_pin() stuff, which is very confusing. The
confusing structure of the
BTScanPosIsPinned()/_bt_drop_lock_and_maybe_pin() code more or less
caused the skip scan patch to have numerous bugs involving code
holding a buffer pin, but not a buffer lock, at least when I last
looked at it a couple of months ago. The only thing having a pin on a
leaf page guarantees is that the TIDs from tuples on the page won't be
concurrently recycled by VACUUM. This is a very weak guarantee -- in
particular, it's much weaker than the guarantees around buffer pins
that apply in heapam. It's certainly not going to prevent any kind of
defragmentation of the page -- the page can even split, for example.
Any code that relies on holding a pin to prevent anything more than
that is broken, but possibly only in a subtle way. It's not like page
splits happen all that frequently.

Given that I was concerned about a fairly specific situation, a
specific solution seems like it might be the best way to structure the
extra checks. The attached rough patch shows the kind of approach that
might be practical in specific index access methods. This works on top
of the patch I posted yesterday. The idea is to mark the buffer's page
as a noaccess region within _bt_drop_lock_and_maybe_pin(), and then
mark it defined again at either of the two points that we might have
to relock (but not repin) the buffer to re-read the page. This doesn't
cause any regression test failures, so maybe there are no bugs like
this currently, but it still seems like it might be worth pursuing on
top of the buffer pin stuff.

--
Peter Geoghegan

0003-Mark-lockless-nbtree-leaf-page-memory-undefined.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: xid wraparound danger due to INDEX_CLEANUP false

Peter Geoghegan-4
In reply to this post by Andres Freund
On Thu, Apr 16, 2020 at 11:27 AM Andres Freund <[hidden email]> wrote:

> Sure, there is some pre-existing wraparound danger for individual
> pages. But it's a pretty narrow corner case before INDEX_CLEANUP
> off.
>
> That comment says something about "shared-memory free space map", making
> it sound like any crash would loose the page. But it's a normal FSM
> these days. Vacuum will insert the deleted page into the free space
> map. So either the FSM would need to be corrupted to not find the
> inserted page anymore, or the index would need to grow slow enough to
> not use a page before the wraparound.  And then such wrapped around xids
> would exist on individual pages. Not on all deleted pages, like with
> INDEX_CLEANUP false.

Is that really that narrow, even without "INDEX_CLEANUP false"? It's
not as if the index needs to grow very slowly to have only very few
page splits hour to hour (it depends on whether the inserts are random
or not, and so on). Especially if you had a bulk DELETE affecting many
rows, which is hardly that uncommon.

Fundamentally, btvacuumpage() doesn't freeze 32-bit XIDs (from
bpto.xact) when it recycles deleted pages. It simply puts them in the
FSM without changing anything about the page itself. This means
surprisingly little in the context of nbtree: the
_bt_page_recyclable() XID check that takes place in btvacuumpage()
also takes place in _bt_getbuf(), at the point where the page actually
gets recycled by the client. That's not great.

It wouldn't be so unreasonable if btvacuumpage() actually did freeze
the bpto.xact value at the point where it puts the page in the FSM. It
doesn't need to be crash safe; it can work as a hint. Maybe "freezing"
is the wrong word (too much baggage). More like we'd have VACUUM
represent that "this deleted B-Tree page is definitely not considered
to still be a part of the tree by any possible other backend" using a
page flag hint -- btvacuumpage() would "mark the deleted page as
recyclable" explicitly. Note that we still need to keep the original
bpto.xact XID around for _bt_log_reuse_page() (also, do we need to
worry _bt_log_reuse_page() with a wrapped-around XID?).

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: xid wraparound danger due to INDEX_CLEANUP false

Andres Freund
Hi,

On 2020-04-29 11:28:00 -0700, Peter Geoghegan wrote:

> On Thu, Apr 16, 2020 at 11:27 AM Andres Freund <[hidden email]> wrote:
> > Sure, there is some pre-existing wraparound danger for individual
> > pages. But it's a pretty narrow corner case before INDEX_CLEANUP
> > off.
> >
> > That comment says something about "shared-memory free space map", making
> > it sound like any crash would loose the page. But it's a normal FSM
> > these days. Vacuum will insert the deleted page into the free space
> > map. So either the FSM would need to be corrupted to not find the
> > inserted page anymore, or the index would need to grow slow enough to
> > not use a page before the wraparound.  And then such wrapped around xids
> > would exist on individual pages. Not on all deleted pages, like with
> > INDEX_CLEANUP false.
>
> Is that really that narrow, even without "INDEX_CLEANUP false"? It's
> not as if the index needs to grow very slowly to have only very few
> page splits hour to hour (it depends on whether the inserts are random
> or not, and so on). Especially if you had a bulk DELETE affecting many
> rows, which is hardly that uncommon.

Well, you'd need to have a workload that has bulk deletes, high xid
usage *and* doesn't insert new data to use those empty pages


> Fundamentally, btvacuumpage() doesn't freeze 32-bit XIDs (from
> bpto.xact) when it recycles deleted pages. It simply puts them in the
> FSM without changing anything about the page itself. This means
> surprisingly little in the context of nbtree: the
> _bt_page_recyclable() XID check that takes place in btvacuumpage()
> also takes place in _bt_getbuf(), at the point where the page actually
> gets recycled by the client. That's not great.

I think it's quite foolish for btvacuumpage() to not freeze xids. If we
only do so when necessary (i.e. older than a potential new relfrozenxid,
and only when the vacuum didn't yet skip pages), the costs are pretty
miniscule.


> It wouldn't be so unreasonable if btvacuumpage() actually did freeze
> the bpto.xact value at the point where it puts the page in the FSM. It
> doesn't need to be crash safe; it can work as a hint.

I'd much rather make sure the xid is guaranteed to be removed. As
outlined above, the cost would be small, and I think the likelihood of
the consequences of wrapped around xids getting worse over time is
substantial.


> Note that we still need to keep the original bpto.xact XID around for
> _bt_log_reuse_page() (also, do we need to worry _bt_log_reuse_page()
> with a wrapped-around XID?).

I'd just WAL log the reuse when freezing the xid. Then there's no worry
about wraparounds. And I don't think it'd cause additional conflicts;
the vacuum itself (or a prior vacuum) would also have to cause them, I
think?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: xid wraparound danger due to INDEX_CLEANUP false

Peter Geoghegan-4
On Wed, Apr 29, 2020 at 12:54 PM Andres Freund <[hidden email]> wrote:

> > Fundamentally, btvacuumpage() doesn't freeze 32-bit XIDs (from
> > bpto.xact) when it recycles deleted pages. It simply puts them in the
> > FSM without changing anything about the page itself. This means
> > surprisingly little in the context of nbtree: the
> > _bt_page_recyclable() XID check that takes place in btvacuumpage()
> > also takes place in _bt_getbuf(), at the point where the page actually
> > gets recycled by the client. That's not great.
>
> I think it's quite foolish for btvacuumpage() to not freeze xids. If we
> only do so when necessary (i.e. older than a potential new relfrozenxid,
> and only when the vacuum didn't yet skip pages), the costs are pretty
> miniscule.

I wonder if we should just bite the bullet and mark pages that are
recycled by VACUUM as explicitly recycled, with WAL-logging and
everything (this is like freezing, but stronger). That way, the
_bt_page_recyclable() call within _bt_getbuf() would only be required
to check that state (while btvacuumpage() would use something like a
_bt_page_eligible_for_recycling(), which would do the same thing as
the current _bt_page_recyclable()).

I'm not sure how low the costs would be, but at least we'd only have
to do it once per already-deleted page (i.e. only the first time a
VACUUM operation found _bt_page_eligible_for_recycling() returned true
for the page and marked it recycled in a crash safe manner). That
design would be quite a lot simpler, because it expresses the problem
in terms that make sense to the nbtree code. _bt_getbuf() should not
have to make a distinction between "possibly recycled" versus
"definitely recycled".

It makes sense that the FSM is not crash safe, I suppose, but we're
talking about relatively large amounts of free space here. Can't we
just do it properly/reliably?

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: xid wraparound danger due to INDEX_CLEANUP false

Peter Geoghegan-4
On Wed, Apr 29, 2020 at 1:40 PM Peter Geoghegan <[hidden email]> wrote:
> I'm not sure how low the costs would be, but at least we'd only have
> to do it once per already-deleted page (i.e. only the first time a
> VACUUM operation found _bt_page_eligible_for_recycling() returned true
> for the page and marked it recycled in a crash safe manner). That
> design would be quite a lot simpler, because it expresses the problem
> in terms that make sense to the nbtree code. _bt_getbuf() should not
> have to make a distinction between "possibly recycled" versus
> "definitely recycled".

As a bonus, we now have an easy correctness cross-check: if some
random piece of nbtree code lands on a page (having followed a
downlink or sibling link) that is marked recycled, then clearly
something is very wrong -- throw a "can't happen" error.

This would be especially useful in places like _bt_readpage(), I
suppose. Think of extreme cases like cursors, which can have a scan
that remembers a block number of a leaf page, that only actually gets
accessed hours or days later (for whatever reason). If that code was
buggy in some way, we might have a hope of figuring it out at some
point with this cross-check.

--
Peter Geoghegan


12