New IndexAM API controlling index vacuum strategies

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

New IndexAM API controlling index vacuum strategies

Masahiko Sawada
Hi all,

I've started this separate thread from [1] for discussing the general
API design of index vacuum.

Summary:

* Call ambulkdelete and amvacuumcleanup even when INDEX_CLEANUP is
false, and leave it to the index AM whether or not skip them.
* Add a new index AM API amvacuumstrategy(), asking the index AM the
strategy before calling to ambulkdelete.
* Whether or not remove garbage tuples from heap depends on multiple
factors including INDEX_CLEANUP option and the answers of
amvacuumstrategy() for each index AM.

The first point is to fix the inappropriate behavior discussed on the thread[1].

The second and third points are to introduce a general framework for
future extensibility. User-visible behavior is not changed by this
change.

The new index AM API, amvacuumstrategy(), which is called before
bulkdelete() for each index and asks the index bulk-deletion strategy.
On this API, lazy vacuum asks, "Hey index X, I collected garbage heap
tuples during heap scanning, how urgent is vacuuming for you?", and
the index answers either "it's urgent" when it wants to do
bulk-deletion or "it's not urgent, I can skip it". The point of this
proposal is to isolate heap vacuum and index vacuum for each index so
that we can employ different strategies for each index. Lazy vacuum
can decide whether or not to do heap clean based on the answers from
the indexes.

By default, if all indexes answer 'yes' (meaning it will do
bulkdelete()), lazy vacuum can do heap clean. On the other hand, if
even one index answers 'no' (meaning it will not do bulkdelete()),
lazy vacuum doesn't the heap clean. Lazy vacuum would also be able to
require indexes to do bulkdelete() for some reason such as specyfing
INDEX_CLEANUP option by the user. It’s something like saying "Hey
index X, you answered not to do bulkdelete() but since heap clean is
necessary for me please don't skip bulkdelete()".

Currently, if INDEX_CLEANUP option is not set (i.g.
VACOPT_TERNARY_DEFAULT in the code), it's treated as true and will do
heap clean. But with this patch we use the default as a neutral state
('smart' mode). This neutral state could be "on" and "off" depending
on several factors including the answers of amvacuumstrategy(), the
table status, and user's request. In this context, specifying
INDEX_CLEANUP would mean making the neutral state "on" or "off" by
user's request. The table status that could influence the decision
could concretely be, for instance:

* Removing LP_DEAD accumulation due to skipping bulkdelete() for a long time.
* Making pages all-visible for index-only scan.

Also there are potential enhancements using this API:

* If bottom-up index deletion feature[2] is introduced, individual
indexes could be a different situation in terms of dead tuple
accumulation; some indexes on the table can delete its garbage index
tuples without bulkdelete(). A problem will appear that doing
bulkdelete() for such indexes would not be efficient. This problem is
solved by this proposal because we can do bulkdelete() for a subset of
indexes on the table.

* If retail index deletion feature[3] is introduced, we can make the
return value of bulkvacuumstrategy() a ternary value: "do_bulkdelete",
"do_indexscandelete", and "no".

* We probably can introduce a threshold of the number of dead tuples
to control whether or not to do index tuple bulk-deletion (like
bulkdelete() version of vacuum_cleanup_index_scale_factor). In the
case where the amount of dead tuples is slightly larger than
maitenance_work_mem the second time calling to bulkdelete will be
called with a small number of dead tuples, which is inefficient. This
problem is also solved by this proposal by allowing a subset of
indexes to skip bulkdelete() if the number of dead tuple doesn't
exceed the threshold.

I’ve attached the PoC patch for the above idea. By default, since lazy
vacuum choose the vacuum bulkdelete strategy based on answers of
amvacuumstrategy() so it can be either true or false ( although it’s
always true in the currene patch). But for amvacuumcleanup() there is
no the neutral state, lazy vacuum treats the default as true.

Comment and feedback are very welcome.

Regards,

[1] https://www.postgresql.org/message-id/20200415233848.saqp72pcjv2y6ryi%40alap3.anarazel.de
[2] https://www.postgresql.org/message-id/CAH2-Wzm%2BmaE3apHB8NOtmM%3Dp-DO65j2V5GzAWCOEEuy3JZgb2g%40mail.gmail.com
[3] https://www.postgresql.org/message-id/425db134-8bba-005c-b59d-56e50de3b41e%40postgrespro.ru

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

poc_vacuumstrategy.patch (50K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New IndexAM API controlling index vacuum strategies

Peter Geoghegan-4
On Tue, Dec 22, 2020 at 2:54 PM Masahiko Sawada <[hidden email]> wrote:
> I've started this separate thread from [1] for discussing the general
> API design of index vacuum.

This is a very difficult and very important problem. Clearly defining
the problem is probably the hardest part. This prototype patch seems
like a good start, though.

Private discussion between Masahiko and myself led to a shared
understanding of what the best *general* direction is for VACUUM now.
It is necessary to deal with several problems all at once here, and to
at least think about several more problems that will need to be solved
later. If anybody reading the thread initially finds it hard to see
the connection between the specific items that Masahiko has
introduced, they should note that that's *expected*.

> Summary:
>
> * Call ambulkdelete and amvacuumcleanup even when INDEX_CLEANUP is
> false, and leave it to the index AM whether or not skip them.

Makes sense. I like the way you unify INDEX_CLEANUP and the
vacuum_cleanup_index_scale_factor stuff in a way that is now quite
explicit and obvious in the code.

> The second and third points are to introduce a general framework for
> future extensibility. User-visible behavior is not changed by this
> change.

In some ways the ideas in your patch might be considered radical, or
at least novel: they introduce the idea that bloat can be a
qualitative thing. But at the same time the design is quite
conservative: these are fairly isolated changes, at least code-wise. I
am not 100% sure that this approach will be successful in
vacuumlazy.c, in the end (I'm ~95% sure). But I am 100% sure that our
collective understanding of the problems in this area will be
significantly improved by this effort. A fundamental rethink does not
necessarily require a fundamental redesign, and yet it might be just
as effective.

This is certainly what I see when testing my bottom-up index deletion
patch, which adds an incremental index deletion mechanism that merely
intervenes in a precise, isolated way. Despite my patch's simplicity,
it manages to practically eliminate an entire important *class* of
index bloat (at least once you make certain mild assumptions about the
duration of snapshots). Sometimes it is possible to solve a hard
problem by thinking about it only *slightly* differently.

This is a tantalizing possibility for VACUUM, too. I'm willing to risk
sounding grandiose if that's what it takes to get more hackers
interested in these questions. With that in mind, here is a summary of
the high level hypothesis behind this VACUUM patch:

VACUUM can and should be reimagined as a top-down mechanism that
complements various bottom-up mechanisms (including the stuff from my
deletion patch, heap pruning, and possibly an enhanced version of heap
pruning based on similar principles). This will be possible without
changing any of the fundamental invariants of the current vacuumlazy.c
design. VACUUM's problems are largely pathological behaviors of one
kind or another, that can be fixed with specific well-targeted
interventions. Workload characteristics can naturally determine how
much of the cleanup is done by VACUUM itself -- large variations are
possible within a single database, and even across indexes on the same
table.

> The new index AM API, amvacuumstrategy(), which is called before
> bulkdelete() for each index and asks the index bulk-deletion strategy.
> On this API, lazy vacuum asks, "Hey index X, I collected garbage heap
> tuples during heap scanning, how urgent is vacuuming for you?", and
> the index answers either "it's urgent" when it wants to do
> bulk-deletion or "it's not urgent, I can skip it". The point of this
> proposal is to isolate heap vacuum and index vacuum for each index so
> that we can employ different strategies for each index. Lazy vacuum
> can decide whether or not to do heap clean based on the answers from
> the indexes.

Right -- workload characteristics (plus appropriate optimizations at
the local level) make it possible that amvacuumstrategy() will give
*very* different answers from different indexes *on the same table*.
The idea that all indexes on the table are more or less equally
bloated at any given point in time is mostly wrong. Actually,
*sometimes* it really is correct! But other times it is *dramatically
wrong* -- it all depends on workload characteristics. What is likely
to be true *on average* across all tables/indexes is *irrelevant* (the
mean/average is simply not a useful concept, in fact).

The basic lazy vacuum design needs to recognize this important
difference, and other similar issues. That's the point of
amvacuumstrategy().

> Currently, if INDEX_CLEANUP option is not set (i.g.
> VACOPT_TERNARY_DEFAULT in the code), it's treated as true and will do
> heap clean. But with this patch we use the default as a neutral state
> ('smart' mode). This neutral state could be "on" and "off" depending
> on several factors including the answers of amvacuumstrategy(), the
> table status, and user's request. In this context, specifying
> INDEX_CLEANUP would mean making the neutral state "on" or "off" by
> user's request. The table status that could influence the decision
> could concretely be, for instance:
>
> * Removing LP_DEAD accumulation due to skipping bulkdelete() for a long time.
> * Making pages all-visible for index-only scan.

So you have several different kinds of back pressure - 'smart' mode
really is smart.

> Also there are potential enhancements using this API:

> * If retail index deletion feature[3] is introduced, we can make the
> return value of bulkvacuumstrategy() a ternary value: "do_bulkdelete",
> "do_indexscandelete", and "no".

Makes sense.

> * We probably can introduce a threshold of the number of dead tuples
> to control whether or not to do index tuple bulk-deletion (like
> bulkdelete() version of vacuum_cleanup_index_scale_factor). In the
> case where the amount of dead tuples is slightly larger than
> maitenance_work_mem the second time calling to bulkdelete will be
> called with a small number of dead tuples, which is inefficient. This
> problem is also solved by this proposal by allowing a subset of
> indexes to skip bulkdelete() if the number of dead tuple doesn't
> exceed the threshold.

Good idea. I bet other people can come up with other ideas a little
like this just by thinking about it. The "untangling" performed by
your patch creates many possibilities

> I’ve attached the PoC patch for the above idea. By default, since lazy
> vacuum choose the vacuum bulkdelete strategy based on answers of
> amvacuumstrategy() so it can be either true or false ( although it’s
> always true in the currene patch). But for amvacuumcleanup() there is
> no the neutral state, lazy vacuum treats the default as true.

As you said, the next question must be: How do we teach lazy vacuum to
not do what gets requested by amvacuumcleanup() when it cannot respect
the wishes of one individual indexes, for example when the
accumulation of LP_DEAD items in the heap becomes a big problem in
itself? That really could be the thing that forces full heap
vacuuming, even with several indexes.

I will need to experiment in order to improve my understanding of how
to make this cooperate with bottom-up index deletion. But that's
mostly just a question for my patch (and a relatively easy one).

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: New IndexAM API controlling index vacuum strategies

Masahiko Sawada
On Thu, Dec 24, 2020 at 12:59 PM Peter Geoghegan <[hidden email]> wrote:

>
> On Tue, Dec 22, 2020 at 2:54 PM Masahiko Sawada <[hidden email]> wrote:
> > I've started this separate thread from [1] for discussing the general
> > API design of index vacuum.
>
> This is a very difficult and very important problem. Clearly defining
> the problem is probably the hardest part. This prototype patch seems
> like a good start, though.
>
> Private discussion between Masahiko and myself led to a shared
> understanding of what the best *general* direction is for VACUUM now.
> It is necessary to deal with several problems all at once here, and to
> at least think about several more problems that will need to be solved
> later. If anybody reading the thread initially finds it hard to see
> the connection between the specific items that Masahiko has
> introduced, they should note that that's *expected*.
>
> > Summary:
> >
> > * Call ambulkdelete and amvacuumcleanup even when INDEX_CLEANUP is
> > false, and leave it to the index AM whether or not skip them.
>
> Makes sense. I like the way you unify INDEX_CLEANUP and the
> vacuum_cleanup_index_scale_factor stuff in a way that is now quite
> explicit and obvious in the code.
>
> > The second and third points are to introduce a general framework for
> > future extensibility. User-visible behavior is not changed by this
> > change.
>
> In some ways the ideas in your patch might be considered radical, or
> at least novel: they introduce the idea that bloat can be a
> qualitative thing. But at the same time the design is quite
> conservative: these are fairly isolated changes, at least code-wise. I
> am not 100% sure that this approach will be successful in
> vacuumlazy.c, in the end (I'm ~95% sure). But I am 100% sure that our
> collective understanding of the problems in this area will be
> significantly improved by this effort. A fundamental rethink does not
> necessarily require a fundamental redesign, and yet it might be just
> as effective.
>
> This is certainly what I see when testing my bottom-up index deletion
> patch, which adds an incremental index deletion mechanism that merely
> intervenes in a precise, isolated way. Despite my patch's simplicity,
> it manages to practically eliminate an entire important *class* of
> index bloat (at least once you make certain mild assumptions about the
> duration of snapshots). Sometimes it is possible to solve a hard
> problem by thinking about it only *slightly* differently.
>
> This is a tantalizing possibility for VACUUM, too. I'm willing to risk
> sounding grandiose if that's what it takes to get more hackers
> interested in these questions. With that in mind, here is a summary of
> the high level hypothesis behind this VACUUM patch:
>
> VACUUM can and should be reimagined as a top-down mechanism that
> complements various bottom-up mechanisms (including the stuff from my
> deletion patch, heap pruning, and possibly an enhanced version of heap
> pruning based on similar principles). This will be possible without
> changing any of the fundamental invariants of the current vacuumlazy.c
> design. VACUUM's problems are largely pathological behaviors of one
> kind or another, that can be fixed with specific well-targeted
> interventions. Workload characteristics can naturally determine how
> much of the cleanup is done by VACUUM itself -- large variations are
> possible within a single database, and even across indexes on the same
> table.

Agreed.

Ideally, the bottom-up mechanism works well and reclaim almost all
garbage. VACUUM should be a feature that complements these works if
the bottom-up mechanism cannot work well for some reason, and also is
used to make sure that all collected garbage has been vacuumed. For
heaps, we already have such a mechanism: opportunistically hot-pruning
and lazy vacuum. For indexes especially btree indexes, the bottom-up
index deletion and ambulkdelete() would have a similar relationship.

>
> > The new index AM API, amvacuumstrategy(), which is called before
> > bulkdelete() for each index and asks the index bulk-deletion strategy.
> > On this API, lazy vacuum asks, "Hey index X, I collected garbage heap
> > tuples during heap scanning, how urgent is vacuuming for you?", and
> > the index answers either "it's urgent" when it wants to do
> > bulk-deletion or "it's not urgent, I can skip it". The point of this
> > proposal is to isolate heap vacuum and index vacuum for each index so
> > that we can employ different strategies for each index. Lazy vacuum
> > can decide whether or not to do heap clean based on the answers from
> > the indexes.
>
> Right -- workload characteristics (plus appropriate optimizations at
> the local level) make it possible that amvacuumstrategy() will give
> *very* different answers from different indexes *on the same table*.
> The idea that all indexes on the table are more or less equally
> bloated at any given point in time is mostly wrong. Actually,
> *sometimes* it really is correct! But other times it is *dramatically
> wrong* -- it all depends on workload characteristics. What is likely
> to be true *on average* across all tables/indexes is *irrelevant* (the
> mean/average is simply not a useful concept, in fact).
>
> The basic lazy vacuum design needs to recognize this important
> difference, and other similar issues. That's the point of
> amvacuumstrategy().

Agreed.

In terms of bloat, the characteristics of index AM also bring such
differences (e.g., btree vs. brin). With the bottom-up index deletion
feature, even btree indexes on the same table will also different to
each other.

>
> > Currently, if INDEX_CLEANUP option is not set (i.g.
> > VACOPT_TERNARY_DEFAULT in the code), it's treated as true and will do
> > heap clean. But with this patch we use the default as a neutral state
> > ('smart' mode). This neutral state could be "on" and "off" depending
> > on several factors including the answers of amvacuumstrategy(), the
> > table status, and user's request. In this context, specifying
> > INDEX_CLEANUP would mean making the neutral state "on" or "off" by
> > user's request. The table status that could influence the decision
> > could concretely be, for instance:
> >
> > * Removing LP_DEAD accumulation due to skipping bulkdelete() for a long time.
> > * Making pages all-visible for index-only scan.
>
> So you have several different kinds of back pressure - 'smart' mode
> really is smart.
>
> > Also there are potential enhancements using this API:
>
> > * If retail index deletion feature[3] is introduced, we can make the
> > return value of bulkvacuumstrategy() a ternary value: "do_bulkdelete",
> > "do_indexscandelete", and "no".
>
> Makes sense.
>
> > * We probably can introduce a threshold of the number of dead tuples
> > to control whether or not to do index tuple bulk-deletion (like
> > bulkdelete() version of vacuum_cleanup_index_scale_factor). In the
> > case where the amount of dead tuples is slightly larger than
> > maitenance_work_mem the second time calling to bulkdelete will be
> > called with a small number of dead tuples, which is inefficient. This
> > problem is also solved by this proposal by allowing a subset of
> > indexes to skip bulkdelete() if the number of dead tuple doesn't
> > exceed the threshold.
>
> Good idea. I bet other people can come up with other ideas a little
> like this just by thinking about it. The "untangling" performed by
> your patch creates many possibilities
>
> > I’ve attached the PoC patch for the above idea. By default, since lazy
> > vacuum choose the vacuum bulkdelete strategy based on answers of
> > amvacuumstrategy() so it can be either true or false ( although it’s
> > always true in the currene patch). But for amvacuumcleanup() there is
> > no the neutral state, lazy vacuum treats the default as true.
>
> As you said, the next question must be: How do we teach lazy vacuum to
> not do what gets requested by amvacuumcleanup() when it cannot respect
> the wishes of one individual indexes, for example when the
> accumulation of LP_DEAD items in the heap becomes a big problem in
> itself? That really could be the thing that forces full heap
> vacuuming, even with several indexes.

You mean requested by amvacuumstreategy(), not by amvacuumcleanup()? I
think amvacuumstrategy() affects only ambulkdelete(). But when all
ambulkdelete() were skipped by the requests by index AMs we might want
to skip amvacuumcleanup() as well.

>
> I will need to experiment in order to improve my understanding of how
> to make this cooperate with bottom-up index deletion. But that's
> mostly just a question for my patch (and a relatively easy one).

Yeah, I think we might need something like statistics about garbage
per index so that individual index can make a different decision based
on their status. For example, a btree index might want to skip
ambulkdelete() if it has a few dead index tuples in its leaf pages. It
could be on stats collector or on btree's meta page.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/


Reply | Threaded
Open this post in threaded view
|

Re: New IndexAM API controlling index vacuum strategies

Peter Geoghegan-4
On Sun, Dec 27, 2020 at 10:55 PM Masahiko Sawada <[hidden email]> wrote:

> > As you said, the next question must be: How do we teach lazy vacuum to
> > not do what gets requested by amvacuumcleanup() when it cannot respect
> > the wishes of one individual indexes, for example when the
> > accumulation of LP_DEAD items in the heap becomes a big problem in
> > itself? That really could be the thing that forces full heap
> > vacuuming, even with several indexes.
>
> You mean requested by amvacuumstreategy(), not by amvacuumcleanup()? I
> think amvacuumstrategy() affects only ambulkdelete(). But when all
> ambulkdelete() were skipped by the requests by index AMs we might want
> to skip amvacuumcleanup() as well.

No, I was asking about how we should decide to do a real VACUUM even
(a real ambulkdelete() call) when no index asks for it because
bottom-up deletion works very well in every index. Clearly we will
need to eventually remove remaining LP_DEAD items from the heap at
some point if nothing else happens -- eventually LP_DEAD items in the
heap alone will force a traditional heap vacuum (which will still have
to go through indexes that have not grown, just to be safe/avoid
recycling a TID that's still in the index).

Postgres heap fillfactor is 100 by default, though I believe it's 90
in another well known DB system. If you set Postgres heap fill factor
to 90 you can fit a little over 200 LP_DEAD items in the "extra space"
left behind in each heap page after initial bulk loading/INSERTs take
place that respect our lower fill factor setting. This is about 4x the
number of initial heap tuples in the pgbench_accounts table -- it's
quite a lot!

If we pessimistically assume that all updates are non-HOT updates,
we'll still usually have enough space for each logical row to get
updated several times before the heap page "overflows". Even when
there is significant skew in the UPDATEs, the skew is not noticeable
at the level of individual heap pages. We have a surprisingly large
general capacity to temporarily "absorb" extra garbage LP_DEAD items
in heap pages this way. Nobody really cared about this extra capacity
very much before now, because it did not help with the big problem of
index bloat that you naturally see with this workload. But that big
problem may go away soon, and so this extra capacity may become
important at the same time.

I think that it could make sense for lazy_scan_heap() to maintain
statistics about the number of LP_DEAD items remaining in each heap
page (just local stack variables). From there, it can pass the
statistics to the choose_vacuum_strategy() function from your patch.
Perhaps choose_vacuum_strategy() will notice that the heap page with
the most LP_DEAD items encountered within lazy_scan_heap() (among
those encountered so far in the event of multiple index passes) has
too many LP_DEAD items -- this indicates that there is a danger that
some heap pages will start to "overflow" soon, which is now a problem
that lazy_scan_heap() must think about. Maybe if the "extra space"
left by applying heap fill factor (with settings below 100) is
insufficient to fit perhaps 2/3 of the LP_DEAD items needed on the
heap page that has the most LP_DEAD items (among all heap pages), we
stop caring about what amvacuumstrategy()/the indexes say. So we do
the right thing for the heap pages, while still mostly avoiding index
vacuuming and the final heap pass.

I experimented with this today, and I think that it is a good way to
do it. I like the idea of choose_vacuum_strategy() understanding that
heap pages that are subject to many non-HOT updates have a "natural
extra capacity for LP_DEAD items" that it must care about directly (at
least with non-default heap fill factor settings). My early testing
shows that it will often take a surprisingly long time for the most
heavily updated heap page to have more than about 100 LP_DEAD items.

> > I will need to experiment in order to improve my understanding of how
> > to make this cooperate with bottom-up index deletion. But that's
> > mostly just a question for my patch (and a relatively easy one).
>
> Yeah, I think we might need something like statistics about garbage
> per index so that individual index can make a different decision based
> on their status. For example, a btree index might want to skip
> ambulkdelete() if it has a few dead index tuples in its leaf pages. It
> could be on stats collector or on btree's meta page.

Right. I think that even a very conservative approach could work well.
For example, maybe we teach nbtree's amvacuumstrategy() routine to ask
to do a real ambulkdelete(), except in the extreme case where the
index is *exactly* the same size as it was after the last VACUUM.
This will happen regularly with bottom-up index deletion. Maybe that
approach is a bit too conservative, though.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: New IndexAM API controlling index vacuum strategies

Peter Geoghegan-4
On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan <[hidden email]> wrote:
> I experimented with this today, and I think that it is a good way to
> do it. I like the idea of choose_vacuum_strategy() understanding that
> heap pages that are subject to many non-HOT updates have a "natural
> extra capacity for LP_DEAD items" that it must care about directly (at
> least with non-default heap fill factor settings). My early testing
> shows that it will often take a surprisingly long time for the most
> heavily updated heap page to have more than about 100 LP_DEAD items.

Attached is a rough patch showing what I did here. It was applied on
top of my bottom-up index deletion patch series and your
poc_vacuumstrategy.patch patch. This patch was written as a quick and
dirty way of simulating what I thought would work best for bottom-up
index deletion for one specific benchmark/test, which was
non-hot-update heavy. This consists of a variant pgbench with several
indexes on pgbench_accounts (almost the same as most other bottom-up
deletion benchmarks I've been running). Only one index is "logically
modified" by the updates, but of course we still physically modify all
indexes on every update. I set fill factor to 90 for this benchmark,
which is an important factor for how your VACUUM patch works during
the benchmark.

This rough supplementary patch includes VACUUM logic that assumes (but
doesn't check) that the table has heap fill factor set to 90 -- see my
changes to choose_vacuum_strategy(). This benchmark is really about
stability over time more than performance (though performance is also
improved significantly). I wanted to keep both the table/heap and the
logically unmodified indexes (i.e. 3 out of 4 indexes on
pgbench_accounts) exactly the same size *forever*.

Does this make sense?

Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that
pg_stat_database shows an almost ~28% reduction in blks_read after an
overnight run for the patch series (it was 508,820,699 for the
patches, 705,282,975 for the master branch). I think that the VACUUM
component is responsible for some of that reduction. There were 11
VACUUMs for the patch, 7 of which did not call lazy_vacuum_heap()
(these 7 VACUUM operations all only dead a btbulkdelete() call for the
one problematic index on the table, named "abalance_ruin", which my
supplementary patch has hard-coded knowledge of).

--
Peter Geoghegan

0007-btvacuumstrategy-bottom-up-index-deletion-changes.patch (13K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New IndexAM API controlling index vacuum strategies

Masahiko Sawada
In reply to this post by Peter Geoghegan-4
On Mon, Dec 28, 2020 at 4:42 PM Peter Geoghegan <[hidden email]> wrote:

>
> On Sun, Dec 27, 2020 at 10:55 PM Masahiko Sawada <[hidden email]> wrote:
> > > As you said, the next question must be: How do we teach lazy vacuum to
> > > not do what gets requested by amvacuumcleanup() when it cannot respect
> > > the wishes of one individual indexes, for example when the
> > > accumulation of LP_DEAD items in the heap becomes a big problem in
> > > itself? That really could be the thing that forces full heap
> > > vacuuming, even with several indexes.
> >
> > You mean requested by amvacuumstreategy(), not by amvacuumcleanup()? I
> > think amvacuumstrategy() affects only ambulkdelete(). But when all
> > ambulkdelete() were skipped by the requests by index AMs we might want
> > to skip amvacuumcleanup() as well.
>
> No, I was asking about how we should decide to do a real VACUUM even
> (a real ambulkdelete() call) when no index asks for it because
> bottom-up deletion works very well in every index. Clearly we will
> need to eventually remove remaining LP_DEAD items from the heap at
> some point if nothing else happens -- eventually LP_DEAD items in the
> heap alone will force a traditional heap vacuum (which will still have
> to go through indexes that have not grown, just to be safe/avoid
> recycling a TID that's still in the index).
>
> Postgres heap fillfactor is 100 by default, though I believe it's 90
> in another well known DB system. If you set Postgres heap fill factor
> to 90 you can fit a little over 200 LP_DEAD items in the "extra space"
> left behind in each heap page after initial bulk loading/INSERTs take
> place that respect our lower fill factor setting. This is about 4x the
> number of initial heap tuples in the pgbench_accounts table -- it's
> quite a lot!
>
> If we pessimistically assume that all updates are non-HOT updates,
> we'll still usually have enough space for each logical row to get
> updated several times before the heap page "overflows". Even when
> there is significant skew in the UPDATEs, the skew is not noticeable
> at the level of individual heap pages. We have a surprisingly large
> general capacity to temporarily "absorb" extra garbage LP_DEAD items
> in heap pages this way. Nobody really cared about this extra capacity
> very much before now, because it did not help with the big problem of
> index bloat that you naturally see with this workload. But that big
> problem may go away soon, and so this extra capacity may become
> important at the same time.
>
> I think that it could make sense for lazy_scan_heap() to maintain
> statistics about the number of LP_DEAD items remaining in each heap
> page (just local stack variables). From there, it can pass the
> statistics to the choose_vacuum_strategy() function from your patch.
> Perhaps choose_vacuum_strategy() will notice that the heap page with
> the most LP_DEAD items encountered within lazy_scan_heap() (among
> those encountered so far in the event of multiple index passes) has
> too many LP_DEAD items -- this indicates that there is a danger that
> some heap pages will start to "overflow" soon, which is now a problem
> that lazy_scan_heap() must think about. Maybe if the "extra space"
> left by applying heap fill factor (with settings below 100) is
> insufficient to fit perhaps 2/3 of the LP_DEAD items needed on the
> heap page that has the most LP_DEAD items (among all heap pages), we
> stop caring about what amvacuumstrategy()/the indexes say. So we do
> the right thing for the heap pages, while still mostly avoiding index
> vacuuming and the final heap pass.
Agreed. I like the idea that we calculate how many LP_DEAD items we
can absorb based on the extra space left by applying the fill factor.
Since there is a limit on the maximum number of line pointers in a
heap page we might need to consider that limit when calculation.

From another point of view, given the maximum number of heap tuple in
one 8kb heap page (MaxHeapTuplesPerPage) is 291, I think how bad to
store LP_DEAD items in a heap page vary depending on the tuple size.

For example, suppose the tuple size is 200 we can store 40 tuples into
one heap page if there is no LP_DEAD item at all. Even if there are
150 LP_DEAD items on the page, we still are able to store 37 tuples
because we still can have 141 line pointers at most, which is enough
number to store the maximum number of heap tuples when there are no
LP_DEAD items, and we have (8192 - (4 * 150)) bytes space to store
tuples (with line pointers). That is, we can think that having 150
LP_DEAD items end up causing an overflow of 3 tuples. On the other
hand, suppose the tuple size is 40 we can store 204 tuples into one
heap page if there is no LP_DEAD item at all. If there are 150 LP_DEAD
items on the page, we are able to store 141 tuples. That is, having
150 LP_DEAD items end up causing an overflow of 63 tuples. I think
the impact on the table bloat by absorbing LP_DEAD items is larger in
the latter case.

The larger the tuple size, the more LP_DEAD items can be absorbed in a
heap page with less bad effect. Considering 32 bytes tuple, the
minimum heap tuples size including the tuple header, absorbing
approximately up to 70 LP_DEAD items would not affect much in terms of
bloat. In other words, if a heap page has more than 70 LP_DEAD items,
absorbing LP_DEAD items may become a problem of the table bloat. This
threshold of 70 LP_DEAD items is a conservative value and probably
would be a lower bound. If the tuple size is larger, we may be able to
absorb more LP_DEAD items.

FYI I've attached a graph showing how the number of LP_DEAD items on
one heap page affects the maximum number of heap tuples on the same
heap page. The X-axis is the number of LP_DEAD items in one heap page
and the Y-axis is the number of heap tuples that can be stored on the
page. The lines in the graph are heap tuple size respectively. For
example, in pgbench workload, since the tuple size is about 120 bytes
the page bloat accelerates if we leave more than about 230 LP_DEAD
items in a heap page.

>
> I experimented with this today, and I think that it is a good way to
> do it. I like the idea of choose_vacuum_strategy() understanding that
> heap pages that are subject to many non-HOT updates have a "natural
> extra capacity for LP_DEAD items" that it must care about directly (at
> least with non-default heap fill factor settings). My early testing
> shows that it will often take a surprisingly long time for the most
> heavily updated heap page to have more than about 100 LP_DEAD items.

Agreed.

>
> > > I will need to experiment in order to improve my understanding of how
> > > to make this cooperate with bottom-up index deletion. But that's
> > > mostly just a question for my patch (and a relatively easy one).
> >
> > Yeah, I think we might need something like statistics about garbage
> > per index so that individual index can make a different decision based
> > on their status. For example, a btree index might want to skip
> > ambulkdelete() if it has a few dead index tuples in its leaf pages. It
> > could be on stats collector or on btree's meta page.
>
> Right. I think that even a very conservative approach could work well.
> For example, maybe we teach nbtree's amvacuumstrategy() routine to ask
> to do a real ambulkdelete(), except in the extreme case where the
> index is *exactly* the same size as it was after the last VACUUM.
> This will happen regularly with bottom-up index deletion. Maybe that
> approach is a bit too conservative, though.
Agreed.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

lp_dead.png (31K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New IndexAM API controlling index vacuum strategies

Masahiko Sawada
In reply to this post by Peter Geoghegan-4
On Tue, Dec 29, 2020 at 7:06 AM Peter Geoghegan <[hidden email]> wrote:

>
> On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan <[hidden email]> wrote:
> > I experimented with this today, and I think that it is a good way to
> > do it. I like the idea of choose_vacuum_strategy() understanding that
> > heap pages that are subject to many non-HOT updates have a "natural
> > extra capacity for LP_DEAD items" that it must care about directly (at
> > least with non-default heap fill factor settings). My early testing
> > shows that it will often take a surprisingly long time for the most
> > heavily updated heap page to have more than about 100 LP_DEAD items.
>
> Attached is a rough patch showing what I did here. It was applied on
> top of my bottom-up index deletion patch series and your
> poc_vacuumstrategy.patch patch. This patch was written as a quick and
> dirty way of simulating what I thought would work best for bottom-up
> index deletion for one specific benchmark/test, which was
> non-hot-update heavy. This consists of a variant pgbench with several
> indexes on pgbench_accounts (almost the same as most other bottom-up
> deletion benchmarks I've been running). Only one index is "logically
> modified" by the updates, but of course we still physically modify all
> indexes on every update. I set fill factor to 90 for this benchmark,
> which is an important factor for how your VACUUM patch works during
> the benchmark.
>
> This rough supplementary patch includes VACUUM logic that assumes (but
> doesn't check) that the table has heap fill factor set to 90 -- see my
> changes to choose_vacuum_strategy(). This benchmark is really about
> stability over time more than performance (though performance is also
> improved significantly). I wanted to keep both the table/heap and the
> logically unmodified indexes (i.e. 3 out of 4 indexes on
> pgbench_accounts) exactly the same size *forever*.
>
> Does this make sense?

Thank you for sharing the patch. That makes sense.

+        if (!vacuum_heap)
+        {
+            if (maxdeadpage > 130 ||
+                /* Also check if maintenance_work_mem space is running out */
+                vacrelstats->dead_tuples->num_tuples >
+                vacrelstats->dead_tuples->max_tuples / 2)
+                vacuum_heap = true;
+        }

The second test checking if maintenane_work_mem space is running out
also makes sense to me. Perhaps another idea would be to compare the
number of collected garbage tuple to the total number of heap tuples
so that we do lazy_vacuum_heap() only when we’re likely to reclaim a
certain amount of garbage in the table.

>
> Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that
> pg_stat_database shows an almost ~28% reduction in blks_read after an
> overnight run for the patch series (it was 508,820,699 for the
> patches, 705,282,975 for the master branch). I think that the VACUUM
> component is responsible for some of that reduction. There were 11
> VACUUMs for the patch, 7 of which did not call lazy_vacuum_heap()
> (these 7 VACUUM operations all only dead a btbulkdelete() call for the
> one problematic index on the table, named "abalance_ruin", which my
> supplementary patch has hard-coded knowledge of).

That's a very good result in terms of skipping lazy_vacuum_heap(). How
much the table and indexes bloated? Also, I'm curious about that which
tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or
test if maintenance_work_mem space is running out? And what was the
impact on clearing all-visible bits?

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/


Reply | Threaded
Open this post in threaded view
|

Re: New IndexAM API controlling index vacuum strategies

Peter Geoghegan-4
On Mon, Dec 28, 2020 at 10:26 PM Masahiko Sawada <[hidden email]> wrote:
> The second test checking if maintenane_work_mem space is running out
> also makes sense to me. Perhaps another idea would be to compare the
> number of collected garbage tuple to the total number of heap tuples
> so that we do lazy_vacuum_heap() only when we’re likely to reclaim a
> certain amount of garbage in the table.

Right. Or to consider if this is an anti-wraparound VACUUM might be
nice -- maybe we should skip index vacuuming + lazy_vacuum_heap() if
and only if we're under pressure to advance datfrozenxid for the whole
DB, and really need to hurry up. (I think that we could both probably
think of way too many ideas like this one.)

> That's a very good result in terms of skipping lazy_vacuum_heap(). How
> much the table and indexes bloated? Also, I'm curious about that which
> tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or
> test if maintenance_work_mem space is running out? And what was the
> impact on clearing all-visible bits?

The pgbench_accounts heap table and 3 out of 4 of its indexes (i.e.
all indexes except "abalance_ruin") had zero growth. They did not even
become larger by 1 block. As I often say when talking about work in
this area, this is not a quantitative difference -- it's a qualitative
difference. (If they grew even a tiny amount, say by only 1 block,
further growth is likely to follow.)

The "abalance_ruin" index was smaller with the patch. Its size started
off at 253,779 blocks with both the patch and master branch (which is
very small, because of B-Tree deduplication). By the end of 2 pairs of
runs for the patch (2 3 hour runs) the size grew to 502,016 blocks.
But with the master branch it grew to 540,090 blocks. (For reference,
the primary key on pgbench_accounts started out at 822,573 blocks.)

My guess is that this would compare favorably with "magic VACUUM" [1]
(I refer to a thought experiment that is useful for understanding the
principles behind bottom-up index deletion). The fact that
"abalance_ruin" becomes bloated probably doesn't have that much to do
with MVCC versioning. In other words, I suspect that the index
wouldn't be that smaller in a traditional two-phase locking database
system with the same workload. Words like "bloat" and "fragmentation"
have always been overloaded/ambiguous in highly confusing ways, which
is why I find it useful to compare a real world workload/benchmark to
some kind of theoretical ideal behavior.

This test wasn't particularly sympathetic to the patch because most of
the indexes (all but the PK) were useless -- they did not get used by
query plans. So the final size of "abalance_ruin" (or any other index)
isn't even the truly important thing IMV (the benchmark doesn't
advertise the truly important thing for me). The truly important thing
is that the worst case number of versions *per logical row* is tightly
controlled. It doesn't necessarily matter all that much if 30% of an
index's tuples are garbage, as long as the garbage tuples are evenly
spread across all logical rows in the table (in practice it's pretty
unlikely that that would actually happen, but it's possible in theory,
and if it did happen it really wouldn't be so bad).

[1] https://postgr.es/m/CAH2-Wz=rPkB5vXS7AZ+v8t3VE75v_dKGro+w4nWd64E9yiCLEQ@...
--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: New IndexAM API controlling index vacuum strategies

Peter Geoghegan-4
On Mon, Dec 28, 2020 at 11:20 PM Peter Geoghegan <[hidden email]> wrote:

> > That's a very good result in terms of skipping lazy_vacuum_heap(). How
> > much the table and indexes bloated? Also, I'm curious about that which
> > tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or
> > test if maintenance_work_mem space is running out? And what was the
> > impact on clearing all-visible bits?
>
> The pgbench_accounts heap table and 3 out of 4 of its indexes (i.e.
> all indexes except "abalance_ruin") had zero growth. They did not even
> become larger by 1 block. As I often say when talking about work in
> this area, this is not a quantitative difference -- it's a qualitative
> difference. (If they grew even a tiny amount, say by only 1 block,
> further growth is likely to follow.)

I forgot to say: I don't know what the exact impact was on the VM bit
setting, but I doubt that it was noticeably worse for the patch. It
cannot have been better, though.

It's inherently almost impossible to keep most of the VM bits set for
long with this workload. Perhaps VM bit setting would be improved with
workloads that have some HOT updates, but as I mentioned this workload
only had non-HOT updates (except in a tiny number of cases where
abalance did not change, just by random luck).

I also forget to say that the maintenance_work_mem test wasn't that
relevant, though I believe it triggered once. maintenance_work_mem was
set very high (5GB).

Here is a link with more details information, in case that is
interesting: https://drive.google.com/file/d/1TqpAQnqb4SMMuhehD8ELpf6Cv9A8ux2E/view?usp=sharing

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: New IndexAM API controlling index vacuum strategies

Masahiko Sawada
In reply to this post by Masahiko Sawada
On Tue, Dec 29, 2020 at 3:25 PM Masahiko Sawada <[hidden email]> wrote:

>
> On Tue, Dec 29, 2020 at 7:06 AM Peter Geoghegan <[hidden email]> wrote:
> >
> > On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan <[hidden email]> wrote:
> > > I experimented with this today, and I think that it is a good way to
> > > do it. I like the idea of choose_vacuum_strategy() understanding that
> > > heap pages that are subject to many non-HOT updates have a "natural
> > > extra capacity for LP_DEAD items" that it must care about directly (at
> > > least with non-default heap fill factor settings). My early testing
> > > shows that it will often take a surprisingly long time for the most
> > > heavily updated heap page to have more than about 100 LP_DEAD items.
> >
> > Attached is a rough patch showing what I did here. It was applied on
> > top of my bottom-up index deletion patch series and your
> > poc_vacuumstrategy.patch patch. This patch was written as a quick and
> > dirty way of simulating what I thought would work best for bottom-up
> > index deletion for one specific benchmark/test, which was
> > non-hot-update heavy. This consists of a variant pgbench with several
> > indexes on pgbench_accounts (almost the same as most other bottom-up
> > deletion benchmarks I've been running). Only one index is "logically
> > modified" by the updates, but of course we still physically modify all
> > indexes on every update. I set fill factor to 90 for this benchmark,
> > which is an important factor for how your VACUUM patch works during
> > the benchmark.
> >
> > This rough supplementary patch includes VACUUM logic that assumes (but
> > doesn't check) that the table has heap fill factor set to 90 -- see my
> > changes to choose_vacuum_strategy(). This benchmark is really about
> > stability over time more than performance (though performance is also
> > improved significantly). I wanted to keep both the table/heap and the
> > logically unmodified indexes (i.e. 3 out of 4 indexes on
> > pgbench_accounts) exactly the same size *forever*.
> >
> > Does this make sense?
>
> Thank you for sharing the patch. That makes sense.
>
> +        if (!vacuum_heap)
> +        {
> +            if (maxdeadpage > 130 ||
> +                /* Also check if maintenance_work_mem space is running out */
> +                vacrelstats->dead_tuples->num_tuples >
> +                vacrelstats->dead_tuples->max_tuples / 2)
> +                vacuum_heap = true;
> +        }
>
> The second test checking if maintenane_work_mem space is running out
> also makes sense to me. Perhaps another idea would be to compare the
> number of collected garbage tuple to the total number of heap tuples
> so that we do lazy_vacuum_heap() only when we’re likely to reclaim a
> certain amount of garbage in the table.
>
> >
> > Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that
> > pg_stat_database shows an almost ~28% reduction in blks_read after an
> > overnight run for the patch series (it was 508,820,699 for the
> > patches, 705,282,975 for the master branch). I think that the VACUUM
> > component is responsible for some of that reduction. There were 11
> > VACUUMs for the patch, 7 of which did not call lazy_vacuum_heap()
> > (these 7 VACUUM operations all only dead a btbulkdelete() call for the
> > one problematic index on the table, named "abalance_ruin", which my
> > supplementary patch has hard-coded knowledge of).
>
> That's a very good result in terms of skipping lazy_vacuum_heap(). How
> much the table and indexes bloated? Also, I'm curious about that which
> tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or
> test if maintenance_work_mem space is running out? And what was the
> impact on clearing all-visible bits?
>
I merged these patches and polished it.

In the 0002 patch, we calculate how many LP_DEAD items can be
accumulated in the space on a single heap page left by fillfactor. I
increased MaxHeapTuplesPerPage so that we can accumulate LP_DEAD items
on a heap page. Because otherwise accumulating LP_DEAD items
unnecessarily constrains the number of heap tuples in a single page,
especially when small tuples, as I mentioned before. Previously, we
constrained the number of line pointers to avoid excessive
line-pointer bloat and not require an increase in the size of the work
array. However, once amvacuumstrategy stuff entered the picture,
accumulating line pointers has value. Also, we might want to store the
returned value of amvacuumstrategy so that index AM can refer to it on
index-deletion.

The 0003 patch has btree indexes skip bulk-deletion if the index
doesn't grow since last bulk-deletion. I stored the number of blocks
in the meta page but didn't implement meta page upgrading.

I've attached the draft version patches. Note that the documentation
update is still lacking.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

0001-Introduce-IndexAM-API-for-choosing-index-vacuum-stra.patch (21K) Download Attachment
0002-Choose-index-vacuum-strategy-based-on-amvacuumstrate.patch (50K) Download Attachment
0003-Skip-btree-bulkdelete-if-the-index-doesn-t-grow.patch (12K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New IndexAM API controlling index vacuum strategies

Masahiko Sawada
On Tue, Jan 5, 2021 at 10:35 AM Masahiko Sawada <[hidden email]> wrote:

>
> On Tue, Dec 29, 2020 at 3:25 PM Masahiko Sawada <[hidden email]> wrote:
> >
> > On Tue, Dec 29, 2020 at 7:06 AM Peter Geoghegan <[hidden email]> wrote:
> > >
> > > On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan <[hidden email]> wrote:
> > > > I experimented with this today, and I think that it is a good way to
> > > > do it. I like the idea of choose_vacuum_strategy() understanding that
> > > > heap pages that are subject to many non-HOT updates have a "natural
> > > > extra capacity for LP_DEAD items" that it must care about directly (at
> > > > least with non-default heap fill factor settings). My early testing
> > > > shows that it will often take a surprisingly long time for the most
> > > > heavily updated heap page to have more than about 100 LP_DEAD items.
> > >
> > > Attached is a rough patch showing what I did here. It was applied on
> > > top of my bottom-up index deletion patch series and your
> > > poc_vacuumstrategy.patch patch. This patch was written as a quick and
> > > dirty way of simulating what I thought would work best for bottom-up
> > > index deletion for one specific benchmark/test, which was
> > > non-hot-update heavy. This consists of a variant pgbench with several
> > > indexes on pgbench_accounts (almost the same as most other bottom-up
> > > deletion benchmarks I've been running). Only one index is "logically
> > > modified" by the updates, but of course we still physically modify all
> > > indexes on every update. I set fill factor to 90 for this benchmark,
> > > which is an important factor for how your VACUUM patch works during
> > > the benchmark.
> > >
> > > This rough supplementary patch includes VACUUM logic that assumes (but
> > > doesn't check) that the table has heap fill factor set to 90 -- see my
> > > changes to choose_vacuum_strategy(). This benchmark is really about
> > > stability over time more than performance (though performance is also
> > > improved significantly). I wanted to keep both the table/heap and the
> > > logically unmodified indexes (i.e. 3 out of 4 indexes on
> > > pgbench_accounts) exactly the same size *forever*.
> > >
> > > Does this make sense?
> >
> > Thank you for sharing the patch. That makes sense.
> >
> > +        if (!vacuum_heap)
> > +        {
> > +            if (maxdeadpage > 130 ||
> > +                /* Also check if maintenance_work_mem space is running out */
> > +                vacrelstats->dead_tuples->num_tuples >
> > +                vacrelstats->dead_tuples->max_tuples / 2)
> > +                vacuum_heap = true;
> > +        }
> >
> > The second test checking if maintenane_work_mem space is running out
> > also makes sense to me. Perhaps another idea would be to compare the
> > number of collected garbage tuple to the total number of heap tuples
> > so that we do lazy_vacuum_heap() only when we’re likely to reclaim a
> > certain amount of garbage in the table.
> >
> > >
> > > Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that
> > > pg_stat_database shows an almost ~28% reduction in blks_read after an
> > > overnight run for the patch series (it was 508,820,699 for the
> > > patches, 705,282,975 for the master branch). I think that the VACUUM
> > > component is responsible for some of that reduction. There were 11
> > > VACUUMs for the patch, 7 of which did not call lazy_vacuum_heap()
> > > (these 7 VACUUM operations all only dead a btbulkdelete() call for the
> > > one problematic index on the table, named "abalance_ruin", which my
> > > supplementary patch has hard-coded knowledge of).
> >
> > That's a very good result in terms of skipping lazy_vacuum_heap(). How
> > much the table and indexes bloated? Also, I'm curious about that which
> > tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or
> > test if maintenance_work_mem space is running out? And what was the
> > impact on clearing all-visible bits?
> >
>
> I merged these patches and polished it.
>
> In the 0002 patch, we calculate how many LP_DEAD items can be
> accumulated in the space on a single heap page left by fillfactor. I
> increased MaxHeapTuplesPerPage so that we can accumulate LP_DEAD items
> on a heap page. Because otherwise accumulating LP_DEAD items
> unnecessarily constrains the number of heap tuples in a single page,
> especially when small tuples, as I mentioned before. Previously, we
> constrained the number of line pointers to avoid excessive
> line-pointer bloat and not require an increase in the size of the work
> array. However, once amvacuumstrategy stuff entered the picture,
> accumulating line pointers has value. Also, we might want to store the
> returned value of amvacuumstrategy so that index AM can refer to it on
> index-deletion.
>
> The 0003 patch has btree indexes skip bulk-deletion if the index
> doesn't grow since last bulk-deletion. I stored the number of blocks
> in the meta page but didn't implement meta page upgrading.
>
After more thought, I think that ambulkdelete needs to be able to
refer the answer to amvacuumstrategy. That way, the index can skip
bulk-deletion when lazy vacuum doesn't vacuum heap and it also doesn’t
want to do that.

I’ve attached the updated version patch that includes the following changes:

* Store the answers to amvacuumstrategy into either the local memory
or DSM (in parallel vacuum case) so that ambulkdelete can refer the
answer to amvacuumstrategy.
* Fix regression failures.
* Update the documentation and commments.

Note that 0003 patch is still PoC quality, lacking the btree meta page
version upgrade.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

v2-0003-Skip-btree-bulkdelete-if-the-index-doesn-t-grow.patch (13K) Download Attachment
v2-0003-PoC-skip-btree-bulkdelete-if-the-index-doesn-t-gr.patch (13K) Download Attachment
v2-0001-Introduce-IndexAM-API-for-choosing-index-vacuum-s.patch (23K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New IndexAM API controlling index vacuum strategies

Masahiko Sawada
On Mon, Jan 18, 2021 at 2:18 PM Masahiko Sawada <[hidden email]> wrote:

>
> On Tue, Jan 5, 2021 at 10:35 AM Masahiko Sawada <[hidden email]> wrote:
> >
> > On Tue, Dec 29, 2020 at 3:25 PM Masahiko Sawada <[hidden email]> wrote:
> > >
> > > On Tue, Dec 29, 2020 at 7:06 AM Peter Geoghegan <[hidden email]> wrote:
> > > >
> > > > On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan <[hidden email]> wrote:
> > > > > I experimented with this today, and I think that it is a good way to
> > > > > do it. I like the idea of choose_vacuum_strategy() understanding that
> > > > > heap pages that are subject to many non-HOT updates have a "natural
> > > > > extra capacity for LP_DEAD items" that it must care about directly (at
> > > > > least with non-default heap fill factor settings). My early testing
> > > > > shows that it will often take a surprisingly long time for the most
> > > > > heavily updated heap page to have more than about 100 LP_DEAD items.
> > > >
> > > > Attached is a rough patch showing what I did here. It was applied on
> > > > top of my bottom-up index deletion patch series and your
> > > > poc_vacuumstrategy.patch patch. This patch was written as a quick and
> > > > dirty way of simulating what I thought would work best for bottom-up
> > > > index deletion for one specific benchmark/test, which was
> > > > non-hot-update heavy. This consists of a variant pgbench with several
> > > > indexes on pgbench_accounts (almost the same as most other bottom-up
> > > > deletion benchmarks I've been running). Only one index is "logically
> > > > modified" by the updates, but of course we still physically modify all
> > > > indexes on every update. I set fill factor to 90 for this benchmark,
> > > > which is an important factor for how your VACUUM patch works during
> > > > the benchmark.
> > > >
> > > > This rough supplementary patch includes VACUUM logic that assumes (but
> > > > doesn't check) that the table has heap fill factor set to 90 -- see my
> > > > changes to choose_vacuum_strategy(). This benchmark is really about
> > > > stability over time more than performance (though performance is also
> > > > improved significantly). I wanted to keep both the table/heap and the
> > > > logically unmodified indexes (i.e. 3 out of 4 indexes on
> > > > pgbench_accounts) exactly the same size *forever*.
> > > >
> > > > Does this make sense?
> > >
> > > Thank you for sharing the patch. That makes sense.
> > >
> > > +        if (!vacuum_heap)
> > > +        {
> > > +            if (maxdeadpage > 130 ||
> > > +                /* Also check if maintenance_work_mem space is running out */
> > > +                vacrelstats->dead_tuples->num_tuples >
> > > +                vacrelstats->dead_tuples->max_tuples / 2)
> > > +                vacuum_heap = true;
> > > +        }
> > >
> > > The second test checking if maintenane_work_mem space is running out
> > > also makes sense to me. Perhaps another idea would be to compare the
> > > number of collected garbage tuple to the total number of heap tuples
> > > so that we do lazy_vacuum_heap() only when we’re likely to reclaim a
> > > certain amount of garbage in the table.
> > >
> > > >
> > > > Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that
> > > > pg_stat_database shows an almost ~28% reduction in blks_read after an
> > > > overnight run for the patch series (it was 508,820,699 for the
> > > > patches, 705,282,975 for the master branch). I think that the VACUUM
> > > > component is responsible for some of that reduction. There were 11
> > > > VACUUMs for the patch, 7 of which did not call lazy_vacuum_heap()
> > > > (these 7 VACUUM operations all only dead a btbulkdelete() call for the
> > > > one problematic index on the table, named "abalance_ruin", which my
> > > > supplementary patch has hard-coded knowledge of).
> > >
> > > That's a very good result in terms of skipping lazy_vacuum_heap(). How
> > > much the table and indexes bloated? Also, I'm curious about that which
> > > tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or
> > > test if maintenance_work_mem space is running out? And what was the
> > > impact on clearing all-visible bits?
> > >
> >
> > I merged these patches and polished it.
> >
> > In the 0002 patch, we calculate how many LP_DEAD items can be
> > accumulated in the space on a single heap page left by fillfactor. I
> > increased MaxHeapTuplesPerPage so that we can accumulate LP_DEAD items
> > on a heap page. Because otherwise accumulating LP_DEAD items
> > unnecessarily constrains the number of heap tuples in a single page,
> > especially when small tuples, as I mentioned before. Previously, we
> > constrained the number of line pointers to avoid excessive
> > line-pointer bloat and not require an increase in the size of the work
> > array. However, once amvacuumstrategy stuff entered the picture,
> > accumulating line pointers has value. Also, we might want to store the
> > returned value of amvacuumstrategy so that index AM can refer to it on
> > index-deletion.
> >
> > The 0003 patch has btree indexes skip bulk-deletion if the index
> > doesn't grow since last bulk-deletion. I stored the number of blocks
> > in the meta page but didn't implement meta page upgrading.
> >
>
> After more thought, I think that ambulkdelete needs to be able to
> refer the answer to amvacuumstrategy. That way, the index can skip
> bulk-deletion when lazy vacuum doesn't vacuum heap and it also doesn’t
> want to do that.
>
> I’ve attached the updated version patch that includes the following changes:
>
> * Store the answers to amvacuumstrategy into either the local memory
> or DSM (in parallel vacuum case) so that ambulkdelete can refer the
> answer to amvacuumstrategy.
> * Fix regression failures.
> * Update the documentation and commments.
>
> Note that 0003 patch is still PoC quality, lacking the btree meta page
> version upgrade.
Sorry, I missed 0002 patch. I've attached the patch set again.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

v2-0003-PoC-skip-btree-bulkdelete-if-the-index-doesn-t-gr.patch (13K) Download Attachment
v2-0001-Introduce-IndexAM-API-for-choosing-index-vacuum-s.patch (23K) Download Attachment
v2-0002-Choose-index-vacuum-strategy-based-on-amvacuumstr.patch (70K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New IndexAM API controlling index vacuum strategies

Zhihong Yu
Hi, Masahiko-san:

For v2-0001-Introduce-IndexAM-API-for-choosing-index-vacuum-s.patch :

For blvacuumstrategy():

+   if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+       return INDEX_VACUUM_STRATEGY_NONE;
+   else
+       return INDEX_VACUUM_STRATEGY_BULKDELETE;

The 'else' can be omitted.

Similar comment for ginvacuumstrategy().

For v2-0002-Choose-index-vacuum-strategy-based-on-amvacuumstr.patch :

If index_cleanup option is specified neither VACUUM command nor
storage option

I think this is what you meant (by not using passive voice):

If index_cleanup option specifies neither VACUUM command nor
storage option,

- * integer, but you can't fit that many items on a page. 11 ought to be more
+ * integer, but you can't fit that many items on a page. 13 ought to be more

It would be nice to add a note why the number of bits is increased.

For choose_vacuum_strategy():

+       IndexVacuumStrategy ivstrat;

The variable is only used inside the loop. You can use vacrelstats->ivstrategies[i] directly and omit the variable.

+       int ndeaditems_limit = (int) ((freespace / sizeof(ItemIdData)) * 0.7);

How was the factor of 0.7 determined ? Comment below only mentioned 'safety factor' but not how it was chosen.
I also wonder if this factor should be exposed as GUC.

+   if (nworkers > 0)
+       nworkers--;

Should log / assert be added when nworkers is <= 0 ?

+ * XXX: allowing to fill the heap page with only line pointer seems a overkill.

'a overkill' -> 'an overkill'

Cheers

On Sun, Jan 17, 2021 at 10:21 PM Masahiko Sawada <[hidden email]> wrote:
On Mon, Jan 18, 2021 at 2:18 PM Masahiko Sawada <[hidden email]> wrote:
>
> On Tue, Jan 5, 2021 at 10:35 AM Masahiko Sawada <[hidden email]> wrote:
> >
> > On Tue, Dec 29, 2020 at 3:25 PM Masahiko Sawada <[hidden email]> wrote:
> > >
> > > On Tue, Dec 29, 2020 at 7:06 AM Peter Geoghegan <[hidden email]> wrote:
> > > >
> > > > On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan <[hidden email]> wrote:
> > > > > I experimented with this today, and I think that it is a good way to
> > > > > do it. I like the idea of choose_vacuum_strategy() understanding that
> > > > > heap pages that are subject to many non-HOT updates have a "natural
> > > > > extra capacity for LP_DEAD items" that it must care about directly (at
> > > > > least with non-default heap fill factor settings). My early testing
> > > > > shows that it will often take a surprisingly long time for the most
> > > > > heavily updated heap page to have more than about 100 LP_DEAD items.
> > > >
> > > > Attached is a rough patch showing what I did here. It was applied on
> > > > top of my bottom-up index deletion patch series and your
> > > > poc_vacuumstrategy.patch patch. This patch was written as a quick and
> > > > dirty way of simulating what I thought would work best for bottom-up
> > > > index deletion for one specific benchmark/test, which was
> > > > non-hot-update heavy. This consists of a variant pgbench with several
> > > > indexes on pgbench_accounts (almost the same as most other bottom-up
> > > > deletion benchmarks I've been running). Only one index is "logically
> > > > modified" by the updates, but of course we still physically modify all
> > > > indexes on every update. I set fill factor to 90 for this benchmark,
> > > > which is an important factor for how your VACUUM patch works during
> > > > the benchmark.
> > > >
> > > > This rough supplementary patch includes VACUUM logic that assumes (but
> > > > doesn't check) that the table has heap fill factor set to 90 -- see my
> > > > changes to choose_vacuum_strategy(). This benchmark is really about
> > > > stability over time more than performance (though performance is also
> > > > improved significantly). I wanted to keep both the table/heap and the
> > > > logically unmodified indexes (i.e. 3 out of 4 indexes on
> > > > pgbench_accounts) exactly the same size *forever*.
> > > >
> > > > Does this make sense?
> > >
> > > Thank you for sharing the patch. That makes sense.
> > >
> > > +        if (!vacuum_heap)
> > > +        {
> > > +            if (maxdeadpage > 130 ||
> > > +                /* Also check if maintenance_work_mem space is running out */
> > > +                vacrelstats->dead_tuples->num_tuples >
> > > +                vacrelstats->dead_tuples->max_tuples / 2)
> > > +                vacuum_heap = true;
> > > +        }
> > >
> > > The second test checking if maintenane_work_mem space is running out
> > > also makes sense to me. Perhaps another idea would be to compare the
> > > number of collected garbage tuple to the total number of heap tuples
> > > so that we do lazy_vacuum_heap() only when we’re likely to reclaim a
> > > certain amount of garbage in the table.
> > >
> > > >
> > > > Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that
> > > > pg_stat_database shows an almost ~28% reduction in blks_read after an
> > > > overnight run for the patch series (it was 508,820,699 for the
> > > > patches, 705,282,975 for the master branch). I think that the VACUUM
> > > > component is responsible for some of that reduction. There were 11
> > > > VACUUMs for the patch, 7 of which did not call lazy_vacuum_heap()
> > > > (these 7 VACUUM operations all only dead a btbulkdelete() call for the
> > > > one problematic index on the table, named "abalance_ruin", which my
> > > > supplementary patch has hard-coded knowledge of).
> > >
> > > That's a very good result in terms of skipping lazy_vacuum_heap(). How
> > > much the table and indexes bloated? Also, I'm curious about that which
> > > tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or
> > > test if maintenance_work_mem space is running out? And what was the
> > > impact on clearing all-visible bits?
> > >
> >
> > I merged these patches and polished it.
> >
> > In the 0002 patch, we calculate how many LP_DEAD items can be
> > accumulated in the space on a single heap page left by fillfactor. I
> > increased MaxHeapTuplesPerPage so that we can accumulate LP_DEAD items
> > on a heap page. Because otherwise accumulating LP_DEAD items
> > unnecessarily constrains the number of heap tuples in a single page,
> > especially when small tuples, as I mentioned before. Previously, we
> > constrained the number of line pointers to avoid excessive
> > line-pointer bloat and not require an increase in the size of the work
> > array. However, once amvacuumstrategy stuff entered the picture,
> > accumulating line pointers has value. Also, we might want to store the
> > returned value of amvacuumstrategy so that index AM can refer to it on
> > index-deletion.
> >
> > The 0003 patch has btree indexes skip bulk-deletion if the index
> > doesn't grow since last bulk-deletion. I stored the number of blocks
> > in the meta page but didn't implement meta page upgrading.
> >
>
> After more thought, I think that ambulkdelete needs to be able to
> refer the answer to amvacuumstrategy. That way, the index can skip
> bulk-deletion when lazy vacuum doesn't vacuum heap and it also doesn’t
> want to do that.
>
> I’ve attached the updated version patch that includes the following changes:
>
> * Store the answers to amvacuumstrategy into either the local memory
> or DSM (in parallel vacuum case) so that ambulkdelete can refer the
> answer to amvacuumstrategy.
> * Fix regression failures.
> * Update the documentation and commments.
>
> Note that 0003 patch is still PoC quality, lacking the btree meta page
> version upgrade.

Sorry, I missed 0002 patch. I've attached the patch set again.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/
Reply | Threaded
Open this post in threaded view
|

Re: New IndexAM API controlling index vacuum strategies

Peter Geoghegan-4
In reply to this post by Masahiko Sawada
On Sun, Jan 17, 2021 at 9:18 PM Masahiko Sawada <[hidden email]> wrote:
> After more thought, I think that ambulkdelete needs to be able to
> refer the answer to amvacuumstrategy. That way, the index can skip
> bulk-deletion when lazy vacuum doesn't vacuum heap and it also doesn’t
> want to do that.

Makes sense.

BTW, your patch has bitrot already. Peter E's recent pageinspect
commit happens to conflict with this patch. It might make sense to
produce a new version that just fixes the bitrot, so that other people
don't have to deal with it each time.

> I’ve attached the updated version patch that includes the following changes:

Looks good. I'll give this version a review now. I will do a lot more
soon. I need to come up with a good benchmark for this, that I can
return to again and again during review as needed.

Some feedback on the first patch:

* Just so you know: I agree with you about handling
VACOPT_TERNARY_DISABLED in the index AM's amvacuumstrategy routine. I
think that it's better to do that there, even though this choice may
have some downsides.

* Can you add some "stub" sgml doc changes for this? Doesn't have to
be complete in any way. Just a placeholder for later, that has the
correct general "shape" to orientate the reader of the patch. It can
just be a FIXME comment, plus basic mechanical stuff -- details of the
new amvacuumstrategy_function routine and its signature.

Some feedback on the second patch:

* Why do you move around IndexVacuumStrategy in the second patch?
Looks like a rebasing oversight.

* Actually, do we really need the first and second patches to be
separate patches? I agree that the nbtree patch should be a separate
patch, but dividing the first two sets of changes doesn't seem like it
adds much. Did I miss some something?

* Is the "MAXALIGN(sizeof(ItemIdData)))" change in the definition of
MaxHeapTuplesPerPage appropriate? Here is the relevant section from
the patch:

diff --git a/src/include/access/htup_details.h
b/src/include/access/htup_details.h
index 7c62852e7f..038e7cd580 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -563,17 +563,18 @@ do { \
 /*
  * MaxHeapTuplesPerPage is an upper bound on the number of tuples that can
  * fit on one heap page.  (Note that indexes could have more, because they
- * use a smaller tuple header.)  We arrive at the divisor because each tuple
- * must be maxaligned, and it must have an associated line pointer.
+ * use a smaller tuple header.)  We arrive at the divisor because each line
+ * pointer must be maxaligned.
*** SNIP ***
 #define MaxHeapTuplesPerPage    \
-    ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
-            (MAXALIGN(SizeofHeapTupleHeader) + sizeof(ItemIdData))))
+    ((int) ((BLCKSZ - SizeOfPageHeaderData) / (MAXALIGN(sizeof(ItemIdData)))))

It's true that ItemIdData structs (line pointers) are aligned, but
they're not MAXALIGN()'d. If they were then the on-disk size of line
pointers would generally be 8 bytes, not 4 bytes.

* Maybe it would be better if you just changed the definition such
that "MAXALIGN(SizeofHeapTupleHeader)" became "MAXIMUM_ALIGNOF", with
no other changes? (Some variant of this suggestion might be better,
not sure.)

For some reason that feels a bit safer: we still have an "imaginary
tuple header", but it's just 1 MAXALIGN() quantum now. This is still
much less than the current 3 MAXALIGN() quantums (i.e. what
MaxHeapTuplesPerPage treats as the tuple header size). Do you think
that this alternative approach will be noticeably less effective
within vacuumlazy.c?

Note that you probably understand the issue with MaxHeapTuplesPerPage
for vacuumlazy.c better than I do currently. I'm still trying to
understand your choices, and to understand what is really important
here.

* Maybe add a #define for the value 0.7? (I refer to the value used in
choose_vacuum_strategy() to calculate a "this is the number of LP_DEAD
line pointers that we consider too many" cut off point, which is to be
applied throughout lazy_scan_heap() processing.)

* I notice that your new lazy_vacuum_table_and_indexes() function is
the only place that calls lazy_vacuum_table_and_indexes(). I think
that you should merge them together -- replace the only remaining call
to lazy_vacuum_table_and_indexes() with the body of the function
itself. Having a separate lazy_vacuum_table_and_indexes() function
doesn't seem useful to me -- it doesn't actually hide complexity, and
might even be harder to maintain.

* I suggest thinking about what the last item will mean for the
reporting that currently takes place in
lazy_vacuum_table_and_indexes(), but will now go in an expanded
lazy_vacuum_table_and_indexes() -- how do we count the total number of
index scans now?

I don't actually believe that the logic needs to change, but some kind
of consolidation and streamlining seems like it might be helpful.
Maybe just a comment that says "note that all index scans might just
be no-ops because..." -- stuff like that.

* Any idea about how hard it will be to teach is_wraparound VACUUMs to
skip index cleanup automatically, based on some practical/sensible
criteria?

It would be nice to have a basic PoC for that, even if it remains a
PoC for the foreseeable future (i.e. even if it cannot be committed to
Postgres 14). This feature should definitely be something that your
patch series *enables*. I'd feel good about having covered that
question as part of this basic design work if there was a PoC. That
alone should make it 100% clear that it's easy to do (or no harder
than it should be -- it should ideally be compatible with your basic
design). The exact criteria that we use for deciding whether or not to
skip index cleanup (which probably should not just be "this VACUUM is
is_wraparound, good enough" in the final version) may need to be
debated at length on pgsql-hackers. Even still, it is "just a detail"
in the code. Whereas being *able* to do that with your design (now or
in the future) seems essential now.

> * Store the answers to amvacuumstrategy into either the local memory
> or DSM (in parallel vacuum case) so that ambulkdelete can refer the
> answer to amvacuumstrategy.
> * Fix regression failures.
> * Update the documentation and commments.
>
> Note that 0003 patch is still PoC quality, lacking the btree meta page
> version upgrade.

This patch is not the hard part, of course -- there really isn't that
much needed here compared to vacuumlazy.c. So this patch seems like
the simplest 1 out of the 3 (at least to me).

Some feedback on the third patch:

* The new btm_last_deletion_nblocks metapage field should use P_NONE
(which is 0) to indicate never having been vacuumed -- not
InvalidBlockNumber (which is 0xFFFFFFFF).

This is more idiomatic in nbtree, which is nice, but it has a very
significant practical advantage: it ensures that every heapkeyspace
nbtree index (i.e. those on recent nbtree versions) can be treated as
if it has the new btm_last_deletion_nblocks field all along, even when
it actually built on Postgres 12 or 13. This trick will let you avoid
dealing with the headache of bumping BTREE_VERSION, which is a huge
advantage.

Note that this is the same trick I used to avoid bumping BTREE_VERSION
when the btm_allequalimage field needed to be added (for the nbtree
deduplication feature added to Postgres 13).

* Forgot to do this in the third patch (think I made this same mistake
once myself):

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8bb180bbbe..88dfea9af3 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -653,7 +653,7 @@ bt_metap(PG_FUNCTION_ARGS)
     BTMetaPageData *metad;
     TupleDesc   tupleDesc;
     int         j;
-    char       *values[9];
+    char       *values[10];
     Buffer      buffer;
     Page        page;
     HeapTuple   tuple;
@@ -734,6 +734,11 @@ bt_metap(PG_FUNCTION_ARGS)

That's all I have for now...
--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: New IndexAM API controlling index vacuum strategies

Peter Geoghegan-4
On Tue, Jan 19, 2021 at 2:57 PM Peter Geoghegan <[hidden email]> wrote:

> * Maybe it would be better if you just changed the definition such
> that "MAXALIGN(SizeofHeapTupleHeader)" became "MAXIMUM_ALIGNOF", with
> no other changes? (Some variant of this suggestion might be better,
> not sure.)
>
> For some reason that feels a bit safer: we still have an "imaginary
> tuple header", but it's just 1 MAXALIGN() quantum now. This is still
> much less than the current 3 MAXALIGN() quantums (i.e. what
> MaxHeapTuplesPerPage treats as the tuple header size). Do you think
> that this alternative approach will be noticeably less effective
> within vacuumlazy.c?

BTW, I think that increasing MaxHeapTuplesPerPage will make it
necessary to consider tidbitmap.c. Comments at the top of that file
say that it is assumed that MaxHeapTuplesPerPage is about 256. So
there is a risk of introducing performance regressions affecting
bitmap scans here.

Apparently some other DB systems make the equivalent of
MaxHeapTuplesPerPage dynamically configurable at the level of heap
tables. It usually doesn't matter, but it can matter with on-disk
bitmap indexes, where the bitmap must be encoded from raw TIDs (this
must happen before the bitmap is compressed -- there must be a simple
mapping from every possible TID to some bit in a bitmap first). The
item offset component of each heap TID is not usually very large, so
there is a trade-off between keeping the representation of bitmaps
efficient and not unduly restricting the number of distinct heap
tuples on each heap page. I think that there might be a similar
consideration here, in tidbitmap.c (even though it's not concerned
about on-disk bitmaps).

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: New IndexAM API controlling index vacuum strategies

Peter Geoghegan-4
On Tue, Jan 19, 2021 at 4:45 PM Peter Geoghegan <[hidden email]> wrote:
> BTW, I think that increasing MaxHeapTuplesPerPage will make it
> necessary to consider tidbitmap.c. Comments at the top of that file
> say that it is assumed that MaxHeapTuplesPerPage is about 256. So
> there is a risk of introducing performance regressions affecting
> bitmap scans here.

More concretely, WORDS_PER_PAGE increases from 5 on the master branch
to 16 with the latest version of the patch series on most platforms
(while WORDS_PER_CHUNK is 4 with or without the patches).

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: New IndexAM API controlling index vacuum strategies

Masahiko Sawada
In reply to this post by Peter Geoghegan-4
On Wed, Jan 20, 2021 at 9:45 AM Peter Geoghegan <[hidden email]> wrote:

>
> On Tue, Jan 19, 2021 at 2:57 PM Peter Geoghegan <[hidden email]> wrote:
> > * Maybe it would be better if you just changed the definition such
> > that "MAXALIGN(SizeofHeapTupleHeader)" became "MAXIMUM_ALIGNOF", with
> > no other changes? (Some variant of this suggestion might be better,
> > not sure.)
> >
> > For some reason that feels a bit safer: we still have an "imaginary
> > tuple header", but it's just 1 MAXALIGN() quantum now. This is still
> > much less than the current 3 MAXALIGN() quantums (i.e. what
> > MaxHeapTuplesPerPage treats as the tuple header size). Do you think
> > that this alternative approach will be noticeably less effective
> > within vacuumlazy.c?
>
> BTW, I think that increasing MaxHeapTuplesPerPage will make it
> necessary to consider tidbitmap.c. Comments at the top of that file
> say that it is assumed that MaxHeapTuplesPerPage is about 256. So
> there is a risk of introducing performance regressions affecting
> bitmap scans here.
>
> Apparently some other DB systems make the equivalent of
> MaxHeapTuplesPerPage dynamically configurable at the level of heap
> tables. It usually doesn't matter, but it can matter with on-disk
> bitmap indexes, where the bitmap must be encoded from raw TIDs (this
> must happen before the bitmap is compressed -- there must be a simple
> mapping from every possible TID to some bit in a bitmap first). The
> item offset component of each heap TID is not usually very large, so
> there is a trade-off between keeping the representation of bitmaps
> efficient and not unduly restricting the number of distinct heap
> tuples on each heap page. I think that there might be a similar
> consideration here, in tidbitmap.c (even though it's not concerned
> about on-disk bitmaps).

That's a good point. With the patch, MaxHeapTuplesPerPage increased to
2042 with 8k page, and to 8186 with 32k page whereas it's currently
291 with 8k page and 1169 with 32k page. So it is likely to be a
problem as you pointed out. If we change
"MAXALIGN(SizeofHeapTupleHeader)" to "MAXIMUM_ALIGNOF", it's 680 with
8k patch and 2728 with 32k page, which seems much better.

The purpose of increasing MaxHeapTuplesPerPage in the patch is to have
a heap page accumulate more LP_DEAD line pointers. As I explained
before, considering MaxHeapTuplesPerPage, we cannot calculate how many
LP_DEAD line pointers can be accumulated into the space taken by
fillfactor simply by ((the space taken by fillfactor) / (size of line
pointer)). We need to consider both how many line pointers are
available for LP_DEAD and how much space is available for LP_DEAD.

For example, suppose the tuple size is 50 bytes and fillfactor is 80,
each page has 1633 bytes (=(8192-24)*0.2) free space taken by
fillfactor, where 408 line pointers can fit. However, if we store 250
LP_DEAD line pointers into that space, the number of tuples that can
be stored on the page is only 41, although we have 6534 bytes
(=(8192-24)*0.8) where 121 tuples (+line pointers) can fit because
MaxHeapTuplesPerPage is 291. In this case, where the tuple size is 50
and fillfactor is 80, we can accumulate up to about 170 LP_DEAD line
pointers while storing 121 tuples. Increasing MaxHeapTuplesPerPage
raises this 291 limit and enables us to forget the limit when
calculating the maximum number of LP_DEAD line pointers that can be
accumulated on a single page.

An alternative approach would be to calculate it using the average
tuple's size. I think if we know the tuple size, the maximum number of
LP_DEAD line pointers can be accumulated into the single page is the
minimum of the following two formula:

(1) MaxHeapTuplesPerPage - (((BLCKSZ - SizeOfPageHeaderData) *
(fillfactor/100)) / (sizeof(ItemIdData) + tuple_size))); //how many
line pointers are available for LP_DEAD?

(2) ((BLCKSZ - SizeOfPageHeaderData) * ((1 - fillfactor)/100)) /
sizeof(ItemIdData); //how much space is available for LP_DEAD?

But I'd prefer to increase MaxHeapTuplesPerPage but not to affect the
bitmap much rather than introducing a complex theory.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


Reply | Threaded
Open this post in threaded view
|

Re: New IndexAM API controlling index vacuum strategies

Masahiko Sawada
In reply to this post by Peter Geoghegan-4
On Wed, Jan 20, 2021 at 7:58 AM Peter Geoghegan <[hidden email]> wrote:

>
> On Sun, Jan 17, 2021 at 9:18 PM Masahiko Sawada <[hidden email]> wrote:
> > After more thought, I think that ambulkdelete needs to be able to
> > refer the answer to amvacuumstrategy. That way, the index can skip
> > bulk-deletion when lazy vacuum doesn't vacuum heap and it also doesn’t
> > want to do that.
>
> Makes sense.
>
> BTW, your patch has bitrot already. Peter E's recent pageinspect
> commit happens to conflict with this patch. It might make sense to
> produce a new version that just fixes the bitrot, so that other people
> don't have to deal with it each time.
>
> > I’ve attached the updated version patch that includes the following changes:
>
> Looks good. I'll give this version a review now. I will do a lot more
> soon. I need to come up with a good benchmark for this, that I can
> return to again and again during review as needed.

Thank you for reviewing the patches.

>
> Some feedback on the first patch:
>
> * Just so you know: I agree with you about handling
> VACOPT_TERNARY_DISABLED in the index AM's amvacuumstrategy routine. I
> think that it's better to do that there, even though this choice may
> have some downsides.
>
> * Can you add some "stub" sgml doc changes for this? Doesn't have to
> be complete in any way. Just a placeholder for later, that has the
> correct general "shape" to orientate the reader of the patch. It can
> just be a FIXME comment, plus basic mechanical stuff -- details of the
> new amvacuumstrategy_function routine and its signature.
>

0002 patch had the doc update (I mistakenly included it to 0002
patch). Is that update what you meant?

> Some feedback on the second patch:
>
> * Why do you move around IndexVacuumStrategy in the second patch?
> Looks like a rebasing oversight.

Check.

>
> * Actually, do we really need the first and second patches to be
> separate patches? I agree that the nbtree patch should be a separate
> patch, but dividing the first two sets of changes doesn't seem like it
> adds much. Did I miss some something?

I separated the patches since I used to have separate patches when
adding other index AM options required by parallel vacuum. But I
agreed to merge the first two patches.

>
> * Is the "MAXALIGN(sizeof(ItemIdData)))" change in the definition of
> MaxHeapTuplesPerPage appropriate? Here is the relevant section from
> the patch:
>
> diff --git a/src/include/access/htup_details.h
> b/src/include/access/htup_details.h
> index 7c62852e7f..038e7cd580 100644
> --- a/src/include/access/htup_details.h
> +++ b/src/include/access/htup_details.h
> @@ -563,17 +563,18 @@ do { \
>  /*
>   * MaxHeapTuplesPerPage is an upper bound on the number of tuples that can
>   * fit on one heap page.  (Note that indexes could have more, because they
> - * use a smaller tuple header.)  We arrive at the divisor because each tuple
> - * must be maxaligned, and it must have an associated line pointer.
> + * use a smaller tuple header.)  We arrive at the divisor because each line
> + * pointer must be maxaligned.
> *** SNIP ***
>  #define MaxHeapTuplesPerPage    \
> -    ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
> -            (MAXALIGN(SizeofHeapTupleHeader) + sizeof(ItemIdData))))
> +    ((int) ((BLCKSZ - SizeOfPageHeaderData) / (MAXALIGN(sizeof(ItemIdData)))))
>
> It's true that ItemIdData structs (line pointers) are aligned, but
> they're not MAXALIGN()'d. If they were then the on-disk size of line
> pointers would generally be 8 bytes, not 4 bytes.

You're right. Will fix.

>
> * Maybe it would be better if you just changed the definition such
> that "MAXALIGN(SizeofHeapTupleHeader)" became "MAXIMUM_ALIGNOF", with
> no other changes? (Some variant of this suggestion might be better,
> not sure.)
>
> For some reason that feels a bit safer: we still have an "imaginary
> tuple header", but it's just 1 MAXALIGN() quantum now. This is still
> much less than the current 3 MAXALIGN() quantums (i.e. what
> MaxHeapTuplesPerPage treats as the tuple header size). Do you think
> that this alternative approach will be noticeably less effective
> within vacuumlazy.c?
>
> Note that you probably understand the issue with MaxHeapTuplesPerPage
> for vacuumlazy.c better than I do currently. I'm still trying to
> understand your choices, and to understand what is really important
> here.

Yeah, using MAXIMUM_ALIGNOF seems better for safety. I shared my
thoughts on the issue with MaxHeapTuplesPerPage yesterday. I think we
need to discuss how to deal with that.

>
> * Maybe add a #define for the value 0.7? (I refer to the value used in
> choose_vacuum_strategy() to calculate a "this is the number of LP_DEAD
> line pointers that we consider too many" cut off point, which is to be
> applied throughout lazy_scan_heap() processing.)
>

Agreed.

> * I notice that your new lazy_vacuum_table_and_indexes() function is
> the only place that calls lazy_vacuum_table_and_indexes(). I think
> that you should merge them together -- replace the only remaining call
> to lazy_vacuum_table_and_indexes() with the body of the function
> itself. Having a separate lazy_vacuum_table_and_indexes() function
> doesn't seem useful to me -- it doesn't actually hide complexity, and
> might even be harder to maintain.

lazy_vacuum_table_and_indexes() is called at two places: after
maintenance_work_mem run out (around L1097) and the end of
lazy_scan_heap() (around L1726). I defined this function to pack the
operations such as choosing a strategy, vacuuming indexes and
vacuuming heap. Without this function, will we end up writing the same
codes twice there?

>
> * I suggest thinking about what the last item will mean for the
> reporting that currently takes place in
> lazy_vacuum_table_and_indexes(), but will now go in an expanded
> lazy_vacuum_table_and_indexes() -- how do we count the total number of
> index scans now?
>
> I don't actually believe that the logic needs to change, but some kind
> of consolidation and streamlining seems like it might be helpful.
> Maybe just a comment that says "note that all index scans might just
> be no-ops because..." -- stuff like that.

What do you mean by the last item and what report? I think
lazy_vacuum_table_and_indexes() itself doesn't report anything and
vacrelstats->num_index_scan counts the total number of index scans.

>
> * Any idea about how hard it will be to teach is_wraparound VACUUMs to
> skip index cleanup automatically, based on some practical/sensible
> criteria?

One simple idea would be to have a to-prevent-wraparound autovacuum
worker disables index cleanup (i.g., setting VACOPT_TERNARY_DISABLED
to index_cleanup). But a downside (but not a common case) is that
since a to-prevent-wraparound vacuum is not necessarily an aggressive
vacuum, it could skip index cleanup even though it cannot move
relfrozenxid forward.

>
> It would be nice to have a basic PoC for that, even if it remains a
> PoC for the foreseeable future (i.e. even if it cannot be committed to
> Postgres 14). This feature should definitely be something that your
> patch series *enables*. I'd feel good about having covered that
> question as part of this basic design work if there was a PoC. That
> alone should make it 100% clear that it's easy to do (or no harder
> than it should be -- it should ideally be compatible with your basic
> design). The exact criteria that we use for deciding whether or not to
> skip index cleanup (which probably should not just be "this VACUUM is
> is_wraparound, good enough" in the final version) may need to be
> debated at length on pgsql-hackers. Even still, it is "just a detail"
> in the code. Whereas being *able* to do that with your design (now or
> in the future) seems essential now.

Agreed. I'll write a PoC patch for that.

>
> > * Store the answers to amvacuumstrategy into either the local memory
> > or DSM (in parallel vacuum case) so that ambulkdelete can refer the
> > answer to amvacuumstrategy.
> > * Fix regression failures.
> > * Update the documentation and commments.
> >
> > Note that 0003 patch is still PoC quality, lacking the btree meta page
> > version upgrade.
>
> This patch is not the hard part, of course -- there really isn't that
> much needed here compared to vacuumlazy.c. So this patch seems like
> the simplest 1 out of the 3 (at least to me).
>
> Some feedback on the third patch:
>
> * The new btm_last_deletion_nblocks metapage field should use P_NONE
> (which is 0) to indicate never having been vacuumed -- not
> InvalidBlockNumber (which is 0xFFFFFFFF).
>
> This is more idiomatic in nbtree, which is nice, but it has a very
> significant practical advantage: it ensures that every heapkeyspace
> nbtree index (i.e. those on recent nbtree versions) can be treated as
> if it has the new btm_last_deletion_nblocks field all along, even when
> it actually built on Postgres 12 or 13. This trick will let you avoid
> dealing with the headache of bumping BTREE_VERSION, which is a huge
> advantage.
>
> Note that this is the same trick I used to avoid bumping BTREE_VERSION
> when the btm_allequalimage field needed to be added (for the nbtree
> deduplication feature added to Postgres 13).
>

That's a nice way with a great advantage. I'll use P_NONE.

> * Forgot to do this in the third patch (think I made this same mistake
> once myself):
>
> diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
> index 8bb180bbbe..88dfea9af3 100644
> --- a/contrib/pageinspect/btreefuncs.c
> +++ b/contrib/pageinspect/btreefuncs.c
> @@ -653,7 +653,7 @@ bt_metap(PG_FUNCTION_ARGS)
>      BTMetaPageData *metad;
>      TupleDesc   tupleDesc;
>      int         j;
> -    char       *values[9];
> +    char       *values[10];
>      Buffer      buffer;
>      Page        page;
>      HeapTuple   tuple;
> @@ -734,6 +734,11 @@ bt_metap(PG_FUNCTION_ARGS)

Check.

I'm updating and testing the patch. I'll submit the updated version
patches tomorrow.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


Reply | Threaded
Open this post in threaded view
|

Re: New IndexAM API controlling index vacuum strategies

Peter Geoghegan-4
In reply to this post by Peter Geoghegan-4
On Tue, Jan 19, 2021 at 2:57 PM Peter Geoghegan <[hidden email]> wrote:
> Looks good. I'll give this version a review now. I will do a lot more
> soon. I need to come up with a good benchmark for this, that I can
> return to again and again during review as needed.

I performed another benchmark, similar to the last one but with the
latest version (v2), and over a much longer period. Attached is a
summary of the whole benchmark, and log_autovacuum output from the
logs of both the master branch and the patch.

This was pgbench scale 2000, 4 indexes on pgbench_accounts, and a
transaction with one update and two selects. Each run was 4 hours, and
we alternate between patch and master for each run, and alternate
between 16 and 32 clients. There were 8 4 hour runs in total, meaning
the entire set of runs took 8 * 4 hours = 32 hours (not including
initial load time and a few other small things like that). I used a
10k TPS rate limit, so TPS isn't interesting here. Latency is
interesting -- we see a nice improvement in latency (i.e. a reduction)
for the patch (see all.summary.out).

The benefits of the patch are clearly visible when I drill down and
look at the details. Each pgbench_accounts autovacuum VACUUM operation
can finish faster with the patch because they can often skip at least
some indexes (usually the PK, sometimes 3 out of 4 indexes total). But
it's more subtle than some might assume. We're skipping indexes that
VACUUM actually would have deleted *some* index tuples from, which is
very good. Bottom-up index deletion is usually lazy, and only
occasionally very eager, so you still have plenty of "floating
garbage" index tuples in most pages. And now we see VACUUM behave a
little more like bottom-up index deletion -- it is lazy when that is
appropriate (with indexes that really only have floating garbage that
is spread diffusely throughout the index structure), and eager when
that is appropriate (with indexes that get much more garbage).

The benefit is not really that we're avoiding doing I/O for index
vacuuming (though that is one of the smaller benefits here). The real
benefit is that VACUUM is not dirtying pages, since it skips indexes
when it would be "premature" to vacuum them from an efficiency point
of view. This is important because we know that Postgres throughput is
very often limited by page cleaning. Also, the "economics" of this new
behavior make perfect sense -- obviously it's more efficient to delay
garbage cleanup until the point when the same page will be modified by
a backend anyway -- in the case of this benchmark via bottom-up index
deletion (which deletes all garbage tuples in the leaf page at the
point that it runs for a subset of pointed-to heap pages -- it's not
using an oldestXmin cutoff from 30 minutes ago). So whenever we dirty
a page, we now get more value per additional-page-dirtied.

I believe that controlling the number of pages dirtied by VACUUM is
usually much more important than reducing the amount of read I/O from
VACUUM, for reasons I go into on the recent "vacuum_cost_page_miss
default value and modern hardware" thread. As a further consequence of
all this, VACUUM can go faster safely and sustainably (since the cost
limit is not affected so much by vacuum_cost_page_miss), which has its
own benefits (e.g. oldestXmin cutoff doesn't get so old towards the
end).

Another closely related huge improvement that we see here is that the
number of FPIs generated by VACUUM can be significantly reduced. This
cost is closely related to the cost of dirtying pages, but it's worth
mentioning separately. You'll see some of that in the log_autovacuum
log output I attached.

There is an archive with much more detailed information, including
dumps from most pg_stat_* views at key intervals. This has way more
information than anybody is likely to want:

https://drive.google.com/file/d/1OTiErELKRZmYnuJuczO2Tfcm1-cBYITd/view?usp=sharing

I did notice a problem, though. I now think that the criteria for
skipping an index vacuum in the third patch from the series is too
conservative, and that this led to an excessive number of index
vacuums with the patch. This is probably because there was a tiny
number of page splits in some of the indexes that were not really
supposed to grow. I believe that this is caused by ANALYZE running --
I think that it prevented bottom-up deletion from keeping a few of the
hottest pages from splitting (that can take 5 or 6 seconds) at a few
points over the 32 hour run. For example, the index named "tenner"
grew by 9 blocks, starting out at 230,701 and ending up at 230,710 (to
see this, extract the files from the archive and "diff
patch.r1c16.initial_pg_relation_size.out
patch.r2c32.after_pg_relation_size.out").

I now think that 0 blocks added is unnecessarily restrictive -- a
small tolerance still seems like a good idea, though (let's still be
somewhat conservative about it).

Maybe a better criteria would be for nbtree to always proceed with
index vacuuming when the index size is less than 2048 blocks (16MiB
with 8KiB BLCKSZ). If an index is larger than that, then compare the
last/old block count to the current block count (at the point that we
decide if index vacuuming is going to go ahead) by rounding up both
values to the next highest 2048 block increment. This formula is
pretty arbitrary, but probably works as well as most others. It's a
good iteration for the next version of the patch/further testing, at
least.

BTW, it would be nice if there was more instrumentation, say in the
log output produced when log_autovacuum is on. That would make it
easier to run these benchmarks -- I could verify my understanding of
the work done for each particular av operation represented in the log.
Though the default log_autovacuum log output is quite informative, it
would be nice if the specifics were more obvious (maybe this could
just be for the review/testing, but it might become something for
users if it seems useful).

--
Peter Geoghegan

patch_vacuum_logs.txt (5K) Download Attachment
all.summary.out (3K) Download Attachment
master_vacuum_logs.txt (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New IndexAM API controlling index vacuum strategies

Masahiko Sawada
In reply to this post by Masahiko Sawada
On Thu, Jan 21, 2021 at 11:24 PM Masahiko Sawada <[hidden email]> wrote:

>
> On Wed, Jan 20, 2021 at 7:58 AM Peter Geoghegan <[hidden email]> wrote:
> >
> > On Sun, Jan 17, 2021 at 9:18 PM Masahiko Sawada <[hidden email]> wrote:
> > > After more thought, I think that ambulkdelete needs to be able to
> > > refer the answer to amvacuumstrategy. That way, the index can skip
> > > bulk-deletion when lazy vacuum doesn't vacuum heap and it also doesn’t
> > > want to do that.
> >
> > Makes sense.
> >
> > BTW, your patch has bitrot already. Peter E's recent pageinspect
> > commit happens to conflict with this patch. It might make sense to
> > produce a new version that just fixes the bitrot, so that other people
> > don't have to deal with it each time.
> >
> > > I’ve attached the updated version patch that includes the following changes:
> >
> > Looks good. I'll give this version a review now. I will do a lot more
> > soon. I need to come up with a good benchmark for this, that I can
> > return to again and again during review as needed.
>
> Thank you for reviewing the patches.
>
> >
> > Some feedback on the first patch:
> >
> > * Just so you know: I agree with you about handling
> > VACOPT_TERNARY_DISABLED in the index AM's amvacuumstrategy routine. I
> > think that it's better to do that there, even though this choice may
> > have some downsides.
> >
> > * Can you add some "stub" sgml doc changes for this? Doesn't have to
> > be complete in any way. Just a placeholder for later, that has the
> > correct general "shape" to orientate the reader of the patch. It can
> > just be a FIXME comment, plus basic mechanical stuff -- details of the
> > new amvacuumstrategy_function routine and its signature.
> >
>
> 0002 patch had the doc update (I mistakenly included it to 0002
> patch). Is that update what you meant?
>
> > Some feedback on the second patch:
> >
> > * Why do you move around IndexVacuumStrategy in the second patch?
> > Looks like a rebasing oversight.
>
> Check.
>
> >
> > * Actually, do we really need the first and second patches to be
> > separate patches? I agree that the nbtree patch should be a separate
> > patch, but dividing the first two sets of changes doesn't seem like it
> > adds much. Did I miss some something?
>
> I separated the patches since I used to have separate patches when
> adding other index AM options required by parallel vacuum. But I
> agreed to merge the first two patches.
>
> >
> > * Is the "MAXALIGN(sizeof(ItemIdData)))" change in the definition of
> > MaxHeapTuplesPerPage appropriate? Here is the relevant section from
> > the patch:
> >
> > diff --git a/src/include/access/htup_details.h
> > b/src/include/access/htup_details.h
> > index 7c62852e7f..038e7cd580 100644
> > --- a/src/include/access/htup_details.h
> > +++ b/src/include/access/htup_details.h
> > @@ -563,17 +563,18 @@ do { \
> >  /*
> >   * MaxHeapTuplesPerPage is an upper bound on the number of tuples that can
> >   * fit on one heap page.  (Note that indexes could have more, because they
> > - * use a smaller tuple header.)  We arrive at the divisor because each tuple
> > - * must be maxaligned, and it must have an associated line pointer.
> > + * use a smaller tuple header.)  We arrive at the divisor because each line
> > + * pointer must be maxaligned.
> > *** SNIP ***
> >  #define MaxHeapTuplesPerPage    \
> > -    ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
> > -            (MAXALIGN(SizeofHeapTupleHeader) + sizeof(ItemIdData))))
> > +    ((int) ((BLCKSZ - SizeOfPageHeaderData) / (MAXALIGN(sizeof(ItemIdData)))))
> >
> > It's true that ItemIdData structs (line pointers) are aligned, but
> > they're not MAXALIGN()'d. If they were then the on-disk size of line
> > pointers would generally be 8 bytes, not 4 bytes.
>
> You're right. Will fix.
>
> >
> > * Maybe it would be better if you just changed the definition such
> > that "MAXALIGN(SizeofHeapTupleHeader)" became "MAXIMUM_ALIGNOF", with
> > no other changes? (Some variant of this suggestion might be better,
> > not sure.)
> >
> > For some reason that feels a bit safer: we still have an "imaginary
> > tuple header", but it's just 1 MAXALIGN() quantum now. This is still
> > much less than the current 3 MAXALIGN() quantums (i.e. what
> > MaxHeapTuplesPerPage treats as the tuple header size). Do you think
> > that this alternative approach will be noticeably less effective
> > within vacuumlazy.c?
> >
> > Note that you probably understand the issue with MaxHeapTuplesPerPage
> > for vacuumlazy.c better than I do currently. I'm still trying to
> > understand your choices, and to understand what is really important
> > here.
>
> Yeah, using MAXIMUM_ALIGNOF seems better for safety. I shared my
> thoughts on the issue with MaxHeapTuplesPerPage yesterday. I think we
> need to discuss how to deal with that.
>
> >
> > * Maybe add a #define for the value 0.7? (I refer to the value used in
> > choose_vacuum_strategy() to calculate a "this is the number of LP_DEAD
> > line pointers that we consider too many" cut off point, which is to be
> > applied throughout lazy_scan_heap() processing.)
> >
>
> Agreed.
>
> > * I notice that your new lazy_vacuum_table_and_indexes() function is
> > the only place that calls lazy_vacuum_table_and_indexes(). I think
> > that you should merge them together -- replace the only remaining call
> > to lazy_vacuum_table_and_indexes() with the body of the function
> > itself. Having a separate lazy_vacuum_table_and_indexes() function
> > doesn't seem useful to me -- it doesn't actually hide complexity, and
> > might even be harder to maintain.
>
> lazy_vacuum_table_and_indexes() is called at two places: after
> maintenance_work_mem run out (around L1097) and the end of
> lazy_scan_heap() (around L1726). I defined this function to pack the
> operations such as choosing a strategy, vacuuming indexes and
> vacuuming heap. Without this function, will we end up writing the same
> codes twice there?
>
> >
> > * I suggest thinking about what the last item will mean for the
> > reporting that currently takes place in
> > lazy_vacuum_table_and_indexes(), but will now go in an expanded
> > lazy_vacuum_table_and_indexes() -- how do we count the total number of
> > index scans now?
> >
> > I don't actually believe that the logic needs to change, but some kind
> > of consolidation and streamlining seems like it might be helpful.
> > Maybe just a comment that says "note that all index scans might just
> > be no-ops because..." -- stuff like that.
>
> What do you mean by the last item and what report? I think
> lazy_vacuum_table_and_indexes() itself doesn't report anything and
> vacrelstats->num_index_scan counts the total number of index scans.
>
> >
> > * Any idea about how hard it will be to teach is_wraparound VACUUMs to
> > skip index cleanup automatically, based on some practical/sensible
> > criteria?
>
> One simple idea would be to have a to-prevent-wraparound autovacuum
> worker disables index cleanup (i.g., setting VACOPT_TERNARY_DISABLED
> to index_cleanup). But a downside (but not a common case) is that
> since a to-prevent-wraparound vacuum is not necessarily an aggressive
> vacuum, it could skip index cleanup even though it cannot move
> relfrozenxid forward.
>
> >
> > It would be nice to have a basic PoC for that, even if it remains a
> > PoC for the foreseeable future (i.e. even if it cannot be committed to
> > Postgres 14). This feature should definitely be something that your
> > patch series *enables*. I'd feel good about having covered that
> > question as part of this basic design work if there was a PoC. That
> > alone should make it 100% clear that it's easy to do (or no harder
> > than it should be -- it should ideally be compatible with your basic
> > design). The exact criteria that we use for deciding whether or not to
> > skip index cleanup (which probably should not just be "this VACUUM is
> > is_wraparound, good enough" in the final version) may need to be
> > debated at length on pgsql-hackers. Even still, it is "just a detail"
> > in the code. Whereas being *able* to do that with your design (now or
> > in the future) seems essential now.
>
> Agreed. I'll write a PoC patch for that.
>
> >
> > > * Store the answers to amvacuumstrategy into either the local memory
> > > or DSM (in parallel vacuum case) so that ambulkdelete can refer the
> > > answer to amvacuumstrategy.
> > > * Fix regression failures.
> > > * Update the documentation and commments.
> > >
> > > Note that 0003 patch is still PoC quality, lacking the btree meta page
> > > version upgrade.
> >
> > This patch is not the hard part, of course -- there really isn't that
> > much needed here compared to vacuumlazy.c. So this patch seems like
> > the simplest 1 out of the 3 (at least to me).
> >
> > Some feedback on the third patch:
> >
> > * The new btm_last_deletion_nblocks metapage field should use P_NONE
> > (which is 0) to indicate never having been vacuumed -- not
> > InvalidBlockNumber (which is 0xFFFFFFFF).
> >
> > This is more idiomatic in nbtree, which is nice, but it has a very
> > significant practical advantage: it ensures that every heapkeyspace
> > nbtree index (i.e. those on recent nbtree versions) can be treated as
> > if it has the new btm_last_deletion_nblocks field all along, even when
> > it actually built on Postgres 12 or 13. This trick will let you avoid
> > dealing with the headache of bumping BTREE_VERSION, which is a huge
> > advantage.
> >
> > Note that this is the same trick I used to avoid bumping BTREE_VERSION
> > when the btm_allequalimage field needed to be added (for the nbtree
> > deduplication feature added to Postgres 13).
> >
>
> That's a nice way with a great advantage. I'll use P_NONE.
>
> > * Forgot to do this in the third patch (think I made this same mistake
> > once myself):
> >
> > diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
> > index 8bb180bbbe..88dfea9af3 100644
> > --- a/contrib/pageinspect/btreefuncs.c
> > +++ b/contrib/pageinspect/btreefuncs.c
> > @@ -653,7 +653,7 @@ bt_metap(PG_FUNCTION_ARGS)
> >      BTMetaPageData *metad;
> >      TupleDesc   tupleDesc;
> >      int         j;
> > -    char       *values[9];
> > +    char       *values[10];
> >      Buffer      buffer;
> >      Page        page;
> >      HeapTuple   tuple;
> > @@ -734,6 +734,11 @@ bt_metap(PG_FUNCTION_ARGS)
>
> Check.
>
> I'm updating and testing the patch. I'll submit the updated version
> patches tomorrow.
>
Sorry for the late.

I've attached the updated version patch that incorporated the comments
I got so far.

I merged the previous 0001 and 0002 patches. 0003 patch is now another
PoC patch that disables index cleanup automatically when autovacuum is
to prevent xid-wraparound and an aggressive vacuum.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

v3-0002-Skip-btree-bulkdelete-if-the-index-doesn-t-grow.patch (13K) Download Attachment
v3-0003-PoC-disable-index-cleanup-when-an-anti-wraparound.patch (1K) Download Attachment
v3-0001-Choose-vacuum-strategy-before-heap-and-index-vacu.patch (92K) Download Attachment
12