Hi all,
I've started this separate thread from [1] for discussing the general API design of index vacuum. Summary: * Call ambulkdelete and amvacuumcleanup even when INDEX_CLEANUP is false, and leave it to the index AM whether or not skip them. * Add a new index AM API amvacuumstrategy(), asking the index AM the strategy before calling to ambulkdelete. * Whether or not remove garbage tuples from heap depends on multiple factors including INDEX_CLEANUP option and the answers of amvacuumstrategy() for each index AM. The first point is to fix the inappropriate behavior discussed on the thread[1]. The second and third points are to introduce a general framework for future extensibility. User-visible behavior is not changed by this change. The new index AM API, amvacuumstrategy(), which is called before bulkdelete() for each index and asks the index bulk-deletion strategy. On this API, lazy vacuum asks, "Hey index X, I collected garbage heap tuples during heap scanning, how urgent is vacuuming for you?", and the index answers either "it's urgent" when it wants to do bulk-deletion or "it's not urgent, I can skip it". The point of this proposal is to isolate heap vacuum and index vacuum for each index so that we can employ different strategies for each index. Lazy vacuum can decide whether or not to do heap clean based on the answers from the indexes. By default, if all indexes answer 'yes' (meaning it will do bulkdelete()), lazy vacuum can do heap clean. On the other hand, if even one index answers 'no' (meaning it will not do bulkdelete()), lazy vacuum doesn't the heap clean. Lazy vacuum would also be able to require indexes to do bulkdelete() for some reason such as specyfing INDEX_CLEANUP option by the user. It’s something like saying "Hey index X, you answered not to do bulkdelete() but since heap clean is necessary for me please don't skip bulkdelete()". Currently, if INDEX_CLEANUP option is not set (i.g. VACOPT_TERNARY_DEFAULT in the code), it's treated as true and will do heap clean. But with this patch we use the default as a neutral state ('smart' mode). This neutral state could be "on" and "off" depending on several factors including the answers of amvacuumstrategy(), the table status, and user's request. In this context, specifying INDEX_CLEANUP would mean making the neutral state "on" or "off" by user's request. The table status that could influence the decision could concretely be, for instance: * Removing LP_DEAD accumulation due to skipping bulkdelete() for a long time. * Making pages all-visible for index-only scan. Also there are potential enhancements using this API: * If bottom-up index deletion feature[2] is introduced, individual indexes could be a different situation in terms of dead tuple accumulation; some indexes on the table can delete its garbage index tuples without bulkdelete(). A problem will appear that doing bulkdelete() for such indexes would not be efficient. This problem is solved by this proposal because we can do bulkdelete() for a subset of indexes on the table. * If retail index deletion feature[3] is introduced, we can make the return value of bulkvacuumstrategy() a ternary value: "do_bulkdelete", "do_indexscandelete", and "no". * We probably can introduce a threshold of the number of dead tuples to control whether or not to do index tuple bulk-deletion (like bulkdelete() version of vacuum_cleanup_index_scale_factor). In the case where the amount of dead tuples is slightly larger than maitenance_work_mem the second time calling to bulkdelete will be called with a small number of dead tuples, which is inefficient. This problem is also solved by this proposal by allowing a subset of indexes to skip bulkdelete() if the number of dead tuple doesn't exceed the threshold. I’ve attached the PoC patch for the above idea. By default, since lazy vacuum choose the vacuum bulkdelete strategy based on answers of amvacuumstrategy() so it can be either true or false ( although it’s always true in the currene patch). But for amvacuumcleanup() there is no the neutral state, lazy vacuum treats the default as true. Comment and feedback are very welcome. Regards, [1] https://www.postgresql.org/message-id/20200415233848.saqp72pcjv2y6ryi%40alap3.anarazel.de [2] https://www.postgresql.org/message-id/CAH2-Wzm%2BmaE3apHB8NOtmM%3Dp-DO65j2V5GzAWCOEEuy3JZgb2g%40mail.gmail.com [3] https://www.postgresql.org/message-id/425db134-8bba-005c-b59d-56e50de3b41e%40postgrespro.ru -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/ |
On Tue, Dec 22, 2020 at 2:54 PM Masahiko Sawada <[hidden email]> wrote:
> I've started this separate thread from [1] for discussing the general > API design of index vacuum. This is a very difficult and very important problem. Clearly defining the problem is probably the hardest part. This prototype patch seems like a good start, though. Private discussion between Masahiko and myself led to a shared understanding of what the best *general* direction is for VACUUM now. It is necessary to deal with several problems all at once here, and to at least think about several more problems that will need to be solved later. If anybody reading the thread initially finds it hard to see the connection between the specific items that Masahiko has introduced, they should note that that's *expected*. > Summary: > > * Call ambulkdelete and amvacuumcleanup even when INDEX_CLEANUP is > false, and leave it to the index AM whether or not skip them. Makes sense. I like the way you unify INDEX_CLEANUP and the vacuum_cleanup_index_scale_factor stuff in a way that is now quite explicit and obvious in the code. > The second and third points are to introduce a general framework for > future extensibility. User-visible behavior is not changed by this > change. In some ways the ideas in your patch might be considered radical, or at least novel: they introduce the idea that bloat can be a qualitative thing. But at the same time the design is quite conservative: these are fairly isolated changes, at least code-wise. I am not 100% sure that this approach will be successful in vacuumlazy.c, in the end (I'm ~95% sure). But I am 100% sure that our collective understanding of the problems in this area will be significantly improved by this effort. A fundamental rethink does not necessarily require a fundamental redesign, and yet it might be just as effective. This is certainly what I see when testing my bottom-up index deletion patch, which adds an incremental index deletion mechanism that merely intervenes in a precise, isolated way. Despite my patch's simplicity, it manages to practically eliminate an entire important *class* of index bloat (at least once you make certain mild assumptions about the duration of snapshots). Sometimes it is possible to solve a hard problem by thinking about it only *slightly* differently. This is a tantalizing possibility for VACUUM, too. I'm willing to risk sounding grandiose if that's what it takes to get more hackers interested in these questions. With that in mind, here is a summary of the high level hypothesis behind this VACUUM patch: VACUUM can and should be reimagined as a top-down mechanism that complements various bottom-up mechanisms (including the stuff from my deletion patch, heap pruning, and possibly an enhanced version of heap pruning based on similar principles). This will be possible without changing any of the fundamental invariants of the current vacuumlazy.c design. VACUUM's problems are largely pathological behaviors of one kind or another, that can be fixed with specific well-targeted interventions. Workload characteristics can naturally determine how much of the cleanup is done by VACUUM itself -- large variations are possible within a single database, and even across indexes on the same table. > The new index AM API, amvacuumstrategy(), which is called before > bulkdelete() for each index and asks the index bulk-deletion strategy. > On this API, lazy vacuum asks, "Hey index X, I collected garbage heap > tuples during heap scanning, how urgent is vacuuming for you?", and > the index answers either "it's urgent" when it wants to do > bulk-deletion or "it's not urgent, I can skip it". The point of this > proposal is to isolate heap vacuum and index vacuum for each index so > that we can employ different strategies for each index. Lazy vacuum > can decide whether or not to do heap clean based on the answers from > the indexes. Right -- workload characteristics (plus appropriate optimizations at the local level) make it possible that amvacuumstrategy() will give *very* different answers from different indexes *on the same table*. The idea that all indexes on the table are more or less equally bloated at any given point in time is mostly wrong. Actually, *sometimes* it really is correct! But other times it is *dramatically wrong* -- it all depends on workload characteristics. What is likely to be true *on average* across all tables/indexes is *irrelevant* (the mean/average is simply not a useful concept, in fact). The basic lazy vacuum design needs to recognize this important difference, and other similar issues. That's the point of amvacuumstrategy(). > Currently, if INDEX_CLEANUP option is not set (i.g. > VACOPT_TERNARY_DEFAULT in the code), it's treated as true and will do > heap clean. But with this patch we use the default as a neutral state > ('smart' mode). This neutral state could be "on" and "off" depending > on several factors including the answers of amvacuumstrategy(), the > table status, and user's request. In this context, specifying > INDEX_CLEANUP would mean making the neutral state "on" or "off" by > user's request. The table status that could influence the decision > could concretely be, for instance: > > * Removing LP_DEAD accumulation due to skipping bulkdelete() for a long time. > * Making pages all-visible for index-only scan. So you have several different kinds of back pressure - 'smart' mode really is smart. > Also there are potential enhancements using this API: > * If retail index deletion feature[3] is introduced, we can make the > return value of bulkvacuumstrategy() a ternary value: "do_bulkdelete", > "do_indexscandelete", and "no". Makes sense. > * We probably can introduce a threshold of the number of dead tuples > to control whether or not to do index tuple bulk-deletion (like > bulkdelete() version of vacuum_cleanup_index_scale_factor). In the > case where the amount of dead tuples is slightly larger than > maitenance_work_mem the second time calling to bulkdelete will be > called with a small number of dead tuples, which is inefficient. This > problem is also solved by this proposal by allowing a subset of > indexes to skip bulkdelete() if the number of dead tuple doesn't > exceed the threshold. Good idea. I bet other people can come up with other ideas a little like this just by thinking about it. The "untangling" performed by your patch creates many possibilities > I’ve attached the PoC patch for the above idea. By default, since lazy > vacuum choose the vacuum bulkdelete strategy based on answers of > amvacuumstrategy() so it can be either true or false ( although it’s > always true in the currene patch). But for amvacuumcleanup() there is > no the neutral state, lazy vacuum treats the default as true. As you said, the next question must be: How do we teach lazy vacuum to not do what gets requested by amvacuumcleanup() when it cannot respect the wishes of one individual indexes, for example when the accumulation of LP_DEAD items in the heap becomes a big problem in itself? That really could be the thing that forces full heap vacuuming, even with several indexes. I will need to experiment in order to improve my understanding of how to make this cooperate with bottom-up index deletion. But that's mostly just a question for my patch (and a relatively easy one). -- Peter Geoghegan |
On Thu, Dec 24, 2020 at 12:59 PM Peter Geoghegan <[hidden email]> wrote:
> > On Tue, Dec 22, 2020 at 2:54 PM Masahiko Sawada <[hidden email]> wrote: > > I've started this separate thread from [1] for discussing the general > > API design of index vacuum. > > This is a very difficult and very important problem. Clearly defining > the problem is probably the hardest part. This prototype patch seems > like a good start, though. > > Private discussion between Masahiko and myself led to a shared > understanding of what the best *general* direction is for VACUUM now. > It is necessary to deal with several problems all at once here, and to > at least think about several more problems that will need to be solved > later. If anybody reading the thread initially finds it hard to see > the connection between the specific items that Masahiko has > introduced, they should note that that's *expected*. > > > Summary: > > > > * Call ambulkdelete and amvacuumcleanup even when INDEX_CLEANUP is > > false, and leave it to the index AM whether or not skip them. > > Makes sense. I like the way you unify INDEX_CLEANUP and the > vacuum_cleanup_index_scale_factor stuff in a way that is now quite > explicit and obvious in the code. > > > The second and third points are to introduce a general framework for > > future extensibility. User-visible behavior is not changed by this > > change. > > In some ways the ideas in your patch might be considered radical, or > at least novel: they introduce the idea that bloat can be a > qualitative thing. But at the same time the design is quite > conservative: these are fairly isolated changes, at least code-wise. I > am not 100% sure that this approach will be successful in > vacuumlazy.c, in the end (I'm ~95% sure). But I am 100% sure that our > collective understanding of the problems in this area will be > significantly improved by this effort. A fundamental rethink does not > necessarily require a fundamental redesign, and yet it might be just > as effective. > > This is certainly what I see when testing my bottom-up index deletion > patch, which adds an incremental index deletion mechanism that merely > intervenes in a precise, isolated way. Despite my patch's simplicity, > it manages to practically eliminate an entire important *class* of > index bloat (at least once you make certain mild assumptions about the > duration of snapshots). Sometimes it is possible to solve a hard > problem by thinking about it only *slightly* differently. > > This is a tantalizing possibility for VACUUM, too. I'm willing to risk > sounding grandiose if that's what it takes to get more hackers > interested in these questions. With that in mind, here is a summary of > the high level hypothesis behind this VACUUM patch: > > VACUUM can and should be reimagined as a top-down mechanism that > complements various bottom-up mechanisms (including the stuff from my > deletion patch, heap pruning, and possibly an enhanced version of heap > pruning based on similar principles). This will be possible without > changing any of the fundamental invariants of the current vacuumlazy.c > design. VACUUM's problems are largely pathological behaviors of one > kind or another, that can be fixed with specific well-targeted > interventions. Workload characteristics can naturally determine how > much of the cleanup is done by VACUUM itself -- large variations are > possible within a single database, and even across indexes on the same > table. Agreed. Ideally, the bottom-up mechanism works well and reclaim almost all garbage. VACUUM should be a feature that complements these works if the bottom-up mechanism cannot work well for some reason, and also is used to make sure that all collected garbage has been vacuumed. For heaps, we already have such a mechanism: opportunistically hot-pruning and lazy vacuum. For indexes especially btree indexes, the bottom-up index deletion and ambulkdelete() would have a similar relationship. > > > The new index AM API, amvacuumstrategy(), which is called before > > bulkdelete() for each index and asks the index bulk-deletion strategy. > > On this API, lazy vacuum asks, "Hey index X, I collected garbage heap > > tuples during heap scanning, how urgent is vacuuming for you?", and > > the index answers either "it's urgent" when it wants to do > > bulk-deletion or "it's not urgent, I can skip it". The point of this > > proposal is to isolate heap vacuum and index vacuum for each index so > > that we can employ different strategies for each index. Lazy vacuum > > can decide whether or not to do heap clean based on the answers from > > the indexes. > > Right -- workload characteristics (plus appropriate optimizations at > the local level) make it possible that amvacuumstrategy() will give > *very* different answers from different indexes *on the same table*. > The idea that all indexes on the table are more or less equally > bloated at any given point in time is mostly wrong. Actually, > *sometimes* it really is correct! But other times it is *dramatically > wrong* -- it all depends on workload characteristics. What is likely > to be true *on average* across all tables/indexes is *irrelevant* (the > mean/average is simply not a useful concept, in fact). > > The basic lazy vacuum design needs to recognize this important > difference, and other similar issues. That's the point of > amvacuumstrategy(). Agreed. In terms of bloat, the characteristics of index AM also bring such differences (e.g., btree vs. brin). With the bottom-up index deletion feature, even btree indexes on the same table will also different to each other. > > > Currently, if INDEX_CLEANUP option is not set (i.g. > > VACOPT_TERNARY_DEFAULT in the code), it's treated as true and will do > > heap clean. But with this patch we use the default as a neutral state > > ('smart' mode). This neutral state could be "on" and "off" depending > > on several factors including the answers of amvacuumstrategy(), the > > table status, and user's request. In this context, specifying > > INDEX_CLEANUP would mean making the neutral state "on" or "off" by > > user's request. The table status that could influence the decision > > could concretely be, for instance: > > > > * Removing LP_DEAD accumulation due to skipping bulkdelete() for a long time. > > * Making pages all-visible for index-only scan. > > So you have several different kinds of back pressure - 'smart' mode > really is smart. > > > Also there are potential enhancements using this API: > > > * If retail index deletion feature[3] is introduced, we can make the > > return value of bulkvacuumstrategy() a ternary value: "do_bulkdelete", > > "do_indexscandelete", and "no". > > Makes sense. > > > * We probably can introduce a threshold of the number of dead tuples > > to control whether or not to do index tuple bulk-deletion (like > > bulkdelete() version of vacuum_cleanup_index_scale_factor). In the > > case where the amount of dead tuples is slightly larger than > > maitenance_work_mem the second time calling to bulkdelete will be > > called with a small number of dead tuples, which is inefficient. This > > problem is also solved by this proposal by allowing a subset of > > indexes to skip bulkdelete() if the number of dead tuple doesn't > > exceed the threshold. > > Good idea. I bet other people can come up with other ideas a little > like this just by thinking about it. The "untangling" performed by > your patch creates many possibilities > > > I’ve attached the PoC patch for the above idea. By default, since lazy > > vacuum choose the vacuum bulkdelete strategy based on answers of > > amvacuumstrategy() so it can be either true or false ( although it’s > > always true in the currene patch). But for amvacuumcleanup() there is > > no the neutral state, lazy vacuum treats the default as true. > > As you said, the next question must be: How do we teach lazy vacuum to > not do what gets requested by amvacuumcleanup() when it cannot respect > the wishes of one individual indexes, for example when the > accumulation of LP_DEAD items in the heap becomes a big problem in > itself? That really could be the thing that forces full heap > vacuuming, even with several indexes. You mean requested by amvacuumstreategy(), not by amvacuumcleanup()? I think amvacuumstrategy() affects only ambulkdelete(). But when all ambulkdelete() were skipped by the requests by index AMs we might want to skip amvacuumcleanup() as well. > > I will need to experiment in order to improve my understanding of how > to make this cooperate with bottom-up index deletion. But that's > mostly just a question for my patch (and a relatively easy one). Yeah, I think we might need something like statistics about garbage per index so that individual index can make a different decision based on their status. For example, a btree index might want to skip ambulkdelete() if it has a few dead index tuples in its leaf pages. It could be on stats collector or on btree's meta page. Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/ |
On Sun, Dec 27, 2020 at 10:55 PM Masahiko Sawada <[hidden email]> wrote:
> > As you said, the next question must be: How do we teach lazy vacuum to > > not do what gets requested by amvacuumcleanup() when it cannot respect > > the wishes of one individual indexes, for example when the > > accumulation of LP_DEAD items in the heap becomes a big problem in > > itself? That really could be the thing that forces full heap > > vacuuming, even with several indexes. > > You mean requested by amvacuumstreategy(), not by amvacuumcleanup()? I > think amvacuumstrategy() affects only ambulkdelete(). But when all > ambulkdelete() were skipped by the requests by index AMs we might want > to skip amvacuumcleanup() as well. No, I was asking about how we should decide to do a real VACUUM even (a real ambulkdelete() call) when no index asks for it because bottom-up deletion works very well in every index. Clearly we will need to eventually remove remaining LP_DEAD items from the heap at some point if nothing else happens -- eventually LP_DEAD items in the heap alone will force a traditional heap vacuum (which will still have to go through indexes that have not grown, just to be safe/avoid recycling a TID that's still in the index). Postgres heap fillfactor is 100 by default, though I believe it's 90 in another well known DB system. If you set Postgres heap fill factor to 90 you can fit a little over 200 LP_DEAD items in the "extra space" left behind in each heap page after initial bulk loading/INSERTs take place that respect our lower fill factor setting. This is about 4x the number of initial heap tuples in the pgbench_accounts table -- it's quite a lot! If we pessimistically assume that all updates are non-HOT updates, we'll still usually have enough space for each logical row to get updated several times before the heap page "overflows". Even when there is significant skew in the UPDATEs, the skew is not noticeable at the level of individual heap pages. We have a surprisingly large general capacity to temporarily "absorb" extra garbage LP_DEAD items in heap pages this way. Nobody really cared about this extra capacity very much before now, because it did not help with the big problem of index bloat that you naturally see with this workload. But that big problem may go away soon, and so this extra capacity may become important at the same time. I think that it could make sense for lazy_scan_heap() to maintain statistics about the number of LP_DEAD items remaining in each heap page (just local stack variables). From there, it can pass the statistics to the choose_vacuum_strategy() function from your patch. Perhaps choose_vacuum_strategy() will notice that the heap page with the most LP_DEAD items encountered within lazy_scan_heap() (among those encountered so far in the event of multiple index passes) has too many LP_DEAD items -- this indicates that there is a danger that some heap pages will start to "overflow" soon, which is now a problem that lazy_scan_heap() must think about. Maybe if the "extra space" left by applying heap fill factor (with settings below 100) is insufficient to fit perhaps 2/3 of the LP_DEAD items needed on the heap page that has the most LP_DEAD items (among all heap pages), we stop caring about what amvacuumstrategy()/the indexes say. So we do the right thing for the heap pages, while still mostly avoiding index vacuuming and the final heap pass. I experimented with this today, and I think that it is a good way to do it. I like the idea of choose_vacuum_strategy() understanding that heap pages that are subject to many non-HOT updates have a "natural extra capacity for LP_DEAD items" that it must care about directly (at least with non-default heap fill factor settings). My early testing shows that it will often take a surprisingly long time for the most heavily updated heap page to have more than about 100 LP_DEAD items. > > I will need to experiment in order to improve my understanding of how > > to make this cooperate with bottom-up index deletion. But that's > > mostly just a question for my patch (and a relatively easy one). > > Yeah, I think we might need something like statistics about garbage > per index so that individual index can make a different decision based > on their status. For example, a btree index might want to skip > ambulkdelete() if it has a few dead index tuples in its leaf pages. It > could be on stats collector or on btree's meta page. Right. I think that even a very conservative approach could work well. For example, maybe we teach nbtree's amvacuumstrategy() routine to ask to do a real ambulkdelete(), except in the extreme case where the index is *exactly* the same size as it was after the last VACUUM. This will happen regularly with bottom-up index deletion. Maybe that approach is a bit too conservative, though. -- Peter Geoghegan |
On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan <[hidden email]> wrote:
> I experimented with this today, and I think that it is a good way to > do it. I like the idea of choose_vacuum_strategy() understanding that > heap pages that are subject to many non-HOT updates have a "natural > extra capacity for LP_DEAD items" that it must care about directly (at > least with non-default heap fill factor settings). My early testing > shows that it will often take a surprisingly long time for the most > heavily updated heap page to have more than about 100 LP_DEAD items. Attached is a rough patch showing what I did here. It was applied on top of my bottom-up index deletion patch series and your poc_vacuumstrategy.patch patch. This patch was written as a quick and dirty way of simulating what I thought would work best for bottom-up index deletion for one specific benchmark/test, which was non-hot-update heavy. This consists of a variant pgbench with several indexes on pgbench_accounts (almost the same as most other bottom-up deletion benchmarks I've been running). Only one index is "logically modified" by the updates, but of course we still physically modify all indexes on every update. I set fill factor to 90 for this benchmark, which is an important factor for how your VACUUM patch works during the benchmark. This rough supplementary patch includes VACUUM logic that assumes (but doesn't check) that the table has heap fill factor set to 90 -- see my changes to choose_vacuum_strategy(). This benchmark is really about stability over time more than performance (though performance is also improved significantly). I wanted to keep both the table/heap and the logically unmodified indexes (i.e. 3 out of 4 indexes on pgbench_accounts) exactly the same size *forever*. Does this make sense? Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that pg_stat_database shows an almost ~28% reduction in blks_read after an overnight run for the patch series (it was 508,820,699 for the patches, 705,282,975 for the master branch). I think that the VACUUM component is responsible for some of that reduction. There were 11 VACUUMs for the patch, 7 of which did not call lazy_vacuum_heap() (these 7 VACUUM operations all only dead a btbulkdelete() call for the one problematic index on the table, named "abalance_ruin", which my supplementary patch has hard-coded knowledge of). -- Peter Geoghegan |
In reply to this post by Peter Geoghegan-4
On Mon, Dec 28, 2020 at 4:42 PM Peter Geoghegan <[hidden email]> wrote:
> > On Sun, Dec 27, 2020 at 10:55 PM Masahiko Sawada <[hidden email]> wrote: > > > As you said, the next question must be: How do we teach lazy vacuum to > > > not do what gets requested by amvacuumcleanup() when it cannot respect > > > the wishes of one individual indexes, for example when the > > > accumulation of LP_DEAD items in the heap becomes a big problem in > > > itself? That really could be the thing that forces full heap > > > vacuuming, even with several indexes. > > > > You mean requested by amvacuumstreategy(), not by amvacuumcleanup()? I > > think amvacuumstrategy() affects only ambulkdelete(). But when all > > ambulkdelete() were skipped by the requests by index AMs we might want > > to skip amvacuumcleanup() as well. > > No, I was asking about how we should decide to do a real VACUUM even > (a real ambulkdelete() call) when no index asks for it because > bottom-up deletion works very well in every index. Clearly we will > need to eventually remove remaining LP_DEAD items from the heap at > some point if nothing else happens -- eventually LP_DEAD items in the > heap alone will force a traditional heap vacuum (which will still have > to go through indexes that have not grown, just to be safe/avoid > recycling a TID that's still in the index). > > Postgres heap fillfactor is 100 by default, though I believe it's 90 > in another well known DB system. If you set Postgres heap fill factor > to 90 you can fit a little over 200 LP_DEAD items in the "extra space" > left behind in each heap page after initial bulk loading/INSERTs take > place that respect our lower fill factor setting. This is about 4x the > number of initial heap tuples in the pgbench_accounts table -- it's > quite a lot! > > If we pessimistically assume that all updates are non-HOT updates, > we'll still usually have enough space for each logical row to get > updated several times before the heap page "overflows". Even when > there is significant skew in the UPDATEs, the skew is not noticeable > at the level of individual heap pages. We have a surprisingly large > general capacity to temporarily "absorb" extra garbage LP_DEAD items > in heap pages this way. Nobody really cared about this extra capacity > very much before now, because it did not help with the big problem of > index bloat that you naturally see with this workload. But that big > problem may go away soon, and so this extra capacity may become > important at the same time. > > I think that it could make sense for lazy_scan_heap() to maintain > statistics about the number of LP_DEAD items remaining in each heap > page (just local stack variables). From there, it can pass the > statistics to the choose_vacuum_strategy() function from your patch. > Perhaps choose_vacuum_strategy() will notice that the heap page with > the most LP_DEAD items encountered within lazy_scan_heap() (among > those encountered so far in the event of multiple index passes) has > too many LP_DEAD items -- this indicates that there is a danger that > some heap pages will start to "overflow" soon, which is now a problem > that lazy_scan_heap() must think about. Maybe if the "extra space" > left by applying heap fill factor (with settings below 100) is > insufficient to fit perhaps 2/3 of the LP_DEAD items needed on the > heap page that has the most LP_DEAD items (among all heap pages), we > stop caring about what amvacuumstrategy()/the indexes say. So we do > the right thing for the heap pages, while still mostly avoiding index > vacuuming and the final heap pass. can absorb based on the extra space left by applying the fill factor. Since there is a limit on the maximum number of line pointers in a heap page we might need to consider that limit when calculation. From another point of view, given the maximum number of heap tuple in one 8kb heap page (MaxHeapTuplesPerPage) is 291, I think how bad to store LP_DEAD items in a heap page vary depending on the tuple size. For example, suppose the tuple size is 200 we can store 40 tuples into one heap page if there is no LP_DEAD item at all. Even if there are 150 LP_DEAD items on the page, we still are able to store 37 tuples because we still can have 141 line pointers at most, which is enough number to store the maximum number of heap tuples when there are no LP_DEAD items, and we have (8192 - (4 * 150)) bytes space to store tuples (with line pointers). That is, we can think that having 150 LP_DEAD items end up causing an overflow of 3 tuples. On the other hand, suppose the tuple size is 40 we can store 204 tuples into one heap page if there is no LP_DEAD item at all. If there are 150 LP_DEAD items on the page, we are able to store 141 tuples. That is, having 150 LP_DEAD items end up causing an overflow of 63 tuples. I think the impact on the table bloat by absorbing LP_DEAD items is larger in the latter case. The larger the tuple size, the more LP_DEAD items can be absorbed in a heap page with less bad effect. Considering 32 bytes tuple, the minimum heap tuples size including the tuple header, absorbing approximately up to 70 LP_DEAD items would not affect much in terms of bloat. In other words, if a heap page has more than 70 LP_DEAD items, absorbing LP_DEAD items may become a problem of the table bloat. This threshold of 70 LP_DEAD items is a conservative value and probably would be a lower bound. If the tuple size is larger, we may be able to absorb more LP_DEAD items. FYI I've attached a graph showing how the number of LP_DEAD items on one heap page affects the maximum number of heap tuples on the same heap page. The X-axis is the number of LP_DEAD items in one heap page and the Y-axis is the number of heap tuples that can be stored on the page. The lines in the graph are heap tuple size respectively. For example, in pgbench workload, since the tuple size is about 120 bytes the page bloat accelerates if we leave more than about 230 LP_DEAD items in a heap page. > > I experimented with this today, and I think that it is a good way to > do it. I like the idea of choose_vacuum_strategy() understanding that > heap pages that are subject to many non-HOT updates have a "natural > extra capacity for LP_DEAD items" that it must care about directly (at > least with non-default heap fill factor settings). My early testing > shows that it will often take a surprisingly long time for the most > heavily updated heap page to have more than about 100 LP_DEAD items. Agreed. > > > > I will need to experiment in order to improve my understanding of how > > > to make this cooperate with bottom-up index deletion. But that's > > > mostly just a question for my patch (and a relatively easy one). > > > > Yeah, I think we might need something like statistics about garbage > > per index so that individual index can make a different decision based > > on their status. For example, a btree index might want to skip > > ambulkdelete() if it has a few dead index tuples in its leaf pages. It > > could be on stats collector or on btree's meta page. > > Right. I think that even a very conservative approach could work well. > For example, maybe we teach nbtree's amvacuumstrategy() routine to ask > to do a real ambulkdelete(), except in the extreme case where the > index is *exactly* the same size as it was after the last VACUUM. > This will happen regularly with bottom-up index deletion. Maybe that > approach is a bit too conservative, though. Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/ |
In reply to this post by Peter Geoghegan-4
On Tue, Dec 29, 2020 at 7:06 AM Peter Geoghegan <[hidden email]> wrote:
> > On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan <[hidden email]> wrote: > > I experimented with this today, and I think that it is a good way to > > do it. I like the idea of choose_vacuum_strategy() understanding that > > heap pages that are subject to many non-HOT updates have a "natural > > extra capacity for LP_DEAD items" that it must care about directly (at > > least with non-default heap fill factor settings). My early testing > > shows that it will often take a surprisingly long time for the most > > heavily updated heap page to have more than about 100 LP_DEAD items. > > Attached is a rough patch showing what I did here. It was applied on > top of my bottom-up index deletion patch series and your > poc_vacuumstrategy.patch patch. This patch was written as a quick and > dirty way of simulating what I thought would work best for bottom-up > index deletion for one specific benchmark/test, which was > non-hot-update heavy. This consists of a variant pgbench with several > indexes on pgbench_accounts (almost the same as most other bottom-up > deletion benchmarks I've been running). Only one index is "logically > modified" by the updates, but of course we still physically modify all > indexes on every update. I set fill factor to 90 for this benchmark, > which is an important factor for how your VACUUM patch works during > the benchmark. > > This rough supplementary patch includes VACUUM logic that assumes (but > doesn't check) that the table has heap fill factor set to 90 -- see my > changes to choose_vacuum_strategy(). This benchmark is really about > stability over time more than performance (though performance is also > improved significantly). I wanted to keep both the table/heap and the > logically unmodified indexes (i.e. 3 out of 4 indexes on > pgbench_accounts) exactly the same size *forever*. > > Does this make sense? Thank you for sharing the patch. That makes sense. + if (!vacuum_heap) + { + if (maxdeadpage > 130 || + /* Also check if maintenance_work_mem space is running out */ + vacrelstats->dead_tuples->num_tuples > + vacrelstats->dead_tuples->max_tuples / 2) + vacuum_heap = true; + } The second test checking if maintenane_work_mem space is running out also makes sense to me. Perhaps another idea would be to compare the number of collected garbage tuple to the total number of heap tuples so that we do lazy_vacuum_heap() only when we’re likely to reclaim a certain amount of garbage in the table. > > Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that > pg_stat_database shows an almost ~28% reduction in blks_read after an > overnight run for the patch series (it was 508,820,699 for the > patches, 705,282,975 for the master branch). I think that the VACUUM > component is responsible for some of that reduction. There were 11 > VACUUMs for the patch, 7 of which did not call lazy_vacuum_heap() > (these 7 VACUUM operations all only dead a btbulkdelete() call for the > one problematic index on the table, named "abalance_ruin", which my > supplementary patch has hard-coded knowledge of). That's a very good result in terms of skipping lazy_vacuum_heap(). How much the table and indexes bloated? Also, I'm curious about that which tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or test if maintenance_work_mem space is running out? And what was the impact on clearing all-visible bits? Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/ |
On Mon, Dec 28, 2020 at 10:26 PM Masahiko Sawada <[hidden email]> wrote:
> The second test checking if maintenane_work_mem space is running out > also makes sense to me. Perhaps another idea would be to compare the > number of collected garbage tuple to the total number of heap tuples > so that we do lazy_vacuum_heap() only when we’re likely to reclaim a > certain amount of garbage in the table. Right. Or to consider if this is an anti-wraparound VACUUM might be nice -- maybe we should skip index vacuuming + lazy_vacuum_heap() if and only if we're under pressure to advance datfrozenxid for the whole DB, and really need to hurry up. (I think that we could both probably think of way too many ideas like this one.) > That's a very good result in terms of skipping lazy_vacuum_heap(). How > much the table and indexes bloated? Also, I'm curious about that which > tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or > test if maintenance_work_mem space is running out? And what was the > impact on clearing all-visible bits? The pgbench_accounts heap table and 3 out of 4 of its indexes (i.e. all indexes except "abalance_ruin") had zero growth. They did not even become larger by 1 block. As I often say when talking about work in this area, this is not a quantitative difference -- it's a qualitative difference. (If they grew even a tiny amount, say by only 1 block, further growth is likely to follow.) The "abalance_ruin" index was smaller with the patch. Its size started off at 253,779 blocks with both the patch and master branch (which is very small, because of B-Tree deduplication). By the end of 2 pairs of runs for the patch (2 3 hour runs) the size grew to 502,016 blocks. But with the master branch it grew to 540,090 blocks. (For reference, the primary key on pgbench_accounts started out at 822,573 blocks.) My guess is that this would compare favorably with "magic VACUUM" [1] (I refer to a thought experiment that is useful for understanding the principles behind bottom-up index deletion). The fact that "abalance_ruin" becomes bloated probably doesn't have that much to do with MVCC versioning. In other words, I suspect that the index wouldn't be that smaller in a traditional two-phase locking database system with the same workload. Words like "bloat" and "fragmentation" have always been overloaded/ambiguous in highly confusing ways, which is why I find it useful to compare a real world workload/benchmark to some kind of theoretical ideal behavior. This test wasn't particularly sympathetic to the patch because most of the indexes (all but the PK) were useless -- they did not get used by query plans. So the final size of "abalance_ruin" (or any other index) isn't even the truly important thing IMV (the benchmark doesn't advertise the truly important thing for me). The truly important thing is that the worst case number of versions *per logical row* is tightly controlled. It doesn't necessarily matter all that much if 30% of an index's tuples are garbage, as long as the garbage tuples are evenly spread across all logical rows in the table (in practice it's pretty unlikely that that would actually happen, but it's possible in theory, and if it did happen it really wouldn't be so bad). [1] https://postgr.es/m/CAH2-Wz=rPkB5vXS7AZ+v8t3VE75v_dKGro+w4nWd64E9yiCLEQ@... -- Peter Geoghegan |
On Mon, Dec 28, 2020 at 11:20 PM Peter Geoghegan <[hidden email]> wrote:
> > That's a very good result in terms of skipping lazy_vacuum_heap(). How > > much the table and indexes bloated? Also, I'm curious about that which > > tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or > > test if maintenance_work_mem space is running out? And what was the > > impact on clearing all-visible bits? > > The pgbench_accounts heap table and 3 out of 4 of its indexes (i.e. > all indexes except "abalance_ruin") had zero growth. They did not even > become larger by 1 block. As I often say when talking about work in > this area, this is not a quantitative difference -- it's a qualitative > difference. (If they grew even a tiny amount, say by only 1 block, > further growth is likely to follow.) I forgot to say: I don't know what the exact impact was on the VM bit setting, but I doubt that it was noticeably worse for the patch. It cannot have been better, though. It's inherently almost impossible to keep most of the VM bits set for long with this workload. Perhaps VM bit setting would be improved with workloads that have some HOT updates, but as I mentioned this workload only had non-HOT updates (except in a tiny number of cases where abalance did not change, just by random luck). I also forget to say that the maintenance_work_mem test wasn't that relevant, though I believe it triggered once. maintenance_work_mem was set very high (5GB). Here is a link with more details information, in case that is interesting: https://drive.google.com/file/d/1TqpAQnqb4SMMuhehD8ELpf6Cv9A8ux2E/view?usp=sharing -- Peter Geoghegan |
In reply to this post by Masahiko Sawada
On Tue, Dec 29, 2020 at 3:25 PM Masahiko Sawada <[hidden email]> wrote:
> > On Tue, Dec 29, 2020 at 7:06 AM Peter Geoghegan <[hidden email]> wrote: > > > > On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan <[hidden email]> wrote: > > > I experimented with this today, and I think that it is a good way to > > > do it. I like the idea of choose_vacuum_strategy() understanding that > > > heap pages that are subject to many non-HOT updates have a "natural > > > extra capacity for LP_DEAD items" that it must care about directly (at > > > least with non-default heap fill factor settings). My early testing > > > shows that it will often take a surprisingly long time for the most > > > heavily updated heap page to have more than about 100 LP_DEAD items. > > > > Attached is a rough patch showing what I did here. It was applied on > > top of my bottom-up index deletion patch series and your > > poc_vacuumstrategy.patch patch. This patch was written as a quick and > > dirty way of simulating what I thought would work best for bottom-up > > index deletion for one specific benchmark/test, which was > > non-hot-update heavy. This consists of a variant pgbench with several > > indexes on pgbench_accounts (almost the same as most other bottom-up > > deletion benchmarks I've been running). Only one index is "logically > > modified" by the updates, but of course we still physically modify all > > indexes on every update. I set fill factor to 90 for this benchmark, > > which is an important factor for how your VACUUM patch works during > > the benchmark. > > > > This rough supplementary patch includes VACUUM logic that assumes (but > > doesn't check) that the table has heap fill factor set to 90 -- see my > > changes to choose_vacuum_strategy(). This benchmark is really about > > stability over time more than performance (though performance is also > > improved significantly). I wanted to keep both the table/heap and the > > logically unmodified indexes (i.e. 3 out of 4 indexes on > > pgbench_accounts) exactly the same size *forever*. > > > > Does this make sense? > > Thank you for sharing the patch. That makes sense. > > + if (!vacuum_heap) > + { > + if (maxdeadpage > 130 || > + /* Also check if maintenance_work_mem space is running out */ > + vacrelstats->dead_tuples->num_tuples > > + vacrelstats->dead_tuples->max_tuples / 2) > + vacuum_heap = true; > + } > > The second test checking if maintenane_work_mem space is running out > also makes sense to me. Perhaps another idea would be to compare the > number of collected garbage tuple to the total number of heap tuples > so that we do lazy_vacuum_heap() only when we’re likely to reclaim a > certain amount of garbage in the table. > > > > > Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that > > pg_stat_database shows an almost ~28% reduction in blks_read after an > > overnight run for the patch series (it was 508,820,699 for the > > patches, 705,282,975 for the master branch). I think that the VACUUM > > component is responsible for some of that reduction. There were 11 > > VACUUMs for the patch, 7 of which did not call lazy_vacuum_heap() > > (these 7 VACUUM operations all only dead a btbulkdelete() call for the > > one problematic index on the table, named "abalance_ruin", which my > > supplementary patch has hard-coded knowledge of). > > That's a very good result in terms of skipping lazy_vacuum_heap(). How > much the table and indexes bloated? Also, I'm curious about that which > tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or > test if maintenance_work_mem space is running out? And what was the > impact on clearing all-visible bits? > In the 0002 patch, we calculate how many LP_DEAD items can be accumulated in the space on a single heap page left by fillfactor. I increased MaxHeapTuplesPerPage so that we can accumulate LP_DEAD items on a heap page. Because otherwise accumulating LP_DEAD items unnecessarily constrains the number of heap tuples in a single page, especially when small tuples, as I mentioned before. Previously, we constrained the number of line pointers to avoid excessive line-pointer bloat and not require an increase in the size of the work array. However, once amvacuumstrategy stuff entered the picture, accumulating line pointers has value. Also, we might want to store the returned value of amvacuumstrategy so that index AM can refer to it on index-deletion. The 0003 patch has btree indexes skip bulk-deletion if the index doesn't grow since last bulk-deletion. I stored the number of blocks in the meta page but didn't implement meta page upgrading. I've attached the draft version patches. Note that the documentation update is still lacking. Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/ ![]() ![]() ![]() |
On Tue, Jan 5, 2021 at 10:35 AM Masahiko Sawada <[hidden email]> wrote:
> > On Tue, Dec 29, 2020 at 3:25 PM Masahiko Sawada <[hidden email]> wrote: > > > > On Tue, Dec 29, 2020 at 7:06 AM Peter Geoghegan <[hidden email]> wrote: > > > > > > On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan <[hidden email]> wrote: > > > > I experimented with this today, and I think that it is a good way to > > > > do it. I like the idea of choose_vacuum_strategy() understanding that > > > > heap pages that are subject to many non-HOT updates have a "natural > > > > extra capacity for LP_DEAD items" that it must care about directly (at > > > > least with non-default heap fill factor settings). My early testing > > > > shows that it will often take a surprisingly long time for the most > > > > heavily updated heap page to have more than about 100 LP_DEAD items. > > > > > > Attached is a rough patch showing what I did here. It was applied on > > > top of my bottom-up index deletion patch series and your > > > poc_vacuumstrategy.patch patch. This patch was written as a quick and > > > dirty way of simulating what I thought would work best for bottom-up > > > index deletion for one specific benchmark/test, which was > > > non-hot-update heavy. This consists of a variant pgbench with several > > > indexes on pgbench_accounts (almost the same as most other bottom-up > > > deletion benchmarks I've been running). Only one index is "logically > > > modified" by the updates, but of course we still physically modify all > > > indexes on every update. I set fill factor to 90 for this benchmark, > > > which is an important factor for how your VACUUM patch works during > > > the benchmark. > > > > > > This rough supplementary patch includes VACUUM logic that assumes (but > > > doesn't check) that the table has heap fill factor set to 90 -- see my > > > changes to choose_vacuum_strategy(). This benchmark is really about > > > stability over time more than performance (though performance is also > > > improved significantly). I wanted to keep both the table/heap and the > > > logically unmodified indexes (i.e. 3 out of 4 indexes on > > > pgbench_accounts) exactly the same size *forever*. > > > > > > Does this make sense? > > > > Thank you for sharing the patch. That makes sense. > > > > + if (!vacuum_heap) > > + { > > + if (maxdeadpage > 130 || > > + /* Also check if maintenance_work_mem space is running out */ > > + vacrelstats->dead_tuples->num_tuples > > > + vacrelstats->dead_tuples->max_tuples / 2) > > + vacuum_heap = true; > > + } > > > > The second test checking if maintenane_work_mem space is running out > > also makes sense to me. Perhaps another idea would be to compare the > > number of collected garbage tuple to the total number of heap tuples > > so that we do lazy_vacuum_heap() only when we’re likely to reclaim a > > certain amount of garbage in the table. > > > > > > > > Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that > > > pg_stat_database shows an almost ~28% reduction in blks_read after an > > > overnight run for the patch series (it was 508,820,699 for the > > > patches, 705,282,975 for the master branch). I think that the VACUUM > > > component is responsible for some of that reduction. There were 11 > > > VACUUMs for the patch, 7 of which did not call lazy_vacuum_heap() > > > (these 7 VACUUM operations all only dead a btbulkdelete() call for the > > > one problematic index on the table, named "abalance_ruin", which my > > > supplementary patch has hard-coded knowledge of). > > > > That's a very good result in terms of skipping lazy_vacuum_heap(). How > > much the table and indexes bloated? Also, I'm curious about that which > > tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or > > test if maintenance_work_mem space is running out? And what was the > > impact on clearing all-visible bits? > > > > I merged these patches and polished it. > > In the 0002 patch, we calculate how many LP_DEAD items can be > accumulated in the space on a single heap page left by fillfactor. I > increased MaxHeapTuplesPerPage so that we can accumulate LP_DEAD items > on a heap page. Because otherwise accumulating LP_DEAD items > unnecessarily constrains the number of heap tuples in a single page, > especially when small tuples, as I mentioned before. Previously, we > constrained the number of line pointers to avoid excessive > line-pointer bloat and not require an increase in the size of the work > array. However, once amvacuumstrategy stuff entered the picture, > accumulating line pointers has value. Also, we might want to store the > returned value of amvacuumstrategy so that index AM can refer to it on > index-deletion. > > The 0003 patch has btree indexes skip bulk-deletion if the index > doesn't grow since last bulk-deletion. I stored the number of blocks > in the meta page but didn't implement meta page upgrading. > refer the answer to amvacuumstrategy. That way, the index can skip bulk-deletion when lazy vacuum doesn't vacuum heap and it also doesn’t want to do that. I’ve attached the updated version patch that includes the following changes: * Store the answers to amvacuumstrategy into either the local memory or DSM (in parallel vacuum case) so that ambulkdelete can refer the answer to amvacuumstrategy. * Fix regression failures. * Update the documentation and commments. Note that 0003 patch is still PoC quality, lacking the btree meta page version upgrade. Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/ ![]() ![]() ![]() |
On Mon, Jan 18, 2021 at 2:18 PM Masahiko Sawada <[hidden email]> wrote:
> > On Tue, Jan 5, 2021 at 10:35 AM Masahiko Sawada <[hidden email]> wrote: > > > > On Tue, Dec 29, 2020 at 3:25 PM Masahiko Sawada <[hidden email]> wrote: > > > > > > On Tue, Dec 29, 2020 at 7:06 AM Peter Geoghegan <[hidden email]> wrote: > > > > > > > > On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan <[hidden email]> wrote: > > > > > I experimented with this today, and I think that it is a good way to > > > > > do it. I like the idea of choose_vacuum_strategy() understanding that > > > > > heap pages that are subject to many non-HOT updates have a "natural > > > > > extra capacity for LP_DEAD items" that it must care about directly (at > > > > > least with non-default heap fill factor settings). My early testing > > > > > shows that it will often take a surprisingly long time for the most > > > > > heavily updated heap page to have more than about 100 LP_DEAD items. > > > > > > > > Attached is a rough patch showing what I did here. It was applied on > > > > top of my bottom-up index deletion patch series and your > > > > poc_vacuumstrategy.patch patch. This patch was written as a quick and > > > > dirty way of simulating what I thought would work best for bottom-up > > > > index deletion for one specific benchmark/test, which was > > > > non-hot-update heavy. This consists of a variant pgbench with several > > > > indexes on pgbench_accounts (almost the same as most other bottom-up > > > > deletion benchmarks I've been running). Only one index is "logically > > > > modified" by the updates, but of course we still physically modify all > > > > indexes on every update. I set fill factor to 90 for this benchmark, > > > > which is an important factor for how your VACUUM patch works during > > > > the benchmark. > > > > > > > > This rough supplementary patch includes VACUUM logic that assumes (but > > > > doesn't check) that the table has heap fill factor set to 90 -- see my > > > > changes to choose_vacuum_strategy(). This benchmark is really about > > > > stability over time more than performance (though performance is also > > > > improved significantly). I wanted to keep both the table/heap and the > > > > logically unmodified indexes (i.e. 3 out of 4 indexes on > > > > pgbench_accounts) exactly the same size *forever*. > > > > > > > > Does this make sense? > > > > > > Thank you for sharing the patch. That makes sense. > > > > > > + if (!vacuum_heap) > > > + { > > > + if (maxdeadpage > 130 || > > > + /* Also check if maintenance_work_mem space is running out */ > > > + vacrelstats->dead_tuples->num_tuples > > > > + vacrelstats->dead_tuples->max_tuples / 2) > > > + vacuum_heap = true; > > > + } > > > > > > The second test checking if maintenane_work_mem space is running out > > > also makes sense to me. Perhaps another idea would be to compare the > > > number of collected garbage tuple to the total number of heap tuples > > > so that we do lazy_vacuum_heap() only when we’re likely to reclaim a > > > certain amount of garbage in the table. > > > > > > > > > > > Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that > > > > pg_stat_database shows an almost ~28% reduction in blks_read after an > > > > overnight run for the patch series (it was 508,820,699 for the > > > > patches, 705,282,975 for the master branch). I think that the VACUUM > > > > component is responsible for some of that reduction. There were 11 > > > > VACUUMs for the patch, 7 of which did not call lazy_vacuum_heap() > > > > (these 7 VACUUM operations all only dead a btbulkdelete() call for the > > > > one problematic index on the table, named "abalance_ruin", which my > > > > supplementary patch has hard-coded knowledge of). > > > > > > That's a very good result in terms of skipping lazy_vacuum_heap(). How > > > much the table and indexes bloated? Also, I'm curious about that which > > > tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or > > > test if maintenance_work_mem space is running out? And what was the > > > impact on clearing all-visible bits? > > > > > > > I merged these patches and polished it. > > > > In the 0002 patch, we calculate how many LP_DEAD items can be > > accumulated in the space on a single heap page left by fillfactor. I > > increased MaxHeapTuplesPerPage so that we can accumulate LP_DEAD items > > on a heap page. Because otherwise accumulating LP_DEAD items > > unnecessarily constrains the number of heap tuples in a single page, > > especially when small tuples, as I mentioned before. Previously, we > > constrained the number of line pointers to avoid excessive > > line-pointer bloat and not require an increase in the size of the work > > array. However, once amvacuumstrategy stuff entered the picture, > > accumulating line pointers has value. Also, we might want to store the > > returned value of amvacuumstrategy so that index AM can refer to it on > > index-deletion. > > > > The 0003 patch has btree indexes skip bulk-deletion if the index > > doesn't grow since last bulk-deletion. I stored the number of blocks > > in the meta page but didn't implement meta page upgrading. > > > > After more thought, I think that ambulkdelete needs to be able to > refer the answer to amvacuumstrategy. That way, the index can skip > bulk-deletion when lazy vacuum doesn't vacuum heap and it also doesn’t > want to do that. > > I’ve attached the updated version patch that includes the following changes: > > * Store the answers to amvacuumstrategy into either the local memory > or DSM (in parallel vacuum case) so that ambulkdelete can refer the > answer to amvacuumstrategy. > * Fix regression failures. > * Update the documentation and commments. > > Note that 0003 patch is still PoC quality, lacking the btree meta page > version upgrade. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/ ![]() ![]() ![]() |
Hi, Masahiko-san: For v2-0001-Introduce-IndexAM-API-for-choosing-index-vacuum-s.patch : For blvacuumstrategy(): + if (params->index_cleanup == VACOPT_TERNARY_DISABLED) + return INDEX_VACUUM_STRATEGY_NONE; + else + return INDEX_VACUUM_STRATEGY_BULKDELETE; The 'else' can be omitted. Similar comment for ginvacuumstrategy(). For v2-0002-Choose-index-vacuum-strategy-based-on-amvacuumstr.patch : If index_cleanup option is specified neither VACUUM command nor storage option I think this is what you meant (by not using passive voice): If index_cleanup option specifies neither VACUUM command nor storage option, - * integer, but you can't fit that many items on a page. 11 ought to be more + * integer, but you can't fit that many items on a page. 13 ought to be more It would be nice to add a note why the number of bits is increased. For choose_vacuum_strategy(): + IndexVacuumStrategy ivstrat; The variable is only used inside the loop. You can use vacrelstats->ivstrategies[i] directly and omit the variable. + int ndeaditems_limit = (int) ((freespace / sizeof(ItemIdData)) * 0.7); How was the factor of 0.7 determined ? Comment below only mentioned 'safety factor' but not how it was chosen. I also wonder if this factor should be exposed as GUC. + if (nworkers > 0) + nworkers--; Should log / assert be added when nworkers is <= 0 ? + * XXX: allowing to fill the heap page with only line pointer seems a overkill. 'a overkill' -> 'an overkill' Cheers On Sun, Jan 17, 2021 at 10:21 PM Masahiko Sawada <[hidden email]> wrote: On Mon, Jan 18, 2021 at 2:18 PM Masahiko Sawada <[hidden email]> wrote: |
In reply to this post by Masahiko Sawada
On Sun, Jan 17, 2021 at 9:18 PM Masahiko Sawada <[hidden email]> wrote:
> After more thought, I think that ambulkdelete needs to be able to > refer the answer to amvacuumstrategy. That way, the index can skip > bulk-deletion when lazy vacuum doesn't vacuum heap and it also doesn’t > want to do that. Makes sense. BTW, your patch has bitrot already. Peter E's recent pageinspect commit happens to conflict with this patch. It might make sense to produce a new version that just fixes the bitrot, so that other people don't have to deal with it each time. > I’ve attached the updated version patch that includes the following changes: Looks good. I'll give this version a review now. I will do a lot more soon. I need to come up with a good benchmark for this, that I can return to again and again during review as needed. Some feedback on the first patch: * Just so you know: I agree with you about handling VACOPT_TERNARY_DISABLED in the index AM's amvacuumstrategy routine. I think that it's better to do that there, even though this choice may have some downsides. * Can you add some "stub" sgml doc changes for this? Doesn't have to be complete in any way. Just a placeholder for later, that has the correct general "shape" to orientate the reader of the patch. It can just be a FIXME comment, plus basic mechanical stuff -- details of the new amvacuumstrategy_function routine and its signature. Some feedback on the second patch: * Why do you move around IndexVacuumStrategy in the second patch? Looks like a rebasing oversight. * Actually, do we really need the first and second patches to be separate patches? I agree that the nbtree patch should be a separate patch, but dividing the first two sets of changes doesn't seem like it adds much. Did I miss some something? * Is the "MAXALIGN(sizeof(ItemIdData)))" change in the definition of MaxHeapTuplesPerPage appropriate? Here is the relevant section from the patch: diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h index 7c62852e7f..038e7cd580 100644 --- a/src/include/access/htup_details.h +++ b/src/include/access/htup_details.h @@ -563,17 +563,18 @@ do { \ /* * MaxHeapTuplesPerPage is an upper bound on the number of tuples that can * fit on one heap page. (Note that indexes could have more, because they - * use a smaller tuple header.) We arrive at the divisor because each tuple - * must be maxaligned, and it must have an associated line pointer. + * use a smaller tuple header.) We arrive at the divisor because each line + * pointer must be maxaligned. *** SNIP *** #define MaxHeapTuplesPerPage \ - ((int) ((BLCKSZ - SizeOfPageHeaderData) / \ - (MAXALIGN(SizeofHeapTupleHeader) + sizeof(ItemIdData)))) + ((int) ((BLCKSZ - SizeOfPageHeaderData) / (MAXALIGN(sizeof(ItemIdData))))) It's true that ItemIdData structs (line pointers) are aligned, but they're not MAXALIGN()'d. If they were then the on-disk size of line pointers would generally be 8 bytes, not 4 bytes. * Maybe it would be better if you just changed the definition such that "MAXALIGN(SizeofHeapTupleHeader)" became "MAXIMUM_ALIGNOF", with no other changes? (Some variant of this suggestion might be better, not sure.) For some reason that feels a bit safer: we still have an "imaginary tuple header", but it's just 1 MAXALIGN() quantum now. This is still much less than the current 3 MAXALIGN() quantums (i.e. what MaxHeapTuplesPerPage treats as the tuple header size). Do you think that this alternative approach will be noticeably less effective within vacuumlazy.c? Note that you probably understand the issue with MaxHeapTuplesPerPage for vacuumlazy.c better than I do currently. I'm still trying to understand your choices, and to understand what is really important here. * Maybe add a #define for the value 0.7? (I refer to the value used in choose_vacuum_strategy() to calculate a "this is the number of LP_DEAD line pointers that we consider too many" cut off point, which is to be applied throughout lazy_scan_heap() processing.) * I notice that your new lazy_vacuum_table_and_indexes() function is the only place that calls lazy_vacuum_table_and_indexes(). I think that you should merge them together -- replace the only remaining call to lazy_vacuum_table_and_indexes() with the body of the function itself. Having a separate lazy_vacuum_table_and_indexes() function doesn't seem useful to me -- it doesn't actually hide complexity, and might even be harder to maintain. * I suggest thinking about what the last item will mean for the reporting that currently takes place in lazy_vacuum_table_and_indexes(), but will now go in an expanded lazy_vacuum_table_and_indexes() -- how do we count the total number of index scans now? I don't actually believe that the logic needs to change, but some kind of consolidation and streamlining seems like it might be helpful. Maybe just a comment that says "note that all index scans might just be no-ops because..." -- stuff like that. * Any idea about how hard it will be to teach is_wraparound VACUUMs to skip index cleanup automatically, based on some practical/sensible criteria? It would be nice to have a basic PoC for that, even if it remains a PoC for the foreseeable future (i.e. even if it cannot be committed to Postgres 14). This feature should definitely be something that your patch series *enables*. I'd feel good about having covered that question as part of this basic design work if there was a PoC. That alone should make it 100% clear that it's easy to do (or no harder than it should be -- it should ideally be compatible with your basic design). The exact criteria that we use for deciding whether or not to skip index cleanup (which probably should not just be "this VACUUM is is_wraparound, good enough" in the final version) may need to be debated at length on pgsql-hackers. Even still, it is "just a detail" in the code. Whereas being *able* to do that with your design (now or in the future) seems essential now. > * Store the answers to amvacuumstrategy into either the local memory > or DSM (in parallel vacuum case) so that ambulkdelete can refer the > answer to amvacuumstrategy. > * Fix regression failures. > * Update the documentation and commments. > > Note that 0003 patch is still PoC quality, lacking the btree meta page > version upgrade. This patch is not the hard part, of course -- there really isn't that much needed here compared to vacuumlazy.c. So this patch seems like the simplest 1 out of the 3 (at least to me). Some feedback on the third patch: * The new btm_last_deletion_nblocks metapage field should use P_NONE (which is 0) to indicate never having been vacuumed -- not InvalidBlockNumber (which is 0xFFFFFFFF). This is more idiomatic in nbtree, which is nice, but it has a very significant practical advantage: it ensures that every heapkeyspace nbtree index (i.e. those on recent nbtree versions) can be treated as if it has the new btm_last_deletion_nblocks field all along, even when it actually built on Postgres 12 or 13. This trick will let you avoid dealing with the headache of bumping BTREE_VERSION, which is a huge advantage. Note that this is the same trick I used to avoid bumping BTREE_VERSION when the btm_allequalimage field needed to be added (for the nbtree deduplication feature added to Postgres 13). * Forgot to do this in the third patch (think I made this same mistake once myself): diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c index 8bb180bbbe..88dfea9af3 100644 --- a/contrib/pageinspect/btreefuncs.c +++ b/contrib/pageinspect/btreefuncs.c @@ -653,7 +653,7 @@ bt_metap(PG_FUNCTION_ARGS) BTMetaPageData *metad; TupleDesc tupleDesc; int j; - char *values[9]; + char *values[10]; Buffer buffer; Page page; HeapTuple tuple; @@ -734,6 +734,11 @@ bt_metap(PG_FUNCTION_ARGS) That's all I have for now... -- Peter Geoghegan |
On Tue, Jan 19, 2021 at 2:57 PM Peter Geoghegan <[hidden email]> wrote:
> * Maybe it would be better if you just changed the definition such > that "MAXALIGN(SizeofHeapTupleHeader)" became "MAXIMUM_ALIGNOF", with > no other changes? (Some variant of this suggestion might be better, > not sure.) > > For some reason that feels a bit safer: we still have an "imaginary > tuple header", but it's just 1 MAXALIGN() quantum now. This is still > much less than the current 3 MAXALIGN() quantums (i.e. what > MaxHeapTuplesPerPage treats as the tuple header size). Do you think > that this alternative approach will be noticeably less effective > within vacuumlazy.c? BTW, I think that increasing MaxHeapTuplesPerPage will make it necessary to consider tidbitmap.c. Comments at the top of that file say that it is assumed that MaxHeapTuplesPerPage is about 256. So there is a risk of introducing performance regressions affecting bitmap scans here. Apparently some other DB systems make the equivalent of MaxHeapTuplesPerPage dynamically configurable at the level of heap tables. It usually doesn't matter, but it can matter with on-disk bitmap indexes, where the bitmap must be encoded from raw TIDs (this must happen before the bitmap is compressed -- there must be a simple mapping from every possible TID to some bit in a bitmap first). The item offset component of each heap TID is not usually very large, so there is a trade-off between keeping the representation of bitmaps efficient and not unduly restricting the number of distinct heap tuples on each heap page. I think that there might be a similar consideration here, in tidbitmap.c (even though it's not concerned about on-disk bitmaps). -- Peter Geoghegan |
On Tue, Jan 19, 2021 at 4:45 PM Peter Geoghegan <[hidden email]> wrote:
> BTW, I think that increasing MaxHeapTuplesPerPage will make it > necessary to consider tidbitmap.c. Comments at the top of that file > say that it is assumed that MaxHeapTuplesPerPage is about 256. So > there is a risk of introducing performance regressions affecting > bitmap scans here. More concretely, WORDS_PER_PAGE increases from 5 on the master branch to 16 with the latest version of the patch series on most platforms (while WORDS_PER_CHUNK is 4 with or without the patches). -- Peter Geoghegan |
In reply to this post by Peter Geoghegan-4
On Wed, Jan 20, 2021 at 9:45 AM Peter Geoghegan <[hidden email]> wrote:
> > On Tue, Jan 19, 2021 at 2:57 PM Peter Geoghegan <[hidden email]> wrote: > > * Maybe it would be better if you just changed the definition such > > that "MAXALIGN(SizeofHeapTupleHeader)" became "MAXIMUM_ALIGNOF", with > > no other changes? (Some variant of this suggestion might be better, > > not sure.) > > > > For some reason that feels a bit safer: we still have an "imaginary > > tuple header", but it's just 1 MAXALIGN() quantum now. This is still > > much less than the current 3 MAXALIGN() quantums (i.e. what > > MaxHeapTuplesPerPage treats as the tuple header size). Do you think > > that this alternative approach will be noticeably less effective > > within vacuumlazy.c? > > BTW, I think that increasing MaxHeapTuplesPerPage will make it > necessary to consider tidbitmap.c. Comments at the top of that file > say that it is assumed that MaxHeapTuplesPerPage is about 256. So > there is a risk of introducing performance regressions affecting > bitmap scans here. > > Apparently some other DB systems make the equivalent of > MaxHeapTuplesPerPage dynamically configurable at the level of heap > tables. It usually doesn't matter, but it can matter with on-disk > bitmap indexes, where the bitmap must be encoded from raw TIDs (this > must happen before the bitmap is compressed -- there must be a simple > mapping from every possible TID to some bit in a bitmap first). The > item offset component of each heap TID is not usually very large, so > there is a trade-off between keeping the representation of bitmaps > efficient and not unduly restricting the number of distinct heap > tuples on each heap page. I think that there might be a similar > consideration here, in tidbitmap.c (even though it's not concerned > about on-disk bitmaps). That's a good point. With the patch, MaxHeapTuplesPerPage increased to 2042 with 8k page, and to 8186 with 32k page whereas it's currently 291 with 8k page and 1169 with 32k page. So it is likely to be a problem as you pointed out. If we change "MAXALIGN(SizeofHeapTupleHeader)" to "MAXIMUM_ALIGNOF", it's 680 with 8k patch and 2728 with 32k page, which seems much better. The purpose of increasing MaxHeapTuplesPerPage in the patch is to have a heap page accumulate more LP_DEAD line pointers. As I explained before, considering MaxHeapTuplesPerPage, we cannot calculate how many LP_DEAD line pointers can be accumulated into the space taken by fillfactor simply by ((the space taken by fillfactor) / (size of line pointer)). We need to consider both how many line pointers are available for LP_DEAD and how much space is available for LP_DEAD. For example, suppose the tuple size is 50 bytes and fillfactor is 80, each page has 1633 bytes (=(8192-24)*0.2) free space taken by fillfactor, where 408 line pointers can fit. However, if we store 250 LP_DEAD line pointers into that space, the number of tuples that can be stored on the page is only 41, although we have 6534 bytes (=(8192-24)*0.8) where 121 tuples (+line pointers) can fit because MaxHeapTuplesPerPage is 291. In this case, where the tuple size is 50 and fillfactor is 80, we can accumulate up to about 170 LP_DEAD line pointers while storing 121 tuples. Increasing MaxHeapTuplesPerPage raises this 291 limit and enables us to forget the limit when calculating the maximum number of LP_DEAD line pointers that can be accumulated on a single page. An alternative approach would be to calculate it using the average tuple's size. I think if we know the tuple size, the maximum number of LP_DEAD line pointers can be accumulated into the single page is the minimum of the following two formula: (1) MaxHeapTuplesPerPage - (((BLCKSZ - SizeOfPageHeaderData) * (fillfactor/100)) / (sizeof(ItemIdData) + tuple_size))); //how many line pointers are available for LP_DEAD? (2) ((BLCKSZ - SizeOfPageHeaderData) * ((1 - fillfactor)/100)) / sizeof(ItemIdData); //how much space is available for LP_DEAD? But I'd prefer to increase MaxHeapTuplesPerPage but not to affect the bitmap much rather than introducing a complex theory. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/ |
In reply to this post by Peter Geoghegan-4
On Wed, Jan 20, 2021 at 7:58 AM Peter Geoghegan <[hidden email]> wrote:
> > On Sun, Jan 17, 2021 at 9:18 PM Masahiko Sawada <[hidden email]> wrote: > > After more thought, I think that ambulkdelete needs to be able to > > refer the answer to amvacuumstrategy. That way, the index can skip > > bulk-deletion when lazy vacuum doesn't vacuum heap and it also doesn’t > > want to do that. > > Makes sense. > > BTW, your patch has bitrot already. Peter E's recent pageinspect > commit happens to conflict with this patch. It might make sense to > produce a new version that just fixes the bitrot, so that other people > don't have to deal with it each time. > > > I’ve attached the updated version patch that includes the following changes: > > Looks good. I'll give this version a review now. I will do a lot more > soon. I need to come up with a good benchmark for this, that I can > return to again and again during review as needed. Thank you for reviewing the patches. > > Some feedback on the first patch: > > * Just so you know: I agree with you about handling > VACOPT_TERNARY_DISABLED in the index AM's amvacuumstrategy routine. I > think that it's better to do that there, even though this choice may > have some downsides. > > * Can you add some "stub" sgml doc changes for this? Doesn't have to > be complete in any way. Just a placeholder for later, that has the > correct general "shape" to orientate the reader of the patch. It can > just be a FIXME comment, plus basic mechanical stuff -- details of the > new amvacuumstrategy_function routine and its signature. > 0002 patch had the doc update (I mistakenly included it to 0002 patch). Is that update what you meant? > Some feedback on the second patch: > > * Why do you move around IndexVacuumStrategy in the second patch? > Looks like a rebasing oversight. Check. > > * Actually, do we really need the first and second patches to be > separate patches? I agree that the nbtree patch should be a separate > patch, but dividing the first two sets of changes doesn't seem like it > adds much. Did I miss some something? I separated the patches since I used to have separate patches when adding other index AM options required by parallel vacuum. But I agreed to merge the first two patches. > > * Is the "MAXALIGN(sizeof(ItemIdData)))" change in the definition of > MaxHeapTuplesPerPage appropriate? Here is the relevant section from > the patch: > > diff --git a/src/include/access/htup_details.h > b/src/include/access/htup_details.h > index 7c62852e7f..038e7cd580 100644 > --- a/src/include/access/htup_details.h > +++ b/src/include/access/htup_details.h > @@ -563,17 +563,18 @@ do { \ > /* > * MaxHeapTuplesPerPage is an upper bound on the number of tuples that can > * fit on one heap page. (Note that indexes could have more, because they > - * use a smaller tuple header.) We arrive at the divisor because each tuple > - * must be maxaligned, and it must have an associated line pointer. > + * use a smaller tuple header.) We arrive at the divisor because each line > + * pointer must be maxaligned. > *** SNIP *** > #define MaxHeapTuplesPerPage \ > - ((int) ((BLCKSZ - SizeOfPageHeaderData) / \ > - (MAXALIGN(SizeofHeapTupleHeader) + sizeof(ItemIdData)))) > + ((int) ((BLCKSZ - SizeOfPageHeaderData) / (MAXALIGN(sizeof(ItemIdData))))) > > It's true that ItemIdData structs (line pointers) are aligned, but > they're not MAXALIGN()'d. If they were then the on-disk size of line > pointers would generally be 8 bytes, not 4 bytes. You're right. Will fix. > > * Maybe it would be better if you just changed the definition such > that "MAXALIGN(SizeofHeapTupleHeader)" became "MAXIMUM_ALIGNOF", with > no other changes? (Some variant of this suggestion might be better, > not sure.) > > For some reason that feels a bit safer: we still have an "imaginary > tuple header", but it's just 1 MAXALIGN() quantum now. This is still > much less than the current 3 MAXALIGN() quantums (i.e. what > MaxHeapTuplesPerPage treats as the tuple header size). Do you think > that this alternative approach will be noticeably less effective > within vacuumlazy.c? > > Note that you probably understand the issue with MaxHeapTuplesPerPage > for vacuumlazy.c better than I do currently. I'm still trying to > understand your choices, and to understand what is really important > here. Yeah, using MAXIMUM_ALIGNOF seems better for safety. I shared my thoughts on the issue with MaxHeapTuplesPerPage yesterday. I think we need to discuss how to deal with that. > > * Maybe add a #define for the value 0.7? (I refer to the value used in > choose_vacuum_strategy() to calculate a "this is the number of LP_DEAD > line pointers that we consider too many" cut off point, which is to be > applied throughout lazy_scan_heap() processing.) > Agreed. > * I notice that your new lazy_vacuum_table_and_indexes() function is > the only place that calls lazy_vacuum_table_and_indexes(). I think > that you should merge them together -- replace the only remaining call > to lazy_vacuum_table_and_indexes() with the body of the function > itself. Having a separate lazy_vacuum_table_and_indexes() function > doesn't seem useful to me -- it doesn't actually hide complexity, and > might even be harder to maintain. lazy_vacuum_table_and_indexes() is called at two places: after maintenance_work_mem run out (around L1097) and the end of lazy_scan_heap() (around L1726). I defined this function to pack the operations such as choosing a strategy, vacuuming indexes and vacuuming heap. Without this function, will we end up writing the same codes twice there? > > * I suggest thinking about what the last item will mean for the > reporting that currently takes place in > lazy_vacuum_table_and_indexes(), but will now go in an expanded > lazy_vacuum_table_and_indexes() -- how do we count the total number of > index scans now? > > I don't actually believe that the logic needs to change, but some kind > of consolidation and streamlining seems like it might be helpful. > Maybe just a comment that says "note that all index scans might just > be no-ops because..." -- stuff like that. What do you mean by the last item and what report? I think lazy_vacuum_table_and_indexes() itself doesn't report anything and vacrelstats->num_index_scan counts the total number of index scans. > > * Any idea about how hard it will be to teach is_wraparound VACUUMs to > skip index cleanup automatically, based on some practical/sensible > criteria? One simple idea would be to have a to-prevent-wraparound autovacuum worker disables index cleanup (i.g., setting VACOPT_TERNARY_DISABLED to index_cleanup). But a downside (but not a common case) is that since a to-prevent-wraparound vacuum is not necessarily an aggressive vacuum, it could skip index cleanup even though it cannot move relfrozenxid forward. > > It would be nice to have a basic PoC for that, even if it remains a > PoC for the foreseeable future (i.e. even if it cannot be committed to > Postgres 14). This feature should definitely be something that your > patch series *enables*. I'd feel good about having covered that > question as part of this basic design work if there was a PoC. That > alone should make it 100% clear that it's easy to do (or no harder > than it should be -- it should ideally be compatible with your basic > design). The exact criteria that we use for deciding whether or not to > skip index cleanup (which probably should not just be "this VACUUM is > is_wraparound, good enough" in the final version) may need to be > debated at length on pgsql-hackers. Even still, it is "just a detail" > in the code. Whereas being *able* to do that with your design (now or > in the future) seems essential now. Agreed. I'll write a PoC patch for that. > > > * Store the answers to amvacuumstrategy into either the local memory > > or DSM (in parallel vacuum case) so that ambulkdelete can refer the > > answer to amvacuumstrategy. > > * Fix regression failures. > > * Update the documentation and commments. > > > > Note that 0003 patch is still PoC quality, lacking the btree meta page > > version upgrade. > > This patch is not the hard part, of course -- there really isn't that > much needed here compared to vacuumlazy.c. So this patch seems like > the simplest 1 out of the 3 (at least to me). > > Some feedback on the third patch: > > * The new btm_last_deletion_nblocks metapage field should use P_NONE > (which is 0) to indicate never having been vacuumed -- not > InvalidBlockNumber (which is 0xFFFFFFFF). > > This is more idiomatic in nbtree, which is nice, but it has a very > significant practical advantage: it ensures that every heapkeyspace > nbtree index (i.e. those on recent nbtree versions) can be treated as > if it has the new btm_last_deletion_nblocks field all along, even when > it actually built on Postgres 12 or 13. This trick will let you avoid > dealing with the headache of bumping BTREE_VERSION, which is a huge > advantage. > > Note that this is the same trick I used to avoid bumping BTREE_VERSION > when the btm_allequalimage field needed to be added (for the nbtree > deduplication feature added to Postgres 13). > That's a nice way with a great advantage. I'll use P_NONE. > * Forgot to do this in the third patch (think I made this same mistake > once myself): > > diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c > index 8bb180bbbe..88dfea9af3 100644 > --- a/contrib/pageinspect/btreefuncs.c > +++ b/contrib/pageinspect/btreefuncs.c > @@ -653,7 +653,7 @@ bt_metap(PG_FUNCTION_ARGS) > BTMetaPageData *metad; > TupleDesc tupleDesc; > int j; > - char *values[9]; > + char *values[10]; > Buffer buffer; > Page page; > HeapTuple tuple; > @@ -734,6 +734,11 @@ bt_metap(PG_FUNCTION_ARGS) Check. I'm updating and testing the patch. I'll submit the updated version patches tomorrow. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/ |
In reply to this post by Peter Geoghegan-4
On Tue, Jan 19, 2021 at 2:57 PM Peter Geoghegan <[hidden email]> wrote:
> Looks good. I'll give this version a review now. I will do a lot more > soon. I need to come up with a good benchmark for this, that I can > return to again and again during review as needed. I performed another benchmark, similar to the last one but with the latest version (v2), and over a much longer period. Attached is a summary of the whole benchmark, and log_autovacuum output from the logs of both the master branch and the patch. This was pgbench scale 2000, 4 indexes on pgbench_accounts, and a transaction with one update and two selects. Each run was 4 hours, and we alternate between patch and master for each run, and alternate between 16 and 32 clients. There were 8 4 hour runs in total, meaning the entire set of runs took 8 * 4 hours = 32 hours (not including initial load time and a few other small things like that). I used a 10k TPS rate limit, so TPS isn't interesting here. Latency is interesting -- we see a nice improvement in latency (i.e. a reduction) for the patch (see all.summary.out). The benefits of the patch are clearly visible when I drill down and look at the details. Each pgbench_accounts autovacuum VACUUM operation can finish faster with the patch because they can often skip at least some indexes (usually the PK, sometimes 3 out of 4 indexes total). But it's more subtle than some might assume. We're skipping indexes that VACUUM actually would have deleted *some* index tuples from, which is very good. Bottom-up index deletion is usually lazy, and only occasionally very eager, so you still have plenty of "floating garbage" index tuples in most pages. And now we see VACUUM behave a little more like bottom-up index deletion -- it is lazy when that is appropriate (with indexes that really only have floating garbage that is spread diffusely throughout the index structure), and eager when that is appropriate (with indexes that get much more garbage). The benefit is not really that we're avoiding doing I/O for index vacuuming (though that is one of the smaller benefits here). The real benefit is that VACUUM is not dirtying pages, since it skips indexes when it would be "premature" to vacuum them from an efficiency point of view. This is important because we know that Postgres throughput is very often limited by page cleaning. Also, the "economics" of this new behavior make perfect sense -- obviously it's more efficient to delay garbage cleanup until the point when the same page will be modified by a backend anyway -- in the case of this benchmark via bottom-up index deletion (which deletes all garbage tuples in the leaf page at the point that it runs for a subset of pointed-to heap pages -- it's not using an oldestXmin cutoff from 30 minutes ago). So whenever we dirty a page, we now get more value per additional-page-dirtied. I believe that controlling the number of pages dirtied by VACUUM is usually much more important than reducing the amount of read I/O from VACUUM, for reasons I go into on the recent "vacuum_cost_page_miss default value and modern hardware" thread. As a further consequence of all this, VACUUM can go faster safely and sustainably (since the cost limit is not affected so much by vacuum_cost_page_miss), which has its own benefits (e.g. oldestXmin cutoff doesn't get so old towards the end). Another closely related huge improvement that we see here is that the number of FPIs generated by VACUUM can be significantly reduced. This cost is closely related to the cost of dirtying pages, but it's worth mentioning separately. You'll see some of that in the log_autovacuum log output I attached. There is an archive with much more detailed information, including dumps from most pg_stat_* views at key intervals. This has way more information than anybody is likely to want: https://drive.google.com/file/d/1OTiErELKRZmYnuJuczO2Tfcm1-cBYITd/view?usp=sharing I did notice a problem, though. I now think that the criteria for skipping an index vacuum in the third patch from the series is too conservative, and that this led to an excessive number of index vacuums with the patch. This is probably because there was a tiny number of page splits in some of the indexes that were not really supposed to grow. I believe that this is caused by ANALYZE running -- I think that it prevented bottom-up deletion from keeping a few of the hottest pages from splitting (that can take 5 or 6 seconds) at a few points over the 32 hour run. For example, the index named "tenner" grew by 9 blocks, starting out at 230,701 and ending up at 230,710 (to see this, extract the files from the archive and "diff patch.r1c16.initial_pg_relation_size.out patch.r2c32.after_pg_relation_size.out"). I now think that 0 blocks added is unnecessarily restrictive -- a small tolerance still seems like a good idea, though (let's still be somewhat conservative about it). Maybe a better criteria would be for nbtree to always proceed with index vacuuming when the index size is less than 2048 blocks (16MiB with 8KiB BLCKSZ). If an index is larger than that, then compare the last/old block count to the current block count (at the point that we decide if index vacuuming is going to go ahead) by rounding up both values to the next highest 2048 block increment. This formula is pretty arbitrary, but probably works as well as most others. It's a good iteration for the next version of the patch/further testing, at least. BTW, it would be nice if there was more instrumentation, say in the log output produced when log_autovacuum is on. That would make it easier to run these benchmarks -- I could verify my understanding of the work done for each particular av operation represented in the log. Though the default log_autovacuum log output is quite informative, it would be nice if the specifics were more obvious (maybe this could just be for the review/testing, but it might become something for users if it seems useful). -- Peter Geoghegan ![]() ![]() ![]() |
In reply to this post by Masahiko Sawada
On Thu, Jan 21, 2021 at 11:24 PM Masahiko Sawada <[hidden email]> wrote:
> > On Wed, Jan 20, 2021 at 7:58 AM Peter Geoghegan <[hidden email]> wrote: > > > > On Sun, Jan 17, 2021 at 9:18 PM Masahiko Sawada <[hidden email]> wrote: > > > After more thought, I think that ambulkdelete needs to be able to > > > refer the answer to amvacuumstrategy. That way, the index can skip > > > bulk-deletion when lazy vacuum doesn't vacuum heap and it also doesn’t > > > want to do that. > > > > Makes sense. > > > > BTW, your patch has bitrot already. Peter E's recent pageinspect > > commit happens to conflict with this patch. It might make sense to > > produce a new version that just fixes the bitrot, so that other people > > don't have to deal with it each time. > > > > > I’ve attached the updated version patch that includes the following changes: > > > > Looks good. I'll give this version a review now. I will do a lot more > > soon. I need to come up with a good benchmark for this, that I can > > return to again and again during review as needed. > > Thank you for reviewing the patches. > > > > > Some feedback on the first patch: > > > > * Just so you know: I agree with you about handling > > VACOPT_TERNARY_DISABLED in the index AM's amvacuumstrategy routine. I > > think that it's better to do that there, even though this choice may > > have some downsides. > > > > * Can you add some "stub" sgml doc changes for this? Doesn't have to > > be complete in any way. Just a placeholder for later, that has the > > correct general "shape" to orientate the reader of the patch. It can > > just be a FIXME comment, plus basic mechanical stuff -- details of the > > new amvacuumstrategy_function routine and its signature. > > > > 0002 patch had the doc update (I mistakenly included it to 0002 > patch). Is that update what you meant? > > > Some feedback on the second patch: > > > > * Why do you move around IndexVacuumStrategy in the second patch? > > Looks like a rebasing oversight. > > Check. > > > > > * Actually, do we really need the first and second patches to be > > separate patches? I agree that the nbtree patch should be a separate > > patch, but dividing the first two sets of changes doesn't seem like it > > adds much. Did I miss some something? > > I separated the patches since I used to have separate patches when > adding other index AM options required by parallel vacuum. But I > agreed to merge the first two patches. > > > > > * Is the "MAXALIGN(sizeof(ItemIdData)))" change in the definition of > > MaxHeapTuplesPerPage appropriate? Here is the relevant section from > > the patch: > > > > diff --git a/src/include/access/htup_details.h > > b/src/include/access/htup_details.h > > index 7c62852e7f..038e7cd580 100644 > > --- a/src/include/access/htup_details.h > > +++ b/src/include/access/htup_details.h > > @@ -563,17 +563,18 @@ do { \ > > /* > > * MaxHeapTuplesPerPage is an upper bound on the number of tuples that can > > * fit on one heap page. (Note that indexes could have more, because they > > - * use a smaller tuple header.) We arrive at the divisor because each tuple > > - * must be maxaligned, and it must have an associated line pointer. > > + * use a smaller tuple header.) We arrive at the divisor because each line > > + * pointer must be maxaligned. > > *** SNIP *** > > #define MaxHeapTuplesPerPage \ > > - ((int) ((BLCKSZ - SizeOfPageHeaderData) / \ > > - (MAXALIGN(SizeofHeapTupleHeader) + sizeof(ItemIdData)))) > > + ((int) ((BLCKSZ - SizeOfPageHeaderData) / (MAXALIGN(sizeof(ItemIdData))))) > > > > It's true that ItemIdData structs (line pointers) are aligned, but > > they're not MAXALIGN()'d. If they were then the on-disk size of line > > pointers would generally be 8 bytes, not 4 bytes. > > You're right. Will fix. > > > > > * Maybe it would be better if you just changed the definition such > > that "MAXALIGN(SizeofHeapTupleHeader)" became "MAXIMUM_ALIGNOF", with > > no other changes? (Some variant of this suggestion might be better, > > not sure.) > > > > For some reason that feels a bit safer: we still have an "imaginary > > tuple header", but it's just 1 MAXALIGN() quantum now. This is still > > much less than the current 3 MAXALIGN() quantums (i.e. what > > MaxHeapTuplesPerPage treats as the tuple header size). Do you think > > that this alternative approach will be noticeably less effective > > within vacuumlazy.c? > > > > Note that you probably understand the issue with MaxHeapTuplesPerPage > > for vacuumlazy.c better than I do currently. I'm still trying to > > understand your choices, and to understand what is really important > > here. > > Yeah, using MAXIMUM_ALIGNOF seems better for safety. I shared my > thoughts on the issue with MaxHeapTuplesPerPage yesterday. I think we > need to discuss how to deal with that. > > > > > * Maybe add a #define for the value 0.7? (I refer to the value used in > > choose_vacuum_strategy() to calculate a "this is the number of LP_DEAD > > line pointers that we consider too many" cut off point, which is to be > > applied throughout lazy_scan_heap() processing.) > > > > Agreed. > > > * I notice that your new lazy_vacuum_table_and_indexes() function is > > the only place that calls lazy_vacuum_table_and_indexes(). I think > > that you should merge them together -- replace the only remaining call > > to lazy_vacuum_table_and_indexes() with the body of the function > > itself. Having a separate lazy_vacuum_table_and_indexes() function > > doesn't seem useful to me -- it doesn't actually hide complexity, and > > might even be harder to maintain. > > lazy_vacuum_table_and_indexes() is called at two places: after > maintenance_work_mem run out (around L1097) and the end of > lazy_scan_heap() (around L1726). I defined this function to pack the > operations such as choosing a strategy, vacuuming indexes and > vacuuming heap. Without this function, will we end up writing the same > codes twice there? > > > > > * I suggest thinking about what the last item will mean for the > > reporting that currently takes place in > > lazy_vacuum_table_and_indexes(), but will now go in an expanded > > lazy_vacuum_table_and_indexes() -- how do we count the total number of > > index scans now? > > > > I don't actually believe that the logic needs to change, but some kind > > of consolidation and streamlining seems like it might be helpful. > > Maybe just a comment that says "note that all index scans might just > > be no-ops because..." -- stuff like that. > > What do you mean by the last item and what report? I think > lazy_vacuum_table_and_indexes() itself doesn't report anything and > vacrelstats->num_index_scan counts the total number of index scans. > > > > > * Any idea about how hard it will be to teach is_wraparound VACUUMs to > > skip index cleanup automatically, based on some practical/sensible > > criteria? > > One simple idea would be to have a to-prevent-wraparound autovacuum > worker disables index cleanup (i.g., setting VACOPT_TERNARY_DISABLED > to index_cleanup). But a downside (but not a common case) is that > since a to-prevent-wraparound vacuum is not necessarily an aggressive > vacuum, it could skip index cleanup even though it cannot move > relfrozenxid forward. > > > > > It would be nice to have a basic PoC for that, even if it remains a > > PoC for the foreseeable future (i.e. even if it cannot be committed to > > Postgres 14). This feature should definitely be something that your > > patch series *enables*. I'd feel good about having covered that > > question as part of this basic design work if there was a PoC. That > > alone should make it 100% clear that it's easy to do (or no harder > > than it should be -- it should ideally be compatible with your basic > > design). The exact criteria that we use for deciding whether or not to > > skip index cleanup (which probably should not just be "this VACUUM is > > is_wraparound, good enough" in the final version) may need to be > > debated at length on pgsql-hackers. Even still, it is "just a detail" > > in the code. Whereas being *able* to do that with your design (now or > > in the future) seems essential now. > > Agreed. I'll write a PoC patch for that. > > > > > > * Store the answers to amvacuumstrategy into either the local memory > > > or DSM (in parallel vacuum case) so that ambulkdelete can refer the > > > answer to amvacuumstrategy. > > > * Fix regression failures. > > > * Update the documentation and commments. > > > > > > Note that 0003 patch is still PoC quality, lacking the btree meta page > > > version upgrade. > > > > This patch is not the hard part, of course -- there really isn't that > > much needed here compared to vacuumlazy.c. So this patch seems like > > the simplest 1 out of the 3 (at least to me). > > > > Some feedback on the third patch: > > > > * The new btm_last_deletion_nblocks metapage field should use P_NONE > > (which is 0) to indicate never having been vacuumed -- not > > InvalidBlockNumber (which is 0xFFFFFFFF). > > > > This is more idiomatic in nbtree, which is nice, but it has a very > > significant practical advantage: it ensures that every heapkeyspace > > nbtree index (i.e. those on recent nbtree versions) can be treated as > > if it has the new btm_last_deletion_nblocks field all along, even when > > it actually built on Postgres 12 or 13. This trick will let you avoid > > dealing with the headache of bumping BTREE_VERSION, which is a huge > > advantage. > > > > Note that this is the same trick I used to avoid bumping BTREE_VERSION > > when the btm_allequalimage field needed to be added (for the nbtree > > deduplication feature added to Postgres 13). > > > > That's a nice way with a great advantage. I'll use P_NONE. > > > * Forgot to do this in the third patch (think I made this same mistake > > once myself): > > > > diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c > > index 8bb180bbbe..88dfea9af3 100644 > > --- a/contrib/pageinspect/btreefuncs.c > > +++ b/contrib/pageinspect/btreefuncs.c > > @@ -653,7 +653,7 @@ bt_metap(PG_FUNCTION_ARGS) > > BTMetaPageData *metad; > > TupleDesc tupleDesc; > > int j; > > - char *values[9]; > > + char *values[10]; > > Buffer buffer; > > Page page; > > HeapTuple tuple; > > @@ -734,6 +734,11 @@ bt_metap(PG_FUNCTION_ARGS) > > Check. > > I'm updating and testing the patch. I'll submit the updated version > patches tomorrow. > I've attached the updated version patch that incorporated the comments I got so far. I merged the previous 0001 and 0002 patches. 0003 patch is now another PoC patch that disables index cleanup automatically when autovacuum is to prevent xid-wraparound and an aggressive vacuum. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/ ![]() ![]() ![]() |
Free forum by Nabble | Edit this page |