On Thu, Dec 31, 2020 at 11:01 AM Zhihong Yu <[hidden email]> wrote:
> Happy New Year. Happy New Year. > For v12-0001-Pass-down-logically-unchanged-index-hint.patch > > + if (hasexpression) > + return false; > + > + return true; > > The above can be written as return !hasexpression; > The negation is due to the return value from index_unchanged_by_update_var_walker() is inverse of index unchanged. > If you align the meaning of return value from index_unchanged_by_update_var_walker() the same way as index_unchanged_by_update(), negation is not needed. I don't think that that represents an improvement. The negation seems clearer to me because we're negating the *absence* of something that we search for more or less linearly (a modified column from the index). This happens when determining whether to do an extra thing (provide the "logically unchanged" hint to this particular index/aminsert() call). To me, the negation reflects that high level structure. > For struct xl_btree_delete: > > + /* DELETED TARGET OFFSET NUMBERS FOLLOW */ > + /* UPDATED TARGET OFFSET NUMBERS FOLLOW */ > + /* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */ > > I guess the comment is for illustration purposes. Maybe you can write the comment in lower case. The comment is written like this (in higher case) to be consistent with an existing convention. If this was a green field situation I suppose I might not write it that way, but that's not how I assess these things. I always try to give the existing convention the benefit of the doubt. In this case I don't think that it matters very much, and so I'm inclined to stick with the existing style. > +#define BOTTOMUP_FAVORABLE_STRIDE 3 > > Adding a comment on why the number 3 is chosen would be helpful for people to understand the code. Noted - I will add something about that to the next revision. -- Peter Geoghegan |
In reply to this post by Peter Geoghegan-4
On 02/12/2020 00:18, Peter Geoghegan wrote:
> On Tue, Dec 1, 2020 at 1:50 AM Heikki Linnakangas <[hidden email]> wrote: >> On 30/11/2020 21:50, Peter Geoghegan wrote: >>> +} TM_IndexDelete; > >>> +} TM_IndexStatus; > >> Is it really significantly faster to have two arrays? If you had just >> one array, each element would be only 12 bytes long. That's not much >> smaller than TM_IndexDeletes, which is 8 bytes. > > Yeah, but the swap operations really matter here. At one point I found > that to be the case, anyway. That might no longer be true, though. It > might be that the code became less performance critical because other > parts of the design improved, resulting in the code getting called > less frequently. But if that is true then it has a lot to do with the > power-of-two bucketing that you go on to ask about -- that helped > performance a lot in certain specific cases (as I go into below). > > I will add a TODO item for myself, to look into this again. You may > well be right. > >>> + /* First sort caller's array by TID */ >>> + heap_tid_shellsort(delstate->deltids, delstate->ndeltids); > >> Instead of sorting the array by TID, wouldn't it be enough to sort by >> just the block numbers? > > I don't understand. Yeah, I guess that we could do our initial sort of > the deltids array (the heap_tid_shellsort() call) just using > BlockNumber (not TID). But OTOH there might be some small locality > benefit to doing a proper TID sort at the level of each heap page. And > even if there isn't any such benefit, does it really matter? You said above that heap_tid_shellsort() is very performance critical, and that's why you use the two arrays approach. If it's so performance critical that swapping 8 bytes vs 12 byte array elements makes a difference, I would guess that comparing TID vs just the block numbers would also make a difference. If you really want to shave cycles, you could also store BlockNumber and OffsetNumber in the TM_IndexDelete array, instead of ItemPointerData. What's the difference, you might ask? ItemPointerData stores the block number as two 16 bit integers that need to be reassembled into a 32 bit integer in the ItemPointerGetBlockNumber() macro. My argument here is two-pronged: If this is performance critical, you could do these additional optimizations. If it's not, then you don't need the two-arrays trick; one array would be simpler. I'm not sure how performance critical this really is, or how many instructions each of these optimizations might shave off, so of course I might be wrong and the tradeoff you have in the patch now really is the best. However, my intuition would be to use a single array, because that's simpler, and do all the other optimizations instead: store the tid in the struct as separate BlockNumber and OffsetNumber fields, and sort only by the block number. Those optimizations are very deep in the low level functions and won't add much to the mental effort of understanding the algorithm at a high level. >>> * While in general the presence of promising tuples (the hint that index >>> + * AMs provide) is the best information that we have to go on, it is based >>> + * on simple heuristics about duplicates in indexes that are understood to >>> + * have specific flaws. We should not let the most promising heap pages >>> + * win or lose on the basis of _relatively_ small differences in the total >>> + * number of promising tuples. Small differences between the most >>> + * promising few heap pages are effectively ignored by applying this >>> + * power-of-two bucketing scheme. >>> + * >> >> What are the "specific flaws"? > > I just meant the obvious: the index AM doesn't actually know for sure > that there are any old versions on the leaf page that it will > ultimately be able to delete. This uncertainty needs to be managed, > including inside heapam.c. Feel free to suggest a better way of > expressing that sentiment. Hmm, maybe: "... is the best information that we have to go on, it is just a guess based on simple heuristics about duplicates in indexes". >> I understand that this is all based on rough heuristics, but is there >> any advantage to rounding/bucketing the numbers like this? Even if small >> differences in the total number of promising tuple is essentially noise >> that can be safely ignored, is there any harm in letting those small >> differences guide the choice? Wouldn't it be the essentially the same as >> picking at random, or are those small differences somehow biased? > > Excellent question! It actually helps enormously, especially with low > cardinality data that makes good use of the deduplication optimization > (where it is especially important to keep the costs and the benefits > in balance). This has been validated by benchmarking. > > This design naturally allows the heapam.c code to take advantage of > both temporal and spatial locality. For example, let's say that you > have several indexes all on the same table, which get lots of non-HOT > UPDATEs, which are skewed. Naturally, the heap TIDs will match across > each index -- these are index entries that are needed to represent > successor versions (which are logically unchanged/version duplicate > index tuples from the point of view of nbtree/nbtdedup.c). Introducing > a degree of determinism to the order in which heap blocks are > processed naturally takes advantage of the correlated nature of the > index bloat. It naturally makes it much more likely that the > also-correlated bottom-up index deletion passes (that occur across > indexes on the same table) each process the same heap blocks close > together in time -- with obvious performance benefits. > > In the extreme (but not particularly uncommon) case of non-HOT UPDATEs > with many low cardinality indexes, each heapam.c call will end up > doing *almost the same thing* across indexes. So we're making the > correlated nature of the bloat (which is currently a big problem) work > in our favor -- turning it on its head, you could say. Highly > correlated bloat is not the exception -- it's actually the norm in the > cases we're targeting here. If it wasn't highly correlated then it > would already be okay to rely on VACUUM to get around to cleaning it > later. > > This power-of-two bucketing design probably also helps when there is > only one index. I could go into more detail on that, plus other > variations, but perhaps the multiple index example suffices for now. I > believe that there are a few interesting kinds of correlations here -- > I bet you can think of some yourself. (Of course it's also possible > and even likely that heap block correlation won't be important at all, > but my response is "what specifically is the harm in being open to the > possibility?".) I see. Would be good to explain that pattern with multiple indexes in the comment. - Heikki |
In reply to this post by Peter Geoghegan-4
чт, 31 дек. 2020 г. в 03:55, Peter Geoghegan <[hidden email]>: Attached is v12, which fixed bitrot against the master branch. This I've looked through the v12 patch. I like the new outline: - _bt_delete_or_dedup_one_page() is the main entry for the new code - first we try _bt_simpledel_pass() does improved cleanup of LP_DEAD entries - then (if necessary) _bt_bottomupdel_pass() for bottomup deletion - finally, we perform _bt_dedup_pass() to deduplication We split the leaf page only if all the actions above failed to provide enough space. Some comments on the code. v12-0001 -------- 1. For the following comment + * Only do this for key columns. A change to a non-key column within an + * INCLUDE index should not be considered because that's just payload to + * the index (they're not unlike table TIDs to the index AM). The last part of it (in the parenthesis) is difficult to grasp due to the double negation (not unlike). I think it's better to rephrase it. 2. After reading the patch, I also think, that fact, that index_unchanged_by_update() and index_unchanged_by_update_var_walker() return different bool states (i.e. when the latter returns true, the first one returns false) is a bit misleading. Although logic as it is looks fine, maybe index_unchanged_by_update_var_walker() can be renamed to avoid this confusion, to smth like index_expression_changed_walker() ? v12-0002 -------- 1. Thanks for the comments, they're well made and do help to read the code. 2. I'm not sure the bottomup_delete_items parameter is very helpful. In order to disable bottom-up deletion, DBA needs somehow to measure it's impact on a particular index. Currently I do not see how to achieve this. Not sure if this is overly important, though, as you have a similar parameter for the deduplication. 3. It feels like indexUnchanged is better to make indexChanged and negate its usage in the code. As !indexChanged reads more natural than !indexUnchanged, at least to me. This is all I have. I agree, that this code is pretty close to being committed. Now for the tests. First, I run a 2-hour long case with the same setup as I used in my e-mail from 15 of November. I found no difference between patch and master whatsoever. Which makes me think, that current master is quite good at keeping better bloat control (not sure if this is an effect of 4228817449 commit or deduplication). I created another setup (see attached testcases). Basically, I emulated queue operations(INSERT at the end and DELETE Victor Yegorov |
пн, 4 янв. 2021 г. в 17:28, Victor Yegorov <[hidden email]>:
Sorry, hit Send too early. So, I emulated queue operations(INSERT at the end and DELETE from the head). And also made 5-minute transactions appear in the background for the whole duration of the test. 3 pgbench were run in parallel on a scale 3000 bench database with modifications (attached). Master ------ relname | nrows | blk_before | mb_before | blk_after | mb_after | diff -----------------------+-----------+------------+-----------+-----------+----------+-------- pgbench_accounts | 300000000 | 4918033 | 38422.1 | 5065575 | 39574.8 | +3.0% accounts_mtime | 300000000 | 1155119 | 9024.4 | 1287656 | 10059.8 | +11.5% fiver | 300000000 | 427039 | 3336.2 | 567755 | 4435.6 | +33.0% pgbench_accounts_pkey | 300000000 | 822573 | 6426.4 | 1033344 | 8073.0 | +25.6% score | 300000000 | 284022 | 2218.9 | 458502 | 3582.0 | +61.4% tenner | 300000000 | 346050 | 2703.5 | 417985 | 3265.5 | +20.8% (6 rows) DB size: 65.2..72.3 (+7.1GB / +10.9%) TPS: 2297 / 495 Patched ------ relname | nrows | blk_before | mb_before | blk_after | mb_after | diff -----------------------+-----------+------------+-----------+-----------+----------+-------- pgbench_accounts | 300000000 | 4918033 | 38422.1 | 5067500 | 39589.8 | +3.0% accounts_mtime | 300000000 | 1155119 | 9024.4 | 1283441 | 10026.9 | +11.1% fiver | 300000000 | 427039 | 3336.2 | 429101 | 3352.4 | +0.5% pgbench_accounts_pkey | 300000000 | 822573 | 6426.4 | 826056 | 6453.6 | +0.4% score | 300000000 | 284022 | 2218.9 | 285465 | 2230.2 | +0.5% tenner | 300000000 | 346050 | 2703.5 | 347695 | 2716.4 | +0.5% (6 rows) DB size: 65.2..67.5 (+2.3GB / +3.5%) As you can see, TPS are very much similar, but the fact that we have no bloat for the patched version makes me very happy! On the graphs, you can clearly see extra write activity performed by the backedns of the patched version. Victor Yegorov ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
In reply to this post by Heikki Linnakangas
On Mon, Jan 4, 2021 at 4:08 AM Heikki Linnakangas <[hidden email]> wrote:
> You said above that heap_tid_shellsort() is very performance critical, > and that's why you use the two arrays approach. If it's so performance > critical that swapping 8 bytes vs 12 byte array elements makes a > difference, I would guess that comparing TID vs just the block numbers > would also make a difference. > > If you really want to shave cycles, you could also store BlockNumber and > OffsetNumber in the TM_IndexDelete array, instead of ItemPointerData. > What's the difference, you might ask? ItemPointerData stores the block > number as two 16 bit integers that need to be reassembled into a 32 bit > integer in the ItemPointerGetBlockNumber() macro. > > My argument here is two-pronged: If this is performance critical, you > could do these additional optimizations. If it's not, then you don't > need the two-arrays trick; one array would be simpler. That's reasonable. The reason why I haven't been more decisive here is because the question of whether or not it matters is actually very complicated, for reasons that have little to do with sorting. The more effective the mechanism is each time (the more bytes it allows nbtree to free from each leaf page), the less often it is called, and the less performance critical the overhead per operation is. On the other hand there are a couple of other questions about what we do in heapam.c that aren't quite resolved yet (e.g. exact "favorable blocks"/prefetching behavior in bottom-up case), that probably affect how important the heap_tid_shellsort() microoptimisations are. I think that it makes sense to make a final decision on this at the last minute, once everything else is resolved, since the implicit dependencies make any other approach much too complicated. I agree that this kind of microoptimization is best avoided, but I really don't want to have to worry about regressions in workloads that I now understand fully. I think that the sort became less important on perhaps 2 or 3 occasions during development, even though that was never really the goal that I had in mind in each case. I'll do my best to avoid it. > Hmm, maybe: "... is the best information that we have to go on, it is > just a guess based on simple heuristics about duplicates in indexes". I'll add something like that to the next revision. > I see. Would be good to explain that pattern with multiple indexes in > the comment. Will do -- it is the single best example of how heap block locality can matter with a real workload, so it makes sense to go with it in explanatory comments. -- Peter Geoghegan |
In reply to this post by Victor Yegorov
On Mon, Jan 4, 2021 at 8:28 AM Victor Yegorov <[hidden email]> wrote:
> I've looked through the v12 patch. > > I like the new outline: > > - _bt_delete_or_dedup_one_page() is the main entry for the new code > - first we try _bt_simpledel_pass() does improved cleanup of LP_DEAD entries > - then (if necessary) _bt_bottomupdel_pass() for bottomup deletion > - finally, we perform _bt_dedup_pass() to deduplication > > We split the leaf page only if all the actions above failed to provide enough space. I'm thinking of repeating the LP_DEAD enhancement within GiST and hash shortly after this, as follow-up work. Their implementation of LP_DEAD deletion was always based on the nbtree original, and I think that it makes sense to keep everything in sync. The simple deletion enhancement probably makes just as much sense in these other places. > + * Only do this for key columns. A change to a non-key column within an > + * INCLUDE index should not be considered because that's just payload to > + * the index (they're not unlike table TIDs to the index AM). > > The last part of it (in the parenthesis) is difficult to grasp due to > the double negation (not unlike). I think it's better to rephrase it. Okay, will do. > 2. After reading the patch, I also think, that fact, that index_unchanged_by_update() > and index_unchanged_by_update_var_walker() return different bool states > (i.e. when the latter returns true, the first one returns false) is a bit misleading. > Although logic as it is looks fine, maybe index_unchanged_by_update_var_walker() > can be renamed to avoid this confusion, to smth like index_expression_changed_walker() ? I agree. I'll use the name index_expression_changed_walker() in the next revision. > 2. I'm not sure the bottomup_delete_items parameter is very helpful. In order to disable > bottom-up deletion, DBA needs somehow to measure it's impact on a particular index. > Currently I do not see how to achieve this. Not sure if this is overly important, though, as > you have a similar parameter for the deduplication. I'll take bottomup_delete_items out, since I think that you may be right about that. It can be another "decision to recheck mid-beta" thing (on the PostgreSQL 14 open items list), so we can delay making a final decision on it until after it gets tested by the broadest possible group of people. > 3. It feels like indexUnchanged is better to make indexChanged and negate its usage in the code. > As !indexChanged reads more natural than !indexUnchanged, at least to me. I don't want to do that because then we're negating "indexChanged" as part of our gating condition to the bottom-up deletion pass code. To me this feels wrong, since the hint only exists to be used to trigger index tuple deletion. This is unlikely to ever change. > First, I run a 2-hour long case with the same setup as I used in my e-mail from 15 of November. > I found no difference between patch and master whatsoever. Which makes me think, that current > master is quite good at keeping better bloat control (not sure if this is an effect of > 4228817449 commit or deduplication). Commit 4228817449 can't have improved the performance of the master branch -- that commit was just about providing a *correct* latestRemovedXid value. It cannot affect how many index tuples get deleted per pass, or anything like that. Not surprised that you didn't see many problems with the master branch on your first attempt. It's normal for there to be non-linear effects with this stuff, and these are very hard (maybe even impossible) to model. For example, even with artificial test data you often see distinct "waves" of page splits, which is a phenomenon pretty well described by the "Waves of misery after index creation" paper [1]. It's likely that your original stress test case (the 15 November one) would have shown significant bloat if you just kept it up for long enough. One goal for this project is to make the performance characteristics more predictable. The performance characteristics are unpredictable precisely because they're pathological. B-Trees will always have non-linear performance characteristics, but they can be made a lot less chaotic in practice. To be honest, I was slightly surprised that your more recent stress test (the queue thing) worked out so well for the patch. But not too surprised, because I don't necessarily expect to understand how the patch can help for any given workload -- this is dynamic behavior that's part of a complex system. I first thought that maybe the presence of the INSERTs and DELETEs would mess things up. But I was wrong -- everything still worked well, probably because bottom-up index deletion "cooperated" with deduplication. The INSERTs could not trigger a bottom-up deletion (which might have been useful for these particular INSERTs), but that didn't actually matter because deduplication was triggered instead, which "held the line" for long enough for bottom-up deletion to eventually get triggered (by non-HOT UPDATEs). As I've said before, I believe that the workload should "figure out the best strategy for handling bloat on its own". A diversity of strategies are available, which can adapt to many different kinds of workloads organically. That kind of approach is sometimes necessary with a complex system IMV. [1] https://btw.informatik.uni-rostock.de/download/tagungsband/B2-2.pdf -- Peter Geoghegan |
On Thu, Jan 7, 2021 at 3:07 PM Peter Geoghegan <[hidden email]> wrote:
> I agree. I'll use the name index_expression_changed_walker() in the > next revision. Attached is v13, which has this tweak, and other miscellaneous cleanup based on review from both Victor and Heikki. I consider this version of the patch to be committable. I intend to commit something close to it in the next week, probably no later than Thursday. I still haven't got to the bottom of the shellsort question raised by Heikki. I intend to do further performance validation before committing the patch. I will look into the shellsort thing again as part of this final performance validation work -- perhaps I can get rid of the specialized shellsort implementation entirely, simplifying the state structs added to tableam.h. (As I said before, it seems best to address this last of all to avoid making performance validation even more complicated.) This version of the patch is notable for removing the index storage param, and for having lots of comment updates and documentation consolidation, particularly in heapam.c. Many of the latter changes are based on feedback from Heikki. Note that all of the discussion of heapam level locality has been consolidated, and is now mostly confined to a fairly large comment block over bottomup_nblocksfavorable() in heapam.c. I also cut down on redundancy among comments about the design at the whole-patch level. A small amount of redundancy in design docs/comments is a good thing IMV. It was hard to get the balance exactly right, since bottom-up index deletion is by its very nature a mechanism that requires the index AM and the tableam to closely cooperate -- which is a novel thing. This isn't 100% housekeeping changes, though. I did add one new minor optimization to v13: We now count the heap block of the incoming new item index tuple's TID (the item that won't fit on the leaf page as-is) as an LP_DEAD-related block for the purposes of determining which heap blocks will be visited during simple index tuple deletion. The extra cost of doing this is low: when the new item heap block is visited purely due to this new behavior, we're still practically guaranteed to not get a buffer miss to read from the heap page. The reason should be obvious: the executor is currently in the process of modifying that same heap page anyway. The benefits are also high relative to the cost. This heap block in particular seems to be very promising as a place to look for deletable TIDs (I tested this with custom instrumentation and microbenchmarks). I believe that this effect exists because by its very nature garbage is often concentrated in recently modified pages. This is per the generational hypothesis, an important part of the theory behind GC algorithms for automated memory management (GC theory seems to have real practical relevance to the GC/VACUUM problems in Postgres, at least at a high level). Of course we still won't do any simple deletion operations unless there is at least one index tuple with its LP_DEAD bit set in the first place at the point that it looks like the page will overflow (no change there). As always, we're just piggy-backing some extra work on top of an expensive operation that needed to take place anyway. I couldn't resist adding this new minor optimization at this late stage, because it is such a bargain. Thanks -- Peter Geoghegan ![]() ![]() |
пн, 11 янв. 2021 г. в 01:07, Peter Geoghegan <[hidden email]>:
I've checked this version quickly. It applies and compiles without issues. `make check` and `make check-world` reported no issue. But `make installcheck-world` failed on: … test explain ... FAILED 22 ms test event_trigger ... ok 178 ms test fast_default ... ok 262 ms test stats ... ok 586 ms ======================== 1 of 202 tests failed. ======================== (see attached diff). It doesn't look like the fault of this patch, though. I suppose you plan to send another revision before committing this. Therefore I didn't perform any tests here, will wait for the next version. Victor Yegorov |
On Mon, Jan 11, 2021 at 12:19 PM Victor Yegorov <[hidden email]> wrote:
> (see attached diff). It doesn't look like the fault of this patch, though. > > I suppose you plan to send another revision before committing this. > Therefore I didn't perform any tests here, will wait for the next version. I imagine that this happened because you have track_io_timing=on in postgresql.conf. Doesn't the same failure take place with the current master branch? I have my own way of running the locally installed Postgres when I want "make installcheck" to pass that specifically accounts for this (and a few other similar things). -- Peter Geoghegan |
пн, 11 янв. 2021 г. в 22:10, Peter Geoghegan <[hidden email]>: I imagine that this happened because you have track_io_timing=on in Oh, right, haven't thought of this. Thanks for pointing that out. Now everything looks good! Victor Yegorov |
In reply to this post by Peter Geoghegan-4
On Sun, Jan 10, 2021 at 4:06 PM Peter Geoghegan <[hidden email]> wrote:
> Attached is v13, which has this tweak, and other miscellaneous cleanup > based on review from both Victor and Heikki. I consider this version > of the patch to be committable. I intend to commit something close to > it in the next week, probably no later than Thursday. I still haven't > got to the bottom of the shellsort question raised by Heikki. I intend > to do further performance validation before committing the patch. I benchmarked the patch with one array and without the shellsort specialization using two patches on top of v13, both of which are attached. This benchmark was similar to other low cardinality index benchmarks I've run in the past (with indexes named fiver, tenner, score, plus a pgbench_accounts INCLUDE unique index instead of the regular primary key). I used pgbench scale 500, for 30 minutes, no rate limit. One run with 16 clients, another with 32 clients. Original v13: patch.r1c32.bench.out: "tps = 32709.772257 (including connections establishing)" "latency average = 0.974 ms" "latency stddev = 0.514 ms" patch.r1c16.bench.out: "tps = 34670.929998 (including connections establishing)" "latency average = 0.461 ms" "latency stddev = 0.314 ms" v13 + attached simplifying patches: patch.r1c32.bench.out: "tps = 31848.632770 (including connections establishing)" "latency average = 1.000 ms" "latency stddev = 0.548 ms" patch.r1c16.bench.out: "tps = 33864.099333 (including connections establishing)" "latency average = 0.472 ms" "latency stddev = 0.399 ms" Clearly the optimization work still has some value, since we're looking at about a 2% - 3% increase in throughput here. This seems like it makes the cost of retaining the optimizations acceptable. The difference is much less visible with a rate-limit, which is rather more realistic. To some extent the sort is hot here because the patched version of Postgres updates tuples as fast as it can, and must therefore delete from the index as fast as it can. The sort itself was consistently near the top consumer of CPU cycles according to "perf top", even if it didn't get as bad as what I saw in earlier versions of the patch months ago. There are actually two sorts involved here (not just the heapam.c shellsort). We need to sort the deltids array twice -- once in heapam.c, and a second time (to restore the original leaf-page-wise order) in nbtpage.c, using qsort(). I'm pretty sure that the latter sort also matters, though it matters less than the heapam.c shellsort. I'm going to proceed with committing the original version of the patch -- I feel that this settles it. The code would be a bit tidier without two arrays or the shellsort, but it really doesn't make that much difference. Whereas the benefit is quite visible, and will be something that all varieties of index tuple deletion see a performance benefit from (not just bottom-up deletion). -- Peter Geoghegan ![]() ![]() |
On Mon, Jan 11, 2021 at 9:26 PM Peter Geoghegan <[hidden email]> wrote:
> I'm going to proceed with committing the original version of the patch > -- I feel that this settles it. Pushed both patches from the patch series just now. Thanks for the code reviews and benchmarking work! -- Peter Geoghegan |
On Wed, Jan 13, 2021 at 11:16 PM Peter Geoghegan <[hidden email]> wrote:
> > On Mon, Jan 11, 2021 at 9:26 PM Peter Geoghegan <[hidden email]> wrote: > > I'm going to proceed with committing the original version of the patch > > -- I feel that this settles it. > > Pushed both patches from the patch series just now. > Nice work! I briefly read the commits and have a few questions. Do we do this optimization (bottom-up deletion) only when the last item which can lead to page split has indexUnchanged set to true? If so, what if the last tuple doesn't have indexUnchanged but the previous ones do have? Why are we using terminology bottom-up deletion and why not simply duplicate version deletion or something on those lines? There is the following comment in the code: 'Index AM/tableam coordination is central to the design of bottom-up index deletion. The index AM provides hints about where to look to the tableam by marking some entries as "promising". Index AM does this with duplicate index tuples that are strongly suspected to be old versions left behind by UPDATEs that did not logically modify indexed values.' How do we distinguish between version duplicate tuples (added due to the reason that some other index column is updated) or duplicate tuples (added because a user has inserted a row with duplicate value) for the purpose of bottom-up deletion? I think we need to distinguish between them because in the earlier case we can remove the old version of the tuple in the index but not in later. Or is it the case that indexAm doesn't differentiate between them but relies on heapAM to give the correct answer? -- With Regards, Amit Kapila. |
On Sat, Jan 16, 2021 at 9:55 PM Amit Kapila <[hidden email]> wrote:
> Do we do this optimization (bottom-up deletion) only when the last > item which can lead to page split has indexUnchanged set to true? If > so, what if the last tuple doesn't have indexUnchanged but the > previous ones do have? Using the indexUnchanged hint as a triggering condition makes sure that non-HOT updaters are the ones that pay the cost of a bottom-up deletion pass. We create backpressure for non-HOT updaters (in indexes that are logically unchanged) specifically. (Of course it's also true that the presence of the indexUnchanged hint is highly suggestive of there being more version churn duplicates on the leaf page already, which is not actually certain.) It's possible that there will be some mixture of inserts and non-hot updates on the same leaf page, and as you say the implementation might fail to do a bottom-up pass based on the fact that an incoming item at the point of a would-be page split was a plain insert (and so didn't receive the hint). The possibility of these kinds of "missed opportunities" are okay because a page split was inevitable either way. You can imagine a case where that isn't true (i.e. a missed opportunity to avoid a page split), but that's kind of like imagining a fair coin flip taking place 100 times and coming up heads each time. It just isn't realistic for such an "mixed tendencies" leaf page to stay in flux (i.e. not ever split) forever, with the smartest triggering condition in the world -- it's too unstable. An important part of understanding the design is to imagine things at the leaf page level, while thinking about how what that actually looks like differs from an ideal physical representation of the same leaf page that is much closer to the logical contents of the database. We're only interested in leaf pages where the number of logical rows is basically fixed over time (when there is version churn). Usually this just means that there are lots of non-HOT updates, but it can also work with lots of queue-like inserts and deletes in a unique index. Ultimately, the thing that determines whether or not the bottom-up deletion optimization is effective for any given leaf page is the fact that it actually prevents the page from splitting despite lots of version churn -- this happens again and again. Another key concept here is that bottom-up deletion passes for the same leaf page are related in important ways -- it is often a mistake to think of individual bottom-up passes as independent events. > Why are we using terminology bottom-up deletion and why not simply > duplicate version deletion or something on those lines? Why is that simpler? Also, it doesn't exactly delete duplicates. See my later remarks. > How do we distinguish between version duplicate tuples (added due to > the reason that some other index column is updated) or duplicate > tuples (added because a user has inserted a row with duplicate value) > for the purpose of bottom-up deletion? I think we need to distinguish > between them because in the earlier case we can remove the old version > of the tuple in the index but not in later. Or is it the case that > indexAm doesn't differentiate between them but relies on heapAM to > give the correct answer? Bottom-up deletion uses the presence of duplicates as a starting point for determining which heap pages to visit, based on the assumption that at least some are obsolete versions. But heapam.c has additional heap-level heuristics that help too. It is quite possible and even likely that we will delete some non-duplicate tuples in passing, just because they're checked in passing -- they might turn out to be deletable, for whatever reason. We're also concerned (on the heapam.c side) about which heap pages have the most TIDs (any kind of TID, not just one marked promising in index AM), so this kind of "serendipity" is quite common in practice. Often the total number of heap pages that are pointed to by all index tuples on the page just isn't that many (8 - 10). And often cases with lots of HOT pruning can have hundreds of LP_DEAD item pointers on a heap page, which we'll tend to get to before too long anyway (with or without many duplicates). The nbtree/caller side makes inferences about what is likely to be true about the "logical contents" of the physical leaf page, as a starting point for the heuristic-driven search for deletable index tuples. There are various ways in which each inference (considered individually) might be wrong, including the one that you pointed out: inserts of duplicates will look like update version churn. But that particular issue won't matter if the inserted duplicates are on mostly-distinct heap blocks (which is likely), because then they'll only make a noise level contribution to the behavior in heapam.c. Also, we can fall back on deduplication when bottom-up deletion fails, which will merge together the duplicates-from-insert, so now any future bottom-up deletion pass over the same leaf page won't have the same problem. Bear in mind that heapam.c is only looking for a select few heap blocks, and doesn't even need to get the order exactly right. We're only concerned about extremes, which are actually what we see in cases that we're interested in helping. We only have to be very approximately right, or right on average. Sure, there might be some tiny cost in marginal cases, but that's not something that I've been able to quantify in any practical way. Because once we fail we fail for good -- the page splits and the question of doing a bottom-up deletion pass for that same leaf page ends. Another important concept is the cost asymmetry -- the asymmetry here is huge. Leaf page splits driven by version churn are very expensive in the short term and in the long term. Our previous behavior was to assume that they were necessary. Now we're initially assuming that they're unnecessary, and requiring non-HOT updaters to do some work that shows (with some margin of error) that such a split is in fact necessary. This is a form of backpressure. Bottom-up deletion doesn't intervene unless and until that happens. It is only interested in pages where version churn is concentrated -- it is absolutely fine to leave it up to VACUUM to get to any "floating garbage" tuples later. This is a pathological condition, and as such isn't hard to spot, regardless of workload conditions. If you or anyone else can think of a gap in my reasoning, or a workload in which the heuristics either fail to prevent page splits where that might be expected or impose too high a cost, do let me know. I admit that my design is unorthodox, but the results speak for themselves. -- Peter Geoghegan |
On Mon, Jan 18, 2021 at 12:43 AM Peter Geoghegan <[hidden email]> wrote:
> > On Sat, Jan 16, 2021 at 9:55 PM Amit Kapila <[hidden email]> wrote: > > Do we do this optimization (bottom-up deletion) only when the last > > item which can lead to page split has indexUnchanged set to true? If > > so, what if the last tuple doesn't have indexUnchanged but the > > previous ones do have? > > Using the indexUnchanged hint as a triggering condition makes sure > that non-HOT updaters are the ones that pay the cost of a bottom-up > deletion pass. We create backpressure for non-HOT updaters (in indexes > that are logically unchanged) specifically. (Of course it's also true > that the presence of the indexUnchanged hint is highly suggestive of > there being more version churn duplicates on the leaf page already, > which is not actually certain.) > > It's possible that there will be some mixture of inserts and non-hot > updates on the same leaf page, and as you say the implementation might > fail to do a bottom-up pass based on the fact that an incoming item at > the point of a would-be page split was a plain insert (and so didn't > receive the hint). > or it would do the scan when that is the only time for this leaf page to receive such a hint. > The possibility of these kinds of "missed > opportunities" are okay because a page split was inevitable either > way. You can imagine a case where that isn't true (i.e. a missed > opportunity to avoid a page split), but that's kind of like imagining > a fair coin flip taking place 100 times and coming up heads each time. > It just isn't realistic for such an "mixed tendencies" leaf page to > stay in flux (i.e. not ever split) forever, with the smartest > triggering condition in the world -- it's too unstable. > But even if we don't want to avoid it forever delaying it will also have advantages depending upon the workload. Let's try to see by some example, say the item and page size are such that it would require 12 items to fill the page completely. Case-1, where we decide based on the hint received in the last item, and Case-2 where we decide based on whether we ever received such a hint for the leaf page. Case-1: ======== 12 new items (6 inserts 6 non-HOT updates) Page-1: 12 items - no split would be required because we received the hint (indexUnchanged) with the last item, so page-1 will have 6 items after clean up. 6 new items (3 inserts, 3 non-HOT updates) Page-1: 12 items (9 inserts, 3 non-HOT updates) lead to split because we received the last item without a hint. The SPLIT-1 happens after 18 operations (6 after the initial 12). After this split (SPLIT-1), we will have 2 pages with the below state: Page-1: 6 items (4 inserts, 2 non-HOT updates) Page-2: 6 items (5 inserts, 1 non-HOT updates) 6 new items (3 inserts, 3 non-HOT updates) Page-1 got 3 new items 1 insert and 2 non-HOT updates; Page-1: 9 items (5 inserts, 4 non-HOT updates) Page-2 got 3 new items 2 inserts and 1 non-HOT update; Page-2: 9 items (7 inserts, 2 non-HOT updates) 6 new items (3 inserts, 3 non-HOT updates) Page-1 got 3 new items 1 insert and 2 non-HOT updates; Page-1: 12 items (6 inserts, 6 non-HOT updates) doesn't lead to split Page-2 got 3 new items 2 inserts and 1 non-HOT update; Page-2: 9 items (9 inserts, 3 non-HOT updates) lead to split (split-2) The SPLIT-2 happens after 30 operations (12 new operations after the previous split). Case-2: ======== 12 new items (6 inserts 6 non-HOT updates) Page-1: 12 items - no split would be required because we received the hint (indexUnchanged) with at least one of the item, so page-1 will have 6 items after clean up. 6 new items (3 inserts, 3 non-HOT updates) Page-1: 12 items (9 inserts, 3 non-HOT updates), cleanup happens and Page-1 will have 9 items. 6 new items (3 inserts, 3 non-HOT updates), at this stage in whichever order the new items are received one split can't be avoided. The SPLIT-1 happens after 24 new operations (12 new ops after initial 12). Page-1: 6 items (6 inserts) Page-2: 6 items (6 inserts) 6 new items (3 inserts, 3 non-HOT updates) Page-1 got 3 new items 1 insert and 2 non-HOT updates; Page-1: 9 items (7 inserts, 2 non-HOT updates) Page-2 got 3 new items 2 inserts and 1 non-HOT update; Page-2: 9 items (8 inserts, 1 non-HOT updates) 6 new items (3 inserts, 3 non-HOT updates) Page-1 got 3 new items 1 insert and 2 non-HOT updates; Page-1: 12 items (8 inserts, 4 non-HOT updates) clean up happens and page-1 will have 8 items. Page-2 got 3 new items 2 inserts and 1 non-HOT update; Page-2: 9 items (10 inserts, 2 non-HOT updates) clean up happens and page-2 will have 10 items. 6 new items (3 inserts, 3 non-HOT updates) Page-1 got 3 new items, 1 insert and 2 non-HOT updates; Page-1: 11 items (9 inserts, 2 non-HOT updates) Page-2 got 3 new items, 2 inserts and 1 non-HOT update; Page-2: 12 items (12 inserts, 0 non-HOT updates) cleanup happens for one of the non-HOT updates 6 new items (3 inserts, 3 non-HOT updates) Page-1 got 3 new items, 1 insert, and 2 non-HOT updates; Page-1: 12 items (12 inserts, 0 non-HOT updates) cleanup happens for one of the non-HOT updates Page-2 got 3 new items, 2 inserts, and 1 non-HOT update; Page-2: split happens After split Page-1: 12 items (12 inserts) Page-2: 6 items (6 inserts) Page-3: 6 items (6 inserts) The SPLIT-2 happens after 48 operations (24 new operations) The summary of the above is that with Case-1 (clean-up based on hint received with the last item on the page) it takes fewer operations to cause a page split as compared to Case-2 (clean-up based on hint received with at least of the items on the page) for a mixed workload. How can we say that it doesn't matter? > An important part of understanding the design is to imagine things at > the leaf page level, while thinking about how what that actually looks > like differs from an ideal physical representation of the same leaf > page that is much closer to the logical contents of the database. > We're only interested in leaf pages where the number of logical rows > is basically fixed over time (when there is version churn). > With the above example, it seems like it would also help when this is not true. > Usually > this just means that there are lots of non-HOT updates, but it can > also work with lots of queue-like inserts and deletes in a unique > index. > > Ultimately, the thing that determines whether or not the bottom-up > deletion optimization is effective for any given leaf page is the fact > that it actually prevents the page from splitting despite lots of > version churn -- this happens again and again. Another key concept > here is that bottom-up deletion passes for the same leaf page are > related in important ways -- it is often a mistake to think of > individual bottom-up passes as independent events. > > > Why are we using terminology bottom-up deletion and why not simply > > duplicate version deletion or something on those lines? > > Why is that simpler? > I am not sure what I proposed fits here but the bottom-up sounds like we are starting from the leaf level and move upwards to root level which I think is not true here. > Also, it doesn't exactly delete duplicates. See > my later remarks. > > > How do we distinguish between version duplicate tuples (added due to > > the reason that some other index column is updated) or duplicate > > tuples (added because a user has inserted a row with duplicate value) > > for the purpose of bottom-up deletion? I think we need to distinguish > > between them because in the earlier case we can remove the old version > > of the tuple in the index but not in later. Or is it the case that > > indexAm doesn't differentiate between them but relies on heapAM to > > give the correct answer? > > Bottom-up deletion uses the presence of duplicates as a starting point > for determining which heap pages to visit, based on the assumption > that at least some are obsolete versions. But heapam.c has additional > heap-level heuristics that help too. > > It is quite possible and even likely that we will delete some > non-duplicate tuples in passing, just because they're checked in > passing -- they might turn out to be deletable, for whatever reason. > How is that working? Does heapam.c can someway indicate additional tuples (extra to what has been marked/sent by IndexAM as promising) as deletable? I see in heap_index_delete_tuples that we mark the status of the passed tuples as delectable (by setting knowndeletable flag for them). -- With Regards, Amit Kapila. |
пн, 18 янв. 2021 г. в 07:44, Amit Kapila <[hidden email]>: The summary of the above is that with Case-1 (clean-up based on hint I cannot understand this, perhaps there's a word missing in the brackets?.. Thinking of your proposal to undertake bottom-up deletion also on the last-to-fit tuple insertion, I'd like to start with my understanding of the design: * we try to avoid index bloat, but we do it in the “lazy” way (for a reason, see below) * it means, that if there is enough space on the leaf page to fit new index tuple, we just fit it there * if there's not enough space, we first look at the presence of LP_DEAD tuples, and if they do exits, we scan the whole (index) page to re-check all tuples in order to find others, not-yet-marked-but-actually-being-LP_DEAD tuples and clean all those up. * if still not enough space, only now we try bottom-up-deletion. This is heavy operation and can cause extra IO (tests do show this), therefore this operation is undertaken at the point, when we can justify extra IO against leaf page split. * if no space also after bottom-up-deletion, we perform deduplication (if possible) * finally, we split the page. Should we try bottom-up-deletion pass in a situation where we're inserting the last possible tuple into the page (enough space for *current* insert, but likely no space for the next), then (among others) exists the following possibilities: - no new tuples ever comes to this page anymore (happy accident), which means that we've wasted IO cycles - we undertake bottom-up-deletion pass without success and we're asked to insert new tuple in this fully packed index page. This can unroll to: - due to the delay, we've managed to find some space either due to LP_DEAD or bottom-up-cleanup. which means we've wasted IO cycles on the previous iteration (too early attempt). - we still failed to find any space and are forced to split the page. in this case we've wasted IO cycles twice. In my view these cases when we generated wasted IO cycles (producing no benefit) should be avoided. And this is main reason for current approach. Again, this is my understanding and I hope I got it right. Victor Yegorov |
On Mon, Jan 18, 2021 at 5:11 PM Victor Yegorov <[hidden email]> wrote:
> > пн, 18 янв. 2021 г. в 07:44, Amit Kapila <[hidden email]>: >> >> The summary of the above is that with Case-1 (clean-up based on hint >> received with the last item on the page) it takes fewer operations to >> cause a page split as compared to Case-2 (clean-up based on hint >> received with at least of the items on the page) for a mixed workload. >> How can we say that it doesn't matter? > > > I cannot understand this, perhaps there's a word missing in the brackets?.. > There is a missing word (one) in Case-2, let me write it again to avoid any confusion. Case-2 (clean-up based on hint received with at least one of the items on the page). > > Thinking of your proposal to undertake bottom-up deletion also on the last-to-fit tuple insertion, > I think there is some misunderstanding in what I am trying to say and your conclusion of the same. See below. > I'd like to start with my understanding of the design: > > * we try to avoid index bloat, but we do it in the “lazy” way (for a reason, see below) > > * it means, that if there is enough space on the leaf page to fit new index tuple, > we just fit it there > > * if there's not enough space, we first look at the presence of LP_DEAD tuples, > and if they do exits, we scan the whole (index) page to re-check all tuples in order > to find others, not-yet-marked-but-actually-being-LP_DEAD tuples and clean all those up. > > * if still not enough space, only now we try bottom-up-deletion. This is heavy operation and > can cause extra IO (tests do show this), therefore this operation is undertaken at the point, > when we can justify extra IO against leaf page split. > > * if no space also after bottom-up-deletion, we perform deduplication (if possible) > > * finally, we split the page. > > Should we try bottom-up-deletion pass in a situation where we're inserting the last possible tuple > into the page (enough space for *current* insert, but likely no space for the next), > I am saying that we try bottom-up deletion when the new insert item didn't find enough space on the page and there was previously some indexUnchanged tuple(s) inserted into that page. Of course, like now it will be attempted after an attempt to remove LP_DEAD items. > then (among others) exists the following possibilities: > > - no new tuples ever comes to this page anymore (happy accident), which means that > we've wasted IO cycles > > - we undertake bottom-up-deletion pass without success and we're asked to insert new tuple in this > fully packed index page. This can unroll to: > - due to the delay, we've managed to find some space either due to LP_DEAD or bottom-up-cleanup. > which means we've wasted IO cycles on the previous iteration (too early attempt). > - we still failed to find any space and are forced to split the page. > in this case we've wasted IO cycles twice. > > In my view these cases when we generated wasted IO cycles (producing no benefit) should be avoided. > I don't think any of these can happen in what I am actually saying. Do you still have the same feeling after reading this email? Off-hand, I don't see any downsides as compared to the current approach and it will have fewer splits in some other workloads like when there is a mix of inserts and non-HOT updates (that doesn't logically change the index). -- With Regards, Amit Kapila. |
пн, 18 янв. 2021 г. в 13:42, Amit Kapila <[hidden email]>: I don't think any of these can happen in what I am actually saying. Do If I understand you correctly, you suggest to track _all_ the hints that came from the executor for possible non-HOT logical duplicates somewhere on the page. And when we hit the no-space case, we'll check not only the last item being hinted, but all items on the page, which makes it more probable to kick in and do something. Sounds interesting. And judging on the Peter's tests of extra LP_DEAD tuples found on the page (almost always being more, than actually flagged), this can have some positive effects. Also, I'm not sure where to put it. We've deprecated the BTP_HAS_GARBAGE flag, maybe it can be reused for this purpose? Victor Yegorov |
In reply to this post by akapila
On Sun, Jan 17, 2021 at 10:44 PM Amit Kapila <[hidden email]> wrote:
> With the above example, it seems like it would also help when this is not true. I'll respond to your remarks here separately, in a later email. > I am not sure what I proposed fits here but the bottom-up sounds like > we are starting from the leaf level and move upwards to root level > which I think is not true here. I guess that that's understandable, because it is true that B-Trees are maintained in a bottom-up fashion. However, it's also true that you can have top-down and bottom-up approaches in query optimizers, and many other things (it's even something that is used to describe governance models). The whole point of the term "bottom-up" is to suggest that bottom-up deletion complements top-down cleanup by VACUUM. I think that this design embodies certain principles that can be generalized to other areas, such as heap pruning. I recall that Robert Haas hated the name deduplication. I'm not about to argue that my choice of "deduplication" was objectively better than whatever he would have preferred. Same thing applies here - I more or less chose a novel name because the concept is itself novel (unlike deduplication). Reasonable people can disagree about what exact name might have been better, but it's not like we're going to arrive at something that everybody can be happy with. And whatever we arrive at probably won't matter that much - the vast majority of users will never need to know what either thing is. They may be important things, but that doesn't mean that many people will ever think about them (in fact I specifically hope that they don't ever need to think about them). > How is that working? Does heapam.c can someway indicate additional > tuples (extra to what has been marked/sent by IndexAM as promising) as > deletable? I see in heap_index_delete_tuples that we mark the status > of the passed tuples as delectable (by setting knowndeletable flag for > them). The bottom-up nbtree caller to heap_index_delete_tuples()/table_index_delete_tuple() (not to be confused with the simple/LP_DEAD heap_index_delete_tuples() caller) always provides heapam.c with a complete picture of the index page, in the sense that it exhaustively has a delstate.deltids entry for each and every TID on the page, no matter what. This is the case even though in practice there is usually no possible way to check even 20% of the deltids entries within heapam.c. In general, the goal during a bottom-up pass is *not* to maximize expected utility (i.e. the number of deleted index tuples/space freed/whatever), exactly. The goal is to lower the variance across related calls, so that we'll typically manage to free a fair number of index tuples when we need to. In general it is much better for heapam.c to make its decisions based on 2 or 3 good reasons rather than just 1 excellent reason. And so heapam.c applies a power-of-two bucketing scheme, never truly giving too much weight to what nbtree tells it about duplicates. See comments above bottomup_nblocksfavorable(), and bottomup_sort_and_shrink() comments (both are from heapam.c). -- Peter Geoghegan |
In reply to this post by Victor Yegorov
On Mon, Jan 18, 2021 at 6:11 AM Victor Yegorov <[hidden email]> wrote:
> If I understand you correctly, you suggest to track _all_ the hints that came > from the executor for possible non-HOT logical duplicates somewhere on > the page. And when we hit the no-space case, we'll check not only the last > item being hinted, but all items on the page, which makes it more probable > to kick in and do something. > Also, I'm not sure where to put it. We've deprecated the BTP_HAS_GARBAGE > flag, maybe it can be reused for this purpose? There actually was a similar flag (named BTP_UNCHANGED_UPDATE and later BTP_HAS_DUPS) that appeared in earlier versions of the patch (several versions in total, up to and including v6). This was never discussed on the list because I assumed that that wouldn't be helpful (I was already writing too many overlong e-mails). It was unsettled in my mind at the time, so it didn't make sense to start discussing. I changed my mind at some point, and so it never came up until now, when Amit raised the question. Looking back on my personal notes, I am reminded that I debated this exact question with myself at length. The argument for keeping the nbtree flag (i.e. what Amit is arguing for now) involved an isolated example that seems very similar to the much more recent example from Amit (that he arrived at independently). I am at least sympathetic to this view of things, even now. Let me go into why I changed my mind after a couple of months of on and off deliberation. In general, the highly speculative nature of the extra work that heapam.c does for index deleters in the bottom-up caller case can only be justified because the cost is paid by non-HOT updaters that are definitely about to split the page just to fit another version, and because we only risk wasting one heap page access at any given point of the entire process (the latter point about risk/cost is not 100% true, because you have additional fixed CPU costs and stuff like that, but it's at least true in spirit). We can justify "gambling" like this only because the game is rigged in our favor to an *absurd* degree: there are many ways to win big (relative to the likely version churn page split baseline case), and we only have to put down relatively small "bets" at any given point -- heapam.c will give up everything when it encounters one whole heap page that lacks a single deletable entry, no matter what the reason is. The algorithm *appears* to behave very intelligently when seen from afar, but is in fact stupid and opportunistic when you look at it up close. It's possible to be so permissive about the upside benefit by also being very conservative about the downside cost. Almost all of our individual inferences can be wrong, and yet we still win in the end. And the effect is robust over time. You could say that it is an organic approach: it adapts to the workload, rather than trying to make the workload adapt to it. This is less radical than you'd think -- in some sense this is how B-Trees have always worked. In the end, I couldn't justify imposing this cost on anything other than a non-HOT updater, which is what the flag proposal would require me to do -- then it's not 100% clear that the relative cost of each "bet" placed in heapam.c really is extremely low (we can no longer say for sure that we have very little to lose, given the certain alternative that is a version churn page split). The fact is that "version chains in indexes" still cannot get very long. Plus there are other subtle ways in which it's unlikely to be a problem (e.g. the LP_DEAD deletion stuff also got much better, deduplication can combine with bottom-up deletion so that we'll trigger a useful bottom-up deletion pass sooner or later without much downside, and possibly other things I haven't even thought of). It's possible that I have been too conservative. I admit that my decision on this point is at least a little arbitrary, but I stand by it for now. I cannot feel too bad about theoretically leaving some gains on the table, given that we're only talking about deleting a group of related versions a little later than we would otherwise, and only in some circumstances, and in a way that seems likely to be imperceptible to users. I reserve the right to change my mind about it, but for now it doesn't look like an absurdly good deal, and those are the kind of deals that it makes sense to focus on here. I am very happy about the fact that it is relatively easy to understand why the worst case for bottom-up index deletion cannot be that bad. -- Peter Geoghegan |
Free forum by Nabble | Edit this page |