On Tue, Nov 17, 2020 at 7:05 AM Victor Yegorov <[hidden email]> wrote:
> I've looked over the BTP_HAS_GARBAGE modifications, they look sane. > I've double checked that heapkeyspace indexes don't use this flag (don't rely on it), > while pre-v4 ones still use it. Cool. > I have a question. This flag is raised in the _bt_check_unique() and in _bt_killitems(). > If we're deprecating this flag, perhaps it'd be good to either avoid raising it at least for > _bt_check_unique(), as it seems to me that conditions are dealing with postings, therefore > we are speaking of heapkeyspace indexes here. Well, we still want to mark LP_DEAD bits set in all cases, just as before. The difference is that heapkeyspace indexes won't rely on the page-level flag later on. > If we'll conditionally raise this flag in the functions above, we can get rid of blocks that drop it > in _bt_delitems_delete(), I think. I prefer to continue to maintain the flag in the same way, regardless of which B-Tree version is in use (i.e. if it's heapkeyspace or not). Maintaining the flag is not expensive, may have some small value for forensic or debugging purposes, and saves callers the trouble of telling _bt_delitems_delete() (and code like it) whether or not this is a heapkeyspace index. -- Peter Geoghegan |
вт, 17 нояб. 2020 г. в 17:24, Peter Geoghegan <[hidden email]>: I prefer to continue to maintain the flag in the same way, regardless OK. Can you explain what deprecation means here? If this functionality is left as is, it is not really deprecation?.. Victor Yegorov |
On Tue, Nov 17, 2020 at 9:19 AM Victor Yegorov <[hidden email]> wrote:
> OK. Can you explain what deprecation means here? > If this functionality is left as is, it is not really deprecation?.. It just means that we only keep it around for compatibility purposes. We would like to remove it, but can't right now. If we ever stop supporting version 3 indexes, then we can probably remove it entirely. I would like to avoid special cases across B-Tree index versions. Simply maintaining the page flag in the same way as we always have is the simplest approach. Pushed the BTP_HAS_GARBAGE patch just now. I'll post a rebased version of the patch series later on today. Thanks -- Peter Geoghegan |
In reply to this post by Victor Yegorov
On Tue, Nov 17, 2020 at 7:24 AM Victor Yegorov <[hidden email]> wrote:
> I've looked through the code and it looks very good from my end: > - plenty comments, good description of what's going on > - I found no loose ends in terms of AM integration > - magic constants replaced with defines > Code looks good. Still, it'd be good if somebody with more experience could look into this patch. Great, thank you. > Question: why in the comments you're using double spaces after dots? > Is this a convention of the project? Not really. It's based on my habit of trying to be as consistent as possible with existing code. There seems to be a weak consensus among English speakers on this question, which is: the two space convention is antiquated, and only ever made sense in the era of mechanical typewriters. I don't really care either way, and I doubt that any other committer pays much attention to these things. You may have noticed that I use only one space in my e-mails. Actually, I probably shouldn't care about it myself. It's just what I decided to do at some point. I find it useful to decide that this or that practice is now a best practice, and then stick to it without thinking about it very much (this frees up space in my head to think about more important things). But this particular habit of mine around spaces is definitely not something I'd insist on from other contributors. It's just that: a habit. > I am thinking of two more scenarios that require testing: > - queue in the table, with a high rate of INSERTs+DELETEs and a long transaction. I see your point. This is going to be hard to make work outside of unique indexes, though. Unique indexes are already not dependent on the executor hint -- they can just use the "uniquedup" hint. The code for unique indexes is prepared to notice duplicates in _bt_check_unique() in passing, and apply the optimization for that reason. Maybe there is some argument to forgetting about the hint entirely, and always assuming that we should try to find tuples to delete at the point that a page is about to be split. I think that that argument is a lot harder to make, though. And it can be revisited in the future. It would be nice to do better with INSERTs+DELETEs, but that's surely not the big problem for us right now. I realize that this unique indexes/_bt_check_unique() thing is not even really a partial fix to the problem you describe. The indexes that have real problems with such an INSERTs+DELETEs workload will naturally not be unique indexes -- _bt_check_unique() already does a fairly good job of controlling bloat without bottom-up deletion. > - upgraded cluster with !heapkeyspace indexes. I do have a patch that makes that easy to test, that I used for the Postgres 13 deduplication work -- I can rebase it and post it if you like. You will be able to apply the patch, and run the regression tests with a !heapkeyspace index. This works with only one or two tweaks to the tests (IIRC the amcheck tests need to be tweaked in one place for this to work). I don't anticipate that !heapkeyspace indexes will be a problem, because they won't use any of the new stuff anyway, and because nothing about the on-disk format is changed by bottom-up index deletion. -- Peter Geoghegan |
In reply to this post by Victor Yegorov
On Tue, Nov 17, 2020 at 7:17 AM Victor Yegorov <[hidden email]> wrote:
> чт, 12 нояб. 2020 г. в 23:00, Peter Geoghegan <[hidden email]>: >> Another thing that I'll probably add to v8: Prefetching. This is >> probably necessary just so I can have parity with the existing >> heapam.c function that the new code is based on, >> heap_compute_xid_horizon_for_tuples(). That will probably help here, >> too. > > I don't quite see this part. Do you mean top_block_groups_favorable() here? I meant to add prefetching to the version of the patch that became v8, but that didn't happen because I ran out of time. I wanted to get out a version with the low cardinality fix, to see if that helped with the regression you talked about last week. (Prefetching seems to make a small difference when we're I/O bound, so it may not be that important.) Attached is v9 of the patch series. This actually has prefetching in heapam.c. Prefetching is not just applied to favorable blocks, though -- it's applied to all the blocks that we might visit, even though we often won't really visit the last few blocks in line. This needs more testing. The specific choices I made around prefetching were definitely a bit arbitrary. To be honest, it was a bit of a box-ticking thing (parity with similar code for its own sake). But maybe I failed to consider particular scenarios in which prefetching really is important. My high level goal for v9 was to do cleanup of v8. There isn't very much that you could call a new enhancement (just the prefetching thing). Other changes in v9 include: * Much simpler approach to passing down an aminsert() hint from the executor in v9-0002* patch. Rather than exposing some HOT implementation details from heap_update(), we use executor state that tracks updated columns. Now all we have to do is tell ExecInsertIndexTuples() "this round of index tuple inserts is for an UPDATE statement". It then figures out the specific details (whether it passes the hint or not) on an index by index basis. This interface feels much more natural to me. This also made it easy to handle expression indexes sensibly. And, we get support for the logical replication UPDATE caller to ExecInsertIndexTuples(). It only has to say "this is for an UPDATE", in the usual way, without any special effort (actually I need to test logical replication, just to be sure, but I think that it works fine in v9). * New B-Tree sgml documentation in v9-0003* patch. I've added an extensive user-facing description of the feature to the end of "Chapter 64. B-Tree Indexes", near the existing discussion of deduplication. * New delete_items storage parameter. This makes it possible to disable the optimization. Like deduplicate_items in Postgres 13, it is not expected to be set to "off" very often. I'm not yet 100% sure that a storage parameter is truly necessary -- I might still change my mind and remove it later. Thanks -- Peter Geoghegan ![]() ![]() ![]() ![]() |
In reply to this post by Peter Geoghegan-4
On Tue, Nov 17, 2020 at 12:45 PM Peter Geoghegan <[hidden email]> wrote:
> > I am thinking of two more scenarios that require testing: > > - queue in the table, with a high rate of INSERTs+DELETEs and a long transaction. > > I see your point. This is going to be hard to make work outside of > unique indexes, though. Unique indexes are already not dependent on > the executor hint -- they can just use the "uniquedup" hint. The code > for unique indexes is prepared to notice duplicates in > _bt_check_unique() in passing, and apply the optimization for that > reason. I thought about this some more. My first idea was to simply always try out bottom-up deletion (i.e. behave as if the hint from the executor always indicates that it's favorable). I couldn't really justify that approach, though. It results in many bottom-up deletion passes that end up wasting cycles (and unnecessarily accessing heap blocks). Then I had a much better idea: Make the existing LP_DEAD stuff a little more like bottom-up index deletion. We usually have to access heap blocks that the index tuples point to today, in order to have a latestRemovedXid cutoff (to generate recovery conflicts). It's worth scanning the leaf page for index tuples with TIDs whose heap block matches the index tuples that actually have their LP_DEAD bits set. This only consumes a few more CPU cycles. We don't have to access any more heap blocks to try these extra TIDs, so it seems like a very good idea to try them out. I ran the regression tests with an enhanced version of the patch, with this LP_DEAD-deletion-with-extra-TIDs thing. It also had custom instrumentation that showed exactly what happens in each case. We manage to delete at least a small number of extra index tuples in almost all cases -- so we get some benefit in practically all cases. And in the majority of cases we can delete significantly more. It's not uncommon to increase the number of index tuples deleted. It could go from 1 - 10 or so without the enhancement to LP_DEAD deletion, to 50 - 250 with the LP_DEAD enhancement. Some individual LP_DEAD deletion calls can free more than 50% of the space on the leaf page. I believe that this is a lower risk way of doing better when there is a high rate of INSERTs+DELETEs. Most of the regression test cases I looked at were in the larger system catalog indexes, which often look like that. We don't have to be that lucky for a passing index scan to set at least one or two LP_DEAD bits. If there is any kind of physical/logical correlation, then we're bound to also end up deleting some extra index tuples by the time that the page looks like it might need to be split. -- Peter Geoghegan |
ср, 25 нояб. 2020 г. в 05:35, Peter Geoghegan <[hidden email]>: Then I had a much better idea: Make the existing LP_DEAD stuff a I don't seem to understand this. Is it: we're scanning the leaf page for all LP_DEAD tuples that point to the same heap block? Which heap block we're talking about here, the one that holds entry we're about to add (the one that triggered bottom-up-deletion due to lack of space I mean)? I ran the regression tests with an enhanced version of the patch, with I am missing a general perspective here. Is it true, that despite the long (vacuum preventing) transaction we can re-use space, as after the DELETE statements commits, IndexScans are setting LP_DEAD hints after they check the state of the corresponding heap tuple? If my thinking is correct for both cases — nature of LP_DEAD hint bits and the mechanics of suggested optimization — then I consider this a very promising improvement! I haven't done any testing so far since sending my last e-mail. If you'll have a chance to send a new v10 version with LP_DEAD-deletion-with-extra-TIDs thing, I will do some tests (planned). Victor Yegorov |
On Wed, Nov 25, 2020 at 4:43 AM Victor Yegorov <[hidden email]> wrote:
>> Then I had a much better idea: Make the existing LP_DEAD stuff a >> little more like bottom-up index deletion. We usually have to access >> heap blocks that the index tuples point to today, in order to have a >> latestRemovedXid cutoff (to generate recovery conflicts). It's worth >> scanning the leaf page for index tuples with TIDs whose heap block >> matches the index tuples that actually have their LP_DEAD bits set. >> This only consumes a few more CPU cycles. We don't have to access any >> more heap blocks to try these extra TIDs, so it seems like a very good >> idea to try them out. > > > I don't seem to understand this. > > Is it: we're scanning the leaf page for all LP_DEAD tuples that point to the same > heap block? Which heap block we're talking about here, the one that holds > entry we're about to add (the one that triggered bottom-up-deletion due to lack > of space I mean)? No, the incoming tuple isn't significant. As you know, bottom-up index deletion uses heuristics that are concerned with duplicates on the page, and the "logically unchanged by an UPDATE" hint that the executor passes to btinsert(). Bottom-up deletion runs when all LP_DEAD bits have been cleared (either because there never were any LP_DEAD bits set, or because they were set and then deleted, which wasn't enough). But before bottom-up deletion may run, traditional deletion of LP_DEAD index tuples runs -- this is always our preferred strategy because index tuples with their LP_DEAD bits set are already known to be deletable. We can make this existing process (which has been around since PostgreSQL 8.2) better by applying similar principles. We have promising tuples for bottom-up deletion. Why not have "promising heap blocks" for traditional LP_DEAD index tuple deletion? Or if you prefer, we can consider index tuples that *don't* have their LP_DEAD bits set already but happen to point to the *same heap block* as other tuples that *do* have their LP_DEAD bits set promising. (The tuples with their LP_DEAD bits set are not just promising -- they're already a sure thing.) This means that traditional LP_DEAD deletion is now slightly more speculative in one way (it speculates about what is likely to be true using heuristics). But it's much less speculative than bottom-up index deletion. We are required to visit these heap blocks anyway, since a call to _bt_delitems_delete() for LP_DEAD deletion must already call table_compute_xid_horizon_for_tuples(), which has to access the blocks to get a latestRemovedXid for the WAL record. The only thing that we have to lose here is a few CPU cycles to find extra TIDs to consider. We'll visit exactly the same number of heap blocks as before. (Actually, _bt_delitems_delete() does not have to do that in all cases, actually, but it has to do it with a logged table with wal_level >= replica, which is the vast majority of cases in practice.) This means that traditional LP_DEAD deletion reuses some of the bottom-up index deletion infrastructure. So maybe nbtree never calls table_compute_xid_horizon_for_tuples() now, since everything goes through the new heapam stuff instead (which knows how to check extra TIDs that might not be dead at all). > I am missing a general perspective here. > > Is it true, that despite the long (vacuum preventing) transaction we can re-use space, > as after the DELETE statements commits, IndexScans are setting LP_DEAD hints after > they check the state of the corresponding heap tuple? The enhancement to traditional LP_DEAD deletion that I just described does not affect the current restrictions on setting LP_DEAD bits in the presence of a long-running transaction, or anything like that. That seems like an unrelated project. The value of this enhancement is purely its ability to delete *extra* index tuples that could have had their LP_DEAD bits set already (it was possible in principle), but didn't. And only when they are nearby to index tuples that really do have their LP_DEAD bits set. > I haven't done any testing so far since sending my last e-mail. > If you'll have a chance to send a new v10 version with LP_DEAD-deletion-with-extra-TIDs thing, > I will do some tests (planned). Thanks! I think that it will be next week. It's a relatively big change. -- Peter Geoghegan |
ср, 25 нояб. 2020 г. в 19:41, Peter Geoghegan <[hidden email]>: We have promising tuples for bottom-up deletion. Why not have In the _bt_delete_or_dedup_one_page() we start with the simple loop over items on the page and if there're any LP_DEAD tuples, we're kicking off _bt_delitems_delete(). So if I understood you right, you plan to make this loop (or a similar one somewhere around) to track TIDs of the LP_DEAD tuples and then (perhaps on a second loop over the page) compare all other currently-not-LP_DEAD tuples and mark those pages, that have at least 2 TIDs pointing at (one LP_DEAD and other not) as a promising one. Later, should we require to kick deduplication, we'll go visit promising pages first. Is my understanding correct? > I am missing a general perspective here. I wasn't considering improvements here, that was a general question about how this works (trying to clear up gaps in my understanding). What I meant to ask — will LP_DEAD be set by IndexScan in the presence of the long transaction? Victor Yegorov |
On Wed, Nov 25, 2020 at 1:20 PM Victor Yegorov <[hidden email]> wrote:
> In the _bt_delete_or_dedup_one_page() we start with the simple loop over items on the page and > if there're any LP_DEAD tuples, we're kicking off _bt_delitems_delete(). Right. > So if I understood you right, you plan to make this loop (or a similar one somewhere around) > to track TIDs of the LP_DEAD tuples and then (perhaps on a second loop over the page) compare all other > currently-not-LP_DEAD tuples and mark those pages, that have at least 2 TIDs pointing at (one LP_DEAD and other not) > as a promising one. Yes. We notice extra TIDs that can be included in our heapam test "for free". The cost is low, but the benefits are also often quite high: in practice there are *natural* correlations that we can exploit. For example: maybe there were non-HOT updates, and some but not all of the versions got marked LP_DEAD. We can get them all in one go, avoiding a true bottom-up index deletion pass for much longer (compared to doing LP_DEAD deletion the old way, which is what happens in v9 of the patch). We're better off doing the deletions all at once. It's cheaper. (We still really need to have bottom-up deletion passes, of course, because that covers the important case where there are no LP_DEAD bits set at all, which is an important goal of this project.) Minor note: Technically there aren't any promising tuples involved, because that only makes sense when we are not going to visit every possible heap page (just the "most promising" heap pages). But we are going to visit every possible heap page with the new LP_DEAD bit deletion code (which could occasionally mean visiting 10 or more heap pages, which is a lot more than bottom-up index deletion will ever visit). All we need to do with the new LP_DEAD deletion logic is to include all possible matching TIDs (not just those that are marked LP_DEAD already). > What I meant to ask — will LP_DEAD be set by IndexScan in the presence of the long transaction? That works in the same way as before, even with the new LP_DEAD deletion code. The new code uses the same information as before (set LP_DEAD bits), which is generated in the same way as before. The difference is in how the information is actually used during LP_DEAD deletion -- we can now delete some extra things in certain common cases. In practice this (and bottom-up deletion) make nbtree more robust against disruption caused by long running transactions that hold a snapshot open. It's hard to give a simple explanation of why that is, because it's a second order effect. The patch is going to make it possible to recover when LP_DEAD bits suddenly stop being set because of an old snapshot -- now we'll have a "second chance", and maybe even a third chance. But if the snapshot is held open *forever*, then a second chance has no value. Here is a thought experiment that might be helpful: Imagine Postgres just as it is today (without the patch), except that VACUUM runs very frequently, and is infinitely fast (this is a magical version of VACUUM). This solves many problems, but does not solve all problems. Magic Postgres will become just as slow as earthly Postgres when there is a snapshot that is held open for a very long time. That will take longer to happen compared to earthly/mortal Postgres, but eventually there will be no difference between the two at all. But, when you don't have such an extreme problem, magic Postgres really is much faster. I think that it will be possible to approximate the behavior of magic Postgres using techniques like bottom-up deletion, the new LP_DEAD deletion thing we've been talking today, and maybe other enhancements in other areas (like in heap pruning). It doesn't matter that we don't physically remove garbage immediately, as long as we "logically" remove it immediately. The actually physical removal can occur in a just in time, incremental fashion, creating the illusion that VACUUM really does run infinitely fast. No magic required. Actually, in a way this isn't new; we have always "logically" removed garbage at the earliest opportunity (by which I mean we allow that it can be physically removed according to an oldestXmin style cutoff, which can be reacquired/updated the second the oldest MVCC snapshot goes away). We don't think of useless old versions as being "logically removed" the instant an old snapshot goes away. But maybe we should -- it's a useful mental model. It will also be very helpful to add "remove useless intermediate versions" logic at some point. This is quite a distinct area to what I just described, but it's also important. We need both, I think. -- Peter Geoghegan |
In reply to this post by Peter Geoghegan-4
On Wed, Nov 25, 2020 at 10:41 AM Peter Geoghegan <[hidden email]> wrote:
> We have promising tuples for bottom-up deletion. Why not have > "promising heap blocks" for traditional LP_DEAD index tuple deletion? > Or if you prefer, we can consider index tuples that *don't* have their > LP_DEAD bits set already but happen to point to the *same heap block* > as other tuples that *do* have their LP_DEAD bits set promising. (The > tuples with their LP_DEAD bits set are not just promising -- they're > already a sure thing.) > > This means that traditional LP_DEAD deletion is now slightly more > speculative in one way (it speculates about what is likely to be true > using heuristics). But it's much less speculative than bottom-up index > deletion. We are required to visit these heap blocks anyway, since a > call to _bt_delitems_delete() for LP_DEAD deletion must already call > table_compute_xid_horizon_for_tuples(), which has to access the blocks > to get a latestRemovedXid for the WAL record. > > The only thing that we have to lose here is a few CPU cycles to find > extra TIDs to consider. We'll visit exactly the same number of heap > blocks as before. (Actually, _bt_delitems_delete() does not have to do > that in all cases, actually, but it has to do it with a logged table > with wal_level >= replica, which is the vast majority of cases in > practice.) > > This means that traditional LP_DEAD deletion reuses some of the > bottom-up index deletion infrastructure. So maybe nbtree never calls > table_compute_xid_horizon_for_tuples() now, since everything goes > through the new heapam stuff instead (which knows how to check extra > TIDs that might not be dead at all). described. (It also fixed bitrot -- v9 no longer applies.) This revision does a little refactoring to make this possible. Now there is less new code in nbtdedup.c, and more code in nbtpage.c, because some of the logic used by bottom-up deletion has been generalized (in order to be used by the new-to-v10 LP_DEAD deletion enhancement). Other than that, no big changes between this v10 and v9. Just polishing and refactoring. I decided to make it mandatory for tableams to support the new interface that heapam implements, since it's hardly okay for them to not allow LP_DEAD deletion in nbtree (which is what making supporting the interface optional would imply, given the LP_DEAD changes). So now the heapam and tableam changes are including in one patch/commit, which is to be applied first among patches in the series. -- Peter Geoghegan ![]() ![]() ![]() |
This is a wholly new concept with a lot of heuristics. It's a lot of
swallow. But here are a few quick comments after a first read-through: On 30/11/2020 21:50, Peter Geoghegan wrote: > +/* > + * State used when calling table_index_delete_check() to perform "bottom up" > + * deletion of duplicate index tuples. State is intialized by index AM > + * caller, while state is finalized by tableam, which modifies state. > + */ > +typedef struct TM_IndexDelete > +{ > + ItemPointerData tid; /* table TID from index tuple */ > + int16 id; /* Offset into TM_IndexStatus array */ > +} TM_IndexDelete; > + > +typedef struct TM_IndexStatus > +{ > + OffsetNumber idxoffnum; /* Index am page offset number */ > + int16 tupsize; /* Space freed in index if tuple deleted */ > + bool ispromising; /* Duplicate in index? */ > + bool deleteitup; /* Known dead-to-all? */ > +} TM_IndexStatus; > ... > + * The two arrays are conceptually one single array. Two arrays/structs are > + * used for performance reasons. (We really need to keep the TM_IndexDelete > + * struct small so that the tableam can do an initial sort by TID as quickly > + * as possible.) Is it really significantly faster to have two arrays? If you had just one array, each element would be only 12 bytes long. That's not much smaller than TM_IndexDeletes, which is 8 bytes. > + /* First sort caller's array by TID */ > + heap_tid_shellsort(delstate->deltids, delstate->ndeltids); > + > + /* alltids caller visits all blocks, so make sure that happens */ > + if (delstate->alltids) > + return delstate->ndeltids; > + > + /* Calculate per-heap-block count of TIDs */ > + blockcounts = palloc(sizeof(IndexDeleteCounts) * delstate->ndeltids); > + for (int i = 0; i < delstate->ndeltids; i++) > + { > + ItemPointer deltid = &delstate->deltids[i].tid; > + TM_IndexStatus *dstatus = delstate->status + delstate->deltids[i].id; > + bool ispromising = dstatus->ispromising; > + > + if (curblock != ItemPointerGetBlockNumber(deltid)) > + { > + /* New block group */ > + nblockgroups++; > + > + Assert(curblock < ItemPointerGetBlockNumber(deltid) || > + !BlockNumberIsValid(curblock)); > + > + curblock = ItemPointerGetBlockNumber(deltid); > + blockcounts[nblockgroups - 1].ideltids = i; > + blockcounts[nblockgroups - 1].ntids = 1; > + blockcounts[nblockgroups - 1].npromisingtids = 0; > + } > + else > + { > + blockcounts[nblockgroups - 1].ntids++; > + } > + > + if (ispromising) > + blockcounts[nblockgroups - 1].npromisingtids++; > + } Instead of sorting the array by TID, wouldn't it be enough to sort by just the block numbers? > * While in general the presence of promising tuples (the hint that index > + * AMs provide) is the best information that we have to go on, it is based > + * on simple heuristics about duplicates in indexes that are understood to > + * have specific flaws. We should not let the most promising heap pages > + * win or lose on the basis of _relatively_ small differences in the total > + * number of promising tuples. Small differences between the most > + * promising few heap pages are effectively ignored by applying this > + * power-of-two bucketing scheme. > + * What are the "specific flaws"? I understand that this is all based on rough heuristics, but is there any advantage to rounding/bucketing the numbers like this? Even if small differences in the total number of promising tuple is essentially noise that can be safely ignored, is there any harm in letting those small differences guide the choice? Wouldn't it be the essentially the same as picking at random, or are those small differences somehow biased? - Heikki |
On Tue, Dec 1, 2020 at 1:50 AM Heikki Linnakangas <[hidden email]> wrote:
> This is a wholly new concept with a lot of heuristics. It's a lot of > swallow. Thanks for taking a look! Yes, it's a little unorthodox. Ideally, you'd find time to grok the patch and help me codify the design in some general kind of way. If there are general lessons to be learned here (and I suspect that there are), then this should not be left to chance. The same principles can probably be applied in heapam, for example. Even if I'm wrong about the techniques being highly generalizable, it should still be interesting! (Something to think about a little later.) Some of the intuitions behind the design might be vaguely familiar to you as the reviewer of my earlier B-Tree work. In particular, the whole way that we reason about how successive related operations (in this case bottom-up deletion passes) affect individual leaf pages over time might give you a feeling of déjà vu. It's a little like the nbtsplitloc.c stuff that we worked on together during the Postgres 12 cycle. We can make what might seem like rather bold assumptions about what's going on, and adapt to the workload. This is okay because we're sure that the downside of our being wrong is a fixed, low performance penalty. (To a lesser degree it's okay because the empirical evidence shows that we're almost always right, because we apply the optimization again and again precisely because it worked out last time.) I like to compare it to the familiar "rightmost leaf page applies leaf fillfactor" heuristic, which applies an assumption that is wrong in the general case, but nevertheless obviously helps enormously as a practical matter. Of course it's still true that anybody reviewing this patch ought to start with a principled skepticism of this claim -- that's how you review any patch. I can say for sure that that's the idea behind the patch, though. I want to be clear about that up front, to save you time -- if this claim is wrong, then I'm wrong about everything. Generational garbage collection influenced this work, too. It also applies pragmatic assumptions about where garbage is likely to appear. Assumptions that are based on nothing more than empirical observations about what is likely to happen in the real world, that are validated by experience and by benchmarking. > On 30/11/2020 21:50, Peter Geoghegan wrote: > > +} TM_IndexDelete; > > +} TM_IndexStatus; > Is it really significantly faster to have two arrays? If you had just > one array, each element would be only 12 bytes long. That's not much > smaller than TM_IndexDeletes, which is 8 bytes. Yeah, but the swap operations really matter here. At one point I found that to be the case, anyway. That might no longer be true, though. It might be that the code became less performance critical because other parts of the design improved, resulting in the code getting called less frequently. But if that is true then it has a lot to do with the power-of-two bucketing that you go on to ask about -- that helped performance a lot in certain specific cases (as I go into below). I will add a TODO item for myself, to look into this again. You may well be right. > > + /* First sort caller's array by TID */ > > + heap_tid_shellsort(delstate->deltids, delstate->ndeltids); > Instead of sorting the array by TID, wouldn't it be enough to sort by > just the block numbers? I don't understand. Yeah, I guess that we could do our initial sort of the deltids array (the heap_tid_shellsort() call) just using BlockNumber (not TID). But OTOH there might be some small locality benefit to doing a proper TID sort at the level of each heap page. And even if there isn't any such benefit, does it really matter? If you are asking about the later sort of the block counts array (which helps us sort the deltids array a second time, leaving it in its final order for bottom-up deletion heapam.c processing), then the answer is no. This block counts metadata array sort is useful because it allows us to leave the deltids array in what I believe to be the most useful order for processing. We'll access heap blocks primarily based on the number of promising tuples (though as I go into below, sometimes the number of promising tuples isn't a decisive influence on processing order). > > * While in general the presence of promising tuples (the hint that index > > + * AMs provide) is the best information that we have to go on, it is based > > + * on simple heuristics about duplicates in indexes that are understood to > > + * have specific flaws. We should not let the most promising heap pages > > + * win or lose on the basis of _relatively_ small differences in the total > > + * number of promising tuples. Small differences between the most > > + * promising few heap pages are effectively ignored by applying this > > + * power-of-two bucketing scheme. > > + * > > What are the "specific flaws"? I just meant the obvious: the index AM doesn't actually know for sure that there are any old versions on the leaf page that it will ultimately be able to delete. This uncertainty needs to be managed, including inside heapam.c. Feel free to suggest a better way of expressing that sentiment. > I understand that this is all based on rough heuristics, but is there > any advantage to rounding/bucketing the numbers like this? Even if small > differences in the total number of promising tuple is essentially noise > that can be safely ignored, is there any harm in letting those small > differences guide the choice? Wouldn't it be the essentially the same as > picking at random, or are those small differences somehow biased? Excellent question! It actually helps enormously, especially with low cardinality data that makes good use of the deduplication optimization (where it is especially important to keep the costs and the benefits in balance). This has been validated by benchmarking. This design naturally allows the heapam.c code to take advantage of both temporal and spatial locality. For example, let's say that you have several indexes all on the same table, which get lots of non-HOT UPDATEs, which are skewed. Naturally, the heap TIDs will match across each index -- these are index entries that are needed to represent successor versions (which are logically unchanged/version duplicate index tuples from the point of view of nbtree/nbtdedup.c). Introducing a degree of determinism to the order in which heap blocks are processed naturally takes advantage of the correlated nature of the index bloat. It naturally makes it much more likely that the also-correlated bottom-up index deletion passes (that occur across indexes on the same table) each process the same heap blocks close together in time -- with obvious performance benefits. In the extreme (but not particularly uncommon) case of non-HOT UPDATEs with many low cardinality indexes, each heapam.c call will end up doing *almost the same thing* across indexes. So we're making the correlated nature of the bloat (which is currently a big problem) work in our favor -- turning it on its head, you could say. Highly correlated bloat is not the exception -- it's actually the norm in the cases we're targeting here. If it wasn't highly correlated then it would already be okay to rely on VACUUM to get around to cleaning it later. This power-of-two bucketing design probably also helps when there is only one index. I could go into more detail on that, plus other variations, but perhaps the multiple index example suffices for now. I believe that there are a few interesting kinds of correlations here -- I bet you can think of some yourself. (Of course it's also possible and even likely that heap block correlation won't be important at all, but my response is "what specifically is the harm in being open to the possibility?".) BTW, I tried to make the tableam interface changes more or less compatible with Zedstore, which is notable for not using TIDs in the same way as heapam (or zheap). Let me know what you think about that. I can go into detail about it if it isn't obvious to you. -- Peter Geoghegan |
On Tue, Dec 1, 2020 at 2:18 PM Peter Geoghegan <[hidden email]> wrote:
> On Tue, Dec 1, 2020 at 1:50 AM Heikki Linnakangas <[hidden email]> wrote: > > This is a wholly new concept with a lot of heuristics. It's a lot of > > swallow. Attached is v11, which cleans everything up around the tableam interface. There are only two patches in v11, since the tableam refactoring made it impossible to split the second patch into a heapam patch and nbtree patch (there is no reduction in functionality compared to v10). Most of the real changes in v11 (compared to v10) are in heapam.c. I've completely replaced the table_compute_xid_horizon_for_tuples() interface with a new interface that supports all existing requirements (from index deletions that support LP_DEAD deletion), while also supporting these new requirements (e.g. bottom-up index deletion). So now heap_compute_xid_horizon_for_tuples() becomes heap_compute_delete_for_tuples(), which has different arguments but the same overall structure. All of the new requirements can now be thought of as additive things that we happen to use for nbtree callers, that could easily also be used in other index AMs later on. This means that there is a lot less new code in heapam.c. Prefetching of heap blocks for the new bottom-up index deletion caller now works in the same way as it has worked in heap_compute_xid_horizon_for_tuples() since Postgres 12 (more or less). This is a significant improvement compared to my original approach. Chaning heap_compute_xid_horizon_for_tuples() rather than adding a sibling function started to make sense when v10 of the patch taught regular nbtree LP_DEAD deletion (the thing that has been around since PostgreSQL 8.2) to add "extra" TIDs to check in passing, just in case we find that they're also deletable -- why not just have one function? It also means that hash indexes and GiST indexes now use the heap_compute_delete_for_tuples() function, though they get the same behavior as before. I imagine that it would be pretty straightforward to bring that same benefit to those other index AMs, since their implementations are already derived from the nbtree implementation of LP_DEAD deletion (and because adding extra TIDs to check in passing in these other index AMs should be fairly easy). > > > +} TM_IndexDelete; > > > > +} TM_IndexStatus; > > > Is it really significantly faster to have two arrays? If you had just > > one array, each element would be only 12 bytes long. That's not much > > smaller than TM_IndexDeletes, which is 8 bytes. I kept this facet of the design in v11, following some deliberation. I found that the TID sort operation appeared quite prominently in profiles, and so the optimizations mostly seemed to still make sense. I also kept one shellsort specialization. However, I removed the second specialized sort implementation, so at least there is only one specialization now (which is small anyway). I found that the second sort specialization (added to heapam.c in v10) really wasn't pulling its weight, even in more extreme cases of the kind that justified the optimizations in the first place. -- Peter Geoghegan ![]() ![]() |
Hi, In v11-0001-Pass-down-logically-unchanged-index-hint.patch : + if (hasexpression) + return false; + + return true; The above can be written as 'return !hasexpression;' For +index_unchanged_by_update_var_walker: + * Returns true when Var that appears within allUpdatedCols located. the sentence seems incomplete. Currently the return value of index_unchanged_by_update_var_walker() is the reverse of index_unchanged_by_update(). Maybe it is easier to read the code if their return values have the same meaning. Cheers On Wed, Dec 9, 2020 at 5:13 PM Peter Geoghegan <[hidden email]> wrote: On Tue, Dec 1, 2020 at 2:18 PM Peter Geoghegan <[hidden email]> wrote: |
In reply to this post by Peter Geoghegan-4
On Wed, Dec 9, 2020 at 5:12 PM Peter Geoghegan <[hidden email]> wrote:
> Most of the real changes in v11 (compared to v10) are in heapam.c. > I've completely replaced the table_compute_xid_horizon_for_tuples() > interface with a new interface that supports all existing requirements > (from index deletions that support LP_DEAD deletion), while also > supporting these new requirements (e.g. bottom-up index deletion). I came up with a microbenchmark that is designed to give some general sense of how helpful it is to include "extra" TIDs alongside LP_DEAD-in-index TIDs (when they happen to point to the same table block as the LP_DEAD-in-index TIDs, which we'll have to visit anyway). The basic idea is simple: Add custom instrumentation that summarizes each B-Tree index deletion in LOG output, and then run the regression tests. Attached in the result of this for a "make installcheck-world". If you're just looking at this thread for the first time in a while: note that what I'm about to go into isn't technically bottom-up deletion (though you will see some of that in the full log output if you look). It's a closely related optimization that only appeared in recent versions of the patch. So I'm comparing the current approach to simple deletion of LP_DEAD-marked index tuples to a new enhanced approach (that makes it a little more like bottom-up deletion, but only a little). Here is some sample output (selected almost at random from a text file consisting of 889 lines of similar output): ... exact TIDs deleted 17, LP_DEAD tuples 4, LP_DEAD-related table blocks 2 ) ... exact TIDs deleted 38, LP_DEAD tuples 28, LP_DEAD-related table blocks 1 ) ... exact TIDs deleted 39, LP_DEAD tuples 1, LP_DEAD-related table blocks 1 ) ... exact TIDs deleted 44, LP_DEAD tuples 42, LP_DEAD-related table blocks 3 ) ... exact TIDs deleted 6, LP_DEAD tuples 2, LP_DEAD-related table blocks 2 ) (The initial contents of each line were snipped here, to focus on the relevant metrics.) Here we see that the actual number of TIDs/index tuples deleted often *vastly* exceeds the number of LP_DEAD-marked tuples (which are all we would have been able to delete with the existing approach of just deleting LP_DEAD items). It's pretty rare for us to fail to at least delete a couple of extra TIDs. Clearly this technique is broadly effective, because in practice there are significant locality-related effects that we can benefit from. It doesn't really matter that it's hard to precisely describe all of these locality effects. IMV the question that really matters is whether or not the cost of trying is consistently very low (relative to the cost of our current approach to simple LP_DEAD deletion). We do need to understand that fully. It's tempting to think about this quantitatively (and it also bolsters the case for the patch), but that misses the point. The right way to think about this is as a *qualitative* thing. The presence of LP_DEAD bits gives us certain reliable information about the nature of the index tuple (that it is dead-to-all, and therefore safe to delete), but it also *suggests* quite a lot more than that. In practice bloat is usually highly correlated/concentrated, especially when we limit our consideration to workloads where bloat is noticeable in any practical sense. So we're very well advised to look for nearby deletable index tuples in passing -- especially since it's practically free to do so. (Which is what the patch does here, of course.) Let's compare this to an extreme variant of my patch that runs the same test, to see what changes. Consider a variant that exhaustively checks every index tuple on the page at the point of a simple LP_DEAD deletion operation, no matter what. Rather than only including those TIDs that happen to be on the same heap/table blocks (and thus are practically free to check), we include all of them. This design isn't acceptable in the real world because it does a lot of unnecessary I/O, but that shouldn't invalidate any comparison I make here. This is still a reasonable approximation of a version of the patch with magical foreknowledge of where to find dead TIDs. It's an Oracle (ahem) that we can sensibly compare to the real patch within certain constraints. The results of this comparative analysis seem to suggest something important about the general nature of what's going on. The results are: There are only 844 deletion operations total with the Oracle. While this is less than the actual patch's 889 deletion operations, you would expect a bigger improvement from using what is after all supposed to apply magic! This suggests to me that the precise intervention of the patch here (the new LP_DEAD deletion stuff) has an outsized impact. The correlations that naturally exist make this asymmetrical payoff-to-cost situation possible. Simple cheap heuristics once again go a surprisingly long way, kind of like bottom-up deletion itself. -- Peter Geoghegan |
In reply to this post by Peter Geoghegan-4
On Wed, Dec 9, 2020 at 5:12 PM Peter Geoghegan <[hidden email]> wrote:
> Attached is v11, which cleans everything up around the tableam > interface. There are only two patches in v11, since the tableam > refactoring made it impossible to split the second patch into a heapam > patch and nbtree patch (there is no reduction in functionality > compared to v10). Attached is v12, which fixed bitrot against the master branch. This version has significant comment and documentation revisions. It is functionally equivalent to v11, though. I intend to commit the patch in the next couple of weeks. While it certainly would be nice to get a more thorough review, I don't feel that it is strictly necessary. The patch provides very significant benefits with certain workloads that have traditionally been considered an Achilles' heel for Postgres. Even zheap doesn't provide a solution to these problems. The only thing that I can think of that might reasonably be considered in competition with this design is WARM, which hasn't been under active development since 2017 (I assume that it has been abandoned by those involved). I should also point out that the design doesn't change the on-disk format in any way, and so doesn't commit us to supporting something that might become onerous in the event of somebody else finding a better way to address at least some of the same problems. -- Peter Geoghegan ![]() ![]() |
чт, 31 дек. 2020 г. в 03:55, Peter Geoghegan <[hidden email]>: Attached is v12, which fixed bitrot against the master branch. This I am planning to look into this patch in the next few days. Victor Yegorov |
In reply to this post by Peter Geoghegan-4
Hi, Peter: Happy New Year. For v12-0001-Pass-down-logically-unchanged-index-hint.patch + if (hasexpression) + return false; + + return true; The above can be written as return !hasexpression; The negation is due to the return value from index_unchanged_by_update_var_walker() is inverse of index unchanged. If you align the meaning of return value from index_unchanged_by_update_var_walker() the same way as index_unchanged_by_update(), negation is not needed. For v12-0002-Add-bottom-up-index-deletion.patch : For struct xl_btree_delete: + /* DELETED TARGET OFFSET NUMBERS FOLLOW */ + /* UPDATED TARGET OFFSET NUMBERS FOLLOW */ + /* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */ I guess the comment is for illustration purposes. Maybe you can write the comment in lower case. +#define BOTTOMUP_FAVORABLE_STRIDE 3 Adding a comment on why the number 3 is chosen would be helpful for people to understand the code. Cheers On Wed, Dec 30, 2020 at 6:55 PM Peter Geoghegan <[hidden email]> wrote: On Wed, Dec 9, 2020 at 5:12 PM Peter Geoghegan <[hidden email]> wrote: |
чт, 31 дек. 2020 г. в 20:01, Zhihong Yu <[hidden email]>:
To be honest, I prefer the way Peter has it in his patch. Yes, it's possible to shorten this part. But readability is hurt — for current code I just read it, for the suggested change I need to think about it. Victor Yegorov |
Free forum by Nabble | Edit this page |