Deleting older versions in unique indexes to avoid page splits

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
90 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Thu, Dec 31, 2020 at 11:01 AM Zhihong Yu <[hidden email]> wrote:
> Happy New Year.

Happy New Year.

> For v12-0001-Pass-down-logically-unchanged-index-hint.patch
>
> +   if (hasexpression)
> +       return false;
> +
> +   return true;
>
> The above can be written as return !hasexpression;
> The negation is due to the return value from index_unchanged_by_update_var_walker() is inverse of index unchanged.
> If you align the meaning of return value from index_unchanged_by_update_var_walker() the same way as index_unchanged_by_update(), negation is not needed.

I don't think that that represents an improvement. The negation seems
clearer to me because we're negating the *absence* of something that
we search for more or less linearly (a modified column from the
index). This happens when determining whether to do an extra thing
(provide the "logically unchanged" hint to this particular
index/aminsert() call). To me, the negation reflects that high level
structure.

> For struct xl_btree_delete:
>
> +   /* DELETED TARGET OFFSET NUMBERS FOLLOW */
> +   /* UPDATED TARGET OFFSET NUMBERS FOLLOW */
> +   /* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
>
> I guess the comment is for illustration purposes. Maybe you can write the comment in lower case.

The comment is written like this (in higher case) to be consistent
with an existing convention. If this was a green field situation I
suppose I might not write it that way, but that's not how I assess
these things. I always try to give the existing convention the benefit
of the doubt. In this case I don't think that it matters very much,
and so I'm inclined to stick with the existing style.

> +#define BOTTOMUP_FAVORABLE_STRIDE  3
>
> Adding a comment on why the number 3 is chosen would be helpful for people to understand the code.

Noted - I will add something about that to the next revision.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Heikki Linnakangas
In reply to this post by Peter Geoghegan-4
On 02/12/2020 00:18, Peter Geoghegan wrote:

> On Tue, Dec 1, 2020 at 1:50 AM Heikki Linnakangas <[hidden email]> wrote:
>> On 30/11/2020 21:50, Peter Geoghegan wrote:
>>> +} TM_IndexDelete;
>
>>> +} TM_IndexStatus;
>
>> Is it really significantly faster to have two arrays? If you had just
>> one array, each element would be only 12 bytes long. That's not much
>> smaller than TM_IndexDeletes, which is 8 bytes.
>
> Yeah, but the swap operations really matter here. At one point I found
> that to be the case, anyway. That might no longer be true, though. It
> might be that the code became less performance critical because other
> parts of the design improved, resulting in the code getting called
> less frequently. But if that is true then it has a lot to do with the
> power-of-two bucketing that you go on to ask about -- that helped
> performance a lot in certain specific cases (as I go into below).
>
> I will add a TODO item for myself, to look into this again. You may
> well be right.
>
>>> +     /* First sort caller's array by TID */
>>> +     heap_tid_shellsort(delstate->deltids, delstate->ndeltids);
>
>> Instead of sorting the array by TID, wouldn't it be enough to sort by
>> just the block numbers?
>
> I don't understand. Yeah, I guess that we could do our initial sort of
> the deltids array (the heap_tid_shellsort() call) just using
> BlockNumber (not TID). But OTOH there might be some small locality
> benefit to doing a proper TID sort at the level of each heap page. And
> even if there isn't any such benefit, does it really matter?

You said above that heap_tid_shellsort() is very performance critical,
and that's why you use the two arrays approach. If it's so performance
critical that swapping 8 bytes vs 12 byte array elements makes a
difference, I would guess that comparing TID vs just the block numbers
would also make a difference.

If you really want to shave cycles, you could also store BlockNumber and
OffsetNumber in the TM_IndexDelete array, instead of ItemPointerData.
What's the difference, you might ask? ItemPointerData stores the block
number as two 16 bit integers that need to be reassembled into a 32 bit
integer in the ItemPointerGetBlockNumber() macro.

My argument here is two-pronged: If this is performance critical, you
could do these additional optimizations. If it's not, then you don't
need the two-arrays trick; one array would be simpler.

I'm not sure how performance critical this really is, or how many
instructions each of these optimizations might shave off, so of course I
might be wrong and the tradeoff you have in the patch now really is the
best. However, my intuition would be to use a single array, because
that's simpler, and do all the other optimizations instead: store the
tid in the struct as separate BlockNumber and OffsetNumber fields, and
sort only by the block number. Those optimizations are very deep in the
low level functions and won't add much to the mental effort of
understanding the algorithm at a high level.

>>>         * While in general the presence of promising tuples (the hint that index
>>> +      * AMs provide) is the best information that we have to go on, it is based
>>> +      * on simple heuristics about duplicates in indexes that are understood to
>>> +      * have specific flaws.  We should not let the most promising heap pages
>>> +      * win or lose on the basis of _relatively_ small differences in the total
>>> +      * number of promising tuples.  Small differences between the most
>>> +      * promising few heap pages are effectively ignored by applying this
>>> +      * power-of-two bucketing scheme.
>>> +      *
>>
>> What are the "specific flaws"?
>
> I just meant the obvious: the index AM doesn't actually know for sure
> that there are any old versions on the leaf page that it will
> ultimately be able to delete. This uncertainty needs to be managed,
> including inside heapam.c. Feel free to suggest a better way of
> expressing that sentiment.

Hmm, maybe: "... is the best information that we have to go on, it is
just a guess based on simple heuristics about duplicates in indexes".

>> I understand that this is all based on rough heuristics, but is there
>> any advantage to rounding/bucketing the numbers like this? Even if small
>> differences in the total number of promising tuple is essentially noise
>> that can be safely ignored, is there any harm in letting those small
>> differences guide the choice? Wouldn't it be the essentially the same as
>> picking at random, or are those small differences somehow biased?
>
> Excellent question! It actually helps enormously, especially with low
> cardinality data that makes good use of the deduplication optimization
> (where it is especially important to keep the costs and the benefits
> in balance). This has been validated by benchmarking.
>
> This design naturally allows the heapam.c code to take advantage of
> both temporal and spatial locality. For example, let's say that you
> have several indexes all on the same table, which get lots of non-HOT
> UPDATEs, which are skewed. Naturally, the heap TIDs will match across
> each index -- these are index entries that are needed to represent
> successor versions (which are logically unchanged/version duplicate
> index tuples from the point of view of nbtree/nbtdedup.c). Introducing
> a degree of determinism to the order in which heap blocks are
> processed naturally takes advantage of the correlated nature of the
> index bloat. It naturally makes it much more likely that the
> also-correlated bottom-up index deletion passes (that occur across
> indexes on the same table) each process the same heap blocks close
> together in time -- with obvious performance benefits.
>
> In the extreme (but not particularly uncommon) case of non-HOT UPDATEs
> with many low cardinality indexes, each heapam.c call will end up
> doing *almost the same thing* across indexes. So we're making the
> correlated nature of the bloat (which is currently a big problem) work
> in our favor -- turning it on its head, you could say. Highly
> correlated bloat is not the exception -- it's actually the norm in the
> cases we're targeting here. If it wasn't highly correlated then it
> would already be okay to rely on VACUUM to get around to cleaning it
> later.
>
> This power-of-two bucketing design probably also helps when there is
> only one index. I could go into more detail on that, plus other
> variations, but perhaps the multiple index example suffices for now. I
> believe that there are a few interesting kinds of correlations here --
> I bet you can think of some yourself. (Of course it's also possible
> and even likely that heap block correlation won't be important at all,
> but my response is "what specifically is the harm in being open to the
> possibility?".)

I see. Would be good to explain that pattern with multiple indexes in
the comment.

- Heikki


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Victor Yegorov
In reply to this post by Peter Geoghegan-4
чт, 31 дек. 2020 г. в 03:55, Peter Geoghegan <[hidden email]>:
Attached is v12, which fixed bitrot against the master branch. This
version has significant comment and documentation revisions. It is
functionally equivalent to v11, though.

I intend to commit the patch in the next couple of weeks. While it
certainly would be nice to get a more thorough review, I don't feel
that it is strictly necessary. The patch provides very significant
benefits with certain workloads that have traditionally been
considered an Achilles' heel for Postgres. Even zheap doesn't provide
a solution to these problems. The only thing that I can think of that
might reasonably be considered in competition with this design is
WARM, which hasn't been under active development since 2017 (I assume
that it has been abandoned by those involved).

I've looked through the v12 patch.

I like the new outline:

- _bt_delete_or_dedup_one_page() is the main entry for the new code
- first we try _bt_simpledel_pass() does improved cleanup of LP_DEAD entries
- then (if necessary) _bt_bottomupdel_pass() for bottomup deletion
- finally, we perform _bt_dedup_pass() to deduplication

We split the leaf page only if all the actions above failed to provide enough space.

Some comments on the code.

v12-0001
--------

1. For the following comment

+    * Only do this for key columns.  A change to a non-key column within an
+    * INCLUDE index should not be considered because that's just payload to
+    * the index (they're not unlike table TIDs to the index AM).

The last part of it (in the parenthesis) is difficult to grasp due to
the double negation (not unlike). I think it's better to rephrase it.

2. After reading the patch, I also think, that fact, that index_unchanged_by_update()
and index_unchanged_by_update_var_walker() return different bool states
(i.e. when the latter returns true, the first one returns false) is a bit misleading.

Although logic as it is looks fine, maybe index_unchanged_by_update_var_walker()
can be renamed to avoid this confusion, to smth like index_expression_changed_walker() ?

v12-0002
--------

1. Thanks for the comments, they're well made and do help to read the code.

2. I'm not sure the bottomup_delete_items parameter is very helpful. In order to disable
bottom-up deletion, DBA needs somehow to measure it's impact on a particular index.
Currently I do not see how to achieve this. Not sure if this is overly important, though, as
you have a similar parameter for the deduplication.

3. It feels like indexUnchanged is better to make indexChanged and negate its usage in the code.
    As !indexChanged reads more natural than !indexUnchanged, at least to me.

This is all I have. I agree, that this code is pretty close to being committed.

Now for the tests.

First, I run a 2-hour long case with the same setup as I used in my e-mail from 15 of November.
I found no difference between patch and master whatsoever. Which makes me think, that current
master is quite good at keeping better bloat control (not sure if this is an effect of
4228817449 commit or deduplication).

I created another setup (see attached testcases). Basically, I emulated queue operations(INSERT at the end and DELETE 



--
Victor Yegorov
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Victor Yegorov
пн, 4 янв. 2021 г. в 17:28, Victor Yegorov <[hidden email]>:
I created another setup (see attached testcases). Basically, I emulated queue operations(INSERT at the end and DELETE 

Sorry, hit Send too early.

So, I emulated queue operations(INSERT at the end and DELETE from the head). And also made 5-minute transactions
appear in the background for the whole duration of the test. 3 pgbench were run in parallel on a scale 3000 bench database
with modifications (attached).

Master
------

        relname        |   nrows   | blk_before | mb_before | blk_after | mb_after |  diff
-----------------------+-----------+------------+-----------+-----------+----------+--------
 pgbench_accounts      | 300000000 |    4918033 |   38422.1 |   5065575 |  39574.8 |  +3.0%
 accounts_mtime        | 300000000 |    1155119 |    9024.4 |   1287656 |  10059.8 | +11.5%
 fiver                 | 300000000 |     427039 |    3336.2 |    567755 |   4435.6 | +33.0%
 pgbench_accounts_pkey | 300000000 |     822573 |    6426.4 |   1033344 |   8073.0 | +25.6%
 score                 | 300000000 |     284022 |    2218.9 |    458502 |   3582.0 | +61.4%
 tenner                | 300000000 |     346050 |    2703.5 |    417985 |   3265.5 | +20.8%
(6 rows)

DB size: 65.2..72.3 (+7.1GB / +10.9%)
TPS: 2297 / 495

Patched
------
        relname        |   nrows   | blk_before | mb_before | blk_after | mb_after |  diff
-----------------------+-----------+------------+-----------+-----------+----------+--------
 pgbench_accounts      | 300000000 |    4918033 |   38422.1 |   5067500 |  39589.8 |  +3.0%
 accounts_mtime        | 300000000 |    1155119 |    9024.4 |   1283441 |  10026.9 | +11.1%
 fiver                 | 300000000 |     427039 |    3336.2 |    429101 |   3352.4 |  +0.5%
 pgbench_accounts_pkey | 300000000 |     822573 |    6426.4 |    826056 |   6453.6 |  +0.4%
 score                 | 300000000 |     284022 |    2218.9 |    285465 |   2230.2 |  +0.5%
 tenner                | 300000000 |     346050 |    2703.5 |    347695 |   2716.4 |  +0.5%
(6 rows)

DB size: 65.2..67.5 (+2.3GB / +3.5%)
TPS: 2216 / 492

As you can see, TPS are very much similar, but the fact that we have no bloat for the patched version makes me very happy!

On the graphs, you can clearly see extra write activity performed by the backedns of the patched version.
 

--
Victor Yegorov

20210103-pg_bench-alterations.txt (766 bytes) Download Attachment
20210103-master-overview.png (294K) Download Attachment
20210103-master-dbsize.png (48K) Download Attachment
20210103-v12-overview.png (332K) Download Attachment
20210103-v12-dbsize.png (46K) Download Attachment
20210103-postgresql.auto.conf (980 bytes) Download Attachment
20210103-testcase1.pgbench (380 bytes) Download Attachment
20210103-testcase2.pgbench (484 bytes) Download Attachment
20210103-testcase3.pgbench (154 bytes) Download Attachment
20210103-results-master.txt (4K) Download Attachment
20210103-results-patched.txt (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
In reply to this post by Heikki Linnakangas
On Mon, Jan 4, 2021 at 4:08 AM Heikki Linnakangas <[hidden email]> wrote:

> You said above that heap_tid_shellsort() is very performance critical,
> and that's why you use the two arrays approach. If it's so performance
> critical that swapping 8 bytes vs 12 byte array elements makes a
> difference, I would guess that comparing TID vs just the block numbers
> would also make a difference.
>
> If you really want to shave cycles, you could also store BlockNumber and
> OffsetNumber in the TM_IndexDelete array, instead of ItemPointerData.
> What's the difference, you might ask? ItemPointerData stores the block
> number as two 16 bit integers that need to be reassembled into a 32 bit
> integer in the ItemPointerGetBlockNumber() macro.
>
> My argument here is two-pronged: If this is performance critical, you
> could do these additional optimizations. If it's not, then you don't
> need the two-arrays trick; one array would be simpler.

That's reasonable. The reason why I haven't been more decisive here is
because the question of whether or not it matters is actually very
complicated, for reasons that have little to do with sorting. The more
effective the mechanism is each time (the more bytes it allows nbtree
to free from each leaf page), the less often it is called, and the
less performance critical the overhead per operation is. On the other
hand there are a couple of other questions about what we do in
heapam.c that aren't quite resolved yet (e.g. exact "favorable
blocks"/prefetching behavior in bottom-up case), that probably affect
how important the heap_tid_shellsort() microoptimisations are.

I think that it makes sense to make a final decision on this at the
last minute, once everything else is resolved, since the implicit
dependencies make any other approach much too complicated. I agree
that this kind of microoptimization is best avoided, but I really
don't want to have to worry about regressions in workloads that I now
understand fully. I think that the sort became less important on
perhaps 2 or 3 occasions during development, even though that was
never really the goal that I had in mind in each case.

I'll do my best to avoid it.

> Hmm, maybe: "... is the best information that we have to go on, it is
> just a guess based on simple heuristics about duplicates in indexes".

I'll add something like that to the next revision.

> I see. Would be good to explain that pattern with multiple indexes in
> the comment.

Will do -- it is the single best example of how heap block locality
can matter with a real workload, so it makes sense to go with it in
explanatory comments.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
In reply to this post by Victor Yegorov
On Mon, Jan 4, 2021 at 8:28 AM Victor Yegorov <[hidden email]> wrote:

> I've looked through the v12 patch.
>
> I like the new outline:
>
> - _bt_delete_or_dedup_one_page() is the main entry for the new code
> - first we try _bt_simpledel_pass() does improved cleanup of LP_DEAD entries
> - then (if necessary) _bt_bottomupdel_pass() for bottomup deletion
> - finally, we perform _bt_dedup_pass() to deduplication
>
> We split the leaf page only if all the actions above failed to provide enough space.

I'm thinking of repeating the LP_DEAD enhancement within GiST and hash
shortly after this, as follow-up work. Their implementation of LP_DEAD
deletion was always based on the nbtree original, and I think that it
makes sense to keep everything in sync. The simple deletion
enhancement probably makes just as much sense in these other places.

> +    * Only do this for key columns.  A change to a non-key column within an
> +    * INCLUDE index should not be considered because that's just payload to
> +    * the index (they're not unlike table TIDs to the index AM).
>
> The last part of it (in the parenthesis) is difficult to grasp due to
> the double negation (not unlike). I think it's better to rephrase it.

Okay, will do.

> 2. After reading the patch, I also think, that fact, that index_unchanged_by_update()
> and index_unchanged_by_update_var_walker() return different bool states
> (i.e. when the latter returns true, the first one returns false) is a bit misleading.

> Although logic as it is looks fine, maybe index_unchanged_by_update_var_walker()
> can be renamed to avoid this confusion, to smth like index_expression_changed_walker() ?

I agree. I'll use the name index_expression_changed_walker() in the
next revision.

> 2. I'm not sure the bottomup_delete_items parameter is very helpful. In order to disable
> bottom-up deletion, DBA needs somehow to measure it's impact on a particular index.
> Currently I do not see how to achieve this. Not sure if this is overly important, though, as
> you have a similar parameter for the deduplication.

I'll take bottomup_delete_items out, since I think that you may be
right about that. It can be another "decision to recheck mid-beta"
thing (on the PostgreSQL 14 open items list), so we can delay making a
final decision on it until after it gets tested by the broadest
possible group of people.

> 3. It feels like indexUnchanged is better to make indexChanged and negate its usage in the code.
>     As !indexChanged reads more natural than !indexUnchanged, at least to me.

I don't want to do that because then we're negating "indexChanged" as
part of our gating condition to the bottom-up deletion pass code. To
me this feels wrong, since the hint only exists to be used to trigger
index tuple deletion. This is unlikely to ever change.

> First, I run a 2-hour long case with the same setup as I used in my e-mail from 15 of November.
> I found no difference between patch and master whatsoever. Which makes me think, that current
> master is quite good at keeping better bloat control (not sure if this is an effect of
> 4228817449 commit or deduplication).

Commit 4228817449 can't have improved the performance of the master
branch -- that commit was just about providing a *correct*
latestRemovedXid value. It cannot affect how many index tuples get
deleted per pass, or anything like that.

Not surprised that you didn't see many problems with the master branch
on your first attempt. It's normal for there to be non-linear effects
with this stuff, and these are very hard (maybe even impossible) to
model. For example, even with artificial test data you often see
distinct "waves" of page splits, which is a phenomenon pretty well
described by the "Waves of misery after index creation" paper [1].
It's likely that your original stress test case (the 15 November one)
would have shown significant bloat if you just kept it up for long
enough. One goal for this project is to make the performance
characteristics more predictable. The performance characteristics are
unpredictable precisely because they're pathological. B-Trees will
always have non-linear performance characteristics, but they can be
made a lot less chaotic in practice.

To be honest, I was slightly surprised that your more recent stress
test (the queue thing) worked out so well for the patch. But not too
surprised, because I don't necessarily expect to understand how the
patch can help for any given workload -- this is dynamic behavior
that's part of a complex system. I first thought that maybe the
presence of the INSERTs and DELETEs would mess things up. But I was
wrong -- everything still worked well, probably because bottom-up
index deletion "cooperated" with deduplication. The INSERTs could not
trigger a bottom-up deletion (which might have been useful for these
particular INSERTs), but that didn't actually matter because
deduplication was triggered instead, which "held the line" for long
enough for bottom-up deletion to eventually get triggered (by non-HOT
UPDATEs).

As I've said before, I believe that the workload should "figure out
the best strategy for handling bloat on its own". A diversity of
strategies are available, which can adapt to many different kinds of
workloads organically. That kind of approach is sometimes necessary
with a complex system IMV.

[1] https://btw.informatik.uni-rostock.de/download/tagungsband/B2-2.pdf
--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Thu, Jan 7, 2021 at 3:07 PM Peter Geoghegan <[hidden email]> wrote:
> I agree. I'll use the name index_expression_changed_walker() in the
> next revision.

Attached is v13, which has this tweak, and other miscellaneous cleanup
based on review from both Victor and Heikki. I consider this version
of the patch to be committable. I intend to commit something close to
it in the next week, probably no later than Thursday. I still haven't
got to the bottom of the shellsort question raised by Heikki. I intend
to do further performance validation before committing the patch. I
will look into the shellsort thing again as part of this final
performance validation work -- perhaps I can get rid of the
specialized shellsort implementation entirely, simplifying the state
structs added to tableam.h. (As I said before, it seems best to
address this last of all to avoid making performance validation even
more complicated.)

This version of the patch is notable for removing the index storage
param, and for having lots of comment updates and documentation
consolidation, particularly in heapam.c. Many of the latter changes
are based on feedback from Heikki. Note that all of the discussion of
heapam level locality has been consolidated, and is now mostly
confined to a fairly large comment block over
bottomup_nblocksfavorable() in heapam.c. I also cut down on redundancy
among comments about the design at the whole-patch level. A small
amount of redundancy in design docs/comments is a good thing IMV. It
was hard to get the balance exactly right, since bottom-up index
deletion is by its very nature a mechanism that requires the index AM
and the tableam to closely cooperate -- which is a novel thing.

This isn't 100% housekeeping changes, though. I did add one new minor
optimization to v13: We now count the heap block of the incoming new
item index tuple's TID (the item that won't fit on the leaf page
as-is) as an LP_DEAD-related block for the purposes of determining
which heap blocks will be visited during simple index tuple deletion.
The extra cost of doing this is low: when the new item heap block is
visited purely due to this new behavior, we're still practically
guaranteed to not get a buffer miss to read from the heap page. The
reason should be obvious: the executor is currently in the process of
modifying that same heap page anyway. The benefits are also high
relative to the cost. This heap block in particular seems to be very
promising as a place to look for deletable TIDs (I tested this with
custom instrumentation and microbenchmarks). I believe that this
effect exists because by its very nature garbage is often concentrated
in recently modified pages. This is per the generational hypothesis,
an important part of the theory behind GC algorithms for automated
memory management (GC theory seems to have real practical relevance to
the GC/VACUUM problems in Postgres, at least at a high level).

Of course we still won't do any simple deletion operations unless
there is at least one index tuple with its LP_DEAD bit set in the
first place at the point that it looks like the page will overflow (no
change there). As always, we're just piggy-backing some extra work on
top of an expensive operation that needed to take place anyway. I
couldn't resist adding this new minor optimization at this late stage,
because it is such a bargain.

Thanks
--
Peter Geoghegan

v13-0001-Pass-down-logically-unchanged-index-hint.patch (40K) Download Attachment
v13-0002-Enhance-nbtree-index-tuple-deletion.patch (196K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Victor Yegorov
пн, 11 янв. 2021 г. в 01:07, Peter Geoghegan <[hidden email]>:

Attached is v13, which has this tweak, and other miscellaneous cleanup
based on review from both Victor and Heikki. I consider this version
of the patch to be committable. I intend to commit something close to
it in the next week, probably no later than Thursday. I still haven't
got to the bottom of the shellsort question raised by Heikki. I intend
to do further performance validation before committing the patch. I
will look into the shellsort thing again as part of this final
performance validation work -- perhaps I can get rid of the
specialized shellsort implementation entirely, simplifying the state
structs added to tableam.h. (As I said before, it seems best to
address this last of all to avoid making performance validation even
more complicated.)

I've checked this version quickly. It applies and compiles without issues.
`make check` and `make check-world` reported no issue.

But `make installcheck-world` failed on:

test explain                      ... FAILED       22 ms
test event_trigger                ... ok          178 ms
test fast_default                 ... ok          262 ms
test stats                        ... ok          586 ms

========================
 1 of 202 tests failed.
========================


(see attached diff). It doesn't look like the fault of this patch, though.

I suppose you plan to send another revision before committing this.
Therefore I didn't perform any tests here, will wait for the next version.


--
Victor Yegorov

20210111-v13-regression.diffs (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Mon, Jan 11, 2021 at 12:19 PM Victor Yegorov <[hidden email]> wrote:
> (see attached diff). It doesn't look like the fault of this patch, though.
>
> I suppose you plan to send another revision before committing this.
> Therefore I didn't perform any tests here, will wait for the next version.

I imagine that this happened because you have track_io_timing=on in
postgresql.conf. Doesn't the same failure take place with the current
master branch?

I have my own way of running the locally installed Postgres when I
want "make installcheck" to pass that specifically accounts for this
(and a few other similar things).

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Victor Yegorov
пн, 11 янв. 2021 г. в 22:10, Peter Geoghegan <[hidden email]>:
I imagine that this happened because you have track_io_timing=on in
postgresql.conf. Doesn't the same failure take place with the current
master branch?

I have my own way of running the locally installed Postgres when I
want "make installcheck" to pass that specifically accounts for this
(and a few other similar things).

Oh, right, haven't thought of this. Thanks for pointing that out.

Now everything looks good!


--
Victor Yegorov
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
In reply to this post by Peter Geoghegan-4
On Sun, Jan 10, 2021 at 4:06 PM Peter Geoghegan <[hidden email]> wrote:
> Attached is v13, which has this tweak, and other miscellaneous cleanup
> based on review from both Victor and Heikki. I consider this version
> of the patch to be committable. I intend to commit something close to
> it in the next week, probably no later than Thursday. I still haven't
> got to the bottom of the shellsort question raised by Heikki. I intend
> to do further performance validation before committing the patch.

I benchmarked the patch with one array and without the shellsort
specialization using two patches on top of v13, both of which are
attached.

This benchmark was similar to other low cardinality index benchmarks
I've run in the past (with indexes named fiver, tenner, score, plus a
pgbench_accounts INCLUDE unique index instead of the regular primary
key). I used pgbench scale 500, for 30 minutes, no rate limit. One run
with 16 clients, another with 32 clients.

Original v13:

patch.r1c32.bench.out: "tps = 32709.772257 (including connections
establishing)" "latency average = 0.974 ms" "latency stddev = 0.514
ms"
patch.r1c16.bench.out: "tps = 34670.929998 (including connections
establishing)" "latency average = 0.461 ms" "latency stddev = 0.314
ms"

v13 + attached simplifying patches:

patch.r1c32.bench.out: "tps = 31848.632770 (including connections
establishing)" "latency average = 1.000 ms" "latency stddev = 0.548
ms"
patch.r1c16.bench.out: "tps = 33864.099333 (including connections
establishing)" "latency average = 0.472 ms" "latency stddev = 0.399
ms"

Clearly the optimization work still has some value, since we're
looking at about a 2% - 3% increase in throughput here. This seems
like it makes the cost of retaining the optimizations acceptable.

The difference is much less visible with a rate-limit, which is rather
more realistic. To some extent the sort is hot here because the
patched version of Postgres updates tuples as fast as it can, and must
therefore delete from the index as fast as it can. The sort itself was
consistently near the top consumer of CPU cycles according to "perf
top", even if it didn't get as bad as what I saw in earlier versions
of the patch months ago.

There are actually two sorts involved here (not just the heapam.c
shellsort). We need to sort the deltids array twice -- once in
heapam.c, and a second time (to restore the original leaf-page-wise
order) in nbtpage.c, using qsort(). I'm pretty sure that the latter
sort also matters, though it matters less than the heapam.c shellsort.

I'm going to proceed with committing the original version of the patch
-- I feel that this settles it. The code would be a bit tidier without
two arrays or the shellsort, but it really doesn't make that much
difference. Whereas the benefit is quite visible, and will be
something that all varieties of index tuple deletion see a performance
benefit from (not just bottom-up deletion).

--
Peter Geoghegan

0006-Experiment-One-array-again.patch (20K) Download Attachment
0007-Experiment-Remove-shellsort-just-use-qsort.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Mon, Jan 11, 2021 at 9:26 PM Peter Geoghegan <[hidden email]> wrote:
> I'm going to proceed with committing the original version of the patch
> -- I feel that this settles it.

Pushed both patches from the patch series just now.

Thanks for the code reviews and benchmarking work!

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

akapila
On Wed, Jan 13, 2021 at 11:16 PM Peter Geoghegan <[hidden email]> wrote:
>
> On Mon, Jan 11, 2021 at 9:26 PM Peter Geoghegan <[hidden email]> wrote:
> > I'm going to proceed with committing the original version of the patch
> > -- I feel that this settles it.
>
> Pushed both patches from the patch series just now.
>

Nice work! I briefly read the commits and have a few questions.

Do we do this optimization (bottom-up deletion) only when the last
item which can lead to page split has indexUnchanged set to true? If
so, what if the last tuple doesn't have indexUnchanged but the
previous ones do have?

Why are we using terminology bottom-up deletion and why not simply
duplicate version deletion or something on those lines?

There is the following comment in the code:
'Index AM/tableam coordination is central to the design of bottom-up
index deletion.  The index AM provides hints about where to look to
the tableam by marking some entries as "promising".  Index AM does
this with duplicate index tuples that are strongly suspected to be old
versions left behind by UPDATEs that did not logically modify indexed
values.'

How do we distinguish between version duplicate tuples (added due to
the reason that some other index column is updated) or duplicate
tuples (added because a user has inserted a row with duplicate value)
for the purpose of bottom-up deletion?  I think we need to distinguish
between them because in the earlier case we can remove the old version
of the tuple in the index but not in later. Or is it the case that
indexAm doesn't differentiate between them but relies on heapAM to
give the correct answer?

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Sat, Jan 16, 2021 at 9:55 PM Amit Kapila <[hidden email]> wrote:
> Do we do this optimization (bottom-up deletion) only when the last
> item which can lead to page split has indexUnchanged set to true? If
> so, what if the last tuple doesn't have indexUnchanged but the
> previous ones do have?

Using the indexUnchanged hint as a triggering condition makes sure
that non-HOT updaters are the ones that pay the cost of a bottom-up
deletion pass. We create backpressure for non-HOT updaters (in indexes
that are logically unchanged) specifically. (Of course it's also true
that the presence of the indexUnchanged hint is highly suggestive of
there being more version churn duplicates on the leaf page already,
which is not actually certain.)

It's possible that there will be some mixture of inserts and non-hot
updates on the same leaf page, and as you say the implementation might
fail to do a bottom-up pass based on the fact that an incoming item at
the point of a would-be page split was a plain insert (and so didn't
receive the hint). The possibility of these kinds of "missed
opportunities" are okay because a page split was inevitable either
way. You can imagine a case where that isn't true (i.e. a missed
opportunity to avoid a page split), but that's kind of like imagining
a fair coin flip taking place 100 times and coming up heads each time.
It just isn't realistic for such an "mixed tendencies" leaf page to
stay in flux (i.e. not ever split) forever, with the smartest
triggering condition in the world -- it's too unstable.

An important part of understanding the design is to imagine things at
the leaf page level, while thinking about how what that actually looks
like differs from an ideal physical representation of the same leaf
page that is much closer to the logical contents of the database.
We're only interested in leaf pages where the number of logical rows
is basically fixed over time (when there is version churn). Usually
this just means that there are lots of non-HOT updates, but it can
also work with lots of queue-like inserts and deletes in a unique
index.

Ultimately, the thing that determines whether or not the bottom-up
deletion optimization is effective for any given leaf page is the fact
that it actually prevents the page from splitting despite lots of
version churn -- this happens again and again. Another key concept
here is that bottom-up deletion passes for the same leaf page are
related in important ways -- it is often a mistake to think of
individual bottom-up passes as independent events.

> Why are we using terminology bottom-up deletion and why not simply
> duplicate version deletion or something on those lines?

Why is that simpler? Also, it doesn't exactly delete duplicates. See
my later remarks.

> How do we distinguish between version duplicate tuples (added due to
> the reason that some other index column is updated) or duplicate
> tuples (added because a user has inserted a row with duplicate value)
> for the purpose of bottom-up deletion?  I think we need to distinguish
> between them because in the earlier case we can remove the old version
> of the tuple in the index but not in later. Or is it the case that
> indexAm doesn't differentiate between them but relies on heapAM to
> give the correct answer?

Bottom-up deletion uses the presence of duplicates as a starting point
for determining which heap pages to visit, based on the assumption
that at least some are obsolete versions. But heapam.c has additional
heap-level heuristics that help too.

It is quite possible and even likely that we will delete some
non-duplicate tuples in passing, just because they're checked in
passing -- they might turn out to be deletable, for whatever reason.
We're also concerned (on the heapam.c side) about which heap pages
have the most TIDs (any kind of TID, not just one marked promising in
index AM), so this kind of "serendipity" is quite common in practice.
Often the total number of heap pages that are pointed to by all index
tuples on the page just isn't that many (8 - 10). And often cases with
lots of HOT pruning can have hundreds of LP_DEAD item pointers on a
heap page, which we'll tend to get to before too long anyway (with or
without many duplicates).

The nbtree/caller side makes inferences about what is likely to be
true about the "logical contents" of the physical leaf page, as a
starting point for the heuristic-driven search for deletable index
tuples. There are various ways in which each inference (considered
individually) might be wrong, including the one that you pointed out:
inserts of duplicates will look like update version churn. But that
particular issue won't matter if the inserted duplicates are on
mostly-distinct heap blocks (which is likely), because then they'll
only make a noise level contribution to the behavior in heapam.c.
Also, we can fall back on deduplication when bottom-up deletion fails,
which will merge together the duplicates-from-insert, so now any
future bottom-up deletion pass over the same leaf page won't have the
same problem.

Bear in mind that heapam.c is only looking for a select few heap
blocks, and doesn't even need to get the order exactly right. We're
only concerned about extremes, which are actually what we see in cases
that we're interested in helping. We only have to be very
approximately right, or right on average. Sure, there might be some
tiny cost in marginal cases, but that's not something that I've been
able to quantify in any practical way. Because once we fail we fail
for good -- the page splits and the question of doing a bottom-up
deletion pass for that same leaf page ends.

Another important concept is the cost asymmetry -- the asymmetry here
is huge. Leaf page splits driven by version churn are very expensive
in the short term and in the long term. Our previous behavior was to
assume that they were necessary. Now we're initially assuming that
they're unnecessary, and requiring non-HOT updaters to do some work
that shows (with some margin of error) that such a split is in fact
necessary. This is a form of backpressure.

Bottom-up deletion doesn't intervene unless and until that happens. It
is only interested in pages where version churn is concentrated -- it
is absolutely fine to leave it up to VACUUM to get to any "floating
garbage" tuples later. This is a pathological condition, and as such
isn't hard to spot, regardless of workload conditions.

If you or anyone else can think of a gap in my reasoning, or a
workload in which the heuristics either fail to prevent page splits
where that might be expected or impose too high a cost, do let me
know. I admit that my design is unorthodox, but the results speak for
themselves.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

akapila
On Mon, Jan 18, 2021 at 12:43 AM Peter Geoghegan <[hidden email]> wrote:

>
> On Sat, Jan 16, 2021 at 9:55 PM Amit Kapila <[hidden email]> wrote:
> > Do we do this optimization (bottom-up deletion) only when the last
> > item which can lead to page split has indexUnchanged set to true? If
> > so, what if the last tuple doesn't have indexUnchanged but the
> > previous ones do have?
>
> Using the indexUnchanged hint as a triggering condition makes sure
> that non-HOT updaters are the ones that pay the cost of a bottom-up
> deletion pass. We create backpressure for non-HOT updaters (in indexes
> that are logically unchanged) specifically. (Of course it's also true
> that the presence of the indexUnchanged hint is highly suggestive of
> there being more version churn duplicates on the leaf page already,
> which is not actually certain.)
>
> It's possible that there will be some mixture of inserts and non-hot
> updates on the same leaf page, and as you say the implementation might
> fail to do a bottom-up pass based on the fact that an incoming item at
> the point of a would-be page split was a plain insert (and so didn't
> receive the hint).
>

or it would do the scan when that is the only time for this leaf page
to receive such a hint.

> The possibility of these kinds of "missed
> opportunities" are okay because a page split was inevitable either
> way. You can imagine a case where that isn't true (i.e. a missed
> opportunity to avoid a page split), but that's kind of like imagining
> a fair coin flip taking place 100 times and coming up heads each time.
> It just isn't realistic for such an "mixed tendencies" leaf page to
> stay in flux (i.e. not ever split) forever, with the smartest
> triggering condition in the world -- it's too unstable.
>

But even if we don't want to avoid it forever delaying it will also
have advantages depending upon the workload. Let's try to see by some
example, say the item and page size are such that it would require 12
items to fill the page completely. Case-1, where we decide based on
the hint received in the last item, and Case-2 where we decide based
on whether we ever received such a hint for the leaf page.

Case-1:
========
12 new items (6 inserts 6 non-HOT updates)
Page-1: 12 items - no split would be required because we received the
hint (indexUnchanged) with the last item, so page-1 will have 6 items
after clean up.

6 new items (3 inserts, 3 non-HOT updates)
Page-1: 12 items (9 inserts, 3 non-HOT updates) lead to split because
we received the last item without a hint.

The SPLIT-1 happens after 18 operations (6 after the initial 12).

After this split (SPLIT-1), we will have 2 pages with the below state:
Page-1: 6 items (4 inserts, 2 non-HOT updates)
Page-2: 6 items (5 inserts, 1 non-HOT updates)

6 new items (3 inserts, 3 non-HOT updates)
Page-1 got 3 new items 1 insert and 2 non-HOT updates; Page-1: 9 items
(5 inserts, 4 non-HOT updates)
Page-2 got 3 new items 2 inserts and 1 non-HOT update; Page-2: 9 items
(7 inserts, 2 non-HOT updates)

6 new items (3 inserts, 3 non-HOT updates)
Page-1 got 3 new items 1 insert and 2 non-HOT updates; Page-1: 12
items (6 inserts, 6 non-HOT updates) doesn't lead to split
Page-2 got 3 new items 2 inserts and 1 non-HOT update; Page-2: 9 items
(9 inserts, 3 non-HOT updates) lead to split (split-2)

The SPLIT-2 happens after 30 operations (12 new operations after the
previous split).

Case-2:
========
12 new items (6 inserts 6 non-HOT updates)
Page-1: 12 items - no split would be required because we received the
hint (indexUnchanged) with at least one of the item, so page-1 will
have 6 items after clean up.

6 new items (3 inserts, 3 non-HOT updates)
Page-1: 12 items (9 inserts, 3 non-HOT updates), cleanup happens and
Page-1 will have 9 items.

6 new items (3 inserts, 3 non-HOT updates), at this stage in whichever
order the new items are received one split can't be avoided.

The SPLIT-1 happens after 24 new operations (12 new ops after initial 12).
Page-1: 6 items (6 inserts)
Page-2: 6 items (6 inserts)

6 new items (3 inserts, 3 non-HOT updates)
Page-1 got 3 new items 1 insert and 2 non-HOT updates; Page-1: 9 items
(7 inserts, 2 non-HOT updates)
Page-2 got 3 new items 2 inserts and 1 non-HOT update; Page-2: 9 items
(8 inserts, 1 non-HOT updates)

6 new items (3 inserts, 3 non-HOT updates)
Page-1 got 3 new items 1 insert and 2 non-HOT updates; Page-1: 12
items (8 inserts, 4 non-HOT updates) clean up happens and page-1 will
have 8 items.
Page-2 got 3 new items 2 inserts and 1 non-HOT update; Page-2: 9 items
(10 inserts, 2 non-HOT updates) clean up happens and page-2 will have
10 items.

6 new items (3 inserts, 3 non-HOT updates)
Page-1 got 3 new items, 1 insert and 2 non-HOT updates; Page-1: 11
items (9 inserts, 2 non-HOT updates)
Page-2 got 3 new items, 2 inserts and 1 non-HOT update; Page-2: 12
items (12 inserts, 0 non-HOT updates) cleanup happens for one of the
non-HOT updates

6 new items (3 inserts, 3 non-HOT updates)
Page-1 got 3 new items, 1 insert, and 2 non-HOT updates; Page-1: 12
items (12 inserts, 0 non-HOT updates) cleanup happens for one of the
non-HOT updates
Page-2 got 3 new items, 2 inserts, and 1 non-HOT update; Page-2: split happens

After split
Page-1: 12 items (12 inserts)
Page-2: 6 items (6 inserts)
Page-3: 6 items (6 inserts)

The SPLIT-2 happens after 48 operations (24 new operations)

The summary of the above is that with Case-1 (clean-up based on hint
received with the last item on the page) it takes fewer operations to
cause a page split as compared to Case-2 (clean-up based on hint
received with at least of the items on the page) for a mixed workload.
How can we say that it doesn't matter?

> An important part of understanding the design is to imagine things at
> the leaf page level, while thinking about how what that actually looks
> like differs from an ideal physical representation of the same leaf
> page that is much closer to the logical contents of the database.
> We're only interested in leaf pages where the number of logical rows
> is basically fixed over time (when there is version churn).
>

With the above example, it seems like it would also help when this is not true.

> Usually
> this just means that there are lots of non-HOT updates, but it can
> also work with lots of queue-like inserts and deletes in a unique
> index.
>
> Ultimately, the thing that determines whether or not the bottom-up
> deletion optimization is effective for any given leaf page is the fact
> that it actually prevents the page from splitting despite lots of
> version churn -- this happens again and again. Another key concept
> here is that bottom-up deletion passes for the same leaf page are
> related in important ways -- it is often a mistake to think of
> individual bottom-up passes as independent events.
>
> > Why are we using terminology bottom-up deletion and why not simply
> > duplicate version deletion or something on those lines?
>
> Why is that simpler?
>

I am not sure what I proposed fits here but the bottom-up sounds like
we are starting from the leaf level and move upwards to root level
which I think is not true here.

> Also, it doesn't exactly delete duplicates. See
> my later remarks.
>
> > How do we distinguish between version duplicate tuples (added due to
> > the reason that some other index column is updated) or duplicate
> > tuples (added because a user has inserted a row with duplicate value)
> > for the purpose of bottom-up deletion?  I think we need to distinguish
> > between them because in the earlier case we can remove the old version
> > of the tuple in the index but not in later. Or is it the case that
> > indexAm doesn't differentiate between them but relies on heapAM to
> > give the correct answer?
>
> Bottom-up deletion uses the presence of duplicates as a starting point
> for determining which heap pages to visit, based on the assumption
> that at least some are obsolete versions. But heapam.c has additional
> heap-level heuristics that help too.
>
> It is quite possible and even likely that we will delete some
> non-duplicate tuples in passing, just because they're checked in
> passing -- they might turn out to be deletable, for whatever reason.
>

How is that working? Does heapam.c can someway indicate additional
tuples (extra to what has been marked/sent by IndexAM as promising) as
deletable? I see in heap_index_delete_tuples that we mark the status
of the passed tuples as delectable (by setting knowndeletable flag for
them).

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Victor Yegorov
пн, 18 янв. 2021 г. в 07:44, Amit Kapila <[hidden email]>:
The summary of the above is that with Case-1 (clean-up based on hint
received with the last item on the page) it takes fewer operations to
cause a page split as compared to Case-2 (clean-up based on hint
received with at least of the items on the page) for a mixed workload.
How can we say that it doesn't matter?

I cannot understand this, perhaps there's a word missing in the brackets?..


Thinking of your proposal to undertake bottom-up deletion also on the last-to-fit tuple insertion,
I'd like to start with my understanding of the design:

* we try to avoid index bloat, but we do it in the “lazy” way (for a reason, see below)

* it means, that if there is enough space on the leaf page to fit new index tuple,
  we just fit it there

* if there's not enough space, we first look at the presence of LP_DEAD tuples,
  and if they do exits, we scan the whole (index) page to re-check all tuples in order
  to find others, not-yet-marked-but-actually-being-LP_DEAD tuples and clean all those up.

* if still not enough space, only now we try bottom-up-deletion. This is heavy operation and
  can cause extra IO (tests do show this), therefore this operation is undertaken at the point,
  when we can justify extra IO against leaf page split.

* if no space also after bottom-up-deletion, we perform deduplication (if possible)

* finally, we split the page.

Should we try bottom-up-deletion pass in a situation where we're inserting the last possible tuple
into the page (enough space for *current* insert, but likely no space for the next),
then (among others) exists the following possibilities:

- no new tuples ever comes to this page anymore (happy accident), which means that
  we've wasted IO cycles

- we undertake bottom-up-deletion pass without success and we're asked to insert new tuple in this
  fully packed index page. This can unroll to:
  - due to the delay, we've managed to find some space either due to LP_DEAD or bottom-up-cleanup.
    which means we've wasted IO cycles on the previous iteration (too early attempt).
  - we still failed to find any space and are forced to split the page.
    in this case we've wasted IO cycles twice.

In my view these cases when we generated wasted IO cycles (producing no benefit) should be avoided.
And this is main reason for current approach.

Again, this is my understanding and I hope I got it right.


--
Victor Yegorov
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

akapila
On Mon, Jan 18, 2021 at 5:11 PM Victor Yegorov <[hidden email]> wrote:

>
> пн, 18 янв. 2021 г. в 07:44, Amit Kapila <[hidden email]>:
>>
>> The summary of the above is that with Case-1 (clean-up based on hint
>> received with the last item on the page) it takes fewer operations to
>> cause a page split as compared to Case-2 (clean-up based on hint
>> received with at least of the items on the page) for a mixed workload.
>> How can we say that it doesn't matter?
>
>
> I cannot understand this, perhaps there's a word missing in the brackets?..
>

There is a missing word (one) in Case-2, let me write it again to
avoid any confusion. Case-2 (clean-up based on hint received with at
least one of the items on the page).

>
> Thinking of your proposal to undertake bottom-up deletion also on the last-to-fit tuple insertion,
>

I think there is some misunderstanding in what I am trying to say and
your conclusion of the same. See below.

> I'd like to start with my understanding of the design:
>
> * we try to avoid index bloat, but we do it in the “lazy” way (for a reason, see below)
>
> * it means, that if there is enough space on the leaf page to fit new index tuple,
>   we just fit it there
>
> * if there's not enough space, we first look at the presence of LP_DEAD tuples,
>   and if they do exits, we scan the whole (index) page to re-check all tuples in order
>   to find others, not-yet-marked-but-actually-being-LP_DEAD tuples and clean all those up.
>
> * if still not enough space, only now we try bottom-up-deletion. This is heavy operation and
>   can cause extra IO (tests do show this), therefore this operation is undertaken at the point,
>   when we can justify extra IO against leaf page split.
>
> * if no space also after bottom-up-deletion, we perform deduplication (if possible)
>
> * finally, we split the page.
>
> Should we try bottom-up-deletion pass in a situation where we're inserting the last possible tuple
> into the page (enough space for *current* insert, but likely no space for the next),
>

I am saying that we try bottom-up deletion when the new insert item
didn't find enough space on the page and there was previously some
indexUnchanged tuple(s) inserted into that page. Of course, like now
it will be attempted after an attempt to remove LP_DEAD items.

> then (among others) exists the following possibilities:
>
> - no new tuples ever comes to this page anymore (happy accident), which means that
>   we've wasted IO cycles
>
> - we undertake bottom-up-deletion pass without success and we're asked to insert new tuple in this
>   fully packed index page. This can unroll to:
>   - due to the delay, we've managed to find some space either due to LP_DEAD or bottom-up-cleanup.
>     which means we've wasted IO cycles on the previous iteration (too early attempt).
>   - we still failed to find any space and are forced to split the page.
>     in this case we've wasted IO cycles twice.
>
> In my view these cases when we generated wasted IO cycles (producing no benefit) should be avoided.
>

I don't think any of these can happen in what I am actually saying. Do
you still have the same feeling after reading this email? Off-hand, I
don't see any downsides as compared to the current approach and it
will have fewer splits in some other workloads like when there is a
mix of inserts and non-HOT updates (that doesn't logically change the
index).

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Victor Yegorov
пн, 18 янв. 2021 г. в 13:42, Amit Kapila <[hidden email]>:
I don't think any of these can happen in what I am actually saying. Do
you still have the same feeling after reading this email? Off-hand, I
don't see any downsides as compared to the current approach and it
will have fewer splits in some other workloads like when there is a
mix of inserts and non-HOT updates (that doesn't logically change the
index).

If I understand you correctly, you suggest to track _all_ the hints that came
from the executor for possible non-HOT logical duplicates somewhere on
the page. And when we hit the no-space case, we'll check not only the last
item being hinted, but all items on the page, which makes it more probable
to kick in and do something.

Sounds interesting. And judging on the Peter's tests of extra LP_DEAD tuples
found on the page (almost always being more, than actually flagged), this can
have some positive effects.

Also, I'm not sure where to put it. We've deprecated the BTP_HAS_GARBAGE
flag, maybe it can be reused for this purpose?


-- 
Victor Yegorov
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
In reply to this post by akapila
On Sun, Jan 17, 2021 at 10:44 PM Amit Kapila <[hidden email]> wrote:
> With the above example, it seems like it would also help when this is not true.

I'll respond to your remarks here separately, in a later email.

> I am not sure what I proposed fits here but the bottom-up sounds like
> we are starting from the leaf level and move upwards to root level
> which I think is not true here.

I guess that that's understandable, because it is true that B-Trees
are maintained in a bottom-up fashion. However, it's also true that
you can have top-down and bottom-up approaches in query optimizers,
and many other things (it's even something that is used to describe
governance models). The whole point of the term "bottom-up" is to
suggest that bottom-up deletion complements top-down cleanup by
VACUUM. I think that this design embodies certain principles that can
be generalized to other areas, such as heap pruning.

I recall that Robert Haas hated the name deduplication. I'm not about
to argue that my choice of "deduplication" was objectively better than
whatever he would have preferred. Same thing applies here - I more or
less chose a novel name because the concept is itself novel (unlike
deduplication). Reasonable people can disagree about what exact name
might have been better, but it's not like we're going to arrive at
something that everybody can be happy with. And whatever we arrive at
probably won't matter that much - the vast majority of users will
never need to know what either thing is. They may be important things,
but that doesn't mean that many people will ever think about them (in
fact I specifically hope that they don't ever need to think about
them).

> How is that working? Does heapam.c can someway indicate additional
> tuples (extra to what has been marked/sent by IndexAM as promising) as
> deletable? I see in heap_index_delete_tuples that we mark the status
> of the passed tuples as delectable (by setting knowndeletable flag for
> them).

The bottom-up nbtree caller to
heap_index_delete_tuples()/table_index_delete_tuple() (not to be
confused with the simple/LP_DEAD heap_index_delete_tuples() caller)
always provides heapam.c with a complete picture of the index page, in
the sense that it exhaustively has a delstate.deltids entry for each
and every TID on the page, no matter what. This is the case even
though in practice there is usually no possible way to check even 20%
of the deltids entries within heapam.c.

In general, the goal during a bottom-up pass is *not* to maximize
expected utility (i.e. the number of deleted index tuples/space
freed/whatever), exactly. The goal is to lower the variance across
related calls, so that we'll typically manage to free a fair number of
index tuples when we need to. In general it is much better for
heapam.c to make its decisions based on 2 or 3 good reasons rather
than just 1 excellent reason. And so heapam.c applies a power-of-two
bucketing scheme, never truly giving too much weight to what nbtree
tells it about duplicates. See comments above
bottomup_nblocksfavorable(), and bottomup_sort_and_shrink() comments
(both are from heapam.c).

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
In reply to this post by Victor Yegorov
On Mon, Jan 18, 2021 at 6:11 AM Victor Yegorov <[hidden email]> wrote:
> If I understand you correctly, you suggest to track _all_ the hints that came
> from the executor for possible non-HOT logical duplicates somewhere on
> the page. And when we hit the no-space case, we'll check not only the last
> item being hinted, but all items on the page, which makes it more probable
> to kick in and do something.

> Also, I'm not sure where to put it. We've deprecated the BTP_HAS_GARBAGE
> flag, maybe it can be reused for this purpose?

There actually was a similar flag (named BTP_UNCHANGED_UPDATE and
later BTP_HAS_DUPS) that appeared in earlier versions of the patch
(several versions in total, up to and including v6). This was never
discussed on the list because I assumed that that wouldn't be helpful
(I was already writing too many overlong e-mails). It was unsettled in
my mind at the time, so it didn't make sense to start discussing. I
changed my mind at some point, and so it never came up until now, when
Amit raised the question.

Looking back on my personal notes, I am reminded that I debated this
exact question with myself at length. The argument for keeping the
nbtree flag (i.e. what Amit is arguing for now) involved an isolated
example that seems very similar to the much more recent example from
Amit (that he arrived at independently). I am at least sympathetic to
this view of things, even now. Let me go into why I changed my mind
after a couple of months of on and off deliberation.

In general, the highly speculative nature of the extra work that
heapam.c does for index deleters in the bottom-up caller case can only
be justified because the cost is paid by non-HOT updaters that are
definitely about to split the page just to fit another version, and
because we only risk wasting one heap page access at any given point
of the entire process (the latter point about risk/cost is not 100%
true, because you have additional fixed CPU costs and stuff like that,
but it's at least true in spirit). We can justify "gambling" like this
only because the game is rigged in our favor to an *absurd* degree:
there are many ways to win big (relative to the likely version churn
page split baseline case), and we only have to put down relatively
small "bets" at any given point -- heapam.c will give up everything
when it encounters one whole heap page that lacks a single deletable
entry, no matter what the reason is.

The algorithm *appears* to behave very intelligently when seen from
afar, but is in fact stupid and opportunistic when you look at it up
close. It's possible to be so permissive about the upside benefit by
also being very conservative about the downside cost. Almost all of
our individual inferences can be wrong, and yet we still win in the
end. And the effect is robust over time. You could say that it is an
organic approach: it adapts to the workload, rather than trying to
make the workload adapt to it. This is less radical than you'd think
-- in some sense this is how B-Trees have always worked.

In the end, I couldn't justify imposing this cost on anything other
than a non-HOT updater, which is what the flag proposal would require
me to do -- then it's not 100% clear that the relative cost of each
"bet" placed in heapam.c really is extremely low (we can no longer say
for sure that we have very little to lose, given the certain
alternative that is a version churn page split). The fact is that
"version chains in indexes" still cannot get very long. Plus there are
other subtle ways in which it's unlikely to be a problem (e.g. the
LP_DEAD deletion stuff also got much better, deduplication can combine
with bottom-up deletion so that we'll trigger a useful bottom-up
deletion pass sooner or later without much downside, and possibly
other things I haven't even thought of).

It's possible that I have been too conservative. I admit that my
decision on this point is at least a little arbitrary, but I stand by
it for now. I cannot feel too bad about theoretically leaving some
gains on the table, given that we're only talking about deleting a
group of related versions a little later than we would otherwise, and
only in some circumstances, and in a way that seems likely to be
imperceptible to users. I reserve the right to change my mind about
it, but for now it doesn't look like an absurdly good deal, and those
are the kind of deals that it makes sense to focus on here. I am very
happy about the fact that it is relatively easy to understand why the
worst case for bottom-up index deletion cannot be that bad.

--
Peter Geoghegan


12345