Deleting older versions in unique indexes to avoid page splits

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
90 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Tue, Nov 17, 2020 at 7:05 AM Victor Yegorov <[hidden email]> wrote:
> I've looked over the BTP_HAS_GARBAGE modifications, they look sane.
> I've double checked that heapkeyspace indexes don't use this flag (don't rely on it),
> while pre-v4 ones still use it.

Cool.

> I have a question. This flag is raised in the _bt_check_unique() and in _bt_killitems().
> If we're deprecating this flag, perhaps it'd be good to either avoid raising it at least for
> _bt_check_unique(), as it seems to me that conditions are dealing with postings, therefore
> we are speaking of heapkeyspace indexes here.

Well, we still want to mark LP_DEAD bits set in all cases, just as
before. The difference is that heapkeyspace indexes won't rely on the
page-level flag later on.

> If we'll conditionally raise this flag in the functions above, we can get rid of blocks that drop it
> in _bt_delitems_delete(), I think.

I prefer to continue to maintain the flag in the same way, regardless
of which B-Tree version is in use (i.e. if it's heapkeyspace or not).
Maintaining the flag is not expensive, may have some small value for
forensic or debugging purposes, and saves callers the trouble of
telling  _bt_delitems_delete() (and code like it) whether or not this
is a heapkeyspace index.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Victor Yegorov
вт, 17 нояб. 2020 г. в 17:24, Peter Geoghegan <[hidden email]>:
I prefer to continue to maintain the flag in the same way, regardless
of which B-Tree version is in use (i.e. if it's heapkeyspace or not).
Maintaining the flag is not expensive, may have some small value for
forensic or debugging purposes, and saves callers the trouble of
telling  _bt_delitems_delete() (and code like it) whether or not this
is a heapkeyspace index.

OK. Can you explain what deprecation means here?
If this functionality is left as is, it is not really deprecation?..

--
Victor Yegorov
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Tue, Nov 17, 2020 at 9:19 AM Victor Yegorov <[hidden email]> wrote:
> OK. Can you explain what deprecation means here?
> If this functionality is left as is, it is not really deprecation?..

It just means that we only keep it around for compatibility purposes.
We would like to remove it, but can't right now. If we ever stop
supporting version 3 indexes, then we can probably remove it entirely.
I would like to avoid special cases across B-Tree index versions.
Simply maintaining the page flag in the same way as we always have is
the simplest approach.

Pushed the BTP_HAS_GARBAGE patch just now. I'll post a rebased version
of the patch series later on today.

Thanks
--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
In reply to this post by Victor Yegorov
On Tue, Nov 17, 2020 at 7:24 AM Victor Yegorov <[hidden email]> wrote:
> I've looked through the code and it looks very good from my end:
> - plenty comments, good description of what's going on
> - I found no loose ends in terms of AM integration
> - magic constants replaced with defines
> Code looks good. Still, it'd be good if somebody with more experience could look into this patch.

Great, thank you.

> Question: why in the comments you're using double spaces after dots?
> Is this a convention of the project?

Not really. It's based on my habit of trying to be as consistent as
possible with existing code.

There seems to be a weak consensus among English speakers on this
question, which is: the two space convention is antiquated, and only
ever made sense in the era of mechanical typewriters. I don't really
care either way, and I doubt that any other committer pays much
attention to these things. You may have noticed that I use only one
space in my e-mails.

Actually, I probably shouldn't care about it myself. It's just what I
decided to do at some point. I find it useful to decide that this or
that practice is now a best practice, and then stick to it without
thinking about it very much (this frees up space in my head to think
about more important things). But this particular habit of mine around
spaces is definitely not something I'd insist on from other
contributors. It's just that: a habit.

> I am thinking of two more scenarios that require testing:
> - queue in the table, with a high rate of INSERTs+DELETEs and a long transaction.

I see your point. This is going to be hard to make work outside of
unique indexes, though. Unique indexes are already not dependent on
the executor hint -- they can just use the "uniquedup" hint. The code
for unique indexes is prepared to notice duplicates in
_bt_check_unique() in passing, and apply the optimization for that
reason.

Maybe there is some argument to forgetting about the hint entirely,
and always assuming that we should try to find tuples to delete at the
point that a page is about to be split. I think that that argument is
a lot harder to make, though. And it can be revisited in the future.
It would be nice to do better with INSERTs+DELETEs, but that's surely
not the big problem for us right now.

I realize that this unique indexes/_bt_check_unique() thing is not
even really a partial fix to the problem you describe. The indexes
that have real problems with such an INSERTs+DELETEs workload will
naturally not be unique indexes -- _bt_check_unique() already does a
fairly good job of controlling bloat without bottom-up deletion.

> - upgraded cluster with !heapkeyspace indexes.

I do have a patch that makes that easy to test, that I used for the
Postgres 13 deduplication work -- I can rebase it and post it if you
like. You will be able to apply the patch, and run the regression
tests with a !heapkeyspace index. This works with only one or two
tweaks to the tests (IIRC the amcheck tests need to be tweaked in one
place for this to work). I don't anticipate that !heapkeyspace indexes
will be a problem, because they won't use any of the new stuff anyway,
and because nothing about the on-disk format is changed by bottom-up
index deletion.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
In reply to this post by Victor Yegorov
On Tue, Nov 17, 2020 at 7:17 AM Victor Yegorov <[hidden email]> wrote:
> чт, 12 нояб. 2020 г. в 23:00, Peter Geoghegan <[hidden email]>:
>> Another thing that I'll probably add to v8: Prefetching. This is
>> probably necessary just so I can have parity with the existing
>> heapam.c function that the new code is based on,
>> heap_compute_xid_horizon_for_tuples(). That will probably help here,
>> too.
>
> I don't quite see this part. Do you mean top_block_groups_favorable() here?

I meant to add prefetching to the version of the patch that became v8,
but that didn't happen because I ran out of time. I wanted to get out
a version with the low cardinality fix, to see if that helped with the
regression you talked about last week. (Prefetching seems to make a
small difference when we're I/O bound, so it may not be that
important.)

Attached is v9 of the patch series. This actually has prefetching in
heapam.c. Prefetching is not just applied to favorable blocks, though
-- it's applied to all the blocks that we might visit, even though we
often won't really visit the last few blocks in line. This needs more
testing. The specific choices I made around prefetching were
definitely a bit arbitrary. To be honest, it was a bit of a
box-ticking thing (parity with similar code for its own sake). But
maybe I failed to consider particular scenarios in which prefetching
really is important.

My high level goal for v9 was to do cleanup of v8. There isn't very
much that you could call a new enhancement (just the prefetching
thing).

Other changes in v9 include:

* Much simpler approach to passing down an aminsert() hint from the
executor in v9-0002* patch.

Rather than exposing some HOT implementation details from
heap_update(), we use executor state that tracks updated columns. Now
all we have to do is tell ExecInsertIndexTuples() "this round of index
tuple inserts is for an UPDATE statement". It then figures out the
specific details (whether it passes the hint or not) on an index by
index basis. This interface feels much more natural to me.

This also made it easy to handle expression indexes sensibly. And, we
get support for the logical replication UPDATE caller to
ExecInsertIndexTuples(). It only has to say "this is for an UPDATE",
in the usual way, without any special effort (actually I need to test
logical replication, just to be sure, but I think that it works fine
in v9).

* New B-Tree sgml documentation in v9-0003* patch. I've added an
extensive user-facing description of the feature to the end of
"Chapter 64. B-Tree Indexes", near the existing discussion of
deduplication.

* New delete_items storage parameter. This makes it possible to
disable the optimization. Like deduplicate_items in Postgres 13, it is
not expected to be set to "off" very often.

I'm not yet 100% sure that a storage parameter is truly necessary -- I
might still change my mind and remove it later.

Thanks
--
Peter Geoghegan

v9-0001-Make-tableam-interface-support-bottom-up-deletion.patch (10K) Download Attachment
v9-0002-Pass-down-logically-unchanged-index-hint.patch (39K) Download Attachment
v9-0003-Teach-nbtree-to-use-bottom-up-index-deletion.patch (93K) Download Attachment
v9-0004-Teach-heapam-to-support-bottom-up-index-deletion.patch (34K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
In reply to this post by Peter Geoghegan-4
On Tue, Nov 17, 2020 at 12:45 PM Peter Geoghegan <[hidden email]> wrote:
> > I am thinking of two more scenarios that require testing:
> > - queue in the table, with a high rate of INSERTs+DELETEs and a long transaction.
>
> I see your point. This is going to be hard to make work outside of
> unique indexes, though. Unique indexes are already not dependent on
> the executor hint -- they can just use the "uniquedup" hint. The code
> for unique indexes is prepared to notice duplicates in
> _bt_check_unique() in passing, and apply the optimization for that
> reason.

I thought about this some more. My first idea was to simply always try
out bottom-up deletion (i.e. behave as if the hint from the executor
always indicates that it's favorable). I couldn't really justify that
approach, though. It results in many bottom-up deletion passes that
end up wasting cycles (and unnecessarily accessing heap blocks).

Then I had a much better idea: Make the existing LP_DEAD stuff a
little more like bottom-up index deletion. We usually have to access
heap blocks that the index tuples point to today, in order to have a
latestRemovedXid cutoff (to generate recovery conflicts). It's worth
scanning the leaf page for index tuples with TIDs whose heap block
matches the index tuples that actually have their LP_DEAD bits set.
This only consumes a few more CPU cycles. We don't have to access any
more heap blocks to try these extra TIDs, so it seems like a very good
idea to try them out.

I ran the regression tests with an enhanced version of the patch, with
this LP_DEAD-deletion-with-extra-TIDs thing. It also had custom
instrumentation that showed exactly what happens in each case. We
manage to delete at least a small number of extra index tuples in
almost all cases -- so we get some benefit in practically all cases.
And in the majority of cases we can delete significantly more. It's
not uncommon to increase the number of index tuples deleted. It could
go from 1 - 10 or so without the enhancement to LP_DEAD deletion, to
50 - 250 with the LP_DEAD enhancement. Some individual LP_DEAD
deletion calls can free more than 50% of the space on the leaf page.

I believe that this is a lower risk way of doing better when there is
a high rate of INSERTs+DELETEs. Most of the regression test cases I
looked at were in the larger system catalog indexes, which often look
like that.

We don't have to be that lucky for a passing index scan to set at
least one or two LP_DEAD bits. If there is any kind of
physical/logical correlation, then we're bound to also end up deleting
some extra index tuples by the time that the page looks like it might
need to be split.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Victor Yegorov
ср, 25 нояб. 2020 г. в 05:35, Peter Geoghegan <[hidden email]>:
Then I had a much better idea: Make the existing LP_DEAD stuff a
little more like bottom-up index deletion. We usually have to access
heap blocks that the index tuples point to today, in order to have a
latestRemovedXid cutoff (to generate recovery conflicts). It's worth
scanning the leaf page for index tuples with TIDs whose heap block
matches the index tuples that actually have their LP_DEAD bits set.
This only consumes a few more CPU cycles. We don't have to access any
more heap blocks to try these extra TIDs, so it seems like a very good
idea to try them out.

I don't seem to understand this.

Is it: we're scanning the leaf page for all LP_DEAD tuples that point to the same
heap block? Which heap block we're talking about here, the one that holds
entry we're about to add (the one that triggered bottom-up-deletion due to lack
of space I mean)?

I ran the regression tests with an enhanced version of the patch, with
this LP_DEAD-deletion-with-extra-TIDs thing. It also had custom
instrumentation that showed exactly what happens in each case. We
manage to delete at least a small number of extra index tuples in
almost all cases -- so we get some benefit in practically all cases.
And in the majority of cases we can delete significantly more. It's
not uncommon to increase the number of index tuples deleted. It could
go from 1 - 10 or so without the enhancement to LP_DEAD deletion, to
50 - 250 with the LP_DEAD enhancement. Some individual LP_DEAD
deletion calls can free more than 50% of the space on the leaf page.

I am missing a general perspective here.

Is it true, that despite the long (vacuum preventing) transaction we can re-use space,
as after the DELETE statements commits, IndexScans are setting LP_DEAD hints after
they check the state of the corresponding heap tuple?

If my thinking is correct for both cases — nature of LP_DEAD hint bits and the mechanics of
suggested optimization — then I consider this a very promising improvement!

I haven't done any testing so far since sending my last e-mail.
If you'll have a chance to send a new v10 version with LP_DEAD-deletion-with-extra-TIDs thing,
I will do some tests (planned).

--
Victor Yegorov
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Wed, Nov 25, 2020 at 4:43 AM Victor Yegorov <[hidden email]> wrote:

>> Then I had a much better idea: Make the existing LP_DEAD stuff a
>> little more like bottom-up index deletion. We usually have to access
>> heap blocks that the index tuples point to today, in order to have a
>> latestRemovedXid cutoff (to generate recovery conflicts). It's worth
>> scanning the leaf page for index tuples with TIDs whose heap block
>> matches the index tuples that actually have their LP_DEAD bits set.
>> This only consumes a few more CPU cycles. We don't have to access any
>> more heap blocks to try these extra TIDs, so it seems like a very good
>> idea to try them out.
>
>
> I don't seem to understand this.
>
> Is it: we're scanning the leaf page for all LP_DEAD tuples that point to the same
> heap block? Which heap block we're talking about here, the one that holds
> entry we're about to add (the one that triggered bottom-up-deletion due to lack
> of space I mean)?

No, the incoming tuple isn't significant.

As you know, bottom-up index deletion uses heuristics that are
concerned with duplicates on the page, and the "logically unchanged by
an UPDATE" hint that the executor passes to btinsert(). Bottom-up
deletion runs when all LP_DEAD bits have been cleared (either because
there never were any LP_DEAD bits set, or because they were set and
then deleted, which wasn't enough).

But before bottom-up deletion may run, traditional deletion of LP_DEAD
index tuples runs -- this is always our preferred strategy because
index tuples with their LP_DEAD bits set are already known to be
deletable. We can make this existing process (which has been around
since PostgreSQL 8.2) better by applying similar principles.

We have promising tuples for bottom-up deletion. Why not have
"promising heap blocks" for traditional LP_DEAD index tuple deletion?
Or if you prefer, we can consider index tuples that *don't* have their
LP_DEAD bits set already but happen to point to the *same heap block*
as other tuples that *do* have their LP_DEAD bits set promising. (The
tuples with their LP_DEAD bits set are not just promising -- they're
already a sure thing.)

This means that traditional LP_DEAD deletion is now slightly more
speculative in one way (it speculates about what is likely to be true
using heuristics). But it's much less speculative than bottom-up index
deletion. We are required to visit these heap blocks anyway, since a
call to _bt_delitems_delete() for LP_DEAD deletion must already call
table_compute_xid_horizon_for_tuples(), which has to access the blocks
to get a latestRemovedXid for the WAL record.

The only thing that we have to lose here is a few CPU cycles to find
extra TIDs to consider. We'll visit exactly the same number of heap
blocks as before. (Actually, _bt_delitems_delete() does not have to do
that in all cases, actually, but it has to do it with a logged table
with wal_level >= replica, which is the vast majority of cases in
practice.)

This means that traditional LP_DEAD deletion reuses some of the
bottom-up index deletion infrastructure. So maybe nbtree never calls
table_compute_xid_horizon_for_tuples() now, since everything goes
through the new heapam stuff instead (which knows how to check extra
TIDs that might not be dead at all).

> I am missing a general perspective here.
>
> Is it true, that despite the long (vacuum preventing) transaction we can re-use space,
> as after the DELETE statements commits, IndexScans are setting LP_DEAD hints after
> they check the state of the corresponding heap tuple?

The enhancement to traditional LP_DEAD deletion that I just described
does not affect the current restrictions on setting LP_DEAD bits in
the presence of a long-running transaction, or anything like that.
That seems like an unrelated project. The value of this enhancement is
purely its ability to delete *extra* index tuples that could have had
their LP_DEAD bits set already (it was possible in principle), but
didn't. And only when they are nearby to index tuples that really do
have their LP_DEAD bits set.

> I haven't done any testing so far since sending my last e-mail.
> If you'll have a chance to send a new v10 version with LP_DEAD-deletion-with-extra-TIDs thing,
> I will do some tests (planned).

Thanks! I think that it will be next week. It's a relatively big change.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Victor Yegorov
ср, 25 нояб. 2020 г. в 19:41, Peter Geoghegan <[hidden email]>:
We have promising tuples for bottom-up deletion. Why not have
"promising heap blocks" for traditional LP_DEAD index tuple deletion?
Or if you prefer, we can consider index tuples that *don't* have their
LP_DEAD bits set already but happen to point to the *same heap block*
as other tuples that *do* have their LP_DEAD bits set promising. (The
tuples with their LP_DEAD bits set are not just promising -- they're
already a sure thing.)

In the _bt_delete_or_dedup_one_page() we start with the simple loop over items on the page and
if there're any LP_DEAD tuples, we're kicking off _bt_delitems_delete().

So if I understood you right, you plan to make this loop (or a similar one somewhere around)
to track TIDs of the LP_DEAD tuples and then (perhaps on a second loop over the page) compare all other
currently-not-LP_DEAD tuples and mark those pages, that have at least 2 TIDs pointing at (one LP_DEAD and other not)
as a promising one.

Later, should we require to kick deduplication, we'll go visit promising pages first.

Is my understanding correct?


> I am missing a general perspective here.
>
> Is it true, that despite the long (vacuum preventing) transaction we can re-use space,
> as after the DELETE statements commits, IndexScans are setting LP_DEAD hints after
> they check the state of the corresponding heap tuple?

The enhancement to traditional LP_DEAD deletion that I just described
does not affect the current restrictions on setting LP_DEAD bits in
the presence of a long-running transaction, or anything like that.
That seems like an unrelated project. The value of this enhancement is
purely its ability to delete *extra* index tuples that could have had
their LP_DEAD bits set already (it was possible in principle), but
didn't. And only when they are nearby to index tuples that really do
have their LP_DEAD bits set.

I wasn't considering improvements here, that was a general question about how this works
(trying to clear up gaps in my understanding).

What I meant to ask — will LP_DEAD be set by IndexScan in the presence of the long transaction?


--
Victor Yegorov
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Wed, Nov 25, 2020 at 1:20 PM Victor Yegorov <[hidden email]> wrote:
> In the _bt_delete_or_dedup_one_page() we start with the simple loop over items on the page and
> if there're any LP_DEAD tuples, we're kicking off _bt_delitems_delete().

Right.

> So if I understood you right, you plan to make this loop (or a similar one somewhere around)
> to track TIDs of the LP_DEAD tuples and then (perhaps on a second loop over the page) compare all other
> currently-not-LP_DEAD tuples and mark those pages, that have at least 2 TIDs pointing at (one LP_DEAD and other not)
> as a promising one.

Yes. We notice extra TIDs that can be included in our heapam test "for
free". The cost is low, but the benefits are also often quite high: in
practice there are *natural* correlations that we can exploit.

For example: maybe there were non-HOT updates, and some but not all of
the versions got marked LP_DEAD. We can get them all in one go,
avoiding a true bottom-up index deletion pass for much longer
(compared to doing LP_DEAD deletion the old way, which is what happens
in v9 of the patch). We're better off doing the deletions all at once.
It's cheaper.

(We still really need to have bottom-up deletion passes, of course,
because that covers the important case where there are no LP_DEAD bits
set at all, which is an important goal of this project.)

Minor note: Technically there aren't any promising tuples involved,
because that only makes sense when we are not going to visit every
possible heap page (just the "most promising" heap pages). But we are
going to visit every possible heap page with the new LP_DEAD bit
deletion code (which could occasionally mean visiting 10 or more heap
pages, which is a lot more than bottom-up index deletion will ever
visit). All we need to do with the new LP_DEAD deletion logic is to
include all possible matching TIDs (not just those that are marked
LP_DEAD already).

> What I meant to ask — will LP_DEAD be set by IndexScan in the presence of the long transaction?

That works in the same way as before, even with the new LP_DEAD
deletion code. The new code uses the same information as before (set
LP_DEAD bits), which is generated in the same way as before. The
difference is in how the information is actually used during LP_DEAD
deletion -- we can now delete some extra things in certain common
cases.

In practice this (and bottom-up deletion) make nbtree more robust
against disruption caused by long running transactions that hold a
snapshot open. It's hard to give a simple explanation of why that is,
because it's a second order effect. The patch is going to make it
possible to recover when LP_DEAD bits suddenly stop being set because
of an old snapshot -- now we'll have a "second chance", and maybe even
a third chance. But if the snapshot is held open *forever*, then a
second chance has no value.

Here is a thought experiment that might be helpful:

Imagine Postgres just as it is today (without the patch), except that
VACUUM runs very frequently, and is infinitely fast (this is a magical
version of VACUUM). This solves many problems, but does not solve all
problems. Magic Postgres will become just as slow as earthly Postgres
when there is a snapshot that is held open for a very long time. That
will take longer to happen compared to earthly/mortal Postgres, but
eventually there will be no difference between the two at all. But,
when you don't have such an extreme problem, magic Postgres really is
much faster.

I think that it will be possible to approximate the behavior of magic
Postgres using techniques like bottom-up deletion, the new LP_DEAD
deletion thing we've been talking today, and maybe other enhancements
in other areas (like in heap pruning). It doesn't matter that we don't
physically remove garbage immediately, as long as we "logically"
remove it immediately. The actually physical removal can occur in a
just in time, incremental fashion, creating the illusion that VACUUM
really does run infinitely fast. No magic required.

Actually, in a way this isn't new; we have always "logically" removed
garbage at the earliest opportunity (by which I mean we allow that it
can be physically removed according to an oldestXmin style cutoff,
which can be reacquired/updated the second the oldest MVCC snapshot
goes away). We don't think of useless old versions as being "logically
removed" the instant an old snapshot goes away. But maybe we should --
it's a useful mental model.

It will also be very helpful to add "remove useless intermediate
versions" logic at some point. This is quite a distinct area to what I
just described, but it's also important. We need both, I think.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
In reply to this post by Peter Geoghegan-4
On Wed, Nov 25, 2020 at 10:41 AM Peter Geoghegan <[hidden email]> wrote:

> We have promising tuples for bottom-up deletion. Why not have
> "promising heap blocks" for traditional LP_DEAD index tuple deletion?
> Or if you prefer, we can consider index tuples that *don't* have their
> LP_DEAD bits set already but happen to point to the *same heap block*
> as other tuples that *do* have their LP_DEAD bits set promising. (The
> tuples with their LP_DEAD bits set are not just promising -- they're
> already a sure thing.)
>
> This means that traditional LP_DEAD deletion is now slightly more
> speculative in one way (it speculates about what is likely to be true
> using heuristics). But it's much less speculative than bottom-up index
> deletion. We are required to visit these heap blocks anyway, since a
> call to _bt_delitems_delete() for LP_DEAD deletion must already call
> table_compute_xid_horizon_for_tuples(), which has to access the blocks
> to get a latestRemovedXid for the WAL record.
>
> The only thing that we have to lose here is a few CPU cycles to find
> extra TIDs to consider. We'll visit exactly the same number of heap
> blocks as before. (Actually, _bt_delitems_delete() does not have to do
> that in all cases, actually, but it has to do it with a logged table
> with wal_level >= replica, which is the vast majority of cases in
> practice.)
>
> This means that traditional LP_DEAD deletion reuses some of the
> bottom-up index deletion infrastructure. So maybe nbtree never calls
> table_compute_xid_horizon_for_tuples() now, since everything goes
> through the new heapam stuff instead (which knows how to check extra
> TIDs that might not be dead at all).
Attached is v10, which has this LP_DEAD deletion enhancement I
described. (It also fixed bitrot -- v9 no longer applies.)

This revision does a little refactoring to make this possible. Now
there is less new code in nbtdedup.c, and more code in nbtpage.c,
because some of the logic used by bottom-up deletion has been
generalized (in order to be used by the new-to-v10 LP_DEAD deletion
enhancement).

Other than that, no big changes between this v10 and v9. Just
polishing and refactoring. I decided to make it mandatory for tableams
to support the new interface that heapam implements, since it's hardly
okay for them to not allow LP_DEAD deletion in nbtree (which is what
making supporting the interface optional would imply, given the
LP_DEAD changes). So now the heapam and tableam changes are including
in one patch/commit, which is to be applied first among patches in the
series.

--
Peter Geoghegan

v10-0001-Teach-heapam-to-support-bottom-up-index-deletion.patch (46K) Download Attachment
v10-0002-Pass-down-logically-unchanged-index-hint.patch (40K) Download Attachment
v10-0003-Teach-nbtree-to-use-bottom-up-index-deletion.patch (121K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Heikki Linnakangas
This is a wholly new concept with a lot of heuristics. It's a lot of
swallow. But here are a few quick comments after a first read-through:

On 30/11/2020 21:50, Peter Geoghegan wrote:

> +/*
> + * State used when calling table_index_delete_check() to perform "bottom up"
> + * deletion of duplicate index tuples.  State is intialized by index AM
> + * caller, while state is finalized by tableam, which modifies state.
> + */
> +typedef struct TM_IndexDelete
> +{
> + ItemPointerData tid; /* table TID from index tuple */
> + int16 id; /* Offset into TM_IndexStatus array */
> +} TM_IndexDelete;
> +
> +typedef struct TM_IndexStatus
> +{
> + OffsetNumber idxoffnum; /* Index am page offset number */
> + int16 tupsize; /* Space freed in index if tuple deleted */
> + bool ispromising; /* Duplicate in index? */
> + bool deleteitup; /* Known dead-to-all? */
> +} TM_IndexStatus;
> ...
> + * The two arrays are conceptually one single array.  Two arrays/structs are
> + * used for performance reasons.  (We really need to keep the TM_IndexDelete
> + * struct small so that the tableam can do an initial sort by TID as quickly
> + * as possible.)

Is it really significantly faster to have two arrays? If you had just
one array, each element would be only 12 bytes long. That's not much
smaller than TM_IndexDeletes, which is 8 bytes.

> + /* First sort caller's array by TID */
> + heap_tid_shellsort(delstate->deltids, delstate->ndeltids);
> +
> + /* alltids caller visits all blocks, so make sure that happens */
> + if (delstate->alltids)
> + return delstate->ndeltids;
> +
> + /* Calculate per-heap-block count of TIDs */
> + blockcounts = palloc(sizeof(IndexDeleteCounts) * delstate->ndeltids);
> + for (int i = 0; i < delstate->ndeltids; i++)
> + {
> + ItemPointer deltid = &delstate->deltids[i].tid;
> + TM_IndexStatus *dstatus = delstate->status + delstate->deltids[i].id;
> + bool ispromising = dstatus->ispromising;
> +
> + if (curblock != ItemPointerGetBlockNumber(deltid))
> + {
> + /* New block group */
> + nblockgroups++;
> +
> + Assert(curblock < ItemPointerGetBlockNumber(deltid) ||
> +   !BlockNumberIsValid(curblock));
> +
> + curblock = ItemPointerGetBlockNumber(deltid);
> + blockcounts[nblockgroups - 1].ideltids = i;
> + blockcounts[nblockgroups - 1].ntids = 1;
> + blockcounts[nblockgroups - 1].npromisingtids = 0;
> + }
> + else
> + {
> + blockcounts[nblockgroups - 1].ntids++;
> + }
> +
> + if (ispromising)
> + blockcounts[nblockgroups - 1].npromisingtids++;
> + }

Instead of sorting the array by TID, wouldn't it be enough to sort by
just the block numbers?

> * While in general the presence of promising tuples (the hint that index
> + * AMs provide) is the best information that we have to go on, it is based
> + * on simple heuristics about duplicates in indexes that are understood to
> + * have specific flaws.  We should not let the most promising heap pages
> + * win or lose on the basis of _relatively_ small differences in the total
> + * number of promising tuples.  Small differences between the most
> + * promising few heap pages are effectively ignored by applying this
> + * power-of-two bucketing scheme.
> + *

What are the "specific flaws"?

I understand that this is all based on rough heuristics, but is there
any advantage to rounding/bucketing the numbers like this? Even if small
differences in the total number of promising tuple is essentially noise
that can be safely ignored, is there any harm in letting those small
differences guide the choice? Wouldn't it be the essentially the same as
picking at random, or are those small differences somehow biased?

- Heikki


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Tue, Dec 1, 2020 at 1:50 AM Heikki Linnakangas <[hidden email]> wrote:
> This is a wholly new concept with a lot of heuristics. It's a lot of
> swallow.

Thanks for taking a look! Yes, it's a little unorthodox.

Ideally, you'd find time to grok the patch and help me codify the
design in some general kind of way. If there are general lessons to be
learned here (and I suspect that there are), then this should not be
left to chance. The same principles can probably be applied in heapam,
for example. Even if I'm wrong about the techniques being highly
generalizable, it should still be interesting! (Something to think
about a little later.)

Some of the intuitions behind the design might be vaguely familiar to
you as the reviewer of my earlier B-Tree work. In particular, the
whole way that we reason about how successive related operations (in
this case bottom-up deletion passes) affect individual leaf pages over
time might give you a feeling of déjà vu. It's a little like the
nbtsplitloc.c stuff that we worked on together during the Postgres 12
cycle. We can make what might seem like rather bold assumptions about
what's going on, and adapt to the workload. This is okay because we're
sure that the downside of our being wrong is a fixed, low performance
penalty. (To a lesser degree it's okay because the empirical evidence
shows that we're almost always right, because we apply the
optimization again and again precisely because it worked out last
time.)

I like to compare it to the familiar "rightmost leaf page applies leaf
fillfactor" heuristic, which applies an assumption that is wrong in
the general case, but nevertheless obviously helps enormously as a
practical matter. Of course it's still true that anybody reviewing
this patch ought to start with a principled skepticism of this claim
-- that's how you review any patch. I can say for sure that that's the
idea behind the patch, though. I want to be clear about that up front,
to save you time -- if this claim is wrong, then I'm wrong about
everything.

Generational garbage collection influenced this work, too. It also
applies pragmatic assumptions about where garbage is likely to appear.
Assumptions that are based on nothing more than empirical observations
about what is likely to happen in the real world, that are validated
by experience and by benchmarking.

> On 30/11/2020 21:50, Peter Geoghegan wrote:
> > +} TM_IndexDelete;

> > +} TM_IndexStatus;

> Is it really significantly faster to have two arrays? If you had just
> one array, each element would be only 12 bytes long. That's not much
> smaller than TM_IndexDeletes, which is 8 bytes.

Yeah, but the swap operations really matter here. At one point I found
that to be the case, anyway. That might no longer be true, though. It
might be that the code became less performance critical because other
parts of the design improved, resulting in the code getting called
less frequently. But if that is true then it has a lot to do with the
power-of-two bucketing that you go on to ask about -- that helped
performance a lot in certain specific cases (as I go into below).

I will add a TODO item for myself, to look into this again. You may
well be right.

> > +     /* First sort caller's array by TID */
> > +     heap_tid_shellsort(delstate->deltids, delstate->ndeltids);

> Instead of sorting the array by TID, wouldn't it be enough to sort by
> just the block numbers?

I don't understand. Yeah, I guess that we could do our initial sort of
the deltids array (the heap_tid_shellsort() call) just using
BlockNumber (not TID). But OTOH there might be some small locality
benefit to doing a proper TID sort at the level of each heap page. And
even if there isn't any such benefit, does it really matter?

If you are asking about the later sort of the block counts array
(which helps us sort the deltids array a second time, leaving it in
its final order for bottom-up deletion heapam.c processing), then the
answer is no. This block counts metadata array sort is useful because
it allows us to leave the deltids array in what I believe to be the
most useful order for processing. We'll access heap blocks primarily
based on the number of promising tuples (though as I go into below,
sometimes the number of promising tuples isn't a decisive influence on
processing order).

> >        * While in general the presence of promising tuples (the hint that index
> > +      * AMs provide) is the best information that we have to go on, it is based
> > +      * on simple heuristics about duplicates in indexes that are understood to
> > +      * have specific flaws.  We should not let the most promising heap pages
> > +      * win or lose on the basis of _relatively_ small differences in the total
> > +      * number of promising tuples.  Small differences between the most
> > +      * promising few heap pages are effectively ignored by applying this
> > +      * power-of-two bucketing scheme.
> > +      *
>
> What are the "specific flaws"?

I just meant the obvious: the index AM doesn't actually know for sure
that there are any old versions on the leaf page that it will
ultimately be able to delete. This uncertainty needs to be managed,
including inside heapam.c. Feel free to suggest a better way of
expressing that sentiment.

> I understand that this is all based on rough heuristics, but is there
> any advantage to rounding/bucketing the numbers like this? Even if small
> differences in the total number of promising tuple is essentially noise
> that can be safely ignored, is there any harm in letting those small
> differences guide the choice? Wouldn't it be the essentially the same as
> picking at random, or are those small differences somehow biased?

Excellent question! It actually helps enormously, especially with low
cardinality data that makes good use of the deduplication optimization
(where it is especially important to keep the costs and the benefits
in balance). This has been validated by benchmarking.

This design naturally allows the heapam.c code to take advantage of
both temporal and spatial locality. For example, let's say that you
have several indexes all on the same table, which get lots of non-HOT
UPDATEs, which are skewed. Naturally, the heap TIDs will match across
each index -- these are index entries that are needed to represent
successor versions (which are logically unchanged/version duplicate
index tuples from the point of view of nbtree/nbtdedup.c). Introducing
a degree of determinism to the order in which heap blocks are
processed naturally takes advantage of the correlated nature of the
index bloat. It naturally makes it much more likely that the
also-correlated bottom-up index deletion passes (that occur across
indexes on the same table) each process the same heap blocks close
together in time -- with obvious performance benefits.

In the extreme (but not particularly uncommon) case of non-HOT UPDATEs
with many low cardinality indexes, each heapam.c call will end up
doing *almost the same thing* across indexes. So we're making the
correlated nature of the bloat (which is currently a big problem) work
in our favor -- turning it on its head, you could say. Highly
correlated bloat is not the exception -- it's actually the norm in the
cases we're targeting here. If it wasn't highly correlated then it
would already be okay to rely on VACUUM to get around to cleaning it
later.

This power-of-two bucketing design probably also helps when there is
only one index. I could go into more detail on that, plus other
variations, but perhaps the multiple index example suffices for now. I
believe that there are a few interesting kinds of correlations here --
I bet you can think of some yourself. (Of course it's also possible
and even likely that heap block correlation won't be important at all,
but my response is "what specifically is the harm in being open to the
possibility?".)

BTW, I tried to make the tableam interface changes more or less
compatible with Zedstore, which is notable for not using TIDs in the
same way as heapam (or zheap). Let me know what you think about that.
I can go into detail about it if it isn't obvious to you.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Tue, Dec 1, 2020 at 2:18 PM Peter Geoghegan <[hidden email]> wrote:
> On Tue, Dec 1, 2020 at 1:50 AM Heikki Linnakangas <[hidden email]> wrote:
> > This is a wholly new concept with a lot of heuristics. It's a lot of
> > swallow.

Attached is v11, which cleans everything up around the tableam
interface. There are only two patches in v11, since the tableam
refactoring made it impossible to split the second patch into a heapam
patch and nbtree patch (there is no reduction in functionality
compared to v10).

Most of the real changes in v11 (compared to v10) are in heapam.c.
I've completely replaced the table_compute_xid_horizon_for_tuples()
interface with a new interface that supports all existing requirements
(from index deletions that support LP_DEAD deletion), while also
supporting these new requirements (e.g. bottom-up index deletion). So
now heap_compute_xid_horizon_for_tuples() becomes
heap_compute_delete_for_tuples(), which has different arguments but
the same overall structure. All of the new requirements can now be
thought of as additive things that we happen to use for nbtree
callers, that could easily also be used in other index AMs later on.
This means that there is a lot less new code in heapam.c.

Prefetching of heap blocks for the new bottom-up index deletion caller
now works in the same way as it has worked in
heap_compute_xid_horizon_for_tuples() since Postgres 12 (more or
less). This is a significant improvement compared to my original
approach.

Chaning heap_compute_xid_horizon_for_tuples() rather than adding a
sibling function started to make sense when v10 of the patch taught
regular nbtree LP_DEAD deletion (the thing that has been around since
PostgreSQL 8.2) to add "extra" TIDs to check in passing, just in case
we find that they're also deletable -- why not just have one function?
It also means that hash indexes and GiST indexes now use the
heap_compute_delete_for_tuples() function, though they get the same
behavior as before. I imagine that it would be pretty straightforward
to bring that same benefit to those other index AMs, since their
implementations are already derived from the nbtree implementation of
LP_DEAD deletion (and because adding extra TIDs to check in passing in
these other index AMs should be fairly easy).

> > > +} TM_IndexDelete;
>
> > > +} TM_IndexStatus;
>
> > Is it really significantly faster to have two arrays? If you had just
> > one array, each element would be only 12 bytes long. That's not much
> > smaller than TM_IndexDeletes, which is 8 bytes.

I kept this facet of the design in v11, following some deliberation. I
found that the TID sort operation appeared quite prominently in
profiles, and so the optimizations mostly seemed to still make sense.
I also kept one shellsort specialization. However, I removed the
second specialized sort implementation, so at least there is only one
specialization now (which is small anyway). I found that the second
sort specialization (added to heapam.c in v10) really wasn't pulling
its weight, even in more extreme cases of the kind that justified the
optimizations in the first place.

--
Peter Geoghegan

v11-0001-Pass-down-logically-unchanged-index-hint.patch (40K) Download Attachment
v11-0002-Add-bottom-up-index-deletion.patch (176K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Zhihong Yu
Hi,
In v11-0001-Pass-down-logically-unchanged-index-hint.patch :

+   if (hasexpression)
+       return false;
+
+   return true;

The above can be written as 'return !hasexpression;'

For +index_unchanged_by_update_var_walker:

+ * Returns true when Var that appears within allUpdatedCols located.

the sentence seems incomplete.

Currently the return value of index_unchanged_by_update_var_walker() is the reverse of index_unchanged_by_update().
Maybe it is easier to read the code if their return values have the same meaning.

Cheers

On Wed, Dec 9, 2020 at 5:13 PM Peter Geoghegan <[hidden email]> wrote:
On Tue, Dec 1, 2020 at 2:18 PM Peter Geoghegan <[hidden email]> wrote:
> On Tue, Dec 1, 2020 at 1:50 AM Heikki Linnakangas <[hidden email]> wrote:
> > This is a wholly new concept with a lot of heuristics. It's a lot of
> > swallow.

Attached is v11, which cleans everything up around the tableam
interface. There are only two patches in v11, since the tableam
refactoring made it impossible to split the second patch into a heapam
patch and nbtree patch (there is no reduction in functionality
compared to v10).

Most of the real changes in v11 (compared to v10) are in heapam.c.
I've completely replaced the table_compute_xid_horizon_for_tuples()
interface with a new interface that supports all existing requirements
(from index deletions that support LP_DEAD deletion), while also
supporting these new requirements (e.g. bottom-up index deletion). So
now heap_compute_xid_horizon_for_tuples() becomes
heap_compute_delete_for_tuples(), which has different arguments but
the same overall structure. All of the new requirements can now be
thought of as additive things that we happen to use for nbtree
callers, that could easily also be used in other index AMs later on.
This means that there is a lot less new code in heapam.c.

Prefetching of heap blocks for the new bottom-up index deletion caller
now works in the same way as it has worked in
heap_compute_xid_horizon_for_tuples() since Postgres 12 (more or
less). This is a significant improvement compared to my original
approach.

Chaning heap_compute_xid_horizon_for_tuples() rather than adding a
sibling function started to make sense when v10 of the patch taught
regular nbtree LP_DEAD deletion (the thing that has been around since
PostgreSQL 8.2) to add "extra" TIDs to check in passing, just in case
we find that they're also deletable -- why not just have one function?
It also means that hash indexes and GiST indexes now use the
heap_compute_delete_for_tuples() function, though they get the same
behavior as before. I imagine that it would be pretty straightforward
to bring that same benefit to those other index AMs, since their
implementations are already derived from the nbtree implementation of
LP_DEAD deletion (and because adding extra TIDs to check in passing in
these other index AMs should be fairly easy).

> > > +} TM_IndexDelete;
>
> > > +} TM_IndexStatus;
>
> > Is it really significantly faster to have two arrays? If you had just
> > one array, each element would be only 12 bytes long. That's not much
> > smaller than TM_IndexDeletes, which is 8 bytes.

I kept this facet of the design in v11, following some deliberation. I
found that the TID sort operation appeared quite prominently in
profiles, and so the optimizations mostly seemed to still make sense.
I also kept one shellsort specialization. However, I removed the
second specialized sort implementation, so at least there is only one
specialization now (which is small anyway). I found that the second
sort specialization (added to heapam.c in v10) really wasn't pulling
its weight, even in more extreme cases of the kind that justified the
optimizations in the first place.

--
Peter Geoghegan
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
In reply to this post by Peter Geoghegan-4
On Wed, Dec 9, 2020 at 5:12 PM Peter Geoghegan <[hidden email]> wrote:
> Most of the real changes in v11 (compared to v10) are in heapam.c.
> I've completely replaced the table_compute_xid_horizon_for_tuples()
> interface with a new interface that supports all existing requirements
> (from index deletions that support LP_DEAD deletion), while also
> supporting these new requirements (e.g. bottom-up index deletion).

I came up with a microbenchmark that is designed to give some general
sense of how helpful it is to include "extra" TIDs alongside
LP_DEAD-in-index TIDs (when they happen to point to the same table
block as the LP_DEAD-in-index TIDs, which we'll have to visit anyway).
The basic idea is simple: Add custom instrumentation that summarizes
each B-Tree index deletion in LOG output, and then run the regression
tests. Attached in the result of this for a "make installcheck-world".

If you're just looking at this thread for the first time in a while:
note that what I'm about to go into isn't technically bottom-up
deletion (though you will see some of that in the full log output if
you look). It's a closely related optimization that only appeared in
recent versions of the patch. So I'm comparing the current approach to
simple deletion of LP_DEAD-marked index tuples to a new enhanced
approach (that makes it a little more like bottom-up deletion, but
only a little).

Here is some sample output (selected almost at random from a text file
consisting of 889 lines of similar output):

... exact TIDs deleted 17, LP_DEAD tuples 4, LP_DEAD-related table blocks 2 )
... exact TIDs deleted 38, LP_DEAD tuples 28, LP_DEAD-related table blocks 1 )
... exact TIDs deleted 39, LP_DEAD tuples 1, LP_DEAD-related table blocks 1 )
... exact TIDs deleted 44, LP_DEAD tuples 42, LP_DEAD-related table blocks 3 )
... exact TIDs deleted 6, LP_DEAD tuples 2, LP_DEAD-related table blocks 2 )

(The initial contents of each line were snipped here, to focus on the
relevant metrics.)

Here we see that the actual number of TIDs/index tuples deleted often
*vastly* exceeds the number of LP_DEAD-marked tuples (which are all we
would have been able to delete with the existing approach of just
deleting LP_DEAD items). It's pretty rare for us to fail to at least
delete a couple of extra TIDs. Clearly this technique is broadly
effective, because in practice there are significant locality-related
effects that we can benefit from. It doesn't really matter that it's
hard to precisely describe all of these locality effects. IMV the
question that really matters is whether or not the cost of trying is
consistently very low (relative to the cost of our current approach to
simple LP_DEAD deletion). We do need to understand that fully.

It's tempting to think about this quantitatively (and it also bolsters
the case for the patch), but that misses the point. The right way to
think about this is as a *qualitative* thing. The presence of LP_DEAD
bits gives us certain reliable information about the nature of the
index tuple (that it is dead-to-all, and therefore safe to delete),
but it also *suggests* quite a lot more than that. In practice bloat
is usually highly correlated/concentrated, especially when we limit
our consideration to workloads where bloat is noticeable in any
practical sense. So we're very well advised to look for nearby
deletable index tuples in passing -- especially since it's practically
free to do so. (Which is what the patch does here, of course.)

Let's compare this to an extreme variant of my patch that runs the
same test, to see what changes. Consider a variant that exhaustively
checks every index tuple on the page at the point of a simple LP_DEAD
deletion operation, no matter what. Rather than only including those
TIDs that happen to be on the same heap/table blocks (and thus are
practically free to check), we include all of them. This design isn't
acceptable in the real world because it does a lot of unnecessary I/O,
but that shouldn't invalidate any comparison I make here. This is
still a reasonable approximation of a version of the patch with
magical foreknowledge of where to find dead TIDs. It's an Oracle
(ahem) that we can sensibly compare to the real patch within certain
constraints.

The results of this comparative analysis seem to suggest something
important about the general nature of what's going on. The results
are: There are only 844 deletion operations total with the Oracle.
While this is less than the actual patch's 889 deletion operations,
you would expect a bigger improvement from using what is after all
supposed to apply magic! This suggests to me that the precise
intervention of the patch here (the new LP_DEAD deletion stuff) has an
outsized impact. The correlations that naturally exist make this
asymmetrical payoff-to-cost situation possible. Simple cheap
heuristics once again go a surprisingly long way, kind of like
bottom-up deletion itself.

--
Peter Geoghegan

delete_stats_regression_tests.txt.gz (10K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
In reply to this post by Peter Geoghegan-4
On Wed, Dec 9, 2020 at 5:12 PM Peter Geoghegan <[hidden email]> wrote:
> Attached is v11, which cleans everything up around the tableam
> interface. There are only two patches in v11, since the tableam
> refactoring made it impossible to split the second patch into a heapam
> patch and nbtree patch (there is no reduction in functionality
> compared to v10).

Attached is v12, which fixed bitrot against the master branch. This
version has significant comment and documentation revisions. It is
functionally equivalent to v11, though.

I intend to commit the patch in the next couple of weeks. While it
certainly would be nice to get a more thorough review, I don't feel
that it is strictly necessary. The patch provides very significant
benefits with certain workloads that have traditionally been
considered an Achilles' heel for Postgres. Even zheap doesn't provide
a solution to these problems. The only thing that I can think of that
might reasonably be considered in competition with this design is
WARM, which hasn't been under active development since 2017 (I assume
that it has been abandoned by those involved).

I should also point out that the design doesn't change the on-disk
format in any way, and so doesn't commit us to supporting something
that might become onerous in the event of somebody else finding a
better way to address at least some of the same problems.

--
Peter Geoghegan

v12-0002-Add-bottom-up-index-deletion.patch (191K) Download Attachment
v12-0001-Pass-down-logically-unchanged-index-hint.patch (40K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Victor Yegorov
чт, 31 дек. 2020 г. в 03:55, Peter Geoghegan <[hidden email]>:
Attached is v12, which fixed bitrot against the master branch. This
version has significant comment and documentation revisions. It is
functionally equivalent to v11, though.

I intend to commit the patch in the next couple of weeks. While it
certainly would be nice to get a more thorough review, I don't feel
that it is strictly necessary. The patch provides very significant
benefits with certain workloads that have traditionally been
considered an Achilles' heel for Postgres. Even zheap doesn't provide
a solution to these problems. The only thing that I can think of that
might reasonably be considered in competition with this design is
WARM, which hasn't been under active development since 2017 (I assume
that it has been abandoned by those involved).

I am planning to look into this patch in the next few days.

--
Victor Yegorov
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Zhihong Yu
In reply to this post by Peter Geoghegan-4
Hi, Peter:
Happy New Year.

For v12-0001-Pass-down-logically-unchanged-index-hint.patch

+   if (hasexpression)
+       return false;
+
+   return true;

The above can be written as return !hasexpression;
The negation is due to the return value from index_unchanged_by_update_var_walker() is inverse of index unchanged.
If you align the meaning of return value from index_unchanged_by_update_var_walker() the same way as index_unchanged_by_update(), negation is not needed.

For v12-0002-Add-bottom-up-index-deletion.patch :

For struct xl_btree_delete:

+   /* DELETED TARGET OFFSET NUMBERS FOLLOW */
+   /* UPDATED TARGET OFFSET NUMBERS FOLLOW */
+   /* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */

I guess the comment is for illustration purposes. Maybe you can write the comment in lower case.

+#define BOTTOMUP_FAVORABLE_STRIDE  3

Adding a comment on why the number 3 is chosen would be helpful for people to understand the code.

Cheers

On Wed, Dec 30, 2020 at 6:55 PM Peter Geoghegan <[hidden email]> wrote:
On Wed, Dec 9, 2020 at 5:12 PM Peter Geoghegan <[hidden email]> wrote:
> Attached is v11, which cleans everything up around the tableam
> interface. There are only two patches in v11, since the tableam
> refactoring made it impossible to split the second patch into a heapam
> patch and nbtree patch (there is no reduction in functionality
> compared to v10).

Attached is v12, which fixed bitrot against the master branch. This
version has significant comment and documentation revisions. It is
functionally equivalent to v11, though.

I intend to commit the patch in the next couple of weeks. While it
certainly would be nice to get a more thorough review, I don't feel
that it is strictly necessary. The patch provides very significant
benefits with certain workloads that have traditionally been
considered an Achilles' heel for Postgres. Even zheap doesn't provide
a solution to these problems. The only thing that I can think of that
might reasonably be considered in competition with this design is
WARM, which hasn't been under active development since 2017 (I assume
that it has been abandoned by those involved).

I should also point out that the design doesn't change the on-disk
format in any way, and so doesn't commit us to supporting something
that might become onerous in the event of somebody else finding a
better way to address at least some of the same problems.

--
Peter Geoghegan
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Victor Yegorov
чт, 31 дек. 2020 г. в 20:01, Zhihong Yu <[hidden email]>:
For v12-0001-Pass-down-logically-unchanged-index-hint.patch

+   if (hasexpression)
+       return false;
+
+   return true;

The above can be written as return !hasexpression;

To be honest, I prefer the way Peter has it in his patch.
Yes, it's possible to shorten this part. But readability is hurt — for current code I just read it, for the suggested change I need to think about it.

--
Victor Yegorov
12345