Making all nbtree entries unique by having heap TIDs participate in comparisons

classic Classic list List threaded Threaded
121 messages Options
1234 ... 7
Reply | Threaded
Open this post in threaded view
|

Making all nbtree entries unique by having heap TIDs participate in comparisons

Peter Geoghegan-4
I've been thinking about using heap TID as a tie-breaker when
comparing B-Tree index tuples for a while now [1]. I'd like to make
all tuples at the leaf level unique, as assumed by L&Y. This can
enable "retail index tuple deletion", which I think we'll probably end
up implementing in some form or another, possibly as part of the zheap
project. It's also possible that this work will facilitate GIN-style
deduplication based on run length encoding of TIDs, or storing
versioned heap TIDs in an out-of-line nbtree-versioning structure
(unique indexes only). I can see many possibilities, but we have to
start somewhere.

I attach an unfinished prototype of suffix truncation, that also
sometimes *adds* a new attribute in pivot tuples. It adds an extra
heap TID from the leaf level when truncating away non-distinguishing
attributes during a leaf page split, though only when it must. The
patch also has nbtree treat heap TID as a first class part of the key
space of the index. Claudio wrote a patch that did something similar,
though without the suffix truncation part [2] (I haven't studied his
patch, to be honest). My patch is actually a very indirect spin-off of
Anastasia's covering index patch, and I want to show what I have in
mind now, while it's still swapped into my head. I won't do any
serious work on this project unless and until I see a way to implement
retail index tuple deletion, which seems like a multi-year project
that requires the buy-in of multiple senior community members. On its
own, my patch regresses performance unacceptably in some workloads,
probably due to interactions with kill_prior_tuple()/LP_DEAD hint
setting, and interactions with page space management when there are
many "duplicates" (it can still help performance in some pgbench
workloads with non-unique indexes, though).

Note that the approach to suffix truncation that I've taken isn't even
my preferred approach [3] -- it's a medium-term solution that enables
making a heap TID attribute part of the key space, which enables
everything else. Cheap incremental/retail tuple deletion is the real
prize here; don't lose sight of that when looking through my patch. If
we're going to teach nbtree to truncate this new implicit heap TID
attribute, which seems essential, then we might as well teach nbtree
to do suffix truncation of other (user-visible) attributes while we're
at it. This patch isn't a particularly effective implementation of
suffix truncation, because that's not what I'm truly interested in
improving here (plus I haven't even bothered to optimize the logic for
picking a split point in light of suffix truncation).

amcheck
=======

This patch adds amcheck coverage, which seems like essential
infrastructure for developing a feature such as this. Extensive
amcheck coverage gave me confidence in my general approach. The basic
idea, invariant-wise, is to treat truncated attributes (often
including a truncated heap TID attribute in internal pages) as "minus
infinity" attributes, which participate in comparisons if and only if
we reach such attributes before the end of the scan key (a smaller
keysz for the index scan could prevent this). I've generalized the
minus infinity concept that _bt_compare() has always considered as a
special case, extending it to individual attributes. It's actually
possible to remove that old hard-coded _bt_compare() logic with this
patch applied without breaking anything, since we can rely on the
comparison of an explicitly 0-attribute tuple working the same way
(pg_upgrade'd databases will break if we do this, however, so I didn't
go that far).

Note that I didn't change the logic that has _bt_binsrch() treat
internal pages in a special way when tuples compare as equal. We still
need that logic for cases where keysz is less than the number of
indexed columns. It's only possible to avoid this _bt_binsrch() thing
for internal pages when all attributes, including heap TID, were
specified and compared (an insertion scan key has to have an entry for
every indexed column, including even heap TID). Doing better there
doesn't seem worth the trouble of teaching _bt_compare() to tell the
_bt_binsrch() caller about this as a special case. That means that we
still move left on equality in some cases where it isn't strictly
necessary, contrary to L&Y. However, amcheck verifies that the classic
"Ki < v <= Ki+1" invariant holds (as opposed to "Ki <= v <= Ki+1")
when verifying parent/child relationships, which demonstrates that I
have restored the classic invariant (I just don't find it worthwhile
to take advantage of it within _bt_binsrch() just yet).

Most of this work was done while I was an employee of VMware, though I
joined Crunchy data on Monday and cleaned it up a bit more since then.
I'm excited about joining Crunchy, but I should also acknowledge
VMware's strong support of my work.

[1] https://wiki.postgresql.org/wiki/Key_normalization#Making_all_items_in_the_index_unique_by_treating_heap_TID_as_an_implicit_last_attribute
[2] https://postgr.es/m/CAGTBQpZ-kTRQiAa13xG1GNe461YOwrA-s-ycCQPtyFrpKTaDBQ@...
[3] https://wiki.postgresql.org/wiki/Key_normalization#Suffix_truncation_of_normalized_keys
--
Peter Geoghegan

0001-Ensure-nbtree-leaf-tuple-keys-are-always-unique.patch (135K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Robert Haas
On Thu, Jun 14, 2018 at 2:44 PM, Peter Geoghegan <[hidden email]> wrote:

> I've been thinking about using heap TID as a tie-breaker when
> comparing B-Tree index tuples for a while now [1]. I'd like to make
> all tuples at the leaf level unique, as assumed by L&Y. This can
> enable "retail index tuple deletion", which I think we'll probably end
> up implementing in some form or another, possibly as part of the zheap
> project. It's also possible that this work will facilitate GIN-style
> deduplication based on run length encoding of TIDs, or storing
> versioned heap TIDs in an out-of-line nbtree-versioning structure
> (unique indexes only). I can see many possibilities, but we have to
> start somewhere.

Yes, retail index deletion is essential for the delete-marking
approach that is proposed for zheap.

It could also be extremely useful in some workloads with the regular
heap.  If the indexes are large -- say, 100GB -- and the number of
tuples that vacuum needs to kill is small -- say, 5 -- scanning them
all to remove the references to those tuples is really inefficient.
If we had retail index deletion, then we could make a cost-based
decision about which approach to use in a particular case.

> mind now, while it's still swapped into my head. I won't do any
> serious work on this project unless and until I see a way to implement
> retail index tuple deletion, which seems like a multi-year project
> that requires the buy-in of multiple senior community members.

Can you enumerate some of the technical obstacles that you see?

> On its
> own, my patch regresses performance unacceptably in some workloads,
> probably due to interactions with kill_prior_tuple()/LP_DEAD hint
> setting, and interactions with page space management when there are
> many "duplicates" (it can still help performance in some pgbench
> workloads with non-unique indexes, though).

I think it would be helpful if you could talk more about these
regressions (and the wins).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Peter Geoghegan-4
On Fri, Jun 15, 2018 at 2:36 PM, Robert Haas <[hidden email]> wrote:
> Yes, retail index deletion is essential for the delete-marking
> approach that is proposed for zheap.

Makes sense.

I don't know that much about zheap. I'm sure that retail index tuple
deletion is really important in general, though. The Gray & Reuter
book treats unique keys as a basic assumption, as do other
authoritative reference works and papers. Other database systems
probably make unique indexes simply use the user-visible attributes as
unique values, but appending heap TID as a unique-ifier is probably a
reasonably common design for secondary indexes (it would also be nice
if we could simply not store duplicates for unique indexes, rather
than using heap TID). I generally have a very high opinion on the
nbtree code, but this seems like a problem that ought to be fixed.

I've convinced myself that I basically have the right idea with this
patch, because the classic L&Y invariants have all been tested with an
enhanced amcheck run against all indexes in a regression test
database. There was other stress-testing, too. The remaining problems
are fixable, but I need some guidance.

> It could also be extremely useful in some workloads with the regular
> heap.  If the indexes are large -- say, 100GB -- and the number of
> tuples that vacuum needs to kill is small -- say, 5 -- scanning them
> all to remove the references to those tuples is really inefficient.
> If we had retail index deletion, then we could make a cost-based
> decision about which approach to use in a particular case.

I remember talking to Andres about this in a bar 3 years ago. I can
imagine variations of pruning that do some amount of this when there
are lots of duplicates. Perhaps something like InnoDB's purge threads,
which do things like in-place deletes of secondary indexes after an
updating (or deleting) xact commits. I believe that that mechanism
targets secondary indexes specifically, and that is operates quite
eagerly.

> Can you enumerate some of the technical obstacles that you see?

The #1 technical obstacle is that I simply don't know where I should
try to take this patch, given that it probably needs to be tied to
some much bigger project, such as zheap. I have an open mind, though,
and intend to help if I can. I'm not really sure what the #2 and #3
problems are, because I'd need to be able to see a few steps ahead to
be sure. Maybe #2 is that I'm doing something wonky to avoid breaking
duplicate checking for unique indexes. (The way that unique duplicate
checking has always worked [1] is perhaps questionable, though.)

> I think it would be helpful if you could talk more about these
> regressions (and the wins).

I think that the performance regressions are due to the fact that when
you have a huge number of duplicates today, it's useful to be able to
claim space to fit further duplicates from almost any of the multiple
leaf pages that contain or have contained duplicates. I'd hoped that
the increased temporal locality that the patch gets would more than
make up for that. As far as I can tell, the problem is that temporal
locality doesn't help enough. I saw that performance was somewhat
improved with extreme Zipf distribution contention, but it went the
other way with less extreme contention. The details are not that fresh
in my mind, since I shelved this patch for a while following limited
performance testing.

The code could certainly use more performance testing, and more
general polishing. I'm not strongly motivated to do that right now,
because I don't quite see a clear path to making this patch useful.
But, as I said, I have an open mind about what the next step should
be.

[1] https://wiki.postgresql.org/wiki/Key_normalization#Avoiding_unnecessary_unique_index_enforcement
--
Peter Geoghegan

Reply | Threaded
Open this post in threaded view
|

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Claudio Freire
On Fri, Jun 15, 2018 at 8:47 PM Peter Geoghegan <[hidden email]> wrote:

> > I think it would be helpful if you could talk more about these
> > regressions (and the wins).
>
> I think that the performance regressions are due to the fact that when
> you have a huge number of duplicates today, it's useful to be able to
> claim space to fit further duplicates from almost any of the multiple
> leaf pages that contain or have contained duplicates. I'd hoped that
> the increased temporal locality that the patch gets would more than
> make up for that. As far as I can tell, the problem is that temporal
> locality doesn't help enough. I saw that performance was somewhat
> improved with extreme Zipf distribution contention, but it went the
> other way with less extreme contention. The details are not that fresh
> in my mind, since I shelved this patch for a while following limited
> performance testing.
>
> The code could certainly use more performance testing, and more
> general polishing. I'm not strongly motivated to do that right now,
> because I don't quite see a clear path to making this patch useful.
> But, as I said, I have an open mind about what the next step should
> be.

Way back when I was dabbling in this kind of endeavor, my main idea to
counteract that, and possibly improve performance overall, was a
microvacuum kind of thing that would do some on-demand cleanup to
remove duplicates or make room before page splits. Since nbtree
uniqueification enables efficient retail deletions, that could end up
as a net win.

I never got around to implementing it though, and it does get tricky
if you don't want to allow unbounded latency spikes.

Reply | Threaded
Open this post in threaded view
|

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Peter Geoghegan-4
On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire <[hidden email]> wrote:
> Way back when I was dabbling in this kind of endeavor, my main idea to
> counteract that, and possibly improve performance overall, was a
> microvacuum kind of thing that would do some on-demand cleanup to
> remove duplicates or make room before page splits. Since nbtree
> uniqueification enables efficient retail deletions, that could end up
> as a net win.

That sounds like a mechanism that works a bit like
_bt_vacuum_one_page(), which we run at the last second before a page
split. We do this to see if a page split that looks necessary can
actually be avoided.

I imagine that retail index tuple deletion (the whole point of this
project) would be run by a VACUUM-like process that kills tuples that
are dead to everyone. Even with something like zheap, you cannot just
delete index tuples until you establish that they're truly dead. I
guess that the delete marking stuff that Robert mentioned marks tuples
as dead when the deleting transaction commits. Maybe we could justify
having _bt_vacuum_one_page() do cleanup to those tuples (i.e. check if
they're visible to anyone, and if not recycle), because we at least
know that the deleting transaction committed there. That is, they
could be recently dead or dead, and it may be worth going to the extra
trouble of checking which when we know that it's one of the two
possibilities.

--
Peter Geoghegan

Reply | Threaded
Open this post in threaded view
|

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Claudio Freire
On Mon, Jun 18, 2018 at 2:03 PM Peter Geoghegan <[hidden email]> wrote:

>
> On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire <[hidden email]> wrote:
> > Way back when I was dabbling in this kind of endeavor, my main idea to
> > counteract that, and possibly improve performance overall, was a
> > microvacuum kind of thing that would do some on-demand cleanup to
> > remove duplicates or make room before page splits. Since nbtree
> > uniqueification enables efficient retail deletions, that could end up
> > as a net win.
>
> That sounds like a mechanism that works a bit like
> _bt_vacuum_one_page(), which we run at the last second before a page
> split. We do this to see if a page split that looks necessary can
> actually be avoided.
>
> I imagine that retail index tuple deletion (the whole point of this
> project) would be run by a VACUUM-like process that kills tuples that
> are dead to everyone. Even with something like zheap, you cannot just
> delete index tuples until you establish that they're truly dead. I
> guess that the delete marking stuff that Robert mentioned marks tuples
> as dead when the deleting transaction commits. Maybe we could justify
> having _bt_vacuum_one_page() do cleanup to those tuples (i.e. check if
> they're visible to anyone, and if not recycle), because we at least
> know that the deleting transaction committed there. That is, they
> could be recently dead or dead, and it may be worth going to the extra
> trouble of checking which when we know that it's one of the two
> possibilities.

Yes, but currently bt_vacuum_one_page does local work on the pinned
page. Doing dead tuple deletion however involves reading the heap to
check visibility at the very least, and doing it on the whole page
might involve several heap fetches, so it's an order of magnitude
heavier if done naively.

But the idea is to do just that, only not naively.

Reply | Threaded
Open this post in threaded view
|

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

akapila
In reply to this post by Peter Geoghegan-4
On Mon, Jun 18, 2018 at 10:33 PM, Peter Geoghegan <[hidden email]> wrote:

> On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire <[hidden email]> wrote:
>> Way back when I was dabbling in this kind of endeavor, my main idea to
>> counteract that, and possibly improve performance overall, was a
>> microvacuum kind of thing that would do some on-demand cleanup to
>> remove duplicates or make room before page splits. Since nbtree
>> uniqueification enables efficient retail deletions, that could end up
>> as a net win.
>
> That sounds like a mechanism that works a bit like
> _bt_vacuum_one_page(), which we run at the last second before a page
> split. We do this to see if a page split that looks necessary can
> actually be avoided.
>
> I imagine that retail index tuple deletion (the whole point of this
> project) would be run by a VACUUM-like process that kills tuples that
> are dead to everyone. Even with something like zheap, you cannot just
> delete index tuples until you establish that they're truly dead. I
> guess that the delete marking stuff that Robert mentioned marks tuples
> as dead when the deleting transaction commits.
>

No, I don't think that is the case because we want to perform in-place
updates for indexed-column-updates.  If we won't delete-mark the index
tuple before performing in-place update, then we will have two tuples
in the index which point to the same heap-TID.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Peter Geoghegan-4
On Tue, Jun 19, 2018 at 4:03 AM, Amit Kapila <[hidden email]> wrote:

>> I imagine that retail index tuple deletion (the whole point of this
>> project) would be run by a VACUUM-like process that kills tuples that
>> are dead to everyone. Even with something like zheap, you cannot just
>> delete index tuples until you establish that they're truly dead. I
>> guess that the delete marking stuff that Robert mentioned marks tuples
>> as dead when the deleting transaction commits.
>>
>
> No, I don't think that is the case because we want to perform in-place
> updates for indexed-column-updates.  If we won't delete-mark the index
> tuple before performing in-place update, then we will have two tuples
> in the index which point to the same heap-TID.

How can an old MVCC snapshot that needs to find the heap tuple using
some now-obsolete key values get to the heap tuple via an index scan
if there are no index tuples that stick around until "recently dead"
heap tuples become "fully dead"? How can you avoid keeping around both
old and new index tuples at the same time?

--
Peter Geoghegan

Reply | Threaded
Open this post in threaded view
|

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

akapila
On Tue, Jun 19, 2018 at 11:13 PM, Peter Geoghegan <[hidden email]> wrote:

> On Tue, Jun 19, 2018 at 4:03 AM, Amit Kapila <[hidden email]> wrote:
>>> I imagine that retail index tuple deletion (the whole point of this
>>> project) would be run by a VACUUM-like process that kills tuples that
>>> are dead to everyone. Even with something like zheap, you cannot just
>>> delete index tuples until you establish that they're truly dead. I
>>> guess that the delete marking stuff that Robert mentioned marks tuples
>>> as dead when the deleting transaction commits.
>>>
>>
>> No, I don't think that is the case because we want to perform in-place
>> updates for indexed-column-updates.  If we won't delete-mark the index
>> tuple before performing in-place update, then we will have two tuples
>> in the index which point to the same heap-TID.
>
> How can an old MVCC snapshot that needs to find the heap tuple using
> some now-obsolete key values get to the heap tuple via an index scan
> if there are no index tuples that stick around until "recently dead"
> heap tuples become "fully dead"? How can you avoid keeping around both
> old and new index tuples at the same time?
>

Both values will be present in the index, but the old value will be
delete-marked.  It is correct that we can't remove the value (index
tuple) from the index until it is truly dead (not visible to anyone),
but during a delete or index-update operation, we need to traverse the
index to mark the entries as delete-marked.  See, at this stage, I
don't want to go in too much detail discussion of how delete-marking
will happen in zheap and also I am not sure this thread is the right
place to discuss details of that technology.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Peter Geoghegan-4
On Tue, Jun 19, 2018 at 8:52 PM, Amit Kapila <[hidden email]> wrote:
> Both values will be present in the index, but the old value will be
> delete-marked.  It is correct that we can't remove the value (index
> tuple) from the index until it is truly dead (not visible to anyone),
> but during a delete or index-update operation, we need to traverse the
> index to mark the entries as delete-marked.  See, at this stage, I
> don't want to go in too much detail discussion of how delete-marking
> will happen in zheap and also I am not sure this thread is the right
> place to discuss details of that technology.

I don't understand, but okay. I can provide feedback once a design for
delete marking is available.


--
Peter Geoghegan

Reply | Threaded
Open this post in threaded view
|

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Peter Geoghegan-4
In reply to this post by Peter Geoghegan-4
On Thu, Jun 14, 2018 at 11:44 AM, Peter Geoghegan <[hidden email]> wrote:

> I attach an unfinished prototype of suffix truncation, that also
> sometimes *adds* a new attribute in pivot tuples. It adds an extra
> heap TID from the leaf level when truncating away non-distinguishing
> attributes during a leaf page split, though only when it must. The
> patch also has nbtree treat heap TID as a first class part of the key
> space of the index. Claudio wrote a patch that did something similar,
> though without the suffix truncation part [2] (I haven't studied his
> patch, to be honest). My patch is actually a very indirect spin-off of
> Anastasia's covering index patch, and I want to show what I have in
> mind now, while it's still swapped into my head. I won't do any
> serious work on this project unless and until I see a way to implement
> retail index tuple deletion, which seems like a multi-year project
> that requires the buy-in of multiple senior community members. On its
> own, my patch regresses performance unacceptably in some workloads,
> probably due to interactions with kill_prior_tuple()/LP_DEAD hint
> setting, and interactions with page space management when there are
> many "duplicates" (it can still help performance in some pgbench
> workloads with non-unique indexes, though).
I attach a revised version, which is still very much of prototype
quality, but manages to solve a few of the problems that v1 had.
Andrey Lepikhov (CC'd) asked me to post any improved version I might
have for use with his retail index tuple deletion patch, so I thought
I'd post what I have.

The main development for v2 is that the sort order of the implicit
heap TID attribute is flipped. In v1, it was in "ascending" order. In
v2, comparisons of heap TIDs are inverted to make the attribute order
"descending". This has a number of advantages:

* It's almost consistent with the current behavior when there are
repeated insertions of duplicates. Currently, this tends to result in
page splits of the leftmost leaf page among pages that mostly consist
of the same duplicated value. This means that the destabilizing impact
on DROP SCHEMA ... CASCADE regression test output noted before [1] is
totally eliminated. There is now only a single trivial change to
regression test "expected" files, whereas in v1 dozens of "expected"
files had to be changed, often resulting in less useful reports for
the user.

* The performance regression I observed with various pgbench workloads
seems to have gone away, or is now within the noise range. A patch
like this one requires a lot of validation and testing, so this should
be taken with a grain of salt.

I may have been too quick to give up on my original ambition of
writing a stand-alone patch that can be justified entirely on its own
merits, without being tied to some much more ambitious project like
retail index tuple deletion by VACUUM, or zheap's deletion marking. I
still haven't tried to replace the kludgey handling of unique index
enforcement, even though that would probably have a measurable
additional performance benefit. I think that this patch could become
an unambiguous win.

[1] https://postgr.es/m/CAH2-Wz=wAKwhv0PqEBFuK2_s8E60kZRMzDdyLi=-MvcM_pHN_w@...
--
Peter Geoghegan

v2-0001-Ensure-nbtree-leaf-tuple-keys-are-always-unique.patch (86K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Peter Geoghegan-4
Attached is my v3, which has some significant improvements:

* The hinting for unique index inserters within _bt_findinsertloc()
has been restored, more or less.

* Bug fix for case where left side of split comes from tuple being
inserted. We need to pass this to _bt_suffix_truncate() as the left
side of the split, which we previously failed to do. The amcheck
coverage I've added allowed me to catch this issue during a benchmark.
(I use amcheck during benchmarks to get some amount of stress-testing
in.)

* New performance optimization that allows us to descend a downlink
when its user-visible attributes have scankey-equal values. We avoid
an unnecessary move left by using a sentinel scan tid that's less than
any possible real heap TID, but still greater than minus infinity to
_bt_compare().

I am now considering pursuing this as a project in its own right,
which can be justified without being part of some larger effort to add
retail index tuple deletion (e.g. by VACUUM). I think that I can get
it to the point of being a totally unambiguous win, if I haven't
already. So, this patch is no longer just an interesting prototype of
a new architectural direction we should take. In any case, it has far
fewer problems than v2.

Testing the performance characteristics of this patch has proven
difficult. My home server seems to show a nice win with a pgbench
workload that uses a Gaussian distribution for the pgbench_accounts
queries (script attached). That seems consistent and reproducible. My
home server has 32GB of RAM, and has a Samsung SSD 850 EVO SSD, with a
250GB capacity. With shared_buffers set to 12GB, 80 minute runs at
scale 4800 look like this:

Master:

25 clients:
tps = 15134.223357 (excluding connections establishing)

50 clients:
tps = 13708.419887 (excluding connections establishing)

75 clients:
tps = 12951.286926 (excluding connections establishing)

90 clients:
tps = 12057.852088 (excluding connections establishing)

Patch:

25 clients:
tps = 17857.863353 (excluding connections establishing)

50 clients:
tps = 14319.514825 (excluding connections establishing)

75 clients:
tps = 14015.794005 (excluding connections establishing)

90 clients:
tps = 12495.683053 (excluding connections establishing)

I ran this twice, and got pretty consistent results each time (there
were many other benchmarks on my home server -- this was the only one
that tested this exact patch, though). Note that there was only one
pgbench initialization for each set of runs. It looks like a pretty
strong result for the patch - note that the accounts table is about
twice the size of available main memory. The server is pretty well
overloaded in every individual run.

Unfortunately, I have a hard time showing much of any improvement on a
storage-optimized AWS instance with EBS storage, with scaled up
pgbench scale and main memory. I'm using an i3.4xlarge, which has 16
vCPUs, 122 GiB RAM, and 2 SSDs in a software RAID0 configuration. It
appears to more or less make no overall difference there, for reasons
that I have yet to get to the bottom of. I conceived this AWS
benchmark as something that would have far longer run times with a
scaled-up database size. My expectation was that it would confirm the
preliminary result, but it hasn't.

Maybe the issue is that it's far harder to fill the I/O queue on this
AWS instance? Or perhaps its related to the higher latency of EBS,
compared to the local SSD on my home server? I would welcome any ideas
about how to benchmark the patch. It doesn't necessarily have to be a
huge win for a very generic workload like the one I've tested, since
it would probably still be enough of a win for things like free space
management in secondary indexes [1]. Plus, of course, it seems likely
that we're going to eventually add retail index tuple deletion in some
form or another, which this is prerequisite to.

For a project like this, I expect an unambiguous, across the board win
from the committed patch, even if it isn't a huge win. I'm encouraged
by the fact that this is starting to look like credible as a
stand-alone patch, but I have to admit that there's probably still
significant gaps in my understanding of how it affects real-world
performance. I don't have a lot of recent experience with benchmarking
workloads like this one.

[1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@...
--
Peter Geoghegan

dell-server.txt (738 bytes) Download Attachment
tpcb.sql (750 bytes) Download Attachment
v3-0001-Make-all-nbtree-index-tuples-have-unique-keys.patch (144K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Andrey Lepikhov
I use v3 version of the patch for a Retail Indextuple Deletion and from
time to time i catch regression test error (see attachment).
As i see in regression.diff, the problem is instability order of DROP
... CASCADE deletions.
Most frequently i get error on a test called 'updatable views'.
I check nbtree invariants during all tests, but index relations is in
consistent state all time.
My hypothesis is: instability order of logical duplicates in index
relations on a pg_depend relation.
But 'updatable views' test not contains any sources of instability:
concurrent insertions, updates, vacuum and so on. This fact discourage me.
May be you have any ideas on this problem?


18.07.2018 00:21, Peter Geoghegan пишет:

> Attached is my v3, which has some significant improvements:
>
> * The hinting for unique index inserters within _bt_findinsertloc()
> has been restored, more or less.
>
> * Bug fix for case where left side of split comes from tuple being
> inserted. We need to pass this to _bt_suffix_truncate() as the left
> side of the split, which we previously failed to do. The amcheck
> coverage I've added allowed me to catch this issue during a benchmark.
> (I use amcheck during benchmarks to get some amount of stress-testing
> in.)
>
> * New performance optimization that allows us to descend a downlink
> when its user-visible attributes have scankey-equal values. We avoid
> an unnecessary move left by using a sentinel scan tid that's less than
> any possible real heap TID, but still greater than minus infinity to
> _bt_compare().
>
> I am now considering pursuing this as a project in its own right,
> which can be justified without being part of some larger effort to add
> retail index tuple deletion (e.g. by VACUUM). I think that I can get
> it to the point of being a totally unambiguous win, if I haven't
> already. So, this patch is no longer just an interesting prototype of
> a new architectural direction we should take. In any case, it has far
> fewer problems than v2.
>
> Testing the performance characteristics of this patch has proven
> difficult. My home server seems to show a nice win with a pgbench
> workload that uses a Gaussian distribution for the pgbench_accounts
> queries (script attached). That seems consistent and reproducible. My
> home server has 32GB of RAM, and has a Samsung SSD 850 EVO SSD, with a
> 250GB capacity. With shared_buffers set to 12GB, 80 minute runs at
> scale 4800 look like this:
>
> Master:
>
> 25 clients:
> tps = 15134.223357 (excluding connections establishing)
>
> 50 clients:
> tps = 13708.419887 (excluding connections establishing)
>
> 75 clients:
> tps = 12951.286926 (excluding connections establishing)
>
> 90 clients:
> tps = 12057.852088 (excluding connections establishing)
>
> Patch:
>
> 25 clients:
> tps = 17857.863353 (excluding connections establishing)
>
> 50 clients:
> tps = 14319.514825 (excluding connections establishing)
>
> 75 clients:
> tps = 14015.794005 (excluding connections establishing)
>
> 90 clients:
> tps = 12495.683053 (excluding connections establishing)
>
> I ran this twice, and got pretty consistent results each time (there
> were many other benchmarks on my home server -- this was the only one
> that tested this exact patch, though). Note that there was only one
> pgbench initialization for each set of runs. It looks like a pretty
> strong result for the patch - note that the accounts table is about
> twice the size of available main memory. The server is pretty well
> overloaded in every individual run.
>
> Unfortunately, I have a hard time showing much of any improvement on a
> storage-optimized AWS instance with EBS storage, with scaled up
> pgbench scale and main memory. I'm using an i3.4xlarge, which has 16
> vCPUs, 122 GiB RAM, and 2 SSDs in a software RAID0 configuration. It
> appears to more or less make no overall difference there, for reasons
> that I have yet to get to the bottom of. I conceived this AWS
> benchmark as something that would have far longer run times with a
> scaled-up database size. My expectation was that it would confirm the
> preliminary result, but it hasn't.
>
> Maybe the issue is that it's far harder to fill the I/O queue on this
> AWS instance? Or perhaps its related to the higher latency of EBS,
> compared to the local SSD on my home server? I would welcome any ideas
> about how to benchmark the patch. It doesn't necessarily have to be a
> huge win for a very generic workload like the one I've tested, since
> it would probably still be enough of a win for things like free space
> management in secondary indexes [1]. Plus, of course, it seems likely
> that we're going to eventually add retail index tuple deletion in some
> form or another, which this is prerequisite to.
>
> For a project like this, I expect an unambiguous, across the board win
> from the committed patch, even if it isn't a huge win. I'm encouraged
> by the fact that this is starting to look like credible as a
> stand-alone patch, but I have to admit that there's probably still
> significant gaps in my understanding of how it affects real-world
> performance. I don't have a lot of recent experience with benchmarking
> workloads like this one.
>
> [1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@...
>
--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

collate_bug.tar.gz (3K) Download Attachment
updatable_views.tar.gz (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Peter Geoghegan-4
On Wed, Aug 1, 2018 at 9:48 PM, Andrey Lepikhov
<[hidden email]> wrote:

> I use v3 version of the patch for a Retail Indextuple Deletion and from time
> to time i catch regression test error (see attachment).
> As i see in regression.diff, the problem is instability order of DROP ...
> CASCADE deletions.
> Most frequently i get error on a test called 'updatable views'.
> I check nbtree invariants during all tests, but index relations is in
> consistent state all time.
> My hypothesis is: instability order of logical duplicates in index relations
> on a pg_depend relation.
> But 'updatable views' test not contains any sources of instability:
> concurrent insertions, updates, vacuum and so on. This fact discourage me.
> May be you have any ideas on this problem?

It's caused by an implicit dependency on the order of items in an
index. See https://www.postgresql.org/message-id/20180504022601.fflymidf7eoencb2%40alvherre.pgsql.

I've been making "\set VERBOSITY terse" changes like this whenever it
happens in a new place. It seems to have finally stopped happening.
Note that this is a preexisting issue; there are already places in the
regression tests where we paper over the problem in a similar way. I
notice that it tends to happen when the machine running the regression
tests is heavily loaded.

--
Peter Geoghegan

Reply | Threaded
Open this post in threaded view
|

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Peter Geoghegan-4
In reply to this post by Peter Geoghegan-4
Attached is v4. I have two goals in mind for this revision, goals that
are of great significance to the project as a whole:

* Making better choices around leaf page split points, in order to
maximize suffix truncation and thereby maximize fan-out. This is
important when there are mostly-distinct index tuples on each leaf
page (i.e. most of the time). Maximizing the effectiveness of suffix
truncation needs to be weighed against the existing/main
consideration: evenly distributing space among each half of a page
split. This is tricky.

* Not regressing the logic that lets us pack leaf pages full when
there are a great many logical duplicates. That is, I still want to
get the behavior I described on the '"Write amplification" is made
worse by "getting tired" while inserting into nbtree secondary
indexes' thread [1]. This is not something that happens as a
consequence of thinking about suffix truncation specifically, and
seems like a fairly distinct thing to me. It's actually a bit similar
to the rightmost 90/10 page split case.

v4 adds significant new logic to make us do better on the first goal,
without hurting the second goal. It's easy to regress one while
focussing on the other, so I've leaned on a custom test suite
throughout development. Previous versions mostly got the first goal
wrong, but got the second goal right. For the time being, I'm
focussing on index size, on the assumption that I'll be able to
demonstrate a nice improvement in throughput or latency later. I can
get the main TPC-C order_line pkey about 7% smaller after an initial
bulk load with the new logic added to get the first goal (note that
the benefits with a fresh CREATE INDEX are close to zero). The index
is significantly smaller, even though the internal page index tuples
can themselves never be any smaller due to alignment -- this is all
about not restricting what can go on each leaf page by too much. 7% is
not as dramatic as the "get tired" case, which saw something like a
50% decrease in bloat for one pathological case, but it's still
clearly well worth having. The order_line primary key is the largest
TPC-C index, and I'm merely doing a standard bulk load to get this 7%
shrinkage. The TPC-C order_line primary key happens to be kind of
adversarial or pathological to B-Tree space management in general, but
it's still fairly realistic.

For the first goal, page splits now weigh what I've called the
"distance" between tuples, with a view to getting the most
discriminating split point -- the leaf split point that maximizes the
effectiveness of suffix truncation, within a range of acceptable split
points (acceptable from the point of view of not implying a lopsided
page split). This is based on probing IndexTuple contents naively when
deciding on a split point, without regard for the underlying
opclass/types. We mostly just use char integer comparisons to probe,
on the assumption that that's a good enough proxy for using real
insertion scankey comparisons (only actual truncation goes to those
lengths, since that's a strict matter of correctness). This distance
business might be considered a bit iffy by some, so I want to get
early feedback. This new "distance" code clearly needs more work, but
I felt that I'd gone too long without posting a new version.

For the second goal, I've added a new macro that can be enabled for
debugging purposes. This has the implementation sort heap TIDs in ASC
order, rather than DESC order. This nicely demonstrates how my two
goals for v4 are fairly independent; uncommenting "#define
BTREE_ASC_HEAP_TID" will cause a huge regression with cases where many
duplicates are inserted, but won't regress things like the TPC-C
indexes. (Note that BTREE_ASC_HEAP_TID will break the regression
tests, though in a benign way can safely be ignored.)

Open items:

* Do more traditional benchmarking.

* Add pg_upgrade support.

* Simplify _bt_findsplitloc() logic.

[1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@...
--
Peter Geoghegan

v4-0001-Make-all-nbtree-index-tuples-have-unique-keys.patch (177K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Peter Geoghegan-4
Attached is v5, which significantly simplifies the _bt_findsplitloc()
logic. It's now a great deal easier to follow. It would be helpful if
someone could do code-level review of the overhauled
_bt_findsplitloc(). That's the most important part of the patch. It
involves relatively subjective trade-offs around total effort spent
during a page split, space utilization, and avoiding "false sharing"
(I call the situation where a range of duplicate values straddle two
leaf pages unnecessarily "false sharing", since it necessitates that
subsequent index scans visit two index scans rather than just one,
even when that's avoidable.)

This version has slightly improved performance, especially for cases
where an index gets bloated without any garbage being generated. With
the UK land registry data [1], an index on (county, city, locality) is
shrunk by just over 18% by the new logic (I recall that it was shrunk
by ~15% in an earlier version). In concrete terms, it goes from being
1.288 GiB on master to being 1.054 GiB with v5 of the patch. This is
mostly because the patch intelligently packs together duplicate-filled
pages tightly (in particular, it avoids "getting tired"), but also
because it makes pivots less restrictive about where leaf tuples can
go. I still manage to shrink the largest TPC-C and TPC-H indexes by at
least 5% following an initial load performed by successive INSERTs.
Those are unique indexes, so the benefits are certainly not limited to
cases involving many duplicates.

3 modes
-------

My new approach is to teach _bt_findsplitloc() 3 distinct modes of
operation: Regular/default mode, many duplicates mode, and single
value mode. The higher level split code always asks for a default mode
call to _bt_findsplitloc(), so that's always where we start. For leaf
page splits, _bt_findsplitloc() will occasionally call itself
recursively in either many duplicates mode or single value mode. This
happens when the default strategy doesn't work out.

* Default mode almost does what we do already, but remembers the top n
candidate split points, sorted by the delta between left and right
post-split free space, rather than just looking for the overall lowest
delta split point.

Then, we go through a second pass over the temp array of "acceptable"
split points, that considers the needs of suffix truncation.

* Many duplicates mode is used when we fail to find a "distinguishing"
split point in regular mode, but have determined that it's possible to
get one if a new, exhaustive search is performed.

We go to great lengths to avoid having to append a heap TID to the new
left page high key -- that's what I mean by "distinguishing". We're
particularly concerned with false sharing by subsequent point lookup
index scans here.

* Single value mode is used when we see that even many duplicates mode
would be futile, as the leaf page is already *entirely* full of
logical duplicates.

Single value mode isn't exhaustive, since there is clearly nothing to
exhaustively search for. Instead, it packs together as many tuples as
possible on the right side of the split. Since heap TIDs sort in
descending order, this is very much like a "leftmost" split that tries
to free most of the space on the left side, and pack most of the page
contents on the right side. Except that it's leftmost, and in
particular is leftmost among pages full of logical duplicates (as
opposed to being leftmost/rightmost among pages on an entire level of
the tree, as with the traditional rightmost 90:10 split thing).

Other changes
-------------

* I now explicitly use fillfactor in the manner of a rightmost split
to get the single value mode behavior.

I call these types of splits (rightmost and single value mode splits)
"weighted" splits in the patch. This is much more consistent with our
existing conventions than my previous approach.

* Improved approached to inexpensively determining how effective
suffix truncation will be for a given candidate split point.

I no longer naively probe the contents of index tuples to do char
comparisons.  Instead, I use a tuple descriptor to get offsets to each
attribute in each tuple in turn, then calling to datumIsEqual() to
determine if they're equal. This is almost as good as a full scan key
comparison. This actually seems to be a bit faster, and also takes
care of INCLUDE indexes without special care (no need to worry about
probing non-key attributes, and reaching a faulty conclusion about
which split point helps with suffix truncation).

I still haven't managed to add pg_upgrade support, but that's my next
step. I am more or less happy with the substance of the patch in v5,
and feel that I can now work backwards towards figuring out the best
way to deal with on-disk compatibility. It shouldn't be too hard --
most of the effort will involve coming up with a good test suite.

[1] https://wiki.postgresql.org/wiki/Sample_Databases
--
Peter Geoghegan

v5-0003-Allow-nbtree-to-use-ASC-heap-TID-attribute-order.patch (9K) Download Attachment
v5-0002-Add-temporary-page-split-debug-LOG-instrumentatio.patch (5K) Download Attachment
v5-0001-Make-nbtree-indexes-have-unique-keys-in-tuples.patch (184K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Andrey Lepikhov
In reply to this post by Peter Geoghegan-4
I use the v5 version in quick vacuum strategy and in the heap&index
cleaner (see new patches at corresponding thread a little bit later). It
works fine and give quick vacuum 2-3% performance growup in comparison
with version v3 at my 24-core test server.
Note, that the interface of _bt_moveright() and _bt_binsrch() functions
with combination of scankey, scantid and nextkey parameters is too
semantic loaded.
Everytime of code reading i spend time to remember, what this functions
do exactly.
May be it needed to rewrite comments. For example, _bt_moveright()
comments may include phrase:
nextkey=false: Traverse to the next suitable index page if the current
page does not contain the value (scan key; scan tid).

What do you think about submitting the patch to the next CF?

12.09.2018 23:11, Peter Geoghegan пишет:

> Attached is v4. I have two goals in mind for this revision, goals that
> are of great significance to the project as a whole:
>
> * Making better choices around leaf page split points, in order to
> maximize suffix truncation and thereby maximize fan-out. This is
> important when there are mostly-distinct index tuples on each leaf
> page (i.e. most of the time). Maximizing the effectiveness of suffix
> truncation needs to be weighed against the existing/main
> consideration: evenly distributing space among each half of a page
> split. This is tricky.
>
> * Not regressing the logic that lets us pack leaf pages full when
> there are a great many logical duplicates. That is, I still want to
> get the behavior I described on the '"Write amplification" is made
> worse by "getting tired" while inserting into nbtree secondary
> indexes' thread [1]. This is not something that happens as a
> consequence of thinking about suffix truncation specifically, and
> seems like a fairly distinct thing to me. It's actually a bit similar
> to the rightmost 90/10 page split case.
>
> v4 adds significant new logic to make us do better on the first goal,
> without hurting the second goal. It's easy to regress one while
> focussing on the other, so I've leaned on a custom test suite
> throughout development. Previous versions mostly got the first goal
> wrong, but got the second goal right. For the time being, I'm
> focussing on index size, on the assumption that I'll be able to
> demonstrate a nice improvement in throughput or latency later. I can
> get the main TPC-C order_line pkey about 7% smaller after an initial
> bulk load with the new logic added to get the first goal (note that
> the benefits with a fresh CREATE INDEX are close to zero). The index
> is significantly smaller, even though the internal page index tuples
> can themselves never be any smaller due to alignment -- this is all
> about not restricting what can go on each leaf page by too much. 7% is
> not as dramatic as the "get tired" case, which saw something like a
> 50% decrease in bloat for one pathological case, but it's still
> clearly well worth having. The order_line primary key is the largest
> TPC-C index, and I'm merely doing a standard bulk load to get this 7%
> shrinkage. The TPC-C order_line primary key happens to be kind of
> adversarial or pathological to B-Tree space management in general, but
> it's still fairly realistic.
>
> For the first goal, page splits now weigh what I've called the
> "distance" between tuples, with a view to getting the most
> discriminating split point -- the leaf split point that maximizes the
> effectiveness of suffix truncation, within a range of acceptable split
> points (acceptable from the point of view of not implying a lopsided
> page split). This is based on probing IndexTuple contents naively when
> deciding on a split point, without regard for the underlying
> opclass/types. We mostly just use char integer comparisons to probe,
> on the assumption that that's a good enough proxy for using real
> insertion scankey comparisons (only actual truncation goes to those
> lengths, since that's a strict matter of correctness). This distance
> business might be considered a bit iffy by some, so I want to get
> early feedback. This new "distance" code clearly needs more work, but
> I felt that I'd gone too long without posting a new version.
>
> For the second goal, I've added a new macro that can be enabled for
> debugging purposes. This has the implementation sort heap TIDs in ASC
> order, rather than DESC order. This nicely demonstrates how my two
> goals for v4 are fairly independent; uncommenting "#define
> BTREE_ASC_HEAP_TID" will cause a huge regression with cases where many
> duplicates are inserted, but won't regress things like the TPC-C
> indexes. (Note that BTREE_ASC_HEAP_TID will break the regression
> tests, though in a benign way can safely be ignored.)
>
> Open items:
>
> * Do more traditional benchmarking.
>
> * Add pg_upgrade support.
>
> * Simplify _bt_findsplitloc() logic.
>
> [1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@...
>

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

Reply | Threaded
Open this post in threaded view
|

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Peter Geoghegan-4
On Wed, Sep 19, 2018 at 9:56 PM, Andrey Lepikhov
<[hidden email]> wrote:
> Note, that the interface of _bt_moveright() and _bt_binsrch() functions with
> combination of scankey, scantid and nextkey parameters is too semantic
> loaded.
> Everytime of code reading i spend time to remember, what this functions do
> exactly.
> May be it needed to rewrite comments.

I think that it might be a good idea to create an "BTInsertionScankey"
struct, or similar, since keysz, nextkey, the scankey array and now
scantid are all part of that, and are all common to these 4 or so
functions. It could have a flexible array at the end, so that we still
only need a single palloc(). I'll look into that.

> What do you think about submitting the patch to the next CF?

Clearly the project that you're working on is a difficult one. It's
easy for me to understand why you might want to take an iterative
approach, with lots of prototyping. Your patch needs attention to
advance, and IMV the CF is the best way to get that attention. So, I
think that it would be fine to go submit it now.

I must admit that I didn't even notice that your patch lacked a CF
entry. Everyone has a different process, perhaps.

--
Peter Geoghegan

Reply | Threaded
Open this post in threaded view
|

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Peter Geoghegan-4
In reply to this post by Peter Geoghegan-4
On Wed, Sep 19, 2018 at 11:23 AM Peter Geoghegan <[hidden email]> wrote:
> 3 modes
> -------
>
> My new approach is to teach _bt_findsplitloc() 3 distinct modes of
> operation: Regular/default mode, many duplicates mode, and single
> value mode.

I think that I'll have to add a fourth mode, since I came up with
another strategy that is really effective though totally complementary
to the other 3 -- "multiple insertion point" mode. Credit goes to
Kevin Grittner for pointing out that this technique exists about 2
years ago [1]. The general idea is to pick a split point just after
the insertion point of the new item (the incoming tuple that prompted
a page split) when it looks like there are localized monotonically
increasing ranges.  This is like a rightmost 90:10 page split, except
the insertion point is not at the rightmost page on the level -- it's
rightmost within some local grouping of values.

This makes the two largest TPC-C indexes *much* smaller. Previously,
they were shrunk by a little over 5% by using the new generic
strategy, a win that now seems like small potatoes. With this new
mode, TPC-C's order_line primary key, which is the largest index of
all, is ~45% smaller following a standard initial bulk load at
scalefactor 50. It shrinks from 99,085 blocks (774.10 MiB) to 55,020
blocks (429.84 MiB). It's actually slightly smaller than it would be
after a fresh REINDEX with the new strategy. We see almost as big a
win with the second largest TPC-C index, the stock table's primary key
-- it's ~40% smaller.

Here is the definition of the biggest index, the order line primary key index:

pg@tpcc[3666]=# \d order_line_pkey
     Index "public.order_line_pkey"
  Column   │  Type   │ Key? │ Definition
───────────┼─────────┼──────┼────────────
 ol_w_id   │ integer │ yes  │ ol_w_id
 ol_d_id   │ integer │ yes  │ ol_d_id
 ol_o_id   │ integer │ yes  │ ol_o_id
 ol_number │ integer │ yes  │ ol_number
primary key, btree, for table "public.order_line"

The new strategy/mode works very well because we see monotonically
increasing inserts on ol_number (an order's item number), but those
are grouped by order. It's kind of an adversarial case for our
existing implementation, and yet it seems like it's probably a fairly
common scenario in the real world.

Obviously these are very significant improvements. They really exceed
my initial expectations for the patch. TPC-C is generally considered
to be by far the most influential database benchmark of all time, and
this is something that we need to pay more attention to. My sense is
that the TPC-C benchmark is deliberately designed to almost require
that the system under test have this "multiple insertion point" B-Tree
optimization, suffix truncation, etc. This is exactly the same index
that we've seen reports of out of control bloat on when people run
TPC-C over hours or days [2].

My next task is to find heuristics to make the new page split
mode/strategy kick in when it's likely to help, but not kick in when
it isn't (when we want something close to a generic 50:50 page split).
These heuristics should look similar to what I've already done to get
cases with lots of duplicates to behave sensibly. Anyone have any
ideas on how to do this? I might end up inferring a "multiple
insertion point" case from the fact that there are multiple
pass-by-value attributes for the index, with the new/incoming tuple
having distinct-to-immediate-left-tuple attribute values for the last
column, but not the first few. It also occurs to me to consider the
fragmentation of the page as a guide, though I'm less sure about that.
I'll probably need to experiment with a variety of datasets before I
settle on something that looks good. Forcing the new strategy without
considering any of this actually works surprisingly well on cases
where you'd think it wouldn't, since a 50:50 page split is already
something of a guess about where future insertions will end up.

[1] https://postgr.es/m/CACjxUsN5fV0kV=YirXwA0S7LqoOJuy7soPtipDhUCemhgwoVFg@...
[2] https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c/
--
Peter Geoghegan

Reply | Threaded
Open this post in threaded view
|

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Peter Eisentraut-6
In reply to this post by Peter Geoghegan-4
On 19/09/2018 20:23, Peter Geoghegan wrote:
> Attached is v5,

So.  I don't know much about the btree code, so don't believe anything I
say.

I was very interested in the bloat test case that you posted on
2018-07-09 and I tried to understand it more.  The current method for
inserting a duplicate value into a btree is going to the leftmost point
for that value and then move right until we find some space or we get
"tired" of searching, in which case just make some space right there.
The problem is that it's tricky to decide when to stop searching, and
there are scenarios when we stop too soon and repeatedly miss all the
good free space to the right, leading to bloat even though the index is
perhaps quite empty.

I tried playing with the getting-tired factor (it could plausibly be a
reloption), but that wasn't very successful.  You can use that to
postpone the bloat, but you won't stop it, and performance becomes terrible.

You propose to address this by appending the tid to the index key, so
each key, even if its "payload" is a duplicate value, is unique and has
a unique place, so we never have to do this "tiresome" search.  This
makes a lot of sense, and the results in the bloat test you posted are
impressive and reproducible.

I tried a silly alternative approach by placing a new duplicate key in a
random location.  This should be equivalent since tids are effectively
random.  I didn't quite get this to fully work yet, but at least it
doesn't blow up, and it gets the same regression test ordering
differences for pg_depend scans that you are trying to paper over. ;-)

As far as the code is concerned, I agree with Andrey Lepikhov that one
more abstraction layer that somehow combines the scankey and the tid or
some combination like that would be useful, instead of passing the tid
as a separate argument everywhere.

I think it might help this patch move along if it were split up a bit,
for example 1) suffix truncation, 2) tid stuff, 3) new split strategies.
 That way, it would also be easier to test out each piece separately.
For example, how much space does suffix truncation save in what
scenario, are there any performance regressions, etc.  In the last few
versions, the patches have still been growing significantly in size and
functionality, and most of the supposed benefits are not readily visible
in tests.

And of course we need to think about how to handle upgrades, but you
have already started a separate discussion about that.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

1234 ... 7