[PROPOSAL] Effective storage of duplicates in B-tree index.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
117 messages Options
123456
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
The new version of the patch is attached.
This version is even simpler than the previous one,
thanks to the recent btree design changes and all the feedback I received.
I consider it ready for review and testing.

[feature overview]
This patch implements the deduplication of btree non-pivot tuples on
leaf pages
in a manner similar to GIN index "posting lists".

Non-pivot posting tuple has following format:
t_tid | t_info | key values | posting_list[]

Where t_tid and t_info fields are used to store meta info
about tuple's posting list.
posting list is an array of ItemPointerData.

Currently, compression is applied to all indexes except system indexes,
unique
indexes, and indexes with included columns.

On insertion, compression applied not to each tuple, but to the page before
split. If the target page is full, we try to compress it.

[benchmark results]
idx ON tbl(c1);
index contains 10000000 integer values

i - number of distinct values in the index.
So i=1 means that all rows have the same key,
and i=10000000 means that all keys are different.

i / old size (MB) / new size (MB)
1            215    88
1000        215    90
100000        215    71
10000000    214    214

For more, see the attached diagram with test results.

[future work]
Many things can be improved in this feature.
Personally, I'd prefer to keep this patch as small as possible
and work on other improvements after a basic part is committed.
Though, I understand that some of these can be considered essential
for this patch to be approved.

1. Implement a split of the posting tuples on a page split.
2. Implement microvacuum of posting tuples.
3. Add a flag into pg_index, which allows enabling/disabling compression
for a particular index.
4. Implement posting list compression.

--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company


btree_compression_pg12_v1.patch (56K) Download Attachment
btree_compression_test_result.png (22K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
<[hidden email]> wrote:

> i - number of distinct values in the index.
> So i=1 means that all rows have the same key,
> and i=10000000 means that all keys are different.
>
> i / old size (MB) / new size (MB)
> 1            215    88
> 1000        215    90
> 100000        215    71
> 10000000    214    214
>
> For more, see the attached diagram with test results.

I tried this on my own "UK land registry" test data [1], which was
originally used for the v12 nbtree work. My test case has a low
cardinality, multi-column text index. I chose this test case because
it was convenient for me.

On v12/master, the index is 1100MB. Whereas with your patch, it ends
up being 196MB -- over 5.5x smaller!

I also tried it out with the "Mouse genome informatics" database [2],
which was already improved considerably by the v12 work on duplicates.
This is helped tremendously by your patch. It's not quite 5.5x across
the board, of course. There are 187 indexes (on 28 tables), and almost
all of the indexes are smaller. Actually, *most* of the indexes are
*much* smaller. Very often 50% smaller.

I don't have time to do an in-depth analysis of these results today,
but clearly the patch is very effective on real world data. I think
that we tend to underestimate just how common indexes with a huge
number of duplicates are.

[1] https://https:/postgr.es/m/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@...
[2] http://www.informatics.jax.org/software.shtml
--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
On Thu, Jul 4, 2019 at 10:38 AM Peter Geoghegan <[hidden email]> wrote:
> I tried this on my own "UK land registry" test data [1], which was
> originally used for the v12 nbtree work. My test case has a low
> cardinality, multi-column text index. I chose this test case because
> it was convenient for me.
>
> On v12/master, the index is 1100MB. Whereas with your patch, it ends
> up being 196MB -- over 5.5x smaller!

I also see a huge and consistent space saving for TPC-H. All 9 indexes
are significantly smaller. The lineitem orderkey index is "just" 1/3
smaller, which is the smallest improvement among TPC-H indexes in my
index bloat test case. The two largest indexes after the initial bulk
load are *much* smaller: the lineitem parts supplier index is ~2.7x
smaller, while the lineitem ship date index is a massive ~4.2x
smaller. Also, the orders customer key index is ~2.8x smaller, and the
order date index is ~2.43x smaller. Note that the test involved retail
insertions, not CREATE INDEX.

I haven't seen any regression in the size of any index so far,
including when the number of internal pages is all that we measure.
Actually, there seems to be cases where there is a noticeably larger
reduction in internal pages than in leaf pages, probably because of
interactions with suffix truncation.

This result is very impressive. We'll need to revisit what the right
trade-off is for the compression scheme, which Heikki had some
thoughts on when we left off 3 years ago, but that should be a lot
easier now. I am very encouraged by the fact that this relatively
simple approach already works quite nicely. It's also great to see
that bulk insertions with lots of compression are very clearly faster
with this latest revision of your patch, unlike earlier versions from
2016 that made those cases slower (though I haven't tested indexes
that don't really use compression). I think that this is because you
now do the compression lazily, at the point where it looks like we may
need to split the page. Previous versions of the patch had to perform
compression eagerly, just like GIN, which is not really appropriate
for nbtree.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Bruce Momjian
On Thu, Jul  4, 2019 at 05:06:09PM -0700, Peter Geoghegan wrote:

> This result is very impressive. We'll need to revisit what the right
> trade-off is for the compression scheme, which Heikki had some
> thoughts on when we left off 3 years ago, but that should be a lot
> easier now. I am very encouraged by the fact that this relatively
> simple approach already works quite nicely. It's also great to see
> that bulk insertions with lots of compression are very clearly faster
> with this latest revision of your patch, unlike earlier versions from
> 2016 that made those cases slower (though I haven't tested indexes
> that don't really use compression). I think that this is because you
> now do the compression lazily, at the point where it looks like we may
> need to split the page. Previous versions of the patch had to perform
> compression eagerly, just like GIN, which is not really appropriate
> for nbtree.

I am also encouraged and am happy we can finally move this duplicate
optimization forward.

--
  Bruce Momjian  <[hidden email]>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
In reply to this post by Anastasia Lubennikova
On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
<[hidden email]> wrote:
> The new version of the patch is attached.
> This version is even simpler than the previous one,
> thanks to the recent btree design changes and all the feedback I received.
> I consider it ready for review and testing.

I took a closer look at this patch, and have some general thoughts on
its design, and specific feedback on the implementation.

Preserving the *logical contents* of B-Tree indexes that use
compression is very important -- that should not change in a way that
outside code can notice. The heap TID itself should count as logical
contents here, since we want to be able to implement retail index
tuple deletion in the future. Even without retail index tuple
deletion, amcheck's "rootdescend" verification assumes that it can
find one specific tuple (which could now just be one specific "logical
tuple") using specific key values from the heap, including the heap
tuple's heap TID. This requirement makes things a bit harder for your
patch, because you have to deal with one or two edge-cases that you
currently don't handle: insertion of new duplicates that fall inside
the min/max range of some existing posting list. That should be rare
enough in practice, so the performance penalty won't be too bad. This
probably means that code within _bt_findinsertloc() and/or
_bt_binsrch_insert() will need to think about a logical tuple as a
distinct thing from a physical tuple, though that won't be necessary
in most places.

The need to "preserve the logical contents" also means that the patch
will need to recognize when indexes are not safe as a target for
compression/deduplication (maybe we should call this feature
deduplilcation, so it's clear how it differs from TOAST?). For
example, if we have a case-insensitive ICU collation, then it is not
okay to treat an opclass-equal pair of text strings that use the
collation as having the same value when considering merging the two
into one. You don't actually do that in the patch, but you also don't
try to deal with the fact that such a pair of strings are equal, and
so must have their final positions determined by the heap TID column
(deduplication within _bt_compress_one_page() must respect that).
Possibly equal-but-distinct values seems like a problem that's not
worth truly fixing, but it will be necessary to store metadata about
whether or not we're willing to do deduplication in the meta page,
based on operator class and collation details. That seems like a
restriction that we're just going to have to accept, though I'm not
too worried about exactly what that will look like right now. We can
work it out later.

I think that the need to be careful about the logical contents of
indexes already causes bugs, even with "safe for compression" indexes.
For example, I can sometimes see an assertion failure
within_bt_truncate(), at the point where we check if heap TID values
are safe:

    /*
     * Lehman and Yao require that the downlink to the right page, which is to
     * be inserted into the parent page in the second phase of a page split be
     * a strict lower bound on items on the right page, and a non-strict upper
     * bound for items on the left page.  Assert that heap TIDs follow these
     * invariants, since a heap TID value is apparently needed as a
     * tiebreaker.
     */
#ifndef DEBUG_NO_TRUNCATE
    Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
                              BTreeTupleGetMinTID(firstright)) < 0);
...

This bug is not that easy to see, but it will happen with a big index,
even without updates or deletes. I think that this happens because
compression can allow the "logical tuples" to be in the wrong heap TID
order when there are multiple posting lists for the same value. As I
said, I think that it's necessary to see a posting list as being
comprised of multiple logical tuples in the context of inserting new
tuples, even when you're not performing compression or splitting the
page. I also see that amcheck's bt_index_parent_check() function
fails, though bt_index_check() does not fail when I don't use any of
its extra verification options. (You haven't updated amcheck, but I
don't think that you need to update it for these basic checks to
continue to work.)

Other feedback on specific things:

* A good way to assess whether or not the "logical tuple versus
physical tuple" thing works is to make sure that amcheck's
"rootdescend" verification works with a variety of indexes. As I said,
it has the same requirements for nbtree as retail index tuple deletion
will.

* _bt_findinsertloc() should not call _bt_compress_one_page() for
!heapkeyspace (version 3) indexes -- the second call to
_bt_compress_one_page() should be removed.

* Why can't compression be used on system catalog indexes? I
understand that they are not a compelling case, but we tend to do
things the same way with catalog tables and indexes unless there is a
very good reason not to (e.g. HOT, suffix truncation). I see that the
tests fail when that restriction is removed, but I don't think that
that has anything to do with system catalogs. I think that that's due
to a bug somewhere else. Why have this restriction at all?

* It looks like we could be less conservative in nbtsplitloc.c to good
effect. We know for sure that a posting list will be truncated down to
one heap TID even in the worst case, so we can safely assume that the
new high key will be a lot smaller than the firstright tuple that it
is based on when it has a posting list. We only have to keep one TID.
This will allow us to leave more tuples on the left half of the page
in certain cases, further improving space utilization.

* Don't you need to update nbtdesc.c?

* Maybe we could do compression with unique indexes when inserting
values with NULLs? Note that we now treat an insertion of a tuple with
NULLs into a unique index as if it wasn't even a unique index -- see
the "checkingunique" optimization at the beginning of _bt_doinsert().
Having many NULL values in a unique index is probably fairly common.

* It looks like amcheck's heapallindexed verification needs to have
normalization added, to avoid false positives. This situation is
specifically anticipated by existing comments above
bt_normalize_tuple(). Again, being careful about "logical versus
physical tuple" seems necessary.

* Doesn't the nbtsearch.c/_bt_readpage() code that deals with
backwards scans need to return posting lists backwards, not forwards?
It seems like a good idea to try to "preserve the logical contents"
here too, just to be conservative.

Within nbtsort.c:

* Is the new code in _bt_buildadd() actually needed? If so, why?

* insert_itupprev_to_page_buildadd() is only called within nbtsort.c,
and so should be static. The name also seems very long.

* add_item_to_posting() is called within both nbtsort.c and
nbtinsert.c, and so should remain non-static, but have less generic
(and shorter) name.  (Use the usual _bt_* style instead.)

* Is nbtsort.c the right place for these functions, anyway? (Maybe,
but maybe not, IMV.)

I ran pgindent on the patch, and made some small manual whitespace
adjustments, which is attached. There are no real changes, but some of
the formatting in the original version you posted was hard to read.
Please work off this for your next revision.

--
Peter Geoghegan

0001-btree_compression_pg12_v1.patch-with-pg_indent-run.patch (75K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
On Sat, Jul 6, 2019 at 4:08 PM Peter Geoghegan <[hidden email]> wrote:
> I took a closer look at this patch, and have some general thoughts on
> its design, and specific feedback on the implementation.

I have some high level concerns about how the patch might increase
contention, which could make queries slower. Apparently that is a real
problem in other systems that use MVCC when their bitmap index feature
is used -- they are never really supposed to be used with OLTP apps.
This patch makes nbtree behave rather a lot like a bitmap index.
That's not exactly true, because you're not storing a bitmap or
compressing the TID lists, but they're definitely quite similar. It's
easy to imagine a hybrid approach, that starts with a B-Tree with
deduplication/TID lists, and eventually becomes a bitmap index as more
duplicates are added [1].

It doesn't seem like it would be practical for these other MVCC
database systems to have standard B-Tree secondary indexes that
compress duplicates gracefully in the way that you propose to with
this patch, because lock contention would presumably be a big problem
for the same reason as it is with their bitmap indexes (whatever the
true reason actually is). Is it really possible to have something
that's adaptive, offering the best of both worlds?

Having dug into it some more, I think that the answer for us might
actually be "yes, we can have it both ways". Other database systems
that are also based on MVCC still probably use a limited form of index
locking, even in READ COMMITTED mode, though this isn't very widely
known. They need this for unique indexes, but they also need it for
transaction rollback, to remove old entries from the index when the
transaction must abort. The section "6.7 Standard Practice" from the
paper "Architecture of a Database System" [2] goes into this, saying:

"All production databases today support ACID transactions. As a rule,
they use write-ahead logging for durability, and two-phase locking for
concurrency control. An exception is PostgreSQL, which uses
multiversion concurrency control throughout."

I suggest reading "6.7 Standard Practice" in full.

Anyway, I think that *hundreds* or even *thousands* of rows are
effectively locked all at once when a bitmap index needs to be updated
in these other systems -- and I mean a heavyweight lock that lasts
until the xact commits or aborts, like a Postgres row lock. As I said,
this is necessary simply because the transaction might need to roll
back. Of course, your patch never needs to do anything like that --
the only risk is that buffer lock contention will be increased. Maybe
VACUUM isn't so bad after all!

Doing deduplication adaptively and automatically in nbtree seems like
it might play to the strengths of Postgres, while also ameliorating
its weaknesses. As the same paper goes on to say, it's actually quite
unusual that PostgreSQL has *transactional* full text search built in
(using GIN), and offers transactional, high concurrency spatial
indexing (using GiST). Actually, this is an additional advantages of
our "pure" approach to MVCC -- we can add new high concurrency,
transactional access methods relatively easily.


[1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.98.3159&rep=rep1&type=pdf
[2] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf
--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Bruce Momjian
On Wed, Jul 10, 2019 at 09:53:04PM -0700, Peter Geoghegan wrote:

> Anyway, I think that *hundreds* or even *thousands* of rows are
> effectively locked all at once when a bitmap index needs to be updated
> in these other systems -- and I mean a heavyweight lock that lasts
> until the xact commits or aborts, like a Postgres row lock. As I said,
> this is necessary simply because the transaction might need to roll
> back. Of course, your patch never needs to do anything like that --
> the only risk is that buffer lock contention will be increased. Maybe
> VACUUM isn't so bad after all!
>
> Doing deduplication adaptively and automatically in nbtree seems like
> it might play to the strengths of Postgres, while also ameliorating
> its weaknesses. As the same paper goes on to say, it's actually quite
> unusual that PostgreSQL has *transactional* full text search built in
> (using GIN), and offers transactional, high concurrency spatial
> indexing (using GiST). Actually, this is an additional advantages of
> our "pure" approach to MVCC -- we can add new high concurrency,
> transactional access methods relatively easily.

Wow, I never thought of that.  The only things I know we lock until
transaction end are rows we update (against concurrent updates), and
additions to unique indexes.  By definition, indexes with many
duplicates are not unique, so that doesn't apply.

--
  Bruce Momjian  <[hidden email]>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Alexander Korotkov
In reply to this post by Peter Geoghegan-4
Hi Peter,

Thank you very much for your attention to this patch. Let me comment
some points of your message.

On Sun, Jul 7, 2019 at 2:09 AM Peter Geoghegan <[hidden email]> wrote:

> On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
> <[hidden email]> wrote:
> > The new version of the patch is attached.
> > This version is even simpler than the previous one,
> > thanks to the recent btree design changes and all the feedback I received.
> > I consider it ready for review and testing.
>
> I took a closer look at this patch, and have some general thoughts on
> its design, and specific feedback on the implementation.
>
> Preserving the *logical contents* of B-Tree indexes that use
> compression is very important -- that should not change in a way that
> outside code can notice. The heap TID itself should count as logical
> contents here, since we want to be able to implement retail index
> tuple deletion in the future. Even without retail index tuple
> deletion, amcheck's "rootdescend" verification assumes that it can
> find one specific tuple (which could now just be one specific "logical
> tuple") using specific key values from the heap, including the heap
> tuple's heap TID. This requirement makes things a bit harder for your
> patch, because you have to deal with one or two edge-cases that you
> currently don't handle: insertion of new duplicates that fall inside
> the min/max range of some existing posting list. That should be rare
> enough in practice, so the performance penalty won't be too bad. This
> probably means that code within _bt_findinsertloc() and/or
> _bt_binsrch_insert() will need to think about a logical tuple as a
> distinct thing from a physical tuple, though that won't be necessary
> in most places.

Could you please elaborate more on preserving the logical contents?  I
can understand it as following: "B-Tree should have the same structure
and invariants as if each TID in posting list be a separate tuple".
So, if we imagine each TID to become separate tuple it would be the
same B-tree, which just can magically sometimes store more tuples in
page.  Is my understanding correct?  But outside code will still
notice changes as soon as it directly accesses B-tree pages (like
contrib/amcheck does).  Do you mean we need an API for accessing
logical B-tree tuples or something?

> The need to "preserve the logical contents" also means that the patch
> will need to recognize when indexes are not safe as a target for
> compression/deduplication (maybe we should call this feature
> deduplilcation, so it's clear how it differs from TOAST?). For
> example, if we have a case-insensitive ICU collation, then it is not
> okay to treat an opclass-equal pair of text strings that use the
> collation as having the same value when considering merging the two
> into one. You don't actually do that in the patch, but you also don't
> try to deal with the fact that such a pair of strings are equal, and
> so must have their final positions determined by the heap TID column
> (deduplication within _bt_compress_one_page() must respect that).
> Possibly equal-but-distinct values seems like a problem that's not
> worth truly fixing, but it will be necessary to store metadata about
> whether or not we're willing to do deduplication in the meta page,
> based on operator class and collation details. That seems like a
> restriction that we're just going to have to accept, though I'm not
> too worried about exactly what that will look like right now. We can
> work it out later.

I think in order to deduplicate "equal but distinct" values we need at
least to give up with index only scans.  Because we have no
restriction that equal according to B-tree opclass values are same for
other operations and/or user output.

> I think that the need to be careful about the logical contents of
> indexes already causes bugs, even with "safe for compression" indexes.
> For example, I can sometimes see an assertion failure
> within_bt_truncate(), at the point where we check if heap TID values
> are safe:
>
>     /*
>      * Lehman and Yao require that the downlink to the right page, which is to
>      * be inserted into the parent page in the second phase of a page split be
>      * a strict lower bound on items on the right page, and a non-strict upper
>      * bound for items on the left page.  Assert that heap TIDs follow these
>      * invariants, since a heap TID value is apparently needed as a
>      * tiebreaker.
>      */
> #ifndef DEBUG_NO_TRUNCATE
>     Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
>                               BTreeTupleGetMinTID(firstright)) < 0);
> ...
>
> This bug is not that easy to see, but it will happen with a big index,
> even without updates or deletes. I think that this happens because
> compression can allow the "logical tuples" to be in the wrong heap TID
> order when there are multiple posting lists for the same value. As I
> said, I think that it's necessary to see a posting list as being
> comprised of multiple logical tuples in the context of inserting new
> tuples, even when you're not performing compression or splitting the
> page. I also see that amcheck's bt_index_parent_check() function
> fails, though bt_index_check() does not fail when I don't use any of
> its extra verification options. (You haven't updated amcheck, but I
> don't think that you need to update it for these basic checks to
> continue to work.)

Do I understand correctly that current patch may produce posting lists
of the same value with overlapping ranges of TIDs?  If so, it's
definitely wrong.

> * Maybe we could do compression with unique indexes when inserting
> values with NULLs? Note that we now treat an insertion of a tuple with
> NULLs into a unique index as if it wasn't even a unique index -- see
> the "checkingunique" optimization at the beginning of _bt_doinsert().
> Having many NULL values in a unique index is probably fairly common.

I think unique indexes may benefit from deduplication not only because
of NULL values.  Non-HOT updates produce duplicates of non-NULL values
in unique indexes.  And those duplicates can take significant space.


------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Alexander Korotkov
In reply to this post by Peter Geoghegan-4
On Thu, Jul 11, 2019 at 7:53 AM Peter Geoghegan <[hidden email]> wrote:

> Anyway, I think that *hundreds* or even *thousands* of rows are
> effectively locked all at once when a bitmap index needs to be updated
> in these other systems -- and I mean a heavyweight lock that lasts
> until the xact commits or aborts, like a Postgres row lock. As I said,
> this is necessary simply because the transaction might need to roll
> back. Of course, your patch never needs to do anything like that --
> the only risk is that buffer lock contention will be increased. Maybe
> VACUUM isn't so bad after all!
>
> Doing deduplication adaptively and automatically in nbtree seems like
> it might play to the strengths of Postgres, while also ameliorating
> its weaknesses. As the same paper goes on to say, it's actually quite
> unusual that PostgreSQL has *transactional* full text search built in
> (using GIN), and offers transactional, high concurrency spatial
> indexing (using GiST). Actually, this is an additional advantages of
> our "pure" approach to MVCC -- we can add new high concurrency,
> transactional access methods relatively easily.

Good finding, thank you!

BTW, I think deduplication could cause some small performance
degradation in some particular cases, because page-level locks became
more coarse grained once pages hold more tuples.  However, this
doesn't seem like something we should much care about.  Providing an
option to turn deduplication off looks enough for me.

Regarding bitmap indexes itself, I think our BRIN could provide them.
However, it would be useful to have opclass parameters to make them
tunable.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Fwd: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Rafia Sabih
In reply to this post by Peter Geoghegan-4
On Sun, 7 Jul 2019 at 01:08, Peter Geoghegan <[hidden email]> wrote:

> * Maybe we could do compression with unique indexes when inserting
> values with NULLs? Note that we now treat an insertion of a tuple with
+1

I tried this patch and found the improvements impressive. However,
when I tried with multi-column indexes it wasn't giving any
improvement, is it the known limitation of the patch?
I am surprised to find that such a patch is on radar since quite some
years now and not yet committed.

Going through the patch, here are a few comments from me,

 /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d
IndexTupleSize %zu free %zu",
+ compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
and other such DEBUG4 statements are meant to be removed, right...?
Just because I didn't find any other such statements in this API and
there are many in this patch, so not sure how much are they needed.

/*
* If we have only 10 uncompressed items on the full page, it probably
* won't worth to compress them.
*/
if (maxoff - n_posting_on_page < 10)
return;

Is this a magic number...?

/*
* We do not expect to meet any DEAD items, since this function is
* called right after _bt_vacuum_one_page(). If for some reason we
* found dead item, don't compress it, to allow upcoming microvacuum
* or vacuum clean it up.
*/
if (ItemIdIsDead(itemId))
continue;

This makes me wonder about those 'some' reasons.

Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * he will get what he expects

This can be re-framed to make the caller more gender neutral.

Other than that, I am curious about the plans for its backward compatibility.

--
Regards,
Rafia Sabih


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
In reply to this post by Bruce Momjian
On Thu, Jul 11, 2019 at 7:30 AM Bruce Momjian <[hidden email]> wrote:
> Wow, I never thought of that.  The only things I know we lock until
> transaction end are rows we update (against concurrent updates), and
> additions to unique indexes.  By definition, indexes with many
> duplicates are not unique, so that doesn't apply.

Right. Another advantage of their approach is that you can make
queries like this work:

UPDATE tab SET unique_col = unique_col + 1

This will not throw a unique violation error on most/all other DB
systems when the updated column (in this case "unique_col") has a
unique constraint/is the primary key. This behavior is actually
required by the SQL standard. An SQL statement is supposed to be
all-or-nothing, which Postgres doesn't quite manage here.

The section "6.6 Interdependencies of Transactional Storage" from the
paper "Architecture of a Database System" provides additional
background information (I should have suggested reading both 6.6 and
6.7 together).

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
In reply to this post by Alexander Korotkov
On Thu, Jul 11, 2019 at 8:02 AM Alexander Korotkov
<[hidden email]> wrote:
> Could you please elaborate more on preserving the logical contents?  I
> can understand it as following: "B-Tree should have the same structure
> and invariants as if each TID in posting list be a separate tuple".

That's exactly what I mean.

> So, if we imagine each TID to become separate tuple it would be the
> same B-tree, which just can magically sometimes store more tuples in
> page.  Is my understanding correct?

Yes.

> But outside code will still
> notice changes as soon as it directly accesses B-tree pages (like
> contrib/amcheck does).  Do you mean we need an API for accessing
> logical B-tree tuples or something?

Well, contrib/amcheck isn't really outside code. But amcheck's
"rootdescend" option will still need to be able to supply a heap TID
as just another column, and get back zero or one logical tuples from
the index. This is important because retail index tuple deletion needs
to be able to think about logical tuples in the same way. I also think
that it might be useful for the planner to expect to get back
duplicates in heap TID order in the future (or in reverse order in the
case of a backwards scan). Query execution and VACUUM code outside of
nbtree should be able to pretend that there is no such thing as a
posting list.

The main thing that the patch is missing that is needed to "preserve
logical contents" is the ability to update/expand an *existing*
posting list due to a retail insertion of a new duplicate that happens
to be within the range of that existing posting list. This will
usually be a non-HOT update that doesn't change the value for the row
in the index -- that must change the posting list, even when there is
available space on the page without recompressing. We must still
occasionally be eager, like GIN always is, though in practice we'll
almost always add to posting lists in a lazy fashion, when it looks
like we might have to split the page -- the lazy approach seems to
perform best.

> I think in order to deduplicate "equal but distinct" values we need at
> least to give up with index only scans.  Because we have no
> restriction that equal according to B-tree opclass values are same for
> other operations and/or user output.

We can either prevent index-only scans in the case of affected
indexes, or prevent compression, or give the user a choice. I'm not
too worried about how that will work for users just yet.

> Do I understand correctly that current patch may produce posting lists
> of the same value with overlapping ranges of TIDs?  If so, it's
> definitely wrong.

Yes, it can, since the assertion fails. It looks like the assertion
itself was changed to match what I expect, so I assume that this bug
will be fixed in the next version of the patch. It fails with a fairly
big index on text for me.

> > * Maybe we could do compression with unique indexes when inserting
> > values with NULLs? Note that we now treat an insertion of a tuple with
> > NULLs into a unique index as if it wasn't even a unique index -- see
> > the "checkingunique" optimization at the beginning of _bt_doinsert().
> > Having many NULL values in a unique index is probably fairly common.
>
> I think unique indexes may benefit from deduplication not only because
> of NULL values.  Non-HOT updates produce duplicates of non-NULL values
> in unique indexes.  And those duplicates can take significant space.

I agree that we should definitely have an open mind about unique
indexes, even with non-NULL values. If we can prevent a page split by
deduplicating the contents of a unique index page, then we'll probably
win. Why not try? This will need to be tested.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
In reply to this post by Alexander Korotkov
On Thu, Jul 11, 2019 at 8:09 AM Alexander Korotkov
<[hidden email]> wrote:
> BTW, I think deduplication could cause some small performance
> degradation in some particular cases, because page-level locks became
> more coarse grained once pages hold more tuples.  However, this
> doesn't seem like something we should much care about.  Providing an
> option to turn deduplication off looks enough for me.

There was an issue like this with my v12 work on nbtree, with the
TPC-C indexes. They were always ~40% smaller, but there was a
regression when TPC-C was used with a small number of warehouses, when
the data could easily fit in memory (which is not allowed by the TPC-C
spec, in effect). TPC-C is very write-heavy, which combined with
everything else causes this problem. I wasn't doing anything too fancy
there -- the regression seemed to happen simply because the index was
smaller, not because of the overhead of doing page splits differently
or anything like that (there were far fewer splits).

I expect there to be some regression for workloads like this. I am
willing to accept that provided it's not too noticeable, and doesn't
have an impact on other workloads. I am optimistic about it.

> Regarding bitmap indexes itself, I think our BRIN could provide them.
> However, it would be useful to have opclass parameters to make them
> tunable.

I thought that we might implement them in nbtree myself. But we don't
need to decide now.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
In reply to this post by Rafia Sabih
On Thu, Jul 11, 2019 at 8:34 AM Rafia Sabih <[hidden email]> wrote:
> I tried this patch and found the improvements impressive. However,
> when I tried with multi-column indexes it wasn't giving any
> improvement, is it the known limitation of the patch?

It'll only deduplicate full duplicates. It works with multi-column
indexes, provided the entire set of values in duplicated -- not just a
prefix. Prefix compression is possible, but it's more complicated. It
seems to generally require the DBA to specify a prefix length,
expressed as a number of prefix columns.

> I am surprised to find that such a patch is on radar since quite some
> years now and not yet committed.

The v12 work on nbtree (making heap TID a tiebreaker column) seems to
have made the general approach a lot more effective. Compression is
performed lazily, not eagerly, which seems to work a lot better.

> + elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d
> IndexTupleSize %zu free %zu",
> + compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
> +
> and other such DEBUG4 statements are meant to be removed, right...?

I hope so too.

> /*
> * If we have only 10 uncompressed items on the full page, it probably
> * won't worth to compress them.
> */
> if (maxoff - n_posting_on_page < 10)
> return;
>
> Is this a magic number...?

I think that this should be a constant or something.

> /*
> * We do not expect to meet any DEAD items, since this function is
> * called right after _bt_vacuum_one_page(). If for some reason we
> * found dead item, don't compress it, to allow upcoming microvacuum
> * or vacuum clean it up.
> */
> if (ItemIdIsDead(itemId))
> continue;
>
> This makes me wonder about those 'some' reasons.

I think that this is just defensive. Note that _bt_vacuum_one_page()
is prepared to find no dead items, even when the BTP_HAS_GARBAGE flag
is set for the page.

> Caller is responsible for checking BTreeTupleIsPosting to ensure that
> + * he will get what he expects
>
> This can be re-framed to make the caller more gender neutral.

Agreed. I also don't like anthropomorphizing code like this.

> Other than that, I am curious about the plans for its backward compatibility.

Me too. There is something about a new version 5 in comments in
nbtree.h, but the version number isn't changed. I think that we may be
able to get away with not increasing the B-Tree version from 4 to 5,
actually. Deduplication is performed lazily when it looks like we
might have to split the page, so there isn't any expectation that
tuples will either be compressed or uncompressed in any context.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
In reply to this post by Peter Geoghegan-4
On Thu, Jul 11, 2019 at 10:42 AM Peter Geoghegan <[hidden email]> wrote:
> > I think unique indexes may benefit from deduplication not only because
> > of NULL values.  Non-HOT updates produce duplicates of non-NULL values
> > in unique indexes.  And those duplicates can take significant space.
>
> I agree that we should definitely have an open mind about unique
> indexes, even with non-NULL values. If we can prevent a page split by
> deduplicating the contents of a unique index page, then we'll probably
> win. Why not try? This will need to be tested.

I thought about this some more. I believe that the LP_DEAD bit setting
within _bt_check_unique() is generally more important than the more
complicated kill_prior_tuple mechanism for setting LP_DEAD bits, even
though the _bt_check_unique() thing can only be used with unique
indexes. Also, I have often thought that we don't do enough to take
advantage of the special characteristics of unique indexes -- they
really are quite different. I believe that other database systems do
this in various ways. Maybe we should too.

Unique indexes are special because there can only ever be zero or one
tuples of the same value that are visible to any possible MVCC
snapshot. Within the index AM, there is little difference between an
UPDATE by a transaction and a DELETE + INSERT of the same value by a
transaction. If there are 3 or 5 duplicates within a unique index,
then there is a strong chance that VACUUM could reclaim some of them,
given the chance. It is worth going to a little effort to find out.

In a traditional serial/bigserial primary key, the key space that is
typically "owned" by the left half of a rightmost page split describes
a range of about ~366 items, with few or no gaps for other values that
didn't exist at the time of the split (i.e. the two pivot tuples on
each side cover a range that is equal to the number of items itself).
If the page ever splits again, the chances of it being due to non-HOT
updates is perhaps 100%. Maybe VACUUM just didn't get around to the
index in time, or maybe there is a long running xact, or whatever. If
we can delay page splits in indexes like this, then we could easily
prevent them from *ever* happening.

Our first line of defense against page splits within unique indexes
will probably always be LP_DEAD bits set within _bt_check_unique(),
because it costs so little -- same as today. We could also add a
second line of defense: deduplication -- same as with non-unique
indexes with the patch. But we can even add a third line of defense on
top of those two: more aggressive reclaiming of posting list space, by
going to the heap to check the visibility status of earlier posting
list entries. We can do this optimistically when there is no LP_DEAD
bit set, based on heuristics.

The high level principle here is that we can justify going to a small
amount of extra effort for the chance to avoid a page split, and maybe
even more than a small amount. Our chances of reversing the split by
merging pages later on are almost zero. The two halves of the split
will probably each get dirtied again and again in the future if we
cannot avoid it, plus we have to dirty the parent page, and the old
sibling page (to update its left link). In general, a page split is
already really expensive. We could do something like amortize the cost
of accessing the heap a second time for tuples that we won't have
considered setting the LP_DEAD bit on within _bt_check_unique() by
trying the *same* heap page a *second* time where possible (distinct
values are likely to be nearby on the same page). I think that an
approach like this could work quite well for many workloads. You only
pay a cost (visiting the heap an extra time) when it looks like you'll
get a benefit (not splitting the page).

As you know, Andres already changed nbtree to get an XID for conflict
purposes on the primary by visiting the heap a second time (see commit
558a9165e08), when we need to actually reclaim LP_DEAD space. I
anticipated that we could extend this to do more clever/eager/lazy
cleanup of additional items before that went in, which is a closely
related idea. See:

https://www.postgresql.org/message-id/flat/CAH2-Wznx8ZEuXu7BMr6cVpJ26G8OSqdVo6Lx_e3HSOOAU86YoQ%40mail.gmail.com#46ffd6f32a60e086042a117f2bfd7df7

I know that this is a bit hand-wavy; the details certainly need to be
worked out. However, it is not so different to the "ghost bit" design
that other systems use with their non-unique indexes (though this idea
applies specifically to unique indexes in our case). The main
difference is that we're going to the heap rather than to UNDO,
because that's where we store our visibility information. That doesn't
seem like such a big difference -- we are also reasonably confident
that we'll find that the TID is dead, even without LP_DEAD bits being
set, because we only do the extra stuff with unique indexes. And, we
do it lazily.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
In reply to this post by Peter Geoghegan-4

11.07.2019 21:19, Peter Geoghegan wrote:
> On Thu, Jul 11, 2019 at 8:34 AM Rafia Sabih <[hidden email]> wrote:
Hi,
Peter, Rafia, thanks for the review. New version is attached.

>> + elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d
>> IndexTupleSize %zu free %zu",
>> + compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
>> +
>> and other such DEBUG4 statements are meant to be removed, right...?
> I hope so too.
Yes, these messages are only for debugging.
I haven't delete them since this is still work in progress
and it's handy to be able to print inner details.
Maybe I should also write a patch for pageinspect.

>> /*
>> * If we have only 10 uncompressed items on the full page, it probably
>> * won't worth to compress them.
>> */
>> if (maxoff - n_posting_on_page < 10)
>> return;
>>
>> Is this a magic number...?
> I think that this should be a constant or something.
Fixed. Now this is a constant in nbtree.h. I'm not 100% sure about the
value.
When the code will stabilize we can benchmark it and find optimal value.

>> /*
>> * We do not expect to meet any DEAD items, since this function is
>> * called right after _bt_vacuum_one_page(). If for some reason we
>> * found dead item, don't compress it, to allow upcoming microvacuum
>> * or vacuum clean it up.
>> */
>> if (ItemIdIsDead(itemId))
>> continue;
>>
>> This makes me wonder about those 'some' reasons.
> I think that this is just defensive. Note that _bt_vacuum_one_page()
> is prepared to find no dead items, even when the BTP_HAS_GARBAGE flag
> is set for the page.
You are right, now it is impossible to meet dead items in this function.
Though it can change in the future if, for example, _bt_vacuum_one_page
will behave lazily.
So this is just a sanity check. Maybe it's worth to move it to Assert.

>
>> Caller is responsible for checking BTreeTupleIsPosting to ensure that
>> + * he will get what he expects
>>
>> This can be re-framed to make the caller more gender neutral.
> Agreed. I also don't like anthropomorphizing code like this.

Fixed.

>> Other than that, I am curious about the plans for its backward compatibility.
> Me too. There is something about a new version 5 in comments in
> nbtree.h, but the version number isn't changed. I think that we may be
> able to get away with not increasing the B-Tree version from 4 to 5,
> actually. Deduplication is performed lazily when it looks like we
> might have to split the page, so there isn't any expectation that
> tuples will either be compressed or uncompressed in any context.
Current implementation is backward compatible.
To distinguish posting tuples, it only adds one new flag combination.
This combination was never possible before. Comment about version 5 is
deleted.

I also added a patch for amcheck.

There is one major issue left - preserving TID order in posting lists.
For a start, I added a sort into BTreeFormPostingTuple function.
It turned out to be not very helpful, because we cannot check this
invariant lazily.

Now I work on patching _bt_binsrch_insert() and _bt_insertonpg() to
implement
insertion into the middle of the posting list. I'll send a new version
this week.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-btree_compression_pg12_v2.patch (58K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
17.07.2019 19:36, Anastasia Lubennikova:

>
> There is one major issue left - preserving TID order in posting lists.
> For a start, I added a sort into BTreeFormPostingTuple function.
> It turned out to be not very helpful, because we cannot check this
> invariant lazily.
>
> Now I work on patching _bt_binsrch_insert() and _bt_insertonpg() to
> implement
> insertion into the middle of the posting list. I'll send a new version
> this week.
Patch 0002 (must be applied on top of 0001) implements preserving of
correct TID order
inside posting list when inserting new tuples.
This version passes all regression tests including amcheck test.
I also used following script to test insertion into the posting list:

set client_min_messages to debug4;
drop table tbl;
create table tbl (i1 int, i2 int);
insert into tbl select 1, i from generate_series(0,1000) as i;
insert into tbl select 1, i from generate_series(0,1000) as i;
create index idx on tbl (i1);
delete from tbl where i2 <500;
vacuum tbl ;
insert into tbl select 1, i from generate_series(1001, 1500) as i;

The last insert triggers several insertions that can be seen in debug
messages.
I suppose it is not the final version of the patch yet,
so I left some debug messages and TODO comments to ease review.

Please, in your review, pay particular attention to usage of
BTreeTupleGetHeapTID.
For posting tuples it returns the first tid from posting list like
BTreeTupleGetMinTID,
but maybe some callers are not ready for that and want
BTreeTupleGetMaxTID instead.
Incorrect usage of these macros may cause some subtle bugs,
which are probably not covered by tests. So, please double-check it.

Next week I'm going to check performance and try to find specific
scenarios where this
feature can lead to degradation and measure it, to understand if we need
to make this deduplication optional.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-btree_compression_pg12_v2.patch (58K) Download Attachment
0002-btree_compression_pg12_v2.patch (18K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
On Fri, Jul 19, 2019 at 10:53 AM Anastasia Lubennikova
<[hidden email]> wrote:
> Patch 0002 (must be applied on top of 0001) implements preserving of
> correct TID order
> inside posting list when inserting new tuples.
> This version passes all regression tests including amcheck test.
> I also used following script to test insertion into the posting list:

Nice!

> I suppose it is not the final version of the patch yet,
> so I left some debug messages and TODO comments to ease review.

I'm fine with leaving them in. I have sometimes distributed a separate
patch with debug messages, but now that I think about it, that
probably wasn't a good use of time.

You will probably want to remove at least some of the debug messages
during performance testing. I'm thinking of code that appears in very
tight inner loops, such as the _bt_compare() code.

> Please, in your review, pay particular attention to usage of
> BTreeTupleGetHeapTID.
> For posting tuples it returns the first tid from posting list like
> BTreeTupleGetMinTID,
> but maybe some callers are not ready for that and want
> BTreeTupleGetMaxTID instead.
> Incorrect usage of these macros may cause some subtle bugs,
> which are probably not covered by tests. So, please double-check it.

One testing strategy that I plan to use for the patch is to
deliberately corrupt a compressed index in a subtle way using
pg_hexedit, and then see if amcheck detects the problem. For example,
I may swap the order of two TIDs in the middle of a posting list,
which is something that is unlikely to produce wrong answers to
queries, and won't even be detected by the "heapallindexed" check, but
is still wrong. If we can detect very subtle, adversarial corruption
like this, then we can detect any real-world problem.

Once we have confidence in amcheck's ability to detect problems with
posting lists in general, we can use it in many different contexts
without much thought. For example, we'll probably need to do long
running benchmarks to validate the performance of the patch. It's easy
to add amcheck testing at the end of each run. Every benchmark is now
also a correctness/stress test, for free.

> Next week I'm going to check performance and try to find specific
> scenarios where this
> feature can lead to degradation and measure it, to understand if we need
> to make this deduplication optional.

Sounds good, though I think it might be a bit too early to decide
whether or not it needs to be enabled by default. For one thing, the
approach to WAL-logging within _bt_compress_one_page() is probably
fairly inefficient, which may be a problem for certain workloads. It's
okay to leave it that way for now, because it is not relevant to the
core design of the patch. I'm sure that _bt_compress_one_page() can be
carefully optimized when the time comes.

My current focus is not on the raw performance itself. For now, I am
focussed on making sure that the compression works well, and that the
resulting indexes "look nice" in general. FWIW, the first few versions
of my v12 work on nbtree didn't actually make *anything* go faster. It
took a couple of months to fix the more important regressions, and a
few more months to fix all of them. I think that the work on this
patch may develop in a similar way. I am willing to accept regressions
in the unoptimized code during development because it seems likely
that you have the right idea about the data structure itself, which is
the one thing that I *really* care about. Once you get that right, the
remaining problems are very likely to either be fixable with further
work on optimizing specific code, or a price that users will mostly be
happy to pay to get the benefits.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
On Fri, Jul 19, 2019 at 12:32 PM Peter Geoghegan <[hidden email]> wrote:
> On Fri, Jul 19, 2019 at 10:53 AM Anastasia Lubennikova
> <[hidden email]> wrote:
> > Patch 0002 (must be applied on top of 0001) implements preserving of
> > correct TID order
> > inside posting list when inserting new tuples.
> > This version passes all regression tests including amcheck test.
> > I also used following script to test insertion into the posting list:
>
> Nice!

Hmm. So, the attached test case fails amcheck verification for me with
the latest version of the patch:

$ psql -f amcheck-compress-test.sql
DROP TABLE
CREATE TABLE
CREATE INDEX
CREATE EXTENSION
INSERT 0 2001
psql:amcheck-compress-test.sql:6: ERROR:  down-link lower bound
invariant violated for index "idx_desc_nl"
DETAIL:  Parent block=3 child index tid=(2,2) parent page lsn=10/F87A3438.

Note that this test only has an INSERT statement. You have to use
bt_index_parent_check() to see the problem -- bt_index_check() will
not detect the problem.

--
Peter Geoghegan

amcheck-compress-test.sql (426 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
On Fri, Jul 19, 2019 at 7:24 PM Peter Geoghegan <[hidden email]> wrote:
> Hmm. So, the attached test case fails amcheck verification for me with
> the latest version of the patch:

Attached is a revised version of your v2 that fixes this issue -- I'll
call this v3. In general, my goal for the revision was to make sure
that all of my old tests from the v12 work passed, and to make sure
that amcheck can detect almost any possible problem. I tested the
amcheck changes by corrupting random state in a test index using
pg_hexedit, then making sure that amcheck actually complained in each
case.

I also fixed one or two bugs in passing, including the bug that caused
an assertion failure in _bt_truncate(). That was down to a subtle
off-by-one issue within _bt_insertonpg_in_posting(). Overall, I didn't
make that many changes to your v2. There are probably some things
about the patch that I still don't understand, or things that I have
misunderstood.

Other changes:

* We now support system catalog indexes. There is no reason not to support them.

* Removed unnecessary code from _bt_buildadd().

* Added my own new DEBUG4 trace to _bt_insertonpg_in_posting(), which
I used to fix that bug I mentioned. I agree that we should keep the
DEBUG4 traces around until the overall design settles down. I found
the ones that you added helpful, too.

* Added quite a few new assertions. For example, we need to still
support !heapkeyspace (pre Postgres 12) nbtree indexes, but we cannot
let them use compression -- new defensive assertions were added to
make this break loudly.

* Changed the custom binary search code within _bt_compare_posting()
to look more like _bt_binsrch() and _bt_binsrch_insert(). Do you know
of any reason not to do it that way?

* Added quite a few "FIXME"/"XXX" comments at various points, to
indicate where I have general concerns that need more discussion.

* Included my own pageinspect hack to visualize the minimum TIDs in
posting lists. It's broken out into a separate patch file. The code is
very rough, but it might help someone else, so I thought I'd include
it.


I also have some new concerns about the code in the patch that I will
point out now (though only as something to think about a solution on
-- I am unsure myself):

* It's a bad sign that compression involves calls to PageAddItem()
that are allowed to fail (we just give up on compression when that
happens). For one thing, all existing calls to PageAddItem() in
Postgres are never expected to fail -- if they do fail we get a "can't
happen" error that suggests corruption. It was a good idea to take
this approach to get the patch to work, and to prove the general idea,
but we now need to fully work out all the details about the use of
space. This includes complicated new questions around how alignment is
supposed to work.

Alignment in nbtree is already complicated today -- you're supposed to
MAXALIGN() everything in nbtree, so that the MAXALIGN() within
bufpage.c routines cannot be different to the lp_len/IndexTupleSize()
length (note that heapam can have tuples whose lp_len isn't aligned,
so nbtree could do it differently if it proved useful). Code within
nbtsplitloc.c fully understands the space requirements for the
bufpage.c routines, and is very careful about it. (The bufpage.c
details are supposed to be totally hidden from code like
nbtsplitloc.c, but I guess that that ideal isn't quite possible in
reality. Code comments don't really explain the situation today.)

I'm not sure what it would look like for this patch to be as precise
about free space as nbtsplitloc.c already is, even though that seems
desirable (I just know that it would mean you would trust
PageAddItem() to work in all cases). The patch is different to what we
already have today in that it tries to add *less than* a single
MAXALIGN() quantum at a time in some places (when a posting list needs
to grow by one item). The devil is in the details.

* As you know, the current approach to WAL logging is very
inefficient. It's okay for now, but we'll need a fine-grained approach
for the patch to be commitable. I think that this is subtly related to
the last item (i.e. the one about alignment). I have done basic
performance tests using unlogged tables. The patch seems to either
make big INSERT queries run as fast or faster than before when
inserting into unlogged tables, which is a very good start.

* Since we can now split a posting list in two, we may also have to
reconsider BTMaxItemSize, or some similar mechanism that worries about
extreme cases where it becomes impossible to split because even two
pages are not enough to fit everything. Think of what happens when
there is a tuple with a single large datum, that gets split in two
(the tuple is split, not the page), with each half receiving its own
copy of the datum. I haven't proven to myself that this is broken, but
that may just be because I haven't spent any time on it. OTOH, maybe
you already have it right, in which case it seems like it should be
explained somewhere. Possibly in nbtree.h. This is tricky stuff.

* I agree with all of your existing TODO items -- most of them seem
very important to me.

* Do we really need to keep BTreeTupleGetHeapTID(), now that we have
BTreeTupleGetMinTID()? Can't we combine the two macros into one, so
that callers don't need to think about the pivot vs posting list thing
themselves? See the new code added to _bt_mkscankey() by v3, for
example. It now handles both cases/macros at once, in order to keep
its amcheck caller happy. amcheck's verify_nbtree.c received similar
ugly code in v3.

* We should at least experiment with applying compression when
inserting into unique indexes. Like Alexander, I think that
compression in unique indexes might work well, given how they must
work in Postgres.

My next steps will be to study the design of the
_bt_insertonpg_in_posting() stuff some more. It seems like you already
have the right general idea there, but I would like to come up with a
way of making _bt_insertonpg_in_posting() understand how to work with
space on the page with total certainty, much like nbtsplitloc.c does
today. This should allow us to make WAL-logging more
precise/incremental.

--
Peter Geoghegan

v3-0002-DEBUG-Add-pageinspect-instrumentation.patch (10K) Download Attachment
v3-0001-Compression-deduplication-in-nbtree.patch (114K) Download Attachment
123456