[PROPOSAL] Effective storage of duplicates in B-tree index.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
69 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
On Tue, Jul 23, 2019 at 6:22 PM Peter Geoghegan <[hidden email]> wrote:
> Attached is a revised version of your v2 that fixes this issue -- I'll
> call this v3.

Remember that index that I said was 5.5x smaller with the patch
applied, following retail insertions (a single big INSERT ... SELECT
...)? Well, it's 6.5x faster with this small additional patch applied
on top of the v3 I posted yesterday. Many of the indexes in my test
suite are about ~20% smaller __in addition to__ very big size
reductions. Some are even ~30% smaller than they were with v3 of the
patch. For example, the fair use implementation of TPC-H that my test
data comes from has an index on the "orders" o_orderdate column, named
idx_orders_orderdate, which is made ~30% smaller by the addition of
this simple patch (once again, this is following a single big INSERT
... SELECT ...). This change makes idx_orders_orderdate ~3.3x smaller
than it is with master/Postgres 12, in case you were wondering.

This new patch teaches nbtsplitloc.c to subtract posting list overhead
when sizing the new high key for the left half of a candidate split
point, since we know for sure that _bt_truncate() will at least manage
to truncate away that much from the new high key, even in the worst
case. Since posting lists are often very large, this can make a big
difference. This is actually just a bugfix, not a new idea -- I merely
made nbtsplitloc.c understand how truncation works with posting lists.

There seems to be a kind of "synergy" between the nbtsplitloc.c
handling of pages that have lots of duplicates and posting list
compression. It seems as if the former mechanism "sets up the bowling
pins", while the latter mechanism "knocks them down", which is really
cool. We should try to gain a better understanding of how that works,
because it's possible that it could be even more effective in some
cases.

--
Peter Geoghegan

0003-Account-for-posting-list-overhead-during-splits.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
On Wed, Jul 24, 2019 at 3:06 PM Peter Geoghegan <[hidden email]> wrote:
> There seems to be a kind of "synergy" between the nbtsplitloc.c
> handling of pages that have lots of duplicates and posting list
> compression. It seems as if the former mechanism "sets up the bowling
> pins", while the latter mechanism "knocks them down", which is really
> cool. We should try to gain a better understanding of how that works,
> because it's possible that it could be even more effective in some
> cases.

I found another important way in which this synergy can fail to take
place, which I can fix.

By removing the BT_COMPRESS_THRESHOLD limit entirely, certain indexes
from my test suite become much smaller, while most are not affected.
These indexes were not helped too much by the patch before. For
example, the TPC-E i_t_st_id index is 50% smaller. It is entirely full
of duplicates of a single value (that's how it appears after an
initial TPC-E bulk load), as are a couple of other TPC-E indexes.
TPC-H's idx_partsupp_partkey index becomes ~18% smaller, while its
idx_lineitem_orderkey index becomes ~15% smaller.

I believe that this happened because rightmost page splits were an
inefficient case for compression. But rightmost page split heavy
indexes with lots of duplicates are not that uncommon. Think of any
index with many NULL values, for example.

I don't know for sure if BT_COMPRESS_THRESHOLD should be removed. I'm
not sure what the idea is behind it. My sense is that we're likely to
benefit by delaying page splits, no matter what. Though I am still
looking at it purely from a space utilization point of view, at least
for now.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Rafia Sabih
On Thu, 25 Jul 2019 at 05:49, Peter Geoghegan <[hidden email]> wrote:

>
> On Wed, Jul 24, 2019 at 3:06 PM Peter Geoghegan <[hidden email]> wrote:
> > There seems to be a kind of "synergy" between the nbtsplitloc.c
> > handling of pages that have lots of duplicates and posting list
> > compression. It seems as if the former mechanism "sets up the bowling
> > pins", while the latter mechanism "knocks them down", which is really
> > cool. We should try to gain a better understanding of how that works,
> > because it's possible that it could be even more effective in some
> > cases.
>
> I found another important way in which this synergy can fail to take
> place, which I can fix.
>
> By removing the BT_COMPRESS_THRESHOLD limit entirely, certain indexes
> from my test suite become much smaller, while most are not affected.
> These indexes were not helped too much by the patch before. For
> example, the TPC-E i_t_st_id index is 50% smaller. It is entirely full
> of duplicates of a single value (that's how it appears after an
> initial TPC-E bulk load), as are a couple of other TPC-E indexes.
> TPC-H's idx_partsupp_partkey index becomes ~18% smaller, while its
> idx_lineitem_orderkey index becomes ~15% smaller.
>
> I believe that this happened because rightmost page splits were an
> inefficient case for compression. But rightmost page split heavy
> indexes with lots of duplicates are not that uncommon. Think of any
> index with many NULL values, for example.
>
> I don't know for sure if BT_COMPRESS_THRESHOLD should be removed. I'm
> not sure what the idea is behind it. My sense is that we're likely to
> benefit by delaying page splits, no matter what. Though I am still
> looking at it purely from a space utilization point of view, at least
> for now.
>
Minor comment fix, pointes-->pointer, plus, are we really doing the
half, or is it just splitting into two.
/*
+ * Split posting tuple into two halves.
+ *
+ * Left tuple contains all item pointes less than the new one and
+ * right tuple contains new item pointer and all to the right.
+ *
+ * TODO Probably we can come up with more clever algorithm.
+ */

Some remains of 'he'.
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * it will get what he expects
+ */

Everything reads just fine without 'us'.
/*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
--
Regards,
Rafia Sabih


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
In reply to this post by Peter Geoghegan-4
24.07.2019 4:22, Peter Geoghegan wrote:

>
> Attached is a revised version of your v2 that fixes this issue -- I'll
> call this v3. In general, my goal for the revision was to make sure
> that all of my old tests from the v12 work passed, and to make sure
> that amcheck can detect almost any possible problem. I tested the
> amcheck changes by corrupting random state in a test index using
> pg_hexedit, then making sure that amcheck actually complained in each
> case.
>
> I also fixed one or two bugs in passing, including the bug that caused
> an assertion failure in _bt_truncate(). That was down to a subtle
> off-by-one issue within _bt_insertonpg_in_posting(). Overall, I didn't
> make that many changes to your v2. There are probably some things
> about the patch that I still don't understand, or things that I have
> misunderstood.
>
Thank you for this review and fixes.
> * Changed the custom binary search code within _bt_compare_posting()
> to look more like _bt_binsrch() and _bt_binsrch_insert(). Do you know
> of any reason not to do it that way?
It's ok to update it. There was no particular reason, just my habit.
> * Added quite a few "FIXME"/"XXX" comments at various points, to
> indicate where I have general concerns that need more discussion.

+         * FIXME:  The calls to BTreeGetNthTupleOfPosting() allocate
memory,

If we only need to check TIDs, we don't need BTreeGetNthTupleOfPosting(),
we can use BTreeTupleGetPostingN() instead and iterate over TIDs, not
tuples.

Fixed in version 4.

> * Included my own pageinspect hack to visualize the minimum TIDs in
> posting lists. It's broken out into a separate patch file. The code is
> very rough, but it might help someone else, so I thought I'd include
> it.
Cool, I think we should add it to the final patchset,
probably, as separate function by analogy with tuple_data_split.

> I also have some new concerns about the code in the patch that I will
> point out now (though only as something to think about a solution on
> -- I am unsure myself):
>
> * It's a bad sign that compression involves calls to PageAddItem()
> that are allowed to fail (we just give up on compression when that
> happens). For one thing, all existing calls to PageAddItem() in
> Postgres are never expected to fail -- if they do fail we get a "can't
> happen" error that suggests corruption. It was a good idea to take
> this approach to get the patch to work, and to prove the general idea,
> but we now need to fully work out all the details about the use of
> space. This includes complicated new questions around how alignment is
> supposed to work.
The main reason to implement this gentle error handling is the fact that
deduplication could cause storage overhead, which leads to running out
of space
on the page.

First of all, it is a legacy of the previous versions where
BTreeFormPostingTuple was not able to form non-posting tuple even in case
where a number of posting items is 1.

Another case that was in my mind is the situation where we have 2 tuples:
t_tid | t_info | key + t_tid | t_info | key

and compressed result is:
t_tid | t_info | key | t_tid | t_tid

If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple
can be
larger. It may happen if keysize <= 4 byte.
In this situation original tuples must have been aligned to size 16
bytes each,
and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is
also safe.

I changed DEBUG message to ERROR in v4 and it passes all regression tests.
I doubt that it covers all corner cases, so I'll try to add more special
tests.

> Alignment in nbtree is already complicated today -- you're supposed to
> MAXALIGN() everything in nbtree, so that the MAXALIGN() within
> bufpage.c routines cannot be different to the lp_len/IndexTupleSize()
> length (note that heapam can have tuples whose lp_len isn't aligned,
> so nbtree could do it differently if it proved useful). Code within
> nbtsplitloc.c fully understands the space requirements for the
> bufpage.c routines, and is very careful about it. (The bufpage.c
> details are supposed to be totally hidden from code like
> nbtsplitloc.c, but I guess that that ideal isn't quite possible in
> reality. Code comments don't really explain the situation today.)
>
> I'm not sure what it would look like for this patch to be as precise
> about free space as nbtsplitloc.c already is, even though that seems
> desirable (I just know that it would mean you would trust
> PageAddItem() to work in all cases). The patch is different to what we
> already have today in that it tries to add *less than* a single
> MAXALIGN() quantum at a time in some places (when a posting list needs
> to grow by one item). The devil is in the details.
>
> * As you know, the current approach to WAL logging is very
> inefficient. It's okay for now, but we'll need a fine-grained approach
> for the patch to be commitable. I think that this is subtly related to
> the last item (i.e. the one about alignment). I have done basic
> performance tests using unlogged tables. The patch seems to either
> make big INSERT queries run as fast or faster than before when
> inserting into unlogged tables, which is a very good start.
>
> * Since we can now split a posting list in two, we may also have to
> reconsider BTMaxItemSize, or some similar mechanism that worries about
> extreme cases where it becomes impossible to split because even two
> pages are not enough to fit everything. Think of what happens when
> there is a tuple with a single large datum, that gets split in two
> (the tuple is split, not the page), with each half receiving its own
> copy of the datum. I haven't proven to myself that this is broken, but
> that may just be because I haven't spent any time on it. OTOH, maybe
> you already have it right, in which case it seems like it should be
> explained somewhere. Possibly in nbtree.h. This is tricky stuff.
Hmm, I can't get the problem.
In current implementation each posting tuple is smaller than BTMaxItemSize,
so no split can lead to having tuple of larger size.

> * I agree with all of your existing TODO items -- most of them seem
> very important to me.
>
> * Do we really need to keep BTreeTupleGetHeapTID(), now that we have
> BTreeTupleGetMinTID()? Can't we combine the two macros into one, so
> that callers don't need to think about the pivot vs posting list thing
> themselves? See the new code added to _bt_mkscankey() by v3, for
> example. It now handles both cases/macros at once, in order to keep
> its amcheck caller happy. amcheck's verify_nbtree.c received similar
> ugly code in v3.
No, we don't need them both. I don't mind combining them into one macro.
Actually, we never needed BTreeTupleGetMinTID(),
since its functionality is covered by BTreeTupleGetHeapTID.
On the other hand, in some cases BTreeTupleGetMinTID() looks more readable.
For example here:

 >        Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lefttup),
                                   BTreeTupleGetMinTID(righttup)) < 0);

> * We should at least experiment with applying compression when
> inserting into unique indexes. Like Alexander, I think that
> compression in unique indexes might work well, given how they must
> work in Postgres.

The main reason why I decided to avoid applying compression to unique
indexes
is the performance of microvacuum. It is not applied to items inside a
posting
tuple. And I expect it to be important for unique indexes, which ideally
contain only a few live values.

One more thing I want to discuss:
  /*
* We do not expect to meet any DEAD items, since this function is
* called right after _bt_vacuum_one_page(). If for some reason we
* found dead item, don't compress it, to allow upcoming microvacuum
* or vacuum clean it up.
*/
if (ItemIdIsDead(itemId))
continue;

In the previous review Rafia asked about "some reason".
Trying to figure out if this situation possible, I changed this line to
Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a
performance
test. Unfortunately, I was not able to reproduce it.
The explanation I see is that page had DEAD items, but for some reason
BTP_HAS_GARBAGE was not set so _bt_vacuum_one_page() was not called.
I find it difficult to understand what could lead to this situation,
so probably we need to inspect it closer to exclude the possibility of a
bug.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


v4-0001-Compression-deduplication-in-nbtree.patch (84K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
On Wed, Jul 31, 2019 at 9:23 AM Anastasia Lubennikova
<[hidden email]> wrote:
> > * Included my own pageinspect hack to visualize the minimum TIDs in
> > posting lists. It's broken out into a separate patch file. The code is
> > very rough, but it might help someone else, so I thought I'd include
> > it.
> Cool, I think we should add it to the final patchset,
> probably, as separate function by analogy with tuple_data_split.

Good idea.

Attached is v5, which is based on your v4. The three main differences
between this and v4 are:

* Removed BT_COMPRESS_THRESHOLD stuff, for the reasons explained in my
July 24 e-mail. We can always add something like this back during
performance validation of the patch. Right now, having no
BT_COMPRESS_THRESHOLD limit definitely improves space utilization for
certain important cases, which seems more important than the
uncertain/speculative downside.

* We now have experimental support for unique indexes. This is broken
out into its own patch.

* We now handle LP_DEAD items in a special way within
_bt_insertonpg_in_posting().

As you pointed out already, we do need to think about LP_DEAD items
directly, rather than assuming that they cannot be on the page that
_bt_insertonpg_in_posting() must process. More on that later.

> If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple
> can be
> larger. It may happen if keysize <= 4 byte.
> In this situation original tuples must have been aligned to size 16
> bytes each,
> and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is
> also safe.

I still need to think about the exact details of alignment within
_bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
could be wrong.

> I changed DEBUG message to ERROR in v4 and it passes all regression tests.
> I doubt that it covers all corner cases, so I'll try to add more special
> tests.

It also passes my tests, FWIW.

> Hmm, I can't get the problem.
> In current implementation each posting tuple is smaller than BTMaxItemSize,
> so no split can lead to having tuple of larger size.

That sounds correct, then.

> No, we don't need them both. I don't mind combining them into one macro.
> Actually, we never needed BTreeTupleGetMinTID(),
> since its functionality is covered by BTreeTupleGetHeapTID.

I've removed BTreeTupleGetMinTID() in v5. I think it's fine to just
have a comment next to BTreeTupleGetHeapTID(), and another comment
next to BTreeTupleGetMaxTID().

> The main reason why I decided to avoid applying compression to unique
> indexes
> is the performance of microvacuum. It is not applied to items inside a
> posting
> tuple. And I expect it to be important for unique indexes, which ideally
> contain only a few live values.

I found that the performance of my experimental patch with unique
index was significantly worse. It looks like this is a bad idea, as
you predicted, though we may still want to do
deduplication/compression with NULL values in unique indexes. I did
learn a few things from implementing unique index support, though.

BTW, there is a subtle bug in how my unique index patch does
WAL-logging -- see my comments within
index_compute_xid_horizon_for_tuples(). The bug shouldn't matter if
replication isn't used. I don't think that we're going to use this
experimental patch at all, so I didn't bother fixing the bug.

> if (ItemIdIsDead(itemId))
> continue;
>
> In the previous review Rafia asked about "some reason".
> Trying to figure out if this situation possible, I changed this line to
> Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a
> performance
> test. Unfortunately, I was not able to reproduce it.

I found it easy enough to see LP_DEAD items within
_bt_insertonpg_in_posting() when running pgbench with the extra unique
index patch. To give you a simple example of how this can happen,
consider the comments about BTP_HAS_GARBAGE within
_bt_delitems_vacuum(). That probably isn't the only way it can happen,
either. ISTM that we need to be prepared for LP_DEAD items during
deduplication, rather than trying to prevent deduplication from ever
having to see an LP_DEAD item.

v5 makes  _bt_insertonpg_in_posting() prepared to overwrite an
existing item if it's an LP_DEAD item that falls in the same TID range
(that's _bt_compare()-wise "equal" to an existing tuple, which may or
may not be a posting list tuple already). I haven't made this code do
something like call  index_compute_xid_horizon_for_tuples(), even
though that's needed for correctness (i.e. this new code is currently
broken in the same way that I mentioned unique index support is
broken). I also added a nearby FIXME comment to
_bt_insertonpg_in_posting() -- I don't think think that the code for
splitting a posting list in two is currently crash-safe.

How do you feel about officially calling this deduplication, not
compression? I think that it's a more accurate name for the technique.
--
Peter Geoghegan

v5-0001-Compression-deduplication-in-nbtree.patch (115K) Download Attachment
v5-0003-DEBUG-Add-pageinspect-instrumentation.patch (10K) Download Attachment
v5-0002-Experimental-support-for-unique-indexes.patch (13K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
06.08.2019 4:28, Peter Geoghegan wrote:
> Attached is v5, which is based on your v4. The three main differences
> between this and v4 are:
>
> * Removed BT_COMPRESS_THRESHOLD stuff, for the reasons explained in my
> July 24 e-mail. We can always add something like this back during
> performance validation of the patch. Right now, having no
> BT_COMPRESS_THRESHOLD limit definitely improves space utilization for
> certain important cases, which seems more important than the
> uncertain/speculative downside.
Fair enough.
I think we can measure performance and make a decision, when patch will
stabilize.

> * We now have experimental support for unique indexes. This is broken
> out into its own patch.
>
> * We now handle LP_DEAD items in a special way within
> _bt_insertonpg_in_posting().
>
> As you pointed out already, we do need to think about LP_DEAD items
> directly, rather than assuming that they cannot be on the page that
> _bt_insertonpg_in_posting() must process. More on that later.
>
>> If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple
>> can be
>> larger. It may happen if keysize <= 4 byte.
>> In this situation original tuples must have been aligned to size 16
>> bytes each,
>> and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is
>> also safe.
> I still need to think about the exact details of alignment within
> _bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
> could be wrong.
Could you explain more about these cases?
Now I don't understand the problem.

>> The main reason why I decided to avoid applying compression to unique
>> indexes
>> is the performance of microvacuum. It is not applied to items inside a
>> posting
>> tuple. And I expect it to be important for unique indexes, which ideally
>> contain only a few live values.
> I found that the performance of my experimental patch with unique
> index was significantly worse. It looks like this is a bad idea, as
> you predicted, though we may still want to do
> deduplication/compression with NULL values in unique indexes. I did
> learn a few things from implementing unique index support, though.
>
> BTW, there is a subtle bug in how my unique index patch does
> WAL-logging -- see my comments within
> index_compute_xid_horizon_for_tuples(). The bug shouldn't matter if
> replication isn't used. I don't think that we're going to use this
> experimental patch at all, so I didn't bother fixing the bug.
Thank you for the patch.
Still, I'd suggest to leave it as a possible future improvement, so that
it doesn't
distract us from the original feature.

>> if (ItemIdIsDead(itemId))
>> continue;
>>
>> In the previous review Rafia asked about "some reason".
>> Trying to figure out if this situation possible, I changed this line to
>> Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a
>> performance
>> test. Unfortunately, I was not able to reproduce it.
> I found it easy enough to see LP_DEAD items within
> _bt_insertonpg_in_posting() when running pgbench with the extra unique
> index patch. To give you a simple example of how this can happen,
> consider the comments about BTP_HAS_GARBAGE within
> _bt_delitems_vacuum(). That probably isn't the only way it can happen,
> either. ISTM that we need to be prepared for LP_DEAD items during
> deduplication, rather than trying to prevent deduplication from ever
> having to see an LP_DEAD item.
I added to v6 another related fix for _bt_compress_one_page().
Previous code was implicitly deleted DEAD items without
calling index_compute_xid_horizon_for_tuples().
New code has a check whether DEAD items on the page exist and remove
them if any.
Another possible solution is to copy dead items as is from old page to
the new one,
but I think it's good to remove dead tuples as fast as possible.

> v5 makes  _bt_insertonpg_in_posting() prepared to overwrite an
> existing item if it's an LP_DEAD item that falls in the same TID range
> (that's _bt_compare()-wise "equal" to an existing tuple, which may or
> may not be a posting list tuple already). I haven't made this code do
> something like call  index_compute_xid_horizon_for_tuples(), even
> though that's needed for correctness (i.e. this new code is currently
> broken in the same way that I mentioned unique index support is
> broken).
Is it possible that DEAD tuple to delete was smaller than itup?
>   I also added a nearby FIXME comment to
> _bt_insertonpg_in_posting() -- I don't think think that the code for
> splitting a posting list in two is currently crash-safe.
>
Good catch. It seems, that I need to rearrange the code.
I'll send updated patch this week.

> How do you feel about officially calling this deduplication, not
> compression? I think that it's a more accurate name for the technique.
I agree.
Should I rename all related names of functions and variables in the patch?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


v6-0001-Compression-deduplication-in-nbtree.patch (86K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
13.08.2019 18:45, Anastasia Lubennikova wrote:
  I also added a nearby FIXME comment to
_bt_insertonpg_in_posting() -- I don't think think that the code for
splitting a posting list in two is currently crash-safe.

Good catch. It seems, that I need to rearrange the code.
I'll send updated patch this week.

Attached is v7.

In this version of the patch, I heavily refactored the code of insertion into
posting tuple. bt_split logic is quite complex, so I omitted a couple of
optimizations. They are mentioned in TODO comments.

Now the algorithm is the following:

- If bt_findinsertloc() found out that tuple belongs to existing posting tuple's
TID interval, it sets 'in_posting_offset' variable and passes it to
_bt_insertonpg()

- If 'in_posting_offset' is valid and origtup is valid,
merge our itup into origtup.

It can result in one tuple neworigtup, that must replace origtup; or two tuples:
neworigtup and newrighttup, if the result exceeds BTMaxItemSize,

- If two new tuple(s) fit into the old page, we're lucky.
call _bt_delete_and_insert(..., neworigtup, newrighttup, newitemoff) to
atomically replace oldtup with new tuple(s) and generate xlog record.

- In case page split is needed, pass both tuples to _bt_split().
 _bt_findsplitloc() is now aware of upcoming replacement of origtup with
neworigtup, so it uses correct item size where needed.

It seems that now all replace operations are crash-safe. The new patch passes
all regression tests, so I think it's ready for review again.

In the meantime, I'll run more stress-tests.
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

v7-0001-Compression-deduplication-in-nbtree.patch (108K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
In reply to this post by Anastasia Lubennikova
On Tue, Aug 13, 2019 at 8:45 AM Anastasia Lubennikova
<[hidden email]> wrote:
> > I still need to think about the exact details of alignment within
> > _bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
> > could be wrong.
> Could you explain more about these cases?
> Now I don't understand the problem.

Maybe there is no problem.

> Thank you for the patch.
> Still, I'd suggest to leave it as a possible future improvement, so that
> it doesn't
> distract us from the original feature.

I don't even think that it's useful work for the future. It's just
nice to be sure that we could support unique index deduplication if it
made sense. Which it doesn't. If I didn't write the patch that
implements deduplication for unique indexes, I might still not realize
that we need the index_compute_xid_horizon_for_tuples() stuff in
certain other places. I'm not serious about it at all, except as a
learning exercise/experiment.

> I added to v6 another related fix for _bt_compress_one_page().
> Previous code was implicitly deleted DEAD items without
> calling index_compute_xid_horizon_for_tuples().
> New code has a check whether DEAD items on the page exist and remove
> them if any.
> Another possible solution is to copy dead items as is from old page to
> the new one,
> but I think it's good to remove dead tuples as fast as possible.

I think that what you've done in v7 is probably the best way to do it.
It's certainly simple, which is appropriate given that we're not
really expecting to see LP_DEAD items within _bt_compress_one_page()
(we just need to be prepared for them).

> > v5 makes  _bt_insertonpg_in_posting() prepared to overwrite an
> > existing item if it's an LP_DEAD item that falls in the same TID range
> > (that's _bt_compare()-wise "equal" to an existing tuple, which may or
> > may not be a posting list tuple already). I haven't made this code do
> > something like call  index_compute_xid_horizon_for_tuples(), even
> > though that's needed for correctness (i.e. this new code is currently
> > broken in the same way that I mentioned unique index support is
> > broken).
> Is it possible that DEAD tuple to delete was smaller than itup?

I'm not sure what you mean by this. I suppose that it doesn't matter,
since we both prefer the alternative that you came up with anyway.

> > How do you feel about officially calling this deduplication, not
> > compression? I think that it's a more accurate name for the technique.
> I agree.
> Should I rename all related names of functions and variables in the patch?

Please rename them when convenient.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
In reply to this post by Anastasia Lubennikova
On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
<[hidden email]> wrote:

> Now the algorithm is the following:
>
> - If bt_findinsertloc() found out that tuple belongs to existing posting tuple's
> TID interval, it sets 'in_posting_offset' variable and passes it to
> _bt_insertonpg()
>
> - If 'in_posting_offset' is valid and origtup is valid,
> merge our itup into origtup.
>
> It can result in one tuple neworigtup, that must replace origtup; or two tuples:
> neworigtup and newrighttup, if the result exceeds BTMaxItemSize,

That sounds like the right way to do it.

> - If two new tuple(s) fit into the old page, we're lucky.
> call _bt_delete_and_insert(..., neworigtup, newrighttup, newitemoff) to
> atomically replace oldtup with new tuple(s) and generate xlog record.
>
> - In case page split is needed, pass both tuples to _bt_split().
>  _bt_findsplitloc() is now aware of upcoming replacement of origtup with
> neworigtup, so it uses correct item size where needed.

That makes sense, since _bt_split() is responsible for both splitting
the page, and inserting the new item on either the left or right page,
as part of the first phase of a page split. In other words, if you're
adding something new to _bt_insertonpg(), you probably also need to
add something new to _bt_split(). So that's what you did.

> It seems that now all replace operations are crash-safe. The new patch passes
> all regression tests, so I think it's ready for review again.

I'm looking at it now. I'm going to spend a significant amount of time
on this tomorrow.

I think that we should start to think about efficient WAL-logging now.

> In the meantime, I'll run more stress-tests.

As you probably realize, wal_consistency_checking is a good thing to
use with your tests here.

--
Peter Geoghegan


1234