[PROPOSAL] Effective storage of duplicates in B-tree index.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
117 messages Options
123456
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
On Tue, Jul 23, 2019 at 6:22 PM Peter Geoghegan <[hidden email]> wrote:
> Attached is a revised version of your v2 that fixes this issue -- I'll
> call this v3.

Remember that index that I said was 5.5x smaller with the patch
applied, following retail insertions (a single big INSERT ... SELECT
...)? Well, it's 6.5x faster with this small additional patch applied
on top of the v3 I posted yesterday. Many of the indexes in my test
suite are about ~20% smaller __in addition to__ very big size
reductions. Some are even ~30% smaller than they were with v3 of the
patch. For example, the fair use implementation of TPC-H that my test
data comes from has an index on the "orders" o_orderdate column, named
idx_orders_orderdate, which is made ~30% smaller by the addition of
this simple patch (once again, this is following a single big INSERT
... SELECT ...). This change makes idx_orders_orderdate ~3.3x smaller
than it is with master/Postgres 12, in case you were wondering.

This new patch teaches nbtsplitloc.c to subtract posting list overhead
when sizing the new high key for the left half of a candidate split
point, since we know for sure that _bt_truncate() will at least manage
to truncate away that much from the new high key, even in the worst
case. Since posting lists are often very large, this can make a big
difference. This is actually just a bugfix, not a new idea -- I merely
made nbtsplitloc.c understand how truncation works with posting lists.

There seems to be a kind of "synergy" between the nbtsplitloc.c
handling of pages that have lots of duplicates and posting list
compression. It seems as if the former mechanism "sets up the bowling
pins", while the latter mechanism "knocks them down", which is really
cool. We should try to gain a better understanding of how that works,
because it's possible that it could be even more effective in some
cases.

--
Peter Geoghegan

0003-Account-for-posting-list-overhead-during-splits.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
On Wed, Jul 24, 2019 at 3:06 PM Peter Geoghegan <[hidden email]> wrote:
> There seems to be a kind of "synergy" between the nbtsplitloc.c
> handling of pages that have lots of duplicates and posting list
> compression. It seems as if the former mechanism "sets up the bowling
> pins", while the latter mechanism "knocks them down", which is really
> cool. We should try to gain a better understanding of how that works,
> because it's possible that it could be even more effective in some
> cases.

I found another important way in which this synergy can fail to take
place, which I can fix.

By removing the BT_COMPRESS_THRESHOLD limit entirely, certain indexes
from my test suite become much smaller, while most are not affected.
These indexes were not helped too much by the patch before. For
example, the TPC-E i_t_st_id index is 50% smaller. It is entirely full
of duplicates of a single value (that's how it appears after an
initial TPC-E bulk load), as are a couple of other TPC-E indexes.
TPC-H's idx_partsupp_partkey index becomes ~18% smaller, while its
idx_lineitem_orderkey index becomes ~15% smaller.

I believe that this happened because rightmost page splits were an
inefficient case for compression. But rightmost page split heavy
indexes with lots of duplicates are not that uncommon. Think of any
index with many NULL values, for example.

I don't know for sure if BT_COMPRESS_THRESHOLD should be removed. I'm
not sure what the idea is behind it. My sense is that we're likely to
benefit by delaying page splits, no matter what. Though I am still
looking at it purely from a space utilization point of view, at least
for now.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Rafia Sabih
On Thu, 25 Jul 2019 at 05:49, Peter Geoghegan <[hidden email]> wrote:

>
> On Wed, Jul 24, 2019 at 3:06 PM Peter Geoghegan <[hidden email]> wrote:
> > There seems to be a kind of "synergy" between the nbtsplitloc.c
> > handling of pages that have lots of duplicates and posting list
> > compression. It seems as if the former mechanism "sets up the bowling
> > pins", while the latter mechanism "knocks them down", which is really
> > cool. We should try to gain a better understanding of how that works,
> > because it's possible that it could be even more effective in some
> > cases.
>
> I found another important way in which this synergy can fail to take
> place, which I can fix.
>
> By removing the BT_COMPRESS_THRESHOLD limit entirely, certain indexes
> from my test suite become much smaller, while most are not affected.
> These indexes were not helped too much by the patch before. For
> example, the TPC-E i_t_st_id index is 50% smaller. It is entirely full
> of duplicates of a single value (that's how it appears after an
> initial TPC-E bulk load), as are a couple of other TPC-E indexes.
> TPC-H's idx_partsupp_partkey index becomes ~18% smaller, while its
> idx_lineitem_orderkey index becomes ~15% smaller.
>
> I believe that this happened because rightmost page splits were an
> inefficient case for compression. But rightmost page split heavy
> indexes with lots of duplicates are not that uncommon. Think of any
> index with many NULL values, for example.
>
> I don't know for sure if BT_COMPRESS_THRESHOLD should be removed. I'm
> not sure what the idea is behind it. My sense is that we're likely to
> benefit by delaying page splits, no matter what. Though I am still
> looking at it purely from a space utilization point of view, at least
> for now.
>
Minor comment fix, pointes-->pointer, plus, are we really doing the
half, or is it just splitting into two.
/*
+ * Split posting tuple into two halves.
+ *
+ * Left tuple contains all item pointes less than the new one and
+ * right tuple contains new item pointer and all to the right.
+ *
+ * TODO Probably we can come up with more clever algorithm.
+ */

Some remains of 'he'.
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * it will get what he expects
+ */

Everything reads just fine without 'us'.
/*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
--
Regards,
Rafia Sabih


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
In reply to this post by Peter Geoghegan-4
24.07.2019 4:22, Peter Geoghegan wrote:

>
> Attached is a revised version of your v2 that fixes this issue -- I'll
> call this v3. In general, my goal for the revision was to make sure
> that all of my old tests from the v12 work passed, and to make sure
> that amcheck can detect almost any possible problem. I tested the
> amcheck changes by corrupting random state in a test index using
> pg_hexedit, then making sure that amcheck actually complained in each
> case.
>
> I also fixed one or two bugs in passing, including the bug that caused
> an assertion failure in _bt_truncate(). That was down to a subtle
> off-by-one issue within _bt_insertonpg_in_posting(). Overall, I didn't
> make that many changes to your v2. There are probably some things
> about the patch that I still don't understand, or things that I have
> misunderstood.
>
Thank you for this review and fixes.
> * Changed the custom binary search code within _bt_compare_posting()
> to look more like _bt_binsrch() and _bt_binsrch_insert(). Do you know
> of any reason not to do it that way?
It's ok to update it. There was no particular reason, just my habit.
> * Added quite a few "FIXME"/"XXX" comments at various points, to
> indicate where I have general concerns that need more discussion.

+         * FIXME:  The calls to BTreeGetNthTupleOfPosting() allocate
memory,

If we only need to check TIDs, we don't need BTreeGetNthTupleOfPosting(),
we can use BTreeTupleGetPostingN() instead and iterate over TIDs, not
tuples.

Fixed in version 4.

> * Included my own pageinspect hack to visualize the minimum TIDs in
> posting lists. It's broken out into a separate patch file. The code is
> very rough, but it might help someone else, so I thought I'd include
> it.
Cool, I think we should add it to the final patchset,
probably, as separate function by analogy with tuple_data_split.

> I also have some new concerns about the code in the patch that I will
> point out now (though only as something to think about a solution on
> -- I am unsure myself):
>
> * It's a bad sign that compression involves calls to PageAddItem()
> that are allowed to fail (we just give up on compression when that
> happens). For one thing, all existing calls to PageAddItem() in
> Postgres are never expected to fail -- if they do fail we get a "can't
> happen" error that suggests corruption. It was a good idea to take
> this approach to get the patch to work, and to prove the general idea,
> but we now need to fully work out all the details about the use of
> space. This includes complicated new questions around how alignment is
> supposed to work.
The main reason to implement this gentle error handling is the fact that
deduplication could cause storage overhead, which leads to running out
of space
on the page.

First of all, it is a legacy of the previous versions where
BTreeFormPostingTuple was not able to form non-posting tuple even in case
where a number of posting items is 1.

Another case that was in my mind is the situation where we have 2 tuples:
t_tid | t_info | key + t_tid | t_info | key

and compressed result is:
t_tid | t_info | key | t_tid | t_tid

If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple
can be
larger. It may happen if keysize <= 4 byte.
In this situation original tuples must have been aligned to size 16
bytes each,
and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is
also safe.

I changed DEBUG message to ERROR in v4 and it passes all regression tests.
I doubt that it covers all corner cases, so I'll try to add more special
tests.

> Alignment in nbtree is already complicated today -- you're supposed to
> MAXALIGN() everything in nbtree, so that the MAXALIGN() within
> bufpage.c routines cannot be different to the lp_len/IndexTupleSize()
> length (note that heapam can have tuples whose lp_len isn't aligned,
> so nbtree could do it differently if it proved useful). Code within
> nbtsplitloc.c fully understands the space requirements for the
> bufpage.c routines, and is very careful about it. (The bufpage.c
> details are supposed to be totally hidden from code like
> nbtsplitloc.c, but I guess that that ideal isn't quite possible in
> reality. Code comments don't really explain the situation today.)
>
> I'm not sure what it would look like for this patch to be as precise
> about free space as nbtsplitloc.c already is, even though that seems
> desirable (I just know that it would mean you would trust
> PageAddItem() to work in all cases). The patch is different to what we
> already have today in that it tries to add *less than* a single
> MAXALIGN() quantum at a time in some places (when a posting list needs
> to grow by one item). The devil is in the details.
>
> * As you know, the current approach to WAL logging is very
> inefficient. It's okay for now, but we'll need a fine-grained approach
> for the patch to be commitable. I think that this is subtly related to
> the last item (i.e. the one about alignment). I have done basic
> performance tests using unlogged tables. The patch seems to either
> make big INSERT queries run as fast or faster than before when
> inserting into unlogged tables, which is a very good start.
>
> * Since we can now split a posting list in two, we may also have to
> reconsider BTMaxItemSize, or some similar mechanism that worries about
> extreme cases where it becomes impossible to split because even two
> pages are not enough to fit everything. Think of what happens when
> there is a tuple with a single large datum, that gets split in two
> (the tuple is split, not the page), with each half receiving its own
> copy of the datum. I haven't proven to myself that this is broken, but
> that may just be because I haven't spent any time on it. OTOH, maybe
> you already have it right, in which case it seems like it should be
> explained somewhere. Possibly in nbtree.h. This is tricky stuff.
Hmm, I can't get the problem.
In current implementation each posting tuple is smaller than BTMaxItemSize,
so no split can lead to having tuple of larger size.

> * I agree with all of your existing TODO items -- most of them seem
> very important to me.
>
> * Do we really need to keep BTreeTupleGetHeapTID(), now that we have
> BTreeTupleGetMinTID()? Can't we combine the two macros into one, so
> that callers don't need to think about the pivot vs posting list thing
> themselves? See the new code added to _bt_mkscankey() by v3, for
> example. It now handles both cases/macros at once, in order to keep
> its amcheck caller happy. amcheck's verify_nbtree.c received similar
> ugly code in v3.
No, we don't need them both. I don't mind combining them into one macro.
Actually, we never needed BTreeTupleGetMinTID(),
since its functionality is covered by BTreeTupleGetHeapTID.
On the other hand, in some cases BTreeTupleGetMinTID() looks more readable.
For example here:

 >        Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lefttup),
                                   BTreeTupleGetMinTID(righttup)) < 0);

> * We should at least experiment with applying compression when
> inserting into unique indexes. Like Alexander, I think that
> compression in unique indexes might work well, given how they must
> work in Postgres.

The main reason why I decided to avoid applying compression to unique
indexes
is the performance of microvacuum. It is not applied to items inside a
posting
tuple. And I expect it to be important for unique indexes, which ideally
contain only a few live values.

One more thing I want to discuss:
  /*
* We do not expect to meet any DEAD items, since this function is
* called right after _bt_vacuum_one_page(). If for some reason we
* found dead item, don't compress it, to allow upcoming microvacuum
* or vacuum clean it up.
*/
if (ItemIdIsDead(itemId))
continue;

In the previous review Rafia asked about "some reason".
Trying to figure out if this situation possible, I changed this line to
Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a
performance
test. Unfortunately, I was not able to reproduce it.
The explanation I see is that page had DEAD items, but for some reason
BTP_HAS_GARBAGE was not set so _bt_vacuum_one_page() was not called.
I find it difficult to understand what could lead to this situation,
so probably we need to inspect it closer to exclude the possibility of a
bug.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


v4-0001-Compression-deduplication-in-nbtree.patch (84K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
On Wed, Jul 31, 2019 at 9:23 AM Anastasia Lubennikova
<[hidden email]> wrote:
> > * Included my own pageinspect hack to visualize the minimum TIDs in
> > posting lists. It's broken out into a separate patch file. The code is
> > very rough, but it might help someone else, so I thought I'd include
> > it.
> Cool, I think we should add it to the final patchset,
> probably, as separate function by analogy with tuple_data_split.

Good idea.

Attached is v5, which is based on your v4. The three main differences
between this and v4 are:

* Removed BT_COMPRESS_THRESHOLD stuff, for the reasons explained in my
July 24 e-mail. We can always add something like this back during
performance validation of the patch. Right now, having no
BT_COMPRESS_THRESHOLD limit definitely improves space utilization for
certain important cases, which seems more important than the
uncertain/speculative downside.

* We now have experimental support for unique indexes. This is broken
out into its own patch.

* We now handle LP_DEAD items in a special way within
_bt_insertonpg_in_posting().

As you pointed out already, we do need to think about LP_DEAD items
directly, rather than assuming that they cannot be on the page that
_bt_insertonpg_in_posting() must process. More on that later.

> If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple
> can be
> larger. It may happen if keysize <= 4 byte.
> In this situation original tuples must have been aligned to size 16
> bytes each,
> and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is
> also safe.

I still need to think about the exact details of alignment within
_bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
could be wrong.

> I changed DEBUG message to ERROR in v4 and it passes all regression tests.
> I doubt that it covers all corner cases, so I'll try to add more special
> tests.

It also passes my tests, FWIW.

> Hmm, I can't get the problem.
> In current implementation each posting tuple is smaller than BTMaxItemSize,
> so no split can lead to having tuple of larger size.

That sounds correct, then.

> No, we don't need them both. I don't mind combining them into one macro.
> Actually, we never needed BTreeTupleGetMinTID(),
> since its functionality is covered by BTreeTupleGetHeapTID.

I've removed BTreeTupleGetMinTID() in v5. I think it's fine to just
have a comment next to BTreeTupleGetHeapTID(), and another comment
next to BTreeTupleGetMaxTID().

> The main reason why I decided to avoid applying compression to unique
> indexes
> is the performance of microvacuum. It is not applied to items inside a
> posting
> tuple. And I expect it to be important for unique indexes, which ideally
> contain only a few live values.

I found that the performance of my experimental patch with unique
index was significantly worse. It looks like this is a bad idea, as
you predicted, though we may still want to do
deduplication/compression with NULL values in unique indexes. I did
learn a few things from implementing unique index support, though.

BTW, there is a subtle bug in how my unique index patch does
WAL-logging -- see my comments within
index_compute_xid_horizon_for_tuples(). The bug shouldn't matter if
replication isn't used. I don't think that we're going to use this
experimental patch at all, so I didn't bother fixing the bug.

> if (ItemIdIsDead(itemId))
> continue;
>
> In the previous review Rafia asked about "some reason".
> Trying to figure out if this situation possible, I changed this line to
> Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a
> performance
> test. Unfortunately, I was not able to reproduce it.

I found it easy enough to see LP_DEAD items within
_bt_insertonpg_in_posting() when running pgbench with the extra unique
index patch. To give you a simple example of how this can happen,
consider the comments about BTP_HAS_GARBAGE within
_bt_delitems_vacuum(). That probably isn't the only way it can happen,
either. ISTM that we need to be prepared for LP_DEAD items during
deduplication, rather than trying to prevent deduplication from ever
having to see an LP_DEAD item.

v5 makes  _bt_insertonpg_in_posting() prepared to overwrite an
existing item if it's an LP_DEAD item that falls in the same TID range
(that's _bt_compare()-wise "equal" to an existing tuple, which may or
may not be a posting list tuple already). I haven't made this code do
something like call  index_compute_xid_horizon_for_tuples(), even
though that's needed for correctness (i.e. this new code is currently
broken in the same way that I mentioned unique index support is
broken). I also added a nearby FIXME comment to
_bt_insertonpg_in_posting() -- I don't think think that the code for
splitting a posting list in two is currently crash-safe.

How do you feel about officially calling this deduplication, not
compression? I think that it's a more accurate name for the technique.
--
Peter Geoghegan

v5-0001-Compression-deduplication-in-nbtree.patch (115K) Download Attachment
v5-0003-DEBUG-Add-pageinspect-instrumentation.patch (10K) Download Attachment
v5-0002-Experimental-support-for-unique-indexes.patch (13K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
06.08.2019 4:28, Peter Geoghegan wrote:
> Attached is v5, which is based on your v4. The three main differences
> between this and v4 are:
>
> * Removed BT_COMPRESS_THRESHOLD stuff, for the reasons explained in my
> July 24 e-mail. We can always add something like this back during
> performance validation of the patch. Right now, having no
> BT_COMPRESS_THRESHOLD limit definitely improves space utilization for
> certain important cases, which seems more important than the
> uncertain/speculative downside.
Fair enough.
I think we can measure performance and make a decision, when patch will
stabilize.

> * We now have experimental support for unique indexes. This is broken
> out into its own patch.
>
> * We now handle LP_DEAD items in a special way within
> _bt_insertonpg_in_posting().
>
> As you pointed out already, we do need to think about LP_DEAD items
> directly, rather than assuming that they cannot be on the page that
> _bt_insertonpg_in_posting() must process. More on that later.
>
>> If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple
>> can be
>> larger. It may happen if keysize <= 4 byte.
>> In this situation original tuples must have been aligned to size 16
>> bytes each,
>> and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is
>> also safe.
> I still need to think about the exact details of alignment within
> _bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
> could be wrong.
Could you explain more about these cases?
Now I don't understand the problem.

>> The main reason why I decided to avoid applying compression to unique
>> indexes
>> is the performance of microvacuum. It is not applied to items inside a
>> posting
>> tuple. And I expect it to be important for unique indexes, which ideally
>> contain only a few live values.
> I found that the performance of my experimental patch with unique
> index was significantly worse. It looks like this is a bad idea, as
> you predicted, though we may still want to do
> deduplication/compression with NULL values in unique indexes. I did
> learn a few things from implementing unique index support, though.
>
> BTW, there is a subtle bug in how my unique index patch does
> WAL-logging -- see my comments within
> index_compute_xid_horizon_for_tuples(). The bug shouldn't matter if
> replication isn't used. I don't think that we're going to use this
> experimental patch at all, so I didn't bother fixing the bug.
Thank you for the patch.
Still, I'd suggest to leave it as a possible future improvement, so that
it doesn't
distract us from the original feature.

>> if (ItemIdIsDead(itemId))
>> continue;
>>
>> In the previous review Rafia asked about "some reason".
>> Trying to figure out if this situation possible, I changed this line to
>> Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a
>> performance
>> test. Unfortunately, I was not able to reproduce it.
> I found it easy enough to see LP_DEAD items within
> _bt_insertonpg_in_posting() when running pgbench with the extra unique
> index patch. To give you a simple example of how this can happen,
> consider the comments about BTP_HAS_GARBAGE within
> _bt_delitems_vacuum(). That probably isn't the only way it can happen,
> either. ISTM that we need to be prepared for LP_DEAD items during
> deduplication, rather than trying to prevent deduplication from ever
> having to see an LP_DEAD item.
I added to v6 another related fix for _bt_compress_one_page().
Previous code was implicitly deleted DEAD items without
calling index_compute_xid_horizon_for_tuples().
New code has a check whether DEAD items on the page exist and remove
them if any.
Another possible solution is to copy dead items as is from old page to
the new one,
but I think it's good to remove dead tuples as fast as possible.

> v5 makes  _bt_insertonpg_in_posting() prepared to overwrite an
> existing item if it's an LP_DEAD item that falls in the same TID range
> (that's _bt_compare()-wise "equal" to an existing tuple, which may or
> may not be a posting list tuple already). I haven't made this code do
> something like call  index_compute_xid_horizon_for_tuples(), even
> though that's needed for correctness (i.e. this new code is currently
> broken in the same way that I mentioned unique index support is
> broken).
Is it possible that DEAD tuple to delete was smaller than itup?
>   I also added a nearby FIXME comment to
> _bt_insertonpg_in_posting() -- I don't think think that the code for
> splitting a posting list in two is currently crash-safe.
>
Good catch. It seems, that I need to rearrange the code.
I'll send updated patch this week.

> How do you feel about officially calling this deduplication, not
> compression? I think that it's a more accurate name for the technique.
I agree.
Should I rename all related names of functions and variables in the patch?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


v6-0001-Compression-deduplication-in-nbtree.patch (86K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
13.08.2019 18:45, Anastasia Lubennikova wrote:
  I also added a nearby FIXME comment to
_bt_insertonpg_in_posting() -- I don't think think that the code for
splitting a posting list in two is currently crash-safe.

Good catch. It seems, that I need to rearrange the code.
I'll send updated patch this week.

Attached is v7.

In this version of the patch, I heavily refactored the code of insertion into
posting tuple. bt_split logic is quite complex, so I omitted a couple of
optimizations. They are mentioned in TODO comments.

Now the algorithm is the following:

- If bt_findinsertloc() found out that tuple belongs to existing posting tuple's
TID interval, it sets 'in_posting_offset' variable and passes it to
_bt_insertonpg()

- If 'in_posting_offset' is valid and origtup is valid,
merge our itup into origtup.

It can result in one tuple neworigtup, that must replace origtup; or two tuples:
neworigtup and newrighttup, if the result exceeds BTMaxItemSize,

- If two new tuple(s) fit into the old page, we're lucky.
call _bt_delete_and_insert(..., neworigtup, newrighttup, newitemoff) to
atomically replace oldtup with new tuple(s) and generate xlog record.

- In case page split is needed, pass both tuples to _bt_split().
 _bt_findsplitloc() is now aware of upcoming replacement of origtup with
neworigtup, so it uses correct item size where needed.

It seems that now all replace operations are crash-safe. The new patch passes
all regression tests, so I think it's ready for review again.

In the meantime, I'll run more stress-tests.
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

v7-0001-Compression-deduplication-in-nbtree.patch (108K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
In reply to this post by Anastasia Lubennikova
On Tue, Aug 13, 2019 at 8:45 AM Anastasia Lubennikova
<[hidden email]> wrote:
> > I still need to think about the exact details of alignment within
> > _bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
> > could be wrong.
> Could you explain more about these cases?
> Now I don't understand the problem.

Maybe there is no problem.

> Thank you for the patch.
> Still, I'd suggest to leave it as a possible future improvement, so that
> it doesn't
> distract us from the original feature.

I don't even think that it's useful work for the future. It's just
nice to be sure that we could support unique index deduplication if it
made sense. Which it doesn't. If I didn't write the patch that
implements deduplication for unique indexes, I might still not realize
that we need the index_compute_xid_horizon_for_tuples() stuff in
certain other places. I'm not serious about it at all, except as a
learning exercise/experiment.

> I added to v6 another related fix for _bt_compress_one_page().
> Previous code was implicitly deleted DEAD items without
> calling index_compute_xid_horizon_for_tuples().
> New code has a check whether DEAD items on the page exist and remove
> them if any.
> Another possible solution is to copy dead items as is from old page to
> the new one,
> but I think it's good to remove dead tuples as fast as possible.

I think that what you've done in v7 is probably the best way to do it.
It's certainly simple, which is appropriate given that we're not
really expecting to see LP_DEAD items within _bt_compress_one_page()
(we just need to be prepared for them).

> > v5 makes  _bt_insertonpg_in_posting() prepared to overwrite an
> > existing item if it's an LP_DEAD item that falls in the same TID range
> > (that's _bt_compare()-wise "equal" to an existing tuple, which may or
> > may not be a posting list tuple already). I haven't made this code do
> > something like call  index_compute_xid_horizon_for_tuples(), even
> > though that's needed for correctness (i.e. this new code is currently
> > broken in the same way that I mentioned unique index support is
> > broken).
> Is it possible that DEAD tuple to delete was smaller than itup?

I'm not sure what you mean by this. I suppose that it doesn't matter,
since we both prefer the alternative that you came up with anyway.

> > How do you feel about officially calling this deduplication, not
> > compression? I think that it's a more accurate name for the technique.
> I agree.
> Should I rename all related names of functions and variables in the patch?

Please rename them when convenient.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
In reply to this post by Anastasia Lubennikova
On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
<[hidden email]> wrote:

> Now the algorithm is the following:
>
> - If bt_findinsertloc() found out that tuple belongs to existing posting tuple's
> TID interval, it sets 'in_posting_offset' variable and passes it to
> _bt_insertonpg()
>
> - If 'in_posting_offset' is valid and origtup is valid,
> merge our itup into origtup.
>
> It can result in one tuple neworigtup, that must replace origtup; or two tuples:
> neworigtup and newrighttup, if the result exceeds BTMaxItemSize,

That sounds like the right way to do it.

> - If two new tuple(s) fit into the old page, we're lucky.
> call _bt_delete_and_insert(..., neworigtup, newrighttup, newitemoff) to
> atomically replace oldtup with new tuple(s) and generate xlog record.
>
> - In case page split is needed, pass both tuples to _bt_split().
>  _bt_findsplitloc() is now aware of upcoming replacement of origtup with
> neworigtup, so it uses correct item size where needed.

That makes sense, since _bt_split() is responsible for both splitting
the page, and inserting the new item on either the left or right page,
as part of the first phase of a page split. In other words, if you're
adding something new to _bt_insertonpg(), you probably also need to
add something new to _bt_split(). So that's what you did.

> It seems that now all replace operations are crash-safe. The new patch passes
> all regression tests, so I think it's ready for review again.

I'm looking at it now. I'm going to spend a significant amount of time
on this tomorrow.

I think that we should start to think about efficient WAL-logging now.

> In the meantime, I'll run more stress-tests.

As you probably realize, wal_consistency_checking is a good thing to
use with your tests here.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
20.08.2019 4:04, Peter Geoghegan wrote:
> On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
> <[hidden email]> wrote:
>
>> It seems that now all replace operations are crash-safe. The new patch passes
>> all regression tests, so I think it's ready for review again.
> I'm looking at it now. I'm going to spend a significant amount of time
> on this tomorrow.
>
> I think that we should start to think about efficient WAL-logging now.

Thank you for the review.

The new version v8 is attached. Compared to previous version, this patch
includes
updated btree_xlog_insert() and btree_xlog_split() so that WAL records
now only contain data
about updated posting tuple and don't require full page writes.
I haven't updated pg_waldump yet. It is postponed until we agree on
nbtxlog changes.

Also in this patch I renamed all 'compress' keywords to 'deduplicate'
and did minor cleanup
of outdated comments.

I'm going to look through the patch once more to update nbtxlog
comments, where needed and
answer to your remarks that are still left in the comments.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


v8-0001-Deduplication-in-nbtree.patch (113K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
On Wed, Aug 21, 2019 at 10:19 AM Anastasia Lubennikova
<[hidden email]> wrote:
> I'm going to look through the patch once more to update nbtxlog
> comments, where needed and
> answer to your remarks that are still left in the comments.

Have you been using amcheck's rootdescend verification? I see this
problem with v8, with the TPC-H test data:

DEBUG:  finished verifying presence of 1500000 tuples from table
"customer" with bitset 51.09% set
ERROR:  could not find tuple using search from root page in index
"idx_customer_nationkey2"

I've been running my standard amcheck query with these databases, which is:

SELECT bt_index_parent_check(index => c.oid, heapallindexed => true,
rootdescend => true),
c.relname,
c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree'
AND c.relpersistence != 't'
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC;

There were many large indexes that amcheck didn't detect a problem
with. I don't yet understand what the problem is, or why we only see
the problem for a small number of indexes. Note that all of these
indexes passed verification with v5, so this is some kind of
regression.

I also noticed that there were some regressions in the size of indexes
-- indexes were not nearly as small as they were in v5 in some cases.
The overall picture was a clear regression in how effective
deduplication is.

I think that it would save time if you had direct access to my test
data, even though it's a bit cumbersome. You'll have to download about
10GB of dumps, which require plenty of disk space when restored:

regression=# \l+
                                                                List
of databases
    Name    | Owner | Encoding |  Collate   |   Ctype    | Access
privileges |  Size   | Tablespace |                Description
------------+-------+----------+------------+------------+-------------------+---------+------------+--------------------------------------------
 land       | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 |
      | 6425 MB | pg_default |
 mgd        | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 |
      | 61 GB   | pg_default |
 postgres   | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 |
      | 7753 kB | pg_default | default administrative connection
database
 regression | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 |
      | 886 MB  | pg_default |
 template0  | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 | =c/pg
     +| 7609 kB | pg_default | unmodifiable empty database
            |       |          |            |            | pg=CTc/pg
      |         |            |
 template1  | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 | =c/pg
     +| 7609 kB | pg_default | default template for new databases
            |       |          |            |            | pg=CTc/pg
      |         |            |
 tpcc       | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 |
      | 10 GB   | pg_default |
 tpce       | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 |
      | 26 GB   | pg_default |
 tpch       | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 |
      | 32 GB   | pg_default |
(9 rows)

I have found it very valuable to use this test data when changing
nbtsplitloc.c, or anything that could affect where page splits make
free space available. If this is too much data to handle conveniently,
then you could skip "mgd" and almost have as much test coverage. There
really does seem to be a benefit to using diverse test cases like
this, because sometimes regressions only affect a small number of
specific indexes for specific reasons. For example, only TPC-H has a
small number of indexes that have tuples that are inserted in order,
but also have many duplicates. Removing the BT_COMPRESS_THRESHOLD
stuff really helped with those indexes.

Want me to send this data and the associated tests script over to you?

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
23.08.2019 7:33, Peter Geoghegan wrote:
> On Wed, Aug 21, 2019 at 10:19 AM Anastasia Lubennikova
> <[hidden email]> wrote:
>> I'm going to look through the patch once more to update nbtxlog
>> comments, where needed and
>> answer to your remarks that are still left in the comments.
> Have you been using amcheck's rootdescend verification?

No, I haven't checked it with the latest version yet.

> There were many large indexes that amcheck didn't detect a problem
> with. I don't yet understand what the problem is, or why we only see
> the problem for a small number of indexes. Note that all of these
> indexes passed verification with v5, so this is some kind of
> regression.
>
> I also noticed that there were some regressions in the size of indexes
> -- indexes were not nearly as small as they were in v5 in some cases.
> The overall picture was a clear regression in how effective
> deduplication is.
Do these indexes have something in common? Maybe some specific workload?
Are there any error messages in log?

I'd like to specify what caused the problem.
There were several major changes between v5 and v8:
- dead tuples handling added in v6;
- _bt_split changes for posting tuples in v7;
- WAL logging of posting tuple changes in v8.

I don't think the last one could break regular indexes on master.
Do you see the same regression in v6, v7?

> I think that it would save time if you had direct access to my test
> data, even though it's a bit cumbersome. You'll have to download about
> 10GB of dumps, which require plenty of disk space when restored:
>
>
> Want me to send this data and the associated tests script over to you?
>
Yes, I think it will help me to debug the patch faster.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
In reply to this post by Anastasia Lubennikova
On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
<[hidden email]> wrote:
> Now the algorithm is the following:

> - In case page split is needed, pass both tuples to _bt_split().
>  _bt_findsplitloc() is now aware of upcoming replacement of origtup with
> neworigtup, so it uses correct item size where needed.
>
> It seems that now all replace operations are crash-safe. The new patch passes
> all regression tests, so I think it's ready for review again.

I think that the way this works within nbtsplitloc.c is too
complicated. In v5, the only thing that  nbtsplitloc.c knew about
deduplication was that it could be sure that suffix truncation would
at least make a posting list into a single heap TID in the worst case.
This consideration was mostly about suffix truncation, not
deduplication, which seemed like a good thing to me. _bt_split() and
_bt_findsplitloc() should know as little as possible about posting
lists.

Obviously it will sometimes be necessary to deal with the case where a
posting list is about to become too big (i.e. it's about to go over
BTMaxItemSize()), and so must be split. Less often, a page split will
be needed because of one of these posting list splits. These are two
complicated areas (posting list splits and page splits), and it would
be a good idea to find a way to separate them as much as possible.
Remember, nbtsplitloc.c works by pretending that the new item that
cannot fit on the page is already on its own imaginary version of the
page that *can* fit the new item, along with everything else from the
original/actual page. That gets *way* too complicated when it has to
deal with the fact that the new item is being merged with an existing
item. Perhaps nbtsplitloc.c could also "pretend" that the new item is
always a plain tuple, without knowing anything about posting lists.
Almost like how it worked in v5.

We always want posting lists to be as close to the BTMaxItemSize()
size as possible, because that helps with space utilization. In v5 of
the patch, this was what happened, because, in effect, we didn't try
to do anything complicated with the new item. This worked well, apart
from the crash safety issue. Maybe we can simulate the v5 approach,
giving us the best of all worlds (good space utilization, simplicity,
and crash safety). Something like this:

* Posting list splits should always result in one posting list that is
at or just under BTMaxItemSize() in size, plus one plain tuple to its
immediate right on the page. This is similar to the more common case
where we cannot add additional tuples to a posting list due to the
BTMaxItemSize() restriction, and so end up with a single tuple (or a
smaller posting list with the same value) to the right of a
BTMaxItemSize()-sized posting list tuple. I don't see a reason to
split a posting list in the middle -- we should always split to the
right, leaving the posting list as large as possible.

* When there is a simple posting list split, with no page split, the
logic required is fairly straightforward: We rewrite the posting list
in-place so that our new item goes wherever it belongs in the existing
posting list on the page (we memmove() the posting list to make space
for the new TID, basically). The old last/rightmost TID in the
original posting list becomes a new, plain tuple. We may need a new
WAL record for this, but it's not that different to a regular leaf
page insert.

* When this happens to result in a page split, we then have a "fake"
new item -- the right half of the posting list that we split, which is
always a plain item. Obviously we need to be a bit careful with the
WAL logging, but the space accounting within _bt_split() and
_bt_findsplitloc() can work just the same as now. nbtsplitloc.c can
work like it did in v5, when the only thing it knew about posting
lists was that _bt_truncate() always removes them, maybe leaving a
single TID behind in the new high key. (Note also that it's not okay
to remove the conservative assumption about at least having space for
one heap TID within _bt_recsplitloc() -- that needs to be restored to
its v5 state in the next version of the patch.)

Because deduplication is lazy, there is little value in doing
deduplication of the new item (which may or may not be the fake new
item). The nbtsplitloc.c logic will "trap" duplicates on the same page
today, so we can just let deduplication of the new item happen at a
later time. _bt_split() can almost pretend that posting lists don't
exist, and nbtsplitloc.c needs to know nothing about posting lists
(apart from the way that _bt_truncate() behaves with posting lists).
We "lie" to  _bt_findsplitloc(), and tell it that the new item is our
fake new item -- it doesn't do anything that will be broken by that
lie, because it doesn't care about the actual content of posting
lists. And, we can fix the "fake new item is not actually real new
item" issue at one point within _bt_split(), just as we're about to
WAL log.

What do you think of that approach?

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
28.08.2019 6:19, Peter Geoghegan wrote:

> On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
> <[hidden email]>  wrote:
>> Now the algorithm is the following:
>> - In case page split is needed, pass both tuples to _bt_split().
>>   _bt_findsplitloc() is now aware of upcoming replacement of origtup with
>> neworigtup, so it uses correct item size where needed.
>>
>> It seems that now all replace operations are crash-safe. The new patch passes
>> all regression tests, so I think it's ready for review again.
> I think that the way this works within nbtsplitloc.c is too
> complicated. In v5, the only thing that  nbtsplitloc.c knew about
> deduplication was that it could be sure that suffix truncation would
> at least make a posting list into a single heap TID in the worst case.
> This consideration was mostly about suffix truncation, not
> deduplication, which seemed like a good thing to me. _bt_split() and
> _bt_findsplitloc() should know as little as possible about posting
> lists.
>
> Obviously it will sometimes be necessary to deal with the case where a
> posting list is about to become too big (i.e. it's about to go over
> BTMaxItemSize()), and so must be split. Less often, a page split will
> be needed because of one of these posting list splits. These are two
> complicated areas (posting list splits and page splits), and it would
> be a good idea to find a way to separate them as much as possible.
> Remember, nbtsplitloc.c works by pretending that the new item that
> cannot fit on the page is already on its own imaginary version of the
> page that *can* fit the new item, along with everything else from the
> original/actual page. That gets *way* too complicated when it has to
> deal with the fact that the new item is being merged with an existing
> item. Perhaps nbtsplitloc.c could also "pretend" that the new item is
> always a plain tuple, without knowing anything about posting lists.
> Almost like how it worked in v5.
>
> We always want posting lists to be as close to the BTMaxItemSize()
> size as possible, because that helps with space utilization. In v5 of
> the patch, this was what happened, because, in effect, we didn't try
> to do anything complicated with the new item. This worked well, apart
> from the crash safety issue. Maybe we can simulate the v5 approach,
> giving us the best of all worlds (good space utilization, simplicity,
> and crash safety). Something like this:
>
> * Posting list splits should always result in one posting list that is
> at or just under BTMaxItemSize() in size, plus one plain tuple to its
> immediate right on the page. This is similar to the more common case
> where we cannot add additional tuples to a posting list due to the
> BTMaxItemSize() restriction, and so end up with a single tuple (or a
> smaller posting list with the same value) to the right of a
> BTMaxItemSize()-sized posting list tuple. I don't see a reason to
> split a posting list in the middle -- we should always split to the
> right, leaving the posting list as large as possible.
>
> * When there is a simple posting list split, with no page split, the
> logic required is fairly straightforward: We rewrite the posting list
> in-place so that our new item goes wherever it belongs in the existing
> posting list on the page (we memmove() the posting list to make space
> for the new TID, basically). The old last/rightmost TID in the
> original posting list becomes a new, plain tuple. We may need a new
> WAL record for this, but it's not that different to a regular leaf
> page insert.
>
> * When this happens to result in a page split, we then have a "fake"
> new item -- the right half of the posting list that we split, which is
> always a plain item. Obviously we need to be a bit careful with the
> WAL logging, but the space accounting within _bt_split() and
> _bt_findsplitloc() can work just the same as now. nbtsplitloc.c can
> work like it did in v5, when the only thing it knew about posting
> lists was that _bt_truncate() always removes them, maybe leaving a
> single TID behind in the new high key. (Note also that it's not okay
> to remove the conservative assumption about at least having space for
> one heap TID within _bt_recsplitloc() -- that needs to be restored to
> its v5 state in the next version of the patch.)
>
> Because deduplication is lazy, there is little value in doing
> deduplication of the new item (which may or may not be the fake new
> item). The nbtsplitloc.c logic will "trap" duplicates on the same page
> today, so we can just let deduplication of the new item happen at a
> later time. _bt_split() can almost pretend that posting lists don't
> exist, and nbtsplitloc.c needs to know nothing about posting lists
> (apart from the way that _bt_truncate() behaves with posting lists).
> We "lie" to  _bt_findsplitloc(), and tell it that the new item is our
> fake new item -- it doesn't do anything that will be broken by that
> lie, because it doesn't care about the actual content of posting
> lists. And, we can fix the "fake new item is not actually real new
> item" issue at one point within _bt_split(), just as we're about to
> WAL log.
>
> What do you think of that approach?
I think it's a good idea. Thank you for such a detailed description of
various
cases. I already started to simplify this code, while debugging amcheck
error
in v8. At first, I rewrote it to split posting tuple into a posting and a
regular tuple instead of two posting tuples.

Your explanation helped me to understand that this approach can be
extended to
the case of insertion into posting list, that doesn't trigger posting
split,
and that nbtsplitloc indeed doesn't need to know about posting tuples
specific.
The code is much cleaner now.

The new version is attached. It passes regression tests. I also run land
and
tpch test. They pass amcheck rootdescend and if I interpreted results
correctly, the new version shows slightly better compression.
\l+
  tpch      | anastasia | UTF8     | ru_RU.UTF-8 | ru_RU.UTF-8 | | 31
GB   | pg_default |
  land      | anastasia | UTF8     | ru_RU.UTF-8 | ru_RU.UTF-8 | | 6380
MB | pg_default |

Some individual indexes are larger, some are smaller compared to the
expected output.

This patch is based on v6, so it again contains "compression" instead of
"deduplication"
in variable names and comments. I will rename them when code becomes
more stable.

--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company


v9-0001-Compression-deduplication-in-nbtree.patch (105K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
On Thu, Aug 29, 2019 at 5:13 AM Anastasia Lubennikova
<[hidden email]> wrote:
> Your explanation helped me to understand that this approach can be
> extended to
> the case of insertion into posting list, that doesn't trigger posting
> split,
> and that nbtsplitloc indeed doesn't need to know about posting tuples
> specific.
> The code is much cleaner now.

Fantastic!

> Some individual indexes are larger, some are smaller compared to the
> expected output.

I agree that v9 might be ever so slightly more space efficient than v5
was, on balance. In any case v9 completely fixes the regression that I
saw in the last version. I have pushed the changes to the test output
for the serial tests that I privately maintain, that I gave you access
to. The MGD test output also looks perfect.

We may find that deduplication is a little too effective, in the sense
that it packs so many tuples on to leaf pages that *concurrent*
inserters will tend to get excessive page splits. We may find that it
makes sense to aim for posting lists that are maybe 96% of
BTMaxItemSize() -- note that BTREE_SINGLEVAL_FILLFACTOR is 96 for this
reason. Concurrent inserters will tend to have heap TIDs that are
slightly out of order, so we want to at least have enough space
remaining on the left half of a "single value mode" split. We may end
up with a design where deduplication anticipates what will be useful
for nbtsplitloc.c.

I still think that it's too early to start worrying about problems
like this one -- I feel it will be useful to continue to focus on the
code and the space utilization of the serial test cases for now. We
can look at it at the same time that we think about adding back
something like BT_COMPRESS_THRESHOLD. I am mentioning it now because
it's probably a good time for you to start thinking about it, if you
haven't already (actually, maybe I'm just describing what
BT_COMPRESS_THRESHOLD was supposed to do in the first place). We'll
need to have a good benchmark to assess these questions, and it's not
obvious what that will be. Two possible candidates are TPC-H and
TPC-E. (Of course, I mean running them for real -- not using their
indexes to make sure that the nbtsplitloc.c stuff works well in
isolation.)

Any thoughts on a conventional benchmark that allows us to understand
the patch's impact on both throughput and latency?

BTW, I notice that we often have indexes that are quite a lot smaller
when they were created with retail insertions rather than with CREATE
INDEX/REINDEX. This is not new, but the difference is much larger than
it typically is without the patch. For example, the TPC-E index on
trade.t_ca_id (which is named "i_t_ca_id" or  "i_t_ca_id2" in my test)
is 162 MB with CREATE INDEX/REINDEX, and 121 MB with retail insertions
(assuming the insertions use the actual order from the test). I'm not
sure what to do about this, if anything. I mean, the reason that the
retail insertions do better is that they have the nbtsplitloc.c stuff,
and because we don't split the page until it's 100% full and until
deduplication stops helping -- we could apply several rounds of
deduplication before we actually have to split the cage. So the
difference that we see here is both logical and surprising.

How do you feel about this CREATE INDEX index-size-is-larger business?

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
On Thu, Aug 29, 2019 at 5:07 PM Peter Geoghegan <[hidden email]> wrote:
> I agree that v9 might be ever so slightly more space efficient than v5
> was, on balance.

I see some Valgrind errors on v9, all of which look like the following
two sample errors I go into below.

First one:

==11193== VALGRINDERROR-BEGIN
==11193== Unaddressable byte(s) found during client check request
==11193==    at 0x4C0E03: PageAddItemExtended (bufpage.c:332)
==11193==    by 0x20F6C3: _bt_split (nbtinsert.c:1643)
==11193==    by 0x20F6C3: _bt_insertonpg (nbtinsert.c:1206)
==11193==    by 0x21239B: _bt_doinsert (nbtinsert.c:306)
==11193==    by 0x2150EE: btinsert (nbtree.c:207)
==11193==    by 0x20D63A: index_insert (indexam.c:186)
==11193==    by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393)
==11193==    by 0x391793: ExecInsert (nodeModifyTable.c:593)
==11193==    by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219)
==11193==    by 0x37306D: ExecProcNodeFirst (execProcnode.c:445)
==11193==    by 0x36C738: ExecProcNode (executor.h:240)
==11193==    by 0x36C738: ExecutePlan (execMain.c:1648)
==11193==    by 0x36C738: standard_ExecutorRun (execMain.c:365)
==11193==    by 0x36C7DD: ExecutorRun (execMain.c:309)
==11193==    by 0x4CC41A: ProcessQuery (pquery.c:161)
==11193==    by 0x4CC5EB: PortalRunMulti (pquery.c:1283)
==11193==    by 0x4CD31C: PortalRun (pquery.c:796)
==11193==    by 0x4C8EFC: exec_simple_query (postgres.c:1231)
==11193==    by 0x4C9EE0: PostgresMain (postgres.c:4256)
==11193==    by 0x453650: BackendRun (postmaster.c:4446)
==11193==    by 0x453650: BackendStartup (postmaster.c:4137)
==11193==    by 0x453650: ServerLoop (postmaster.c:1704)
==11193==    by 0x454CAC: PostmasterMain (postmaster.c:1377)
==11193==    by 0x3B85A1: main (main.c:210)
==11193==  Address 0x9c11350 is 0 bytes after a recently re-allocated
block of size 8,192 alloc'd
==11193==    at 0x4C2FB0F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==11193==    by 0x61085A: AllocSetAlloc (aset.c:914)
==11193==    by 0x617AD8: palloc (mcxt.c:938)
==11193==    by 0x21A829: _bt_mkscankey (nbtutils.c:107)
==11193==    by 0x2118F3: _bt_doinsert (nbtinsert.c:93)
==11193==    by 0x2150EE: btinsert (nbtree.c:207)
==11193==    by 0x20D63A: index_insert (indexam.c:186)
==11193==    by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393)
==11193==    by 0x391793: ExecInsert (nodeModifyTable.c:593)
==11193==    by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219)
==11193==    by 0x37306D: ExecProcNodeFirst (execProcnode.c:445)
==11193==    by 0x36C738: ExecProcNode (executor.h:240)
==11193==    by 0x36C738: ExecutePlan (execMain.c:1648)
==11193==    by 0x36C738: standard_ExecutorRun (execMain.c:365)
==11193==    by 0x36C7DD: ExecutorRun (execMain.c:309)
==11193==    by 0x4CC41A: ProcessQuery (pquery.c:161)
==11193==    by 0x4CC5EB: PortalRunMulti (pquery.c:1283)
==11193==    by 0x4CD31C: PortalRun (pquery.c:796)
==11193==    by 0x4C8EFC: exec_simple_query (postgres.c:1231)
==11193==    by 0x4C9EE0: PostgresMain (postgres.c:4256)
==11193==    by 0x453650: BackendRun (postmaster.c:4446)
==11193==    by 0x453650: BackendStartup (postmaster.c:4137)
==11193==    by 0x453650: ServerLoop (postmaster.c:1704)
==11193==    by 0x454CAC: PostmasterMain (postmaster.c:1377)
==11193==
==11193== VALGRINDERROR-END
{
   <insert_a_suppression_name_here>
   Memcheck:User
   fun:PageAddItemExtended
   fun:_bt_split
   fun:_bt_insertonpg
   fun:_bt_doinsert
   fun:btinsert
   fun:index_insert
   fun:ExecInsertIndexTuples
   fun:ExecInsert
   fun:ExecModifyTable
   fun:ExecProcNodeFirst
   fun:ExecProcNode
   fun:ExecutePlan
   fun:standard_ExecutorRun
   fun:ExecutorRun
   fun:ProcessQuery
   fun:PortalRunMulti
   fun:PortalRun
   fun:exec_simple_query
   fun:PostgresMain
   fun:BackendRun
   fun:BackendStartup
   fun:ServerLoop
   fun:PostmasterMain
   fun:main
}

nbtinsert.c:1643 is the first PageAddItem() in _bt_split() -- the
lefthikey call.

Second one:

==11193== VALGRINDERROR-BEGIN
==11193== Invalid read of size 2
==11193==    at 0x20FDF5: _bt_insertonpg (nbtinsert.c:1126)
==11193==    by 0x21239B: _bt_doinsert (nbtinsert.c:306)
==11193==    by 0x2150EE: btinsert (nbtree.c:207)
==11193==    by 0x20D63A: index_insert (indexam.c:186)
==11193==    by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393)
==11193==    by 0x391793: ExecInsert (nodeModifyTable.c:593)
==11193==    by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219)
==11193==    by 0x37306D: ExecProcNodeFirst (execProcnode.c:445)
==11193==    by 0x36C738: ExecProcNode (executor.h:240)
==11193==    by 0x36C738: ExecutePlan (execMain.c:1648)
==11193==    by 0x36C738: standard_ExecutorRun (execMain.c:365)
==11193==    by 0x36C7DD: ExecutorRun (execMain.c:309)
==11193==    by 0x4CC41A: ProcessQuery (pquery.c:161)
==11193==    by 0x4CC5EB: PortalRunMulti (pquery.c:1283)
==11193==    by 0x4CD31C: PortalRun (pquery.c:796)
==11193==    by 0x4C8EFC: exec_simple_query (postgres.c:1231)
==11193==    by 0x4C9EE0: PostgresMain (postgres.c:4256)
==11193==    by 0x453650: BackendRun (postmaster.c:4446)
==11193==    by 0x453650: BackendStartup (postmaster.c:4137)
==11193==    by 0x453650: ServerLoop (postmaster.c:1704)
==11193==    by 0x454CAC: PostmasterMain (postmaster.c:1377)
==11193==    by 0x3B85A1: main (main.c:210)
==11193==  Address 0x9905b90 is 11,088 bytes inside a recently
re-allocated block of size 524,288 alloc'd
==11193==    at 0x4C2FB0F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==11193==    by 0x61085A: AllocSetAlloc (aset.c:914)
==11193==    by 0x617AD8: palloc (mcxt.c:938)
==11193==    by 0x1C5677: CopyIndexTuple (indextuple.c:508)
==11193==    by 0x20E887: _bt_compress_one_page (nbtinsert.c:2751)
==11193==    by 0x21241E: _bt_findinsertloc (nbtinsert.c:773)
==11193==    by 0x21241E: _bt_doinsert (nbtinsert.c:303)
==11193==    by 0x2150EE: btinsert (nbtree.c:207)
==11193==    by 0x20D63A: index_insert (indexam.c:186)
==11193==    by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393)
==11193==    by 0x391793: ExecInsert (nodeModifyTable.c:593)
==11193==    by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219)
==11193==    by 0x37306D: ExecProcNodeFirst (execProcnode.c:445)
==11193==    by 0x36C738: ExecProcNode (executor.h:240)
==11193==    by 0x36C738: ExecutePlan (execMain.c:1648)
==11193==    by 0x36C738: standard_ExecutorRun (execMain.c:365)
==11193==    by 0x36C7DD: ExecutorRun (execMain.c:309)
==11193==    by 0x4CC41A: ProcessQuery (pquery.c:161)
==11193==    by 0x4CC5EB: PortalRunMulti (pquery.c:1283)
==11193==    by 0x4CD31C: PortalRun (pquery.c:796)
==11193==    by 0x4C8EFC: exec_simple_query (postgres.c:1231)
==11193==    by 0x4C9EE0: PostgresMain (postgres.c:4256)
==11193==    by 0x453650: BackendRun (postmaster.c:4446)
==11193==    by 0x453650: BackendStartup (postmaster.c:4137)
==11193==    by 0x453650: ServerLoop (postmaster.c:1704)
==11193==
==11193== VALGRINDERROR-END
{
   <insert_a_suppression_name_here>
   Memcheck:Addr2
   fun:_bt_insertonpg
   fun:_bt_doinsert
   fun:btinsert
   fun:index_insert
   fun:ExecInsertIndexTuples
   fun:ExecInsert
   fun:ExecModifyTable
   fun:ExecProcNodeFirst
   fun:ExecProcNode
   fun:ExecutePlan
   fun:standard_ExecutorRun
   fun:ExecutorRun
   fun:ProcessQuery
   fun:PortalRunMulti
   fun:PortalRun
   fun:exec_simple_query
   fun:PostgresMain
   fun:BackendRun
   fun:BackendStartup
   fun:ServerLoop
   fun:PostmasterMain
   fun:main
}

nbtinsert.c:1126 is this code from _bt_insertonpg():

            elog(DEBUG4, "dest before (%u,%u)",
                 ItemPointerGetBlockNumberNoCheck((ItemPointer) dest),
                 ItemPointerGetOffsetNumberNoCheck((ItemPointer) dest));

This is probably harmless, but it needs to be fixed.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
On Thu, Aug 29, 2019 at 10:10 PM Peter Geoghegan <[hidden email]> wrote:
> I see some Valgrind errors on v9, all of which look like the following
> two sample errors I go into below.

I've found a fix for these Valgrind issues. It's a matter of making
sure that _bt_truncate() sizes new pivot tuples properly, which is
quite subtle:

--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2155,8 +2155,11 @@ _bt_truncate(Relation rel, IndexTuple lastleft,
IndexTuple firstright,
         {
             BTreeTupleClearBtIsPosting(pivot);
             BTreeTupleSetNAtts(pivot, keepnatts);
-            pivot->t_info &= ~INDEX_SIZE_MASK;
-            pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+            if (keepnatts == natts)
+            {
+                pivot->t_info &= ~INDEX_SIZE_MASK;
+                pivot->t_info |=
MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+            }
         }

I'm varying how the new pivot tuple is sized here according to whether
or not index_truncate_tuple() just does a CopyIndexTuple(). This very
slightly changes the behavior of the nbtsplitloc.c stuff, but that's
not a concern for me.

I will post a patch with this and other tweaks next week.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
On Sat, Aug 31, 2019 at 1:04 AM Peter Geoghegan <[hidden email]> wrote:
> I've found a fix for these Valgrind issues.

Attach is v10, which fixes the Valgrind issue.

Other changes:

* The code now fully embraces the idea that posting list splits
involve "changing the incoming item" in a way that "avoids" having the
new/incoming item overlap with an existing posting list tuple. This
allowed me to cut down on the changes required within nbtinsert.c
considerably.

* Streamlined a lot of the code in nbtsearch.c. I was able to
significantly simplify _bt_compare() and _bt_binsrch_insert().

* Removed the DEBUG4 traces. A lot of these had to go when I
refactored nbtsearch.c code, so I thought I might as well removed the
remaining ones. I hope that you don't mind (go ahead and add them back
where that makes sense).

* A backwards scan will return "logical tuples" in descending order
now. We should do this on general principle, and also because of the
possibility of future external code that expects and takes advantage
of consistent heap TID order.

This change might even have a small performance benefit today, though:
Index scans that visit multiple heap pages but only match on a single
key will only pin each heap page visited once. Visiting the heap pages
in descending order within a B-Tree page full of duplicates, but
ascending order within individual posting lists could result in
unnecessary extra pinning.

* Standardized terminology. We consistently call what the patch adds
"deduplication" rather than "compression".

* Added a new section on the design to the nbtree README. This is
fairly high level, and talks about dynamics that we can't really talk
about anywhere else, such as how nbtsplitloc.c "cooperates" with
deduplication, producing an effect that is greater than the sum of its
parts.

* I also made some changes to the WAL logging for leaf page insertions
and page splits.

I didn't add the optimization that you anticipated in your nbtxlog.h
comments (i.e. only WAL-log a rewritten posting list when it will go
on the left half of the split, just like the new/incoming item thing
we have already). I agree that that's a good idea, and should be added
soon. Actually, I think the whole "new item vs. rewritten posting list
item" thing makes the WAL logging confusing, so this is not really
about performance.

Maybe the easiest way to do this is also the way that performs best.
I'm thinking of this: maybe we could completely avoid WAL-logging the
entire rewritten/split posting list. After all, the contents of the
rewritten posting list are derived from the existing/original posting
list, as well as the new/incoming item. We can make the WAL record
much smaller on average by making standbys repeat a little bit of the
work performed on the primary. Maybe we could WAL-log
"in_posting_offset" itself, and an ItemPointerData (obviously the new
item offset number tells us the offset number of the posting list that
must be replaced/memmoved()'d). Then have the standby repeat some of
the work performed on the primary -- at least the work of swapping a
heap TID could be repeated on standbys, since it's very little extra
work for standbys, but could really reduce the WAL volume. This might
actually be simpler.

The WAL logging that I didn't touch in v10 is the most important thing
to improve. I am talking about the WAL-logging that is performed as
part of deduplicating all items on a page, to avoid a page split (i.e.
the WAL-logging within _bt_dedup_one_page()). That still just does a
log_newpage_buffer() in v10, which is pretty inefficient. Much like
the posting list split WAL logging stuff, WAL logging in
_bt_dedup_one_page() can probably be made more efficient by describing
deduplication in terms of logical changes. For example, the WAL
records should consist of metadata that could be read by a human as
"merge the tuples from offset number 15 until offset number 27".
Perhaps this could also share code with the posting list split stuff.
What do you think?

Once we make the WAL-logging within _bt_dedup_one_page() more
efficient, that also makes it fairly easy to make the deduplication
that it performs occur incrementally, maybe even very incrementally. I
can imagine the _bt_dedup_one_page() caller specifying "my new tuple
is 32 bytes, and I'd really like to not have to split the page, so
please at least do enough deduplication to make it fit". Delaying
deduplication increases the amount of time that we have to set the
LP_DEAD bit for remaining items on the page, which might be important.
Also, spreading out the volume of WAL produced by deduplication over
time might be important with certain workloads. We would still
probably do somewhat more work than strictly necessary to avoid a page
split if we were to make _bt_dedup_one_page() incremental like this,
though not by a huge amount.

OTOH, maybe I am completely wrong about "incremental deduplication"
being a good idea. It seems worth experimenting with, though. It's not
that much more work on top of making the _bt_dedup_one_page()
WAL-logging efficient, which seems like the thing we should focus on
now.

Thoughts?
--
Peter Geoghegan

v10-0002-DEBUG-Add-pageinspect-instrumentation.patch (10K) Download Attachment
v10-0001-Add-deduplication-to-nbtree.patch (142K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-4
On Mon, Sep 2, 2019 at 6:53 PM Peter Geoghegan <[hidden email]> wrote:
> Attach is v10, which fixes the Valgrind issue.

Attached is v11, which makes the kill_prior_tuple optimization work
with posting list tuples. The only catch is that it can only work when
all "logical tuples" within a posting list are known-dead, since of
course there is only one LP_DEAD bit available for each posting list.

The hardest part of this kill_prior_tuple work was writing the new
_bt_killitems() code, which I'm still not 100% happy with. Still, it
seems to work well -- new pageinspect LP_DEAD status info was added to
the second patch to verify that we're setting LP_DEAD bits as needed
for posting list tuples. I also had to add a new nbtree-specific,
posting-list-aware version of index_compute_xid_horizon_for_tuples()
-- _bt_compute_xid_horizon_for_tuples(). Finally, it was necessary to
avoid splitting a posting list with the LP_DEAD bit set. I took a
naive approach to avoiding that problem, adding code to
_bt_findinsertloc() to prevent it. Posting list splits are generally
assumed to be rare, so the fact that this is slightly inefficient
should be fine IMV.

I also refactored deduplication itself in anticipation of making the
WAL logging more efficient, and incremental. So, the structure of the
code within _bt_dedup_one_page() was simplified, without really
changing it very much (I think). I also fixed a bug in
_bt_dedup_one_page(). The check for dead items was broken in previous
versions, because the loop examined the high key tuple in every
iteration.

Making _bt_dedup_one_page() more efficient and incremental is still
the most important open item for the patch.

--
Peter Geoghegan

v11-0001-Add-deduplication-to-nbtree.patch (158K) Download Attachment
v11-0002-DEBUG-Add-pageinspect-instrumentation.patch (11K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
09.09.2019 22:54, Peter Geoghegan wrote:

>
> Attached is v11, which makes the kill_prior_tuple optimization work
> with posting list tuples. The only catch is that it can only work when
> all "logical tuples" within a posting list are known-dead, since of
> course there is only one LP_DEAD bit available for each posting list.
>
> The hardest part of this kill_prior_tuple work was writing the new
> _bt_killitems() code, which I'm still not 100% happy with. Still, it
> seems to work well -- new pageinspect LP_DEAD status info was added to
> the second patch to verify that we're setting LP_DEAD bits as needed
> for posting list tuples. I also had to add a new nbtree-specific,
> posting-list-aware version of index_compute_xid_horizon_for_tuples()
> -- _bt_compute_xid_horizon_for_tuples(). Finally, it was necessary to
> avoid splitting a posting list with the LP_DEAD bit set. I took a
> naive approach to avoiding that problem, adding code to
> _bt_findinsertloc() to prevent it. Posting list splits are generally
> assumed to be rare, so the fact that this is slightly inefficient
> should be fine IMV.
>
> I also refactored deduplication itself in anticipation of making the
> WAL logging more efficient, and incremental. So, the structure of the
> code within _bt_dedup_one_page() was simplified, without really
> changing it very much (I think). I also fixed a bug in
> _bt_dedup_one_page(). The check for dead items was broken in previous
> versions, because the loop examined the high key tuple in every
> iteration.
>
> Making _bt_dedup_one_page() more efficient and incremental is still
> the most important open item for the patch.

Hi, thank you for the fixes and improvements.
I reviewed them and everything looks good except the idea of not
splitting dead posting tuples.
According to comments to scan->ignore_killed_tuples in genam.c:107,
it may lead to incorrect tuple order on a replica.
I don't sure, if it leads to any real problem, though, or it will be
resolved
by subsequent visibility checks. Anyway, it's worth to add more comments in
_bt_killitems() explaining why it's safe.


Attached is v12, which contains WAL optimizations for posting split and
page
deduplication. Changes to prior version:

* xl_btree_split record doesn't contain posting tuple anymore, instead
it keeps
'in_posting offset'  and repeats the logic of _bt_insertonpg() as you
proposed
upthread.

* I introduced new xlog record XLOG_BTREE_DEDUP_PAGE, which contains
info about
groups of tuples deduplicated into posting tuples. In principle, it is
possible
to fit it into some existing record, but I preferred to keep things clear.

I haven't measured how these changes affect WAL size yet.
Do you have any suggestions on how to automate testing of new WAL records?
Is there any suitable place in regression tests?

* I also noticed that _bt_dedup_one_page() can be optimized to return early
when none tuples were deduplicated. I wonder if we can introduce inner
statistic to tune deduplication? That is returning to the idea of
BT_COMPRESS_THRESHOLD, which can help to avoid extra work for pages that
have
very few duplicates or pages that are already full of posting lists.

To be honest, I don't believe that incremental deduplication can really
improve
something, because no matter how many items were compressed we still
rewrite
all items from the original page to the new one, so, why not do our best.
What do we save by this incremental approach?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


v12-0001-Add-deduplication-to-nbtree.patch (129K) Download Attachment
123456