[PROPOSAL] Effective storage of duplicates in B-tree index.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
58 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[PROPOSAL] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree index.
The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as follows.
Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme and articles.
It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1).

Now new B-tree index tuple must be inserted for each table row that we index.
It can possibly cause page split. Because of MVCC even unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

So it seems to be very useful improvement. Of course it requires a lot of changes in B-tree implementation, so I need approval from community.

1. Compatibility.
It's important to save compatibility with older index versions.
I'm going to change BTREE_VERSION to 3.
And use new (posting) features for v3, saving old implementation for v2.
Any objections?

2. There are several tricks to handle non-unique keys in B-tree.
More info in btree readme (chapter - Differences to the Lehman & Yao algorithm).
In the new version they'll become useless. Am I right?

3. Microvacuum.
Killed items are marked LP_DEAD and could be deleted from separate page at time of insertion.
Now it's fine, because each item corresponds with separate TID. But posting list implementation requires another way. I've got two ideas:
First is to mark LP_DEAD only those tuples where all TIDs are not visible.
Second is to add LP_DEAD flag to each TID in posting list(tree). This way requires a bit more space, but allows to do microvacuum of posting list/tree.
Which one is better?
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

Tomas Vondra-4
Hi,

On 08/31/2015 09:41 AM, Anastasia Lubennikova wrote:

> Hi, hackers!
> I'm going to begin work on effective storage of duplicate keys in B-tree
> index.
> The main idea is to implement posting lists and posting trees for B-tree
> index pages as it's already done for GIN.
>
> In a nutshell, effective storing of duplicates in GIN is organised as
> follows.
> Index stores single index tuple for each unique key. That index tuple
> points to posting list which contains pointers to heap tuples (TIDs). If
> too many rows having the same key, multiple pages are allocated for the
> TIDs and these constitute so called posting tree.
> You can find wonderful detailed descriptions in gin readme
> <https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
> and articles <http://www.cybertec.at/gin-just-an-index-type/>.
> It also makes possible to apply compression algorithm to posting
> list/tree and significantly decrease index size. Read more in
> presentation (part 1)
> <http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.
>
> Now new B-tree index tuple must be inserted for each table row that we
> index.
> It can possibly cause page split. Because of MVCC even unique index
> could contain duplicates.
> Storing duplicates in posting list/tree helps to avoid superfluous splits.
>
> So it seems to be very useful improvement. Of course it requires a lot
> of changes in B-tree implementation, so I need approval from community.

In general, index size is often a serious issue - cases where indexes
need more space than tables are not quite uncommon in my experience. So
I think the efforts to lower space requirements for indexes are good.

But if we introduce posting lists into btree indexes, how different are
they from GIN? It seems to me that if I create a GIN index (using
btree_gin), I do get mostly the same thing you propose, no?

Sure, there are differences - GIN indexes don't handle UNIQUE indexes,
but the compression can only be effective when there are duplicate rows.
So either the index is not UNIQUE (so the b-tree feature is not needed),
or there are many updates.

Which brings me to the other benefit of btree indexes - they are
designed for high concurrency. How much is this going to be affected by
introducing the posting lists?

kind regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

Alexander Korotkov
Hi, Tomas!

On Mon, Aug 31, 2015 at 6:26 PM, Tomas Vondra <[hidden email]> wrote:
On 08/31/2015 09:41 AM, Anastasia Lubennikova wrote:
I'm going to begin work on effective storage of duplicate keys in B-tree
index.
The main idea is to implement posting lists and posting trees for B-tree
index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as
follows.
Index stores single index tuple for each unique key. That index tuple
points to posting list which contains pointers to heap tuples (TIDs). If
too many rows having the same key, multiple pages are allocated for the
TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting
list/tree and significantly decrease index size. Read more in
presentation (part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.

Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index
could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

So it seems to be very useful improvement. Of course it requires a lot
of changes in B-tree implementation, so I need approval from community.

In general, index size is often a serious issue - cases where indexes need more space than tables are not quite uncommon in my experience. So I think the efforts to lower space requirements for indexes are good.

But if we introduce posting lists into btree indexes, how different are they from GIN? It seems to me that if I create a GIN index (using btree_gin), I do get mostly the same thing you propose, no?
 
Yes, In general GIN is a btree with effective duplicates handling + support of splitting single datums into multiple keys.
This proposal is mostly porting duplicates handling from GIN to btree.

Sure, there are differences - GIN indexes don't handle UNIQUE indexes,

The difference between btree_gin and btree is not only UNIQUE feature.
1) There is no gingettuple in GIN. GIN supports only bitmap scans. And it's not feasible to add gingettuple to GIN. At least with same semantics as it is in btree.
2) GIN doesn't support multicolumn indexes in the way btree does. Multicolumn GIN is more like set of separate singlecolumn GINs: it doesn't have composite keys.
3) btree_gin can't effectively handle range searches. "a < x < b" would be hangle as "a < x" intersect "x < b". That is extremely inefficient. It is possible to fix. However, there is no clear proposal how to fit this case into GIN interface, yet.
 
but the compression can only be effective when there are duplicate rows. So either the index is not UNIQUE (so the b-tree feature is not needed), or there are many updates.

From my observations users can use btree_gin only in some cases. They like compression, but can't use btree_gin mostly because of #1.

Which brings me to the other benefit of btree indexes - they are designed for high concurrency. How much is this going to be affected by introducing the posting lists?

I'd notice that current duplicates handling in PostgreSQL is hack over original btree. It is designed so in btree access method in PostgreSQL, not btree in general.
Posting lists shouldn't change concurrency much. Currently, in btree you have to lock one page exclusively when you're inserting new value.
When posting list is small and fits one page you have to do similar thing: exclusive lock of one page to insert new value.
When you have posting tree, you have to do exclusive lock on one page of posting tree.

One can say that concurrency would became worse because index would become smaller and number of pages would became smaller too. Since number of pages would be smaller, backends are more likely concur for the same page. But this argument can be user against any compression and for any bloat.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

Tomas Vondra-4


On 09/01/2015 11:31 AM, Alexander Korotkov wrote:
...

>
> Yes, In general GIN is a btree with effective duplicates handling +
> support of splitting single datums into multiple keys.
> This proposal is mostly porting duplicates handling from GIN to btree.
>
>     Sure, there are differences - GIN indexes don't handle UNIQUE indexes,
>
>
> The difference between btree_gin and btree is not only UNIQUE feature.
> 1) There is no gingettuple in GIN. GIN supports only bitmap scans. And
> it's not feasible to add gingettuple to GIN. At least with same
> semantics as it is in btree.
> 2) GIN doesn't support multicolumn indexes in the way btree does.
> Multicolumn GIN is more like set of separate singlecolumn GINs: it
> doesn't have composite keys.
> 3) btree_gin can't effectively handle range searches. "a < x < b" would
> be hangle as "a < x" intersect "x < b". That is extremely inefficient.
> It is possible to fix. However, there is no clear proposal how to fit
> this case into GIN interface, yet.
>
>     but the compression can only be effective when there are duplicate
>     rows. So either the index is not UNIQUE (so the b-tree feature is
>     not needed), or there are many updates.
>
>  From my observations users can use btree_gin only in some cases. They
> like compression, but can't use btree_gin mostly because of #1.

Thanks for the explanation! I'm not that familiar with GIN internals,
but this mostly matches my understanding. I have only mentioned UNIQUE
because the lack of gettuple() method seems obvious - and it works fine
when GIN indexes are used as "bitmap indexes".

But you're right - we can't do index only scans on GIN indexes, which is
a huge benefit of btree indexes.

>
>     Which brings me to the other benefit of btree indexes - they are
>     designed for high concurrency. How much is this going to be affected
>     by introducing the posting lists?
>
>
> I'd notice that current duplicates handling in PostgreSQL is hack over
> original btree. It is designed so in btree access method in PostgreSQL,
> not btree in general.
> Posting lists shouldn't change concurrency much. Currently, in btree you
> have to lock one page exclusively when you're inserting new value.
> When posting list is small and fits one page you have to do similar
> thing: exclusive lock of one page to insert new value.
> When you have posting tree, you have to do exclusive lock on one page of
> posting tree.

OK.

>
> One can say that concurrency would became worse because index would
> become smaller and number of pages would became smaller too. Since
> number of pages would be smaller, backends are more likely concur for
> the same page. But this argument can be user against any compression and
> for any bloat.

Which might be a problem for some use cases, but I assume we could add
an option disabling this per-index. Probably having it "off" by default,
and only enabling the compression explicitly.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

Peter Geoghegan-3
In reply to this post by Anastasia Lubennikova
On Mon, Aug 31, 2015 at 12:41 AM, Anastasia Lubennikova
<[hidden email]> wrote:
> Now new B-tree index tuple must be inserted for each table row that we
> index.
> It can possibly cause page split. Because of MVCC even unique index could
> contain duplicates.
> Storing duplicates in posting list/tree helps to avoid superfluous splits.

I'm glad someone is thinking about this, because it is certainly
needed. I thought about working on it myself, but there is always
something else to do. I should be able to assist with review, though.

> So it seems to be very useful improvement. Of course it requires a lot of
> changes in B-tree implementation, so I need approval from community.
>
> 1. Compatibility.
> It's important to save compatibility with older index versions.
> I'm going to change BTREE_VERSION to 3.
> And use new (posting) features for v3, saving old implementation for v2.
> Any objections?

It might be better to just have a flag bit for pages that are
compressed -- there are IIRC 8 free bits in the B-Tree page special
area flags variable. But no real opinion on this from me, yet. You
have plenty of bitspace to work with to mark B-Tree pages, in any
case.

> 2. There are several tricks to handle non-unique keys in B-tree.
> More info in btree readme (chapter - Differences to the Lehman & Yao
> algorithm).
> In the new version they'll become useless. Am I right?

I think that the L&Y algorithm makes assumptions for the sake of
simplicity, rather than because they really believed that there were
real problems. For example, they say that deletion can occur offline
or something along those lines, even though that's clearly
impractical. They say that because they didn't want to write a paper
about deletion within B-Trees, I suppose.

See also, my opinion of how they claim to not need read locks [1].
Also, note that despite the fact that the GIN README mentions "Lehman
& Yao style right links", it doesn't actually do the L&Y trick of
avoiding lock coupling -- the whole point of L&Y -- so that remark is
misleading. This must be why B-Tree has much better concurrency than
GIN in practice.

Anyway, the way that I always imagined this would work is a layer
"below" the current implementation. In other words, you could easily
have prefix compression with a prefix that could end at a point within
a reference IndexTuple. It could be any arbitrary point in the second
or subsequent attribute, and would not "care" about the structure of
the IndexTuple when it comes to where attributes begin and end, etc
(although, in reality, in probably would end up caring, because of the
complexity -- not caring is the ideal only, at least to me). As
Alexander pointed out, GIN does not care about composite keys.

That seems quite different to a GIN posting list (something that I
know way less about, FYI). So I'm really talking about a slightly
different thing -- prefix compression, rather than handling
duplicates. Whether or not you should do prefix compression instead of
deduplication is certainly not clear to me, but it should be
considered. Also, I always imagined that prefix compression would use
the highkey as the thing that is offset for each "real" IndexTuple,
because it's there anyway, and that's simple. However, I suppose that
that means that duplicate handling can't really work in a way that
makes duplicates have a fixed cost, which may be a particularly
important property to you.

> 3. Microvacuum.
> Killed items are marked LP_DEAD and could be deleted from separate page at
> time of insertion.
> Now it's fine, because each item corresponds with separate TID. But posting
> list implementation requires another way. I've got two ideas:
> First is to mark LP_DEAD only those tuples where all TIDs are not visible.
> Second is to add LP_DEAD flag to each TID in posting list(tree). This way
> requires a bit more space, but allows to do microvacuum of posting
> list/tree.

No real opinion on this point, except that I agree that doing
something is necessary.

Couple of further thoughts on this general topic:

* Currently, B-Tree must be able to store at least 3 items on each
page, for the benefit of the L&Y algorithm. You need room for 1
"highkey", plus 2 downlink IndexTuples. Obviously an internal B-Tree
page is redundant if you cannot get to any child page based on the
scanKey value differing one way or the other (so 2 downlinks are a
sensible minimum), plus a highkey is usually needed (just not on the
rightmost page). As you probably know, we enforce this by making sure
every IndexTuple is no more than 1/3 of the size that will fit.

You should start thinking about how to deal with this in a world where
the physical size could actually be quite variable. The solution is
probably to simply pretend that every IndexTuple is its original size.
This applies to both prefix compression and duplicate suppression, I
suppose.

* Since everything is aligned within B-Tree, it's probably worth
considering the alignment boundaries when doing prefix compression, if
you want to go that way. We can probably imagine a world where
alignment is not required for B-Tree, which would work on x86
machines, but I can't see it happening soon. It isn't worth
compressing unless it compresses enough to cross an "alignment
boundary", where we're not actually obliged to store as much data on
disk. This point may be obvious, not sure.

[1] http://www.postgresql.org/message-id/flat/CAM3SWZT-T9o_dchK8E4_YbKQ+LPJTpd89E6dtPwhXnBV_5NE3Q@...#CAM3SWZT-T9o_dchK8E4_YbKQ+LPJTpd89E6dtPwhXnBV_5NE3Q@...

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova


01.09.2015 21:23, Peter Geoghegan:

> On Mon, Aug 31, 2015 at 12:41 AM, Anastasia Lubennikova
> <[hidden email]> wrote:
>> Now new B-tree index tuple must be inserted for each table row that we
>> index.
>> It can possibly cause page split. Because of MVCC even unique index could
>> contain duplicates.
>> Storing duplicates in posting list/tree helps to avoid superfluous splits.
> I'm glad someone is thinking about this, because it is certainly
> needed. I thought about working on it myself, but there is always
> something else to do. I should be able to assist with review, though.
Thank you)

>> So it seems to be very useful improvement. Of course it requires a lot of
>> changes in B-tree implementation, so I need approval from community.
>>
>> 1. Compatibility.
>> It's important to save compatibility with older index versions.
>> I'm going to change BTREE_VERSION to 3.
>> And use new (posting) features for v3, saving old implementation for v2.
>> Any objections?
> It might be better to just have a flag bit for pages that are
> compressed -- there are IIRC 8 free bits in the B-Tree page special
> area flags variable. But no real opinion on this from me, yet. You
> have plenty of bitspace to work with to mark B-Tree pages, in any
> case.
>
Hmm.. If we are talking about storing duplicates in posting lists (and
trees) as in GIN, I don't see a way how to apply it to separate pages,
while not applying to others. Look some notes below .

>> 2. There are several tricks to handle non-unique keys in B-tree.
>> More info in btree readme (chapter - Differences to the Lehman & Yao
>> algorithm).
>> In the new version they'll become useless. Am I right?
> I think that the L&Y algorithm makes assumptions for the sake of
> simplicity, rather than because they really believed that there were
> real problems. For example, they say that deletion can occur offline
> or something along those lines, even though that's clearly
> impractical. They say that because they didn't want to write a paper
> about deletion within B-Trees, I suppose.
>
> See also, my opinion of how they claim to not need read locks [1].
> Also, note that despite the fact that the GIN README mentions "Lehman
> & Yao style right links", it doesn't actually do the L&Y trick of
> avoiding lock coupling -- the whole point of L&Y -- so that remark is
> misleading. This must be why B-Tree has much better concurrency than
> GIN in practice.

Yes, thanks for extensive explanation.
I mean such tricks as moving right in _bt_findinsertloc(), for example.

/*----------
      * If we will need to split the page to put the item on this page,
      * check whether we can put the tuple somewhere to the right,
      * instead.  Keep scanning right until we
      *        (a) find a page with enough free space,
      *        (b) reach the last page where the tuple can legally go, or
      *        (c) get tired of searching.
      * (c) is not flippant; it is important because if there are many
      * pages' worth of equal keys, it's better to split one of the early
      * pages than to scan all the way to the end of the run of equal keys
      * on every insert.  We implement "get tired" as a random choice,
      * since stopping after scanning a fixed number of pages wouldn't work
      * well (we'd never reach the right-hand side of previously split
      * pages).  Currently the probability of moving right is set at 0.99,
      * which may seem too high to change the behavior much, but it does an
      * excellent job of preventing O(N^2) behavior with many equal keys.
      *----------
      */

If there is no multiple tuples with the same key, we shouldn't care
about it at all. It would be possible to skip these steps in "effective
B-tree implementation". That's why I want to change btree_version.

>   So I'm really talking about a slightly
> different thing -- prefix compression, rather than handling
> duplicates. Whether or not you should do prefix compression instead of
> deduplication is certainly not clear to me, but it should be
> considered. Also, I always imagined that prefix compression would use
> the highkey as the thing that is offset for each "real" IndexTuple,
> because it's there anyway, and that's simple. However, I suppose that
> that means that duplicate handling can't really work in a way that
> makes duplicates have a fixed cost, which may be a particularly
> important property to you.

You're right, that is two different techniques.
1. Effective storing of duplicates, which I propose, works with equal
keys. And allow us to delete repeats.
Index tuples are stored like this:

IndexTupleData + Attrs (key) | IndexTupleData + Attrs (key) |
IndexTupleData + Attrs (key)

If all Attrs are equal, it seems reasonable not to repeat them. So we
can store it in following structure:

MetaData + Attrs (key) | IndexTupleData | IndexTupleData | IndexTupleData

It is a posting list. It doesn't require significant changes in index
page layout, because we can use ordinary IndexTupleData for meta
information. Each IndexTupleData has fixed size, so it's easy to handle
posting list as an array.

2. Prefix compression handles different keys and somehow compresses them.
I think that it will require non-trivial changes in btree index tuples
representation.  Furthermore, any compression leads to extra
computations. Now, I don't have clear idea about how to implement this
technique.

> * Currently, B-Tree must be able to store at least 3 items on each
> page, for the benefit of the L&Y algorithm. You need room for 1
> "highkey", plus 2 downlink IndexTuples. Obviously an internal B-Tree
> page is redundant if you cannot get to any child page based on the
> scanKey value differing one way or the other (so 2 downlinks are a
> sensible minimum), plus a highkey is usually needed (just not on the
> rightmost page). As you probably know, we enforce this by making sure
> every IndexTuple is no more than 1/3 of the size that will fit.
That is the point where too big posting list transforms to a posting
tree. But I think, that in the first patch, I'll do it another way. Just
by splitting long posting list into 2 lists of appropriate length.

> * Since everything is aligned within B-Tree, it's probably worth
> considering the alignment boundaries when doing prefix compression, if
> you want to go that way. We can probably imagine a world where
> alignment is not required for B-Tree, which would work on x86
> machines, but I can't see it happening soon. It isn't worth
> compressing unless it compresses enough to cross an "alignment
> boundary", where we're not actually obliged to store as much data on
> disk. This point may be obvious, not sure.

That is another reason, why I doubt prefix compression, whereas
effective duplicate storage hasn't this problem.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

Peter Geoghegan-3
On Thu, Sep 3, 2015 at 8:35 AM, Anastasia Lubennikova
<[hidden email]> wrote:

>> * Since everything is aligned within B-Tree, it's probably worth
>> considering the alignment boundaries when doing prefix compression, if
>> you want to go that way. We can probably imagine a world where
>> alignment is not required for B-Tree, which would work on x86
>> machines, but I can't see it happening soon. It isn't worth
>> compressing unless it compresses enough to cross an "alignment
>> boundary", where we're not actually obliged to store as much data on
>> disk. This point may be obvious, not sure.
>
> That is another reason, why I doubt prefix compression, whereas effective
> duplicate storage hasn't this problem.

Okay. That sounds reasonable. I think duplicate handling is a good project.

A good learning tool for Postgres B-Trees -- or at least one of the
better ones -- is my amcheck tool. See:

https://github.com/petergeoghegan/postgres/tree/amcheck

This is a tool for verifying B-Tree invariants hold, which is loosely
based on pageinspect. It checks that certain conditions hold for
B-Trees. A simple example is that all items on each page be in the
correct, logical order. Some invariants checked are far more
complicated, though, and span multiple pages or multiple levels. See
the source code for exact details. This tool works well when running
the regression tests (see stress.sql -- I used it with pgbench), with
no problems reported last I checked. It often only needs light locks
on relations, and single shared locks on buffers. (Buffers are copied
to local memory for the tool to operate on, much like
contrib/pageinspect).

While I have yet to formally submit amcheck to a CF (I once asked for
input on the goals for the project on -hackers), the comments are
fairly comprehensive, and it wouldn't be too hard to adopt this to
guide your work on duplicate handling. Maybe it'll happen for 9.6.
Feedback appreciated.

The tool calls _bt_compare() for many things currently, but doesn't
care about many lower level details, which is (very roughly speaking)
the level that duplicate handling will work at. You aren't actually
proposing to change anything about the fundamental structure that
B-Tree indexes have, so the tool could be quite useful and low-effort
for debugging your code during development.

Debugging this stuff is sometimes like keyhole surgery. If you could
just see at/get to the structure that you care about, it would be 10
times easier. Hopefully this tool makes it easier to identify problems.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

Peter Geoghegan-3
On Sun, Sep 27, 2015 at 4:11 PM, Peter Geoghegan <[hidden email]> wrote:
> Debugging this stuff is sometimes like keyhole surgery. If you could
> just see at/get to the structure that you care about, it would be 10
> times easier. Hopefully this tool makes it easier to identify problems.

I should add that the way that the L&Y technique works, and the way
that Postgres code is generally very robust/defensive can make direct
testing a difficult thing. I have seen cases where a completely messed
up B-Tree still gave correct results most of the time, and was just
slower. That can happen, for example, because the "move right" thing
results in a degenerate linear scan of the entire index. The
comparisons in the internal pages were totally messed up, but it
"didn't matter" once a scan could get to leaf pages and could move
right and find the value that way.

I wrote amcheck because I thought it was scary how B-Tree indexes
could be *completely* messed up without it being obvious; what hope is
there of a test finding a subtle problem in their structure, then?
Testing the invariants directly seemed like the only way to have a
chance of not introducing bugs when adding new stuff to the B-Tree
code. I believe that adding optimizations to the B-Tree code will be
important in the next couple of years, and there is no other way to
approach it IMV.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
In reply to this post by Anastasia Lubennikova
31.08.2015 10:41, Anastasia Lubennikova:
Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree index.
The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as follows.
Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme and articles.
It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1).

Now new B-tree index tuple must be inserted for each table row that we index.
It can possibly cause page split. Because of MVCC even unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way as GIN does it.

Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

It seems that backward compatibility works well without any changes. But I haven't tested it properly yet.

Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).

i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625

You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree.

I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome.

Here is a couple of useful queries to inspect the data inside the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt;

And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

btree_compression_1.0.patch (40K) Download Attachment
btree_compression_test.sql (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Thom Brown-2
On 28 January 2016 at 14:06, Anastasia Lubennikova <[hidden email]> wrote:

31.08.2015 10:41, Anastasia Lubennikova:
Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree index.
The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as follows.
Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme and articles.
It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1).

Now new B-tree index tuple must be inserted for each table row that we index.
It can possibly cause page split. Because of MVCC even unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way as GIN does it.

Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

It seems that backward compatibility works well without any changes. But I haven't tested it properly yet.

Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).

i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625

You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree.

I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome.

Here is a couple of useful queries to inspect the data inside the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt;

And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.

This doesn't apply cleanly against current git head.  Have you caught up past commit 65c5fcd35?
 
Thom
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
28.01.2016 18:12, Thom Brown:
On 28 January 2016 at 14:06, Anastasia Lubennikova <[hidden email]> wrote:

31.08.2015 10:41, Anastasia Lubennikova:
Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree index.
The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as follows.
Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme and articles.
It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1).

Now new B-tree index tuple must be inserted for each table row that we index.
It can possibly cause page split. Because of MVCC even unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way as GIN does it.

Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

It seems that backward compatibility works well without any changes. But I haven't tested it properly yet.

Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).

i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625

You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree.

I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome.

Here is a couple of useful queries to inspect the data inside the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt;

And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.

This doesn't apply cleanly against current git head.  Have you caught up past commit 65c5fcd35?

Thank you for the notice. New patch is attached.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

btree_compression_1.0(rebased).patch (59K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Thom Brown-2
On 28 January 2016 at 16:12, Anastasia Lubennikova <[hidden email]> wrote:

28.01.2016 18:12, Thom Brown:

On 28 January 2016 at 14:06, Anastasia Lubennikova <[hidden email]> wrote:

31.08.2015 10:41, Anastasia Lubennikova:
Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree index.
The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as follows.
Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme and articles.
It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1).

Now new B-tree index tuple must be inserted for each table row that we index.
It can possibly cause page split. Because of MVCC even unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way as GIN does it.

Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

It seems that backward compatibility works well without any changes. But I haven't tested it properly yet.

Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).

i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625

You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree.

I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome.

Here is a couple of useful queries to inspect the data inside the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt;

And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.

This doesn't apply cleanly against current git head.  Have you caught up past commit 65c5fcd35?

Thank you for the notice. New patch is attached.

Thanks for the quick rebase.

Okay, a quick check with pgbench:

CREATE INDEX ON pgbench_accounts(bid);

Timing
Scale: master / patch
100: 10657ms / 13555ms (rechecked and got 9745ms)
500: 56909ms / 56985ms

Size
Scale: master / patch
100: 214MB / 87MB (40.7%)
500: 1071MB / 437MB (40.8%)

No performance issues from what I can tell.

I'm surprised that efficiencies can't be realised beyond this point.  Your results show a sweet spot at around 1000 / 10000000, with it getting slightly worse beyond that.  I kind of expected a lot of efficiency where all the values are the same, but perhaps that's due to my lack of understanding regarding the way they're being stored.

Thom
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-3
On Thu, Jan 28, 2016 at 9:03 AM, Thom Brown <[hidden email]> wrote:
> I'm surprised that efficiencies can't be realised beyond this point.  Your results show a sweet spot at around 1000 / 10000000, with it getting slightly worse beyond that.  I kind of expected a lot of efficiency where all the values are the same, but perhaps that's due to my lack of understanding regarding the way they're being stored.

I think that you'd need an I/O bound workload to see significant
benefits. That seems unsurprising. I believe that random I/O from
index writes is a big problem for us.



--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Thom Brown-2
On 28 January 2016 at 17:09, Peter Geoghegan <[hidden email]> wrote:
> On Thu, Jan 28, 2016 at 9:03 AM, Thom Brown <[hidden email]> wrote:
>> I'm surprised that efficiencies can't be realised beyond this point.  Your results show a sweet spot at around 1000 / 10000000, with it getting slightly worse beyond that.  I kind of expected a lot of efficiency where all the values are the same, but perhaps that's due to my lack of understanding regarding the way they're being stored.
>
> I think that you'd need an I/O bound workload to see significant
> benefits. That seems unsurprising. I believe that random I/O from
> index writes is a big problem for us.

I was thinking more from the point of view of the index size.  An
index containing 10 million duplicate values is around 40% of the size
of an index with 10 million unique values.

Thom


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Thom Brown-2
In reply to this post by Thom Brown-2
On 28 January 2016 at 17:03, Thom Brown <[hidden email]> wrote:

On 28 January 2016 at 16:12, Anastasia Lubennikova <[hidden email]> wrote:

28.01.2016 18:12, Thom Brown:

On 28 January 2016 at 14:06, Anastasia Lubennikova <[hidden email]> wrote:

31.08.2015 10:41, Anastasia Lubennikova:
Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree index.
The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as follows.
Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme and articles.
It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1).

Now new B-tree index tuple must be inserted for each table row that we index.
It can possibly cause page split. Because of MVCC even unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way as GIN does it.

Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

It seems that backward compatibility works well without any changes. But I haven't tested it properly yet.

Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).

i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625

You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree.

I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome.

Here is a couple of useful queries to inspect the data inside the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt;

And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.

This doesn't apply cleanly against current git head.  Have you caught up past commit 65c5fcd35?

Thank you for the notice. New patch is attached.

Thanks for the quick rebase.

Okay, a quick check with pgbench:

CREATE INDEX ON pgbench_accounts(bid);

Timing
Scale: master / patch
100: 10657ms / 13555ms (rechecked and got 9745ms)
500: 56909ms / 56985ms

Size
Scale: master / patch
100: 214MB / 87MB (40.7%)
500: 1071MB / 437MB (40.8%)

No performance issues from what I can tell.

I'm surprised that efficiencies can't be realised beyond this point.  Your results show a sweet spot at around 1000 / 10000000, with it getting slightly worse beyond that.  I kind of expected a lot of efficiency where all the values are the same, but perhaps that's due to my lack of understanding regarding the way they're being stored.

Okay, now for some badness.  I've restored a database containing 2 tables, one 318MB, another 24kB.  The 318MB table contains 5 million rows with a sequential id column.  I get a problem if I try to delete many rows from it:

# delete from contacts where id % 3 != 0 ;
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory

The query completes, but I get this message a lot before it does.

This happens even if I drop the primary key and foreign key constraints, so somehow the memory usage has massively increased with this patch.

Thom
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
In reply to this post by Thom Brown-2
28.01.2016 20:03, Thom Brown:
On 28 January 2016 at 16:12, Anastasia Lubennikova <[hidden email]> wrote:

28.01.2016 18:12, Thom Brown:

On 28 January 2016 at 14:06, Anastasia Lubennikova <[hidden email]> wrote:

31.08.2015 10:41, Anastasia Lubennikova:
Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree index.
The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as follows.
Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme and articles.
It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1).

Now new B-tree index tuple must be inserted for each table row that we index.
It can possibly cause page split. Because of MVCC even unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way as GIN does it.

Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

It seems that backward compatibility works well without any changes. But I haven't tested it properly yet.

Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).

i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625

You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree.

I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome.

Here is a couple of useful queries to inspect the data inside the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt;

And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.

This doesn't apply cleanly against current git head.  Have you caught up past commit 65c5fcd35?

Thank you for the notice. New patch is attached.

Thanks for the quick rebase.

Okay, a quick check with pgbench:

CREATE INDEX ON pgbench_accounts(bid);

Timing
Scale: master / patch
100: 10657ms / 13555ms (rechecked and got 9745ms)
500: 56909ms / 56985ms

Size
Scale: master / patch
100: 214MB / 87MB (40.7%)
500: 1071MB / 437MB (40.8%)

No performance issues from what I can tell.

I'm surprised that efficiencies can't be realised beyond this point.  Your results show a sweet spot at around 1000 / 10000000, with it getting slightly worse beyond that.  I kind of expected a lot of efficiency where all the values are the same, but perhaps that's due to my lack of understanding regarding the way they're being stored.

Thank you for the prompt reply. I see what you're confused about. I'll try to clarify it.

First of all, what is implemented in the patch is not actually compression. It's more about index page layout changes to compact ItemPointers (TIDs).
Instead of TID+key, TID+key, we store now META+key+List_of_TIDs (also known as Posting list).

before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

TID (N item pointers, posting list offset) - this is the meta information. So, we have to store this meta information in addition to useful data.

Next point is the requirement of having minimum three tuples in a page. We need at least two tuples to point the children and the highkey as well.
This requirement leads to the limitation of the max index tuple size.
/*
 * Maximum size of a btree index entry, including its tuple header.
 *
 * We actually need to be able to fit three items on every page,
 * so restrict any one item to 1/3 the per-page available space.
 */
#define BTMaxItemSize(page) \
	MAXALIGN_DOWN((PageGetPageSize(page) - \
				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)

Although, I thought just now that this size could be increased for compressed tuples, at least for leaf pages.

That's the reason, why we have to store more meta information than meets the eye.

For example, we have 100000 of duplicates with the same key. It seems that compression should be really significant.
Something like 1 Meta + 1 key instead of 100000 keys --> 6 bytes (size of meta TID) + keysize instead of 6
00000.
But, we have to split one huge posting list into the smallest ones to fit it into the index page.

It depends on the key size, of course. As I can see form pageisnpect the index on the single integer key have to split the tuples into the pieces with the size
2704 containing 447 TIDs in one posting list.
So we have
1 Meta + 1 key instead of  447 keys. As you can see, that is really less impressive than expected.

There is an idea of posting trees in GIN. Key is stored just once, and posting list which doesn't fit into the page becomes a tree.
You can find incredible article about it here http://www.cybertec.at/2013/03/gin-just-an-index-type/
But I think, that it's not the best way for the btree am, because it’s not supposed to handle concurrent insertions.

As I mentioned before I'm going to implement prefix compression of posting list, which must be efficient and quite simple, since it's already implemented in GIN. You can find the presentation about it here https://www.pgcon.org/2014/schedule/events/698.en.html
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Aleksander Alekseeev
I tested this patch on x64 and ARM servers for a few hours today. The
only problem I could find is that INSERT works considerably slower after
applying a patch. Beside that everything looks fine - no crashes, tests
pass, memory doesn't seem to leak, etc.

> Okay, now for some badness.  I've restored a database containing 2
> tables, one 318MB, another 24kB.  The 318MB table contains 5 million
> rows with a sequential id column.  I get a problem if I try to delete
> many rows from it:
> # delete from contacts where id % 3 != 0 ;
> WARNING:  out of shared memory
> WARNING:  out of shared memory
> WARNING:  out of shared memory

I didn't manage to reproduce this. Thom, could you describe exact steps
to reproduce this issue please?


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Thom Brown-2
On 29 January 2016 at 15:47, Aleksander Alekseev
<[hidden email]> wrote:

> I tested this patch on x64 and ARM servers for a few hours today. The
> only problem I could find is that INSERT works considerably slower after
> applying a patch. Beside that everything looks fine - no crashes, tests
> pass, memory doesn't seem to leak, etc.
>
>> Okay, now for some badness.  I've restored a database containing 2
>> tables, one 318MB, another 24kB.  The 318MB table contains 5 million
>> rows with a sequential id column.  I get a problem if I try to delete
>> many rows from it:
>> # delete from contacts where id % 3 != 0 ;
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>
> I didn't manage to reproduce this. Thom, could you describe exact steps
> to reproduce this issue please?

Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
-r0), which creates an instance with a custom config, which is as
follows:

shared_buffers = 8MB
max_connections = 7
wal_level = 'hot_standby'
cluster_name = 'primary'
max_wal_senders = 3
wal_keep_segments = 6

Then create a pgbench data set (I didn't originally use pgbench, but
you can get the same results with it):

createdb -p 5530 pgbench
pgbench -p 5530 -i -s 100 pgbench

And delete some stuff:

thom@swift:~/Development/test$ psql -p 5530 pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.


 ➤ psql://thom@[local]:5530/pgbench

# DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
...
WARNING:  out of shared memory
WARNING:  out of shared memory
DELETE 6666667
Time: 22218.804 ms

There were 358 lines of that warning message.  I don't get these
messages without the patch.

Thom


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
29.01.2016 19:01, Thom Brown:
> On 29 January 2016 at 15:47, Aleksander Alekseev
> <[hidden email]> wrote:
>> I tested this patch on x64 and ARM servers for a few hours today. The
>> only problem I could find is that INSERT works considerably slower after
>> applying a patch. Beside that everything looks fine - no crashes, tests
>> pass, memory doesn't seem to leak, etc.
Thank you for testing. I rechecked that, and insertions are really very
very very slow. It seems like a bug.

>>> Okay, now for some badness.  I've restored a database containing 2
>>> tables, one 318MB, another 24kB.  The 318MB table contains 5 million
>>> rows with a sequential id column.  I get a problem if I try to delete
>>> many rows from it:
>>> # delete from contacts where id % 3 != 0 ;
>>> WARNING:  out of shared memory
>>> WARNING:  out of shared memory
>>> WARNING:  out of shared memory
>> I didn't manage to reproduce this. Thom, could you describe exact steps
>> to reproduce this issue please?
> Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
> -r0), which creates an instance with a custom config, which is as
> follows:
>
> shared_buffers = 8MB
> max_connections = 7
> wal_level = 'hot_standby'
> cluster_name = 'primary'
> max_wal_senders = 3
> wal_keep_segments = 6
>
> Then create a pgbench data set (I didn't originally use pgbench, but
> you can get the same results with it):
>
> createdb -p 5530 pgbench
> pgbench -p 5530 -i -s 100 pgbench
>
> And delete some stuff:
>
> thom@swift:~/Development/test$ psql -p 5530 pgbench
> Timing is on.
> psql (9.6devel)
> Type "help" for help.
>
>
>   ➤ psql://thom@[local]:5530/pgbench
>
> # DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
> WARNING:  out of shared memory
> WARNING:  out of shared memory
> WARNING:  out of shared memory
> WARNING:  out of shared memory
> WARNING:  out of shared memory
> WARNING:  out of shared memory
> WARNING:  out of shared memory
> ...
> WARNING:  out of shared memory
> WARNING:  out of shared memory
> DELETE 6666667
> Time: 22218.804 ms
>
> There were 358 lines of that warning message.  I don't get these
> messages without the patch.
>
> Thom
Thank you for this report.
I tried to reproduce it, but I couldn't. Debug will be much easier now.

I hope I'll fix these issueswithin the next few days.

BTW, I found a dummy mistake, the previous patch contains some unrelated
changes. I fixed it in the new version (attached).

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

btree_compression_2.0.patch (40K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Thom Brown-2
On 29 January 2016 at 16:50, Anastasia Lubennikova
<[hidden email]> wrote:

> 29.01.2016 19:01, Thom Brown:
>>
>> On 29 January 2016 at 15:47, Aleksander Alekseev
>> <[hidden email]> wrote:
>>>
>>> I tested this patch on x64 and ARM servers for a few hours today. The
>>> only problem I could find is that INSERT works considerably slower after
>>> applying a patch. Beside that everything looks fine - no crashes, tests
>>> pass, memory doesn't seem to leak, etc.
>
> Thank you for testing. I rechecked that, and insertions are really very very
> very slow. It seems like a bug.
>
>>>> Okay, now for some badness.  I've restored a database containing 2
>>>> tables, one 318MB, another 24kB.  The 318MB table contains 5 million
>>>> rows with a sequential id column.  I get a problem if I try to delete
>>>> many rows from it:
>>>> # delete from contacts where id % 3 != 0 ;
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>
>>> I didn't manage to reproduce this. Thom, could you describe exact steps
>>> to reproduce this issue please?
>>
>> Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
>> -r0), which creates an instance with a custom config, which is as
>> follows:
>>
>> shared_buffers = 8MB
>> max_connections = 7
>> wal_level = 'hot_standby'
>> cluster_name = 'primary'
>> max_wal_senders = 3
>> wal_keep_segments = 6
>>
>> Then create a pgbench data set (I didn't originally use pgbench, but
>> you can get the same results with it):
>>
>> createdb -p 5530 pgbench
>> pgbench -p 5530 -i -s 100 pgbench
>>
>> And delete some stuff:
>>
>> thom@swift:~/Development/test$ psql -p 5530 pgbench
>> Timing is on.
>> psql (9.6devel)
>> Type "help" for help.
>>
>>
>>   ➤ psql://thom@[local]:5530/pgbench
>>
>> # DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> ...
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> DELETE 6666667
>> Time: 22218.804 ms
>>
>> There were 358 lines of that warning message.  I don't get these
>> messages without the patch.
>>
>> Thom
>
>
> Thank you for this report.
> I tried to reproduce it, but I couldn't. Debug will be much easier now.
>
> I hope I'll fix these issueswithin the next few days.
>
> BTW, I found a dummy mistake, the previous patch contains some unrelated
> changes. I fixed it in the new version (attached).

Thanks.  Well I've tested this latest patch, and the warnings are no
longer generated.  However, the index sizes show that the patch
doesn't seem to be doing its job, so I'm wondering if you removed too
much from it.

Thom


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
123