[PROPOSAL] Effective storage of duplicates in B-tree index.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
69 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova


29.01.2016 20:43, Thom Brown:

> On 29 January 2016 at 16:50, Anastasia Lubennikova
> <[hidden email]>  wrote:
>> 29.01.2016 19:01, Thom Brown:
>>> On 29 January 2016 at 15:47, Aleksander Alekseev
>>> <[hidden email]>  wrote:
>>>> I tested this patch on x64 and ARM servers for a few hours today. The
>>>> only problem I could find is that INSERT works considerably slower after
>>>> applying a patch. Beside that everything looks fine - no crashes, tests
>>>> pass, memory doesn't seem to leak, etc.
>> Thank you for testing. I rechecked that, and insertions are really very very
>> very slow. It seems like a bug.
>>
>>>>> Okay, now for some badness.  I've restored a database containing 2
>>>>> tables, one 318MB, another 24kB.  The 318MB table contains 5 million
>>>>> rows with a sequential id column.  I get a problem if I try to delete
>>>>> many rows from it:
>>>>> # delete from contacts where id % 3 != 0 ;
>>>>> WARNING:  out of shared memory
>>>>> WARNING:  out of shared memory
>>>>> WARNING:  out of shared memory
>>>> I didn't manage to reproduce this. Thom, could you describe exact steps
>>>> to reproduce this issue please?
>>> Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
>>> -r0), which creates an instance with a custom config, which is as
>>> follows:
>>>
>>> shared_buffers = 8MB
>>> max_connections = 7
>>> wal_level = 'hot_standby'
>>> cluster_name = 'primary'
>>> max_wal_senders = 3
>>> wal_keep_segments = 6
>>>
>>> Then create a pgbench data set (I didn't originally use pgbench, but
>>> you can get the same results with it):
>>>
>>> createdb -p 5530 pgbench
>>> pgbench -p 5530 -i -s 100 pgbench
>>>
>>> And delete some stuff:
>>>
>>> thom@swift:~/Development/test$ psql -p 5530 pgbench
>>> Timing is on.
>>> psql (9.6devel)
>>> Type "help" for help.
>>>
>>>
>>>    ➤ psql://thom@[local]:5530/pgbench
>>>
>>> # DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
>>> WARNING:  out of shared memory
>>> WARNING:  out of shared memory
>>> WARNING:  out of shared memory
>>> WARNING:  out of shared memory
>>> WARNING:  out of shared memory
>>> WARNING:  out of shared memory
>>> WARNING:  out of shared memory
>>> ...
>>> WARNING:  out of shared memory
>>> WARNING:  out of shared memory
>>> DELETE 6666667
>>> Time: 22218.804 ms
>>>
>>> There were 358 lines of that warning message.  I don't get these
>>> messages without the patch.
>>>
>>> Thom
>> Thank you for this report.
>> I tried to reproduce it, but I couldn't. Debug will be much easier now.
>>
>> I hope I'll fix these issueswithin the next few days.
>>
>> BTW, I found a dummy mistake, the previous patch contains some unrelated
>> changes. I fixed it in the new version (attached).
> Thanks.  Well I've tested this latest patch, and the warnings are no
> longer generated.  However, the index sizes show that the patch
> doesn't seem to be doing its job, so I'm wondering if you removed too
> much from it.

Huh, this patch seems to be enchanted) It works fine for me. Did you
perform "make distclean"?
Anyway, I'll send a new version soon.
I just write here to say that I do not disappear and I do remember about
the issue.
I even almost fixed the insert speed problem. But I'm very very busy
this week.
I'll send an updated patch next week as soon as possible.

Thank you for attention to this work.

--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Thom Brown-2
On 2 February 2016 at 11:47, Anastasia Lubennikova
<[hidden email]> wrote:

>
>
> 29.01.2016 20:43, Thom Brown:
>
>> On 29 January 2016 at 16:50, Anastasia Lubennikova
>> <[hidden email]>  wrote:
>>>
>>> 29.01.2016 19:01, Thom Brown:
>>>>
>>>> On 29 January 2016 at 15:47, Aleksander Alekseev
>>>> <[hidden email]>  wrote:
>>>>>
>>>>> I tested this patch on x64 and ARM servers for a few hours today. The
>>>>> only problem I could find is that INSERT works considerably slower
>>>>> after
>>>>> applying a patch. Beside that everything looks fine - no crashes, tests
>>>>> pass, memory doesn't seem to leak, etc.
>>>
>>> Thank you for testing. I rechecked that, and insertions are really very
>>> very
>>> very slow. It seems like a bug.
>>>
>>>>>> Okay, now for some badness.  I've restored a database containing 2
>>>>>> tables, one 318MB, another 24kB.  The 318MB table contains 5 million
>>>>>> rows with a sequential id column.  I get a problem if I try to delete
>>>>>> many rows from it:
>>>>>> # delete from contacts where id % 3 != 0 ;
>>>>>> WARNING:  out of shared memory
>>>>>> WARNING:  out of shared memory
>>>>>> WARNING:  out of shared memory
>>>>>
>>>>> I didn't manage to reproduce this. Thom, could you describe exact steps
>>>>> to reproduce this issue please?
>>>>
>>>> Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
>>>> -r0), which creates an instance with a custom config, which is as
>>>> follows:
>>>>
>>>> shared_buffers = 8MB
>>>> max_connections = 7
>>>> wal_level = 'hot_standby'
>>>> cluster_name = 'primary'
>>>> max_wal_senders = 3
>>>> wal_keep_segments = 6
>>>>
>>>> Then create a pgbench data set (I didn't originally use pgbench, but
>>>> you can get the same results with it):
>>>>
>>>> createdb -p 5530 pgbench
>>>> pgbench -p 5530 -i -s 100 pgbench
>>>>
>>>> And delete some stuff:
>>>>
>>>> thom@swift:~/Development/test$ psql -p 5530 pgbench
>>>> Timing is on.
>>>> psql (9.6devel)
>>>> Type "help" for help.
>>>>
>>>>
>>>>    ➤ psql://thom@[local]:5530/pgbench
>>>>
>>>> # DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> ...
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> DELETE 6666667
>>>> Time: 22218.804 ms
>>>>
>>>> There were 358 lines of that warning message.  I don't get these
>>>> messages without the patch.
>>>>
>>>> Thom
>>>
>>> Thank you for this report.
>>> I tried to reproduce it, but I couldn't. Debug will be much easier now.
>>>
>>> I hope I'll fix these issueswithin the next few days.
>>>
>>> BTW, I found a dummy mistake, the previous patch contains some unrelated
>>> changes. I fixed it in the new version (attached).
>>
>> Thanks.  Well I've tested this latest patch, and the warnings are no
>> longer generated.  However, the index sizes show that the patch
>> doesn't seem to be doing its job, so I'm wondering if you removed too
>> much from it.
>
>
> Huh, this patch seems to be enchanted) It works fine for me. Did you perform
> "make distclean"?

Yes.  Just tried it again:

git clean -fd
git stash
make distclean
patch -p1 < ~/Downloads/btree_compression_2.0.patch
../dopg.sh (script I've always used to build with)
pg_ctl start
createdb pgbench
pgbench -i -s 100 pgbench

$ psql pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.


 ➤ psql://thom@[local]:5488/pgbench

# \di+
                                    List of relations
 Schema |         Name          | Type  | Owner |      Table       |
Size  | Description
--------+-----------------------+-------+-------+------------------+--------+-------------
 public | pgbench_accounts_pkey | index | thom  | pgbench_accounts | 214 MB |
 public | pgbench_branches_pkey | index | thom  | pgbench_branches | 24 kB  |
 public | pgbench_tellers_pkey  | index | thom  | pgbench_tellers  | 48 kB  |
(3 rows)

Previously, this would show an index size of 87MB for pgbench_accounts_pkey.

> Anyway, I'll send a new version soon.
> I just write here to say that I do not disappear and I do remember about the
> issue.
> I even almost fixed the insert speed problem. But I'm very very busy this
> week.
> I'll send an updated patch next week as soon as possible.

Thanks.

> Thank you for attention to this work.

Thanks for your awesome patches.

Thom


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-3
On Tue, Feb 2, 2016 at 3:59 AM, Thom Brown <[hidden email]> wrote:
>  public | pgbench_accounts_pkey | index | thom  | pgbench_accounts | 214 MB |
>  public | pgbench_branches_pkey | index | thom  | pgbench_branches | 24 kB  |
>  public | pgbench_tellers_pkey  | index | thom  | pgbench_tellers  | 48 kB  |

I see the same.

I use my regular SQL query to see the breakdown of leaf/internal/root pages:

postgres=# with tots as (
  SELECT count(*) c,
  avg(live_items) avg_live_items,
  avg(dead_items) avg_dead_items,
  u.type,
  r.oid
  from (select c.oid,
          c.relpages,
          generate_series(1, c.relpages - 1) i
          from pg_index i
          join pg_opclass op on i.indclass[0] = op.oid
          join pg_am am on op.opcmethod = am.oid
          join pg_class c on i.indexrelid = c.oid
          where am.amname = 'btree') r,
        lateral (select * from bt_page_stats(r.oid::regclass::text, i)) u
  group by r.oid, type)
select ct.relname table_name,
  tots.oid::regclass::text index_name,
  (select relpages - 1 from pg_class c where c.oid = tots.oid) non_meta_pages,
  upper(type) page_type,
  c npages,
  to_char(avg_live_items, '990.999'),
  to_char(avg_dead_items, '990.999'),
  to_char(c/sum(c) over(partition by tots.oid) * 100, '990.999') || '
%' as prop_of_index
  from tots
  join pg_index i on i.indexrelid = tots.oid
  join pg_class ct on ct.oid = i.indrelid
  where tots.oid = 'pgbench_accounts_pkey'::regclass
  order by ct.relnamespace, table_name, index_name, npages, type;
    table_name    │      index_name       │ non_meta_pages │ page_type
│ npages │ to_char  │ to_char  │ prop_of_index
──────────────────┼───────────────────────┼────────────────┼───────────┼────────┼──────────┼──────────┼───────────────
 pgbench_accounts │ pgbench_accounts_pkey │         27,421 │ R
│      1 │   97.000 │    0.000 │    0.004 %
 pgbench_accounts │ pgbench_accounts_pkey │         27,421 │ I
│     97 │  282.670 │    0.000 │    0.354 %
 pgbench_accounts │ pgbench_accounts_pkey │         27,421 │ L
│ 27,323 │  366.992 │    0.000 │   99.643 %
(3 rows)

But this looks healthy -- I see the same with master. And since the
accounts table is listed as 1281 MB, this looks like a plausible ratio
in the size of the table to its primary index (which I would not say
is true of an 87MB primary key index).

Are you sure you have the details right, Thom?
--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Thom Brown-2
On 4 February 2016 at 15:07, Peter Geoghegan <[hidden email]> wrote:

> On Tue, Feb 2, 2016 at 3:59 AM, Thom Brown <[hidden email]> wrote:
>>  public | pgbench_accounts_pkey | index | thom  | pgbench_accounts | 214 MB |
>>  public | pgbench_branches_pkey | index | thom  | pgbench_branches | 24 kB  |
>>  public | pgbench_tellers_pkey  | index | thom  | pgbench_tellers  | 48 kB  |
>
> I see the same.
>
> I use my regular SQL query to see the breakdown of leaf/internal/root pages:
>
> postgres=# with tots as (
>   SELECT count(*) c,
>   avg(live_items) avg_live_items,
>   avg(dead_items) avg_dead_items,
>   u.type,
>   r.oid
>   from (select c.oid,
>           c.relpages,
>           generate_series(1, c.relpages - 1) i
>           from pg_index i
>           join pg_opclass op on i.indclass[0] = op.oid
>           join pg_am am on op.opcmethod = am.oid
>           join pg_class c on i.indexrelid = c.oid
>           where am.amname = 'btree') r,
>         lateral (select * from bt_page_stats(r.oid::regclass::text, i)) u
>   group by r.oid, type)
> select ct.relname table_name,
>   tots.oid::regclass::text index_name,
>   (select relpages - 1 from pg_class c where c.oid = tots.oid) non_meta_pages,
>   upper(type) page_type,
>   c npages,
>   to_char(avg_live_items, '990.999'),
>   to_char(avg_dead_items, '990.999'),
>   to_char(c/sum(c) over(partition by tots.oid) * 100, '990.999') || '
> %' as prop_of_index
>   from tots
>   join pg_index i on i.indexrelid = tots.oid
>   join pg_class ct on ct.oid = i.indrelid
>   where tots.oid = 'pgbench_accounts_pkey'::regclass
>   order by ct.relnamespace, table_name, index_name, npages, type;
>     table_name    │      index_name       │ non_meta_pages │ page_type
> │ npages │ to_char  │ to_char  │ prop_of_index
> ──────────────────┼───────────────────────┼────────────────┼───────────┼────────┼──────────┼──────────┼───────────────
>  pgbench_accounts │ pgbench_accounts_pkey │         27,421 │ R
> │      1 │   97.000 │    0.000 │    0.004 %
>  pgbench_accounts │ pgbench_accounts_pkey │         27,421 │ I
> │     97 │  282.670 │    0.000 │    0.354 %
>  pgbench_accounts │ pgbench_accounts_pkey │         27,421 │ L
> │ 27,323 │  366.992 │    0.000 │   99.643 %
> (3 rows)
>
> But this looks healthy -- I see the same with master. And since the
> accounts table is listed as 1281 MB, this looks like a plausible ratio
> in the size of the table to its primary index (which I would not say
> is true of an 87MB primary key index).
>
> Are you sure you have the details right, Thom?

*facepalm*

No, I'm not.  I've just realised that all I've been checking is the
primary key expecting it to change in size, which is, of course,
nonsense.  I should have been creating an index on the bid field of
pgbench_accounts and reviewing the size of that.

Now I've checked it with the latest patch, and can see it working
fine.  Apologies for the confusion.

Thom


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-3
On Thu, Feb 4, 2016 at 8:25 AM, Thom Brown <[hidden email]> wrote:
>
> No, I'm not.  I've just realised that all I've been checking is the
> primary key expecting it to change in size, which is, of course,
> nonsense.  I should have been creating an index on the bid field of
> pgbench_accounts and reviewing the size of that.

Right. Because, apart from everything else, unique indexes are not
currently supported.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-3
In reply to this post by Anastasia Lubennikova
On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
<[hidden email]> wrote:
> I fixed it in the new version (attached).

Some quick remarks on your V2.0:

* Seems unnecessary that _bt_binsrch() is passed a real pointer by all
callers. Maybe the one current posting list caller
_bt_findinsertloc(), or its caller, _bt_doinsert(), should do this
work itself:

@@ -373,7 +377,17 @@ _bt_binsrch(Relation rel,
     * scan key), which could be the last slot + 1.
     */
    if (P_ISLEAF(opaque))
+   {
+       if (low <= PageGetMaxOffsetNumber(page))
+       {
+           IndexTuple oitup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, low));
+           /* one excessive check of equality. for possible posting
tuple update or creation */
+           if ((_bt_compare(rel, keysz, scankey, page, low) == 0)
+               && (IndexTupleSize(oitup) + sizeof(ItemPointerData) <
BTMaxItemSize(page)))
+               *updposing = true;
+       }
        return low;
+   }

* ISTM that you should not use _bt_compare() above, in any case. Consider this:

postgres=# select 5.0 = 5.000;
 ?column?
──────────
 t
(1 row)

B-Tree operator class indicates equality here. And yet, users will
expect to see the original value in an index-only scan, including the
trailing zeroes as they were originally input. So this should be a bit
closer to HeapSatisfiesHOTandKeyUpdate() (actually,
heap_tuple_attr_equals()), which looks for strict binary equality for
similar reasons.

* Is this correct?:

@@ -555,7 +662,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState
*state, IndexTuple itup)
         * it off the old page, not the new one, in case we are not at leaf
         * level.
         */
-       state->btps_minkey = CopyIndexTuple(oitup);
+       ItemId iihk = PageGetItemId(opage, P_HIKEY);
+       IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+       state->btps_minkey = CopyIndexTuple(hikey);

How this code has changed from the master branch is not clear to me.

I understand that this code in incomplete/draft:

+#define MaxPackedIndexTuplesPerPage    \
+   ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+           (sizeof(ItemPointerData))))

But why is it different to the old (actually unchanged)
MaxIndexTuplesPerPage? I would like to see comments explaining your
understanding, even if they are quite rough. Why did GIN never require
this change to a generic header (itup.h)? Should such a change live in
that generic header file, and not another one more localized to
nbtree?

* More explanation of the design would be nice. I suggest modifying
the nbtree README file, so it's easy to tell what the current design
is. It's hard to follow this from the thread. When I reviewed Heikki's
B-Tree patches from a couple of years ago, we spent ~75% of the time
on design, and only ~25% on code.

* I have a paranoid feeling that the deletion locking protocol
(VACUUMing index tuples concurrently and safely) may need special
consideration here. Basically, with the B-Tree code, there are several
complicated locking protocols, like for page splits, page deletion,
and interlocking with vacuum ("super exclusive lock" stuff). These are
why the B-Tree code is complicated in general, and it's very important
to pin down exactly how we deal with each. Ideally, you'd have an
explanation for why your code was correct in each of these existing
cases (especially deletion). With very complicated and important code
like this, it's often wise to be very clear about when we are talking
about your design, and when we are talking about your code. It's
generally too hard to review both at the same time.

Ideally, when you talk about your design, you'll be able to say things
like "it's clear that this existing thing is correct; at least we have
no complaints from the field. Therefore, it must be true that my new
technique is also correct, because it makes that general situation no
worse". Obviously that kind of rigor is just something we aspire to,
and still fall short of at times. Still, it would be nice to
specifically see a reason why the new code isn't special from the
point of view of the super-exclusive lock thing (which is what I mean
by deletion locking protocol + special consideration). Or why it is
special, but that's okay, or whatever. This style of review is normal
when writing B-Tree code. Some other things don't need this rigor, or
have no invariants that need to be respected/used. Maybe this is
obvious to you already, but it isn't obvious to me.

It's okay if you don't know why, but knowing that you don't have a
strong opinion about something is itself useful information.

* I see you disabled the LP_DEAD thing; why? Just because that made
bugs go away?

* Have you done much stress testing? Using pgbench with many
concurrent VACUUM FREEZE operations would be a good idea, if you
haven't already, because that is insistent about getting super
exclusive locks, unlike regular VACUUM.

* Are you keeping the restriction of 1/3 of a buffer page, but that
just includes the posting list now? That's the kind of detail I'd like
to see in the README now.

* Why not support unique indexes? The obvious answer is that it isn't
worth it, but why? How useful would that be (a bit, just not enough)?
What's the trade-off?

Anyway, this is really cool work; I have often thought that we don't
have nearly enough people thinking about how to optimize B-Tree
indexing. It is hard, but so is anything worthwhile.

That's all I have for now. Just a quick review focused on code and
correctness (and not on the benefits). I want to do more on this,
especially the benefits, because it deserves more attention.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
04.02.2016 20:16, Peter Geoghegan:
On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
[hidden email] wrote:
I fixed it in the new version (attached).

Thank you for the review.
At last, there is a new patch version 3.0. After some refactoring it looks much better.
I described all details of the compression in this document  https://goo.gl/50O8Q0 (the same text without pictures is attached in btc_readme_1.0.txt).
Consider it as a rough copy of readme. It contains some notes about tricky moments of implementation and questions about future work.
Please don't hesitate to comment it.

Some quick remarks on your V2.0:

* Seems unnecessary that _bt_binsrch() is passed a real pointer by all
callers. Maybe the one current posting list caller
_bt_findinsertloc(), or its caller, _bt_doinsert(), should do this
work itself:

@@ -373,7 +377,17 @@ _bt_binsrch(Relation rel,
     * scan key), which could be the last slot + 1.
     */
    if (P_ISLEAF(opaque))
+   {
+       if (low <= PageGetMaxOffsetNumber(page))
+       {
+           IndexTuple oitup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, low));
+           /* one excessive check of equality. for possible posting
tuple update or creation */
+           if ((_bt_compare(rel, keysz, scankey, page, low) == 0)
+               && (IndexTupleSize(oitup) + sizeof(ItemPointerData) <
BTMaxItemSize(page)))
+               *updposing = true;
+       }
        return low;
+   }

* ISTM that you should not use _bt_compare() above, in any case. Consider this:

postgres=# select 5.0 = 5.000;
 ?column?
──────────
 t
(1 row)

B-Tree operator class indicates equality here. And yet, users will
expect to see the original value in an index-only scan, including the
trailing zeroes as they were originally input. So this should be a bit
closer to HeapSatisfiesHOTandKeyUpdate() (actually,
heap_tuple_attr_equals()), which looks for strict binary equality for
similar reasons.
Thank you for the notice. Fixed.
* Is this correct?:

@@ -555,7 +662,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState
*state, IndexTuple itup)
         * it off the old page, not the new one, in case we are not at leaf
         * level.
         */
-       state->btps_minkey = CopyIndexTuple(oitup);
+       ItemId iihk = PageGetItemId(opage, P_HIKEY);
+       IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+       state->btps_minkey = CopyIndexTuple(hikey);

How this code has changed from the master branch is not clear to me.
Yes, it is. I completed the comment above.
I understand that this code in incomplete/draft:

+#define MaxPackedIndexTuplesPerPage    \
+   ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+           (sizeof(ItemPointerData))))

But why is it different to the old (actually unchanged)
MaxIndexTuplesPerPage? I would like to see comments explaining your
understanding, even if they are quite rough. Why did GIN never require
this change to a generic header (itup.h)? Should such a change live in
that generic header file, and not another one more localized to
nbtree?
I agree.
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

btc_readme_1.0.patch (8K) Download Attachment
btree_compression_3.0.patch (41K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
18.02.2016 20:18, Anastasia Lubennikova:
04.02.2016 20:16, Peter Geoghegan:
On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
[hidden email] wrote:
I fixed it in the new version (attached).

Thank you for the review.
At last, there is a new patch version 3.0. After some refactoring it looks much better.
I described all details of the compression in this document  https://goo.gl/50O8Q0 (the same text without pictures is attached in btc_readme_1.0.txt).
Consider it as a rough copy of readme. It contains some notes about tricky moments of implementation and questions about future work.
Please don't hesitate to comment it.

Sorry, previous patch was dirty. Hotfix is attached.
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

btc_readme_1.0.patch (8K) Download Attachment
btree_compression_3.1.patch (41K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

David Steele
Hi Anastasia,

On 2/18/16 12:29 PM, Anastasia Lubennikova wrote:

> 18.02.2016 20:18, Anastasia Lubennikova:
>> 04.02.2016 20:16, Peter Geoghegan:
>>> On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
>>> <[hidden email]>  wrote:
>>>> I fixed it in the new version (attached).
>>
>> Thank you for the review.
>> At last, there is a new patch version 3.0. After some refactoring it
>> looks much better.
>> I described all details of the compression in this document
>> https://goo.gl/50O8Q0 (the same text without pictures is attached in
>> btc_readme_1.0.txt).
>> Consider it as a rough copy of readme. It contains some notes about
>> tricky moments of implementation and questions about future work.
>> Please don't hesitate to comment it.
>>
> Sorry, previous patch was dirty. Hotfix is attached.

This looks like an extremely valuable optimization for btree indexes but
unfortunately it is not getting a lot of attention.  It still applies
cleanly for anyone interested in reviewing.

It's not clear to me that you answered all of Peter's questions in [1].
  I understand that you've provided a README but it may not be clear if
the answers are in there (and where).

Also, at the end of the README it says:

13. Xlog. TODO.

Does that mean the patch is not yet complete?

Thanks,
--
-David
[hidden email]

[1]
http://www.postgresql.org/message-id/CAM3SWZQ3_PLQCH4w7uQ8q_f2t4HEseKTr2n0rQ5pxA18OeRTJw@...


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
14.03.2016 16:02, David Steele:

> Hi Anastasia,
>
> On 2/18/16 12:29 PM, Anastasia Lubennikova wrote:
>> 18.02.2016 20:18, Anastasia Lubennikova:
>>> 04.02.2016 20:16, Peter Geoghegan:
>>>> On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
>>>> <[hidden email]>  wrote:
>>>>> I fixed it in the new version (attached).
>>>
>>> Thank you for the review.
>>> At last, there is a new patch version 3.0. After some refactoring it
>>> looks much better.
>>> I described all details of the compression in this document
>>> https://goo.gl/50O8Q0 (the same text without pictures is attached in
>>> btc_readme_1.0.txt).
>>> Consider it as a rough copy of readme. It contains some notes about
>>> tricky moments of implementation and questions about future work.
>>> Please don't hesitate to comment it.
>>>
>> Sorry, previous patch was dirty. Hotfix is attached.
>
> This looks like an extremely valuable optimization for btree indexes
> but unfortunately it is not getting a lot of attention. It still
> applies cleanly for anyone interested in reviewing.
>

Thank you for attention.
I would be indebted to all reviewers, who can just try this patch on
real data and workload (except WAL for now).
B-tree needs very much testing.

> It's not clear to me that you answered all of Peter's questions in
> [1].  I understand that you've provided a README but it may not be
> clear if the answers are in there (and where).

I described in README all the points Peter asked.
But I see that it'd be better to answer directly.
Thanks for reminding, I'll do it tomorrow.

> Also, at the end of the README it says:
>
> 13. Xlog. TODO.
>
> Does that mean the patch is not yet complete?

Yes, you're right.
Frankly speaking, I supposed that someone will help me with that stuff,
but now I almost completed it. I'll send updated patch in the next letter.

I'm still doubtful about some patch details. I mentioned them in readme
(bold type).
But they are mostly about future improvements.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
Please, find the new version of the patch attached. Now it has WAL
functionality.

Detailed description of the feature you can find in README draft
https://goo.gl/50O8Q0

This patch is pretty complicated, so I ask everyone, who interested in
this feature,
to help with reviewing and testing it. I will be grateful for any feedback.
But please, don't complain about code style, it is still work in progress.

Next things I'm going to do:
1. More debugging and testing. I'm going to attach in next message
couple of sql scripts for testing.
2. Fix NULLs processing
3. Add a flag into pg_index, that allows to enable/disable compression
for each particular index.
4. Recheck locking considerations. I tried to write code as less
invasive as possible, but we need to make sure that algorithm is still
correct.
5. Change BTMaxItemSize
6. Bring back microvacuum functionality.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

btree_compression_4.0.patch (52K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Alexandr Popov


On 18.03.2016 20:19, Anastasia Lubennikova wrote:
Please, find the new version of the patch attached. Now it has WAL functionality.

Detailed description of the feature you can find in README draft https://goo.gl/50O8Q0

This patch is pretty complicated, so I ask everyone, who interested in this feature,
to help with reviewing and testing it. I will be grateful for any feedback.
But please, don't complain about code style, it is still work in progress.

Next things I'm going to do:
1. More debugging and testing. I'm going to attach in next message couple of sql scripts for testing.
2. Fix NULLs processing
3. Add a flag into pg_index, that allows to enable/disable compression for each particular index.
4. Recheck locking considerations. I tried to write code as less invasive as possible, but we need to make sure that algorithm is still correct.
5. Change BTMaxItemSize
6. Bring back microvacuum functionality.



Hi, hackers.

It's my first review, so do not be strict to me.

I have tested this patch on the next table:
create table message
    (
        id        serial,
        usr_id        integer,
        text        text
    );
CREATE INDEX message_usr_id ON message (usr_id);
The table has 10000000 records.

I found the following:
The less unique keys the less size of the table.

Next 2 tablas demonstrates it.
New B-tree
Count of unique keys (usr_id), index“s size , time of creation
10000000    ;"214 MB"    ;"00:00:34.193441"
3333333      ;"214 MB"    ;"00:00:45.731173"
2000000      ;"129 MB"    ;"00:00:41.445876"
1000000      ;"129 MB"    ;"00:00:38.455616"
100000        ;"86 MB"      ;"00:00:40.887626"
10000          ;"79 MB"      ;"00:00:47.199774"

Old B-tree
Count of unique keys (usr_id), index“s size , time of creation
10000000    ;"214 MB"    ;"00:00:35.043677"
3333333      ;"286 MB"    ;"00:00:40.922845"
2000000      ;"300 MB"    ;"00:00:46.454846"
1000000      ;"278 MB"    ;"00:00:42.323525"
100000        ;"287 MB"    ;"00:00:47.438132"
10000          ;"280 MB"    ;"00:01:00.307873"

I inserted data  randomly and sequentially, it did not influence the index's size.
Time of select, insert and update random rows is not changed. It is great, but certainly it needs some more detailed study.
 
Alexander Popov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Robert Haas
In reply to this post by Anastasia Lubennikova
On Fri, Mar 18, 2016 at 1:19 PM, Anastasia Lubennikova
<[hidden email]> wrote:

> Please, find the new version of the patch attached. Now it has WAL
> functionality.
>
> Detailed description of the feature you can find in README draft
> https://goo.gl/50O8Q0
>
> This patch is pretty complicated, so I ask everyone, who interested in this
> feature,
> to help with reviewing and testing it. I will be grateful for any feedback.
> But please, don't complain about code style, it is still work in progress.
>
> Next things I'm going to do:
> 1. More debugging and testing. I'm going to attach in next message couple of
> sql scripts for testing.
> 2. Fix NULLs processing
> 3. Add a flag into pg_index, that allows to enable/disable compression for
> each particular index.
> 4. Recheck locking considerations. I tried to write code as less invasive as
> possible, but we need to make sure that algorithm is still correct.
> 5. Change BTMaxItemSize
> 6. Bring back microvacuum functionality.

I really like this idea, and the performance results seem impressive,
but I think we should push this out to 9.7.  A btree patch that didn't
have WAL support until two and a half weeks into the final CommitFest
just doesn't seem to me like a good candidate.  First, as a general
matter, if a patch isn't code-complete at the start of a CommitFest,
it's reasonable to say that it should be reviewed but not necessarily
committed in that CommitFest.  This patch has had some review, but I'm
not sure how deep that review is, and I think it's had no code review
at all of the WAL logging changes, which were submitted only a week
ago, well after the CF deadline.  Second, the btree AM is a
particularly poor place to introduce possibly destabilizing changes.
Everybody depends on it, all the time, for everything.  And despite
new tools like amcheck, it's not a particularly easy thing to debug.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Alexander Korotkov
On Thu, Mar 24, 2016 at 5:17 PM, Robert Haas <[hidden email]> wrote:
On Fri, Mar 18, 2016 at 1:19 PM, Anastasia Lubennikova
<[hidden email]> wrote:
> Please, find the new version of the patch attached. Now it has WAL
> functionality.
>
> Detailed description of the feature you can find in README draft
> https://goo.gl/50O8Q0
>
> This patch is pretty complicated, so I ask everyone, who interested in this
> feature,
> to help with reviewing and testing it. I will be grateful for any feedback.
> But please, don't complain about code style, it is still work in progress.
>
> Next things I'm going to do:
> 1. More debugging and testing. I'm going to attach in next message couple of
> sql scripts for testing.
> 2. Fix NULLs processing
> 3. Add a flag into pg_index, that allows to enable/disable compression for
> each particular index.
> 4. Recheck locking considerations. I tried to write code as less invasive as
> possible, but we need to make sure that algorithm is still correct.
> 5. Change BTMaxItemSize
> 6. Bring back microvacuum functionality.

I really like this idea, and the performance results seem impressive,
but I think we should push this out to 9.7.  A btree patch that didn't
have WAL support until two and a half weeks into the final CommitFest
just doesn't seem to me like a good candidate.  First, as a general
matter, if a patch isn't code-complete at the start of a CommitFest,
it's reasonable to say that it should be reviewed but not necessarily
committed in that CommitFest.  This patch has had some review, but I'm
not sure how deep that review is, and I think it's had no code review
at all of the WAL logging changes, which were submitted only a week
ago, well after the CF deadline.  Second, the btree AM is a
particularly poor place to introduce possibly destabilizing changes.
Everybody depends on it, all the time, for everything.  And despite
new tools like amcheck, it's not a particularly easy thing to debug.

It's all true.  But:
1) It's a great feature many users dream about.
2) Patch is not very big.
3) Patch doesn't introduce significant infrastructural changes.  It just change some well-isolated placed.

Let's give it a chance.  I've signed as additional reviewer and I'll do my best in spotting all possible issues in this patch.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-3
In reply to this post by Robert Haas
On Thu, Mar 24, 2016 at 7:17 AM, Robert Haas <[hidden email]> wrote:

> I really like this idea, and the performance results seem impressive,
> but I think we should push this out to 9.7.  A btree patch that didn't
> have WAL support until two and a half weeks into the final CommitFest
> just doesn't seem to me like a good candidate.  First, as a general
> matter, if a patch isn't code-complete at the start of a CommitFest,
> it's reasonable to say that it should be reviewed but not necessarily
> committed in that CommitFest.  This patch has had some review, but I'm
> not sure how deep that review is, and I think it's had no code review
> at all of the WAL logging changes, which were submitted only a week
> ago, well after the CF deadline.  Second, the btree AM is a
> particularly poor place to introduce possibly destabilizing changes.
> Everybody depends on it, all the time, for everything.  And despite
> new tools like amcheck, it's not a particularly easy thing to debug.

Regrettably, I must agree. I don't see a plausible path to commit for
this patch in the ongoing CF.

I think that Anastasia did an excellent job here, and I wish I could
have been of greater help sooner. Nevertheless, it would be unwise to
commit this given the maturity of the code. There have been very few
instances of performance improvements to the B-Tree code for as long
as I've been interested, because it's so hard, and the standard is so
high. The only example I can think of from the last few years is
Kevin's commit 2ed5b87f96 and Tom's commit 1a77f8b63d both of which
were far less invasive, and Simon's commit c7111d11b1, which we just
outright reverted from 9.5 due to subtle bugs (and even that was
significantly less invasive than this patch). Improving nbtree is
something that requires several rounds of expert review, and that's
something that's in short supply for the B-Tree code in particular. I
think that a new testing strategy is needed to make this easier, and I
hope to get that going with amcheck. I need help with formalizing a
"testing first" approach for improving the B-Tree code, because I
think it's the only way that we can move forward with projects like
this. It's *incredibly* hard to push forward patches like this given
our current, limited testing strategy.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Jim Nasby-5
In reply to this post by Alexander Korotkov
On 3/24/16 10:21 AM, Alexander Korotkov wrote:
> 1) It's a great feature many users dream about.

Doesn't matter if it starts eating their data...

> 2) Patch is not very big.
> 3) Patch doesn't introduce significant infrastructural changes.  It just
> change some well-isolated placed.

It doesn't really matter how big the patch is, it's a question of "What
did the patch fail to consider?". With something as complicated as the
btree code, there's ample opportunities for missing things. (And FWIW,
I'd argue that a 51kB patch is certainly not small, and a patch that is
doing things in critical sections isn't terribly isolated).

I do think this will be a great addition, but it's just too late to be
adding this to 9.6.

(BTW, I'm getting bounces from [hidden email], as well as
postmaster@. I emailed [hidden email] about this but never heard back.)
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova
In reply to this post by Peter Geoghegan-3
25.03.2016 01:12, Peter Geoghegan:
> On Thu, Mar 24, 2016 at 7:17 AM, Robert Haas <[hidden email]> wrote:
>> I really like this idea, and the performance results seem impressive,
>> but I think we should push this out to 9.7.  A btree patch that didn't
>> have WAL support until two and a half weeks into the final CommitFest
>> just doesn't seem to me like a good candidate.  First, as a general
>> matter, if a patch isn't code-complete at the start of a CommitFest,
>> it's reasonable to say that it should be reviewed but not necessarily
>> committed in that CommitFest.
You're right.
Frankly, I thought that someone will help me with the path, but I had to
finish it myself.
*off-topic*
I wonder, if we can add new flag to commitfest. Something like "Needs
assistance",
which will be used to mark big and complicated patches in progress.
While "Needs review" means that the patch is almost ready and only
requires the final review.

>>    This patch has had some review, but I'm
>> not sure how deep that review is, and I think it's had no code review
>> at all of the WAL logging changes, which were submitted only a week
>> ago, well after the CF deadline.  Second, the btree AM is a
>> particularly poor place to introduce possibly destabilizing changes.
>> Everybody depends on it, all the time, for everything.  And despite
>> new tools like amcheck, it's not a particularly easy thing to debug.
> Regrettably, I must agree. I don't see a plausible path to commit for
> this patch in the ongoing CF.
>
> I think that Anastasia did an excellent job here, and I wish I could
> have been of greater help sooner. Nevertheless, it would be unwise to
> commit this given the maturity of the code. There have been very few
> instances of performance improvements to the B-Tree code for as long
> as I've been interested, because it's so hard, and the standard is so
> high. The only example I can think of from the last few years is
> Kevin's commit 2ed5b87f96 and Tom's commit 1a77f8b63d both of which
> were far less invasive, and Simon's commit c7111d11b1, which we just
> outright reverted from 9.5 due to subtle bugs (and even that was
> significantly less invasive than this patch). Improving nbtree is
> something that requires several rounds of expert review, and that's
> something that's in short supply for the B-Tree code in particular. I
> think that a new testing strategy is needed to make this easier, and I
> hope to get that going with amcheck. I need help with formalizing a
> "testing first" approach for improving the B-Tree code, because I
> think it's the only way that we can move forward with projects like
> this. It's *incredibly* hard to push forward patches like this given
> our current, limited testing strategy.

Unfortunately, I must agree. This patch seems to be far from final
version until the feature freeze.
I'll move it to the future commitfest.

Anyway it means, that now we have more time to improve the patch.
If you have any ideas related to this patch like prefix/suffix
compression, I'll be glad to discuss them.
Same for any other ideas of B-tree optimization.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Claudio Freire
In reply to this post by Peter Geoghegan-3
On Thu, Mar 24, 2016 at 7:12 PM, Peter Geoghegan <[hidden email]> wrote:

> On Thu, Mar 24, 2016 at 7:17 AM, Robert Haas <[hidden email]> wrote:
>> I really like this idea, and the performance results seem impressive,
>> but I think we should push this out to 9.7.  A btree patch that didn't
>> have WAL support until two and a half weeks into the final CommitFest
>> just doesn't seem to me like a good candidate.  First, as a general
>> matter, if a patch isn't code-complete at the start of a CommitFest,
>> it's reasonable to say that it should be reviewed but not necessarily
>> committed in that CommitFest.  This patch has had some review, but I'm
>> not sure how deep that review is, and I think it's had no code review
>> at all of the WAL logging changes, which were submitted only a week
>> ago, well after the CF deadline.  Second, the btree AM is a
>> particularly poor place to introduce possibly destabilizing changes.
>> Everybody depends on it, all the time, for everything.  And despite
>> new tools like amcheck, it's not a particularly easy thing to debug.
>
> Regrettably, I must agree. I don't see a plausible path to commit for
> this patch in the ongoing CF.
>
> I think that Anastasia did an excellent job here, and I wish I could
> have been of greater help sooner. Nevertheless, it would be unwise to
> commit this given the maturity of the code. There have been very few
> instances of performance improvements to the B-Tree code for as long
> as I've been interested, because it's so hard, and the standard is so
> high. The only example I can think of from the last few years is
> Kevin's commit 2ed5b87f96 and Tom's commit 1a77f8b63d both of which
> were far less invasive, and Simon's commit c7111d11b1, which we just
> outright reverted from 9.5 due to subtle bugs (and even that was
> significantly less invasive than this patch). Improving nbtree is
> something that requires several rounds of expert review, and that's
> something that's in short supply for the B-Tree code in particular. I
> think that a new testing strategy is needed to make this easier, and I
> hope to get that going with amcheck. I need help with formalizing a
> "testing first" approach for improving the B-Tree code, because I
> think it's the only way that we can move forward with projects like
> this. It's *incredibly* hard to push forward patches like this given
> our current, limited testing strategy.

I've been toying (having gotten nowhere concrete really) with prefix
compression myself, I agree that messing with btree code is quite
harder than it ought to be.

Perhaps trying experimental format changes in a separate experimental
am wouldn't be all that bad (say, nxbtree?). People could opt-in to
those, by creating the indexes with nxbtree instead of plain btree
(say in development environments) and get some testing going without
risking much.

Normally the same effect should be achievable with mere flags, but
since format changes to btree tend to be rather invasive, ensuring the
patch doesn't change behavior with the flag off is hard as well, hence
the wholly separate am idea.


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Heikki Linnakangas
In reply to this post by Anastasia Lubennikova
On 18/03/16 19:19, Anastasia Lubennikova wrote:

> Please, find the new version of the patch attached. Now it has WAL
> functionality.
>
> Detailed description of the feature you can find in README draft
> https://goo.gl/50O8Q0
>
> This patch is pretty complicated, so I ask everyone, who interested in
> this feature,
> to help with reviewing and testing it. I will be grateful for any feedback.
> But please, don't complain about code style, it is still work in progress.
>
> Next things I'm going to do:
> 1. More debugging and testing. I'm going to attach in next message
> couple of sql scripts for testing.
> 2. Fix NULLs processing
> 3. Add a flag into pg_index, that allows to enable/disable compression
> for each particular index.
> 4. Recheck locking considerations. I tried to write code as less
> invasive as possible, but we need to make sure that algorithm is still
> correct.
> 5. Change BTMaxItemSize
> 6. Bring back microvacuum functionality.

I think we should pack the TIDs more tightly, like GIN does with the
varbyte encoding. It's tempting to commit this without it for now, and
add the compression later, but I'd like to avoid having to deal with
multiple binary-format upgrades, so let's figure out the final on-disk
format that we want, right from the beginning.

It would be nice to reuse the varbyte encoding code from GIN, but we
might not want to use that exact scheme for B-tree. Firstly, an
important criteria when we designed GIN's encoding scheme was to avoid
expanding on-disk size for any data set, which meant that a TID had to
always be encoded in 6 bytes or less. We don't have that limitation with
B-tree, because in B-tree, each item is currently stored as a separate
IndexTuple, which is much larger. So we are free to choose an encoding
scheme that's better at packing some values, at the expense of using
more bytes for other values, if we want to. Some analysis on what we
want would be nice. (It's still important that removing a TID from the
list never makes the list larger, for VACUUM.)

Secondly, to be able to just always enable this feature, without a GUC
or reloption, we might need something that's faster for random access
than GIN's posting lists. Or can we just add the setting, but it would
be nice to have some more analysis on the worst-case performance before
we decide on that.

I find the macros in nbtree.h in the patch quite confusing. They're
similar to what we did in GIN, but again we might want to choose
differently here. So some discussion on the desired IndexTuple layout is
in order. (One clear bug is that using the high bit of BlockNumber for
the BT_POSTING flag will fail for a table larger than 2^31 blocks.)

- Heikki



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan-3
On Mon, Jul 4, 2016 at 2:30 AM, Heikki Linnakangas <[hidden email]> wrote:
> I think we should pack the TIDs more tightly, like GIN does with the varbyte
> encoding. It's tempting to commit this without it for now, and add the
> compression later, but I'd like to avoid having to deal with multiple
> binary-format upgrades, so let's figure out the final on-disk format that we
> want, right from the beginning.

While the idea of duplicate storage is pretty obviously compelling,
there could be other, non-obvious benefits. I think that it could
bring further benefits if we could use duplicate storage to change
this property of nbtree (this is from the README):

"""
Lehman and Yao assume that the key range for a subtree S is described
by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
page.  This does not work for nonunique keys (for example, if we have
enough equal keys to spread across several leaf pages, there *must* be
some equal bounding keys in the first level up).  Therefore we assume
Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
bounding key in an upper tree level must descend to the left of that
key to ensure it finds any equal keys in the preceding page.  An
insertion that sees the high key of its target page is equal to the key
to be inserted has a choice whether or not to move right, since the new
key could go on either page.  (Currently, we try to find a page where
there is room for the new key without a split.)

"""

If we could *guarantee* that all keys in the index are unique, then we
could maintain the keyspace as L&Y originally described.

The practical benefits to this would be:

* We wouldn't need to take the extra step described above -- finding a
bounding key/separator key that's fully equal to our scankey would no
longer necessitate a probably-useless descent to the left of that key.
(BTW, I wonder if we could get away with not inserting a downlink into
parent when a leaf page split finds an identical IndexTuple in parent,
*without* changing the keyspace invariant I mention -- if we're always
going to go to the left of an equal-to-scankey key in an internal
page, why even have more than one?)

* This would make suffix truncation of internal index tuples easier,
and that's important.

The traditional reason why suffix truncation is important is that it
can keep the tree a lot shorter than it would otherwise be. These
days, that might not seem that important, because even if you have
twice the number of internal pages than strictly necessary, that still
isn't that many relative to typical main memory size (and even CPU
cache sizes, perhaps).

The reason I think it's important these days is that not having suffix
truncation makes our "separator keys" overly prescriptive about what
part of the keyspace is owned by each internal page. With a pristine
index (following REINDEX), this doesn't matter much. But, I think that
we get much bigger problems with index bloat due to the poor fan-out
that we sometimes see due to not having suffix truncation, *combined*
with the page deletion algorithms restriction on deleting internal
pages (it can only be done for internal pages with *no* children).

Adding another level or two to the B-Tree makes it so that your
workload's "sparse deletion patterns" really don't need to be that
sparse in order to bloat the B-Tree badly, necessitating a REINDEX to
get back to acceptable performance (VACUUM won't do it). To avoid
this, we should make the internal pages represent the key space in the
least restrictive way possible, by applying suffix truncation so that
it's much more likely that things will *stay* balanced as churn
occurs. This is probably a really bad problem with things like
composite indexes over text columns, or indexes with many NULL values.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
1234