Index Skip Scan (new UniqueKeys)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Index Skip Scan (new UniqueKeys)

Dmitry Dolgov
Hi,

Here is a new version of index skip scan patch, based on v8 patch for
UniqueKeys implementation from [1]. I want to start a new thread to
simplify navigation, hopefully I didn't forget anyone who actively
participated in the discussion.

To simplify reviewing I've split it into several parts:

* First two are taken from [1] just for the reference and to make cfbot happy.

* Extend-UniqueKeys consists changes that needs to be done for
  UniqueKeys to be used in skip scan. Essentially this is a reduced
  version of previous implementation from Jesper & David, based on the
  new UniqueKeys infrastructure, as it does the very same thing.

* Index-Skip-Scan contains not am specific code and the overall
  infrastructure, including configuration elements and so on.

* Btree-implementation contains btree specific code to implement amskip,
  introduced in the previous patch.

* The last one contains just documentation bits.

Interesting enough with a new UniqueKey implementation skipping is
applied in some tests unexpectedly. For those I've disabled
indexskipscan to avoid confusion.

[1]: https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com

v8-0001-Introduce-RelOptInfo-notnullattrs-attribute.patch (2K) Download Attachment
v8-0002-Introuduce-RelOptInfo.uniquekeys-attribute.patch (49K) Download Attachment
v35-0001-Extend-UniqueKeys.patch (13K) Download Attachment
v35-0002-Index-skip-scan.patch (39K) Download Attachment
v35-0003-Btree-implementation-of-skipping.patch (41K) Download Attachment
v35-0004-Index-skip-scan-documentation.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan (new UniqueKeys)

Andy Fan


On Tue, Jun 9, 2020 at 6:20 PM Dmitry Dolgov <[hidden email]> wrote:
Hi,

Here is a new version of index skip scan patch, based on v8 patch for
UniqueKeys implementation from [1]. I want to start a new thread to
simplify navigation, hopefully I didn't forget anyone who actively
participated in the discussion.

To simplify reviewing I've split it into several parts:

* First two are taken from [1] just for the reference and to make cfbot happy.

* Extend-UniqueKeys consists changes that needs to be done for
  UniqueKeys to be used in skip scan. Essentially this is a reduced
  version of previous implementation from Jesper & David, based on the
  new UniqueKeys infrastructure, as it does the very same thing.

* Index-Skip-Scan contains not am specific code and the overall
  infrastructure, including configuration elements and so on.

* Btree-implementation contains btree specific code to implement amskip,
  introduced in the previous patch.

* The last one contains just documentation bits.

Interesting enough with a new UniqueKey implementation skipping is
applied in some tests unexpectedly. For those I've disabled
indexskipscan to avoid confusion.

[1]: https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com

Thanks for the patch.

I just get the rough idea of patch, looks we have to narrow down the user cases
where we can use this method.  Consider the below example:

create table j1(i int, im5 int,  im100 int, im1000 int);
insert into j1 select i, i%5, i%100, i%1000 from generate_series(1, 10000000)i;
create index on j1(im5, i);
insert into j1 values (1, 1, 0, 0);
analyze j1;

demo=# select distinct * from j1 where i < 2;
 i | im5 | im100 | im1000
---+-----+-------+--------
 1 |   1 |     1 |      1
(1 row)

demo=# set enable_indexskipscan to off;
SET
demo=# select distinct * from j1 where i < 2;
 i | im5 | im100 | im1000
---+-----+-------+--------
 1 |   1 |     0 |      0
 1 |   1 |     1 |      1
(2 rows)

drop index j1_im5_i_idx;

create index on j1(im5, i, im100);
demo=# select distinct im5, i, im100 from j1 where i < 2;
 im5 | i | im100
-----+---+-------
   1 | 1 |     0
   1 | 1 |     1
(2 rows)
demo=# set enable_indexskipscan to on;
SET
demo=# select distinct im5, i, im100 from j1 where i < 2;
 im5 | i | im100
-----+---+-------
   1 | 1 |     0
(1 row)


--
Best Regards
Andy Fan
Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan (new UniqueKeys)

Dmitry Dolgov
> On Thu, Jun 11, 2020 at 04:14:07PM +0800, Andy Fan wrote:
>
> I just get the rough idea of patch, looks we have to narrow down the
> user cases where we can use this method. Consider the below example:

Hi

Not exactly narrow down, but rather get rid of wrong usage of skipping
for index scan. Since skipping for it was added later than for index
only scan I can imagine there are still blind spots, so good that you've
looked. In this particular case, when index expressions do not fully
cover those expressionse result need to be distinct on, skipping just
doesn't have enough information and should not be used. I'll add it to
the next version, thanks!


Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan (new UniqueKeys)

Peter Geoghegan-4
In reply to this post by Dmitry Dolgov
On Tue, Jun 9, 2020 at 3:20 AM Dmitry Dolgov <[hidden email]> wrote:
> * Btree-implementation contains btree specific code to implement amskip,
>   introduced in the previous patch.

The way that you're dealing with B-Tree tuples here needs to account
for posting list tuples:

> +               currItem = &so->currPos.items[so->currPos.lastItem];
> +               itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
> +               nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);

But I wonder more generally what the idea here is. The following
comments that immediately follow provide some hints:

> +               /*
> +                * To check if we returned the same tuple, try to find a
> +                * startItup on the current page. For that we need to update
> +                * scankey to match the whole tuple and set nextkey to return
> +                * an exact tuple, not the next one. If the nextOffset is the
> +                * same as before, it means we are in the loop, return offnum
> +                * to the original position and jump further
> +                */

Why does it make sense to use the offset number like this? It isn't
stable or reliable. The patch goes on to do this:

> +                   startOffset = _bt_binsrch(scan->indexRelation,
> +                                             so->skipScanKey,
> +                                             so->currPos.buf);
> +
> +                   page = BufferGetPage(so->currPos.buf);
> +                   maxoff = PageGetMaxOffsetNumber(page);
> +
> +                   if (nextOffset <= startOffset)
> +                   {

Why compare a heap TID's offset number (an offset number for a heap
page) to another offset number for a B-Tree leaf page? They're
fundamentally different things.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

RE: Index Skip Scan (new UniqueKeys)

Floris Van Nee
In reply to this post by Dmitry Dolgov
Hi Dmitry,

Also took another look at the patch now, and found a case of incorrect data. It looks related to the new way of creating the paths, as I can't recall seeing this in earlier versions.

create table t1 as select a,b,b%5 as c, random() as d from generate_series(1, 10) a, generate_series(1,100) b;
create index on t1 (a,b,c);

postgres=# explain select distinct on (a) * from t1 order by a,b desc,c;
                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
 Sort  (cost=2.92..2.94 rows=10 width=20)
   Sort Key: a, b DESC, c
   ->  Index Scan using t1_a_b_c_idx on t1  (cost=0.28..2.75 rows=10 width=20)
         Skip scan: true
(4 rows)


With the 'order by a, b desc, c' we expect the value of column 'b' to always be 100. With index_skipscan enabled, it always gives 1 though. It's not correct that the planner chooses a skip scan followed by sort in this case.

-Floris



Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan (new UniqueKeys)

Dmitry Dolgov
In reply to this post by Peter Geoghegan-4
> On Wed, Jul 08, 2020 at 03:44:26PM -0700, Peter Geoghegan wrote:
>
> On Tue, Jun 9, 2020 at 3:20 AM Dmitry Dolgov <[hidden email]> wrote:
> > * Btree-implementation contains btree specific code to implement amskip,
> >   introduced in the previous patch.
>
> The way that you're dealing with B-Tree tuples here needs to account
> for posting list tuples:
>
> > +               currItem = &so->currPos.items[so->currPos.lastItem];
> > +               itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
> > +               nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);

Do you mean this last part with t_tid, which could also have a tid array
in case of posting tuple format?

> > +               /*
> > +                * To check if we returned the same tuple, try to find a
> > +                * startItup on the current page. For that we need to update
> > +                * scankey to match the whole tuple and set nextkey to return
> > +                * an exact tuple, not the next one. If the nextOffset is the
> > +                * same as before, it means we are in the loop, return offnum
> > +                * to the original position and jump further
> > +                */
>
> Why does it make sense to use the offset number like this? It isn't
> stable or reliable. The patch goes on to do this:
>
> > +                   startOffset = _bt_binsrch(scan->indexRelation,
> > +                                             so->skipScanKey,
> > +                                             so->currPos.buf);
> > +
> > +                   page = BufferGetPage(so->currPos.buf);
> > +                   maxoff = PageGetMaxOffsetNumber(page);
> > +
> > +                   if (nextOffset <= startOffset)
> > +                   {
>
> Why compare a heap TID's offset number (an offset number for a heap
> page) to another offset number for a B-Tree leaf page? They're
> fundamentally different things.

Well, it's obviously wrong, thanks for noticing. What is necessary is to
compare two index tuples, the start and the next one, to test if they're
the same (in which case if I'm not mistaken probably we can compare item
pointers). I've got this question when I was about to post a new version
with changes to address feedback from Andy, now I'll combine them and
send a cumulative patch.


Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan (new UniqueKeys)

Dmitry Dolgov
In reply to this post by Floris Van Nee
> On Fri, Jul 10, 2020 at 05:03:37PM +0000, Floris Van Nee wrote:
>
> Also took another look at the patch now, and found a case of incorrect
> data. It looks related to the new way of creating the paths, as I
> can't recall seeing this in earlier versions.
>
> create table t1 as select a,b,b%5 as c, random() as d from generate_series(1, 10) a, generate_series(1,100) b;
> create index on t1 (a,b,c);
>
> postgres=# explain select distinct on (a) * from t1 order by a,b desc,c;
>                                   QUERY PLAN
> -------------------------------------------------------------------------------
>  Sort  (cost=2.92..2.94 rows=10 width=20)
>    Sort Key: a, b DESC, c
>    ->  Index Scan using t1_a_b_c_idx on t1  (cost=0.28..2.75 rows=10 width=20)
>          Skip scan: true
> (4 rows)

Good point, thanks for looking at this. With the latest planner version
there are indeed more possibilities to use skipping. It never occured to
me that some of those paths will still rely on index scan returning full
data set. I'll look in details and add verification to prevent putting
something like this on top of skip scan in the next version.


Reply | Threaded
Open this post in threaded view
|

RE: Index Skip Scan (new UniqueKeys)

Floris Van Nee
>
> Good point, thanks for looking at this. With the latest planner version there
> are indeed more possibilities to use skipping. It never occured to me that
> some of those paths will still rely on index scan returning full data set. I'll look
> in details and add verification to prevent putting something like this on top of
> skip scan in the next version.

I believe the required changes are something like in attached patch. There were a few things I've changed:
- build_uniquekeys was constructing the list incorrectly. For a DISTINCT a,b, it would create two unique keys, one with a and one with b. However, it should be one unique key with (a,b).
- the uniquekeys that is built, still contains some redundant keys, that are normally eliminated from the path keys lists.
- the distinct_pathkeys may be NULL, even though there's a possibility for skipping. But it wouldn't create the uniquekeys in this case. This makes the planner not choose skip scans even though it could. For example in queries that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a is constant, it's eliminated from regular pathkeys.
- to combat the issues mentioned earlier, there's now a check in build_index_paths that checks if the query_pathkeys matches the useful_pathkeys. Note that we have to use the path keys here rather than any of the unique keys. The unique keys are only Expr nodes - they do not contain the necessary information about ordering. Due to elimination of some constant path keys, we have to search the attributes of the index to find the correct prefix to use in skipping.
- creating the skip scan path did not actually fill the Path's unique keys. It should just set this to query_uniquekeys.

I've attached the first two unique-keys patches (v9, 0001, 0002)), your patches, but rebased on v9 of unique keys (0003-0006) + a diff patch (0007) that applies my suggested changes on top of it.

-Floris


0001-Introduce-RelOptInfo-notnullattrs-attribute.patch (6K) Download Attachment
0002-Introduce-UniqueKey-attributes-on-RelOptInfo-struct.patch (80K) Download Attachment
0003-Extend-UniqueKeys.patch (17K) Download Attachment
0004-Index-skip-scan.patch (52K) Download Attachment
0005-Btree-implementation-of-skipping.patch (54K) Download Attachment
0006-Index-skip-scan-documentation.patch (6K) Download Attachment
0007-planner-fixes.patch (15K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Index Skip Scan (new UniqueKeys)

Floris Van Nee
>
> I've attached the first two unique-keys patches (v9, 0001, 0002)), your
> patches, but rebased on v9 of unique keys (0003-0006) + a diff patch (0007)
> that applies my suggested changes on top of it.
>

I just realized there's another thing that looks a bit strange too. From reading the thread, I thought it should be the case that in create_distinct_paths, it is checked whether the uniqueness in the unique_pathlist matches the uniqueness that is needed by the query.
This means that I think what it should be comparing is this:
- The generated index path should have some path-level unique keys set
- The path-level unique keys must be at least as strict as the path-level unique keys. Eg. if path-level is unique on (a), then query-level must be (a), or possibly (a,b).

I've changed the patch to compare the path-level keys (set in create index path) with the query-level keys in create_distinct_path. Currently, I don't think the previous implementation was an actual issue leading to incorrect queries, but it would be causing problems if we tried to extend the uniqueness for distinct to join rels etc.

One question about the unique keys - probably for Andy or David: I've looked in the archives to find arguments for/against using Expr nodes or EquivalenceClasses in the Unique Keys patch. However, I couldn't really find a clear answer about why the current patch uses Expr rather than EquivalenceClasses. At some point David mentioned "that probably Expr nodes were needed rather than EquivalenceClasses", but it's not really clear to me why. What were the thoughts behind this?

-Floris


0001-Introduce-RelOptInfo-notnullattrs-attribute.patch (6K) Download Attachment
0002-Introduce-UniqueKey-attributes-on-RelOptInfo-struct.patch (80K) Download Attachment
0003-Extend-UniqueKeys.patch (17K) Download Attachment
0004-Index-skip-scan.patch (52K) Download Attachment
0005-Btree-implementation-of-skipping.patch (54K) Download Attachment
0006-Index-skip-scan-documentation.patch (6K) Download Attachment
0007-planner-fixes.patch (19K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan (new UniqueKeys)

Dmitry Dolgov
In reply to this post by Floris Van Nee
> On Sun, Jul 12, 2020 at 12:48:47PM +0000, Floris Van Nee wrote:
> >
> > Good point, thanks for looking at this. With the latest planner version there
> > are indeed more possibilities to use skipping. It never occured to me that
> > some of those paths will still rely on index scan returning full data set. I'll look
> > in details and add verification to prevent putting something like this on top of
> > skip scan in the next version.
>
> I believe the required changes are something like in attached patch. There were a few things I've changed:
> - build_uniquekeys was constructing the list incorrectly. For a DISTINCT a,b, it would create two unique keys, one with a and one with b. However, it should be one unique key with (a,b).

Yes, I've also noticed that while preparing fix for index scan not
covered by index and included it.

> - the uniquekeys that is built, still contains some redundant keys, that are normally eliminated from the path keys lists.

I guess you're talking about:

+ if (EC_MUST_BE_REDUNDANT(ec))
+   continue;

Can you add some test cases to your changes to show the effect of it? It
seem to me redundant keys are already eliminated at this point by either
make_pathkeys_for_uniquekeys or even earlier for distinct on, but could
be I've missed something.

Along the lines I'm also curious about this part:

- ListCell   *k;
- List *exprs = NIL;
-
- foreach(k, ec->ec_members)
- {
- EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
- exprs = lappend(exprs, mem->em_expr);
- }
-
- result = lappend(result, makeUniqueKey(exprs, false, false));
+ EquivalenceMember *mem = (EquivalenceMember*) lfirst(list_head(ec->ec_members));

I'm curious about this myself, maybe someone can clarify. It looks like
generaly speaking there could be more than one member (if not
ec_has_volatile), which "representing knowledge that multiple items are
effectively equal". Is this information is not interesting enough to
preserve it in unique keys?

> - the distinct_pathkeys may be NULL, even though there's a possibility for skipping. But it wouldn't create the uniquekeys in this case. This makes the planner not choose skip scans even though it could. For example in queries that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a is constant, it's eliminated from regular pathkeys.

What would be the point of skipping if it's a constant?

> - to combat the issues mentioned earlier, there's now a check in build_index_paths that checks if the query_pathkeys matches the useful_pathkeys. Note that we have to use the path keys here rather than any of the unique keys. The unique keys are only Expr nodes - they do not contain the necessary information about ordering. Due to elimination of some constant path keys, we have to search the attributes of the index to find the correct prefix to use in skipping.

IIUC here you mean this function, right?

+ prefix = find_index_prefix_for_pathkey(root,
+     index,
+     BackwardScanDirection,
+     llast_node(PathKey,
+     root->distinct_pathkeys));

Doesn't it duplicate the job already done in build_index_pathkeys by
building those pathkeys again? If yes, probably it's possible to reuse
useful_pathkeys. Not sure about unordered indexes, but looks like
query_pathkeys should also match in this case.

Will also look at the follow up questions in the next email.


Reply | Threaded
Open this post in threaded view
|

RE: Index Skip Scan (new UniqueKeys)

Floris Van Nee
>
> > - the uniquekeys that is built, still contains some redundant keys, that are
> normally eliminated from the path keys lists.
>
> I guess you're talking about:
>
> + if (EC_MUST_BE_REDUNDANT(ec))
> +   continue;
>
> Can you add some test cases to your changes to show the effect of it? It
> seem to me redundant keys are already eliminated at this point by either
> make_pathkeys_for_uniquekeys or even earlier for distinct on, but could be
> I've missed something.
>

The build_uniquekeys function calls make_pathkeys_for_uniquekeys, which checks for uniqueness using pathkey_is_unique, but not for constantness. Consider a query like:
SELECT DISTINCT ON (a,b) * FROM t1 WHERE a=1 ORDER BY a,b,c

All the regular path keys filter out 'a' for constantness here - so they would end up with a distinct pathkeys of (b) and sort path keys of (b,c).
The unique keys would end up with (a,b) though. In the previous patch, it'd compared in create_distinct_paths, the pathkeys to the unique keys, so it wouldn't consider the skip scan.
Due to the other changes I made in create_distinct_paths/query_has_uniquekeys_for, it will choose a correct plan now, even without the EC_MUST_BE_REDUNDANT check though, so it's difficult to give an actual failing test case now. However, since all code filters out constant keys, I think uniqueness should do it too - otherwise you could get into problems later on. It's also more consistent. If you already know something is unique by just (b), it doesn't make sense to store that it's unique by (a,b). Now that I think of it, the best place to do this EC_MUST_BE_REDUNDANT check is probably inside make_pathkeys_for_uniquekeys, rather than build_uniquekeys though. It's probably good to move it there.

> Along the lines I'm also curious about this part:
>
> - ListCell   *k;
> - List *exprs = NIL;
> -
> - foreach(k, ec->ec_members)
> - {
> - EquivalenceMember *mem = (EquivalenceMember *)
> lfirst(k);
> - exprs = lappend(exprs, mem->em_expr);
> - }
> -
> - result = lappend(result, makeUniqueKey(exprs, false, false));
> + EquivalenceMember *mem = (EquivalenceMember*)
> +lfirst(list_head(ec->ec_members));
>
> I'm curious about this myself, maybe someone can clarify. It looks like
> generaly speaking there could be more than one member (if not
> ec_has_volatile), which "representing knowledge that multiple items are
> effectively equal". Is this information is not interesting enough to preserve it
> in unique keys?

Yeah, that's a good question. Hence my question about the choice for Expr rather than EquivalenceClass for the Unique Keys patch to Andy/David. When storing just Expr, it is rather difficult to check equivalence in joins for example. Suppose, later on we decide to add join support to the distinct skip scan. Consider a query like this:
SELECT DISTINCT t1.a FROM t1 JOIN t2 ON t1.a=t2.a
As far as my understanding goes (I didn't look into it in detail though), I think here the distinct_pathkey will have an EqClass {t1.a, t2.a}. That results in a UniqueKey with expr (t1.a) (because currently we only take the first Expr in the list to construct the UniqueKey). We could also construct *two* unique keys, one with Expr (t1.a) and one with Expr (t2.a), but I don't think that's the correct approach either, as it will explode when you have multiple pathkeys, each having multiple Expr inside their EqClasses.
That makes it difficult to check if we can perform the DISTINCT skip scan on table t2 as well (theoretically we could, but for that we need to check equivalence classes rather than expressions).

>
> > - the distinct_pathkeys may be NULL, even though there's a possibility for
> skipping. But it wouldn't create the uniquekeys in this case. This makes the
> planner not choose skip scans even though it could. For example in queries
> that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a
> is constant, it's eliminated from regular pathkeys.
>
> What would be the point of skipping if it's a constant?

For the query: SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b;
There may be 1000s of records with a=1. We're only interested in the first one though. The traditional non-skip approach would still scan all records with a=1. Skip would just fetch the first one with a=1 and then skip to the next prefix and stop the scan.

>
> > - to combat the issues mentioned earlier, there's now a check in
> build_index_paths that checks if the query_pathkeys matches the
> useful_pathkeys. Note that we have to use the path keys here rather than
> any of the unique keys. The unique keys are only Expr nodes - they do not
> contain the necessary information about ordering. Due to elimination of
> some constant path keys, we have to search the attributes of the index to
> find the correct prefix to use in skipping.
>
> IIUC here you mean this function, right?
>
> + prefix = find_index_prefix_for_pathkey(root,
> +
> index,
> +
> BackwardScanDirection,
> +
> llast_node(PathKey,
> +
> root->distinct_pathkeys));
>
> Doesn't it duplicate the job already done in build_index_pathkeys by building
> those pathkeys again? If yes, probably it's possible to reuse useful_pathkeys.
> Not sure about unordered indexes, but looks like query_pathkeys should
> also match in this case.
>

Yeah, there's definitely some double work there, but the actual impact may be limited - it doesn't actually allocate a new path key, but it looks it up in root->canon_pathkeys and returns that path key.
I wrote it like this, because I couldn't find a way to identify from a certain PathKey the actual location in the index of that column. The constructed path keys list filters out all redundant path keys. An index on (a,a,b,a,b) becomes path keys (a,b). Now if we skip on (a,b) we actually need to use prefix=3. But how to get from PathKey=b to that number 3, I didn't find a solid way except doing this. Maybe there is though?

-Floris



Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan (new UniqueKeys)

Peter Geoghegan-4
In reply to this post by Dmitry Dolgov
On Sat, Jul 11, 2020 at 9:10 AM Dmitry Dolgov <[hidden email]> wrote:
> > > +               currItem = &so->currPos.items[so->currPos.lastItem];
> > > +               itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
> > > +               nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);
>
> Do you mean this last part with t_tid, which could also have a tid array
> in case of posting tuple format?

Yeah. There is a TID array at the end of the index when the tuple is a
posting list tuple (as indicated by BTreeTupleIsPivot()). It isn't
safe to assume that t_tid is a heap TID for this reason, even in code
that only ever considers data items (that is, non high-key tuples AKA
non-pivot tuples) on a leaf page. (Though new/incoming tuples cannot
be posting list tuples either, so you'll see the assumption that t_tid
is just a heap TID in parts of nbtinsert.c -- though only for the
new/incoming item.)

> Well, it's obviously wrong, thanks for noticing. What is necessary is to
> compare two index tuples, the start and the next one, to test if they're
> the same (in which case if I'm not mistaken probably we can compare item
> pointers). I've got this question when I was about to post a new version
> with changes to address feedback from Andy, now I'll combine them and
> send a cumulative patch.

This sounds like approximately the same problem as the one that
_bt_killitems() has to deal with as of Postgres 13. This is handled in
a way that is admittedly pretty tricky, even though the code does not
need to be 100% certain that it's "the same" tuple. Deduplication kind
of makes that a fuzzy concept. In principle there could be one big
index tuple instead of 5 tuples, even though the logical contents of
the page have not been changed between the time we recording heap TIDs
in local and the time _bt_killitems() tried to match on those heap
TIDs to kill_prior_tuple-kill some index tuples -- a concurrent
deduplication pass could do that. Your code needs to be prepared for
stuff like that.

Ultimately posting list tuples are just a matter of understanding the
on-disk representation -- a "Small Matter of Programming". Even
without deduplication there are potential hazards from the physical
deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is
not code that runs in VACUUM, despite the name). Make sure that you
hold a buffer pin on the leaf page throughout, because you need to do
that to make sure that VACUUM cannot concurrently recycle heap TIDs.
If VACUUM *is* able to concurrently recycle heap TIDs then it'll be
subtly broken. _bt_killitems() is safe because it either holds on to a
pin or gives up when the LSN changes at all. (ISTM that your only
choice is to hold on to a leaf page pin, since you cannot just decide
to give up in the way that _bt_killitems() sometimes can.)

Note that the rules surrounding buffer locks/pins for nbtree were
tightened up a tiny bit today -- see commit 4a70f829. Also, it's no
longer okay to use raw LockBuffer() calls in nbtree, so you're going
to have to fix that up when you next rebase -- you must use the new
_bt_lockbuf() wrapper function instead, so that the new Valgrind
instrumentation is used. This shouldn't be hard.

Perhaps you can use Valgrind to verify that this patch doesn't have
any unsafe buffer accesses. I recall problems like that in earlier
versions of the patch series. Valgrind should be able to detect most
bugs like that (though see comments within _bt_lockbuf() for details
of a bug in this area that Valgrind cannot detect even now).

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan (new UniqueKeys)

Dmitry Dolgov
In reply to this post by Floris Van Nee
> On Tue, Jul 14, 2020 at 06:18:50PM +0000, Floris Van Nee wrote:
>
> Due to the other changes I made in create_distinct_paths/query_has_uniquekeys_for, it will choose a correct plan now, even without the EC_MUST_BE_REDUNDANT check though, so it's difficult to give an actual failing test case now. However, since all code filters out constant keys, I think uniqueness should do it too - otherwise you could get into problems later on. It's also more consistent. If you already know something is unique by just (b), it doesn't make sense to store that it's unique by (a,b). Now that I think of it, the best place to do this EC_MUST_BE_REDUNDANT check is probably inside make_pathkeys_for_uniquekeys, rather than build_uniquekeys though. It's probably good to move it there.

That would be my suggestion as well.

> > Along the lines I'm also curious about this part:
> >
> > - ListCell   *k;
> > - List *exprs = NIL;
> > -
> > - foreach(k, ec->ec_members)
> > - {
> > - EquivalenceMember *mem = (EquivalenceMember *)
> > lfirst(k);
> > - exprs = lappend(exprs, mem->em_expr);
> > - }
> > -
> > - result = lappend(result, makeUniqueKey(exprs, false, false));
> > + EquivalenceMember *mem = (EquivalenceMember*)
> > +lfirst(list_head(ec->ec_members));
> >
> > I'm curious about this myself, maybe someone can clarify. It looks like
> > generaly speaking there could be more than one member (if not
> > ec_has_volatile), which "representing knowledge that multiple items are
> > effectively equal". Is this information is not interesting enough to preserve it
> > in unique keys?
>
> Yeah, that's a good question. Hence my question about the choice for Expr rather than EquivalenceClass for the Unique Keys patch to Andy/David. When storing just Expr, it is rather difficult to check equivalence in joins for example. Suppose, later on we decide to add join support to the distinct skip scan. Consider a query like this:
> SELECT DISTINCT t1.a FROM t1 JOIN t2 ON t1.a=t2.a
> As far as my understanding goes (I didn't look into it in detail though), I think here the distinct_pathkey will have an EqClass {t1.a, t2.a}. That results in a UniqueKey with expr (t1.a) (because currently we only take the first Expr in the list to construct the UniqueKey). We could also construct *two* unique keys, one with Expr (t1.a) and one with Expr (t2.a), but I don't think that's the correct approach either, as it will explode when you have multiple pathkeys, each having multiple Expr inside their EqClasses.

One UniqueKey can have multiple corresponding expressions, which gives
us also possibility of having one unique key with (t1.a, t2.a) and it
looks now similar to EquivalenceClass.

> > > - the distinct_pathkeys may be NULL, even though there's a possibility for
> > skipping. But it wouldn't create the uniquekeys in this case. This makes the
> > planner not choose skip scans even though it could. For example in queries
> > that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a
> > is constant, it's eliminated from regular pathkeys.
> >
> > What would be the point of skipping if it's a constant?
>
> For the query: SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b;
> There may be 1000s of records with a=1. We're only interested in the first one though. The traditional non-skip approach would still scan all records with a=1. Skip would just fetch the first one with a=1 and then skip to the next prefix and stop the scan.

The idea behind this query sounds questionable to me, more transparent
would be to do this without distinct, skipping will actually do exactly
the same stuff just under another name. But if allowing skipping on
constants do not bring significant changes in the code probably it's
fine.

> > > - to combat the issues mentioned earlier, there's now a check in
> > build_index_paths that checks if the query_pathkeys matches the
> > useful_pathkeys. Note that we have to use the path keys here rather than
> > any of the unique keys. The unique keys are only Expr nodes - they do not
> > contain the necessary information about ordering. Due to elimination of
> > some constant path keys, we have to search the attributes of the index to
> > find the correct prefix to use in skipping.
> >
> > IIUC here you mean this function, right?
> >
> > + prefix = find_index_prefix_for_pathkey(root,
> > +
> > index,
> > +
> > BackwardScanDirection,
> > +
> > llast_node(PathKey,
> > +
> > root->distinct_pathkeys));
> >
> > Doesn't it duplicate the job already done in build_index_pathkeys by building
> > those pathkeys again? If yes, probably it's possible to reuse useful_pathkeys.
> > Not sure about unordered indexes, but looks like query_pathkeys should
> > also match in this case.
> >
>
> Yeah, there's definitely some double work there, but the actual impact may be limited - it doesn't actually allocate a new path key, but it looks it up in root->canon_pathkeys and returns that path key.
> I wrote it like this, because I couldn't find a way to identify from a certain PathKey the actual location in the index of that column. The constructed path keys list filters out all redundant path keys. An index on (a,a,b,a,b) becomes path keys (a,b). Now if we skip on (a,b) we actually need to use prefix=3. But how to get from PathKey=b to that number 3, I didn't find a solid way except doing this. Maybe there is though?

I don't think there is a direct way, but why not modify
build_index_paths to also provide this information, or compare
index_pathkeys expressions with indextlist without actually create those
pathkeys again?

And couple of words about this thread [1]. It looks to me like a strange
way of interacting with the community. Are you going to duplicate there
everything, or what are your plans? At the very least you could try to
include everyone involved in the recipients list, not exclude some of
the authors.

[1]: https://www.postgresql.org/message-id/flat/e4b623692a1447d4a13ac80fa271c8e6%40opammb0561.comp.optiver.com


Reply | Threaded
Open this post in threaded view
|

RE: Index Skip Scan (new UniqueKeys)

Floris Van Nee
>
> One UniqueKey can have multiple corresponding expressions, which gives us
> also possibility of having one unique key with (t1.a, t2.a) and it looks now
> similar to EquivalenceClass.
>

I believe the current definition of a unique key with two expressions (t1.a, t2.a) means that it's unique on the tuple (t1.a, t2.a) - this gives weaker guarantees than uniqueness on (t1.a) and uniqueness on (t2.a).

>
> The idea behind this query sounds questionable to me, more transparent
> would be to do this without distinct, skipping will actually do exactly the same
> stuff just under another name. But if allowing skipping on constants do not
> bring significant changes in the code probably it's fine.
>

Yeah indeed, I didn't say it's a query that people should generally write. :-) It's better to write as a regular SELECT with LIMIT 1 of course. However, it's more to be consistent and predictable to the user: if a SELECT DISTINCT ON (a) * FROM t1 runs fast, then it doesn't make sense to the user if a SELECT DISTINCT ON (a) * FROM t1 WHERE a=2 runs slow. And to support it also makes the implementation more consistent with little code changes.

> >
> > Yeah, there's definitely some double work there, but the actual impact may
> be limited - it doesn't actually allocate a new path key, but it looks it up in
> root->canon_pathkeys and returns that path key.
> > I wrote it like this, because I couldn't find a way to identify from a certain
> PathKey the actual location in the index of that column. The constructed path
> keys list filters out all redundant path keys. An index on (a,a,b,a,b) becomes
> path keys (a,b). Now if we skip on (a,b) we actually need to use prefix=3. But
> how to get from PathKey=b to that number 3, I didn't find a solid way except
> doing this. Maybe there is though?
>
> I don't think there is a direct way, but why not modify build_index_paths to
> also provide this information, or compare index_pathkeys expressions with
> indextlist without actually create those pathkeys again?
>

I agree there could be other ways - I don't currently have a strong preference for either. I can have a look at this later.

> And couple of words about this thread [1]. It looks to me like a strange way
> of interacting with the community. Are you going to duplicate there
> everything, or what are your plans? At the very least you could try to include
> everyone involved in the recipients list, not exclude some of the authors.
>

When I wrote the first mail in the thread, I went to this thread [1] and included everyone from there, but I see now that I only included the to: and cc: people and forgot the original thread author, you. I'm sorry about that - I should've looked better to make sure I had everyone.
In any case, my plan is to keep the patch at least applicable to master, as I believe it can be helpful for discussions about both patches.

[1] https://www.postgresql.org/message-id/20200609102247.jdlatmfyeecg52fi%40localhost


Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan (new UniqueKeys)

Dmitry Dolgov
In reply to this post by Peter Geoghegan-4
> On Tue, Jul 21, 2020 at 04:35:55PM -0700, Peter Geoghegan wrote:
>
> > Well, it's obviously wrong, thanks for noticing. What is necessary is to
> > compare two index tuples, the start and the next one, to test if they're
> > the same (in which case if I'm not mistaken probably we can compare item
> > pointers). I've got this question when I was about to post a new version
> > with changes to address feedback from Andy, now I'll combine them and
> > send a cumulative patch.
>
> This sounds like approximately the same problem as the one that
> _bt_killitems() has to deal with as of Postgres 13. This is handled in
> a way that is admittedly pretty tricky, even though the code does not
> need to be 100% certain that it's "the same" tuple. Deduplication kind
> of makes that a fuzzy concept. In principle there could be one big
> index tuple instead of 5 tuples, even though the logical contents of
> the page have not been changed between the time we recording heap TIDs
> in local and the time _bt_killitems() tried to match on those heap
> TIDs to kill_prior_tuple-kill some index tuples -- a concurrent
> deduplication pass could do that. Your code needs to be prepared for
> stuff like that.
>
> Ultimately posting list tuples are just a matter of understanding the
> on-disk representation -- a "Small Matter of Programming". Even
> without deduplication there are potential hazards from the physical
> deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is
> not code that runs in VACUUM, despite the name). Make sure that you
> hold a buffer pin on the leaf page throughout, because you need to do
> that to make sure that VACUUM cannot concurrently recycle heap TIDs.
> If VACUUM *is* able to concurrently recycle heap TIDs then it'll be
> subtly broken. _bt_killitems() is safe because it either holds on to a
> pin or gives up when the LSN changes at all. (ISTM that your only
> choice is to hold on to a leaf page pin, since you cannot just decide
> to give up in the way that _bt_killitems() sometimes can.)

I see, thanks for clarification. You're right, in this part of
implementation there is no way to give up if LSN changes like
_bt_killitems does. As far as I can see the leaf page is already pinned
all the time between reading relevant tuples and comparing them, I only
need to handle posting list tuples.


Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan (new UniqueKeys)

David Rowley
In reply to this post by Floris Van Nee
On Mon, 13 Jul 2020 at 10:18, Floris Van Nee <[hidden email]> wrote:
> One question about the unique keys - probably for Andy or David: I've looked in the archives to find arguments for/against using Expr nodes or EquivalenceClasses in the Unique Keys patch. However, I couldn't really find a clear answer about why the current patch uses Expr rather than EquivalenceClasses. At some point David mentioned "that probably Expr nodes were needed rather than EquivalenceClasses", but it's not really clear to me why. What were the thoughts behind this?

I'm still not quite sure on this either way.  I did think
EquivalenceClasses were more suitable before I wrote the POC patch for
unique keys.  But after that, I had in mind that Exprs might be
better. The reason I thought this was due to the fact that the
DISTINCT clause list is a bunch of Exprs and if the UniqueKeys were
EquivalenceClasses then checking to see if the DISTINCT can be skipped
turned into something more complex that required looking through lists
of ec_members rather than just checking if the uniquekey exprs were a
subset of the DISTINCT clause.

Thinking about it a bit harder, if we did use Exprs then it would mean
it a case like the following wouldn't work for Andy's DISTINCT no-op
stuff.

CREATE TABLE xy (x int primary key, y int not null);

SELECT DISTINCT y FROM xy WHERE x=y;

whereas if we use EquivalenceClasses then we'll find that we have an
EC with x,y in it and can skip the DISTINCT since we have a UniqueKey
containing that EquivalenceClass.

Also, looking at what Andy wrote to make a case like the following
work in his populate_baserel_uniquekeys() function in the 0002 patch:

CREATE TABLE ab (a int, b int, primary key(a,b));
SELECT DISTINCT a FROM ab WHERE b = 1;

it's a bit uninspiring. Really what we want here when checking if we
can skip doing the DISTINCT is a UniqueKey set using
EquivalenceClasses as we can just insist that any unmatched UniqueKey
items have an ec_is_const == true. However, that means we have to loop
through the ec_members of the EquivalenceClasses in the uniquekeys
during the DISTINCT check.  That's particularly bad when you consider
that in a partitioned table case there might be an ec_member for each
child partition and there could be 1000s of child partitions and
following those ec_members chains is going to be too slow.

My current thoughts are that we should be using EquivalenceClasses but
we should first add some infrastructure to make them perform better.
My current thoughts are that we do something like what I mentioned in
[1] or something more like what Andres mentions in [2].  After that,
we could either make EquivalenceClass.ec_members a hash table or
binary search tree. Or even perhaps just have a single hash table/BST
for all EquivalenceClasses that allows very fast lookups from {Expr}
-> {EquivalenceClass}.  I think an Expr can only belong in a single
non-merged EquivalenceClass. So when we do merging of
EquivalenceClasses we could just repoint that data structure to point
to the new EquivalenceClass. We'd never point to ones that have
ec_merged != NULL.  This would also allow us to fix the poor
performance in regards to get_eclass_for_sort_expr() for partitioned
tables.

So, it seems the patch dependency chain for skip scans just got a bit longer :-(

David

[1] https://www.postgresql.org/message-id/CAApHDvrEXcadNYAAdq6RO0eKZUG6rRHXJGAbpzj8y432gCD9bA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/flat/20190920051857.2fhnvhvx4qdddviz%40alap3.anarazel.de#c3add3919c534591eae2179a6c82742c


Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan (new UniqueKeys)

Andy Fan
Hi David:

Thanks for looking into this. 

On Fri, Jul 31, 2020 at 11:07 AM David Rowley <[hidden email]> wrote:
On Mon, 13 Jul 2020 at 10:18, Floris Van Nee <[hidden email]> wrote:
> One question about the unique keys - probably for Andy or David: I've looked in the archives to find arguments for/against using Expr nodes or EquivalenceClasses in the Unique Keys patch. However, I couldn't really find a clear answer about why the current patch uses Expr rather than EquivalenceClasses. At some point David mentioned "that probably Expr nodes were needed rather than EquivalenceClasses", but it's not really clear to me why. What were the thoughts behind this?

I'm still not quite sure on this either way.  I did think
EquivalenceClasses were more suitable before I wrote the POC patch for
unique keys.  But after that, I had in mind that Exprs might be
better. The reason I thought this was due to the fact that the
DISTINCT clause list is a bunch of Exprs and if the UniqueKeys were
EquivalenceClasses then checking to see if the DISTINCT can be skipped
turned into something more complex that required looking through lists
of ec_members rather than just checking if the uniquekey exprs were a
subset of the DISTINCT clause.

Thinking about it a bit harder, if we did use Exprs then it would mean
it a case like the following wouldn't work for Andy's DISTINCT no-op
stuff.

CREATE TABLE xy (x int primary key, y int not null);

SELECT DISTINCT y FROM xy WHERE x=y;

whereas if we use EquivalenceClasses then we'll find that we have an
EC with x,y in it and can skip the DISTINCT since we have a UniqueKey
containing that EquivalenceClass. 

Also, looking at what Andy wrote to make a case like the following
work in his populate_baserel_uniquekeys() function in the 0002 patch:

CREATE TABLE ab (a int, b int, primary key(a,b));
SELECT DISTINCT a FROM ab WHERE b = 1;

it's a bit uninspiring. Really what we want here when checking if we
can skip doing the DISTINCT is a UniqueKey set using
EquivalenceClasses as we can just insist that any unmatched UniqueKey
items have an ec_is_const == true. However, that means we have to loop
through the ec_members of the EquivalenceClasses in the uniquekeys
during the DISTINCT check.  That's particularly bad when you consider
that in a partitioned table case there might be an ec_member for each
child partition and there could be 1000s of child partitions and
following those ec_members chains is going to be too slow.

My current thoughts are that we should be using EquivalenceClasses but
we should first add some infrastructure to make them perform better.
My current thoughts are that we do something like what I mentioned in
[1] or something more like what Andres mentions in [2].  After that,
we could either make EquivalenceClass.ec_members a hash table or
binary search tree. Or even perhaps just have a single hash table/BST
for all EquivalenceClasses that allows very fast lookups from {Expr}
-> {EquivalenceClass}.  I think an Expr can only belong in a single
non-merged EquivalenceClass. So when we do merging of
EquivalenceClasses we could just repoint that data structure to point
to the new EquivalenceClass. We'd never point to ones that have
ec_merged != NULL.  This would also allow us to fix the poor
performance in regards to get_eclass_for_sort_expr() for partitioned
tables.

So, it seems the patch dependency chain for skip scans just got a bit longer :-(


I admit that  EquivalenceClasses has a better expressive power.   There are 2 more 
cases we can handle better with EquivalenceClasses.  SELECT DISTINCT a, b, c  
FROM t WHERE a = b;  Currently the UniqueKey is (a, b, c), but it is better be (a, c) 
and (b, c).   The other case happens similarly in group by case.  

After realizing this,  I am still hesitant to do that, due to the complexity.  If we do that,
we may have to maintain a EquivalenceClasses in one more place or make the existing
EquivalenceClasses List longer, for example:  SELECT pk FROM t;  The current 
infrastructure doesn't create any EquivalenceClasses for pk.  So we have to create
a new one in this case and reuse some existing ones in other cases.  Finally since the
EquivalenceClasses is not so straight to upper user, we have to depend on the 
infrastructure change to look up an EquivalenceClasses quickly from an Expr. 

I rethink more about the case you provide above,  IIUC, there is such issue for joinrel. 
then we can just add a EC checking for populate_baserel_uniquekeys.  As for the 
DISTINCT/GROUP BY case,  we should build the UniqueKeys from root->distinct_pathkeys
and root->group_pathkeys where the  EquivalenceClasses are already there. 

I am still not insisting on either Expr or EquivalenceClasses right now,  if we need to
change it to EquivalenceClasses,  I'd see if we need to have more places to take
care before doing that. 

--
Best Regards
Andy Fan