RE: Index Skip Scan

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
40 messages Options
12
Reply | Threaded
Open this post in threaded view
|

RE: Index Skip Scan

Floris Van Nee
Hi all,

I reviewed the latest version of the patch. Overall some good improvements I think. Please find my feedback below.

- I think I mentioned this before - it's not that big of a deal, but it just looks weird and inconsistent to me:
create table t2 as (select a, b, c, 10 d from generate_series(1,5) a, generate_series(1,100) b, generate_series(1,10000) c); create index on t2 (a,b,c desc);

postgres=# explain select distinct on (a,b) a,b,c from t2 where a=2 and b>=5 and b<=5 order by a,b,c desc;
                                   QUERY PLAN                                    
---------------------------------------------------------------------------------
 Index Only Scan using t2_a_b_c_idx on t2  (cost=0.43..216.25 rows=500 width=12)
   Skip scan: true
   Index Cond: ((a = 2) AND (b >= 5) AND (b <= 5))
(3 rows)

postgres=# explain select distinct on (a,b) a,b,c from t2 where a=2 and b=5 order by a,b,c desc;
                                       QUERY PLAN                                        
-----------------------------------------------------------------------------------------
 Unique  (cost=0.43..8361.56 rows=500 width=12)
   ->  Index Only Scan using t2_a_b_c_idx on t2  (cost=0.43..8361.56 rows=9807 width=12)
         Index Cond: ((a = 2) AND (b = 5))
(3 rows)

When doing a distinct on (params) and having equality conditions for all params, it falls back to the regular index scan even though there's no reason not to use the skip scan here. It's much faster to write b between 5 and 5 now rather than writing b=5. I understand this was a limitation of the unique-keys patch at the moment which could be addressed in the future. I think for the sake of consistency it would make sense if this eventually gets addressed.

- nodeIndexScan.c, line 126
This sets xs_want_itup to true in all cases (even for non skip-scans). I don't think this is acceptable given the side-effects this has (page will never be unpinned in between returned tuples in _bt_drop_lock_and_maybe_pin)

- nbsearch.c, _bt_skip, line 1440
_bt_update_skip_scankeys(scan, indexRel); This function is called twice now - once in the else {} and immediately after that outside of the else. The second call can be removed I think.

- nbtsearch.c _bt_skip line 1597
                                LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
                                scan->xs_itup = (IndexTuple) PageGetItem(page, itemid);

This is an UNLOCK followed by a read of the unlocked page. That looks incorrect?

- nbtsearch.c _bt_skip line 1440
if (BTScanPosIsValid(so->currPos) &&
                _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))

Is it allowed to look at the high key / low key of the page without have a read lock on it?

- nbtsearch.c line 1634
if (_bt_readpage(scan, indexdir, offnum))  ...
else
 error()

Is it really guaranteed that a match can be found on the page itself? Isn't it possible that an extra index condition, not part of the scan key, makes none of the keys on the page match?

- nbtsearch.c in general
Most of the code seems to rely quite heavily on the fact that xs_want_itup forces _bt_drop_lock_and_maybe_pin to never release the buffer pin. Have you considered that compacting of a page may still happen even if you hold the pin? [1] I've been trying to come up with cases in which this may break the patch, but I haven't able to produce such a scenario - so it may be fine. But it would be good to consider again. One thing I was thinking of was a scenario where page splits and/or compacting would happen in between returning tuples. Could this break the _bt_scankey_within_page check such that it thinks the scan key is within the current page, while it actually isn't? Mainly for backward and/or cursor scans. Forward scans shouldn't be a problem I think. Apologies for being a bit vague as I don't have a clear example ready when it would go wrong. It may well be fine, but it was one of the things on my mind.

[1] https://postgrespro.com/list/id/1566683972147.11682@...

-Floris


Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan

Jesper Pedersen
Hi Floris,

On 1/15/20 8:33 AM, Floris Van Nee wrote:
> I reviewed the latest version of the patch. Overall some good improvements I think. Please find my feedback below.
>

Thanks for your review !

> - I think I mentioned this before - it's not that big of a deal, but it just looks weird and inconsistent to me:
> create table t2 as (select a, b, c, 10 d from generate_series(1,5) a, generate_series(1,100) b, generate_series(1,10000) c); create index on t2 (a,b,c desc);
>
> postgres=# explain select distinct on (a,b) a,b,c from t2 where a=2 and b>=5 and b<=5 order by a,b,c desc;
>                                     QUERY PLAN
> ---------------------------------------------------------------------------------
>   Index Only Scan using t2_a_b_c_idx on t2  (cost=0.43..216.25 rows=500 width=12)
>     Skip scan: true
>     Index Cond: ((a = 2) AND (b >= 5) AND (b <= 5))
> (3 rows)
>
> postgres=# explain select distinct on (a,b) a,b,c from t2 where a=2 and b=5 order by a,b,c desc;
>                                         QUERY PLAN
> -----------------------------------------------------------------------------------------
>   Unique  (cost=0.43..8361.56 rows=500 width=12)
>     ->  Index Only Scan using t2_a_b_c_idx on t2  (cost=0.43..8361.56 rows=9807 width=12)
>           Index Cond: ((a = 2) AND (b = 5))
> (3 rows)
>
> When doing a distinct on (params) and having equality conditions for all params, it falls back to the regular index scan even though there's no reason not to use the skip scan here. It's much faster to write b between 5 and 5 now rather than writing b=5. I understand this was a limitation of the unique-keys patch at the moment which could be addressed in the future. I think for the sake of consistency it would make sense if this eventually gets addressed.
>
Agreed, that it is an improvement that should be made. I would like
David's view on this since it relates to the UniqueKey patch.

> - nodeIndexScan.c, line 126
> This sets xs_want_itup to true in all cases (even for non skip-scans). I don't think this is acceptable given the side-effects this has (page will never be unpinned in between returned tuples in _bt_drop_lock_and_maybe_pin)
>

Correct - fixed.

> - nbsearch.c, _bt_skip, line 1440
> _bt_update_skip_scankeys(scan, indexRel); This function is called twice now - once in the else {} and immediately after that outside of the else. The second call can be removed I think.
>

Yes, removed the "else" call site.

> - nbtsearch.c _bt_skip line 1597
> LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
> scan->xs_itup = (IndexTuple) PageGetItem(page, itemid);
>
> This is an UNLOCK followed by a read of the unlocked page. That looks incorrect?
>

Yes, that needed to be changed.

> - nbtsearch.c _bt_skip line 1440
> if (BTScanPosIsValid(so->currPos) &&
> _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
>
> Is it allowed to look at the high key / low key of the page without have a read lock on it?
>

In case of a split the page will still contain a high key and a low key,
so this should be ok.

> - nbtsearch.c line 1634
> if (_bt_readpage(scan, indexdir, offnum))  ...
> else
>   error()
>
> Is it really guaranteed that a match can be found on the page itself? Isn't it possible that an extra index condition, not part of the scan key, makes none of the keys on the page match?
>

The logic for this has been changed.

> - nbtsearch.c in general
> Most of the code seems to rely quite heavily on the fact that xs_want_itup forces _bt_drop_lock_and_maybe_pin to never release the buffer pin. Have you considered that compacting of a page may still happen even if you hold the pin? [1] I've been trying to come up with cases in which this may break the patch, but I haven't able to produce such a scenario - so it may be fine. But it would be good to consider again. One thing I was thinking of was a scenario where page splits and/or compacting would happen in between returning tuples. Could this break the _bt_scankey_within_page check such that it thinks the scan key is within the current page, while it actually isn't? Mainly for backward and/or cursor scans. Forward scans shouldn't be a problem I think. Apologies for being a bit vague as I don't have a clear example ready when it would go wrong. It may well be fine, but it was one of the things on my mind.
>

There is a BT_READ lock in place when finding the correct leaf page, or
searching within the leaf page itself. _bt_vacuum_one_page deletes only
LP_DEAD tuples, but those are already ignored in _bt_readpage. Peter, do
you have some feedback for this ?


Please, find the updated patches attached that Dmitry and I made.

Thanks again !

Best regards,
  Jesper

v31_0001-Unique-key.patch (26K) Download Attachment
v31-0002-Index-skip-scan.patch (71K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan

Peter Geoghegan-4
On Mon, Jan 20, 2020 at 11:01 AM Jesper Pedersen
<[hidden email]> wrote:
> > - nbtsearch.c _bt_skip line 1440
> > if (BTScanPosIsValid(so->currPos) &&
> >               _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
> >
> > Is it allowed to look at the high key / low key of the page without have a read lock on it?
> >
>
> In case of a split the page will still contain a high key and a low key,
> so this should be ok.

This is definitely not okay.

> > - nbtsearch.c in general
> > Most of the code seems to rely quite heavily on the fact that xs_want_itup forces _bt_drop_lock_and_maybe_pin to never release the buffer pin. Have you considered that compacting of a page may still happen even if you hold the pin? [1] I've been trying to come up with cases in which this may break the patch, but I haven't able to produce such a scenario - so it may be fine.

Try making _bt_findinsertloc() call _bt_vacuum_one_page() whenever the
page is P_HAS_GARBAGE(), regardless of whether or not the page is
about to split. That will still be correct, while having a much better
chance of breaking the patch during stress-testing.

Relying on a buffer pin to prevent the B-Tree structure itself from
changing in any important way seems likely to be broken already. Even
if it isn't, it sounds fragile.

A leaf page doesn't really have anything called a low key. It usually
has a current first "data item"/non-pivot tuple, which is an
inherently unstable thing. Also, it has a very loose relationship with
the high key of the left sibling page, which the the closest thing to
a low key that exists (often they'll have almost the same key values,
but that is not guaranteed at all). While I haven't studied the patch,
the logic within _bt_scankey_within_page() seems fishy to me for that
reason.

> There is a BT_READ lock in place when finding the correct leaf page, or
> searching within the leaf page itself. _bt_vacuum_one_page deletes only
> LP_DEAD tuples, but those are already ignored in _bt_readpage. Peter, do
> you have some feedback for this ?

It sounds like the design of the patch relies on doing something other
than stopping a scan "between" pages, in the sense that is outlined in
the commit message of commit 09cb5c0e. If so, then that's a serious
flaw in its design.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan

Peter Geoghegan-4
On Mon, Jan 20, 2020 at 1:19 PM Peter Geoghegan <[hidden email]> wrote:

> On Mon, Jan 20, 2020 at 11:01 AM Jesper Pedersen
> <[hidden email]> wrote:
> > > - nbtsearch.c _bt_skip line 1440
> > > if (BTScanPosIsValid(so->currPos) &&
> > >               _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
> > >
> > > Is it allowed to look at the high key / low key of the page without have a read lock on it?
> > >
> >
> > In case of a split the page will still contain a high key and a low key,
> > so this should be ok.
>
> This is definitely not okay.

I suggest that you find a way to add assertions to code like
_bt_readpage() that verify that we do in fact have the buffer content
lock. Actually, there is an existing assertion here that covers the
pin, but not the buffer content lock:

static bool
_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
{
    <declare variables>
    ...

    /*
     * We must have the buffer pinned and locked, but the usual macro can't be
     * used here; this function is what makes it good for currPos.
     */
    Assert(BufferIsValid(so->currPos.buf));

You can add another assertion that calls a new utility function in
bufmgr.c. That can use the same logic as this existing assertion in
FlushOneBuffer():

Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));

We haven't needed assertions like this so far because it's usually it
is clear whether or not a buffer lock is held (plus the bufmgr.c
assertions help on their own). The fact that it isn't clear whether or
not a buffer lock will be held by caller here suggests a problem. Even
still, having some guard rails in the form of these assertions could
be helpful. Also, it seems like _bt_scankey_within_page() should have
a similar set of assertions.

BTW, there is a paper that describes optimizations like loose index
scan and skip scan together, in fairly general terms: "Efficient
Search of Multidimensional B-Trees". Loose index scans are given the
name "MDAM duplicate elimination" in the paper. See:

http://vldb.org/conf/1995/P710.PDF

Goetz Graefe told me about the paper. It seems like the closest thing
that exists to a taxonomy or conceptual framework for these
techniques.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan

Dmitry Dolgov
> On Mon, Jan 20, 2020 at 05:05:33PM -0800, Peter Geoghegan wrote:
>
> I suggest that you find a way to add assertions to code like
> _bt_readpage() that verify that we do in fact have the buffer content
> lock. Actually, there is an existing assertion here that covers the
> pin, but not the buffer content lock:
>
> static bool
> _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
> {
>     <declare variables>
>     ...
>
>     /*
>      * We must have the buffer pinned and locked, but the usual macro can't be
>      * used here; this function is what makes it good for currPos.
>      */
>     Assert(BufferIsValid(so->currPos.buf));
>
> You can add another assertion that calls a new utility function in
> bufmgr.c. That can use the same logic as this existing assertion in
> FlushOneBuffer():
>
> Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
>
> We haven't needed assertions like this so far because it's usually it
> is clear whether or not a buffer lock is held (plus the bufmgr.c
> assertions help on their own). The fact that it isn't clear whether or
> not a buffer lock will be held by caller here suggests a problem. Even
> still, having some guard rails in the form of these assertions could
> be helpful. Also, it seems like _bt_scankey_within_page() should have
> a similar set of assertions.

Thanks for suggestion. Agree, we will add such guards. It seems that in
general I need to go through the locking in the patch one more time,
since there are some gaps I din't notice/didn't know about before.

> BTW, there is a paper that describes optimizations like loose index
> scan and skip scan together, in fairly general terms: "Efficient
> Search of Multidimensional B-Trees". Loose index scans are given the
> name "MDAM duplicate elimination" in the paper. See:
>
> http://vldb.org/conf/1995/P710.PDF
>
> Goetz Graefe told me about the paper. It seems like the closest thing
> that exists to a taxonomy or conceptual framework for these
> techniques.

Yes, I've read this paper, as it's indeed the only reference I found
about this topic in literature. But unfortunately it's not much and (at
least from the first read) gives only an overview of the idea.


Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan

Dmitry Dolgov
In reply to this post by Peter Geoghegan-4
> On Mon, Jan 20, 2020 at 01:19:30PM -0800, Peter Geoghegan wrote:

Thanks for the commentaries. I'm trying to clarify your conclusions for
myself, so couple of questions.

> > > - nbtsearch.c in general
> > > Most of the code seems to rely quite heavily on the fact that xs_want_itup forces _bt_drop_lock_and_maybe_pin to never release the buffer pin. Have you considered that compacting of a page may still happen even if you hold the pin? [1] I've been trying to come up with cases in which this may break the patch, but I haven't able to produce such a scenario - so it may be fine.
>
> Try making _bt_findinsertloc() call _bt_vacuum_one_page() whenever the
> page is P_HAS_GARBAGE(), regardless of whether or not the page is
> about to split. That will still be correct, while having a much better
> chance of breaking the patch during stress-testing.
>
> Relying on a buffer pin to prevent the B-Tree structure itself from
> changing in any important way seems likely to be broken already. Even
> if it isn't, it sounds fragile.

Except for checking low/high key (which should be done with a lock), I
believe the current implementation follows the same pattern I see quite
often, namely

* get a lock on a page of interest and test it's values (if we can find
  next distinct value right on the next one without goind down the tree).

* if not, unlock the current page, search within the tree with
  _bt_search (which locks a resuling new page) and examine values on a
  new page, when necessary do _bt_steppage

Is there an obvious problem with this approach, when it comes to the
page structure modification?

> A leaf page doesn't really have anything called a low key. It usually
> has a current first "data item"/non-pivot tuple, which is an
> inherently unstable thing.

Would this inherent instability be resolved for this particular case by
having a lock on a page while checking a first data item, or there is
something else I need to take into account?

> > There is a BT_READ lock in place when finding the correct leaf page, or
> > searching within the leaf page itself. _bt_vacuum_one_page deletes only
> > LP_DEAD tuples, but those are already ignored in _bt_readpage. Peter, do
> > you have some feedback for this ?
>
> It sounds like the design of the patch relies on doing something other
> than stopping a scan "between" pages, in the sense that is outlined in
> the commit message of commit 09cb5c0e. If so, then that's a serious
> flaw in its design.

Could you please elaborate why does it sound like that? If I understand
correctly, to stop a scan only "between" pages one need to use only
_bt_readpage/_bt_steppage? Other than that there is no magic with scan
position in the patch, so I'm not sure if I'm missing something here.


Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan

Jesper Pedersen
In reply to this post by Peter Geoghegan-4
Hi Peter,

Thanks for your feedback; Dmitry has followed-up with some additional
questions.

On 1/20/20 8:05 PM, Peter Geoghegan wrote:
>> This is definitely not okay.
>
> I suggest that you find a way to add assertions to code like
> _bt_readpage() that verify that we do in fact have the buffer content
> lock.

If you apply the attached patch on master it will fail the test suite;
did you mean something else ?

Best regards,
  Jesper

buffer_locked.txt (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Index Skip Scan

Floris Van Nee
In reply to this post by Dmitry Dolgov
>
> Could you please elaborate why does it sound like that? If I understand
> correctly, to stop a scan only "between" pages one need to use only
> _bt_readpage/_bt_steppage? Other than that there is no magic with scan
> position in the patch, so I'm not sure if I'm missing something here.

Anyone please correct me if I'm wrong, but I think one case where the current patch relies on some data from the page it has locked before it in checking this hi/lo key. I think it's possible for the following sequence to happen. Suppose we have a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8]
We do a backwards index skip scan on this and have just returned our first tuple (8). The buffer is left pinned but unlocked. Now, someone else comes in and inserts a tuple (value 5) into this page, but suppose the page happens to be full. So a page split occurs. As far as I know, a page split could happen at any random element in the page. One of the situations we could be left with is:
Leaf page 1 = [2,4]
Leaf page 2 = [5,6,8]
However, our scan is still pointing to leaf page 1. For non-skip scans this is not a problem, as we already read all matching elements in our local buffer and we'll return those. But the skip scan currently:
a) checks the lo-key of the page to see if the next prefix can be found on the leaf page 1
b) finds out that this is actually true
c) does a search on the page and returns value=4 (while it should have returned value=6)

Peter, is my understanding about the btree internals correct so far?

Now that I look at the patch again, I fear there currently may also be such a dependency in the "Advance forward but read backward"-case. It saves the offset number of a tuple in a variable, then does a _bt_search (releasing the lock and pin on the page). At this point, anything can happen to the tuples on this page - the page may be compacted by vacuum such that the offset number you have in your variable does not match the actual offset number of the tuple on the page anymore. Then, at the check for (nextOffset == startOffset) later, there's a possibility the offsets are different even though they relate to the same tuple.


-Floris



Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan

Dmitry Dolgov
> On Wed, Jan 22, 2020 at 07:50:30AM +0000, Floris Van Nee wrote:
>
> Anyone please correct me if I'm wrong, but I think one case where the current patch relies on some data from the page it has locked before it in checking this hi/lo key. I think it's possible for the following sequence to happen. Suppose we have a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8]
> We do a backwards index skip scan on this and have just returned our first tuple (8). The buffer is left pinned but unlocked. Now, someone else comes in and inserts a tuple (value 5) into this page, but suppose the page happens to be full. So a page split occurs. As far as I know, a page split could happen at any random element in the page. One of the situations we could be left with is:
> Leaf page 1 = [2,4]
> Leaf page 2 = [5,6,8]
> However, our scan is still pointing to leaf page 1.

In case if we just returned a tuple, the next action would be either
check the next page for another key or search down to the tree. Maybe
I'm missing something in your scenario, but the latter will land us on a
required page (we do not point to any leaf here), and before the former
there is a check for high/low key. Is there anything else missing?

> Now that I look at the patch again, I fear there currently may also be such a dependency in the "Advance forward but read backward"-case. It saves the offset number of a tuple in a variable, then does a _bt_search (releasing the lock and pin on the page). At this point, anything can happen to the tuples on this page - the page may be compacted by vacuum such that the offset number you have in your variable does not match the actual offset number of the tuple on the page anymore. Then, at the check for (nextOffset == startOffset) later, there's a possibility the offsets are different even though they relate to the same tuple.

Interesting point. The original idea here was to check that we're not
returned to the same position after jumping, so maybe instead of offsets
we can check a tuple we found.


Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan

Floris Van Nee
Hi Dmitry,


> > On Wed, Jan 22, 2020 at 07:50:30AM +0000, Floris Van Nee wrote:
> >
> > Anyone please correct me if I'm wrong, but I think one case where the current patch relies on some data from the page it has locked before it in checking this hi/lo key. I think it's possible for the following sequence to happen. Suppose we have a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8]
> > We do a backwards index skip scan on this and have just returned our first tuple (8). The buffer is left pinned but unlocked. Now, someone else comes in and inserts a tuple (value 5) into this page, but suppose the page happens to be full. So a page split occurs. As far as I know, a page split could happen at any random element in the page. One of the situations we could be left with is:
> > Leaf page 1 = [2,4]
> > Leaf page 2 = [5,6,8]
> > However, our scan is still pointing to leaf page 1.

> In case if we just returned a tuple, the next action would be either
>> check the next page for another key or search down to the tree. Maybe

But it won't look at the 'next page for another key', but rather at the 'same page or another key', right? In the _bt_scankey_within_page shortcut we're taking, there's no stepping to a next page involved. It just locks the page again that it previously also locked.

> I'm missing something in your scenario, but the latter will land us on a
> required page (we do not point to any leaf here), and before the former
> there is a check for high/low key. Is there anything else missing?

Let me try to clarify. After we return the first tuple, so->currPos.buf is pointing to page=1 in my example (it's the only page after all). We've returned item=8. Then the split happens and the items get rearranged as in my example. We're still pointing with so->currPos.buf to page=1, but the page now contains [2,4]. The split happened to the right, so there's a page=2 with [5,6,8], however the ongoing index scan is unaware of that.
Now _bt_skip gets called to fetch the next tuple. It starts by checking _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir), the result of which will be 'true': we're comparing the skip key to the low key of the page. So it thinks the next key can be found on the current page. It locks the page and does a _binsrch, finding item=4 to be returned.

The problem here is that _bt_scankey_within_page mistakenly returns true, thereby limiting the search to just the page that it's pointing to already.
It may be fine to just fix this function to return the proper value (I guess it'd also need to look at the high key in this example). It could also be fixed by not looking at the lo/hi key of the page, but to use the local tuple buffer instead. We already did a _read_page once, so if we have any matching tuples on that specific page, we have them locally in the buffer already. That way we never need to lock the same page twice.


Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan

Peter Geoghegan-4
In reply to this post by Jesper Pedersen
On Tue, Jan 21, 2020 at 9:06 AM Jesper Pedersen
<[hidden email]> wrote:
> If you apply the attached patch on master it will fail the test suite;
> did you mean something else ?

Yeah, this is exactly what I had in mind for the _bt_readpage() assertion.

As I said, it isn't a great sign that this kind of assertion is even
necessary in index access method code (code like bufmgr.c is another
matter). Usually it's just obvious that a buffer lock is held. I can't
really blame this patch for that, though. You could say the same thing
about the existing "buffer pin held" _bt_readpage() assertion. It's
good that it verifies what is actually a fragile assumption, even
though I'd prefer to not make a fragile assumption.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan

Peter Geoghegan-4
In reply to this post by Floris Van Nee
On Tue, Jan 21, 2020 at 11:50 PM Floris Van Nee
<[hidden email]> wrote:

> Anyone please correct me if I'm wrong, but I think one case where the current patch relies on some data from the page it has locked before it in checking this hi/lo key. I think it's possible for the following sequence to happen. Suppose we have a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8]
> We do a backwards index skip scan on this and have just returned our first tuple (8). The buffer is left pinned but unlocked. Now, someone else comes in and inserts a tuple (value 5) into this page, but suppose the page happens to be full. So a page split occurs. As far as I know, a page split could happen at any random element in the page. One of the situations we could be left with is:
> Leaf page 1 = [2,4]
> Leaf page 2 = [5,6,8]
> However, our scan is still pointing to leaf page 1. For non-skip scans this is not a problem, as we already read all matching elements in our local buffer and we'll return those. But the skip scan currently:
> a) checks the lo-key of the page to see if the next prefix can be found on the leaf page 1
> b) finds out that this is actually true
> c) does a search on the page and returns value=4 (while it should have returned value=6)
>
> Peter, is my understanding about the btree internals correct so far?

This is a good summary. This is the kind of scenario I had in mind
when I expressed a general concern about "stopping between pages".
Processing a whole page at a time is a crucial part of how
_bt_readpage() currently deals with concurrent page splits.

Holding a buffer pin on a leaf page is only effective as an interlock
against VACUUM completely removing a tuple, which could matter with
non-MVCC scans.

> Now that I look at the patch again, I fear there currently may also be such a dependency in the "Advance forward but read backward"-case. It saves the offset number of a tuple in a variable, then does a _bt_search (releasing the lock and pin on the page). At this point, anything can happen to the tuples on this page - the page may be compacted by vacuum such that the offset number you have in your variable does not match the actual offset number of the tuple on the page anymore. Then, at the check for (nextOffset == startOffset) later, there's a possibility the offsets are different even though they relate to the same tuple.

If skip scan is restricted to heapkeyspace indexes (i.e. those created
on Postgres 12+), then it might be reasonable to save an index tuple,
and relocate it within the same page using a fresh binary search that
uses a scankey derived from the same index tuple -- without unsetting
scantid/the heap TID scankey attribute. I suppose that you'll need to
"find your place again" after releasing the buffer lock on a leaf page
for a time. Also, I think that this will only be safe with MVCC scans,
because otherwise the page could be concurrently deleted by VACUUM.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan

Peter Geoghegan-4
On Wed, Jan 22, 2020 at 10:55 AM Peter Geoghegan <[hidden email]> wrote:
> This is a good summary. This is the kind of scenario I had in mind
> when I expressed a general concern about "stopping between pages".
> Processing a whole page at a time is a crucial part of how
> _bt_readpage() currently deals with concurrent page splits.

Note in particular that index scans cannot return the same index tuple
twice -- processing a page at a time ensures that that cannot happen.

Can a loose index scan return the same tuple (i.e. a tuple with the
same heap TID) to the executor more than once?

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

RE: Index Skip Scan

Floris Van Nee
> Note in particular that index scans cannot return the same index tuple twice -
> - processing a page at a time ensures that that cannot happen.
>
> Can a loose index scan return the same tuple (i.e. a tuple with the same heap
> TID) to the executor more than once?
>

The loose index scan shouldn't return a tuple twice. It should only be able to skip 'further', so that shouldn't be a problem. Out of curiosity, why can't index scans return the same tuple twice? Is there something in the executor that isn't able to handle this?

-Floris

Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan

Dmitry Dolgov
In reply to this post by Floris Van Nee
> On Wed, Jan 22, 2020 at 05:24:43PM +0000, Floris Van Nee wrote:
>
> > > Anyone please correct me if I'm wrong, but I think one case where the current patch relies on some data from the page it has locked before it in checking this hi/lo key. I think it's possible for the following sequence to happen. Suppose we have a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8]
> > > We do a backwards index skip scan on this and have just returned our first tuple (8). The buffer is left pinned but unlocked. Now, someone else comes in and inserts a tuple (value 5) into this page, but suppose the page happens to be full. So a page split occurs. As far as I know, a page split could happen at any random element in the page. One of the situations we could be left with is:
> > > Leaf page 1 = [2,4]
> > > Leaf page 2 = [5,6,8]
> > > However, our scan is still pointing to leaf page 1.
>
> > In case if we just returned a tuple, the next action would be either
> >> check the next page for another key or search down to the tree. Maybe
>
> But it won't look at the 'next page for another key', but rather at the 'same page or another key', right? In the _bt_scankey_within_page shortcut we're taking, there's no stepping to a next page involved. It just locks the page again that it previously also locked.

Yep, it would look only on the same page. Not sure what do you mean by
"another key", if the current key is not found within the current page
at the first stage, we restart from the root.

> > I'm missing something in your scenario, but the latter will land us on a
> > required page (we do not point to any leaf here), and before the former
> > there is a check for high/low key. Is there anything else missing?
>
> Let me try to clarify. After we return the first tuple, so->currPos.buf is pointing to page=1 in my example (it's the only page after all). We've returned item=8. Then the split happens and the items get rearranged as in my example. We're still pointing with so->currPos.buf to page=1, but the page now contains [2,4]. The split happened to the right, so there's a page=2 with [5,6,8], however the ongoing index scan is unaware of that.
> Now _bt_skip gets called to fetch the next tuple. It starts by checking _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir), the result of which will be 'true': we're comparing the skip key to the low key of the page. So it thinks the next key can be found on the current page. It locks the page and does a _binsrch, finding item=4 to be returned.
>
> The problem here is that _bt_scankey_within_page mistakenly returns true, thereby limiting the search to just the page that it's pointing to already.
> It may be fine to just fix this function to return the proper value (I guess it'd also need to look at the high key in this example). It could also be fixed by not looking at the lo/hi key of the page, but to use the local tuple buffer instead. We already did a _read_page once, so if we have any matching tuples on that specific page, we have them locally in the buffer already. That way we never need to lock the same page twice.

Oh, that's what you mean. Yes, I was somehow tricked by the name of this
function and didn't notice that it checks only one boundary, so in case
of backward scan it returns wrong result. I think in the situation
you've describe it would actually not find any item on the current page
and restart from the root, but nevertheless we need to check for both
keys in _bt_scankey_within_page.


Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan

Dmitry Dolgov
In reply to this post by Floris Van Nee
> On Wed, Jan 22, 2020 at 09:08:59PM +0000, Floris Van Nee wrote:
> > Note in particular that index scans cannot return the same index tuple twice -
> > - processing a page at a time ensures that that cannot happen.
> >
> > Can a loose index scan return the same tuple (i.e. a tuple with the same heap
> > TID) to the executor more than once?
> >
>
> The loose index scan shouldn't return a tuple twice. It should only be able to skip 'further', so that shouldn't be a problem.

Yes, it shouldn't happen.


Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan

Peter Geoghegan-4
In reply to this post by Floris Van Nee
On Wed, Jan 22, 2020 at 1:09 PM Floris Van Nee <[hidden email]> wrote:
> The loose index scan shouldn't return a tuple twice. It should only be able to skip 'further', so that shouldn't be a problem. Out of curiosity, why can't index scans return the same tuple twice? Is there something in the executor that isn't able to handle this?

I have no reason to believe that the executor has a problem with index
scans that return a tuple more than once, aside from the very obvious:
in general, that will often be wrong. It might not be wrong when the
scan happens to be input to a unique node anyway, or something like
that.

I'm not particularly concerned about it. Just wanted to be clear on
our assumptions for loose index scans -- if loose index scans were
allowed to return a tuple more than once, then that would at least
have to at least be considered in the wider context of the executor
(but apparently they're not, so no need to worry about it). This may
have been mentioned somewhere already. If it is then I must have
missed it.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan

Peter Geoghegan-4
In reply to this post by Dmitry Dolgov
On Wed, Jan 22, 2020 at 1:35 PM Dmitry Dolgov <[hidden email]> wrote:
> Oh, that's what you mean. Yes, I was somehow tricked by the name of this
> function and didn't notice that it checks only one boundary, so in case
> of backward scan it returns wrong result. I think in the situation
> you've describe it would actually not find any item on the current page
> and restart from the root, but nevertheless we need to check for both
> keys in _bt_scankey_within_page.

I suggest reading the nbtree README file's description of backwards
scans. Read the paragraph that begins with 'We support the notion of
an ordered "scan" of an index...'. I also suggest that you read a bit
of the stuff in the large section on page deletion. Certainly read the
paragraph that begins with 'Moving left in a backward scan is
complicated because...'.

It's important to grok why it's okay that we don't "couple" or "crab"
buffer locks as we descend the tree with Lehman & Yao's design -- we
can get away with having *no* interlock against page splits (e.g.,
pin, buffer lock) when we are "between" levels of the tree. This is
safe, since the page that we land on must still be "substantively the
same page", no matter how much time passes. That is, it must at least
cover the leftmost portion of the keyspace covered by the original
version of the page that we saw that we needed to descend to within
the parent page. The worst that can happen is that we have to recover
from a concurrent page split by moving right one or more times.
(Actually, page deletion can change the contents of a page entirely,
but that's not really an exception to the general rule -- page
deletion is careful about recycling pages that an in flight index scan
might land on.)

Lehman & Yao don't have backwards scans (or left links, or page
deletion). Unlike nbtree. This is why the usual Lehman & Yao
guarantees don't quite work with backward scans. We must therefore
compensate as described by the README file (basically, we check and
re-check for races, possibly returning to the original page when we
think that we might have overlooked something and need to make sure).
It's an exception to the general rule, you could say.


--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan

Peter Geoghegan-4
In reply to this post by Peter Geoghegan-4
On Wed, Jan 22, 2020 at 10:55 AM Peter Geoghegan <[hidden email]> wrote:

> On Tue, Jan 21, 2020 at 11:50 PM Floris Van Nee
> <[hidden email]> wrote:
> > As far as I know, a page split could happen at any random element in the page. One of the situations we could be left with is:
> > Leaf page 1 = [2,4]
> > Leaf page 2 = [5,6,8]
> > However, our scan is still pointing to leaf page 1. For non-skip scans this is not a problem, as we already read all matching elements in our local buffer and we'll return those. But the skip scan currently:
> > a) checks the lo-key of the page to see if the next prefix can be found on the leaf page 1
> > b) finds out that this is actually true
> > c) does a search on the page and returns value=4 (while it should have returned value=6)
> >
> > Peter, is my understanding about the btree internals correct so far?
>
> This is a good summary. This is the kind of scenario I had in mind
> when I expressed a general concern about "stopping between pages".
> Processing a whole page at a time is a crucial part of how
> _bt_readpage() currently deals with concurrent page splits.

I want to be clear about what it means that the page doesn't have a
"low key". Let us once again start with a very simple one leaf-page
btree containing four elements: leaf page 1 = [2,4,6,8] -- just like
in Floris' original page split scenario.

Let us also say that page 1 has a left sibling page -- page 0. Page 0
happens to have a high key with the integer value 0. So you could
*kind of* claim that the "low key" of page 1 is the integer value 0
(page 1 values must be > 0) -- *not* the integer value 2 (the
so-called "low key" here is neither > 2, nor >= 2). More formally, an
invariant exists that says that all values on page 1 must be
*strictly* greater than the integer value 0. However, this formal
invariant thing is hard or impossible to rely on when we actually
reach page 1 and want to know about its lower bound -- since there is
no "low key" pivot tuple on page 1 (we can only speak of a "low key"
as an abstract concept, or something that works transitively from the
parent -- there is only a physical high key pivot tuple on page 1
itself).

Suppose further that Page 0 is now empty, apart from its "all values
on page are <= 0" high key (page 0 must have had a few negative
integer values in its tuples at some point, but not anymore). VACUUM
will delete the page, *changing the effective low key* of Page 0 in
the process. The lower bound from the shared parent page will move
lower/left as a consequence of the deletion of page 0. nbtree page
deletion makes the "keyspace move right, not left". So the "conceptual
low key" of page 1 just went down from 0 to -5 (say), without there
being any practical way of a skip scan reading page 1 noticing the
change (the left sibling of page 0, page -1, has a high key of <= -5,
say).

Not only is it possible for somebody to insert the value 1 in page 1
-- now they can insert the value -3 or -4!

More concretely, the pivot tuple in the parent that originally pointed
to page 0 is still there -- all that page deletion changed about this
tuple is its downlink, which now points to page 1 instead or page 0.
Confusingly, page deletion removes the pivot tuple of the right
sibling page from the parent -- *not* the pivot tuple of the empty
page that gets deleted (in this case page 0) itself.

Note: this example ignores things like negative infinity values in
truncated pivot tuples, and the heap TID tiebreaker column -- in
reality this would look a bit different because of those factors.

See also: amcheck's bt_right_page_check_scankey() function, which has
a huge comment that reasons about a race involving page deletion. In
general, page deletion is by far the biggest source of complexity when
reasoning about the key space.
--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Index Skip Scan

Dmitry Dolgov
In reply to this post by Dmitry Dolgov
> On Wed, Jan 22, 2020 at 10:36:03PM +0100, Dmitry Dolgov wrote:

>
> > Let me try to clarify. After we return the first tuple, so->currPos.buf is pointing to page=1 in my example (it's the only page after all). We've returned item=8. Then the split happens and the items get rearranged as in my example. We're still pointing with so->currPos.buf to page=1, but the page now contains [2,4]. The split happened to the right, so there's a page=2 with [5,6,8], however the ongoing index scan is unaware of that.
> > Now _bt_skip gets called to fetch the next tuple. It starts by checking _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir), the result of which will be 'true': we're comparing the skip key to the low key of the page. So it thinks the next key can be found on the current page. It locks the page and does a _binsrch, finding item=4 to be returned.
> >
> > The problem here is that _bt_scankey_within_page mistakenly returns true, thereby limiting the search to just the page that it's pointing to already.
> > It may be fine to just fix this function to return the proper value (I guess it'd also need to look at the high key in this example). It could also be fixed by not looking at the lo/hi key of the page, but to use the local tuple buffer instead. We already did a _read_page once, so if we have any matching tuples on that specific page, we have them locally in the buffer already. That way we never need to lock the same page twice.
>
> Oh, that's what you mean. Yes, I was somehow tricked by the name of this
> function and didn't notice that it checks only one boundary, so in case
> of backward scan it returns wrong result. I think in the situation
> you've describe it would actually not find any item on the current page
> and restart from the root, but nevertheless we need to check for both
> keys in _bt_scankey_within_page.
Thanks again everyone for commentaries and clarification. Here is the
version, where hopefully I've addressed all the mentioned issues.

As mentioned in the _bt_skip commentaries, before we were moving left to
check the next page to avoid significant issues in case if ndistinct was
underestimated and we need to skip too often. To make it work safe in
presense of splits we need to remember an original page and move right
again until we find a page with the right link pointing to it. It's not
clear whether it's worth to increase complexity for such sort of "edge
case" with ndistinct estimation while moving left, so at least for now
we ignore this in the implementation and just start from the root
immediately.

Offset based code in moving forward/reading backward was replaced with
remembering a start index tuple and an attempt to find it on the new
page. Also a missing page lock before _bt_scankey_within_page was added
and _bt_scankey_within_page checks for both page boundaries.

v31-0001-Unique-key.patch (26K) Download Attachment
v31-0002-Index-skip-scan.patch (74K) Download Attachment
12