Tid scan improvements

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
93 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Tomas Vondra-4


On 11/24/18 1:56 AM, Edmund Horner wrote:

> On Fri, 23 Nov 2018 at 07:03, Tomas Vondra <[hidden email]> wrote:
>> On 11/22/18 8:41 AM, David Rowley wrote:
>>> ...
>>> 3. I'd rather see EnsureTidRangeSpace() keep doubling the size of the
>>> allocation until it reaches the required size. See how
>>> MakeSharedInvalidMessagesArray() does it.  Doing it this way ensures
>>> we always have a power of two sized array which is much nicer if we
>>> ever reach the palloc() limit as if the array is sized at the palloc()
>>> limit / 2 + 1, then if we try to double it'll fail.  Of course, it's
>>> unlikely to be a problem here, but... the question would be how to
>>> decide on the initial size.
>>
>> I think it kinda tries to do that in some cases, by doing this:
>>
>>      *numAllocRanges *= 2;
>>      ...
>>      tidRanges = (TidRange *)
>>          repalloc(tidRanges,
>>                   *numAllocRanges * sizeof(TidRange));
>>
>> The problem here is that what matters is not numAllocRanges being 2^N,
>> but the number of bytes allocated being 2^N. Because that's what ends up
>> in AllocSet, which keeps lists of 2^N chunks.
>>
>> And as TidRange is 12B, so this is guaranteed to waste memory, because
>> no matter what the first factor is, the result will never be 2^N.
>
> For simplicity, I think making it a strict doubling of capacity each
> time is fine.  That's what we see in numerous other places in the
> backend code.
>

Sure.

> What we don't really see is intentionally setting the initial capacity
> so that each subsequent capacity is close-to-but-not-exceeding a power
> of 2 bytes.  You can't really do that optimally if working in terms of
> whole numbers of items that aren't each a power of 2 size.  This step,
> there may be 2/3 of an item spare; next step, we'll have a whole item
> spare that we're not going to use.

Ah, I missed the detail with setting initial size.

> So we could keep track in terms of bytes allocated, and then figure
> out how many items we can fit at the current time.
>
> In my opinion, such complexity is overkill for Tid scans.
>
> Currently, we try to pick an initial size based on the number of
> expressions.  We assume each expression will yield one range, and
> allow that a saop expression might require us to enlarge the array.
>
> Again, for simplicity, we should scrap that and pick something like
> floor(256/sizeof(TidRange)) = 21 items, with about 1.5% wastage.
>

Probably. I don't think it'd be a lot of code to do the exact sizing,
but you're right 1.5% is close enough. As long as there is a comment
explaining the initial sizing, I'm fine with that.

If I could suggest one more thing, I'd define a struct combining the
array of ranges, numRanges and numAllocRangeslike:

typedef struct TidRanges
{
    int         numRanges;
    int         numAllocRanges;
    TidRange    ranges[FLEXIBLE_ARRAY_MEMBER];
} TidRanges;

and use that instead of the plain array. I find it easier to follow
compared to passing the various fields directly (sometimes as a value,
sometimes pointer to the value, etc.).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Edmund Horner
On Sat, 24 Nov 2018 at 15:46, Tomas Vondra <[hidden email]> wrote:

> On 11/24/18 1:56 AM, Edmund Horner wrote:
> > On Fri, 23 Nov 2018 at 07:03, Tomas Vondra <[hidden email]> wrote:
> >> On 11/22/18 8:41 AM, David Rowley wrote:
> >>> ...
> >>> 3. I'd rather see EnsureTidRangeSpace() keep doubling the size of the
> >>> allocation until it reaches the required size. See how
> >>> MakeSharedInvalidMessagesArray() does it.  Doing it this way ensures
> >>> we always have a power of two sized array which is much nicer if we
> >>> ever reach the palloc() limit as if the array is sized at the palloc()
> >>> limit / 2 + 1, then if we try to double it'll fail.  Of course, it's
> >>> unlikely to be a problem here, but... the question would be how to
> >>> decide on the initial size.
> >>
> >> I think it kinda tries to do that in some cases, by doing this:
> >>
> >>      *numAllocRanges *= 2;
> >>      ...
> >>      tidRanges = (TidRange *)
> >>          repalloc(tidRanges,
> >>                   *numAllocRanges * sizeof(TidRange));
> >>
> >> The problem here is that what matters is not numAllocRanges being 2^N,
> >> but the number of bytes allocated being 2^N. Because that's what ends up
> >> in AllocSet, which keeps lists of 2^N chunks.
> >>
> >> And as TidRange is 12B, so this is guaranteed to waste memory, because
> >> no matter what the first factor is, the result will never be 2^N.
> >
> > For simplicity, I think making it a strict doubling of capacity each
> > time is fine.  That's what we see in numerous other places in the
> > backend code.
> >
>
> Sure.
>
> > What we don't really see is intentionally setting the initial capacity
> > so that each subsequent capacity is close-to-but-not-exceeding a power
> > of 2 bytes.  You can't really do that optimally if working in terms of
> > whole numbers of items that aren't each a power of 2 size.  This step,
> > there may be 2/3 of an item spare; next step, we'll have a whole item
> > spare that we're not going to use.
>
> Ah, I missed the detail with setting initial size.
>
> > So we could keep track in terms of bytes allocated, and then figure
> > out how many items we can fit at the current time.
> >
> > In my opinion, such complexity is overkill for Tid scans.
> >
> > Currently, we try to pick an initial size based on the number of
> > expressions.  We assume each expression will yield one range, and
> > allow that a saop expression might require us to enlarge the array.
> >
> > Again, for simplicity, we should scrap that and pick something like
> > floor(256/sizeof(TidRange)) = 21 items, with about 1.5% wastage.
> >
>
> Probably. I don't think it'd be a lot of code to do the exact sizing,
> but you're right 1.5% is close enough. As long as there is a comment
> explaining the initial sizing, I'm fine with that.
>
> If I could suggest one more thing, I'd define a struct combining the
> array of ranges, numRanges and numAllocRangeslike:
>
> typedef struct TidRanges
> {
>     int         numRanges;
>     int         numAllocRanges;
>     TidRange    ranges[FLEXIBLE_ARRAY_MEMBER];
> } TidRanges;
>
> and use that instead of the plain array. I find it easier to follow
> compared to passing the various fields directly (sometimes as a value,
> sometimes pointer to the value, etc.).

Ok, I've made rewritten it to use a struct:

typedef struct TidRangeArray {
    TidRange *ranges;
    int numRanges;
    int numAllocated;
}           TidRangeArray;

which is slightly different from the flexible array member version you
suggested.  The TidRangeArray is allocated on the stack in the
function that builds it, and then ranges and numRanges are copied into
the TidScanState before the function returns.

Any particular pros/cons of this versus your approach?  With yours, I
presume we'd have a pointer to TidRanges in TidScanState.

My other concern now is that EnsureTidRangeSpace needs a loop to
double the allocated size.  Most such arrays in the backend only ever
grow by 1, so a single doubling is fine, but the TID scan one can grow
by an arbitrary number with a scalar array op, and it's nice to not
have to check the space for each individual item.  Here's what I've
got.

void
EnsureTidRangeSpace(TidRangeArray *tidRangeArray, int numNewItems)
{
    int requiredSpace = tidRangeArray->numRanges + numNewItems;
    if (requiredSpace <= tidRangeArray->numAllocated)
        return;

    /* it's not safe to double the size unless we're less than half MAX_INT */
    if (requiredSpace >= INT_MAX / 2)
        tidRangeArray->numAllocated = requiredSpace;
    else
        while (tidRangeArray->numAllocated < requiredSpace)
            tidRangeArray->numAllocated *= 2;

    tidRangeArray->ranges = (TidRange *)
        repalloc(tidRangeArray->ranges,
                 tidRangeArray->numAllocated * sizeof(TidRange));
}

If you're in danger of overflowing numAllocated with the number of
TIDs in your query, you're probably going to have other problems.  But
I'd prefer to at least not get stuck in an infinite doubling loop.

Note that you don't need any single ScalarArrayOp to return a huge
result, because you can have multiple such ops in your query, and the
results for each all need to get put into the TidRangeArray before
de-duplication occurs.

What's a safe way to check that we're not trying to process too many items?

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Edmund Horner
In reply to this post by David Rowley-3
On Thu, 22 Nov 2018 at 20:41, David Rowley <[hidden email]> wrote:
> I've now had a look over the latest patches and I've found a few more
> things.  Many of these are a bit nitpicky, but certainly not all. I
> also reviewed 0004 this time.


Whew!  A lot more things to look at.

I've tried to address most of what you've raised, and attach yet
another set of patches.  There are are few things that I'm not settled
on, discussed below under Big Items.

CC'd Tomas, if he wants to check what I've done with the TidRange
array allocation.


***** Big Items *****

> 0001:
>
> 1. The row estimates are not quite right.  This cases the row
> estimation to go the wrong way for isgt.
>
> For example, the following gets 24 rows instead of 26.
>
> postgres=# create table t (a int);
> CREATE TABLE
> postgres=# insert into t select generate_Series(1,100);
> INSERT 0 100
> postgres=# analyze t;
> postgres=# explain analyze select * from t where ctid >= '(0,75)';
>                                          QUERY PLAN
> ---------------------------------------------------------------------------------------------
>  Seq Scan on t  (cost=0.00..2.25 rows=24 width=4) (actual
> time=0.046..0.051 rows=26 loops=1)
>    Filter: (ctid >= '(0,75)'::tid)
>    Rows Removed by Filter: 74
>  Planning Time: 0.065 ms
>  Execution Time: 0.074 ms
> (5 rows)
>
> The < and <= case is not quite right either. < should have 1 fewer
> tuple than the calculated average tuples per page, and <= should have
> the same (assuming no gaps)
>
> I've attached a small delta patch that I think improves things here.
Thanks, I've incorporated your patch.  I think the logic for iseq and
isgt makes sense now.

Since we only have the total number of tuples and the total number of
pages, and no real statistics, this might be the best we can
reasonably do.  There's still a noticable rowcount error for the last
page, and slighter rowcount errors for other pages.  We estimate
density = ntuples/npages for all pages; but in a densely populated
table, we'll average only half the number of tuples in the last page
as earlier pages.

I guess we *could* estimate density = ntuples/(npages - 0.5) for all
but the last page; and half that for the last.  But that adds
complexity, and you'd still only get a good row count when the last
page was about half full.

I implemented this anyway, and it does improve row counts a bit.  I'll
include it in the next patch set and you can take a look.

I also spent some time today agonising over how visiblity would affect
things, but did not come up with anything useful to add to our
formulas.

> 3. I'd rather see EnsureTidRangeSpace() keep doubling the size of the
> allocation until it reaches the required size. See how
> MakeSharedInvalidMessagesArray() does it.  Doing it this way ensures
> we always have a power of two sized array which is much nicer if we
> ever reach the palloc() limit as if the array is sized at the palloc()
> limit / 2 + 1, then if we try to double it'll fail.  Of course, it's
> unlikely to be a problem here, but... the question would be how to
> decide on the initial size.

I've tried to change things that way, but we still need to deal with
excessive numbers of items.

I've defined a constant MaxTidRanges = MaxAllocSize/sizeof(TidRange),
and raise an error if the required size exceeds that.

> 4. "at" needs shifted left a couple of words
>
> /*
> * If the lower bound was already or above at the maximum block
> * number, then there is no valid range.
> */
>
> but I don't see how it could be "or above".  The ctid type does not
> have the room for that. Although, that's not to say you should test if
> (block == MaxBlockNumber), the >= seems better for the code. I'm just
> complaining about the comment.
We have to deal with TIDs entered by the user, which can include
invalid ones like (4294967295,0).  MaxBlockNumber is 4294967294.

> 12. expected_comparison_operator is a bit long a name:
>
> IsTidComparison(OpExpr *node, int varno, Oid expected_comparison_operator)
>
> How about just expected_opno?
>
> 14. This is not great:
>
> [horrible macros in tidpath.c]
>
> The 4 macros for >, >=, < and <= are only used by IsTidRangeClause()
> which means IsTidComparison() could get called up to 4 times. Most of
> the work it does would be redundant in that case.  Maybe it's better
> to rethink that?
Yeah.  I've rewritten all this as two functions, IsTidEqualClause and
IsTidRangeClause, which each check the opno, with a helper function
IsTidBinaryExpression that checks everything else.

> 18. I'd expect the following not to produce a sort above the Tid Scan.
>
> postgres=# set enable_seqscan=0;
> SET
> postgres=# explain select * from t inner join t t1 on t.ctid = t1.ctid
> where t.ctid < '(0,10)' ;
>                                       QUERY PLAN
> ---------------------------------------------------------------------------------------
>  Merge Join  (cost=10000000008.65..10000000009.28 rows=9 width=8)
>    Merge Cond: (t.ctid = t1.ctid)
>    ->  Sort  (cost=3.33..3.35 rows=9 width=10)
>          Sort Key: t.ctid
>          ->  Tid Scan on t  (cost=0.00..3.18 rows=9 width=10)
>                TID Cond: (ctid < '(0,10)'::tid)
>    ->  Sort  (cost=10000000005.32..10000000005.57 rows=100 width=10)
>          Sort Key: t1.ctid
>          ->  Seq Scan on t t1  (cost=10000000000.00..10000000002.00
> rows=100 width=10)
> (9 rows)
>
> On looking at why the planner did this, I see it's down to how you've
> coded create_tidscan_paths(). You're creating a tidpath if there's any
> quals or any useful pathkeys useful to the query's ORDER BY, but only
> including the pathkeys if they're useful for the query's ORDER BY.  I
> think it'll be better to include the forward pathkeys in all cases,
> and just make it a backward Tid Scan if backward keys are useful for
> the ORDER BY.   There's still a problem with this as a Merge Join
> would need a Sort if there was an ORDER BY ctid DESC for one relation
> even if the other relation had some valid ctid quals since the 2nd
> scan would create a forward Tid Scan.  Maybe that's not worth worrying
> about.   The only fix I can imagine is to always create a forward and
> backward Tid Scan path, which is pretty bad as it's two more paths
> that likely won't get used 99.9% of the time.
Two paths seems excessive just to cater for these unlikely plans.  We
don't provide any other support for joining on CTID.

But setting the path keys doesn't cost much, so we should do that.

> This also caused me to notice the costs are pretty broken for this:
>
> postgres=# explain select * from t order by ctid;
>                      QUERY PLAN
> ---------------------------------------------------
>  Tid Scan on t  (cost=0.00..0.00 rows=100 width=10)
> (1 row)

Yeah -- a side effect of treating empty tidquals as a scan over the
whole table.  I've added costing code for this case.


***** Smaller items *****

Compacted for brevity (hope you don't mind):

> 2. You should test for a non-empty List with list != NIL  [...]  Also, can you not just return after that if test? I think the code
> would be easier to read with it like that.
> 5. TidInArrayExprEval() lacks a header comment [...]
> 6. In MergeTidRanges(), you have: [leftover code]
> 7. "fist" -> "first" [...]
> 8. tss_TidRangePtr is a pretty confusingly named field. [...]
> 9. This comment seems to indicate that a range can only have one bound, but that does not seem to be the case. [...]
> 10. In cost_tidscan() I think you should ceil() the following: [...]
> 11. In the comment: /* TODO decide what the costs should be */ [...]
> 13. !rlst -> rlst != NIL
> 15. There's no field named NumTids: [...]
> 16. I think the following comment needs to be updated: [heapam comments]
> 17. Can you make a few changed to build_tidscan_pathkeys(): [...]
> 19. Looks like the ScanDirection's normally get named "scandir": [...]
These are mostly trivial and I've generally gone with your recommendation.

v5-0001-Add-selectivity-and-nullness-estimates-for-CTID-syst.patch (4K) Download Attachment
v5-0004-Tid-Scan-results-are-ordered.patch (27K) Download Attachment
v5-0005-TID-selectivity-reduce-the-density-of-the-last-page-.patch (2K) Download Attachment
v5-0003-Support-backward-scans-over-restricted-ranges-in-hea.patch (2K) Download Attachment
v5-0002-Support-range-quals-in-Tid-Scan.patch (85K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

David Rowley-3
Review of v5:

0001: looks good.

0002:

1. I don't think you need palloc0() here. palloc() looks like it would be fine.

if (tidRangeArray->ranges == NULL)
tidRangeArray->ranges = (TidRange *)
palloc0(tidRangeArray->numAllocated * sizeof(TidRange));

if that wasn't the case, then you'll need to also zero the additional
memory when you repalloc().

2. Can't the following code be moved into the correct
forwards/backwards if block inside the if inscan block above?

/* If we've finished iterating over the ranges, exit the loop. */
if (node->tss_CurrentTidRange >= numRanges ||
node->tss_CurrentTidRange < 0)
break;

Something like:

if (bBackward)
{
    if (node->tss_CurrentTidRange < 0)
    {
        /* initialize for backward scan */
        node->tss_CurrentTidRange = numRanges - 1;
    }
    else if (node->tss_CurrentTidRange == 0)
        break;
    else
        node->tss_CurrentTidRange--;
 }
else
{
    if (node->tss_CurrentTidRange < 0)
    {
        /* initialize for forward scan */
        node->tss_CurrentTidRange = 0;
    }
    else if (node->tss_CurrentTidRange >= numRanges - 1)
        break;
    else
        node->tss_CurrentTidRange++;
}

I think that's a few less lines and instructions and (I think) a bit neater too.

3. if (found_quals != NIL) (yeah, I Know there's already lots of
places not doing this)

/* If we found any, make an AND clause out of them. */
if (found_quals)

likewise in:

/* Use a range qual if any were found. */
if (found_quals)

4. The new tests in tidscan.sql should drop the newly created tables.
(I see some get dropped in the 0004 patch, but not all. Best not to
rely on a later patch to do work that this patch should do)

0003: looks okay.

0004:

5. Please add a comment to scandir in:

typedef struct TidScan
{
Scan scan;
List    *tidquals; /* qual(s) involving CTID = something */
ScanDirection scandir;
} TidScan;

/* forward or backward or don't care */ would do.

Likewise for struct TidPath. Likely IndexPath can be used for guidance.

6. Is it worth adding a Merge Join regression test for this patch?

Something like:

postgres=# explain select * from t1 inner join t1 t2 on t1.ctid =
t2.ctid order by t1.ctid desc;
                                 QUERY PLAN
-----------------------------------------------------------------------------
 Merge Join  (cost=0.00..21.25 rows=300 width=14)
   Merge Cond: (t1.ctid = t2.ctid)
   ->  Tid Scan Backward on t1  (cost=0.00..8.00 rows=300 width=10)
   ->  Materialize  (cost=0.00..8.75 rows=300 width=10)
         ->  Tid Scan Backward on t1 t2  (cost=0.00..8.00 rows=300 width=10)
(5 rows)

0005:

7. I see the logic behind this new patch, but quite possibly the
majority of the time the relpages will be out of date and you'll
mistakenly apply this to not the final page. I'm neither here nor
there with it. I imagine you might feel the same since you didn't
merge it with 0001. Maybe we can leave it out for now and see what
others think.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Tom Lane-2
In reply to this post by Edmund Horner
Edmund Horner <[hidden email]> writes:
> [ tid scan patches ]

I'm having a hard time wrapping my mind around why you'd bother with
backwards TID scans.  The amount of code needed versus the amount of
usefulness seems like a pretty bad cost/benefit ratio, IMO.  I can
see that there might be value in knowing that a regular scan has
"ORDER BY ctid ASC" pathkeys (mainly, that it might let us mergejoin
on TID without an explicit sort).  It does not, however, follow that
there's any additional value in supporting the DESC case.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Andres Freund
Hi,

On 2018-12-20 17:21:07 -0500, Tom Lane wrote:

> Edmund Horner <[hidden email]> writes:
> > [ tid scan patches ]
>
> I'm having a hard time wrapping my mind around why you'd bother with
> backwards TID scans.  The amount of code needed versus the amount of
> usefulness seems like a pretty bad cost/benefit ratio, IMO.  I can
> see that there might be value in knowing that a regular scan has
> "ORDER BY ctid ASC" pathkeys (mainly, that it might let us mergejoin
> on TID without an explicit sort).  It does not, however, follow that
> there's any additional value in supporting the DESC case.

I've not followed this thread, but wouldn't that be quite useful to be
able to move old tuples to free space earlier in the table?

I've written multiple scripts that update the later pages in a table, to
force reuse of earlier free pages (in my case by generating ctid = ANY()
style queries with all possible tids for the last few pages, the most
efficient way I could think of).

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Tom Lane-2
Andres Freund <[hidden email]> writes:
> On 2018-12-20 17:21:07 -0500, Tom Lane wrote:
>> I'm having a hard time wrapping my mind around why you'd bother with
>> backwards TID scans.

> I've not followed this thread, but wouldn't that be quite useful to be
> able to move old tuples to free space earlier in the table?
> I've written multiple scripts that update the later pages in a table, to
> force reuse of earlier free pages (in my case by generating ctid = ANY()
> style queries with all possible tids for the last few pages, the most
> efficient way I could think of).

Sure, but wouldn't you now write those using something on the order of

      WHERE ctid >= '(cutoff_page_here, 1)'

?  I don't see that you'd want to write "ORDER BY ctid DESC LIMIT n"
because you wouldn't know what value of n to use to get all the
tuples on some-number-of-ending-pages.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Andres Freund
Hi,

On 2018-12-20 18:06:41 -0500, Tom Lane wrote:

> Andres Freund <[hidden email]> writes:
> > On 2018-12-20 17:21:07 -0500, Tom Lane wrote:
> >> I'm having a hard time wrapping my mind around why you'd bother with
> >> backwards TID scans.
>
> > I've not followed this thread, but wouldn't that be quite useful to be
> > able to move old tuples to free space earlier in the table?
> > I've written multiple scripts that update the later pages in a table, to
> > force reuse of earlier free pages (in my case by generating ctid = ANY()
> > style queries with all possible tids for the last few pages, the most
> > efficient way I could think of).
>
> Sure, but wouldn't you now write those using something on the order of
>
>       WHERE ctid >= '(cutoff_page_here, 1)'
>
> ?  I don't see that you'd want to write "ORDER BY ctid DESC LIMIT n"
> because you wouldn't know what value of n to use to get all the
> tuples on some-number-of-ending-pages.

I think you'd want both, to make sure there's not more tuples than
estimated. With the limit calculated to ensure there's enough free space
for them to actually fit.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Edmund Horner
In reply to this post by Tom Lane-2
On Fri, 21 Dec 2018 at 11:21, Tom Lane <[hidden email]> wrote:

>
> Edmund Horner <[hidden email]> writes:
> > [ tid scan patches ]
>
> I'm having a hard time wrapping my mind around why you'd bother with
> backwards TID scans.  The amount of code needed versus the amount of
> usefulness seems like a pretty bad cost/benefit ratio, IMO.  I can
> see that there might be value in knowing that a regular scan has
> "ORDER BY ctid ASC" pathkeys (mainly, that it might let us mergejoin
> on TID without an explicit sort).  It does not, however, follow that
> there's any additional value in supporting the DESC case.

I have occasionally found myself running "SELECT MAX(ctid) FROM t"
when I was curious about why a table is so big after vacuuming.

Perhaps that's not a common enough use case to justify the amount of
code, especially the changes to heapam.c and explain.c.

We'd still need the pathkeys to make good use of forward scans.  (And
I think the executor still needs to support seeking backward for
cursors.)

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

David Rowley-3
On Fri, 21 Dec 2018 at 13:09, Edmund Horner <[hidden email]> wrote:

> On Fri, 21 Dec 2018 at 11:21, Tom Lane <[hidden email]> wrote:
> > I'm having a hard time wrapping my mind around why you'd bother with
> > backwards TID scans.  The amount of code needed versus the amount of
> > usefulness seems like a pretty bad cost/benefit ratio, IMO.  I can
> > see that there might be value in knowing that a regular scan has
> > "ORDER BY ctid ASC" pathkeys (mainly, that it might let us mergejoin
> > on TID without an explicit sort).  It does not, however, follow that
> > there's any additional value in supporting the DESC case.
>
> I have occasionally found myself running "SELECT MAX(ctid) FROM t"
> when I was curious about why a table is so big after vacuuming.
>
> Perhaps that's not a common enough use case to justify the amount of
> code, especially the changes to heapam.c and explain.c.
>
> We'd still need the pathkeys to make good use of forward scans.  (And
> I think the executor still needs to support seeking backward for
> cursors.)

I think the best thing to do here is separate out all the additional
backwards scan code into a separate patch to allow it to be easier
considered and approved, or rejected. I think if there's any hint of
this blocking the main patch then it should be a separate patch to
allow it's worth to be considered independently.

Also, my primary interest in this patch is to find tuples that are
stopping the heap being truncated during a vacuum. Generally, when I'm
looking for that I have a good idea of what size I expect the relation
should be, (otherwise I'd not think it was bloated), in which case I'd
be doing WHERE ctid >= '(N,1)'. However, it might be easier to write
some auto-bloat-removal script if we could have an ORDER BY ctid DESC
LIMIT n.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Edmund Horner
On Fri, 21 Dec 2018 at 13:25, David Rowley <[hidden email]> wrote:

> On Fri, 21 Dec 2018 at 13:09, Edmund Horner <[hidden email]> wrote:
> > On Fri, 21 Dec 2018 at 11:21, Tom Lane <[hidden email]> wrote:
> > > I'm having a hard time wrapping my mind around why you'd bother with
> > > backwards TID scans.  The amount of code needed versus the amount of
> > > usefulness seems like a pretty bad cost/benefit ratio, IMO.  I can
> > > see that there might be value in knowing that a regular scan has
> > > "ORDER BY ctid ASC" pathkeys (mainly, that it might let us mergejoin
> > > on TID without an explicit sort).  It does not, however, follow that
> > > there's any additional value in supporting the DESC case.
> >
> > I have occasionally found myself running "SELECT MAX(ctid) FROM t"
> > when I was curious about why a table is so big after vacuuming.
> >
> > Perhaps that's not a common enough use case to justify the amount of
> > code, especially the changes to heapam.c and explain.c.
> >
> > We'd still need the pathkeys to make good use of forward scans.  (And
> > I think the executor still needs to support seeking backward for
> > cursors.)
>
> I think the best thing to do here is separate out all the additional
> backwards scan code into a separate patch to allow it to be easier
> considered and approved, or rejected. I think if there's any hint of
> this blocking the main patch then it should be a separate patch to
> allow it's worth to be considered independently.

Yeah I think you're right.  I'll separate those parts into the basic
forward scan, and then the optional backward scan support.  I think
we'll still only generate a backward scan if the query_pathkeys makes
use of it.

For the forward scan, I seem to recall, from your merge join example,
that it's useful to set the pathkeys even when there are no
query_pathkeys.  We just have to unconditionally set them so that the
larger plan can make use of them.

> Also, my primary interest in this patch is to find tuples that are
> stopping the heap being truncated during a vacuum. Generally, when I'm
> looking for that I have a good idea of what size I expect the relation
> should be, (otherwise I'd not think it was bloated), in which case I'd
> be doing WHERE ctid >= '(N,1)'. However, it might be easier to write
> some auto-bloat-removal script if we could have an ORDER BY ctid DESC
> LIMIT n.

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Tom Lane-2
Edmund Horner <[hidden email]> writes:
> For the forward scan, I seem to recall, from your merge join example,
> that it's useful to set the pathkeys even when there are no
> query_pathkeys.  We just have to unconditionally set them so that the
> larger plan can make use of them.

No.  Look at indxpath.c: it does not worry about pathkeys unless
has_useful_pathkeys is true, and it definitely does not generate
pathkeys that don't get past truncate_useless_pathkeys.  Those
functions are responsible for worrying about whether mergejoin
can use the pathkeys.  It's not tidpath.c's job to outthink them.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Edmund Horner
On Fri, 21 Dec 2018 at 16:31, Tom Lane <[hidden email]> wrote:

>
> Edmund Horner <[hidden email]> writes:
> > For the forward scan, I seem to recall, from your merge join example,
> > that it's useful to set the pathkeys even when there are no
> > query_pathkeys.  We just have to unconditionally set them so that the
> > larger plan can make use of them.
>
> No.  Look at indxpath.c: it does not worry about pathkeys unless
> has_useful_pathkeys is true, and it definitely does not generate
> pathkeys that don't get past truncate_useless_pathkeys.  Those
> functions are responsible for worrying about whether mergejoin
> can use the pathkeys.  It's not tidpath.c's job to outthink them.

Ok.  I think that will simplify things.  So if I follow you correctly,
we should do:

1. If has_useful_pathkeys is true: generate pathkeys (for CTID ASC),
and use truncate_useless_pathkeys on them.
2. If we have tid quals or pathkeys, emit a TID scan path.

For the (optional) backwards scan support patch, should we separately
emit another path, in the reverse direction?  (My current patch only
creates one path, and tries to decide what the best direction is by
looking at query_pathkeys.  This doesn't fit into the above
algorithm.)

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Tom Lane-2
Edmund Horner <[hidden email]> writes:
> Ok.  I think that will simplify things.  So if I follow you correctly,
> we should do:

> 1. If has_useful_pathkeys is true: generate pathkeys (for CTID ASC),
> and use truncate_useless_pathkeys on them.
> 2. If we have tid quals or pathkeys, emit a TID scan path.

Check.

> For the (optional) backwards scan support patch, should we separately
> emit another path, in the reverse direction?

What indxpath.c does is, if has_useful_pathkeys is true, to generate
pathkeys both ways and then build paths if the pathkeys get past
truncate_useless_pathkeys.  That seems sufficient in this case too.
There are various heuristics about whether it's really useful to
consider both sort directions, but that intelligence is already
built into truncate_useless_pathkeys.  tid quals with no pathkeys
would be reason to generate a forward path, but not reason to
generate a reverse path, because then that would be duplicative.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Tom Lane-2
In reply to this post by Edmund Horner
BTW, with respect to this bit in 0001:

@@ -1795,6 +1847,15 @@ nulltestsel(PlannerInfo *root, NullTestType nulltesttype, Node *arg,
                 return (Selectivity) 0; /* keep compiler quiet */
         }
     }
+    else if (vardata.var && IsA(vardata.var, Var) &&
+             ((Var *) vardata.var)->varattno == SelfItemPointerAttributeNumber)
+    {
+        /*
+         * There are no stats for system columns, but we know CTID is never
+         * NULL.
+         */
+        selec = (nulltesttype == IS_NULL) ? 0.0 : 1.0;
+    }
     else
     {
         /*

I'm not entirely sure why you're bothering; surely nulltestsel is
unrelated to what this patch is about?  And would anybody really
write "WHERE ctid IS NULL"?

However, if we do think it's worth adding code to cover this case,
I wouldn't make it specific to CTID.  *All* system columns can be
assumed not null, see heap_getsysattr().

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Edmund Horner
On Sat, 22 Dec 2018 at 07:10, Tom Lane <[hidden email]> wrote:
BTW, with respect to this bit in 0001:

@@ -1795,6 +1847,15 @@ nulltestsel(PlannerInfo *root, NullTestType nulltesttype, Node *arg,
                 return (Selectivity) 0; /* keep compiler quiet */
         }
     }
+    else if (vardata.var && IsA(vardata.var, Var) &&
+             ((Var *) vardata.var)->varattno == SelfItemPointerAttributeNumber)
+    {
+        /*
+         * There are no stats for system columns, but we know CTID is never
+         * NULL.
+         */
+        selec = (nulltesttype == IS_NULL) ? 0.0 : 1.0;
+    }
     else
     {
         /*

I'm not entirely sure why you're bothering; surely nulltestsel is
unrelated to what this patch is about?  And would anybody really
write "WHERE ctid IS NULL"?

I found that it made a difference with selectivity of range comparisons, because clauselist_selectivity tries to correct for it (clausesel.c:274):

    s2 = rqlist->hibound + rqlist->lobound - 1.0

    /* Adjust for double-exclusion of NULLs */
    s2 += nulltestsel(root, IS_NULL, rqlist->var,
                      varRelid, jointype, sjinfo);
 
It was adding DEFAULT_UNK_SEL = 0.005 to the selectivity, which (while not major) did make the selectivity less accurate.

However, if we do think it's worth adding code to cover this case,
I wouldn't make it specific to CTID.  *All* system columns can be
assumed not null, see heap_getsysattr().
I guess we could have a standalone patch to add this for all system columns?


Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Tom Lane-2
Edmund Horner <[hidden email]> writes:
> On Sat, 22 Dec 2018 at 07:10, Tom Lane <[hidden email]> wrote:
>> I'm not entirely sure why you're bothering; surely nulltestsel is
>> unrelated to what this patch is about?

> I found that it made a difference with selectivity of range comparisons,
> because clauselist_selectivity tries to correct for it (clausesel.c:274):

Oh, I see.

> I guess we could have a standalone patch to add this for all system columns?

+1

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Edmund Horner
In reply to this post by Tom Lane-2
Hi all,

I am a bit stuck and I think it's best to try to explain where.

I'm still rebasing the patches for the changes Tom made to support parameterised TID paths for joins.  While the addition of join support itself does not really touch the same code, the modernisation -- in particular, returning a list of RestrictInfos rather than raw quals -- does rewrite quite a bit of tidpath.c.

The original code returned:

    List (with OR semantics)
      CTID = ?   or   CTID = ANY (...)   or   IS CURRENT OF
      (more items)

That changed recently to return:

    List (with OR semantics)
      RestrictInfo
        CTID = ?   or ...
      (more items)

My last set of patches extended the tidqual extraction to pull out lists (with AND semantics) of range quals of the form CTID < ?, etc.  Each list of more than one item was converted into an AND clause before being added to the tidqual list; a single range qual can be added to tidquals as is.

This required looking across multiple RestrictInfos at the top level, for example:

  - "WHERE ctid > ? AND ctid < ?" would arrive at tidpath as a list of two RestrictInfos, from which we extract a single tidqual in the form of an AND clause.
  - "WHERE ctid = ? OR (ctid > ? AND ctid < ?)" arrives as only one RestrictInfo, but we extract two tidquals (an OpExpr, and an AND clause).

The code could also ignore additional unusable quals from a list of top-level RestrictInfos, or from a list of quals from an AND clause, for example:

  - "WHERE foo = ? AND ctid > ? AND ctid < ?" gives us the single tidqual "ctid > ? AND ctid < ?".
  - "WHERE (ctid = ? AND bar = ?) OR (foo = ? AND ctid > ? AND ctid < ?)" gives us the two tidquals "ctid = ?" and "ctid > ? AND ctid < ?".

As the extracted tidquals no longer match the original query quals, they aren't removed from scan_clauses in createplan.c, and hence are correctly checked by the filter.

Aside: The analogous situation with an indexed user attribute "x" behaves a bit differently:
  - "WHERE x = ? OR (x > ? AND x < ?)", won't use a regular index scan, but might use a bitmap index scan.

My patch uses the same path type and executor for all extractable tidquals.

This worked pretty well, but I am finding it difficult to reimplement it in the new tidpath.c code.

In the query information given to the path generator, there is no existing RestrictInfo relating to the whole expression "ctid > ? AND ctid < ?".  I am still learning about RestrictInfos, but my understanding is it doesn't make sense to have a RestrictInfo for an AND clause, anyway; you're supposed to have them for the sub-expressions of it.

And it doesn't seem a good idea to try to create new RestrictInfos in the path generation just to pass the tidquals back to plan creation.  They're complicated objects.

There's also the generation of scan_clauses in create_tidscan_plan (createplan.c:3107).  This now uses RestrictInfos -- I'd image we'd need each AND clause to be wrapped in a RestrictInfo to be able to check it properly.

To summarise, I'm not sure what kind of structure I should add to the tidquals list to represent a compound range expression.  Maybe it's better to create a different path (either a new path type, or a flag in TidPath to say what kind of quals are attached) ?

Edmund

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Tom Lane-2
Edmund Horner <[hidden email]> writes:
> My patch uses the same path type and executor for all extractable tidquals.

> This worked pretty well, but I am finding it difficult to reimplement it in
> the new tidpath.c code.

I didn't like that approach to begin with, and would suggest that you go
over to using a separate path type and executor node.  I don't think the
amount of commonality for the two cases is all that large, and doing it
as you had it required some ugly ad-hoc conventions about the semantics
of the tidquals list.  Where I think this should go is that the tidquals
list still has OR semantics in the existing path type, but you use AND
semantics in the new path type, so that "ctid > ? AND ctid < ?" is just
represented as an implicit-AND list of two simple RestrictInfos.

Now admittedly, this wouldn't give us an efficient way to execute
queries with conditions like "WHERE ctid = X OR (ctid > Y AND ctid < Z)",
but I find myself quite unable to get excited about supporting that.
I see no reason for the new code to worry about any cases more complex
than one or two TID inequalities at top level of the restriction list.

> In the query information given to the path generator, there is no existing
> RestrictInfo relating to the whole expression "ctid > ? AND ctid < ?".  I
> am still learning about RestrictInfos, but my understanding is it doesn't
> make sense to have a RestrictInfo for an AND clause, anyway; you're
> supposed to have them for the sub-expressions of it.

FWIW, the actual data structure for cases like that is that there's
a RestrictInfo for the whole clause ctid = X OR (ctid > Y AND ctid < Z),
and if you look into its "orclause" field, you will find RestrictInfos
attached to the primitive clauses ctid = X, ctid > Y, ctid < Z.  (The
old code in tidpath.c didn't know that, because it'd never been rewritten
since RestrictInfos were invented.)  However, I think this new code should
not worry about OR cases at all, but just pull out top-level TID
comparison clauses.

> And it doesn't seem a good idea to try to create new RestrictInfos in the
> path generation just to pass the tidquals back to plan creation.

No, you should avoid that.  There are places that assume there's only
one RestrictInfo for any given original clause (or sub-clause).

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Edmund Horner
On Sat, 19 Jan 2019 at 05:35, Tom Lane <[hidden email]> wrote:
Edmund Horner <[hidden email]> writes:
> My patch uses the same path type and executor for all extractable tidquals.

> This worked pretty well, but I am finding it difficult to reimplement it in
> the new tidpath.c code.

I didn't like that approach to begin with, and would suggest that you go
over to using a separate path type and executor node.  I don't think the
amount of commonality for the two cases is all that large, and doing it
as you had it required some ugly ad-hoc conventions about the semantics
of the tidquals list.  Where I think this should go is that the tidquals
list still has OR semantics in the existing path type, but you use AND
semantics in the new path type, so that "ctid > ? AND ctid < ?" is just
represented as an implicit-AND list of two simple RestrictInfos.

Thanks for the advice.  This approach resembles my first draft, which had a separate executor type.  However, it did have a combined path type, with an enum TidPathMethod to determine how tidquals was interpreted.  At this point, I think a different path type is clearer, though generation of both types can live in tidpath.c (just as indxpath.c generates different index path types).
 
Now admittedly, this wouldn't give us an efficient way to execute
queries with conditions like "WHERE ctid = X OR (ctid > Y AND ctid < Z)",
but I find myself quite unable to get excited about supporting that.
I see no reason for the new code to worry about any cases more complex
than one or two TID inequalities at top level of the restriction list.

I'm a bit sad to see support for multiple ranges go, though I never saw such queries as ever being particularly common.  (And there was always a nagging feeling that tidpath.c was beginning to perform feats of boolean acrobatics out of proportion to its importance.  Perhaps in some distant future, TID quals will become another way of supplying TIDs to a bitmap heap scan, which would enable complicated boolean queries using both indexes and TID scans.  But that's just musing, not a proposal.)

> In the query information given to the path generator, there is no existing
> RestrictInfo relating to the whole expression "ctid > ? AND ctid < ?".  I
> am still learning about RestrictInfos, but my understanding is it doesn't
> make sense to have a RestrictInfo for an AND clause, anyway; you're
> supposed to have them for the sub-expressions of it.

FWIW, the actual data structure for cases like that is that there's
a RestrictInfo for the whole clause ctid = X OR (ctid > Y AND ctid < Z),
and if you look into its "orclause" field, you will find RestrictInfos
attached to the primitive clauses ctid = X, ctid > Y, ctid < Z.  (The
old code in tidpath.c didn't know that, because it'd never been rewritten
since RestrictInfos were invented.)  However, I think this new code should
not worry about OR cases at all, but just pull out top-level TID
comparison clauses.

Thanks for the explanation.

> And it doesn't seem a good idea to try to create new RestrictInfos in the
> path generation just to pass the tidquals back to plan creation.

No, you should avoid that.  There are places that assume there's only
one RestrictInfo for any given original clause (or sub-clause).
 
12345