Tid scan improvements

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
93 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

David Rowley
On Wed, 13 Jan 2021 at 15:38, Edmund Horner <[hidden email]> wrote:
> Thanks for updating the patch.  I'd be very happy if this got picked up again, and I'd certainly follow along and do some review.

Likewise here. I this patch was pretty close so it seems a shame to
let it slip through the cracks.

I spoke to Andres off-list about this patch. He mentioned that he
wasn't too keen on seeing the setscanlimits being baked into the table
AM API.  He mentioned that he'd rather not assume too much about table
AMs having all of their tids in a given range consecutively on a set
of pages.   That seems reasonable to me.  He suggested that we add a
new callback that just allows a range of ItemPointers to scan and
leave it up to the implementation to decide which pages should be
scanned to fetch the tuples within the given range.  I didn't argue. I
just went and coded it all, hopefully to Andres' description. The new
table AM callback is optional.

I've implemented this in the attached.

I also took the time to support backwards TID Range scans and added a
few more tests to test rescans.  I just found it annoying that TID
Scans supported backwards scans and TID Range scans did not.

The 0002 patch is the guts of it. The 0001 patch is an existing bug
that needs to be fixed before 0002 could go in (backwards TID Range
Scans are broken without this). I've posted separately about this bug
in [1]

I also didn't really like the idea of adding possibly lots of bool
fields to RelOptInfo to describe what the planner can do in regards to
what the given table AM supports.   I know that IndexOptInfo has such
a set of bool fields.  I'd rather not repeat that, so I just went with
a single int field named "amflags" and just added a single constant to
define a flag that specifies if the rel supports scanning ranges of
TIDs.

Edmund, will you get time to a look at this?

David

[1] https://www.postgresql.org/message-id/CAApHDvpGc9h0_oVD2CtgBcxCS1N-qDYZSeBRnUh+0CWJA9cMaA@...

v11-0001-Fix-hypothetical-bug-in-heap-backward-scans.patch (4K) Download Attachment
v11-0002-Add-TID-Range-Scans-to-support-efficient-scannin.patch (94K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

David Rowley
On Thu, 21 Jan 2021 at 18:16, David Rowley <[hidden email]> wrote:
> I've implemented this in the attached.

The bug fix in 0001 is now committed, so I'm just attaching the 0002
patch again after having rebased... This is mostly just to keep the
CFbot happy.

David

v12-0001-Add-TID-Range-Scans-to-support-efficient-scannin.patch (94K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Zhihong Yu
Hi,

bq. within this range.  Table AMs where scanning ranges of TIDs does not make
sense or is difficult to implement efficiently may choose to not implement

Is there criterion on how to judge efficiency ?

+       if (tidopexpr->exprtype == TIDEXPR_LOWER_BOUND)
...
+       if (tidopexpr->exprtype == TIDEXPR_UPPER_BOUND)

The if statement for upper bound should be prefixed with 'else', right ?

+ * TidRecheck -- access method routine to recheck a tuple in EvalPlanQual
...
+TidRangeRecheck(TidRangeScanState *node, TupleTableSlot *slot)

The method name in the comment doesn't match the real method name.

+ *     ExecTidRangeScan(node)
...
+ExecTidRangeScan(PlanState *pstate)

Parameter names don't match.

Cheers

On Mon, Jan 25, 2021 at 5:23 PM David Rowley <[hidden email]> wrote:
On Thu, 21 Jan 2021 at 18:16, David Rowley <[hidden email]> wrote:
> I've implemented this in the attached.

The bug fix in 0001 is now committed, so I'm just attaching the 0002
patch again after having rebased... This is mostly just to keep the
CFbot happy.

David
Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

David Rowley
Thanks for having a look at this.

On Tue, 26 Jan 2021 at 15:48, Zhihong Yu <[hidden email]> wrote:
> bq. within this range.  Table AMs where scanning ranges of TIDs does not make
> sense or is difficult to implement efficiently may choose to not implement
>
> Is there criterion on how to judge efficiency ?

For example, if the AM had no way to index such a column and the
method needed to scan the entire table to find TIDs in that range. The
planner may as well just pick a SeqScan. If that's the case, then the
table AM may as well not bother implementing that function.

> +       if (tidopexpr->exprtype == TIDEXPR_LOWER_BOUND)
> ...
> +       if (tidopexpr->exprtype == TIDEXPR_UPPER_BOUND)
>
> The if statement for upper bound should be prefixed with 'else', right ?

Yeah, thanks.

> + * TidRecheck -- access method routine to recheck a tuple in EvalPlanQual
> ...
> +TidRangeRecheck(TidRangeScanState *node, TupleTableSlot *slot)
>
> The method name in the comment doesn't match the real method name.

Well noticed.

> + *     ExecTidRangeScan(node)
> ...
> +ExecTidRangeScan(PlanState *pstate)
>
> Parameter names don't match.

hmm. Looking around it seems there's lots of other places that do
this.  I think the the comment is really just indicating that the
function is taking an executor node state as a parameter.

Have a look at: git grep -E "\s\*.*\(node\)$"   that shows the other
places. Some of these have the parameter named "node", and many others
have some other name.

I've made the two changes locally. Since the two issues were mostly
cosmetic, I'll post an updated patch after some bigger changes are
required.

Thanks again for looking at this.

David


Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Andres Freund
In reply to this post by David Rowley
Hi,

On 2021-01-26 14:22:42 +1300, David Rowley wrote:

> diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
> index 33bffb6815..d1c608b176 100644
> --- a/src/include/access/tableam.h
> +++ b/src/include/access/tableam.h
> @@ -325,6 +325,26 @@ typedef struct TableAmRoutine
>   ScanDirection direction,
>   TupleTableSlot *slot);
>  
> + /*
> + * Return next tuple from `scan` where TID is within the defined range.
> + * This behaves like scan_getnextslot but only returns tuples from the
> + * given range of TIDs.  Ranges are inclusive.  This function is optional
> + * and may be set to NULL if TID range scans are not supported by the AM.
> + *
> + * Implementations of this function must themselves handle ItemPointers
> + * of any value. i.e, they must handle each of the following:
> + *
> + * 1) mintid or maxtid is beyond the end of the table; and
> + * 2) mintid is above maxtid; and
> + * 3) item offset for mintid or maxtid is beyond the maximum offset
> + * allowed by the AM.
> + */
> + bool (*scan_getnextslot_inrange) (TableScanDesc scan,
> + ScanDirection direction,
> + TupleTableSlot *slot,
> + ItemPointer mintid,
> + ItemPointer maxtid);
> +
>  
>   /* ------------------------------------------------------------------------
>   * Parallel table scan related functions.
> @@ -1015,6 +1035,36 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
>   return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
>  }
>  
> +/*
> + * Return next tuple from defined TID range from `scan` and store in slot.
> + */
> +static inline bool
> +table_scan_getnextslot_inrange(TableScanDesc sscan, ScanDirection direction,
> +   TupleTableSlot *slot, ItemPointer mintid,
> +   ItemPointer maxtid)
> +{
> + /*
> + * The planner should never make a plan which uses this function when the
> + * table AM has not defined any function for this callback.
> + */
> + Assert(sscan->rs_rd->rd_tableam->scan_getnextslot_inrange != NULL);
> +
> + slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
> +
> + /*
> + * We don't expect direct calls to table_scan_getnextslot_inrange with
> + * valid CheckXidAlive for catalog or regular tables.  See detailed
> + * comments in xact.c where these variables are declared.
> + */
> + if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
> + elog(ERROR, "unexpected table_scan_getnextslot_inrange call during logical decoding");
> +
> + return sscan->rs_rd->rd_tableam->scan_getnextslot_inrange(sscan,
> +  direction,
> +  slot,
> +  mintid,
> +  maxtid);
> +}


I don't really like that this API relies on mintid/maxtid to stay the
same across multiple scan_getnextslot_inrange() calls. I think we'd at
least need to document that that's required and what needs to be done to
scan a different set of min/max tid (or re-scan the same min/max from
scratch).

Perhaps something like

typedef struct TableScanTidRange TableScanTidRange;

TableScanTidRange* table_scan_tid_range_start(TableScanDesc sscan, ItemPointer mintid, ItemPointer maxtid);
bool table_scan_tid_range_nextslot(TableScanDesc sscan, TableScanTidRange *range, TupleTableSlot *slot);
void table_scan_tid_range_end(TableScanDesc sscan, TableScanTidRange* range);

would work better? That'd allow an AM to have arbitrarily large state
for a tid range scan, would make it clear what the lifetime of the
ItemPointer mintid, ItemPointer maxtid are etc.  Wouldn't, on the API
level, prevent multiple tid range scans from being in progress at the
same times though :(. Perhaps we could add a TableScanTidRange* pointer
to TableScanDesc which'd be checked/set by tableam.h which'd prevent that?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

David Rowley
Thanks for looking at this.

On Thu, 4 Feb 2021 at 10:19, Andres Freund <[hidden email]> wrote:

> Perhaps something like
>
> typedef struct TableScanTidRange TableScanTidRange;
>
> TableScanTidRange* table_scan_tid_range_start(TableScanDesc sscan, ItemPointer mintid, ItemPointer maxtid);
> bool table_scan_tid_range_nextslot(TableScanDesc sscan, TableScanTidRange *range, TupleTableSlot *slot);
> void table_scan_tid_range_end(TableScanDesc sscan, TableScanTidRange* range);
>
> would work better? That'd allow an AM to have arbitrarily large state
> for a tid range scan, would make it clear what the lifetime of the
> ItemPointer mintid, ItemPointer maxtid are etc.  Wouldn't, on the API
> level, prevent multiple tid range scans from being in progress at the
> same times though :(. Perhaps we could add a TableScanTidRange* pointer
> to TableScanDesc which'd be checked/set by tableam.h which'd prevent that?

Maybe the TableScanTidRange can just have a field to store the
TableScanDesc. That way table_scan_tid_range_nextslot and
table_scan_tid_range_end can just pass the TableScanTidRange pointer.

That way it seems like it would be ok for multiple scans to be going
on concurrently as nobody should be reusing the TableScanDesc.

Does that seem ok?

David


Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

David Rowley
On Thu, 4 Feb 2021 at 10:31, David Rowley <[hidden email]> wrote:

>
> Thanks for looking at this.
>
> On Thu, 4 Feb 2021 at 10:19, Andres Freund <[hidden email]> wrote:
> > Perhaps something like
> >
> > typedef struct TableScanTidRange TableScanTidRange;
> >
> > TableScanTidRange* table_scan_tid_range_start(TableScanDesc sscan, ItemPointer mintid, ItemPointer maxtid);
> > bool table_scan_tid_range_nextslot(TableScanDesc sscan, TableScanTidRange *range, TupleTableSlot *slot);
> > void table_scan_tid_range_end(TableScanDesc sscan, TableScanTidRange* range);
> >
> > would work better? That'd allow an AM to have arbitrarily large state
> > for a tid range scan, would make it clear what the lifetime of the
> > ItemPointer mintid, ItemPointer maxtid are etc.  Wouldn't, on the API
> > level, prevent multiple tid range scans from being in progress at the
> > same times though :(. Perhaps we could add a TableScanTidRange* pointer
> > to TableScanDesc which'd be checked/set by tableam.h which'd prevent that?
>
> Maybe the TableScanTidRange can just have a field to store the
> TableScanDesc. That way table_scan_tid_range_nextslot and
> table_scan_tid_range_end can just pass the TableScanTidRange pointer.
>
> That way it seems like it would be ok for multiple scans to be going
> on concurrently as nobody should be reusing the TableScanDesc.
I ended up adding just two new API functions to table AM.

void (*scan_set_tid_range) (TableScanDesc sscan,
   ItemPointer mintid,
   ItemPointer maxtid);

and
bool (*scan_tid_range_nextslot) (TableScanDesc sscan,
ScanDirection direction,
TupleTableSlot *slot);

I added an additional function in tableam.h that does not have a
corresponding API function:

static inline TableScanDesc
table_tid_range_start(Relation rel, Snapshot snapshot,
  ItemPointer mintid,
  ItemPointer maxtid)

This just calls the standard scan_begin then calls scan_set_tid_range
setting the specified mintid and maxtid.

I also added 2 new fields to TableScanDesc:

ItemPointerData rs_mintid;
ItemPointerData rs_maxtid;

I didn't quite see a need to have a new start and end scan API function.

Updated patch attached.

David

v13-0001-Add-TID-Range-Scans-to-support-efficient-scannin.patch (98K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

David Rowley
On Thu, 4 Feb 2021 at 23:51, David Rowley <[hidden email]> wrote:
> Updated patch attached.

I made another pass over this patch and did a bit of renaming work
around the heap_* functions and the tableam functions. I think the new
names are a bit more aligned to the existing names.

I don't really see anything else that I'm unhappy with about this
patch, so pending any objections or last-minute reviews, I plan to
push it later this week.

David

v14-0001-Add-TID-Range-Scans-to-support-efficient-scannin.patch (99K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

David Fetter
On Tue, Feb 16, 2021 at 10:22:41PM +1300, David Rowley wrote:
> On Thu, 4 Feb 2021 at 23:51, David Rowley <[hidden email]> wrote:
> > Updated patch attached.
>
> I made another pass over this patch and did a bit of renaming work
> around the heap_* functions and the tableam functions. I think the new
> names are a bit more aligned to the existing names.

Thanks! I'm looking forward to making use of this :)

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Andres Freund
In reply to this post by David Rowley
Hi,

On 2021-02-04 23:51:39 +1300, David Rowley wrote:

> I ended up adding just two new API functions to table AM.
>
> void (*scan_set_tid_range) (TableScanDesc sscan,
>    ItemPointer mintid,
>    ItemPointer maxtid);
>
> and
> bool (*scan_tid_range_nextslot) (TableScanDesc sscan,
> ScanDirection direction,
> TupleTableSlot *slot);
>
> I added an additional function in tableam.h that does not have a
> corresponding API function:
>
> static inline TableScanDesc
> table_tid_range_start(Relation rel, Snapshot snapshot,
>   ItemPointer mintid,
>   ItemPointer maxtid)
>
> This just calls the standard scan_begin then calls scan_set_tid_range
> setting the specified mintid and maxtid.

Hm. But that means we can't rescan?


> I also added 2 new fields to TableScanDesc:
>
> ItemPointerData rs_mintid;
> ItemPointerData rs_maxtid;
>
> I didn't quite see a need to have a new start and end scan API function.

Yea. I guess they're not that large. Avoiding that was one of the two
reasons to have a separate scan state somewhere. The other that it
seemed like it'd possibly a bit cleaner API wise to deal with rescan.


> +bool
> +heap_getnextslot_tidrange(TableScanDesc sscan, ScanDirection direction,
> +  TupleTableSlot *slot)
> +{
> + HeapScanDesc scan = (HeapScanDesc) sscan;
> + ItemPointer mintid = &sscan->rs_mintid;
> + ItemPointer maxtid = &sscan->rs_maxtid;
> +
> + /* Note: no locking manipulations needed */
> + for (;;)
> + {
> + if (sscan->rs_flags & SO_ALLOW_PAGEMODE)
> + heapgettup_pagemode(scan, direction, sscan->rs_nkeys, sscan->rs_key);
> + else
> + heapgettup(scan, direction, sscan->rs_nkeys, sscan->rs_key);
> +
> + if (scan->rs_ctup.t_data == NULL)
> + {
> + ExecClearTuple(slot);
> + return false;
> + }
> +
> + /*
> + * heap_set_tidrange will have used heap_setscanlimits to limit the
> + * range of pages we scan to only ones that can contain the TID range
> + * we're scanning for.  Here we must filter out any tuples from these
> + * pages that are outwith that range.
> + */
> + if (ItemPointerCompare(&scan->rs_ctup.t_self, mintid) < 0)
> + {
> + ExecClearTuple(slot);
> +
> + /*
> + * When scanning backwards, the TIDs will be in descending order.
> + * Future tuples in this direction will be lower still, so we can
> + * just return false to indicate there will be no more tuples.
> + */
> + if (ScanDirectionIsBackward(direction))
> + return false;
> +
> + continue;
> + }
> +
> + /*
> + * Likewise for the final page, we must filter out TIDs greater than
> + * maxtid.
> + */
> + if (ItemPointerCompare(&scan->rs_ctup.t_self, maxtid) > 0)
> + {
> + ExecClearTuple(slot);
> +
> + /*
> + * When scanning forward, the TIDs will be in ascending order.
> + * Future tuples in this direction will be higher still, so we can
> + * just return false to indicate there will be no more tuples.
> + */
> + if (ScanDirectionIsForward(direction))
> + return false;
> + continue;
> + }
> +
> + break;
> + }
> +
> + /*
> + * if we get here it means we have a new current scan tuple, so point to
> + * the proper return buffer and return the tuple.
> + */
> + pgstat_count_heap_getnext(scan->rs_base.rs_rd);

I wonder if there's an argument for counting the misses above via
pgstat_count_heap_fetch()? Probably not, right?


> +#define IsCTIDVar(node)  \
> + ((node) != NULL && \
> + IsA((node), Var) && \
> + ((Var *) (node))->varattno == SelfItemPointerAttributeNumber && \
> + ((Var *) (node))->varlevelsup == 0)
> +
> +typedef enum
> +{
> + TIDEXPR_UPPER_BOUND,
> + TIDEXPR_LOWER_BOUND
> +} TidExprType;
> +
> +/* Upper or lower range bound for scan */
> +typedef struct TidOpExpr
> +{
> + TidExprType exprtype; /* type of op; lower or upper */
> + ExprState  *exprstate; /* ExprState for a TID-yielding subexpr */
> + bool inclusive; /* whether op is inclusive */
> +} TidOpExpr;
> +
> +/*
> + * For the given 'expr', build and return an appropriate TidOpExpr taking into
> + * account the expr's operator and operand order.
> + */
> +static TidOpExpr *
> +MakeTidOpExpr(OpExpr *expr, TidRangeScanState *tidstate)
> +{
> + Node   *arg1 = get_leftop((Expr *) expr);
> + Node   *arg2 = get_rightop((Expr *) expr);
> + ExprState  *exprstate = NULL;
> + bool invert = false;
> + TidOpExpr  *tidopexpr;
> +
> + if (IsCTIDVar(arg1))
> + exprstate = ExecInitExpr((Expr *) arg2, &tidstate->ss.ps);
> + else if (IsCTIDVar(arg2))
> + {
> + exprstate = ExecInitExpr((Expr *) arg1, &tidstate->ss.ps);
> + invert = true;
> + }
> + else
> + elog(ERROR, "could not identify CTID variable");
> +
> + tidopexpr = (TidOpExpr *) palloc(sizeof(TidOpExpr));
> + tidopexpr->inclusive = false; /* for now */
> +
> + switch (expr->opno)
> + {
> + case TIDLessEqOperator:
> + tidopexpr->inclusive = true;
> + /* fall through */
> + case TIDLessOperator:
> + tidopexpr->exprtype = invert ? TIDEXPR_LOWER_BOUND : TIDEXPR_UPPER_BOUND;
> + break;
> + case TIDGreaterEqOperator:
> + tidopexpr->inclusive = true;
> + /* fall through */
> + case TIDGreaterOperator:
> + tidopexpr->exprtype = invert ? TIDEXPR_UPPER_BOUND : TIDEXPR_LOWER_BOUND;
> + break;
> + default:
> + elog(ERROR, "could not identify CTID operator");
> + }
> +
> + tidopexpr->exprstate = exprstate;
> +
> + return tidopexpr;
> +}
> +
> +/*
> + * Extract the qual subexpressions that yield TIDs to search for,
> + * and compile them into ExprStates if they're ordinary expressions.
> + */
> +static void
> +TidExprListCreate(TidRangeScanState *tidrangestate)
> +{
> + TidRangeScan *node = (TidRangeScan *) tidrangestate->ss.ps.plan;
> + List   *tidexprs = NIL;
> + ListCell   *l;
> +
> + foreach(l, node->tidrangequals)
> + {
> + OpExpr   *opexpr = lfirst(l);
> + TidOpExpr  *tidopexpr;
> +
> + if (!IsA(opexpr, OpExpr))
> + elog(ERROR, "could not identify CTID expression");
> +
> + tidopexpr = MakeTidOpExpr(opexpr, tidrangestate);
> + tidexprs = lappend(tidexprs, tidopexpr);
> + }
> +
> + tidrangestate->trss_tidexprs = tidexprs;
> +}

Architecturally it feels like this is something that really belongs more
into plan time?


> +/*
> + * table_set_tidrange resets the minimum and maximum TID range to scan for a
> + * TableScanDesc created by table_beginscan_tidrange.
> + */
> +static inline void
> +table_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
> +   ItemPointer maxtid)
> +{
> + /* Ensure table_beginscan_tidrange() was used. */
> + Assert((sscan->rs_flags & SO_TYPE_TIDRANGESCAN) != 0);
> +
> + sscan->rs_rd->rd_tableam->scan_set_tidrange(sscan, mintid, maxtid);
> +}

How does this interact with rescans?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

David Rowley
Thanks for having a look at this.

On Wed, 17 Feb 2021 at 11:05, Andres Freund <[hidden email]> wrote:

>
> On 2021-02-04 23:51:39 +1300, David Rowley wrote:
> > and
> > bool (*scan_tid_range_nextslot) (TableScanDesc sscan,
> > ScanDirection direction,
> > TupleTableSlot *slot);
> >
> > I added an additional function in tableam.h that does not have a
> > corresponding API function:
> >
> > static inline TableScanDesc
> > table_tid_range_start(Relation rel, Snapshot snapshot,
> >   ItemPointer mintid,
> >   ItemPointer maxtid)
> >
> > This just calls the standard scan_begin then calls scan_set_tid_range
> > setting the specified mintid and maxtid.
>
> Hm. But that means we can't rescan?

It might not be perfect, but to rescan, we must call table_rescan()
then table_set_tidrange() before calling the
table_scan_getnextslot_tidrange() function.

> > +bool
> > +heap_getnextslot_tidrange(TableScanDesc sscan, ScanDirection direction,
> > +                                               TupleTableSlot *slot)
> > +{
> > +     HeapScanDesc scan = (HeapScanDesc) sscan;
> > +     ItemPointer mintid = &sscan->rs_mintid;
> > +     ItemPointer maxtid = &sscan->rs_maxtid;
> > +
> > +     /* Note: no locking manipulations needed */
> > +     for (;;)
> > +     {
> > +             if (sscan->rs_flags & SO_ALLOW_PAGEMODE)
> > +                     heapgettup_pagemode(scan, direction, sscan->rs_nkeys, sscan->rs_key);
> > +             else
> > +                     heapgettup(scan, direction, sscan->rs_nkeys, sscan->rs_key);
> > +
> > +             if (scan->rs_ctup.t_data == NULL)
> > +             {
> > +                     ExecClearTuple(slot);
> > +                     return false;
> > +             }
> > +
> > +             /*
> > +              * heap_set_tidrange will have used heap_setscanlimits to limit the
> > +              * range of pages we scan to only ones that can contain the TID range
> > +              * we're scanning for.  Here we must filter out any tuples from these
> > +              * pages that are outwith that range.
> > +              */
> > +             if (ItemPointerCompare(&scan->rs_ctup.t_self, mintid) < 0)
> > +             {
> > +                     ExecClearTuple(slot);
> > +
> > +                     /*
> > +                      * When scanning backwards, the TIDs will be in descending order.
> > +                      * Future tuples in this direction will be lower still, so we can
> > +                      * just return false to indicate there will be no more tuples.
> > +                      */
> > +                     if (ScanDirectionIsBackward(direction))
> > +                             return false;
> > +
> > +                     continue;
> > +             }
> > +
> > +             /*
> > +              * Likewise for the final page, we must filter out TIDs greater than
> > +              * maxtid.
> > +              */
> > +             if (ItemPointerCompare(&scan->rs_ctup.t_self, maxtid) > 0)
> > +             {
> > +                     ExecClearTuple(slot);
> > +
> > +                     /*
> > +                      * When scanning forward, the TIDs will be in ascending order.
> > +                      * Future tuples in this direction will be higher still, so we can
> > +                      * just return false to indicate there will be no more tuples.
> > +                      */
> > +                     if (ScanDirectionIsForward(direction))
> > +                             return false;
> > +                     continue;
> > +             }
> > +
> > +             break;
> > +     }
> > +
> > +     /*
> > +      * if we get here it means we have a new current scan tuple, so point to
> > +      * the proper return buffer and return the tuple.
> > +      */
> > +     pgstat_count_heap_getnext(scan->rs_base.rs_rd);
>
> I wonder if there's an argument for counting the misses above via
> pgstat_count_heap_fetch()? Probably not, right?

I'm a bit undecided about that. In theory, we're doing the heap
fetches of tuples on the target page which are outside of the range so
we should maybe count them. On the other hand, it might be a little
confusing for very observant users if they see the fetches going up
for the tuples we skip over in TID Range scans.

> > +#define IsCTIDVar(node)  \
> > +     ((node) != NULL && \
> > +      IsA((node), Var) && \
> > +      ((Var *) (node))->varattno == SelfItemPointerAttributeNumber && \
> > +      ((Var *) (node))->varlevelsup == 0)
> > +
> > +typedef enum
> > +{
> > +     TIDEXPR_UPPER_BOUND,
> > +     TIDEXPR_LOWER_BOUND
> > +} TidExprType;
> > +
> > +/* Upper or lower range bound for scan */
> > +typedef struct TidOpExpr
> > +{
> > +     TidExprType exprtype;           /* type of op; lower or upper */
> > +     ExprState  *exprstate;          /* ExprState for a TID-yielding subexpr */
> > +     bool            inclusive;              /* whether op is inclusive */
> > +} TidOpExpr;
> > +
> > +/*
> > + * For the given 'expr', build and return an appropriate TidOpExpr taking into
> > + * account the expr's operator and operand order.
> > + */
> > +static TidOpExpr *
> > +MakeTidOpExpr(OpExpr *expr, TidRangeScanState *tidstate)
> > +{
> > +     Node       *arg1 = get_leftop((Expr *) expr);
> > +     Node       *arg2 = get_rightop((Expr *) expr);
> > +     ExprState  *exprstate = NULL;
> > +     bool            invert = false;
> > +     TidOpExpr  *tidopexpr;
> > +
> > +     if (IsCTIDVar(arg1))
> > +             exprstate = ExecInitExpr((Expr *) arg2, &tidstate->ss.ps);
> > +     else if (IsCTIDVar(arg2))
> > +     {
> > +             exprstate = ExecInitExpr((Expr *) arg1, &tidstate->ss.ps);
> > +             invert = true;
> > +     }
> > +     else
> > +             elog(ERROR, "could not identify CTID variable");
> > +
> > +     tidopexpr = (TidOpExpr *) palloc(sizeof(TidOpExpr));
> > +     tidopexpr->inclusive = false;   /* for now */
> > +
> > +     switch (expr->opno)
> > +     {
> > +             case TIDLessEqOperator:
> > +                     tidopexpr->inclusive = true;
> > +                     /* fall through */
> > +             case TIDLessOperator:
> > +                     tidopexpr->exprtype = invert ? TIDEXPR_LOWER_BOUND : TIDEXPR_UPPER_BOUND;
> > +                     break;
> > +             case TIDGreaterEqOperator:
> > +                     tidopexpr->inclusive = true;
> > +                     /* fall through */
> > +             case TIDGreaterOperator:
> > +                     tidopexpr->exprtype = invert ? TIDEXPR_UPPER_BOUND : TIDEXPR_LOWER_BOUND;
> > +                     break;
> > +             default:
> > +                     elog(ERROR, "could not identify CTID operator");
> > +     }
> > +
> > +     tidopexpr->exprstate = exprstate;
> > +
> > +     return tidopexpr;
> > +}
> > +
> > +/*
> > + * Extract the qual subexpressions that yield TIDs to search for,
> > + * and compile them into ExprStates if they're ordinary expressions.
> > + */
> > +static void
> > +TidExprListCreate(TidRangeScanState *tidrangestate)
> > +{
> > +     TidRangeScan *node = (TidRangeScan *) tidrangestate->ss.ps.plan;
> > +     List       *tidexprs = NIL;
> > +     ListCell   *l;
> > +
> > +     foreach(l, node->tidrangequals)
> > +     {
> > +             OpExpr     *opexpr = lfirst(l);
> > +             TidOpExpr  *tidopexpr;
> > +
> > +             if (!IsA(opexpr, OpExpr))
> > +                     elog(ERROR, "could not identify CTID expression");
> > +
> > +             tidopexpr = MakeTidOpExpr(opexpr, tidrangestate);
> > +             tidexprs = lappend(tidexprs, tidopexpr);
> > +     }
> > +
> > +     tidrangestate->trss_tidexprs = tidexprs;
> > +}
>
> Architecturally it feels like this is something that really belongs more
> into plan time?

Possibly. It would mean TidOpExpr would have to become a Node type.
TID Range scan is really just following the lead set by TID Scan here.

I'm not sure if it's worth the trouble making these Node types for the
small gains there'd be in the performance of having the planner make
them once for prepared queries rather than them having to be built
during each execution.

Do you think it is?

> > +/*
> > + * table_set_tidrange resets the minimum and maximum TID range to scan for a
> > + * TableScanDesc created by table_beginscan_tidrange.
> > + */
> > +static inline void
> > +table_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
> > +                                ItemPointer maxtid)
> > +{
> > +     /* Ensure table_beginscan_tidrange() was used. */
> > +     Assert((sscan->rs_flags & SO_TYPE_TIDRANGESCAN) != 0);
> > +
> > +     sscan->rs_rd->rd_tableam->scan_set_tidrange(sscan, mintid, maxtid);
> > +}
>
> How does this interact with rescans?

We must call table_rescan() before calling table_set_tidrange() again.
That perhaps could be documented better. I'm just unsure if that
should be documented in tableam.h or if it's a restriction that only
needs to exist in heapam.c

David


Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

David Rowley
On Thu, 18 Feb 2021 at 09:45, David Rowley <[hidden email]> wrote:

>
> On Wed, 17 Feb 2021 at 11:05, Andres Freund <[hidden email]> wrote:
> > Architecturally it feels like this is something that really belongs more
> > into plan time?
>
> Possibly. It would mean TidOpExpr would have to become a Node type.
> TID Range scan is really just following the lead set by TID Scan here.
>
> I'm not sure if it's worth the trouble making these Node types for the
> small gains there'd be in the performance of having the planner make
> them once for prepared queries rather than them having to be built
> during each execution.
I changed the code around and added a new Node type to the planner and
made it create the TidRangeExpr during planning.

However, I'm pretty much set on this being pretty horrible and I ended
up ripping it back out again.  The reason is that there's quite a bit
of extra boilerplate code that goes with the new node type. e.g it
must be handled in setrefs.c.  EXPLAIN also needs to know about the
new Node. That either means teaching the deparse code about
TidRangeExprs or having the Plan node carry along the OpExprs just so
we can make EXPLAIN work.   Translating between the two might be
possible but it just seemed too much code and I started feeling pretty
bad about the whole idea.

> > > +/*
> > > + * table_set_tidrange resets the minimum and maximum TID range to scan for a
> > > + * TableScanDesc created by table_beginscan_tidrange.
> > > + */
> > > +static inline void
> > > +table_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
> > > +                                ItemPointer maxtid)
> > > +{
> > > +     /* Ensure table_beginscan_tidrange() was used. */
> > > +     Assert((sscan->rs_flags & SO_TYPE_TIDRANGESCAN) != 0);
> > > +
> > > +     sscan->rs_rd->rd_tableam->scan_set_tidrange(sscan, mintid, maxtid);
> > > +}
> >
> > How does this interact with rescans?
>
> We must call table_rescan() before calling table_set_tidrange() again.
> That perhaps could be documented better. I'm just unsure if that
> should be documented in tableam.h or if it's a restriction that only
> needs to exist in heapam.c
I've changed things around so that we no longer explicitly call
table_rescan() in nodeTidrangescan.c. Instead table_set_tidrange()
does a rescan call. I also adjusted the documentation to mention that
changing the tid range starts the scan again.  This does mean we'll do
a ->scan_rescan() the first time we do table_set_tidrange(). I'm not
all that sure that matters.

v15 attached.

David

v15-0001-Add-TID-Range-Scans-to-support-efficient-scannin.patch (99K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

David Rowley
On Fri, 19 Feb 2021 at 20:37, David Rowley <[hidden email]> wrote:

>
> On Thu, 18 Feb 2021 at 09:45, David Rowley <[hidden email]> wrote:
> >
> > On Wed, 17 Feb 2021 at 11:05, Andres Freund <[hidden email]> wrote:
> > > How does this interact with rescans?
> >
> > We must call table_rescan() before calling table_set_tidrange() again.
> > That perhaps could be documented better. I'm just unsure if that
> > should be documented in tableam.h or if it's a restriction that only
> > needs to exist in heapam.c
>
> I've changed things around so that we no longer explicitly call
> table_rescan() in nodeTidrangescan.c. Instead table_set_tidrange()
> does a rescan call. I also adjusted the documentation to mention that
> changing the tid range starts the scan again.  This does mean we'll do
> a ->scan_rescan() the first time we do table_set_tidrange(). I'm not
> all that sure that matters.

I've pushed this now. I did end up changing the function name in
tableam.h so that we no longer expose the table_set_tidrange().
Instead, the range is set by either table_beginscan_tidrange() or
table_rescan_tidrange().  There's no need to worry about what would
happen if someone were to change the TID range mid-scan.

Apart from that, I adjusted a few comments and changed the regression
tests a little to get rid of the tidrangescan_empty table. This was
created to ensure empty tables work correctly. Instead, I just did
those tests before populating the tidrangescan table.  This just makes
the test run a little faster since we're creating and dropping 1 less
table.

David


12345