Tid scan improvements

classic Classic list List threaded Threaded
50 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Tid scan improvements

Edmund Horner
Hello,

To scratch an itch, I have been working on teaching TidScan how to do
range queries, i.e. those using >=, <, BETWEEN, etc.  This means we
can write, for instance,

    SELECT * FROM t WHERE ctid >= '(1000,0)' AND ctid < '(2000,0)';

instead of resorting to the old trick:

    SELECT * FROM t WHERE ctid = ANY (ARRAY(SELECT format('(%s,%s)', i, j)::tid
    FROM generate_series(1000,1999) AS gs(i), generate_series(1,200)
AS gs2(j)));

where "200" is some guess at how many tuples can fit on a page for that table.

There's some previous discussion about this at
https://www.postgresql.org/message-id/flat/CAHyXU0zJhg_5RtxKnNbAK%3D4ZzQEFUFi%2B52RjpLrxtkRTD6CDFw%40mail.gmail.com#3ba2c3a6be217f40130655a3250d80a4
.

Since range scan execution is rather different from the existing
TidScan execution, I ended up making a new plan type, TidRangeScan.
There is still only one TidPath, but it has an additional member that
describes which method to use.

As part of the work I also taught TidScan that its results are ordered
by ctid, i.e. to set a pathkey on a TidPath.  The benefit of this is
that queries such as

    SELECT MAX(ctid) FROM t;
    SELECT * FROM t WHERE ctid IN (...) ORDER BY ctid;

are now planned a bit more efficiently.  Execution was already
returning tuples in ascending ctid order; I just had to add support
for descending order.

Attached are a couple of patches:
  - 01_tid_scan_ordering.patch
  - 02_tid_range_scan.patch, to be applied on top of 01.

Can I add this to the next CommitFest?

Obviously the whole thing needs thorough review, and I expect there to
be numerous problems.  (I had to make this prototype to demonstrate to
myself that it wasn't completely beyond me.  I know from experience
how easy it is to enthusiastically volunteer something for an open
source project, discover that one does not have the time or skill
required, and be too embarrassed to show one's face again!)

As well as actual correctness, some aspects that I am particularly
unsure about include:

  - Is it messy to use TidPath for both types of scan?
  - What is the planning cost for plans that don't end up being a
TidScan or TidRangeScan?
  - Have I put the various helper functions in the right files?
  - Is there a less brittle way to create tables of a specific number
of blocks/tuples in the regression tests?
  - Have a got the ScanDirection right during execution?
  - Are my changes to heapam ok?

Cheers,
Edmund

01_tid_scan_ordering.patch (22K) Download Attachment
02_tid_range_scan.patch (86K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

David Rowley-3
On 12 August 2018 at 14:29, Edmund Horner <[hidden email]> wrote:
> To scratch an itch, I have been working on teaching TidScan how to do
> range queries, i.e. those using >=, <, BETWEEN, etc.  This means we
> can write, for instance,
>
>     SELECT * FROM t WHERE ctid >= '(1000,0)' AND ctid < '(2000,0)';

I think this will be useful to UPDATE records at the end of a bloated
table to move them into space that's been freed up by vacuum to allow
the table to be trimmed back to size again.

> Since range scan execution is rather different from the existing
> TidScan execution, I ended up making a new plan type, TidRangeScan.
> There is still only one TidPath, but it has an additional member that
> describes which method to use.

I always thought that this would be implemented by overloading
TidScan.  I thought that TidListEval() could be modified to remove
duplicates accounting for range scans. For example:

SELECT * FROM t WHERE ctid BETWEEN '(0,1)' AND (0,10') OR ctid
IN('(0,5)','(0,30)');

would first sort all the tids along with their operator and then make
a pass over the sorted array to remove any equality ctids that are
redundant because they're covered in a range.

> As part of the work I also taught TidScan that its results are ordered
> by ctid, i.e. to set a pathkey on a TidPath.  The benefit of this is
> that queries such as
>
>     SELECT MAX(ctid) FROM t;
>     SELECT * FROM t WHERE ctid IN (...) ORDER BY ctid;

I think that can be done as I see you're passing allow_sync as false
in heap_beginscan_strat(), so the scan will start at the beginning of
the heap.

> Attached are a couple of patches:
>   - 01_tid_scan_ordering.patch
>   - 02_tid_range_scan.patch, to be applied on top of 01.
>
> Can I add this to the next CommitFest?

Please do.

> As well as actual correctness, some aspects that I am particularly
> unsure about include:
>
>   - Is it messy to use TidPath for both types of scan?

I wonder if there is a good reason to have a separate node type at
all? I've not looked, but if you've managed to overload the TidPath
struct without it getting out of control, then perhaps the same can be
done with the node type.

>   - What is the planning cost for plans that don't end up being a
> TidScan or TidRangeScan?

I suppose that wouldn't matter if there was just 1 path for a single node type.

>   - Is there a less brittle way to create tables of a specific number
> of blocks/tuples in the regression tests?

Perhaps you could just populate a table with some number of records
then DELETE the ones above ctid (x,100) on each page, where 100 is
whatever you can be certain will fit on a page on any platform. I'm
not quite sure if our regress test would pass with a very small block
size anyway, but probably worth verifying that before you write the
first test that will break it.

I'll try to look in a bit more detail during the commitfest.

It's perhaps a minor detail at this stage, but generally, we don't
have code lines over 80 chars in length. There are some exceptions,
e.g not breaking error message strings so that they're easily
greppable.  src/tools/pgindent has a tool that you can run to fix the
whitespace so it's in line with project standard.

Thanks for working on this. It will great to see improvements made in this area.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Robert Haas
On Sun, Aug 12, 2018 at 8:07 AM, David Rowley
<[hidden email]> wrote:
> Thanks for working on this. It will great to see improvements made in this area.

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Edmund Horner
In reply to this post by David Rowley-3
On 12 August 2018 at 20:07, David Rowley <[hidden email]> wrote:

>> Since range scan execution is rather different from the existing
>> TidScan execution, I ended up making a new plan type, TidRangeScan.
>> There is still only one TidPath, but it has an additional member that
>> describes which method to use.
>
> I always thought that this would be implemented by overloading
> TidScan.  I thought that TidListEval() could be modified to remove
> duplicates accounting for range scans. For example:
>
> SELECT * FROM t WHERE ctid BETWEEN '(0,1)' AND (0,10') OR ctid
> IN('(0,5)','(0,30)');
>
> would first sort all the tids along with their operator and then make
> a pass over the sorted array to remove any equality ctids that are
> redundant because they're covered in a range.

Initially, I figured that 99% of the time, the user either wants to
filter by a specific set of ctids (such as those returned by a
subquery), or wants to process a table in blocks.  Picking up an
OR-set of ctid-conditions and determining which parts should be picked
up row-wise (as in the existing code) versus which parts are blocks
that should be scanned -- and ensuring that any overlaps were removed
-- seemed more complicated than it was worth.

Having thought about it, I think what you propose might be worth it;
at least it limits us to a single TidScan plan to maintain.

The existing code:
  - Looks for a qual that's an OR-list of (ctid = ?) or (ctid IN (?))
  - Costs it by assuming each matching tuple is a separate page.
  - When beginning the scan, evaluates all the ?s and builds an array
of tids to fetch.
  - Sorts and remove duplicates.
  - Iterates over the array, fetching tuples.

So we'd extend that to:
  - Include in the OR-list "range" subquals of the form (ctid > ? AND
ctid < ?) (either side could be optional, and we have to deal with >=
and <= and having ctid on the rhs, etc.).
  - Cost the range subquals by assuming they don't overlap, and
estimating how many blocks and tuples they span.
  - When beginning the scan, evaluate all the ?s and build an array of
"tid ranges" to fetch.  A tid range is a struct with a starting tid,
and an ending tid, and might just be a single tid item.
  - Sort and remove duplicates.
  - Iterate over the array, using a single fetch for single-item tid
ranges, and starting/ending a heap scan for multi-item tid ranges.

I think I'll try implementing this.

>> As part of the work I also taught TidScan that its results are ordered
>> by ctid, i.e. to set a pathkey on a TidPath.  The benefit of this is
>> that queries such as
>>
>>     SELECT MAX(ctid) FROM t;
>>     SELECT * FROM t WHERE ctid IN (...) ORDER BY ctid;
>
> I think that can be done as I see you're passing allow_sync as false
> in heap_beginscan_strat(), so the scan will start at the beginning of
> the heap.

I found that heap scan caters to parallel scans, synchronised scans,
and block range indexing; but it didn't quite work for my case of
specifying a subset of a table and scanning backward or forward over
it.  Hence my changes.  I'm not overly familiar with the heap scan
code though.

>>   - Is there a less brittle way to create tables of a specific number
>> of blocks/tuples in the regression tests?
>
> Perhaps you could just populate a table with some number of records
> then DELETE the ones above ctid (x,100) on each page, where 100 is
> whatever you can be certain will fit on a page on any platform. I'm
> not quite sure if our regress test would pass with a very small block
> size anyway, but probably worth verifying that before you write the
> first test that will break it.

I don't think I've tested with extreme block sizes.

> I'll try to look in a bit more detail during the commitfest.
>
> It's perhaps a minor detail at this stage, but generally, we don't
> have code lines over 80 chars in length. There are some exceptions,
> e.g not breaking error message strings so that they're easily
> greppable.  src/tools/pgindent has a tool that you can run to fix the
> whitespace so it's in line with project standard.

I'll try to get pgindent running before my next patch.

Thanks for the comments!

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

David Rowley-3
On 15 August 2018 at 11:11, Edmund Horner <[hidden email]> wrote:
So we'd extend that to:
  - Include in the OR-list "range" subquals of the form (ctid > ? AND
ctid < ?) (either side could be optional, and we have to deal with >=
and <= and having ctid on the rhs, etc.).
  - Cost the range subquals by assuming they don't overlap, and
estimating how many blocks and tuples they span.
  - When beginning the scan, evaluate all the ?s and build an array of
"tid ranges" to fetch.  A tid range is a struct with a starting tid,
and an ending tid, and might just be a single tid item.
  - Sort and remove duplicates.
  - Iterate over the array, using a single fetch for single-item tid
ranges, and starting/ending a heap scan for multi-item tid ranges.

I think I'll try implementing this.

I've set this patch as waiting on author in the commitfest app. 


--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Edmund Horner
On Mon, 17 Sep 2018 at 23:21, David Rowley <[hidden email]> wrote:

> On 15 August 2018 at 11:11, Edmund Horner <[hidden email]> wrote:
>> So we'd extend that to:
>>   - Include in the OR-list "range" subquals of the form (ctid > ? AND
>> ctid < ?) (either side could be optional, and we have to deal with >=
>> and <= and having ctid on the rhs, etc.).
>>   - Cost the range subquals by assuming they don't overlap, and
>> estimating how many blocks and tuples they span.
>>   - When beginning the scan, evaluate all the ?s and build an array of
>> "tid ranges" to fetch.  A tid range is a struct with a starting tid,
>> and an ending tid, and might just be a single tid item.
>>   - Sort and remove duplicates.
>>   - Iterate over the array, using a single fetch for single-item tid
>> ranges, and starting/ending a heap scan for multi-item tid ranges.
>>
>> I think I'll try implementing this.
>
>
> I've set this patch as waiting on author in the commitfest app.

Thanks David.

Between work I have found time here and there to work on it, but
making a path type that handles all the above turns out to be
surprisingly harder than my tid range scan.

In the earlier discussion from 2012, Tom Lane said:

> Bruce Momjian <bruce(at)momjian(dot)us> writes:
> > On Wed, Jun 13, 2012 at 03:21:17PM -0500, Merlin Moncure wrote:
> >> IMNSHO, it's a no-brainer for the todo (but I think it's more
> >> complicated than adding some comparisons -- which are working now):
>
> > I see. Seems we have to add index smarts to those comparisons. That
> > might be complicated.
>
> Uh, the whole point of a TID scan is to *not* need an index.
>
> What would be needed is for tidpath.c to let through more kinds of TID
> comparison quals than it does now, and then for nodeTidscan.c to know
> what to do with them. The latter logic might well look something like
> btree indexscan qual preparation, but it wouldn't be the same code.

I have been generally following this approach (handling more kinds of
TID comparisons), and have found myself doing things like pairing up >
with <, estimating how much of a table is covered by some set of >, <,
or "> AND <" quals, etc.  Things that I'm sure are handled in an
advanced way by index paths; unfortunately I didn't see any easily
reusable code in the index path code.  So I've ended up writing
special-case code for TID scans.  Hopefully it will be worth it.

Edmund

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

David Rowley-3
On 19 September 2018 at 18:04, Edmund Horner <[hidden email]> wrote:
> I have been generally following this approach (handling more kinds of
> TID comparisons), and have found myself doing things like pairing up >
> with <, estimating how much of a table is covered by some set of >, <,
> or "> AND <" quals, etc.  Things that I'm sure are handled in an
> advanced way by index paths; unfortunately I didn't see any easily
> reusable code in the index path code.  So I've ended up writing
> special-case code for TID scans.  Hopefully it will be worth it.

I don't think it would need to be as complex as the index matching
code. Just looping over the quals and gathering up all compatible ctid
quals should be fine.  I imagine the complex handling of sorting the
quals by ctid and removal of redundant quals that are covered by some
range would be done in the executor.

Probably the costing will get more complex. At the moment it seems we
add a random_page_cost per ctid, but you'd probably need to make that
better and loop over the quals in each implicitly ANDed set and find
the max ctid for the > / >= quals and the the min < / <= ctid, then
get the page number from each and assume max - min seq_page_cost, then
add random_page_cost for any remaining equality quals.  The costs from
other OR branches can likely just be added on.  This would double
count if someone did WHERE ctid BETWEEN '(0,0') AND '(100,300)' OR
ctid BETWEEN '(0,0') AND '(100,300)';  The current code seems to
double count now for duplicate ctids anyway. It even double counts if
the ctid being compared to is on the same page as another ctid, so I
don't think that would be unacceptable.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Edmund Horner
On Wed, 19 Sep 2018 at 18:56, David Rowley <[hidden email]> wrote:

>
> On 19 September 2018 at 18:04, Edmund Horner <[hidden email]> wrote:
> > I have been generally following this approach (handling more kinds of
> > TID comparisons), and have found myself doing things like pairing up >
> > with <, estimating how much of a table is covered by some set of >, <,
> > or "> AND <" quals, etc.  Things that I'm sure are handled in an
> > advanced way by index paths; unfortunately I didn't see any easily
> > reusable code in the index path code.  So I've ended up writing
> > special-case code for TID scans.  Hopefully it will be worth it.
>
> I don't think it would need to be as complex as the index matching
> code. Just looping over the quals and gathering up all compatible ctid
> quals should be fine.  I imagine the complex handling of sorting the
> quals by ctid and removal of redundant quals that are covered by some
> range would be done in the executor.
I've got the path creation and execution pretty much working, though
with some inefficiencies:
  - Each individual TID is treated as a range of size 1 (but CURRENT
OF is handled as a single fetch)
  - Range scans have to scan whole blocks, and skip over the tuples
that are out of range.
But it's enough to get the tests passing.

Right now I'm looking at costing:

> Probably the costing will get more complex. At the moment it seems we
> add a random_page_cost per ctid, but you'd probably need to make that
> better and loop over the quals in each implicitly ANDed set and find
> the max ctid for the > / >= quals and the the min < / <= ctid, then
> get the page number from each and assume max - min seq_page_cost, then
> add random_page_cost for any remaining equality quals.  The costs from
> other OR branches can likely just be added on.  This would double
> count if someone did WHERE ctid BETWEEN '(0,0') AND '(100,300)' OR
> ctid BETWEEN '(0,0') AND '(100,300)';  The current code seems to
> double count now for duplicate ctids anyway. It even double counts if
> the ctid being compared to is on the same page as another ctid, so I
> don't think that would be unacceptable.
There are two stages of costing:
1. Estimating the number of rows that the relation will return.  This
happens before path generation.
2. Estimating the cost of the path.

In the existing code, (1) goes through the normal clausesel.c
machinery, eventually getting to the restriction function defined in
pg_operator.  For range quals, e.g. >, it looks for a stats entry for
the variable, but since it's a system variable with no stats, it
returns DEFAULT_INEQ_SEL (in function scalarineqsel).  For equality
quals, it does have some special-case code (in function
get_variable_numdistinct) to use stadistinct=-1 for the CTID variable,
resulting in a selectivity estimate of 1/ntuples.

(2), on the other hand, has special-case code in costsize.c (function
cost_tidscan), which estimates each TID as being a separate tuple
fetch from a different page.  (The existing code only has to support
=, IN, and CURRENT OF as quals for a TID path.)

In my work, I have been adding support for range quals to (2), which
includes estimating the selectivity of expressions like (CTID > a AND
CTID < b).  I got tired of handling all the various ways of ordering
the quals, so I thought I would try re-using the clausesel.c
machinery.  In selfuncs.c, I've added special case code for
scalarineqsel and nulltestsel to handle CTID variables.  (This also
improves the row count estimates.)

I'm not 100% sure what the costs of each range should be.  I think the
first block should incur random_page_cost, with subsequent blocks
being seq_page_cost.  Simple "CTID = ?" quals are still estimated as 1
tuple + 1 random block.

Have a look at the attached WIP if you like and tell me if you think
it's going in the right direction.  I'm sorry for the size of the
patch; I couldn't find a nice way to cut it up.  I did run pgindent
over it though. :)

Cheers,
Edmund

tid_scan_improvements-v1.patch (102K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Edmund Horner
On Fri, 28 Sep 2018 at 17:02, Edmund Horner <[hidden email]> wrote:
> I did run pgindent over it though. :)

But I didn't check if it still applied to master.  Sigh.  Here's one that does.

tid_scan_improvements-v2.patch (102K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

David Rowley-3
On 28 September 2018 at 18:13, Edmund Horner <[hidden email]> wrote:
> On Fri, 28 Sep 2018 at 17:02, Edmund Horner <[hidden email]> wrote:
>> I did run pgindent over it though. :)
>
> But I didn't check if it still applied to master.  Sigh.  Here's one that does.

I know commit fest is over, but I made a pass of this to hopefully
provide a bit of guidance so that it's closer for the November 'fest.

I've only done light testing on the patch and it does seem to work,
but there are a few things that I think should be changed. Most
importantly #11 below I think needs to be done. That might overwrite
some of the items that come before it in the list as you likely will
have to pull some of code which I mention out out due to changing #11.
I've kept them around anyway just in case some of it remains.

1. Could wrap for tables > 16TB. Please use double. See index_pages_fetched()

int nrandompages;
int nseqpages;

2. Should multiply by nseqpages, not add.

run_cost += spc_random_page_cost * nrandompages + spc_seq_page_cost + nseqpages;

3. Should be double:

BlockNumber pages = selectivity * baserel->pages;

4. Comment needs updated to mention what the new code does in
heapgettup() and heapgettup_pagemode()

+
  /* start from last page of the scan */
- if (scan->rs_startblock > 0)
- page = scan->rs_startblock - 1;
+ if (scan->rs_numblocks == InvalidBlockNumber)
+ {
+ if (scan->rs_startblock > 0)
+ page = scan->rs_startblock - 1;
+ else
+ page = scan->rs_nblocks - 1;
+ }
  else
- page = scan->rs_nblocks - 1;
+ page = scan->rs_startblock + scan->rs_numblocks - 1;
+

5. Variables should be called "inclusive". We use "strict" to indicate
an operator comparison cannot match NULL values.

+ bool strict; /* Indicates < rather than <=, or > rather */
+ bool strict2; /* than >= */

Don't break the comment like that. If you need more space don't end
the comment and use a new line and tab the next line out to match the
* of the first line.

6. Why not pass the TidExpr into MakeTidOpExprState() and have it set
the type instead of repeating code

7. It's not very obvious why the following Assert() can't fail.

+ bool invert;
+ bool invert2;
+
+ Assert(list_length(((BoolExpr *) expr)->args) == 2);

I had to hunt around quite a bit to see that
TidQualFromBaseRestrictinfo could only ever make the list have 2
elements, and we'd not form a BoolExpr with just 1. (but see #11)

8. Many instances of the word "strict" are used to mean "inclusive".
Can you please change all of them.

9. Confusing comment:

+ * If the expression was non-strict (<=) and the offset is 0, then just
+ * pretend it was strict, because offset 0 doesn't exist and we may as
+ * well exclude that block.

Shouldn't this be, "If the operator is non-inclusive, then since TID
offsets are 1-based, for simplicity, we can just class the expression
as inclusive.", or something along those lines.

10. Comment talks about LHS, but the first OpExpr in a list of two
OpExprs has nothing to do with left hand side.  You could use LHS if
you were talking about the first arg in an OpExpr, but this is not the
case here.

/* If the LHS is not the lower bound, swap them. */

You could probably just ensure that the >=, > ops is the first in the
list inside TidQualFromBaseRestrictinfo(), but you'd need to clearly
comment that this is the case in both locations. Perhaps use lcons()
for the lower end and lappend() for the upper end, but see #11.

11. I think the qual matching code needs an overhaul.  Really you
should attempt to find the smallest and largest ctid for your
implicitly ANDed ranges.  This would require you getting rid of the
BETWEEN type claused you're trying to build in
TidQualFromBaseRestrictinfo
and instead just include all quals, don't ignore other quals when
you've already found your complete range bounds.

The problem with doing it the way that you're doing it now is in cases like:

create table t1(a int);
insert into t1 select generate_Series(1,10000000);
create index on t1 (a);
select ctid,a from t1 order by a desc limit 1; -- find the max ctid.
    ctid     |    a
-------------+----------
 (44247,178) | 10000000
(1 row)

set max_parallel_workers_per_gather=0;
explain analyze select ctid,* from t1 where ctid > '(0,0)' and ctid <=
'(44247,178)' and ctid <= '(0,1)';
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Tid Scan on t1  (cost=0.01..169248.78 rows=1 width=10) (actual
time=0.042..2123.432 rows=1 loops=1)
   TID Cond: ((ctid > '(0,0)'::tid) AND (ctid <= '(44247,178)'::tid))
   Filter: (ctid <= '(0,1)'::tid)
   Rows Removed by Filter: 9999999
 Planning Time: 4.049 ms
 Execution Time: 2123.464 ms
(6 rows)

Due to how you've coded TidQualFromBaseRestrictinfo(), the ctid <=
'(0,1)' qual does not make it into the range. It's left as a filter in
the Tid Scan.

I think I'm going to stop here as changing this going to cause quite a
bit of churn.

but one more...

12. I think the changes to selfuncs.c to get the selectivity estimate
is a fairly credible idea, but I think it also needs to account for
offsets. You should be able to work out the average number of items
per page with rel->tuples / rel->pages and factor that in to get a
better estimate for cases like:

postgres=# explain analyze select ctid,* from t1 where ctid <= '(0,200)';
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Tid Scan on t1  (cost=0.00..5.00 rows=1 width=10) (actual
time=0.025..0.065 rows=200 loops=1)
   TID Cond: (ctid <= '(0,200)'::tid)
 Planning Time: 0.081 ms
 Execution Time: 0.088 ms
(4 rows)

You can likely add on "(offset / avg_tuples_per_page) / rel->pages" to
the selectivity and get a fairly accurate estimate... at least when
there are no dead tuples in the heap

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Edmund Horner
On Wed, 3 Oct 2018 at 17:36, David Rowley <[hidden email]> wrote:
> I know commit fest is over, but I made a pass of this to hopefully
> provide a bit of guidance so that it's closer for the November 'fest.

Hi David.  Thanks for the review.  It's fairly thorough and you must
have put some time into it -- I really appreciate it.


> I've only done light testing on the patch and it does seem to work,
> but there are a few things that I think should be changed. Most
> importantly #11 below I think needs to be done. That might overwrite
> some of the items that come before it in the list as you likely will
> have to pull some of code which I mention out out due to changing #11.
> I've kept them around anyway just in case some of it remains.

> 1. Could wrap for tables > 16TB. Please use double. See index_pages_fetched()
> 2. Should multiply by nseqpages, not add.
> 3. Should be double.

I agree with these three.


> 4. Comment needs updated to mention what the new code does in
> heapgettup() and heapgettup_pagemode()
>
> +
>   /* start from last page of the scan */
> - if (scan->rs_startblock > 0)
> - page = scan->rs_startblock - 1;
> + if (scan->rs_numblocks == InvalidBlockNumber)
> + {
> + if (scan->rs_startblock > 0)
> + page = scan->rs_startblock - 1;
> + else
> + page = scan->rs_nblocks - 1;
> + }
>   else
> - page = scan->rs_nblocks - 1;
> + page = scan->rs_startblock + scan->rs_numblocks - 1;
> +

I'm thinking that, as they don't depend on the others, the heapam.c
changes should be a separate preparatory patch?

The heap scan code has to support things like synchonised scans and
parallel scans, but as far as I know, its support for scanning
subranges is currently used only for building BRIN indexes.  I found
that although I could specify a subrange with heap_setscanlimits, I
could not scan backward over it, because the original version of the
above code would start at the end of the whole table.

I'm not especially comfortable with this understanding of heapam, so
close review would be appreciated.

I note that there's a lot of common code in heapgettup and
heapgettup_pagemode, which my changes add to.  It might be worth
trying to factor out somehow.


> 5. Variables should be called "inclusive". We use "strict" to indicate
> an operator comparison cannot match NULL values.
>
> + bool strict; /* Indicates < rather than <=, or > rather */
> + bool strict2; /* than >= */
>
> Don't break the comment like that. If you need more space don't end
> the comment and use a new line and tab the next line out to match the
> * of the first line.

> 8. Many instances of the word "strict" are used to mean "inclusive".
> Can you please change all of them.

I don't mind renaming it.  I took "strict" from "strictly greater/less
than" but I knew it was confusable with the other usages of "strict".


> 9. Confusing comment:
>
> + * If the expression was non-strict (<=) and the offset is 0, then just
> + * pretend it was strict, because offset 0 doesn't exist and we may as
> + * well exclude that block.
>
> Shouldn't this be, "If the operator is non-inclusive, then since TID
> offsets are 1-based, for simplicity, we can just class the expression
> as inclusive.", or something along those lines.

Ok, I'll try to reword it along those lines.


> I think I'm going to stop here as changing this going to cause quite a
> bit of churn.
>
> but one more...

> 12. I think the changes to selfuncs.c to get the selectivity estimate
> is a fairly credible idea, but I think it also needs to account for
> offsets. You should be able to work out the average number of items
> per page with rel->tuples / rel->pages and factor that in to get a
> better estimate for cases like:
>
> postgres=# explain analyze select ctid,* from t1 where ctid <= '(0,200)';
>                                           QUERY PLAN
> -----------------------------------------------------------------------------------------------
>  Tid Scan on t1  (cost=0.00..5.00 rows=1 width=10) (actual
> time=0.025..0.065 rows=200 loops=1)
>    TID Cond: (ctid <= '(0,200)'::tid)
>  Planning Time: 0.081 ms
>  Execution Time: 0.088 ms
> (4 rows)
>
> You can likely add on "(offset / avg_tuples_per_page) / rel->pages" to
> the selectivity and get a fairly accurate estimate... at least when
> there are no dead tuples in the heap

I think the changes to selfuncs.c could also be a separate patch?

I'll try to include the offset in the selectivity too.

Related -- what should the selectivity be on an empty table?  My code has:

    /* If the relation's empty, we're going to read all of it. */
    if (vardata->rel->pages == 0)
        return 1.0;

(which needs rewording, since selectivity isn't always about reading).
Is 1.0 the right thing to return?


> 6. Why not pass the TidExpr into MakeTidOpExprState() and have it set
> the type instead of repeating code
> 7. It's not very obvious why the following Assert() can't fail. [...]
> I had to hunt around quite a bit to see that
> TidQualFromBaseRestrictinfo could only ever make the list have 2
> elements, and we'd not form a BoolExpr with just 1. (but see #11)
> 10. Comment talks about LHS, but the first OpExpr in a list of two
> OpExprs has nothing to do with left hand side.  You could use LHS if
> you were talking about the first arg in an OpExpr, but this is not the
> case here.

These three might become non-issues if we change it along the lines of #11:

> 11. I think the qual matching code needs an overhaul.  Really you
> should attempt to find the smallest and largest ctid for your
> implicitly ANDed ranges.  This would require you getting rid of the
> BETWEEN type claused you're trying to build in
> TidQualFromBaseRestrictinfo
> and instead just include all quals, don't ignore other quals when
> you've already found your complete range bounds.
>
> The problem with doing it the way that you're doing it now is in cases like:
>
> create table t1(a int);
> insert into t1 select generate_Series(1,10000000);
> create index on t1 (a);
> select ctid,a from t1 order by a desc limit 1; -- find the max ctid.
>     ctid     |    a
> -------------+----------
>  (44247,178) | 10000000
> (1 row)
>
> set max_parallel_workers_per_gather=0;
> explain analyze select ctid,* from t1 where ctid > '(0,0)' and ctid <=
> '(44247,178)' and ctid <= '(0,1)';
>                                              QUERY PLAN
> -----------------------------------------------------------------------------------------------------
>  Tid Scan on t1  (cost=0.01..169248.78 rows=1 width=10) (actual
> time=0.042..2123.432 rows=1 loops=1)
>    TID Cond: ((ctid > '(0,0)'::tid) AND (ctid <= '(44247,178)'::tid))
>    Filter: (ctid <= '(0,1)'::tid)
>    Rows Removed by Filter: 9999999
>  Planning Time: 4.049 ms
>  Execution Time: 2123.464 ms
> (6 rows)
>
> Due to how you've coded TidQualFromBaseRestrictinfo(), the ctid <=
> '(0,1)' qual does not make it into the range. It's left as a filter in
> the Tid Scan.

My first thought was to support the fairly-common use case of the
two-bound range "ctid >= ? AND ctid< ?" (or single-bound variations);
hence my original patch for a "TID Range Scan".

Following the comments made earlier, I tried incorporating this into
the existing TID Scan; but I still had the same use case in mind, so
only the first lower and upper bounds were used.  My thoughts were
that, while we need to *correctly* handle more complicated cases like
"ctid > '(0,0)' AND ctid <= '(44247,178)' AND ctid <= '(0,1)'", such
queries will not come up in practice and hence it's OK if those extra
bounds are applied in the filter.  For the same reason, I did not
consider it worthwhile trying to pick which bound to use in the scan.

I've since realised that such queries aren't always redundant.  At
query time we might not know which of the bounds if the "best", but we
will after evaluating them in the executor.  So I quite like the idea
of keeping all of them.

This means a TID path's quals is an OR-list of:
  - "ctid = ?"
  - "ctid = ANY (?)" / "ctid IN (?)"
  - "(ctid op ?) AND ..."   (where op is one of >,>=,<,<=)
  - "CURRENT OF"

I still don't think the scan needs to support quals like "ctid = ? AND
ctid > ?", or "ctid IN (?) AND ctid IN (?)" -- the executor *could*
try to form the intersection but I don't think it's worth the code.
In these cases, picking a simple qual is usually enough for an
efficient scan; the full qual can go into the filter.

I'm part way through implementing this.  It looks like it might
actually be less code than what I had before.

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Edmund Horner
Hi all,

I have managed to split my changes into 4 patches:

v3-0001-Add-selectivity-and-nullness-estimates-for-the-ItemP.patch
v3-0002-Support-range-quals-in-Tid-Scan.patch
v3-0003-Support-backward-scans-over-restricted-ranges-in-hea.patch
v3-0004-Tid-Scan-results-are-ordered.patch

(1) is basically independent, and usefully improves estimates for ctid quals.
(2) is the main patch, adding basic range scan support to TidPath and TidScan.
(3) is a small change to properly support backward scans over a
restricted range in heapam.c, and is needed for (4).
(4) adds Backward Tid Scans, and adds path keys to Tid Paths so that
the planner doesn't have to add a sort for certain queries.

I have tried to apply David's suggestions.

In (1), I've included the offset part of a CTID constant in the
selectivity calculation.  I've not included "allvisfrac" in the
calculation; I'm not sure it's worth it as it would only affect the
offset part.
I have tried to use iseq to differentiate between <=,>= versus <,>,
but I'm not sure I've got this right.  I am also not entirely sure
it's worth it; the changes are already an improvement over the current
behaviour of using hardcoded selectivity constants.

In (2), the planner now picks up a greater variety of TID quals,
including AND-clauses with arbitrary children instead of the original
lower bound/upper bound pair.  These are resolved in the executor into
a list of ranges to scan.

(3) is the same code, but I've added a couple of comments to explain the change.

(4) is basically the same pathkey/direction code as before (but as a
separate patch).

I hope the separation will make it easier to review.  Item (2) is
still quite big, but a third of it is tests.

Cheers.
Edmund

v3-0002-Support-range-quals-in-Tid-Scan.patch (81K) Download Attachment
v3-0001-Add-selectivity-and-nullness-estimates-for-the-ItemP.patch (3K) Download Attachment
v3-0003-Support-backward-scans-over-restricted-ranges-in-hea.patch (2K) Download Attachment
v3-0004-Tid-Scan-results-are-ordered.patch (26K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

David Rowley-3
On 4 November 2018 at 17:20, Edmund Horner <[hidden email]> wrote:
> I have managed to split my changes into 4 patches:
>
> v3-0001-Add-selectivity-and-nullness-estimates-for-the-ItemP.patch
> v3-0002-Support-range-quals-in-Tid-Scan.patch
> v3-0003-Support-backward-scans-over-restricted-ranges-in-hea.patch
> v3-0004-Tid-Scan-results-are-ordered.patch

Hi,

I've been looking over 0001 to 0003. I ran out of steam before 0004.

I like the design of the new patch. From what I threw so far at the
selectivity estimation code, it seems pretty good.  I also quite like
the design in nodeTidscan.c for range scans.

I didn't quite manage to wrap my head around the code that removes
redundant quals from the tidquals. For example, with:

postgres=# explain select * from t1 where ctid <= '(0,10)' and a = 0;
                    QUERY PLAN
--------------------------------------------------
 Tid Scan on t1  (cost=0.00..3.19 rows=1 width=4)
   TID Cond: (ctid <= '(0,10)'::tid)
   Filter: (a = 0)
(3 rows)

and:

postgres=# explain select * from t1 where ctid <= '(0,10)' or a = 20
and ctid >= '(0,0)';
                                  QUERY PLAN
------------------------------------------------------------------------------
 Tid Scan on t1  (cost=0.01..176.18 rows=12 width=4)
   TID Cond: ((ctid <= '(0,10)'::tid) OR (ctid >= '(0,0)'::tid))
   Filter: ((ctid <= '(0,10)'::tid) OR ((a = 20) AND (ctid >= '(0,0)'::tid)))
(3 rows)

I understand why the 2nd query didn't remove the ctid quals from the
filter, and I understand why the first query could. I just didn't
manage to convince myself that the code behaves correctly for all
cases.

During my pass through 0001, 0002 and 0003 I noted the following:

0001:

1. I see a few instances of:

#define DatumGetItemPointer(X) ((ItemPointer) DatumGetPointer(X))
#define ItemPointerGetDatum(X) PointerGetDatum(X)

in both tid.c and ginfuncs.c, and I see you have:

+ itemptr = (ItemPointer) DatumGetPointer(constval);

Do you think it would be worth moving the macros out of tid.c and
ginfuncs.c into postgres.h and use that macro instead?

(I see the code in this file already did this, so it might not matter
about this)

0002:

2. In TidCompoundRangeQualFromExpr() rlst is not really needed. You
can just return MakeTidRangeQuals(found_quals); or return NIL.

3. Can you explain why this only needs to take place when list_length() == 1?

/*
* In the case of a compound qual such as "ctid > ? AND ctid < ? AND ...",
* the various parts will have come from different RestrictInfos.  So
* remove each part separately.
*/
if (list_length(tidquals) == 1)
{
Node    *qual = linitial(tidquals);

if (and_clause(qual))
{
BoolExpr   *and_qual = ((BoolExpr *) qual);

scan_clauses = list_difference(scan_clauses, and_qual->args);
}
}

4. Accidental change?

- tidquals);
+ tidquals
+ );

5. Shouldn't this comment get changed?

- * NumTids    number of tids in this scan
+ * NumRanges    number of tids in this scan

6. There's no longer a field named NumTids

- * TidList    evaluated item pointers (array of size NumTids)
+ * TidRanges    evaluated item pointers (array of size NumTids)

7. The following field is not documented in TidScanState:

+ bool tss_inScan;

8. Can you name this exprtype instead?

+ TidExprType type; /* type of op */

"type" is used by Node types to indicate their type.

9. It would be neater this:

if (expr->opno == TIDLessOperator || expr->opno == TIDLessEqOperator)
tidopexpr->type = invert ? TIDEXPR_LOWER_BOUND : TIDEXPR_UPPER_BOUND;
else if (expr->opno == TIDGreaterOperator || expr->opno == TIDGreaterEqOperator)
tidopexpr->type = invert ? TIDEXPR_UPPER_BOUND : TIDEXPR_LOWER_BOUND;
else
tidopexpr->type = TIDEXPR_EQ;

tidopexpr->exprstate = exprstate;

tidopexpr->inclusive = expr->opno == TIDLessEqOperator || expr->opno
== TIDGreaterEqOperator;

as a switch:

switch (expr->opno)
{
case TIDLessEqOperator:
tidopexpr->inclusive = true;
/* fall through */
case TIDLessOperator:
tidopexpr->type = invert ? TIDEXPR_LOWER_BOUND : TIDEXPR_UPPER_BOUND;
break;
case TIDGreaterEqOperator:
tidopexpr->inclusive = true;
/* fall through */
case TIDGreaterOperator:
tidopexpr->type = invert ? TIDEXPR_UPPER_BOUND : TIDEXPR_LOWER_BOUND;
break;
default:
tidopexpr->type = TIDEXPR_EQ;
}
tidopexpr->exprstate = exprstate;

10. I don't quite understand this comment:

+ * Create an ExprState corresponding to the value part of a TID comparison,
+ * and wrap it in a TidOpExpr.  Set the type and inclusivity of the TidOpExpr
+ * appropriately, depending on the operator and position of the its arguments.

I don't quite see how the code sets the inclusivity depending on the
position of the arguments.

Maybe the comment should be:

+ * For the given 'expr' build and return an appropriate TidOpExpr taking into
+ * account the expr's operator and operand order.

11. ScalarArrayOpExpr are commonly named "saop":

+static TidOpExpr *
+MakeTidScalarArrayOpExpr(ScalarArrayOpExpr *saex, TidScanState *tidstate)

(Though I see it's saex in other places in that file, so might not matter...)

12. You need to code SetTidLowerBound() with similar wraparound logic
you have in SetTidUpperBound().

It's perhaps unlikely, but the following shows incorrect results.

postgres=# select ctid from t1 where ctid > '(0,65535)' limit 1;
 ctid
-------
 (0,1)
(1 row)

-- the following is fine.

Time: 1.652 ms
postgres=# select ctid from t1 where ctid >= '(0,65535)' limit 1;
 ctid
-------
 (1,1)
(1 row)

Likely you can just upgrade to the next block when the offset is >
MaxOffsetNumber.

13. It looks like the previous code didn't make the assumption you're making in:

+ * A current-of TidExpr only exists by itself, and we should
+ * already have allocated a tidList entry for it.  We don't
+ * need to check whether the tidList array needs to be
+ * resized.

I'm not sure if it's a good idea to lock the executor code into what
the grammar currently says is possible. The previous code didn't
assume that.

14. we pass 'false' to what?

+ * save the tuple and the buffer returned to us by the access methods in
+ * our scan tuple slot and return the slot.  Note: we pass 'false' because
+ * tuples returned by heap_getnext() are pointers onto disk pages and were
+ * not created with palloc() and so should not be pfree()'d.  Note also
+ * that ExecStoreHeapTuple will increment the refcount of the buffer; the
+ * refcount will not be dropped until the tuple table slot is cleared.
  */
- return ExecClearTuple(slot);
+ if (tuple)
+ ExecStoreBufferHeapTuple(tuple, /* tuple to store */
+ slot, /* slot to store in */
+ scandesc->rs_cbuf); /* buffer associated
+ * with this tuple */
+ else
+ ExecClearTuple(slot);
+
+ return slot;

0003:

Saw nothing wrong:

0004:

Not yet reviewed.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Alvaro Herrera-9
On 2018-Nov-06, David Rowley wrote:

> 14. we pass 'false' to what?
>
> + * save the tuple and the buffer returned to us by the access methods in
> + * our scan tuple slot and return the slot.  Note: we pass 'false' because
> + * tuples returned by heap_getnext() are pointers onto disk pages and were
> + * not created with palloc() and so should not be pfree()'d.  Note also
> + * that ExecStoreHeapTuple will increment the refcount of the buffer; the
> + * refcount will not be dropped until the tuple table slot is cleared.
>   */

Seems a mistake stemming from 29c94e03c7d0 ...

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Edmund Horner
On Tue, 6 Nov 2018 at 16:52, Alvaro Herrera <[hidden email]> wrote:

> On 2018-Nov-06, David Rowley wrote:
> > 14. we pass 'false' to what?
> >
> > + * save the tuple and the buffer returned to us by the access methods in
> > + * our scan tuple slot and return the slot.  Note: we pass 'false' because
> > + * tuples returned by heap_getnext() are pointers onto disk pages and were
> > + * not created with palloc() and so should not be pfree()'d.  Note also
> > + * that ExecStoreHeapTuple will increment the refcount of the buffer; the
> > + * refcount will not be dropped until the tuple table slot is cleared.
> >   */
>
> Seems a mistake stemming from 29c94e03c7d0 ...
Yep -- I copied that bit from nodeSeqscan.c.  Some of the notes were
removed in that change, but nodeSeqscan.c and nodeIndexscan.c still
have them.

I made a little patch to remove them.

remove-obsolete-ExecStoreTuple-notes.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Edmund Horner
In reply to this post by David Rowley-3
On Tue, 6 Nov 2018 at 16:40, David Rowley <[hidden email]> wrote:
> I've been looking over 0001 to 0003. I ran out of steam before 0004.

Hi David, thanks for another big review with lots of improvements.

> I like the design of the new patch. From what I threw so far at the
> selectivity estimation code, it seems pretty good.  I also quite like
> the design in nodeTidscan.c for range scans.


> I didn't quite manage to wrap my head around the code that removes
> redundant quals from the tidquals. For example, with:
>
> postgres=# explain select * from t1 where ctid <= '(0,10)' and a = 0;
>                     QUERY PLAN
> --------------------------------------------------
>  Tid Scan on t1  (cost=0.00..3.19 rows=1 width=4)
>    TID Cond: (ctid <= '(0,10)'::tid)
>    Filter: (a = 0)
> (3 rows)
>
> and:
>
> postgres=# explain select * from t1 where ctid <= '(0,10)' or a = 20
> and ctid >= '(0,0)';
>                                   QUERY PLAN
> ------------------------------------------------------------------------------
>  Tid Scan on t1  (cost=0.01..176.18 rows=12 width=4)
>    TID Cond: ((ctid <= '(0,10)'::tid) OR (ctid >= '(0,0)'::tid))
>    Filter: ((ctid <= '(0,10)'::tid) OR ((a = 20) AND (ctid >= '(0,0)'::tid)))
> (3 rows)
>
> I understand why the 2nd query didn't remove the ctid quals from the
> filter, and I understand why the first query could. I just didn't
> manage to convince myself that the code behaves correctly for all
> cases.

I agree it's not obvious.

1. We extract a set of tidquals that can be directly implemented by
the Tid scan.  This set is of the form:  "(CTID op ? AND ...) OR
(...)" (with some limitations).
2. If they happened to come verbatim from the original RestrictInfos,
then they will be found in scan_clauses, and we can remove them.
3. If they're not verbatim, i.e. the original RestrictInfos have
additional criteria that the Tid scan can't use, then tidquals won't
match anything in scan_clauses, and hence scan_clauses will be
unchanged.
4. We do a bit of extra work for the common and useful case of "(CTID
op ? AND ...)".  Since the top-level operation of the input quals is
an AND, it will typically be split into multiple RestrictInfo items.
We remove each part from scan_clauses.

> 1. I see a few instances of:
>
> #define DatumGetItemPointer(X) ((ItemPointer) DatumGetPointer(X))
> #define ItemPointerGetDatum(X) PointerGetDatum(X)
>
> in both tid.c and ginfuncs.c, and I see you have:
>
> + itemptr = (ItemPointer) DatumGetPointer(constval);
>
> Do you think it would be worth moving the macros out of tid.c and
> ginfuncs.c into postgres.h and use that macro instead?
>
> (I see the code in this file already did this, so it might not matter
> about this)

I'm not sure about this one - - I think it's better as a separate
patch, since we'd also change ginfuncs.c.  I have left it alone for
now.

> 2. In TidCompoundRangeQualFromExpr() rlst is not really needed. You
> can just return MakeTidRangeQuals(found_quals); or return NIL.

Yup, gone.

> 3. Can you explain why this only needs to take place when list_length() == 1?
>
> /*
> * In the case of a compound qual such as "ctid > ? AND ctid < ? AND ...",
> * the various parts will have come from different RestrictInfos.  So
> * remove each part separately.
> */
> ...

I've tried to improve the comment.

> 4. Accidental change?
>
> - tidquals);
> + tidquals
> + );
>
> 5. Shouldn't this comment get changed?
>
> - * NumTids    number of tids in this scan
> + * NumRanges    number of tids in this scan
>
> 6. There's no longer a field named NumTids
>
> - * TidList    evaluated item pointers (array of size NumTids)
> + * TidRanges    evaluated item pointers (array of size NumTids)
>
> 7. The following field is not documented in TidScanState:
>
> + bool tss_inScan;
>
> 8. Can you name this exprtype instead?
>
> + TidExprType type; /* type of op */
>
> "type" is used by Node types to indicate their type.

Yup, yup, yup, yup, yup.

> 9. It would be neater this:
>
> if (expr->opno == TIDLessOperator || expr->opno == TIDLessEqOperator)
> tidopexpr->type = invert ? TIDEXPR_LOWER_BOUND : TIDEXPR_UPPER_BOUND;
> else if (expr->opno == TIDGreaterOperator || expr->opno == TIDGreaterEqOperator)
> tidopexpr->type = invert ? TIDEXPR_UPPER_BOUND : TIDEXPR_LOWER_BOUND;
> else
> tidopexpr->type = TIDEXPR_EQ;
>
> tidopexpr->exprstate = exprstate;
>
> tidopexpr->inclusive = expr->opno == TIDLessEqOperator || expr->opno
> == TIDGreaterEqOperator;
>
> as a switch: ...

Yup, I think the switch is a bit nicer.

> 10. I don't quite understand this comment:
>
> + * Create an ExprState corresponding to the value part of a TID comparison,
> + * and wrap it in a TidOpExpr.  Set the type and inclusivity of the TidOpExpr
> + * appropriately, depending on the operator and position of the its arguments.
>
> I don't quite see how the code sets the inclusivity depending on the
> position of the arguments.
>
> Maybe the comment should be:
>
> + * For the given 'expr' build and return an appropriate TidOpExpr taking into
> + * account the expr's operator and operand order.

I'll go with your wording.

> 11. ScalarArrayOpExpr are commonly named "saop": ...

Yup.

> 12. You need to code SetTidLowerBound() with similar wraparound logic
> you have in SetTidUpperBound().
>
> It's perhaps unlikely, but the following shows incorrect results.
>
> postgres=# select ctid from t1 where ctid > '(0,65535)' limit 1;
>  ctid
> -------
>  (0,1)
> (1 row)
>
> -- the following is fine.
>
> Time: 1.652 ms
> postgres=# select ctid from t1 where ctid >= '(0,65535)' limit 1;
>  ctid
> -------
>  (1,1)
> (1 row)
>
> Likely you can just upgrade to the next block when the offset is >
> MaxOffsetNumber.

This is important, thanks for spotting it.

I've tried to add some code to handle this case (and also that of
"ctid < '(0,0)'") with a couple of tests too.

> 13. It looks like the previous code didn't make the assumption you're making in:
>
> + * A current-of TidExpr only exists by itself, and we should
> + * already have allocated a tidList entry for it.  We don't
> + * need to check whether the tidList array needs to be
> + * resized.
>
> I'm not sure if it's a good idea to lock the executor code into what
> the grammar currently says is possible. The previous code didn't
> assume that.

Fair enough, I've restored the previous code without the assumption.

> 14. we pass 'false' to what?

Obsolete comment (see reply to Alvaro).

I've applied most of these, and I'll post a new patch soon.

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Edmund Horner
Hi, here's the new patch(s).

Mostly the same, but trying to address your comments from earlier as
well as clean up a few other things I noticed.

Cheers,
Edmund

On Fri, 9 Nov 2018 at 15:01, Edmund Horner <[hidden email]> wrote:

>
> On Tue, 6 Nov 2018 at 16:40, David Rowley <[hidden email]> wrote:
> > I've been looking over 0001 to 0003. I ran out of steam before 0004.
>
> Hi David, thanks for another big review with lots of improvements.
>
> > I like the design of the new patch. From what I threw so far at the
> > selectivity estimation code, it seems pretty good.  I also quite like
> > the design in nodeTidscan.c for range scans.
>
>
> > I didn't quite manage to wrap my head around the code that removes
> > redundant quals from the tidquals. For example, with:
> >
> > postgres=# explain select * from t1 where ctid <= '(0,10)' and a = 0;
> >                     QUERY PLAN
> > --------------------------------------------------
> >  Tid Scan on t1  (cost=0.00..3.19 rows=1 width=4)
> >    TID Cond: (ctid <= '(0,10)'::tid)
> >    Filter: (a = 0)
> > (3 rows)
> >
> > and:
> >
> > postgres=# explain select * from t1 where ctid <= '(0,10)' or a = 20
> > and ctid >= '(0,0)';
> >                                   QUERY PLAN
> > ------------------------------------------------------------------------------
> >  Tid Scan on t1  (cost=0.01..176.18 rows=12 width=4)
> >    TID Cond: ((ctid <= '(0,10)'::tid) OR (ctid >= '(0,0)'::tid))
> >    Filter: ((ctid <= '(0,10)'::tid) OR ((a = 20) AND (ctid >= '(0,0)'::tid)))
> > (3 rows)
> >
> > I understand why the 2nd query didn't remove the ctid quals from the
> > filter, and I understand why the first query could. I just didn't
> > manage to convince myself that the code behaves correctly for all
> > cases.
>
> I agree it's not obvious.
>
> 1. We extract a set of tidquals that can be directly implemented by
> the Tid scan.  This set is of the form:  "(CTID op ? AND ...) OR
> (...)" (with some limitations).
> 2. If they happened to come verbatim from the original RestrictInfos,
> then they will be found in scan_clauses, and we can remove them.
> 3. If they're not verbatim, i.e. the original RestrictInfos have
> additional criteria that the Tid scan can't use, then tidquals won't
> match anything in scan_clauses, and hence scan_clauses will be
> unchanged.
> 4. We do a bit of extra work for the common and useful case of "(CTID
> op ? AND ...)".  Since the top-level operation of the input quals is
> an AND, it will typically be split into multiple RestrictInfo items.
> We remove each part from scan_clauses.
>
> > 1. I see a few instances of:
> >
> > #define DatumGetItemPointer(X) ((ItemPointer) DatumGetPointer(X))
> > #define ItemPointerGetDatum(X) PointerGetDatum(X)
> >
> > in both tid.c and ginfuncs.c, and I see you have:
> >
> > + itemptr = (ItemPointer) DatumGetPointer(constval);
> >
> > Do you think it would be worth moving the macros out of tid.c and
> > ginfuncs.c into postgres.h and use that macro instead?
> >
> > (I see the code in this file already did this, so it might not matter
> > about this)
>
> I'm not sure about this one - - I think it's better as a separate
> patch, since we'd also change ginfuncs.c.  I have left it alone for
> now.
>
> > 2. In TidCompoundRangeQualFromExpr() rlst is not really needed. You
> > can just return MakeTidRangeQuals(found_quals); or return NIL.
>
> Yup, gone.
>
> > 3. Can you explain why this only needs to take place when list_length() == 1?
> >
> > /*
> > * In the case of a compound qual such as "ctid > ? AND ctid < ? AND ...",
> > * the various parts will have come from different RestrictInfos.  So
> > * remove each part separately.
> > */
> > ...
>
> I've tried to improve the comment.
>
> > 4. Accidental change?
> >
> > - tidquals);
> > + tidquals
> > + );
> >
> > 5. Shouldn't this comment get changed?
> >
> > - * NumTids    number of tids in this scan
> > + * NumRanges    number of tids in this scan
> >
> > 6. There's no longer a field named NumTids
> >
> > - * TidList    evaluated item pointers (array of size NumTids)
> > + * TidRanges    evaluated item pointers (array of size NumTids)
> >
> > 7. The following field is not documented in TidScanState:
> >
> > + bool tss_inScan;
> >
> > 8. Can you name this exprtype instead?
> >
> > + TidExprType type; /* type of op */
> >
> > "type" is used by Node types to indicate their type.
>
> Yup, yup, yup, yup, yup.
>
> > 9. It would be neater this:
> >
> > if (expr->opno == TIDLessOperator || expr->opno == TIDLessEqOperator)
> > tidopexpr->type = invert ? TIDEXPR_LOWER_BOUND : TIDEXPR_UPPER_BOUND;
> > else if (expr->opno == TIDGreaterOperator || expr->opno == TIDGreaterEqOperator)
> > tidopexpr->type = invert ? TIDEXPR_UPPER_BOUND : TIDEXPR_LOWER_BOUND;
> > else
> > tidopexpr->type = TIDEXPR_EQ;
> >
> > tidopexpr->exprstate = exprstate;
> >
> > tidopexpr->inclusive = expr->opno == TIDLessEqOperator || expr->opno
> > == TIDGreaterEqOperator;
> >
> > as a switch: ...
>
> Yup, I think the switch is a bit nicer.
>
> > 10. I don't quite understand this comment:
> >
> > + * Create an ExprState corresponding to the value part of a TID comparison,
> > + * and wrap it in a TidOpExpr.  Set the type and inclusivity of the TidOpExpr
> > + * appropriately, depending on the operator and position of the its arguments.
> >
> > I don't quite see how the code sets the inclusivity depending on the
> > position of the arguments.
> >
> > Maybe the comment should be:
> >
> > + * For the given 'expr' build and return an appropriate TidOpExpr taking into
> > + * account the expr's operator and operand order.
>
> I'll go with your wording.
>
> > 11. ScalarArrayOpExpr are commonly named "saop": ...
>
> Yup.
>
> > 12. You need to code SetTidLowerBound() with similar wraparound logic
> > you have in SetTidUpperBound().
> >
> > It's perhaps unlikely, but the following shows incorrect results.
> >
> > postgres=# select ctid from t1 where ctid > '(0,65535)' limit 1;
> >  ctid
> > -------
> >  (0,1)
> > (1 row)
> >
> > -- the following is fine.
> >
> > Time: 1.652 ms
> > postgres=# select ctid from t1 where ctid >= '(0,65535)' limit 1;
> >  ctid
> > -------
> >  (1,1)
> > (1 row)
> >
> > Likely you can just upgrade to the next block when the offset is >
> > MaxOffsetNumber.
>
> This is important, thanks for spotting it.
>
> I've tried to add some code to handle this case (and also that of
> "ctid < '(0,0)'") with a couple of tests too.
>
> > 13. It looks like the previous code didn't make the assumption you're making in:
> >
> > + * A current-of TidExpr only exists by itself, and we should
> > + * already have allocated a tidList entry for it.  We don't
> > + * need to check whether the tidList array needs to be
> > + * resized.
> >
> > I'm not sure if it's a good idea to lock the executor code into what
> > the grammar currently says is possible. The previous code didn't
> > assume that.
>
> Fair enough, I've restored the previous code without the assumption.
>
> > 14. we pass 'false' to what?
>
> Obsolete comment (see reply to Alvaro).
>
> I've applied most of these, and I'll post a new patch soon.

v4-0001-Add-selectivity-and-nullness-estimates-for-CTID-syst.patch (3K) Download Attachment
v4-0003-Support-backward-scans-over-restricted-ranges-in-hea.patch (2K) Download Attachment
v4-0002-Support-range-quals-in-Tid-Scan.patch (76K) Download Attachment
v4-0004-Tid-Scan-results-are-ordered.patch (26K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

David Rowley-3
On Mon, 12 Nov 2018 at 17:35, Edmund Horner <[hidden email]> wrote:
> Hi, here's the new patch(s).
>
> Mostly the same, but trying to address your comments from earlier as
> well as clean up a few other things I noticed.

Thanks for making those changes.

I've now had a look over the latest patches and I've found a few more
things.  Many of these are a bit nitpicky, but certainly not all. I
also reviewed 0004 this time.

0001:

1. The row estimates are not quite right.  This cases the row
estimation to go the wrong way for isgt.

For example, the following gets 24 rows instead of 26.

postgres=# create table t (a int);
CREATE TABLE
postgres=# insert into t select generate_Series(1,100);
INSERT 0 100
postgres=# analyze t;
postgres=# explain analyze select * from t where ctid >= '(0,75)';
                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..2.25 rows=24 width=4) (actual
time=0.046..0.051 rows=26 loops=1)
   Filter: (ctid >= '(0,75)'::tid)
   Rows Removed by Filter: 74
 Planning Time: 0.065 ms
 Execution Time: 0.074 ms
(5 rows)

The < and <= case is not quite right either. < should have 1 fewer
tuple than the calculated average tuples per page, and <= should have
the same (assuming no gaps)

I've attached a small delta patch that I think improves things here.

0002:

2. You should test for a non-empty List with list != NIL

/*
* If no quals were specified, then a complete scan is assumed.  Make a
* TidExpr with an empty list of TidOpExprs.
*/
if (!node->tidquals)

Also, can you not just return after that if test? I think the code
would be easier to read with it like that.

3. I'd rather see EnsureTidRangeSpace() keep doubling the size of the
allocation until it reaches the required size. See how
MakeSharedInvalidMessagesArray() does it.  Doing it this way ensures
we always have a power of two sized array which is much nicer if we
ever reach the palloc() limit as if the array is sized at the palloc()
limit / 2 + 1, then if we try to double it'll fail.  Of course, it's
unlikely to be a problem here, but... the question would be how to
decide on the initial size.

4. "at" needs shifted left a couple of words

/*
* If the lower bound was already or above at the maximum block
* number, then there is no valid range.
*/

but I don't see how it could be "or above".  The ctid type does not
have the room for that. Although, that's not to say you should test if
(block == MaxBlockNumber), the >= seems better for the code. I'm just
complaining about the comment.

5. TidInArrayExprEval() lacks a header comment, and any other comments
to mention what it does. The function args also push over the 80 char
line length.  There's also a few other functions in nodeTidscan.c that
are missing a header comment.

6. In MergeTidRanges(), you have:

ItemPointerData a_last = a->last;
ItemPointerData b_last;

if (!ItemPointerIsValid(&a_last))
a_last = a->first;

but I don't see anywhere you're setting ->last to an invalid item
pointer. Is this left over from a previous design of the range scan?
It looks like in TidExprEval() you're setting the upperbound to the
last page on the relation.

7. "fist" -> "first"

* If the tuple is in the fist block of the range and before the first

8. tss_TidRangePtr is a pretty confusingly named field.

if (node->tss_TidRangePtr >= numRanges || node->tss_TidRangePtr < 0)
break;

I'd expect anything with ptr in it to be a pointer, but this seems to
be an array index. Maybe "idx" is better than "ptr", or take note from
nodeAppend.c and have something like "tts_whichRange".

UPDATE: I see you've just altered what's there already. Perhaps it's
okay to leave it as you have it, but it's still not ideal.

9. This comment seems to indicate that a range can only have one
bound, but that does not seem to be the case.

* Ranges with only one item -- including one resulting from a
* CURRENT-OF qual -- are handled by looking up the item directly.

It seems open bounded ranges just have the lowest or highest possible
value for a ctid on the open side.

Perhaps the comment could be written as:

/*
 * For ranges containing a single tuple, we can simply make an
 * attempt to fetch the tuple directly.
 */

10. In cost_tidscan() I think you should ceil() the following:

double pages = selectivity * baserel->pages;

Otherwise, you'll end up partially charging a seq_page_cost, which
seems pretty invalid since you can't partially read a page.

11. In the comment:

/* TODO decide what the costs should be */

I think you can just explain why you're charging 1 random_page_cost
and the remainder in seq_page_cost. Or is there something left to do
here that I've forgotten about?

12. expected_comparison_operator is a bit long a name:

IsTidComparison(OpExpr *node, int varno, Oid expected_comparison_operator)

How about just expected_opno?

13. !rlst -> rlst != NIL

/* if no range qual was found, look for any other TID qual */
if (!rlst)

(Yeah I know there's various cases where it's done incorrectly there
already :-( )

14. This is not great:

#define IsTidEqualClause(node, varno) IsTidComparison(node, varno,
TIDEqualOperator)
#define IsTidLTClause(node, varno) IsTidComparison(node, varno, TIDLessOperator)
#define IsTidLEClause(node, varno) IsTidComparison(node, varno,
TIDLessEqOperator)
#define IsTidGTClause(node, varno) IsTidComparison(node, varno,
TIDGreaterOperator)
#define IsTidGEClause(node, varno) IsTidComparison(node, varno,
TIDGreaterEqOperator)

#define IsTidRangeClause(node, varno) (IsTidLTClause(node, varno) || \
IsTidLEClause(node, varno) || \
IsTidGTClause(node, varno) || \
IsTidGEClause(node, varno))

The 4 macros for >, >=, < and <= are only used by IsTidRangeClause()
which means IsTidComparison() could get called up to 4 times. Most of
the work it does would be redundant in that case.  Maybe it's better
to rethink that?

15. There's no field named NumTids:

 * TidRanges evaluated item pointers (array of size NumTids)

0003:

16. I think the following comment needs to be updated:

/* start from last page of the scan */

to:

/* When scanning the whole relation, start from the last page of the scan */

and drop:

/* Scanning the full relation: start just before start block. */

then maybe change:

/* Scanning a restricted range: start at end of range. */

to

/* Otherwise, if scanning just a subset of the relation, start at the
final block in the range */

0004:

17. Can you make a few changed to build_tidscan_pathkeys():

a. build_index_pathkeys() uses ScanDirectionIsBackward(scandir), can
you set the opno based on that rather than doing "direction ==
ForwardScanDirection"
b. varexpr can be an Expr and just be named expr. Please move the
declaration and assignment out onto separate lines and wrap the long
line.
c. wrap long line with the call to build_expression_pathkey(). Get rid
of the (Expr *) cast.

18. I'd expect the following not to produce a sort above the Tid Scan.

postgres=# set enable_seqscan=0;
SET
postgres=# explain select * from t inner join t t1 on t.ctid = t1.ctid
where t.ctid < '(0,10)' ;
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Merge Join  (cost=10000000008.65..10000000009.28 rows=9 width=8)
   Merge Cond: (t.ctid = t1.ctid)
   ->  Sort  (cost=3.33..3.35 rows=9 width=10)
         Sort Key: t.ctid
         ->  Tid Scan on t  (cost=0.00..3.18 rows=9 width=10)
               TID Cond: (ctid < '(0,10)'::tid)
   ->  Sort  (cost=10000000005.32..10000000005.57 rows=100 width=10)
         Sort Key: t1.ctid
         ->  Seq Scan on t t1  (cost=10000000000.00..10000000002.00
rows=100 width=10)
(9 rows)

On looking at why the planner did this, I see it's down to how you've
coded create_tidscan_paths(). You're creating a tidpath if there's any
quals or any useful pathkeys useful to the query's ORDER BY, but only
including the pathkeys if they're useful for the query's ORDER BY.  I
think it'll be better to include the forward pathkeys in all cases,
and just make it a backward Tid Scan if backward keys are useful for
the ORDER BY.   There's still a problem with this as a Merge Join
would need a Sort if there was an ORDER BY ctid DESC for one relation
even if the other relation had some valid ctid quals since the 2nd
scan would create a forward Tid Scan.  Maybe that's not worth worrying
about.   The only fix I can imagine is to always create a forward and
backward Tid Scan path, which is pretty bad as it's two more paths
that likely won't get used 99.9% of the time.

This also caused me to notice the costs are pretty broken for this:

postgres=# explain select * from t order by ctid;
                     QUERY PLAN
---------------------------------------------------
 Tid Scan on t  (cost=0.00..0.00 rows=100 width=10)
(1 row)

19. Looks like the ScanDirection's normally get named "scandir":

static TidScan *make_tidscan(List *qptlist, List *qpqual, Index scanrelid,
List *tidquals, ScanDirection direction);

Likewise for the various .h files you've added that as a new field to
various structs.

Setting back to waiting on author.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

fixes_for_v4_0001.diff (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Tomas Vondra-4
On 11/22/18 8:41 AM, David Rowley wrote:
 >

> ...
>
> 3. I'd rather see EnsureTidRangeSpace() keep doubling the size of the
> allocation until it reaches the required size. See how
> MakeSharedInvalidMessagesArray() does it.  Doing it this way ensures
> we always have a power of two sized array which is much nicer if we
> ever reach the palloc() limit as if the array is sized at the palloc()
> limit / 2 + 1, then if we try to double it'll fail.  Of course, it's
> unlikely to be a problem here, but... the question would be how to
> decide on the initial size.
>

I think it kinda tries to do that in some cases, by doing this:

     *numAllocRanges *= 2;

     ...

     tidRanges = (TidRange *)
         repalloc(tidRanges,
                  *numAllocRanges * sizeof(TidRange));

The problem here is that what matters is not numAllocRanges being 2^N,
but the number of bytes allocated being 2^N. Because that's what ends up
in AllocSet, which keeps lists of 2^N chunks.

And as TidRange is 12B, so this is guaranteed to waste memory, because
no matter what the first factor is, the result will never be 2^N.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Tid scan improvements

Edmund Horner
On Fri, 23 Nov 2018 at 07:03, Tomas Vondra <[hidden email]> wrote:

> On 11/22/18 8:41 AM, David Rowley wrote:
> > ...
> > 3. I'd rather see EnsureTidRangeSpace() keep doubling the size of the
> > allocation until it reaches the required size. See how
> > MakeSharedInvalidMessagesArray() does it.  Doing it this way ensures
> > we always have a power of two sized array which is much nicer if we
> > ever reach the palloc() limit as if the array is sized at the palloc()
> > limit / 2 + 1, then if we try to double it'll fail.  Of course, it's
> > unlikely to be a problem here, but... the question would be how to
> > decide on the initial size.
>
> I think it kinda tries to do that in some cases, by doing this:
>
>      *numAllocRanges *= 2;
>      ...
>      tidRanges = (TidRange *)
>          repalloc(tidRanges,
>                   *numAllocRanges * sizeof(TidRange));
>
> The problem here is that what matters is not numAllocRanges being 2^N,
> but the number of bytes allocated being 2^N. Because that's what ends up
> in AllocSet, which keeps lists of 2^N chunks.
>
> And as TidRange is 12B, so this is guaranteed to waste memory, because
> no matter what the first factor is, the result will never be 2^N.

For simplicity, I think making it a strict doubling of capacity each
time is fine.  That's what we see in numerous other places in the
backend code.

What we don't really see is intentionally setting the initial capacity
so that each subsequent capacity is close-to-but-not-exceeding a power
of 2 bytes.  You can't really do that optimally if working in terms of
whole numbers of items that aren't each a power of 2 size.  This step,
there may be 2/3 of an item spare; next step, we'll have a whole item
spare that we're not going to use.  So we could keep track in terms of
bytes allocated, and then figure out how many items we can fit at the
current time.

In my opinion, such complexity is overkill for Tid scans.

Currently, we try to pick an initial size based on the number of
expressions.  We assume each expression will yield one range, and
allow that a saop expression might require us to enlarge the array.

Again, for simplicity, we should scrap that and pick something like
floor(256/sizeof(TidRange)) = 21 items, with about 1.5% wastage.

123