NOT IN subquery optimization

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
59 messages Options
123
Reply | Threaded
Open this post in threaded view
|

NOT IN subquery optimization

Jim Finnerty
The semantics of NOT IN (SELECT ...) are subtly different from the semantics
of NOT EXISTS (SELECT ...).  These differences center on how NULLs are
treated, and in general can result in statements that are harder to optimize
and slower to execute than the apparently similar NOT EXISTS statement.

A little over a year ago, Christian Antognini authored the blog "/How Well a
Query Optimizer Handles Subqueries?/" summarizing his findings about the
performance of PostgreSQL, MySQL, and Oracle on various subqueries:

   
https://antognini.ch/2017/12/how-well-a-query-optimizer-handles-subqueries/

His position was that you can classify the optimizations as correct or
incorrect, and based on that he provided the following comparison summary
(see below).  In short, PostgreSQL was the worst of the three systems:

    "Summary

        The number of queries that the query optimizers handle correctly are
the following:

        Oracle Database 12.2: 72 out of 80
        MySQL 8.0.3: 67 out of 80
        PostgreSQL 10.0: 60 out of 80

    Since not all queries are handled correctly, for best performance it is
sometimes necessary to rewrite them."

The subqueries that were found to be optimized "incorrectly" were almost
entirely due to poor or absent NOT IN subquery optimization.

The PostgreSQL community has been aware of the deficiencies in NOT IN
optimization for quite some time.  Based on an analysis of
psgsql-performance posts between 2013 and 2015, Robert Haas identified NOT
IN optimization as one of the common root causes of performance problems.

We have been working on improved optimization of NOT IN, and we would like
to share this optimizaton with the community.  With respect to the test
cases mentioned in the blog post mentioned above, it will elevate PostgreSQL
from "worst" to "first".  Generally the performance gains are large when the
optimization applies, though we have found one test case where performance
is worse.  We are investigating this now to see if we can disable the
optimization in that case.

We would like to include a patch for this change in the current commitfest.
This thread can be used to track comments about this optimization.




-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Jim Finnerty, AWS, Amazon Aurora PostgreSQL
Reply | Threaded
Open this post in threaded view
|

Re: NOT IN subquery optimization

Tom Lane-2
Jim Finnerty <[hidden email]> writes:
> We have been working on improved optimization of NOT IN, and we would like
> to share this optimizaton with the community.

The idea that's been kicked around in the past is to detect whether the
subselect's output column(s) can be proved NOT NULL, and if so, convert
to an antijoin just like NOT EXISTS.  Is that what you're doing, or
something different?

> We would like to include a patch for this change in the current commitfest.

Generally, people just send patches, they don't ask for permission first
;-)

Having said that, we have a general policy that we don't like complex
patches that first show up for the last commitfest of a dev cycle.
So unless this is a pretty small patch, it's probably going to get
delayed to v13.  Still, we'd like to have it in the queue, so submit
away ...

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: NOT IN subquery optimization

Jim Finnerty
re: The idea that's been kicked around in the past is to detect whether the
subselect's output column(s) can be proved NOT NULL, and if so, convert
to an antijoin just like NOT EXISTS

basically, yes.  this will handle nullability of both the outer and inner
correlated expression(s), multiple expressions, presence or absence of
predicates in the WHERE clause, and whether the correlated expressions are
on the null-padded side of an outer join.  If it is judged to be more
efficient, then it transforms the NOT IN sublink into an anti-join.

some complications enter into the decision to transform NOT IN to anti-join
based on whether a bitmap plan will/not be used, or whether it will/not be
eligible for PQ.



-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Jim Finnerty, AWS, Amazon Aurora PostgreSQL
Reply | Threaded
Open this post in threaded view
|

Re: NOT IN subquery optimization

Tom Lane-2
Jim Finnerty <[hidden email]> writes:
> re: The idea that's been kicked around in the past is to detect whether the
> subselect's output column(s) can be proved NOT NULL, and if so, convert
> to an antijoin just like NOT EXISTS

> basically, yes.  this will handle nullability of both the outer and inner
> correlated expression(s), multiple expressions, presence or absence of
> predicates in the WHERE clause, and whether the correlated expressions are
> on the null-padded side of an outer join.  If it is judged to be more
> efficient, then it transforms the NOT IN sublink into an anti-join.

Hmm, that seems overcomplicated ...

> some complications enter into the decision to transform NOT IN to anti-join
> based on whether a bitmap plan will/not be used, or whether it will/not be
> eligible for PQ.

... and that even more so, considering that this decision really needs
to be taken long before cost estimates would be available.

As far as I can see, there should be no situation where we'd not want
to transform to antijoin if we can prove it's semantically valid to
do so.  If there are cases where that comes out as a worse plan,
that indicates a costing error that would be something to address
separately (because it'd also be a problem for other antijoin cases).
Also, as long as it nearly always wins, I'm not going to cry too hard
if there are corner cases where it makes the wrong choice.  That's not
something that's possible to avoid completely.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: NOT IN subquery optimization

Jim Finnerty
We can always correctly transform a NOT IN to a correlated NOT EXISTS.  In
almost all cases it is more efficient to do so.  In the one case that we've
found that is slower it does come down to a more general costing issue, so
that's probably the right way to think about it.



-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Jim Finnerty, AWS, Amazon Aurora PostgreSQL
Reply | Threaded
Open this post in threaded view
|

Re: NOT IN subquery optimization

David Rowley-3
On Thu, 21 Feb 2019 at 16:27, Jim Finnerty <[hidden email]> wrote:
> We can always correctly transform a NOT IN to a correlated NOT EXISTS.  In
> almost all cases it is more efficient to do so.  In the one case that we've
> found that is slower it does come down to a more general costing issue, so
> that's probably the right way to think about it.

I worked on this over 4 years ago [1]. I think the patch there is not
completely broken and seems just to need a few things fixed. I rebased
it on top of current master and looked at it. I think the main
remaining issue is fixing the code that ensures the outer side join
quals can't be NULL.  The code that's there looks broken still since
it attempts to use quals from any inner joined rel for proofs that
NULLs will be removed.  That might not work so well in a case like:
SELECT * FROM t1 LEFT JOIN t2 ON t1.a = t2.a AND t2.b NOT IN(select b
from t3),  however, I'd need to think harder about that since if there
was such a qual then the planner should convert the left join into an
inner join. But anyway, the function expressions_are_not_nullable()
was more intended to work with targetlists to ensure exprs there can't
be NULL. I just had done a poor job of trying to modify that into
allowing it to take exprs from any random place, likely that should be
a new function and expressions_are_not_nullable() should be put back
to what Tom ended up with.

I've attached the rebased and still broken version.

[1] https://www.postgresql.org/message-id/CAApHDvqRB-iFBy68%3DdCgqS46aRep7AuN2pou4KTwL8kX9YOcTQ%40mail.gmail.com

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

not_in_anti_join_v1.0.patch (51K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: NOT IN subquery optimization

David Rowley-3
On Fri, 22 Feb 2019 at 09:44, David Rowley <[hidden email]> wrote:
> I've attached the rebased and still broken version.

I set about trying to make a less broken version of this.

A quick reminder of the semantics of NOT IN:

1. WHERE <nullable_column> NOT IN(SELECT <not null column> FROM table);

If table is non-empty:
will filter out rows where <nullable_column> is NULL
and only show values that are not in <not null column>

If table is empty:
Filters nothing.

2. WHERE <nonnullable_column> NOT IN(SELECT <null column> FROM table);

If table contains NULLs in the <null column> no records will match.

The previous patch handled #2 correctly but neglected to do anything
about #1.  For #1 the only way we can implement this as a planner only
change is to insist that the outer side expressions also are not null.
If we could have somehow tested if "table" was non-empty then we could
have added a IS NOT NULL clause to the outer query and converted to an
anti-join, but ... can't know that during planning and can't add the
IS NOT NULL regardless as, if "table" is empty we will filter NULLs
when we shouldn't.

In the attached, I set about fixing #1 by determining if the outer
expressions could be NULL by checking

1. If expression is a Var from an inner joined relation it can't be
NULL if there's a NOT NULL constraint on the column; or
2. If expression is a Var from an inner joined relation and there is a
strict WHERE/ON clause, the expression can't be NULL; or
3. If expression is a Var from an outer joined relation check for
quals that were specified in the same syntactical level as the NOT IN
for proofs that NULL will be filtered.

An example of #3 is:

SELECT * FROM t1 LEFT JOIN t2 on t1.a = t2.a WHERE t2.a IS NOT NULL
AND t2.a NOT IN(SELECT a FROM t3); -- t2 becomes INNER JOINed later in
planning, but...
or;
SELECT * FROM t1 LEFT JOIN t2 on t1.a = t2.a AND t2.a NOT IN(SELECT a FROM t3);

In the latter of the two, the t1.a = t2.a join conditions ensures that
NULLs can't exist where the NOT IN is evaluated.

I implemented #3 by passing the quals down to
pull_up_sublinks_qual_recurse().  At the top level call 'node' and
'notnull_proofs' are the same, but that changes in recursive calls
like the one we make inside the is_andclause() condition.

Comments welcome.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

not_in_anti_join_v1.1.patch (64K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: NOT IN subquery optimization

zhengli
I'm attaching a working patch following the discussion.

You can also find the following patch description is the commit message:
NOT IN to ANTI JOIN transformation

    The semantics of ANTI JOIN were created to match the semantics of NOT
    EXISTS, which enables NOT EXISTS subqueries to be efficiently executed
    as a kind of join.  NOT IN subqueries have different NULL semantics than
    NOT EXISTS, but since there is no special join operator for NOT IN it is
    generally executed as a nested sub-plan.  It is possible, however, to
    transform NOT IN to a correlated NOT EXISTS so that it can be executed
    it as an ANTI JOIN with additional correlated predicates.

    A general transformation from NOT IN to NOT EXISTS for the
    single-expression case (the multi-expression case is just ANDs of the
    single-expressions) is:
        t1.x NOT IN (SELECT t2.y from t2 where p) <=> NOT EXISTS (select 1
                from t2 where (y=x or y is NULL or x is NULL) and p).

    If x or y is non-nullable, we can safely remove the predicate "x
    is NULL" or "y is NULL", and if there is no predicate p,
    then "x is NULL" may be factored out of the subquery.
    Experiments show that if we can remove one or the other ORed
    predicates, or if we can factor out the "x is NULL", then
    execution is typically much faster.

    Basically, for the single expression case (we also handle the multi
    expression case), we try to do the following transformation:
    When p does not exist:
        t1.x not in (t2.y) => ANTI JOIN t1(Filter: x is not null), t2 on
        join condition: t1.x=t2.y or t2.y is null.
    When x is non-nullable:
        t1.x not in (t2.y where p) => ANTI JOIN t1, t2 on join condition:
        (t1.x=t2.y or t2.y is null) and p.

    We implemented a nullability test routine is_node_nonnullable().
    Currently it handles Var, TargetEntry, CoalesceExpr and Const. Outer
    joins are taken into consideration in the nullability test.

    We adjust and apply reduce_outer_joins() before the transformation so
    that the outer joins have an opportunity to be converted to inner joins
    prior to the transformation.

    Using this transformation, we measured performance improvements of
    two to three orders of magnitude on most queries in a development
    environment. In our performance experiments, table s (small) has 11
    rows, table l (large) has 1 million rows. s.n and l.n have NULL
    values. s.nn and l.nn are NOT NULL.

        s.n not in (l.n)          1150 ms -> 0.49 ms
        s.nn not in (l.nn)      1120 ms -> 0.45 ms
        l.n not in (l.n)   over   20 min -> 1700 ms
        l.nn not in (l.nn) over 20 min -> 220 ms
        l.n not in (s.n)               63 ms -> 750 ms
        l.nn not in (s.nn)          58 ms -> 46 ms

    For the only case that performance drops - l.n not in (s.n). It is
    likely to be resolved by ending the nested loop anti join early as
    soon as we find NULL inner tuple entry/entries that satisfies the
    join condition during execution. This is still under investigation.

Comments are welcome.

With Regards,
---
Zheng Li, AWS, Amazon Aurora PostgreSQL

On 2/25/19, 7:32 AM, "David Rowley" <[hidden email]> wrote:

    On Fri, 22 Feb 2019 at 09:44, David Rowley <[hidden email]> wrote:
    > I've attached the rebased and still broken version.
   
    I set about trying to make a less broken version of this.
   
    A quick reminder of the semantics of NOT IN:
   
    1. WHERE <nullable_column> NOT IN(SELECT <not null column> FROM table);
   
    If table is non-empty:
    will filter out rows where <nullable_column> is NULL
    and only show values that are not in <not null column>
   
    If table is empty:
    Filters nothing.
   
    2. WHERE <nonnullable_column> NOT IN(SELECT <null column> FROM table);
   
    If table contains NULLs in the <null column> no records will match.
   
    The previous patch handled #2 correctly but neglected to do anything
    about #1.  For #1 the only way we can implement this as a planner only
    change is to insist that the outer side expressions also are not null.
    If we could have somehow tested if "table" was non-empty then we could
    have added a IS NOT NULL clause to the outer query and converted to an
    anti-join, but ... can't know that during planning and can't add the
    IS NOT NULL regardless as, if "table" is empty we will filter NULLs
    when we shouldn't.
   
    In the attached, I set about fixing #1 by determining if the outer
    expressions could be NULL by checking
   
    1. If expression is a Var from an inner joined relation it can't be
    NULL if there's a NOT NULL constraint on the column; or
    2. If expression is a Var from an inner joined relation and there is a
    strict WHERE/ON clause, the expression can't be NULL; or
    3. If expression is a Var from an outer joined relation check for
    quals that were specified in the same syntactical level as the NOT IN
    for proofs that NULL will be filtered.
   
    An example of #3 is:
   
    SELECT * FROM t1 LEFT JOIN t2 on t1.a = t2.a WHERE t2.a IS NOT NULL
    AND t2.a NOT IN(SELECT a FROM t3); -- t2 becomes INNER JOINed later in
    planning, but...
    or;
    SELECT * FROM t1 LEFT JOIN t2 on t1.a = t2.a AND t2.a NOT IN(SELECT a FROM t3);
   
    In the latter of the two, the t1.a = t2.a join conditions ensures that
    NULLs can't exist where the NOT IN is evaluated.
   
    I implemented #3 by passing the quals down to
    pull_up_sublinks_qual_recurse().  At the top level call 'node' and
    'notnull_proofs' are the same, but that changes in recursive calls
    like the one we make inside the is_andclause() condition.
   
    Comments welcome.
   
    --
     David Rowley                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
   


not_in_anti_join_transformation.patch (120K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: NOT IN subquery optimization

zhengli
Resend the patch with a whitespace removed so that "git apply patch" works directly.

---
Zheng Li, AWS, Amazon Aurora PostgreSQL

On 2/25/19, 12:39 PM, "Li, Zheng" <[hidden email]> wrote:

    I'm attaching a working patch following the discussion.
   
    You can also find the following patch description is the commit message:
    NOT IN to ANTI JOIN transformation
   
        The semantics of ANTI JOIN were created to match the semantics of NOT
        EXISTS, which enables NOT EXISTS subqueries to be efficiently executed
        as a kind of join.  NOT IN subqueries have different NULL semantics than
        NOT EXISTS, but since there is no special join operator for NOT IN it is
        generally executed as a nested sub-plan.  It is possible, however, to
        transform NOT IN to a correlated NOT EXISTS so that it can be executed
        it as an ANTI JOIN with additional correlated predicates.
   
        A general transformation from NOT IN to NOT EXISTS for the
        single-expression case (the multi-expression case is just ANDs of the
        single-expressions) is:
            t1.x NOT IN (SELECT t2.y from t2 where p) <=> NOT EXISTS (select 1
                    from t2 where (y=x or y is NULL or x is NULL) and p).
   
        If x or y is non-nullable, we can safely remove the predicate "x
        is NULL" or "y is NULL", and if there is no predicate p,
        then "x is NULL" may be factored out of the subquery.
        Experiments show that if we can remove one or the other ORed
        predicates, or if we can factor out the "x is NULL", then
        execution is typically much faster.
   
        Basically, for the single expression case (we also handle the multi
        expression case), we try to do the following transformation:
        When p does not exist:
            t1.x not in (t2.y) => ANTI JOIN t1(Filter: x is not null), t2 on
            join condition: t1.x=t2.y or t2.y is null.
        When x is non-nullable:
            t1.x not in (t2.y where p) => ANTI JOIN t1, t2 on join condition:
            (t1.x=t2.y or t2.y is null) and p.
   
        We implemented a nullability test routine is_node_nonnullable().
        Currently it handles Var, TargetEntry, CoalesceExpr and Const. Outer
        joins are taken into consideration in the nullability test.
   
        We adjust and apply reduce_outer_joins() before the transformation so
        that the outer joins have an opportunity to be converted to inner joins
        prior to the transformation.
   
        Using this transformation, we measured performance improvements of
        two to three orders of magnitude on most queries in a development
        environment. In our performance experiments, table s (small) has 11
        rows, table l (large) has 1 million rows. s.n and l.n have NULL
        values. s.nn and l.nn are NOT NULL.
   
            s.n not in (l.n)          1150 ms -> 0.49 ms
            s.nn not in (l.nn)      1120 ms -> 0.45 ms
            l.n not in (l.n)   over   20 min -> 1700 ms
            l.nn not in (l.nn) over 20 min -> 220 ms
            l.n not in (s.n)               63 ms -> 750 ms
            l.nn not in (s.nn)          58 ms -> 46 ms
   
        For the only case that performance drops - l.n not in (s.n). It is
        likely to be resolved by ending the nested loop anti join early as
        soon as we find NULL inner tuple entry/entries that satisfies the
        join condition during execution. This is still under investigation.
   
    Comments are welcome.
   
    With Regards,
    ---
    Zheng Li, AWS, Amazon Aurora PostgreSQL
   
    On 2/25/19, 7:32 AM, "David Rowley" <[hidden email]> wrote:
   
        On Fri, 22 Feb 2019 at 09:44, David Rowley <[hidden email]> wrote:
        > I've attached the rebased and still broken version.
       
        I set about trying to make a less broken version of this.
       
        A quick reminder of the semantics of NOT IN:
       
        1. WHERE <nullable_column> NOT IN(SELECT <not null column> FROM table);
       
        If table is non-empty:
        will filter out rows where <nullable_column> is NULL
        and only show values that are not in <not null column>
       
        If table is empty:
        Filters nothing.
       
        2. WHERE <nonnullable_column> NOT IN(SELECT <null column> FROM table);
       
        If table contains NULLs in the <null column> no records will match.
       
        The previous patch handled #2 correctly but neglected to do anything
        about #1.  For #1 the only way we can implement this as a planner only
        change is to insist that the outer side expressions also are not null.
        If we could have somehow tested if "table" was non-empty then we could
        have added a IS NOT NULL clause to the outer query and converted to an
        anti-join, but ... can't know that during planning and can't add the
        IS NOT NULL regardless as, if "table" is empty we will filter NULLs
        when we shouldn't.
       
        In the attached, I set about fixing #1 by determining if the outer
        expressions could be NULL by checking
       
        1. If expression is a Var from an inner joined relation it can't be
        NULL if there's a NOT NULL constraint on the column; or
        2. If expression is a Var from an inner joined relation and there is a
        strict WHERE/ON clause, the expression can't be NULL; or
        3. If expression is a Var from an outer joined relation check for
        quals that were specified in the same syntactical level as the NOT IN
        for proofs that NULL will be filtered.
       
        An example of #3 is:
       
        SELECT * FROM t1 LEFT JOIN t2 on t1.a = t2.a WHERE t2.a IS NOT NULL
        AND t2.a NOT IN(SELECT a FROM t3); -- t2 becomes INNER JOINed later in
        planning, but...
        or;
        SELECT * FROM t1 LEFT JOIN t2 on t1.a = t2.a AND t2.a NOT IN(SELECT a FROM t3);
       
        In the latter of the two, the t1.a = t2.a join conditions ensures that
        NULLs can't exist where the NOT IN is evaluated.
       
        I implemented #3 by passing the quals down to
        pull_up_sublinks_qual_recurse().  At the top level call 'node' and
        'notnull_proofs' are the same, but that changes in recursive calls
        like the one we make inside the is_andclause() condition.
       
        Comments welcome.
       
        --
         David Rowley                   http://www.2ndQuadrant.com/
         PostgreSQL Development, 24x7 Support, Training & Services
       
   
   


not_in_anti_join_transformation_v1.patch (120K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: NOT IN subquery optimization

David Rowley-3
On Tue, 26 Feb 2019 at 11:51, Li, Zheng <[hidden email]> wrote:
> Resend the patch with a whitespace removed so that "git apply patch" works directly.

I had a quick look at this and it seems to be broken for the empty
table case I mentioned up thread.

Quick example:

Setup:

create table t1 (a int);
create table t2 (a int not null);
insert into t1 values(NULL),(1),(2);

select * from t1 where a not in(select a from t2);

Patched:
 a
---
 1
 2
(2 rows)

Master:
 a
---

 1
 2
(3 rows)

This will be due to the fact you're adding an a IS NOT NULL qual to
the scan of a:

postgres=# explain select * from t1 where a not in(select a from t2);
                            QUERY PLAN
------------------------------------------------------------------
 Hash Anti Join  (cost=67.38..152.18 rows=1268 width=4)
   Hash Cond: (t1.a = t2.a)
   ->  Seq Scan on t1  (cost=0.00..35.50 rows=2537 width=4)
         Filter: (a IS NOT NULL)
   ->  Hash  (cost=35.50..35.50 rows=2550 width=4)
         ->  Seq Scan on t2  (cost=0.00..35.50 rows=2550 width=4)
(6 rows)

but as I mentioned, you can't do that as t2 might be empty and there's
no way to know that during planning.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: NOT IN subquery optimization

Richard Guo-2
Greenplum Database does this optimization. The idea is to use a new join
type, let's call it JOIN_LASJ_NOTIN, and its semantic regarding NULL is
defined as below:

1. If there is a NULL in the outer side, and the inner side is empty, the
   NULL should be part of the outputs.

2. If there is a NULL in the outer side, and the inner side is not empty,
   the NULL should not be part of the outputs.

3. If there is a NULL in the inner side, no outputs should be produced.

An example plan looks like:

gpadmin=# explain (costs off)  select * from t1 where a not in(select a from t2);
            QUERY PLAN
-----------------------------------
 Hash Left Anti Semi (Not-In) Join
   Hash Cond: (t1.a = t2.a)
   ->  Seq Scan on t1
   ->  Hash
         ->  Seq Scan on t2
(5 rows)

Thanks
Richard

On Tue, Feb 26, 2019 at 7:20 AM David Rowley <[hidden email]> wrote:
On Tue, 26 Feb 2019 at 11:51, Li, Zheng <[hidden email]> wrote:
> Resend the patch with a whitespace removed so that "git apply patch" works directly.

I had a quick look at this and it seems to be broken for the empty
table case I mentioned up thread.

Quick example:

Setup:

create table t1 (a int);
create table t2 (a int not null);
insert into t1 values(NULL),(1),(2);

select * from t1 where a not in(select a from t2);

Patched:
 a
---
 1
 2
(2 rows)

Master:
 a
---

 1
 2
(3 rows)

This will be due to the fact you're adding an a IS NOT NULL qual to
the scan of a:

postgres=# explain select * from t1 where a not in(select a from t2);
                            QUERY PLAN
------------------------------------------------------------------
 Hash Anti Join  (cost=67.38..152.18 rows=1268 width=4)
   Hash Cond: (t1.a = t2.a)
   ->  Seq Scan on t1  (cost=0.00..35.50 rows=2537 width=4)
         Filter: (a IS NOT NULL)
   ->  Hash  (cost=35.50..35.50 rows=2550 width=4)
         ->  Seq Scan on t2  (cost=0.00..35.50 rows=2550 width=4)
(6 rows)

but as I mentioned, you can't do that as t2 might be empty and there's
no way to know that during planning.

--
 David Rowley                   https://urldefense.proofpoint.com/v2/url?u=http-3A__www.2ndQuadrant.com_&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=5r3cnfZPUDOHrMiXq8Mq2g&m=dE1nglE17x3nD-oH_BrF0r4SLaFnQKzwwJBJGpDoaaA&s=dshupMomMvkDAd92918cU21AJ1E1s7QwbrxIGSRxZA8&e=
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: NOT IN subquery optimization

Jim Finnerty
In reply to this post by David Rowley-3
The problem is that the special optimization that was proposed for the case
where the subquery has no WHERE clause isn't valid when the subquery returns
no rows.  That additional optimization needs to be removed, and preferably
replaced with some sort of efficient run-time test.





-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Jim Finnerty, AWS, Amazon Aurora PostgreSQL
Reply | Threaded
Open this post in threaded view
|

Re: NOT IN subquery optimization

David Rowley-3
On Wed, 27 Feb 2019 at 03:07, Jim Finnerty <[hidden email]> wrote:
>
> The problem is that the special optimization that was proposed for the case
> where the subquery has no WHERE clause isn't valid when the subquery returns
> no rows.  That additional optimization needs to be removed, and preferably
> replaced with some sort of efficient run-time test.

That is one option, however, the join type that Richard mentions, to
satisfy point #3, surely only can work for Hash joins and perhaps
Merge joins that required a sort, assuming there's some way for the
sort to communicate about if it found NULLs or not. Either way, we
need to have looked at the entire inner side to ensure there are no
nulls. Probably it would be possible to somehow team that up with a
planner check to see if the inner exprs could be NULL then just
implement points #1 and #2 for other join methods.

If you're proposing to do that for this thread then I can take my
planner only patch somewhere else.  I only posted my patch as I pretty
much already had what I thought you were originally talking about.
However, please be aware there are current patents around adding
execution time smarts in this area, so it's probably unlikely you'll
find a way to do this in the executor that does not infringe on those.
Probably no committer would want to touch it.  I think my patch covers
a good number of use cases and as far as I understand, does not go
near any current patents.

Please let me know if I should move my patch to another thread. I
don't want to hi-jack this if you're going in another direction.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: NOT IN subquery optimization

zhengli
I agree we will need some runtime smarts (such as a new anti join type as pointed out by Richard) to "ultimately" account for all the cases of NOT IN queries.

However, given that the March CommitFest is imminent and the runtime smarts patent concerns David had pointed out (which I was not aware of before), we would not move that direction at the moment.

I propose that we collaborate to build one patch from the two patches submitted in this thread for the CF. The two patches are for the same purpose and similar. However, they differ in the following ways as far as I can tell:

Nullability Test:
-David's patch uses strict predicates for nullability test.
-Our patch doesn't use strict predicates, but it accounts for COALESCE and null-padded rows from outer join. In addition, we made reduce_outer_joins() work before the transformation which makes the nullability test more accurate.

Anti Join Transformation:
-Dvaid's patch does the transformation when both inner and outer outputs are non-nullable.
-With the latest fix (for the empty table case), our patch does the transformation as long as the outer is non-nullable regardless of the inner nullability, experiments show that the results are always faster.

David, please let me know what you think. If you would like to collaborate, I'll start merging with your code on using strict predicates to make a better Nullability Test.

Thanks,
Zheng

Reply | Threaded
Open this post in threaded view
|

Re: NOT IN subquery optimization

Tom Lane-2
"Li, Zheng" <[hidden email]> writes:
> However, given that the March CommitFest is imminent and the runtime smarts patent concerns David had pointed out (which I was not aware of before), we would not move that direction at the moment.

> I propose that we collaborate to build one patch from the two patches submitted in this thread for the CF.

TBH, I think it's very unlikely that any patch for this will be seriously
considered for commit in v12.  It would be against project policy, and
spending a lot of time reviewing the patch would be quite unfair to other
patches that have been in the queue longer.  Therefore, I'd suggest that
you not bend things out of shape just to have some patch to submit before
March 1.  It'd be better to work with the goal of having a coherent patch
ready for the first v13 CF, probably July-ish.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: NOT IN subquery optimization

David Rowley-3
In reply to this post by zhengli
On Wed, 27 Feb 2019 at 13:05, Li, Zheng <[hidden email]> wrote:
> -With the latest fix (for the empty table case), our patch does the transformation as long as the outer is non-nullable regardless of the inner nullability, experiments show that the results are always faster.

Hi Zheng,

I'm interested to know how this works without testing for inner
nullability.  If any of the inner side's join exprs are NULL then no
records can match. What do you propose to work around that?

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: NOT IN subquery optimization

David Rowley-3
In reply to this post by Tom Lane-2
On Wed, 27 Feb 2019 at 13:13, Tom Lane <[hidden email]> wrote:

>
> "Li, Zheng" <[hidden email]> writes:
> > However, given that the March CommitFest is imminent and the runtime smarts patent concerns David had pointed out (which I was not aware of before), we would not move that direction at the moment.
>
> > I propose that we collaborate to build one patch from the two patches submitted in this thread for the CF.
>
> TBH, I think it's very unlikely that any patch for this will be seriously
> considered for commit in v12.  It would be against project policy, and
> spending a lot of time reviewing the patch would be quite unfair to other
> patches that have been in the queue longer.  Therefore, I'd suggest that
> you not bend things out of shape just to have some patch to submit before
> March 1.  It'd be better to work with the goal of having a coherent patch
> ready for the first v13 CF, probably July-ish.

FWIW, I did add this to the March CF, but I set the target version to
13.  I wasn't considering this for PG12. I see Zheng was, but I agree
with you on PG13 being the target for this.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: NOT IN subquery optimization

zhengli
I'm totally fine with setting the target to PG13.

--
I'm interested to know how this works without testing for inner
nullability.  If any of the inner side's join exprs are NULL then no
records can match. What do you propose to work around that?
--

We still check for inner side's nullability, when it is nullable we
append a "var is NULL" to the anti join condition. So every outer
tuple is going to evaluate to true on the join condition when there
is indeed a null entry in the inner.
Actually I think the nested loop anti join can end early in this case,
but I haven't find a way to do it properly, this may be one other reason
why we need a new join type for NOT IN.

e.g.
explain select count(*) from s where u not in (select n from l);
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Aggregate  (cost=2892.88..2892.89 rows=1 width=8)
   ->  Nested Loop Anti Join  (cost=258.87..2892.88 rows=1 width=0)
         ->  Seq Scan on s  (cost=0.00..1.11 rows=11 width=4)
         ->  Bitmap Heap Scan on l  (cost=258.87..262.88 rows=1 width=4)
               Recheck Cond: ((s.u = n) OR (n IS NULL))
               ->  BitmapOr  (cost=258.87..258.87 rows=1 width=0)
                     ->  Bitmap Index Scan on l_n  (cost=0.00..4.43 rows=1 width=0)
                           Index Cond: (s.u = n)
                     ->  Bitmap Index Scan on l_n  (cost=0.00..4.43 rows=1 width=0)
                           Index Cond: (n IS NULL)

Zheng

Reply | Threaded
Open this post in threaded view
|

Re: NOT IN subquery optimization

David Rowley-3
On Wed, 27 Feb 2019 at 13:41, Li, Zheng <[hidden email]> wrote:
> We still check for inner side's nullability, when it is nullable we
> append a "var is NULL" to the anti join condition. So every outer
> tuple is going to evaluate to true on the join condition when there
> is indeed a null entry in the inner.

That's possible, at least providing the var is NULL is an OR
condition, but the problem there is that you force the plan into a
nested loop join. That's unfortunately not going to perform very well
when the number of rows to process is large.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: NOT IN subquery optimization

Richard Guo-2
In reply to this post by David Rowley-3

On Wed, Feb 27, 2019 at 4:52 AM David Rowley <[hidden email]> wrote:
On Wed, 27 Feb 2019 at 03:07, Jim Finnerty <[hidden email]> wrote:

If you're proposing to do that for this thread then I can take my
planner only patch somewhere else.  I only posted my patch as I pretty
much already had what I thought you were originally talking about.
However, please be aware there are current patents around adding
execution time smarts in this area, so it's probably unlikely you'll
find a way to do this in the executor that does not infringe on those.
Probably no committer would want to touch it.  I think my patch covers
a good number of use cases and as far as I understand, does not go
near any current patents.

Thanks for pointing out the patent concerns. I was not aware of that before.
Could you please provide some clue where I can find more info about the patents?

Thanks
Richard 
123