Discussion on missing optimizations

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
32 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Discussion on missing optimizations

Adam Brusselback
Hopefully it's alright for me to post this here, please let me know if not.

I ran across an article on blog.jooq.org comparing all the major RDBMS' with their ability to optimize away unnecessary work with queries which are less than optimal, and saw some discussion on hackernews and reddit, but I hadn't seen any discussion here.


Commercial databases seem to have a serious leg up in this area, and there are even some that MySQL has that Postgres doesn't.

I was wondering which of these are planned, which have had discussion before and decided not to support, and which just haven't been thought of?

I thought i'd bring it up and hopefully others who are more knowledgeable can chime in.

Reply | Threaded
Open this post in threaded view
|

Re: Discussion on missing optimizations

Andres Freund
Hi,

On 2017-10-06 21:33:16 -0400, Adam Brusselback wrote:
> Hopefully it's alright for me to post this here, please let me know if not.

> I ran across an article on blog.jooq.org comparing all the major RDBMS'
> with their ability to optimize away unnecessary work with queries which are
> less than optimal, and saw some discussion on hackernews and reddit, but I
> hadn't seen any discussion here.
>
> The article in question is here:
> https://blog.jooq.org/2017/09/28/10-cool-sql-optimisations-that-do-not-depend-on-the-cost-model/

That's interesting.


> Commercial databases seem to have a serious leg up in this area, and there
> are even some that MySQL has that Postgres doesn't.
>
> I was wondering which of these are planned, which have had discussion
> before and decided not to support, and which just haven't been thought of?

Hm, going through the ones we don't [fully] support:

3. JOIN Elimination

There's been a lot of discussion and several patches. There's a bunch of
problems here, one being that there's cases (during trigger firing,
before the constraint checks) where foreign keys don't hold true, so we
can't quite generally make these optimization.  Another concern is
whether the added plan time is actually worthwhile.


4. Removing “Silly” Predicates

(i.e stuff like column = column)

This has deemed not to be a valuable use of plan time. I'm doubtful
about that argument, but that IIRC was the concensus the last time this
came up.


6. Predicate Merging

(i.e. combining a IN (a,b,c) and a IN (c,d,f) into a IN (c), similar
with OR)

I can't remember proposals about adding this, but it seems worthwhile to
consider. I think we should be able to check for this without a lot of
planner overhead.


7. Provably Empty Sets
8. CHECK Constraints

I think some of this should actually work with constraint exclusion
turned on.


9. Unneeded Self JOIN

Can't remember discussions of this.


Greetings,

Andres Freund


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Discussion on missing optimizations

Tom Lane-2
Andres Freund <[hidden email]> writes:
> On 2017-10-06 21:33:16 -0400, Adam Brusselback wrote:
>> The article in question is here:
>> https://blog.jooq.org/2017/09/28/10-cool-sql-optimisations-that-do-not-depend-on-the-cost-model/

> That's interesting.

The impression I have in a quick scan is that probably hardly any of these
are cases that any of the DB designers think are important in themselves.
Rather, they fall out of more general optimization attempts, or not,
depending on the optimization mechanisms in use in a particular DB.
For example, reducing "WHERE 1=1" to "WHERE TRUE" and then to nothing
comes out of a constant-subexpression-precalculation mechanism for us,
whereas "WHERE column=column" doesn't fall to that approach.  ISTM it
would be really dumb to expend planner cycles looking specifically for
that case, so I guess that DB2 et al are finding it as a side-effect of
some more general optimization ... I wonder what that is?

(edit: a few minutes later, I seem to remember that equivclass.c has
to do something special with the X=X case, so maybe it could do
something else special instead, with little new overhead.)

> (i.e. combining a IN (a,b,c) and a IN (c,d,f) into a IN (c), similar
> with OR)

> I can't remember proposals about adding this, but it seems worthwhile to
> consider. I think we should be able to check for this without a lot of
> planner overhead.

It would be enormously expensive to check that in the general case with
a bunch of entries in each IN list.  Perhaps it would be OK to add on
the presumption that few queries would contain multiple IN's on the same
column in the first place, though.  Not sure.

> 9. Unneeded Self JOIN

> Can't remember discussions of this.

I can't get very excited about that one either.

In the end, what the article fails to consider is that all of these are
tradeoffs, not unalloyed goods.  If you spend planner cycles on every
query to look for cases that only the most unabashedly brain-dead ORMs
ever generate, you're not really doing your users a favor on balance.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Discussion on missing optimizations

Andres Freund
On 2017-10-06 22:19:54 -0400, Tom Lane wrote:
> The impression I have in a quick scan is that probably hardly any of these
> are cases that any of the DB designers think are important in
> themselves.

> In the end, what the article fails to consider is that all of these are
> tradeoffs, not unalloyed goods.  If you spend planner cycles on every
> query to look for cases that only the most unabashedly brain-dead ORMs
> ever generate, you're not really doing your users a favor on balance.

Partially agreed. A comment to the article also mentions that some other
database performs more optimizations depending on the cost of the
plan. That's not easy to do in our current plan structure, but I think
it's quite a worthwhile concept.


> Rather, they fall out of more general optimization attempts, or not,
> depending on the optimization mechanisms in use in a particular DB.
> For example, reducing "WHERE 1=1" to "WHERE TRUE" and then to nothing
> comes out of a constant-subexpression-precalculation mechanism for us,
> whereas "WHERE column=column" doesn't fall to that approach.  ISTM it
> would be really dumb to expend planner cycles looking specifically for
> that case, so I guess that DB2 et al are finding it as a side-effect of
> some more general optimization ... I wonder what that is?
>
> (edit: a few minutes later, I seem to remember that equivclass.c has
> to do something special with the X=X case, so maybe it could do
> something else special instead, with little new overhead.)

Yea, I think this should be inferrable from information we essentially
already compute for equivclasses.


> > (i.e. combining a IN (a,b,c) and a IN (c,d,f) into a IN (c), similar
> > with OR)
>
> > I can't remember proposals about adding this, but it seems worthwhile to
> > consider. I think we should be able to check for this without a lot of
> > planner overhead.
>
> It would be enormously expensive to check that in the general case with
> a bunch of entries in each IN list.  Perhaps it would be OK to add on
> the presumption that few queries would contain multiple IN's on the same
> column in the first place, though.  Not sure.

Merging of ORs should be near free if you leave duplicates in there, and
the duplicates aren't a huge concern if your alternative is is to have
them, but also have two clauses to evaluate.

I think the IN AND IN case should usually end up a clear win as
well. Both lists are going to be evaluated for each row anyway - you
don't need a lot of rows where clauses are evaluated to make it worth
the O(n*log(n)) of sorting the lists.  Sorting them would be beneficial
for other reasons as well, e.g. it improves access patterns for SAO
index scans.


Greetings,

Andres Freund


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Discussion on missing optimizations

Petr Jelinek-4
In reply to this post by Tom Lane-2
On 07/10/17 04:19, Tom Lane wrote:

> Andres Freund <[hidden email]> writes:
>> On 2017-10-06 21:33:16 -0400, Adam Brusselback wrote:
>>> The article in question is here:
>>> https://blog.jooq.org/2017/09/28/10-cool-sql-optimisations-that-do-not-depend-on-the-cost-model/
>
>> That's interesting.
>
> The impression I have in a quick scan is that probably hardly any of these
> are cases that any of the DB designers think are important in themselves.
> Rather, they fall out of more general optimization attempts, or not,
> depending on the optimization mechanisms in use in a particular DB.
> For example, reducing "WHERE 1=1" to "WHERE TRUE" and then to nothing
> comes out of a constant-subexpression-precalculation mechanism for us,
> whereas "WHERE column=column" doesn't fall to that approach.  ISTM it
> would be really dumb to expend planner cycles looking specifically for
> that case, so I guess that DB2 et al are finding it as a side-effect of
> some more general optimization ... I wonder what that is?
>
> (edit: a few minutes later, I seem to remember that equivclass.c has
> to do something special with the X=X case, so maybe it could do
> something else special instead, with little new overhead.)
>
What it actually does is to specifically skip the processing for X=X
(the const expression will be simplified by
estimate_expression_value/eval_const_expressions separately). There is
comment there that specifically states that it's not worth it to process
this as it's rare clause which is equal to X IS NOT NULL.

I don't actually agree with the argument of the comment there, since in
practice the if the "silly" equality is not there, we'll just waste
equal() call and if it is there the optimization seems worth it as it
will lead to orders of magnitude better estimation in many cases.

So I wrote prototype of achieving this optimization and it seems to be
really quite simple code-wise (see attached). I did only a limited
manual testing of this but I don't see any negative impact on planning time.

Thoughts?

--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

0001-Transform-X-X-expressions-into-X-IS-NOT-NULL.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Discussion on missing optimizations

David Rowley-3
In reply to this post by Tom Lane-2
On 7 October 2017 at 15:19, Tom Lane <[hidden email]> wrote:

>> 9. Unneeded Self JOIN
>
>> Can't remember discussions of this.
>
> I can't get very excited about that one either.
>
> In the end, what the article fails to consider is that all of these are
> tradeoffs, not unalloyed goods.  If you spend planner cycles on every
> query to look for cases that only the most unabashedly brain-dead ORMs
> ever generate, you're not really doing your users a favor on balance.

I think that final sentence lacks imagination.

I've seen plenty of views being called where some column is
unavailable, but the caller joins the very same table again on the
primary key to add the column. There was no brain-dead ORM involved,
just perhaps questionable design. This was very common in my last job
where we had some rats nest of views several layers deep, the core of
which often had some complex logic that nobody dared to try and
replicate.

It would be fairly cheap to check if any of the rtekind==RTE_RELATION
joinlist  items have above 1 RangeTblEntry with the same relid.  The
joinlist is never that big anyway, and if it was the join search would
be slow. The more expensive part would be to build the join clauses,
check if the expressions on either side of all OpExpr matches and that
nothing else will cause a non-match, then perform the uniqueness check
on those OpExpr operands. We do have some infrastructure to do the
unique checks.  Likely the slowdown in planning would be just for
cases with a legitimately useful self-join, I doubt checking for a
duplicate RangeTblEntry->relid would cause much of a concern.

Anyway, this is giving me some feeling of Deja vu.. Perhaps we need
some pg_stats view that shows us planning time and execution time so
that we can get a better idea on how much these things matter in the
average case. We tend to never fare so well in these sorts of
comparisons with commercial databases. It's not hard to imagine
someone with migration plans loading some rats nest of views into
Postgres and taking it for a spin and finding performance is not quite
what they need. It's a bit sad that often the people with the loudest
voices are always so fast to stomp on the ideas for improvements. It
would be much nicer if you'd at least wait for benchmarks before
shooting.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Discussion on missing optimizations

Tom Lane-2
In reply to this post by Petr Jelinek-4
Petr Jelinek <[hidden email]> writes:
> On 07/10/17 04:19, Tom Lane wrote:
>> (edit: a few minutes later, I seem to remember that equivclass.c has
>> to do something special with the X=X case, so maybe it could do
>> something else special instead, with little new overhead.)

> So I wrote prototype of achieving this optimization and it seems to be
> really quite simple code-wise (see attached). I did only a limited
> manual testing of this but I don't see any negative impact on planning time.

No, I'm afraid you didn't read that comment closely enough.  This will
flat out fail for cases like "select ... where x=x order by x", because
there will already be a single-element EC for x and so the clause will
just disappear entirely.  If that doesn't happen, then instead you're
creating an EC with duplicate entries, which is an idea I seriously
dislike; the semantics of that don't seem clear at all to me.

What I was imagining was that having detected X=X, instead of "throwing
back" the clause as-is we could throw back an X IS NOT NULL clause,
along the lines of the attached.

This passes the smell test for me in the sense of not adding any
significant number of planner cycles except when the weird case occurs.
It leaves something on the table in that there are some situations
where X=X could be converted, but we won't do it because we don't see
the clause as being a potential EC (because it's not at top level),
as in the second new regression test case below.  I think that's
probably all right; I don't see any way to be more thorough without
adding a lot of new cycles all the time, and I don't believe this is
worth that.

                        regards, tom lane


diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 7997f50..a225414 100644
*** a/src/backend/optimizer/path/equivclass.c
--- b/src/backend/optimizer/path/equivclass.c
***************
*** 27,32 ****
--- 27,33 ----
  #include "optimizer/paths.h"
  #include "optimizer/planmain.h"
  #include "optimizer/prep.h"
+ #include "optimizer/restrictinfo.h"
  #include "optimizer/var.h"
  #include "utils/lsyscache.h"
 
*************** static bool reconsider_full_join_clause(
*** 71,78 ****
   *  any delay by an outer join, so its two sides can be considered equal
   *  anywhere they are both computable; moreover that equality can be
   *  extended transitively.  Record this knowledge in the EquivalenceClass
!  *  data structure.  Returns TRUE if successful, FALSE if not (in which
!  *  case caller should treat the clause as ordinary, not an equivalence).
   *
   * If below_outer_join is true, then the clause was found below the nullable
   * side of an outer join, so its sides might validly be both NULL rather than
--- 72,85 ----
   *  any delay by an outer join, so its two sides can be considered equal
   *  anywhere they are both computable; moreover that equality can be
   *  extended transitively.  Record this knowledge in the EquivalenceClass
!  *  data structure, if applicable.  Returns TRUE if successful, FALSE if not
!  *  (in which case caller should treat the clause as ordinary, not an
!  *  equivalence).
!  *
!  * In some cases, although we cannot convert a clause into EquivalenceClass
!  * knowledge, we can still modify it to a more useful form than the original.
!  * Then, *p_restrictinfo will be replaced by a new RestrictInfo, which is what
!  * the caller should use for further processing.
   *
   * If below_outer_join is true, then the clause was found below the nullable
   * side of an outer join, so its sides might validly be both NULL rather than
*************** static bool reconsider_full_join_clause(
*** 104,112 ****
   * memory context.
   */
  bool
! process_equivalence(PlannerInfo *root, RestrictInfo *restrictinfo,
  bool below_outer_join)
  {
  Expr   *clause = restrictinfo->clause;
  Oid opno,
  collation,
--- 111,121 ----
   * memory context.
   */
  bool
! process_equivalence(PlannerInfo *root,
! RestrictInfo **p_restrictinfo,
  bool below_outer_join)
  {
+ RestrictInfo *restrictinfo = *p_restrictinfo;
  Expr   *clause = restrictinfo->clause;
  Oid opno,
  collation,
*************** process_equivalence(PlannerInfo *root, R
*** 154,169 ****
    collation);
 
  /*
! * Reject clauses of the form X=X.  These are not as redundant as they
! * might seem at first glance: assuming the operator is strict, this is
! * really an expensive way to write X IS NOT NULL.  So we must not risk
! * just losing the clause, which would be possible if there is already a
! * single-element EquivalenceClass containing X.  The case is not common
! * enough to be worth contorting the EC machinery for, so just reject the
! * clause and let it be processed as a normal restriction clause.
  */
  if (equal(item1, item2))
! return false; /* X=X is not a useful equivalence */
 
  /*
  * If below outer join, check for strictness, else reject.
--- 163,207 ----
    collation);
 
  /*
! * Clauses of the form X=X cannot be translated into EquivalenceClasses.
! * We'd either end up with a single-entry EC, losing the knowledge that
! * the clause was present at all, or else make an EC with duplicate
! * entries, causing other issues.
  */
  if (equal(item1, item2))
! {
! /*
! * If the operator is strict, then the clause can be treated as just
! * "X IS NOT NULL".  (Since we know we are considering a top-level
! * qual, we can ignore the difference between FALSE and NULL results.)
! * It's worth making the conversion because we'll typically get a much
! * better selectivity estimate than we would for X=X.
! *
! * If the operator is not strict, we can't be sure what it will do
! * with NULLs, so don't attempt to optimize it.
! */
! set_opfuncid((OpExpr *) clause);
! if (func_strict(((OpExpr *) clause)->opfuncid))
! {
! NullTest   *ntest = makeNode(NullTest);
!
! ntest->arg = item1;
! ntest->nulltesttype = IS_NOT_NULL;
! ntest->argisrow = false; /* correct even if composite arg */
! ntest->location = -1;
!
! *p_restrictinfo =
! make_restrictinfo((Expr *) ntest,
!  restrictinfo->is_pushed_down,
!  restrictinfo->outerjoin_delayed,
!  restrictinfo->pseudoconstant,
!  restrictinfo->security_level,
!  NULL,
!  restrictinfo->outer_relids,
!  restrictinfo->nullable_relids);
! }
! return false;
! }
 
  /*
  * If below outer join, check for strictness, else reject.
*************** reconsider_outer_join_clause(PlannerInfo
*** 1741,1747 ****
    bms_copy(inner_relids),
    bms_copy(inner_nullable_relids),
    cur_ec->ec_min_security);
! if (process_equivalence(root, newrinfo, true))
  match = true;
  }
 
--- 1779,1785 ----
    bms_copy(inner_relids),
    bms_copy(inner_nullable_relids),
    cur_ec->ec_min_security);
! if (process_equivalence(root, &newrinfo, true))
  match = true;
  }
 
*************** reconsider_full_join_clause(PlannerInfo
*** 1884,1890 ****
    bms_copy(left_relids),
    bms_copy(left_nullable_relids),
    cur_ec->ec_min_security);
! if (process_equivalence(root, newrinfo, true))
  matchleft = true;
  }
  eq_op = select_equality_operator(cur_ec,
--- 1922,1928 ----
    bms_copy(left_relids),
    bms_copy(left_nullable_relids),
    cur_ec->ec_min_security);
! if (process_equivalence(root, &newrinfo, true))
  matchleft = true;
  }
  eq_op = select_equality_operator(cur_ec,
*************** reconsider_full_join_clause(PlannerInfo
*** 1899,1905 ****
    bms_copy(right_relids),
    bms_copy(right_nullable_relids),
    cur_ec->ec_min_security);
! if (process_equivalence(root, newrinfo, true))
  matchright = true;
  }
  }
--- 1937,1943 ----
    bms_copy(right_relids),
    bms_copy(right_nullable_relids),
    cur_ec->ec_min_security);
! if (process_equivalence(root, &newrinfo, true))
  matchright = true;
  }
  }
diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index 9931ddd..974eb58 100644
*** a/src/backend/optimizer/plan/initsplan.c
--- b/src/backend/optimizer/plan/initsplan.c
*************** distribute_qual_to_rels(PlannerInfo *roo
*** 1964,1973 ****
  if (maybe_equivalence)
  {
  if (check_equivalence_delay(root, restrictinfo) &&
! process_equivalence(root, restrictinfo, below_outer_join))
  return;
  /* EC rejected it, so set left_ec/right_ec the hard way ... */
! initialize_mergeclause_eclasses(root, restrictinfo);
  /* ... and fall through to distribute_restrictinfo_to_rels */
  }
  else if (maybe_outer_join && restrictinfo->can_join)
--- 1964,1974 ----
  if (maybe_equivalence)
  {
  if (check_equivalence_delay(root, restrictinfo) &&
! process_equivalence(root, &restrictinfo, below_outer_join))
  return;
  /* EC rejected it, so set left_ec/right_ec the hard way ... */
! if (restrictinfo->mergeopfamilies) /* EC might have changed this */
! initialize_mergeclause_eclasses(root, restrictinfo);
  /* ... and fall through to distribute_restrictinfo_to_rels */
  }
  else if (maybe_outer_join && restrictinfo->can_join)
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index a15eee5..ea886b6 100644
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef bool (*ec_matches_callback_type)
*** 127,133 ****
   EquivalenceMember *em,
   void *arg);
 
! extern bool process_equivalence(PlannerInfo *root, RestrictInfo *restrictinfo,
  bool below_outer_join);
  extern Expr *canonicalize_ec_expression(Expr *expr,
    Oid req_type, Oid req_collation);
--- 127,134 ----
   EquivalenceMember *em,
   void *arg);
 
! extern bool process_equivalence(PlannerInfo *root,
! RestrictInfo **p_restrictinfo,
  bool below_outer_join);
  extern Expr *canonicalize_ec_expression(Expr *expr,
    Oid req_type, Oid req_collation);
diff --git a/src/test/regress/expected/equivclass.out b/src/test/regress/expected/equivclass.out
index a96b2a1..c448d85 100644
*** a/src/test/regress/expected/equivclass.out
--- b/src/test/regress/expected/equivclass.out
*************** reset session authorization;
*** 421,423 ****
--- 421,441 ----
  revoke select on ec0 from regress_user_ectest;
  revoke select on ec1 from regress_user_ectest;
  drop user regress_user_ectest;
+ -- check that X=X is converted to X IS NOT NULL when appropriate
+ explain (costs off)
+   select * from tenk1 where unique1 = unique1 and unique2 = unique2;
+                          QUERY PLAN                          
+ -------------------------------------------------------------
+  Seq Scan on tenk1
+    Filter: ((unique1 IS NOT NULL) AND (unique2 IS NOT NULL))
+ (2 rows)
+
+ -- this could be converted, but isn't at present
+ explain (costs off)
+   select * from tenk1 where unique1 = unique1 or unique2 = unique2;
+                        QUERY PLAN                      
+ --------------------------------------------------------
+  Seq Scan on tenk1
+    Filter: ((unique1 = unique1) OR (unique2 = unique2))
+ (2 rows)
+
diff --git a/src/test/regress/sql/equivclass.sql b/src/test/regress/sql/equivclass.sql
index 0e4aa0c..85aa65d 100644
*** a/src/test/regress/sql/equivclass.sql
--- b/src/test/regress/sql/equivclass.sql
*************** revoke select on ec0 from regress_user_e
*** 254,256 ****
--- 254,264 ----
  revoke select on ec1 from regress_user_ectest;
 
  drop user regress_user_ectest;
+
+ -- check that X=X is converted to X IS NOT NULL when appropriate
+ explain (costs off)
+   select * from tenk1 where unique1 = unique1 and unique2 = unique2;
+
+ -- this could be converted, but isn't at present
+ explain (costs off)
+   select * from tenk1 where unique1 = unique1 or unique2 = unique2;


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Discussion on missing optimizations

Tom Lane-2
In reply to this post by David Rowley-3
David Rowley <[hidden email]> writes:
> It would be much nicer if you'd at least wait for benchmarks before
> shooting.

Benchmarks of what?  We'd have to expend quite a bit of effort just
to get to a place where we'd have something to benchmark.  I do not
think it's unreasonable of me to express an opinion that that would
likely be wasted effort.  If you disagree, and are willing to expend
such effort speculatively, I'm not stopping you.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Discussion on missing optimizations

Petr Jelinek-4
In reply to this post by Tom Lane-2
On 07/10/17 18:15, Tom Lane wrote:

> Petr Jelinek <[hidden email]> writes:
>> On 07/10/17 04:19, Tom Lane wrote:
>>> (edit: a few minutes later, I seem to remember that equivclass.c has
>>> to do something special with the X=X case, so maybe it could do
>>> something else special instead, with little new overhead.)
>
>> So I wrote prototype of achieving this optimization and it seems to be
>> really quite simple code-wise (see attached). I did only a limited
>> manual testing of this but I don't see any negative impact on planning time.
>
> No, I'm afraid you didn't read that comment closely enough.  This will
> flat out fail for cases like "select ... where x=x order by x", because
> there will already be a single-element EC for x and so the clause will
> just disappear entirely.  If that doesn't happen, then instead you're
> creating an EC with duplicate entries, which is an idea I seriously
> dislike; the semantics of that don't seem clear at all to me.

Hmm it did not disappear (and worked fine in SQL level tests). I don't
think I fully understand the "EC with duplicate entries" part and what's
the problem with it so I'll defer to your wisdom there.

> What I was imagining was that having detected X=X, instead of "throwing
> back" the clause as-is we could throw back an X IS NOT NULL clause,
> along the lines of the attached.

Right, I wanted to avoid messing with the incoming result info, but if
we don't want to call the code bellow or write tons of code for this, I
guess it's the best we can do.

> This passes the smell test for me in the sense of not adding any
> significant number of planner cycles except when the weird case occurs.
> It leaves something on the table in that there are some situations
> where X=X could be converted, but we won't do it because we don't see
> the clause as being a potential EC (because it's not at top level),
> as in the second new regression test case below.  I think that's
> probably all right; I don't see any way to be more thorough without
> adding a lot of new cycles all the time, and I don't believe this is
> worth that.
>

My code had same issue. I think going deeper would require quite a bit
of new code (and cycles) for something that's even less likely to happen
than simple X=X while the current patch is quite cheap win.

--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Discussion on missing optimizations

Tom Lane-2
Petr Jelinek <[hidden email]> writes:
> On 07/10/17 18:15, Tom Lane wrote:
>> No, I'm afraid you didn't read that comment closely enough.  This will
>> flat out fail for cases like "select ... where x=x order by x", because
>> there will already be a single-element EC for x and so the clause will
>> just disappear entirely.  If that doesn't happen, then instead you're
>> creating an EC with duplicate entries, which is an idea I seriously
>> dislike; the semantics of that don't seem clear at all to me.

> Hmm it did not disappear (and worked fine in SQL level tests).

I may not be correctly remembering what the query would need to look
like for there to be single-element ECs in existence at this point; but
I surely would not like this code to assume that none will exist till
later.

> I don't
> think I fully understand the "EC with duplicate entries" part and what's
> the problem with it so I'll defer to your wisdom there.

Well, as one example, assume that we use your patch and consider what
happens with
        where x = y and x = x
vs
        where x = x and x = y

In the first case we end up with an EC that's just {x,y}, because the
second process_equivalence() will find both sides in the same EC and
conclude that the second clause is redundant.  (Which it is, if the
equality operators have the same notion of what to do with nulls.)
In the second case we end up with an EC containing {x,x,y}, which
at minimum will result in emitting redundant generated quals.  I'm
not sure if it could have any worse consequences than that, but I'm
not sure it couldn't either.  But this is bogus in any case, because
those WHERE clauses surely should produce identical results.

Looking further ahead, if ECs have to support being multisets rather
than pure sets, that would put a crimp in ever improving this logic to
use a smarter UNION-FIND algorithm.  (I've not yet seen queries where the
number of EC members is large enough to make that a serious issue, but
I think it could happen, particularly with inheritance/partitioning.)

>> This passes the smell test for me in the sense of not adding any
>> significant number of planner cycles except when the weird case occurs.
>> It leaves something on the table in that there are some situations
>> where X=X could be converted, but we won't do it because we don't see
>> the clause as being a potential EC (because it's not at top level),
>> as in the second new regression test case below.  I think that's
>> probably all right; I don't see any way to be more thorough without
>> adding a lot of new cycles all the time, and I don't believe this is
>> worth that.

> My code had same issue. I think going deeper would require quite a bit
> of new code (and cycles) for something that's even less likely to happen
> than simple X=X while the current patch is quite cheap win.

Yeah.  I'm not really convinced it's a big win, but it's so cheap we
might as well do it.  The only case where it would expend cycles and
fail to give an improvement is if we have X=X with a non-strict operator,
which I think is a case that never arises in practice at present,
because btree effectively assumes that all btree operators are strict.
(Someday we might want to relax that, though, which is why I don't
want to let the planner just assume it.)

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Discussion on missing optimizations

Nico Williams
In reply to this post by Tom Lane-2
On Fri, Oct 06, 2017 at 10:19:54PM -0400, Tom Lane wrote:
> Andres Freund <[hidden email]> writes:
> > On 2017-10-06 21:33:16 -0400, Adam Brusselback wrote:
> >> The article in question is here:
> >> https://blog.jooq.org/2017/09/28/10-cool-sql-optimisations-that-do-not-depend-on-the-cost-model/
>
> > That's interesting.
>
> The impression I have in a quick scan is that probably hardly any of these
> are cases that any of the DB designers think are important in themselves.

That's true for some of those.  But some of them might become important
when you start pushing WHERE constraints from outside into inner table
sources and subqueries, as dumb-looking constraints can simply appear
from pushing non-dumb-looking constraints.

More than the op optimizations would make a big difference for me:

 - turning subqueries into joins

 - turning ORs into UNIONs

   It is easy enough to work around the lack of this optimization in
   many cases, but it does make queries more verbose.

 - pushing WHERE constraints from outer queries into the table source
   queries (_including_ VIEWs)

    - determining that some table in a query that had WHERE constraints
      pushed into it... now has a very well-filled out lookup key,
      therefore it's the one that should be the table source to start
      the plan with, i.e., that it should be first in the outermost loop
      of a nested loop join

      For me these two would be huge wins.  I have to resort to
      functions with roughly the same body as views just so that I can
      have the optimizer pick the correct plan.  This causes a lot of
      code duplication in my schemas.

 - pushing WHERE constraints from outer queries into HAVING thence WHERE
   constraints on GROUP BY queries where the outer constraints are on
   columns used to GROUP BY

   I find myself making two versions of views that do aggregation: one
   that does not, and one that does.  This allows me to use the
   non-aggregating view in contexts where I need this optimization, but
   then I have to re-code the aggregation at that layer.  Again, lots of
   duplication.

These sorts of optimizations are huge.

> Rather, they fall out of more general optimization attempts, or not,
> depending on the optimization mechanisms in use in a particular DB.
> For example, reducing "WHERE 1=1" to "WHERE TRUE" and then to nothing
> comes out of a constant-subexpression-precalculation mechanism for us,
> whereas "WHERE column=column" doesn't fall to that approach.  ISTM it
> would be really dumb to expend planner cycles looking specifically for
> that case, so I guess that DB2 et al are finding it as a side-effect of
> some more general optimization ... I wonder what that is?

If you can reduce the number of compilations / optimization passes for
statements, then spending more time in the optimizer is not a big deal.
So, when invoked via PREPARE I would say spending more cycles looking
for this sort of thing is OK, but in many other cases it's not.

Also, sometimes these cases crop up do to pushing constraints into VIEWs
and sub-queries.  In those cases then constant sub-expression
elimination can be a win.

> (edit: a few minutes later, I seem to remember that equivclass.c has
> to do something special with the X=X case, so maybe it could do
> something else special instead, with little new overhead.)

I'd expect that column = column is not trivial to turn into TRUE, not
unless those columns are NOT NULLable.

> > 9. Unneeded Self JOIN
>
> > Can't remember discussions of this.
>
> I can't get very excited about that one either.
>
> In the end, what the article fails to consider is that all of these are
> tradeoffs, not unalloyed goods.  If you spend planner cycles on every
> query to look for cases that only the most unabashedly brain-dead ORMs
> ever generate, you're not really doing your users a favor on balance.

I can't get very excited about this one either, though I do believe it
can arise as the author says, "when you build complex views and JOIN
them to each other".  Maybe I'm not excited about it because I've not
needed it :)

Nico
--


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Discussion on missing optimizations

Petr Jelinek-4
In reply to this post by Tom Lane-2
On 07/10/17 19:59, Tom Lane wrote:

> Petr Jelinek <[hidden email]> writes:
>> On 07/10/17 18:15, Tom Lane wrote:
>>> No, I'm afraid you didn't read that comment closely enough.  This will
>>> flat out fail for cases like "select ... where x=x order by x", because
>>> there will already be a single-element EC for x and so the clause will
>>> just disappear entirely.  If that doesn't happen, then instead you're
>>> creating an EC with duplicate entries, which is an idea I seriously
>>> dislike; the semantics of that don't seem clear at all to me.
>
>> Hmm it did not disappear (and worked fine in SQL level tests).
>
> I may not be correctly remembering what the query would need to look
> like for there to be single-element ECs in existence at this point; but
> I surely would not like this code to assume that none will exist till
> later.
>
>> I don't
>> think I fully understand the "EC with duplicate entries" part and what's
>> the problem with it so I'll defer to your wisdom there.
>
> Well, as one example, assume that we use your patch and consider what
> happens with
> where x = y and x = x
> vs
> where x = x and x = y
>
> In the first case we end up with an EC that's just {x,y}, because the
> second process_equivalence() will find both sides in the same EC and
> conclude that the second clause is redundant.  (Which it is, if the
> equality operators have the same notion of what to do with nulls.)
> In the second case we end up with an EC containing {x,x,y}, which
> at minimum will result in emitting redundant generated quals.  I'm
> not sure if it could have any worse consequences than that, but I'm
> not sure it couldn't either.  But this is bogus in any case, because
> those WHERE clauses surely should produce identical results.
>
> Looking further ahead, if ECs have to support being multisets rather
> than pure sets, that would put a crimp in ever improving this logic to
> use a smarter UNION-FIND algorithm.  (I've not yet seen queries where the
> number of EC members is large enough to make that a serious issue, but
> I think it could happen, particularly with inheritance/partitioning.)
>

Okay, that makes sense, thanks for explanation. Your patch is the way to
go then.

>>> This passes the smell test for me in the sense of not adding any
>>> significant number of planner cycles except when the weird case occurs.
>>> It leaves something on the table in that there are some situations
>>> where X=X could be converted, but we won't do it because we don't see
>>> the clause as being a potential EC (because it's not at top level),
>>> as in the second new regression test case below.  I think that's
>>> probably all right; I don't see any way to be more thorough without
>>> adding a lot of new cycles all the time, and I don't believe this is
>>> worth that.
>
>> My code had same issue. I think going deeper would require quite a bit
>> of new code (and cycles) for something that's even less likely to happen
>> than simple X=X while the current patch is quite cheap win.
>
> Yeah.  I'm not really convinced it's a big win, but it's so cheap we
> might as well do it.  The only case where it would expend cycles and
> fail to give an improvement is if we have X=X with a non-strict operator,
> which I think is a case that never arises in practice at present,
> because btree effectively assumes that all btree operators are strict.
> (Someday we might want to relax that, though, which is why I don't
> want to let the planner just assume it.)
>

In real-life probably not, but seems like these small bits can be useful
marketing wise, given articles like the one which started this. And it
should not hurt anything either.

--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Discussion on missing optimizations

Adam Brusselback
> I can't get very excited about this one either, though I do believe it
> can arise as the author says, "when you build complex views and JOIN
> them to each other".  Maybe I'm not excited about it because I've not
> needed it :)

This is one that I know would help with my database.  There is a ton
of logic stored in views,
which get joined to to the original table to filter the set rather
than imposing that set of
conditions in every separate query.

It would be really nice if the optimizer could simplify those to
eliminate the self join.  It's almost always
on the primary key of a table that the join would happen on, and if
not it'd be a non-nullable column for sure.


On another note:
> turning ORs into UNIONs
This is another one which would be incredibly useful for me.  I've had
to do this manually for performance
reasons far too often.


> Partially agreed. A comment to the article also mentions that some other
> database performs more optimizations depending on the cost of the
> plan. That's not easy to do in our current plan structure, but I think
> it's quite a worthwhile concept.

I would love to see this in Postgres.  It would allow the planner to
not waste cycles unnecessarily on
queries where it's just not needed, and to potentially spend a few
more cycles planning on very
costly queries to save a ton while executing.


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Discussion on missing optimizations

Tom Lane-2
Adam Brusselback <[hidden email]> writes:
> On another note:
>> turning ORs into UNIONs

> This is another one which would be incredibly useful for me.  I've had
> to do this manually for performance reasons far too often.

Well, maybe you could sign up to help review the open patch for that then:
https://commitfest.postgresql.org/15/1001/

The reason that's not in v10 is we haven't been able to convince
ourselves whether it's 100% correct.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Discussion on missing optimizations

Andres Freund
On 2017-10-08 11:28:09 -0400, Tom Lane wrote:

> Adam Brusselback <[hidden email]> writes:
> > On another note:
> >> turning ORs into UNIONs
>
> > This is another one which would be incredibly useful for me.  I've had
> > to do this manually for performance reasons far too often.
>
> Well, maybe you could sign up to help review the open patch for that then:
> https://commitfest.postgresql.org/15/1001/
>
> The reason that's not in v10 is we haven't been able to convince
> ourselves whether it's 100% correct.

Unfortunately it won't help in this specific case (no support for UNION,
just UNION ALL), but I thought it might be interesting to reference
https://medium.com/@uwdb/introducing-cosette-527898504bd6
here.

Greetings,

Andres Freund


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Discussion on missing optimizations

Tom Lane-2
In reply to this post by Petr Jelinek-4
Petr Jelinek <[hidden email]> writes:
> Okay, that makes sense, thanks for explanation. Your patch is the way to
> go then.

Hearing no further comment, pushed.  Thanks for reviewing it.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Discussion on missing optimizations

Tom Lane-2
In reply to this post by Andres Freund
Andres Freund <[hidden email]> writes:
> On 2017-10-08 11:28:09 -0400, Tom Lane wrote:
>> https://commitfest.postgresql.org/15/1001/
>> The reason that's not in v10 is we haven't been able to convince
>> ourselves whether it's 100% correct.

> Unfortunately it won't help in this specific case (no support for UNION,
> just UNION ALL), but I thought it might be interesting to reference
> https://medium.com/@uwdb/introducing-cosette-527898504bd6
> here.

Huh, that is an interesting project indeed.  Although I'm not sure that
it quite addresses the question of whether an optimization transform
is valid.  IIUC, it could prove that a particular query having been fed
through the transform didn't change semantics, but that offers only
limited insight into whether some other query fed through the transform
might change.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Discussion on missing optimizations

Andres Freund
On 2017-10-08 17:11:44 -0400, Tom Lane wrote:

> Andres Freund <[hidden email]> writes:
> > On 2017-10-08 11:28:09 -0400, Tom Lane wrote:
> >> https://commitfest.postgresql.org/15/1001/
> >> The reason that's not in v10 is we haven't been able to convince
> >> ourselves whether it's 100% correct.
>
> > Unfortunately it won't help in this specific case (no support for UNION,
> > just UNION ALL), but I thought it might be interesting to reference
> > https://medium.com/@uwdb/introducing-cosette-527898504bd6
> > here.
>
> Huh, that is an interesting project indeed.  Although I'm not sure that
> it quite addresses the question of whether an optimization transform
> is valid.  IIUC, it could prove that a particular query having been fed
> through the transform didn't change semantics, but that offers only
> limited insight into whether some other query fed through the transform
> might change.

According to the guide it offers some support for more general
transformations:
http://cosette.cs.washington.edu/guide#24-symbolic-predicates That's
still only going to be sufficient for some of the interesting cases, but
still...

Wonder about pinging them about the OR -> UNION case, they've been
responsive to problem in some threads I found online.

Greetings,

Andres Freund


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Discussion on missing optimizations

David Rowley-3
In reply to this post by Andres Freund
On 7 October 2017 at 14:48, Andres Freund <[hidden email]> wrote:
> 3. JOIN Elimination
>
> There's been a lot of discussion and several patches. There's a bunch of
> problems here, one being that there's cases (during trigger firing,
> before the constraint checks) where foreign keys don't hold true, so we
> can't quite generally make these optimization.  Another concern is
> whether the added plan time is actually worthwhile.

I looked over this and it seems there's some pretty low hanging fruit
in there that we can add with just a handful of lines of new code.

This is the case for LEFT JOINs with a DISTINCT clause. Normally we
can only remove an unused LEFT JOINed relation if there are some
unique properties that ensure that the join does not duplicate any
outer row. We don't need to worry about this when there's a DISTINCT
clause, as the DISTINCT would remove any duplicates anyway. If I'm not
mistaken, we also don't need to bother looking at the actual distinct
clause's exprs since we'll already know that nothing is in there
regarding the relation being removed. The benefit to this could be
two-fold, as 1) we don't need to join to the unused relation and 2) we
don't need to remove any row duplication caused by the join.

While someone might argue that this is not going to be that common a
case to hit, if we consider how cheap this is to implement, it does
seem to be worth doing a couple of NULL checks in the planner for it.

The only slight downside I can see to this is that we're letting a few
more cases through rel_supports_distinctness() which is also used for
unique joins, and these proofs are not valid in those. However, it may
not be worth the trouble doing anything about that as relations
without unique indexes are pretty uncommon (at least in my world).

Thoughts?

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

remove_left_join_distinct.patch (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Discussion on missing optimizations

David Rowley-3
On 9 October 2017 at 17:41, David Rowley <[hidden email]> wrote:
> Thoughts?

Actually, I was a little inconsistent with my List NULL/NIL checks in
that last one.

I've attached an updated patch.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

remove_left_join_distinct_v2.patch (9K) Download Attachment
12
Previous Thread Next Thread