[PATCH] Overestimated filter cost and its mitigation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH] Overestimated filter cost and its mitigation

Yuto HAYAMIZU-2
Hi hackers,

Currently, cost of a filter with multiple clauses is estimated by
summing up estimated cost of each clause.  As long as a filter
consists of simple clauses and its cost is fairly small, it works
fine. However, when there exists some heavy clauses (like SubQuery or
user-defined functions) and most of tuples are filtered by other
simple clauses, total cost is likely to be overestimated.  An attached
patch mitigates this overestimation by using selectivity estimation of
subsets of clauses in a filter.

Suppose that there are three qual clauses in a scan node, current
postgres estimates per-tuple cost of the filter to be:
   cost(A) + cost(B) + cost(C)

And basic idea of the attached patch is:
   cost(A) + clauselist_selectivity({A}) * cost(B) + clauselist_selectivity({A, B}) * cost(C)


We first noticed this overestimation during performance analysis of
our real applications. In order to reproduce the situation, we created
a similar query with pgbench data.

The database was prepared by following commands:
    pgbench --scale=1000 -i pgb5
    pgbench  -c 10 -t 100000 pgb5

and the query is:
    select e.tid,
            e.aid,
            e.bid,
            e.mtime
       from pgbench_history e
      where e.tid between 1 and 1000
        and (select count(*)
               from pgbench_history f
              where f.mtime < e.mtime
                and f.bid = e.bid
              group by f.bid) > 100
        and e.aid between 1 and 100000;


== Query plan with current upstream ==

1 Seq Scan on pgbench_history e  (cost=0.00..21393523889.00 rows=28 width=20) (actual time=2391.683..21775.191 rows=85 loops=1)
2  Filter: ((tid >= 1) AND (tid <= 1000) AND (aid >= 1) AND (aid <= 100000) AND ((SubPlan 1) > 100))
3  Rows Removed by Filter: 999915
4    SubPlan 1
5     ->  GroupAggregate  (cost=0.00..21393.50 rows=283 width=12) (actual time=229.050..229.050 rows=1 loops=94)
6          Group Key: f.bid
7           ->  Seq Scan on pgbench_history f  (cost=0.00..21389.00 rows=333 width=4) (actual time=5.036..228.851 rows=529 loops=94)
8                 Filter: ((mtime < e.mtime) AND (bid = e.bid))
9                 Rows Removed by Filter: 999471

Most amount of total cost 21,393,523,889 comes from:
  (cost of SubPlan 1) * (# of rows in pgbench_history) = 21,393.50 * 1,000,000 = 21,393,000,000

but in actual run, SubPlan 1 was executed only 94 times, and total
cost was overestimated more than 10000 times greater.


== Query plan with patched upstream ==

1 Seq Scan on pgbench_history e  (cost=0.00..1687169.88 rows=28 width=20) (actual time=2388.802..21721.498 rows=85 loops=1)
2   Filter: ((tid >= 1) AND (tid <= 1000) AND (aid >= 1) AND (aid <= 100000) AND ((SubPlan 1) > 100))
3   Rows Removed by Filter: 999915
4   SubPlan 1
5     ->  GroupAggregate  (cost=0.00..19726.83 rows=283 width=12) (actual time=228.507..228.507 rows=1 loops=94)
6           Group Key: f.bid
7           ->  Seq Scan on pgbench_history f  (cost=0.00..19722.33 rows=333 width=4) (actual time=5.378..228.316 rows=529 loops=94)
8                 Filter: ((mtime < e.mtime) AND (bid = e.bid))
9                 Rows Removed by Filter: 999471

In patched version of upstream, total cost is largely decreased.  In
cost estimation, only 84.4 tuples of pgbench_history are expected to
be evaluated with SubPlan 1.  It is close to actual number 94 to some
extent, and total cost seems much more reasonable than cost with
current upstream.

Any thoughts?

Regards,
Yuto Hayamizu



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Mitigate-filter-cost-overestimation.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Overestimated filter cost and its mitigation

David Fetter
On Mon, Sep 11, 2017 at 04:43:46PM +0900, Yuto Hayamizu wrote:

> Hi hackers,
>
> Currently, cost of a filter with multiple clauses is estimated by
> summing up estimated cost of each clause.  As long as a filter
> consists of simple clauses and its cost is fairly small, it works
> fine. However, when there exists some heavy clauses (like SubQuery or
> user-defined functions) and most of tuples are filtered by other
> simple clauses, total cost is likely to be overestimated.  An attached
> patch mitigates this overestimation by using selectivity estimation of
> subsets of clauses in a filter.

I've taken the liberty of adding this to the upcoming commitfest.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Overestimated filter cost and its mitigation

Thomas Munro-3
In reply to this post by Yuto HAYAMIZU-2
On Mon, Sep 11, 2017 at 7:43 PM, Yuto Hayamizu <[hidden email]> wrote:
> Suppose that there are three qual clauses in a scan node, current
> postgres estimates per-tuple cost of the filter to be:
>    cost(A) + cost(B) + cost(C)
>
> And basic idea of the attached patch is:
>    cost(A) + clauselist_selectivity({A}) * cost(B) +
> clauselist_selectivity({A, B}) * cost(C)

I am no planner expert and I haven't tested or studied the patch in
detail, but here is some feedback for what it's worth.

This idea seems to makes intuitive sense.  I see that you use
order_qual_clauses() to know what order they'll run in, so I'm
wondering if there is any reason we shouldn't do it up front and keep
it during path building, instead of running it again at plan creation
time.  Is there some way it could ever produce a different result at
the two times?  Why not also apply this logic to qpquals of joins,
foreign scans, subplans?  That is, instead of replacing cost_qual_eval
with this code for baserels, I wonder if we should teach
cost_qual_eval how to do this so those other users could also benefit
(having perhaps already ordered the qual clauses earlier).

This is one of those patches that risks having an impact on many query
plans.  Yikes.  Of all the regression test queries, only
updatable_views complained though, and that involves leakproof
functions.  I see that those get some kind of special treatment in
order_qual_clauses().

+ ordered_clauses = order_qual_clauses(root, rel->baserestrictinfo);
+ clause_list = ordered_clauses;

Is clause_list necessary?  Can't you just use ordered_clauses for the
outer and inner loops?

+ List *clause_list_for_sel = NULL;

The convention is to use NIL for empty lists (a nod to the Lisp
machine they prototyped this project on).

+ /* Make a temporary clause list for selectivity calcuation */

s/calcuation/calculation/

--
Thomas Munro
http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Overestimated filter cost and its mitigation

Ashutosh Bapat
On Mon, Nov 6, 2017 at 10:01 AM, Thomas Munro
<[hidden email]> wrote:
>
> This idea seems to makes intuitive sense.  I see that you use
> order_qual_clauses() to know what order they'll run in, so I'm
> wondering if there is any reason we shouldn't do it up front and keep
> it during path building, instead of running it again at plan creation
> time.  Is there some way it could ever produce a different result at
> the two times?

IIRC, only thing that changes between plan time quals and execution
time quals is constaint folding of constant parameters. But I don't
think we change the selectivity estimates when that's done. At the
same time, I don't think we should make a lot of effort to make sure
that the order used during the estimation is same as the order at the
execution; we are anyway estimating. There can always be some
difference between what's estimated and what actually happens.

> Why not also apply this logic to qpquals of joins,
> foreign scans, subplans?  That is, instead of replacing cost_qual_eval
> with this code for baserels, I wonder if we should teach
> cost_qual_eval how to do this so those other users could also benefit
> (having perhaps already ordered the qual clauses earlier).

+1.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Overestimated filter cost and its mitigation

Robert Haas
On Mon, Nov 6, 2017 at 5:19 AM, Ashutosh Bapat
<[hidden email]> wrote:
> IIRC, only thing that changes between plan time quals and execution
> time quals is constaint folding of constant parameters. But I don't
> think we change the selectivity estimates when that's done. At the
> same time, I don't think we should make a lot of effort to make sure
> that the order used during the estimation is same as the order at the
> execution; we are anyway estimating. There can always be some
> difference between what's estimated and what actually happens.

I don't think that's a good justification for allowing the order to
vary.  It's true that, at execution time, we may find row counts and
selectivities to be different than what we predicted, but that is a
case of the real data being different from our idealized data -- which
is difficult to avoid in general.  Allowing our actual decisions to be
different from the decisions we thought we would make seems like
programmer sloppiness.  It would also be very confusing if the planner
uses one ordering to estimate the cost and then a different order at
execution time and in the EXPLAIN ANALYZE output.  How could anybody
understand what had happened in such a case?

I think it would be a good idea, as Thomas says, to order the qual
clauses at an earlier stage and then remember our decision.  However,
we have to think about whether that's going to increase planning time
in a noticeable way.  I wonder why we currently postpone this until
such a late phase of planning.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Overestimated filter cost and its mitigation

Tom Lane-2
Robert Haas <[hidden email]> writes:
> I think it would be a good idea, as Thomas says, to order the qual
> clauses at an earlier stage and then remember our decision.  However,
> we have to think about whether that's going to increase planning time
> in a noticeable way.  I wonder why we currently postpone this until
> such a late phase of planning.

Because (1) up to now there's been no need to consider the qual ordering
till later, and (2) re-doing that sort for each path seemed unduly
expensive.  If we're to try to estimate whether later quals will be
reached, then sure the ordering becomes important.  I'm still concerned
about (2) though.  If there were a way to pre-sort the quals once for
all paths, the problem goes away, but I don't think that works ---
indexscans may segregate some quals, and in join cases different paths
will actually have different sets of quals they need to check depending
on the join order.

So the bottom line here is how much is it going to cost us to add this
additional refinement in cost estimation, and is it worth it given our
extremely poor level of accuracy in expression cost estimation.  Color
me dubious.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Overestimated filter cost and its mitigation

Ashutosh Bapat
On Wed, Nov 8, 2017 at 3:18 AM, Tom Lane <[hidden email]> wrote:

> Robert Haas <[hidden email]> writes:
>> I think it would be a good idea, as Thomas says, to order the qual
>> clauses at an earlier stage and then remember our decision.  However,
>> we have to think about whether that's going to increase planning time
>> in a noticeable way.  I wonder why we currently postpone this until
>> such a late phase of planning.
>
> Because (1) up to now there's been no need to consider the qual ordering
> till later, and (2) re-doing that sort for each path seemed unduly
> expensive.  If we're to try to estimate whether later quals will be
> reached, then sure the ordering becomes important.  I'm still concerned
> about (2) though.  If there were a way to pre-sort the quals once for
> all paths, the problem goes away, but I don't think that works ---
> indexscans may segregate some quals, and in join cases different paths
> will actually have different sets of quals they need to check depending
> on the join order.
>

Looking at order_qual_clauses(), we can say that a set of quals q1
.... qn are ordered the same irrespective of the set of clauses they
are subset of. E.g. if {q1 .. qn} is subset of Q (ordered as Qo) and
also Q' (ordered as Q'o) the order in which they appear in Qo and Q'o
is same. So, even if different paths segregate the clauses in
different set, within each set the order is same. FWIW, we can order
all clauses in largest set once and use that order every time. Albeit
we will have to remember the order somewhere OR make the separator
routine retain the order in the larger set, which I guess is true
about all separator functions.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Overestimated filter cost and its mitigation

Michael Paquier
On Thu, Nov 9, 2017 at 12:33 PM, Ashutosh Bapat
<[hidden email]> wrote:

> Looking at order_qual_clauses(), we can say that a set of quals q1
> .... qn are ordered the same irrespective of the set of clauses they
> are subset of. E.g. if {q1 .. qn} is subset of Q (ordered as Qo) and
> also Q' (ordered as Q'o) the order in which they appear in Qo and Q'o
> is same. So, even if different paths segregate the clauses in
> different set, within each set the order is same. FWIW, we can order
> all clauses in largest set once and use that order every time. Albeit
> we will have to remember the order somewhere OR make the separator
> routine retain the order in the larger set, which I guess is true
> about all separator functions.

I am not sure what to think about this patch, so moved to next CF. The
patch still applies. Hayamizu-san, it would be nice as well if you
could review other's patches. One patch reviewed for one patch
submitted, with equal difficulty. You should also get a community
account so as it is possible to add your name as an author of this
patch.
--
Michael

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Overestimated filter cost and its mitigation

Stephen Frost
Greetings,

* Michael Paquier ([hidden email]) wrote:

> On Thu, Nov 9, 2017 at 12:33 PM, Ashutosh Bapat
> <[hidden email]> wrote:
> > Looking at order_qual_clauses(), we can say that a set of quals q1
> > .... qn are ordered the same irrespective of the set of clauses they
> > are subset of. E.g. if {q1 .. qn} is subset of Q (ordered as Qo) and
> > also Q' (ordered as Q'o) the order in which they appear in Qo and Q'o
> > is same. So, even if different paths segregate the clauses in
> > different set, within each set the order is same. FWIW, we can order
> > all clauses in largest set once and use that order every time. Albeit
> > we will have to remember the order somewhere OR make the separator
> > routine retain the order in the larger set, which I guess is true
> > about all separator functions.
>
> I am not sure what to think about this patch, so moved to next CF. The
> patch still applies. Hayamizu-san, it would be nice as well if you
> could review other's patches. One patch reviewed for one patch
> submitted, with equal difficulty. You should also get a community
> account so as it is possible to add your name as an author of this
> patch.
While this patch does still apply, it doesn't pass the 'make check'
regression tests and there appears to be concern raised up-thread about
the performance impact.

Yuto Hayamizu, it looks like some good questions were raised and which
need to be addressed, in addition to making sure that the patch at least
passes 'make check', so I'm leaving this as waiting-for-author.  If you
believe the patch is no longer viable, we can change it to 'returned
with feedback', otherwise, an updated patch and some responses to the
questions raised would be great as we're already a week into this
commitfest and this thread has been dormant since near the start of the
last CF.

Thanks!

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Overestimated filter cost and its mitigation

Yuto HAYAMIZU-2
On Sun, Jan 7, 2018 at 8:25 AM, Stephen Frost <[hidden email]> wrote:

>
> Greetings,
>
> * Michael Paquier ([hidden email]) wrote:
> > On Thu, Nov 9, 2017 at 12:33 PM, Ashutosh Bapat
> > <[hidden email]> wrote:
> > > Looking at order_qual_clauses(), we can say that a set of quals q1
> > > .... qn are ordered the same irrespective of the set of clauses they
> > > are subset of. E.g. if {q1 .. qn} is subset of Q (ordered as Qo) and
> > > also Q' (ordered as Q'o) the order in which they appear in Qo and Q'o
> > > is same. So, even if different paths segregate the clauses in
> > > different set, within each set the order is same. FWIW, we can order
> > > all clauses in largest set once and use that order every time. Albeit
> > > we will have to remember the order somewhere OR make the separator
> > > routine retain the order in the larger set, which I guess is true
> > > about all separator functions.
> >
> > I am not sure what to think about this patch, so moved to next CF. The
> > patch still applies. Hayamizu-san, it would be nice as well if you
> > could review other's patches. One patch reviewed for one patch
> > submitted, with equal difficulty. You should also get a community
> > account so as it is possible to add your name as an author of this
> > patch.
>
> While this patch does still apply, it doesn't pass the 'make check'
> regression tests and there appears to be concern raised up-thread about
> the performance impact.
>
> Yuto Hayamizu, it looks like some good questions were raised and which
> need to be addressed, in addition to making sure that the patch at least
> passes 'make check', so I'm leaving this as waiting-for-author.  If you
> believe the patch is no longer viable, we can change it to 'returned
> with feedback', otherwise, an updated patch and some responses to the
> questions raised would be great as we're already a week into this
> commitfest and this thread has been dormant since near the start of the
> last CF.
>
> Thanks!
>
> Stephen

Thank you for your kind guidance, and sorry for late reply.

Attached patch fixes regression tests and now it passes the 'make check' tests.
Since this patch changes cost estimation in set_baserel_size_estimates,
picked query plans were changed for some tests, so I've updated their
expected EXPLAIN results.

I'm going to answer raised questions by replying to each message soon.

regards,

Yuto Hayamizu

Mitigate-filter-cost-overestimation-v2.patch (13K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Overestimated filter cost and its mitigation

Yuto HAYAMIZU-2
In reply to this post by Tom Lane-2
On Wed, Nov 8, 2017 at 6:48 AM, Tom Lane <[hidden email]> wrote:
> Because (1) up to now there's been no need to consider the qual ordering
> till later, and (2) re-doing that sort for each path seemed unduly
> expensive.  If we're to try to estimate whether later quals will be
> reached, then sure the ordering becomes important.  I'm still concerned
> about (2) though.

This patch changes 'set_baserel_size_estimates', and it is called only once
for each range table entry, so sorting does not happen for each path and
its negative impact on optimizer performance is negligible.

While sorting quals does not cost so much, I noticed another performance
problem of this patch.
Patched 'set_baserel_size_estimates'  is O(N^2) (N is the number of quals)
and would take so long time when a query has a large qual list.
After sorting quals, it loops over quals q_1,q_2, ..., q_N to
estimated "weighted"
cost of each qual q_k. In each iteration, in order to calculate "the
weight" of q_k,
it calls clauselist_selectivity({q_1,q_2, ..., q_k}), which loops k
times in the function.

I've checked actual negative impact on optimization time with the
artificial query:
    SELECT count(*) FROM pgbench_accounts WHERE (randomly generated N quals);

N=50: 0.5966 msec (original), 0.938 msec (patched)
N=100: 1.1992 msec (original), 1.5396 msec (patched)
N=200: 1.9167 msec (original),  4.6676 msec (patched)
N=400: 2.7583 msec (original), 12.0786 msec (patched)
N=800: 5.2919 msec (original), 38.0228 msec (patched)
N=1600: 9.3428 msec (original), 167.737 msec (patched)

From this result, patched version is almost O(N^2).
100 quals in a query seems unusually large number, but I think there may exists
some real-world queries that have hundreds or thousands of quals,
so I feel this patch needs improvement for handling large N.

My idea of improving this patch is that give a threshold N_limit,
and for q_1 ... q_N_limit, do the same weighted cost estimation in the
current version of this patch.
For q_{N_limit+1} ...., stop calling clauselist_selectivity for
calculating the weight
and reuse the result of clauselist_selectivity({q_1,q_2, ..., q_N_limit}).
For example, if N_limit=100, additional overhead is only
sub-milliseconds per each range table entry,
and cost estimation is surely better than the current postgres implementation.

Any thoughts?

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Overestimated filter cost and its mitigation

Yuto HAYAMIZU-2
In reply to this post by Ashutosh Bapat
On Thu, Nov 9, 2017 at 12:33 PM, Ashutosh Bapat
<[hidden email]> wrote:
> different set, within each set the order is same. FWIW, we can order
> all clauses in largest set once and use that order every time. Albeit
> we will have to remember the order somewhere OR make the separator
> routine retain the order in the larger set, which I guess is true
> about all separator functions.

For this patch, sorting of a qual list happens only once for each
range table entry, not for each path. So there is no need for caching
sorted qual lists as far as I know.

----
regards,
Yuto Hayamizu

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Overestimated filter cost and its mitigation

Yuto HAYAMIZU-2
In reply to this post by Yuto HAYAMIZU-2
On Fri, Jan 19, 2018 at 5:07 PM, Yuto Hayamizu <[hidden email]> wrote:
> My idea of improving this patch is that give a threshold N_limit,
> and for q_1 ... q_N_limit, do the same weighted cost estimation in the
> current version of this patch.
> For q_{N_limit+1} ...., stop calling clauselist_selectivity for
> calculating the weight
> and reuse the result of clauselist_selectivity({q_1,q_2, ..., q_N_limit}).
> For example, if N_limit=100, additional overhead is only
> sub-milliseconds per each range table entry,
> and cost estimation is surely better than the current postgres implementation.

Attached patch implemented the improvement idea above.
With this patch attached, performance degradation of the test query
with many quals was <1%.
Example test query is attached.

regards,

----
Yuto Hayamizu

Mitigate-filter-cost-overestimation-v3.patch (14K) Download Attachment
testquery.sql (44K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Overestimated filter cost and its mitigation

Ashutosh Bapat
On Mon, Jan 29, 2018 at 10:42 AM, Yuto Hayamizu <[hidden email]> wrote:

> On Fri, Jan 19, 2018 at 5:07 PM, Yuto Hayamizu <[hidden email]> wrote:
>> My idea of improving this patch is that give a threshold N_limit,
>> and for q_1 ... q_N_limit, do the same weighted cost estimation in the
>> current version of this patch.
>> For q_{N_limit+1} ...., stop calling clauselist_selectivity for
>> calculating the weight
>> and reuse the result of clauselist_selectivity({q_1,q_2, ..., q_N_limit}).
>> For example, if N_limit=100, additional overhead is only
>> sub-milliseconds per each range table entry,
>> and cost estimation is surely better than the current postgres implementation.
>
> Attached patch implemented the improvement idea above.
> With this patch attached, performance degradation of the test query
> with many quals was <1%.
> Example test query is attached.
>

I looked at the patch and the idea sounds reasonable to me. But I
still have doubts about the usefulness of the patch. What this patch
does is to produce more accurate costs of scan of a single relation.
That's good, but it does that for all the paths created for that
relation. So the accurate estimate doesn't help us to choose one
method of scanning the relation over the other method. It also does
not affect the costs of different join paths, sort paths etc. IOW, I
can not imagine a find a query which will be planned in a different
way than upstream when we apply the patch and the new plan will be
more efficient than the previous one. I might be missing something
but. Can you please provide such a query.

In the patch clauselist_selectivity() gets called repeatedly for every
new qual added to the clause list. Instead, if we try to combine the
cost/row estimation with order_qual_clauses() or
clauselist_selectivity(), we might be able to what the current patch
does in O(n). clauselist_selectivity() calculates combined selectivity
of all the given quals in O(n). We should do something similar to that
in this case.

As suggested upthread by Tom, it will be more useful to have this
feature work with not just scans but joins as well.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Overestimated filter cost and its mitigation

Dean Rasheed-3
On 26 July 2018 at 07:12, Ashutosh Bapat
<[hidden email]> wrote:
> In the patch clauselist_selectivity() gets called repeatedly for every
> new qual added to the clause list. Instead, if we try to combine the
> cost/row estimation with order_qual_clauses() or
> clauselist_selectivity(), we might be able to what the current patch
> does in O(n). clauselist_selectivity() calculates combined selectivity
> of all the given quals in O(n). We should do something similar to that
> in this case.

Duplicating the logic of clauselist_selectivity() seems like quite a
lot of effort to go to just for improved filter cost estimates. Note
also that clauselist_selectivity() might get a good deal more complex
with multivariate stats.

Perhaps a reasonable simplification would be to just treat the clauses
as independent when estimating the filter cost, and so use the
following as a "good enough" estimate for the filter cost:

  cost(A) + sel(A) * cost(B) + sel(A) * sel(B) * cost(C) + ...

That would probably be an improvement over what we currently have. It
would be O(n) to compute, and it would probably use the cached
selectivity estimates for the clauses.

Note also that with this simplified expression for the filter cost, it
would be possible to improve order_qual_clauses() to take into account
the clause selectivities when sorting the clauses, and minimise the
cost of the filter by evaluating more selective clauses sooner, if
they're not too expensive. I believe that amounts to sorting by
cost/(1-sel) rather than just cost, except for clauses with sel==1,
which it makes sense to move to the end, and just sort by cost.

Regards,
Dean

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Overestimated filter cost and its mitigation

Ashutosh Bapat
On Thu, Jul 26, 2018 at 10:30 PM, Dean Rasheed <[hidden email]> wrote:

> On 26 July 2018 at 07:12, Ashutosh Bapat
> <[hidden email]> wrote:
>> In the patch clauselist_selectivity() gets called repeatedly for every
>> new qual added to the clause list. Instead, if we try to combine the
>> cost/row estimation with order_qual_clauses() or
>> clauselist_selectivity(), we might be able to what the current patch
>> does in O(n). clauselist_selectivity() calculates combined selectivity
>> of all the given quals in O(n). We should do something similar to that
>> in this case.
>
> Duplicating the logic of clauselist_selectivity() seems like quite a
> lot of effort to go to just for improved filter cost estimates. Note
> also that clauselist_selectivity() might get a good deal more complex
> with multivariate stats.

I am not suggesting to duplicate code in clauselist_selectivity().
Instead I am suggesting that we incorporate costing in that function.
I don't know how feasible is that.

>
> Perhaps a reasonable simplification would be to just treat the clauses
> as independent when estimating the filter cost, and so use the
> following as a "good enough" estimate for the filter cost:
>
>   cost(A) + sel(A) * cost(B) + sel(A) * sel(B) * cost(C) + ...
>
> That would probably be an improvement over what we currently have. It
> would be O(n) to compute, and it would probably use the cached
> selectivity estimates for the clauses.

That looks like a good trade-off. But looking at the following comment
in clause_selectivity(), I doubt if this will work in all the cases
        /*
         * If possible, cache the result of the selectivity calculation for
         * the clause.  We can cache if varRelid is zero or the clause
         * contains only vars of that relid --- otherwise varRelid will affect
         * the result, so mustn't cache.  Outer join quals might be examined
         * with either their join's actual jointype or JOIN_INNER, so we need
         * two cache variables to remember both cases.  Note: we assume the
         * result won't change if we are switching the input relations or
         * considering a unique-ified case, so we only need one cache variable
         * for all non-JOIN_INNER cases.
         */


>
> Note also that with this simplified expression for the filter cost, it
> would be possible to improve order_qual_clauses() to take into account
> the clause selectivities when sorting the clauses, and minimise the
> cost of the filter by evaluating more selective clauses sooner, if
> they're not too expensive. I believe that amounts to sorting by
> cost/(1-sel) rather than just cost, except for clauses with sel==1,
> which it makes sense to move to the end, and just sort by cost.

+1, if we could do that.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Overestimated filter cost and its mitigation

Michael Paquier-2
On Fri, Jul 27, 2018 at 02:19:27PM +0530, Ashutosh Bapat wrote:
> [ removal of latest arguments ]
> +1, if we could do that.

The patch seems to have stuck a bit, so I am marking it as returned with
feedback because of no activity.
--
Michael

signature.asc (849 bytes) Download Attachment