Columns correlation and adaptive query optimization

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Columns correlation and adaptive query optimization

konstantin knizhnik
Hi hackers,

Errors in selectivity estimations is one of the main reason of bad plans
generation by Postgres optimizer.
Postgres estimates selectivity based on the collected statistic
(histograms).
While it is able to more or less precisely estimated selectivity of
simple predicate for particular table,
it is much more difficult to estimate selectivity for result of join of
several tables and for complex predicate consisting of several
conjuncts/disjuncts
accessing different columns.

Postgres is not able to take in account correlation between columns
unless correspondent multicolumn statistic is explicitly created.
But even if such statistic is created, it can not be used in join
selectivity estimation.

The problem with adjusting selectivity using machine learning based on
the results of EXPLAIN ANALYZE was address in AQO project:

https://github.com/postgrespro/aqo

There are still many issues with proposed AQO approach (for example, it
doesn't take in account concrete constant values).
We are going  to continue its improvement.

But here I wan to propose much simpler patch which allows two things:
1. Use extended statistic in estimation of join selectivity
2. Create on demand multicolumn statistic in auto_explain extension if
there is larger gap between real and estimated number of tuples for the
concrete plan node.


create table inner_tab(x integer, y integer);
create table outer_tab(pk integer primary key, x integer, y integer);
create index on inner_tab(x,y);
insert into outer_tab values (generate_series(1,100000),
generate_series(1,100000), generate_series(1,100000)*10);
insert into inner_tab values (generate_series(1,1000000)/10,
generate_series(1,1000000)/10*10);
analyze inner_tab;
analyze outer_tab;


Without this patch:
explain select * from outer_tab join inner_tab using(x,y) where pk=1;
                                           QUERY PLAN
----------------------------------------------------------------------------------------------
  Nested Loop  (cost=0.72..16.77 rows=1 width=12)
    ->  Index Scan using outer_tab_pkey on outer_tab (cost=0.29..8.31
rows=1 width=12)
          Index Cond: (pk = 1)
    ->  Index Only Scan using inner_tab_x_y_idx on inner_tab
(cost=0.42..8.45 rows=1 width=8)
          Index Cond: ((x = outer_tab.x) AND (y = outer_tab.y))
(5 rows)


With this patch:

load 'auto_explain';
set auto_explain.log_min_duration=0;
set auto_explain.add_statistics_threshold=10.0;
set auto_explain.log_analyze=on;
select * from outer_tab join inner_tab using(x,y) where pk=1;
analyze inner_tab;
analyze outer_tab;

explain select * from outer_tab join inner_tab using(x,y) where pk=1;
                                            QUERY PLAN
------------------------------------------------------------------------------------------------
  Nested Loop  (cost=0.72..32.79 rows=10 width=12)
    ->  Index Scan using outer_tab_pkey on outer_tab (cost=0.29..8.31
rows=1 width=12)
          Index Cond: (pk = 1)
    ->  Index Only Scan using inner_tab_x_y_idx on inner_tab
(cost=0.42..24.38 rows=10 width=8)
          Index Cond: ((x = outer_tab.x) AND (y = outer_tab.y))
(5 rows)


As you can see now estimation of join result is correct (10).

I attached two patches: one for using extended statistic for join
selectivity estimation and another for auto_explain to implicitly add
this extended statistic on demand.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


join_selectivity.patch (3K) Download Attachment
auto_explain.patch (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Columns correlation and adaptive query optimization

legrand legrand
Hello Konstantin,

What you have proposed regarding join_selectivity and multicolumn statistics
is a very good new !

Regarding your auto_explain modification, maybe an "advisor" mode would also
be helpfull (with auto_explain_add_statistics_threshold=-1 for exemple).
This would allow to track which missing statistic should be tested (manually
or in an other environment).

In my point of view this advice should be an option of the EXPLAIN command,
that should also permit
auto_explain module to propose "learning" phase.

Regards
PAscal



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html


Reply | Threaded
Open this post in threaded view
|

Re: Columns correlation and adaptive query optimization

konstantin knizhnik


On 15.10.2019 1:20, legrand legrand wrote:

> Hello Konstantin,
>
> What you have proposed regarding join_selectivity and multicolumn statistics
> is a very good new !
>
> Regarding your auto_explain modification, maybe an "advisor" mode would also
> be helpfull (with auto_explain_add_statistics_threshold=-1 for exemple).
> This would allow to track which missing statistic should be tested (manually
> or in an other environment).
>
> In my point of view this advice should be an option of the EXPLAIN command,
> that should also permit
> auto_explain module to propose "learning" phase.
Thank you for good suggestion. Advisor mode is really good idea.
I have added "auto_explain.suggest_only" GUC.
When it is switched on, suggested CREATE STATISTICS statement is just
printed in  log but not actually created.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


join_selectivity.patch (3K) Download Attachment
auto_explain_create_statistics.patch (10K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Columns correlation and adaptive query optimization

konstantin knizhnik
Smarter version of join selectivity patch handling cases like this:


explain select * from outer_tab join inner_tab using(x,y) where x=1;
                                            QUERY PLAN
------------------------------------------------------------------------------------------------
  Nested Loop  (cost=0.42..1815.47 rows=10 width=12)
    Join Filter: (outer_tab.y = inner_tab.y)
    ->  Seq Scan on outer_tab  (cost=0.00..1791.00 rows=1 width=12)
          Filter: (x = 1)
    ->  Index Only Scan using inner_tab_x_y_idx on inner_tab
(cost=0.42..24.35 rows=10 width=8)
          Index Cond: (x = 1)
(6 rows)


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


auto_explain_create_statistics.patch (10K) Download Attachment
join_selectivity-2.patch (17K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Columns correlation and adaptive query optimization

konstantin knizhnik
New version of patch implicitly adding multicolumn statistic in
auto_explain extension and using it in optimizer for more precise
estimation of join selectivity.
This patch fixes race condition while adding statistics and restricts
generated statistic name to fit in 64 bytes (NameData).

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


auto_explain_create_statistic-3.patch (18K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Columns correlation and adaptive query optimization

David Steele
On 12/24/19 3:15 AM, Konstantin Knizhnik wrote:
> New version of patch implicitly adding multicolumn statistic in
> auto_explain extension and using it in optimizer for more precise
> estimation of join selectivity.
> This patch fixes race condition while adding statistics and restricts
> generated statistic name to fit in 64 bytes (NameData).

This patch no longer applies: https://commitfest.postgresql.org/27/2386/

The CF entry has been updated to Waiting on Author.

Regards,
--
-David
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Columns correlation and adaptive query optimization

konstantin knizhnik


On 24.03.2020 20:12, David Steele wrote:

> On 12/24/19 3:15 AM, Konstantin Knizhnik wrote:
>> New version of patch implicitly adding multicolumn statistic in
>> auto_explain extension and using it in optimizer for more precise
>> estimation of join selectivity.
>> This patch fixes race condition while adding statistics and restricts
>> generated statistic name to fit in 64 bytes (NameData).
>
> This patch no longer applies: https://commitfest.postgresql.org/27/2386/
>
> The CF entry has been updated to Waiting on Author.
>
> Regards,
Rebased patch is attached.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


auto_explain_create_statistic-4.patch (18K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Columns correlation and adaptive query optimization

David Steele
On 3/25/20 6:57 AM, Konstantin Knizhnik wrote:

>
>
> On 24.03.2020 20:12, David Steele wrote:
>> On 12/24/19 3:15 AM, Konstantin Knizhnik wrote:
>>> New version of patch implicitly adding multicolumn statistic in
>>> auto_explain extension and using it in optimizer for more precise
>>> estimation of join selectivity.
>>> This patch fixes race condition while adding statistics and restricts
>>> generated statistic name to fit in 64 bytes (NameData).
>>
>> This patch no longer applies: https://commitfest.postgresql.org/27/2386/
>>
>> The CF entry has been updated to Waiting on Autho
>
> Rebased patch is attached.

The patch applies now but there are error on Windows and Linux:
https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.85481
https://travis-ci.org/postgresql-cfbot/postgresql/builds/666729979

Regards,
--
-David
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Columns correlation and adaptive query optimization

konstantin knizhnik


On 25.03.2020 16:00, David Steele wrote:

> On 3/25/20 6:57 AM, Konstantin Knizhnik wrote:
>>
>>
>> On 24.03.2020 20:12, David Steele wrote:
>>> On 12/24/19 3:15 AM, Konstantin Knizhnik wrote:
>>>> New version of patch implicitly adding multicolumn statistic in
>>>> auto_explain extension and using it in optimizer for more precise
>>>> estimation of join selectivity.
>>>> This patch fixes race condition while adding statistics and
>>>> restricts generated statistic name to fit in 64 bytes (NameData).
>>>
>>> This patch no longer applies:
>>> https://commitfest.postgresql.org/27/2386/
>>>
>>> The CF entry has been updated to Waiting on Autho
>>
>> Rebased patch is attached.
>
> The patch applies now but there are error on Windows and Linux:
> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.85481 
>
> https://travis-ci.org/postgresql-cfbot/postgresql/builds/666729979
>
> Regards,
Sorry, yet another patch is attached.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


auto_explain_create_statistic-5.patch (18K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Columns correlation and adaptive query optimization

Rafia Sabih
Hello,

This sounded like an interesting addition to postgresql. I gave some
time to it today to review, here are few comments,

On Wed, 25 Mar 2020 at 14:28, Konstantin Knizhnik
<[hidden email]> wrote:

>
>
>
> On 25.03.2020 16:00, David Steele wrote:
> > On 3/25/20 6:57 AM, Konstantin Knizhnik wrote:
> >>
> >>
> >> On 24.03.2020 20:12, David Steele wrote:
> >>> On 12/24/19 3:15 AM, Konstantin Knizhnik wrote:
> >>>> New version of patch implicitly adding multicolumn statistic in
> >>>> auto_explain extension and using it in optimizer for more precise
> >>>> estimation of join selectivity.
> >>>> This patch fixes race condition while adding statistics and
> >>>> restricts generated statistic name to fit in 64 bytes (NameData).
> >>>
> >>> This patch no longer applies:
> >>> https://commitfest.postgresql.org/27/2386/
> >>>
> >>> The CF entry has been updated to Waiting on Autho
> >>
> >> Rebased patch is attached.
> >
> > The patch applies now but there are error on Windows and Linux:
> > https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.85481
> >
> > https://travis-ci.org/postgresql-cfbot/postgresql/builds/666729979
> >
> > Regards,
>
> Sorry, yet another patch is attached.
>
> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>

+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
+

This doesn't look like the right place for it, you might want to
declare it with other functions in the starting of the file.

Also, there is no description about any of the functions here,
wouldn’t hurt having some more comments there.

A few of more questions that cross my mind at this point,

- have you tried measuring the extra cost we have to pay for this
mores statistics , and also compare it with the benefit it gives in
terms of accuracy.
- I would also be interested in understanding if there are cases when
adding this extra step doesn’t help and have you excluded them already
or if some of them are easily identifiable at this stage...?
- is there any limit  on the number of columns for which this will
work, or should there be any such limit...?

--
Regards,
Rafia Sabih


Reply | Threaded
Open this post in threaded view
|

Re: Columns correlation and adaptive query optimization

konstantin knizhnik
Thank you very much for review.

On 25.03.2020 20:04, Rafia Sabih wrote:

>
> +static void
> +AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
> +
>
> This doesn't look like the right place for it, you might want to
> declare it with other functions in the starting of the file.
>
> Also, there is no description about any of the functions here,
> wouldn’t hurt having some more comments there.

Sorry, I will fix it.
Actually this patch contains of two independent parts:
first allows to use auto_explain extension to generate mutlicolumn
statistic for variables used in clauses
for which selectivity estimation gives wrong result. It affects only
auto_explain extension.

Second part allows to use multicolumn statistic for join selectivity
estimation.
As far as I know extended statistic is now actively improved:
https://www.postgresql.org/message-id/flat/20200309000157.ig5tcrynvaqu4ixd%40development#bfbdf9c41c31ef92819dfc5ecde4a67c

I think that using extended statistic for join selectivity is very
important and should also be addressed.
If my approach is on so good, I will be pleased for other suggestions.

>
> A few of more questions that cross my mind at this point,
>
> - have you tried measuring the extra cost we have to pay for this
> mores statistics , and also compare it with the benefit it gives in
> terms of accuracy.
Adding statistic not always leads to performance improvement but I never
observed any performance degradation caused by presence of extended
statistic.
Definitely we can manually create too many extended statistic entries
for different subsets of columns.
And it certainly increase planning time because optimizer has to
consider more alternatives.
But in practice I never noticed such slowdown.

> - I would also be interested in understanding if there are cases when
> adding this extra step doesn’t help and have you excluded them already
> or if some of them are easily identifiable at this stage...?

Unfortunately there are many cases when extended statistic can not help.
Either because optimizer is not able to use it (for example my patch
consider only cases with strict equality comparison,
but if you use predicate like "a.x=b.x and  a.y in (1,2,3)"  then
extended statistic for <x,y> can not be used.
Either because collected statistic itself is not precise enough ,
especially in case of data skews.


> - is there any limit  on the number of columns for which this will
> work, or should there be any such limit...?
>
Right now there is limit for maximal number of columns used in extended
statistic: 8 columns.
But in practice I rarely see join predicates involving more than 3 columns.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Reply | Threaded
Open this post in threaded view
|

Re: Columns correlation and adaptive query optimization

konstantin knizhnik
In reply to this post by Rafia Sabih


On 25.03.2020 20:04, Rafia Sabih wrote:
>
> Also, there is no description about any of the functions here,
> wouldn’t hurt having some more comments there.
>

Attached please find new version of the patch with more comments and
descriptions added.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


auto_explain_create_statistic-6.patch (19K) Download Attachment