Hi hackers,
Errors in selectivity estimations is one of the main reason of bad plans generation by Postgres optimizer. Postgres estimates selectivity based on the collected statistic (histograms). While it is able to more or less precisely estimated selectivity of simple predicate for particular table, it is much more difficult to estimate selectivity for result of join of several tables and for complex predicate consisting of several conjuncts/disjuncts accessing different columns. Postgres is not able to take in account correlation between columns unless correspondent multicolumn statistic is explicitly created. But even if such statistic is created, it can not be used in join selectivity estimation. The problem with adjusting selectivity using machine learning based on the results of EXPLAIN ANALYZE was address in AQO project: https://github.com/postgrespro/aqo There are still many issues with proposed AQO approach (for example, it doesn't take in account concrete constant values). We are going to continue its improvement. But here I wan to propose much simpler patch which allows two things: 1. Use extended statistic in estimation of join selectivity 2. Create on demand multicolumn statistic in auto_explain extension if there is larger gap between real and estimated number of tuples for the concrete plan node. create table inner_tab(x integer, y integer); create table outer_tab(pk integer primary key, x integer, y integer); create index on inner_tab(x,y); insert into outer_tab values (generate_series(1,100000), generate_series(1,100000), generate_series(1,100000)*10); insert into inner_tab values (generate_series(1,1000000)/10, generate_series(1,1000000)/10*10); analyze inner_tab; analyze outer_tab; Without this patch: explain select * from outer_tab join inner_tab using(x,y) where pk=1; QUERY PLAN ---------------------------------------------------------------------------------------------- Nested Loop (cost=0.72..16.77 rows=1 width=12) -> Index Scan using outer_tab_pkey on outer_tab (cost=0.29..8.31 rows=1 width=12) Index Cond: (pk = 1) -> Index Only Scan using inner_tab_x_y_idx on inner_tab (cost=0.42..8.45 rows=1 width=8) Index Cond: ((x = outer_tab.x) AND (y = outer_tab.y)) (5 rows) With this patch: load 'auto_explain'; set auto_explain.log_min_duration=0; set auto_explain.add_statistics_threshold=10.0; set auto_explain.log_analyze=on; select * from outer_tab join inner_tab using(x,y) where pk=1; analyze inner_tab; analyze outer_tab; explain select * from outer_tab join inner_tab using(x,y) where pk=1; QUERY PLAN ------------------------------------------------------------------------------------------------ Nested Loop (cost=0.72..32.79 rows=10 width=12) -> Index Scan using outer_tab_pkey on outer_tab (cost=0.29..8.31 rows=1 width=12) Index Cond: (pk = 1) -> Index Only Scan using inner_tab_x_y_idx on inner_tab (cost=0.42..24.38 rows=10 width=8) Index Cond: ((x = outer_tab.x) AND (y = outer_tab.y)) (5 rows) As you can see now estimation of join result is correct (10). I attached two patches: one for using extended statistic for join selectivity estimation and another for auto_explain to implicitly add this extended statistic on demand. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company |
Hello Konstantin,
What you have proposed regarding join_selectivity and multicolumn statistics is a very good new ! Regarding your auto_explain modification, maybe an "advisor" mode would also be helpfull (with auto_explain_add_statistics_threshold=-1 for exemple). This would allow to track which missing statistic should be tested (manually or in an other environment). In my point of view this advice should be an option of the EXPLAIN command, that should also permit auto_explain module to propose "learning" phase. Regards PAscal -- Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html |
On 15.10.2019 1:20, legrand legrand wrote: > Hello Konstantin, > > What you have proposed regarding join_selectivity and multicolumn statistics > is a very good new ! > > Regarding your auto_explain modification, maybe an "advisor" mode would also > be helpfull (with auto_explain_add_statistics_threshold=-1 for exemple). > This would allow to track which missing statistic should be tested (manually > or in an other environment). > > In my point of view this advice should be an option of the EXPLAIN command, > that should also permit > auto_explain module to propose "learning" phase. When it is switched on, suggested CREATE STATISTICS statement is just printed in log but not actually created. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company ![]() ![]() |
Smarter version of join selectivity patch handling cases like this:
explain select * from outer_tab join inner_tab using(x,y) where x=1; QUERY PLAN ------------------------------------------------------------------------------------------------ Nested Loop (cost=0.42..1815.47 rows=10 width=12) Join Filter: (outer_tab.y = inner_tab.y) -> Seq Scan on outer_tab (cost=0.00..1791.00 rows=1 width=12) Filter: (x = 1) -> Index Only Scan using inner_tab_x_y_idx on inner_tab (cost=0.42..24.35 rows=10 width=8) Index Cond: (x = 1) (6 rows) -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company ![]() ![]() |
New version of patch implicitly adding multicolumn statistic in
auto_explain extension and using it in optimizer for more precise estimation of join selectivity. This patch fixes race condition while adding statistics and restricts generated statistic name to fit in 64 bytes (NameData). -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company |
On 12/24/19 3:15 AM, Konstantin Knizhnik wrote:
> New version of patch implicitly adding multicolumn statistic in > auto_explain extension and using it in optimizer for more precise > estimation of join selectivity. > This patch fixes race condition while adding statistics and restricts > generated statistic name to fit in 64 bytes (NameData). This patch no longer applies: https://commitfest.postgresql.org/27/2386/ The CF entry has been updated to Waiting on Author. Regards, -- -David [hidden email] |
On 24.03.2020 20:12, David Steele wrote: > On 12/24/19 3:15 AM, Konstantin Knizhnik wrote: >> New version of patch implicitly adding multicolumn statistic in >> auto_explain extension and using it in optimizer for more precise >> estimation of join selectivity. >> This patch fixes race condition while adding statistics and restricts >> generated statistic name to fit in 64 bytes (NameData). > > This patch no longer applies: https://commitfest.postgresql.org/27/2386/ > > The CF entry has been updated to Waiting on Author. > > Regards, -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company |
On 3/25/20 6:57 AM, Konstantin Knizhnik wrote:
> > > On 24.03.2020 20:12, David Steele wrote: >> On 12/24/19 3:15 AM, Konstantin Knizhnik wrote: >>> New version of patch implicitly adding multicolumn statistic in >>> auto_explain extension and using it in optimizer for more precise >>> estimation of join selectivity. >>> This patch fixes race condition while adding statistics and restricts >>> generated statistic name to fit in 64 bytes (NameData). >> >> This patch no longer applies: https://commitfest.postgresql.org/27/2386/ >> >> The CF entry has been updated to Waiting on Autho > > Rebased patch is attached. The patch applies now but there are error on Windows and Linux: https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.85481 https://travis-ci.org/postgresql-cfbot/postgresql/builds/666729979 Regards, -- -David [hidden email] |
On 25.03.2020 16:00, David Steele wrote: > On 3/25/20 6:57 AM, Konstantin Knizhnik wrote: >> >> >> On 24.03.2020 20:12, David Steele wrote: >>> On 12/24/19 3:15 AM, Konstantin Knizhnik wrote: >>>> New version of patch implicitly adding multicolumn statistic in >>>> auto_explain extension and using it in optimizer for more precise >>>> estimation of join selectivity. >>>> This patch fixes race condition while adding statistics and >>>> restricts generated statistic name to fit in 64 bytes (NameData). >>> >>> This patch no longer applies: >>> https://commitfest.postgresql.org/27/2386/ >>> >>> The CF entry has been updated to Waiting on Autho >> >> Rebased patch is attached. > > The patch applies now but there are error on Windows and Linux: > https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.85481 > > https://travis-ci.org/postgresql-cfbot/postgresql/builds/666729979 > > Regards, -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company |
Hello,
This sounded like an interesting addition to postgresql. I gave some time to it today to review, here are few comments, On Wed, 25 Mar 2020 at 14:28, Konstantin Knizhnik <[hidden email]> wrote: > > > > On 25.03.2020 16:00, David Steele wrote: > > On 3/25/20 6:57 AM, Konstantin Knizhnik wrote: > >> > >> > >> On 24.03.2020 20:12, David Steele wrote: > >>> On 12/24/19 3:15 AM, Konstantin Knizhnik wrote: > >>>> New version of patch implicitly adding multicolumn statistic in > >>>> auto_explain extension and using it in optimizer for more precise > >>>> estimation of join selectivity. > >>>> This patch fixes race condition while adding statistics and > >>>> restricts generated statistic name to fit in 64 bytes (NameData). > >>> > >>> This patch no longer applies: > >>> https://commitfest.postgresql.org/27/2386/ > >>> > >>> The CF entry has been updated to Waiting on Autho > >> > >> Rebased patch is attached. > > > > The patch applies now but there are error on Windows and Linux: > > https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.85481 > > > > https://travis-ci.org/postgresql-cfbot/postgresql/builds/666729979 > > > > Regards, > > Sorry, yet another patch is attached. > > -- > Konstantin Knizhnik > Postgres Professional: http://www.postgrespro.com > The Russian Postgres Company > +static void +AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es); + This doesn't look like the right place for it, you might want to declare it with other functions in the starting of the file. Also, there is no description about any of the functions here, wouldn’t hurt having some more comments there. A few of more questions that cross my mind at this point, - have you tried measuring the extra cost we have to pay for this mores statistics , and also compare it with the benefit it gives in terms of accuracy. - I would also be interested in understanding if there are cases when adding this extra step doesn’t help and have you excluded them already or if some of them are easily identifiable at this stage...? - is there any limit on the number of columns for which this will work, or should there be any such limit...? -- Regards, Rafia Sabih |
Thank you very much for review.
On 25.03.2020 20:04, Rafia Sabih wrote: > > +static void > +AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es); > + > > This doesn't look like the right place for it, you might want to > declare it with other functions in the starting of the file. > > Also, there is no description about any of the functions here, > wouldn’t hurt having some more comments there. Sorry, I will fix it. Actually this patch contains of two independent parts: first allows to use auto_explain extension to generate mutlicolumn statistic for variables used in clauses for which selectivity estimation gives wrong result. It affects only auto_explain extension. Second part allows to use multicolumn statistic for join selectivity estimation. As far as I know extended statistic is now actively improved: https://www.postgresql.org/message-id/flat/20200309000157.ig5tcrynvaqu4ixd%40development#bfbdf9c41c31ef92819dfc5ecde4a67c I think that using extended statistic for join selectivity is very important and should also be addressed. If my approach is on so good, I will be pleased for other suggestions. > > A few of more questions that cross my mind at this point, > > - have you tried measuring the extra cost we have to pay for this > mores statistics , and also compare it with the benefit it gives in > terms of accuracy. Adding statistic not always leads to performance improvement but I never observed any performance degradation caused by presence of extended statistic. Definitely we can manually create too many extended statistic entries for different subsets of columns. And it certainly increase planning time because optimizer has to consider more alternatives. But in practice I never noticed such slowdown. > - I would also be interested in understanding if there are cases when > adding this extra step doesn’t help and have you excluded them already > or if some of them are easily identifiable at this stage...? Unfortunately there are many cases when extended statistic can not help. Either because optimizer is not able to use it (for example my patch consider only cases with strict equality comparison, but if you use predicate like "a.x=b.x and a.y in (1,2,3)" then extended statistic for <x,y> can not be used. Either because collected statistic itself is not precise enough , especially in case of data skews. > - is there any limit on the number of columns for which this will > work, or should there be any such limit...? > Right now there is limit for maximal number of columns used in extended statistic: 8 columns. But in practice I rarely see join predicates involving more than 3 columns. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company |
In reply to this post by Rafia Sabih
On 25.03.2020 20:04, Rafia Sabih wrote: > > Also, there is no description about any of the functions here, > wouldn’t hurt having some more comments there. > Attached please find new version of the patch with more comments and descriptions added. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company |
Hello,
On Thu, 26 Mar 2020 18:49:51 +0300 Konstantin Knizhnik <[hidden email]> wrote: > Attached please find new version of the patch with more comments and > descriptions added. Adaptive query optimization is very interesting feature for me, so I looked into this patch. Here are some comments and questions. (1) This patch needs rebase because clauselist_selectivity was modified to improve estimation of OR clauses. (2) If I understand correctly, your proposal consists of the following two features. 1. Add a feature to auto_explain that creates an extended statistic automatically if an error on estimated rows number is large. 2. Improve rows number estimation of join results by considering functional dependencies between vars in the join qual if the qual has more than one clauses, and also functional dependencies between a var in the join qual and vars in quals of the inner/outer relation. As you said, these two parts are independent each other, so one feature will work even if we don't assume the other. I wonder it would be better to split the patch again, and register them to commitfest separately. (3) + DefineCustomBoolVariable("auto_explain.suggest_only", + "Do not create statistic but just record in WAL suggested create statistics statement.", + NULL, + &auto_explain_suggest_on To imply that this parameter is involving to add_statistics_threshold, it seems better for me to use more related name like add_statistics_suggest_only. Also, additional documentations for new parameters are required. (4) + /* + * Prevent concurrent access to extended statistic table + */ + stat_rel = table_open(StatisticExtRelationId, AccessExclusiveLock); + slot = table_slot_create(stat_rel, NULL); + scan = table_beginscan_catalog(stat_rel, 2, entry); (snip) + table_close(stat_rel, AccessExclusiveLock); + } When I tested the auto_explain part, I got the following WARNING. WARNING: buffer refcount leak: [097] (rel=base/12879/3381, blockNum=0, flags=0x83000000, refcount=1 2) WARNING: buffer refcount leak: [097] (rel=base/12879/3381, blockNum=0, flags=0x83000000, refcount=1 1) WARNING: relcache reference leak: relation "pg_statistic_ext" not closed WARNING: TupleDesc reference leak: TupleDesc 0x7fa439266338 (12029,-1) still referenced WARNING: Snapshot reference leak: Snapshot 0x55c332c10418 still referenced To suppress this, I think we need table_endscan(scan) and ExecDropSingleTupleTableSlot(slot) before finishing this function. (6) + elog(NOTICE, "Auto_explain suggestion: CREATE STATISTICS %s %s FROM %s", stat_name, create_stat_stmt, rel_name); We should use ereport instead of elog for log messages. (7) + double dep = find_var_dependency(root, innerRelid, var, clauses_attnums); + if (dep != 0.0) + { + s1 *= dep + (1 - dep) * s2; + continue; + } I found the following comment of clauselist_apply_dependencies(): * we actually combine selectivities using the formula * * P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b) so, is it not necessary using the same formula in this patch? That is, s1 *= dep + (1-dep) * s2 (if s1 <= s2) s1 *= dep * (s2/s1) + (1-dep) * s2 (otherwise) . (8) +/* + * Try to find dependency between variables. + * var: varaibles which dependencies are considered + * join_vars: list of variables used in other clauses + * This functions return strongest dependency and some subset of variables from the same relation + * or 0.0 if no dependency was found. + */ +static double +var_depends_on(PlannerInfo *root, Var* var, List* clause_vars) +{ The comment mentions join_vars but the actual argument name is clauses_vars, so it needs unification. (9) Currently, it only consider functional dependencies statistics. Can we also consider multivariate MCV list, and is it useful? (10) To achieve adaptive query optimization (AQO) in PostgreSQL, this patch proposes to use auto_explain for getting feedback from actual results. So, could auto_explain be a infrastructure of AQO in future? Or, do you have any plan or idea to make built-in infrastructure for AQO? Regards, Yugo Nagata -- Yugo NAGATA <[hidden email]> |
Hello,
Thank you for review. My answers are inside. On 21.01.2021 15:30, Yugo NAGATA wrote:
Hello, On Thu, 26 Mar 2020 18:49:51 +0300 Konstantin Knizhnik [hidden email] wrote:Attached please find new version of the patch with more comments and descriptions added.Adaptive query optimization is very interesting feature for me, so I looked into this patch. Here are some comments and questions. (1) This patch needs rebase because clauselist_selectivity was modified to improve estimation of OR clauses. Rebased version is attached. (2) If I understand correctly, your proposal consists of the following two features. 1. Add a feature to auto_explain that creates an extended statistic automatically if an error on estimated rows number is large. 2. Improve rows number estimation of join results by considering functional dependencies between vars in the join qual if the qual has more than one clauses, and also functional dependencies between a var in the join qual and vars in quals of the inner/outer relation. As you said, these two parts are independent each other, so one feature will work even if we don't assume the other. I wonder it would be better to split the patch again, and register them to commitfest separately. I agree with you that this are two almost unrelated changes, although without clausesel patch additional statistic can not improve query planning. But I already have too many patches at commitfest. May be it will be enough to spit this patch into two? (3) + DefineCustomBoolVariable("auto_explain.suggest_only", + "Do not create statistic but just record in WAL suggested create statistics statement.", + NULL, + &auto_explain_suggest_on To imply that this parameter is involving to add_statistics_threshold, it seems better for me to use more related name like add_statistics_suggest_only. Also, additional documentations for new parameters are required. Done. (4) + /* + * Prevent concurrent access to extended statistic table + */ + stat_rel = table_open(StatisticExtRelationId, AccessExclusiveLock); + slot = table_slot_create(stat_rel, NULL); + scan = table_beginscan_catalog(stat_rel, 2, entry); (snip) + table_close(stat_rel, AccessExclusiveLock); + } When I tested the auto_explain part, I got the following WARNING. WARNING: buffer refcount leak: [097] (rel=base/12879/3381, blockNum=0, flags=0x83000000, refcount=1 2) WARNING: buffer refcount leak: [097] (rel=base/12879/3381, blockNum=0, flags=0x83000000, refcount=1 1) WARNING: relcache reference leak: relation "pg_statistic_ext" not closed WARNING: TupleDesc reference leak: TupleDesc 0x7fa439266338 (12029,-1) still referenced WARNING: Snapshot reference leak: Snapshot 0x55c332c10418 still referenced To suppress this, I think we need table_endscan(scan) and ExecDropSingleTupleTableSlot(slot) before finishing this function. Thank you for noticing the problem, fixed. (6) + elog(NOTICE, "Auto_explain suggestion: CREATE STATISTICS %s %s FROM %s", stat_name, create_stat_stmt, rel_name); We should use ereport instead of elog for log messages. Changed. Makes sense.(7) + double dep = find_var_dependency(root, innerRelid, var, clauses_attnums); + if (dep != 0.0) + { + s1 *= dep + (1 - dep) * s2; + continue; + } I found the following comment of clauselist_apply_dependencies(): * we actually combine selectivities using the formula * * P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b) so, is it not necessary using the same formula in this patch? That is, s1 *= dep + (1-dep) * s2 (if s1 <= s2) s1 *= dep * (s2/s1) + (1-dep) * s2 (otherwise) . (8) +/* + * Try to find dependency between variables. + * var: varaibles which dependencies are considered + * join_vars: list of variables used in other clauses + * This functions return strongest dependency and some subset of variables from the same relation + * or 0.0 if no dependency was found. + */ +static double +var_depends_on(PlannerInfo *root, Var* var, List* clause_vars) +{ The comment mentions join_vars but the actual argument name is clauses_vars, so it needs unification. Fixed. (9) Currently, it only consider functional dependencies statistics. Can we also consider multivariate MCV list, and is it useful? Right now auto_explain create statistic without explicit specification of statistic kind. According to the documentation all supported statistics kinds should be created in this case: statistics_kind A statistics kind to be computed in this statistics object.
Currently supported kinds are Sorry, I do not have answer for this question.(10) To achieve adaptive query optimization (AQO) in PostgreSQL, this patch proposes to use auto_explain for getting feedback from actual results. So, could auto_explain be a infrastructure of AQO in future? Or, do you have any plan or idea to make built-in infrastructure for AQO? I just patched auto_explain extension because it is doing half of the required work (analyze expensive statements). It can be certainly moved to separate extension. In this case it will party duplicate existed functionality and settings of auto_explain (like statement execution time threshold). I am not sure that it is good. But from the other side, this my patch makes auto_explain extension to do some unexpected work... Actually task of adaptive query optimization is much bigger. We have separate AQO extension which tries to use machine learning to correctly adjust estimations. This my patch is much simpler and use existed mechanism (extended statistics) to improve estimations. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company ![]() ![]() |
On Mon, 25 Jan 2021 16:27:25 +0300
Konstantin Knizhnik <[hidden email]> wrote: > Hello, > > Thank you for review. > My answers are inside. Thank you for updating the patch and answering my questions. > > (2) > > If I understand correctly, your proposal consists of the following two features. > > > > 1. Add a feature to auto_explain that creates an extended statistic automatically > > if an error on estimated rows number is large. > > > > 2. Improve rows number estimation of join results by considering functional > > dependencies between vars in the join qual if the qual has more than one clauses, > > and also functional dependencies between a var in the join qual and vars in quals > > of the inner/outer relation. > > > > As you said, these two parts are independent each other, so one feature will work > > even if we don't assume the other. I wonder it would be better to split the patch > > again, and register them to commitfest separately. > > I agree with you that this are two almost unrelated changes, although > without clausesel patch additional statistic can not improve query planning. I think extended statistics created by the auto_explain patch can improve query planning even without the clausesel patch. For example, suppose the following case: postgres=# create table t ( i int, j int); CREATE TABLE postgres=# insert into t select i/10, i/100 from generate_series(1,1000000) i; INSERT 0 1000000 postgres=# analyze t; ANALYZE postgres=# explain analyze select * from t where i = 100 and j = 10; QUERY PLAN ------------------------------------------------------------------------------------------------- Seq Scan on t (cost=0.00..19425.00 rows=1 width=8) (actual time=0.254..97.293 rows=10 loops=1) Filter: ((i = 100) AND (j = 10)) Rows Removed by Filter: 999990 Planning Time: 0.199 ms Execution Time: 97.327 ms (5 rows) After applying the auto_explain patch (without clausesel patch) and issuing the query, additional statistics were created. postgres=# select * from t where i = 100 and j = 10; LOG: Add statistics t_i_j Then, after analyze, the row estimation was improved. postgres=# analyze t; ANALYZE postgres=# explain analyze select * from t where i = 100 and j = 10; postgres=# explain analyze select * from t where i = 100 and j = 10; QUERY PLAN -------------------------------------------------------------------------------------------------- Seq Scan on t (cost=0.00..19425.00 rows=10 width=8) (actual time=0.255..95.347 rows=10 loops=1) Filter: ((i = 100) AND (j = 10)) Rows Removed by Filter: 999990 Planning Time: 0.124 ms Execution Time: 95.383 ms (5 rows) So, I think the auto_explain patch is useful with just that as a tool to detect a gap between estimate and real and adjust the plan. Also, the clausesel patch would be useful without the auto_explain patch if an appropriate statistics are created. > But I already have too many patches at commitfest. > May be it will be enough to spit this patch into two? Although we can discuss both of these patches in this thread, I wonder we don't have to commit them together. > > > > (3) > > + DefineCustomBoolVariable("auto_explain.suggest_only", > > + "Do not create statistic but just record in WAL suggested create statistics statement.", > > + NULL, > > + &auto_explain_suggest_on > > > > To imply that this parameter is involving to add_statistics_threshold, it seems > > better for me to use more related name like add_statistics_suggest_only. > > > > Also, additional documentations for new parameters are required. > > Done. + + <varlistentry> + <term> + <varname>auto_explain.auto_explain.add_statistics_threshold</varname> (<type>real</type>) + <indexterm> + <primary><varname>auto_explain.add_statistics_threshold</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + <varname>auto_explain.add_statistics_threshold</varname> sets the threshold for + actual/estimated #rows ratio triggering creation of multicolumn statistic + for the related columns. It can be used for adpative query optimization. + If there is large gap between real and estimated number of tuples for the + concrete plan node, then multicolumn statistic is created for involved + attributes. Zero value (default) disables implicit creation of multicolumn statistic. + </para> + </listitem> I wonder we need to say that this parameter has no effect unless log_analyze is enabled and that statistics are created only when the excution time exceeds log_min_duration, if these behaviors are intentional. In addition, additional statistics are created only if #rows is over-estimated and not if it is under-estimated. Although it seems good as a criterion for creating multicolumn statistic since extended statisstic is usually useful to fix over-estimation, I am not sure if we don't have to consider under-estimation case at all. > > (9) > > Currently, it only consider functional dependencies statistics. Can we also > > consider multivariate MCV list, and is it useful? > > > Right now auto_explain create statistic without explicit specification > of statistic kind. > According to the documentation all supported statistics kinds should be > created in this case: Yes, auto_explain creates all kinds of extended statistics. However, IIUC, the clausesel patch uses only functional dependencies statistics for improving join, so my question was about possibility to consider MCV in the clausesel patch. > > (10) > > To achieve adaptive query optimization (AQO) in PostgreSQL, this patch proposes > > to use auto_explain for getting feedback from actual results. So, could auto_explain > > be a infrastructure of AQO in future? Or, do you have any plan or idea to make > > built-in infrastructure for AQO? > Sorry, I do not have answer for this question. > I just patched auto_explain extension because it is doing half of the > required work (analyze expensive statements). > It can be certainly moved to separate extension. In this case it will > party duplicate existed functionality and > settings of auto_explain (like statement execution time threshold). I am > not sure that it is good. > But from the other side, this my patch makes auto_explain extension to > do some unexpected work... I think that auto_explain is an extension originally for aiming to detect and log plans that take a long time, so it doesn't seem so unnatural for me to use this for improving such plans. Especially, the feature to find tunable points in executed plans seems useful. > Actually task of adaptive query optimization is much bigger. > We have separate AQO extension which tries to use machine learning to > correctly adjust estimations. > This my patch is much simpler and use existed mechanism (extended > statistics) to improve estimations. Well, this patch provide a kind of AQO as auto_explain feature, but this is independent of the AQO extension. Is it right? Anyway, I'm interested in the AQO extension, so I'll look into this, too. Regards, Yugo Nagata -- Yugo NAGATA <[hidden email]> |
On 27.01.2021 8:45, Yugo NAGATA wrote: > On Mon, 25 Jan 2021 16:27:25 +0300 > Konstantin Knizhnik <[hidden email]> wrote: > >> Hello, >> >> Thank you for review. >> My answers are inside. > Thank you for updating the patch and answering my questions. > >>> (2) >>> If I understand correctly, your proposal consists of the following two features. >>> >>> 1. Add a feature to auto_explain that creates an extended statistic automatically >>> if an error on estimated rows number is large. >>> >>> 2. Improve rows number estimation of join results by considering functional >>> dependencies between vars in the join qual if the qual has more than one clauses, >>> and also functional dependencies between a var in the join qual and vars in quals >>> of the inner/outer relation. >>> >>> As you said, these two parts are independent each other, so one feature will work >>> even if we don't assume the other. I wonder it would be better to split the patch >>> again, and register them to commitfest separately. >> I agree with you that this are two almost unrelated changes, although >> without clausesel patch additional statistic can not improve query planning. > I think extended statistics created by the auto_explain patch can improve query > planning even without the clausesel patch. For example, suppose the following case: > > postgres=# create table t ( i int, j int); > CREATE TABLE > postgres=# insert into t select i/10, i/100 from generate_series(1,1000000) i; > INSERT 0 1000000 > postgres=# analyze t; > ANALYZE > postgres=# explain analyze select * from t where i = 100 and j = 10; > QUERY PLAN > ------------------------------------------------------------------------------------------------- > Seq Scan on t (cost=0.00..19425.00 rows=1 width=8) (actual time=0.254..97.293 rows=10 loops=1) > Filter: ((i = 100) AND (j = 10)) > Rows Removed by Filter: 999990 > Planning Time: 0.199 ms > Execution Time: 97.327 ms > (5 rows) > > After applying the auto_explain patch (without clausesel patch) and issuing the query, > additional statistics were created. > > postgres=# select * from t where i = 100 and j = 10; > LOG: Add statistics t_i_j > > Then, after analyze, the row estimation was improved. > > postgres=# analyze t; > ANALYZE > postgres=# explain analyze select * from t where i = 100 and j = 10; > postgres=# explain analyze select * from t where i = 100 and j = 10; > QUERY PLAN > -------------------------------------------------------------------------------------------------- > Seq Scan on t (cost=0.00..19425.00 rows=10 width=8) (actual time=0.255..95.347 rows=10 loops=1) > Filter: ((i = 100) AND (j = 10)) > Rows Removed by Filter: 999990 > Planning Time: 0.124 ms > Execution Time: 95.383 ms > (5 rows) > > So, I think the auto_explain patch is useful with just that as a tool > to detect a gap between estimate and real and adjust the plan. Also, > the clausesel patch would be useful without the auto_explain patch > if an appropriate statistics are created. > >> But I already have too many patches at commitfest. >> May be it will be enough to spit this patch into two? > Although we can discuss both of these patches in this thread, > I wonder we don't have to commit them together. > >>> (3) >>> + DefineCustomBoolVariable("auto_explain.suggest_only", >>> + "Do not create statistic but just record in WAL suggested create statistics statement.", >>> + NULL, >>> + &auto_explain_suggest_on >>> >>> To imply that this parameter is involving to add_statistics_threshold, it seems >>> better for me to use more related name like add_statistics_suggest_only. >>> >>> Also, additional documentations for new parameters are required. >> Done. > + > + <varlistentry> > + <term> > + <varname>auto_explain.auto_explain.add_statistics_threshold</varname> (<type>real</type>) > + <indexterm> > + <primary><varname>auto_explain.add_statistics_threshold</varname> configuration parameter</primary> > + </indexterm> > + </term> > + <listitem> > + <para> > + <varname>auto_explain.add_statistics_threshold</varname> sets the threshold for > + actual/estimated #rows ratio triggering creation of multicolumn statistic > + for the related columns. It can be used for adpative query optimization. > + If there is large gap between real and estimated number of tuples for the > + concrete plan node, then multicolumn statistic is created for involved > + attributes. Zero value (default) disables implicit creation of multicolumn statistic. > + </para> > + </listitem> > > I wonder we need to say that this parameter has no effect unless log_analyze > is enabled and that statistics are created only when the excution time exceeds > log_min_duration, if these behaviors are intentional. > > In addition, additional statistics are created only if #rows is over-estimated > and not if it is under-estimated. Although it seems good as a criterion for creating > multicolumn statistic since extended statisstic is usually useful to fix over-estimation, > I am not sure if we don't have to consider under-estimation case at all. > > >>> (9) >>> Currently, it only consider functional dependencies statistics. Can we also >>> consider multivariate MCV list, and is it useful? >> >> Right now auto_explain create statistic without explicit specification >> of statistic kind. >> According to the documentation all supported statistics kinds should be >> created in this case: > Yes, auto_explain creates all kinds of extended statistics. However, > IIUC, the clausesel patch uses only functional dependencies statistics for > improving join, so my question was about possibility to consider MCV in the > clausesel patch. > >>> (10) >>> To achieve adaptive query optimization (AQO) in PostgreSQL, this patch proposes >>> to use auto_explain for getting feedback from actual results. So, could auto_explain >>> be a infrastructure of AQO in future? Or, do you have any plan or idea to make >>> built-in infrastructure for AQO? >> Sorry, I do not have answer for this question. >> I just patched auto_explain extension because it is doing half of the >> required work (analyze expensive statements). >> It can be certainly moved to separate extension. In this case it will >> party duplicate existed functionality and >> settings of auto_explain (like statement execution time threshold). I am >> not sure that it is good. >> But from the other side, this my patch makes auto_explain extension to >> do some unexpected work... > I think that auto_explain is an extension originally for aiming to detect > and log plans that take a long time, so it doesn't seem so unnatural for > me to use this for improving such plans. Especially, the feature to find > tunable points in executed plans seems useful. > >> Actually task of adaptive query optimization is much bigger. >> We have separate AQO extension which tries to use machine learning to >> correctly adjust estimations. >> This my patch is much simpler and use existed mechanism (extended >> statistics) to improve estimations. > Well, this patch provide a kind of AQO as auto_explain feature, but this > is independent of the AQO extension. Is it right? > Anyway, I'm interested in the AQO extension, so I'll look into this, too. > > Regards, > Yugo Nagata > auto_explain extension for the next commitfest. I will create separate thread for improving join selectivity estimation using extended statistics. > Well, this patch provide a kind of AQO as auto_explain feature, but this > is independent of the AQO extension. Is it right? Yes. The basic idea is the same, but approaches are different. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company |
Free forum by Nabble | Edit this page |