Hi,
Attached is a PoC/WIP patch adding support for extended statistics on expressions. This is by no means "ready" - most of the stuff works, but often in a rather hackish way. I certainly don't expect this to pass regression tests, for example. There's an example demonstrating how this works for two queries at the end of this message. Now let's talk about the main parts of the patch: 1) extending grammar to allow expressions, not just plain columns Fairly straighforward, I think. I'm sure the logic which expressions are allowed is not 100% (e.g. volatile functions etc.) but that's a detail we can deal with later. 2) store the expressions in pg_statistic_ext catalog I ended up adding a separate column, similar to indexprs, except that the order of columns/expressions does not matter, so we don't need to bother with storing 0s in stxkeys - we simply consider expressions to be "after" all the simple columns. 3) build statistics This should work too, for all three types of statistics we have (mcv, dependencies and ndistinct). This should work too, although the code changes are often very hackish "to make it work". The main challenge here was how to represent the expressions in the statistics - e.g. in ndistinct, which track ndistinct estimates for combinations of parameters, and so far we used attnums for that. I decided the easiest way it to keep doing that, but offset the expressions by MaxHeapAttributeNumber. That seems to work, but maybe there's a better way. 4) apply the statistics This is the hard part, really, and the exact state of the support depends on type of statistics. For ndistinct coefficients, it generally works. I'm sure there may be bugs in estimate_num_groups, etc. but in principle it works. For MCV lists, it generally works too - you can define statistics on the expressions and the estimates should improve. The main downside here is that it requires at least two expressions, otherwise we can't build/apply the extended statistics. So for example SELECT * FROM t WHERE mod(a,100) = 10 AND mod(b,11) = 0 may be estimated "correctly", once you drop any of the conditions it gets much worse as we don't have stats for individual expressions. That's rather annoying - it does not break the extended MCV, but the behavior will certainly cause confusion. For functional dependencies, the estimation does not work yet. Also, the missing per-column statistics have bigger impact than on MCV, because while MCV can work fine without it, the dependencies heavily rely on the per-column estimates. We only apply "corrections" based on the dependency degree, so we still need (good) per-column estimates, which does not quite work with the expressions. Of course, the lack of per-expression statistics may be somewhat fixed by adding indexes on expressions, but that's kinda expensive. Now, a simple example demonstrating how this improves estimates - let's create a table with 1M rows, and do queries with mod() expressions on it. It might be date_trunc() or something similar, that'd work too. table with 1M rows ================== test=# create table t (a int); test=# insert into t select i from generate_series(1,1000000) s(i); test=# analyze t; poorly estimated queries ======================== test=# explain (analyze, timing off) select * from t where mod(a,3) = 0 and mod(a,7) = 0; QUERY PLAN ---------------------------------------------------------------------------------- Seq Scan on t (cost=0.00..24425.00 rows=25 width=4) (actual rows=47619 loops=1) Filter: ((mod(a, 3) = 0) AND (mod(a, 7) = 0)) Rows Removed by Filter: 952381 Planning Time: 0.329 ms Execution Time: 156.675 ms (5 rows) test=# explain (analyze, timing off) select mod(a,3), mod(a,7) from t group by 1, 2; QUERY PLAN ----------------------------------------------------------------------------------------------- HashAggregate (cost=75675.00..98487.50 rows=1000000 width=8) (actual rows=21 loops=1) Group Key: mod(a, 3), mod(a, 7) Planned Partitions: 32 Batches: 1 Memory Usage: 1561kB -> Seq Scan on t (cost=0.00..19425.00 rows=1000000 width=8) (actual rows=1000000 loops=1) Planning Time: 0.277 ms Execution Time: 502.803 ms (6 rows) improved estimates ================== test=# create statistics s1 (ndistinct) on mod(a,3), mod(a,7) from t; test=# analyze t; test=# explain (analyze, timing off) select mod(a,3), mod(a,7) from t group by 1, 2; QUERY PLAN ----------------------------------------------------------------------------------------------- HashAggregate (cost=24425.00..24425.31 rows=21 width=8) (actual rows=21 loops=1) Group Key: mod(a, 3), mod(a, 7) Batches: 1 Memory Usage: 24kB -> Seq Scan on t (cost=0.00..19425.00 rows=1000000 width=8) (actual rows=1000000 loops=1) Planning Time: 0.135 ms Execution Time: 500.092 ms (6 rows) test=# create statistics s2 (mcv) on mod(a,3), mod(a,7) from t; test=# analyze t; test=# explain (analyze, timing off) select * from t where mod(a,3) = 0 and mod(a,7) = 0; QUERY PLAN ------------------------------------------------------------------------------------- Seq Scan on t (cost=0.00..24425.00 rows=46433 width=4) (actual rows=47619 loops=1) Filter: ((mod(a, 3) = 0) AND (mod(a, 7) = 0)) Rows Removed by Filter: 952381 Planning Time: 0.702 ms Execution Time: 152.280 ms (5 rows) Clearly, estimates for both queries are significantly improved. Of course, this example is kinda artificial/simplistic. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company 0001-Support-for-extended-statistics-on-expressi-20201116.patch (154K) Download Attachment |
On 11/16/20 2:49 PM, Tomas Vondra wrote:
> Hi, > > ... > > 4) apply the statistics > > This is the hard part, really, and the exact state of the support > depends on type of statistics. > > For ndistinct coefficients, it generally works. I'm sure there may be > bugs in estimate_num_groups, etc. but in principle it works. > > For MCV lists, it generally works too - you can define statistics on > the expressions and the estimates should improve. The main downside > here is that it requires at least two expressions, otherwise we can't > build/apply the extended statistics. So for example > > SELECT * FROM t WHERE mod(a,100) = 10 AND mod(b,11) = 0 > > may be estimated "correctly", once you drop any of the conditions it > gets much worse as we don't have stats for individual expressions. > That's rather annoying - it does not break the extended MCV, but the > behavior will certainly cause confusion. > > For functional dependencies, the estimation does not work yet. Also, > the missing per-column statistics have bigger impact than on MCV, > because while MCV can work fine without it, the dependencies heavily > rely on the per-column estimates. We only apply "corrections" based > on the dependency degree, so we still need (good) per-column > estimates, which does not quite work with the expressions. > > > Of course, the lack of per-expression statistics may be somewhat > fixed by adding indexes on expressions, but that's kinda expensive. > FWIW after re-reading [1], I think the plan to build pg_statistic rows for expressions and stash them in pg_statistic_ext_data is the way to go. I was thinking that maybe we'll need some new statistics type to request this, e.g. CREATE STATISTICS s (expressions) ON ... but on second thought I think we should just build this whenever there are expressions in the definition. It'll require some changes (e.g. we require at least two items in the list, but we'll want to allow building stats on a single expression too, I guess), but that's doable. Of course, we don't have any catalogs with composite types yet, so it's not 100% sure this will work, but it's worth a try. regards [1] https://www.postgresql.org/message-id/flat/6331.1579041473%40sss.pgh.pa.us#5ec6af7583e84cef2ca6a9e8a713511e -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company |
Hi,
attached is a significantly improved version of the patch, allowing defining extended statistics on expressions. This fixes most of the problems in the previous WIP version and AFAICS it does pass all regression tests (including under valgrind). There's a bunch of FIXMEs and a couple loose ends, but overall I think it's ready for reviews. Overall, the patch does two main things: * it adds a new "expressions" statistics kind, building per-expression statistics (i.e it's similar to having expression index) * it allows using expressions in definition of extended statistics, and properly handles that in all existing statistics kinds (dependencies, mcv, ndistinct) The expression handling mostly copies what we do for indexes, with similar restrictions - no volatile functions, aggregates etc. The list of expressions is stored in pg_statistic_ext catalog, but unlike for indexes we don't need to worry about the exact order of elements, so there are no "0" for expressions in stxkeys etc. We simply assume the expressions come after simple columns, and that's it. To reference expressions in the built statistics (e.g. in a dependency) we use "special attnums" computed from the expression index by adding MaxHeapAttributeNumber. So the first expression has attnum 1601 etc. This mapping expressions to attnums is used both while building and applying the statistics to clauses, as it makes the whole process much simpler than dealing with attnums and expressions entirely separately. The first part allows us to do something like this: CREATE TABLE t (a int); INSERT INTO t SELECT i FROM generate_series(1,1000000) s(i); ANALYZE t; EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE mod(a,10) = 0; CREATE STATISTICS s (expressions) ON mod(a,10) FROM t; ANALYZE t; EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE mod(a,10) = 0; Without the statistics we get this: QUERY PLAN -------------------------------------------------------- Seq Scan on t (cost=0.00..19425.00 rows=5000 width=4) (actual rows=100000 loops=1) Filter: (mod(a, 10) = 0) Rows Removed by Filter: 900000 Planning Time: 0.216 ms Execution Time: 157.552 ms (5 rows) while with the statistics we get this QUERY PLAN ---------------------------------------------------------- Seq Scan on t (cost=0.00..19425.00 rows=100900 width=4) (actual rows=100000 loops=1) Filter: (mod(a, 10) = 0) Rows Removed by Filter: 900000 Planning Time: 0.399 ms Execution Time: 157.530 ms (5 rows) So that's pretty nice improvement. In practice you could get the same effect by creating an expression index CREATE INDEX ON t (mod(a,10)); but of course that's far from free - there's cost to maintain the index, it blocks HOT, and it takes space on disk. The statistics have none of these issues. Implementation-wise, this simply builds per-column statistics for each expression, and stashes them into a new column in pg_statistic_ext_data catalog as an array of elements with pg_statistic composite type. And then in selfuncs.c we look not just at indexes, but also at this when looking for expression stats. So that gives us the per-expression stats. This is enabled by default when you don't specify the statistics type and the definition includes any expression that is not a simple column reference. Otherwise you may also request it explicitly by using "expressions" in the CREATE. Now, the second part is really just a natural extension of the existing stats to also work with expressions. The easiest thing is probably to show some examples, so consider this: CREATE TABLE t (a INT, b INT, c INT); INSERT INTO t SELECT i, i, i FROM generate_series(1,1000000) s(i); ANALYZE t; which without any statistics gives us this: EXPLAIN (ANALYZE, TIMING OFF) SELECT 1 FROM t WHERE mod(a,10) = 0 AND mod(b,5) = 0; QUERY PLAN ------------------------------------------------------ Seq Scan on t (cost=0.00..25406.00 rows=25 width=4) (actual rows=100000 loops=1) Filter: ((mod(a, 10) = 0) AND (mod(b, 5) = 0)) Rows Removed by Filter: 900000 Planning Time: 0.080 ms Execution Time: 161.445 ms (5 rows) EXPLAIN (ANALYZE, TIMING OFF) SELECT 1 FROM t GROUP BY mod(a,10), mod(b,5); QUERY PLAN ------------------------------------------------------------------ HashAggregate (cost=76656.00..99468.50 rows=1000000 width=12) (actual rows=10 loops=1) Group Key: mod(a, 10), mod(b, 5) Planned Partitions: 32 Batches: 1 Memory Usage: 1561kB -> Seq Scan on t (cost=0.00..20406.00 rows=1000000 width=8) (actual rows=1000000 loops=1) Planning Time: 0.232 ms Execution Time: 514.446 ms (6 rows) and now let's add statistics on the expressions: CREATE STATISTICS s ON mod(a,10), mod(b,5) FROM t; ANALYZE t; which ends up with these spot-on estimates: EXPLAIN (ANALYZE, TIMING OFF) SELECT 1 FROM t WHERE mod(a,10) = 0 AND mod(b,5) = 0; QUERY PLAN --------------------------------------------------------- Seq Scan on t (cost=0.00..25406.00 rows=97400 width=4) (actual rows=100000 loops=1) Filter: ((mod(a, 10) = 0) AND (mod(b, 5) = 0)) Rows Removed by Filter: 900000 Planning Time: 0.366 ms Execution Time: 159.207 ms (5 rows) EXPLAIN (ANALYZE, TIMING OFF) SELECT 1 FROM t GROUP BY mod(a,10), mod(b,5); QUERY PLAN ----------------------------------------------------------------- HashAggregate (cost=25406.00..25406.15 rows=10 width=12) (actual rows=10 loops=1) Group Key: mod(a, 10), mod(b, 5) Batches: 1 Memory Usage: 24kB -> Seq Scan on t (cost=0.00..20406.00 rows=1000000 width=8) (actual rows=1000000 loops=1) Planning Time: 0.299 ms Execution Time: 530.793 ms (6 rows) Of course, this is a very simple query, but hopefully you get the idea. There's about two main areas where I think might be hidden issues: 1) We're kinda faking the pg_statistic entries, and I suppose there might be some loose ends (e.g. with respect to ACLs etc.). 2) Memory management while evaluating the expressions during analyze is kinda simplistic, we're probably keeping the memory around longer than needed etc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company 0001-Extended-statistics-on-expressions-20201122.patch (216K) Download Attachment |
On Sun, Nov 22, 2020 at 08:03:51PM +0100, Tomas Vondra wrote:
> attached is a significantly improved version of the patch, allowing > defining extended statistics on expressions. This fixes most of the > problems in the previous WIP version and AFAICS it does pass all > regression tests (including under valgrind). There's a bunch of FIXMEs > and a couple loose ends, but overall I think it's ready for reviews. I was looking at the previous patch, so now read this one instead, and attach some proposed fixes. + * This matters especially for * expensive expressions, of course. + The expression can refer only to columns of the underlying table, but + it can use all columns, not just the ones the statistics is defined + on. I don't know what these are trying to say? + errmsg("statistics expressions and predicates can refer only to the table being indexed"))); + * partial-index predicates. Create it in the per-index context to be I think these are copied and shouldn't mention "indexes" or "predicates". Or should statistics support predicates, too ? Idea: if a user specifies no stakinds, and there's no expression specified, then you automatically build everything except for expressional stats. But if they specify only one statistics "column", it gives an error. If that's a non-simple column reference, should that instead build *only* expressional stats (possibly with a NOTICE, since the user might be intending to make MV stats). I think pg_stats_ext should allow inspecting the pg_statistic data in pg_statistic_ext_data.stxdexprs. I guess array_agg() should be ordered by something, so maybe it should use ORDINALITY (?) I hacked more on bootstrap.c so included that here. -- Justin 0001-bootstrap-convert-Typ-to-a-List.patch (3K) Download Attachment 0002-Allow-composite-types-in-bootstrap.patch (1K) Download Attachment 0003-Extended-statistics-on-expressions.patch (215K) Download Attachment 0004-Fix-silly-errors.patch (2K) Download Attachment 0005-Small-language-fixen.patch (4K) Download Attachment 0006-Some-cleanup.patch (11K) Download Attachment 0007-WIP-Update-pg_stats_ext-for-expressional-stats-incom.patch (12K) Download Attachment |
On 11/23/20 3:26 AM, Justin Pryzby wrote: > On Sun, Nov 22, 2020 at 08:03:51PM +0100, Tomas Vondra wrote: >> attached is a significantly improved version of the patch, allowing >> defining extended statistics on expressions. This fixes most of the >> problems in the previous WIP version and AFAICS it does pass all >> regression tests (including under valgrind). There's a bunch of FIXMEs >> and a couple loose ends, but overall I think it's ready for reviews. > > I was looking at the previous patch, so now read this one instead, and attach > some proposed fixes. > > + * This matters especially for * expensive expressions, of course. > The point this was trying to make is that we evaluate the expressions only once, and use the results to build all extended statistics. Instead of leaving it up to every "build" to re-evaluate it. > + The expression can refer only to columns of the underlying table, but > + it can use all columns, not just the ones the statistics is defined > + on. > > I don't know what these are trying to say? > D'oh. That's bogus para, copied from the CREATE INDEX docs (where it talked about the index predicate, which is irrelevant here). > + errmsg("statistics expressions and predicates can refer only to the table being indexed"))); > + * partial-index predicates. Create it in the per-index context to be > > I think these are copied and shouldn't mention "indexes" or "predicates". Or > should statistics support predicates, too ? > Right. Stupid copy-pasto. > Idea: if a user specifies no stakinds, and there's no expression specified, > then you automatically build everything except for expressional stats. But if > they specify only one statistics "column", it gives an error. If that's a > non-simple column reference, should that instead build *only* expressional > stats (possibly with a NOTICE, since the user might be intending to make MV > stats). > Right, that was the intention - but I messed up and it works only if you specify the "expressions" kind explicitly (and I also added the ERROR message to expected output by mistake). I agree we should handle this automatically, so that CREATE STATISTICS s ON (a+b) FROM t works and only creates statistics for the expression. > I think pg_stats_ext should allow inspecting the pg_statistic data in > pg_statistic_ext_data.stxdexprs. I guess array_agg() should be ordered by > something, so maybe it should use ORDINALITY (?) > I agree we should expose the expression statistics, but I'm not convinced we should do that in the pg_stats_ext view itself. The problem is that it's a table bested in a table, essentially, with non-trivial structure, so I was thinking about adding a separate view exposing just this one part. Something like pg_stats_ext_expressions, with about the same structure as pg_stats, or something. > I hacked more on bootstrap.c so included that here. Thanks. As for the 0004-0007 patches: 0004 - Seems fine. IMHO not really "silly errors" but OK. 0005 - Mostly OK. The docs wording mostly comes from CREATE INDEX docs, though. The paragraph about "t1" is old, so if we want to reword it then maybe we should backpatch too. 0006 - Not sure. I think CreateStatistics can be fixed with less code, keeping it more like PG13 (good for backpatching). Not sure why rename extended statistics to multi-variate statistics - we use "extended" everywhere. Not sure what's the point of serialize_expr_stats changes, that's code is mostly copy-paste from update_attstats. 0007 - I suspect this makes the pg_stats_ext too complex to work with, IMHO we should move this to a separate view. Thanks for the review! I'll try to look more closely at those patches sometime next week, and merge most of it. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company |
Hi,
Attached is an updated version of the patch series, merging some of the changes proposed by Justin. I've kept the bootstrap patches separate, at least for now. As for the individual 0004-0007 patches: 1) 0004 - merged as is 2) 0005 - I've merged some of the docs changes, but some of the wording was copied from CREATE INDEX docs in which case I've kept that. I've also not merged changed to pre-existing docs, like the t1 example which is unrelated to this patch. OTOH I've corrected the t3 example description, which was somewhat bogus and unrelated to what the example actually did. I've also removed the irrelevant para which originally described index predicates and was copied from CREATE INDEX docs by mistake. 3) 0006 - I've committed something similar / less invasive, achieving the same goals (I think), and I've added a couple regression tests. 4) 0007 - I agreed we need a way to expose the stats, but including this in pg_stats_ext seems rather inconvenient (table in a table is difficult to work with). Instead I've added a new catalog pg_stats_ext_exprs with structure similar to pg_stats. I've also added the expressions to the pg_stats_ext catalog, which was only showing the attributes, and some basic docs for the catalog changes. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company 0001-bootstrap-convert-Typ-to-a-List-20201123.patch (3K) Download Attachment 0002-Allow-composite-types-in-bootstrap-20201123.patch (1K) Download Attachment 0003-Extended-statistics-on-expressions-20201123.patch (240K) Download Attachment |
In reply to this post by Tomas Vondra-6
On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote:
> 0004 - Seems fine. IMHO not really "silly errors" but OK. This is one of the same issues you pointed out - shadowing a variable. Could be backpatched. On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote: > > + errmsg("statistics expressions and predicates can refer only to the table being indexed"))); > > + * partial-index predicates. Create it in the per-index context to be > > > > I think these are copied and shouldn't mention "indexes" or "predicates". Or > > should statistics support predicates, too ? > > > > Right. Stupid copy-pasto. Right, but then I was wondering if CREATE STATS should actually support predicates, since one use case is to do what indexes do without their overhead. I haven't thought about it enough yet. > 0006 - Not sure. I think CreateStatistics can be fixed with less code, > keeping it more like PG13 (good for backpatching). Not sure why rename > extended statistics to multi-variate statistics - we use "extended" > everywhere. - if (build_expressions && (list_length(stxexprs) == 0)) + if (!build_expressions_only && (list_length(stmt->exprs) < 2)) ereport(ERROR, (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), - errmsg("extended expression statistics require at least one expression"))); + errmsg("multi-variate statistics require at least two columns"))); I think all of "CREATE STATISTICS" has been known as "extended stats", so I think it may be confusing to say that it requires two columns for the general facility. > Not sure what's the point of serialize_expr_stats changes, > that's code is mostly copy-paste from update_attstats. Right. I think "i" is poor variable name when it isn't a loop variable and not of limited scope. > 0007 - I suspect this makes the pg_stats_ext too complex to work with, > IMHO we should move this to a separate view. Right - then unnest() the whole thing and return one row per expression rather than array, as you've done. Maybe the docs should say that this returns one row per expression. Looking quickly at your new patch: I guess you know there's a bunch of lingering references to "indexes" and "predicates": I don't know if you want to go to the effort to prohibit this. postgres=# CREATE STATISTICS asf ON (tableoid::int+1) FROM t; CREATE STATISTICS I think a lot of people will find this confusing: postgres=# CREATE STATISTICS asf ON i FROM t; ERROR: extended statistics require at least 2 columns postgres=# CREATE STATISTICS asf ON (i) FROM t; CREATE STATISTICS postgres=# CREATE STATISTICS asf (expressions) ON i FROM t; ERROR: extended expression statistics require at least one expression postgres=# CREATE STATISTICS asf (expressions) ON (i) FROM t; CREATE STATISTICS I haven't looked, but is it possible to make it work without parens ? -- Justin |
On 11/24/20 5:23 PM, Justin Pryzby wrote: > On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote: >> 0004 - Seems fine. IMHO not really "silly errors" but OK. > > This is one of the same issues you pointed out - shadowing a variable. > Could be backpatched. > > On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote: >>> + errmsg("statistics expressions and predicates can refer only to the table being indexed"))); >>> + * partial-index predicates. Create it in the per-index context to be >>> >>> I think these are copied and shouldn't mention "indexes" or "predicates". Or >>> should statistics support predicates, too ? >>> >> >> Right. Stupid copy-pasto. > > Right, but then I was wondering if CREATE STATS should actually support > predicates, since one use case is to do what indexes do without their overhead. > I haven't thought about it enough yet. > Well, it's not supported now, so the message is bogus. I'm not against supporting "partial statistics" with predicates in the future, but it's going to be non-trivial project on it's own. It's not something I can bolt onto the current patch easily. >> 0006 - Not sure. I think CreateStatistics can be fixed with less code, >> keeping it more like PG13 (good for backpatching). Not sure why rename >> extended statistics to multi-variate statistics - we use "extended" >> everywhere. > > - if (build_expressions && (list_length(stxexprs) == 0)) > + if (!build_expressions_only && (list_length(stmt->exprs) < 2)) > ereport(ERROR, > (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), > - errmsg("extended expression statistics require at least one expression"))); > + errmsg("multi-variate statistics require at least two columns"))); > > I think all of "CREATE STATISTICS" has been known as "extended stats", so I > think it may be confusing to say that it requires two columns for the general > facility. > >> Not sure what's the point of serialize_expr_stats changes, >> that's code is mostly copy-paste from update_attstats. > > Right. I think "i" is poor variable name when it isn't a loop variable and not > of limited scope. > OK, I understand. I'll consider tweaking that. >> 0007 - I suspect this makes the pg_stats_ext too complex to work with, >> IMHO we should move this to a separate view. > > Right - then unnest() the whole thing and return one row per expression rather > than array, as you've done. Maybe the docs should say that this returns one > row per expression. > > Looking quickly at your new patch: I guess you know there's a bunch of > lingering references to "indexes" and "predicates": > > I don't know if you want to go to the effort to prohibit this. > postgres=# CREATE STATISTICS asf ON (tableoid::int+1) FROM t; > CREATE STATISTICS > Hmm, we're already rejecting system attributes, I suppose we should do the same thing for expressions on system attributes. > I think a lot of people will find this confusing: > > postgres=# CREATE STATISTICS asf ON i FROM t; > ERROR: extended statistics require at least 2 columns > postgres=# CREATE STATISTICS asf ON (i) FROM t; > CREATE STATISTICS > > postgres=# CREATE STATISTICS asf (expressions) ON i FROM t; > ERROR: extended expression statistics require at least one expression > postgres=# CREATE STATISTICS asf (expressions) ON (i) FROM t; > CREATE STATISTICS > > I haven't looked, but is it possible to make it work without parens ? > Hmm, you're right that may be surprising. I suppose we could walk the expressions while creating the statistics, and replace such trivial expressions with the nested variable, but I haven't tried. I wonder what the CREATE INDEX behavior would be in these cases. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company |
In reply to this post by Tomas Vondra-6
Hi,
Attached is a patch series rebased on top of 25a9e54d2d which improves estimation of OR clauses. There were only a couple minor conflicts. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company 0001-bootstrap-convert-Typ-to-a-List-20201203.patch (3K) Download Attachment 0002-Allow-composite-types-in-bootstrap-20201203.patch (1K) Download Attachment 0003-Extended-statistics-on-expressions-20201203.patch (240K) Download Attachment |
On Thu, 3 Dec 2020 at 15:23, Tomas Vondra <[hidden email]> wrote:
> > Attached is a patch series rebased on top of 25a9e54d2d. After reading this thread and [1], I think I prefer the name "standard" rather than "expressions", because it is meant to describe the kind of statistics being built rather than what they apply to, but maybe that name doesn't actually need to be exposed to the end user: Looking at the current behaviour, there are a couple of things that seem a little odd, even though they are understandable. For example, the fact that CREATE STATISTICS s (expressions) ON (expr), col FROM tbl; fails, but CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl; succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl; tends to suggest that it's going to create statistics on the pair of expressions, describing their correlation, when actually it builds 2 independent statistics. Also, this error text isn't entirely accurate: CREATE STATISTICS s ON col FROM tbl; ERROR: extended statistics require at least 2 columns because extended statistics don't always require 2 columns, they can also just have an expression, or multiple expressions and 0 or 1 columns. I think a lot of this stems from treating "expressions" in the same way as the other (multi-column) stats kinds, and it might actually be neater to have separate documented syntaxes for single- and multi-column statistics: CREATE STATISTICS [ IF NOT EXISTS ] statistics_name ON (expression) FROM table_name CREATE STATISTICS [ IF NOT EXISTS ] statistics_name [ ( statistics_kind [, ... ] ) ] ON { column_name | (expression) } , { column_name | (expression) } [, ...] FROM table_name The first syntax would create single-column stats, and wouldn't accept a statistics_kind argument, because there is only one kind of single-column statistic. Maybe that might change in the future, but if so, it's likely that the kinds of single-column stats will be different from the kinds of multi-column stats. In the second syntax, the only accepted kinds would be the current multi-column stats kinds (ndistinct, dependencies, and mcv), and it would always build stats describing the correlations between the columns listed. It would continue to build standard/expression stats on any expressions in the list, but that's more of an implementation detail. It would no longer be possible to do "CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to issue 2 separate "CREATE STATISTICS" commands, but that seems more logical, because they're independent stats. The parsing code might not change much, but some of the errors would be different. For example, the errors "building only extended expression statistics on simple columns not allowed" and "extended expression statistics require at least one expression" would go away, and the error "extended statistics require at least 2 columns" might become more specific, depending on the stats kind. Regards, Dean [1] https://www.postgresql.org/message-id/flat/1009.1579038764%40sss.pgh.pa.us#8624792a20ae595683b574f5933dae53 |
On 12/7/20 10:56 AM, Dean Rasheed wrote: > On Thu, 3 Dec 2020 at 15:23, Tomas Vondra <[hidden email]> wrote: >> >> Attached is a patch series rebased on top of 25a9e54d2d. > > After reading this thread and [1], I think I prefer the name > "standard" rather than "expressions", because it is meant to describe > the kind of statistics being built rather than what they apply to, but > maybe that name doesn't actually need to be exposed to the end user: > > Looking at the current behaviour, there are a couple of things that > seem a little odd, even though they are understandable. For example, > the fact that > > CREATE STATISTICS s (expressions) ON (expr), col FROM tbl; > > fails, but > > CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl; > > succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax > > CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl; > > tends to suggest that it's going to create statistics on the pair of > expressions, describing their correlation, when actually it builds 2 > independent statistics. Also, this error text isn't entirely accurate: > > CREATE STATISTICS s ON col FROM tbl; > ERROR: extended statistics require at least 2 columns > > because extended statistics don't always require 2 columns, they can > also just have an expression, or multiple expressions and 0 or 1 > columns. > > I think a lot of this stems from treating "expressions" in the same > way as the other (multi-column) stats kinds, and it might actually be > neater to have separate documented syntaxes for single- and > multi-column statistics: > > CREATE STATISTICS [ IF NOT EXISTS ] statistics_name > ON (expression) > FROM table_name > > CREATE STATISTICS [ IF NOT EXISTS ] statistics_name > [ ( statistics_kind [, ... ] ) ] > ON { column_name | (expression) } , { column_name | (expression) } [, ...] > FROM table_name > > The first syntax would create single-column stats, and wouldn't accept > a statistics_kind argument, because there is only one kind of > single-column statistic. Maybe that might change in the future, but if > so, it's likely that the kinds of single-column stats will be > different from the kinds of multi-column stats. > > In the second syntax, the only accepted kinds would be the current > multi-column stats kinds (ndistinct, dependencies, and mcv), and it > would always build stats describing the correlations between the > columns listed. It would continue to build standard/expression stats > on any expressions in the list, but that's more of an implementation > detail. > > It would no longer be possible to do "CREATE STATISTICS s > (expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to > issue 2 separate "CREATE STATISTICS" commands, but that seems more > logical, because they're independent stats. > > The parsing code might not change much, but some of the errors would > be different. For example, the errors "building only extended > expression statistics on simple columns not allowed" and "extended > expression statistics require at least one expression" would go away, > and the error "extended statistics require at least 2 columns" might > become more specific, depending on the stats kind. > I think it makes sense in general. I see two issues with this approach, though: * By adding expression/standard stats for individual statistics, it makes the list of statistics longer - I wonder if this might have measurable impact on lookups in this list. * I'm not sure it's a good idea that the second syntax would always build the per-expression stats. Firstly, it seems a bit strange that it behaves differently than the other kinds. Secondly, I wonder if there are cases where it'd be desirable to explicitly disable building these per-expression stats. For example, what if we have multiple extended statistics objects, overlapping on a couple expressions. It seems pointless to build the stats for all of them. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company |
On Mon, 7 Dec 2020 at 14:15, Tomas Vondra <[hidden email]> wrote:
> > On 12/7/20 10:56 AM, Dean Rasheed wrote: > > it might actually be > > neater to have separate documented syntaxes for single- and > > multi-column statistics: > > > > CREATE STATISTICS [ IF NOT EXISTS ] statistics_name > > ON (expression) > > FROM table_name > > > > CREATE STATISTICS [ IF NOT EXISTS ] statistics_name > > [ ( statistics_kind [, ... ] ) ] > > ON { column_name | (expression) } , { column_name | (expression) } [, ...] > > FROM table_name > > I think it makes sense in general. I see two issues with this approach, > though: > > * By adding expression/standard stats for individual statistics, it > makes the list of statistics longer - I wonder if this might have > measurable impact on lookups in this list. > > * I'm not sure it's a good idea that the second syntax would always > build the per-expression stats. Firstly, it seems a bit strange that it > behaves differently than the other kinds. Secondly, I wonder if there > are cases where it'd be desirable to explicitly disable building these > per-expression stats. For example, what if we have multiple extended > statistics objects, overlapping on a couple expressions. It seems > pointless to build the stats for all of them. > Hmm, I'm not sure it would really be a good idea to build MCV stats on expressions without also building the standard stats for those expressions, otherwise the assumptions that mcv_combine_selectivities() makes about simple_sel and mcv_basesel wouldn't really hold. But then, if multiple MCV stats shared the same expression, it would be quite wasteful to build standard stats on the expression more than once. It feels like it should build a single extended stats object for each unique expression, with appropriate dependencies for any MCV stats that used those expressions, but I'm not sure how complex that would be. Dropping the last MCV stat object using a standard expression stat object might reasonably drop the expression stats ... except if they were explicitly created by the user, independently of any MCV stats. That could get quite messy. Regards, Dean |
On 12/7/20 5:02 PM, Dean Rasheed wrote: > On Mon, 7 Dec 2020 at 14:15, Tomas Vondra <[hidden email]> wrote: >> >> On 12/7/20 10:56 AM, Dean Rasheed wrote: >>> it might actually be >>> neater to have separate documented syntaxes for single- and >>> multi-column statistics: >>> >>> CREATE STATISTICS [ IF NOT EXISTS ] statistics_name >>> ON (expression) >>> FROM table_name >>> >>> CREATE STATISTICS [ IF NOT EXISTS ] statistics_name >>> [ ( statistics_kind [, ... ] ) ] >>> ON { column_name | (expression) } , { column_name | (expression) } [, ...] >>> FROM table_name >> >> I think it makes sense in general. I see two issues with this approach, >> though: >> >> * By adding expression/standard stats for individual statistics, it >> makes the list of statistics longer - I wonder if this might have >> measurable impact on lookups in this list. >> >> * I'm not sure it's a good idea that the second syntax would always >> build the per-expression stats. Firstly, it seems a bit strange that it >> behaves differently than the other kinds. Secondly, I wonder if there >> are cases where it'd be desirable to explicitly disable building these >> per-expression stats. For example, what if we have multiple extended >> statistics objects, overlapping on a couple expressions. It seems >> pointless to build the stats for all of them. >> > > Hmm, I'm not sure it would really be a good idea to build MCV stats on > expressions without also building the standard stats for those > expressions, otherwise the assumptions that > mcv_combine_selectivities() makes about simple_sel and mcv_basesel > wouldn't really hold. But then, if multiple MCV stats shared the same > expression, it would be quite wasteful to build standard stats on the > expression more than once. > Yeah. You're right it'd be problematic to build MCV on expressions without having the per-expression stats. In fact, that's exactly the problem what forced me to add the per-expression stats to this patch. Originally I planned to address it in a later patch, but I had to move it forward. So I think you're right we need to ensure we have standard stats for each expression at least once, to make this work well. > It feels like it should build a single extended stats object for each > unique expression, with appropriate dependencies for any MCV stats > that used those expressions, but I'm not sure how complex that would > be. Dropping the last MCV stat object using a standard expression stat > object might reasonably drop the expression stats ... except if they > were explicitly created by the user, independently of any MCV stats. > That could get quite messy. > Possibly. But I don't think it's worth the extra complexity. I don't expect people to have a lot of overlapping stats, so the amount of wasted space and CPU time is expected to be fairly limited. So I don't think it's worth spending too much time on this now. Let's just do what you proposed, and revisit this later if needed. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company |
On Tue, 8 Dec 2020 at 12:44, Tomas Vondra <[hidden email]> wrote:
> > Possibly. But I don't think it's worth the extra complexity. I don't > expect people to have a lot of overlapping stats, so the amount of > wasted space and CPU time is expected to be fairly limited. > > So I don't think it's worth spending too much time on this now. Let's > just do what you proposed, and revisit this later if needed. > Yes, I think that's a reasonable approach to take. As long as the documentation makes it clear that building MCV stats also causes standard expression stats to be built on any expressions included in the list, then the user will know and can avoid duplication most of the time. I don't think there's any need for code to try to prevent that -- just as we don't bother with code to prevent a user building multiple indexes on the same column. The only case where duplication won't be avoidable is where there are multiple MCV stats sharing the same expression, but that's probably quite unlikely in practice, and it seems acceptable to leave improving that as a possible future optimisation. Regards, Dean |
On 12/11/20 1:58 PM, Dean Rasheed wrote:
> On Tue, 8 Dec 2020 at 12:44, Tomas Vondra <[hidden email]> wrote: >> >> Possibly. But I don't think it's worth the extra complexity. I don't >> expect people to have a lot of overlapping stats, so the amount of >> wasted space and CPU time is expected to be fairly limited. >> >> So I don't think it's worth spending too much time on this now. Let's >> just do what you proposed, and revisit this later if needed. >> > > Yes, I think that's a reasonable approach to take. As long as the > documentation makes it clear that building MCV stats also causes > standard expression stats to be built on any expressions included in > the list, then the user will know and can avoid duplication most of > the time. I don't think there's any need for code to try to prevent > that -- just as we don't bother with code to prevent a user building > multiple indexes on the same column. > > The only case where duplication won't be avoidable is where there are > multiple MCV stats sharing the same expression, but that's probably > quite unlikely in practice, and it seems acceptable to leave improving > that as a possible future optimisation. > I tried tweaking the grammar to differentiate these two syntax variants, but that led to shift/reduce conflicts with the existing ones. I tried fixing that, but I ended up doing that in CreateStatistics(). The other thing is that we probably can't tie this to just MCV, because functional dependencies need the per-expression stats too. So I simply build expression stats whenever there's at least one expression. I also decided to keep the "expressions" statistics kind - it's not allowed to specify it in CREATE STATISTICS, but it's useful internally as it allows deciding whether to build the stats in a single place. Otherwise we'd need to do that every time we build the statistics, etc. I added a brief explanation to the sgml docs, not sure if that's good enough - maybe it needs more details. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company 0001-bootstrap-convert-Typ-to-a-List-20201211.patch (3K) Download Attachment 0002-Allow-composite-types-in-bootstrap-20201211.patch (1K) Download Attachment 0003-Extended-statistics-on-expressions-20201211.patch (241K) Download Attachment |
On Fri, 11 Dec 2020 at 20:17, Tomas Vondra
<[hidden email]> wrote: > > OK. Attached is an updated version, reworking it this way. Cool. I think this is an exciting development, so I hope it makes it into the next release. I have started looking at it. So far I have only looked at the catalog, parser and client changes, but I thought it's worth posting my comments so far. > I tried tweaking the grammar to differentiate these two syntax variants, > but that led to shift/reduce conflicts with the existing ones. I tried > fixing that, but I ended up doing that in CreateStatistics(). Yeah, that makes sense. I wasn't expecting the grammar to change. > The other thing is that we probably can't tie this to just MCV, because > functional dependencies need the per-expression stats too. So I simply > build expression stats whenever there's at least one expression. Makes sense. > I also decided to keep the "expressions" statistics kind - it's not > allowed to specify it in CREATE STATISTICS, but it's useful internally > as it allows deciding whether to build the stats in a single place. > Otherwise we'd need to do that every time we build the statistics, etc. Yes, I thought that would be the easiest way to do it. Essentially the "expressions" stats kind is an internal implementation detail, hidden from the user, because it's built automatically when required, so you don't need to (and can't) explicitly ask for it. This new behaviour seems much more logical to me. > I added a brief explanation to the sgml docs, not sure if that's good > enough - maybe it needs more details. Yes, I think that could use a little tidying up, but I haven't looked too closely yet. Some other comments: * I'm not sure I understand the need for 0001. Wasn't there an earlier version of this patch that just did it by re-populating the type array, but which still had it as an array rather than turning it into a list? Making it a list falsifies some of the comments and function/variable name choices in that file. * There's a comment typo in catalog/Makefile -- "are are reputedly other...", should be "there are reputedly other...". * Looking at the pg_stats_ext view, I think perhaps expressions stats should be omitted entirely from that view, since it doesn't show any useful information about them. So it could remove "e" from the "kinds" array, and exclude rows whose only kind is "e", since such rows have no interesting data in them. Essentially, the new view pg_stats_ext_exprs makes having any expression stats in pg_stats_ext redundant. Hiding this data in pg_stats_ext would also be consistent with making the "expressions" stats kind hidden from the user. * In gram.y, it wasn't quite obvious why you converted the column list for CREATE STATISTICS from an expr_list to a stats_params list. I figured it out, and it makes sense, but I think it could use a comment, perhaps something along the lines of the one for index_elem, e.g.: /* * Statistics attributes can be either simple column references, or arbitrary * expressions in parens. For compatibility with index attributes permitted * in CREATE INDEX, we allow an expression that's just a function call to be * written without parens. */ * In parse_func.c and parse_agg.c, there are a few new error strings that use the abbreviation "stats expressions", whereas most other errors refer to "statistics expressions". For consistency, I think they should all be the latter. * In generateClonedExtStatsStmt(), I think the "expressions" stats kind needs to be explicitly excluded, otherwise CREATE TABLE (LIKE ...) fails if the source table has expression stats. * CreateStatistics() uses ShareUpdateExclusiveLock, but in tcop/utility.c the relation is opened with a ShareLock. ISTM that the latter lock mode should be made to match CreateStatistics(). * Why does the new code in tcop/utility.c not use RangeVarGetRelidExtended together with RangeVarCallbackOwnsRelation? That seems preferable to doing the ACL check in CreateStatistics(). For one thing, as it stands, it allows the lock to be taken even if the user doesn't own the table. Is it intentional that the current code allows extended stats to be created on system catalogs? That would be one thing that using RangeVarCallbackOwnsRelation would change, but I can't see a reason to allow it. * In src/bin/psql/describe.c, I think the \d output should also exclude the "expressions" stats kind and just list the other kinds (or have no kinds list at all, if there are no other kinds), to make it consistent with the CREATE STATISTICS syntax. * The pg_dump output for a stats object whose only kind is "expressions" is broken -- it includes a spurious "()" for the kinds list. That's it for now. I'll look at the optimiser changes next, and try to post more comments later this week. Regards, Dean |
On Mon, Jan 04, 2021 at 03:34:08PM +0000, Dean Rasheed wrote:
> * I'm not sure I understand the need for 0001. Wasn't there an earlier > version of this patch that just did it by re-populating the type > array, but which still had it as an array rather than turning it into > a list? Making it a list falsifies some of the comments and > function/variable name choices in that file. This part is from me. I can review the names if it's desired , but it'd be fine to fall back to the earlier patch. I thought a pglist was cleaner, but it's not needed. -- Justin |
In reply to this post by Dean Rasheed-3
On 1/4/21 4:34 PM, Dean Rasheed wrote:
> > ... > > Some other comments: > > * I'm not sure I understand the need for 0001. Wasn't there an earlier > version of this patch that just did it by re-populating the type > array, but which still had it as an array rather than turning it into > a list? Making it a list falsifies some of the comments and > function/variable name choices in that file. > That's a bit done to Justin - I think it's fine to use the older version repopulating the type array, but that question is somewhat unrelated to this patch. > * There's a comment typo in catalog/Makefile -- "are are reputedly > other...", should be "there are reputedly other...". > > * Looking at the pg_stats_ext view, I think perhaps expressions stats > should be omitted entirely from that view, since it doesn't show any > useful information about them. So it could remove "e" from the "kinds" > array, and exclude rows whose only kind is "e", since such rows have > no interesting data in them. Essentially, the new view > pg_stats_ext_exprs makes having any expression stats in pg_stats_ext > redundant. Hiding this data in pg_stats_ext would also be consistent > with making the "expressions" stats kind hidden from the user. > Hmmm, not sure. I'm not sure removing 'e' from the array is a good idea. On the one hand it's internal detail, on the other hand most of that view is internal detail too. Excluding rows with only 'e' seems reasonable, though. I need to think about this. > * In gram.y, it wasn't quite obvious why you converted the column list > for CREATE STATISTICS from an expr_list to a stats_params list. I > figured it out, and it makes sense, but I think it could use a > comment, perhaps something along the lines of the one for index_elem, > e.g.: > > /* > * Statistics attributes can be either simple column references, or arbitrary > * expressions in parens. For compatibility with index attributes permitted > * in CREATE INDEX, we allow an expression that's just a function call to be > * written without parens. > */ > OH, right. I'd have trouble figuring this myself, and I wrote that code myself only one or two months ago. > * In parse_func.c and parse_agg.c, there are a few new error strings > that use the abbreviation "stats expressions", whereas most other > errors refer to "statistics expressions". For consistency, I think > they should all be the latter. > OK, will fix. > * In generateClonedExtStatsStmt(), I think the "expressions" stats > kind needs to be explicitly excluded, otherwise CREATE TABLE (LIKE > ...) fails if the source table has expression stats. > Yeah, will fix. I guess this also means we're missing some tests. > * CreateStatistics() uses ShareUpdateExclusiveLock, but in > tcop/utility.c the relation is opened with a ShareLock. ISTM that the > latter lock mode should be made to match CreateStatistics(). > Not sure, will check. > * Why does the new code in tcop/utility.c not use > RangeVarGetRelidExtended together with RangeVarCallbackOwnsRelation? > That seems preferable to doing the ACL check in CreateStatistics(). > For one thing, as it stands, it allows the lock to be taken even if > the user doesn't own the table. Is it intentional that the current > code allows extended stats to be created on system catalogs? That > would be one thing that using RangeVarCallbackOwnsRelation would > change, but I can't see a reason to allow it. > I think I copied the code from somewhere - probably expression indexes, or something like that. Not a proof that it's the right/better way to do this, though. > * In src/bin/psql/describe.c, I think the \d output should also > exclude the "expressions" stats kind and just list the other kinds (or > have no kinds list at all, if there are no other kinds), to make it > consistent with the CREATE STATISTICS syntax. > Not sure I understand. Why would this make it consistent with CREATE STATISTICS? Can you elaborate? > * The pg_dump output for a stats object whose only kind is > "expressions" is broken -- it includes a spurious "()" for the kinds > list. > Will fix. Again, this suggests there are TAP tests missing. > That's it for now. I'll look at the optimiser changes next, and try to > post more comments later this week. > Thanks! regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company |
On Tue, 5 Jan 2021 at 00:45, Tomas Vondra <[hidden email]> wrote:
> > On 1/4/21 4:34 PM, Dean Rasheed wrote: > > > > * In src/bin/psql/describe.c, I think the \d output should also > > exclude the "expressions" stats kind and just list the other kinds (or > > have no kinds list at all, if there are no other kinds), to make it > > consistent with the CREATE STATISTICS syntax. > > Not sure I understand. Why would this make it consistent with CREATE > STATISTICS? Can you elaborate? > This isn't absolutely essential, but I think it would be neater. For example, if I have a table with stats like this: CREATE TABLE foo(a int, b int); CREATE STATISTICS foo_s_ab (mcv) ON a,b FROM foo; then the \d output is as follows: \d foo Table "public.foo" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- a | integer | | | b | integer | | | Statistics objects: "public"."foo_s_ab" (mcv) ON a, b FROM foo and the stats line matches the DDL used to create the stats. It could, for example, be copy-pasted and tweaked to create similar stats on another table, but even if that's not very likely, it's neat that it reflects how the stats were created. OTOH, if there are expressions in the list, it produces something like this: Table "public.foo" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- a | integer | | | b | integer | | | Statistics objects: "public"."foo_s_ab" (mcv, expressions) ON a, b, ((a * b)) FROM foo which no longer matches the DDL used, and isn't part of an accepted syntax, so seems a bit inconsistent. In general, if we're making the "expressions" kind an internal implementation detail that just gets built automatically when needed, then I think we should hide it from this sort of output, so the list of kinds matches the list that the user used when the stats were created. Regards, Dean |
On 1/5/21 3:10 PM, Dean Rasheed wrote: > On Tue, 5 Jan 2021 at 00:45, Tomas Vondra <[hidden email]> wrote: >> >> On 1/4/21 4:34 PM, Dean Rasheed wrote: >>> >>> * In src/bin/psql/describe.c, I think the \d output should also >>> exclude the "expressions" stats kind and just list the other kinds (or >>> have no kinds list at all, if there are no other kinds), to make it >>> consistent with the CREATE STATISTICS syntax. >> >> Not sure I understand. Why would this make it consistent with CREATE >> STATISTICS? Can you elaborate? >> > > This isn't absolutely essential, but I think it would be neater. For > example, if I have a table with stats like this: > > CREATE TABLE foo(a int, b int); > CREATE STATISTICS foo_s_ab (mcv) ON a,b FROM foo; > > then the \d output is as follows: > > \d foo > Table "public.foo" > Column | Type | Collation | Nullable | Default > --------+---------+-----------+----------+--------- > a | integer | | | > b | integer | | | > Statistics objects: > "public"."foo_s_ab" (mcv) ON a, b FROM foo > > and the stats line matches the DDL used to create the stats. It could, > for example, be copy-pasted and tweaked to create similar stats on > another table, but even if that's not very likely, it's neat that it > reflects how the stats were created. > > OTOH, if there are expressions in the list, it produces something like this: > > Table "public.foo" > Column | Type | Collation | Nullable | Default > --------+---------+-----------+----------+--------- > a | integer | | | > b | integer | | | > Statistics objects: > "public"."foo_s_ab" (mcv, expressions) ON a, b, ((a * b)) FROM foo > > which no longer matches the DDL used, and isn't part of an accepted > syntax, so seems a bit inconsistent. > > In general, if we're making the "expressions" kind an internal > implementation detail that just gets built automatically when needed, > then I think we should hide it from this sort of output, so the list > of kinds matches the list that the user used when the stats were > created. > Hmm, I see. You're probably right it's not necessary to show this, given the modified handling of expression stats (which makes them an internal detail, not exposed to users). I'll tweak this. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company |
Free forum by Nabble | Edit this page |