PoC/WIP: Extended statistics on expressions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
47 messages Options
123
Reply | Threaded
Open this post in threaded view
|

PoC/WIP: Extended statistics on expressions

Tomas Vondra-6
Hi,

Attached is a PoC/WIP patch adding support for extended statistics on
expressions. This is by no means "ready" - most of the stuff works, but
often in a rather hackish way. I certainly don't expect this to pass
regression tests, for example.

There's an example demonstrating how this works for two queries at the
end of this message. Now let's talk about the main parts of the patch:

1) extending grammar to allow expressions, not just plain columns

   Fairly straighforward, I think. I'm sure the logic which expressions
   are allowed is not 100% (e.g. volatile functions etc.) but that's
   a detail we can deal with later.

2) store the expressions in pg_statistic_ext catalog

   I ended up adding a separate column, similar to indexprs, except that
   the order of columns/expressions does not matter, so we don't need to
   bother with storing 0s in stxkeys - we simply consider expressions to
   be "after" all the simple columns.

3) build statistics

   This should work too, for all three types of statistics we have (mcv,
   dependencies and ndistinct). This should work too, although the code
   changes are often very hackish "to make it work".

   The main challenge here was how to represent the expressions in the
   statistics - e.g. in ndistinct, which track ndistinct estimates for
   combinations of parameters, and so far we used attnums for that. I
   decided the easiest way it to keep doing that, but offset the
   expressions by MaxHeapAttributeNumber. That seems to work, but maybe
   there's a better way.

4) apply the statistics

   This is the hard part, really, and the exact state of the support
   depends on type of statistics.

   For ndistinct coefficients, it generally works. I'm sure there may be
   bugs in estimate_num_groups, etc. but in principle it works.

   For MCV lists, it generally works too - you can define statistics on
   the expressions and the estimates should improve. The main downside
   here is that it requires at least two expressions, otherwise we can't
   build/apply the extended statistics. So for example

      SELECT * FROM t WHERE mod(a,100) = 10 AND mod(b,11) = 0

   may be estimated "correctly", once you drop any of the conditions it
   gets much worse as we don't have stats for individual expressions.
   That's rather annoying - it does not break the extended MCV, but the
   behavior will certainly cause confusion.

   For functional dependencies, the estimation does not work yet. Also,
   the missing per-column statistics have bigger impact than on MCV,
   because while MCV can work fine without it, the dependencies heavily
   rely on the per-column estimates. We only apply "corrections" based
   on the dependency degree, so we still need (good) per-column
   estimates, which does not quite work with the expressions.


   Of course, the lack of per-expression statistics may be somewhat
   fixed by adding indexes on expressions, but that's kinda expensive.


Now, a simple example demonstrating how this improves estimates - let's
create a table with 1M rows, and do queries with mod() expressions on
it. It might be date_trunc() or something similar, that'd work too.


table with 1M rows
==================

test=# create table t (a int);
test=# insert into t select i from generate_series(1,1000000) s(i);
test=# analyze t;


poorly estimated queries
========================

test=# explain (analyze, timing off) select * from t where mod(a,3) = 0
and mod(a,7) = 0;
                                    QUERY PLAN

----------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..24425.00 rows=25 width=4) (actual rows=47619
loops=1)
   Filter: ((mod(a, 3) = 0) AND (mod(a, 7) = 0))
   Rows Removed by Filter: 952381
 Planning Time: 0.329 ms
 Execution Time: 156.675 ms
(5 rows)

test=# explain (analyze, timing off) select mod(a,3), mod(a,7) from t
group by 1, 2;
                                          QUERY PLAN

-----------------------------------------------------------------------------------------------
 HashAggregate  (cost=75675.00..98487.50 rows=1000000 width=8) (actual
rows=21 loops=1)
   Group Key: mod(a, 3), mod(a, 7)
   Planned Partitions: 32  Batches: 1  Memory Usage: 1561kB
   ->  Seq Scan on t  (cost=0.00..19425.00 rows=1000000 width=8) (actual
rows=1000000 loops=1)
 Planning Time: 0.277 ms
 Execution Time: 502.803 ms
(6 rows)


improved estimates
==================

test=# create statistics s1 (ndistinct) on mod(a,3), mod(a,7) from t;
test=# analyze t;

test=# explain (analyze, timing off) select mod(a,3), mod(a,7) from t
group by 1, 2;
                                          QUERY PLAN

-----------------------------------------------------------------------------------------------
 HashAggregate  (cost=24425.00..24425.31 rows=21 width=8) (actual
rows=21 loops=1)
   Group Key: mod(a, 3), mod(a, 7)
   Batches: 1  Memory Usage: 24kB
   ->  Seq Scan on t  (cost=0.00..19425.00 rows=1000000 width=8) (actual
rows=1000000 loops=1)
 Planning Time: 0.135 ms
 Execution Time: 500.092 ms
(6 rows)

test=# create statistics s2 (mcv) on mod(a,3), mod(a,7) from t;
test=# analyze t;

test=# explain (analyze, timing off) select * from t where mod(a,3) = 0
and mod(a,7) = 0;
                                     QUERY PLAN

-------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..24425.00 rows=46433 width=4) (actual
rows=47619 loops=1)
   Filter: ((mod(a, 3) = 0) AND (mod(a, 7) = 0))
   Rows Removed by Filter: 952381
 Planning Time: 0.702 ms
 Execution Time: 152.280 ms
(5 rows)


Clearly, estimates for both queries are significantly improved. Of
course, this example is kinda artificial/simplistic.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

0001-Support-for-extended-statistics-on-expressi-20201116.patch (154K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PoC/WIP: Extended statistics on expressions

Tomas Vondra-6
On 11/16/20 2:49 PM, Tomas Vondra wrote:

> Hi,
>
> ...
>
> 4) apply the statistics
>
>    This is the hard part, really, and the exact state of the support
>    depends on type of statistics.
>
>    For ndistinct coefficients, it generally works. I'm sure there may be
>    bugs in estimate_num_groups, etc. but in principle it works.
>
>    For MCV lists, it generally works too - you can define statistics on
>    the expressions and the estimates should improve. The main downside
>    here is that it requires at least two expressions, otherwise we can't
>    build/apply the extended statistics. So for example
>
>       SELECT * FROM t WHERE mod(a,100) = 10 AND mod(b,11) = 0
>
>    may be estimated "correctly", once you drop any of the conditions it
>    gets much worse as we don't have stats for individual expressions.
>    That's rather annoying - it does not break the extended MCV, but the
>    behavior will certainly cause confusion.
>
>    For functional dependencies, the estimation does not work yet. Also,
>    the missing per-column statistics have bigger impact than on MCV,
>    because while MCV can work fine without it, the dependencies heavily
>    rely on the per-column estimates. We only apply "corrections" based
>    on the dependency degree, so we still need (good) per-column
>    estimates, which does not quite work with the expressions.
>
>
>    Of course, the lack of per-expression statistics may be somewhat
>    fixed by adding indexes on expressions, but that's kinda expensive.
>

FWIW after re-reading [1], I think the plan to build pg_statistic rows
for expressions and stash them in pg_statistic_ext_data is the way to
go. I was thinking that maybe we'll need some new statistics type to
request this, e.g.

   CREATE STATISTICS s (expressions) ON ...

but on second thought I think we should just build this whenever there
are expressions in the definition. It'll require some changes (e.g. we
require at least two items in the list, but we'll want to allow building
stats on a single expression too, I guess), but that's doable.

Of course, we don't have any catalogs with composite types yet, so it's
not 100% sure this will work, but it's worth a try.

regards


[1]
https://www.postgresql.org/message-id/flat/6331.1579041473%40sss.pgh.pa.us#5ec6af7583e84cef2ca6a9e8a713511e

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: PoC/WIP: Extended statistics on expressions

Tomas Vondra-6
Hi,

attached is a significantly improved version of the patch, allowing
defining extended statistics on expressions. This fixes most of the
problems in the previous WIP version and AFAICS it does pass all
regression tests (including under valgrind). There's a bunch of FIXMEs
and a couple loose ends, but overall I think it's ready for reviews.


Overall, the patch does two main things:

* it adds a new "expressions" statistics kind, building per-expression
statistics (i.e it's similar to having expression index)

* it allows using expressions in definition of extended statistics, and
properly handles that in all existing statistics kinds (dependencies,
mcv, ndistinct)


The expression handling mostly copies what we do for indexes, with
similar restrictions - no volatile functions, aggregates etc. The list
of expressions is stored in pg_statistic_ext catalog, but unlike for
indexes we don't need to worry about the exact order of elements, so
there are no "0" for expressions in stxkeys etc. We simply assume the
expressions come after simple columns, and that's it.

To reference expressions in the built statistics (e.g. in a dependency)
we use "special attnums" computed from the expression index by adding
MaxHeapAttributeNumber. So the first expression has attnum 1601 etc.

This mapping expressions to attnums is used both while building and
applying the statistics to clauses, as it makes the whole process much
simpler than dealing with attnums and expressions entirely separately.


The first part allows us to do something like this:

    CREATE TABLE t (a int);
    INSERT INTO t SELECT i FROM generate_series(1,1000000) s(i);
    ANALYZE t;

    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT * FROM t WHERE mod(a,10) = 0;

    CREATE STATISTICS s (expressions) ON mod(a,10) FROM t;
    ANALYZE t;

    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT * FROM t WHERE mod(a,10) = 0;

Without the statistics we get this:

                                 QUERY PLAN
    --------------------------------------------------------
     Seq Scan on t  (cost=0.00..19425.00 rows=5000 width=4)
                    (actual rows=100000 loops=1)
       Filter: (mod(a, 10) = 0)
       Rows Removed by Filter: 900000
     Planning Time: 0.216 ms
     Execution Time: 157.552 ms
    (5 rows)

while with the statistics we get this

                                 QUERY PLAN
    ----------------------------------------------------------
     Seq Scan on t  (cost=0.00..19425.00 rows=100900 width=4)
                    (actual rows=100000 loops=1)
       Filter: (mod(a, 10) = 0)
       Rows Removed by Filter: 900000
     Planning Time: 0.399 ms
     Execution Time: 157.530 ms
    (5 rows)

So that's pretty nice improvement. In practice you could get the same
effect by creating an expression index

    CREATE INDEX ON t (mod(a,10));

but of course that's far from free - there's cost to maintain the index,
it blocks HOT, and it takes space on disk. The statistics have none of
these issues.

Implementation-wise, this simply builds per-column statistics for each
expression, and stashes them into a new column in pg_statistic_ext_data
catalog as an array of elements with pg_statistic composite type. And
then in selfuncs.c we look not just at indexes, but also at this when
looking for expression stats.

So that gives us the per-expression stats. This is enabled by default
when you don't specify the statistics type and the definition includes
any expression that is not a simple column reference. Otherwise you may
also request it explicitly by using "expressions" in the CREATE.


Now, the second part is really just a natural extension of the existing
stats to also work with expressions. The easiest thing is probably to
show some examples, so consider this:

    CREATE TABLE t (a INT, b INT, c INT);
    INSERT INTO t SELECT i, i, i FROM generate_series(1,1000000) s(i);
    ANALYZE t;

which without any statistics gives us this:

    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT 1 FROM t WHERE mod(a,10) = 0 AND mod(b,5) = 0;

                           QUERY PLAN
    ------------------------------------------------------
     Seq Scan on t  (cost=0.00..25406.00 rows=25 width=4)
                    (actual rows=100000 loops=1)
       Filter: ((mod(a, 10) = 0) AND (mod(b, 5) = 0))
       Rows Removed by Filter: 900000
     Planning Time: 0.080 ms
     Execution Time: 161.445 ms
    (5 rows)


    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT 1 FROM t GROUP BY mod(a,10), mod(b,5);

                               QUERY PLAN
    ------------------------------------------------------------------
     HashAggregate  (cost=76656.00..99468.50 rows=1000000 width=12)
                    (actual rows=10 loops=1)
       Group Key: mod(a, 10), mod(b, 5)
       Planned Partitions: 32  Batches: 1  Memory Usage: 1561kB
       ->  Seq Scan on t  (cost=0.00..20406.00 rows=1000000 width=8)
                          (actual rows=1000000 loops=1)
     Planning Time: 0.232 ms
     Execution Time: 514.446 ms
    (6 rows)

and now let's add statistics on the expressions:

    CREATE STATISTICS s ON mod(a,10), mod(b,5) FROM t;
    ANALYZE t;

which ends up with these spot-on estimates:

    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT 1 FROM t WHERE mod(a,10) = 0 AND mod(b,5) = 0;

                            QUERY PLAN
    ---------------------------------------------------------
     Seq Scan on t  (cost=0.00..25406.00 rows=97400 width=4)
                    (actual rows=100000 loops=1)
       Filter: ((mod(a, 10) = 0) AND (mod(b, 5) = 0))
       Rows Removed by Filter: 900000
     Planning Time: 0.366 ms
     Execution Time: 159.207 ms
    (5 rows)

    EXPLAIN (ANALYZE, TIMING OFF)
    SELECT 1 FROM t GROUP BY mod(a,10), mod(b,5);

                               QUERY PLAN
    -----------------------------------------------------------------
     HashAggregate  (cost=25406.00..25406.15 rows=10 width=12)
                    (actual rows=10 loops=1)
       Group Key: mod(a, 10), mod(b, 5)
       Batches: 1  Memory Usage: 24kB
       ->  Seq Scan on t  (cost=0.00..20406.00 rows=1000000 width=8)
                          (actual rows=1000000 loops=1)
     Planning Time: 0.299 ms
     Execution Time: 530.793 ms
    (6 rows)

Of course, this is a very simple query, but hopefully you get the idea.


There's about two main areas where I think might be hidden issues:

1) We're kinda faking the pg_statistic entries, and I suppose there
might be some loose ends (e.g. with respect to ACLs etc.).

2) Memory management while evaluating the expressions during analyze is
kinda simplistic, we're probably keeping the memory around longer than
needed etc.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

0001-Extended-statistics-on-expressions-20201122.patch (216K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PoC/WIP: Extended statistics on expressions

Justin Pryzby
On Sun, Nov 22, 2020 at 08:03:51PM +0100, Tomas Vondra wrote:
> attached is a significantly improved version of the patch, allowing
> defining extended statistics on expressions. This fixes most of the
> problems in the previous WIP version and AFAICS it does pass all
> regression tests (including under valgrind). There's a bunch of FIXMEs
> and a couple loose ends, but overall I think it's ready for reviews.

I was looking at the previous patch, so now read this one instead, and attach
some proposed fixes.

+  * This matters especially for * expensive expressions, of course.

+   The expression can refer only to columns of the underlying table, but
+   it can use all columns, not just the ones the statistics is defined
+   on.

I don't know what these are trying to say?

+                                errmsg("statistics expressions and predicates can refer only to the table being indexed")));
+        * partial-index predicates.  Create it in the per-index context to be

I think these are copied and shouldn't mention "indexes" or "predicates".  Or
should statistics support predicates, too ?

Idea: if a user specifies no stakinds, and there's no expression specified,
then you automatically build everything except for expressional stats.  But if
they specify only one statistics "column", it gives an error.  If that's a
non-simple column reference, should that instead build *only* expressional
stats (possibly with a NOTICE, since the user might be intending to make MV
stats).

I think pg_stats_ext should allow inspecting the pg_statistic data in
pg_statistic_ext_data.stxdexprs.  I guess array_agg() should be ordered by
something, so maybe it should use ORDINALITY (?)

I hacked more on bootstrap.c so included that here.

--
Justin

0001-bootstrap-convert-Typ-to-a-List.patch (3K) Download Attachment
0002-Allow-composite-types-in-bootstrap.patch (1K) Download Attachment
0003-Extended-statistics-on-expressions.patch (215K) Download Attachment
0004-Fix-silly-errors.patch (2K) Download Attachment
0005-Small-language-fixen.patch (4K) Download Attachment
0006-Some-cleanup.patch (11K) Download Attachment
0007-WIP-Update-pg_stats_ext-for-expressional-stats-incom.patch (12K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PoC/WIP: Extended statistics on expressions

Tomas Vondra-6


On 11/23/20 3:26 AM, Justin Pryzby wrote:

> On Sun, Nov 22, 2020 at 08:03:51PM +0100, Tomas Vondra wrote:
>> attached is a significantly improved version of the patch, allowing
>> defining extended statistics on expressions. This fixes most of the
>> problems in the previous WIP version and AFAICS it does pass all
>> regression tests (including under valgrind). There's a bunch of FIXMEs
>> and a couple loose ends, but overall I think it's ready for reviews.
>
> I was looking at the previous patch, so now read this one instead, and attach
> some proposed fixes.
>
> +  * This matters especially for * expensive expressions, of course.
>

The point this was trying to make is that we evaluate the expressions
only once, and use the results to build all extended statistics. Instead
of leaving it up to every "build" to re-evaluate it.

> +   The expression can refer only to columns of the underlying table, but
> +   it can use all columns, not just the ones the statistics is defined
> +   on.
>
> I don't know what these are trying to say?
>

D'oh. That's bogus para, copied from the CREATE INDEX docs (where it
talked about the index predicate, which is irrelevant here).

> +                                errmsg("statistics expressions and predicates can refer only to the table being indexed")));
> +        * partial-index predicates.  Create it in the per-index context to be
>
> I think these are copied and shouldn't mention "indexes" or "predicates".  Or
> should statistics support predicates, too ?
>

Right. Stupid copy-pasto.

> Idea: if a user specifies no stakinds, and there's no expression specified,
> then you automatically build everything except for expressional stats.  But if
> they specify only one statistics "column", it gives an error.  If that's a
> non-simple column reference, should that instead build *only* expressional
> stats (possibly with a NOTICE, since the user might be intending to make MV
> stats).
>

Right, that was the intention - but I messed up and it works only if you
specify the "expressions" kind explicitly (and I also added the ERROR
message to expected output by mistake). I agree we should handle this
automatically, so that

   CREATE STATISTICS s ON (a+b) FROM t

works and only creates statistics for the expression.


> I think pg_stats_ext should allow inspecting the pg_statistic data in
> pg_statistic_ext_data.stxdexprs.  I guess array_agg() should be ordered by
> something, so maybe it should use ORDINALITY (?)
>

I agree we should expose the expression statistics, but I'm not
convinced we should do that in the pg_stats_ext view itself. The problem
is that it's a table bested in a table, essentially, with non-trivial
structure, so I was thinking about adding a separate view exposing just
this one part. Something like pg_stats_ext_expressions, with about the
same structure as pg_stats, or something.

> I hacked more on bootstrap.c so included that here.

Thanks. As for the 0004-0007 patches:

0004 - Seems fine. IMHO not really "silly errors" but OK.

0005 - Mostly OK. The docs wording mostly comes from CREATE INDEX docs,
though. The paragraph about "t1" is old, so if we want to reword it then
maybe we should backpatch too.

0006 - Not sure. I think CreateStatistics can be fixed with less code,
keeping it more like PG13 (good for backpatching). Not sure why rename
extended statistics to multi-variate statistics - we use "extended"
everywhere. Not sure what's the point of serialize_expr_stats changes,
that's code is mostly copy-paste from update_attstats.

0007 - I suspect this makes the pg_stats_ext too complex to work with,
IMHO we should move this to a separate view.


Thanks for the review! I'll try to look more closely at those patches
sometime next week, and merge most of it.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: PoC/WIP: Extended statistics on expressions

Tomas Vondra-6
Hi,

Attached is an updated version of the patch series, merging some of the
changes proposed by Justin. I've kept the bootstrap patches separate, at
least for now.

As for the individual 0004-0007 patches:

1) 0004 - merged as is

2) 0005 - I've merged some of the docs changes, but some of the wording
was copied from CREATE INDEX docs in which case I've kept that. I've
also not merged changed to pre-existing docs, like the t1 example which
is unrelated to this patch.

OTOH I've corrected the t3 example description, which was somewhat bogus
and unrelated to what the example actually did. I've also removed the
irrelevant para which originally described index predicates and was
copied from CREATE INDEX docs by mistake.

3) 0006 - I've committed something similar / less invasive, achieving
the same goals (I think), and I've added a couple regression tests.

4) 0007 - I agreed we need a way to expose the stats, but including this
in pg_stats_ext seems rather inconvenient (table in a table is difficult
to work with). Instead I've added a new catalog pg_stats_ext_exprs with
structure similar to pg_stats. I've also added the expressions to the
pg_stats_ext catalog, which was only showing the attributes, and some
basic docs for the catalog changes.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

0001-bootstrap-convert-Typ-to-a-List-20201123.patch (3K) Download Attachment
0002-Allow-composite-types-in-bootstrap-20201123.patch (1K) Download Attachment
0003-Extended-statistics-on-expressions-20201123.patch (240K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PoC/WIP: Extended statistics on expressions

Justin Pryzby
In reply to this post by Tomas Vondra-6
On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote:
> 0004 - Seems fine. IMHO not really "silly errors" but OK.

This is one of the same issues you pointed out - shadowing a variable.
Could be backpatched.

On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote:
> > +                                errmsg("statistics expressions and predicates can refer only to the table being indexed")));
> > +        * partial-index predicates.  Create it in the per-index context to be
> >
> > I think these are copied and shouldn't mention "indexes" or "predicates".  Or
> > should statistics support predicates, too ?
> >
>
> Right. Stupid copy-pasto.

Right, but then I was wondering if CREATE STATS should actually support
predicates, since one use case is to do what indexes do without their overhead.
I haven't thought about it enough yet.

> 0006 - Not sure. I think CreateStatistics can be fixed with less code,
> keeping it more like PG13 (good for backpatching). Not sure why rename
> extended statistics to multi-variate statistics - we use "extended"
> everywhere.

-       if (build_expressions && (list_length(stxexprs) == 0))
+       if (!build_expressions_only && (list_length(stmt->exprs) < 2))
                ereport(ERROR,  
                                (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
-                                errmsg("extended expression statistics require at least one expression")));
+                                errmsg("multi-variate statistics require at least two columns")));

I think all of "CREATE STATISTICS" has been known as "extended stats", so I
think it may be confusing to say that it requires two columns for the general
facility.

> Not sure what's the point of serialize_expr_stats changes,
> that's code is mostly copy-paste from update_attstats.

Right.  I think "i" is poor variable name when it isn't a loop variable and not
of limited scope.

> 0007 - I suspect this makes the pg_stats_ext too complex to work with,
> IMHO we should move this to a separate view.

Right - then unnest() the whole thing and return one row per expression rather
than array, as you've done.  Maybe the docs should say that this returns one
row per expression.

Looking quickly at your new patch: I guess you know there's a bunch of
lingering references to "indexes" and "predicates":

I don't know if you want to go to the effort to prohibit this.
postgres=# CREATE STATISTICS asf ON (tableoid::int+1) FROM t;
CREATE STATISTICS

I think a lot of people will find this confusing:

postgres=# CREATE STATISTICS asf ON i FROM t;
ERROR:  extended statistics require at least 2 columns
postgres=# CREATE STATISTICS asf ON (i) FROM t;
CREATE STATISTICS

postgres=# CREATE STATISTICS asf (expressions) ON i FROM t;
ERROR:  extended expression statistics require at least one expression
postgres=# CREATE STATISTICS asf (expressions) ON (i) FROM t;
CREATE STATISTICS

I haven't looked, but is it possible to make it work without parens ?

--
Justin


Reply | Threaded
Open this post in threaded view
|

Re: PoC/WIP: Extended statistics on expressions

Tomas Vondra-6


On 11/24/20 5:23 PM, Justin Pryzby wrote:

> On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote:
>> 0004 - Seems fine. IMHO not really "silly errors" but OK.
>
> This is one of the same issues you pointed out - shadowing a variable.
> Could be backpatched.
>
> On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote:
>>> +                                errmsg("statistics expressions and predicates can refer only to the table being indexed")));
>>> +        * partial-index predicates.  Create it in the per-index context to be
>>>
>>> I think these are copied and shouldn't mention "indexes" or "predicates".  Or
>>> should statistics support predicates, too ?
>>>
>>
>> Right. Stupid copy-pasto.
>
> Right, but then I was wondering if CREATE STATS should actually support
> predicates, since one use case is to do what indexes do without their overhead.
> I haven't thought about it enough yet.
>

Well, it's not supported now, so the message is bogus. I'm not against
supporting "partial statistics" with predicates in the future, but it's
going to be non-trivial project on it's own. It's not something I can
bolt onto the current patch easily.

>> 0006 - Not sure. I think CreateStatistics can be fixed with less code,
>> keeping it more like PG13 (good for backpatching). Not sure why rename
>> extended statistics to multi-variate statistics - we use "extended"
>> everywhere.
>
> -       if (build_expressions && (list_length(stxexprs) == 0))
> +       if (!build_expressions_only && (list_length(stmt->exprs) < 2))
>                 ereport(ERROR,  
>                                 (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> -                                errmsg("extended expression statistics require at least one expression")));
> +                                errmsg("multi-variate statistics require at least two columns")));
>
> I think all of "CREATE STATISTICS" has been known as "extended stats", so I
> think it may be confusing to say that it requires two columns for the general
> facility.
>
>> Not sure what's the point of serialize_expr_stats changes,
>> that's code is mostly copy-paste from update_attstats.
>
> Right.  I think "i" is poor variable name when it isn't a loop variable and not
> of limited scope.
>

OK, I understand. I'll consider tweaking that.

>> 0007 - I suspect this makes the pg_stats_ext too complex to work with,
>> IMHO we should move this to a separate view.
>
> Right - then unnest() the whole thing and return one row per expression rather
> than array, as you've done.  Maybe the docs should say that this returns one
> row per expression.
>
> Looking quickly at your new patch: I guess you know there's a bunch of
> lingering references to "indexes" and "predicates":
>
> I don't know if you want to go to the effort to prohibit this.
> postgres=# CREATE STATISTICS asf ON (tableoid::int+1) FROM t;
> CREATE STATISTICS
>

Hmm, we're already rejecting system attributes, I suppose we should do
the same thing for expressions on system attributes.

> I think a lot of people will find this confusing:
>
> postgres=# CREATE STATISTICS asf ON i FROM t;
> ERROR:  extended statistics require at least 2 columns
> postgres=# CREATE STATISTICS asf ON (i) FROM t;
> CREATE STATISTICS
>
> postgres=# CREATE STATISTICS asf (expressions) ON i FROM t;
> ERROR:  extended expression statistics require at least one expression
> postgres=# CREATE STATISTICS asf (expressions) ON (i) FROM t;
> CREATE STATISTICS
>
> I haven't looked, but is it possible to make it work without parens ?
>

Hmm, you're right that may be surprising. I suppose we could walk the
expressions while creating the statistics, and replace such trivial
expressions with the nested variable, but I haven't tried. I wonder what
the CREATE INDEX behavior would be in these cases.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: PoC/WIP: Extended statistics on expressions

Tomas Vondra-6
In reply to this post by Tomas Vondra-6
Hi,

Attached is a patch series rebased on top of 25a9e54d2d which improves
estimation of OR clauses. There were only a couple minor conflicts.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

0001-bootstrap-convert-Typ-to-a-List-20201203.patch (3K) Download Attachment
0002-Allow-composite-types-in-bootstrap-20201203.patch (1K) Download Attachment
0003-Extended-statistics-on-expressions-20201203.patch (240K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PoC/WIP: Extended statistics on expressions

Dean Rasheed-3
On Thu, 3 Dec 2020 at 15:23, Tomas Vondra <[hidden email]> wrote:
>
> Attached is a patch series rebased on top of 25a9e54d2d.

After reading this thread and [1], I think I prefer the name
"standard" rather than "expressions", because it is meant to describe
the kind of statistics being built rather than what they apply to, but
maybe that name doesn't actually need to be exposed to the end user:

Looking at the current behaviour, there are a couple of things that
seem a little odd, even though they are understandable. For example,
the fact that

  CREATE STATISTICS s (expressions) ON (expr), col FROM tbl;

fails, but

  CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl;

succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax

  CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl;

tends to suggest that it's going to create statistics on the pair of
expressions, describing their correlation, when actually it builds 2
independent statistics. Also, this error text isn't entirely accurate:

  CREATE STATISTICS s ON col FROM tbl;
  ERROR:  extended statistics require at least 2 columns

because extended statistics don't always require 2 columns, they can
also just have an expression, or multiple expressions and 0 or 1
columns.

I think a lot of this stems from treating "expressions" in the same
way as the other (multi-column) stats kinds, and it might actually be
neater to have separate documented syntaxes for single- and
multi-column statistics:

  CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
    ON (expression)
    FROM table_name

  CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
    [ ( statistics_kind [, ... ] ) ]
    ON { column_name | (expression) } , { column_name | (expression) } [, ...]
    FROM table_name

The first syntax would create single-column stats, and wouldn't accept
a statistics_kind argument, because there is only one kind of
single-column statistic. Maybe that might change in the future, but if
so, it's likely that the kinds of single-column stats will be
different from the kinds of multi-column stats.

In the second syntax, the only accepted kinds would be the current
multi-column stats kinds (ndistinct, dependencies, and mcv), and it
would always build stats describing the correlations between the
columns listed. It would continue to build standard/expression stats
on any expressions in the list, but that's more of an implementation
detail.

It would no longer be possible to do "CREATE STATISTICS s
(expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to
issue 2 separate "CREATE STATISTICS" commands, but that seems more
logical, because they're independent stats.

The parsing code might not change much, but some of the errors would
be different. For example, the errors "building only extended
expression statistics on simple columns not allowed" and "extended
expression statistics require at least one expression" would go away,
and the error "extended statistics require at least 2 columns" might
become more specific, depending on the stats kind.

Regards,
Dean

[1] https://www.postgresql.org/message-id/flat/1009.1579038764%40sss.pgh.pa.us#8624792a20ae595683b574f5933dae53


Reply | Threaded
Open this post in threaded view
|

Re: PoC/WIP: Extended statistics on expressions

Tomas Vondra-6


On 12/7/20 10:56 AM, Dean Rasheed wrote:

> On Thu, 3 Dec 2020 at 15:23, Tomas Vondra <[hidden email]> wrote:
>>
>> Attached is a patch series rebased on top of 25a9e54d2d.
>
> After reading this thread and [1], I think I prefer the name
> "standard" rather than "expressions", because it is meant to describe
> the kind of statistics being built rather than what they apply to, but
> maybe that name doesn't actually need to be exposed to the end user:
>
> Looking at the current behaviour, there are a couple of things that
> seem a little odd, even though they are understandable. For example,
> the fact that
>
>   CREATE STATISTICS s (expressions) ON (expr), col FROM tbl;
>
> fails, but
>
>   CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl;
>
> succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax
>
>   CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl;
>
> tends to suggest that it's going to create statistics on the pair of
> expressions, describing their correlation, when actually it builds 2
> independent statistics. Also, this error text isn't entirely accurate:
>
>   CREATE STATISTICS s ON col FROM tbl;
>   ERROR:  extended statistics require at least 2 columns
>
> because extended statistics don't always require 2 columns, they can
> also just have an expression, or multiple expressions and 0 or 1
> columns.
>
> I think a lot of this stems from treating "expressions" in the same
> way as the other (multi-column) stats kinds, and it might actually be
> neater to have separate documented syntaxes for single- and
> multi-column statistics:
>
>   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>     ON (expression)
>     FROM table_name
>
>   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>     [ ( statistics_kind [, ... ] ) ]
>     ON { column_name | (expression) } , { column_name | (expression) } [, ...]
>     FROM table_name
>
> The first syntax would create single-column stats, and wouldn't accept
> a statistics_kind argument, because there is only one kind of
> single-column statistic. Maybe that might change in the future, but if
> so, it's likely that the kinds of single-column stats will be
> different from the kinds of multi-column stats.
>
> In the second syntax, the only accepted kinds would be the current
> multi-column stats kinds (ndistinct, dependencies, and mcv), and it
> would always build stats describing the correlations between the
> columns listed. It would continue to build standard/expression stats
> on any expressions in the list, but that's more of an implementation
> detail.
>
> It would no longer be possible to do "CREATE STATISTICS s
> (expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to
> issue 2 separate "CREATE STATISTICS" commands, but that seems more
> logical, because they're independent stats.
>
> The parsing code might not change much, but some of the errors would
> be different. For example, the errors "building only extended
> expression statistics on simple columns not allowed" and "extended
> expression statistics require at least one expression" would go away,
> and the error "extended statistics require at least 2 columns" might
> become more specific, depending on the stats kind.
>

I think it makes sense in general. I see two issues with this approach,
though:

* By adding expression/standard stats for individual statistics, it
makes the list of statistics longer - I wonder if this might have
measurable impact on lookups in this list.

* I'm not sure it's a good idea that the second syntax would always
build the per-expression stats. Firstly, it seems a bit strange that it
behaves differently than the other kinds. Secondly, I wonder if there
are cases where it'd be desirable to explicitly disable building these
per-expression stats. For example, what if we have multiple extended
statistics objects, overlapping on a couple expressions. It seems
pointless to build the stats for all of them.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: PoC/WIP: Extended statistics on expressions

Dean Rasheed-3
On Mon, 7 Dec 2020 at 14:15, Tomas Vondra <[hidden email]> wrote:

>
> On 12/7/20 10:56 AM, Dean Rasheed wrote:
> > it might actually be
> > neater to have separate documented syntaxes for single- and
> > multi-column statistics:
> >
> >   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
> >     ON (expression)
> >     FROM table_name
> >
> >   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
> >     [ ( statistics_kind [, ... ] ) ]
> >     ON { column_name | (expression) } , { column_name | (expression) } [, ...]
> >     FROM table_name
>
> I think it makes sense in general. I see two issues with this approach,
> though:
>
> * By adding expression/standard stats for individual statistics, it
> makes the list of statistics longer - I wonder if this might have
> measurable impact on lookups in this list.
>
> * I'm not sure it's a good idea that the second syntax would always
> build the per-expression stats. Firstly, it seems a bit strange that it
> behaves differently than the other kinds. Secondly, I wonder if there
> are cases where it'd be desirable to explicitly disable building these
> per-expression stats. For example, what if we have multiple extended
> statistics objects, overlapping on a couple expressions. It seems
> pointless to build the stats for all of them.
>

Hmm, I'm not sure it would really be a good idea to build MCV stats on
expressions without also building the standard stats for those
expressions, otherwise the assumptions that
mcv_combine_selectivities() makes about simple_sel and mcv_basesel
wouldn't really hold. But then, if multiple MCV stats shared the same
expression, it would be quite wasteful to build standard stats on the
expression more than once.

It feels like it should build a single extended stats object for each
unique expression, with appropriate dependencies for any MCV stats
that used those expressions, but I'm not sure how complex that would
be. Dropping the last MCV stat object using a standard expression stat
object might reasonably drop the expression stats ... except if they
were explicitly created by the user, independently of any MCV stats.
That could get quite messy.

Regards,
Dean


Reply | Threaded
Open this post in threaded view
|

Re: PoC/WIP: Extended statistics on expressions

Tomas Vondra-6


On 12/7/20 5:02 PM, Dean Rasheed wrote:

> On Mon, 7 Dec 2020 at 14:15, Tomas Vondra <[hidden email]> wrote:
>>
>> On 12/7/20 10:56 AM, Dean Rasheed wrote:
>>> it might actually be
>>> neater to have separate documented syntaxes for single- and
>>> multi-column statistics:
>>>
>>>   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>>>     ON (expression)
>>>     FROM table_name
>>>
>>>   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>>>     [ ( statistics_kind [, ... ] ) ]
>>>     ON { column_name | (expression) } , { column_name | (expression) } [, ...]
>>>     FROM table_name
>>
>> I think it makes sense in general. I see two issues with this approach,
>> though:
>>
>> * By adding expression/standard stats for individual statistics, it
>> makes the list of statistics longer - I wonder if this might have
>> measurable impact on lookups in this list.
>>
>> * I'm not sure it's a good idea that the second syntax would always
>> build the per-expression stats. Firstly, it seems a bit strange that it
>> behaves differently than the other kinds. Secondly, I wonder if there
>> are cases where it'd be desirable to explicitly disable building these
>> per-expression stats. For example, what if we have multiple extended
>> statistics objects, overlapping on a couple expressions. It seems
>> pointless to build the stats for all of them.
>>
>
> Hmm, I'm not sure it would really be a good idea to build MCV stats on
> expressions without also building the standard stats for those
> expressions, otherwise the assumptions that
> mcv_combine_selectivities() makes about simple_sel and mcv_basesel
> wouldn't really hold. But then, if multiple MCV stats shared the same
> expression, it would be quite wasteful to build standard stats on the
> expression more than once.
>

Yeah. You're right it'd be problematic to build MCV on expressions
without having the per-expression stats. In fact, that's exactly the
problem what forced me to add the per-expression stats to this patch.
Originally I planned to address it in a later patch, but I had to move
it forward.

So I think you're right we need to ensure we have standard stats for
each expression at least once, to make this work well.

> It feels like it should build a single extended stats object for each
> unique expression, with appropriate dependencies for any MCV stats
> that used those expressions, but I'm not sure how complex that would
> be. Dropping the last MCV stat object using a standard expression stat
> object might reasonably drop the expression stats ... except if they
> were explicitly created by the user, independently of any MCV stats.
> That could get quite messy.
>

Possibly. But I don't think it's worth the extra complexity. I don't
expect people to have a lot of overlapping stats, so the amount of
wasted space and CPU time is expected to be fairly limited.

So I don't think it's worth spending too much time on this now. Let's
just do what you proposed, and revisit this later if needed.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: PoC/WIP: Extended statistics on expressions

Dean Rasheed-3
On Tue, 8 Dec 2020 at 12:44, Tomas Vondra <[hidden email]> wrote:
>
> Possibly. But I don't think it's worth the extra complexity. I don't
> expect people to have a lot of overlapping stats, so the amount of
> wasted space and CPU time is expected to be fairly limited.
>
> So I don't think it's worth spending too much time on this now. Let's
> just do what you proposed, and revisit this later if needed.
>

Yes, I think that's a reasonable approach to take. As long as the
documentation makes it clear that building MCV stats also causes
standard expression stats to be built on any expressions included in
the list, then the user will know and can avoid duplication most of
the time. I don't think there's any need for code to try to prevent
that -- just as we don't bother with code to prevent a user building
multiple indexes on the same column.

The only case where duplication won't be avoidable is where there are
multiple MCV stats sharing the same expression, but that's probably
quite unlikely in practice, and it seems acceptable to leave improving
that as a possible future optimisation.

Regards,
Dean


Reply | Threaded
Open this post in threaded view
|

Re: PoC/WIP: Extended statistics on expressions

Tomas Vondra-6
On 12/11/20 1:58 PM, Dean Rasheed wrote:

> On Tue, 8 Dec 2020 at 12:44, Tomas Vondra <[hidden email]> wrote:
>>
>> Possibly. But I don't think it's worth the extra complexity. I don't
>> expect people to have a lot of overlapping stats, so the amount of
>> wasted space and CPU time is expected to be fairly limited.
>>
>> So I don't think it's worth spending too much time on this now. Let's
>> just do what you proposed, and revisit this later if needed.
>>
>
> Yes, I think that's a reasonable approach to take. As long as the
> documentation makes it clear that building MCV stats also causes
> standard expression stats to be built on any expressions included in
> the list, then the user will know and can avoid duplication most of
> the time. I don't think there's any need for code to try to prevent
> that -- just as we don't bother with code to prevent a user building
> multiple indexes on the same column.
>
> The only case where duplication won't be avoidable is where there are
> multiple MCV stats sharing the same expression, but that's probably
> quite unlikely in practice, and it seems acceptable to leave improving
> that as a possible future optimisation.
>
OK. Attached is an updated version, reworking it this way.

I tried tweaking the grammar to differentiate these two syntax variants,
but that led to shift/reduce conflicts with the existing ones. I tried
fixing that, but I ended up doing that in CreateStatistics().

The other thing is that we probably can't tie this to just MCV, because
functional dependencies need the per-expression stats too. So I simply
build expression stats whenever there's at least one expression.

I also decided to keep the "expressions" statistics kind - it's not
allowed to specify it in CREATE STATISTICS, but it's useful internally
as it allows deciding whether to build the stats in a single place.
Otherwise we'd need to do that every time we build the statistics, etc.

I added a brief explanation to the sgml docs, not sure if that's good
enough - maybe it needs more details.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

0001-bootstrap-convert-Typ-to-a-List-20201211.patch (3K) Download Attachment
0002-Allow-composite-types-in-bootstrap-20201211.patch (1K) Download Attachment
0003-Extended-statistics-on-expressions-20201211.patch (241K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PoC/WIP: Extended statistics on expressions

Dean Rasheed-3
On Fri, 11 Dec 2020 at 20:17, Tomas Vondra
<[hidden email]> wrote:
>
> OK. Attached is an updated version, reworking it this way.

Cool. I think this is an exciting development, so I hope it makes it
into the next release.

I have started looking at it. So far I have only looked at the
catalog, parser and client changes, but I thought it's worth posting
my comments so far.

> I tried tweaking the grammar to differentiate these two syntax variants,
> but that led to shift/reduce conflicts with the existing ones. I tried
> fixing that, but I ended up doing that in CreateStatistics().

Yeah, that makes sense. I wasn't expecting the grammar to change.

> The other thing is that we probably can't tie this to just MCV, because
> functional dependencies need the per-expression stats too. So I simply
> build expression stats whenever there's at least one expression.

Makes sense.

> I also decided to keep the "expressions" statistics kind - it's not
> allowed to specify it in CREATE STATISTICS, but it's useful internally
> as it allows deciding whether to build the stats in a single place.
> Otherwise we'd need to do that every time we build the statistics, etc.

Yes, I thought that would be the easiest way to do it. Essentially the
"expressions" stats kind is an internal implementation detail, hidden
from the user, because it's built automatically when required, so you
don't need to (and can't) explicitly ask for it. This new behaviour
seems much more logical to me.

> I added a brief explanation to the sgml docs, not sure if that's good
> enough - maybe it needs more details.

Yes, I think that could use a little tidying up, but I haven't looked
too closely yet.


Some other comments:

* I'm not sure I understand the need for 0001. Wasn't there an earlier
version of this patch that just did it by re-populating the type
array, but which still had it as an array rather than turning it into
a list? Making it a list falsifies some of the comments and
function/variable name choices in that file.

* There's a comment typo in catalog/Makefile -- "are are reputedly
other...", should be "there are reputedly other...".

* Looking at the pg_stats_ext view, I think perhaps expressions stats
should be omitted entirely from that view, since it doesn't show any
useful information about them. So it could remove "e" from the "kinds"
array, and exclude rows whose only kind is "e", since such rows have
no interesting data in them. Essentially, the new view
pg_stats_ext_exprs makes having any expression stats in pg_stats_ext
redundant. Hiding this data in pg_stats_ext would also be consistent
with making the "expressions" stats kind hidden from the user.

* In gram.y, it wasn't quite obvious why you converted the column list
for CREATE STATISTICS from an expr_list to a stats_params list. I
figured it out, and it makes sense, but I think it could use a
comment, perhaps something along the lines of the one for index_elem,
e.g.:

/*
 * Statistics attributes can be either simple column references, or arbitrary
 * expressions in parens.  For compatibility with index attributes permitted
 * in CREATE INDEX, we allow an expression that's just a function call to be
 * written without parens.
 */

* In parse_func.c and parse_agg.c, there are a few new error strings
that use the abbreviation "stats expressions", whereas most other
errors refer to "statistics expressions". For consistency, I think
they should all be the latter.

* In generateClonedExtStatsStmt(), I think the "expressions" stats
kind needs to be explicitly excluded, otherwise CREATE TABLE (LIKE
...) fails if the source table has expression stats.

* CreateStatistics() uses ShareUpdateExclusiveLock, but in
tcop/utility.c the relation is opened with a ShareLock. ISTM that the
latter lock mode should be made to match CreateStatistics().

* Why does the new code in tcop/utility.c not use
RangeVarGetRelidExtended together with RangeVarCallbackOwnsRelation?
That seems preferable to doing the ACL check in CreateStatistics().
For one thing, as it stands, it allows the lock to be taken even if
the user doesn't own the table. Is it intentional that the current
code allows extended stats to be created on system catalogs? That
would be one thing that using RangeVarCallbackOwnsRelation would
change, but I can't see a reason to allow it.

* In src/bin/psql/describe.c, I think the \d output should also
exclude the "expressions" stats kind and just list the other kinds (or
have no kinds list at all, if there are no other kinds), to make it
consistent with the CREATE STATISTICS syntax.

* The pg_dump output for a stats object whose only kind is
"expressions" is broken -- it includes a spurious "()" for the kinds
list.

That's it for now. I'll look at the optimiser changes next, and try to
post more comments later this week.

Regards,
Dean


Reply | Threaded
Open this post in threaded view
|

Re: PoC/WIP: Extended statistics on expressions

Justin Pryzby
On Mon, Jan 04, 2021 at 03:34:08PM +0000, Dean Rasheed wrote:
> * I'm not sure I understand the need for 0001. Wasn't there an earlier
> version of this patch that just did it by re-populating the type
> array, but which still had it as an array rather than turning it into
> a list? Making it a list falsifies some of the comments and
> function/variable name choices in that file.

This part is from me.

I can review the names if it's desired , but it'd be fine to fall back to the
earlier patch.  I thought a pglist was cleaner, but it's not needed.

--
Justin


Reply | Threaded
Open this post in threaded view
|

Re: PoC/WIP: Extended statistics on expressions

Tomas Vondra-6
In reply to this post by Dean Rasheed-3
On 1/4/21 4:34 PM, Dean Rasheed wrote:

>
> ...
>
> Some other comments:
>
> * I'm not sure I understand the need for 0001. Wasn't there an earlier
> version of this patch that just did it by re-populating the type
> array, but which still had it as an array rather than turning it into
> a list? Making it a list falsifies some of the comments and
> function/variable name choices in that file.
>

That's a bit done to Justin - I think it's fine to use the older version
repopulating the type array, but that question is somewhat unrelated to
this patch.

> * There's a comment typo in catalog/Makefile -- "are are reputedly
> other...", should be "there are reputedly other...".
>
> * Looking at the pg_stats_ext view, I think perhaps expressions stats
> should be omitted entirely from that view, since it doesn't show any
> useful information about them. So it could remove "e" from the "kinds"
> array, and exclude rows whose only kind is "e", since such rows have
> no interesting data in them. Essentially, the new view
> pg_stats_ext_exprs makes having any expression stats in pg_stats_ext
> redundant. Hiding this data in pg_stats_ext would also be consistent
> with making the "expressions" stats kind hidden from the user.
>

Hmmm, not sure. I'm not sure removing 'e' from the array is a good idea.
On the one hand it's internal detail, on the other hand most of that
view is internal detail too. Excluding rows with only 'e' seems
reasonable, though. I need to think about this.

> * In gram.y, it wasn't quite obvious why you converted the column list
> for CREATE STATISTICS from an expr_list to a stats_params list. I
> figured it out, and it makes sense, but I think it could use a
> comment, perhaps something along the lines of the one for index_elem,
> e.g.:
>
> /*
>   * Statistics attributes can be either simple column references, or arbitrary
>   * expressions in parens.  For compatibility with index attributes permitted
>   * in CREATE INDEX, we allow an expression that's just a function call to be
>   * written without parens.
>   */
>

OH, right. I'd have trouble figuring this myself, and I wrote that code
myself only one or two months ago.

> * In parse_func.c and parse_agg.c, there are a few new error strings
> that use the abbreviation "stats expressions", whereas most other
> errors refer to "statistics expressions". For consistency, I think
> they should all be the latter.
>

OK, will fix.

> * In generateClonedExtStatsStmt(), I think the "expressions" stats
> kind needs to be explicitly excluded, otherwise CREATE TABLE (LIKE
> ...) fails if the source table has expression stats.
>

Yeah, will fix. I guess this also means we're missing some tests.

> * CreateStatistics() uses ShareUpdateExclusiveLock, but in
> tcop/utility.c the relation is opened with a ShareLock. ISTM that the
> latter lock mode should be made to match CreateStatistics().
>

Not sure, will check.

> * Why does the new code in tcop/utility.c not use
> RangeVarGetRelidExtended together with RangeVarCallbackOwnsRelation?
> That seems preferable to doing the ACL check in CreateStatistics().
> For one thing, as it stands, it allows the lock to be taken even if
> the user doesn't own the table. Is it intentional that the current
> code allows extended stats to be created on system catalogs? That
> would be one thing that using RangeVarCallbackOwnsRelation would
> change, but I can't see a reason to allow it.
>

I think I copied the code from somewhere - probably expression indexes,
or something like that. Not a proof that it's the right/better way to do
this, though.

> * In src/bin/psql/describe.c, I think the \d output should also
> exclude the "expressions" stats kind and just list the other kinds (or
> have no kinds list at all, if there are no other kinds), to make it
> consistent with the CREATE STATISTICS syntax.
>

Not sure I understand. Why would this make it consistent with CREATE
STATISTICS? Can you elaborate?

> * The pg_dump output for a stats object whose only kind is
> "expressions" is broken -- it includes a spurious "()" for the kinds
> list.
>

Will fix. Again, this suggests there are TAP tests missing.

> That's it for now. I'll look at the optimiser changes next, and try to
> post more comments later this week.
>

Thanks!


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: PoC/WIP: Extended statistics on expressions

Dean Rasheed-3
On Tue, 5 Jan 2021 at 00:45, Tomas Vondra <[hidden email]> wrote:

>
> On 1/4/21 4:34 PM, Dean Rasheed wrote:
> >
> > * In src/bin/psql/describe.c, I think the \d output should also
> > exclude the "expressions" stats kind and just list the other kinds (or
> > have no kinds list at all, if there are no other kinds), to make it
> > consistent with the CREATE STATISTICS syntax.
>
> Not sure I understand. Why would this make it consistent with CREATE
> STATISTICS? Can you elaborate?
>

This isn't absolutely essential, but I think it would be neater. For
example, if I have a table with stats like this:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS foo_s_ab (mcv) ON a,b FROM foo;

then the \d output is as follows:

\d foo
                Table "public.foo"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 a      | integer |           |          |
 b      | integer |           |          |
Statistics objects:
    "public"."foo_s_ab" (mcv) ON a, b FROM foo

and the stats line matches the DDL used to create the stats. It could,
for example, be copy-pasted and tweaked to create similar stats on
another table, but even if that's not very likely, it's neat that it
reflects how the stats were created.

OTOH, if there are expressions in the list, it produces something like this:

                Table "public.foo"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 a      | integer |           |          |
 b      | integer |           |          |
Statistics objects:
    "public"."foo_s_ab" (mcv, expressions) ON a, b, ((a * b)) FROM foo

which no longer matches the DDL used, and isn't part of an accepted
syntax, so seems a bit inconsistent.

In general, if we're making the "expressions" kind an internal
implementation detail that just gets built automatically when needed,
then I think we should hide it from this sort of output, so the list
of kinds matches the list that the user used when the stats were
created.

Regards,
Dean


Reply | Threaded
Open this post in threaded view
|

Re: PoC/WIP: Extended statistics on expressions

Tomas Vondra-6


On 1/5/21 3:10 PM, Dean Rasheed wrote:

> On Tue, 5 Jan 2021 at 00:45, Tomas Vondra <[hidden email]> wrote:
>>
>> On 1/4/21 4:34 PM, Dean Rasheed wrote:
>>>
>>> * In src/bin/psql/describe.c, I think the \d output should also
>>> exclude the "expressions" stats kind and just list the other kinds (or
>>> have no kinds list at all, if there are no other kinds), to make it
>>> consistent with the CREATE STATISTICS syntax.
>>
>> Not sure I understand. Why would this make it consistent with CREATE
>> STATISTICS? Can you elaborate?
>>
>
> This isn't absolutely essential, but I think it would be neater. For
> example, if I have a table with stats like this:
>
> CREATE TABLE foo(a int, b int);
> CREATE STATISTICS foo_s_ab (mcv) ON a,b FROM foo;
>
> then the \d output is as follows:
>
> \d foo
>                  Table "public.foo"
>   Column |  Type   | Collation | Nullable | Default
> --------+---------+-----------+----------+---------
>   a      | integer |           |          |
>   b      | integer |           |          |
> Statistics objects:
>      "public"."foo_s_ab" (mcv) ON a, b FROM foo
>
> and the stats line matches the DDL used to create the stats. It could,
> for example, be copy-pasted and tweaked to create similar stats on
> another table, but even if that's not very likely, it's neat that it
> reflects how the stats were created.
>
> OTOH, if there are expressions in the list, it produces something like this:
>
>                  Table "public.foo"
>   Column |  Type   | Collation | Nullable | Default
> --------+---------+-----------+----------+---------
>   a      | integer |           |          |
>   b      | integer |           |          |
> Statistics objects:
>      "public"."foo_s_ab" (mcv, expressions) ON a, b, ((a * b)) FROM foo
>
> which no longer matches the DDL used, and isn't part of an accepted
> syntax, so seems a bit inconsistent.
>
> In general, if we're making the "expressions" kind an internal
> implementation detail that just gets built automatically when needed,
> then I think we should hide it from this sort of output, so the list
> of kinds matches the list that the user used when the stats were
> created.
>

Hmm, I see. You're probably right it's not necessary to show this, given
the modified handling of expression stats (which makes them an internal
detail, not exposed to users). I'll tweak this.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


123