Hi All,
I have a question about statistics hooks. I am trying to teach the planner that when grouping by something like date_trunc('1 day', time) will produce a lot less rows than the number of distinct time values. I want to do that in an extension. The problem is that I don't see a way to make the get_relation_stats_hook work well fo that since by the time it's called you only see the `time` var and not the full expression. None of the other hooks seem appropriate either. So 2 questions: 1) Would people be opposed to adding a code hook somewhere at the start of `examine_variable` (selfuncs.c) to allow creating statistics on complete expressions? I can submit a patch if this seems reasonable. 2) Do patches that add code hooks (and are probably under 10 lines) need to go through the entire commitfest process. I guess what I am really asking is if PG12 would be the first version such a patch could appear in or is PG11 still a possibility? Just wondering what the policy on such stuff is. Thanks, Mat TimescaleDB 
20180312 14:03 GMT03:00 Mat Arye <[hidden email]>:
> I have a question about statistics hooks. I am trying to teach the planner > that when grouping by something like date_trunc('1 day', time) will produce > a lot less rows than the number of distinct time values. I want to do that > in an extension. The problem is that I don't see a way to make the > get_relation_stats_hook work well fo that since by the time it's called you > only see the `time` var and not the full expression. None of the other hooks > seem appropriate either. So 2 questions: > Isn't it the case to extend the available hook? > 1) Would people be opposed to adding a code hook somewhere at the start of > `examine_variable` (selfuncs.c) to allow creating statistics on complete > expressions? I can submit a patch if this seems reasonable. > If you explain the use case maybe it could be considered. > 2) Do patches that add code hooks (and are probably under 10 lines) need to > go through the entire commitfest process. I guess what I am really asking is > if PG12 would be the first version such a patch could appear in or is PG11 > still a possibility? Just wondering what the policy on such stuff is. > If it is a new feature and is not in the last CF, it won't be considered for v11 (even small patches).  Euler Taveira Timbira  http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento 
On Mon, Mar 12, 2018 at 2:52 PM, Euler Taveira <[hidden email]> wrote: 20180312 14:03 GMT03:00 Mat Arye <[hidden email]>: So the usecase is an analytical query like SELECT date_trunc('hour', time) AS MetricMinuteTs, AVG(value) as avg FROM hyper WHERE time >= '20010104T00:00:00' AND time <= '20010105T01:00:00' GROUP BY MetricMinuteTs ORDER BY MetricMinuteTs DESC; Right now this query will choose a muchlessefficient GroupAggregate plan instead of a HashAggregate. It will choose this because it thinks the number of groups produced here is 9,000,000 because that's the number of distinct time values there are. But, because date_trunc "buckets" the values there will be about 24 groups (1 for each hour).

Mat Arye <[hidden email]> writes:
> So the usecase is an analytical query like > SELECT date_trunc('hour', time) AS MetricMinuteTs, AVG(value) as avg > FROM hyper > WHERE time >= '20010104T00:00:00' AND time <= '20010105T01:00:00' > GROUP BY MetricMinuteTs > ORDER BY MetricMinuteTs DESC; > Right now this query will choose a muchlessefficient GroupAggregate plan > instead of a HashAggregate. It will choose this because it thinks the > number of groups > produced here is 9,000,000 because that's the number of distinct time > values there are. > But, because date_trunc "buckets" the values there will be about 24 groups > (1 for each hour). While it would certainly be nice to have better behavior for that, "add a hook so users who can write C can fix it by hand" doesn't seem like a great solution. On top of the sheer difficulty of writing a hook function, you'd have the problem that no prewritten hook could know about all available functions. I think somehow we'd need a way to add perfunction knowledge, perhaps roughly like the protransform feature. regards, tom lane 
On 13 March 2018 at 11:44, Tom Lane <[hidden email]> wrote:
> While it would certainly be nice to have better behavior for that, > "add a hook so users who can write C can fix it by hand" doesn't seem > like a great solution. On top of the sheer difficulty of writing a > hook function, you'd have the problem that no prewritten hook could > know about all available functions. I think somehow we'd need a way > to add perfunction knowledge, perhaps roughly like the protransform > feature. I always imagined that extended statistics could be used for this. Right now the estimates are much better when you create an index on the function, but there's no real reason to limit the stats that are gathered to just plain columns + expression indexes. I believe I'm not the only person to have considered this. Originally extended statistics were named multivariate statistics. I think it was Dean and I (maybe others too) that suggested to Tomas to give the feature a more generic name so that it can be used for a more general purpose later.  David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services 
In reply to this post by Tom Lane2
On Tue, Mar 13, 2018 at 4:14 AM, Tom Lane <[hidden email]> wrote:
> Mat Arye <[hidden email]> writes: >> So the usecase is an analytical query like > >> SELECT date_trunc('hour', time) AS MetricMinuteTs, AVG(value) as avg >> FROM hyper >> WHERE time >= '20010104T00:00:00' AND time <= '20010105T01:00:00' >> GROUP BY MetricMinuteTs >> ORDER BY MetricMinuteTs DESC; > >> Right now this query will choose a muchlessefficient GroupAggregate plan >> instead of a HashAggregate. It will choose this because it thinks the >> number of groups >> produced here is 9,000,000 because that's the number of distinct time >> values there are. >> But, because date_trunc "buckets" the values there will be about 24 groups >> (1 for each hour). > > While it would certainly be nice to have better behavior for that, > "add a hook so users who can write C can fix it by hand" doesn't seem > like a great solution. On top of the sheer difficulty of writing a > hook function, you'd have the problem that no prewritten hook could > know about all available functions. I think somehow we'd need a way > to add perfunction knowledge, perhaps roughly like the protransform > feature. Like cost associated with a function, we may associate mapping cardinality with a function. It tells how many distinct input values map to 1 output value. By input value, I mean input argument tuple. In Mat's case the mapping cardinality will be 12. The number of distinct values that function may output is estimated as number of estimated rows / mapping cardinality of that function.  Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company 
On Tue, Mar 13, 2018 at 6:56 AM, Ashutosh Bapat <[hidden email]> wrote: On Tue, Mar 13, 2018 at 4:14 AM, Tom Lane <[hidden email]> wrote: I think this is complicated by the fact that the mapping cardinality is not a constant per function but depends on the constant given as the first argument to the function and the granularity of the underlying data (do you have a secondgranularity or microsecond granularity). I actually think the logic for the estimate here should be the (max(time)min(time))/

In reply to this post by David Rowley3
On Tue, Mar 13, 2018 at 6:31 AM, David Rowley <[hidden email]> wrote: On 13 March 2018 at 11:44, Tom Lane <[hidden email]> wrote: I think this isn't eitheror. I think a general hook can be useful for extensions that want to optimize particular data distributions/workloads using domainknowledge about functions common for those workloads. That way users working with that data can use extensions to optimize workloads without writing C themselves. I also think a protransform like feature would add a lot of power to the native planner but this could take a while to get into core properly and may not handle all kinds of data distributions/cases. An example, of a case a protransform type system would not be able to optimize is mathematical operator expressions like bucketing integers by decile  (integer / 10) * 10. This is somewhat analogous to date_trunc in the integer space and would also change the number of resulting distinct rows.
I also think that the point with extended statistics is a good one and points to the need for more experimentation/experience which I think a C hook is better suited for. Putting in a hook will allow extension writers like us to experiment and figure out the kinds of transform on statistics that are useful while having a small footprint on the core. I think designing a protransformlike system would benefit from more experience with the kinds of transformations that are useful. For example, can anything be done if the interval passed to date_trunc is not constant, or is it not even worth bothering with that case? Maybe extended statistics is a better approach, etc.

Mat Arye <[hidden email]> writes:
> An example, of a case a protransform type system would not be able to > optimize is mathematical operator expressions like bucketing integers by > decile  (integer / 10) * 10. Uh, why not? An estimation function that is specific to integer divide shouldn't have much trouble figuring out that x/10 has onetenth as many distinct values as x does. I'd certainly rather have that knowledge associated directly with int4div, and the corresponding knowledge about date_trunc associated with that function, and similar knowledge about extensionprovided operators provided by the extensions, than try to maintain a hook function that embeds all such knowledge. > I also think that the point with extended statistics is a good one and > points to the need for more experimentation/experience which I think > a C hook is better suited for. Putting in a hook will allow extension > writers like us to experiment and figure out the kinds of transform on > statistics that are useful while having > a small footprint on the core. If you're experimenting you might as well just change the source code. A hook is only useful if you're trying to ship something for production, and I doubt that factorizing things this way is a credible production solution. regards, tom lane 
In reply to this post by Mat Arye3
On Tue, Mar 13, 2018 at 8:55 PM, Mat Arye <[hidden email]> wrote:
>> >> Like cost associated with a function, we may associate mapping >> cardinality with a function. It tells how many distinct input values >> map to 1 output value. By input value, I mean input argument tuple. In >> Mat's case the mapping cardinality will be 12. The number of distinct >> values that function may output is estimated as number of estimated >> rows / mapping cardinality of that function. > > > I think this is complicated by the fact that the mapping cardinality is not > a constant per function > but depends on the constant given as the first argument to the function and > the granularity of the > underlying data (do you have a secondgranularity or microsecond > granularity). I actually think the logic for the > estimate here should be the (max(time)min(time))/interval. I think to be > general you need to allow functions on statistics to determine the estimate. > I think my solution was quite shortsighted. You are right. We need a function taking statistics about the input argument as input and output the statistics about the output. The planner can then use this statistics to arrive at various estimates.  Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company 
On 03/15/2018 06:00 AM, Ashutosh Bapat wrote: > On Tue, Mar 13, 2018 at 8:55 PM, Mat Arye <[hidden email]> wrote: >>> >>> Like cost associated with a function, we may associate mapping >>> cardinality with a function. It tells how many distinct input values >>> map to 1 output value. By input value, I mean input argument tuple. In >>> Mat's case the mapping cardinality will be 12. The number of distinct >>> values that function may output is estimated as number of estimated >>> rows / mapping cardinality of that function. >> >> >> I think this is complicated by the fact that the mapping cardinality is not >> a constant per function >> but depends on the constant given as the first argument to the function and >> the granularity of the >> underlying data (do you have a secondgranularity or microsecond >> granularity). I actually think the logic for the >> estimate here should be the (max(time)min(time))/interval. I think to be >> general you need to allow functions on statistics to determine the estimate. >> > > I think my solution was quite shortsighted. You are right. We need a > function taking statistics about the input argument as input and > output the statistics about the output. The planner can then use this > statistics to arrive at various estimates. > I think the best solution is to extend the CREATE STATISTICS so that it handles things like CREATE STATISTICS s ON date_trunc('day', column) FROM table As David mentioned elsewhere in this thread, this was considered before CREATE STATISTICS was introduced in PG11 and it's why the features is called 'extended' and not 'multivariate'. It would give us the same stats as we have for expression indexes, but without the extra overhead. regards  Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 
On Thu, Mar 15, 2018 at 7:59 PM, Tomas Vondra
<[hidden email]> wrote: > > > On 03/15/2018 06:00 AM, Ashutosh Bapat wrote: >> On Tue, Mar 13, 2018 at 8:55 PM, Mat Arye <[hidden email]> wrote: >>>> >>>> Like cost associated with a function, we may associate mapping >>>> cardinality with a function. It tells how many distinct input values >>>> map to 1 output value. By input value, I mean input argument tuple. In >>>> Mat's case the mapping cardinality will be 12. The number of distinct >>>> values that function may output is estimated as number of estimated >>>> rows / mapping cardinality of that function. >>> >>> >>> I think this is complicated by the fact that the mapping cardinality is not >>> a constant per function >>> but depends on the constant given as the first argument to the function and >>> the granularity of the >>> underlying data (do you have a secondgranularity or microsecond >>> granularity). I actually think the logic for the >>> estimate here should be the (max(time)min(time))/interval. I think to be >>> general you need to allow functions on statistics to determine the estimate. >>> >> >> I think my solution was quite shortsighted. You are right. We need a >> function taking statistics about the input argument as input and >> output the statistics about the output. The planner can then use this >> statistics to arrive at various estimates. >> > > I think the best solution is to extend the CREATE STATISTICS so that it > handles things like > > CREATE STATISTICS s ON date_trunc('day', column) FROM table > I think we need both kinds of solution here. My proposal would work for commonly used functions like int4div(). There is no point in creating statistics for all the expression where int4div() is used; there will be many queries which use this function and it's hard to spot since it's invoked for '/' operator. Also there will be many tables with which this function will be used, and maintaining statistics for all those tables will eat a lot of space. But date_trunc() will not be used that widely and thus crafting statistics for that function per table where it's used will work.  Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company 
Free forum by Nabble  Edit this page 