Multivariate MCV list vs. statistics target

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Multivariate MCV list vs. statistics target

Tomas Vondra-4
Hi,

One slightly inconvenient thing I realized while playing with the
address data set is that it's somewhat difficult to set the desired size
of the multi-column MCV list.

At the moment, we simply use the maximum statistic target for attributes
the MCV list is built on. But that does not allow keeping default size
for per-column stats, and only increase size of multi-column MCV lists.

So I'm thinking we should allow tweaking the statistics for extended
stats, and serialize it in the pg_statistic_ext catalog. Any opinions
why that would be a bad idea?

I suppose it should be part of the CREATE STATISTICS command, but I'm
not sure what'd be the best syntax. We might also have something more
similar to ALTER COLUMNT, but perhaps

     ALTER STATISTICS s SET STATISTICS 1000;

looks a bit too weird.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Multivariate MCV list vs. statistics target

Dean Rasheed-3
On Tue, 18 Jun 2019 at 22:34, Tomas Vondra <[hidden email]> wrote:

>
> One slightly inconvenient thing I realized while playing with the
> address data set is that it's somewhat difficult to set the desired size
> of the multi-column MCV list.
>
> At the moment, we simply use the maximum statistic target for attributes
> the MCV list is built on. But that does not allow keeping default size
> for per-column stats, and only increase size of multi-column MCV lists.
>
> So I'm thinking we should allow tweaking the statistics for extended
> stats, and serialize it in the pg_statistic_ext catalog. Any opinions
> why that would be a bad idea?
>

Seems reasonable to me. This might not be the only option we'll ever
want to add though, so perhaps a "stxoptions text[]" column along the
lines of a relation's reloptions would be the way to go.

> I suppose it should be part of the CREATE STATISTICS command, but I'm
> not sure what'd be the best syntax. We might also have something more
> similar to ALTER COLUMNT, but perhaps
>
>      ALTER STATISTICS s SET STATISTICS 1000;
>
> looks a bit too weird.
>

Yes it does look a bit weird, but that's the natural generalisation of
what we have for per-column statistics, so it's probably preferable to
do that rather than invent some other syntax that wouldn't be so
consistent.

Regards,
Dean


Reply | Threaded
Open this post in threaded view
|

Re: Multivariate MCV list vs. statistics target

Tomas Vondra-4
On Thu, Jun 20, 2019 at 08:08:44AM +0100, Dean Rasheed wrote:

>On Tue, 18 Jun 2019 at 22:34, Tomas Vondra <[hidden email]> wrote:
>>
>> One slightly inconvenient thing I realized while playing with the
>> address data set is that it's somewhat difficult to set the desired size
>> of the multi-column MCV list.
>>
>> At the moment, we simply use the maximum statistic target for attributes
>> the MCV list is built on. But that does not allow keeping default size
>> for per-column stats, and only increase size of multi-column MCV lists.
>>
>> So I'm thinking we should allow tweaking the statistics for extended
>> stats, and serialize it in the pg_statistic_ext catalog. Any opinions
>> why that would be a bad idea?
>>
>
>Seems reasonable to me. This might not be the only option we'll ever
>want to add though, so perhaps a "stxoptions text[]" column along the
>lines of a relation's reloptions would be the way to go.
>

I don't know - I kinda dislike the idea of stashing stuff like this into
text[] arrays unless there's a clear need for such flexibility (i.e.
vision to have more such options). Which I'm not sure is the case here.
And we kinda have a precedent in pg_attribute.attstattarget, so I'd use
the same approach here.

>> I suppose it should be part of the CREATE STATISTICS command, but I'm
>> not sure what'd be the best syntax. We might also have something more
>> similar to ALTER COLUMNT, but perhaps
>>
>>      ALTER STATISTICS s SET STATISTICS 1000;
>>
>> looks a bit too weird.
>>
>
>Yes it does look a bit weird, but that's the natural generalisation of
>what we have for per-column statistics, so it's probably preferable to
>do that rather than invent some other syntax that wouldn't be so
>consistent.
>

Yeah, I agree.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Reply | Threaded
Open this post in threaded view
|

Re: Multivariate MCV list vs. statistics target

Dean Rasheed-3
On Thu, 20 Jun 2019 at 23:12, Tomas Vondra <[hidden email]> wrote:

>
> On Thu, Jun 20, 2019 at 08:08:44AM +0100, Dean Rasheed wrote:
> >On Tue, 18 Jun 2019 at 22:34, Tomas Vondra <[hidden email]> wrote:
> >>
> >> So I'm thinking we should allow tweaking the statistics for extended
> >> stats, and serialize it in the pg_statistic_ext catalog. Any opinions
> >> why that would be a bad idea?
> >
> >Seems reasonable to me. This might not be the only option we'll ever
> >want to add though, so perhaps a "stxoptions text[]" column along the
> >lines of a relation's reloptions would be the way to go.
>
> I don't know - I kinda dislike the idea of stashing stuff like this into
> text[] arrays unless there's a clear need for such flexibility (i.e.
> vision to have more such options). Which I'm not sure is the case here.
> And we kinda have a precedent in pg_attribute.attstattarget, so I'd use
> the same approach here.
>

Hmm, maybe. I can certainly understand your dislike of using text[].
I'm not sure that we can confidently say that multivariate statistics
won't ever need additional tuning knobs, but I have no idea at the
moment what they might be, and nothing else has come up so far in all
the time spent considering MCV lists and histograms, so maybe this is
OK.

Regards,
Dean


Reply | Threaded
Open this post in threaded view
|

Re: Multivariate MCV list vs. statistics target

Tomas Vondra-4
On Fri, Jun 21, 2019 at 08:09:18AM +0100, Dean Rasheed wrote:

>On Thu, 20 Jun 2019 at 23:12, Tomas Vondra <[hidden email]> wrote:
>>
>> On Thu, Jun 20, 2019 at 08:08:44AM +0100, Dean Rasheed wrote:
>> >On Tue, 18 Jun 2019 at 22:34, Tomas Vondra <[hidden email]> wrote:
>> >>
>> >> So I'm thinking we should allow tweaking the statistics for extended
>> >> stats, and serialize it in the pg_statistic_ext catalog. Any opinions
>> >> why that would be a bad idea?
>> >
>> >Seems reasonable to me. This might not be the only option we'll ever
>> >want to add though, so perhaps a "stxoptions text[]" column along the
>> >lines of a relation's reloptions would be the way to go.
>>
>> I don't know - I kinda dislike the idea of stashing stuff like this into
>> text[] arrays unless there's a clear need for such flexibility (i.e.
>> vision to have more such options). Which I'm not sure is the case here.
>> And we kinda have a precedent in pg_attribute.attstattarget, so I'd use
>> the same approach here.
>>
>
>Hmm, maybe. I can certainly understand your dislike of using text[].
>I'm not sure that we can confidently say that multivariate statistics
>won't ever need additional tuning knobs, but I have no idea at the
>moment what they might be, and nothing else has come up so far in all
>the time spent considering MCV lists and histograms, so maybe this is
>OK.
>
OK, attached is a patch implementing this - it adds

    ALTER STATISTICS ... SET STATISTICS ...

modifying a new stxstattarget column in pg_statistic_ext catalog,
following the same logic as pg_attribute.attstattarget.

During analyze, the per-ext-statistic value is determined like this:

1) When pg_statistic_ext.stxstattarget != (-1), we just use this value
and we're done.

2) Otherwise we inspect per-column attstattarget values, and use the
largest value. This is what we do now, so it's backwards-compatible
behavior.

3) If the value is still (-1), we use default_statistics_target.



regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

alter-statistics-v1.patch (22K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Multivariate MCV list vs. statistics target

David Rowley-3
In reply to this post by Dean Rasheed-3
On Fri, 21 Jun 2019 at 19:09, Dean Rasheed <[hidden email]> wrote:
> Hmm, maybe. I can certainly understand your dislike of using text[].
> I'm not sure that we can confidently say that multivariate statistics
> won't ever need additional tuning knobs, but I have no idea at the
> moment what they might be, and nothing else has come up so far in all
> the time spent considering MCV lists and histograms, so maybe this is
> OK.

I agree with having the stxstattarget column. Even if something did
come up in the future, then we could consider merging the
stxstattarget column with a new text[] column at that time.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Multivariate MCV list vs. statistics target

Tomas Vondra-4
In reply to this post by Tomas Vondra-4
Hi,

apparently v1 of the ALTER STATISTICS patch was a bit confused about
length of the VacAttrStats array passed to statext_compute_stattarget,
causing segfaults. Attached v2 patch fixes that, and it also makes sure
we print warnings about ignored statistics only once.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

alter-statistics-v2.patch (22K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Multivariate MCV list vs. statistics target

Tomas Vondra-4
On Sun, Jul 07, 2019 at 12:02:38AM +0200, Tomas Vondra wrote:
>Hi,
>
>apparently v1 of the ALTER STATISTICS patch was a bit confused about
>length of the VacAttrStats array passed to statext_compute_stattarget,
>causing segfaults. Attached v2 patch fixes that, and it also makes sure
>we print warnings about ignored statistics only once.
>

v3 of the patch, adding pg_dump support - it works just like when you
tweak statistics target for a column, for example. When the value stored
in the catalog is not -1, pg_dump emits a separate ALTER STATISTICS
command setting it (for the already created statistics object).

I've considered making it part of CREATE STATISTICS itself, but it seems
a bit cumbersome and we don't do it for columns either. I'm not against
adding it in the future, but at this point I don't see a need.

At this point I'm not aware of any missing or broken pieces, so I'd
welcome feedback.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

alter-statistics-v3.patch (26K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Multivariate MCV list vs. statistics target

Jamison, Kirk
On Tuesday, July 9, 2019, Tomas Vondra wrote:

> >apparently v1 of the ALTER STATISTICS patch was a bit confused about
> >length of the VacAttrStats array passed to statext_compute_stattarget,
> >causing segfaults. Attached v2 patch fixes that, and it also makes sure
> >we print warnings about ignored statistics only once.
> >
>
> v3 of the patch, adding pg_dump support - it works just like when you tweak
> statistics target for a column, for example. When the value stored in the
> catalog is not -1, pg_dump emits a separate ALTER STATISTICS command setting
> it (for the already created statistics object).
>

Hi Tomas, I stumbled upon your patch.

According to the CF bot, your patch applies cleanly, builds successfully, and
passes make world. Meaning, the pg_dump tap test passed, but there was no
test for the new SET STATISTICS yet. So you might want to add a regression
test for that and integrate it in the existing alter_generic file.

Upon quick read-through, the syntax and docs are correct because it's similar
to the format of ALTER TABLE/INDEX... SET STATISTICS... :
    ALTER [ COLUMN ] column_name SET STATISTICS integer

+ /* XXX What if the target is set to 0? Reset the statistic? */

This also makes me wonder. I haven't looked deeply into the code, but since 0 is
a valid value, I believe it should reset the stats.
After lookup though, this is how it's tested in ALTER TABLE:
/test/regress/sql/stats_ext.sql:-- Ensure things work sanely with SET STATISTICS 0

> I've considered making it part of CREATE STATISTICS itself, but it seems a
> bit cumbersome and we don't do it for columns either. I'm not against adding
> it in the future, but at this point I don't see a need.

I agree. Perhaps that's for another patch should you decide to add it in the future.

Regards,
Kirk Jamison


Reply | Threaded
Open this post in threaded view
|

Re: Multivariate MCV list vs. statistics target

Tomas Vondra-4
On Fri, Jul 19, 2019 at 06:12:20AM +0000, Jamison, Kirk wrote:

>On Tuesday, July 9, 2019, Tomas Vondra wrote:
>> >apparently v1 of the ALTER STATISTICS patch was a bit confused about
>> >length of the VacAttrStats array passed to statext_compute_stattarget,
>> >causing segfaults. Attached v2 patch fixes that, and it also makes sure
>> >we print warnings about ignored statistics only once.
>> >
>>
>> v3 of the patch, adding pg_dump support - it works just like when you tweak
>> statistics target for a column, for example. When the value stored in the
>> catalog is not -1, pg_dump emits a separate ALTER STATISTICS command setting
>> it (for the already created statistics object).
>>
>
>Hi Tomas, I stumbled upon your patch.
>
>According to the CF bot, your patch applies cleanly, builds successfully, and
>passes make world. Meaning, the pg_dump tap test passed, but there was no
>test for the new SET STATISTICS yet. So you might want to add a regression
>test for that and integrate it in the existing alter_generic file.
>
>Upon quick read-through, the syntax and docs are correct because it's similar
>to the format of ALTER TABLE/INDEX... SET STATISTICS... :
>    ALTER [ COLUMN ] column_name SET STATISTICS integer
>
>+ /* XXX What if the target is set to 0? Reset the statistic? */
>
>This also makes me wonder. I haven't looked deeply into the code, but since 0 is
>a valid value, I believe it should reset the stats.
I agree resetting the stats after setting the target to 0 seems quite
reasonable. But that's not what we do for attribute stats, because in
that case we simply skip the attribute during the future ANALYZE runs -
we don't reset the stats, we keep the existing stats. So I've done the
same thing here, and I've removed the XXX comment.

If we want to change that, I'd do it in a separate patch for both the
regular and extended stats.

>After lookup though, this is how it's tested in ALTER TABLE:
>/test/regress/sql/stats_ext.sql:-- Ensure things work sanely with SET STATISTICS 0
>

Well, yeah. But that tests we skip building the extended statistic
(because we excluded the column from the ANALYZE run).

>> I've considered making it part of CREATE STATISTICS itself, but it seems a
>> bit cumbersome and we don't do it for columns either. I'm not against adding
>> it in the future, but at this point I don't see a need.
>
>I agree. Perhaps that's for another patch should you decide to add it in the future.
>

Right.

Attached is v4 of the patch, with a couple more improvements:

1) I've renamed the if_not_exists flag to missing_ok, because that's
more consistent with the "IF EXISTS" clause in the grammar (the old flag
was kinda the exact opposite), and I've added a NOTICE about the skip.

2) I've renamed ComputeExtStatsTarget to ComputeExtStatsRows, because
that's what the function was doing anyway (computing sample size).

3) I've added a couple of regression tests to stats_ext.sql

Aside from that, I've cleaned up a couple of places and improved a bunch
of comments. Nothing huge.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

alter-statistics-v4.patch (31K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Multivariate MCV list vs. statistics target

Jamison, Kirk
On Sat, July 20, 2019 8:12 AM (GMT+9), Tomas Vondra wrote:

> >+ /* XXX What if the target is set to 0? Reset the statistic?
> */
> >
> >This also makes me wonder. I haven't looked deeply into the code, but
> >since 0 is a valid value, I believe it should reset the stats.
>
> I agree resetting the stats after setting the target to 0 seems quite
> reasonable. But that's not what we do for attribute stats, because in that
> case we simply skip the attribute during the future ANALYZE runs - we don't
> reset the stats, we keep the existing stats. So I've done the same thing here,
> and I've removed the XXX comment.
>
> If we want to change that, I'd do it in a separate patch for both the regular
> and extended stats.

Hi, Tomas

Sorry for my late reply.
You're right. I have no strong opinion whether we'd want to change that behavior.
I've also confirmed the change in the patch where setting statistics target 0
skips the statistics.

Maybe only some minor nitpicks in the source code comments below:
1. "it's" should be "its":
> + * Compute statistic target, based on what's set for the statistic
> + * object itself, and for it's attributes.

2. Consistency whether you'd use either "statistic " or "statisticS ".
Ex. statistic target vs statisticS target, statistics object vs statistic object, etc.

> Attached is v4 of the patch, with a couple more improvements:
>
> 1) I've renamed the if_not_exists flag to missing_ok, because that's more
> consistent with the "IF EXISTS" clause in the grammar (the old flag was kinda
> the exact opposite), and I've added a NOTICE about the skip.

+ bool missing_ok;      /* do nothing if statistics does not exist */

Confirmed. So we ignore if statistic does not exist, and skip the error.
Maybe to make it consistent with other data structures in parsernodes.h,
you can change the comment of missing_ok to:
/* skip error if statistics object does not exist */

> 2) I've renamed ComputeExtStatsTarget to ComputeExtStatsRows, because that's
> what the function was doing anyway (computing sample size).
>
> 3) I've added a couple of regression tests to stats_ext.sql
>
> Aside from that, I've cleaned up a couple of places and improved a bunch of
> comments. Nothing huge.

I have a question though regarding ComputeExtStatisticsRows.
I'm just curious with the value 300 when computing sample size.
Where did this value come from?

+ /* compute sample size based on the statistic target */
+ return (300 * result);

Overall, the patch is almost already in good shape for commit.
I'll wait for the next update.

Regards,
Kirk Jamison


Reply | Threaded
Open this post in threaded view
|

Re: Multivariate MCV list vs. statistics target

Tomas Vondra-4
On Fri, Jul 26, 2019 at 07:03:41AM +0000, Jamison, Kirk wrote:

>On Sat, July 20, 2019 8:12 AM (GMT+9), Tomas Vondra wrote:
>
>> >+ /* XXX What if the target is set to 0? Reset the statistic?
>> */
>> >
>> >This also makes me wonder. I haven't looked deeply into the code, but
>> >since 0 is a valid value, I believe it should reset the stats.
>>
>> I agree resetting the stats after setting the target to 0 seems quite
>> reasonable. But that's not what we do for attribute stats, because in that
>> case we simply skip the attribute during the future ANALYZE runs - we don't
>> reset the stats, we keep the existing stats. So I've done the same thing here,
>> and I've removed the XXX comment.
>>
>> If we want to change that, I'd do it in a separate patch for both the regular
>> and extended stats.
>
>Hi, Tomas
>
>Sorry for my late reply.
>You're right. I have no strong opinion whether we'd want to change that behavior.
>I've also confirmed the change in the patch where setting statistics target 0
>skips the statistics.
>
OK, thanks.

>Maybe only some minor nitpicks in the source code comments below:
>1. "it's" should be "its":
>> + * Compute statistic target, based on what's set for the statistic
>> + * object itself, and for it's attributes.
>
>2. Consistency whether you'd use either "statistic " or "statisticS ".
>Ex. statistic target vs statisticS target, statistics object vs statistic object, etc.
>
>> Attached is v4 of the patch, with a couple more improvements:
>>
>> 1) I've renamed the if_not_exists flag to missing_ok, because that's more
>> consistent with the "IF EXISTS" clause in the grammar (the old flag was kinda
>> the exact opposite), and I've added a NOTICE about the skip.
>
>+ bool missing_ok;      /* do nothing if statistics does not exist */
>
>Confirmed. So we ignore if statistic does not exist, and skip the error.
>Maybe to make it consistent with other data structures in parsernodes.h,
>you can change the comment of missing_ok to:
>/* skip error if statistics object does not exist */
>
Thanks, I've fixed all those places in the attached v5.

>> 2) I've renamed ComputeExtStatsTarget to ComputeExtStatsRows, because that's
>> what the function was doing anyway (computing sample size).
>>
>> 3) I've added a couple of regression tests to stats_ext.sql
>>
>> Aside from that, I've cleaned up a couple of places and improved a bunch of
>> comments. Nothing huge.
>
>I have a question though regarding ComputeExtStatisticsRows.
>I'm just curious with the value 300 when computing sample size.
>Where did this value come from?
>
>+ /* compute sample size based on the statistic target */
>+ return (300 * result);
>
>Overall, the patch is almost already in good shape for commit.
>I'll wait for the next update.
>
That's how we compute number of rows to sample, based on the statistics
target. See std_typanalyze() in analyze.c, which also cites the paper
where this comes from.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

alter-statistics-v5.patch (31K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Multivariate MCV list vs. statistics target

Jamison, Kirk
On Saturday, July 27, 2019 7:06 AM(GMT+9), Tomas Vondra wrote:

> On Fri, Jul 26, 2019 at 07:03:41AM +0000, Jamison, Kirk wrote:
> >On Sat, July 20, 2019 8:12 AM (GMT+9), Tomas Vondra wrote:
> >
> >> >+ /* XXX What if the target is set to 0? Reset the statistic?
> >> */
> >> >
> >> >This also makes me wonder. I haven't looked deeply into the code,
> >> >but since 0 is a valid value, I believe it should reset the stats.
> >>
> >> I agree resetting the stats after setting the target to 0 seems quite
> >> reasonable. But that's not what we do for attribute stats, because in
> >> that case we simply skip the attribute during the future ANALYZE runs
> >> - we don't reset the stats, we keep the existing stats. So I've done
> >> the same thing here, and I've removed the XXX comment.
> >>
> >> If we want to change that, I'd do it in a separate patch for both the
> >> regular and extended stats.
> >
> >Hi, Tomas
> >
> >Sorry for my late reply.
> >You're right. I have no strong opinion whether we'd want to change that
> behavior.
> >I've also confirmed the change in the patch where setting statistics
> >target 0 skips the statistics.
> >
>
> OK, thanks.
>
> >Maybe only some minor nitpicks in the source code comments below:
> >1. "it's" should be "its":
> >> + * Compute statistic target, based on what's set for the
> statistic
> >> + * object itself, and for it's attributes.
> >
> >2. Consistency whether you'd use either "statistic " or "statisticS ".
> >Ex. statistic target vs statisticS target, statistics object vs statistic
> object, etc.
> >
> >> Attached is v4 of the patch, with a couple more improvements:
> >>
> >> 1) I've renamed the if_not_exists flag to missing_ok, because that's
> >> more consistent with the "IF EXISTS" clause in the grammar (the old
> >> flag was kinda the exact opposite), and I've added a NOTICE about the skip.
> >
> >+ bool missing_ok;      /* do nothing if statistics does
> not exist */
> >
> >Confirmed. So we ignore if statistic does not exist, and skip the error.
> >Maybe to make it consistent with other data structures in
> >parsernodes.h, you can change the comment of missing_ok to:
> >/* skip error if statistics object does not exist */
> >
>
> Thanks, I've fixed all those places in the attached v5.

I've confirmed the fix.

> >> 2) I've renamed ComputeExtStatsTarget to ComputeExtStatsRows, because
> >> that's what the function was doing anyway (computing sample size).
> >>
> >> 3) I've added a couple of regression tests to stats_ext.sql
> >>
> >> Aside from that, I've cleaned up a couple of places and improved a
> >> bunch of comments. Nothing huge.
> >
> >I have a question though regarding ComputeExtStatisticsRows.
> >I'm just curious with the value 300 when computing sample size.
> >Where did this value come from?
> >
> >+ /* compute sample size based on the statistic target */
> >+ return (300 * result);
> >
> >Overall, the patch is almost already in good shape for commit.
> >I'll wait for the next update.
> >
>
> That's how we compute number of rows to sample, based on the statistics target.
> See std_typanalyze() in analyze.c, which also cites the paper where this comes
> from.
Noted. Found it. Thank you for the reference.


There's just a small whitespace (extra space) below after running git diff --check.
>src/bin/pg_dump/pg_dump.c:7226: trailing whitespace.
>+
It would be better to post an updated patch,
but other than that, I've confirmed the fixes.
The patch passed the make-world and regression tests as well.
I've marked this as "ready for committer".


Regards,
Kirk Jamison


Reply | Threaded
Open this post in threaded view
|

RE: Multivariate MCV list vs. statistics target

Jamison, Kirk
Hi,

> From: Jamison, Kirk [mailto:[hidden email]]
> Sent: Monday, July 29, 2019 10:53 AM
> To: 'Tomas Vondra' <[hidden email]>
> Cc: Dean Rasheed <[hidden email]>; PostgreSQL Hackers
> <[hidden email]>
> Subject: RE: Multivariate MCV list vs. statistics target
>
> On Saturday, July 27, 2019 7:06 AM(GMT+9), Tomas Vondra wrote:
> > On Fri, Jul 26, 2019 at 07:03:41AM +0000, Jamison, Kirk wrote:
> > >On Sat, July 20, 2019 8:12 AM (GMT+9), Tomas Vondra wrote:
> > >
> > >> >+ /* XXX What if the target is set to 0? Reset the statistic?
> > >> */
> > >> >
> > >> >This also makes me wonder. I haven't looked deeply into the code,
> > >> >but since 0 is a valid value, I believe it should reset the stats.
> > >>
> > >> I agree resetting the stats after setting the target to 0 seems
> > >> quite reasonable. But that's not what we do for attribute stats,
> > >> because in that case we simply skip the attribute during the future
> > >> ANALYZE runs
> > >> - we don't reset the stats, we keep the existing stats. So I've
> > >> done the same thing here, and I've removed the XXX comment.
> > >>
> > >> If we want to change that, I'd do it in a separate patch for both
> > >> the regular and extended stats.
> > >
> > >Hi, Tomas
> > >
> > >Sorry for my late reply.
> > >You're right. I have no strong opinion whether we'd want to change
> > >that
> > behavior.
> > >I've also confirmed the change in the patch where setting statistics
> > >target 0 skips the statistics.
> > >
> >
> > OK, thanks.
> >
> > >Maybe only some minor nitpicks in the source code comments below:
> > >1. "it's" should be "its":
> > >> + * Compute statistic target, based on what's set for the
> > statistic
> > >> + * object itself, and for it's attributes.
> > >
> > >2. Consistency whether you'd use either "statistic " or "statisticS ".
> > >Ex. statistic target vs statisticS target, statistics object vs
> > >statistic
> > object, etc.
> > >
> > >> Attached is v4 of the patch, with a couple more improvements:
> > >>
> > >> 1) I've renamed the if_not_exists flag to missing_ok, because
> > >> that's more consistent with the "IF EXISTS" clause in the grammar
> > >> (the old flag was kinda the exact opposite), and I've added a NOTICE
> about the skip.
> > >
> > >+ bool missing_ok;      /* do nothing if statistics does
> > not exist */
> > >
> > >Confirmed. So we ignore if statistic does not exist, and skip the error.
> > >Maybe to make it consistent with other data structures in
> > >parsernodes.h, you can change the comment of missing_ok to:
> > >/* skip error if statistics object does not exist */
> > >
> >
> > Thanks, I've fixed all those places in the attached v5.
>
> I've confirmed the fix.
>
> > >> 2) I've renamed ComputeExtStatsTarget to ComputeExtStatsRows,
> > >> because that's what the function was doing anyway (computing sample
> size).
> > >>
> > >> 3) I've added a couple of regression tests to stats_ext.sql
> > >>
> > >> Aside from that, I've cleaned up a couple of places and improved a
> > >> bunch of comments. Nothing huge.
> > >
> > >I have a question though regarding ComputeExtStatisticsRows.
> > >I'm just curious with the value 300 when computing sample size.
> > >Where did this value come from?
> > >
> > >+ /* compute sample size based on the statistic target */
> > >+ return (300 * result);
> > >
> > >Overall, the patch is almost already in good shape for commit.
> > >I'll wait for the next update.
> > >
> >
> > That's how we compute number of rows to sample, based on the statistics
> target.
> > See std_typanalyze() in analyze.c, which also cites the paper where
> > this comes from.
> Noted. Found it. Thank you for the reference.
>
>
> There's just a small whitespace (extra space) below after running git diff
> --check.
> >src/bin/pg_dump/pg_dump.c:7226: trailing whitespace.
> >+
> It would be better to post an updated patch, but other than that, I've confirmed
> the fixes.
> The patch passed the make-world and regression tests as well.
> I've marked this as "ready for committer".

The patch does not apply anymore.
In addition to the whitespace detected,
kindly rebase the patch as there were changes from recent commits
in the following files:
       src/backend/commands/analyze.c
       src/bin/pg_dump/pg_dump.c
       src/bin/pg_dump/t/002_pg_dump.pl
       src/test/regress/expected/stats_ext.out
       src/test/regress/sql/stats_ext.sql

Regards,
Kirk Jamison


Reply | Threaded
Open this post in threaded view
|

Re: Multivariate MCV list vs. statistics target

Thomas Munro-5
On Thu, Aug 1, 2019 at 12:16 PM Jamison, Kirk <[hidden email]> wrote:
> > On Saturday, July 27, 2019 7:06 AM(GMT+9), Tomas Vondra wrote:
> > > On Fri, Jul 26, 2019 at 07:03:41AM +0000, Jamison, Kirk wrote:
> > > >Overall, the patch is almost already in good shape for commit.
> > > >I'll wait for the next update.

> > The patch passed the make-world and regression tests as well.
> > I've marked this as "ready for committer".
>
> The patch does not apply anymore.

Based on the above, it sounds like this patch is super close and the
only problem is bitrot, so I've set it back to Ready for Committer.
Over to Tomas to rebase and commit, or move to the next CF if that's
more appropriate.

--
Thomas Munro
https://enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Multivariate MCV list vs. statistics target

Kyotaro Horiguchi-4
In reply to this post by Jamison, Kirk
Hello.

At Thu, 1 Aug 2019 00:15:48 +0000, "Jamison, Kirk" <[hidden email]> wrote in <D09B13F772D2274BB348A310EE3027C6518F94@g01jpexmbkw24>
> The patch does not apply anymore.
> In addition to the whitespace detected,
> kindly rebase the patch as there were changes from recent commits
> in the following files:
>        src/backend/commands/analyze.c
>        src/bin/pg_dump/pg_dump.c
>        src/bin/pg_dump/t/002_pg_dump.pl
>        src/test/regress/expected/stats_ext.out
>        src/test/regress/sql/stats_ext.sql

The patch finally failed only for stats_ext.out, where 14ef15a222
is hitting. (for b2a3d706b8)

I looked through this patch and have some comments.



+++ b/src/backend/commands/statscmds.c
+#include "access/heapam.h"
..
+#include "utils/fmgroids.h"

These don't seem needed.


+    Assert(stmt->missing_ok);

Perhaps we shouldn't Assert on this condition. Isn't it better we
just "elog(ERROR" here?


+    DeconstructQualifiedName(stmt->defnames, &schemaname, &statname);

Maybe we don't need detailed analysis that the function emits on
error. Couldn't we use NameListToString() instead? That reduces
the number of ereport()s and considerably simplify the code
around.

+  oldtup = SearchSysCache1(STATEXTOID, ObjectIdGetDatum(stxoid));
+
+  /* Must be owner of the existing statistics object */
+  if (!pg_statistics_object_ownercheck(stxoid, GetUserId()))

This repeats the SearchSysCache twice in a quite short
duration. I suppose it'd be better that ACL (and validity) checks
done directly using oldtup.


+  /* replace the stxstattarget column */
+  repl_repl[Anum_pg_statistic_ext_stxstattarget - 1] = true;
+  repl_val[Anum_pg_statistic_ext_stxstattarget - 1] = Int32GetDatum(newtarget)

We usually do this kind of work using SearchSysCacheCopyN(),
which simplifies the code around, too.


+++ b/src/backend/statistics/mcv.c
>    * Maximum number of MCV items to store, based on the attribute with the
>    * largest stats target (and the number of groups we have available).
>    */
-  nitems = stats[0]->attr->attstattarget;
-  for (i = 1; i < numattrs; i++)
-  {
-    if (stats[i]->attr->attstattarget > nitems)
-      nitems = stats[i]->attr->attstattarget;
-  }
+  nitems = stattarget;

Maybe you forgot to modify the comment.


check_xact_readonly() returns false for this command. As the
result it emits a somewhat pointless error message.

=# alter statistics s1 set statistics 0;
ERROR:  cannot assign TransactionIds during recovery


+++ b/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.h
>   i_stxname = PQfnumber(res, "stxname");
>   i_stxnamespace = PQfnumber(res, "stxnamespace");
>   i_rolname = PQfnumber(res, "rolname");
+   i_stattarget = PQfnumber(res, "stxstattarget");

I'm not sure whether it is the convention here, but variable name
is different from column name only for the added line.


+++ b/src/bin/psql/tab-complete.c
- COMPLETE_WITH("OWNER TO", "RENAME TO", "SET SCHEMA");
+ COMPLETE_WITH("OWNER TO", "RENAME TO", "SET SCHEMA", "SET STATISTICS");

ALTER STATISTICS s2 SET STATISTICS<tab> is suggested with
ext-stats names,  but it's the place for target value.


+++ b/src/include/nodes/nodes.h
   T_CallStmt,
+  T_AlterStatsStmt,
 
I think it should be immediately below T_CreateStatsStmt.


+++ b/src/include/nodes/parsenodes.h
+  bool    missing_ok;    /* skip error if statistics object is missing */

Should be very trivial, but many bool members especially
missing_ok have a comment having "?" at the end.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: Multivariate MCV list vs. statistics target

Tomas Vondra-4
In reply to this post by Thomas Munro-5
On Thu, Aug 01, 2019 at 05:25:31PM +1200, Thomas Munro wrote:

>On Thu, Aug 1, 2019 at 12:16 PM Jamison, Kirk <[hidden email]> wrote:
>> > On Saturday, July 27, 2019 7:06 AM(GMT+9), Tomas Vondra wrote:
>> > > On Fri, Jul 26, 2019 at 07:03:41AM +0000, Jamison, Kirk wrote:
>> > > >Overall, the patch is almost already in good shape for commit.
>> > > >I'll wait for the next update.
>
>> > The patch passed the make-world and regression tests as well.
>> > I've marked this as "ready for committer".
>>
>> The patch does not apply anymore.
>
>Based on the above, it sounds like this patch is super close and the
>only problem is bitrot, so I've set it back to Ready for Committer.
>Over to Tomas to rebase and commit, or move to the next CF if that's
>more appropriate.
>

I'll move it to the next CF. Aside from the issues pointed by Kyotaro-san
in his review, I still haven't made my mind about whether to base the use
statistics targets set for the attributes. That's what we're doing now,
but I'm not sure it's a good idea after adding separate statistics target.
I wonder what Dean's opinion on this is, as he added the current logic.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Reply | Threaded
Open this post in threaded view
|

Re: Multivariate MCV list vs. statistics target

Dean Rasheed-3
On Thu, 1 Aug 2019 at 11:30, Tomas Vondra <[hidden email]> wrote:
>
> I'll move it to the next CF. Aside from the issues pointed by Kyotaro-san
> in his review, I still haven't made my mind about whether to base the use
> statistics targets set for the attributes. That's what we're doing now,
> but I'm not sure it's a good idea after adding separate statistics target.
> I wonder what Dean's opinion on this is, as he added the current logic.
>

If this were being released in the same version as MCV stats first
appeared, I'd say that there's not much point basing the default
multivariate stats target on the per-column targets, when it has its
own knob to control it. However, since this won't be released for a
year, those per-column-based defaults will be in the field for that
long, and so I'd say that we shouldn't change the default when adding
this, otherwise users who don't use this new feature might be
surprised by the change in behaviour.

Regards,
Dean


Reply | Threaded
Open this post in threaded view
|

Re: Multivariate MCV list vs. statistics target

Alvaro Herrera-9
In reply to this post by Tomas Vondra-4
On 2019-Aug-01, Tomas Vondra wrote:

> I'll move it to the next CF. Aside from the issues pointed by Kyotaro-san
> in his review, I still haven't made my mind about whether to base the use
> statistics targets set for the attributes. That's what we're doing now,
> but I'm not sure it's a good idea after adding separate statistics target.
> I wonder what Dean's opinion on this is, as he added the current logic.

Latest patch no longer applies.  Please update.  And since you already
seem to have handled all review comments since it was Ready for
Committer, and you now know Dean's opinion on the remaining question,
please get it pushed.

Thanks

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Multivariate MCV list vs. statistics target

Tomas Vondra-4
On Tue, Sep 03, 2019 at 02:38:56PM -0400, Alvaro Herrera wrote:

>On 2019-Aug-01, Tomas Vondra wrote:
>
>> I'll move it to the next CF. Aside from the issues pointed by Kyotaro-san
>> in his review, I still haven't made my mind about whether to base the use
>> statistics targets set for the attributes. That's what we're doing now,
>> but I'm not sure it's a good idea after adding separate statistics target.
>> I wonder what Dean's opinion on this is, as he added the current logic.
>
>Latest patch no longer applies.  Please update.  And since you already
>seem to have handled all review comments since it was Ready for
>Committer, and you now know Dean's opinion on the remaining question,
>please get it pushed.
>

OK, I've pushed this the patch, after some minor polishing.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services