n_distinct off by a factor of 1000

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

n_distinct off by a factor of 1000

Klaudie Willis
Friends,

I run Postgresql 12.3, on Windows. I have just discovered a pretty significant problem with Postgresql and my data.  I have a large table, 500M rows, 50 columns. It is split in 3 partitions by Year.  In addition to the primary key, one of the columns is indexed, and I do lookups on this.

Select * from bigtable b where b.instrument_ref in (x,y,z,...)
limit 1000

It responded well with sub-second response, and it uses the index of the column.  However, when I changed it to:

Select * from bigtable b where b.instrument_ref in (x,y,z,)
limit 10000 -- (notice 10K now)

The planner decided to do a full table scan on the entire 500M row table! And that did not work very well.  First I had no clue as to why it did so, and when I disabled sequential scan the query immediately returned.  But I should not have to do so.

I got my first hint of why this problem occurs when I looked at the statistics.  For the column in question, "instrument_ref" the statistics claimed it to be:

The default_statistics_target=500, and analyze has been run.
select * from pg_stats where attname like 'instr%_ref'; -- Result: 40.000
select count(distinct instrumentid_ref) from bigtable -- Result: 33 385 922 (!!)

That is an astonishing difference of almost a 1000X. 

When the planner only thinks there are 40K different values, then it makes sense to switch to table scan in order to fill the limit=10.000.  But it is wrong, very wrong, an the query returns in 100s of seconds instead of a few.

I have tried to increase the statistics target to 5000, and it helps, but it reduces the error to 100X.  Still crazy high.

I understand that this is a known problem.  I have read previous posts about it, still I have never seen anyone reach such a high difference factor.

I have considered these fixes:
- hardcode the statistics to a particular ratio of the total number of rows
- randomize the rows more, so that it does not suffer from page clustering.  However, this has probably other implications

Feel free to comment :)


K

Reply | Threaded
Open this post in threaded view
|

Re: n_distinct off by a factor of 1000

Ron-2
Maybe I missed it, but did you run "ANALYZE VERBOSE bigtable;"?

On 6/23/20 7:42 AM, Klaudie Willis wrote:
Friends,

I run Postgresql 12.3, on Windows. I have just discovered a pretty significant problem with Postgresql and my data.  I have a large table, 500M rows, 50 columns. It is split in 3 partitions by Year.  In addition to the primary key, one of the columns is indexed, and I do lookups on this.

Select * from bigtable b where b.instrument_ref in (x,y,z,...)
limit 1000

It responded well with sub-second response, and it uses the index of the column.  However, when I changed it to:

Select * from bigtable b where b.instrument_ref in (x,y,z,)
limit 10000 -- (notice 10K now)

The planner decided to do a full table scan on the entire 500M row table! And that did not work very well.  First I had no clue as to why it did so, and when I disabled sequential scan the query immediately returned.  But I should not have to do so.

I got my first hint of why this problem occurs when I looked at the statistics.  For the column in question, "instrument_ref" the statistics claimed it to be:

The default_statistics_target=500, and analyze has been run.
select * from pg_stats where attname like 'instr%_ref'; -- Result: 40.000
select count(distinct instrumentid_ref) from bigtable -- Result: 33 385 922 (!!)

That is an astonishing difference of almost a 1000X. 

When the planner only thinks there are 40K different values, then it makes sense to switch to table scan in order to fill the limit=10.000.  But it is wrong, very wrong, an the query returns in 100s of seconds instead of a few.

I have tried to increase the statistics target to 5000, and it helps, but it reduces the error to 100X.  Still crazy high.

I understand that this is a known problem.  I have read previous posts about it, still I have never seen anyone reach such a high difference factor.

I have considered these fixes:
- hardcode the statistics to a particular ratio of the total number of rows
- randomize the rows more, so that it does not suffer from page clustering.  However, this has probably other implications

Feel free to comment :)


K


--
Angular momentum makes the world go 'round.
Reply | Threaded
Open this post in threaded view
|

Re: n_distinct off by a factor of 1000

Klaudie Willis
I didn't run it with "verbose" but otherwise, yes, several times.  I can do it again with verbose if you are interested in the output.  Just give me some time.  500M rows 50 columns, is no small job :)


K

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, June 23, 2020 2:51 PM, Ron <[hidden email]> wrote:

Maybe I missed it, but did you run "ANALYZE VERBOSE bigtable;"?


On 6/23/20 7:42 AM, Klaudie Willis wrote:
Friends,

I run Postgresql 12.3, on Windows. I have just discovered a pretty significant problem with Postgresql and my data.  I have a large table, 500M rows, 50 columns. It is split in 3 partitions by Year.  In addition to the primary key, one of the columns is indexed, and I do lookups on this.

Select * from bigtable b where b.instrument_ref in (x,y,z,...)
limit 1000

It responded well with sub-second response, and it uses the index of the column.  However, when I changed it to:

Select * from bigtable b where b.instrument_ref in (x,y,z,)
limit 10000 -- (notice 10K now)

The planner decided to do a full table scan on the entire 500M row table! And that did not work very well.  First I had no clue as to why it did so, and when I disabled sequential scan the query immediately returned.  But I should not have to do so.

I got my first hint of why this problem occurs when I looked at the statistics.  For the column in question, "instrument_ref" the statistics claimed it to be:

The default_statistics_target=500, and analyze has been run.
select * from pg_stats where attname like 'instr%_ref'; -- Result: 40.000
select count(distinct instrumentid_ref) from bigtable -- Result: 33 385 922 (!!)

That is an astonishing difference of almost a 1000X. 

When the planner only thinks there are 40K different values, then it makes sense to switch to table scan in order to fill the limit=10.000.  But it is wrong, very wrong, an the query returns in 100s of seconds instead of a few.

I have tried to increase the statistics target to 5000, and it helps, but it reduces the error to 100X.  Still crazy high.

I understand that this is a known problem.  I have read previous posts about it, still I have never seen anyone reach such a high difference factor.

I have considered these fixes:
- hardcode the statistics to a particular ratio of the total number of rows
- randomize the rows more, so that it does not suffer from page clustering.  However, this has probably other implications

Feel free to comment :)


K


--
Angular momentum makes the world go 'round.

Reply | Threaded
Open this post in threaded view
|

Re: n_distinct off by a factor of 1000

Fabio Pardi
In reply to this post by Klaudie Willis

On 23/06/2020 14:42, Klaudie Willis wrote:
I got my first hint of why this problem occurs when I looked at the statistics.  For the column in question, "instrument_ref" the statistics claimed it to be:

The default_statistics_target=500, and analyze has been run.
select * from pg_stats where attname like 'instr%_ref'; -- Result: 40.000
select count(distinct instrumentid_ref) from bigtable -- Result: 33 385 922 (!!)

That is an astonishing difference of almost a 1000X. 


I think you are counting 2 different things here.

The first query returns all the columns "like 'instr%_ref'" present in the statistics (so in the whole cluster), while the second is counting the actual number of different rows in bigtable.


regards,

fabio pardi
Reply | Threaded
Open this post in threaded view
|

Re: n_distinct off by a factor of 1000

Adrian Klaver-4
On 6/23/20 7:05 AM, Fabio Pardi wrote:

>
> On 23/06/2020 14:42, Klaudie Willis wrote:
>> I got my first hint of why this problem occurs when I looked at the
>> statistics.  For the column in question, "instrument_ref" the
>> statistics claimed it to be:
>>
>> The default_statistics_target=500, and analyze has been run.
>> select * from pg_stats where attname like 'instr%_ref'; -- Result:
>> *40.000*
>> select count(distinct instrumentid_ref) from bigtable -- Result: *33
>> 385 922 (!!)*
>>
>> That is an astonishing difference of almost a 1000X.
>>
>
> I think you are counting 2 different things here.
>
> The first query returns all the columns "like 'instr%_ref'" present in
> the statistics (so in the whole cluster), while the second is counting
> the actual number of different rows in bigtable.

I believe the OP actually meant the query to be:

select n_distinct from pg_stats where attname like 'instr%_ref';

>
>
> regards,
>
> fabio pardi


--
Adrian Klaver
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: n_distinct off by a factor of 1000

Klaudie Willis
Adrian, you are correct.  My mistanke.

K

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, June 23, 2020 4:14 PM, Adrian Klaver <[hidden email]> wrote:

> On 6/23/20 7:05 AM, Fabio Pardi wrote:
>
> > On 23/06/2020 14:42, Klaudie Willis wrote:
> >
> > > I got my first hint of why this problem occurs when I looked at the
> > > statistics.  For the column in question, "instrument_ref" the
> > > statistics claimed it to be:
> > > The default_statistics_target=500, and analyze has been run.
> > > select * from pg_stats where attname like 'instr%_ref'; -- Result:
> > > 40.000
> > > select count(distinct instrumentid_ref) from bigtable -- Result: 33
> > > 385 922 (!!)That is an astonishing difference of almost a 1000X.
> >
> > I think you are counting 2 different things here.
> > The first query returns all the columns "like 'instr%_ref'" present in
> > the statistics (so in the whole cluster), while the second is counting
> > the actual number of different rows in bigtable.
>
> I believe the OP actually meant the query to be:
>
> select n_distinct from pg_stats where attname like 'instr%_ref';
>
> > regards,
> > fabio pardi
>
> --
>
> Adrian Klaver
> [hidden email]




Reply | Threaded
Open this post in threaded view
|

Re: n_distinct off by a factor of 1000

Michael Lewis
> > On 23/06/2020 14:42, Klaudie Willis wrote:
> >
> > > I got my first hint of why this problem occurs when I looked at the
> > > statistics.  For the column in question, "instrument_ref" the
> > > statistics claimed it to be:
> > > The default_statistics_target=500, and analyze has been run.
> > > select * from pg_stats where attname like 'instr%_ref'; -- Result:
> > > 40.000
> > > select count(distinct instrumentid_ref) from bigtable -- Result: 33
> > > 385 922 (!!)That is an astonishing difference of almost a 1000X.

Try something like this to check how representative those "most common values" are. If you have n_distinct very low compared to reality and also the fraction of the table that the "most common" values are claiming to cover is low, then you can get very bad estimates when querying for values that are not in the MCVs list. The planner will assume an even distribution for other values and that may be much much higher or lower than reality. That is, if you have statistics target of 100 like normal, and those cover 5% of the table, and you have ndistinct value of 500, then the other 400 values are assumed to evenly cover that 95% of the table so each value would be .95/400 * reltuples as an estimate. If your real count of distinct values is 40000 then the number of values you expect to get for each value in your IN clause drops hugely.

Using a custom ndistinct will dramatically impact the estimates that the planner is using to make the decision of index vs sequential scan. Also, if the custom ndistinct and the actual distinct count vary by 2x or 10x as your data grows, it matters very little IMO as compared to relying on the sample taken by (auto)analyze job being off by a factor of 1000x or even 100x as you have experienced.


SELECT

( SELECT SUM (x) FROM UNNEST (most_common_freqs) x ) frac_MCV,

tablename,

attname,

inherited,

null_frac,

n_distinct,

array_length(most_common_vals,1) n_mcv,

array_length(histogram_bounds,1) n_hist,

correlation,

*

FROM pg_stats

WHERE

schemaname = 'public'

AND tablename=‘table’

AND attname=‘column’; 
Reply | Threaded
Open this post in threaded view
|

Re: n_distinct off by a factor of 1000

Klaudie Willis
In reply to this post by Klaudie Willis
show default_statistics_target; --> 500
ALTER TABLE public.bigtable ALTER COLUMN instrumentid_ref SET STATISTICS 5000;

Here is the output of the "ANALYZE VERBOSE bigtable;"
INFO:  analyzing "public.bigtables" inheritance tree
INFO:  "bigtable_y2018": scanned 622250 of 10661013 pages, containing 11994670 live rows and 5091 dead rows; 622250 rows in sample, 205504753 estimated total rows
INFO:  "bigtable_y2019": scanned 520159 of 8911886 pages, containing 10017582 live rows and 6148 dead rows; 520159 rows in sample, 171631268 estimated total rows
INFO:  "bigtable_y2020": scanned 357591 of 6126616 pages, containing 7031238 live rows and 1534 dead rows; 357591 rows in sample, 120466385 estimated total rows
INFO:  analyzing "public.bigtable_y2018"
INFO:  "bigtable_y2018": scanned 1500000 of 10661013 pages, containing 28915115 live rows and 12589 dead rows; 1500000 rows in sample, 205509611 estimated total rows
INFO:  analyzing "public.bigtable_y2019"
INFO:  "bigtable_y2019": scanned 1500000 of 8911886 pages, containing 28888514 live rows and 17778 dead rows; 1500000 rows in sample, 171634096 estimated total rows
INFO:  analyzing "public.bigtable_y2020"
INFO:  "bigtable_y2020": scanned 1500000 of 6126616 pages, containing 29488967 live rows and 6330 dead rows; 1500000 rows in sample, 120445051 estimated total rows
INFO:  analyzing "public.bigtable_y2021"
INFO:  "bigtable_y2021": scanned 1 of 1 pages, containing 8 live rows and 0 dead rows; 8 rows in sample, 8 estimated total rows
ANALYZE


On the comment from Adrian:

SELECT
( SELECT SUM (x) FROM UNNEST (most_common_freqs) x ) frac_MCV,
n_distinct,
array_length(most_common_vals,1) n_mcv,
array_length(histogram_bounds,1) n_hist,
tablename,
FROM pg_stats
WHERE
schemaname = 'public'
AND attname like 'instrumentid_ref'

frac_MCV;n_distinct; n_mcv; n_hist;tablename
0.9205394 122160 2140 5001 "bigtable"
0.9203018 124312 1736 5001 "bigtable_y2018"
0.9258158 113846 2107 5001 "bigtable_y2020"
0.875        -0.375      2              "bigtable_y2021"
0.92304045 118267 2204 5001 "bigtable_y2019"

select count(distinct instrumentid_ref) from bigtable --> 33 385 922
Bigtables instrumentid_ref is underestimated by 300X even when statistics target of the column is 5000;  Pretty weird.


K

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, June 23, 2020 3:07 PM, Klaudie Willis <[hidden email]> wrote:

I didn't run it with "verbose" but otherwise, yes, several times.  I can do it again with verbose if you are interested in the output.  Just give me some time.  500M rows 50 columns, is no small job :)


K

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, June 23, 2020 2:51 PM, Ron <[hidden email]> wrote:

Maybe I missed it, but did you run "ANALYZE VERBOSE bigtable;"?


On 6/23/20 7:42 AM, Klaudie Willis wrote:
Friends,

I run Postgresql 12.3, on Windows. I have just discovered a pretty significant problem with Postgresql and my data.  I have a large table, 500M rows, 50 columns. It is split in 3 partitions by Year.  In addition to the primary key, one of the columns is indexed, and I do lookups on this.

Select * from bigtable b where b.instrument_ref in (x,y,z,...)
limit 1000

It responded well with sub-second response, and it uses the index of the column.  However, when I changed it to:

Select * from bigtable b where b.instrument_ref in (x,y,z,)
limit 10000 -- (notice 10K now)

The planner decided to do a full table scan on the entire 500M row table! And that did not work very well.  First I had no clue as to why it did so, and when I disabled sequential scan the query immediately returned.  But I should not have to do so.

I got my first hint of why this problem occurs when I looked at the statistics.  For the column in question, "instrument_ref" the statistics claimed it to be:

The default_statistics_target=500, and analyze has been run.
select * from pg_stats where attname like 'instr%_ref'; -- Result: 40.000
select count(distinct instrumentid_ref) from bigtable -- Result: 33 385 922 (!!)

That is an astonishing difference of almost a 1000X. 

When the planner only thinks there are 40K different values, then it makes sense to switch to table scan in order to fill the limit=10.000.  But it is wrong, very wrong, an the query returns in 100s of seconds instead of a few.

I have tried to increase the statistics target to 5000, and it helps, but it reduces the error to 100X.  Still crazy high.

I understand that this is a known problem.  I have read previous posts about it, still I have never seen anyone reach such a high difference factor.

I have considered these fixes:
- hardcode the statistics to a particular ratio of the total number of rows
- randomize the rows more, so that it does not suffer from page clustering.  However, this has probably other implications

Feel free to comment :)


K


--
Angular momentum makes the world go 'round.


Reply | Threaded
Open this post in threaded view
|

Re: n_distinct off by a factor of 1000

Peter J. Holzer
On 2020-06-24 07:30:05 +0000, Klaudie Willis wrote:
> show default_statistics_target; --> 500
> ALTER TABLE public.bigtable ALTER COLUMN instrumentid_ref SET STATISTICS 5000;
>
> Here is the output of the "ANALYZE VERBOSE bigtable;"
> INFO:  analyzing "public.bigtables" inheritance tree
[...]

> INFO:  analyzing "public.bigtable_y2018"
> INFO:  "bigtable_y2018": scanned 1500000 of 10661013 pages, containing 28915115
> live rows and 12589 dead rows; 1500000 rows in sample, 205509611 estimated
> total rows
> INFO:  analyzing "public.bigtable_y2019"
> INFO:  "bigtable_y2019": scanned 1500000 of 8911886 pages, containing 28888514
> live rows and 17778 dead rows; 1500000 rows in sample, 171634096 estimated
> total rows
> INFO:  analyzing "public.bigtable_y2020"
> INFO:  "bigtable_y2020": scanned 1500000 of 6126616 pages, containing 29488967
> live rows and 6330 dead rows; 1500000 rows in sample, 120445051 estimated total
> rows
> INFO:  analyzing "public.bigtable_y2021"
> INFO:  "bigtable_y2021": scanned 1 of 1 pages, containing 8 live rows and 0
> dead rows; 8 rows in sample, 8 estimated total rows
So it scans 1500000 rows from each partition, but ...

[...]
> frac_MCV;n_distinct; n_mcv; n_hist;tablename
> 0.9205394 122160 2140 5001 "bigtable"
> 0.9203018 124312 1736 5001 "bigtable_y2018"
> 0.9258158 113846 2107 5001 "bigtable_y2020"
> 0.875        -0.375      2              "bigtable_y2021"
> 0.92304045 118267 2204 5001 "bigtable_y2019"

Estimates the number of distinct values in each partition as between
113846 and 124312. So it can have encountered at most that many
different values, which means that it must have encountered each value
about 12 or 13 times on average.

My guess is that there are relatively few (less than 120000) distinct
values which make up the bulk (over 90 %) of these tables and a lot (33
million) values which are very rare.

Is this guess correct?

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | [hidden email]         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: n_distinct off by a factor of 1000

Peter J. Holzer
In reply to this post by Klaudie Willis
[Please keep replies on the list]

On 2020-06-24 11:02:22 +0000, Klaudie Willis wrote:

> Holzer,  thanks for your feedback.  Yes, your guess is very good.  The
> data consists of millions of instruments that occur a handful of cases
> (rare), and instruments that are very common.
>
> I am still a little surprised that it is so hard to sample data and
> estimate distinct values within a 1000X. My intuition misleads me into
> thinking this should be simpler, but I understand that this is no
> simple task at all.  To your information, it seems like SQL server
> 2016 makes the same "mistake" when calculating distincts (or 1/density
> as they call it). I have similar data there that I have looked into,
> and it seems "fooled" as well.
Yes, estimating the number of distinct values from a relatively small
sample is hard when you don't know the underlying distribution. It might
be possible to analyze the sample to find the distribution and get a
better estimate. But I'm not sure how useful that would really be: If
a few values are very common and most very rare you are probably also
much more likely to use the common values in a query: And for those you
you would massively underestimate their frequency if you had an accurate
n_distinct value. That might be just as bad or even worse.

I'm wondering whether it would make sense to use a range instead of a
single value in cost calculations. Then the planner could prefer plans
which had reasonable cost over the whole range over plans which are
good at one end of the range but horrible at the other.

I'm a afraid that would lead to combinatorial explosion, though. Even
with a small number of ranges there would be a large number of possible
combinations to calculate. So one would probably have to resort to monte
carlo simulation or soemthing like that.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | [hidden email]         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: n_distinct off by a factor of 1000

Michael Lewis


On Wed, Jun 24, 2020, 2:35 PM Peter J. Holzer <[hidden email]> wrote:
Yes, estimating the number of distinct values from a relatively small
sample is hard when you don't know the underlying distribution. It might
be possible to analyze the sample to find the distribution and get a
better estimate. But I'm not sure how useful that would really be: If
a few values are very common and most very rare you are probably also
much more likely to use the common values in a query: And for those you
you would massively underestimate their frequency if you had an accurate
n_distinct value. That might be just as bad or even worse.


This would only be true for values that are "common" but not in the MCVs list, right?

If we could increase the sampling ratio beyond the hard coded 300x to get a more representative sample and use that to estimate ndistinct (and also the frequency of the most common values) but only actually stored the 100 MCVs (or whatever the stats target is set to for the system or column) then the issue may be mitigated without increasing planning time because of stats that are larger than prudent, and the "only" cost should be longer processing time when (auto) analyzing... plus overhead for considering this potential new setting in all analyze cases I suppose.
Reply | Threaded
Open this post in threaded view
|

Re: n_distinct off by a factor of 1000

Klaudie Willis

If we could increase the sampling ratio beyond the hard coded 300x to get a more representative sample and use that to estimate ndistinct (and also the frequency of the most common values) but only actually stored the 100 MCVs (or whatever the stats target is set to for the system or column) then the issue may be mitigated without increasing planning time because of stats that are larger than prudent, and the "only" cost should be longer processing time when (auto) analyzing... plus overhead for considering this potential new setting in all analyze cases I suppose.

I found another large deviation in one of my bridge tables. It is an (int,int) table of 900M rows where the B column contains 2.7M distinct values, however the pg_stats table claims it to be only 10.400. These numbers are with a statistics target of 500.  I'm not sure that really matters for the planner for the queries I run, but it makes me a little nervous :)

Also, is it just my data samples, or is the n_distinct way more often underestimated by a larger ratio, than overestimated?

K
Reply | Threaded
Open this post in threaded view
|

Re: n_distinct off by a factor of 1000

Pavel Luzanov
In reply to this post by Klaudie Willis

Hello,

I got my first hint of why this problem occurs when I looked at the statistics.  For the column in question, "instrument_ref" the statistics claimed it to be:

The default_statistics_target=500, and analyze has been run.
select * from pg_stats where attname like 'instr%_ref'; -- Result: 40.000
select count(distinct instrumentid_ref) from bigtable -- Result: 33 385 922 (!!)

That is an astonishing difference of almost a 1000X. 

I have tried to increase the statistics target to 5000, and it helps, but it reduces the error to 100X.  Still crazy high.

As far as I know, increasing default_statistics_target will not help. [1]

I have considered these fixes:
- hardcode the statistics to a particular ratio of the total number of rows

You can hardcode the percentage of distinct values:
ALTER TABLE bigtable ALTER COLUMN instrument_ref SET ( n_distinct=-0.06 ); /* -1 * (33385922 / 500000000) */


[1] https://www.postgresql.org/message-id/4136ffa0812111823u645b6ec9wdca60b3da4b00499%40mail.gmail.com
-----
Pavel Luzanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Reply | Threaded
Open this post in threaded view
|

Re: n_distinct off by a factor of 1000

Michael Lewis
On Thu, Jun 25, 2020 at 7:27 AM Pavel Luzanov <[hidden email]> wrote:
I have tried to increase the statistics target to 5000, and it helps, but it reduces the error to 100X.  Still crazy high.

As far as I know, increasing default_statistics_target will not help. [1]

I have considered these fixes:
- hardcode the statistics to a particular ratio of the total number of rows

You can hardcode the percentage of distinct values:
ALTER TABLE bigtable ALTER COLUMN instrument_ref SET ( n_distinct=-0.06 ); /* -1 * (33385922 / 500000000) */


[1] https://www.postgresql.org/message-id/4136ffa0812111823u645b6ec9wdca60b3da4b00499%40mail.gmail.com


Thanks for sharing. Very interesting read. If anyone has reference to the papers alluded to, that would be appreciated. I had forgotten about the option to set negative values.
Reply | Threaded
Open this post in threaded view
|

Re: n_distinct off by a factor of 1000

Klaudie Willis
On the topic of n_distinct.

I am not sure whether I am misinterpreting something, or if it is a bug (probably former) however, when using partitions, are not n_distinct_inherited supposed to propagate to the child partitions? It does not seem to do so. (Yes, I run Analyze after setting the variable)  I had to set the n_distinct separately on all partitions to get the desired planer behavior, but I thought that setting n_distinct_inherited was supposed to prevent manually setting the partitions.


K

Reply | Threaded
Open this post in threaded view
|

Re: n_distinct off by a factor of 1000

Klaudie Willis
I am not sure whether I am misinterpreting something, or if it is a bug (probably former) however, when using partitions, are not n_distinct_inherited supposed to propagate to the child partitions? It does not seem to do so. (Yes, I run Analyze after setting the variable)  I had to set the n_distinct separately on all partitions to get the desired planer behavior, but I thought that setting n_distinct_inherited was supposed to prevent manually setting the partitions.

I follow up with a dbfiddle for this problem:
 
best regards
K

Reply | Threaded
Open this post in threaded view
|

Re: n_distinct off by a factor of 1000

Peter J. Holzer
In reply to this post by Michael Lewis
On 2020-06-24 16:27:35 -0600, Michael Lewis wrote:

> On Wed, Jun 24, 2020, 2:35 PM Peter J. Holzer <[hidden email]> wrote:
>
>     Yes, estimating the number of distinct values from a relatively small
>     sample is hard when you don't know the underlying distribution. It might
>     be possible to analyze the sample to find the distribution and get a
>     better estimate. But I'm not sure how useful that would really be: If
>     a few values are very common and most very rare you are probably also
>     much more likely to use the common values in a query: And for those you
>     you would massively underestimate their frequency if you had an accurate
>     n_distinct value. That might be just as bad or even worse.
>
>
>
> This would only be true for values that are "common" but not in the MCVs list,
> right?
Yes, but if you have 33 million values there are likely to be a lot of
them "common but not in the MCVs list", even for a very biased
distribution.


> If we could increase the sampling ratio beyond the hard coded 300x to get a
> more representative sample

I thought of that but abandoned it since I don't think a better estimate
for n_distinct will help (see above for the reason). The problem is that
the distribution is biased and the planner has no idea whether the value
it is searching for is common or rare if it isn't in the MCV list.

Unless ...

As I understood Klaudie, the values are ids, and ids have no inherent
meaning, the common values are probably scattered randomly.

But it might be possible to change that. Group the ids by frequency. Ids
< 1E12 occur at most 10 times, Ids >= 1E12 <2E12 occur at most 100 times
and so on.

This may mean that ids aren't long time stable - they may change as
their frequency changes. But if an id always changes by a multiple of
1E12, the last 12 decimal digits are stable.

The advantage is that then the planner can use the histogram to get a
pretty good estimate of how frequent a value is.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | [hidden email]         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

signature.asc (849 bytes) Download Attachment