Friends, I run Postgresql 12.3, on Windows. I have just discovered a pretty significant problem with Postgresql and my data. I have a large table, 500M rows, 50 columns. It is split in 3 partitions by Year. In addition to the primary key, one of the columns is indexed, and I do lookups on this. Select * from bigtable b where b.instrument_ref in (x,y,z,...) limit 1000 It responded well with subsecond response, and it uses the index of the column. However, when I changed it to: Select * from bigtable b where b.instrument_ref in (x,y,z,) limit 10000  (notice 10K now) The planner decided to do a full table scan on the entire 500M row table! And that did not work very well. First I had no clue as to why it did so, and when I disabled sequential scan the query immediately returned. But I should not have to do so. I got my first hint of why this problem occurs when I looked at the statistics. For the column in question, "instrument_ref" the statistics claimed it to be: The default_statistics_target=500, and analyze has been run. select * from pg_stats where attname like 'instr%_ref';  Result: 40.000 select count(distinct instrumentid_ref) from bigtable  Result: 33 385 922 (!!) That is an astonishing difference of almost a 1000X. When the planner only thinks there are 40K different values, then it makes sense to switch to table scan in order to fill the limit=10.000. But it is wrong, very wrong, an the query returns in 100s of seconds instead of a few. I have tried to increase the statistics target to 5000, and it helps, but it reduces the error to 100X. Still crazy high. I understand that this is a known problem. I have read previous posts about it, still I have never seen anyone reach such a high difference factor. I have considered these fixes:  hardcode the statistics to a particular ratio of the total number of rows  randomize the rows more, so that it does not suffer from page clustering. However, this has probably other implications Feel free to comment :) K 
Maybe I missed it, but did you run "ANALYZE VERBOSE bigtable;"?
On 6/23/20 7:42 AM, Klaudie Willis
wrote:

Angular momentum makes the world go 'round. 
I didn't run it with "verbose" but otherwise, yes, several times. I can do it again with verbose if you are interested in the output. Just give me some time. 500M rows 50 columns, is no small job :) K ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Tuesday, June 23, 2020 2:51 PM, Ron <[hidden email]> wrote:

In reply to this post by Klaudie Willis
On 23/06/2020 14:42, Klaudie Willis
wrote:
I think you are counting 2 different things here. The first query returns all the columns "like 'instr%_ref'" present in the statistics (so in the whole cluster), while the second is counting the actual number of different rows in bigtable. regards, fabio pardi 
On 6/23/20 7:05 AM, Fabio Pardi wrote:
> > On 23/06/2020 14:42, Klaudie Willis wrote: >> I got my first hint of why this problem occurs when I looked at the >> statistics. For the column in question, "instrument_ref" the >> statistics claimed it to be: >> >> The default_statistics_target=500, and analyze has been run. >> select * from pg_stats where attname like 'instr%_ref';  Result: >> *40.000* >> select count(distinct instrumentid_ref) from bigtable  Result: *33 >> 385 922 (!!)* >> >> That is an astonishing difference of almost a 1000X. >> > > I think you are counting 2 different things here. > > The first query returns all the columns "like 'instr%_ref'" present in > the statistics (so in the whole cluster), while the second is counting > the actual number of different rows in bigtable. I believe the OP actually meant the query to be: select n_distinct from pg_stats where attname like 'instr%_ref'; > > > regards, > > fabio pardi  Adrian Klaver [hidden email] 
Adrian, you are correct. My mistanke.
K ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Tuesday, June 23, 2020 4:14 PM, Adrian Klaver <[hidden email]> wrote: > On 6/23/20 7:05 AM, Fabio Pardi wrote: > > > On 23/06/2020 14:42, Klaudie Willis wrote: > > > > > I got my first hint of why this problem occurs when I looked at the > > > statistics. For the column in question, "instrument_ref" the > > > statistics claimed it to be: > > > The default_statistics_target=500, and analyze has been run. > > > select * from pg_stats where attname like 'instr%_ref';  Result: > > > 40.000 > > > select count(distinct instrumentid_ref) from bigtable  Result: 33 > > > 385 922 (!!)That is an astonishing difference of almost a 1000X. > > > > I think you are counting 2 different things here. > > The first query returns all the columns "like 'instr%_ref'" present in > > the statistics (so in the whole cluster), while the second is counting > > the actual number of different rows in bigtable. > > I believe the OP actually meant the query to be: > > select n_distinct from pg_stats where attname like 'instr%_ref'; > > > regards, > > fabio pardi > >  > > Adrian Klaver > [hidden email] 
> > On 23/06/2020 14:42, Klaudie Willis wrote: Try something like this to check how representative those "most common values" are. If you have n_distinct very low compared to reality and also the fraction of the table that the "most common" values are claiming to cover is low, then you can get very bad estimates when querying for values that are not in the MCVs list. The planner will assume an even distribution for other values and that may be much much higher or lower than reality. That is, if you have statistics target of 100 like normal, and those cover 5% of the table, and you have ndistinct value of 500, then the other 400 values are assumed to evenly cover that 95% of the table so each value would be .95/400 * reltuples as an estimate. If your real count of distinct values is 40000 then the number of values you expect to get for each value in your IN clause drops hugely. Using a custom ndistinct will dramatically impact the estimates that the planner is using to make the decision of index vs sequential scan. Also, if the custom ndistinct and the actual distinct count vary by 2x or 10x as your data grows, it matters very little IMO as compared to relying on the sample taken by (auto)analyze job being off by a factor of 1000x or even 100x as you have experienced. SELECT ( SELECT SUM (x) FROM UNNEST (most_common_freqs) x ) frac_MCV, tablename, attname, inherited, null_frac, n_distinct, array_length(most_common_vals,1) n_mcv, array_length(histogram_bounds,1) n_hist, correlation, * FROM pg_stats WHERE schemaname = 'public' AND tablename=‘table’ AND attname=‘column’; 
In reply to this post by Klaudie Willis
show default_statistics_target; > 500 ALTER TABLE public.bigtable ALTER COLUMN instrumentid_ref SET STATISTICS 5000; Here is the output of the "ANALYZE VERBOSE bigtable;" INFO: analyzing "public.bigtables" inheritance tree INFO: "bigtable_y2018": scanned 622250 of 10661013 pages, containing 11994670 live rows and 5091 dead rows; 622250 rows in sample, 205504753 estimated total rows INFO: "bigtable_y2019": scanned 520159 of 8911886 pages, containing 10017582 live rows and 6148 dead rows; 520159 rows in sample, 171631268 estimated total rows INFO: "bigtable_y2020": scanned 357591 of 6126616 pages, containing 7031238 live rows and 1534 dead rows; 357591 rows in sample, 120466385 estimated total rows INFO: analyzing "public.bigtable_y2018" INFO: "bigtable_y2018": scanned 1500000 of 10661013 pages, containing 28915115 live rows and 12589 dead rows; 1500000 rows in sample, 205509611 estimated total rows INFO: analyzing "public.bigtable_y2019" INFO: "bigtable_y2019": scanned 1500000 of 8911886 pages, containing 28888514 live rows and 17778 dead rows; 1500000 rows in sample, 171634096 estimated total rows INFO: analyzing "public.bigtable_y2020" INFO: "bigtable_y2020": scanned 1500000 of 6126616 pages, containing 29488967 live rows and 6330 dead rows; 1500000 rows in sample, 120445051 estimated total rows INFO: analyzing "public.bigtable_y2021" INFO: "bigtable_y2021": scanned 1 of 1 pages, containing 8 live rows and 0 dead rows; 8 rows in sample, 8 estimated total rows ANALYZE On the comment from Adrian: SELECT ( SELECT SUM (x) FROM UNNEST (most_common_freqs) x ) frac_MCV, n_distinct, array_length(most_common_vals,1) n_mcv, array_length(histogram_bounds,1) n_hist, tablename, FROM pg_stats WHERE schemaname = 'public' AND attname like 'instrumentid_ref' frac_MCV;n_distinct; n_mcv; n_hist;tablename 0.9205394 122160 2140 5001 "bigtable" 0.9203018 124312 1736 5001 "bigtable_y2018" 0.9258158 113846 2107 5001 "bigtable_y2020" 0.875 0.375 2 "bigtable_y2021" 0.92304045 118267 2204 5001 "bigtable_y2019" select count(distinct instrumentid_ref) from bigtable > 33 385 922 Bigtables instrumentid_ref is underestimated by 300X even when statistics target of the column is 5000; Pretty weird. K ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Tuesday, June 23, 2020 3:07 PM, Klaudie Willis <[hidden email]> wrote:

On 20200624 07:30:05 +0000, Klaudie Willis wrote:
> show default_statistics_target; > 500 > ALTER TABLE public.bigtable ALTER COLUMN instrumentid_ref SET STATISTICS 5000; > > Here is the output of the "ANALYZE VERBOSE bigtable;" > INFO: analyzing "public.bigtables" inheritance tree [...] > INFO: analyzing "public.bigtable_y2018" > INFO: "bigtable_y2018": scanned 1500000 of 10661013 pages, containing 28915115 > live rows and 12589 dead rows; 1500000 rows in sample, 205509611 estimated > total rows > INFO: analyzing "public.bigtable_y2019" > INFO: "bigtable_y2019": scanned 1500000 of 8911886 pages, containing 28888514 > live rows and 17778 dead rows; 1500000 rows in sample, 171634096 estimated > total rows > INFO: analyzing "public.bigtable_y2020" > INFO: "bigtable_y2020": scanned 1500000 of 6126616 pages, containing 29488967 > live rows and 6330 dead rows; 1500000 rows in sample, 120445051 estimated total > rows > INFO: analyzing "public.bigtable_y2021" > INFO: "bigtable_y2021": scanned 1 of 1 pages, containing 8 live rows and 0 > dead rows; 8 rows in sample, 8 estimated total rows [...] > frac_MCV;n_distinct; n_mcv; n_hist;tablename > 0.9205394 122160 2140 5001 "bigtable" > 0.9203018 124312 1736 5001 "bigtable_y2018" > 0.9258158 113846 2107 5001 "bigtable_y2020" > 0.875 0.375 2 "bigtable_y2021" > 0.92304045 118267 2204 5001 "bigtable_y2019" Estimates the number of distinct values in each partition as between 113846 and 124312. So it can have encountered at most that many different values, which means that it must have encountered each value about 12 or 13 times on average. My guess is that there are relatively few (less than 120000) distinct values which make up the bulk (over 90 %) of these tables and a lot (33 million) values which are very rare. Is this guess correct? hp  _  Peter J. Holzer  Story must make more sense than reality. __)      [hidden email]   Charles Stross, "Creative writing __/  http://www.hjp.at/  challenge!" signature.asc (849 bytes) Download Attachment 
In reply to this post by Klaudie Willis
[Please keep replies on the list]
On 20200624 11:02:22 +0000, Klaudie Willis wrote: > Holzer, thanks for your feedback. Yes, your guess is very good. The > data consists of millions of instruments that occur a handful of cases > (rare), and instruments that are very common. > > I am still a little surprised that it is so hard to sample data and > estimate distinct values within a 1000X. My intuition misleads me into > thinking this should be simpler, but I understand that this is no > simple task at all. To your information, it seems like SQL server > 2016 makes the same "mistake" when calculating distincts (or 1/density > as they call it). I have similar data there that I have looked into, > and it seems "fooled" as well. sample is hard when you don't know the underlying distribution. It might be possible to analyze the sample to find the distribution and get a better estimate. But I'm not sure how useful that would really be: If a few values are very common and most very rare you are probably also much more likely to use the common values in a query: And for those you you would massively underestimate their frequency if you had an accurate n_distinct value. That might be just as bad or even worse. I'm wondering whether it would make sense to use a range instead of a single value in cost calculations. Then the planner could prefer plans which had reasonable cost over the whole range over plans which are good at one end of the range but horrible at the other. I'm a afraid that would lead to combinatorial explosion, though. Even with a small number of ranges there would be a large number of possible combinations to calculate. So one would probably have to resort to monte carlo simulation or soemthing like that. hp  _  Peter J. Holzer  Story must make more sense than reality. __)      [hidden email]   Charles Stross, "Creative writing __/  http://www.hjp.at/  challenge!" signature.asc (849 bytes) Download Attachment 
On Wed, Jun 24, 2020, 2:35 PM Peter J. Holzer <[hidden email]> wrote: Yes, estimating the number of distinct values from a relatively small This would only be true for values that are "common" but not in the MCVs list, right? If we could increase the sampling ratio beyond the hard coded 300x to get a more representative sample and use that to estimate ndistinct (and also the frequency of the most common values) but only actually stored the 100 MCVs (or whatever the stats target is set to for the system or column) then the issue may be mitigated without increasing planning time because of stats that are larger than prudent, and the "only" cost should be longer processing time when (auto) analyzing... plus overhead for considering this potential new setting in all analyze cases I suppose. 
I found another large deviation in one of my bridge tables. It is an (int,int) table of 900M rows where the B column contains 2.7M distinct values, however the pg_stats table claims it to be only 10.400. These numbers are with a statistics target of 500. I'm not sure that really matters for the planner for the queries I run, but it makes me a little nervous :) Also, is it just my data samples, or is the n_distinct way more often underestimated by a larger ratio, than overestimated? K

In reply to this post by Klaudie Willis
Hello,
As far as I know, increasing default_statistics_target will not help. [1]
You can hardcode the percentage of distinct values: ALTER TABLE bigtable ALTER COLUMN instrument_ref SET ( n_distinct=0.06 ); /* 1 * (33385922 / 500000000) */ [1] https://www.postgresql.org/messageid/4136ffa0812111823u645b6ec9wdca60b3da4b00499%40mail.gmail.com  Pavel Luzanov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company 
On Thu, Jun 25, 2020 at 7:27 AM Pavel Luzanov <[hidden email]> wrote:
Thanks for sharing. Very interesting read. If anyone has reference to the papers alluded to, that would be appreciated. I had forgotten about the option to set negative values. 
On the topic of n_distinct. I am not sure whether I am misinterpreting something, or if it is a bug (probably former) however, when using partitions, are not n_distinct_inherited supposed to propagate to the child partitions? It does not seem to do so. (Yes, I run Analyze after setting the variable) I had to set the n_distinct separately on all partitions to get the desired planer behavior, but I thought that setting n_distinct_inherited was supposed to prevent manually setting the partitions. K 
I follow up with a dbfiddle for this problem: best regards K 
In reply to this post by Michael Lewis
On 20200624 16:27:35 0600, Michael Lewis wrote:
> On Wed, Jun 24, 2020, 2:35 PM Peter J. Holzer <[hidden email]> wrote: > > Yes, estimating the number of distinct values from a relatively small > sample is hard when you don't know the underlying distribution. It might > be possible to analyze the sample to find the distribution and get a > better estimate. But I'm not sure how useful that would really be: If > a few values are very common and most very rare you are probably also > much more likely to use the common values in a query: And for those you > you would massively underestimate their frequency if you had an accurate > n_distinct value. That might be just as bad or even worse. > > > > This would only be true for values that are "common" but not in the MCVs list, > right? them "common but not in the MCVs list", even for a very biased distribution. > If we could increase the sampling ratio beyond the hard coded 300x to get a > more representative sample I thought of that but abandoned it since I don't think a better estimate for n_distinct will help (see above for the reason). The problem is that the distribution is biased and the planner has no idea whether the value it is searching for is common or rare if it isn't in the MCV list. Unless ... As I understood Klaudie, the values are ids, and ids have no inherent meaning, the common values are probably scattered randomly. But it might be possible to change that. Group the ids by frequency. Ids < 1E12 occur at most 10 times, Ids >= 1E12 <2E12 occur at most 100 times and so on. This may mean that ids aren't long time stable  they may change as their frequency changes. But if an id always changes by a multiple of 1E12, the last 12 decimal digits are stable. The advantage is that then the planner can use the histogram to get a pretty good estimate of how frequent a value is. hp  _  Peter J. Holzer  Story must make more sense than reality. __)      [hidden email]   Charles Stross, "Creative writing __/  http://www.hjp.at/  challenge!" signature.asc (849 bytes) Download Attachment 
Free forum by Nabble  Edit this page 