Inaccurate (sometimes wildly so) row estimates for simple join

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Inaccurate (sometimes wildly so) row estimates for simple join

Greg Nolle
Hi,

I’m working on diagnosing an issue with a complex query on a large PG11 database we have. The planner is choosing a plan that takes 60+ seconds but if we force it to use a particular index then it takes only a couple of seconds. I’ve narrowed down what I think is the cause to a very simple join for which PG is underestimating the rows by a factor of 20. It then chooses to do a nested loop which is actually much slower than it thinks it would be based on the estimate.

I’ve managed to reproduce this underestimation issue using a simple data set that is very similar in distribution to what we have in our real database:

CREATE TABLE test_a (
a_id VARCHAR(255) PRIMARY KEY,
group_val VARCHAR(255) NOT NULL
);
INSERT INTO test_a (a_id, group_val) SELECT 'a_id'||s, 'group'||s FROM generate_series(1, 100) s;
INSERT INTO test_a (a_id, group_val) VALUES ('a_id101', 'group1');

CREATE TABLE test_b (
b_id VARCHAR(255) PRIMARY KEY,
a_id VARCHAR(255) NOT NULL,
CONSTRAINT test_bk_fk01 FOREIGN KEY (a_id) REFERENCES test_a (a_id)
);
INSERT INTO test_b (b_id, a_id) SELECT 'b_id'||s, 'a_id1' FROM generate_series(1, 1500) s;
INSERT INTO test_b (b_id, a_id) SELECT 'b_id'||s, 'a_id2' FROM generate_series(1501, 2000) s;

ANALYZE test_a;
ANALYZE test_b;

EXPLAIN ANALYZE SELECT * FROM test_b b JOIN test_a a ON a.a_id = b.a_id WHERE a.group_val = 'group1';

This gives an estimated 40 rows for the join when in fact there are 1500 rows. The crux seems to be that test_b does not have an even distribution for a_id values: it only has records for two of the values in the referenced table. This is how our real dataset is too and isn’t something we can really change.

I’ve read through the row estimation information here: https://www.postgresql.org/docs/11/row-estimation-examples.html and also read through the source code for the eqjoinsel_inner function. I can see how it’s calculating the estimate but I have no clue how it can be made accurate for this (seemingly) simple scenario.

Any hints or tips would be greatly appreciated.
Reply | Threaded
Open this post in threaded view
|

Re: Inaccurate (sometimes wildly so) row estimates for simple join

Michael Lewis
On Fri, May 22, 2020 at 7:27 AM Greg Nolle <[hidden email]> wrote:
The crux seems to be that test_b does not have an even distribution for a_id values: it only has records for two of the values in the referenced table. This is how our real dataset is too and isn’t something we can really change.

How does it perform with extended statistics created for these two columns?

Reply | Threaded
Open this post in threaded view
|

Re: Inaccurate (sometimes wildly so) row estimates for simple join

Greg Nolle
On 22 May 2020 at 15:30:42, Michael Lewis ([hidden email]) wrote:
On Fri, May 22, 2020 at 7:27 AM Greg Nolle <[hidden email]> wrote:
The crux seems to be that test_b does not have an even distribution for a_id values: it only has records for two of the values in the referenced table. This is how our real dataset is too and isn’t something we can really change.

How does it perform with extended statistics created for these two columns?

I forgot to mention that I had already tried that actually, and it sadly made no difference. I tried both dependencies and ndistinct modes and the estimates remained the same. Having read through the docs I think that’s to be expected since dependencies is designed to help when you have multiple predicates and ndistinct when you’re doing grouping. Neither of which is the case with my query.

For good measure I just tried mcv mode on PG12 with the test script I posted and that also made no difference.