Partitioning Optimizer Questions and Issues

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Partitioning Optimizer Questions and Issues

keith anderson
I am using Postgres on a large system (recording approximately 20million transactions per day). We use partitioning by date to assist with both vacuum processing time and to archive old data. At the core of the system are records in 2 different tables detailing different types of activity for monetary transactions (e.g. money in and money out) -> a single transaction has entries in both tables, so to retrieve all details for a single transaction we need to join the 2 tables.

The use of partitioning however has a significant impact on the performance of retrieving this data. Being relatively new to Postgres I wanted to share my findings and understand how others address them. We run postgres version 9.6 on CentOS, but the same behaviour is apparent in postgres 10.6. The test case outputs are from version 10.6 running on my Ubuntu machine with default postgres configuration.

Below is an example script to populate the test data:

=======================
drop table if exists tablea cascade;
drop table if exists tableb cascade;

CREATE TABLE tablea (
    id            serial,
    reference     int not null,
    created       date not null
)  PARTITION BY RANGE (created);


CREATE TABLE tablea_part1 PARTITION OF tablea
    FOR VALUES FROM ('2018-01-01') TO ('2018-01-02');
CREATE TABLE tablea_part2 PARTITION OF tablea
    FOR VALUES FROM ('2018-01-02') TO ('2018-01-03');
CREATE TABLE tablea_part3 PARTITION OF tablea
    FOR VALUES FROM ('2018-01-03') TO ('2018-01-04');
CREATE TABLE tablea_part4 PARTITION OF tablea
    FOR VALUES FROM ('2018-01-04') TO ('2018-01-05');
CREATE TABLE tablea_part5 PARTITION OF tablea
    FOR VALUES FROM ('2018-01-05') TO ('2018-01-06');

CREATE INDEX tablea_id_1 ON tablea_part1 (id);
CREATE INDEX tablea_id_2 ON tablea_part2 (id);
CREATE INDEX tablea_id_3 ON tablea_part3 (id);
CREATE INDEX tablea_id_4 ON tablea_part4 (id);
CREATE INDEX tablea_id_5 ON tablea_part5 (id);
CREATE INDEX tablea_reference_1 ON tablea_part1 (reference);
CREATE INDEX tablea_reference_2 ON tablea_part2 (reference);
CREATE INDEX tablea_reference_3 ON tablea_part3 (reference);
CREATE INDEX tablea_reference_4 ON tablea_part4 (reference);
CREATE INDEX tablea_reference_5 ON tablea_part5 (reference);
CREATE INDEX tablea_created_1 ON tablea_part1 (created);
CREATE INDEX tablea_created_2 ON tablea_part2 (created);
CREATE INDEX tablea_created_3 ON tablea_part3 (created);
CREATE INDEX tablea_created_4 ON tablea_part4 (created);
CREATE INDEX tablea_created_5 ON tablea_part5 (created);
alter table tablea_part1 add CHECK ( created >= DATE '2018-01-01' AND created < DATE '2018-01-02');
alter table tablea_part2 add CHECK ( created >= DATE '2018-01-02' AND created < DATE '2018-01-03');
alter table tablea_part3 add CHECK ( created >= DATE '2018-01-03' AND created < DATE '2018-01-04');
alter table tablea_part4 add CHECK ( created >= DATE '2018-01-04' AND created < DATE '2018-01-05');
alter table tablea_part5 add CHECK ( created >= DATE '2018-01-05' AND created < DATE '2018-01-06');

create or replace function populate_tablea()
 RETURNS integer AS
$BODY$                                                             
DECLARE                                      
  i integer;                                   
  v_created date;
BEGIN                                           
    i := 0;                                       
    WHILE (i < 50000)
    loop
        i := i + 1;                                        
        IF (mod(i,5) = 1) THEN
            v_created = '2018-01-01';
        ELSIF (mod(i,5) = 2) THEN
            v_created = '2018-01-02';
         ELSIF (mod(i,5) = 3) THEN
            v_created = '2018-01-03';
        ELSIF (mod(i,5) = 4) THEN
            v_created = '2018-01-04';
        ELSIF (mod(i,5) = 0) THEN
            v_created = '2018-01-05';
        END IF;                            
        insert into tablea values (i, i, v_created);

    end loop;                                                                                   
           
    RETURN i;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
    COST 100;

CREATE TABLE tableb(
    id         serial,
    reference  int not null,
    created    date not null
) PARTITION BY RANGE (created);


CREATE TABLE tableb_part1 PARTITION OF tableb
    FOR VALUES FROM ('2018-01-01') TO ('2018-01-02');
CREATE TABLE tableb_part2 PARTITION OF tableb
    FOR VALUES FROM ('2018-01-02') TO ('2018-01-03');
CREATE TABLE tableb_part3 PARTITION OF tableb
    FOR VALUES FROM ('2018-01-03') TO ('2018-01-04');
CREATE TABLE tableb_part4 PARTITION OF tableb
    FOR VALUES FROM ('2018-01-04') TO ('2018-01-05');
CREATE TABLE tableb_part5 PARTITION OF tableb
    FOR VALUES FROM ('2018-01-05') TO ('2018-01-06');


CREATE INDEX tableb_id_1 ON tableb_part1 (id);
CREATE INDEX tableb_id_2 ON tableb_part2 (id);
CREATE INDEX tableb_id_3 ON tableb_part3 (id);
CREATE INDEX tableb_id_4 ON tableb_part4 (id);
CREATE INDEX tableb_id_5 ON tableb_part5 (id);
CREATE INDEX tableb_reference_1 ON tableb_part1 (reference);
CREATE INDEX tableb_reference_2 ON tableb_part2 (reference);
CREATE INDEX tableb_reference_3 ON tableb_part3 (reference);
CREATE INDEX tableb_reference_4 ON tableb_part4 (reference);
CREATE INDEX tableb_reference_5 ON tableb_part5 (reference);
CREATE INDEX tableb_created_1 ON tableb_part1 (created);
CREATE INDEX tableb_created_2 ON tableb_part2 (created);
CREATE INDEX tableb_created_3 ON tableb_part3 (created);
CREATE INDEX tableb_created_4 ON tableb_part4 (created);
CREATE INDEX tableb_created_5 ON tableb_part5 (created);
alter table tableb_part1 add CHECK ( created >= DATE '2018-01-01' AND created < DATE '2018-01-02');
alter table tableb_part2 add CHECK ( created >= DATE '2018-01-02' AND created < DATE '2018-01-03');
alter table tableb_part3 add CHECK ( created >= DATE '2018-01-03' AND created < DATE '2018-01-04');
alter table tableb_part4 add CHECK ( created >= DATE '2018-01-04' AND created < DATE '2018-01-05');
alter table tableb_part5 add CHECK ( created >= DATE '2018-01-05' AND created < DATE '2018-01-06');


create or replace function populate_tableb()
 RETURNS integer AS
$BODY$
DECLARE
  i integer;
  v_created date;
BEGIN
    i := 0;
    WHILE (i < 50000)
    loop
        i := i + 1;
        IF (mod(i,5) = 0) THEN
            v_created = '2018-01-01';
        ELSIF (mod(i,5) = 1) THEN
            v_created = '2018-01-02';
         ELSIF (mod(i,5) = 2) THEN
            v_created = '2018-01-03';
        ELSIF (mod(i,5) = 3) THEN
            v_created = '2018-01-04';
        ELSIF (mod(i,5) = 4) THEN
            v_created = '2018-01-05';    
        END IF;
        insert into tableb values (i, i, v_created);
    end loop;
    RETURN i;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
    COST 100;

select populate_tablea();
select populate_tableb();
vacuum analyze;
==================================


So it creates 2 tables, both with 5 partitions (using range partitioning on the created column). Each partition has 10000 rows in it.

Below are some example queries I have run, the outputs of explain analyze for each and notes on each of my findings/questions:


============

-- NOTICE IN THE BELOW THAT WE USE A SINGLE ID (ESSENTIALLY THE PRIMARY KEY) BUT WE HAVE ESTIMATED 5 ROWS RETURNED. WE SEEM TO BE BASING
-- ON PARTITION STATS ONLY AND SUMMING. SO EACH PARTITION ASSUMES ID IS UNIQUE, BUT WITH 5 PARTITIONS, THE TOTAL ROWS IS 5.

explain analyze select * from tablea where id = 101;
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.29..41.51 rows=5 width=12) (actual time=0.027..0.066 rows=1 loops=1)
   ->  Index Scan using tablea_id_1 on tablea_part1  (cost=0.29..8.30 rows=1 width=12) (actual time=0.026..0.029 rows=1 loops=1)
         Index Cond: (id = 101)
   ->  Index Scan using tablea_id_2 on tablea_part2  (cost=0.29..8.30 rows=1 width=12) (actual time=0.010..0.010 rows=0 loops=1)
         Index Cond: (id = 101)
   ->  Index Scan using tablea_id_3 on tablea_part3  (cost=0.29..8.30 rows=1 width=12) (actual time=0.008..0.009 rows=0 loops=1)
         Index Cond: (id = 101)
   ->  Index Scan using tablea_id_4 on tablea_part4  (cost=0.29..8.30 rows=1 width=12) (actual time=0.008..0.008 rows=0 loops=1)
         Index Cond: (id = 101)
   ->  Index Scan using tablea_id_5 on tablea_part5  (cost=0.29..8.30 rows=1 width=12) (actual time=0.007..0.007 rows=0 loops=1)
         Index Cond: (id = 101)
 Planning time: 0.875 ms
 Execution time: 0.176 ms


============

-- IF WE USE AN IN WITH 10 ID'S WE ESTIMATE 50 ROWS RETURNED INSTEAD OF THE ACTUAL 10. AGAIN SEEMS TO BE AGGREGATING PARTITION STATISTICS.

explain analyze select * from tablea where id in (101,102,103,104,105,106,107,108,109,110);    
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.29..215.13 rows=50 width=12) (actual time=0.040..0.283 rows=10 loops=1)
   ->  Index Scan using tablea_id_1 on tablea_part1  (cost=0.29..43.03 rows=10 width=12) (actual time=0.039..0.079 rows=2 loops=1)
         Index Cond: (id = ANY ('{101,102,103,104,105,106,107,108,109,110}'::integer[]))
   ->  Index Scan using tablea_id_2 on tablea_part2  (cost=0.29..43.03 rows=10 width=12) (actual time=0.021..0.052 rows=2 loops=1)
         Index Cond: (id = ANY ('{101,102,103,104,105,106,107,108,109,110}'::integer[]))
   ->  Index Scan using tablea_id_3 on tablea_part3  (cost=0.29..43.03 rows=10 width=12) (actual time=0.022..0.048 rows=2 loops=1)
         Index Cond: (id = ANY ('{101,102,103,104,105,106,107,108,109,110}'::integer[]))
   ->  Index Scan using tablea_id_4 on tablea_part4  (cost=0.29..43.03 rows=10 width=12) (actual time=0.026..0.049 rows=2 loops=1)
         Index Cond: (id = ANY ('{101,102,103,104,105,106,107,108,109,110}'::integer[]))
   ->  Index Scan using tablea_id_5 on tablea_part5  (cost=0.29..43.03 rows=10 width=12) (actual time=0.028..0.048 rows=2 loops=1)
         Index Cond: (id = ANY ('{101,102,103,104,105,106,107,108,109,110}'::integer[]))
 Planning time: 1.526 ms
 Execution time: 0.397 ms
                                                

===========

-- IF WE USE A RANGE INSTEAD OF INDIVIDUAL ID'S, WE GET ESTIMATED 10 ROWS RETURNED (GOOD). 
-- IS THIS USING THE GLOBAL TABLE STATISTICS INSTEAD? WHY DOES IT DIFFER FROM DISTINCT ID'S?

explain analyze select * from tablea where id >= 101 and id <= 110;
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.29..41.62 rows=10 width=12) (actual time=0.022..0.074 rows=10 loops=1)
   ->  Index Scan using tablea_id_1 on tablea_part1  (cost=0.29..8.32 rows=2 width=12) (actual time=0.021..0.026 rows=2 loops=1)
         Index Cond: ((id >= 101) AND (id <= 110))
   ->  Index Scan using tablea_id_2 on tablea_part2  (cost=0.29..8.32 rows=2 width=12) (actual time=0.010..0.012 rows=2 loops=1)
         Index Cond: ((id >= 101) AND (id <= 110))
   ->  Index Scan using tablea_id_3 on tablea_part3  (cost=0.29..8.32 rows=2 width=12) (actual time=0.009..0.010 rows=2 loops=1)
         Index Cond: ((id >= 101) AND (id <= 110))
   ->  Index Scan using tablea_id_4 on tablea_part4  (cost=0.29..8.32 rows=2 width=12) (actual time=0.009..0.010 rows=2 loops=1)
         Index Cond: ((id >= 101) AND (id <= 110))
   ->  Index Scan using tablea_id_5 on tablea_part5  (cost=0.29..8.32 rows=2 width=12) (actual time=0.008..0.010 rows=2 loops=1)
         Index Cond: ((id >= 101) AND (id <= 110))
 Planning time: 1.845 ms
 Execution time: 0.196 ms

==========

-- HERE ARE THE TABLE STATS, SHOWING THAT POSTGRES IS AWARE THAT ID'S ARE GLOBALLY UNIQUE IN THE TABLEA TABLE. CAN IT USE THEM?

select tablename,n_distinct from pg_stats where tablename like '%tablea%' and attname = 'id';
  tablename   | n_distinct 
--------------+------------
 tablea_part3 |         -1
 tablea       |         -1
 tablea_part2 |         -1
 tablea_part4 |         -1
 tablea_part5 |         -1
 tablea_part1 |         -1

==========

-- WHEN I JOIN, THE NUMBER OF ROWS MULTIPLIES. NOTICE THE SEQUENTIAL SCAN OF THE TABLEB PARTITION. THIS IS CAUSED BY THE OVERESTIMATION
-- OF ROWS RETURNED BY TABLEA

explain analyze select * from tablea a, tableb b where a.reference = b.reference and a.id in (101,102,103,104,105,106,107,108,109,110);
                                                                    QUERY PLAN                                                                     
---------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=215.75..1178.75 rows=50 width=24) (actual time=0.386..46.845 rows=10 loops=1)
   Hash Cond: (b.reference = a.reference)
   ->  Append  (cost=0.00..775.00 rows=50000 width=12) (actual time=0.024..26.527 rows=50000 loops=1)
         ->  Seq Scan on tableb_part1 b  (cost=0.00..155.00 rows=10000 width=12) (actual time=0.022..4.006 rows=10000 loops=1)
         ->  Seq Scan on tableb_part2 b_1  (cost=0.00..155.00 rows=10000 width=12) (actual time=0.023..4.039 rows=10000 loops=1)
         ->  Seq Scan on tableb_part3 b_2  (cost=0.00..155.00 rows=10000 width=12) (actual time=0.023..3.247 rows=10000 loops=1)
         ->  Seq Scan on tableb_part4 b_3  (cost=0.00..155.00 rows=10000 width=12) (actual time=0.016..1.421 rows=10000 loops=1)
         ->  Seq Scan on tableb_part5 b_4  (cost=0.00..155.00 rows=10000 width=12) (actual time=0.007..1.113 rows=10000 loops=1)
   ->  Hash  (cost=215.13..215.13 rows=50 width=12) (actual time=0.316..0.316 rows=10 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Append  (cost=0.29..215.13 rows=50 width=12) (actual time=0.034..0.301 rows=10 loops=1)
               ->  Index Scan using tablea_id_1 on tablea_part1 a  (cost=0.29..43.03 rows=10 width=12) (actual time=0.033..0.074 rows=2 loops=1)
                     Index Cond: (id = ANY ('{101,102,103,104,105,106,107,108,109,110}'::integer[]))
               ->  Index Scan using tablea_id_2 on tablea_part2 a_1  (cost=0.29..43.03 rows=10 width=12) (actual time=0.020..0.051 rows=2 loops=1)
                     Index Cond: (id = ANY ('{101,102,103,104,105,106,107,108,109,110}'::integer[]))
               ->  Index Scan using tablea_id_3 on tablea_part3 a_2  (cost=0.29..43.03 rows=10 width=12) (actual time=0.021..0.048 rows=2 loops=1)
                     Index Cond: (id = ANY ('{101,102,103,104,105,106,107,108,109,110}'::integer[]))
               ->  Index Scan using tablea_id_4 on tablea_part4 a_3  (cost=0.29..43.03 rows=10 width=12) (actual time=0.025..0.072 rows=2 loops=1)
                     Index Cond: (id = ANY ('{101,102,103,104,105,106,107,108,109,110}'::integer[]))
               ->  Index Scan using tablea_id_5 on tablea_part5 a_4  (cost=0.29..43.03 rows=10 width=12) (actual time=0.028..0.049 rows=2 loops=1)
                     Index Cond: (id = ANY ('{101,102,103,104,105,106,107,108,109,110}'::integer[]))
 Planning time: 2.642 ms
 Execution time: 47.005 ms

===========

-- REPEAT, BUT USING A RANGE QUERY. NO LONGER SEQUENTIAL SCAN AS THE TABLEA ROW ESTIMATE DROPS TO 10 FROM 50. 
-- QUERY EXECUTION TIME DROPS FROM 47MS TO 0.7MS

explain analyze select * from tablea a, tableb b where a.reference = b.reference and a.id >= 101 and a. id <= 110;
                                                                    QUERY PLAN                                                                     
---------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.57..437.25 rows=10 width=24) (actual time=0.063..0.543 rows=10 loops=1)
   ->  Append  (cost=0.29..41.62 rows=10 width=12) (actual time=0.027..0.091 rows=10 loops=1)
         ->  Index Scan using tablea_id_1 on tablea_part1 a  (cost=0.29..8.32 rows=2 width=12) (actual time=0.026..0.031 rows=2 loops=1)
               Index Cond: ((id >= 101) AND (id <= 110))
         ->  Index Scan using tablea_id_2 on tablea_part2 a_1  (cost=0.29..8.32 rows=2 width=12) (actual time=0.010..0.013 rows=2 loops=1)
               Index Cond: ((id >= 101) AND (id <= 110))
         ->  Index Scan using tablea_id_3 on tablea_part3 a_2  (cost=0.29..8.32 rows=2 width=12) (actual time=0.009..0.011 rows=2 loops=1)
               Index Cond: ((id >= 101) AND (id <= 110))
         ->  Index Scan using tablea_id_4 on tablea_part4 a_3  (cost=0.29..8.32 rows=2 width=12) (actual time=0.009..0.012 rows=2 loops=1)
               Index Cond: ((id >= 101) AND (id <= 110))
         ->  Index Scan using tablea_id_5 on tablea_part5 a_4  (cost=0.29..8.32 rows=2 width=12) (actual time=0.015..0.018 rows=2 loops=1)
               Index Cond: ((id >= 101) AND (id <= 110))
   ->  Append  (cost=0.29..39.51 rows=5 width=12) (actual time=0.021..0.041 rows=1 loops=10)
         ->  Index Scan using tableb_reference_1 on tableb_part1 b  (cost=0.29..7.90 rows=1 width=12) (actual time=0.007..0.007 rows=0 loops=10)
               Index Cond: (reference = a.reference)
         ->  Index Scan using tableb_reference_2 on tableb_part2 b_1  (cost=0.29..7.90 rows=1 width=12) (actual time=0.006..0.006 rows=0 loops=10)
               Index Cond: (reference = a.reference)
         ->  Index Scan using tableb_reference_3 on tableb_part3 b_2  (cost=0.29..7.90 rows=1 width=12) (actual time=0.009..0.010 rows=0 loops=10)
               Index Cond: (reference = a.reference)
         ->  Index Scan using tableb_reference_4 on tableb_part4 b_3  (cost=0.29..7.90 rows=1 width=12) (actual time=0.006..0.007 rows=0 loops=10)
               Index Cond: (reference = a.reference)
         ->  Index Scan using tableb_reference_5 on tableb_part5 b_4  (cost=0.29..7.90 rows=1 width=12) (actual time=0.006..0.006 rows=0 loops=10)
               Index Cond: (reference = a.reference)
 Planning time: 3.629 ms
 Execution time: 0.762 ms

===========


So to summarise the findings/questions from above:

- It seems like the Postgres optimizer sometimes uses the partition level statistics, and sometimes the global table level statistics? Or is it using something else?
- With partitioning tables with unique identifier and retrieving explicitly on those identifiers, at present the optimizer will always understimate the selectivity and overestimate the rows returned. This inaccuracy increases in proportion to the number of partitions.
- As a result, when joining to other tables, you are liable to hitting sequential scans. This becomes more likely as you have more partitions or if join to more partitioned tables (note I am aware I could try and tune random_page_cost to try and prevent this).
- To me in the examples queries described above, it makes sense to use the partition statistics for the partition level access strategy, but the global statistics when estimating the actual rows returned by all the individual partition queries. Is there a reason not to do this? Or do others believe the optimizer is doing the right thing here?

And then some general questions:

- How do other people use partitioning but without a significant performance disadvantage on reading the data? Is there something else I should be doing here to achieve the same thing without the overhead? At present my reads have increased optimization cost (as it needs to optimize access to each partition) and also execution cost (access the index on every partition). Even without the optimizer issues described above, the cost of reading simple data is extremely high relative to non-partitioned data (unless you use the partition key as a filter for each table to eliminate those partitions).
- Is there any chance/plan to add global indexes to postgres? If so would that impact significantly the cost of the partition drop e.g. to clean up the index.

Thanks in advance for any feedback/support,

Keith


Reply | Threaded
Open this post in threaded view
|

Re: Partitioning Optimizer Questions and Issues

Justin Pryzby
On Fri, Feb 08, 2019 at 11:13:51AM +0000, keith anderson wrote:
> So to summarise the findings/questions from above:
> - It seems like the Postgres optimizer sometimes uses the partition level statistics, and sometimes the global table level statistics? Or is it using something else?- With partitioning tables with unique identifier and retrieving explicitly on those identifiers, at present the optimizer will always understimate the selectivity and overestimate the rows returned. This inaccuracy increases in proportion to the number of partitions.- As a result, when joining to other tables, you are liable to hitting sequential scans. This becomes more likely as you have more partitions or if join to more partitioned tables (note I am aware I could try and tune random_page_cost to try and prevent this).- To me in the examples queries described above, it makes sense to use the partition statistics for the partition level access strategy, but the global statistics when estimating the actual rows returned by all the individual partition queries. Is there a reason not to do this? Or do others believe the optimizer is doing the right thing here?
> And then some general questions:
> - How do other people use partitioning but without a significant performance disadvantage on reading the data? Is there something else I should be doing here to achieve the same thing without the overhead? At present my reads have increased optimization cost (as it needs to optimize access to each partition) and also execution cost (access the index on every partition). Even without the optimizer issues described above, the cost of reading simple data is extremely high relative to non-partitioned data (unless you use the partition key as a filter for each table to eliminate those partitions).- Is there any chance/plan to add global indexes to postgres? If so would that impact significantly the cost of the partition drop e.g. to clean up the index.
> Thanks in advance for any feedback/support,

An equality or IN() query will use the pg_stats most-common-values list,
whereas a range query will use the histogram.

The tables probably doesn't have complete MCV list.  By default, that's limited
to 100 entries.  Since the maximum allowed by ALTER..SET STATISTICS is 10k, I
don't think it'll help to change it (at least for your production use case).
Each partition's rowcount appears to be estimated from its ndistinct, and not
from its content, so each is estimated as having about the same rowcount.

Your partitions are sharing a sequence for their ID column, which causes the
DEFAULT IDs to be unique...but their global uniqueness isn't enforced nor
guaranteed.

Note, in postgres11, it's possible to create an index on the parent table.
It's NOT a global index, but it can be unique if it includes the partition key.
I don't know how closely your example describes your real use case, but I don't
think that helps with your example; it doesn't seems useful to partition on a
serial column.

You seem to be adding unnecessary CHECK constraints that duplicate the
partition bounds.  Note, it's still useful to include CHECK constraints on key
column if you're planning on DETACHing and re-ATTACHing the partitions, in
order to avoid seqscan to verify tuples don't violate specified bounds.

You might need to rethink your partitioning scheme - you should choose one that
causes performance to improve, and probably naturally includes the partition
key in most queries.

Perhaps you'd use 2 levels of partitioning: a RANGE partition by date, which
allows for archiving, and a HASH partition by ID, which allows for partition
pruning.  Note that it's also possible to partition on multiple keys, like
RANGE(id,date) - I don't think that's useful here, though.  PG11 also allows a
"default" partition.

Or perhaps you could partition by RANGE(date) but add CHECK constraints on ID,
after the table is fully populated, to optimize queries by allowing for
partition pruning.

Or you could maybe change the ID column to include the timestamp (like BIND
zonesfiles YYYYMMDDNNNNNNNN).  You'd set a bigint sequence on each partition's
ID as default to the beginning of the month.  A bigint is enough to handle
5*10^4 times your volume: 20190401000020111222.  (I think this is trying to be
unnecessarily clever, unless there's some reason the other two ideas don't
work.)

Justin

Reply | Threaded
Open this post in threaded view
|

Re: Partitioning Optimizer Questions and Issues

keith anderson
Thanks for the feedback Justin.

You are right, the most-common-values list is empty for my test case and so it is using n_distinct for the 'id IN()' scenario.
And I can see that with the pg_class.reltuples and the pg_stats.histogram_bounds values how the optimizer can conclude with my range query that only 1 in 5 entries in my range query are in each individual partition.

However, I can also see that the pg_stats.n_distinct value for tablea shows -1, as do all the individual child partitions. In my opinion it makes sense for the optimizer when using n_distinct on partitioned tables to use the n_distinct value of the parent table level when estimating row counts rather than a sum of the partition level statistics. Or can someone come up with a good reason to avoid this? 

A couple of examples of different data in a partitioned table:

- Unique identifier -> if providing a single value in a query -> using n_distinct from parent will estimate 1, using child tables will be 1 * (number of partitions). Use of parent table would be correct.
- Date of activity, with  1000 records per day -> if providing a single day to the query -> using n_distinct from parent would show 1000 rows returned, using child tables will be 1000 * (number of partitions). Use of parent table n_distinct is correct.

Perhaps when querying on columns that are part of the partition logic you could use the partition level stats, but I think the vast majority of the time, using the parent statistics would be much more reliable/accurate than summing across partitions.

In terms of the partition strategy, I agree that it should be done with a view to helping performance improve. I will look into more detail at your suggestions, but in general it is very hard to use effectively as there are competing priorities:

- I would like to not have to manage massive numbers of partitions
- I would like to be able to archive data easily using date (a big plus point to the existing date partitioning strategy)
- It is hard in most cases to come up with a partition strategy that allows for partition elimination e.g. consider a common 'transaction record' table with a primary key, an account identifier, and a date -> it is natural to want to be able to query on any one of these, but as things stand it cannot be achieved performantly with partitioning.

Global index support feels like it has potential to resolve many of the issues I have with partitioning (beyond the optimizer concern above). I assume this has been discussed and rejected though by the community?

I've attached as a file the original test script.

Keith



On Friday, 8 February 2019, 13:05:04 GMT, Justin Pryzby <[hidden email]> wrote:


On Fri, Feb 08, 2019 at 11:13:51AM +0000, keith anderson wrote:

> So to summarise the findings/questions from above:
> - It seems like the Postgres optimizer sometimes uses the partition level statistics, and sometimes the global table level statistics? Or is it using something else?- With partitioning tables with unique identifier and retrieving explicitly on those identifiers, at present the optimizer will always understimate the selectivity and overestimate the rows returned. This inaccuracy increases in proportion to the number of partitions.- As a result, when joining to other tables, you are liable to hitting sequential scans. This becomes more likely as you have more partitions or if join to more partitioned tables (note I am aware I could try and tune random_page_cost to try and prevent this).- To me in the examples queries described above, it makes sense to use the partition statistics for the partition level access strategy, but the global statistics when estimating the actual rows returned by all the individual partition queries. Is there a reason not to do this? Or do others believe the optimizer is doing the right thing here?
> And then some general questions:
> - How do other people use partitioning but without a significant performance disadvantage on reading the data? Is there something else I should be doing here to achieve the same thing without the overhead? At present my reads have increased optimization cost (as it needs to optimize access to each partition) and also execution cost (access the index on every partition). Even without the optimizer issues described above, the cost of reading simple data is extremely high relative to non-partitioned data (unless you use the partition key as a filter for each table to eliminate those partitions).- Is there any chance/plan to add global indexes to postgres? If so would that impact significantly the cost of the partition drop e.g. to clean up the index.
> Thanks in advance for any feedback/support,


An equality or IN() query will use the pg_stats most-common-values list,
whereas a range query will use the histogram.

The tables probably doesn't have complete MCV list.  By default, that's limited
to 100 entries.  Since the maximum allowed by ALTER..SET STATISTICS is 10k, I
don't think it'll help to change it (at least for your production use case).
Each partition's rowcount appears to be estimated from its ndistinct, and not
from its content, so each is estimated as having about the same rowcount.

Your partitions are sharing a sequence for their ID column, which causes the
DEFAULT IDs to be unique...but their global uniqueness isn't enforced nor
guaranteed.

Note, in postgres11, it's possible to create an index on the parent table.
It's NOT a global index, but it can be unique if it includes the partition key.
I don't know how closely your example describes your real use case, but I don't
think that helps with your example; it doesn't seems useful to partition on a
serial column.

You seem to be adding unnecessary CHECK constraints that duplicate the
partition bounds.  Note, it's still useful to include CHECK constraints on key
column if you're planning on DETACHing and re-ATTACHing the partitions, in
order to avoid seqscan to verify tuples don't violate specified bounds.

You might need to rethink your partitioning scheme - you should choose one that
causes performance to improve, and probably naturally includes the partition
key in most queries.

Perhaps you'd use 2 levels of partitioning: a RANGE partition by date, which
allows for archiving, and a HASH partition by ID, which allows for partition
pruning.  Note that it's also possible to partition on multiple keys, like
RANGE(id,date) - I don't think that's useful here, though.  PG11 also allows a
"default" partition.

Or perhaps you could partition by RANGE(date) but add CHECK constraints on ID,
after the table is fully populated, to optimize queries by allowing for
partition pruning.

Or you could maybe change the ID column to include the timestamp (like BIND
zonesfiles YYYYMMDDNNNNNNNN).  You'd set a bigint sequence on each partition's
ID as default to the beginning of the month.  A bigint is enough to handle
5*10^4 times your volume: 20190401000020111222.  (I think this is trying to be
unnecessarily clever, unless there's some reason the other two ideas don't
work.)

Justin


partitionExample.sql (7K) Download Attachment