Ordered Partitioned Table Scans

classic Classic list List threaded Threaded
41 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Ordered Partitioned Table Scans

David Rowley-3
RANGE partitioning of time-series data is quite a common range to use
partitioning, and such tables tend to grow fairly large.  I thought
since we always store RANGE partitioned tables in the PartitionDesc in
ascending range order that it might be useful to make use of this and
when the required pathkeys match the order of the range, then we could
make use of an Append node instead of uselessly using a MergeAppend,
since the MergeAppend will just exhaust each subplan one at a time, in
order.

It does not seem very hard to implement this and it does not add much
in the way of additional processing to the planner.

Performance wise it seems to give a good boost to getting sorted
results from a partitioned table. I performed a quick test just on my
laptop with:

Setup:
CREATE TABLE partbench (id BIGINT NOT NULL, i1 INT NOT NULL, i2 INT
NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL) PARTITION
BY RANGE (id);
select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench
FOR VALUES FROM (' || (x*100000)::text || ') TO (' ||
((x+1)*100000)::text || ');' from generate_Series(0,299) x;
\gexec
\o
INSERT INTO partbench SELECT x,1,2,3,4,5 from generate_Series(0,29999999) x;
create index on partbench (id);
vacuum analyze;

Test:
select * from partbench order by id limit 1 offset 29999999;

Results Patched:

Time: 4234.807 ms (00:04.235)
Time: 4237.928 ms (00:04.238)
Time: 4241.289 ms (00:04.241)
Time: 4234.030 ms (00:04.234)
Time: 4244.197 ms (00:04.244)
Time: 4266.000 ms (00:04.266)

Unpatched:

Time: 5917.288 ms (00:05.917)
Time: 5937.775 ms (00:05.938)
Time: 5911.146 ms (00:05.911)
Time: 5906.881 ms (00:05.907)
Time: 5918.309 ms (00:05.918)

(about 39% faster)

The implementation is fairly simple. One thing I don't like about is
I'd rather build_partition_pathkeys() performed all the checks to know
if the partition should support a natural pathkey, but as of now, I
have the calling code ensuring that there are no sub-partitioned
tables. These could cause tuples to be output in the wrong order.

Does this idea seem like something we'd want?

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v1-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patch (49K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Ordered Partitioned Table Scans

Amit Langote-2
On 2018/10/26 11:50, David Rowley wrote:

> RANGE partitioning of time-series data is quite a common range to use
> partitioning, and such tables tend to grow fairly large.  I thought
> since we always store RANGE partitioned tables in the PartitionDesc in
> ascending range order that it might be useful to make use of this and
> when the required pathkeys match the order of the range, then we could
> make use of an Append node instead of uselessly using a MergeAppend,
> since the MergeAppend will just exhaust each subplan one at a time, in
> order.
>
> It does not seem very hard to implement this and it does not add much
> in the way of additional processing to the planner.
>
> Performance wise it seems to give a good boost to getting sorted
> results from a partitioned table. I performed a quick test just on my
> laptop with:
>
> Setup:
> CREATE TABLE partbench (id BIGINT NOT NULL, i1 INT NOT NULL, i2 INT
> NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL) PARTITION
> BY RANGE (id);
> select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench
> FOR VALUES FROM (' || (x*100000)::text || ') TO (' ||
> ((x+1)*100000)::text || ');' from generate_Series(0,299) x;
> \gexec
> \o
> INSERT INTO partbench SELECT x,1,2,3,4,5 from generate_Series(0,29999999) x;
> create index on partbench (id);
> vacuum analyze;
>
> Test:
> select * from partbench order by id limit 1 offset 29999999;
>
> Results Patched:
>
> Time: 4234.807 ms (00:04.235)
> Time: 4237.928 ms (00:04.238)
> Time: 4241.289 ms (00:04.241)
> Time: 4234.030 ms (00:04.234)
> Time: 4244.197 ms (00:04.244)
> Time: 4266.000 ms (00:04.266)
>
> Unpatched:
>
> Time: 5917.288 ms (00:05.917)
> Time: 5937.775 ms (00:05.938)
> Time: 5911.146 ms (00:05.911)
> Time: 5906.881 ms (00:05.907)
> Time: 5918.309 ms (00:05.918)
>
> (about 39% faster)
>
> The implementation is fairly simple. One thing I don't like about is
> I'd rather build_partition_pathkeys() performed all the checks to know
> if the partition should support a natural pathkey, but as of now, I
> have the calling code ensuring that there are no sub-partitioned
> tables. These could cause tuples to be output in the wrong order.
>
> Does this idea seem like something we'd want?

Definitely!  Thanks for creating the patch.

I recall Ronan Dunklau and Julien Rouhaud had proposed a patch for this
last year, but the partitioning-related planning code hadn't advanced then
as much as it has today, so they sort of postponed working on it.
Eventually their patch was returned with feedback last November.  Here's
the link to their email in case you wanted to read some comments their
proposal and patch got, although some of them might be obsolete.

https://www.postgresql.org/message-id/2401607.SfZhPQhbS4%40ronan_laptop

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: Ordered Partitioned Table Scans

David Rowley-3
On 26 October 2018 at 16:52, Amit Langote <[hidden email]> wrote:
> I recall Ronan Dunklau and Julien Rouhaud had proposed a patch for this
> last year, but the partitioning-related planning code hadn't advanced then
> as much as it has today, so they sort of postponed working on it.
> Eventually their patch was returned with feedback last November.  Here's
> the link to their email in case you wanted to read some comments their
> proposal and patch got, although some of them might be obsolete.
>
> https://www.postgresql.org/message-id/2401607.SfZhPQhbS4%40ronan_laptop

Thanks. I wasn't aware, or ... forgot. Looks like back then was tricky
times to be doing this. Hopefully, the dust has settled a little bit
now.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Ordered Partitioned Table Scans

Julien Rouhaud
Hi,

On Fri, Oct 26, 2018 at 6:40 AM David Rowley
<[hidden email]> wrote:

>
> On 26 October 2018 at 16:52, Amit Langote <[hidden email]> wrote:
> > I recall Ronan Dunklau and Julien Rouhaud had proposed a patch for this
> > last year, but the partitioning-related planning code hadn't advanced then
> > as much as it has today, so they sort of postponed working on it.
> > Eventually their patch was returned with feedback last November.  Here's
> > the link to their email in case you wanted to read some comments their
> > proposal and patch got, although some of them might be obsolete.
> >
> > https://www.postgresql.org/message-id/2401607.SfZhPQhbS4%40ronan_laptop
>
> Thanks. I wasn't aware, or ... forgot. Looks like back then was tricky
> times to be doing this. Hopefully, the dust has settled a little bit
> now.

Yes, back then I unfortunately had a limited time to work on that, and
I had to spend all of it rebasing the patch instead of working on the
various issue :(

Sadly, I have even less time now, but I'll try to look at your patch
this weekend!  As far as I remember, the biggest problems we had was
to handle multi-level partitionning, when the query is ordered by all
or a subset of the partition keys, and/or with a mix of ASC/DESC
clauses.  It also required some extra processing on the cost part for
queries that can be naturally ordered and contain a LIMIT clause,
since we can estimate how many partitions will have to be scanned.

Reply | Threaded
Open this post in threaded view
|

Re: Ordered Partitioned Table Scans

Julien Rouhaud
Hi,

On Fri, Oct 26, 2018 at 1:01 PM Julien Rouhaud <[hidden email]> wrote:

> On Fri, Oct 26, 2018 at 6:40 AM David Rowley
> <[hidden email]> wrote:
> >
> > On 26 October 2018 at 16:52, Amit Langote <[hidden email]> wrote:
> > > I recall Ronan Dunklau and Julien Rouhaud had proposed a patch for this
> > > last year, but the partitioning-related planning code hadn't advanced then
> > > as much as it has today, so they sort of postponed working on it.
> > > Eventually their patch was returned with feedback last November.  Here's
> > > the link to their email in case you wanted to read some comments their
> > > proposal and patch got, although some of them might be obsolete.
> > >
> > > https://www.postgresql.org/message-id/2401607.SfZhPQhbS4%40ronan_laptop
> >
> > Thanks. I wasn't aware, or ... forgot. Looks like back then was tricky
> > times to be doing this. Hopefully, the dust has settled a little bit
> > now.
>
> As far as I remember, the biggest problems we had was
> to handle multi-level partitionning, when the query is ordered by all
> or a subset of the partition keys, and/or with a mix of ASC/DESC
> clauses.  It also required some extra processing on the cost part for
> queries that can be naturally ordered and contain a LIMIT clause,
> since we can estimate how many partitions will have to be scanned.

I just had a look at your patch.  I see that you implemented only a
subset of the possible optimizations (only the case for range
partitionoing without subpartitions).  This has been previously
discussed, but we should be able to do similar optimization for list
partitioning if there's no interleaved values, and also for some cases
of multi-level partitioning.

Concerning the implementation, there's at least one issue: it assumes
that each subpath of a range-partitioned table will be ordered, with
is not guaranteed.  You need to to generate explicit Sort nodes nodes
(in the same order as the query_pathkey) for partitions that don't
have an ordered path and make sure that this path is used in the
Append.  Here's a simplistic case showing the issue (sorry, the
partition names are poorly chosen):

CREATE TABLE simple (id integer, val text) PARTITION BY RANGE (id);
CREATE TABLE simple_1_2 PARTITION OF simple FOR VALUES FROM (1) TO (100000);
CREATE TABLE simple_2_3 PARTITION OF simple FOR VALUES FROM (100000)
TO (200000);
CREATE TABLE simple_0_1 PARTITION OF simple FOR VALUES FROM (-100000) TO (1);

INSERT INTO simple SELECT id, 'line ' || id FROM
generate_series(-19999, 199999) id;

CREATE INDEX ON simple_1_2 (id);
CREATE INDEX ON simple_2_3 (id);

EXPLAIN SELECT * FROM simple ORDER BY id ;
                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Append  (cost=0.00..7705.56 rows=219999 width=15)
   ->  Seq Scan on simple_0_1  (cost=0.00..309.00 rows=20000 width=15)
   ->  Index Scan using simple_1_2_id_idx on simple_1_2
(cost=0.29..3148.28 rows=99999 width=14)
   ->  Index Scan using simple_2_3_id_idx on simple_2_3
(cost=0.29..3148.29 rows=100000 width=16)
(4 rows)

Also, if a LIMIT is specified, it should be able to give better
estimates, at least if there's no qual.  For instance:

EXPLAIN SELECT * FROM simple ORDER BY id LIMIT 10;
                                               QUERY PLAN
                                >
------------------------------------------------------------------------------------------------------->
 Limit  (cost=0.00..0.35 rows=10 width=15)
   ->  Append  (cost=0.00..7705.56 rows=219999 width=15)
         ->  Seq Scan on simple_0_1  (cost=0.00..309.00 rows=20000 width=15)
         ->  Index Scan using simple_1_2_id_idx on simple_1_2
(cost=0.29..3148.28 rows=99999 width=14)
         ->  Index Scan using simple_2_3_id_idx on simple_2_3
(cost=0.29..3148.29 rows=100000 width=16)
(5 rows)

In this case, we should estimate that the SeqScan (or in a corrected
version the Sort) node should not return more than 10 rows, and each
following partition should be scanned at all, and cost each path
accordingly.  I think that this is quite important, for instance to
make sure that natively sorted Append is chosen over a MergeAppend
when there are some subpath with explicit sorts, because with the
Append we probably won't have to execute all the sorts if the previous
partition scans returned enough rows.

FWIW, both those cases were handled (probably with some bugs though)
in the previous patches Ronan and I sent some time ago.  Also, I did
not forget about this feature, I planned to work on it in hope to have
it included in pg12.  However, I won't have a lot of time to work on
it before December.

Reply | Threaded
Open this post in threaded view
|

Re: Ordered Partitioned Table Scans

David Rowley-3
Thanks for looking at this.

On 28 October 2018 at 03:49, Julien Rouhaud <[hidden email]> wrote:
> I just had a look at your patch.  I see that you implemented only a
> subset of the possible optimizations (only the case for range
> partitionoing without subpartitions).  This has been previously
> discussed, but we should be able to do similar optimization for list
> partitioning if there's no interleaved values, and also for some cases
> of multi-level partitioning.

I had thought about these cases but originally had thought they would
be more complex to implement than I could justify. On review, I've
found some pretty cheap ways to handle both sub-partitions and for
LIST partitioned tables.  Currently, with LIST partitioned tables I've
coded it to only allow the optimisation if there's no DEFAULT
partition and all partitions are defined with exactly 1 Datum. This
guarantees that there are no interleaved values, but it'll just fail
to optimise cases like FOR VALUES IN(1,2) + FOR VALUES In(3,4).   The
reason that I didn't go to the trouble of the additional checks was
that I don't really want to add any per-partition overhead to this.
If RelOptInfo had a Bitmapset of live partitions then we could just
check the partitions that survived pruning.  Amit Langote has a
pending patch which does that and some other useful stuff, so maybe we
can delay fixing that until the dust settles a bit in that area. Amit
and I are both working hard to remove all these per-partition
overheads. I imagine he'd also not be in favour of adding code that
does something for all partitions when we've pruned down to just 1.
I've personally no objection to doing the required additional
processing for the non-pruned partitions only. We could also then fix
the case where we disable the optimisation if there's a DEFAULT
partition without any regards to if it's been pruned or not.

> Concerning the implementation, there's at least one issue: it assumes
> that each subpath of a range-partitioned table will be ordered, with
> is not guaranteed.  You need to to generate explicit Sort nodes nodes
> (in the same order as the query_pathkey) for partitions that don't
> have an ordered path and make sure that this path is used in the
> Append.  Here's a simplistic case showing the issue (sorry, the
> partition names are poorly chosen):

Thanks for noticing this. I had been thrown off due to the fact that
Paths are never actually created for these sorts. On looking further I
see that we do checks during createplan to see if the path is
suitability sorted and just create a sort node if it's not. This seems
to go against the whole point of paths, but I'm not going to fight for
changing it, so I've just done the Append the same way as MergeAppend
handles it.

> Also, if a LIMIT is specified, it should be able to give better
> estimates, at least if there's no qual.  For instance:
>
> EXPLAIN SELECT * FROM simple ORDER BY id LIMIT 10;
>                                                QUERY PLAN
>                                 >
> ------------------------------------------------------------------------------------------------------->
>  Limit  (cost=0.00..0.35 rows=10 width=15)
>    ->  Append  (cost=0.00..7705.56 rows=219999 width=15)
>          ->  Seq Scan on simple_0_1  (cost=0.00..309.00 rows=20000 width=15)
>          ->  Index Scan using simple_1_2_id_idx on simple_1_2
> (cost=0.29..3148.28 rows=99999 width=14)
>          ->  Index Scan using simple_2_3_id_idx on simple_2_3
> (cost=0.29..3148.29 rows=100000 width=16)
> (5 rows)
>
> In this case, we should estimate that the SeqScan (or in a corrected
> version the Sort) node should not return more than 10 rows, and each
> following partition should be scanned at all, and cost each path
> accordingly.  I think that this is quite important, for instance to
> make sure that natively sorted Append is chosen over a MergeAppend
> when there are some subpath with explicit sorts, because with the
> Append we probably won't have to execute all the sorts if the previous
> partition scans returned enough rows.
In my patch, I'm not adding any additional paths. I'm just adding an
Append instead of a MergeAppend.  For what you're talking about the
limit only needs to be passed into any underlying Sort so that it can
become a top-N sort.  This is handled already in create_limit_path().
Notice in the plan you pasted above that the limit has a lower total
cost than its Append subnode. That's because create_limit_path()
weighted the Limit total cost based on the row count of the limit and
its subpath. ... 7705.56 / 219999 * 10 = ~0.35.

> FWIW, both those cases were handled (probably with some bugs though)
> in the previous patches Ronan and I sent some time ago.  Also, I did
> not forget about this feature, I planned to work on it in hope to have
> it included in pg12.  However, I won't have a lot of time to work on
> it before December.

I apologise for not noticing your patch. I only went as far as
checking the November commitfest to see if anything existed already
and I found nothing there.  I have time to work on this now, so likely
it's better if I continue, just in case your time in December does not
materialise.

v2 of the patch is attached. I've not had time yet to give it a
throughout post write review, but on first look it seems okay.

The known limitations are:

* Disables the optimisation even if the DEFAULT partition is pruned.
* Disables the optimisation if LIST partitioned tables have any
partitions allowing > 1 value.
* Fails to optimise UNION ALLs with partitioned tables.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v2-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patch (66K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Ordered Partitioned Table Scans

David Rowley-3
On 29 October 2018 at 13:44, David Rowley <[hidden email]> wrote:
> v2 of the patch is attached. I've not had time yet to give it a
> throughout post write review, but on first look it seems okay.

Added to the November 'fest.

https://commitfest.postgresql.org/20/1850/

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Ordered Partitioned Table Scans

Julien Rouhaud
In reply to this post by David Rowley-3
On Mon, Oct 29, 2018 at 1:44 AM David Rowley
<[hidden email]> wrote:

>
> On 28 October 2018 at 03:49, Julien Rouhaud <[hidden email]> wrote:
> > I just had a look at your patch.  I see that you implemented only a
> > subset of the possible optimizations (only the case for range
> > partitionoing without subpartitions).  This has been previously
> > discussed, but we should be able to do similar optimization for list
> > partitioning if there's no interleaved values, and also for some cases
> > of multi-level partitioning.
>
> I had thought about these cases but originally had thought they would
> be more complex to implement than I could justify. On review, I've
> found some pretty cheap ways to handle both sub-partitions and for
> LIST partitioned tables.  Currently, with LIST partitioned tables I've
> coded it to only allow the optimisation if there's no DEFAULT
> partition and all partitions are defined with exactly 1 Datum. This
> guarantees that there are no interleaved values, but it'll just fail
> to optimise cases like FOR VALUES IN(1,2) + FOR VALUES In(3,4).   The
> reason that I didn't go to the trouble of the additional checks was
> that I don't really want to add any per-partition overhead to this.

I see, but the overhead you mention is because you're doing that check
during the planning in build_partition_pathkeys().  As advised by
Robert quite some time ago
(https://www.postgresql.org/message-id/CA+TgmobOWgT1=zyjx-q=7s8akXNODix46qG0_-YX7K369P6ADA@...),
 we can store that information when the PartitionDesc is built, so
that would it wouldn't be problematic.  Since checking for overlapping
values is straightforward with the BoundInfoData infrastructure, it'd
be a pity to miss this optimization in such cases, which I believe
would not be rare.

> If RelOptInfo had a Bitmapset of live partitions then we could just
> check the partitions that survived pruning.  Amit Langote has a
> pending patch which does that and some other useful stuff, so maybe we
> can delay fixing that until the dust settles a bit in that area. Amit
> and I are both working hard to remove all these per-partition
> overheads. I imagine he'd also not be in favour of adding code that
> does something for all partitions when we've pruned down to just 1.
> I've personally no objection to doing the required additional
> processing for the non-pruned partitions only. We could also then fix
> the case where we disable the optimisation if there's a DEFAULT
> partition without any regards to if it's been pruned or not.

Those are quite worthwhile enhancements, and being able to avoid a
MergeAppend if the problematic partitions have been prune would be
great!  I didn't followed thoroughly all the discussions about the
various optimization Amit and you are working on, but I don't think it
would be incompatible with a new flag and the possibility to have the
sorted append with multi valued list partitions?

>
> > Concerning the implementation, there's at least one issue: it assumes
> > that each subpath of a range-partitioned table will be ordered, with
> > is not guaranteed.  You need to to generate explicit Sort nodes nodes
> > (in the same order as the query_pathkey) for partitions that don't
> > have an ordered path and make sure that this path is used in the
> > Append.  Here's a simplistic case showing the issue (sorry, the
> > partition names are poorly chosen):
>
> Thanks for noticing this. I had been thrown off due to the fact that
> Paths are never actually created for these sorts. On looking further I
> see that we do checks during createplan to see if the path is
> suitability sorted and just create a sort node if it's not. This seems
> to go against the whole point of paths, but I'm not going to fight for
> changing it, so I've just done the Append the same way as MergeAppend
> handles it.

Yes, I had quite the same reaction when I saw how MergeAppend handles it.

>
> > Also, if a LIMIT is specified, it should be able to give better
> > estimates, at least if there's no qual.  For instance:
> >
> > EXPLAIN SELECT * FROM simple ORDER BY id LIMIT 10;
> >                                                QUERY PLAN
> >                                 >
> > ------------------------------------------------------------------------------------------------------->
> >  Limit  (cost=0.00..0.35 rows=10 width=15)
> >    ->  Append  (cost=0.00..7705.56 rows=219999 width=15)
> >          ->  Seq Scan on simple_0_1  (cost=0.00..309.00 rows=20000 width=15)
> >          ->  Index Scan using simple_1_2_id_idx on simple_1_2
> > (cost=0.29..3148.28 rows=99999 width=14)
> >          ->  Index Scan using simple_2_3_id_idx on simple_2_3
> > (cost=0.29..3148.29 rows=100000 width=16)
> > (5 rows)
> >
> > In this case, we should estimate that the SeqScan (or in a corrected
> > version the Sort) node should not return more than 10 rows, and each
> > following partition should be scanned at all, and cost each path
> > accordingly.  I think that this is quite important, for instance to
> > make sure that natively sorted Append is chosen over a MergeAppend
> > when there are some subpath with explicit sorts, because with the
> > Append we probably won't have to execute all the sorts if the previous
> > partition scans returned enough rows.
>
> In my patch, I'm not adding any additional paths. I'm just adding an
> Append instead of a MergeAppend.  For what you're talking about the
> limit only needs to be passed into any underlying Sort so that it can
> become a top-N sort.  This is handled already in create_limit_path().
> Notice in the plan you pasted above that the limit has a lower total
> cost than its Append subnode. That's because create_limit_path()
> weighted the Limit total cost based on the row count of the limit and
> its subpath. ... 7705.56 / 219999 * 10 = ~0.35.

Yes.  But the cost of the first partition in this example is wrong
since there was no additional sort on top of the seq scan.

However, I now realize that, as you said, what your patch does is to
generate an Append *instead* of a MergeAppend if the optimization was
possible.  So there can't be the problem of a MergeAppend chosen over
a cheaper Append in some cases, sorry for the noise.  I totally missed
that because when I worked on the same topic last year we had to
generate both Append and MergeAppend.  At that time Append were not
parallel-aware yet, so there could be faster parallel MergeAppend in
some cases.

> > FWIW, both those cases were handled (probably with some bugs though)
> > in the previous patches Ronan and I sent some time ago.  Also, I did
> > not forget about this feature, I planned to work on it in hope to have
> > it included in pg12.  However, I won't have a lot of time to work on
> > it before December.
>
> I apologise for not noticing your patch. I only went as far as
> checking the November commitfest to see if anything existed already
> and I found nothing there.

No worries, it's more than a year old now (I'm quite ashamed I didn't
come back on this sooner).

>  I have time to work on this now, so likely
> it's better if I continue, just in case your time in December does not
> materialise.

I entirely agree.

> v2 of the patch is attached. I've not had time yet to give it a
> throughout post write review, but on first look it seems okay.


I've registered as a reviewer.   I still didn't have a deep look at
the patch yet, but thanks a lot for working on it!

Reply | Threaded
Open this post in threaded view
|

Re: Ordered Partitioned Table Scans

David Rowley-3
On 31 October 2018 at 12:24, Julien Rouhaud <[hidden email]> wrote:

> On Mon, Oct 29, 2018 at 1:44 AM David Rowley
> <[hidden email]> wrote:
>>
>> On 28 October 2018 at 03:49, Julien Rouhaud <[hidden email]> wrote:
>> > I just had a look at your patch.  I see that you implemented only a
>> > subset of the possible optimizations (only the case for range
>> > partitionoing without subpartitions).  This has been previously
>> > discussed, but we should be able to do similar optimization for list
>> > partitioning if there's no interleaved values, and also for some cases
>> > of multi-level partitioning.
>>
>> I had thought about these cases but originally had thought they would
>> be more complex to implement than I could justify. On review, I've
>> found some pretty cheap ways to handle both sub-partitions and for
>> LIST partitioned tables.  Currently, with LIST partitioned tables I've
>> coded it to only allow the optimisation if there's no DEFAULT
>> partition and all partitions are defined with exactly 1 Datum. This
>> guarantees that there are no interleaved values, but it'll just fail
>> to optimise cases like FOR VALUES IN(1,2) + FOR VALUES In(3,4).   The
>> reason that I didn't go to the trouble of the additional checks was
>> that I don't really want to add any per-partition overhead to this.
>
> I see, but the overhead you mention is because you're doing that check
> during the planning in build_partition_pathkeys().  As advised by
> Robert quite some time ago
> (https://www.postgresql.org/message-id/CA+TgmobOWgT1=zyjx-q=7s8akXNODix46qG0_-YX7K369P6ADA@...),
>  we can store that information when the PartitionDesc is built, so
> that would it wouldn't be problematic.  Since checking for overlapping
> values is straightforward with the BoundInfoData infrastructure, it'd
> be a pity to miss this optimization in such cases, which I believe
> would not be rare.

Thanks for looking at this again.

I retrospectively read that thread after Amit mentioned about your
patch. I just disagree with Robert about caching this flag.  The
reason is, if the flag is false due to some problematic partitions, if
we go and prune those, then we needlessly fail to optimise that case.
I propose we come back and do the remaining optimisations with
interleaved LIST partitions and partitioned tables with DEFAULT
partitions later,  once we have a new "live_parts" field in
RelOptInfo.  That way we can just check the live parts to ensure
they're compatible with the optimization. If we get what's done
already in then we're already a bit step forward.

[...]

>> v2 of the patch is attached. I've not had time yet to give it a
>> throughout post write review, but on first look it seems okay.
>
>
> I've registered as a reviewer.   I still didn't have a deep look at
> the patch yet, but thanks a lot for working on it!

Thanks for signing up to review.  I need to send another revision of
the patch to add a missing call to truncate_useless_pathkeys(). Will
try to do that today.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Ordered Partitioned Table Scans

David Rowley-3
On 31 October 2018 at 13:05, David Rowley <[hidden email]> wrote:
>>> On 28 October 2018 at 03:49, Julien Rouhaud <[hidden email]> wrote:
>> I've registered as a reviewer.   I still didn't have a deep look at
>> the patch yet, but thanks a lot for working on it!
>
> Thanks for signing up to review.  I need to send another revision of
> the patch to add a missing call to truncate_useless_pathkeys(). Will
> try to do that today.

I've attached a patch that removes the redundant pathkeys. This allows
cases like the following to work:

explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
                         QUERY PLAN
-------------------------------------------------------------
 Append
   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
         Index Cond: (a = 10)
   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
         Index Cond: (a = 10)
(5 rows)

One thing that could work but currently does not are when LIST
partitions just allow a single value, we could allow the Append to
have pathkeys even if there are no indexes.  One way to do this would
be to add PathKeys to the seqscan path on the partition for supporting
partitions. However, that's adding code in another area so likely
should be another patch.

This could allow cases like:

create table bool_rp (b bool) partition by list(b);
create table bool_rp_true partition of bool_rp for values in(true);
create table bool_rp_false partition of bool_rp for values in(false);
explain (costs off) select * from bool_rp order by b;
                            QUERY PLAN
------------------------------------------------------------------
 Append
   ->  Seq Scan on bool_rp_false
   ->  Seq Scan on bool_rp_true
(3 rows)

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v3-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patch (67K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Ordered Partitioned Table Scans

Antonin Houska-2
David Rowley <[hidden email]> wrote:

> On 31 October 2018 at 13:05, David Rowley <[hidden email]> wrote:
> >>> On 28 October 2018 at 03:49, Julien Rouhaud <[hidden email]> wrote:
> >> I've registered as a reviewer.   I still didn't have a deep look at
> >> the patch yet, but thanks a lot for working on it!
> >
> > Thanks for signing up to review.  I need to send another revision of
> > the patch to add a missing call to truncate_useless_pathkeys(). Will
> > try to do that today.
>
> I've attached a patch that ...

I've picked this one when looking around what I could review.

* As for the logic, I found generate_mergeappend_paths() to be the most
  interesting part:

Imagine table partitioned by "i", so "partition_pathkeys" is {i}.

partition 1:

i | j
--+--
0 | 0
1 | 1
0 | 1
1 | 0

partition 2:

i | j
--+--
3 | 0
2 | 0
2 | 1
3 | 1

Even if "pathkeys" is {i, j}, i.e. not contained in "partition_pathkeys", the
ordering of the subpaths should not change the way tuples are split into
partitions.

Obviously a problem is if "partition_pathkeys" and "pathkeys" lists start with
different items. To propose more generic rule, I used this example of
range-partitioned table, where "i" and "j" are the partitioning keys:

partition 1:

 i | j | k
---+---+---
 0 | 0 | 1
 0 | 0 | 0

partition 2:

 i | j | k
---+---+---
 0 | 1 | 0
 0 | 1 | 1

If the output "pathkey" is {i, k}, then the Append path makes rows of both
partitions interleave:

 i | j | k
---+---+---
 0 | 0 | 0
 0 | 1 | 0
 0 | 0 | 1
 0 | 1 | 1

So in general I think the restriction is that no valid position of "pathkeys"
and "partition_pathkeys" may differ. Or in other words: the shorter of the 2
pathkey lists must be contained in the longer one. Does it make sense to you?


Another problem I see is that build_partition_pathkeys() continues even if it
fails to create a pathkey for some partitioning column. In the example above
it would mean that the table can have "partition_pathkeys" equal to {j}
instead of {i, j} on some EC-related conditions. However such a key does not
correspond to reality - this is easier to imagine if another partition is
considered.

partition 3:

 i | j | k
---+---+---
 1 | 0 | 1
 1 | 0 | 0

So I think no "partition_pathkeys" should be generated in that case. On the
other hand, if the function returned the part of the list it could construct
so far, it'd be wrong because such incomplete pathkeys could pass the checks I
proposed above for reasons unrelated to the partitioning scheme.


The following comments are mostly on coding:

* Both qsort_partition_list_value_cmp() and qsort_partition_rbound_cmp() have
  this sentence in the header comment:

Note: If changing this, see build_partition_pathkeys()

However I could not find other use besides that in
RelationBuildPartitionDesc().

* create_append_path():

        /*
         * Apply query-wide LIMIT if known and path is for sole base relation.
         * (Handling this at this low level is a bit klugy.)
         */
        if (root != NULL && pathkeys != NULL &&
                bms_equal(rel->relids, root->all_baserels))
                pathnode->limit_tuples = root->limit_tuples;
        else
                pathnode->limit_tuples = -1.0;

  I think this optimization is not specific to AppendPath / MergeAppendPath,
  so it could be moved elsewhere (as a separate patch of course). But
  specifically for AppendPath, why do we have to test pathkeys? The pathkeys
  of the AppendPath do not necessarily describe the order of the set to which
  LIMIT is applied, so their existence should not be important here.

* If pathkeys is passed, shouldn't create_append_path() include the
  cost_sort() of subpaths just like create_merge_append_path() does?  And if
  so, then create_append_path() and create_merge_append_path() might
  eventually have some common code (at least for the subpath processing) to be
  put into a separate function.

* Likewise, create_append_plan() / create_merge_append_plan() are going to be
  more similar then before, so some refactoring could also make sense.

Although it's not too much code, I admit the patch is not trivial, so I'm
curious about your opinion.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com

Reply | Threaded
Open this post in threaded view
|

Re: Ordered Partitioned Table Scans

David Rowley-3
On 1 November 2018 at 04:01, Antonin Houska <[hidden email]> wrote:

> * As for the logic, I found generate_mergeappend_paths() to be the most
>   interesting part:
>
> Imagine table partitioned by "i", so "partition_pathkeys" is {i}.
>
> partition 1:
>
> i | j
> --+--
> 0 | 0
> 1 | 1
> 0 | 1
> 1 | 0
>
> partition 2:
>
> i | j
> --+--
> 3 | 0
> 2 | 0
> 2 | 1
> 3 | 1
>
> Even if "pathkeys" is {i, j}, i.e. not contained in "partition_pathkeys", the
> ordering of the subpaths should not change the way tuples are split into
> partitions.
>
> Obviously a problem is if "partition_pathkeys" and "pathkeys" lists start with
> different items. To propose more generic rule, I used this example of
> range-partitioned table, where "i" and "j" are the partitioning keys:
>
> partition 1:
>
>  i | j | k
> ---+---+---
>  0 | 0 | 1
>  0 | 0 | 0
>
> partition 2:
>
>  i | j | k
> ---+---+---
>  0 | 1 | 0
>  0 | 1 | 1
>
> If the output "pathkey" is {i, k}, then the Append path makes rows of both
> partitions interleave:
>
>  i | j | k
> ---+---+---
>  0 | 0 | 0
>  0 | 1 | 0
>  0 | 0 | 1
>  0 | 1 | 1
>
> So in general I think the restriction is that no valid position of "pathkeys"
> and "partition_pathkeys" may differ. Or in other words: the shorter of the 2
> pathkey lists must be contained in the longer one. Does it make sense to you?
I understand what you're saying. I just don't understand what you
think is wrong with the patch in this area.

> Another problem I see is that build_partition_pathkeys() continues even if it
> fails to create a pathkey for some partitioning column. In the example above
> it would mean that the table can have "partition_pathkeys" equal to {j}
> instead of {i, j} on some EC-related conditions. However such a key does not
> correspond to reality - this is easier to imagine if another partition is
> considered.
>
> partition 3:
>
>  i | j | k
> ---+---+---
>  1 | 0 | 1
>  1 | 0 | 0
>
> So I think no "partition_pathkeys" should be generated in that case. On the
> other hand, if the function returned the part of the list it could construct
> so far, it'd be wrong because such incomplete pathkeys could pass the checks I
> proposed above for reasons unrelated to the partitioning scheme.
Oops. That's a mistake. We should return what we have so far if we
can't make one of the pathkeys. Will fix.

> The following comments are mostly on coding:
>
> * Both qsort_partition_list_value_cmp() and qsort_partition_rbound_cmp() have
>   this sentence in the header comment:
>
> Note: If changing this, see build_partition_pathkeys()
>
> However I could not find other use besides that in
> RelationBuildPartitionDesc().

While the new code does not call those directly, the new code does
depend on the sort order of the partitions inside the PartitionDesc,
which those functions are responsible for. Perhaps there's a better
way to communicate that.

Actually, I think the partitioning checking code I added to pathkeys.c
does not belong there. Likely those checks should live with the other
partitioning code in the form of a bool returning function. I'll
change that now. It means we don't have to work that out twice as I'm
currently running it once for forward and once for the backwards scan
case.  Currently the code is very simple but if we start analysing
list partition bounds then it will become slower.

> * create_append_path():
>
>         /*
>          * Apply query-wide LIMIT if known and path is for sole base relation.
>          * (Handling this at this low level is a bit klugy.)
>          */
>         if (root != NULL && pathkeys != NULL &&
>                 bms_equal(rel->relids, root->all_baserels))
>                 pathnode->limit_tuples = root->limit_tuples;
>         else
>                 pathnode->limit_tuples = -1.0;
>
>   I think this optimization is not specific to AppendPath / MergeAppendPath,
>   so it could be moved elsewhere (as a separate patch of course). But
>   specifically for AppendPath, why do we have to test pathkeys? The pathkeys
>   of the AppendPath do not necessarily describe the order of the set to which
>   LIMIT is applied, so their existence should not be important here.
The pathkeys != NULL could be removed. I was just trying to maintain
the status quo for Appends without pathkeys. In reality it currently
does not matter since that's only used as a parameter for cost_sort().
There'd be no reason previously to have a Sort path as a subpath in an
Append node since the order would be destroyed after the Append.
Perhaps we should just pass it through as one day it might be useful.
I just can't currently imagine why.

> * If pathkeys is passed, shouldn't create_append_path() include the
>   cost_sort() of subpaths just like create_merge_append_path() does?  And if
>   so, then create_append_path() and create_merge_append_path() might
>   eventually have some common code (at least for the subpath processing) to be
>   put into a separate function.

It does. It's just done via the call to cost_append().

> * Likewise, create_append_plan() / create_merge_append_plan() are going to be
>   more similar then before, so some refactoring could also make sense.
>
> Although it's not too much code, I admit the patch is not trivial, so I'm
> curious about your opinion.

I think the costing code is sufficiently different to warant not
sharing more. For example, the startup costing is completely
different. Append can start on the startup cost of the first subpath,
but MergeAppend takes the sum of the startup cost of all subpaths.

I've attached v4 of the patch. I think this addresses all that you
mentioned apart from the first one, due to not understanding the
problem.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v4-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patch (71K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Ordered Partitioned Table Scans

Antonin Houska-2
David Rowley <[hidden email]> wrote:

> On 1 November 2018 at 04:01, Antonin Houska <[hidden email]> wrote:
> > * As for the logic, I found generate_mergeappend_paths() to be the most
> >   interesting part:
> >
> > Imagine table partitioned by "i", so "partition_pathkeys" is {i}.
> >
> > partition 1:
> >
> > i | j
> > --+--
> > 0 | 0
> > 1 | 1
> > 0 | 1
> > 1 | 0
> >
> > partition 2:
> >
> > i | j
> > --+--
> > 3 | 0
> > 2 | 0
> > 2 | 1
> > 3 | 1
> >
> > Even if "pathkeys" is {i, j}, i.e. not contained in "partition_pathkeys", the
> > ordering of the subpaths should not change the way tuples are split into
> > partitions.
> >
> > ...
>
> I understand what you're saying. I just don't understand what you
> think is wrong with the patch in this area.

I think these conditions are too restrictive:

        /*
         * Determine if these pathkeys match the partition order, or reverse
         * partition order.  It can't match both, so only go to the trouble of
         * checking the reverse order when it's not in ascending partition
         * order.
         */
        partition_order = pathkeys_contained_in(pathkeys,
                                                partition_pathkeys);
        partition_order_desc = !partition_order &&
                                pathkeys_contained_in(pathkeys,
                                                        partition_pathkeys_desc);


In the example above I wanted to show that your new feature should work even
if "pathkeys" is not contained in "partition_pathkeys".

> > Another problem I see is that build_partition_pathkeys() continues even if it
> > fails to create a pathkey for some partitioning column. In the example above
> > it would mean that the table can have "partition_pathkeys" equal to {j}
> > instead of {i, j} on some EC-related conditions. However such a key does not
> > correspond to reality - this is easier to imagine if another partition is
> > considered.
> >
> > partition 3:
> >
> >  i | j | k
> > ---+---+---
> >  1 | 0 | 1
> >  1 | 0 | 0
> >
> > So I think no "partition_pathkeys" should be generated in that case. On the
> > other hand, if the function returned the part of the list it could construct
> > so far, it'd be wrong because such incomplete pathkeys could pass the checks I
> > proposed above for reasons unrelated to the partitioning scheme.
>
> Oops. That's a mistake. We should return what we have so far if we
> can't make one of the pathkeys. Will fix.

I think no "partition_pathkeys" should be created in this case, but before we
can discuss this in detail there needs to be an agreement on the evaluation of
the relationship between "pathkeys" and "partition_pathkeys", see above.

> > The following comments are mostly on coding:
> >
> > * Both qsort_partition_list_value_cmp() and qsort_partition_rbound_cmp() have
> >   this sentence in the header comment:
> >
> > Note: If changing this, see build_partition_pathkeys()
> >
> > However I could not find other use besides that in
> > RelationBuildPartitionDesc().
>
> While the new code does not call those directly, the new code does
> depend on the sort order of the partitions inside the PartitionDesc,
> which those functions are responsible for. Perhaps there's a better
> way to communicate that.

I pointed this out because I suspect that changes of these functions would
affect more features, not only the one you're trying to implement.

> > * If pathkeys is passed, shouldn't create_append_path() include the
> >   cost_sort() of subpaths just like create_merge_append_path() does?  And if
> >   so, then create_append_path() and create_merge_append_path() might
> >   eventually have some common code (at least for the subpath processing) to be
> >   put into a separate function.
>
> It does. It's just done via the call to cost_append().

ok, I missed that.

> > * Likewise, create_append_plan() / create_merge_append_plan() are going to be
> >   more similar then before, so some refactoring could also make sense.
> >
> > Although it's not too much code, I admit the patch is not trivial, so I'm
> > curious about your opinion.
>
> I think the costing code is sufficiently different to warant not
> sharing more. For example, the startup costing is completely
> different. Append can start on the startup cost of the first subpath,
> but MergeAppend takes the sum of the startup cost of all subpaths.

ok

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com

Reply | Threaded
Open this post in threaded view
|

Re: Ordered Partitioned Table Scans

David Rowley-3
On 1 November 2018 at 22:05, Antonin Houska <[hidden email]> wrote:

> I think these conditions are too restrictive:
>
>         /*
>          * Determine if these pathkeys match the partition order, or reverse
>          * partition order.  It can't match both, so only go to the trouble of
>          * checking the reverse order when it's not in ascending partition
>          * order.
>          */
>         partition_order = pathkeys_contained_in(pathkeys,
>                                                 partition_pathkeys);
>         partition_order_desc = !partition_order &&
>                                 pathkeys_contained_in(pathkeys,
>                                                         partition_pathkeys_desc);
>
>
> In the example above I wanted to show that your new feature should work even
> if "pathkeys" is not contained in "partition_pathkeys".

Okay, after a bit more time looking at this I see what you're saying
now and I agree, but; see below.

>> > Another problem I see is that build_partition_pathkeys() continues even if it
>> > fails to create a pathkey for some partitioning column. In the example above
>> > it would mean that the table can have "partition_pathkeys" equal to {j}
>> > instead of {i, j} on some EC-related conditions. However such a key does not
>> > correspond to reality - this is easier to imagine if another partition is
>> > considered.
>> >
>> > partition 3:
>> >
>> >  i | j | k
>> > ---+---+---
>> >  1 | 0 | 1
>> >  1 | 0 | 0
>> >
>> > So I think no "partition_pathkeys" should be generated in that case. On the
>> > other hand, if the function returned the part of the list it could construct
>> > so far, it'd be wrong because such incomplete pathkeys could pass the checks I
>> > proposed above for reasons unrelated to the partitioning scheme.
>>
>> Oops. That's a mistake. We should return what we have so far if we
>> can't make one of the pathkeys. Will fix.
>
> I think no "partition_pathkeys" should be created in this case, but before we
> can discuss this in detail there needs to be an agreement on the evaluation of
> the relationship between "pathkeys" and "partition_pathkeys", see above.

The problem with doing that is that if the partition keys are better
than the pathkeys then we'll most likely fail to generate any
partition keys at all due to lack of any existing eclass to use for
the pathkeys. It's unsafe to use just the prefix in this case as the
eclass may not have been found due to, for example one of the
partition keys having a different collation than the required sort
order of the query. In other words, we can't rely on a failure to
create the pathkey meaning that a more strict sort order is not
required.

I'm a bit unsure on how safe it would be to pass "create_it" as true
to make_pathkey_from_sortinfo(). We might be building partition path
keys for some sub-partitioned table. In this case the eclass should
likely have a its member added with em_is_child = true.  The existing
code always sets em_is_child to false. It's not that clear to me that
setting up a new eclass with a single em_is_child = true member is
correct.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Ordered Partitioned Table Scans

David Rowley-3
On Mon, 5 Nov 2018 at 10:46, David Rowley <[hidden email]> wrote:

> On 1 November 2018 at 22:05, Antonin Houska <[hidden email]> wrote:
> > I think these conditions are too restrictive:
> >
> >         /*
> >          * Determine if these pathkeys match the partition order, or reverse
> >          * partition order.  It can't match both, so only go to the trouble of
> >          * checking the reverse order when it's not in ascending partition
> >          * order.
> >          */
> >         partition_order = pathkeys_contained_in(pathkeys,
> >                                                 partition_pathkeys);
> >         partition_order_desc = !partition_order &&
> >                                 pathkeys_contained_in(pathkeys,
> >                                                         partition_pathkeys_desc);
> >

> The problem with doing that is that if the partition keys are better
> than the pathkeys then we'll most likely fail to generate any
> partition keys at all due to lack of any existing eclass to use for
> the pathkeys. It's unsafe to use just the prefix in this case as the
> eclass may not have been found due to, for example one of the
> partition keys having a different collation than the required sort
> order of the query. In other words, we can't rely on a failure to
> create the pathkey meaning that a more strict sort order is not
> required.

I had another look at this patch and it seems okay just to add a new
flag to build_partition_pathkeys() to indicate if the pathkey List was
truncated or not.  In generate_mergeappend_paths() we can then just
check that flag before checking if the partiiton pathkeys are
contained in pathkeys. It's fine if the partition keys were truncated
for the reverse of that check.

I've done this in the attached and added additional regression tests
for this case.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v5-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patch (74K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Ordered Partitioned Table Scans

Julien Rouhaud
Hi,

On Thu, Nov 22, 2018 at 11:27 AM David Rowley
<[hidden email]> wrote:

>
> On Mon, 5 Nov 2018 at 10:46, David Rowley <[hidden email]> wrote:
> > On 1 November 2018 at 22:05, Antonin Houska <[hidden email]> wrote:
> > > I think these conditions are too restrictive:
> > >
> > >         /*
> > >          * Determine if these pathkeys match the partition order, or reverse
> > >          * partition order.  It can't match both, so only go to the trouble of
> > >          * checking the reverse order when it's not in ascending partition
> > >          * order.
> > >          */
> > >         partition_order = pathkeys_contained_in(pathkeys,
> > >                                                 partition_pathkeys);
> > >         partition_order_desc = !partition_order &&
> > >                                 pathkeys_contained_in(pathkeys,
> > >                                                         partition_pathkeys_desc);
> > >
>
> > The problem with doing that is that if the partition keys are better
> > than the pathkeys then we'll most likely fail to generate any
> > partition keys at all due to lack of any existing eclass to use for
> > the pathkeys. It's unsafe to use just the prefix in this case as the
> > eclass may not have been found due to, for example one of the
> > partition keys having a different collation than the required sort
> > order of the query. In other words, we can't rely on a failure to
> > create the pathkey meaning that a more strict sort order is not
> > required.
>
> I had another look at this patch and it seems okay just to add a new
> flag to build_partition_pathkeys() to indicate if the pathkey List was
> truncated or not.  In generate_mergeappend_paths() we can then just
> check that flag before checking if the partiiton pathkeys are
> contained in pathkeys. It's fine if the partition keys were truncated
> for the reverse of that check.
>
> I've done this in the attached and added additional regression tests
> for this case.

I started to look at v5.

If I understand correctly, the new behavior is controlled by
partitions_are_ordered(), but it only checks for declared partitions,
not partitions that survived pruning.  Did I miss something or is it
the intended behavior?  Also, generate_mergeappend_paths should
probably be renamed to something like generate_sortedappend_paths
since it can now generate either Append or MergeAppend paths.

I'm also wondering if that's ok to only generate either a (sorted)
Append or a MergeAppend.  Is it possible that in some cases it's
better to have a MergeAppend rather than a sorted Append, given that
MergeAppend is parallel-aware and the sorted Append isn't?

Reply | Threaded
Open this post in threaded view
|

Re: Ordered Partitioned Table Scans

David Rowley-3
On Wed, 19 Dec 2018 at 20:40, Julien Rouhaud <[hidden email]> wrote:
> I started to look at v5.

Thanks for giving this a look over.

> If I understand correctly, the new behavior is controlled by
> partitions_are_ordered(), but it only checks for declared partitions,
> not partitions that survived pruning.  Did I miss something or is it
> the intended behavior?

Yeah, it was mentioned up-thread a bit.

I wrote:

> I retrospectively read that thread after Amit mentioned about your
> patch. I just disagree with Robert about caching this flag.  The
> reason is, if the flag is false due to some problematic partitions, if
> we go and prune those, then we needlessly fail to optimise that case.
> I propose we come back and do the remaining optimisations with
> interleaved LIST partitions and partitioned tables with DEFAULT
> partitions later,  once we have a new "live_parts" field in
> RelOptInfo.  That way we can just check the live parts to ensure
> they're compatible with the optimization. If we get what's done
> already in then we're already a bit step forward.
The reason I'm keen to leave this alone today is that determining
which partitions are pruned requires looking at each partition's
RelOptInfo and checking if it's marked as a dummy rel. I'm trying to
minimise the overhead of this patch by avoiding doing any
per-partition processing. If we get the "live_parts" Bitmapset, then
this becomes cheaper as Bitmapsets are fairly efficient at finding the
next set member, even when they're large and sparsely populated.

> Also, generate_mergeappend_paths should
> probably be renamed to something like generate_sortedappend_paths
> since it can now generate either Append or MergeAppend paths.

You might be right about naming this something else, but
"sortedappend" sounds like an Append node with a Sort node above it.
"orderedappend" feels slightly better, although my personal vote would
be not to rename it at all. Sometimes generating an Append seems like
an easy enough corner case to mention in the function body.

> I'm also wondering if that's ok to only generate either a (sorted)
> Append or a MergeAppend.  Is it possible that in some cases it's
> better to have a MergeAppend rather than a sorted Append, given that
> MergeAppend is parallel-aware and the sorted Append isn't?

That might have been worth a thought if we had parallel MergeAppends,
but we don't. You might be thinking of GatherMerge.

I've attached a v6 patch. The only change is the renamed the
generate_mergeappend_paths() function to
generate_orderedappend_paths(), and also the required comment updates
to go with it.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v6-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patch (79K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Ordered Partitioned Table Scans

Julien Rouhaud
On Wed, Dec 19, 2018 at 10:51 AM David Rowley
<[hidden email]> wrote:

>
> On Wed, 19 Dec 2018 at 20:40, Julien Rouhaud <[hidden email]> wrote:
>
> > If I understand correctly, the new behavior is controlled by
> > partitions_are_ordered(), but it only checks for declared partitions,
> > not partitions that survived pruning.  Did I miss something or is it
> > the intended behavior?
>
> Yeah, it was mentioned up-thread a bit.
>
> I wrote:
> > I retrospectively read that thread after Amit mentioned about your
> > patch. I just disagree with Robert about caching this flag.  The
> > reason is, if the flag is false due to some problematic partitions, if
> > we go and prune those, then we needlessly fail to optimise that case.
> > I propose we come back and do the remaining optimisations with
> > interleaved LIST partitions and partitioned tables with DEFAULT
> > partitions later,  once we have a new "live_parts" field in
> > RelOptInfo.  That way we can just check the live parts to ensure
> > they're compatible with the optimization. If we get what's done
> > already in then we're already a bit step forward.

Ah, sorry I did read this but I misunderstood it.  I really need to
catchup what changed for partitioning since pg11 more thoroughly.

> The reason I'm keen to leave this alone today is that determining
> which partitions are pruned requires looking at each partition's
> RelOptInfo and checking if it's marked as a dummy rel. I'm trying to
> minimise the overhead of this patch by avoiding doing any
> per-partition processing. If we get the "live_parts" Bitmapset, then
> this becomes cheaper as Bitmapsets are fairly efficient at finding the
> next set member, even when they're large and sparsely populated.

I see.  But since for now the optimisation will only be done
considering all partitions, I still think that it's better to store a
bool flag in the PartitionDesc to describe if it's natively ordered or
not, and therefore also handle the case for
non-intervleaved-multi-datums list partitioning.  It won't add much
overhead and will benefit way more cases.

We can still revisit that when a live_parts Bitmapset is available in
RelOptInfo (and maybe other flag that say if partitions were pruned or
not, and/or if the default partition was pruned).

> > Also, generate_mergeappend_paths should
> > probably be renamed to something like generate_sortedappend_paths
> > since it can now generate either Append or MergeAppend paths.
>
> You might be right about naming this something else, but
> "sortedappend" sounds like an Append node with a Sort node above it.
> "orderedappend" feels slightly better, although my personal vote would
> be not to rename it at all. Sometimes generating an Append seems like
> an easy enough corner case to mention in the function body.

Ok, I don't have a very strong opinion on it and orderedappend sounds
less ambiguous.

> > I'm also wondering if that's ok to only generate either a (sorted)
> > Append or a MergeAppend.  Is it possible that in some cases it's
> > better to have a MergeAppend rather than a sorted Append, given that
> > MergeAppend is parallel-aware and the sorted Append isn't?
>
> That might have been worth a thought if we had parallel MergeAppends,
> but we don't. You might be thinking of GatherMerge.

Ah, oups indeed :)

Reply | Threaded
Open this post in threaded view
|

Re: Ordered Partitioned Table Scans

David Rowley-3
On Wed, 19 Dec 2018 at 23:25, Julien Rouhaud <[hidden email]> wrote:

> On Wed, Dec 19, 2018 at 10:51 AM David Rowley
> <[hidden email]> wrote:
> > The reason I'm keen to leave this alone today is that determining
> > which partitions are pruned requires looking at each partition's
> > RelOptInfo and checking if it's marked as a dummy rel. I'm trying to
> > minimise the overhead of this patch by avoiding doing any
> > per-partition processing. If we get the "live_parts" Bitmapset, then
> > this becomes cheaper as Bitmapsets are fairly efficient at finding the
> > next set member, even when they're large and sparsely populated.
>
> I see.  But since for now the optimisation will only be done
> considering all partitions, I still think that it's better to store a
> bool flag in the PartitionDesc to describe if it's natively ordered or
> not, and therefore also handle the case for
> non-intervleaved-multi-datums list partitioning.  It won't add much
> overhead and will benefit way more cases.

I'm not really in favour of adding a flag there only to remove it
again once we can more easily determine the pruned partitions.
Remember the flag, because it's stored in the relation cache, must be
set accounting for all partitions. As soon as we want to add smarts
for pruned partitions, then the flag becomes completely useless for
everything.  If covering all cases in the first hit is your aim then
the way to go is to add the live_parts field to RelOptInfo in this
patch rather than in Amit's patch in [1].  I'd much rather add the
pruned partitions smarts as part of another effort.  The most likely
cases to benefit from this are already covered by the current patch;
range partitioned tables.

[1] https://commitfest.postgresql.org/21/1778/

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Ordered Partitioned Table Scans

Julien Rouhaud
On Wed, Dec 19, 2018 at 1:18 PM David Rowley
<[hidden email]> wrote:

>
> On Wed, 19 Dec 2018 at 23:25, Julien Rouhaud <[hidden email]> wrote:
> >
> > I see.  But since for now the optimisation will only be done
> > considering all partitions, I still think that it's better to store a
> > bool flag in the PartitionDesc to describe if it's natively ordered or
> > not, and therefore also handle the case for
> > non-intervleaved-multi-datums list partitioning.  It won't add much
> > overhead and will benefit way more cases.
>
> I'm not really in favour of adding a flag there only to remove it
> again once we can more easily determine the pruned partitions.
> Remember the flag, because it's stored in the relation cache, must be
> set accounting for all partitions. As soon as we want to add smarts
> for pruned partitions, then the flag becomes completely useless for
> everything.

I don't see why we should drop this flag.  If we know that the
partitions are naturally ordered, they'll still be ordered after some
partitions have been prune, so we can skip later checks if we already
have the information.  The only remaining cases this flag doesn't
cover are:

- partitions are naturally ordered but there's a default partition.
We could store this information and later check if the default
partition has been pruned or not
- partitions are not naturally ordered, but become naturally ordered
if enough partitions are pruned.  I may be wrong but that doesn't seem
like a very frequent use case to me  I'd imagine that in a lot of
cases either almost no partition are prune (or at least not enough so
that the remaining one are ordered), or all but one partition is
pruned),.  So keeping a low overhead for the
almost-no-pruned-partition with naturally ordered partitions case
still seems like a good idea to me.

>  If covering all cases in the first hit is your aim then
> the way to go is to add the live_parts field to RelOptInfo in this
> patch rather than in Amit's patch in [1].  I'd much rather add the
> pruned partitions smarts as part of another effort.  The most likely
> cases to benefit from this are already covered by the current patch;
> range partitioned tables.

Covering all cases is definitely not my goal here, just grabbing the
low hanging fruits.  The multi-level partitioning case is another
thing that would need to be handled for instance (and that's the main
reason I couldn't submit a new patch when I was working on it), and
I'm definitely not arguing to cover it in this patch.  That being
said, I'll try to have a look at this patch too, but as I said I have
a lot of catch-up to do in this part of the code, so I'm afraid that
I'll not be super efficient.

123