Partition-wise join for join between (declaratively) partitioned tables

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
267 messages Options
1234 ... 14
Reply | Threaded
Open this post in threaded view
|

Partition-wise join for join between (declaratively) partitioned tables

Ashutosh Bapat
Amit Langote is working on supporting declarative partitioning in PostgreSQL [1]. I have started working on supporting partition-wise join. This mail describes very high level design and some insight into the performance improvements.

An equi-join between two partitioned tables can be broken down into pair-wise join between their partitions. This technique is called partition-wise join. Partition-wise joins between similarly partitioned tables with equi-join condition can be efficient because [2]
1. Each provably non-empty partition-wise join smaller. All such joins collectively might be more efficient than the join between their parent.
2. Such joins are able to exploit properties of partitions like indexes, their storage etc.
3. An N-way partition-wise join may have different efficient join orders compared to the efficient join order between the parent tables.

A partition-wise join is processed in following stages [2], [3].
1. Applicability testing: This phase checks if the join conditions match the partitioning scheme. A partition-wise join is efficient if there is an equi-join on the partition keys. E.g. join between tables R and S partitioned by columns a and b resp. can be broken down into partition-wise joins if there exists a join condition is R.a = S.b. Or in other words the number of provably non-empty partition-wise joins is O(N) where N is the number of partitions.

2. Matching: This phase determines which joins between the partitions of R and S can potentially produce tuples in the join and prunes empty joins between partitions.

3. Clustering: This phase aims at reducing the number of partition-wise joins by clubbing together partitions from joining relations. E.g. clubbing multiple partitions from either of the partitioned relations which can join to a single partition from the other partitioned relation.

4. Path/plan creation: This phase creates multiple paths for each partition-wise join. It also creates Append path/s representing the union of partition-wise joins.

The work here focuses on a subset of use-cases discussed in [2]. It only considers partition-wise join for join between similarly partitioned tables with same number of partitions with same properties, thus producing at most as many partition-wise joins as there are partitions. It should be possible to apply partition-wise join technique (with some special handling for OUTER joins) if both relations have some extra partitions with non-overlapping partition conditions, apart from the matching partitions. But I am not planning to implement this optimization in the first cut.

The attached patch is a POC implementation of partition-wise join. It is is based on the set of patches posted on 23rd May 2016 by Amit Langote for declarative partitioning. The patch gives an idea about the approach used. It has several TODOs, which I am working on.

Attached is a script with output which measures potential performance improvement because of partition-wise join. The script uses a GUC enable_partition_wise_join to disable/enable this feature for performance measurement. The scripts measures performance improvement of a join between two tables partitioned by range on integer column. Each table contains 50K rows. Each table has an integer and a varchar column. It shows around 10-15% reduction in execution time when partition-wise join is used. Accompanied with parallel query and FDWs, it opens up avenues for further improvements for joins between partitioned tables.

[1]. https://www.postgresql.org/message-id/55D3093C.5010800@...
[2]. https://users.cs.duke.edu/~shivnath/papers/sigmod295-herodotou.pdf
[3]. https://users.cs.duke.edu/~shivnath/tmp/paqo_draft.pdf

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

partitioned_join.out (14K) Download Attachment
partitioned_join.sql (4K) Download Attachment
pg_dp_join_POC.patch (32K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Partition-wise join for join between (declaratively) partitioned tables

Robert Haas
On Wed, Jun 15, 2016 at 3:25 AM, Ashutosh Bapat
<[hidden email]> wrote:

> Amit Langote is working on supporting declarative partitioning in PostgreSQL
> [1]. I have started working on supporting partition-wise join. This mail
> describes very high level design and some insight into the performance
> improvements.
>
> An equi-join between two partitioned tables can be broken down into
> pair-wise join between their partitions. This technique is called
> partition-wise join. Partition-wise joins between similarly partitioned
> tables with equi-join condition can be efficient because [2]
> 1. Each provably non-empty partition-wise join smaller. All such joins
> collectively might be more efficient than the join between their parent.
> 2. Such joins are able to exploit properties of partitions like indexes,
> their storage etc.
> 3. An N-way partition-wise join may have different efficient join orders
> compared to the efficient join order between the parent tables.
>
> A partition-wise join is processed in following stages [2], [3].
> 1. Applicability testing: This phase checks if the join conditions match the
> partitioning scheme. A partition-wise join is efficient if there is an
> equi-join on the partition keys. E.g. join between tables R and S
> partitioned by columns a and b resp. can be broken down into partition-wise
> joins if there exists a join condition is R.a = S.b. Or in other words the
> number of provably non-empty partition-wise joins is O(N) where N is the
> number of partitions.
>
> 2. Matching: This phase determines which joins between the partitions of R
> and S can potentially produce tuples in the join and prunes empty joins
> between partitions.
>
> 3. Clustering: This phase aims at reducing the number of partition-wise
> joins by clubbing together partitions from joining relations. E.g. clubbing
> multiple partitions from either of the partitioned relations which can join
> to a single partition from the other partitioned relation.
>
> 4. Path/plan creation: This phase creates multiple paths for each
> partition-wise join. It also creates Append path/s representing the union of
> partition-wise joins.
>
> The work here focuses on a subset of use-cases discussed in [2]. It only
> considers partition-wise join for join between similarly partitioned tables
> with same number of partitions with same properties, thus producing at most
> as many partition-wise joins as there are partitions. It should be possible
> to apply partition-wise join technique (with some special handling for OUTER
> joins) if both relations have some extra partitions with non-overlapping
> partition conditions, apart from the matching partitions. But I am not
> planning to implement this optimization in the first cut.

I haven't reviewed this code yet due to being busy with 9.6, but I
think this is a very important query planner improvement with the
potential for big wins on queries involving large amounts of data.

Suppose we have a pair of equi-partitioned tables.  Right now, if we
choose to perform a hash join, we'll have to build a giant hash table
with all of the rows from every inner partition and then probe it for
every row in every outer partition.  If there are few enough inner
rows that the resultant hash table still fits in work_mem, this is
somewhat inefficient but not terrible - but if it causes us to have to
batch the hash join where we otherwise would not need to do so, then
it really sucks.  Similarly, if we decide to merge-join each pair of
partitions, a partitionwise join may be able to use an internal sort
on some or all partitions whereas if we had to deal with all of the
data at the same time we'd need an external sort, possibly multi-pass.
  And if we choose a nested loop, say over an inner index-scan, we do
O(outer rows) index probes with this optimization but O(outer rows *
inner partitions) index probes without it.

In addition, parallel query can benefit significantly from this kind
of optimization.  Tom recently raised the case of an appendrel where
every child has a parallel-safe path but not every child has a partial
path; currently, we can't go parallel in that case, but it's easy to
see that we could handle it by scheduling the appendrel's children
across a pool of workers.  If we had this optimization, that sort of
thing would be much more likely to be useful, because it could create
appendrels where each member is an N-way join between equipartitioned
tables.  That's particularly important right now because of the
restriction that a partial path must be driven by a Parallel SeqScan,
but even after that restriction is lifted it's easy to imagine that
the effective degree of parallelism for a single index scan may be
limited - so this kind of thing may significantly increase the number
of workers that a given query can use productively.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Partition-wise join for join between (declaratively) partitioned tables

Ashutosh Bapat


On Fri, Jul 8, 2016 at 12:11 AM, Robert Haas <[hidden email]> wrote:

I haven't reviewed this code yet due to being busy with 9.6, but I
think this is a very important query planner improvement with the
potential for big wins on queries involving large amounts of data.

Suppose we have a pair of equi-partitioned tables.  Right now, if we
choose to perform a hash join, we'll have to build a giant hash table
with all of the rows from every inner partition and then probe it for
every row in every outer partition.  If there are few enough inner
rows that the resultant hash table still fits in work_mem, this is
somewhat inefficient but not terrible - but if it causes us to have to
batch the hash join where we otherwise would not need to do so, then
it really sucks.  Similarly, if we decide to merge-join each pair of
partitions, a partitionwise join may be able to use an internal sort
on some or all partitions whereas if we had to deal with all of the
data at the same time we'd need an external sort, possibly multi-pass.

Or we might be able to use indexes directly without need of a MergeAppend.
 
  And if we choose a nested loop, say over an inner index-scan, we do
O(outer rows) index probes with this optimization but O(outer rows *
inner partitions) index probes without it.

In addition, parallel query can benefit significantly from this kind
of optimization.  Tom recently raised the case of an appendrel where
every child has a parallel-safe path but not every child has a partial
path; currently, we can't go parallel in that case, but it's easy to
see that we could handle it by scheduling the appendrel's children
across a pool of workers.  If we had this optimization, that sort of
thing would be much more likely to be useful, because it could create
appendrels where each member is an N-way join between equipartitioned
tables.  That's particularly important right now because of the
restriction that a partial path must be driven by a Parallel SeqScan,
but even after that restriction is lifted it's easy to imagine that
the effective degree of parallelism for a single index scan may be
limited - so this kind of thing may significantly increase the number
of workers that a given query can use productively.

+1.

The attached patch implements the logic to assess whether two partitioned
tables can be joined using partition-wise join technique described in my last
mail on this thread.

Two partitioned relations are considered for partition-wise join if following
conditions are met (See build_joinrel_part_info() for details):
1. Both the partitions have same number of partitions, with same number of
partition keys and partitioned by same strategy - range or list.
2. They have matching datatypes for partition keys (partkey_types_match())
3. For list partitioned relations, they have same lists for each pair of
partitions, paired by position in which they appear.
4. For range partitioned relations, they have same bounds for each pair of
partitions, paired by their position when ordered in ascending fashion on the
upper bounds.
5. There exists an equi-join condition for each pair of partition keys, paired
by the position in which they appear.

Partition-wise join technique can be applied under more lenient constraints [1]
e.g. joins between tables with different number of partitions but having same
bounds/lists for the common partitions. I am planning to defer that to a later
version of this feature.

A join executed using partition-wise join technique is itself a relation
partitioned by the similar partitioning scheme as the joining relations with
the partition keys combined from the joining relations.

A PartitionOptInfo (uses name similar to RelOptInfo or IndexOptInfo) structure
is used to store the partitioning information for a given base or relation.
In build_simple_rel(), we construct PartitionOptInfo structure for the given
base relation by copying the relation's PartitionDesc and PartitionKey
(structures from Amit Langote's patch). While doing so, all the partition keys
are stored as expressions. The structure also holds the RelOptInfos of the
partition relations. For a join relation, most of the PartitionOptInfo is
copied from either of the joining relations, except the partition keys and
RelOptInfo of partition relations. Partition keys of the join relations are
created by combing partition keys from both the joining relations. The logic to
cosnstruct RelOptInfo for the partition-wise join relations is yet to be
implemented.

Since the logic to create the paths and RelOptInfos for partition-wise join
relations is not implemented yet, a query which can use partition-wise join
fails with error
"ERROR: the relation was considered for partition-wise join, which is not
supported right now.". It will also print messages to show which of the joins
can and can not use partition-wise join technique e.g.
"NOTICE:  join between relations (b 1) and (b 2) is considered for
partition-wise join." The relations are indicated by their relid in the query.
OR
"NOTICE:  join between relations (b 1) and (b 2) is NOT considered for
partition-wise join.".
These messages are for debugging only, and will be removed once path creation
logic is implemented.

The patch adds a test partition_join.sql, which has a number of positive and
negative testcases for joins between partitioned tables.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pg_dp_join_assess_phase.patch (217K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Partition-wise join for join between (declaratively) partitioned tables

Ashutosh Bapat
Sorry forgot to mention: this patch applies on top of the v7 patches posted by Amit Langote on 27th June (https://www.postgresql.org/message-id/81371428-bb4b-1e33-5ad6-8c5c51b52cb7%40lab.ntt.co.jp).

On Tue, Jul 19, 2016 at 7:41 PM, Ashutosh Bapat <[hidden email]> wrote:


On Fri, Jul 8, 2016 at 12:11 AM, Robert Haas <[hidden email]> wrote:

I haven't reviewed this code yet due to being busy with 9.6, but I
think this is a very important query planner improvement with the
potential for big wins on queries involving large amounts of data.

Suppose we have a pair of equi-partitioned tables.  Right now, if we
choose to perform a hash join, we'll have to build a giant hash table
with all of the rows from every inner partition and then probe it for
every row in every outer partition.  If there are few enough inner
rows that the resultant hash table still fits in work_mem, this is
somewhat inefficient but not terrible - but if it causes us to have to
batch the hash join where we otherwise would not need to do so, then
it really sucks.  Similarly, if we decide to merge-join each pair of
partitions, a partitionwise join may be able to use an internal sort
on some or all partitions whereas if we had to deal with all of the
data at the same time we'd need an external sort, possibly multi-pass.

Or we might be able to use indexes directly without need of a MergeAppend.
 
  And if we choose a nested loop, say over an inner index-scan, we do
O(outer rows) index probes with this optimization but O(outer rows *
inner partitions) index probes without it.

In addition, parallel query can benefit significantly from this kind
of optimization.  Tom recently raised the case of an appendrel where
every child has a parallel-safe path but not every child has a partial
path; currently, we can't go parallel in that case, but it's easy to
see that we could handle it by scheduling the appendrel's children
across a pool of workers.  If we had this optimization, that sort of
thing would be much more likely to be useful, because it could create
appendrels where each member is an N-way join between equipartitioned
tables.  That's particularly important right now because of the
restriction that a partial path must be driven by a Parallel SeqScan,
but even after that restriction is lifted it's easy to imagine that
the effective degree of parallelism for a single index scan may be
limited - so this kind of thing may significantly increase the number
of workers that a given query can use productively.

+1.

The attached patch implements the logic to assess whether two partitioned
tables can be joined using partition-wise join technique described in my last
mail on this thread.

Two partitioned relations are considered for partition-wise join if following
conditions are met (See build_joinrel_part_info() for details):
1. Both the partitions have same number of partitions, with same number of
partition keys and partitioned by same strategy - range or list.
2. They have matching datatypes for partition keys (partkey_types_match())
3. For list partitioned relations, they have same lists for each pair of
partitions, paired by position in which they appear.
4. For range partitioned relations, they have same bounds for each pair of
partitions, paired by their position when ordered in ascending fashion on the
upper bounds.
5. There exists an equi-join condition for each pair of partition keys, paired
by the position in which they appear.

Partition-wise join technique can be applied under more lenient constraints [1]
e.g. joins between tables with different number of partitions but having same
bounds/lists for the common partitions. I am planning to defer that to a later
version of this feature.

A join executed using partition-wise join technique is itself a relation
partitioned by the similar partitioning scheme as the joining relations with
the partition keys combined from the joining relations.

A PartitionOptInfo (uses name similar to RelOptInfo or IndexOptInfo) structure
is used to store the partitioning information for a given base or relation.
In build_simple_rel(), we construct PartitionOptInfo structure for the given
base relation by copying the relation's PartitionDesc and PartitionKey
(structures from Amit Langote's patch). While doing so, all the partition keys
are stored as expressions. The structure also holds the RelOptInfos of the
partition relations. For a join relation, most of the PartitionOptInfo is
copied from either of the joining relations, except the partition keys and
RelOptInfo of partition relations. Partition keys of the join relations are
created by combing partition keys from both the joining relations. The logic to
cosnstruct RelOptInfo for the partition-wise join relations is yet to be
implemented.

Since the logic to create the paths and RelOptInfos for partition-wise join
relations is not implemented yet, a query which can use partition-wise join
fails with error
"ERROR: the relation was considered for partition-wise join, which is not
supported right now.". It will also print messages to show which of the joins
can and can not use partition-wise join technique e.g.
"NOTICE:  join between relations (b 1) and (b 2) is considered for
partition-wise join." The relations are indicated by their relid in the query.
OR
"NOTICE:  join between relations (b 1) and (b 2) is NOT considered for
partition-wise join.".
These messages are for debugging only, and will be removed once path creation
logic is implemented.

The patch adds a test partition_join.sql, which has a number of positive and
negative testcases for joins between partitioned tables.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Reply | Threaded
Open this post in threaded view
|

Re: Partition-wise join for join between (declaratively) partitioned tables

Ashutosh Bapat
Hi All,

PFA the patch to support partition-wise joins for partitioned tables. The patch
is based on the declarative parition support patches provided by Amit Langote
on 26th August 2016. The previous patch added support to assess whether two
tables can be joined using partition-wise join technique, but did not have
complete support to create plans which used partition-wise technique. This
patch implements three important pieces for supporting partition-wise join

1. Logic to assess whether join between two partitioned tables can be executed
using partition-wise join technique.
2. Construct RelOptInfo's representating join between matching partitions of
the joining relations and add join paths to those RelOptInfo's
3. Add append paths to the RelOptInfo representing the join between partitioned
tables. Rest of the planner code chooses the optimal path for join.

make_join_rel() now calls try_partition_wise_join(), which executes all of the
steps listed above. If the joining partitioned relations are deemed fit for
partition-wise join, we create one RelOptInfo (if not already present)
representing a join between every pair of partitions to be joined. Since the
join between parents is deemed legal, the join between the partitions is also
legal, hence legality of the join is not checked again. RelOptInfo representing
the join between partitions is constructed by translating the relevant members
of RelOptInfo of the parent join relation. Similarly SpecialJoinInfo,
restrictlist (for given join order) are constructed by translating those for
the parent join.

make_join_rel() is split into two portions, a. that deals with constructing
restrictlist and RelOptInfo for join relation b. that creates paths for the
join. The second portion is separated into a function
populate_joinrel_with_paths(), which is reused in try_partition_wise_join() to
create paths for join between matching partitions.

set_append_rel_pathlist() generates paths for child relations, marks the empty
children as dummy relations and creates append paths by collecting paths with
similar properties (parameterization and pathkeys) from non-empty children. It
then adds append paths to the parent relation. This patch divides
set_append_rel_pathlist() into two parts a. marking empty child relations as
dummy and generating paths for non-empty children. b. collecting children paths
into append paths for parent. Part b is separate into a function
add_paths_to_append_rel() which is reused for collecting paths from
partition-wise join child relations to construct append paths for join between
partitioned tables.

For an N-way join between partitioned tables, make_join_rel() is called as many
times as the number of valid join orders exist. For each such call, we will add
paths to join between partitions for corresponding join order between those
partitions. We can generate the append paths for parent joinrel only after all
such join orders have been considered. Hence before setting cheapest path forx
parent join relation, we set the cheapest path for each join relation between
partitions, followed by creating append paths for the parent joinrel. This
method needs some readjustment for multi-level partitions (TODO item 2 below).

A GUC enable_partition_wise_join is added to enable or disable partition-wise
join technique. I think the GUC is useful similar to other join related GUCs
like enable_hashjoin.

parameterized paths: While creating parameterized paths for child relations of
a partitioned tables, we do not have an idea as to whether we will be able to
use partition-wise join technique or not. Also we do not know the child
partition of the other partitioned table, to which a given partition would
join. Hence we do not create paths parameterized by child partitions of other
partitioned relations. But path for child of a partitioned relation
parameterized by other parent relation, can be considered to be parameterised
by any child relation of the other partitioned relation by replacing the parent
parameters by corresponding child parameters. This condition is used to
eliminate parameterized paths while creating merge and hash joins, to decide
the resultant parameterization of a join between child partitions and to create
nested loop paths with inner path parameterized by outer relation where inner
and outer relations are child partitions. While creating such nest loop join
paths we translate the path parameterized by other parent partitioned relation,
to that parameterized by the required child.

Functions like select_outer_pathkeys_for_merge(), make_sort_from_pathkeys(),
find_ec_member_for_tle() which did not expect to be called for a child
relation, are now used for child partition relations for joins. These functions
are adjusted for that usage.

Testing:
I have added partition_join.sql testcase to test partition-wise join feature.
That file has extensive tests for list, range, multi-level partitioning schemes
and various kinds of joins including nested loop join with inner relation
parameterized by outer relationThat file has extensive tests for list, range,
multi-level partitioning schemes and various kinds of joins including nested
loop join with inner relation parameterized by outer relation.

make check passes clean.

TODOs:

1. Instead of storing partitioning information in RelOptInfo of each of the
partitioned relations (base and join relations), we can keep a list of
canonical partition schemes in PlannerInfo. Every RelOptInfo gets a pointer to
the member of list representing the partitioning scheme of corresponding
relation. RelOptInfo's of all similarly partitioned relations get the same
pointer thus making it easy to match the partitioning schemes by comparing the
pointers. While we are supporting only exact partition matching scheme now,
it's possible to extend this method to match compatible partitioning schemes by
maintaining a list of compatible partitioning schemes.

Right now, I have moved some partition related structures from partition.c to
partition.h. These structures are still being reviewed and might change when
Amit Langote improves his patches. Having canonical partitioning scheme in
PlannerInfo may not require moving those structures out. So, that code is still
under development. A related change is renaming RangeBound structure in Amit
Langote's patches to PartitionRangeBound to avoid name conflict with
rangetypes.h. That change too should vanish once we decide where to keep that
structure and its final name.

2. Multi-level partitioned tables: For some reason path created for joining
partitions are not being picked up as the cheapest paths. I think, we need to
finalize the lower level paths before moving upwards in the partition
hierarchy. But I am yet to investigate the issue here. RelOptInfo::parent_relid
should point to top parents rather than immediate parents.

3. Testing: need more tests for testing partition-wise join with foreign tables
as partitions. More tests for parameterized joins for multi-level partitioned
joins.

4. Remove bms_to_char(): I have added this function to print Relids in the
debugger. I have found it very useful to quickly examine Relids in debugger,
which otherwise wasn't so easy. If others find it useful too, I can create a
separate patch to be considered for a separate commit.

5. In add_paths_to_append_rel() to find the possible set of outer relations for
generating parameterized paths for a given join. This code needs to be adjusted
to eliminate the parent relations possible set of outer relations for a join
between child partitions.

6. Add support to reparameterize more types of paths for child relations. I
will add this once we finalize the method to reparameterize a parent path for
child partition.

7. The patch adds make_joinrel() (name needs to be changed because of its
similariy with make_join_rel()) to construct an empty RelOptInfo for a join
between partitions. The function copies code doing the same from
build_join_rel(). build_join_rel() too can use this function, if we decide to
retain it.

8. Few small TODOs related to code reorganization, proper function,
variable naming etc. are in the patch. pg_indent run.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pg_dp_join.patch (712K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Partition-wise join for join between (declaratively) partitioned tables

akapila
On Fri, Sep 9, 2016 at 3:17 PM, Ashutosh Bapat
<[hidden email]> wrote:
>
> 4. Remove bms_to_char(): I have added this function to print Relids in the
> debugger. I have found it very useful to quickly examine Relids in debugger,
> which otherwise wasn't so easy. If others find it useful too, I can create a
> separate patch to be considered for a separate commit.
>

+1 to have such a function.  I often need something like that whenever
I debug the optimizer code.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Partition-wise join for join between (declaratively) partitioned tables

Amit Langote-2
In reply to this post by Ashutosh Bapat
On 2016/09/09 18:47, Ashutosh Bapat wrote:
> A related change is renaming RangeBound structure in Amit
> Langote's patches to PartitionRangeBound to avoid name conflict with
> rangetypes.h. That change too should vanish once we decide where to keep
> that structure and its final name.

This change has been incorporated into the latest patch I posted on Sep 9 [1].

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/28ee345c-1278-700e-39a7-36a71f9a3b43@...




--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Partition-wise join for join between (declaratively) partitioned tables

rajkumar.raghuwanshi
In reply to this post by Ashutosh Bapat

On Fri, Sep 9, 2016 at 3:17 PM, Ashutosh Bapat <[hidden email]> wrote:
Hi All,

PFA the patch to support partition-wise joins for partitioned tables. The patch
is based on the declarative parition support patches provided by Amit Langote
on 26th August 2016.

I have applied declarative partitioning patches posted by Amit Langote on 26 Aug 2016 and then partition-wise-join patch,  getting below error while make install.

../../../../src/include/nodes/relation.h:706: error: redefinition of typedef ‘PartitionOptInfo’
../../../../src/include/nodes/relation.h:490: note: previous declaration of ‘PartitionOptInfo’ was here
make[4]: *** [gistbuild.o] Error 1
make[4]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend/access/gist'
make[3]: *** [gist-recursive] Error 2
make[3]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend/access'
make[2]: *** [access-recursive] Error 2
make[2]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend'
make[1]: *** [all-backend-recurse] Error 2
make[1]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src'
make: *** [all-src-recurse] Error 2

PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)

Attached the patch for the fix of above error.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pwj_install_fix.patch (940 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Partition-wise join for join between (declaratively) partitioned tables

Ashutosh Bapat
On Fri, Sep 16, 2016 at 6:00 PM, Rajkumar Raghuwanshi
<[hidden email]> wrote:

>
> On Fri, Sep 9, 2016 at 3:17 PM, Ashutosh Bapat
> <[hidden email]> wrote:
>>
>> Hi All,
>>
>> PFA the patch to support partition-wise joins for partitioned tables. The
>> patch
>> is based on the declarative parition support patches provided by Amit
>> Langote
>> on 26th August 2016.
>
>
> I have applied declarative partitioning patches posted by Amit Langote on 26
> Aug 2016 and then partition-wise-join patch,  getting below error while make
> install.
>
> ../../../../src/include/nodes/relation.h:706: error: redefinition of typedef
> ‘PartitionOptInfo’
> ../../../../src/include/nodes/relation.h:490: note: previous declaration of
> ‘PartitionOptInfo’ was here
> make[4]: *** [gistbuild.o] Error 1
> make[4]: Leaving directory
> `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend/access/gist'
> make[3]: *** [gist-recursive] Error 2
> make[3]: Leaving directory
> `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend/access'
> make[2]: *** [access-recursive] Error 2
> make[2]: Leaving directory
> `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend'
> make[1]: *** [all-backend-recurse] Error 2
> make[1]: Leaving directory
> `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src'
> make: *** [all-src-recurse] Error 2
>
> PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)
>
> Attached the patch for the fix of above error.

Thanks for the report. I will fix this in the next patch.



--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Partition-wise join for join between (declaratively) partitioned tables

Ashutosh Bapat
In reply to this post by Ashutosh Bapat
PFA patch which takes care of some of the TODOs mentioned in my
previous mail. The patch is based on the set of patches supporting
declarative partitioning by Amit Langoted posted on 26th August.

>
> TODOs:
>
> 1. Instead of storing partitioning information in RelOptInfo of each of the
> partitioned relations (base and join relations), we can keep a list of
> canonical partition schemes in PlannerInfo. Every RelOptInfo gets a pointer
> to
> the member of list representing the partitioning scheme of corresponding
> relation. RelOptInfo's of all similarly partitioned relations get the same
> pointer thus making it easy to match the partitioning schemes by comparing
> the
> pointers. While we are supporting only exact partition matching scheme now,
> it's possible to extend this method to match compatible partitioning schemes
> by
> maintaining a list of compatible partitioning schemes.
>
> Right now, I have moved some partition related structures from partition.c
> to
> partition.h. These structures are still being reviewed and might change when
> Amit Langote improves his patches. Having canonical partitioning scheme in
> PlannerInfo may not require moving those structures out. So, that code is
> still
> under development. A related change is renaming RangeBound structure in Amit
> Langote's patches to PartitionRangeBound to avoid name conflict with
> rangetypes.h. That change too should vanish once we decide where to keep
> that
> structure and its final name.
Done.

>
> 2. Multi-level partitioned tables: For some reason path created for joining
> partitions are not being picked up as the cheapest paths. I think, we need
> to
> finalize the lower level paths before moving upwards in the partition
> hierarchy. But I am yet to investigate the issue here.
> RelOptInfo::parent_relid
> should point to top parents rather than immediate parents.

Done

>
> 3. Testing: need more tests for testing partition-wise join with foreign
> tables
> as partitions. More tests for parameterized joins for multi-level
> partitioned
> joins.

Needs to be done.

>
> 4. Remove bms_to_char(): I have added this function to print Relids in the
> debugger. I have found it very useful to quickly examine Relids in debugger,
> which otherwise wasn't so easy. If others find it useful too, I can create a
> separate patch to be considered for a separate commit.

I will take care of this after rebasing the patch on the latest
sources and latest set of patches by Amit Langote.

>
> 5. In add_paths_to_append_rel() to find the possible set of outer relations
> for
> generating parameterized paths for a given join. This code needs to be
> adjusted
> to eliminate the parent relations possible set of outer relations for a join
> between child partitions.

Done.

>
> 6. Add support to reparameterize more types of paths for child relations. I
> will add this once we finalize the method to reparameterize a parent path
> for
> child partition.

Will wait for reviewer's opinion.

>
> 7. The patch adds make_joinrel() (name needs to be changed because of its
> similariy with make_join_rel()) to construct an empty RelOptInfo for a join
> between partitions. The function copies code doing the same from
> build_join_rel(). build_join_rel() too can use this function, if we decide
> to
> retain it.

This will be done as a separate cleanup patch.

>
> 8. Few small TODOs related to code reorganization, proper function,
> variable naming etc. are in the patch. pg_indent run.

I have taken care of most of the TODOs. But there are still some TODOs
remaining. I will take care of those in the next version of patches.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pg_dp_join_v2.patch (713K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Partition-wise join for join between (declaratively) partitioned tables

rajkumar.raghuwanshi
In reply to this post by Ashutosh Bapat
Hi,

I got a server crash with partition_wise_join, steps to reproduce given below.

postgres=# set enable_partition_wise_join=true;
SET
postgres=# CREATE TABLE tbl (a int,c text) PARTITION BY LIST(a);
CREATE TABLE
postgres=# CREATE TABLE tbl_p1 PARTITION OF tbl FOR VALUES IN (1, 2);
CREATE TABLE
postgres=# CREATE TABLE tbl_p2 PARTITION OF tbl FOR VALUES IN (3, 4);
CREATE TABLE
postgres=# INSERT INTO tbl VALUES (1,'P1'),(2,'P1'),(3,'P2'),(4,'P2');
INSERT 0 4
postgres=# EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM tbl t1 INNER JOIN tbl t2 ON (t1.a = t2.a) WHERE t1.c = 'P1' AND t1.c  =  'P2';
NOTICE:  join between relations (b 1) and (b 2) is considered for partition-wise join.
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
Reply | Threaded
Open this post in threaded view
|

Re: Partition-wise join for join between (declaratively) partitioned tables

Ashutosh Bapat
In reply to this post by rajkumar.raghuwanshi
Hi Rajkumar,


On Fri, Sep 16, 2016 at 6:00 PM, Rajkumar Raghuwanshi
<[hidden email]> wrote:

>
> On Fri, Sep 9, 2016 at 3:17 PM, Ashutosh Bapat
> <[hidden email]> wrote:
>>
>> Hi All,
>>
>> PFA the patch to support partition-wise joins for partitioned tables. The
>> patch
>> is based on the declarative parition support patches provided by Amit
>> Langote
>> on 26th August 2016.
>
>
> I have applied declarative partitioning patches posted by Amit Langote on 26
> Aug 2016 and then partition-wise-join patch,  getting below error while make
> install.
>
> ../../../../src/include/nodes/relation.h:706: error: redefinition of typedef
> ‘PartitionOptInfo’
> ../../../../src/include/nodes/relation.h:490: note: previous declaration of
> ‘PartitionOptInfo’ was here
> make[4]: *** [gistbuild.o] Error 1
> make[4]: Leaving directory
> `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend/access/gist'
> make[3]: *** [gist-recursive] Error 2
> make[3]: Leaving directory
> `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend/access'
> make[2]: *** [access-recursive] Error 2
> make[2]: Leaving directory
> `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend'
> make[1]: *** [all-backend-recurse] Error 2
> make[1]: Leaving directory
> `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src'
> make: *** [all-src-recurse] Error 2
>
> PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)
>

Thanks for the report and the patch.

This is fixed by the patch posted with
https://www.postgresql.org/message-id/CAFjFpRdRFWMc4zNjeJB6p1Ncpznc9DMdXfZJmVK5X_us5zeD9Q%40mail.gmail.com.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Partition-wise join for join between (declaratively) partitioned tables

rajkumar.raghuwanshi
In reply to this post by Ashutosh Bapat
On Tue, Sep 20, 2016 at 4:26 PM, Ashutosh Bapat <[hidden email]> wrote:
PFA patch which takes care of some of the TODOs mentioned in my
previous mail. The patch is based on the set of patches supporting
declarative partitioning by Amit Langoted posted on 26th August.

I have applied declarative partitioning patches posted by Amit Langote on 26 Aug 2016 and then latest partition-wise-join patch,  getting below error while make install.

../../../../src/include/catalog/partition.h:37: error: redefinition of typedef ‘PartitionScheme’
../../../../src/include/nodes/relation.h:492: note: previous declaration of ‘PartitionScheme’ was here
make[4]: *** [commit_ts.o] Error 1
make[4]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src/backend/access/transam'
make[3]: *** [transam-recursive] Error 2
make[3]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src/backend/access'
make[2]: *** [access-recursive] Error 2
make[2]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src/backend'
make[1]: *** [all-backend-recurse] Error 2
make[1]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src'
make: *** [all-src-recurse] Error 2

PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)

I have commented below statement in src/include/catalog/partition.h file and then tried to install, it worked fine.

/* typedef struct PartitionSchemeData    *PartitionScheme; */

Thanks & Regards,
Rajkumar Raghuwanshi

Reply | Threaded
Open this post in threaded view
|

Re: Partition-wise join for join between (declaratively) partitioned tables

Ashutosh Bapat
> ../../../../src/include/catalog/partition.h:37: error: redefinition of
> typedef ‘PartitionScheme’
> ../../../../src/include/nodes/relation.h:492: note: previous declaration of
> ‘PartitionScheme’ was here
[...]
>
> PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)

Thanks for the report. For some reason, I am not getting these errors
with my compiler

[ashutosh@ubuntu regress]gcc --version
gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3

Anyway, I have fixed it in the attached patch.

The patch is based on sources upto commit

commit 2a7f4f76434d82eb0d1b5f4f7051043e1dd3ee1a
Author: Heikki Linnakangas <[hidden email]>
Date:   Wed Sep 21 13:24:13 2016 +0300

and Amit Langote's set of patches posted on 15th Sept. 2016 [1]

There are few implementation details that need to be worked out like
1. adjust_partitionrel_attrs() calls adjust_appendrel_attrs() as many
times as the number of base relations in the join, possibly producing
a new expression tree in every call. It can be optimized to call
adjust_appendrel_attrs() only once. I will work on that if reviewers
agree that adjust_partitionrel_attrs() is needed and should be
optimized.

2. As mentioned in earlier mails, the paths parameterized by parent
partitioned table are translated to be parameterized by child
partitions. That code needs to support more kinds of paths. I will
work on that, if reviewers agree that the approach of translating
paths is acceptable.

3. Because of an issue with declarative partitioning patch [2]
multi-level partition table tests are failing in partition_join.sql.
Those were not failing with an earlier set of patches supporting
declarative partitions. Those will be fixed based on the discussion in
that thread.

4. More tests for foreign tables as partitions and for multi-level
partitioned tables.

5. The tests use unpartitioned tables for verifying results. Those
tables and corresponding SQL statements will be removed once the tests
are finalised.

[1]. https://www.postgresql.org/message-id/e5c1c9cf-3f5a-c4d7-6047-7351147aaef9%40lab.ntt.co.jp
[2]. https://www.postgresql.org/message-id/CAFjFpRc%3DT%2BCjpGNkNSdOkHza8VAPb35bngaCdAzPgBkhijmJhg%40mail.gmail.com

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pg_dp_join_v3.patch (720K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Partition-wise join for join between (declaratively) partitioned tables

rajkumar.raghuwanshi

On Thu, Sep 22, 2016 at 4:11 PM, Ashutosh Bapat <[hidden email]> wrote:
The patch is based on sources upto commit

commit 2a7f4f76434d82eb0d1b5f4f7051043e1dd3ee1a
Author: Heikki Linnakangas <[hidden email]>
Date:   Wed Sep 21 13:24:13 2016 +0300

and Amit Langote's set of patches posted on 15th Sept. 2016 [1]

I have applied your patch on top of Amit patches posted on 15th Sept. 2016, and tried to create some test cases on list and multi-level partition based on test cases written for range partition.

I got some server crash and errors which I have mentioned as comment in expected output file, which need to be updated once these issues will get fix. also for these issue expected output is generated by creating same query for non-partition table with same data.

Attached patch created on top to Ashutosh's patch posted on 22 Sept 2016.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation



 


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

partition_join_extra_testcases.patch (226K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Partition-wise join for join between (declaratively) partitioned tables

Robert Haas
In reply to this post by Ashutosh Bapat
On Thu, Sep 22, 2016 at 6:41 AM, Ashutosh Bapat
<[hidden email]> wrote:
> [ new patch ]

This should probably get updated since Rajkumar reported a crash.
Meanwhile, here are some comments from an initial read-through:

+ * Multiple relations may be partitioned in the same way. The relations
+ * resulting from joining such relations may be partitioned in the same way as
+ * the joining relations.  Similarly, relations derived from such relations by
+ * grouping, sorting be partitioned in the same as the underlying relations.

I think you should change "may be partitioned in the same way" to "are
partitioned in the same way" or "can be regarded as partitioned in the
same way". The sentence that begins with "Similarly," is not
grammatical; it should say something like: ...by grouping or sorting
are partitioned in the same way as the underlying relations.

@@ -870,20 +902,21 @@ RelationBuildPartitionDesc(Relation rel)
                 result->bounds->rangeinfo = rangeinfo;
                 break;
             }
         }
     }

     MemoryContextSwitchTo(oldcxt);
     rel->rd_partdesc = result;
 }

+
 /*
  * Are two partition bound collections logically equal?
  *
  * Used in the keep logic of relcache.c (ie, in RelationClearRelation()).
  * This is also useful when b1 and b2 are bound collections of two separate
  * relations, respectively, because BoundCollection is a canonical
  * representation of a set partition bounds (for given partitioning strategy).
  */
 bool
 partition_bounds_equal(PartitionKey key,

Spurious hunk.

+ *     For an umpartitioned table, it returns NULL.

Spelling.

+             * two arguemnts and returns boolean. For types, it
suffices to match

Spelling.

+ * partition key expression is stored as a single member list to accomodate

Spelling.

+ * For a base relation, construct an array of partition key expressions. Each
+ * partition key expression is stored as a single member list to accomodate
+ * more partition keys when relations are joined.

How would joining relations result in more partitioning keys getting
added?  Especially given the comment for the preceding function, which
says that a new PartitionScheme gets created unless an exact match is
found.

+            if (!lc)

Test lc == NIL instead of !lc.

+extern int
+PartitionSchemeGetNumParts(PartitionScheme part_scheme)
+{
+    return part_scheme ? part_scheme->nparts : 0;
+}

I'm not convinced it's a very good idea for this function to have
special handling for when part_scheme is NULL.  In
try_partition_wise_join() that checks is not needed because it's
already been done, and in generate_partition_wise_join_paths it is
needed but only because you are initializing nparts too early.  If you
move this initialization down below the IS_DUMMY_REL() check you won't
need the NULL guard.  I would ditch this function and let the callers
access the structure member directly.

+extern int
+PartitionSchemeGetNumKeys(PartitionScheme part_scheme)
+{
+    return part_scheme ? part_scheme->partnatts : 0;
+}

Similarly here.  have_partkey_equi_join should probably have a
quick-exit path when part_scheme is NULL, and then num_pks can be set
afterwards unconditionally.  Same for match_expr_to_partition_keys.
build_joinrel_partition_info already has it and doesn't need this
double-check.

+extern Oid *
+PartitionDescGetPartOids(PartitionDesc part_desc)
+{
+    Oid       *part_oids;
+    int        cnt_parts;
+
+    if (!part_desc || part_desc->nparts <= 0)
+        return NULL;
+
+    part_oids = (Oid *) palloc(sizeof(Oid) * part_desc->nparts);
+    for (cnt_parts = 0; cnt_parts < part_desc->nparts; cnt_parts++)
+        part_oids[cnt_parts] = part_desc->oids[cnt_parts];
+
+    return part_oids;
+}

I may be missing something, but this looks like a bad idea in multiple
ways.  First, you've got checks for part_desc's validity here that
should be in the caller, as noted above.  Second, you're copying an
array by looping instead of using memcpy().  Third, the one and only
caller is set_append_rel_size, which doesn't seem to have any need to
copy this data in the first place.  If there is any possibility that
the PartitionDesc is going to change under us while that function is
running, something is deeply broken.  Nothing in the planner is going
to cope with the table structure changing under us, so it had better
not.

+    /*
+     * For a partitioned relation, we will save the child RelOptInfos in parent
+     * RelOptInfo in the same the order as corresponding bounds/lists are
+     * stored in the partition scheme.
+     */

This comment seems misplaced; shouldn't it be next to the code that is
actually doing this, rather than the code that is merely setting up
for it?  And, also, the comment implies that we're doing this instead
of what we'd normally do, whereas I think we are actually doing
something additional.

+        /*
+         * Save topmost parent's relid. If the parent itself is a child of some
+         * other relation, use parent's topmost parent relids.
+         */
+        if (rel->top_parent_relids)
+            childrel->top_parent_relids = rel->top_parent_relids;
+        else
+            childrel->top_parent_relids = bms_copy(rel->relids);

Comment should explain why we're doing it, not what we're doing.  The
comment as written just restates what anybody who's likely to be
looking at this can already see to be true from looking at the code
that follows.  The question is why do it.

+    /* Set only for "other" base or join relations. */
+    Relids        top_parent_relids;

Comment should say what it is, not just when it's set.

+    /* Should have found all the childrels of a partitioned relation. */
+    if (rel->part_scheme)
+    {
+        int        cnt_parts;
+        for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
+            Assert(rel->part_rels[cnt_parts]);
+    }

A block that does nothing but Assert() should be guarded by #ifdef
USE_ASSERT_CHECKING.  Although, actually, maybe this should be an
elog(), just in case?

+    }
+
+    add_paths_to_append_rel(root, rel, live_childrels);
+}
+
+static void
+add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
+                        List *live_childrels)

The new function should have a header comment, which should include an
explanation of why this is now separate from
set_append_rel_pathlist().

+    if (!live_childrels)

As before, I think live_childrels == NIL is better style.

+            generate_partition_wise_join_paths(root, rel);

Needs an update to the comment earlier in the hunk.  It's important to
explain why this has to be done here and not within
join_search_one_level.

+            /* Recursively collect the paths from child joinrel. */
+            generate_partition_wise_join_paths(root, child_rel);

Given the recursion, check_stack_depth() at top of function is
probably appropriate.  Same for try_partition_wise_join().

+    if (live_children)
+        pfree(live_children);

Given that none of the substructure, including ListCells, will be
freed, this seems utterly pointless.  If it's necessary to recover
memory here at all, we probably need to be more aggressive about it.
Have you tested the effect of this patch on planner memory consumption
with multi-way joins between tables with many partitions?  If you
haven't, you probably should. (Testing runtime would be good, too.)
Does it grow linearly?  Quadratically?  Exponentially?  Minor leaks
don't matter, but if we're generating too much garbage we'll have to
make sure it gets cleaned up soon enough to prevent runaway memory
usage.

     /*
+     * An inner path parameterized by the parent relation of outer
+     * relation needs to be reparameterized by the outer relation to be used
+     * for parameterized nested loop join.
+     */

No doubt, but I think the comment is missing the bigger picture -- it
doesn't say anything about this being here to support partition-wise
joins, which seems like a key point.

+        /* If we could not translate the path, don't produce nest loop path. */
+        if (!inner_path)
+            return;

Why would that ever happen?

+/*
+ * If the join between the given two relations can be executed as
+ * partition-wise join create the join relations for partition-wise join,
+ * create paths for those and then create append paths to combine
+ * partition-wise join results.
+ */
+static void
+try_partition_wise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
+                        RelOptInfo *joinrel, SpecialJoinInfo *parent_sjinfo,
+                        List *parent_restrictlist)

This comment doesn't accurately describe what the function does.  No
append paths are created here; that happens at a much later stage.  I
think this comment needs quite a bit more work, and maybe the function
should be renamed, too.  There are really two steps involved here:
first, we create paths for each child, attached to a new RelOptInfo
flagged as RELOPT_OTHER_JOINREL paths; later, we create additional
paths for the parent RelOptInfo by appending a path for each child.

Broadly, I think there's a lack of adequate documentation of the
overall theory of operation of this patch.  I believe that an update
to the optimizer README would be appropriate, probably with a new
section but maybe incorporating the new material into an existing
section.  In addition, the comments for individual comments and chunks
of code need to do a better job explaining how each part of the patch
contributes to the overall picture.  I also think we need to do a
better join hammering out the terminology.  I don't particularly like
the term "partition-wise join" in the first place, although I don't
know what would be better, but we certainly need to avoid confusing a
partition-wise join -- which is a join performed by joining each
partition of one partitioned rel to the corresponding partition of a
similarly partitioned rel rather than by the usual execution strategy
of joining the parent rels -- with the concept of an other-join-rel,
which an other-member-rel analogue for joins.  I don't think the patch
is currently very clear about this right now, either in the code or in
the comments.  Maybe this function ought to be named something like
make_child_joins() or make_child_join_paths(), and we could use "child
joins" and/or "child join paths" as standard terminology throughout
the patch.

+    rel1_desc = makeStringInfo();
+    rel2_desc = makeStringInfo();
+
+    /* TODO: remove this notice when finalising the patch. */
+    outBitmapset(rel1_desc, rel1->relids);
+    outBitmapset(rel2_desc, rel2->relids);
+    elog(NOTICE, "join between relations %s and %s is considered for
partition-wise join.",
+         rel1_desc->data, rel2_desc->data);

Please remove your debugging cruft before submitting patches to
pgsql-hackers, or at least put #ifdef NOT_USED or something around it.

+     * We allocate the array for child RelOptInfos till we find at least one
+     * join order which can use partition-wise join technique. If no join order
+     * can use partition-wise join technique, there are no child relations.

This comment has problems.  I think "till" is supposed to be "until",
and there's supposed to be a "don't" in there somewhere.  But really,
I think what you're going for is just /* Allocate when first needed */
which would be a lot shorter and also more clear.

+     * Create join relations for the partition relations, if they do not exist
+     * already. Add paths to those for the given pair of joining relations.

I think the comment could be a bit more explanatory here.  Something
like: "This joinrel is partitioned, so iterate over the partitions and
create paths for each one, allowing us to eventually build an
append-of-joins path for the parent.  Since this routine may be called
multiple times for various join orders, the RelOptInfo needed for each
child join may or may not already exist, but the paths for this join
order definitely do not.  Note that we don't create any actual
AppendPath at this stage; it only makes sense to do that at the end,
after each possible join order has been considered for each child
join.  The best join order may differ from child to child."

+         * partiticipating in the given partition relations. We need them

Spelling.

+/*
+ * Construct the SpecialJoinInfo for the partition-wise join using parents'
+ * special join info. Also, instead of
+ * constructing an sjinfo everytime, we should probably save it in
+ * root->join_info_list and search within it like join_is_legal?
+ */

The lines here are of very different lengths for no particularly good
reason, and it should end with a period, not a question mark.

On the substance of the issue, it seems like the way you're doing this
right now could allocate a very large number of SpecialJoinInfo
structures.  For every join relation, you'll create one
SpecialJoinInfo per legal join order per partition.  That seems like
it could get to be a big number.  I don't know if that's going to be a
problem from a memory-usage standpoint, but it seems like it might.
It's not just the SpecialJoinInfo itself; all of the substructure gets
duplicated, too.

+    SpecialJoinInfo *sjinfo = copyObject(parent_sjinfo);
+    sjinfo->min_lefthand = adjust_partition_relids(sjinfo->min_lefthand,
+                                                   append_rel_infos1);

Missing a blank line here.

+        AppendRelInfo    *ari = lfirst(lc);

Standard naming convention for an AppendRelInfo variable seems to be
appinfo, not ari.  (I just did "git grep AppendRelInfo".)

+        /* Skip non-equi-join clauses. */
+        if (!rinfo->can_join ||
+            rinfo->hashjoinoperator == InvalidOid ||
+            !rinfo->mergeopfamilies)
+            continue;

There's definitely something ugly about this.  If rinfo->can_join is
false, then we're done.  But suppose one of mergeopfamilies == NIL and
rinfo->hashoperator == InvalidOid is true and the other is false.  Are
we really precluded from doing a partiion-wise join in that case, or
are we just prohibited from using certain join strategies?  In most
places where we make similar tests, we're careful not to require more
than we need.

I also think that these tests need to consider the partitioning
operator in use.  Suppose that the partition key is of a type T which
has two operator classes X and Y.  Both relations are partitioned
using an operator from opfamily X, but the join condition mentions
opfamily Y.  I'm pretty sure this precludes a partitionwise join.  If
the join condition used opfamily X, then we would have a guarantee
that two rows which compared as equal would be in the same partition,
but because it uses opfamily Y, that's not guaranteed.  For example,
if T is a text type, X might test for exact equality using "C"
collation rules, while Y might test for equality using some
case-insensitive set of rules.  If the partition boundaries are such
that "foo" and "FOO" are in different partitions, a partitionwise join
using the case-insensitive operator will produce wrong results.  You
can also imagine this happening with numeric, if you have one opclass
(like the default one) that considers 5.0 and 5.00 to be equal, but
another opclass that thinks they are different; if the latter is used
to set the partition bounds, 5.0 and 5.00 could end up in different
partitions - which will be fine if an operator from that opclass is
used for the join, but not if an operator from the regular opclass is
used.

After thinking this over a bit, I think the right way to think about this is:

1. Amit's patch currently only ever uses btree opfamilies for
partitioning.  It uses those for both range partitioning and list
partitioning.  If we ever support hash partitioning, we would
presumably use hash opfamilies for that purpose, but right now it's
all about btree opfamilies.

2. Therefore, if A and B are partitioned but the btree opfamilies
don't match, they don't have the same partitioning scheme and this
code should never be reached.  Similarly, if they use the same
opfamily but different collations, the partitioning schemes shouldn't
match and therefore this code should not be reached.

3. If A and B are partitioned and the partitioning opfamilies - which
are necessarily btree opfamilies - do match, then the operator which
appears in the query needs to be from the same opfamily and have
amopstrategy of BTEqualStrategyNumber within that opfamily.  If not,
then a partition-wise join is not possible.

4. Assuming the above conditions are met, have_partkey_equi_join
doesn't need to care whether the operator chosen has mergeopfamilies
or a valid hashjoinoperator.  Those factors will control which join
methods are legal, but not whether a partitionwise join is possible in
principle.

Let me know whether that seems right.

+     * RelabelType node; eval_const_expressions() will have simplied if more

Spelling.


     /*
+     * Code below scores equivalence classes by how many equivalence members
+     * can produce join clauses for this join relation. Equivalence members
+     * which do not cover the parents of a partition-wise join relation, can
+     * produce join clauses for partition-wise join relation.
+     */

I don't know what that means.  The comma in the second sentence
doesn't belong there.

+    /*
+     * TODO: Instead of copying and mutating the trees one child relation at a
+     * time, we should be able to do this en-masse for all the partitions
+     * involved.
+     */

I don't see how that would be possible, but if it's a TODO, you'd
better do it (or decide not to do it and remove or change the
comment).

     /*
      * Create explicit sort nodes for the outer and inner paths if necessary.
      */
     if (best_path->outersortkeys)
     {
+        Relids        outer_relids = outer_path->parent->relids;
         Sort       *sort = make_sort_from_pathkeys(outer_plan,
-                                                   best_path->outersortkeys);
+                                                   best_path->outersortkeys,
+                                                   outer_relids);

The changes related to make_sort_from_pathkeys() are pretty opaque to
me.  Can you explain?

+     * Change parameterization of sub paths recursively. Also carry out any

"sub paths" should not be two words, here or anywhere.

+reparameterize_path_for_child(PlannerInfo *root, Path *path,
+                              RelOptInfo *child_rel)

This is suspiciously unlike reparameterize_path.  Why?

+    /* Computer information relevant to the foreign relations. */
+    set_foreign_rel_properties(joinrel, outer_rel, inner_rel);

Perhaps this refactoring could be split out into a preliminary patch,
which would then simplify this patch.  And same for add_join_rel().

+     * Produce partition-wise joinrel's targetlist by translating the parent
+     * joinrel's targetlist. This will also include the required placeholder

Again the confusion between a "child" join and a partition-wise join...

+    /*
+     * Nothing to do if
+     * a. partition-wise join is disabled.
+     * b. joining relations are not partitioned.
+     * c. partitioning schemes do not match.
+     */
+

I don't think that's going to survive pgindent.

+     * are not considered equal, an equi-join involing inner partition keys

Spelling.

+     * Collect the partition key expressions. An OUTER join will produce rows
+     * where the partition key columns of inner side are NULL and may not fit
+     * the partitioning scheme with inner partition keys. Since two NULL values
+     * are not considered equal, an equi-join involing inner partition keys
+     * still prohibits cross-partition joins while joining with another
+     * similarly partitioned relation.

I can't figure out what this comment is trying to tell me.  Possibly I
just need more caffeine.

+ * Adding these two join_rel_level list also means that top level list has more
+ * than one join relation, which is symantically incorrect.

I don't understand this, either; also, spelling.

As a general comment, the ratio of tests-to-code in this patch is way
out of line with PostgreSQL's normal practices.  The total patch file
is 10965 lines. The test cases begin at line 3047, meaning that in
round figures you've got about one-quarter code and about
three-quarters test cases.  I suspect that a large fraction of those
test cases aren't adding any meaningful code coverage and will just
take work to maintain.  That needs to be slimmed down substantially in
any version of this considered for commit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Partition-wise join for join between (declaratively) partitioned tables

Ashutosh Bapat
On Wed, Sep 28, 2016 at 2:02 AM, Robert Haas <[hidden email]> wrote:
> On Thu, Sep 22, 2016 at 6:41 AM, Ashutosh Bapat
> <[hidden email]> wrote:
>> [ new patch ]
>
> This should probably get updated since Rajkumar reported a crash.
> Meanwhile, here are some comments from an initial read-through:

Done. Fixed those crashes. Also fixed some crashes in foreign table
code and postgres_fdw. The tests were provided by Rajkumar. I am
working on including those in my patch. The attached patch is still
based on Amit's patches set of patches posted on 15th Sept. 2016. He
is addressing your comments on his patches. So, I am expecting a more
stable version arrive soon. I will rebase my patches then. Because of
a bug in those patches related to multi-level partitioned tables and
lateral joins and also a restriction on sharing partition keys across
levels of partitions, the testcase is still failing. I will work on
that while rebasing the patch.

>
> + * Multiple relations may be partitioned in the same way. The relations
> + * resulting from joining such relations may be partitioned in the same way as
> + * the joining relations.  Similarly, relations derived from such relations by
> + * grouping, sorting be partitioned in the same as the underlying relations.
>
> I think you should change "may be partitioned in the same way" to "are
> partitioned in the same way" or "can be regarded as partitioned in the
> same way".

The relations resulting from joining partitioned relations are
partitioned in the same way, if there exist equi-join condition/s
between their partition keys. If such equi-joins do not exist, the
join is *not* partitioned. Hence I did not use "are" or "can be" which
indicate a certainty. Instead I used "may" which indicates
"uncertainty". I am not sure whether that's a good place to explain
the conditions under which such relations are partitioned. Those
conditions will change as we implement more and more partition-wise
join strategies. But that comment conveys two things 1. partition
scheme makes sense for all kinds of relations 2. multiple relations
(of any kind) may share partition scheme. I have slightly changed the
wording to make this point clear. Please let me know if it looks
better.

> The sentence that begins with "Similarly," is not
> grammatical; it should say something like: ...by grouping or sorting
> are partitioned in the same way as the underlying relations.

Done. Instead of "are" I have used "may" for the same reason as above.

>
> @@ -870,20 +902,21 @@ RelationBuildPartitionDesc(Relation rel)
>                  result->bounds->rangeinfo = rangeinfo;
>                  break;
>              }
>          }
>      }
>
>      MemoryContextSwitchTo(oldcxt);
>      rel->rd_partdesc = result;
>  }
>
> +
>  /*
>   * Are two partition bound collections logically equal?
>   *
>   * Used in the keep logic of relcache.c (ie, in RelationClearRelation()).
>   * This is also useful when b1 and b2 are bound collections of two separate
>   * relations, respectively, because BoundCollection is a canonical
>   * representation of a set partition bounds (for given partitioning strategy).
>   */
>  bool
>  partition_bounds_equal(PartitionKey key,
>
> Spurious hunk.
>
Thanks. Done.

> + *     For an umpartitioned table, it returns NULL.
>
> Spelling.

Done. Thanks.

>
> +             * two arguemnts and returns boolean. For types, it
> suffices to match
>
> Spelling.

Thanks. Done.

>
> + * partition key expression is stored as a single member list to accomodate
>
> Spelling.

Thanks. Done.

>
> + * For a base relation, construct an array of partition key expressions. Each
> + * partition key expression is stored as a single member list to accomodate
> + * more partition keys when relations are joined.
>
> How would joining relations result in more partitioning keys getting
> added?  Especially given the comment for the preceding function, which
> says that a new PartitionScheme gets created unless an exact match is
> found.

Let's assume that relation A and B are partitioned by columns a and b
resp. and have same partitioning scheme. This means that the datatypes
of a and b as well as the opclass used for comparing partition key
values of A and B are same. A join between A and B with condition A.a
= B.b is partitioned by both A.a and B.b. We need to keep track of
both the keys in case AB joins with C which is partitioned in the same
manner. I guess, the confusion is with the term "partition keys" -
which is being used to indicate the class of partition key as well as
instance of partition key. In the above example, the datatype of
partition key and the opclass together indicate partition key class
whereas A.a and B.b are instances of that class. Increase in partition
keys may mean both increase in the number of classes or increase in
the number of instances. In the above comment I used to mean number of
instances. May be we should use "partition key expressions" to
indicate the partition key instances and "partition key" to indicate
partition key class. I have changed the comments to use partition keys
and partition key expressions appropriately. Please let me know if the
comments are worded correctly.

PartitionScheme does not hold the actual partition key expressions. It
holds the partition key type and opclass used for comparison, which
should be same for all the relations sharing the partition scheme.

>
> +            if (!lc)
>
> Test lc == NIL instead of !lc.

NIL is defined as (List *) NULL and lc is ListCell *. So changed the
test to lc == NULL instead of !lc.

>
> +extern int
> +PartitionSchemeGetNumParts(PartitionScheme part_scheme)
> +{
> +    return part_scheme ? part_scheme->nparts : 0;
> +}
>
> I'm not convinced it's a very good idea for this function to have
> special handling for when part_scheme is NULL.  In
> try_partition_wise_join() that checks is not needed because it's
> already been done, and in generate_partition_wise_join_paths it is
> needed but only because you are initializing nparts too early.  If you
> move this initialization down below the IS_DUMMY_REL() check you won't
> need the NULL guard.  I would ditch this function and let the callers
> access the structure member directly.
>
> +extern int
> +PartitionSchemeGetNumKeys(PartitionScheme part_scheme)
> +{
> +    return part_scheme ? part_scheme->partnatts : 0;
> +}
>
> Similarly here.  have_partkey_equi_join should probably have a
> quick-exit path when part_scheme is NULL, and then num_pks can be set
> afterwards unconditionally.  Same for match_expr_to_partition_keys.
> build_joinrel_partition_info already has it and doesn't need this
> double-check.
>
> +extern Oid *
> +PartitionDescGetPartOids(PartitionDesc part_desc)
> +{
> +    Oid       *part_oids;
> +    int        cnt_parts;
> +
> +    if (!part_desc || part_desc->nparts <= 0)
> +        return NULL;
> +
> +    part_oids = (Oid *) palloc(sizeof(Oid) * part_desc->nparts);
> +    for (cnt_parts = 0; cnt_parts < part_desc->nparts; cnt_parts++)
> +        part_oids[cnt_parts] = part_desc->oids[cnt_parts];
> +
> +    return part_oids;
> +}
>
> I may be missing something, but this looks like a bad idea in multiple
> ways.  First, you've got checks for part_desc's validity here that
> should be in the caller, as noted above.  Second, you're copying an
> array by looping instead of using memcpy().  Third, the one and only
> caller is set_append_rel_size, which doesn't seem to have any need to
> copy this data in the first place.  If there is any possibility that
> the PartitionDesc is going to change under us while that function is
> running, something is deeply broken.  Nothing in the planner is going
> to cope with the table structure changing under us, so it had better
> not.
These three functions were written based on Amit Langote's patches
which did not expose partition related structures outside partition.c.
Hence they required wrappers. I have moved PartitionSchemeData to
partition.h and removed these functions. Instead the members are
accessed directly.

>
> +    /*
> +     * For a partitioned relation, we will save the child RelOptInfos in parent
> +     * RelOptInfo in the same the order as corresponding bounds/lists are
> +     * stored in the partition scheme.
> +     */
>
> This comment seems misplaced; shouldn't it be next to the code that is
> actually doing this, rather than the code that is merely setting up
> for it?  And, also, the comment implies that we're doing this instead
> of what we'd normally do, whereas I think we are actually doing
> something additional.
>
Ok. I have moved the comment few line below, near the code which saves
the partition RelOptInfos.

> +        /*
> +         * Save topmost parent's relid. If the parent itself is a child of some
> +         * other relation, use parent's topmost parent relids.
> +         */
> +        if (rel->top_parent_relids)
> +            childrel->top_parent_relids = rel->top_parent_relids;
> +        else
> +            childrel->top_parent_relids = bms_copy(rel->relids);
>
> Comment should explain why we're doing it, not what we're doing.  The
> comment as written just restates what anybody who's likely to be
> looking at this can already see to be true from looking at the code
> that follows.  The question is why do it.
>
The point of that comment is to explain how it percolates down the
hierarchy, which is not so clear from the code. I have changed it to
read
/*
 * Recursively save topmost parent's relid in RelOptInfos of
 * partitions.
*/

Or you are expecting that the comment to explain the purpose of
top_parent_relids? I don't think that's a good idea, since the purpose
will change over the time and the comment will soon be out of sync
with the actual code, unless the developers expanding the usage
remember to update the comment. I have not seen the comments,
explaining purpose, next to the assignments. Take for example
RelOptInfo::relids.

> +    /* Set only for "other" base or join relations. */
> +    Relids        top_parent_relids;
>
> Comment should say what it is, not just when it's set.

Done. Check if it looks good.

>
> +    /* Should have found all the childrels of a partitioned relation. */
> +    if (rel->part_scheme)
> +    {
> +        int        cnt_parts;
> +        for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
> +            Assert(rel->part_rels[cnt_parts]);
> +    }
>
> A block that does nothing but Assert() should be guarded by #ifdef
> USE_ASSERT_CHECKING.  Although, actually, maybe this should be an
> elog(), just in case?
Changed it to elog().

>
> +    }
> +
> +    add_paths_to_append_rel(root, rel, live_childrels);
> +}
> +
> +static void
> +add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
> +                        List *live_childrels)
>
> The new function should have a header comment, which should include an
> explanation of why this is now separate from
> set_append_rel_pathlist().
Sorry for missing it. Added the prologue. Let me know, if it looks
good. I have made sure that all functions have a prologue and tried to
match the style with surrounding functions. Let me know if I have
still missed any or the styles do not match.

>
> +    if (!live_childrels)
>
> As before, I think live_childrels == NIL is better style.

Fixed.

>
> +            generate_partition_wise_join_paths(root, rel);
>
> Needs an update to the comment earlier in the hunk.  It's important to
> explain why this has to be done here and not within
> join_search_one_level.

Thanks for pointing that out. Similar to generate_gather_paths(), we
need to add explanation in standard_join_search() as well as in the
function prologue. Did that. Let me know if it looks good.

>
> +            /* Recursively collect the paths from child joinrel. */
> +            generate_partition_wise_join_paths(root, child_rel);
>
> Given the recursion, check_stack_depth() at top of function is
> probably appropriate.  Same for try_partition_wise_join().

Done. I wouldn't imagine a user creating that many levels of
partitions, but it's good to guard against some automated script that
has gone berserk.

>
> +    if (live_children)
> +        pfree(live_children);
>
> Given that none of the substructure, including ListCells, will be
> freed, this seems utterly pointless.  If it's necessary to recover
> memory here at all, we probably need to be more aggressive about it.

I intended to use list_free() instead of pfree(). Fixed that.

> Have you tested the effect of this patch on planner memory consumption
> with multi-way joins between tables with many partitions?  If you
> haven't, you probably should. (Testing runtime would be good, too.)
> Does it grow linearly?  Quadratically?  Exponentially?  Minor leaks
> don't matter, but if we're generating too much garbage we'll have to
> make sure it gets cleaned up soon enough to prevent runaway memory
> usage.

I tried to check memory usage with various combinations of number of
partitions and number of relations being joined. For higher number of
relations being joined like 10 with 100 partitions, OOM killer kicked
in during the planning phase. I am suspecting
adjust_partitionrel_attrs() (changed that name to
adjust_join_appendrel_attrs() to be in sync with
adjust_appendrel_attrs()) to be the culprit. It copies expression
trees every time for joining two children. That's an exponentially
increasing number as the number of legal joins increases
exponentially. I am still investigating this.

As a side question, do we have a function to free an expression tree?
I didn't find any.

>
>      /*
> +     * An inner path parameterized by the parent relation of outer
> +     * relation needs to be reparameterized by the outer relation to be used
> +     * for parameterized nested loop join.
> +     */
>
> No doubt, but I think the comment is missing the bigger picture -- it
> doesn't say anything about this being here to support partition-wise
> joins, which seems like a key point.
I have tried to explain the partition-wise join context. Let me know
if it looks good.

>
> +        /* If we could not translate the path, don't produce nest loop path. */
> +        if (!inner_path)
> +            return;
>
> Why would that ever happen?

Right now, reparameterize_path_for_child() does not support all kinds
of paths. So I have added that condition. I will add support for more
path types there once we agree that this is the right way to translate
the paths and that the path translation is required.

>
> +/*
> + * If the join between the given two relations can be executed as
> + * partition-wise join create the join relations for partition-wise join,
> + * create paths for those and then create append paths to combine
> + * partition-wise join results.
> + */
> +static void
> +try_partition_wise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
> +                        RelOptInfo *joinrel, SpecialJoinInfo *parent_sjinfo,
> +                        List *parent_restrictlist)
>
> This comment doesn't accurately describe what the function does.  No
> append paths are created here; that happens at a much later stage.
Removed reference to the append paths. Sorry for leaving it there,
when I moved the append path creation to a later stage.

> I
> think this comment needs quite a bit more work, and maybe the function
> should be renamed, too.

Improved the comments in the prologue and inside the function. Please
let me know, if they look good.

> There are really two steps involved here:
> first, we create paths for each child, attached to a new RelOptInfo
> flagged as RELOPT_OTHER_JOINREL paths; later, we create additional
> paths for the parent RelOptInfo by appending a path for each child.
>

Right, the first one is done in try_partition_wise_join() and the
later is done in generate_partition_wise_join_paths()

> Broadly, I think there's a lack of adequate documentation of the
> overall theory of operation of this patch.  I believe that an update
> to the optimizer README would be appropriate, probably with a new
> section but maybe incorporating the new material into an existing
> section.

Done. I have added a separate section to optimizer/README

> In addition, the comments for individual comments and chunks
> of code need to do a better job explaining how each part of the patch
> contributes to the overall picture.


> I also think we need to do a
> better join hammering out the terminology.  I don't particularly like
> the term "partition-wise join" in the first place, although I don't
> know what would be better, but we certainly need to avoid confusing a
> partition-wise join -- which is a join performed by joining each
> partition of one partitioned rel to the corresponding partition of a
> similarly partitioned rel rather than by the usual execution strategy
> of joining the parent rels -- with the concept of an other-join-rel,
> which an other-member-rel analogue for joins.  I don't think the patch
> is currently very clear about this right now, either in the code or in
> the comments.  Maybe this function ought to be named something like
> make_child_joins() or make_child_join_paths(), and we could use "child
> joins" and/or "child join paths" as standard terminology throughout
> the patch.
Partition-wise join is widely used term in the literature. Other
DBMSes use the same term as well. So, I think we should stick with
"partition-wise join". Partition-wise join as you have described is a
join performed by joining each partition of one partitioned rel to the
corresponding partition of a similarly partitioned rel rather than by
the usual execution strategy of joining the parent rels. I have
usually used the term "partition-wise join technique" to refer to this
method. I have changed the other usages of this term to use wording
like child joins or join between partiitions or join between child
relations as appropriate. Also, I have changed the names of functions
dealing with joins between partitions to use child_join instead of
partition_join or partition_wise_join.

Since partition-wise join is a method to join two relations just like
other methods, try_partition_wise_join() fits into the naming
convention try_<join technique> like try_nestloop_join.

>
> +    rel1_desc = makeStringInfo();
> +    rel2_desc = makeStringInfo();
> +
> +    /* TODO: remove this notice when finalising the patch. */
> +    outBitmapset(rel1_desc, rel1->relids);
> +    outBitmapset(rel2_desc, rel2->relids);
> +    elog(NOTICE, "join between relations %s and %s is considered for
> partition-wise join.",
> +         rel1_desc->data, rel2_desc->data);
>
> Please remove your debugging cruft before submitting patches to
> pgsql-hackers, or at least put #ifdef NOT_USED or something around it.
I kept this one intentionally. But as the TODO comment says, I do
intend to remove it once testing is over. Those messages make it very
easy to know whether partition-wise join was considered for a given
join or not. Without those messages, one has to break into
try_partition_wise_join() to figure out whether partition-wise join
was used or not. The final plan may not come out to be partition-wise
join plan even if partition-wise join was considered. Although, I have
now used DEBUG3 instead of NOTICE and removed those lines from the
expected output.

>
> +     * We allocate the array for child RelOptInfos till we find at least one
> +     * join order which can use partition-wise join technique. If no join order
> +     * can use partition-wise join technique, there are no child relations.
>
> This comment has problems.  I think "till" is supposed to be "until",
> and there's supposed to be a "don't" in there somewhere.  But really,
> I think what you're going for is just /* Allocate when first needed */
> which would be a lot shorter and also more clear.

Sorry for those mistakes. Yes, shorter version is better. Fixed the
comment as per your suggestion.

>
> +     * Create join relations for the partition relations, if they do not exist
> +     * already. Add paths to those for the given pair of joining relations.
>
> I think the comment could be a bit more explanatory here.  Something
> like: "This joinrel is partitioned, so iterate over the partitions and
> create paths for each one, allowing us to eventually build an
> append-of-joins path for the parent.  Since this routine may be called
> multiple times for various join orders, the RelOptInfo needed for each
> child join may or may not already exist, but the paths for this join
> order definitely do not.  Note that we don't create any actual
> AppendPath at this stage; it only makes sense to do that at the end,
> after each possible join order has been considered for each child
> join.  The best join order may differ from child to child."
>
Copied verbatim. Thanks for the detailed comment.


> +         * partiticipating in the given partition relations. We need them
>
> Spelling.
>

Done. Also fixed other grammatical mistakes and typos in that comment.

> +/*
> + * Construct the SpecialJoinInfo for the partition-wise join using parents'
> + * special join info. Also, instead of
> + * constructing an sjinfo everytime, we should probably save it in
> + * root->join_info_list and search within it like join_is_legal?
> + */
>
> The lines here are of very different lengths for no particularly good
> reason, and it should end with a period, not a question mark.

My bad. Sorry. Fixed.

>
> On the substance of the issue, it seems like the way you're doing this
> right now could allocate a very large number of SpecialJoinInfo
> structures.  For every join relation, you'll create one
> SpecialJoinInfo per legal join order per partition.  That seems like
> it could get to be a big number.  I don't know if that's going to be a
> problem from a memory-usage standpoint, but it seems like it might.
> It's not just the SpecialJoinInfo itself; all of the substructure gets
> duplicated, too.
>
Yes. We need the SpecialJoinInfo structures for the existing path
creation to work. The code will be complicated if we try to use parent
SpecialJoinInfo instead of creating those for children. We may free
memory allocated in SpecialJoinInfo to save some memory.
SpecialJoinInfos are not needed once the paths are created. Still we
will waste some memory for semi_rhs_exprs, which are reused for unique
paths. But otherwise we will reclaim the rest of the memory. Memory
wastage in adjust_partition_relids() may be minimized by modifying
adjust_appendrel_attrs() to accept list of AppendRelInfos and mutating
the tree only once rather than doing it N times for an N-way join.

> +    SpecialJoinInfo *sjinfo = copyObject(parent_sjinfo);
> +    sjinfo->min_lefthand = adjust_partition_relids(sjinfo->min_lefthand,
> +                                                   append_rel_infos1);
>
> Missing a blank line here.

Done.

>
> +        AppendRelInfo    *ari = lfirst(lc);
>
> Standard naming convention for an AppendRelInfo variable seems to be
> appinfo, not ari.  (I just did "git grep AppendRelInfo".)

Done.

>
> +        /* Skip non-equi-join clauses. */
> +        if (!rinfo->can_join ||
> +            rinfo->hashjoinoperator == InvalidOid ||
> +            !rinfo->mergeopfamilies)
> +            continue;
>
> There's definitely something ugly about this.  If rinfo->can_join is
> false, then we're done.  But suppose one of mergeopfamilies == NIL and
> rinfo->hashoperator == InvalidOid is true and the other is false.  Are
> we really precluded from doing a partiion-wise join in that case, or
> are we just prohibited from using certain join strategies?  In most
> places where we make similar tests, we're careful not to require more
> than we need.
Right. That condition is flawed. Corrected it.

>
> I also think that these tests need to consider the partitioning
> operator in use.  Suppose that the partition key is of a type T which
> has two operator classes X and Y.  Both relations are partitioned
> using an operator from opfamily X, but the join condition mentions
> opfamily Y.  I'm pretty sure this precludes a partitionwise join.  If
> the join condition used opfamily X, then we would have a guarantee
> that two rows which compared as equal would be in the same partition,
> but because it uses opfamily Y, that's not guaranteed.  For example,
> if T is a text type, X might test for exact equality using "C"
> collation rules, while Y might test for equality using some
> case-insensitive set of rules.  If the partition boundaries are such
> that "foo" and "FOO" are in different partitions, a partitionwise join
> using the case-insensitive operator will produce wrong results.  You
> can also imagine this happening with numeric, if you have one opclass
> (like the default one) that considers 5.0 and 5.00 to be equal, but
> another opclass that thinks they are different; if the latter is used
> to set the partition bounds, 5.0 and 5.00 could end up in different
> partitions - which will be fine if an operator from that opclass is
> used for the join, but not if an operator from the regular opclass is
> used.
Your description above uses opfamily and opclass interchangeably. It
starts saying X and Y are classed but then also refers to them as
families. But I got the point. I guess, similar to
relation_has_unique_index_for(), I have to check whether the operator
family specified in the partition scheme is present in the
mergeopfamilies in RestrictInfo for matching partition key. I have
added that check and restructured that portion of code to be readable.

>
> After thinking this over a bit, I think the right way to think about this is:
>
> 1. Amit's patch currently only ever uses btree opfamilies for
> partitioning.  It uses those for both range partitioning and list
> partitioning.  If we ever support hash partitioning, we would
> presumably use hash opfamilies for that purpose, but right now it's
> all about btree opfamilies.
>
> 2. Therefore, if A and B are partitioned but the btree opfamilies
> don't match, they don't have the same partitioning scheme and this
> code should never be reached.  Similarly, if they use the same
> opfamily but different collations, the partitioning schemes shouldn't
> match and therefore this code should not be reached.
That's right.

>
> 3. If A and B are partitioned and the partitioning opfamilies - which
> are necessarily btree opfamilies - do match, then the operator which
> appears in the query needs to be from the same opfamily and have
> amopstrategy of BTEqualStrategyNumber within that opfamily.  If not,
> then a partition-wise join is not possible.

I think this is achieved by checking whether the opfamily for given
partition key is present in the mergeopfamilies of corresponding
RestrictInfo, as stated above.

>
> 4. Assuming the above conditions are met, have_partkey_equi_join
> doesn't need to care whether the operator chosen has mergeopfamilies
> or a valid hashjoinoperator.  Those factors will control which join
> methods are legal, but not whether a partitionwise join is possible in
> principle.

If mergeopfamilies is NIL, above check will fail anyway. But skipping
a clause which has mergeopfamilies NIL will save some cycles in
matching expressions.

There is something strange happening with Amit's patch. When we create
a table partitioned by range on a column of type int2vector, it
somehow gets a btree operator family, but doesn't have mergeopfamilies
set in RestrictInfo of equality condition on that column. Instead the
RestrictInfo has hashjoinoperator. In this case if we ignore
hashjoinoperator, we won't be able to apply partition-wise join. I
guess, in such case we want to play safe and not apply partition-wise
join, even though applying it will give the correct result.

>
> +     * RelabelType node; eval_const_expressions() will have simplied if more
>
> Spelling.
>

Thanks. Done.

>
>      /*
> +     * Code below scores equivalence classes by how many equivalence members
> +     * can produce join clauses for this join relation. Equivalence members
> +     * which do not cover the parents of a partition-wise join relation, can
> +     * produce join clauses for partition-wise join relation.
> +     */
>
> I don't know what that means.  The comma in the second sentence
> doesn't belong there.
Sorry for that construction. I have changed the comment to be
something more meaningful.

>
> +    /*
> +     * TODO: Instead of copying and mutating the trees one child relation at a
> +     * time, we should be able to do this en-masse for all the partitions
> +     * involved.
> +     */
>
> I don't see how that would be possible, but if it's a TODO, you'd
> better do it (or decide not to do it and remove or change the
> comment).
That should be doable by passing a list of AppendRelInfo structures to
adjust_appendrel_attrs_mutator(). In the mutator, we have to check
each appinfo instead of just one. But that's a lot of refactoring. May
be done as a separate patch, if we are consuming too much memory. I
have removed TODO for now.

>
>      /*
>       * Create explicit sort nodes for the outer and inner paths if necessary.
>       */
>      if (best_path->outersortkeys)
>      {
> +        Relids        outer_relids = outer_path->parent->relids;
>          Sort       *sort = make_sort_from_pathkeys(outer_plan,
> -                                                   best_path->outersortkeys);
> +                                                   best_path->outersortkeys,
> +                                                   outer_relids);
>
> The changes related to make_sort_from_pathkeys() are pretty opaque to
> me.  Can you explain?
prepare_sort_from_pathkeys() accepts Relids as one of the argument to
find equivalence members belonging to child relations. The function
does not expect relids when searching equivalence members for parent
relations. Before this patch, make_sort_from_pathkeys() passed NULL to
this function, because it didn't expect child relations before.
Because of partition-wise joins, we need to sort child relations for
merge join or to create unique paths. So, make_sort_from_pathkeys() is
required to pass relids to prepare_sort_from_pathkeys() when
processing child relations, so that the later does not skip child
members.

>
> +     * Change parameterization of sub paths recursively. Also carry out any
>
> "sub paths" should not be two words, here or anywhere.

Fixed.

>
> +reparameterize_path_for_child(PlannerInfo *root, Path *path,
> +                              RelOptInfo *child_rel)
>
> This is suspiciously unlike reparameterize_path.  Why?

reparameterize_path() tries to create path with new parameterization
from an existing parameterized path. So, it looks for additional
conditions to expand the parameterization. But this functions
translates a path parameterized by parent to be parameterized by its
child. That does not involve looking for any extra conditions, but
involves translating the existing ones so that they can be used with a
child. A right name would be translate_parampath_to_child() or
something which uses word "translate" instead of "reparameterize". But
every name like that is getting too long. For now I have renamed it as
reparameterize_path_by_child(). Also added a comment in the function
prologue about cost, rows, width etc.

>
> +    /* Computer information relevant to the foreign relations. */
> +    set_foreign_rel_properties(joinrel, outer_rel, inner_rel);
>
> Perhaps this refactoring could be split out into a preliminary patch,
> which would then simplify this patch.  And same for add_join_rel().
>

Yes, that's better. I will separate the code out in a separate patch.

There's code in build_join_rel() and build_partition_join_rel() (I
will change that name) which creates a joinrel RelOptInfo. Most of
that code simply sets NULL or 0 fields and is duplicated in both the
functions. Do you see any value in separating it out in its own
function?

Also, makeNode() uses palloc0(), thus makeNode(RelOptInfo) would set
most of the fields to 0 or NULL. Why do we then again set those fields
as NULL or 0? Should I try to remove unnecessary assignments?

> +     * Produce partition-wise joinrel's targetlist by translating the parent
> +     * joinrel's targetlist. This will also include the required placeholder
>
> Again the confusion between a "child" join and a partition-wise join...
>
> +    /*
> +     * Nothing to do if
> +     * a. partition-wise join is disabled.
> +     * b. joining relations are not partitioned.
> +     * c. partitioning schemes do not match.
> +     */
> +
>
> I don't think that's going to survive pgindent.
Changed this code a bit.

>
> +     * are not considered equal, an equi-join involing inner partition keys
>
> Spelling.
>
> +     * Collect the partition key expressions. An OUTER join will produce rows
> +     * where the partition key columns of inner side are NULL and may not fit
> +     * the partitioning scheme with inner partition keys. Since two NULL values
> +     * are not considered equal, an equi-join involing inner partition keys
> +     * still prohibits cross-partition joins while joining with another
> +     * similarly partitioned relation.
>
> I can't figure out what this comment is trying to tell me.  Possibly I
> just need more caffeine.
Re-wrote the comment with examples and detailed explanation. The
comment talks about whether inner partition key expressions should be
considered as the partition key expressions for the join, given that
for an OUTER join the inner partition key expressions may go NULL. The
comment explains why it's safe to do so. If we don't do that, any FULL
OUTER join will have no partition expressions and thus partition-wise
join technique will be useless for a N-way FULL OUTER join even if
it's safe to use it.

>
> + * Adding these two join_rel_level list also means that top level list has more
> + * than one join relation, which is symantically incorrect.
>
> I don't understand this, either; also, spelling.

I think, that sentence is not required. Removed it.

>
> As a general comment, the ratio of tests-to-code in this patch is way
> out of line with PostgreSQL's normal practices.  The total patch file
> is 10965 lines. The test cases begin at line 3047, meaning that in
> round figures you've got about one-quarter code and about
> three-quarters test cases.  I suspect that a large fraction of those
> test cases aren't adding any meaningful code coverage and will just
> take work to maintain.  That needs to be slimmed down substantially in
> any version of this considered for commit.

I agree. We require two kinds of tests 1. those which test partition
scheme matching 2. those test the planner code, which deals with path
creation. I have added both kinds of testcases for all kinds of
partitioning schemes (range, list, multi-level, partition key being
expressions, columns). That's not required. We need 1st kind of tests
for all partitioning schemes and 2nd kind of testcases only for one of
the partitioning schemes. So, definitely the number of tests will
reduce. A possible extreme would be to use a single multi-level
partitioned tests, which includes all kinds of partitioning schemes at
various partition levels. But that kind of testcase will be highly
unreadable and harder to maintain. Let me know what do you think. I
will work on that in the next version of patch. The test still fails
because of a bug in Amit's earlier set of patches

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pg_dp_join_v4.patch (706K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Partition-wise join for join between (declaratively) partitioned tables

Robert Haas
On Fri, Oct 14, 2016 at 12:37 AM, Ashutosh Bapat
<[hidden email]> wrote:

>> Have you tested the effect of this patch on planner memory consumption
>> with multi-way joins between tables with many partitions?  If you
>> haven't, you probably should. (Testing runtime would be good, too.)
>> Does it grow linearly?  Quadratically?  Exponentially?  Minor leaks
>> don't matter, but if we're generating too much garbage we'll have to
>> make sure it gets cleaned up soon enough to prevent runaway memory
>> usage.
>
> I tried to check memory usage with various combinations of number of
> partitions and number of relations being joined. For higher number of
> relations being joined like 10 with 100 partitions, OOM killer kicked
> in during the planning phase. I am suspecting
> adjust_partitionrel_attrs() (changed that name to
> adjust_join_appendrel_attrs() to be in sync with
> adjust_appendrel_attrs()) to be the culprit. It copies expression
> trees every time for joining two children. That's an exponentially
> increasing number as the number of legal joins increases
> exponentially. I am still investigating this.

I think the root of this problem is that the existing paths shares a
lot more substructure than the ones created by the new code.  Without
a partition-wise join, the incremental memory usage for a joinrel
isn't any different whether the underlying rel is partitioned or not.
If it's partitioned, we'll be pointing to an AppendPath; if not, we'll
be pointing to some kind of Scan.  But the join itself creates exactly
the same amount of new stuff regardless of what's underneath it.  With
partitionwise join, that ceases to be true.  Every joinrel - and the
number of those grows exponentially in the number of baserels, IICU -
needs its own list of paths for every member rel.  So if a
non-partition-wise join created X paths, and there are K partitions, a
partition-wise join creates X * K paths.  That's a lot.

Although we might be able to save some memory by tightening things up
here and there - for example, right now the planner isn't real smart
about recycling paths that are evicted by add_path(), and there's
probably other wastage as well - I suspect that what this shows is
that the basic design of this patch is not going to be viable.
Intuitively, it's often going to be the case that we want the "same
plan" for every partition-set.  That is, if we have A JOIN B ON A.x =
B.x JOIN C ON A.y = B.y, and if A, B, and C are all compatibility
partitioned, then the result should be an Append plan with 100 join
plans under it, and all 100 of those plans should be basically mirror
images of each other.  Of course, that's not really right in general:
for example, it could be that A1 is big and A2 is small while B1 is
small and B2 is big, so that the right plan for (A1 JOIN B1) and for
(A2 JOIN B2) are totally different from each other.  But in many
practical cases we'll want to end up with a plan of precisely the same
shape for all children, and the current design ignores this, expending
both memory and CPU time to compute essentially-equivalent paths
across all children.

One way of attacking this problem is to gang together partitions which
are equivalent for planning purposes, as discussed in the paper "Join
Optimization Techniques for Partitioned Tables" by Herodotou, Borisov,
and Babu.  However, it's not exactly clear how to do this: we could
gang together partitions that have the same index definitions, but the
sizes of the heaps, the sizes of their indexes, and the row counts
will vary from one partition to the next, and any of those things
could cause the plan choice to be different for one partition vs. the
next.  We could try to come up with heuristics for when those things
are likely to be true.  For example, suppose we compute the set of
partitions such that all joined relations have matching index
definitions on all tables; then, we take the biggest table in the set
and consider all tables more than half that size as part of one gang.
The biggest table becomes the leader and we compute partition-wise
paths for just that partition; the other members of the gang will
eventually get a plan that is of the same shape, but we don't actually
create it that plan until after scan/join planning is concluded.

Another idea is to try to reduce peak memory usage by performing
planning separately for each partition-set.  For example, suppose we
decide to do a partition-wise join of A, B, and C.  Initially, this
gets represented as a PartitionJoinPath tree, like this:

PartitionJoinPath
-> AppendPath for A
-> PartitionJoinPath
  -> AppendPath for B
  -> AppendPath for C

Because we haven't created individual join paths for the members, this
doesn't use much memory.  Somehow, we come up with a cost for the
PartitionJoinPath; it probably won't be entirely accurate.  Once
scan/join planning is concluded, if our final path contains a
PartitionJoinPath, we go back and loop over the partitions.  For each
partition, we switch to a new memory context, perform planning, copy
the best path and its substructure back to the parent context, and
then reset the context.  In that way, peak memory usage only grows by
about a factor of 2 rather than a factor equal to the partition count,
because we don't need to keep every possibly-useful path for every
partition all at the same time, but rather every possibly-useful path
for a single partition.

Maybe there are other ideas but I have a feeling any way you slice it
this is going to be a lot of work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Partition-wise join for join between (declaratively) partitioned tables

Ashutosh Bapat
On Tue, Oct 18, 2016 at 9:09 PM, Robert Haas <[hidden email]> wrote:

> On Fri, Oct 14, 2016 at 12:37 AM, Ashutosh Bapat
> <[hidden email]> wrote:
>>> Have you tested the effect of this patch on planner memory consumption
>>> with multi-way joins between tables with many partitions?  If you
>>> haven't, you probably should. (Testing runtime would be good, too.)
>>> Does it grow linearly?  Quadratically?  Exponentially?  Minor leaks
>>> don't matter, but if we're generating too much garbage we'll have to
>>> make sure it gets cleaned up soon enough to prevent runaway memory
>>> usage.
>>
>> I tried to check memory usage with various combinations of number of
>> partitions and number of relations being joined. For higher number of
>> relations being joined like 10 with 100 partitions, OOM killer kicked
>> in during the planning phase. I am suspecting
>> adjust_partitionrel_attrs() (changed that name to
>> adjust_join_appendrel_attrs() to be in sync with
>> adjust_appendrel_attrs()) to be the culprit. It copies expression
>> trees every time for joining two children. That's an exponentially
>> increasing number as the number of legal joins increases
>> exponentially. I am still investigating this.
>
> I think the root of this problem is that the existing paths shares a
> lot more substructure than the ones created by the new code.  Without
> a partition-wise join, the incremental memory usage for a joinrel
> isn't any different whether the underlying rel is partitioned or not.
> If it's partitioned, we'll be pointing to an AppendPath; if not, we'll
> be pointing to some kind of Scan.  But the join itself creates exactly
> the same amount of new stuff regardless of what's underneath it.  With
> partitionwise join, that ceases to be true.  Every joinrel - and the
> number of those grows exponentially in the number of baserels, IICU -
> needs its own list of paths for every member rel.  So if a
> non-partition-wise join created X paths, and there are K partitions, a
> partition-wise join creates X * K paths.  That's a lot.
>
> Although we might be able to save some memory by tightening things up
> here and there - for example, right now the planner isn't real smart
> about recycling paths that are evicted by add_path(), and there's
> probably other wastage as well - I suspect that what this shows is
> that the basic design of this patch is not going to be viable.
> Intuitively, it's often going to be the case that we want the "same
> plan" for every partition-set.  That is, if we have A JOIN B ON A.x =
> B.x JOIN C ON A.y = B.y, and if A, B, and C are all compatibility
> partitioned, then the result should be an Append plan with 100 join
> plans under it, and all 100 of those plans should be basically mirror
> images of each other.  Of course, that's not really right in general:
> for example, it could be that A1 is big and A2 is small while B1 is
> small and B2 is big, so that the right plan for (A1 JOIN B1) and for
> (A2 JOIN B2) are totally different from each other.  But in many
> practical cases we'll want to end up with a plan of precisely the same
> shape for all children, and the current design ignores this, expending
> both memory and CPU time to compute essentially-equivalent paths
> across all children.
I think there are going to be two kinds of partitioning use-cases.
First, carefully hand-crafted by DBAs so that every partition is
different from other and so is every join between two partitions.
There will be lesser number of partitions, but creating paths for each
join between partitions will be crucial from performance point of
view. Consider, for example, systems which use partitions to
consolidate results from different sources for analytical purposes or
sharding. If we consider various points you have listed in [1] as to
why a partition is equivalent to a table, each join between partitions
is going to have very different characteristics and thus deserves a
set of paths for its own. Add to that possibility of partition pruning
or certain conditions affecting particular partitions, the need for
detailed planning evident.

The other usage of partitioning is to distribute the data and/or
quickly eliminate the data by partition pruning. In such case, all
partitions of a given table will have very similar properties. There
is a large chance that we will end up having same plans for every
partition and for joins between partitions. In such cases, I think it
suffices to create paths for just one or may be a handful partitions
of join and repeat that plan for other partitions of join. But in such
cases it also makes sense to have a light-weight representation for
partitions as compared to partitions being a full-fledged tables. If
we have such a light-weight representation, we may not even create
RelOptInfos representing joins between partitions, and different paths
for each join between partitions.

>
> One way of attacking this problem is to gang together partitions which
> are equivalent for planning purposes, as discussed in the paper "Join
> Optimization Techniques for Partitioned Tables" by Herodotou, Borisov,
> and Babu.  However, it's not exactly clear how to do this: we could
> gang together partitions that have the same index definitions, but the
> sizes of the heaps, the sizes of their indexes, and the row counts
> will vary from one partition to the next, and any of those things
> could cause the plan choice to be different for one partition vs. the
> next.  We could try to come up with heuristics for when those things
> are likely to be true.  For example, suppose we compute the set of
> partitions such that all joined relations have matching index
> definitions on all tables; then, we take the biggest table in the set
> and consider all tables more than half that size as part of one gang.
> The biggest table becomes the leader and we compute partition-wise
> paths for just that partition; the other members of the gang will
> eventually get a plan that is of the same shape, but we don't actually
> create it that plan until after scan/join planning is concluded.
Section 5 of that paper talks about clustering partitions together for
joining, only when there is 1:m OR n:1 partition matching for join. In
such a case, it clusters all the partitions from one relation that are
all joined with a single partition of the other relation. But I think
your idea to gang up partitions with similar properties may reduce the
number of paths we create but as you have mentioned how to gang them
up is not very clear. There are just too many factors like
availability of the indexes, sizes of tables, size of intermediate
results etc. which make it difficult to identify the properties used
for ganging up. Even after we do that, in the worst case, we will
still end up creating paths for all partitions of all joins, thus
causing increase in paths proportionate to the number of partitions.

In the section 6.3, the paper mentions that the number of paths
retained are linear in the number of child joins per parent join. So,
it's clear that the paper never considered linear increase in the
paths to be a problem or at least a problem that that work had to
solve. Now, it's surprising that their memory usage increased by 7% to
10%. But 1. they might be measuring total memory and not the memory
used by the planner and they experimented with PostgreSQL 8.3.7, which
probably tried much less number of paths than the current optimizer.

>
> Another idea is to try to reduce peak memory usage by performing
> planning separately for each partition-set.  For example, suppose we
> decide to do a partition-wise join of A, B, and C.  Initially, this
> gets represented as a PartitionJoinPath tree, like this:
>
> PartitionJoinPath
> -> AppendPath for A
> -> PartitionJoinPath
>   -> AppendPath for B
>   -> AppendPath for C
>
> Because we haven't created individual join paths for the members, this
> doesn't use much memory.  Somehow, we come up with a cost for the
> PartitionJoinPath; it probably won't be entirely accurate.  Once
> scan/join planning is concluded, if our final path contains a
> PartitionJoinPath, we go back and loop over the partitions.
A typical join tree will be composite: some portion partitioned and
some portion unpartitioned or different portions partitioned by
different partition schemes. In such case, inaccurate costs for
PartitionJoinPath, can affect the plan heavily, causing a suboptimal
path to be picked. Assuming that partitioning will be useful for large
sets of data, choosing a suboptimal plan can be more dangerous than
consuming memory for creating paths.

If we could come up with costs for PartitionJoinPath using some
methods of interpolation, say by sampling few partitions and then
extrapolating their costs for entire PartitionJoinPath, we can use
this method. But unless the partitions have very similar
characteristics or have such characteristics that costs can be guessed
based on the differences between the characteristics, I do not see how
that can happen. For example, while costing a PartitionJoinPath with
pathkeys, the costs will change a lot based on whether underlying
relations have indexes, or which join methods are used, which in turn
is based on properties on the partitions. Same is the case for paths
with parameterization. All such paths are important when a partitioned
join relation joins with other unpartitioned relation or a partitioned
relation with different partitioning scheme.

When each partition of base relation being joined has different
properties, the cost for join between one set of partitions can differ
from join between other set of partitions. Not only that, the costs
for various properties of resultant paths like pathkeys,
parameterization can vary a lot, depending upon the available indexes
and estimates of rows for each join. So, we need to come up with these
cost estimates separately for each join between partitions to come up
with cost of each PartitionJoinPath. If we have to calculate those
costs to create PartitionJoinPath, we better save them in paths rather
than recalculating them in the second round of planning for joins
between partitions.

> For each
> partition, we switch to a new memory context, perform planning, copy
> the best path and its substructure back to the parent context, and
> then reset the context.

This could be rather tricky. It assumes that all the code that creates
paths for joins, should not allocate any memory which is linked to
some object in a context that lives longer than the path creation
context. There is some code like create_join_clause() or
make_canonical_pathkey(), which carefully chooses which memory context
to allocate memory in. But can we ensure it always? postgres_fdw for
example allocates memory for PgFdwRelationInfo in current memory
context and attaches it in RelOptInfo, which should be in the
planner's original context. So, if we create a new memory context for
each partition, fpinfos would be invalidated when those contexts are
released. Not that, we can not enforce some restriction on the memory
usage while planning, it's hard to enforce it and bugs arising from it
may go unnoticed. GEQO planner might have its own problems with this
approach. Third party FDWs will pose a problem.

A possible solution would be to keep the track of used paths using a
reference count. Once the paths for given join tree are created, free
up the unused paths by traversing pathlist in each of the RelOptInfos.
Attached patch has a prototype implementation for the same. There are
some paths which are not linked to RelOptInfos, which need a bit
different treatment, but they can be handled too.

> In that way, peak memory usage only grows by
> about a factor of 2 rather than a factor equal to the partition count,
> because we don't need to keep every possibly-useful path for every
> partition all at the same time, but rather every possibly-useful path
> for a single partition.
>
> Maybe there are other ideas but I have a feeling any way you slice it
> this is going to be a lot of work.

For the case of carefully hand-crafted partitions, I think, users
would expect the planner to use really the best plan and thus may be
willing to accommodate for increased memory usage. Going by any
approach that does not create the paths for joins between partitions
is not guaranteed to give the best plan. Users willing to provide
increased memory will be unhappy if we do not give them the best path.

The user who creates hundreds of partitions, will ideally be using
pretty powerful servers with a lot of memory. On such servers, the
linear increase in memory for paths may not be as bad as you are
portraying above, as long as its producing the best plan.

Just joining partitioned tables with hundreds of partitions does not
increase the number of paths. Number of paths increases when two
partitioned tables with similar partitioning scheme are joined with
equality condition on partition key. Unless we consider
repartitioning, how many of the joining relations share same
partitioning scheme? Section 8.6 mentions, "no TPC-H query plan,
regardless of the partitioning scheme, contains n-way child joins for
n >= 4. Maximum partitions that the paper mentions is 168 (Table 3).
My VM which has 8GB RAM and 4 cores handled that case pretty well. We
may add logic to free up space used by useless paths post-join to free
up some memory for next stages of query execution.

There will still be users, for whom the increase in the memory usage
is unexpected. Those will need to be educated or for them we might
take heuristic PartitionJoinPath based approach discussed above. But I
don't think that heuristic approach should be the default case. May be
we should supply a GUC which can switch between the approaches.

Some ideas for GUCs are 1. delay_partition_wise_join - when ON uses
the heuristic approach of PartitionJoinPath.
2. A GUC similar to join_collapse_limit may be used to limit the
number of partitioned relations being joined using partition-wise join
technique. A value of 1, indicates enable_partition_wise_join = false.
So, we may replace enable_partition_wise_join withe this GUC.
3. A GUC max_joinable_partitions (open to suggestions for name) may
specify the maximum number of partitions that two relations may have
to be eligible for partition-wise join.

I guess, using these GUCs allows a user handle the trade-off between
getting the best plan and memory usage consciously. I think, users
would like to accept a suboptimal plans consciously than being thrown
a suboptimal plan without choice.

[1] http://postgresql.nabble.com/design-for-a-partitioning-feature-was-inheritance-td5921603.html

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

free_unused_paths.patch (21K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Partition-wise join for join between (declaratively) partitioned tables

Robert Haas
On Fri, Oct 28, 2016 at 3:09 AM, Ashutosh Bapat
<[hidden email]> wrote:

> I think there are going to be two kinds of partitioning use-cases.
> First, carefully hand-crafted by DBAs so that every partition is
> different from other and so is every join between two partitions.
> There will be lesser number of partitions, but creating paths for each
> join between partitions will be crucial from performance point of
> view. Consider, for example, systems which use partitions to
> consolidate results from different sources for analytical purposes or
> sharding. If we consider various points you have listed in [1] as to
> why a partition is equivalent to a table, each join between partitions
> is going to have very different characteristics and thus deserves a
> set of paths for its own. Add to that possibility of partition pruning
> or certain conditions affecting particular partitions, the need for
> detailed planning evident.
>
> The other usage of partitioning is to distribute the data and/or
> quickly eliminate the data by partition pruning. In such case, all
> partitions of a given table will have very similar properties. There
> is a large chance that we will end up having same plans for every
> partition and for joins between partitions. In such cases, I think it
> suffices to create paths for just one or may be a handful partitions
> of join and repeat that plan for other partitions of join. But in such
> cases it also makes sense to have a light-weight representation for
> partitions as compared to partitions being a full-fledged tables. If
> we have such a light-weight representation, we may not even create
> RelOptInfos representing joins between partitions, and different paths
> for each join between partitions.

I'm not sure I see a real distinction between these two use cases.  I
think that the problem of differing data distribution between
partitions is almost always going to be an issue.  Take the simple
case of an "orders" table which is partitioned by month.  First, the
month that's currently in progress may be much smaller than a typical
completed month.  Second, many businesses are seasonal and may have
many more orders at certain times of year.  For example, in American
retail, many businesses have large spikes in December.  I think some
businesses may do four times as much business in December as any other
month, for example.  So you will have that sort of variation, at
least.

> A typical join tree will be composite: some portion partitioned and
> some portion unpartitioned or different portions partitioned by
> different partition schemes. In such case, inaccurate costs for
> PartitionJoinPath, can affect the plan heavily, causing a suboptimal
> path to be picked. Assuming that partitioning will be useful for large
> sets of data, choosing a suboptimal plan can be more dangerous than
> consuming memory for creating paths.

Well, sure.  But, I mean, every simplifying assumption which the
planner makes to limit resource consumption could have that effect.
join_collapse_limit, for example, can cause horrible plans.  However,
we have it anyway, because the alternative of having planning take far
too long is unpalatable.  Planning is always, at some level,
guesswork.

>> For each
>> partition, we switch to a new memory context, perform planning, copy
>> the best path and its substructure back to the parent context, and
>> then reset the context.
>
> This could be rather tricky. It assumes that all the code that creates
> paths for joins, should not allocate any memory which is linked to
> some object in a context that lives longer than the path creation
> context. There is some code like create_join_clause() or
> make_canonical_pathkey(), which carefully chooses which memory context
> to allocate memory in. But can we ensure it always? postgres_fdw for
> example allocates memory for PgFdwRelationInfo in current memory
> context and attaches it in RelOptInfo, which should be in the
> planner's original context. So, if we create a new memory context for
> each partition, fpinfos would be invalidated when those contexts are
> released. Not that, we can not enforce some restriction on the memory
> usage while planning, it's hard to enforce it and bugs arising from it
> may go unnoticed. GEQO planner might have its own problems with this
> approach. Third party FDWs will pose a problem.

Yep, there are problems.  :-)

> A possible solution would be to keep the track of used paths using a
> reference count. Once the paths for given join tree are created, free
> up the unused paths by traversing pathlist in each of the RelOptInfos.
> Attached patch has a prototype implementation for the same. There are
> some paths which are not linked to RelOptInfos, which need a bit
> different treatment, but they can be handled too.

So, if you apply this with your previous patch, how much does it cut
down memory consumption?

>> In that way, peak memory usage only grows by
>> about a factor of 2 rather than a factor equal to the partition count,
>> because we don't need to keep every possibly-useful path for every
>> partition all at the same time, but rather every possibly-useful path
>> for a single partition.
>>
>> Maybe there are other ideas but I have a feeling any way you slice it
>> this is going to be a lot of work.
>
> For the case of carefully hand-crafted partitions, I think, users
> would expect the planner to use really the best plan and thus may be
> willing to accommodate for increased memory usage. Going by any
> approach that does not create the paths for joins between partitions
> is not guaranteed to give the best plan. Users willing to provide
> increased memory will be unhappy if we do not give them the best path.
>
> The user who creates hundreds of partitions, will ideally be using
> pretty powerful servers with a lot of memory. On such servers, the
> linear increase in memory for paths may not be as bad as you are
> portraying above, as long as its producing the best plan.

No, I don't agree.  We should be trying to build something that scales
well.  I've heard reports of customers with hundreds or even thousands
of partitions; I think it is quite reasonable to think that we need to
scale to 1000 partitions.  If we use 3MB of memory to plan a query
involving unpartitioned, using 3GB to plan a query where the main
tables have been partitioned 1000 ways does not seem reasonable to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
1234 ... 14
Previous Thread Next Thread