Declarative partitioning - another take

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
328 messages Options
1234 ... 17
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Declarative partitioning - another take

Amit Langote-2
Hi,

Attached is the latest set of patches to implement declarative
partitioning.  There is already a commitfest entry for the same:
https://commitfest.postgresql.org/10/611/

The old discussion is here:
http://www.postgresql.org/message-id/55D3093C.5010800@.../

Attached patches are described below:

0001-Catalog-and-DDL-for-partition-key.patch
0002-psql-and-pg_dump-support-for-partitioned-tables.patch

These patches create the infrastructure and DDL for partitioned
tables.

In addition to a catalog for storing the partition key information, this
adds a new relkind to pg_class.h. PARTITION BY clause is added to CREATE
TABLE. Tables so created are RELKIND_PARTITIONED_REL relations which are
to be special in a number of ways, especially with regard to their
interactions with regular table inheritance features.

PARTITION BY RANGE ({ column_name | ( expression ) } [ opclass ] [, ...])
PARTITION BY LIST ({ column_name | ( expression ) } [ opclass ])


0003-Catalog-and-DDL-for-partition-bounds.patch
0004-psql-and-pg_dump-support-for-partitions.patch

These patches create the infrastructure and DDL for partitions.

Parent-child relationships of a partitioned table and its partitions are
managed behind-the-scenes with inheritance.  That means there is a
pg_inherits entry and attributes, constraints, etc. are marked with
inheritance related information appropriately.  However this case differs
from a regular inheritance relationship in a number of ways.  While the
regular inheritance imposes certain restrictions on what elements a
child's schema is allowed to contain (both at creation time and
after-the-fact), the partitioning related code imposes further
restrictions.  For example, while regular inheritance allows a child to
contain its own columns, the partitioning code disallows that.  Stuff like
NO INHERIT marking on check constraints, ONLY are ignored by the the
partitioning code.

Partition DDL includes both a way to create new partition and "attach" an
existing table as a partition of parent partitioned table.  Attempt to
drop a partition using DROP TABLE causes an error. Instead a partition
needs first to be "detached" from parent partitioned table.  On the other
hand, dropping the parent drops all the partitions if CASCADE is specified.

CREATE TABLE partition_name
    PARTITION OF parent_table [ (
  { column_name WITH OPTIONS [ column_constraint [ ... ] ]
    | table_constraint }
    [, ... ]
) ] partition_bound_spec
[ PARTITION BY {RANGE | LIST} ( { column_name | ( expression ) } [ opclass
] [, ...] )

CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name
  PARTITION OF parent_table [ (
  { column_name WITH OPTIONS [ column_constraint [ ... ] ]
    | table_constraint }
    [, ... ]
) ] partition_bound_spec
  SERVER server_name
[ OPTIONS ( option 'value' [, ... ] ) ]

ALTER TABLE parent ATTACH PARTITION partition_name partition_bound_spec [
VALIDATE | NO VALIDATE ]

ALTER TABLE parent DETACH PARTITION partition_name

partition_bound_spec is:

FOR VALUES { list_spec | range_spec }

list_spec in FOR VALUES is:

IN ( expression [, ...] )

range_spec in FOR VALUES is:

START lower-bound [ INCLUSIVE | EXCLUSIVE ] END upper-bound [ INCLUSIVE |
EXCLUSIVE ]

where lower-bound and upper-bound are:

{ ( expression [, ...] ) | UNBOUNDED }

expression can be a string literal, a numeric literal or NULL.

Note that the one can specify PARTITION BY when creating a partition
itself. That is to allow creating multi-level partitioned tables.


0005-Teach-a-few-places-to-use-partition-check-constraint.patch

A partition's bound implicitly constrains the values that are allowed in
the partition key of its rows.  The same can be applied to partitions when
inserting data *directly* into them to make sure that only the correct
data is allowed in (if a tuple has been routed from the parent, that
becomes unnecessary). To that end, ExecConstraints() now includes the
above implicit check constraint in the list of constraints it enforces.

Further, to enable constraint based partition exclusion on partitioned
tables, the planner code includes in its list of constraints the above
implicitly defined constraints.  This arrangement is temporary however and
will be rendered unnecessary when we implement special data structures and
algorithms within the planner in future versions of this patch to use
partition metadata more effectively for partition exclusion.

Note that the "constraints" referred to above are not some on-disk
structures but those generated internally on-the-fly when requested by a
caller.

0006-Introduce-a-PartitionTreeNode-data-structure.patch
0007-Tuple-routing-for-partitioned-tables.patch

These patches enable routing of tuples inserted into a partitioned table
to one of its leaf partitions.  It applies to both COPY FROM and INSERT.
First of these patches introduces a data structure that provides a
convenient means for the tuple routing code to step down a partition tree
one level at a time.  The second one modifies copy.c and executor to
implement actual tuple routing.  When inserting into a partition, its row
constraints and triggers are applied.  Note that the partition's
constraints also include the constraints defined on the parent.  This
arrangements means however that the parent's triggers are not currently
applied.

Updates are handled like they are now for inheritance sets, however, if an
update makes a row change partition, an error will be thrown.

0008-Update-DDL-Partitioning-chapter.patch

This patch updates the partitioning section in the DDL chapter to reflect
the new methods made available for creating and managing partitioned table
and its partitions.  Especially considering that it is no longer necessary
to define CHECK constraints and triggers/rules manually for constraint
exclusion and tuple routing, respectively.

TODO (in short term):
* Add more regression tests and docs
* Add PartitionOptInfo and use it to perform partition pruning more
effectively (the added infrastructure should also help pairwise joins
patch proposed by Ashutosh Bapat [1])
* Fix internal representation of list partition bounds to be more efficient


Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj%3DEaDTSA%40mail.gmail.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

0001-Catalog-and-DDL-for-partitioned-tables.patch (119K) Download Attachment
0002-psql-and-pg_dump-support-for-partitioned-tables.patch (23K) Download Attachment
0003-Catalog-and-DDL-for-partitions.patch (202K) Download Attachment
0004-psql-and-pg_dump-support-for-partitions.patch (18K) Download Attachment
0005-Teach-a-few-places-to-use-partition-check-constraint.patch (17K) Download Attachment
0006-Introduce-a-PartitionTreeNode-data-structure.patch (8K) Download Attachment
0007-Tuple-routing-for-partitioned-tables.patch (41K) Download Attachment
0008-Update-DDL-Partitioning-chapter-to-reflect-new-devel.patch (25K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Declarative partitioning - another take

Robert Haas
On Wed, Aug 10, 2016 at 7:09 AM, Amit Langote
<[hidden email]> wrote:
> Attached is the latest set of patches to implement declarative
> partitioning.

Cool.  I would encourage you to give some thought to what is the least
committable subset of these patches, and think about whether it can be
reorganized to make that smaller.  Based on the descriptions, it
sounds to me like the least committable subset is probably all of 0001
- 0005 as one giant commit, and that's a lot of code.  Maybe there's
no real way to come up with something more compact than that, but it's
worth some thought.

> 0001-Catalog-and-DDL-for-partition-key.patch

+      <entry><link
linkend="catalog-pg-partitioned"><structname>pg_partitioned</structname></link></entry>

pg_partitioned seems like a slightly strange choice of name.  We have
no other catalog tables whose names end in "ed", so the use of a past
participle here is novel.  More generally, this patch seems like it's
suffering a bit of terminological confusion: while the catalog table
is called pg_partitioned, the relkind is RELKIND_PARTITION_REL.  And
get_relation_by_qualified_name() thinks that RELKIND_PARTITION_REL is
a "table", but getRelationDescription thinks it's a "partitioned
table".  I think we need to get all these bits and pieces on the same
page, or have some kind of consistent way of deciding what to do in
each case.  Maybe pg_partitioned_table, RELKIND_PARTITIONED_TABLE,
etc. for the internal stuff, but just "table" in user-facing messages.

Alternatively, I wonder if we should switch to calling this a
"partition root" rather than a "partitioned table".  It's not the
whole table, just the root.  And doesn't contain data - in fact it
probably shouldn't even have storage - so calling it a "table" might
be confusing.  But if we're using CREATE TABLE to create it in the
ifrst place, then calling it something other than a table later on is
also confusing.  Hmm.

+      <entry><structfield>partexprs</structfield></entry>

There's a certain symmetry between this and what we do for indexes,
but I'm wondering whether there's a use case for partitioning a table
by an expression rather than a column value.  I suppose if you've
already done the work, there's no harm in supporting it.

+[ PARTITION BY {RANGE | LIST} ( { <replaceable
class="parameter">column_name</replaceable> | ( <replaceable
class="parameter">expression</replaceable> ) } [ <replaceable
class="parameter">opclass</replaceable> ] [, ...] )

The spacing isn't right here.  Should say { RANGE | LIST } rather than
{RANGE|LIST}.  Similarly elsewhere.

+      thus created is called <firstterm>partitioned</firstterm> table.  Key
+      consists of an ordered list of column names and/or expressions when
+      using the <literal>RANGE</> method, whereas only a single column or
+      expression can be specified when using the <literal>LIST</> method.

Why do we support range partitioning with multiple columns, but list
partitioning only with a single column?

+      The type of a key column or an expresion must have an associated

Typo.

+     <para>
+      Currently, there are following limitations on definition of partitioned
+      tables: one cannot specify any UNIQUE, PRIMARY KEY, EXCLUDE and/or
+      FOREIGN KEY constraints.
+     </para>

But I presume you can define these constraints on the partitions;
that's probably worth mentioning.  I'd say something like this
"Partitioned tables do not support UNIQUE, PRIMARY, EXCLUDE, or
FOREIGN KEY constraints; however, you can define these constraints on
individual data partitions."

+    if (partexprs)
+        recordDependencyOnSingleRelExpr(&myself,
+                                        (Node *) partexprs,
+                                        RelationGetRelid(rel),
+                                        DEPENDENCY_NORMAL,
+                                        DEPENDENCY_IGNORE);

I don't think introducing a new DEPENDENCY_IGNORE type is a good idea
here.  Instead, you could just add an additional Boolean argument to
recordDependencyOnSingleRelExpr.  That seems less likely to create
bugs in unrelated portions of the code.

+    /*
+     * Override specified inheritance option, if relation is a partitioned
+     * table and it's a target of INSERT.
+     */
+    if (alsoSource)
+        rte->inh |= relid_is_partitioned(rte->relid);

This seems extremely unlikely to be the right way to do this.  We're
just doing parse analysis here, so depending on properties of the
table that are (or might be made to be) changeable is not good.  It's
also an extra catalog lookup whether the table is partitioned or not
(and whether rte->inh is already true or not). Furthermore, it seems
to go against the comment explaining what rte->inh is supposed to
mean, which says:

 *        inh is TRUE for relation references that should be expanded to include
 *        inheritance children, if the rel has any.  This *must* be FALSE for
 *        RTEs other than RTE_RELATION entries.

I am not sure what problem you're trying to solve here, but I suspect
this needs to be ripped out altogether or handled later, during
planning.  Similarly for the related change in addRangeTableEntry.

+    {PartitionedRelationId,        /* PARTEDRELID */
+        PartitionedRelidIndexId,
+        1,
+        {
+            Anum_pg_partitioned_partedrelid,
+            0,
+            0,
+            0
+        },
+        128

I'd probably cut the initial size of this down a bit.  Odds are good
that not all of the tables you access will be partitioned.  Of course
that assumes we'll avoid probing this syscache for non-partitioned
tables, but I think that's pretty important.

+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+                     errmsg("cannot alter column named in partition key")));
+        else
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+                     errmsg("cannot alter column referenced in
partition key expression")));

Maybe "cannot alter type of column named in partition key"?  And
similarly for the other one.

+/*
+ * transformPartitionBy
+ *         Transform any expressions present in the partition key
+ */
+static PartitionBy *
+transformPartitionBy(Relation rel, PartitionBy *partitionby)

Shouldn't this be in src/backend/parser/parse_*.c instead of tablecmds.c?

Most of this (0001) looks pretty reasonable.  I'm sure that there is
more that needs fixing than what I've mentioned above, which is just
what I saw on an initial read-through, but overall I like the
approach.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Declarative partitioning - another take

Ashutosh Bapat




I think it makes sense to keep calling it a table because it has all the
logical properties of a table even though it will differ from a regular
table on the basis of physical implementation details such as that it does
not own physical storage.  Am I missing something?

>
> +      <entry><structfield>partexprs</structfield></entry>
>
> There's a certain symmetry between this and what we do for indexes,
> but I'm wondering whether there's a use case for partitioning a table
> by an expression rather than a column value.  I suppose if you've
> already done the work, there's no harm in supporting it.

Yeah, it's not a whole lot of code to manage expressions alongside simple
column references.

Users who would like to partition their tables by "age" will partition those by the month or year extracted out of a date column e.g. order_date. They will find it convenient to use an expression (extract(month from date)) as a partition key, instead of storing month or year as a separate column.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Declarative partitioning - another take

Robert Eckhardt
On Tue, Aug 16, 2016 at 2:30 AM, Ashutosh Bapat <[hidden email]> wrote:




I think it makes sense to keep calling it a table because it has all the
logical properties of a table even though it will differ from a regular
table on the basis of physical implementation details such as that it does
not own physical storage.  Am I missing something?

>
> +      <entry><structfield>partexprs</structfield></entry>
>
> There's a certain symmetry between this and what we do for indexes,
> but I'm wondering whether there's a use case for partitioning a table
> by an expression rather than a column value.  I suppose if you've
> already done the work, there's no harm in supporting it.

Yeah, it's not a whole lot of code to manage expressions alongside simple
column references.

Users who would like to partition their tables by "age" will partition those by the month or year extracted out of a date column e.g. order_date. They will find it convenient to use an expression (extract(month from date)) as a partition key, instead of storing month or year as a separate column.

In GPDB we have partitioning. It is almost always by date and then often the partitions are for different sizes, i.e. by day for 30 days then by month for 3 years then by year. What we also support, but isn't super performant, is sub-partitioning. 

This is where some on the newer indexing strategies is interesting to me. I see them as synergistic not redundant. 
 

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Declarative partitioning - another take

Robert Haas
In reply to this post by Amit Langote-2
On Wed, Aug 10, 2016 at 7:09 AM, Amit Langote
<[hidden email]> wrote:
> 0002-psql-and-pg_dump-support-for-partitioned-tables.patch

+    if (pset.sversion >= 90600 && tableinfo.relkind == 'P')

Version check is redundant, right?

+) PARTITION BY RANGE ((a+b));
+\d describe_range_key
+Partitioned table "public.describe_range_key"
+ Column |  Type   | Modifiers
+--------+---------+-----------
+ a      | integer |
+ b      | integer |
+Partition Key: PARTITION BY RANGE (((a + b)))

I understand that it's probably difficult not to end up with two sets
of parentheses here, but can we avoid ending up with three sets?

Also, I wonder if pg_get_partkeydef() should omit "PARTITION BY" and
pg_dump can add that part back.  Then this could say:

Partition Key: RANGE ((a + b))

...which seems a good deal more natural than what you have now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Declarative partitioning - another take

Robert Haas
In reply to this post by Amit Langote-2
On Wed, Aug 10, 2016 at 7:09 AM, Amit Langote
<[hidden email]> wrote:
> 0003-Catalog-and-DDL-for-partition-bounds.patch
>
> Partition DDL includes both a way to create new partition and "attach" an
> existing table as a partition of parent partitioned table.  Attempt to
> drop a partition using DROP TABLE causes an error. Instead a partition
> needs first to be "detached" from parent partitioned table.  On the other
> hand, dropping the parent drops all the partitions if CASCADE is specified.

Hmm, I don't think I like this.  Why should it be necessary to detach
a partition before dropping it?  That seems like an unnecessary step.

     [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY
IMMEDIATE ]
+
 </synopsis>

Unnecessary hunk.

+     <para>
+      If this table is a partition, one cannot perform <literal>DROP
NOT NULL</>
+      on a column if it is marked not null in the parent table.
+      not null.
+     </para>

Sentence fragment.

+     <para>
+      Note that unlike the <literal>ATTACH PARTITION</> command, a partition
+      being detached can be itself partitioned.  In that case, it continues
+      to exist as such.
+     </para>

This is another restriction I don't understand.  Why can't I attach a
partitioned table?

+        indicate that descendant tables are included.  Note that whether
+        <literal>ONLY</> or <literal>*</> is specified has no effect in case
+        of a partitioned table; descendant tables (in this case, partitions)
+        are always included.

Ugh, why?  I think this should work exactly the same way for
partitioned tables that it does for any other inheritance hierarchy.
Sure, you'll get no rows, but who cares?

+CREATE FOREIGN TABLE measurement_y2016m07
+    PARTITION OF measurement FOR VALUES START ('2016-07-01') END
('2016-08-01');
+    SERVER server_07;

Extra semicolon?

+      A partition cannot have columns other than those inherited from the
+      parent.  That includes the <structfield>oid</> column, which can be

I think experience suggests that this is a good restriction, but then
why does the syntax synopsis indicate that PARTITION BY can be
specified along with column definitions?  Similarly for CREATE FOREIGN
TABLE.

+      When specifying for a table being created as partition, one needs to
+      use column names from the parent table as part of the key.

This is not very clear.

-       /* Remove NO INHERIT flag if rel is a partitioned table */
-       if (relid_is_partitioned(relid))
+       /* Discard NO INHERIT, if relation is a partitioned table or a
partition */
+       if (relid_is_partitioned(relid) || relid_is_partition(relid))
                is_no_inherit = false;

It might be right to disallow NO INHERIT in this case, but I don't
think it can be right to just silently ignore it.

+ * Not flushed from the cache by RelationClearRelation() unless changed because
+ * of addition or removal of partitions.

This seems unlikely to be safe, unless I'm missing something.

+       form = (Form_pg_inherits) GETSTRUCT(tuple);
+
+       systable_endscan(scan);
+       heap_close(catalogRelation, AccessShareLock);
+
+       return form->inhparent;

This is unsafe.  After systable_endscan, it is no longer OK to access
form->inhparent.

Try building with CLOBBER_CACHE_ALWAYS to find other cache flush hazards.

There should probably be a note in the function header comment that it
is unsafe to use this for an inheritance child that is not a
partition, because there could be more than one parent in that case.
Or maybe the whole idea of this function just isn't very sound...

+static List *
+get_partitions(Oid relid, int lockmode)
+{
+       return find_inheritance_children(relid, lockmode);
+}

What's the point?  If we're going to have a wrapper here at all, then
shouldn't it have a name that matches the existing convention - e.g.
find_partitions() or find_child_partitions()?  But I think you might
as well just use find_inheritance_children() directly.

+                * Happens when we have created the pg_inherits entry
but not the
+                * pg_partition entry yet.

Why do we ever allow the flow of control to reach this point while we
are in such an intermediate state?

+free_partition_info(PartitionInfoData **p, int num)

Seems very error-prone.  Isn't this why MemoryContextReset was invented?

+relid_is_partition(Oid relid)
+{
+       return SearchSysCacheExists1(PARTRELID, ObjectIdGetDatum(relid));
+}

This is used in a lot of places, and the overhead of checking it in
all of those places is not necessarily nil.  Syscache lookups aren't
free.  What if we didn't create a new catalog for this and instead
just added a relpartitionbound attribute to pg_class?  It seems a bit
silly to have a whole extra catalog to store one extra column...

        /*
+        * If this foreign table is a partition, check that the FDW supports
+        * insert.
+        */
+       if (stmt->base.partbound != NULL)
+       {
+               FdwRoutine *fdw_routine;
+
+               fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+               if (fdw_routine->ExecForeignInsert == NULL)
+                       ereport(ERROR,
+                                       (errcode(ERRCODE_FDW_NO_SCHEMAS),
+                                        errmsg("cannot create foreign
table as partition"),
+                                        errdetail("foreign-data
wrapper \"%s\" does not support insert",
+                                                       fdw->fdwname)));
+       }

Why?  This seems like an entirely arbitrary prohibition.  If inserts
aren't supported, then they'll fail at runtime.  Same for updates or
deletes, for which you have not added checks.  I think you should just
remove this.

+               /* Force inheritance recursion, if partitioned table. */
+               if (recurse || relid_is_partitioned(myrelid))

Disagree with this, too.  There's no reason for partitioned tables to
be special in this way.  Similarly, disagree with all of the places
that do something similar.

-                               errmsg("column \"%s\" in child table
must be marked NOT NULL",
-                                          attributeName)));
+                               errmsg("column \"%s\" in %s table must
be marked NOT NULL",
+                                          attributeName,
+                                          is_attach_partition ?
"source" : "child")));

You have a few of these; they cause problems for translators, because
different languages have different word ordering.  Repeat the entire
message instead: is_attach_partition ? "column \"%s\" in source table
must be marked NOT NULL" : "column \"%s\" in child table must be
marked NOT NULL".

+-- XXX add ownership tests

So do that.  :-)

+ERROR:  column "b" is not null in parent
+HINT:  Please drop not null in the parent instead

Hmm.   That hint doesn't seem like project style, and I'm not sure
that it really makes sense to issue such a hint anyway.  Who knows
whether that is the right thing to do?  I think you should somehow be
complaining about the fact that this is a partition, rather than
complaining about the fact that the column is NOT NULL in the parent.
Are we insisting that the flags match exactly, or only that the child
may not allow nulls unless the parent does?

+ERROR:  new partition's list of values overlaps with partition
"lpart1" of "list_parted"

Maybe just:

ERROR: partitions must not overlap
-or-
ERROR: partition "%s" would overlap partition "%s"

As before, this is just an initial read-through, so apologies for
whatever I may have missed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Declarative partitioning - another take

Ashutosh Bapat



+relid_is_partition(Oid relid)
+{
+       return SearchSysCacheExists1(PARTRELID, ObjectIdGetDatum(relid));
+}

This is used in a lot of places, and the overhead of checking it in
all of those places is not necessarily nil.  Syscache lookups aren't
free.  What if we didn't create a new catalog for this and instead
just added a relpartitionbound attribute to pg_class?  It seems a bit
silly to have a whole extra catalog to store one extra column...

 

It looks like in most of the places where this function is called it's using relid_is_partition(RelationGetRelid(rel)). Instead probably we should check existence of rd_partdesc or rd_partkey within Relation() and make sure that those members are always set for a partitioned table. That will avoid cache lookup and may give better performance.

That brings up another question. Can we have rd_partdesc non null and rd_partkey null or vice-versa. If not, should we club those into a single structure like Partition (similar to Relation)?



--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Declarative partitioning - another take

Amit Langote-2
On 2016/08/17 14:33, Ashutosh Bapat wrote:

>> +relid_is_partition(Oid relid)
>> +{
>> +       return SearchSysCacheExists1(PARTRELID, ObjectIdGetDatum(relid));
>> +}
>>
>> This is used in a lot of places, and the overhead of checking it in
>> all of those places is not necessarily nil.  Syscache lookups aren't
>> free.  What if we didn't create a new catalog for this and instead
>> just added a relpartitionbound attribute to pg_class?  It seems a bit
>> silly to have a whole extra catalog to store one extra column...
>>
> It looks like in most of the places where this function is called it's
> using relid_is_partition(RelationGetRelid(rel)). Instead probably we should
> check existence of rd_partdesc or rd_partkey within Relation() and make
> sure that those members are always set for a partitioned table. That will
> avoid cache lookup and may give better performance.

It seems you are talking about a *partitioned* relation here, whereas
relid_is_partition() is to trying to check if a relation is *partition* by
looking up the pg_partition catalog (or the associated cache).  For the
former, the test you suggest or rd_rel->relkind ==
RELKIND_PARTITIONED_TABLE test is enough.

I am slightly tempted to eliminate the pg_partition catalog and associated
syscache altogether and add a column to pg_class as Robert suggested.
That way, all relid_is_partition() calls will be replaced by
rel->rd_partbound != NULL check.  But one potential problem with that
approach is that now whenever a parent relation is opened, all the
partition relations must be opened to get the partbound value (to form the
PartitionDesc to be stored in parent relation's rd_partdesc).  Whereas
currently, we just look up the pg_partition catalog (or the associated
cache) for every partition and that gets us the partbound.

> That brings up another question. Can we have rd_partdesc non null and
> rd_partkey null or vice-versa. If not, should we club those into a single
> structure like Partition (similar to Relation)?

It's true that rd_partkey and rd_partdesc are both either NULL or
non-NULL, so combining them into a single struct is an idea worth considering.

Thanks,
Amit




--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Declarative partitioning - another take

Ashutosh Bapat


On Wed, Aug 17, 2016 at 11:51 AM, Amit Langote <[hidden email]> wrote:
On 2016/08/17 14:33, Ashutosh Bapat wrote:
>> +relid_is_partition(Oid relid)
>> +{
>> +       return SearchSysCacheExists1(PARTRELID, ObjectIdGetDatum(relid));
>> +}
>>
>> This is used in a lot of places, and the overhead of checking it in
>> all of those places is not necessarily nil.  Syscache lookups aren't
>> free.  What if we didn't create a new catalog for this and instead
>> just added a relpartitionbound attribute to pg_class?  It seems a bit
>> silly to have a whole extra catalog to store one extra column...
>>
> It looks like in most of the places where this function is called it's
> using relid_is_partition(RelationGetRelid(rel)). Instead probably we should
> check existence of rd_partdesc or rd_partkey within Relation() and make
> sure that those members are always set for a partitioned table. That will
> avoid cache lookup and may give better performance.

It seems you are talking about a *partitioned* relation here, whereas
relid_is_partition() is to trying to check if a relation is *partition* by
looking up the pg_partition catalog (or the associated cache).  For the
former, the test you suggest or rd_rel->relkind ==
RELKIND_PARTITIONED_TABLE test is enough.

Uh, you are right. Sorry for my misunderstanding.
 

I am slightly tempted to eliminate the pg_partition catalog and associated
syscache altogether and add a column to pg_class as Robert suggested.
That way, all relid_is_partition() calls will be replaced by
rel->rd_partbound != NULL check.  But one potential problem with that
approach is that now whenever a parent relation is opened, all the
partition relations must be opened to get the partbound value (to form the
PartitionDesc to be stored in parent relation's rd_partdesc).  Whereas
currently, we just look up the pg_partition catalog (or the associated
cache) for every partition and that gets us the partbound.

> That brings up another question. Can we have rd_partdesc non null and
> rd_partkey null or vice-versa. If not, should we club those into a single
> structure like Partition (similar to Relation)?

It's true that rd_partkey and rd_partdesc are both either NULL or
non-NULL, so combining them into a single struct is an idea worth considering.

Thanks,
Amit





--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Declarative partitioning - another take

Robert Haas
In reply to this post by Amit Langote-2
On Wed, Aug 17, 2016 at 2:21 AM, Amit Langote
<[hidden email]> wrote:
> I am slightly tempted to eliminate the pg_partition catalog and associated
> syscache altogether and add a column to pg_class as Robert suggested.
> That way, all relid_is_partition() calls will be replaced by
> rel->rd_partbound != NULL check.  But one potential problem with that
> approach is that now whenever a parent relation is opened, all the
> partition relations must be opened to get the partbound value (to form the
> PartitionDesc to be stored in parent relation's rd_partdesc).  Whereas
> currently, we just look up the pg_partition catalog (or the associated
> cache) for every partition and that gets us the partbound.

Well, you could just look up the pg_class row without opening the
relation, too.  There is a system cache on pg_class.oid, after all.  I
think the issue is whether it's safe to read either one of those
things without a lock on the child relation.  If altering the
partitioning information for a relation requires holding only
AccessExclusiveLock on that relation, and no lock on the parent, then
you really can't read the information for any child relation without
taking at least AccessShareLock.  Otherwise, it might change under
you, and that would be bad.

I'm inclined to think that changing the partitioning information for a
child is going to require AccessExclusiveLock on both the child and
the parent.  That seems unfortunate from a concurrency point of view,
but we may be stuck with it: suppose you require only
ShareUpdateExclusiveLock on the parent.  Well, then a concurrent read
transaction might see the partition boundaries change when it does a
relcache rebuild, which would cause it to suddenly start expecting the
data to be in a different plan in mid-transaction, perhaps even in
mid-scan.  Maybe that's survivable with really careful coding, but it
seems like it's probably a bad thing.  For example, it would mean that
the executor would be unable to rely on the partitioning information
in the relcache remaining stable underneath it.  Moreover, the
relcache is always going to be scanned with the most recent possible
MVCC snapshot, but the transaction snapshot may be older, so such a
system creates all sorts of nasty possibilities for there to be skew
between the snapshot being used to via the data and the snapshot being
used to read the metadata that says where the data is.

This may need some more thought, but if we go with that approach of
requiring an AccessExclusiveLock on both parent and child, then it
seems to me that maybe we should consider the partitioning information
to be a property of the parent rather than the child.  Just take all
the partitioning information for all children and put it in one big
node tree and store it in the pg_class or pg_partition_root entry for
the parent as one big ol' varlena.  Now you can open the parent and
get all of the partitioning information for all of the children
without needing any lock on any child, and that's *really* good,
because it means that some day we might be able to do partition
elimination before locking any of the children!  That would be
excellent.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Declarative partitioning - another take

Ashutosh Bapat
In reply to this post by Amit Langote-2
The parent-child relationship of multi-level partitioned tables is not retained when creating the AppendRelInfo nodes. We create RelOptInfo nodes for all the leaf and intermediate tables. The AppendRelInfo nodes created for these RelOptInfos set the topmost table as the parent of all the leaf and child tables. Since partitioning scheme/s at each level is/are associated with the parent/s at that level, we loose information about the immediate parents and thus it becomes difficult to identify which leaf node falls where in the partition hierarchy. This stops us from doing any lump-sum partition pruning where we can eliminate all the partitions under a given parent-partition if that parent-partition gets pruned. It also restricts partition-wise join technique from being applied to partial partition hierarchy when the whole partitioning scheme of joining tables does not match. Maintaining a RelOptInfo hierarchy should not create corresponding Append (all kinds) plan hierarchy since accumulate_append_subpath() flattens any such hierarchy while creating paths. Can you please consider this point in your upcoming patch?


On Wed, Aug 10, 2016 at 4:39 PM, Amit Langote <[hidden email]> wrote:
Hi,

Attached is the latest set of patches to implement declarative
partitioning.  There is already a commitfest entry for the same:
https://commitfest.postgresql.org/10/611/

The old discussion is here:
http://www.postgresql.org/message-id/55D3093C.5010800@lab.ntt.co.jp/

Attached patches are described below:

0001-Catalog-and-DDL-for-partition-key.patch
0002-psql-and-pg_dump-support-for-partitioned-tables.patch

These patches create the infrastructure and DDL for partitioned
tables.

In addition to a catalog for storing the partition key information, this
adds a new relkind to pg_class.h. PARTITION BY clause is added to CREATE
TABLE. Tables so created are RELKIND_PARTITIONED_REL relations which are
to be special in a number of ways, especially with regard to their
interactions with regular table inheritance features.

PARTITION BY RANGE ({ column_name | ( expression ) } [ opclass ] [, ...])
PARTITION BY LIST ({ column_name | ( expression ) } [ opclass ])


0003-Catalog-and-DDL-for-partition-bounds.patch
0004-psql-and-pg_dump-support-for-partitions.patch

These patches create the infrastructure and DDL for partitions.

Parent-child relationships of a partitioned table and its partitions are
managed behind-the-scenes with inheritance.  That means there is a
pg_inherits entry and attributes, constraints, etc. are marked with
inheritance related information appropriately.  However this case differs
from a regular inheritance relationship in a number of ways.  While the
regular inheritance imposes certain restrictions on what elements a
child's schema is allowed to contain (both at creation time and
after-the-fact), the partitioning related code imposes further
restrictions.  For example, while regular inheritance allows a child to
contain its own columns, the partitioning code disallows that.  Stuff like
NO INHERIT marking on check constraints, ONLY are ignored by the the
partitioning code.

Partition DDL includes both a way to create new partition and "attach" an
existing table as a partition of parent partitioned table.  Attempt to
drop a partition using DROP TABLE causes an error. Instead a partition
needs first to be "detached" from parent partitioned table.  On the other
hand, dropping the parent drops all the partitions if CASCADE is specified.

CREATE TABLE partition_name
    PARTITION OF parent_table [ (
  { column_name WITH OPTIONS [ column_constraint [ ... ] ]
    | table_constraint }
    [, ... ]
) ] partition_bound_spec
[ PARTITION BY {RANGE | LIST} ( { column_name | ( expression ) } [ opclass
] [, ...] )

CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name
  PARTITION OF parent_table [ (
  { column_name WITH OPTIONS [ column_constraint [ ... ] ]
    | table_constraint }
    [, ... ]
) ] partition_bound_spec
  SERVER server_name
[ OPTIONS ( option 'value' [, ... ] ) ]

ALTER TABLE parent ATTACH PARTITION partition_name partition_bound_spec [
VALIDATE | NO VALIDATE ]

ALTER TABLE parent DETACH PARTITION partition_name

partition_bound_spec is:

FOR VALUES { list_spec | range_spec }

list_spec in FOR VALUES is:

IN ( expression [, ...] )

range_spec in FOR VALUES is:

START lower-bound [ INCLUSIVE | EXCLUSIVE ] END upper-bound [ INCLUSIVE |
EXCLUSIVE ]

where lower-bound and upper-bound are:

{ ( expression [, ...] ) | UNBOUNDED }

expression can be a string literal, a numeric literal or NULL.

Note that the one can specify PARTITION BY when creating a partition
itself. That is to allow creating multi-level partitioned tables.


0005-Teach-a-few-places-to-use-partition-check-constraint.patch

A partition's bound implicitly constrains the values that are allowed in
the partition key of its rows.  The same can be applied to partitions when
inserting data *directly* into them to make sure that only the correct
data is allowed in (if a tuple has been routed from the parent, that
becomes unnecessary). To that end, ExecConstraints() now includes the
above implicit check constraint in the list of constraints it enforces.

Further, to enable constraint based partition exclusion on partitioned
tables, the planner code includes in its list of constraints the above
implicitly defined constraints.  This arrangement is temporary however and
will be rendered unnecessary when we implement special data structures and
algorithms within the planner in future versions of this patch to use
partition metadata more effectively for partition exclusion.

Note that the "constraints" referred to above are not some on-disk
structures but those generated internally on-the-fly when requested by a
caller.

0006-Introduce-a-PartitionTreeNode-data-structure.patch
0007-Tuple-routing-for-partitioned-tables.patch

These patches enable routing of tuples inserted into a partitioned table
to one of its leaf partitions.  It applies to both COPY FROM and INSERT.
First of these patches introduces a data structure that provides a
convenient means for the tuple routing code to step down a partition tree
one level at a time.  The second one modifies copy.c and executor to
implement actual tuple routing.  When inserting into a partition, its row
constraints and triggers are applied.  Note that the partition's
constraints also include the constraints defined on the parent.  This
arrangements means however that the parent's triggers are not currently
applied.

Updates are handled like they are now for inheritance sets, however, if an
update makes a row change partition, an error will be thrown.

0008-Update-DDL-Partitioning-chapter.patch

This patch updates the partitioning section in the DDL chapter to reflect
the new methods made available for creating and managing partitioned table
and its partitions.  Especially considering that it is no longer necessary
to define CHECK constraints and triggers/rules manually for constraint
exclusion and tuple routing, respectively.

TODO (in short term):
* Add more regression tests and docs
* Add PartitionOptInfo and use it to perform partition pruning more
effectively (the added infrastructure should also help pairwise joins
patch proposed by Ashutosh Bapat [1])
* Fix internal representation of list partition bounds to be more efficient


Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj%3DEaDTSA%40mail.gmail.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers




--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Declarative partitioning - another take

Amit Langote-2
On 2016/08/22 13:51, Ashutosh Bapat wrote:

> The parent-child relationship of multi-level partitioned tables is not
> retained when creating the AppendRelInfo nodes. We create RelOptInfo nodes
> for all the leaf and intermediate tables. The AppendRelInfo nodes created
> for these RelOptInfos set the topmost table as the parent of all the leaf
> and child tables. Since partitioning scheme/s at each level is/are
> associated with the parent/s at that level, we loose information about the
> immediate parents and thus it becomes difficult to identify which leaf node
> falls where in the partition hierarchy. This stops us from doing any
> lump-sum partition pruning where we can eliminate all the partitions under
> a given parent-partition if that parent-partition gets pruned. It also
> restricts partition-wise join technique from being applied to partial
> partition hierarchy when the whole partitioning scheme of joining tables
> does not match. Maintaining a RelOptInfo hierarchy should not create
> corresponding Append (all kinds) plan hierarchy since
> accumulate_append_subpath() flattens any such hierarchy while creating
> paths. Can you please consider this point in your upcoming patch?

I agree.  So there seem to be two things here:  a) when expanding a
partitioned table inheritance set, do it recursively such that resulting
AppendRelInfos preserve *immediate* parent-child relationship info.  b)
when accumulating append subpaths, do not flatten a subpath that is itself
an append when ((AppendPath *) subpath)->path.parent is a RelOptInfo with
non-NULL partitioning info.  Is the latter somehow necessary for
pairwise-join considerations?

I think I can manage to squeeze in (a) in the next version patch and will
also start working on (b), mainly the part about RelOptInfo getting some
partitioning info.

Thanks,
Amit




--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Declarative partitioning - another take

Ashutosh Bapat


On Thu, Aug 25, 2016 at 12:22 PM, Amit Langote <[hidden email]> wrote:
On 2016/08/22 13:51, Ashutosh Bapat wrote:
> The parent-child relationship of multi-level partitioned tables is not
> retained when creating the AppendRelInfo nodes. We create RelOptInfo nodes
> for all the leaf and intermediate tables. The AppendRelInfo nodes created
> for these RelOptInfos set the topmost table as the parent of all the leaf
> and child tables. Since partitioning scheme/s at each level is/are
> associated with the parent/s at that level, we loose information about the
> immediate parents and thus it becomes difficult to identify which leaf node
> falls where in the partition hierarchy. This stops us from doing any
> lump-sum partition pruning where we can eliminate all the partitions under
> a given parent-partition if that parent-partition gets pruned. It also
> restricts partition-wise join technique from being applied to partial
> partition hierarchy when the whole partitioning scheme of joining tables
> does not match. Maintaining a RelOptInfo hierarchy should not create
> corresponding Append (all kinds) plan hierarchy since
> accumulate_append_subpath() flattens any such hierarchy while creating
> paths. Can you please consider this point in your upcoming patch?

I agree.  So there seem to be two things here:  a) when expanding a
partitioned table inheritance set, do it recursively such that resulting
AppendRelInfos preserve *immediate* parent-child relationship info. 

Right.
 
b)
when accumulating append subpaths, do not flatten a subpath that is itself
an append when ((AppendPath *) subpath)->path.parent is a RelOptInfo with
non-NULL partitioning info.Is the latter somehow necessary for
pairwise-join considerations?

I don't think you need to do anything in the path creation code for this. As is it flattens all AppendPath hierarchies whether for partitioning or inheritance or subqueries. We should leave it as it is.



I think I can manage to squeeze in (a) in the next version patch and will
also start working on (b), mainly the part about RelOptInfo getting some
partitioning info.

I am fine with b, where you would include some partitioning information in RelOptInfo. But you don't need to do what you said in (b) above.

In a private conversation Robert Haas suggested a way slightly different than what my patch for partition-wise join does. He suggested that the partitioning schemes i.e strategy, number of partitions and bounds of the partitioned elations involved in the query should be stored in PlannerInfo in the form of a list. Each partitioning scheme is annotated with the relids of the partitioned relations. RelOptInfo of the partitioned relation will point to the partitioning scheme in PlannerInfo. Along-with that each RelOptInfo will need to store partition keys for corresponding relation. This simplifies matching the partitioning schemes of the joining relations. Also it reduces the number of copies of partition bounds floating around as we expect that a query will involve multiple partitioned tables following similar partitioning schemes. May be you want to consider this idea while working on (b).


Thanks,
Amit





--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Declarative partitioning - another take

Amit Langote-2
In reply to this post by Robert Haas
On 2016/08/18 5:23, Robert Haas wrote:

> On Wed, Aug 17, 2016 at 2:21 AM, Amit Langote
> <[hidden email]> wrote:
>> I am slightly tempted to eliminate the pg_partition catalog and associated
>> syscache altogether and add a column to pg_class as Robert suggested.
>> That way, all relid_is_partition() calls will be replaced by
>> rel->rd_partbound != NULL check.  But one potential problem with that
>> approach is that now whenever a parent relation is opened, all the
>> partition relations must be opened to get the partbound value (to form the
>> PartitionDesc to be stored in parent relation's rd_partdesc).  Whereas
>> currently, we just look up the pg_partition catalog (or the associated
>> cache) for every partition and that gets us the partbound.
>
> Well, you could just look up the pg_class row without opening the
> relation, too.  There is a system cache on pg_class.oid, after all.  I

Yes, I somehow didn't think of that.

> think the issue is whether it's safe to read either one of those
> things without a lock on the child relation.  If altering the
> partitioning information for a relation requires holding only
> AccessExclusiveLock on that relation, and no lock on the parent, then
> you really can't read the information for any child relation without
> taking at least AccessShareLock.  Otherwise, it might change under
> you, and that would be bad.

I'd imagine this won't be a problem because we take an AccessExclusiveLock
on the parent when adding/removing a partition.

> I'm inclined to think that changing the partitioning information for a
> child is going to require AccessExclusiveLock on both the child and
> the parent.  That seems unfortunate from a concurrency point of view,
> but we may be stuck with it: suppose you require only
> ShareUpdateExclusiveLock on the parent.  Well, then a concurrent read
> transaction might see the partition boundaries change when it does a
> relcache rebuild, which would cause it to suddenly start expecting the
> data to be in a different plan in mid-transaction, perhaps even in
> mid-scan.  Maybe that's survivable with really careful coding, but it
> seems like it's probably a bad thing.  For example, it would mean that
> the executor would be unable to rely on the partitioning information
> in the relcache remaining stable underneath it.  Moreover, the
> relcache is always going to be scanned with the most recent possible
> MVCC snapshot, but the transaction snapshot may be older, so such a
> system creates all sorts of nasty possibilities for there to be skew
> between the snapshot being used to via the data and the snapshot being
> used to read the metadata that says where the data is.

We do take a lock on the parent because we would be changing its partition
descriptor (relcache).  I changed MergeAttributes() such that an
AccessExclusiveLock instead of ShareUpdateExclusiveLock is taken if the
parent is a partitioned table.

> This may need some more thought, but if we go with that approach of
> requiring an AccessExclusiveLock on both parent and child, then it
> seems to me that maybe we should consider the partitioning information
> to be a property of the parent rather than the child.  Just take all
> the partitioning information for all children and put it in one big
> node tree and store it in the pg_class or pg_partition_root entry for
> the parent as one big ol' varlena.  Now you can open the parent and
> get all of the partitioning information for all of the children
> without needing any lock on any child, and that's *really* good,
> because it means that some day we might be able to do partition
> elimination before locking any of the children!  That would be
> excellent.

If we need an AccessExclusiveLock on parent to add/remove a partition
(IOW, changing that child table's partitioning information), then do we
need to lock the individual partitions when reading partition's
information?  I mean to ask why the simple syscache look-ups to get each
partition's bound wouldn't do.

Thanks,
Amit




--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Declarative partitioning - another take

Amit Langote-2
In reply to this post by Robert Haas

Sorry it took me a while to reply.  Attached updated patches including the
review comments on 0001 at [1].

On 2016/08/17 3:54, Robert Haas wrote:
> On Wed, Aug 10, 2016 at 7:09 AM, Amit Langote
> <[hidden email]> wrote:
>> 0002-psql-and-pg_dump-support-for-partitioned-tables.patch
>
> +    if (pset.sversion >= 90600 && tableinfo.relkind == 'P')
>
> Version check is redundant, right?

Yep, fixed.

> +) PARTITION BY RANGE ((a+b));
> +\d describe_range_key
> +Partitioned table "public.describe_range_key"
> + Column |  Type   | Modifiers
> +--------+---------+-----------
> + a      | integer |
> + b      | integer |
> +Partition Key: PARTITION BY RANGE (((a + b)))
>
> I understand that it's probably difficult not to end up with two sets
> of parentheses here, but can we avoid ending up with three sets?
Fixed.  This code was copy-pasted from the index code which has other
considerations for adding surrounding parentheses which don't apply to the
partition key code.

> Also, I wonder if pg_get_partkeydef() should omit "PARTITION BY" and
> pg_dump can add that part back.  Then this could say:
>
> Partition Key: RANGE ((a + b))
>
> ...which seems a good deal more natural than what you have now.

Agreed, so done that way.

>> 0003-Catalog-and-DDL-for-partition-bounds.patch
>>
>> Partition DDL includes both a way to create new partition and "attach" an
>> existing table as a partition of parent partitioned table.  Attempt to
>> drop a partition using DROP TABLE causes an error. Instead a partition
>> needs first to be "detached" from parent partitioned table.  On the other
>> hand, dropping the parent drops all the partitions if CASCADE is specified.
>
> Hmm, I don't think I like this.  Why should it be necessary to detach
> a partition before dropping it?  That seems like an unnecessary step.
I thought we had better lock the parent table when removing one of its
partitions and it seemed a bit odd to lock the parent table when dropping
a partition using DROP TABLE?  OTOH, with ALTER TABLE parent DETACH
PARTITION, the parent table is locked anyway.

>      [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY
> IMMEDIATE ]
> +
>  </synopsis>
>
> Unnecessary hunk.
>
> +     <para>
> +      If this table is a partition, one cannot perform <literal>DROP
> NOT NULL</>
> +      on a column if it is marked not null in the parent table.
> +      not null.
> +     </para>
>
> Sentence fragment.
Fixed.

>
> +     <para>
> +      Note that unlike the <literal>ATTACH PARTITION</> command, a partition
> +      being detached can be itself partitioned.  In that case, it continues
> +      to exist as such.
> +     </para>
>
> This is another restriction I don't understand.  Why can't I attach a
> partitioned table?

I removed this restriction.

ATExecAttachPartition() adds a AT work queue entry for the table being
attached to perform a heap scan in the rewrite phase.  The scan is done
for checking that no row violates the partition boundary condition.  When
attaching a partitioned table as partition, multiple AT work queue entries
are now added - one for each leaf partition of the table being attached.

> +        indicate that descendant tables are included.  Note that whether
> +        <literal>ONLY</> or <literal>*</> is specified has no effect in case
> +        of a partitioned table; descendant tables (in this case, partitions)
> +        are always included.
>
> Ugh, why?  I think this should work exactly the same way for
> partitioned tables that it does for any other inheritance hierarchy.
> Sure, you'll get no rows, but who cares?

Agreed, done that way.

> +CREATE FOREIGN TABLE measurement_y2016m07
> +    PARTITION OF measurement FOR VALUES START ('2016-07-01') END
> ('2016-08-01');
> +    SERVER server_07;
>
> Extra semicolon?

Fixed.

>
> +      A partition cannot have columns other than those inherited from the
> +      parent.  That includes the <structfield>oid</> column, which can be
>
> I think experience suggests that this is a good restriction, but then
> why does the syntax synopsis indicate that PARTITION BY can be
> specified along with column definitions?  Similarly for CREATE FOREIGN
> TABLE.

The syntax synopsis of CREATE TABLE ... PARTITION OF indicates that a list
of column WITH OPTION and/or table_constraint can be specified.  It does
not allow column definitions.

In this case, inherited columns will be listed in the PARTITION BY clause.

Do you mean that the CREATE TABLE ... PARTITION OF syntax should allow
column definitions just like INHERITS does and unlike regular inheritance,
throw error if columns other than those to be merged are found?

>
> +      When specifying for a table being created as partition, one needs to
> +      use column names from the parent table as part of the key.
>
> This is not very clear.

Sentence removed because it may be clear from the context that inherited
columns are to be used for the partition key.

> -       /* Remove NO INHERIT flag if rel is a partitioned table */
> -       if (relid_is_partitioned(relid))
> +       /* Discard NO INHERIT, if relation is a partitioned table or a
> partition */
> +       if (relid_is_partitioned(relid) || relid_is_partition(relid))
>                 is_no_inherit = false;
>
> It might be right to disallow NO INHERIT in this case, but I don't
> think it can be right to just silently ignore it.

OK, I changed this to instead throw an error if a NO INHERIT check
constraint is added to a partitioned table or a partition.

> + * Not flushed from the cache by RelationClearRelation() unless changed because
> + * of addition or removal of partitions.
>
> This seems unlikely to be safe, unless I'm missing something.

Like TupleDesc, a table's PartitionDesc is preserved across relcache
rebuilds.  PartitionDesc consists of arrays of OIDs and partition bounds
for a table's immediate partitions.  It can only change by adding/removing
partitions to/from the table which requires an exclusive lock on it.
Since this data can grow arbitrarily big it seemed better to not have to
copy it around, so a direct pointer to the relcache field (rd_partdesc) is
given to callers.  relcache rebuilds that do not logically change
PartitionDesc leave it intact so that some user of it is not left with a
dangling pointer.  Am I missing something?

>
> +       form = (Form_pg_inherits) GETSTRUCT(tuple);
> +
> +       systable_endscan(scan);
> +       heap_close(catalogRelation, AccessShareLock);
> +
> +       return form->inhparent;
>
> This is unsafe.  After systable_endscan, it is no longer OK to access
> form->inhparent.
>
> Try building with CLOBBER_CACHE_ALWAYS to find other cache flush hazards.
>
> There should probably be a note in the function header comment that it
> is unsafe to use this for an inheritance child that is not a
> partition, because there could be more than one parent in that case.
> Or maybe the whole idea of this function just isn't very sound...
Fixed unsafe coding and added a comment to the function saying it should
be called on tables known to be partitions.

>
> +static List *
> +get_partitions(Oid relid, int lockmode)
> +{
> +       return find_inheritance_children(relid, lockmode);
> +}
>
> What's the point?  If we're going to have a wrapper here at all, then
> shouldn't it have a name that matches the existing convention - e.g.
> find_partitions() or find_child_partitions()?  But I think you might
> as well just use find_inheritance_children() directly.
OK, changed to just use find_inheritance_children() directly.

>
> +                * Happens when we have created the pg_inherits entry
> but not the
> +                * pg_partition entry yet.
>
> Why do we ever allow the flow of control to reach this point while we
> are in such an intermediate state?

Fixed to prevent this from happening.

>
> +free_partition_info(PartitionInfoData **p, int num)
>
> Seems very error-prone.  Isn't this why MemoryContextReset was invented?

Got rid of this function.

>
> +relid_is_partition(Oid relid)
> +{
> +       return SearchSysCacheExists1(PARTRELID, ObjectIdGetDatum(relid));
> +}
>
> This is used in a lot of places, and the overhead of checking it in
> all of those places is not necessarily nil.  Syscache lookups aren't
> free.  What if we didn't create a new catalog for this and instead
> just added a relpartitionbound attribute to pg_class?  It seems a bit
> silly to have a whole extra catalog to store one extra column...
OK, I got rid of the pg_partition catalog and added a pg_class attribute
for storing partition bound.  Because the newly added attribute is a
pg_node_tree and hence not readily accessible without a heap_getattr()
call, I added a boolean relispartition as well. Now all the
relid_is_partition() calls have been replaced by checks using
relation->rd_rel->relispartition.

>
>         /*
> +        * If this foreign table is a partition, check that the FDW supports
> +        * insert.
> +        */
> +       if (stmt->base.partbound != NULL)
> +       {
> +               FdwRoutine *fdw_routine;
> +
> +               fdw_routine = GetFdwRoutine(fdw->fdwhandler);
> +               if (fdw_routine->ExecForeignInsert == NULL)
> +                       ereport(ERROR,
> +                                       (errcode(ERRCODE_FDW_NO_SCHEMAS),
> +                                        errmsg("cannot create foreign
> table as partition"),
> +                                        errdetail("foreign-data
> wrapper \"%s\" does not support insert",
> +                                                       fdw->fdwname)));
> +       }
>
> Why?  This seems like an entirely arbitrary prohibition.  If inserts
> aren't supported, then they'll fail at runtime.  Same for updates or
> deletes, for which you have not added checks.  I think you should just
> remove this.
OK, removed this check.

Instead, ExecInitModifyTable()/BeginCopy() call CheckValidResultRel() for
every leaf partition to check that they are ready for CMD_INSERT.

>
> +               /* Force inheritance recursion, if partitioned table. */
> +               if (recurse || relid_is_partitioned(myrelid))
>
> Disagree with this, too.  There's no reason for partitioned tables to
> be special in this way.  Similarly, disagree with all of the places
> that do something similar.

Removed the forced recursion bit here and a few other places.

>
> -                               errmsg("column \"%s\" in child table
> must be marked NOT NULL",
> -                                          attributeName)));
> +                               errmsg("column \"%s\" in %s table must
> be marked NOT NULL",
> +                                          attributeName,
> +                                          is_attach_partition ?
> "source" : "child")));
>
> You have a few of these; they cause problems for translators, because
> different languages have different word ordering.  Repeat the entire
> message instead: is_attach_partition ? "column \"%s\" in source table
> must be marked NOT NULL" : "column \"%s\" in child table must be
> marked NOT NULL".
Changed so that the entire message is repeated.

>
> +-- XXX add ownership tests
>
> So do that.  :-)

Done, sorry about that.

>
> +ERROR:  column "b" is not null in parent
> +HINT:  Please drop not null in the parent instead
>
> Hmm.   That hint doesn't seem like project style, and I'm not sure
> that it really makes sense to issue such a hint anyway.  Who knows
> whether that is the right thing to do?  I think you should somehow be
> complaining about the fact that this is a partition, rather than
> complaining about the fact that the column is NOT NULL in the parent.
> Are we insisting that the flags match exactly, or only that the child
> may not allow nulls unless the parent does?
I think the latter.  It is assumed that all the parent's constraints are
present in a child table, because in ExecInsert()/CopyFrom() we perform
ExecConstraints() using the child relation even if the actual insert was
on the parent.  Also, if an inherited NOT NULL constraint on a child's
column is dropped irrespective of parent's, selecting a parent's NOT NULL
column might return nulls from the child table that no longer has the
constraint.

I recently came across a related proposal whereby dropping *inherited* NOT
NULL  from child tables will be prevented.  Problems in letting it be be
dropped are mentioned here:

https://www.postgresql.org/message-id/21633.1448383428@...

That proposal is probably being reworked such that NOT NULL constraints
get a pg_constraint entry with proper accounting of inheritance count.

> +ERROR:  new partition's list of values overlaps with partition
> "lpart1" of "list_parted"
>
> Maybe just:
>
> ERROR: partitions must not overlap
> -or-
> ERROR: partition "%s" would overlap partition "%s"

OK, I changed to the second message.

>
> As before, this is just an initial read-through, so apologies for
> whatever I may have missed.

Thanks a lot for the review.

By the way, I am still working on the following items and will be included
in the next version of the patch.

* Fix internal representation of list partition bounds to be more efficient
* Add PartitionOptInfo

As mentioned in [2], I have refactored the inheritance expansion code
within optimizer so that a partitioned table's inheritance hierarchy is
preserved in resulting AppendRelInfos (patch 0005).  One immediate benefit
of that is that if constraint exclusion determines that an intermediate
partition is to be excluded then all partitions underneath it are excluded
automatically.  With the current flattened model, all partitions in that
subtree would have been processed and excluded one-by-one.

Regards,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoZ008qTgd_Qg6_oZb3i0mOYrS6MdhncwgcqPKahixjarg%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/f2a9592a-17e9-4c6a-e021-03b802195ce7%40lab.ntt.co.jp


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

0001-Catalog-and-DDL-for-partitioned-tables-2.patch (119K) Download Attachment
0002-psql-and-pg_dump-support-for-partitioned-tables-2.patch (23K) Download Attachment
0003-Catalog-and-DDL-for-partitions-2.patch (202K) Download Attachment
0004-psql-and-pg_dump-support-for-partitions-2.patch (18K) Download Attachment
0005-Refactor-optimizer-s-inheritance-set-expansion-code-2.patch (14K) Download Attachment
0006-Teach-a-few-places-to-use-partition-check-quals-2.patch (17K) Download Attachment
0007-Introduce-a-PartitionTreeNode-data-structure-2.patch (8K) Download Attachment
0008-Tuple-routing-for-partitioned-tables-2.patch (41K) Download Attachment
0009-Update-DDL-Partitioning-chapter-to-reflect-new-devel-2.patch (25K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Declarative partitioning - another take

Robert Haas
In reply to this post by Amit Langote-2
On Fri, Aug 26, 2016 at 1:33 PM, Amit Langote
<[hidden email]> wrote:
> We do take a lock on the parent because we would be changing its partition
> descriptor (relcache).  I changed MergeAttributes() such that an
> AccessExclusiveLock instead of ShareUpdateExclusiveLock is taken if the
> parent is a partitioned table.

Hmm, that seems both good and bad.  On the good side, as mentioned,
being able to rely on the partition descriptor not changing under us
makes this sort of thing much easier to reason about.  On the bad
side, it isn't good for this feature to have worse concurrency than
regular inheritance.  Not sure what to do about this.

> If we need an AccessExclusiveLock on parent to add/remove a partition
> (IOW, changing that child table's partitioning information), then do we
> need to lock the individual partitions when reading partition's
> information?  I mean to ask why the simple syscache look-ups to get each
> partition's bound wouldn't do.

Well, if X can't be changed without having an AccessExclusiveLock on
the parent, then an AccessShareLock on the parent is sufficient to
read X, right?  Because those lock modes conflict.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Declarative partitioning - another take

Amit Langote-2
On 2016/08/29 20:53, Robert Haas wrote:

> On Fri, Aug 26, 2016 at 1:33 PM, Amit Langote
> <[hidden email]> wrote:
>> We do take a lock on the parent because we would be changing its partition
>> descriptor (relcache).  I changed MergeAttributes() such that an
>> AccessExclusiveLock instead of ShareUpdateExclusiveLock is taken if the
>> parent is a partitioned table.
>
> Hmm, that seems both good and bad.  On the good side, as mentioned,
> being able to rely on the partition descriptor not changing under us
> makes this sort of thing much easier to reason about.  On the bad
> side, it isn't good for this feature to have worse concurrency than
> regular inheritance.  Not sure what to do about this.
>
>> If we need an AccessExclusiveLock on parent to add/remove a partition
>> (IOW, changing that child table's partitioning information), then do we
>> need to lock the individual partitions when reading partition's
>> information?  I mean to ask why the simple syscache look-ups to get each
>> partition's bound wouldn't do.
>
> Well, if X can't be changed without having an AccessExclusiveLock on
> the parent, then an AccessShareLock on the parent is sufficient to
> read X, right?  Because those lock modes conflict.

Yes.  And hence we can proceed with performing partition elimination
before locking any of children.  Lock on parent (AccessShareLock) will
prevent any of existing partitions to be removed and any new partitions to
be added because those operations require AccessExclusiveLock on the
parent.  What I was trying to understand is why this would not be possible
with a design where partition bound is stored in the catalog as a property
of individual partitions instead of a design where we store collection of
partition bounds as a property of the parent.

Thanks,
Amit




--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Declarative partitioning - another take

Robert Haas
On Wed, Aug 31, 2016 at 12:37 PM, Amit Langote
<[hidden email]> wrote:

>>> If we need an AccessExclusiveLock on parent to add/remove a partition
>>> (IOW, changing that child table's partitioning information), then do we
>>> need to lock the individual partitions when reading partition's
>>> information?  I mean to ask why the simple syscache look-ups to get each
>>> partition's bound wouldn't do.
>>
>> Well, if X can't be changed without having an AccessExclusiveLock on
>> the parent, then an AccessShareLock on the parent is sufficient to
>> read X, right?  Because those lock modes conflict.
>
> Yes.  And hence we can proceed with performing partition elimination
> before locking any of children.  Lock on parent (AccessShareLock) will
> prevent any of existing partitions to be removed and any new partitions to
> be added because those operations require AccessExclusiveLock on the
> parent.

Agreed.

> What I was trying to understand is why this would not be possible
> with a design where partition bound is stored in the catalog as a property
> of individual partitions instead of a design where we store collection of
> partition bounds as a property of the parent.

From the point of view of feasibility, I don't think it matters very
much where the property is stored; it's the locking that is the key
thing.  In other words, I think this *would* be possible if the
partition bound is stored as a property of individual partitions, as
long as it can't change without a lock on the parent.

However, it seems a lot better to make it a property of the parent
from a performance point of view.  Suppose there are 1000 partitions.
Reading one toasted value for pg_class and running stringToNode() on
it is probably a lot faster than scanning pg_inherits to find all of
the child partitions and then doing an index scan to find the pg_class
tuple for each and then decoding all of those tuples and assembling
them into some data structure.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Declarative partitioning - another take

Amit Langote-2
On 2016/08/31 16:17, Robert Haas wrote:

> On Wed, Aug 31, 2016 at 12:37 PM, Amit Langote wrote:
>> What I was trying to understand is why this would not be possible
>> with a design where partition bound is stored in the catalog as a property
>> of individual partitions instead of a design where we store collection of
>> partition bounds as a property of the parent.
>
> From the point of view of feasibility, I don't think it matters very
> much where the property is stored; it's the locking that is the key
> thing.  In other words, I think this *would* be possible if the
> partition bound is stored as a property of individual partitions, as
> long as it can't change without a lock on the parent.
>
> However, it seems a lot better to make it a property of the parent
> from a performance point of view.  Suppose there are 1000 partitions.
> Reading one toasted value for pg_class and running stringToNode() on
> it is probably a lot faster than scanning pg_inherits to find all of
> the child partitions and then doing an index scan to find the pg_class
> tuple for each and then decoding all of those tuples and assembling
> them into some data structure.

Seems worth trying.  One point that bothers me a bit is how do we enforce
partition bound condition on individual partition basis.  For example when
a row is inserted into a partition directly, we better check that it does
not fall outside the bounds and issue an error otherwise.  With current
approach, we just look up a partition's bound from the catalog and gin up
a check constraint expression (and cache in relcache) to be enforced in
ExecConstraints().  With the new approach, I guess we would need to look
up the parent's partition descriptor.  Note that the checking in
ExecConstraints() is turned off when routing a tuple from the parent.

Thanks,
Amit




--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Declarative partitioning - another take

Amit Langote-2
In reply to this post by Ashutosh Bapat
On 2016/08/25 16:15, Ashutosh Bapat wrote:

> On Thu, Aug 25, 2016 at 12:22 PM, Amit Langote wrote:
>> b)
>> when accumulating append subpaths, do not flatten a subpath that is itself
>> an append when ((AppendPath *) subpath)->path.parent is a RelOptInfo with
>> non-NULL partitioning info.Is the latter somehow necessary for
>> pairwise-join considerations?
>
> I don't think you need to do anything in the path creation code for this.
> As is it flattens all AppendPath hierarchies whether for partitioning or
> inheritance or subqueries. We should leave it as it is.

I thought it would be convenient for pairwise join code to work with the
hierarchy intact even within the AppendPath tree.  If it turns out to be
so, maybe that patch can take care of it.

>> I think I can manage to squeeze in (a) in the next version patch and will
>> also start working on (b), mainly the part about RelOptInfo getting some
>> partitioning info.
>
> I am fine with b, where you would include some partitioning information in
> RelOptInfo. But you don't need to do what you said in (b) above.
>
> In a private conversation Robert Haas suggested a way slightly different
> than what my patch for partition-wise join does. He suggested that the
> partitioning schemes i.e strategy, number of partitions and bounds of the
> partitioned elations involved in the query should be stored in PlannerInfo
> in the form of a list. Each partitioning scheme is annotated with the
> relids of the partitioned relations. RelOptInfo of the partitioned relation
> will point to the partitioning scheme in PlannerInfo. Along-with that each
> RelOptInfo will need to store partition keys for corresponding relation.
> This simplifies matching the partitioning schemes of the joining relations.
> Also it reduces the number of copies of partition bounds floating around as
> we expect that a query will involve multiple partitioned tables following
> similar partitioning schemes. May be you want to consider this idea while
> working on (b).

So IIUC, a partitioned relation's (baserel or joinrel) RelOptInfo has only
the information about partition keys.  They will be matched with query
restriction quals pruning away any unneeded partitions which happens
individually for each such parent baserel (within set_append_rel_size() I
suppose).  Further, two joining relations are eligible to be considered
for pairwise joining if they have identical partition keys and query
equi-join quals match the same.  The resulting joinrel will have the same
partition key (as either joining relation) and will have as many
partitions as there are in the intersection of sets of partitions of
joining rels (intersection proceeds by matching partition bounds).

"Partition scheme" structs go into a PlannerInfo list member, one
corresponding to each partitioned relation - baserel or joinrel, right?
As you say, each such struct has the following pieces of information:
strategy, num_partitions, bounds (and other auxiliary info).  Before
make_one_rel() starts, the list has one for each partitioned baserel.
After make_one_rel() has formed baserel pathlists and before
make_rel_from_joinlist() is called, are the partition scheme structs of
processed baserels marked with some information about the pruning activity
that occurred so far?  Then as we build successively higher levels of
joinrels, new entries will be made for those joinrels for which we added
pairwise join paths, with relids matching the corresponding joinrels.
Does that make sense?

Thanks,
Amit




--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
1234 ... 17
Previous Thread Next Thread
Loading...