Should we warn against using too many partitions?

classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Should we warn against using too many partitions?

David Rowley-3
Over on [1] I raised a concern about the lack of any warning in our
documents to inform users that they might not want to use thousands of
partitions.  More recently there's [2], also suffering from OOM using
100 partitions.  Perhaps there's more too this, but the planner using
a lot of memory planning updates and deletes to partitioned tables
does seem to be a surprise to many people.

I had hoped we could get something it the documents sooner rather than
later about this. Probably the v12 patch will need to be adjusted now
that the memory consumption will be reduced when many partitions are
pruned, but I still think v12 needs to have some sort of warning in
there.

https://commitfest.postgresql.org/23/2065/

I'm moving this to a new thread with a better title, rather than
tagging onto that old thread that's become rather long.

[1] https://www.postgresql.org/message-id/CAKJS1f8RW-mHQ8aEWD5Dv0+8A1wH5tHHdYMGW9y5sXqnE0X9wA@...
[2] https://www.postgresql.org/message-id/87ftp6l2qr.fsf@...

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Should we warn against using too many partitions?

Amit Langote-2
Hi David,

On 2019/05/23 18:02, David Rowley wrote:

> Over on [1] I raised a concern about the lack of any warning in our
> documents to inform users that they might not want to use thousands of
> partitions.  More recently there's [2], also suffering from OOM using
> 100 partitions.  Perhaps there's more too this, but the planner using
> a lot of memory planning updates and deletes to partitioned tables
> does seem to be a surprise to many people.
>
> I had hoped we could get something it the documents sooner rather than
> later about this. Probably the v12 patch will need to be adjusted now
> that the memory consumption will be reduced when many partitions are
> pruned, but I still think v12 needs to have some sort of warning in
> there.
>
> https://commitfest.postgresql.org/23/2065/

The latest patch on the thread linked from this CF entry (a modified
version of your patch sent by Justin Pryzby) looks good to me.  Why not
post it on this thread and link this one to the CF entry?  Or maybe, make
this an open item, because we should update documentation back to v11?

Thanks,
Amit



Reply | Threaded
Open this post in threaded view
|

Re: Should we warn against using too many partitions?

David Rowley-3
On Fri, 24 May 2019 at 14:04, Amit Langote
<[hidden email]> wrote:
> The latest patch on the thread linked from this CF entry (a modified
> version of your patch sent by Justin Pryzby) looks good to me.  Why not
> post it on this thread and link this one to the CF entry?

I'm not much of a fan of that patch:

+     <para>
+      When using table inheritance, partition hierarchies with more than a few
+      hundred partitions are not recommended.  Larger partition hierarchies may
+      incur long planning time, and, in the case of <command>UPDATE</command>
+      and <command>DELETE</command>, excessive memory usage.  When inheritance
+      is used, see also the limitations described in
+      <xref linkend="ddl-partitioning-constraint-exclusion"/>.
+     </para>

I'm a bit confused about this paragraph.  It introduces itself as
talking about table inheritance, then uses the word "partition" in
various places. I think that can be dropped.  The final sentence
throws me off as it tries to reduce the scope to only inheritance, but
as far as I understand that was already the scope of the paragraph,
unless of course "table inheritance" is not the same as "inheritance".
Without any insider knowledge on it, I've no idea if this
UPDATE/DELETE issue affects native partitioning too.

+     <para>
+      When using declarative partitioning, the overhead of query planning
+      is directly related to the number of unpruned partitions.  Planning is
+      generally fast with small numbers of unpruned partitions, even in
+      partition hierarchies containing many thousands of partitions.  However,
+      long planning time will be incurred by large partition hierarchies if
+      partition pruning is not possible during the planning phase.
+     </para>

This should really mention the excessive memory usage when many
partitions survive pruning.

I've attached 3 patches of what I think should go into master, pg11, and pg10.

> Or maybe, make
> this an open item, because we should update documentation back to v11?

I'll add this to the open items list since it includes master, and
shift the CF entry to point to this thread.

Authors are Robert Haas and Justin Pryzby, who I've included in the email.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

docs_partitioning_warning_master_v2.patch (2K) Download Attachment
docs_partitioning_warning_pg11_v2.patch (1K) Download Attachment
docs_partitioning_warning_pg10_v2.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Should we warn against using too many partitions?

Amit Langote-2
On 2019/05/24 13:37, David Rowley wrote:
> I've attached 3 patches of what I think should go into master, pg11, and pg10.

Thanks for the updated patches.

In pg11 and pg10 patches, I see this text:

+      Whether using table inheritance or native partitioning, hierarchies

Maybe, it would better to use the word "declarative" instead of "native",
if only to be consistent; neighboring paragraphs use "declarative".

Thanks,
Amit



Reply | Threaded
Open this post in threaded view
|

Re: Should we warn against using too many partitions?

David Rowley-3
On Fri, 24 May 2019 at 17:58, Amit Langote
<[hidden email]> wrote:
> +      Whether using table inheritance or native partitioning, hierarchies
>
> Maybe, it would better to use the word "declarative" instead of "native",
> if only to be consistent; neighboring paragraphs use "declarative".

Thanks for having a look.

I've attached the pg10 and pg11 patches with that updated... and also
the master one (unchanged) with the hopes that the CF bot picks that
one.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

docs_partitioning_warning_master_v2.patch (2K) Download Attachment
docs_partitioning_warning_pg11_v3.patch (1K) Download Attachment
docs_partitioning_warning_pg10_v3.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Should we warn against using too many partitions?

David Rowley-3
On Fri, 24 May 2019 at 22:00, David Rowley <[hidden email]> wrote:
> I've attached the pg10 and pg11 patches with that updated... and also
> the master one (unchanged) with the hopes that the CF bot picks that
> one.

I got talking to Andres about this at PGCon after a use case of 250k
partitions was brought to our attention. I was thinking about the best
way to handle this on the long flight home and after studying the
current docs I really feel that they fairly well describe what we've
done so far implementing table partitioning, but they offer next to
nothing on best practices on how to make the most of the feature.

I've done some work on this today and what I've ended up with is an
entirely new section to the partitioning docs about best practices
which provides a bit of detail on how you might go about choosing the
partition key. It gives an example of why LIST partitioning on a set
of values that may grow significantly over time might be a bad idea.
It talks about memory growth with more partitions and mentions that
rel cache might become a problem even if queries are touching a small
number of partitions per query, but a large number per session.

The attached patch is aimed at master. PG11 will need the planner
memory and performance part tweaked and for PG10 I'll do that plus
remove the mention of PRIMARY KEY and UNIQUE constraints on the
partitioned table.

Does anyone see anything wrong with doing this?  I don't think there
should be an issue adding a section to the docs right at the end as
it's not causing any resequencing.

Or does anyone have any better ideas or better examples to give? or
any comments?

If it looks okay I can post version for PG11 and PG10 for review, but
I'd like to get this in fairly soon.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

add_docs_about_best_partition_practices.patch (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Should we warn against using too many partitions?

Justin Pryzby
I suggest just minor variations on language.

On Thu, Jun 06, 2019 at 04:43:48PM +1200, David Rowley wrote:

>diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
>index cce1618fc1..ab26630199 100644
>--- a/doc/src/sgml/ddl.sgml
>+++ b/doc/src/sgml/ddl.sgml
>@@ -4674,6 +4675,76 @@ EXPLAIN SELECT count(*) FROM measurement WHERE logdate &gt;= DATE '2008-01-01';
>    </itemizedlist>
>    </para>
>   </sect2>
>+  
>+  <sect2 id="ddl-partitioning-declarative-best-practices">
>+   <title>Declarative Partitioning Best Practices</title>
>+
>+   <para>
>+    The choice of how to partition a table should be considered carefully as

Either say "How to partition consider should be .." or "The choice should MADE carefully" ?

>+   <para>
>+    One of the most critical design decisions will be the column or columns
>+    which you partition your data by.  Often the best choice will be to

by which ?

>+   <para>
>+    Choosing the number of partitions to divide the table into is also a

the TARGET number of partitions BY WHICH to divide the table ?

>+    critical decision to make.  Not having enough partitions may mean that
>+    indexes remain too large and that data locality remains poor which could
>+    result in poor cache hit ratios.  However, dividing the table into too
>+    many partitions can also cause issues.  Too many partitions can mean
>+    slower query planning times and higher memory consumption during both
>+    query planning and execution.  It's also important to consider what
>+    changes may occur in the future when choosing how to partition your table.
>+    For example, if you choose to have one partition per customer and you
>+    currently have a small number of large customers, what will the

have ONLY ?

>+    implications be if in several years you obtain a large number of small
>+    customers.  In this case, it may be better to choose to partition by
>+    <literal>HASH</literal> and choose a reasonable amount of partitions

reasonable NUMBER ?

>+   <para>
>+    It is also important to consider the overhead of partitioning during
>+    query planning and execution.  The query planner is generally able to
>+    handle partition hierarchies up a few thousand partitions fairly well,
>+    providing that the vast majority of them can be pruned during query

provided ?

I would say: "provided that typical queries prune all but a small number of
partitions during planning time".

>+    <command>DELETE</command> commands.  Also, even if most queries are
>+    able to prune a high number of partitions during query planning, it still

LARGE number?

>+    may be undesirable to have a large number of partitions as each partition

may still ?

>+    also will obtain a relation cache entry in each session which uses the

will require ?  Or occupy ?

>+   <para>
>+    With data warehouse type workloads it can make sense to use a larger
>+    number of partitions than with an OLTP type workload.  Generally, in data
>+    warehouses, query planning time is less of a concern as the majority of
>+    processing time is generally spent during query execution.  With either of

remove the 2nd "generally"

>+    these two types of workload, it is important to make the right decisions
>+    early as re-partitioning large quantities of data can be painstakingly

early COMMA ?

PAINFULLY slow

>+    When performance is critical, performing workload simulations to
>+    assist in making the correct decisions can be beneficial.  

I would say:
Simulations of the intended workload are beneficial for optimizing partitioning
strategy.

Thanks,
Justin


Reply | Threaded
Open this post in threaded view
|

Re: Should we warn against using too many partitions?

Amit Langote
In reply to this post by David Rowley-3
Hi,

On Thu, Jun 6, 2019 at 1:44 PM David Rowley
<[hidden email]> wrote:

>
> On Fri, 24 May 2019 at 22:00, David Rowley <[hidden email]> wrote:
> > I've attached the pg10 and pg11 patches with that updated... and also
> > the master one (unchanged) with the hopes that the CF bot picks that
> > one.
>
> I got talking to Andres about this at PGCon after a use case of 250k
> partitions was brought to our attention. I was thinking about the best
> way to handle this on the long flight home and after studying the
> current docs I really feel that they fairly well describe what we've
> done so far implementing table partitioning, but they offer next to
> nothing on best practices on how to make the most of the feature.

Agreed that some "best practices" text is overdue, so thanks for taking that up.

> I've done some work on this today and what I've ended up with is an
> entirely new section to the partitioning docs about best practices
> which provides a bit of detail on how you might go about choosing the
> partition key. It gives an example of why LIST partitioning on a set
> of values that may grow significantly over time might be a bad idea.

Design advice like this is good.

> It talks about memory growth with more partitions and mentions that
> rel cache might become a problem even if queries are touching a small
> number of partitions per query, but a large number per session.

I wasn't sure at first if stuff like this should be mentioned in the
user-facing documentation, but your wording seems fine in general.

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: Should we warn against using too many partitions?

Alvaro Herrera-9
In reply to this post by David Rowley-3
On 2019-Jun-06, David Rowley wrote:

> The attached patch is aimed at master. PG11 will need the planner
> memory and performance part tweaked and for PG10 I'll do that plus
> remove the mention of PRIMARY KEY and UNIQUE constraints on the
> partitioned table.

I think in PG10 something should be mentioned about PK and UNIQUE, so
that people doing their partitioning on that release can think ahead.
We don't want them to have to redesign and redo the whole setup when
upgrading to a newer release.  If we had written the pg10 material back
when pg10 was fresh, it wouldn't make sense, but now that we know the
future, I don't see why we wouldn't do it.  Maybe something like "The
current version does not support <this>, but future Postgres versions
do; consult their manuals for some limitations that may affect the
choice of partitioning strategy".

In the PG10 version you'll need to elide the mention of HASH
partitioning strategy.

Generally speaking, your material looks good to me.  Also generally I +1
Justin's suggestions.  The part that mentions a "relation cache entry"
seems too low-level as-is, though ... maybe just say it uses some memory
per partition without being too specific.

I think it'd be worthwhile to mention sub-partitioning.


I wonder if the PG10 manual should just suggest to skip to PG11 if
they're setting up partitioning for the first time.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Should we warn against using too many partitions?

David Rowley-3
In reply to this post by Justin Pryzby
On Thu, 6 Jun 2019 at 17:29, Justin Pryzby <[hidden email]> wrote:
> >+
> >+  <sect2 id="ddl-partitioning-declarative-best-practices">
> >+   <title>Declarative Partitioning Best Practices</title>
> >+
> >+   <para>
> >+    The choice of how to partition a table should be considered carefully as
>
> Either say "How to partition consider should be .." or "The choice should MADE carefully" ?

I've changed "considered" to "made". I'm unable to make sense of the
first suggestion there :(

> >+   <para>
> >+    One of the most critical design decisions will be the column or columns
> >+    which you partition your data by.  Often the best choice will be to
>
> by which ?

okay. I've moved the "by" from after "data" to before "which"

> >+   <para>
> >+    Choosing the number of partitions to divide the table into is also a
>
> the TARGET number of partitions BY WHICH to divide the table ?

Changed.

> >+    critical decision to make.  Not having enough partitions may mean that
> >+    indexes remain too large and that data locality remains poor which could
> >+    result in poor cache hit ratios.  However, dividing the table into too
> >+    many partitions can also cause issues.  Too many partitions can mean
> >+    slower query planning times and higher memory consumption during both
> >+    query planning and execution.  It's also important to consider what
> >+    changes may occur in the future when choosing how to partition your table.
> >+    For example, if you choose to have one partition per customer and you
> >+    currently have a small number of large customers, what will the
>
> have ONLY ?

I assume you mean after the "have" before "one partition per
customer"?  I don't quite understand that since in the scenario we're
partitioning by customer, so it's not possible to have more than one
partition per customer, only the reverse is possible. It seems to me
injecting "only" there would just confuse things.

> >+    implications be if in several years you obtain a large number of small
> >+    customers.  In this case, it may be better to choose to partition by
> >+    <literal>HASH</literal> and choose a reasonable amount of partitions
>
> reasonable NUMBER ?

changed.

> >+   <para>
> >+    It is also important to consider the overhead of partitioning during
> >+    query planning and execution.  The query planner is generally able to
> >+    handle partition hierarchies up a few thousand partitions fairly well,
> >+    providing that the vast majority of them can be pruned during query
>
> provided ?
>
> I would say: "provided that typical queries prune all but a small number of
> partitions during planning time".

changed, only I used "during query planning" rather than "during planning time".

> >+    <command>DELETE</command> commands.  Also, even if most queries are
> >+    able to prune a high number of partitions during query planning, it still
>
> LARGE number?

changed

> >+    may be undesirable to have a large number of partitions as each partition
>
> may still ?
>
> >+    also will obtain a relation cache entry in each session which uses the
>
> will require ?  Or occupy ?

"require" seems better. Although, this may need to be reworded a bit
further per what Alvaro mentions.

> >+   <para>
> >+    With data warehouse type workloads it can make sense to use a larger
> >+    number of partitions than with an OLTP type workload.  Generally, in data
> >+    warehouses, query planning time is less of a concern as the majority of
> >+    processing time is generally spent during query execution.  With either of
>
> remove the 2nd "generally"

Oops. I should have caught that.

> >+    these two types of workload, it is important to make the right decisions
> >+    early as re-partitioning large quantities of data can be painstakingly
>
> early COMMA ?

removed

> PAINFULLY slow

yeah

> >+    When performance is critical, performing workload simulations to
> >+    assist in making the correct decisions can be beneficial.
>
> I would say:
> Simulations of the intended workload are beneficial for optimizing partitioning
> strategy.

I took that but added "often" before "beneficial"

I'll write the patches for PG10 and PG11 and send them all a bit later.

Thanks for the review.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Should we warn against using too many partitions?

Justin Pryzby
On Fri, Jun 07, 2019 at 06:46:59AM +1200, David Rowley wrote:

> On Thu, 6 Jun 2019 at 17:29, Justin Pryzby <[hidden email]> wrote:
> > >+
> > >+  <sect2 id="ddl-partitioning-declarative-best-practices">
> > >+   <title>Declarative Partitioning Best Practices</title>
> > >+
> > >+   <para>
> > >+    The choice of how to partition a table should be considered carefully as
> >
> > Either say "How to partition consider should be .." or "The choice should MADE carefully" ?
>
> I've changed "considered" to "made". I'm unable to make sense of the
> first suggestion there :(

The first option was intended to be:
|How to partition a table should be considered carefully.

(The idea being that the "choice" doesn't need to be considered carefully but
the thing itself).

> > >+    critical decision to make.  Not having enough partitions may mean that
> > >+    indexes remain too large and that data locality remains poor which could
> > >+    result in poor cache hit ratios.  However, dividing the table into too
> > >+    many partitions can also cause issues.  Too many partitions can mean
> > >+    slower query planning times and higher memory consumption during both
> > >+    query planning and execution.  It's also important to consider what
> > >+    changes may occur in the future when choosing how to partition your table.
> > >+    For example, if you choose to have one partition per customer and you
> > >+    currently have a small number of large customers, what will the
> >
> > have ONLY ?
>
> I assume you mean after the "have" before "one partition per
> customer"?

No, I meant "currently have ONLY".

> I don't quite understand that since in the scenario we're
> partitioning by customer, so it's not possible to have more than one
> partition per customer, only the reverse is possible. It seems to me
> injecting "only" there would just confuse things.

Thanks,
Justin


Reply | Threaded
Open this post in threaded view
|

Re: Should we warn against using too many partitions?

David Rowley-3
In reply to this post by Alvaro Herrera-9
On Fri, 7 Jun 2019 at 03:12, Alvaro Herrera <[hidden email]> wrote:
> I think in PG10 something should be mentioned about PK and UNIQUE, so
> that people doing their partitioning on that release can think ahead.

That seems reasonable, but I feel caution would be required as we
don't want to provide any details about what a future version will
support, such information might not age very well. We could say that
future versions of PostgreSQL support PRIMARY KEY and UNIQUE
constraints, but we'll be unable to detail out that these must be a
super-set of the partition columns as if we get global indexes one day
that will no longer be a restriction. I'll have a think about it and
post a PG10 patch later.

> We don't want them to have to redesign and redo the whole setup when
> upgrading to a newer release.  If we had written the pg10 material back
> when pg10 was fresh, it wouldn't make sense, but now that we know the
> future, I don't see why we wouldn't do it.  Maybe something like "The
> current version does not support <this>, but future Postgres versions
> do; consult their manuals for some limitations that may affect the
> choice of partitioning strategy".

> In the PG10 version you'll need to elide the mention of HASH
> partitioning strategy.

Good point. I might need to rethink that example completely as I'm not
sure if swapping HASH for RANGE is such a great fix.

> Generally speaking, your material looks good to me.  Also generally I +1
> Justin's suggestions.  The part that mentions a "relation cache entry"
> seems too low-level as-is, though ... maybe just say it uses some memory
> per partition without being too specific.

Yeah, I wondered about that. I did grep the docs for "relation cache"
and saw two other mentions, that's why I ended up going with it, but I
do agree that it may be a problem since there's nothing in the docs
that explain what that actually means.

> I think it'd be worthwhile to mention sub-partitioning.

I'll try to come up with something for that.

> I wonder if the PG10 manual should just suggest to skip to PG11 if
> they're setting up partitioning for the first time.

I don't think so. I mean, if they just happened to have just installed
PG10 that might be okay, but they may already be heavily invested in
that version already. Suggesting an upgrade may not be a well-received
recommendation for some. Maybe a suggestion that significant
improvements have been made in later versions might be enough, but I'm
a bit on the fence about that.

Thanks for having a look. I'll post PG10 and 11 patches later.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Should we warn against using too many partitions?

David Rowley-3
In reply to this post by Justin Pryzby
On Fri, 7 Jun 2019 at 06:54, Justin Pryzby <[hidden email]> wrote:

> > > >+    critical decision to make.  Not having enough partitions may mean that
> > > >+    indexes remain too large and that data locality remains poor which could
> > > >+    result in poor cache hit ratios.  However, dividing the table into too
> > > >+    many partitions can also cause issues.  Too many partitions can mean
> > > >+    slower query planning times and higher memory consumption during both
> > > >+    query planning and execution.  It's also important to consider what
> > > >+    changes may occur in the future when choosing how to partition your table.
> > > >+    For example, if you choose to have one partition per customer and you
> > > >+    currently have a small number of large customers, what will the
> > >
> > > have ONLY ?
> >
> > I assume you mean after the "have" before "one partition per
> > customer"?
>
> No, I meant "currently have ONLY".

I see, thanks for explaining. I've left that one out as I think adding
"only" would imply that having a small number of large customers is
less significant that a large number of small customers. I don't
really see why either of those has significance over the other, so I
think "only" is out of place there.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Should we warn against using too many partitions?

David Rowley-3
In reply to this post by Alvaro Herrera-9
On Fri, 7 Jun 2019 at 03:12, Alvaro Herrera <[hidden email]> wrote:
> I think it'd be worthwhile to mention sub-partitioning.

In the attached I did briefly mention about sub-partitioning, however,
I didn't feel I had any very wise words to write about it other than
it can be useful to split up larger partitions.

I rather cheaply did the PG10 ones and just removed the mention about
PRIMARY KEYS and UNIQUE constraints. I also mention that PG11 is able
to handle "a few hundred partitions fairly well", and for PG10 I just
wrote that it's able to handle "a few hundred partitions" without the
"fairly well" part. master gets "a few thousand partitions fairly
well".

I also swapped out HASH for RANGE in the PG10 version which is not
quite perfect since its likely a customer ID would be a serial and
would fill the partitions one-by-one rather than more evenly as HASH
partitioning would.

Anyway comments welcome.  If I had a few more minutes to spare I'd
have wrapped OLTP in <acronym> tags, but out of time for now.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

part_doc_master.patch (7K) Download Attachment
part_doc_pg11.patch (6K) Download Attachment
part_doc_pg10.patch (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Should we warn against using too many partitions?

Amit Langote
Hi,

Thanks for the updated patches.

On Fri, Jun 7, 2019 at 2:34 PM David Rowley
<[hidden email]> wrote:
> Anyway comments welcome.  If I had a few more minutes to spare I'd
> have wrapped OLTP in <acronym> tags, but out of time for now.

Some rewording suggestions.

1.

+    ...    Removal of unwanted data is also a factor to consider when
+    planning your partitioning strategy as an entire partition can be removed
+    fairly quickly.  However, if data that you want to keep exists in that
+    partition then that means having to resort to using
+    <command>DELETE</command> instead of removing the partition.

Not sure if the 2nd sentence is necessary or perhaps should be
rewritten in a way that helps to design to benefit from this.

Maybe:

...    Removal of unwanted data is also a factor to consider when
planning your partitioning strategy as an entire partition can be
removed fairly quickly, especially if the partition keys are chosen
such that all data that can be deleted together are grouped into
separate partitions.

2.

+    ... For example, if you choose to have one partition
+    per customer and you currently have a small number of large customers,
+    what will the implications be if in several years you obtain a large
+    number of small customers.

The sentence could be rewritten a bit.  Maybe as:

... For example, choosing a design with one partition per customer,
because you currently have a small number of large customers, will not
scale well several years down the line when you might have a large
number of small customers.

Btw, doesn't it suffice here to say "large number of customers"
instead of "large number of small customers"?

3.

+    ... In this case, it may be better to choose to
+    partition by <literal>RANGE</literal> and choose a reasonable number of
+    partitions

Maybe:

... and choose reasonable number of partitions, each containing the
data of a fixed number of customers.

4.

+    ...  It also
+    may be undesirable to have a large number of partitions as each partition
+    requires metadata about the partition to be stored in each session that
+    touches it.  If each session touches a large number of partitions over a
+    period of time then the memory consumption for this may become
+    significant.

It might be a good idea to reorder the sentences here to put the
problem first and the cause later.  Maybe like this:

Another reason to be concerned about having a large number of
partitions is that the server's memory consumption may grow
significantly over a period of time, especially if many sessions touch
large numbers of partitions.  That's because each partition requires
its own metadata that must be loaded into the local memory of each
session that touches it.

5.

+    With data warehouse type workloads it can make sense to use a larger
+    number of partitions than with an OLTP type workload.

Is there a comma missing between "With data warehouse type workloads"
and the rest of the sentence?

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: Should we warn against using too many partitions?

Justin Pryzby
In reply to this post by David Rowley-3
I made another pass, hopefully it's useful and not too much of a pain.

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index cce1618fc1..be2ca3be48 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -4674,6 +4675,88 @@ EXPLAIN SELECT count(*) FROM measurement WHERE logdate &gt;= DATE '2008-01-01';
    </itemizedlist>
    </para>
   </sect2>
+  
+  <sect2 id="ddl-partitioning-declarative-best-practices">
+   <title>Declarative Partitioning Best Practices</title>
+
+   <para>
+    The choice of how to partition a table should be made carefully as the
+    performance of query planning and execution can be negatively affected by
+    poorly made design decisions.

Maybe just "poor design"

+    partitioned table.  <literal>WHERE</literal> clause items that match and
+    are compatible with the partition key can be used to prune away unneeded

remove "away" ?

+    requirements for the <literal>PRIMARY KEY</literal> or a
+    <literal>UNIQUE</literal> constraint.  Removal of unwanted data is also
+    a factor to consider when planning your partitioning strategy as an entire
+    partition can be removed fairly quickly.  However, if data that you want

Can we just say "dropped" ?  On my first (re)reading, I briefly thought this
was now referring to "pruning" as "removal".

+    to keep exists in that partition then that means having to resort to using
+    <command>DELETE</command> instead of removing the partition.
+   </para>
+
+   <para>
+    Choosing the target number of partitions by which the table should be
+    divided into is also a critical decision to make.  Not having enough

Should be: ".. target number .. into which .. should be divided .."

+    partitions may mean that indexes remain too large and that data locality
+    remains poor which could result in poor cache hit ratios.  However,

Change the 2nd remains to "is" and the second poor to "low" ?

+    dividing the table into too many partitions can also cause issues.
+    Too many partitions can mean slower query planning times and higher memory

s/slower/longer/

+    consumption during both query planning and execution.  It's also important
+    to consider what changes may occur in the future when choosing how to
+    partition your table.  For example, if you choose to have one partition

Remove "when choosing ..."?  Or say:

|When choosing how to partition your table, it's also important to consider
|what changes may occur in the future.

+    per customer and you currently have a small number of large customers,
+    what will the implications be if in several years you obtain a large
+    number of small customers.  In this case, it may be better to choose to
+    partition by <literal>HASH</literal> and choose a reasonable number of
+    partitions rather than trying to partition by <literal>LIST</literal> and
+    hoping that the number of customers does not increase significantly over
+    time.
+   </para>

It's an unusual thing for which to hope :)

+   <para>
+    Sub-partitioning can be useful to further divide partitions that are
+    expected to become larger than other partitions, although excessive
+    sub-partitioning can easily lead to large numbers of partitions and can
+    cause the problems mentioned in the preceding paragraph.
+   </para>

cause the SAME problems ?

+    It is also important to consider the overhead of partitioning during
+    query planning and execution.  The query planner is generally able to
+    handle partition hierarchies up a few thousand partitions fairly well,
+    provided that typical queries prune all but a small number of partitions
+    during query planning.  Planning times become slower and memory

s/slower/longer/

Hm, maybe say "typical queries ALLOW PRUNNING .."

+    consumption becomes higher when more partitions remain after the planner
+    performs partition pruning.  This is particularly true for the

Just say: "remain after planning" ?

+    <command>UPDATE</command> and <command>DELETE</command> commands.  Also,
+    even if most queries are able to prune a large number of partitions during
+    query planning, it still may be undesirable to have a large number of

may still ?

+    partitions as each partition requires metadata about the partition to be
+    stored in each session that touches it.  If each session touches a large

stored for ?

+    number of partitions over a period of time then the memory consumption for
+    this may become significant.
+   </para>

Remove "over a period of time" ?
Add a comma?

Maybe say:

|If each session touches a large number of partitions, then the memory
|overhead may become significant.

+   <para>
+    With data warehouse type workloads it can make sense to use a larger
+    number of partitions than with an OLTP type workload.  Generally, in data
+    warehouses, query planning time is less of a concern as the majority of

VAST majority?  Or "essentially all"?  Or " .. query planning time is
insignificant compared to the time spent during query execution.

+    processing time is spent during query execution.  With either of these two
+    types of workload it is important to make the right decisions early as

early COMMA

Justin


Reply | Threaded
Open this post in threaded view
|

Re: Should we warn against using too many partitions?

David Rowley-3
In reply to this post by Amit Langote
Thanks for these suggestions.

On Fri, 7 Jun 2019 at 19:00, Amit Langote <[hidden email]> wrote:

> Some rewording suggestions.
>
> 1.
>
> +    ...    Removal of unwanted data is also a factor to consider when
> +    planning your partitioning strategy as an entire partition can be removed
> +    fairly quickly.  However, if data that you want to keep exists in that
> +    partition then that means having to resort to using
> +    <command>DELETE</command> instead of removing the partition.
>
> Not sure if the 2nd sentence is necessary or perhaps should be
> rewritten in a way that helps to design to benefit from this.
>
> Maybe:
>
> ...    Removal of unwanted data is also a factor to consider when
> planning your partitioning strategy as an entire partition can be
> removed fairly quickly, especially if the partition keys are chosen
> such that all data that can be deleted together are grouped into
> separate partitions.

It seems like a good idea to change this to have this mention the
benefits rather than the drawbacks. I've reworded it, but not using
your exact words as it seems the "especially" means that a partition
can be removed faster with properly chosen partition keys, which is
not the case.

I also split this out into its own paragraph since it's talking about
something quite different from the previous paragraph.

> 2.
>
> +    ... For example, if you choose to have one partition
> +    per customer and you currently have a small number of large customers,
> +    what will the implications be if in several years you obtain a large
> +    number of small customers.
>
> The sentence could be rewritten a bit.  Maybe as:
>
> ... For example, choosing a design with one partition per customer,
> because you currently have a small number of large customers, will not
> scale well several years down the line when you might have a large
> number of small customers.
>
> Btw, doesn't it suffice here to say "large number of customers"
> instead of "large number of small customers"?

I'm not really trying to imply to plan for business growth here, I'm
trying to angle it as "what if your business changes".  I've reworded
this slightly and it now says "what will the implications be if in
several years you instead find yourself with a large number of small
customers."

> 3.
>
> +    ... In this case, it may be better to choose to
> +    partition by <literal>RANGE</literal> and choose a reasonable number of
> +    partitions
>
> Maybe:
>
> ... and choose reasonable number of partitions, each containing the
> data of a fixed number of customers.

Yeah, that seems better. I'll change that for the PG10 version only.

> 4.
>
> +    ...  It also
> +    may be undesirable to have a large number of partitions as each partition
> +    requires metadata about the partition to be stored in each session that
> +    touches it.  If each session touches a large number of partitions over a
> +    period of time then the memory consumption for this may become
> +    significant.
>
> It might be a good idea to reorder the sentences here to put the
> problem first and the cause later.  Maybe like this:
>
> Another reason to be concerned about having a large number of
> partitions is that the server's memory consumption may grow
> significantly over a period of time, especially if many sessions touch
> large numbers of partitions.  That's because each partition requires
> its own metadata that must be loaded into the local memory of each
> session that touches it.

That seems better. I've taken that text.

> 5.
>
> +    With data warehouse type workloads it can make sense to use a larger
> +    number of partitions than with an OLTP type workload.
>
> Is there a comma missing between "With data warehouse type workloads"
> and the rest of the sentence?

I've added one.

Patches will follow once I've addressed Justin's review.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Should we warn against using too many partitions?

David Rowley-3
In reply to this post by Justin Pryzby
Thanks for having another look.

On Sat, 8 Jun 2019 at 18:39, Justin Pryzby <[hidden email]> wrote:
> +   <para>
> +    The choice of how to partition a table should be made carefully as the
> +    performance of query planning and execution can be negatively affected by
> +    poorly made design decisions.
>
> Maybe just "poor design"

changed

> +    partitioned table.  <literal>WHERE</literal> clause items that match and
> +    are compatible with the partition key can be used to prune away unneeded
>
> remove "away" ?

removed

> +    requirements for the <literal>PRIMARY KEY</literal> or a
> +    <literal>UNIQUE</literal> constraint.  Removal of unwanted data is also
> +    a factor to consider when planning your partitioning strategy as an entire
> +    partition can be removed fairly quickly.  However, if data that you want
>
> Can we just say "dropped" ?  On my first (re)reading, I briefly thought this
> was now referring to "pruning" as "removal".

I used removed because that could be done via DROP TABLE or by DETACH
PARTITION. If I change it to "dropped" then it sounds like we might
only mean DROP TABLE.  I've reworded to use "detached" instead.

> +    to keep exists in that partition then that means having to resort to using
> +    <command>DELETE</command> instead of removing the partition.
> +   </para>
> +
> +   <para>
> +    Choosing the target number of partitions by which the table should be
> +    divided into is also a critical decision to make.  Not having enough
>
> Should be: ".. target number .. into which .. should be divided .."

I've changed "by" to "into". I think that's what you mean, otherwise,
you've lost me.

> +    partitions may mean that indexes remain too large and that data locality
> +    remains poor which could result in poor cache hit ratios.  However,
>
> Change the 2nd remains to "is" and the second poor to "low" ?

An internet search on "low cache hit ratio" turns up about twice as
many results as "poor cache hit ratio", but both seem fine to me.
However, since the search seems to show more for the former, I change
it to that.

> +    dividing the table into too many partitions can also cause issues.
> +    Too many partitions can mean slower query planning times and higher memory
>
> s/slower/longer/

changed

> +    consumption during both query planning and execution.  It's also important
> +    to consider what changes may occur in the future when choosing how to
> +    partition your table.  For example, if you choose to have one partition
>
> Remove "when choosing ..."?  Or say:

I don't see how that would make sense.

> |When choosing how to partition your table, it's also important to consider
> |what changes may occur in the future.

Changed to that.

> +    per customer and you currently have a small number of large customers,
> +    what will the implications be if in several years you obtain a large
> +    number of small customers.  In this case, it may be better to choose to
> +    partition by <literal>HASH</literal> and choose a reasonable number of
> +    partitions rather than trying to partition by <literal>LIST</literal> and
> +    hoping that the number of customers does not increase significantly over
> +    time.
> +   </para>
>
> It's an unusual thing for which to hope :)
I have reworded this slightly which may help with that.

> +   <para>
> +    Sub-partitioning can be useful to further divide partitions that are
> +    expected to become larger than other partitions, although excessive
> +    sub-partitioning can easily lead to large numbers of partitions and can
> +    cause the problems mentioned in the preceding paragraph.
> +   </para>
>
> cause the SAME problems ?

Added

> +    It is also important to consider the overhead of partitioning during
> +    query planning and execution.  The query planner is generally able to
> +    handle partition hierarchies up a few thousand partitions fairly well,
> +    provided that typical queries prune all but a small number of partitions
> +    during query planning.  Planning times become slower and memory
>
> s/slower/longer/

Changed

> Hm, maybe say "typical queries ALLOW PRUNNING .."
>
> +    consumption becomes higher when more partitions remain after the planner
> +    performs partition pruning.  This is particularly true for the
>
> Just say: "remain after planning" ?

I've changed this around, but not really how you've asked.

> +    <command>UPDATE</command> and <command>DELETE</command> commands.  Also,
> +    even if most queries are able to prune a large number of partitions during
> +    query planning, it still may be undesirable to have a large number of
>
> may still ?

This has been rewritten per Amit's review.

> +   <para>
> +    With data warehouse type workloads it can make sense to use a larger
> +    number of partitions than with an OLTP type workload.  Generally, in data
> +    warehouses, query planning time is less of a concern as the majority of
>
> VAST majority?  Or "essentially all"?  Or " .. query planning time is
> insignificant compared to the time spent during query execution.

I don't see any benefit in raising the significance of that.

> +    processing time is spent during query execution.  With either of these two
> +    types of workload it is important to make the right decisions early as
>
> early COMMA

I'm not really sure what you mean here as I don't see any comma in
that text. I guess you want me to add one? But I'm confused as you
seemed to ask me to remove a comma there in your previous review.

You wrote:
>>+    these two types of workload, it is important to make the right decisions
>>+    early as re-partitioning large quantities of data can be painstakingly

> early COMMA ?

Can you be more precise to the exact problem that you see with the
text? In the meantime, I've put the comma back where it was in the
original patch.

I've attached the updated patches.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

part_doc_pg10_v2.patch (6K) Download Attachment
part_doc_pg11_v2.patch (7K) Download Attachment
part_doc_master_v2.patch (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Should we warn against using too many partitions?

Justin Pryzby
On Sun, Jun 09, 2019 at 01:15:09PM +1200, David Rowley wrote:

> Thanks for having another look.
>
> On Sat, 8 Jun 2019 at 18:39, Justin Pryzby <[hidden email]> wrote:
> > +    to keep exists in that partition then that means having to resort to using
> > +    <command>DELETE</command> instead of removing the partition.
> > +   </para>
> > +
> > +   <para>
> > +    Choosing the target number of partitions by which the table should be
> > +    divided into is also a critical decision to make.  Not having enough
> >
> > Should be: ".. target number .. into which .. should be divided .."
>
> I've changed "by" to "into". I think that's what you mean, otherwise,
> you've lost me.

I meant it should say "into which it should be divided" and not "by which it
should be divided INTO", which has too many prepositions.  This is still an
issue:

+    Choosing the target number of partitions into which the table should be
+    divided into is also a critical decision to make.  Not having enough

> > +    partitions may mean that indexes remain too large and that data locality
> > +    remains poor which could result in poor cache hit ratios.  However,
> >
> > Change the 2nd remains to "is" and the second poor to "low" ?

> > +    consumption during both query planning and execution.  It's also important
> > +    to consider what changes may occur in the future when choosing how to
> > +    partition your table.  For example, if you choose to have one partition
> >
> > Remove "when choosing ..."?  Or say:
>
> I don't see how that would make sense.

I suggested it because otherwise it can read as: "in the future when choosing ...".

> > +    per customer and you currently have a small number of large customers,
> > +    what will the implications be if in several years you obtain a large
> > +    number of small customers.  In this case, it may be better to choose to
> > +    partition by <literal>HASH</literal> and choose a reasonable number of
> > +    partitions rather than trying to partition by <literal>LIST</literal> and
> > +    hoping that the number of customers does not increase significantly over
> > +    time.
> > +   </para>
> >
> > It's an unusual thing for which to hope :)
>
> I have reworded this slightly which may help with that.

I didn't mean there was any issue with this, just that it's amusing to find
oneself in the unfortunate position of hoping that one's company doesn't end up
with many customers.

> > +    processing time is spent during query execution.  With either of these two
> > +    types of workload it is important to make the right decisions early as
> >
> > early COMMA
>
> I'm not really sure what you mean here as I don't see any comma in
> that text. I guess you want me to add one? But I'm confused as you
> seemed to ask me to remove a comma there in your previous review.

I meant to add one then and now, like:

|    these two types of workload, it is important to make the right decisions
|    early, as re-partitioning large quantities of data can be ...

Thanks,
Justin


Reply | Threaded
Open this post in threaded view
|

Re: Should we warn against using too many partitions?

David Rowley-3
On Sun, 9 Jun 2019 at 16:21, Justin Pryzby <[hidden email]> wrote:
> I meant it should say "into which it should be divided" and not "by which it
> should be divided INTO", which has too many prepositions.  This is still an
> issue:

It now reads "divided by" instead of "divided into".

> |    these two types of workload, it is important to make the right decisions
> |    early, as re-partitioning large quantities of data can be ...

I've added a comma after "early".

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

part_doc_master_v3.patch (7K) Download Attachment
part_doc_pg11_v3.patch (7K) Download Attachment
part_doc_pg10_v3.patch (6K) Download Attachment
12