When to use PARTITION BY HASH?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

When to use PARTITION BY HASH?

Oleksandr Shulgin
Hi!

I was reading up on declarative partitioning[1] and I'm not sure what could be a possible application of Hash partitioning.

Is anyone actually using it?  What are typical use cases?  What benefits does such a partitioning scheme provide?

On its face, it seems that it can only give you a number of tables which are smaller than the un-partitioned one, but I fail to see how it would provide any of the potential advantages listed in the documentation.

With a reasonable hash function, the distribution of rows across partitions should be more or less equal, so I wouldn't expect any of the following to hold true:
- "...most of the heavily accessed rows of the table are in a single partition or a small number of partitions."
- "Bulk loads and deletes can be accomplished by adding or removing partitions...",
etc.

That *might* turn out to be the case with a small number of distinct values in the partitioning column(s), but then why rely on hash assignment instead of using PARTITION BY LIST in the first place?

Reply | Threaded
Open this post in threaded view
|

Re: When to use PARTITION BY HASH?

Justin Pryzby
> To: [hidden email], [hidden email]

Please don't cross post to multiple lists.

On Tue, Jun 02, 2020 at 07:17:11PM +0200, Oleksandr Shulgin wrote:
> I was reading up on declarative partitioning[1] and I'm not sure what could
> be a possible application of Hash partitioning.

It's a good question.  See Tom's complaint here.
https://www.postgresql.org/message-id/31605.1586112900%40sss.pgh.pa.us

It *does* provide the benefit of smaller indexes and smaller tables, which
might allow seq scans to outpeform index scans.

It's maybe only useful for equality conditions on the partition key, and not
for ranges.  Here, it scans a single partition:

postgres=# CREATE TABLE t(i int) PARTITION BY HASH(i); CREATE TABLE t1 PARTITION OF t FOR VALUES WITH (REMAINDER 0, MODULUS 3);
postgres=# CREATE TABLE t2 PARTITION OF t FOR VALUES WITH (MODULUS 3, REMAINDER 1);
postgres=# CREATE TABLE t3 PARTITION OF t FOR VALUES WITH (MODULUS 3, REMAINDER 2);
postgres=# INSERT INTO t SELECT i%9 FROM generate_series(1,9999)i; ANALYZE t;
postgres=# explain analyze SELECT * FROM t WHERE i=3;
 Seq Scan on t2  (cost=0.00..75.55 rows=2222 width=4) (actual time=0.021..0.518 rows=2222 loops=1)
   Filter: (i = 3)
   Rows Removed by Filter: 2222

--
Justin


Reply | Threaded
Open this post in threaded view
|

Re: When to use PARTITION BY HASH?

Michaeldba@sqlexec.com
In reply to this post by Oleksandr Shulgin
Hi,

I use it quite often, since I'm dealing with partitioning keys that have
high cardinality, ie, high number of different values.  If your
cardinality is very high, but your spacing between values is not
uniform, HASH will balance your partitioned tables naturally.  If your
spacing between values is consistent, perhaps RANGE partitioning would
be better.

Regards,
Michael Vitale

Oleksandr Shulgin wrote on 6/2/2020 1:17 PM:

> Hi!
>
> I was reading up on declarative partitioning[1] and I'm not sure what
> could be a possible application of Hash partitioning.
>
> Is anyone actually using it? What are typical use cases?  What
> benefits does such a partitioning scheme provide?
>
> On its face, it seems that it can only give you a number of tables
> which are smaller than the un-partitioned one, but I fail to see how
> it would provide any of the potential advantages listed in the
> documentation.
>
> With a reasonable hash function, the distribution of rows across
> partitions should be more or less equal, so I wouldn't expect any of
> the following to hold true:
> - "...most of the heavily accessed rows of the table are in a single
> partition or a small number of partitions."
> - "Bulk loads and deletes can be accomplished by adding or removing
> partitions...",
> etc.
>
> That *might* turn out to be the case with a small number of distinct
> values in the partitioning column(s), but then why rely on hash
> assignment instead of using PARTITION BY LIST in the first place?
>
> Regards,
> --
> Alex
>
> [1] https://www.postgresql.org/docs/12/ddl-partitioning.html
>



Reply | Threaded
Open this post in threaded view
|

Re: When to use PARTITION BY HASH?

Oleksandr Shulgin
In reply to this post by Justin Pryzby
On Tue, Jun 2, 2020 at 7:33 PM Justin Pryzby <[hidden email]> wrote:
> To: [hidden email], [hidden email]

Please don't cross post to multiple lists.

On Tue, Jun 02, 2020 at 07:17:11PM +0200, Oleksandr Shulgin wrote:
> I was reading up on declarative partitioning[1] and I'm not sure what could
> be a possible application of Hash partitioning.

It's a good question.  See Tom's complaint here.
https://www.postgresql.org/message-id/31605.1586112900%40sss.pgh.pa.us

It *does* provide the benefit of smaller indexes and smaller tables, which
might allow seq scans to outpeform index scans.

It's maybe only useful for equality conditions on the partition key, and not
for ranges.  Here, it scans a single partition:

postgres=# CREATE TABLE t(i int) PARTITION BY HASH(i); CREATE TABLE t1 PARTITION OF t FOR VALUES WITH (REMAINDER 0, MODULUS 3);
postgres=# CREATE TABLE t2 PARTITION OF t FOR VALUES WITH (MODULUS 3, REMAINDER 1);
postgres=# CREATE TABLE t3 PARTITION OF t FOR VALUES WITH (MODULUS 3, REMAINDER 2);
postgres=# INSERT INTO t SELECT i%9 FROM generate_series(1,9999)i; ANALYZE t;
postgres=# explain analyze SELECT * FROM t WHERE i=3;
 Seq Scan on t2  (cost=0.00..75.55 rows=2222 width=4) (actual time=0.021..0.518 rows=2222 loops=1)
   Filter: (i = 3)
   Rows Removed by Filter: 2222

I see.  So it works with low cardinality in the partitioned column.  With high cardinality an index scan on an unpartitioned table would be preferable I guess.

The documentation page I've linked only contains examples around partitioning BY RANGE.  I believe it'd be helpful to extend it with some meaningful examples for LIST and HASH partitioning.

Regards,
-- 
Alex

Reply | Threaded
Open this post in threaded view
|

Re: When to use PARTITION BY HASH?

Justin Pryzby
On Wed, Jun 03, 2020 at 09:45:48AM +0200, Oleksandr Shulgin wrote:
> I see.  So it works with low cardinality in the partitioned column.  With
> high cardinality an index scan on an unpartitioned table would be
> preferable I guess.
>
> The documentation page I've linked only contains examples around
> partitioning BY RANGE.  I believe it'd be helpful to extend it with some
> meaningful examples for LIST and HASH partitioning.

I agree.  I think it would also be useful to mention the "benefits" which
aren't likely to apply to hash partitioning.

Would you want to propose an example to include ?
Eventually it needs to be submitted as a patch to -hackers.

--
Justin


Reply | Threaded
Open this post in threaded view
|

Re: When to use PARTITION BY HASH?

Imre Samu
In reply to this post by Oleksandr Shulgin
> "Bulk loads ...",

As I see - There is an interesting bulkload benchmark:    

"How Bulkload performance is affected by table partitioning in PostgreSQL" by Beena Emerson (Enterprisedb, December 4, 2019 )
SUMMARY: This article covers how benchmark tests can be used to demonstrate the effect of table partitioning on performance. Tests using range- and hash-partitioned tables are compared and the reasons for their different results are explained:
                 1. Range partitions
                 2. Hash partitions
                 3. Combination graphs
                 4. Explaining the behavior
                 5. Conclusion

"For the hash-partitioned table, the first value is inserted in the first partition, the second number in the second partition and so on till all the partitions are reached before it loops back to the first partition again until all the data is exhausted. Thus it exhibits the worst-case scenario where the partition is repeatedly switched for every value inserted. As a result, the number of times the partition is switched in a range-partitioned table is equal to the number of partitions, while in a hash-partitioned table, the number of times the partition has switched is equal to the amount of data being inserted. This causes the massive difference in timing for the two partition types."
Regards,
 Imre


Oleksandr Shulgin <[hidden email]> ezt írta (időpont: 2020. jún. 2., K, 19:17):
Hi!

I was reading up on declarative partitioning[1] and I'm not sure what could be a possible application of Hash partitioning.

Is anyone actually using it?  What are typical use cases?  What benefits does such a partitioning scheme provide?

On its face, it seems that it can only give you a number of tables which are smaller than the un-partitioned one, but I fail to see how it would provide any of the potential advantages listed in the documentation.

With a reasonable hash function, the distribution of rows across partitions should be more or less equal, so I wouldn't expect any of the following to hold true:
- "...most of the heavily accessed rows of the table are in a single partition or a small number of partitions."
- "Bulk loads and deletes can be accomplished by adding or removing partitions...",
etc.

That *might* turn out to be the case with a small number of distinct values in the partitioning column(s), but then why rely on hash assignment instead of using PARTITION BY LIST in the first place?

Reply | Threaded
Open this post in threaded view
|

Re: When to use PARTITION BY HASH?

Michaeldba@sqlexec.com
The article referenced below assumes a worst case scenario for bulk-loading with hash partitioned tables.  It assumes that the values being inserted are in strict ascending or descending order with no gaps (like a sequence number incrementing by 1), thereby ensuring every partition is hit in order before repeating the process.  If the values being inserted are not strictly sequential with no gaps, then the performance is much better.  Obviously, what part of the tables and indexes are in memory has a lot to do with it as well.

Regards,
Michael Vitale

Imre Samu wrote on 6/5/2020 7:48 AM:
> "Bulk loads ...",

As I see - There is an interesting bulkload benchmark:    

"How Bulkload performance is affected by table partitioning in PostgreSQL" by Beena Emerson (Enterprisedb, December 4, 2019 )
SUMMARY: This article covers how benchmark tests can be used to demonstrate the effect of table partitioning on performance. Tests using range- and hash-partitioned tables are compared and the reasons for their different results are explained:
                 1. Range partitions
                 2. Hash partitions
                 3. Combination graphs
                 4. Explaining the behavior
                 5. Conclusion

"For the hash-partitioned table, the first value is inserted in the first partition, the second number in the second partition and so on till all the partitions are reached before it loops back to the first partition again until all the data is exhausted. Thus it exhibits the worst-case scenario where the partition is repeatedly switched for every value inserted. As a result, the number of times the partition is switched in a range-partitioned table is equal to the number of partitions, while in a hash-partitioned table, the number of times the partition has switched is equal to the amount of data being inserted. This causes the massive difference in timing for the two partition types."
Regards,
 Imre

Reply | Threaded
Open this post in threaded view
|

Re: When to use PARTITION BY HASH?

David Rowley
On Sun, 7 Jun 2020 at 23:41, MichaelDBA <[hidden email]> wrote:
> The article referenced below assumes a worst case scenario for bulk-loading with hash partitioned tables.  It assumes that the values being inserted are in strict ascending or descending order with no gaps (like a sequence number incrementing by 1), thereby ensuring every partition is hit in order before repeating the process.  If the values being inserted are not strictly sequential with no gaps, then the performance is much better.  Obviously, what part of the tables and indexes are in memory has a lot to do with it as well.

In PostgreSQL 12, COPY was modified to support bulk-inserts for
partitioned tables. This did speed up many scenarios.  Internally, how
this works is that we maintain a series of multi insert buffers, one
per partition. We generally only flush those buffers to the table when
the buffer for the partition fills.  However, there is a sort of
sanity limit [1] on the number of multi insert buffers we maintain at
once and currently, that is 32.  Technically we could increase that
limit, but there would still need to be a limit.  Unfortunately, for
this particular case, since we're most likely touching between 199-799
other partitions before hitting the first one again, that will mean
that we really don't get any multi-inserts, which is likely the reason
why the performance is worse for hash partitioning.

With PG12 and for this particular case, you're likely to see COPY
performance drop quite drastically when going from 32 to 33
partitions.  The code was more designed for hitting partitions more
randomly rather than in this sort-of round-robin way that we're likely
to get from hash partitioning on a serial column.

David

[1] https://github.com/postgres/postgres/blob/master/src/backend/commands/copy.c#L2569


Reply | Threaded
Open this post in threaded view
|

Re: When to use PARTITION BY HASH?

Michaeldba@sqlexec.com
Wow! That is good to know!

Sent from my iPad

> On Jun 7, 2020, at 5:23 PM, David Rowley <[hidden email]> wrote:
>
>> On Sun, 7 Jun 2020 at 23:41, MichaelDBA <[hidden email]> wrote:
>> The article referenced below assumes a worst case scenario for bulk-loading with hash partitioned tables.  It assumes that the values being inserted are in strict ascending or descending order with no gaps (like a sequence number incrementing by 1), thereby ensuring every partition is hit in order before repeating the process.  If the values being inserted are not strictly sequential with no gaps, then the performance is much better.  Obviously, what part of the tables and indexes are in memory has a lot to do with it as well.
>
> In PostgreSQL 12, COPY was modified to support bulk-inserts for
> partitioned tables. This did speed up many scenarios.  Internally, how
> this works is that we maintain a series of multi insert buffers, one
> per partition. We generally only flush those buffers to the table when
> the buffer for the partition fills.  However, there is a sort of
> sanity limit [1] on the number of multi insert buffers we maintain at
> once and currently, that is 32.  Technically we could increase that
> limit, but there would still need to be a limit.  Unfortunately, for
> this particular case, since we're most likely touching between 199-799
> other partitions before hitting the first one again, that will mean
> that we really don't get any multi-inserts, which is likely the reason
> why the performance is worse for hash partitioning.
>
> With PG12 and for this particular case, you're likely to see COPY
> performance drop quite drastically when going from 32 to 33
> partitions.  The code was more designed for hitting partitions more
> randomly rather than in this sort-of round-robin way that we're likely
> to get from hash partitioning on a serial column.
>
> David
>
> [1] https://github.com/postgres/postgres/blob/master/src/backend/commands/copy.c#L2569