Memory-Bounded Hash Aggregation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Memory-Bounded Hash Aggregation

Jeff Davis-8
This is for design review. I have a patch (WIP) for Approach 1, and if
this discussion starts to converge on that approach I will polish and
post it.

Let's start at the beginning: why do we have two strategies -- hash
and sort -- for aggregating data? The two are more similar than they
first appear. A partitioned hash strategy writes randomly among the
partitions, and later reads the partitions sequentially; a sort will
write sorted runs sequentially, but then read the among the runs
randomly during the merge phase. A hash is a convenient small
representation of the data that is cheaper to operate on; sort uses
abbreviated keys for the same reason.

Hash offers:

* Data is aggregated on-the-fly, effectively "compressing" the amount
  of data that needs to go to disk. This is particularly important
  when the data contains skewed groups (see below).

* Can output some groups after the first pass of the input data even
  if other groups spilled.

* Some data types only support hashing; not sorting.

Sort+Group offers:

* Only one group is accumulating at once, so if the transition state
  grows (like with ARRAY_AGG), it minimizes the memory needed.

* The input may already happen to be sorted.

* Some data types only support sorting; not hashing.

Currently, Hash Aggregation is only chosen if the optimizer believes
that all the groups (and their transition states) fit in
memory. Unfortunately, if the optimizer is wrong (often the case if the
input is not a base table), the hash table will
keep growing beyond work_mem, potentially bringing the entire system
to OOM. This patch fixes that problem by extending the Hash
Aggregation strategy to spill to disk when needed.

Previous discussions:


https://www.postgresql.org/message-id/1407706010.6623.16.camel@jeff-desktop

https://www.postgresql.org/message-id/1419326161.24895.13.camel%40jeff-desktop

https://www.postgresql.org/message-id/87be3bd5-6b13-d76e-5618-6db0a4db584d%40iki.fi

A lot was discussed, which I will try to summarize and address here.

Digression: Skewed Groups:

Imagine the input tuples have the following grouping keys:

  0, 1, 0, 2, 0, 3, 0, 4, ..., 0, N-1, 0, N

Group 0 is a skew group because it consists of 50% of all tuples in
the table, whereas every other group has a single tuple. If the
algorithm is able to keep group 0 in memory the whole time until
finalized, that means that it doesn't have to spill any group-0
tuples. In this example, that would amount to a 50% savings, and is a
major advantage of Hash Aggregation versus Sort+Group.

High-level approaches:

1. When the in-memory hash table fills, keep existing entries in the
hash table, and spill the raw tuples for all new groups in a
partitioned fashion. When all input tuples are read, finalize groups
in memory and emit. Now that the in-memory hash table is cleared (and
memory context reset), process a spill file the same as the original
input, but this time with a fraction of the group cardinality.

2. When the in-memory hash table fills, partition the hash space, and
evict the groups from all partitions except one by writing out their
partial aggregate states to disk. Any input tuples belonging to an
evicted partition get spilled to disk. When the input is read
entirely, finalize the groups remaining in memory and emit. Now that
the in-memory hash table is cleared, process the next partition by
loading its partial states into the hash table, and then processing
its spilled tuples.

3. Use some kind of hybrid[1][2] of hashing and sorting.

Evaluation of approaches:

Approach 1 is a nice incremental improvement on today's code. The
final patch may be around 1KLOC. It's a single kind of on-disk data
(spilled tuples), and a single algorithm (hashing). It also handles
skewed groups well because the skewed groups are likely to be
encountered before the hash table fills up the first time, and
therefore will stay in memory.

Approach 2 is nice because it resembles the approach of Hash Join, and
it can determine whether a tuple should be spilled without a hash
lookup. Unfortunately, those upsides are fairly mild, and it has
significant downsides:

* It doesn't handle skew values well because it's likely to evict
  them.

* If we leave part of the hash table in memory, it's difficult to
  ensure that we will be able to actually use the space freed by
  eviction, because the freed memory may be fragmented. That could
  force us to evict the entire in-memory hash table as soon as we
  partition, reducing a lot of the benefit of hashing.

* It requires eviction for the algorithm to work. That may be
  necessary for handling cases like ARRAY_AGG (see below) anyway, but
  this approach constrains the specifics of eviction.

Approach 3 is interesting because it unifies the two approaches and
can get some of the benfits of both. It's only a single path, so it
avoids planner mistakes. I really like this idea and it's possible we
will end up with approach 3. However:

* It requires that all data types support sorting, or that we punt
  somehow.

* Right now we are in a weird state because hash aggregation cheats,
  so it's difficult to evaluate whether Approach 3 is moving us in the
  right direction because we have no other correct implementation to
  compare against. Even if Approach 3 is where we end up, it seems
  like we should fix hash aggregation as a stepping stone first.

* It means we have a hash table and sort running concurrently, each
  using memory. Andres said this might not be a problem[3], but I'm
  not convinced that the problem is zero. If you use small work_mem
  for the write phase of sorting, you'll end up with a lot of runs to
  merge later and that has some kind of cost.

* The simplicity might start to evaporate when we consider grouping
  sets and eviction strategy.

Main topics to consider:

ARRAY_AGG:

Some aggregates, like ARRAY_AGG, have a transition state that grows
proportionally with the group size. In other words, it is not a
summary like COUNT or AVG, it contains all of the input data in a new
form.

These aggregates are not a good candidate for hash aggregation. Hash
aggregation is about keeping many transition states running in
parallel, which is just a bad fit for large transition states. Sorting
is better because it advances one transition state at a time. We could:

* Let ARRAY_AGG continue to exceed work_mem like today.

* Block or pessimize use of hash aggregation for such aggregates.

* Evict groups from the hash table when it becomes too large. This
  requires the ability to serialize and deserialize transition states,
  and some approaches here might also need combine_func
  specified. These requirements seem reasonable, but we still need
  some answer of what to do for aggregates that grow like ARRAY_AGG
  but don't have the required serialfunc, deserialfunc, or
  combine_func.

GROUPING SETS:

With grouping sets, there are multiple hash tables and each hash table
has it's own hash function, so that makes partitioning more
complex. In Approach 1, that means we need to either (a) not partition
the spilled tuples; or (b) have a different set of partitions for each
hash table and spill the same tuple multiple times. In Approach 2, we
would be required to partition each hash table separately and spill
tuples multiple times. In Approach 3 (depending on the exact approach
but taking a guess here) we would need to add a set of phases (one
extra phase for each hash table) for spilled tuples.

MEMORY TRACKING:

I have a patch to track the total allocated memory by
incrementing/decrementing it when blocks are malloc'd/free'd. This
doesn't do bookkeeping for each chunk, only each block. Previously,
Robert Haas raised some concerns[4] about performance, which were
mitigated[5] but perhaps not entirely eliminated (but did become
elusive).

The only alternative is estimation, which is ugly and seems like a bad
idea. Memory usage isn't just driven by inputs, it's also driven by
patterns of use. Misestimates in the planner are fine (within reason)
because we don't have any other choice, and a small-factor misestimate
might not change the plan anyway. But in the executor, a small-factor
misestimate seems like it's just not doing the job. If a user found
that hash aggregation was using 3X work_mem, and my only explanation
is "well, it's just an estimate", I would be pretty embarrassed and
the user would likely lose confidence in the feature. I don't mean
that we must track memory perfectly everywhere, but using an estimate
seems like a mediocre improvement of the current state.

We should proceed with memory context tracking and try to eliminate or
mitigate performance concerns. I would not like to make any hurculean
effort as a part of the hash aggregation work though; I think it's
basically just something a memory manager in a database system should
have supported all along. I think we will find other uses for it as
time goes on. We have more and more things happening in the executor
and having a cheap way to check "how much memory is this thing using?"
seems very likely to be useful.

Other points:

* Someone brought up the idea of using logtapes.c instead of writing
  separate files for each partition. That seems reasonable, but it's
  using logtapes.c slightly outside of its intended purpose. Also,
  it's awkward to need to specify the number of tapes up-front. Worth
  experimenting with to see if it's a win.

* Tomas did some experiments regarding the number of batches to choose
  and how to choose them. It seems like there's room for improvement
  over ths simple calculation I'm doing now.

* A lot of discussion about a smart eviction strategy. I don't see
  strong evidence that it's worth the complexity at this time. The
  smarter we try to be, the more bookkeeping and memory fragmentation
  problems we will have. If we evict something, we should probably
  evict the whole hash table or some large part of it.

Regards,
        Jeff Davis

[1]
https://postgr.es/m/20180604185205.epue25jzpavokupf%40alap3.anarazel.de
[2]
https://postgr.es/m/message-id/CAGTBQpa__-NP7%3DkKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g%40mail.gmail.com
[3]
https://www.postgresql.org/message-id/20180605175209.vavuqe4idovcpeie%40alap3.anarazel.de
[4]
https://www.postgresql.org/message-id/CA%2BTgmobnu7XEn1gRdXnFo37P79bF%3DqLt46%3D37ajP3Cro9dBRaA%40mail.gmail.com
[5]
https://www.postgresql.org/message-id/1413422787.18615.18.camel%40jeff-desktop




Reply | Threaded
Open this post in threaded view
|

Re: Memory-Bounded Hash Aggregation

Tomas Vondra-4
Hi Jeff,

On Mon, Jul 01, 2019 at 12:13:53PM -0700, Jeff Davis wrote:
>This is for design review. I have a patch (WIP) for Approach 1, and if
>this discussion starts to converge on that approach I will polish and
>post it.
>

Thanks for working on this.

>Let's start at the beginning: why do we have two strategies -- hash
>and sort -- for aggregating data? The two are more similar than they
>first appear. A partitioned hash strategy writes randomly among the
>partitions, and later reads the partitions sequentially; a sort will
>write sorted runs sequentially, but then read the among the runs
>randomly during the merge phase. A hash is a convenient small
>representation of the data that is cheaper to operate on; sort uses
>abbreviated keys for the same reason.
>

What does "partitioned hash strategy" do? It's probably explained in one
of the historical discussions, but I'm not sure which one. I assume it
simply hashes the group keys and uses that to partition the data, and then
passing it to hash aggregate.

>Hash offers:
>
>* Data is aggregated on-the-fly, effectively "compressing" the amount
>  of data that needs to go to disk. This is particularly important
>  when the data contains skewed groups (see below).
>
>* Can output some groups after the first pass of the input data even
>  if other groups spilled.
>
>* Some data types only support hashing; not sorting.
>
>Sort+Group offers:
>
>* Only one group is accumulating at once, so if the transition state
>  grows (like with ARRAY_AGG), it minimizes the memory needed.
>
>* The input may already happen to be sorted.
>
>* Some data types only support sorting; not hashing.
>
>Currently, Hash Aggregation is only chosen if the optimizer believes
>that all the groups (and their transition states) fit in
>memory. Unfortunately, if the optimizer is wrong (often the case if the
>input is not a base table), the hash table will
>keep growing beyond work_mem, potentially bringing the entire system
>to OOM. This patch fixes that problem by extending the Hash
>Aggregation strategy to spill to disk when needed.
>

OK, makes sense.

>Previous discussions:
>
>
>https://www.postgresql.org/message-id/1407706010.6623.16.camel@jeff-desktop
>
>https://www.postgresql.org/message-id/1419326161.24895.13.camel%40jeff-desktop
>
>https://www.postgresql.org/message-id/87be3bd5-6b13-d76e-5618-6db0a4db584d%40iki.fi
>
>A lot was discussed, which I will try to summarize and address here.
>
>Digression: Skewed Groups:
>
>Imagine the input tuples have the following grouping keys:
>
>  0, 1, 0, 2, 0, 3, 0, 4, ..., 0, N-1, 0, N
>
>Group 0 is a skew group because it consists of 50% of all tuples in
>the table, whereas every other group has a single tuple. If the
>algorithm is able to keep group 0 in memory the whole time until
>finalized, that means that it doesn't have to spill any group-0
>tuples. In this example, that would amount to a 50% savings, and is a
>major advantage of Hash Aggregation versus Sort+Group.
>

Right. I agree efficiently handling skew is important and may be crucial
for achieving good performance.

>High-level approaches:
>
>1. When the in-memory hash table fills, keep existing entries in the
>hash table, and spill the raw tuples for all new groups in a
>partitioned fashion. When all input tuples are read, finalize groups
>in memory and emit. Now that the in-memory hash table is cleared (and
>memory context reset), process a spill file the same as the original
>input, but this time with a fraction of the group cardinality.
>
>2. When the in-memory hash table fills, partition the hash space, and
>evict the groups from all partitions except one by writing out their
>partial aggregate states to disk. Any input tuples belonging to an
>evicted partition get spilled to disk. When the input is read
>entirely, finalize the groups remaining in memory and emit. Now that
>the in-memory hash table is cleared, process the next partition by
>loading its partial states into the hash table, and then processing
>its spilled tuples.
>
>3. Use some kind of hybrid[1][2] of hashing and sorting.
>

Unfortunately the second link does not work :-(

>Evaluation of approaches:
>
>Approach 1 is a nice incremental improvement on today's code. The
>final patch may be around 1KLOC. It's a single kind of on-disk data
>(spilled tuples), and a single algorithm (hashing). It also handles
>skewed groups well because the skewed groups are likely to be
>encountered before the hash table fills up the first time, and
>therefore will stay in memory.
>

I'm not going to block Approach 1, althought I'd really like to see
something that helps with array_agg.

>Approach 2 is nice because it resembles the approach of Hash Join, and
>it can determine whether a tuple should be spilled without a hash
>lookup. Unfortunately, those upsides are fairly mild, and it has
>significant downsides:
>
>* It doesn't handle skew values well because it's likely to evict
>  them.
>
>* If we leave part of the hash table in memory, it's difficult to
>  ensure that we will be able to actually use the space freed by
>  eviction, because the freed memory may be fragmented. That could
>  force us to evict the entire in-memory hash table as soon as we
>  partition, reducing a lot of the benefit of hashing.
>

Yeah, and it may not work well with the memory accounting if we only track
the size of allocated blocks, not chunks (because pfree likely won't free
the blocks).

>* It requires eviction for the algorithm to work. That may be
>  necessary for handling cases like ARRAY_AGG (see below) anyway, but
>  this approach constrains the specifics of eviction.
>
>Approach 3 is interesting because it unifies the two approaches and
>can get some of the benfits of both. It's only a single path, so it
>avoids planner mistakes. I really like this idea and it's possible we
>will end up with approach 3. However:
>
>* It requires that all data types support sorting, or that we punt
>  somehow.
>
>* Right now we are in a weird state because hash aggregation cheats,
>  so it's difficult to evaluate whether Approach 3 is moving us in the
>  right direction because we have no other correct implementation to
>  compare against. Even if Approach 3 is where we end up, it seems
>  like we should fix hash aggregation as a stepping stone first.
>

Aren't all three approaches a way to "fix" hash aggregate? In any case,
it's certainly reasonable to make incremental changes. The question is
whether "approach 1" is sensible step towards some form of "approach 3"


>* It means we have a hash table and sort running concurrently, each
>  using memory. Andres said this might not be a problem[3], but I'm
>  not convinced that the problem is zero. If you use small work_mem
>  for the write phase of sorting, you'll end up with a lot of runs to
>  merge later and that has some kind of cost.
>

Why would we need to do both concurrently? I thought we'd empty the hash
table before doing the sort, no?

>* The simplicity might start to evaporate when we consider grouping
>  sets and eviction strategy.
>

Hmm, yeah :-/

>Main topics to consider:
>
>ARRAY_AGG:
>
>Some aggregates, like ARRAY_AGG, have a transition state that grows
>proportionally with the group size. In other words, it is not a
>summary like COUNT or AVG, it contains all of the input data in a new
>form.
>

Strictly speaking the state may grow even for count/avg aggregates, e.g.
for numeric types, but it's far less serious than array_agg etc.

>These aggregates are not a good candidate for hash aggregation. Hash
>aggregation is about keeping many transition states running in
>parallel, which is just a bad fit for large transition states. Sorting
>is better because it advances one transition state at a time. We could:
>
>* Let ARRAY_AGG continue to exceed work_mem like today.
>
>* Block or pessimize use of hash aggregation for such aggregates.
>
>* Evict groups from the hash table when it becomes too large. This
>  requires the ability to serialize and deserialize transition states,
>  and some approaches here might also need combine_func
>  specified. These requirements seem reasonable, but we still need
>  some answer of what to do for aggregates that grow like ARRAY_AGG
>  but don't have the required serialfunc, deserialfunc, or
>  combine_func.
>

Do we actually need to handle that case? How many such aggregates are
there? I think it's OK to just ignore that case (and keep doing what we do
now), and require serial/deserial functions for anything better.

>GROUPING SETS:
>
>With grouping sets, there are multiple hash tables and each hash table
>has it's own hash function, so that makes partitioning more
>complex. In Approach 1, that means we need to either (a) not partition
>the spilled tuples; or (b) have a different set of partitions for each
>hash table and spill the same tuple multiple times. In Approach 2, we
>would be required to partition each hash table separately and spill
>tuples multiple times. In Approach 3 (depending on the exact approach
>but taking a guess here) we would need to add a set of phases (one
>extra phase for each hash table) for spilled tuples.
>

No thoughts about this yet.

>MEMORY TRACKING:
>
>I have a patch to track the total allocated memory by
>incrementing/decrementing it when blocks are malloc'd/free'd. This
>doesn't do bookkeeping for each chunk, only each block. Previously,
>Robert Haas raised some concerns[4] about performance, which were
>mitigated[5] but perhaps not entirely eliminated (but did become
>elusive).
>
>The only alternative is estimation, which is ugly and seems like a bad
>idea. Memory usage isn't just driven by inputs, it's also driven by
>patterns of use. Misestimates in the planner are fine (within reason)
>because we don't have any other choice, and a small-factor misestimate
>might not change the plan anyway. But in the executor, a small-factor
>misestimate seems like it's just not doing the job. If a user found
>that hash aggregation was using 3X work_mem, and my only explanation
>is "well, it's just an estimate", I would be pretty embarrassed and
>the user would likely lose confidence in the feature. I don't mean
>that we must track memory perfectly everywhere, but using an estimate
>seems like a mediocre improvement of the current state.

I agree estimates are not the right tool here.

>
>We should proceed with memory context tracking and try to eliminate or
>mitigate performance concerns. I would not like to make any hurculean
>effort as a part of the hash aggregation work though; I think it's
>basically just something a memory manager in a database system should
>have supported all along. I think we will find other uses for it as
>time goes on. We have more and more things happening in the executor
>and having a cheap way to check "how much memory is this thing using?"
>seems very likely to be useful.
>

IMO we should just use the cheapest memory accounting (tracking the amount
of memory allocated for blocks). I agree it's a feature we need, I don't
think we can devise anything cheaper than this.

>Other points:
>
>* Someone brought up the idea of using logtapes.c instead of writing
>  separate files for each partition. That seems reasonable, but it's
>  using logtapes.c slightly outside of its intended purpose. Also,
>  it's awkward to need to specify the number of tapes up-front. Worth
>  experimenting with to see if it's a win.
>
>* Tomas did some experiments regarding the number of batches to choose
>  and how to choose them. It seems like there's room for improvement
>  over ths simple calculation I'm doing now.
>

Me? I don't recall such benchmarks, but maybe I did. But I think we'll
need to repeat those with the new patches etc. I think the question is
whether we see this as an emergency solution - in that case I wouldn't
obsess about getting the best possible parameters.

>* A lot of discussion about a smart eviction strategy. I don't see
>  strong evidence that it's worth the complexity at this time. The
>  smarter we try to be, the more bookkeeping and memory fragmentation
>  problems we will have. If we evict something, we should probably
>  evict the whole hash table or some large part of it.
>

Maybe. For each "smart" eviction strategy there is a (trivial) example
of data on which it performs poorly.

I think it's the same thing as with the number of partitions - if we
consider this to be an emergency solution, it's OK if the performance is
not entirely perfect when it kicks in.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Reply | Threaded
Open this post in threaded view
|

Re: Memory-Bounded Hash Aggregation

Jeff Davis-8
In reply to this post by Jeff Davis-8
On Mon, 2019-07-01 at 12:13 -0700, Jeff Davis wrote:
> This is for design review. I have a patch (WIP) for Approach 1, and
> if
> this discussion starts to converge on that approach I will polish and
> post it.

WIP patch attached (based on 9a81c9fa); targeting September CF.

Not intended for detailed review yet, but it seems to work in enough
cases (including grouping sets and JIT) to be a good proof-of-concept
for the algorithm and its complexity.

Initial performance numbers put it at 2X slower than sort for grouping
10M distinct integers. There are quite a few optimizations I haven't
tried yet and quite a few tunables I haven't tuned yet, so hopefully I
can close the gap a bit for the small-groups case.

I will offer more details soon when I have more confidence in the
numbers.

It does not attempt to spill ARRAY_AGG at all yet.

Regards,
        Jeff Davis


hashagg-20190703.patch (63K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Memory-Bounded Hash Aggregation

Jeff Davis-8
In reply to this post by Tomas Vondra-4
On Wed, 2019-07-03 at 02:17 +0200, Tomas Vondra wrote:
> What does "partitioned hash strategy" do? It's probably explained in
> one
> of the historical discussions, but I'm not sure which one. I assume
> it
> simply hashes the group keys and uses that to partition the data, and
> then
> passing it to hash aggregate.

Yes. When spilling, it is cheap to partition on the hash value at the
same time, which dramatically reduces the need to spill multiple times.
Previous discussions:


> Unfortunately the second link does not work :-(

It's supposed to be:


https://www.postgresql.org/message-id/CAGTBQpa__-NP7%3DkKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g%40mail.gmail.com


> I'm not going to block Approach 1, althought I'd really like to see
> something that helps with array_agg.

I have a WIP patch that I just posted. It doesn't yet work with
ARRAY_AGG, but I think it can be made to work by evicting the entire
hash table, serializing the transition states, and then later combining
them.

> Aren't all three approaches a way to "fix" hash aggregate? In any
> case,
> it's certainly reasonable to make incremental changes. The question
> is
> whether "approach 1" is sensible step towards some form of "approach
> 3"

Disk-based hashing certainly seems like a reasonable algorithm on paper
that has some potential advantages over sorting. It certainly seems
sensible to me that we explore the disk-based hashing strategy first,
and then we would at least know what we are missing (if anything) by
going with the hybrid approach later.

There's also a fair amount of design space to explore in the hybrid
strategy. That could take a while to converge, especially if we don't
have anything in place to compare against.

> > * It means we have a hash table and sort running concurrently, each
> >  using memory. Andres said this might not be a problem[3], but I'm
> >  not convinced that the problem is zero. If you use small work_mem
> >  for the write phase of sorting, you'll end up with a lot of runs
> > to
> >  merge later and that has some kind of cost.
> >
>
> Why would we need to do both concurrently? I thought we'd empty the
> hash
> table before doing the sort, no?

So you are saying we spill the tuples into a tuplestore, then feed the
tuplestore through a tuplesort? Seems inefficient, but I guess we can.

> Do we actually need to handle that case? How many such aggregates are
> there? I think it's OK to just ignore that case (and keep doing what
> we do
> now), and require serial/deserial functions for anything better.

Punting on a few cases is fine with me, if the user has a way to fix
it.

Regards,
        Jeff Davis




Reply | Threaded
Open this post in threaded view
|

Re: Memory-Bounded Hash Aggregation

Tomas Vondra-4
On Wed, Jul 03, 2019 at 07:03:06PM -0700, Jeff Davis wrote:

>On Wed, 2019-07-03 at 02:17 +0200, Tomas Vondra wrote:
>> What does "partitioned hash strategy" do? It's probably explained in
>> one
>> of the historical discussions, but I'm not sure which one. I assume
>> it
>> simply hashes the group keys and uses that to partition the data, and
>> then
>> passing it to hash aggregate.
>
>Yes. When spilling, it is cheap to partition on the hash value at the
>same time, which dramatically reduces the need to spill multiple times.
>Previous discussions:
>
>
>> Unfortunately the second link does not work :-(
>
>It's supposed to be:
>
>
>https://www.postgresql.org/message-id/CAGTBQpa__-NP7%3DkKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g%40mail.gmail.com
>
>
>> I'm not going to block Approach 1, althought I'd really like to see
>> something that helps with array_agg.
>
>I have a WIP patch that I just posted. It doesn't yet work with
>ARRAY_AGG, but I think it can be made to work by evicting the entire
>hash table, serializing the transition states, and then later combining
>them.
>
>> Aren't all three approaches a way to "fix" hash aggregate? In any
>> case,
>> it's certainly reasonable to make incremental changes. The question
>> is
>> whether "approach 1" is sensible step towards some form of "approach
>> 3"
>
>Disk-based hashing certainly seems like a reasonable algorithm on paper
>that has some potential advantages over sorting. It certainly seems
>sensible to me that we explore the disk-based hashing strategy first,
>and then we would at least know what we are missing (if anything) by
>going with the hybrid approach later.
>
>There's also a fair amount of design space to explore in the hybrid
>strategy. That could take a while to converge, especially if we don't
>have anything in place to compare against.
>

Makes sense. I haven't thought about how the hybrid approach would be
implemented very much, so I can't quite judge how complicated would it be
to extend "approach 1" later. But if you think it's a sensible first step,
I trust you. And I certainly agree we need something to compare the other
approaches against.


>> > * It means we have a hash table and sort running concurrently, each
>> >  using memory. Andres said this might not be a problem[3], but I'm
>> >  not convinced that the problem is zero. If you use small work_mem
>> >  for the write phase of sorting, you'll end up with a lot of runs
>> > to
>> >  merge later and that has some kind of cost.
>> >
>>
>> Why would we need to do both concurrently? I thought we'd empty the
>> hash
>> table before doing the sort, no?
>
>So you are saying we spill the tuples into a tuplestore, then feed the
>tuplestore through a tuplesort? Seems inefficient, but I guess we can.
>

I think the question is whether we see this as "emergency fix" (for cases
that are misestimated and could/would fail with OOM at runtime), or as
something that is meant to make "hash agg" more widely applicable.

I personally see it as an emergency fix, in which cases it's perfectly
fine if it's not 100% efficient, assuming it kicks in only rarely.
Effectively, we're betting on hash agg, and from time to time we lose.

But even if we see it as a general optimization technique it does not have
to be perfectly efficient, as long as it's properly costed (so the planner
only uses it when appropriate).

If we have a better solution (in terms of efficiency, code complexity,
etc.) then sure - let's use that. But considering we've started this
discussion in ~2015 and we still don't have anything, I wouldn't hold my
breath. Let's do something good enough, and maybe improve it later.

>> Do we actually need to handle that case? How many such aggregates are
>> there? I think it's OK to just ignore that case (and keep doing what
>> we do
>> now), and require serial/deserial functions for anything better.
>
>Punting on a few cases is fine with me, if the user has a way to fix
>it.
>

+1 to doing that


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Reply | Threaded
Open this post in threaded view
|

Re: Memory-Bounded Hash Aggregation

Tomas Vondra-4
In reply to this post by Jeff Davis-8
On Wed, Jul 03, 2019 at 07:03:06PM -0700, Jeff Davis wrote:

>On Wed, 2019-07-03 at 02:17 +0200, Tomas Vondra wrote:
>> What does "partitioned hash strategy" do? It's probably explained in
>> one
>> of the historical discussions, but I'm not sure which one. I assume
>> it
>> simply hashes the group keys and uses that to partition the data, and
>> then
>> passing it to hash aggregate.
>
>Yes. When spilling, it is cheap to partition on the hash value at the
>same time, which dramatically reduces the need to spill multiple times.
>Previous discussions:
>
>
>> Unfortunately the second link does not work :-(
>
>It's supposed to be:
>
>
>https://www.postgresql.org/message-id/CAGTBQpa__-NP7%3DkKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g%40mail.gmail.com
>
>
>> I'm not going to block Approach 1, althought I'd really like to see
>> something that helps with array_agg.
>
>I have a WIP patch that I just posted. It doesn't yet work with
>ARRAY_AGG, but I think it can be made to work by evicting the entire
>hash table, serializing the transition states, and then later combining
>them.
>
>> Aren't all three approaches a way to "fix" hash aggregate? In any
>> case,
>> it's certainly reasonable to make incremental changes. The question
>> is
>> whether "approach 1" is sensible step towards some form of "approach
>> 3"
>
>Disk-based hashing certainly seems like a reasonable algorithm on paper
>that has some potential advantages over sorting. It certainly seems
>sensible to me that we explore the disk-based hashing strategy first,
>and then we would at least know what we are missing (if anything) by
>going with the hybrid approach later.
>
>There's also a fair amount of design space to explore in the hybrid
>strategy. That could take a while to converge, especially if we don't
>have anything in place to compare against.
>

Makes sense. I haven't thought about how the hybrid approach would be
implemented very much, so I can't quite judge how complicated would it be
to extend "approach 1" later. But if you think it's a sensible first step,
I trust you. And I certainly agree we need something to compare the other
approaches against.


>> > * It means we have a hash table and sort running concurrently, each
>> >  using memory. Andres said this might not be a problem[3], but I'm
>> >  not convinced that the problem is zero. If you use small work_mem
>> >  for the write phase of sorting, you'll end up with a lot of runs
>> > to
>> >  merge later and that has some kind of cost.
>> >
>>
>> Why would we need to do both concurrently? I thought we'd empty the
>> hash
>> table before doing the sort, no?
>
>So you are saying we spill the tuples into a tuplestore, then feed the
>tuplestore through a tuplesort? Seems inefficient, but I guess we can.
>

I think the question is whether we see this as "emergency fix" (for cases
that are misestimated and could/would fail with OOM at runtime), or as
something that is meant to make "hash agg" more widely applicable.

I personally see it as an emergency fix, in which cases it's perfectly
fine if it's not 100% efficient, assuming it kicks in only rarely.
Effectively, we're betting on hash agg, and from time to time we lose.

But even if we see it as a general optimization technique it does not have
to be perfectly efficient, as long as it's properly costed (so the planner
only uses it when appropriate).

If we have a better solution (in terms of efficiency, code complexity,
etc.) then sure - let's use that. But considering we've started this
discussion in ~2015 and we still don't have anything, I wouldn't hold my
breath. Let's do something good enough, and maybe improve it later.

>> Do we actually need to handle that case? How many such aggregates are
>> there? I think it's OK to just ignore that case (and keep doing what
>> we do
>> now), and require serial/deserial functions for anything better.
>
>Punting on a few cases is fine with me, if the user has a way to fix
>it.
>

+1 to doing that


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Reply | Threaded
Open this post in threaded view
|

Re: Memory-Bounded Hash Aggregation

Jeff Davis-8
On Thu, 2019-07-11 at 17:55 +0200, Tomas Vondra wrote:
> Makes sense. I haven't thought about how the hybrid approach would be
> implemented very much, so I can't quite judge how complicated would
> it be
> to extend "approach 1" later. But if you think it's a sensible first
> step,
> I trust you. And I certainly agree we need something to compare the
> other
> approaches against.

Is this a duplicate of your previous email?

I'm slightly confused but I will use the opportunity to put out another
WIP patch. The patch could use a few rounds of cleanup and quality
work, but the funcionality is there and the performance seems
reasonable.

I rebased on master and fixed a few bugs, and most importantly, added
tests.

It seems to be working with grouping sets fine. It will take a little
longer to get good performance numbers, but even for group size of one,
I'm seeing HashAgg get close to Sort+Group in some cases.

You are right that the missed lookups appear to be costly, at least
when the data all fits in system memory. I think it's the cache misses,
because sometimes reducing work_mem improves performance. I'll try
tuning the number of buckets for the hash table and see if that helps.
If not, then the performance still seems pretty good to me.

Of course, HashAgg can beat sort for larger group sizes, but I'll try
to gather some more data on the cross-over point.

Regards,
        Jeff Davis


hashagg-20190711.patch (103K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Memory-Bounded Hash Aggregation

Tomas Vondra-4
On Thu, Jul 11, 2019 at 06:06:33PM -0700, Jeff Davis wrote:

>On Thu, 2019-07-11 at 17:55 +0200, Tomas Vondra wrote:
>> Makes sense. I haven't thought about how the hybrid approach would be
>> implemented very much, so I can't quite judge how complicated would
>> it be
>> to extend "approach 1" later. But if you think it's a sensible first
>> step,
>> I trust you. And I certainly agree we need something to compare the
>> other
>> approaches against.
>
>Is this a duplicate of your previous email?
>

Yes. I don't know how I managed to send it again. Sorry.

>I'm slightly confused but I will use the opportunity to put out another
>WIP patch. The patch could use a few rounds of cleanup and quality
>work, but the funcionality is there and the performance seems
>reasonable.
>
>I rebased on master and fixed a few bugs, and most importantly, added
>tests.
>
>It seems to be working with grouping sets fine. It will take a little
>longer to get good performance numbers, but even for group size of one,
>I'm seeing HashAgg get close to Sort+Group in some cases.
>

Nice! That's a very nice progress!

>You are right that the missed lookups appear to be costly, at least
>when the data all fits in system memory. I think it's the cache misses,
>because sometimes reducing work_mem improves performance. I'll try
>tuning the number of buckets for the hash table and see if that helps.
>If not, then the performance still seems pretty good to me.
>
>Of course, HashAgg can beat sort for larger group sizes, but I'll try
>to gather some more data on the cross-over point.
>

Yes, makes sense. I think it's acceptable as long as we consider this
during costing (when we know in advance we'll need this) or treat it to be
emergency measure.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services