Spilling hashed SetOps and aggregates to disk

classic Classic list List threaded Threaded
64 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Spilling hashed SetOps and aggregates to disk

Heikki Linnakangas
Hi,

Hash Aggs and SetOps are currently not spilled to disk. If the planner's
estimate on the number of entries is badly off, you might run out of
memory at execution time, if all the entries don't fit in memory.

For HashAggs, this was discussed in depth a couple of years ago at [1].
SetOps have the same issue, but fixing that is simpler, as you don't
need to handle arbitrary aggregate transition values and functions.

So a while back, I started hacking on spilling SetOps, with the idea
that the code to deal with that could later be reused to deal with
HashAggs, too. I didn't get very far, but I'm posting this in this very
unfinished form to show what I've got, because I had a chat on this with
Jeff Davis and some others last week.

The logtape.c interface would be very useful for this. When you start
spilling, you want to create many spill files, so that when reloaded,
each file will fit comfortably in memory. With logtape.c, you can have
many logical tapes, without the overhead of real files. Furthermore, if
you need to re-spill because you a spill file grows too big in the first
pass, logtape.c allows reusing the space "on-the-fly". The only problem
with the current logtape interface is that it requires specifying the
number of "tapes" upfront, when the tapeset is created. However, I was
planning to change that, anyway [2].

[1]
https://www.postgresql.org/message-id/1407706010.6623.16.camel%40jeff-desktop

[2]
https://www.postgresql.org/message-id/420a0ec7-602c-d406-1e75-1ef7ddc58d83%40iki.fi

- Heikki


0001-Optimize-memory-usage-in-SetOp-executor-node.patch (5K) Download Attachment
0002-Allow-SetOps-to-spill.patch (26K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Spilling hashed SetOps and aggregates to disk

Andres Freund
Hi,

On 2018-06-04 10:32:47 +0200, Heikki Linnakangas wrote:
> Hash Aggs and SetOps are currently not spilled to disk. If the planner's
> estimate on the number of entries is badly off, you might run out of memory
> at execution time, if all the entries don't fit in memory.
>
> For HashAggs, this was discussed in depth a couple of years ago at [1].
> SetOps have the same issue, but fixing that is simpler, as you don't need to
> handle arbitrary aggregate transition values and functions.

That part has gotten a bit easier since, because we have serialize /
deserialize operations for aggregates these days.

I wonder whether, at least for aggregates, the better fix wouldn't be to
switch to feeding the tuples into tuplesort upon memory exhaustion and
doing a sort based aggregate.  We have most of the infrastructure to do
that due to grouping sets. It's just the pre-existing in-memory tuples
that'd be problematic, in that the current transition values would need
to serialized as well.  But with a stable sort that'd not be
particularly problematic, and that could easily be achieved.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Spilling hashed SetOps and aggregates to disk

David Rowley-3
On 5 June 2018 at 06:52, Andres Freund <[hidden email]> wrote:
> That part has gotten a bit easier since, because we have serialize /
> deserialize operations for aggregates these days.

True. Although not all built in aggregates have those defined.

> I wonder whether, at least for aggregates, the better fix wouldn't be to
> switch to feeding the tuples into tuplesort upon memory exhaustion and
> doing a sort based aggregate.  We have most of the infrastructure to do
> that due to grouping sets. It's just the pre-existing in-memory tuples
> that'd be problematic, in that the current transition values would need
> to serialized as well.  But with a stable sort that'd not be
> particularly problematic, and that could easily be achieved.

Isn't there still a problem determining when the memory exhaustion
actually happens though?   As far as I know, we've still little
knowledge how much memory each aggregate state occupies.

Jeff tried to solve this in [1], but from what I remember, there was
too much concern about the overhead of the additional accounting code.

[1] https://www.postgresql.org/message-id/flat/CAKJS1f8yvvvj-sVDv_bcxkzcZKq0ZOTVhX0dHfnYDct2Mycq5Q%40mail.gmail.com#CAKJS1f8yvvvj-sVDv_bcxkzcZKq0ZOTVhX0dHfnYDct2Mycq5Q@...

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Spilling hashed SetOps and aggregates to disk

Tomas Vondra-4
On 06/05/2018 04:56 AM, David Rowley wrote:
> On 5 June 2018 at 06:52, Andres Freund <[hidden email]> wrote:
>> That part has gotten a bit easier since, because we have serialize /
>> deserialize operations for aggregates these days.
>
> True. Although not all built in aggregates have those defined.

Not sure what to do about those, unfortunately. We could stop adding
more groups and hope the aggregate state does not grow further, but
that's about it.

>
>> I wonder whether, at least for aggregates, the better fix wouldn't be to
>> switch to feeding the tuples into tuplesort upon memory exhaustion and
>> doing a sort based aggregate.  We have most of the infrastructure to do
>> that due to grouping sets. It's just the pre-existing in-memory tuples
>> that'd be problematic, in that the current transition values would need
>> to serialized as well.  But with a stable sort that'd not be
>> particularly problematic, and that could easily be achieved.

That's one of the possible solutions, yes. It's hard to say if it's
better or worse than the other approaches, because that depends on the
number of tuples vs. number of groups etc.

Evicting some (or all) of the in-memory groups can be easily better or
worse, depending on the data set. And I'm not sure the code complexity
would be significantly different.

I expect the eviction strategy to be the primary design challenge of
this patch. The other bits will be mostly determined by this one piece.

While the primary goal of the patch is addressing the OOM risks in hash
aggregate, I think it'd be a mistake to see it just that way. I see it
could allow us to perform hash aggregate more often, even if we know the
groups won't fit into work_mem. If we could estimate how much of the
aggregate state we'll have to spill to disk, we could still prefer
hashagg over groupagg. We would pay the price for eviction, but on large
data sets that can be massively cheaper than having to do sort.

I admit this is a bit hand-wavy, and the costing depends on us being
able to estimate the amount of evicted data. I certainly don't expect to
get this right away (that would move the goal posts for the patch very
far away), but it would be unfortunate to make this improvement
impossible/more difficult in the future.

>
> Isn't there still a problem determining when the memory exhaustion
> actually happens though?   As far as I know, we've still little
> knowledge how much memory each aggregate state occupies.
>
> Jeff tried to solve this in [1], but from what I remember, there was
> too much concern about the overhead of the additional accounting code.
>
> [1] https://www.postgresql.org/message-id/flat/CAKJS1f8yvvvj-sVDv_bcxkzcZKq0ZOTVhX0dHfnYDct2Mycq5Q%40mail.gmail.com#CAKJS1f8yvvvj-sVDv_bcxkzcZKq0ZOTVhX0dHfnYDct2Mycq5Q@...
>

I had a chat with Jeff Davis at pgcon about this, and IIRC he suggested
a couple of people who were originally worried about the overhead seem
to be accepting it now.

IMHO we do want a memory-bound hash aggregate, and doing some sort of
memory accounting is a must-have part of that. I don't see a way around
this, really. We need to minimize the overhead, of course.

I do not recall the exact approach we ended up with in 2015, though, or
what the measurements with that version were. I see 1-3% mentioned early
in the thread, and there were some additional improvements in subsequent
patch version I think.

I don't think we can realistically improve this (accounting at block
level), and there was a discussion if this is actually an overhead or
merely due to different binary layout.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Spilling hashed SetOps and aggregates to disk

Jeff Davis-8
In reply to this post by Andres Freund
On Mon, 2018-06-04 at 11:52 -0700, Andres Freund wrote:
> I wonder whether, at least for aggregates, the better fix wouldn't be
> to
> switch to feeding the tuples into tuplesort upon memory exhaustion
> and
> doing a sort based aggregate.  We have most of the infrastructure to
> do

That's an interesting idea, but it seems simpler to stick to hashing
rather than using a combination strategy. It also seems like it would
take less CPU effort.

What advantages do you have in mind? My patch partitions the spilled
data, so it should have similar disk costs as a sort approach.

Regards,
        Jeff Davis


Reply | Threaded
Open this post in threaded view
|

Re: Spilling hashed SetOps and aggregates to disk

Jeff Davis-8
In reply to this post by Tomas Vondra-4
On Tue, 2018-06-05 at 07:04 +0200, Tomas Vondra wrote:
> I expect the eviction strategy to be the primary design challenge of
> this patch. The other bits will be mostly determined by this one
> piece.

Not sure I agree that this is the primary challenge.

The cases that benefit from eviction are probably a minority. I see two
categories that would benefit:

  * Natural clustering in the heap. This sounds fairly common, but a
lot of the cases that come to mind are too low-cardinality to be
compelling; e.g. timestamps grouped by hour/day/month. If someone has
run into a high-cardinality natural grouping case, let me know.
  * ARRAY_AGG (or similar): individual state values can be large enough
that we need to evict to avoid exceeding work_mem even if not adding
any new groups.

In either case, it seems like a fairly simple eviction strategy would
work. For instance, we could just evict the entire hash table if
work_mem is exceeded or if the hit rate on the hash table falls below a
certain threshold. If there was really something important that should
have stayed in the hash table, it will go back in soon anyway.

So why should eviction be a major driver for the entire design? I agree
it should be an area of improvement for the future, so let me know if
you see a major problem, but I haven't been as focused on eviction.

> While the primary goal of the patch is addressing the OOM risks in
> hash
> aggregate, I think it'd be a mistake to see it just that way. I see
> it
> could allow us to perform hash aggregate more often, even if we know
> the
> groups won't fit into work_mem. If we could estimate how much of the
> aggregate state we'll have to spill to disk, we could still prefer
> hashagg over groupagg. We would pay the price for eviction, but on
> large
> data sets that can be massively cheaper than having to do sort.

Agreed. I ran some tests of my patch in the last round, and they
strongly supported choosing HashAgg a lot more often. A lot of sort
improvements have been made though, so I really need to run some new
numbers.

> far away), but it would be unfortunate to make this improvement
> impossible/more difficult in the future.

If you see anything that would make this difficult in the future, let
me know.

Regards,
     Jeff Davis


Reply | Threaded
Open this post in threaded view
|

Re: Spilling hashed SetOps and aggregates to disk

David Rowley-3
In reply to this post by Tomas Vondra-4
On 5 June 2018 at 17:04, Tomas Vondra <[hidden email]> wrote:

> On 06/05/2018 04:56 AM, David Rowley wrote:
>> Isn't there still a problem determining when the memory exhaustion
>> actually happens though?   As far as I know, we've still little
>> knowledge how much memory each aggregate state occupies.
>>
>> Jeff tried to solve this in [1], but from what I remember, there was
>> too much concern about the overhead of the additional accounting code.
>>
>> [1] https://www.postgresql.org/message-id/flat/CAKJS1f8yvvvj-sVDv_bcxkzcZKq0ZOTVhX0dHfnYDct2Mycq5Q%40mail.gmail.com#CAKJS1f8yvvvj-sVDv_bcxkzcZKq0ZOTVhX0dHfnYDct2Mycq5Q@...
>>
>
> I had a chat with Jeff Davis at pgcon about this, and IIRC he suggested
> a couple of people who were originally worried about the overhead seem
> to be accepting it now.

Is there any great need to make everything pay the small price for
this? Couldn't we just create a new MemoryContextMethod that
duplicates aset.c, but has the require additional accounting built in
at the implementation level, then make execGrouping.c use that
allocator for its hash table? The code would not really need to be
duplicated, we could just do things the same way as like.c does with
like_match.c and include a .c file. We'd need another interface
function in MemoryContextMethods to support getting the total memory
allocated in the context, but that does not seem like a problem.


--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Spilling hashed SetOps and aggregates to disk

Tomas Vondra-4

On 06/05/2018 09:22 AM, David Rowley wrote:

> On 5 June 2018 at 17:04, Tomas Vondra <[hidden email]> wrote:
>> On 06/05/2018 04:56 AM, David Rowley wrote:
>>> Isn't there still a problem determining when the memory exhaustion
>>> actually happens though?   As far as I know, we've still little
>>> knowledge how much memory each aggregate state occupies.
>>>
>>> Jeff tried to solve this in [1], but from what I remember, there was
>>> too much concern about the overhead of the additional accounting code.
>>>
>>> [1] https://www.postgresql.org/message-id/flat/CAKJS1f8yvvvj-sVDv_bcxkzcZKq0ZOTVhX0dHfnYDct2Mycq5Q%40mail.gmail.com#CAKJS1f8yvvvj-sVDv_bcxkzcZKq0ZOTVhX0dHfnYDct2Mycq5Q@...
>>>
>>
>> I had a chat with Jeff Davis at pgcon about this, and IIRC he suggested
>> a couple of people who were originally worried about the overhead seem
>> to be accepting it now.
>
> Is there any great need to make everything pay the small price for
> this? Couldn't we just create a new MemoryContextMethod that
> duplicates aset.c, but has the require additional accounting built in
> at the implementation level, then make execGrouping.c use that
> allocator for its hash table? The code would not really need to be
> duplicated, we could just do things the same way as like.c does with
> like_match.c and include a .c file. We'd need another interface
> function in MemoryContextMethods to support getting the total memory
> allocated in the context, but that does not seem like a problem.
>

There probably is not a need, but there was not a great way to enable it
explicitly only for some contexts. IIRC what was originally considered
back in 2015 was some sort of a flag in the memory context, and the
overhead was about the same as with the int64 arithmetic alone.

But I don't think we've considered copying the whole AllocSet. Maybe
that would work, not sure. I wonder if an aggregate might use a custom
context internally (I don't recall anything like that). The accounting
capability seems potentially useful for other places, and those might
not use AllocSet (or at least not directly).

All of this of course assumes the overhead is still there. I sure will
do some new measurements.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Spilling hashed SetOps and aggregates to disk

Tomas Vondra-4
In reply to this post by Jeff Davis-8
On 06/05/2018 07:46 AM, Jeff Davis wrote:

> On Tue, 2018-06-05 at 07:04 +0200, Tomas Vondra wrote:
>> I expect the eviction strategy to be the primary design challenge of
>> this patch. The other bits will be mostly determined by this one
>> piece.
>
> Not sure I agree that this is the primary challenge.
>
> The cases that benefit from eviction are probably a minority. I see two
> categories that would benefit:
>
>    * Natural clustering in the heap. This sounds fairly common, but a
> lot of the cases that come to mind are too low-cardinality to be
> compelling; e.g. timestamps grouped by hour/day/month. If someone has
> run into a high-cardinality natural grouping case, let me know.
>    * ARRAY_AGG (or similar): individual state values can be large enough
> that we need to evict to avoid exceeding work_mem even if not adding
> any new groups.
>
> In either case, it seems like a fairly simple eviction strategy would
> work. For instance, we could just evict the entire hash table if
> work_mem is exceeded or if the hit rate on the hash table falls below a
> certain threshold. If there was really something important that should
> have stayed in the hash table, it will go back in soon anyway.
>
> So why should eviction be a major driver for the entire design? I agree
> it should be an area of improvement for the future, so let me know if
> you see a major problem, but I haven't been as focused on eviction.
>

My concern is more about what happens when the input tuple ordering is
inherently incompatible with the eviction strategy, greatly increasing
the amount of data written to disk during evictions.

Say for example that we can fit 1000 groups into work_mem, and that
there are 2000 groups in the input data set. If the input is correlated
with the groups, everything is peachy because we'll evict the first
batch, and then group the remaining 1000 groups (or something like
that). But if the input data is random (which can easily happen, e.g.
for IP addresses, UUIDs and such) we'll hit the limit repeatedly and
will evict much sooner.

I know you suggested simply dumping the whole hash table and starting
from scratch while we talked about this at pgcon, but ISTM it has
exactly this issue.

But I don't know if there actually is a better option - maybe we simply
have to accept this problem. After all, running slowly is still better
than OOM (which may or may not happen now).

I wonder if we can somehow detect this at run-time and maybe fall-back
to groupagg. E.g. we could compare number of groups vs. number of input
tuples when we first hit the limit. It's a rough heuristics, but maybe
sufficient.

>> While the primary goal of the patch is addressing the OOM risks in
>> hash
>> aggregate, I think it'd be a mistake to see it just that way. I see
>> it
>> could allow us to perform hash aggregate more often, even if we know
>> the
>> groups won't fit into work_mem. If we could estimate how much of the
>> aggregate state we'll have to spill to disk, we could still prefer
>> hashagg over groupagg. We would pay the price for eviction, but on
>> large
>> data sets that can be massively cheaper than having to do sort.
>
> Agreed. I ran some tests of my patch in the last round, and they
> strongly supported choosing HashAgg a lot more often. A lot of sort
> improvements have been made though, so I really need to run some new
> numbers.
>

Yeah, Peter made the sort stuff a lot faster. But I think there still is
a lot of cases where the hashagg would greatly reduce the amount of data
that needs to be written to disk, and that's not something the sort
improvements could improve. If you need to aggregate a 1TB table, it's
still going to be ~1TB of data you need to write to disk during on-disk
sort. But if there is only a reasonable number of groups, that greatly
simplifies things.

>> far away), but it would be unfortunate to make this improvement
>> impossible/more difficult in the future.
>
> If you see anything that would make this difficult in the future, let
> me know.
>

Sure. Sorry for being too hand-wavy in this thread, and possibly moving
the goal posts further away :-/ I got excited by you planning to work on
this again and perhaps a bit carried away ;-)


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Spilling hashed SetOps and aggregates to disk

Andres Freund
In reply to this post by Jeff Davis-8
Hi,

On 2018-06-04 22:18:56 -0700, Jeff Davis wrote:

> On Mon, 2018-06-04 at 11:52 -0700, Andres Freund wrote:
> > I wonder whether, at least for aggregates, the better fix wouldn't be
> > to
> > switch to feeding the tuples into tuplesort upon memory exhaustion
> > and
> > doing a sort based aggregate.  We have most of the infrastructure to
> > do
>
> That's an interesting idea, but it seems simpler to stick to hashing
> rather than using a combination strategy. It also seems like it would
> take less CPU effort.

Isn't the locality of access going to considerably better with the sort
based approach?


> What advantages do you have in mind? My patch partitions the spilled
> data, so it should have similar disk costs as a sort approach.

I think one part of it is that I think the amount of code is going to be
lower - we essentially have already all the code to handle sort based
aggs, and to have both sort and hash based aggs in the same query. We'd
mostly need a way to scan the hashtable and stuff it into a tuplesort,
that's not hard.  nodeAgg.c is already more than complex enough, I'm not
sure that full blown partitioning is worth the cost.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Spilling hashed SetOps and aggregates to disk

Andres Freund
In reply to this post by Tomas Vondra-4
Hi,

On 2018-06-05 09:35:13 +0200, Tomas Vondra wrote:
> But I don't think we've considered copying the whole AllocSet. Maybe that
> would work, not sure.

Don't think you'd need to fully copy it - you can just have a "wrapper"
memory context implementation that does the accounting and then hands
the actual work to AllocSet. That limits some things it can account for,
but I don't see those being really relevant.

> I wonder if an aggregate might use a custom context
> internally (I don't recall anything like that). The accounting capability
> seems potentially useful for other places, and those might not use AllocSet
> (or at least not directly).

Yea, that seems like a big issue.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Spilling hashed SetOps and aggregates to disk

Andres Freund
In reply to this post by Tomas Vondra-4
Hi,

On 2018-06-05 10:05:35 +0200, Tomas Vondra wrote:

> My concern is more about what happens when the input tuple ordering is
> inherently incompatible with the eviction strategy, greatly increasing the
> amount of data written to disk during evictions.
>
> Say for example that we can fit 1000 groups into work_mem, and that there
> are 2000 groups in the input data set. If the input is correlated with the
> groups, everything is peachy because we'll evict the first batch, and then
> group the remaining 1000 groups (or something like that). But if the input
> data is random (which can easily happen, e.g. for IP addresses, UUIDs and
> such) we'll hit the limit repeatedly and will evict much sooner.

> I know you suggested simply dumping the whole hash table and starting from
> scratch while we talked about this at pgcon, but ISTM it has exactly this
> issue.

Yea, that's the case I was thinking of where going to sorting will very
likely have better performance.

I think it'd even be sensible to have a "skew tuple" like
optimization. When detecting getting closer to memory exhaustion, move
new groups to the tuplesort, but already hashed tuples stay in the
hashtable.  That'd still need tuples being moved to the sort in the
cases where the transition values get to big (say array_agg), but that
should be comparatively rare.  I'm sure we could do better in selecting
the hash-tabled values than just taking the first incoming ones, but
that shouldn't be too bad.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Spilling hashed SetOps and aggregates to disk

David Rowley-3
In reply to this post by Andres Freund
On 6 June 2018 at 00:45, Andres Freund <[hidden email]> wrote:
> On 2018-06-05 09:35:13 +0200, Tomas Vondra wrote:
>> I wonder if an aggregate might use a custom context
>> internally (I don't recall anything like that). The accounting capability
>> seems potentially useful for other places, and those might not use AllocSet
>> (or at least not directly).
>
> Yea, that seems like a big issue.

Unfortunately, at least one of the built-in ones do. See initArrayResultArr.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Spilling hashed SetOps and aggregates to disk

Andres Freund
On 2018-06-06 00:53:42 +1200, David Rowley wrote:

> On 6 June 2018 at 00:45, Andres Freund <[hidden email]> wrote:
> > On 2018-06-05 09:35:13 +0200, Tomas Vondra wrote:
> >> I wonder if an aggregate might use a custom context
> >> internally (I don't recall anything like that). The accounting capability
> >> seems potentially useful for other places, and those might not use AllocSet
> >> (or at least not directly).
> >
> > Yea, that seems like a big issue.
>
> Unfortunately, at least one of the built-in ones do. See initArrayResultArr.

I think it's ok to only handle this gracefully if serialization is
supported.

But I think my proposal to continue use a hashtable for the already
known groups, and sorting for additional groups would largely address
that largely, right?  We couldn't deal with groups becoming too large,
but easily with the number of groups becoming too large.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Spilling hashed SetOps and aggregates to disk

David Rowley-3
On 6 June 2018 at 00:57, Andres Freund <[hidden email]> wrote:

> On 2018-06-06 00:53:42 +1200, David Rowley wrote:
>> On 6 June 2018 at 00:45, Andres Freund <[hidden email]> wrote:
>> > On 2018-06-05 09:35:13 +0200, Tomas Vondra wrote:
>> >> I wonder if an aggregate might use a custom context
>> >> internally (I don't recall anything like that). The accounting capability
>> >> seems potentially useful for other places, and those might not use AllocSet
>> >> (or at least not directly).
>> >
>> > Yea, that seems like a big issue.
>>
>> Unfortunately, at least one of the built-in ones do. See initArrayResultArr.
>
> I think it's ok to only handle this gracefully if serialization is
> supported.
>
> But I think my proposal to continue use a hashtable for the already
> known groups, and sorting for additional groups would largely address
> that largely, right?  We couldn't deal with groups becoming too large,
> but easily with the number of groups becoming too large.

My concern is that only accounting memory for the group and not the
state is only solving half the problem. It might be fine for
aggregates that don't stray far from their aggtransspace, but for the
other ones, we could still see OOM.

If solving the problem completely is too hard, then a half fix (maybe
3/4) is better than nothing, but if we can get a design for a full fix
before too much work is done, then isn't that better?


--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Spilling hashed SetOps and aggregates to disk

Andres Freund
Hi,

On 2018-06-06 01:06:39 +1200, David Rowley wrote:

> On 6 June 2018 at 00:57, Andres Freund <[hidden email]> wrote:
> > I think it's ok to only handle this gracefully if serialization is
> > supported.
> >
> > But I think my proposal to continue use a hashtable for the already
> > known groups, and sorting for additional groups would largely address
> > that largely, right?  We couldn't deal with groups becoming too large,
> > but easily with the number of groups becoming too large.
>
> My concern is that only accounting memory for the group and not the
> state is only solving half the problem. It might be fine for
> aggregates that don't stray far from their aggtransspace, but for the
> other ones, we could still see OOM.

> If solving the problem completely is too hard, then a half fix (maybe
> 3/4) is better than nothing, but if we can get a design for a full fix
> before too much work is done, then isn't that better?

I don't think we actually disagree.  I was really primarily talking
about the case where we can't really do better because we don't have
serialization support.  I mean we could just rescan from scratch, using
a groupagg, but that obviously sucks.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Spilling hashed SetOps and aggregates to disk

Tomas Vondra-4
In reply to this post by Andres Freund
On 06/05/2018 02:49 PM, Andres Freund wrote:

> Hi,
>
> On 2018-06-05 10:05:35 +0200, Tomas Vondra wrote:
>> My concern is more about what happens when the input tuple ordering is
>> inherently incompatible with the eviction strategy, greatly increasing the
>> amount of data written to disk during evictions.
>>
>> Say for example that we can fit 1000 groups into work_mem, and that there
>> are 2000 groups in the input data set. If the input is correlated with the
>> groups, everything is peachy because we'll evict the first batch, and then
>> group the remaining 1000 groups (or something like that). But if the input
>> data is random (which can easily happen, e.g. for IP addresses, UUIDs and
>> such) we'll hit the limit repeatedly and will evict much sooner.
>
>> I know you suggested simply dumping the whole hash table and starting from
>> scratch while we talked about this at pgcon, but ISTM it has exactly this
>> issue.
>
> Yea, that's the case I was thinking of where going to sorting will very
> likely have better performance.
>
> I think it'd even be sensible to have a "skew tuple" like
> optimization. When detecting getting closer to memory exhaustion, move
> new groups to the tuplesort, but already hashed tuples stay in the
> hashtable.  That'd still need tuples being moved to the sort in the
> cases where the transition values get to big (say array_agg), but that
> should be comparatively rare.  I'm sure we could do better in selecting
> the hash-tabled values than just taking the first incoming ones, but
> that shouldn't be too bad.
>

Not sure. I'd imagine the "compression" due to aggregating many tuples
into much smaller amount of memory would be a clear win, so I don't find
the "let's just dump all input tuples into a file" very attractive.

I think evicting a fraction of the aggregate state (say, ~25%) would
work better - choosing the "oldest" items seems OK, as those likely
represent many input tuples (thus having a high "compaction" ratio). Or
we could evict items representing rare groups, to make space for the
common ones. Perhaps a combined eviction strategy (10% of each type)
would work well. I'm conveniently ignoring the fact that we don't have
information to determine this, at the moment, of course.

I'm sure it's possible to make better decisions based on some cost
estimates, both at plan time and then during execution.

That being said, I don't want to over-think / over-engineer this.
Getting something that reduces the risk of OOM in the first step is a
good enough improvement. If we can do something better next, great.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Spilling hashed SetOps and aggregates to disk

David Rowley-3
In reply to this post by Andres Freund
On 6 June 2018 at 01:09, Andres Freund <[hidden email]> wrote:

> On 2018-06-06 01:06:39 +1200, David Rowley wrote:
>> My concern is that only accounting memory for the group and not the
>> state is only solving half the problem. It might be fine for
>> aggregates that don't stray far from their aggtransspace, but for the
>> other ones, we could still see OOM.
>
>> If solving the problem completely is too hard, then a half fix (maybe
>> 3/4) is better than nothing, but if we can get a design for a full fix
>> before too much work is done, then isn't that better?
>
> I don't think we actually disagree.  I was really primarily talking
> about the case where we can't really do better because we don't have
> serialization support.  I mean we could just rescan from scratch, using
> a groupagg, but that obviously sucks.

I don't think we do. To take yours to the 100% solution might just
take adding the memory accounting to palloc that Jeff proposed a few
years ago and use that accounting to decide when we should switch
method.

However, I don't quite fully recall how the patch accounted for memory
consumed by sub-contexts and if getting the entire consumption
required recursively looking at subcontexts. If that's the case then
checking the consumption would likely cost too much if it was done
after each tuple was aggregated.


--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Spilling hashed SetOps and aggregates to disk

Tomas Vondra-4
On 06/05/2018 03:17 PM, David Rowley wrote:

> On 6 June 2018 at 01:09, Andres Freund <[hidden email]> wrote:
>> On 2018-06-06 01:06:39 +1200, David Rowley wrote:
>>> My concern is that only accounting memory for the group and not the
>>> state is only solving half the problem. It might be fine for
>>> aggregates that don't stray far from their aggtransspace, but for the
>>> other ones, we could still see OOM.
>>
>>> If solving the problem completely is too hard, then a half fix (maybe
>>> 3/4) is better than nothing, but if we can get a design for a full fix
>>> before too much work is done, then isn't that better?
>>
>> I don't think we actually disagree.  I was really primarily talking
>> about the case where we can't really do better because we don't have
>> serialization support.  I mean we could just rescan from scratch, using
>> a groupagg, but that obviously sucks.
>
> I don't think we do. To take yours to the 100% solution might just
> take adding the memory accounting to palloc that Jeff proposed a few
> years ago and use that accounting to decide when we should switch
> method.
>
> However, I don't quite fully recall how the patch accounted for
> memory consumed by sub-contexts and if getting the entire
> consumption required recursively looking at subcontexts. If that's
> the case then checking the consumption would likely cost too much if
> it was done after each tuple was aggregated.

Yeah, a simple wrapper would not really fix the issue, because
allocating memory in the subcontext would not update the accounting
information. So we'd be oblivious of possibly large amounts of memory. I
don't quite see how is this related to (not) having serialization
support, as it's more about not knowing we already hit the memory limit.


That being said, I don't think array_agg actually has the issue, at
least since b419865a814abbca12bdd6eef6a3d5ed67f432e1 (it does behave
differently in aggregate and non-aggregate contexts, IIRC).

I don't know how common this issue is, though. I don't think any
built-in aggregates create additional contexts, but I may be wrong. But
we have this damn extensibility thing, where users can write their own
aggregates ...

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

RE: Re: Spilling hashed SetOps and aggregates to disk

srielau
In our code base we have added a WithStats-Flavor for creating memory contexts.
This api accepts a pointer to metric for accounting and it is inherited by all subcontexts unless overridden.
So we only needed to change context creation API where we wanted (such as TopTansactionContext, Message Context, ..)
That's quite trivial, actually.
 
Also we have fixed all those missing hash spills - albeit based on the 9.6 hash table design I think.
 
If there is interest by the community we are very willing to share.
 
Cheers
Serge
Salesforce
1234