Default setting for enable_hashagg_disk

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
102 messages Options
123456
Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Jeff Davis-8
On Thu, 2020-06-25 at 15:56 -0400, Bruce Momjian wrote:
> It is my understanding that spill of sorts is mostly read
> sequentially,
> while hash reads are random.  Is that right?  Is that not being
> costed
> properly?

I don't think there's a major problem with the cost model, but it could
probably use some tweaking.

Hash writes are random. The hash reads should be mostly sequential (for
large partitions it will be 128-block extents, or 1MB). The cost model
assumes 50% sequential and 50% random.

Sorts are written sequentially and read randomly, but there's
prefetching to keep the reads from being too random. The cost model
assumes 75% sequential and 25% random.

Overall, the IO pattern is better for Sort, but not dramatically so.
Tomas Vondra did some nice analysis here:


https://www.postgresql.org/message-id/20200525021045.dilgcsmgiu4l5jpa@development

That resulted in getting the prealloc and projection patches in.

Regards,
        Jeff Davis




Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Alvaro Herrera-9
In reply to this post by Andres Freund
On 2020-Jun-25, Andres Freund wrote:

> > What are people doing for those cases already?  Do we have an
> > real-world queries that are a problem in PG 13 for this?
>
> I don't know about real world, but it's pretty easy to come up with
> examples.
>
> query:
> SELECT a, array_agg(b) FROM (SELECT generate_series(1, 10000)) a(a), (SELECT generate_series(1, 10000)) b(b) GROUP BY a HAVING array_length(array_agg(b), 1) = 0;
>
> work_mem = 4MB
>
> 12      18470.012 ms
> HEAD    44635.210 ms
>
> HEAD causes ~2.8GB of file IO, 12 doesn't cause any. If you're IO
> bandwidth constrained, this could be quite bad.

... however, you can pretty much get the previous performance back by
increasing work_mem.  I just tried your example here, and I get 32
seconds of runtime for work_mem 4MB, and 13.5 seconds for work_mem 1GB
(this one spills about 800 MB); if I increase that again to 1.7GB I get
no spilling and 9 seconds of runtime.  (For comparison, 12 takes 15.7
seconds regardless of work_mem).

My point here is that maybe we don't need to offer a GUC to explicitly
turn spilling off; it seems sufficient to let users change work_mem so
that spilling will naturally not occur.  Why do we need more?

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Andres Freund
Hi,

On June 25, 2020 3:44:22 PM PDT, Alvaro Herrera <[hidden email]> wrote:

>On 2020-Jun-25, Andres Freund wrote:
>
>> > What are people doing for those cases already?  Do we have an
>> > real-world queries that are a problem in PG 13 for this?
>>
>> I don't know about real world, but it's pretty easy to come up with
>> examples.
>>
>> query:
>> SELECT a, array_agg(b) FROM (SELECT generate_series(1, 10000)) a(a),
>(SELECT generate_series(1, 10000)) b(b) GROUP BY a HAVING
>array_length(array_agg(b), 1) = 0;
>>
>> work_mem = 4MB
>>
>> 12      18470.012 ms
>> HEAD    44635.210 ms
>>
>> HEAD causes ~2.8GB of file IO, 12 doesn't cause any. If you're IO
>> bandwidth constrained, this could be quite bad.
>
>... however, you can pretty much get the previous performance back by
>increasing work_mem.  I just tried your example here, and I get 32
>seconds of runtime for work_mem 4MB, and 13.5 seconds for work_mem 1GB
>(this one spills about 800 MB); if I increase that again to 1.7GB I get
>no spilling and 9 seconds of runtime.  (For comparison, 12 takes 15.7
>seconds regardless of work_mem).
>
>My point here is that maybe we don't need to offer a GUC to explicitly
>turn spilling off; it seems sufficient to let users change work_mem so
>that spilling will naturally not occur.  Why do we need more?

That's not really a useful escape hatch, because I'll often lead to other nodes using more memory.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Alvaro Herrera-9
On 2020-Jun-25, Andres Freund wrote:

> >My point here is that maybe we don't need to offer a GUC to explicitly
> >turn spilling off; it seems sufficient to let users change work_mem so
> >that spilling will naturally not occur.  Why do we need more?
>
> That's not really a useful escape hatch, because I'll often lead to
> other nodes using more memory.

Ah -- other nodes in the same query -- you're right, that's not good.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Tomas Vondra-4
In reply to this post by Jeff Davis-8
On Thu, Jun 25, 2020 at 02:28:02PM -0700, Jeff Davis wrote:

>On Thu, 2020-06-25 at 15:56 -0400, Bruce Momjian wrote:
>> It is my understanding that spill of sorts is mostly read
>> sequentially,
>> while hash reads are random.  Is that right?  Is that not being
>> costed
>> properly?
>
>I don't think there's a major problem with the cost model, but it could
>probably use some tweaking.
>
>Hash writes are random. The hash reads should be mostly sequential (for
>large partitions it will be 128-block extents, or 1MB). The cost model
>assumes 50% sequential and 50% random.
>

The important bit here is that while the logical writes are random,
those are effectively combined in page cache and the physical writes are
pretty sequential. So I think the cost model is fairly reasonable.

Note: Judging by iosnoop stats shared in the thread linked by Jeff.

>Sorts are written sequentially and read randomly, but there's
>prefetching to keep the reads from being too random. The cost model
>assumes 75% sequential and 25% random.
>
>Overall, the IO pattern is better for Sort, but not dramatically so.
>Tomas Vondra did some nice analysis here:
>
>
>https://www.postgresql.org/message-id/20200525021045.dilgcsmgiu4l5jpa@development
>
>That resulted in getting the prealloc and projection patches in.
>
>Regards,
> Jeff Davis
>
>

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Tomas Vondra-4
In reply to this post by Andres Freund
On Thu, Jun 25, 2020 at 11:16:23AM -0700, Andres Freund wrote:

>Hi,
>
>On 2020-06-25 10:44:42 -0700, Jeff Davis wrote:
>> There are only two possible paths: HashAgg and Sort+Group, and we need
>> to pick one. If the planner expects one to spill, it is likely to
>> expect the other to spill. If one spills in the executor, then the
>> other is likely to spill, too. (I'm ignoring the case with a lot of
>> tuples and few groups because that doesn't seem relevant.)
>
>There's also ordered index scan + Group. Which will often be vastly
>better than Sort+Group, but still slower than HashAgg.
>
>
>> Imagine that there was only one path available to choose. Would you
>> suggest the same thing, that unexpected spills can exceed work_mem but
>> expected spills can't?
>
>I'm not saying what I propose is perfect, but I've yet to hear a better
>proposal. Given that there *are* different ways to implement
>aggregation, and that we use expected costs to choose, I think the
>assumed costs are relevant.
>

I share Jeff's opinion that this is quite counter-intuitive and we'll
have a hard time explaining it to users.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Tomas Vondra-4
In reply to this post by Jeff Davis-8
On Thu, Jun 25, 2020 at 09:42:33AM -0700, Jeff Davis wrote:

>On Wed, 2020-06-24 at 12:31 -0700, Andres Freund wrote:
>> nodeAgg.c already treats those separately:
>>
>> void
>> hash_agg_set_limits(double hashentrysize, uint64 input_groups, int
>> used_bits,
>> Size *mem_limit, uint64
>> *ngroups_limit,
>> int *num_partitions)
>> {
>> int npartitions;
>> Size partition_mem;
>>
>> /* if not expected to spill, use all of work_mem */
>> if (input_groups * hashentrysize < work_mem * 1024L)
>> {
>> if (num_partitions != NULL)
>> *num_partitions = 0;
>> *mem_limit = work_mem * 1024L;
>> *ngroups_limit = *mem_limit / hashentrysize;
>> return;
>> }
>
>The reason this code exists is to decide how much of work_mem to set
>aside for spilling (each spill partition needs an IO buffer).
>
>The alternative would be to fix the number of partitions before
>processing a batch, which didn't seem ideal. Or, we could just ignore
>the memory required for IO buffers, like HashJoin.
>

I think the conclusion from the recent HashJoin discussions is that not
accounting for BufFiles is an issue, and we want to fix it. So repeating
that for HashAgg would be a mistake, IMHO.

>Granted, this is an example where an underestimate can give an
>advantage, but I don't think we want to extend the concept into other
>areas.
>

I agree.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Tomas Vondra-4
In reply to this post by Bruce Momjian
On Thu, Jun 25, 2020 at 01:17:56PM -0400, Bruce Momjian wrote:

>On Thu, Jun 25, 2020 at 11:46:54AM -0400, Robert Haas wrote:
>> On Wed, Jun 24, 2020 at 7:38 PM Bruce Momjian <[hidden email]> wrote:
>> > I think my main point is that work_mem was not being honored for
>> > hash-agg before, but now that PG 13 can do it, we are again allowing
>> > work_mem not to apply in certain cases.  I am wondering if our hard
>> > limit for work_mem is the issue, and we should make that more flexible
>> > for all uses.
>>
>> I mean, that's pretty much what we're talking about here, isn't it? It
>> seems like in your previous two replies you were opposed to separating
>> the plan-type limit from the execution-time limit, but that idea is
>> precisely a way of being more flexible (and extending it to other plan
>> nodes is a way of making it more flexible for more use cases).
>
>I think it is was Tom who was complaining about plan vs. execution time
>control.
>
>> As I think you know, if you have a system where the workload varies a
>> lot, you may sometimes be using 0 copies of work_mem and at other
>> times 1000 or more copies, so the value has to be chosen
>> conservatively as a percentage of system memory, else you start
>> swapping or the OOM killer gets involved. On the other hand, some plan
>> nodes get a lot less efficient when the amount of memory available
>> falls below some threshold, so you can't just set this to a tiny value
>> and forget about it. Because the first problem is so bad, most people
>> set the value relatively conservatively and just live with the
>> performance consequences. But this also means that they have memory
>> left over most of the time, so the idea of letting a node burst above
>> its work_mem allocation when something unexpected happens isn't crazy:
>> as long as only a few nodes do that here and there, rather than, say,
>> all the nodes doing it all at the same time, it's actually fine. If we
>> had a smarter system that could dole out more work_mem to nodes that
>> would really benefit from it and less to nodes where it isn't likely
>> to make much difference, that would be similar in spirit but even
>> better.
>
>I think the issue is that in PG 13 work_mem controls sorts and hashes
>with a new hard limit for hash aggregation:
>
> https://www.postgresql.org/docs/12/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY
>
> Sort operations are used for ORDER BY, DISTINCT, and merge joins. Hash
> tables are used in hash joins, hash-based aggregation, and hash-based
> processing of IN subqueries.
>
>In pre-PG 13, we "controlled" it by avoiding hash-based aggregation if
>was expected it to exceed work_mem, but if we assumed it would be less
>than work_mem and it was more, we exceeded work_mem allocation for that
>node.  In PG 13, we "limit" memory to work_mem and spill to disk if we
>exceed it.
>
>We should really have always documented that hash agg could exceed
>work_mem for misestimation, and if we add a hash_agg work_mem
>misestimation bypass setting we should document this setting in work_mem
>as well.
>

I don't think that would change anything, really. For the users the
consequences would be still exactly the same, and they wouldn't even be
in position to check if they are affected.

So just documenting that hashagg does not respect work_mem at runtime
would be nice, but it would not make any difference for v13, just like
documenting a bug is not really the same thing as fixing it.

>But then the question is why do we allow this bypass only for hash agg?
>Should work_mem have a settings for ORDER BY, merge join, hash join, and
>hash agg, e.g.:
>
> work_mem = 'order_by=10MB, hash_join=20MB, hash_agg=100MB'
>
>Yeah, crazy syntax, but you get the idea.  I understand some nodes are
>more sensitive to disk spill than others, so shouldn't we be controlling
>this at the work_mem level, rather than for a specific node type like
>hash agg?  We could allow for misestimation over allocation of hash agg
>work_mem by splitting up the hash agg values:
>
> work_mem = 'order_by=10MB, hash_join=20MB, hash_agg=100MB hash_agg_max=200MB'
>
>but _avoiding_ hash agg if it is estimated to exceed work mem and spill
>to disk is not something to logically control at the work mem level,
>which leads so something like David Rowley suggested, but with different
>names:
>
> enable_hashagg = on | soft | avoid | off
>
>where 'on' and 'off' are the current PG 13 behavior, 'soft' means to
>treat work_mem as a soft limit and allow it to exceed work mem for
>misestimation, and 'avoid' means to avoid hash agg if it is estimated to
>exceed work mem.  Both 'soft' and 'avoid' don't spill to disk.
>
>David's original terms of "trynospill" and "neverspill" were focused on
>spilling, not on its interaction with work_mem, and I found that
>confusing.
>
>Frankly, if it took me this long to get my head around this, I am
>unclear how many people will understand this tuning feature enough to
>actually use it.
>

Yeah. I agree with Andres we this may be a real issue, and that adding
some sort of "escape hatch" for v13 would be good. But I'm not convinced
adding a whole lot of new memory limits for every node that might spill
is the way to go. What exactly would be our tuning advice to users? Of
course, we could keep it set to work_mem by default, but we all know
engineers - we can't resist tuning a know when we get one.

I'm not saying it's not beneficial to use different limits for different
nodes. Some nodes are less sensitive to the size (e.g. sorting often
gets faster with smaller work_mem). But I think we should instead have a
per-session limit, and the planner should "distribute" the memory to
different nodes. It's a hard problem, of course.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Bruce Momjian
On Fri, Jun 26, 2020 at 01:53:57AM +0200, Tomas Vondra wrote:
> I'm not saying it's not beneficial to use different limits for different
> nodes. Some nodes are less sensitive to the size (e.g. sorting often
> gets faster with smaller work_mem). But I think we should instead have a
> per-session limit, and the planner should "distribute" the memory to
> different nodes. It's a hard problem, of course.

Yeah, I am actually confused why we haven't developed a global memory
allocation strategy and continue to use per-session work_mem.

--
  Bruce Momjian  <[hidden email]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Tomas Vondra-4
On Fri, Jun 26, 2020 at 12:02:10AM -0400, Bruce Momjian wrote:

>On Fri, Jun 26, 2020 at 01:53:57AM +0200, Tomas Vondra wrote:
>> I'm not saying it's not beneficial to use different limits for different
>> nodes. Some nodes are less sensitive to the size (e.g. sorting often
>> gets faster with smaller work_mem). But I think we should instead have a
>> per-session limit, and the planner should "distribute" the memory to
>> different nodes. It's a hard problem, of course.
>
>Yeah, I am actually confused why we haven't developed a global memory
>allocation strategy and continue to use per-session work_mem.
>

I think it's pretty hard problem, actually. One of the reasons is that
the costing of a node depends on the amount of memory available to the
node, but as we're building the plan bottom-up, we have no information
about the nodes above us. So we don't know if there are operations that
will need memory, how sensitive they are, etc.

And so far the per-node limit served us pretty well, I think. So I'm not
very confused we don't have the per-session limit yet, TBH.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Bruce Momjian
On Fri, Jun 26, 2020 at 04:44:14PM +0200, Tomas Vondra wrote:

> On Fri, Jun 26, 2020 at 12:02:10AM -0400, Bruce Momjian wrote:
> > On Fri, Jun 26, 2020 at 01:53:57AM +0200, Tomas Vondra wrote:
> > > I'm not saying it's not beneficial to use different limits for different
> > > nodes. Some nodes are less sensitive to the size (e.g. sorting often
> > > gets faster with smaller work_mem). But I think we should instead have a
> > > per-session limit, and the planner should "distribute" the memory to
> > > different nodes. It's a hard problem, of course.
> >
> > Yeah, I am actually confused why we haven't developed a global memory
> > allocation strategy and continue to use per-session work_mem.
> >
>
> I think it's pretty hard problem, actually. One of the reasons is that

Yes, it is a hard problem, because it is balancing memory for shared
buffers, work_mem, and kernel buffers:

        https://momjian.us/main/blogs/pgblog/2018.html#December_7_2018

I think the big problem is that the work_mem value is not one value but
a floating value that is different per query and session, and concurrent
session activity.

> the costing of a node depends on the amount of memory available to the
> node, but as we're building the plan bottom-up, we have no information
> about the nodes above us. So we don't know if there are operations that
> will need memory, how sensitive they are, etc.
>
> And so far the per-node limit served us pretty well, I think. So I'm not
> very confused we don't have the per-session limit yet, TBH.

I was thinking more of being able to allocate a single value to be
shared by all active sesions.

Also, doesn't this blog entry also show that spiling to disk for ORDER
BY is similarly slow compared to hash aggs?

        https://momjian.us/main/blogs/pgblog/2012.html#February_2_2012

--
  Bruce Momjian  <[hidden email]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Tomas Vondra-4
On Fri, Jun 26, 2020 at 12:37:26PM -0400, Bruce Momjian wrote:

>On Fri, Jun 26, 2020 at 04:44:14PM +0200, Tomas Vondra wrote:
>> On Fri, Jun 26, 2020 at 12:02:10AM -0400, Bruce Momjian wrote:
>> > On Fri, Jun 26, 2020 at 01:53:57AM +0200, Tomas Vondra wrote:
>> > > I'm not saying it's not beneficial to use different limits for different
>> > > nodes. Some nodes are less sensitive to the size (e.g. sorting often
>> > > gets faster with smaller work_mem). But I think we should instead have a
>> > > per-session limit, and the planner should "distribute" the memory to
>> > > different nodes. It's a hard problem, of course.
>> >
>> > Yeah, I am actually confused why we haven't developed a global memory
>> > allocation strategy and continue to use per-session work_mem.
>> >
>>
>> I think it's pretty hard problem, actually. One of the reasons is that
>
>Yes, it is a hard problem, because it is balancing memory for shared
>buffers, work_mem, and kernel buffers:
>
> https://momjian.us/main/blogs/pgblog/2018.html#December_7_2018
>
>I think the big problem is that the work_mem value is not one value but
>a floating value that is different per query and session, and concurrent
>session activity.
>
>> the costing of a node depends on the amount of memory available to the
>> node, but as we're building the plan bottom-up, we have no information
>> about the nodes above us. So we don't know if there are operations that
>> will need memory, how sensitive they are, etc.
>>
>> And so far the per-node limit served us pretty well, I think. So I'm not
>> very confused we don't have the per-session limit yet, TBH.
>
>I was thinking more of being able to allocate a single value to be
>shared by all active sesions.
>

Not sure I understand. What "single value" do you mean?

Wasn't the idea was to replace work_mem with something like query_mem?
That'd be nice, but I think it's inherently circular - we don't know how
to distribute this to different nodes until we know which nodes will
need a buffer, but the buffer size is important for costing (so we need
it when constructing the paths).

Plus then there's the question whether all nodes should get the same
fraction, or less sensitive nodes should get smaller chunks, etc.
Ultimately this would be based on costing too, I think, but it makes it
soe much complex ...

>Also, doesn't this blog entry also show that spiling to disk for ORDER
>BY is similarly slow compared to hash aggs?
>
> https://momjian.us/main/blogs/pgblog/2012.html#February_2_2012
>

The post does not mention hashagg at all, so I'm not sure how could it
show that? But I think you're right the spilling itself is not that far
away, in most cases (thanks to the recent fixes made by Jeff).


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Peter Geoghegan-4
In reply to this post by Andres Freund
On Thu, Jun 25, 2020 at 1:36 PM Andres Freund <[hidden email]> wrote:

> 12      28164.865 ms
>
> fast ssd:
> HEAD    92520.680 ms
>
> magnetic:
> HEAD    183968.538 ms
>
> (no reads, there's plenty enough memory. Just writes because the age /
> amount thresholds for dirty data are reached)
>
> In the magnetic case we're IO bottlenecked nearly the whole time.

I agree with almost everything you've said on this thread, but at the
same time I question the emphasis on I/O here. You've shown that
spinning rust is about twice as slow as a fast SSD here. Fair enough,
but to me the real story is that spilling is clearly a lot slower in
general, regardless of how fast the storage subsystem happens to be (I
wonder how fast it is with a ramdisk). To me, it makes more sense to
think of the problem here as the fact that v13 will *not* do
aggregation using the fast strategy (i.e. in-memory) -- as opposed to
the problem being that v13 does the aggregation using the slow
strategy (which is assumed to be slow because it involves I/O instead
of memory buffers).

I get approximately the same query runtimes with your "make the
transition state a bit bigger" test case. With "set enable_hashagg =
off", I get a group aggregate + sort. It spills to disk, even with
'work_mem = '15GB'" -- leaving 4 runs to merge at the end. That takes
63702.992 ms on v13. But if I reduce the amount of work_mem radically,
to only 16MB (a x960 decrease!), then the run time only increases by
~30% -- it's only 83123.926 ms. So we're talking about a ~200%
increase (for hash aggregate) versus a ~30% increase (for groupagg +
sort) on fast SSDs.

Changing the cost of I/O in the context of hashaggregate seems like it
misses the point. Jeff recently said "Overall, the IO pattern is
better for Sort, but not dramatically so". Whatever the IO pattern may
be, I think that it's pretty clear that the performance
characteristics of hash aggregation with limited memory are very
different to groupaggregate + sort, at least when only a fraction of
the optimal amount of memory we'd like is available. It's true that
hash aggregate was weird among plan nodes in v12, and is now in some
sense less weird among plan nodes. And yet we have a new problem now
-- so where does that leave that whole "weirdness" framing? ISTM that
the work_mem framework was and is the main problem. We seem to have
lost a crutch that ameliorated the problem before now, even though
that amelioration was kind of an accident. Or a thing that user apps
evolved to rely on.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Bruce Momjian
In reply to this post by Tomas Vondra-4
On Fri, Jun 26, 2020 at 07:45:13PM +0200, Tomas Vondra wrote:
> On Fri, Jun 26, 2020 at 12:37:26PM -0400, Bruce Momjian wrote:
> > I was thinking more of being able to allocate a single value to be
> > shared by all active sesions.
>
> Not sure I understand. What "single value" do you mean?

I was thinking of a full-cluster work_mem maximum allocation that could
be given to various backends that request it.

Imagine we set the cluster-wide total of work_mem to 1GB.  If a session
asks for 100MB, if there are no other active sessions, it can grant the
entire 100MB.  If there are other sessions running, and 500MB has
already been allocated, maybe it is only given an active per-node
work_mem of 50MB.  As the amount of unallocated cluster-wide work_mem
gets smaller, requests are granted smaller actual allocations.

What we do now makes little sense, because we might have lots of free
memory, but we force nodes to spill to disk when they exceed a fixed
work_mem.  I realize this is very imprecise, because you don't know what
future work_mem requests are coming, or how long until existing
allocations are freed, but it seems it would have to be better than what
we do now.

> Wasn't the idea was to replace work_mem with something like query_mem?
> That'd be nice, but I think it's inherently circular - we don't know how
> to distribute this to different nodes until we know which nodes will
> need a buffer, but the buffer size is important for costing (so we need
> it when constructing the paths).
>
> Plus then there's the question whether all nodes should get the same
> fraction, or less sensitive nodes should get smaller chunks, etc.
> Ultimately this would be based on costing too, I think, but it makes it
> soe much complex ...

Since work_mem affect the optimizer choices, I can imagine it getting
complex since nodes would have to ask the global work_mem allocator how
much memory it _might_ get, but then ask for final work_mem during
execution, and they might differ.  Still, our spill costs are so high
for so many node types, that reducing spills seems like it would be a
win, even if it sometimes causes poorer plans.

> > Also, doesn't this blog entry also show that spiling to disk for ORDER
> > BY is similarly slow compared to hash aggs?
> >
> > https://momjian.us/main/blogs/pgblog/2012.html#February_2_2012
>
> The post does not mention hashagg at all, so I'm not sure how could it
> show that? But I think you're right the spilling itself is not that far
> away, in most cases (thanks to the recent fixes made by Jeff).

Yeah, I was just measuring ORDER BY spill, but it seems to be a similar
overhead to hashagg spill, which is being singled out in this discussion
as particularly expensive, and I am questioning that.

--
  Bruce Momjian  <[hidden email]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Peter Geoghegan-4
On Fri, Jun 26, 2020 at 4:00 PM Bruce Momjian <[hidden email]> wrote:
> Imagine we set the cluster-wide total of work_mem to 1GB.  If a session
> asks for 100MB, if there are no other active sessions, it can grant the
> entire 100MB.  If there are other sessions running, and 500MB has
> already been allocated, maybe it is only given an active per-node
> work_mem of 50MB.  As the amount of unallocated cluster-wide work_mem
> gets smaller, requests are granted smaller actual allocations.

I think that that's the right approach long term. But right now the
DBA has no way to give hash-based nodes more memory, even though it's
clear that that's where it's truly needed in most cases, across almost
workloads. I think that that's the really glaring problem.

This is just the intrinsic nature of hash-based aggregation and hash
join vs sort-based aggregation and merge join (roughly speaking). It's
much more valuable to be able to do hash-based aggregation in one
pass, especially in cases where hashing already did particularly well
in Postgres v12.

> What we do now makes little sense, because we might have lots of free
> memory, but we force nodes to spill to disk when they exceed a fixed
> work_mem.  I realize this is very imprecise, because you don't know what
> future work_mem requests are coming, or how long until existing
> allocations are freed, but it seems it would have to be better than what
> we do now.

Postgres 13 made hash aggregate respect work_mem. Perhaps it would
have made more sense to teach work_mem to respect hash aggregate,
though.

Hash aggregate cannot consume an unbounded amount of memory in v13,
since the old behavior was clearly unreasonable. Which is great. But
it may be even more unreasonable to force users to conservatively set
the limit on the size of the hash table in an artificial, generic way.

> Since work_mem affect the optimizer choices, I can imagine it getting
> complex since nodes would have to ask the global work_mem allocator how
> much memory it _might_ get, but then ask for final work_mem during
> execution, and they might differ.  Still, our spill costs are so high
> for so many node types, that reducing spills seems like it would be a
> win, even if it sometimes causes poorer plans.

I don't think it's really about the spill costs, at least in one
important sense. If performing a hash aggregate in memory uses twice
as much memory as spilling (with either sorting or hashing), but the
operation completes in one third the time, you have actually saved
memory in the aggregate (no pun intended). Also, the query is 3x
faster, which is a nice bonus! I don't think that this kind of
scenario is rare.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Bruce Momjian
In reply to this post by Peter Geoghegan-4
On Fri, Jun 26, 2020 at 01:53:05PM -0700, Peter Geoghegan wrote:

> On Thu, Jun 25, 2020 at 1:36 PM Andres Freund <[hidden email]> wrote:
> > 12      28164.865 ms
> >
> > fast ssd:
> > HEAD    92520.680 ms
> >
> > magnetic:
> > HEAD    183968.538 ms
> >
> > (no reads, there's plenty enough memory. Just writes because the age /
> > amount thresholds for dirty data are reached)
> >
> > In the magnetic case we're IO bottlenecked nearly the whole time.
>
> I agree with almost everything you've said on this thread, but at the
> same time I question the emphasis on I/O here. You've shown that
> spinning rust is about twice as slow as a fast SSD here. Fair enough,
> but to me the real story is that spilling is clearly a lot slower in
> general, regardless of how fast the storage subsystem happens to be (I
> wonder how fast it is with a ramdisk). To me, it makes more sense to

This blog entry shows ORDER BY using ram disk, SSD, and magnetic:

        https://momjian.us/main/blogs/pgblog/2012.html#February_2_2012

It is from 2012, but I can re-run the test if you want.

--
  Bruce Momjian  <[hidden email]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Tomas Vondra-4
In reply to this post by Bruce Momjian
On Fri, Jun 26, 2020 at 07:00:20PM -0400, Bruce Momjian wrote:

>On Fri, Jun 26, 2020 at 07:45:13PM +0200, Tomas Vondra wrote:
>> On Fri, Jun 26, 2020 at 12:37:26PM -0400, Bruce Momjian wrote:
>> > I was thinking more of being able to allocate a single value to be
>> > shared by all active sesions.
>>
>> Not sure I understand. What "single value" do you mean?
>
>I was thinking of a full-cluster work_mem maximum allocation that could
>be given to various backends that request it.
>
>Imagine we set the cluster-wide total of work_mem to 1GB.  If a session
>asks for 100MB, if there are no other active sessions, it can grant the
>entire 100MB.  If there are other sessions running, and 500MB has
>already been allocated, maybe it is only given an active per-node
>work_mem of 50MB.  As the amount of unallocated cluster-wide work_mem
>gets smaller, requests are granted smaller actual allocations.
>
>What we do now makes little sense, because we might have lots of free
>memory, but we force nodes to spill to disk when they exceed a fixed
>work_mem.  I realize this is very imprecise, because you don't know what
>future work_mem requests are coming, or how long until existing
>allocations are freed, but it seems it would have to be better than what
>we do now.
>
>> Wasn't the idea was to replace work_mem with something like query_mem?
>> That'd be nice, but I think it's inherently circular - we don't know how
>> to distribute this to different nodes until we know which nodes will
>> need a buffer, but the buffer size is important for costing (so we need
>> it when constructing the paths).
>>
>> Plus then there's the question whether all nodes should get the same
>> fraction, or less sensitive nodes should get smaller chunks, etc.
>> Ultimately this would be based on costing too, I think, but it makes it
>> soe much complex ...
>
>Since work_mem affect the optimizer choices, I can imagine it getting
>complex since nodes would have to ask the global work_mem allocator how
>much memory it _might_ get, but then ask for final work_mem during
>execution, and they might differ.  Still, our spill costs are so high
>for so many node types, that reducing spills seems like it would be a
>win, even if it sometimes causes poorer plans.
>

I may not understand what you mean by "poorer plans" here, but I find it
hard to accept that reducing spills is generally worth poorer plans.

I agree larger work_mem for hashagg (and thus less spilling) may mean
lower work_mem for so some other nodes that are less sensitive to this.
But I think this needs to be formulated as a cost-based decision,
although I don't know how to do that for the reasons I explained before
(bottom-up plan construction vs. distributing the memory budget).

FWIW some databases already do something like this - SQL Server has
something called "memory grant" which I think mostly does what you
described here.


>> > Also, doesn't this blog entry also show that spiling to disk for ORDER
>> > BY is similarly slow compared to hash aggs?
>> >
>> > https://momjian.us/main/blogs/pgblog/2012.html#February_2_2012
>>
>> The post does not mention hashagg at all, so I'm not sure how could it
>> show that? But I think you're right the spilling itself is not that far
>> away, in most cases (thanks to the recent fixes made by Jeff).
>
>Yeah, I was just measuring ORDER BY spill, but it seems to be a similar
>overhead to hashagg spill, which is being singled out in this discussion
>as particularly expensive, and I am questioning that.
>

The difference between sort and hashagg spills is that for sorts there
is no behavior change. Plans that did (not) spill in v12 will behave the
same way on v13, modulo some random perturbation. For hashagg that's not
the case - some queries that did not spill before will spill now.

So even if the hashagg spills are roughly equal to sort spills, both are
significantly more expensive than not spilling.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Bruce Momjian
On Sat, Jun 27, 2020 at 01:58:50AM +0200, Tomas Vondra wrote:

> > Since work_mem affect the optimizer choices, I can imagine it getting
> > complex since nodes would have to ask the global work_mem allocator how
> > much memory it _might_ get, but then ask for final work_mem during
> > execution, and they might differ.  Still, our spill costs are so high
> > for so many node types, that reducing spills seems like it would be a
> > win, even if it sometimes causes poorer plans.
> >
>
> I may not understand what you mean by "poorer plans" here, but I find it
> hard to accept that reducing spills is generally worth poorer plans.

We might cost a plan based on a work_mem that the global allocator
things it will give us when we are in the executor, but that might
change when we are in the executor.  We could code is to an optimizer
request is always honored in the executor, but prepared plans would be a
problem, or perhaps already are if you prepare a plan and change
work_mem before EXECUTE.

> I agree larger work_mem for hashagg (and thus less spilling) may mean
> lower work_mem for so some other nodes that are less sensitive to this.
> But I think this needs to be formulated as a cost-based decision,
> although I don't know how to do that for the reasons I explained before
> (bottom-up plan construction vs. distributing the memory budget).
>
> FWIW some databases already do something like this - SQL Server has
> something called "memory grant" which I think mostly does what you
> described here.

Yep, something like that.

> > > > Also, doesn't this blog entry also show that spiling to disk for ORDER
> > > > BY is similarly slow compared to hash aggs?
> > > >
> > > > https://momjian.us/main/blogs/pgblog/2012.html#February_2_2012
> > >
> > > The post does not mention hashagg at all, so I'm not sure how could it
> > > show that? But I think you're right the spilling itself is not that far
> > > away, in most cases (thanks to the recent fixes made by Jeff).
> >
> > Yeah, I was just measuring ORDER BY spill, but it seems to be a similar
> > overhead to hashagg spill, which is being singled out in this discussion
> > as particularly expensive, and I am questioning that.
> >
>
> The difference between sort and hashagg spills is that for sorts there
> is no behavior change. Plans that did (not) spill in v12 will behave the
> same way on v13, modulo some random perturbation. For hashagg that's not
> the case - some queries that did not spill before will spill now.

Well, my point is that we already had ORDER BY problems, and if hash agg
now has them too in PG 13, I am fine with that.  We don't guarantee no
problems in major versions.  If we want to add a general knob that says,
"Hey allow this node to exceed work_mem by X%," I don't see the point
--- just increase work_mem, or have different work_mem settings for
different node types, as I outlined previously.

> So even if the hashagg spills are roughly equal to sort spills, both are
> significantly more expensive than not spilling.

Yes, but that means we need a more general fix and worrying about hash
agg is not addressing the core issue.

--
  Bruce Momjian  <[hidden email]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Peter Geoghegan-4
In reply to this post by Tomas Vondra-4
On Fri, Jun 26, 2020 at 4:59 PM Tomas Vondra
<[hidden email]> wrote:
> I agree larger work_mem for hashagg (and thus less spilling) may mean
> lower work_mem for so some other nodes that are less sensitive to this.
> But I think this needs to be formulated as a cost-based decision,
> although I don't know how to do that for the reasons I explained before
> (bottom-up plan construction vs. distributing the memory budget).

Why do you think that it needs to be formulated as a cost-based
decision? That's probably true of a scheme that allocates memory very
intelligently, but what about an approach that's slightly better than
work_mem?

What problems do you foresee (if any) with adding a hash_mem GUC that
gets used for both planning and execution for hash aggregate and hash
join nodes, in about the same way as work_mem is now?

> FWIW some databases already do something like this - SQL Server has
> something called "memory grant" which I think mostly does what you
> described here.

Same is true of Oracle. But Oracle also has simple work_mem-like
settings for sorting and hashing. People don't really use them
anymore, but apparently it was once common for the DBA to explicitly
give over more memory to hashing -- much like the hash_mem setting I
asked about. IIRC the same is true of DB2.

> The difference between sort and hashagg spills is that for sorts there
> is no behavior change. Plans that did (not) spill in v12 will behave the
> same way on v13, modulo some random perturbation. For hashagg that's not
> the case - some queries that did not spill before will spill now.
>
> So even if the hashagg spills are roughly equal to sort spills, both are
> significantly more expensive than not spilling.

Just to make sure we're on the same page: both are significantly more
expensive than a hash aggregate not spilling *specifically*. OTOH, a
group aggregate may not be much slower when it spills compared to an
in-memory sort group aggregate. It may even be noticeably faster, due
to caching effects, as you mentioned at one point upthread.

This is the property that makes hash aggregate special, and justifies
giving it more memory than other nodes on a system-wide basis (the
same thing applies to hash join). This could even work as a multiplier
of work_mem.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

akapila
In reply to this post by Robert Haas
On Thu, Jun 25, 2020 at 12:59 AM Robert Haas <[hidden email]> wrote:

>
> So, I don't think we can wire in a constant like 10x. That's really
> unprincipled and I think it's a bad idea. What we could do, though, is
> replace the existing Boolean-valued GUC with a new GUC that controls
> the size at which the aggregate spills. The default could be -1,
> meaning work_mem, but a user could configure a larger value if desired
> (presumably, we would just treat a value smaller than work_mem as
> work_mem, and document the same).
>
> I think that's actually pretty appealing. Separating the memory we
> plan to use from the memory we're willing to use before spilling seems
> like a good idea in general, and I think we should probably also do it
> in other places - like sorts.
>

+1.  I also think GUC on these lines could help not only the problem
being discussed here but in other cases as well.  However, I think the
real question is do we want to design/implement it for PG13?  It seems
to me at this stage we don't have a clear understanding of what
percentage of real-world cases will get impacted due to the new
behavior of hash aggregates.  We want to provide some mechanism as a
safety net to avoid problems that users might face which is not a bad
idea but what if we wait and see the real impact of this?  Is it too
bad to provide a GUC later in back-branch if we see users face such
problems quite often?  I think the advantage of delaying it is that we
might see some real problems (like where hash aggregate is not a good
choice) which can be fixed via the costing model.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


123456