Default setting for enable_hashagg_disk

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
102 messages Options
123456
Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Tomas Vondra-4
On Wed, Jun 24, 2020 at 12:36:24PM -0700, Andres Freund wrote:

>Hi,
>
>On 2020-06-24 15:28:47 -0400, Robert Haas wrote:
>> On Wed, Jun 24, 2020 at 3:14 PM Andres Freund <[hidden email]> wrote:
>> > FWIW, my gut feeling is that we'll end up have to separate the
>> > "execution time" spilling from using plain work mem, because it'll
>> > trigger spilling too often. E.g. if the plan isn't expected to spill,
>> > only spill at 10 x work_mem or something like that.  Or we'll need
>> > better management of temp file data when there's plenty memory
>> > available.
>>
>> So, I don't think we can wire in a constant like 10x. That's really
>> unprincipled and I think it's a bad idea. What we could do, though, is
>> replace the existing Boolean-valued GUC with a new GUC that controls
>> the size at which the aggregate spills. The default could be -1,
>> meaning work_mem, but a user could configure a larger value if desired
>> (presumably, we would just treat a value smaller than work_mem as
>> work_mem, and document the same).
>
>To be clear, I wasn't actually thinking of hard-coding 10x, but having a
>config option that specifies a factor of work_mem. A factor seems better
>because it'll work reasonably for different values of work_mem, whereas
>a concrete size wouldn't.
>

I'm not quite convinced we need/should introduce a new memory limit.
It's true keping it equal to work_mem by default makes this less of an
issue, but it's still another moving part the users will need to learn
how to use.

But if we do introduce a new limit, I very much think it should be a
plain limit, not a factor. That just makes it even more complicated, and
we don't have any such limit yet.

>
>> I think that's actually pretty appealing. Separating the memory we
>> plan to use from the memory we're willing to use before spilling seems
>> like a good idea in general, and I think we should probably also do it
>> in other places - like sorts.
>
>Indeed. And then perhaps we could eventually add some reporting /
>monitoring infrastructure for the cases where plan time and execution
>time memory estimate/usage widely differs.
>

I wouldn't mind something like that in general - not just for hashagg,
but for various other nodes.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Bruce Momjian
On Wed, Jun 24, 2020 at 11:02:10PM +0200, Tomas Vondra wrote:
> > Indeed. And then perhaps we could eventually add some reporting /
> > monitoring infrastructure for the cases where plan time and execution
> > time memory estimate/usage widely differs.
> >
>
> I wouldn't mind something like that in general - not just for hashagg,
> but for various other nodes.

Well, other than worrying about problems with pre-13 queries, how is
this different from any other spill to disk when we exceed work_mem,
like sorts for merge join.

--
  Bruce Momjian  <[hidden email]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Bruce Momjian
In reply to this post by Andres Freund
On Wed, Jun 24, 2020 at 12:19:00PM -0700, Andres Freund wrote:

> Hi,
>
> On 2020-06-24 13:12:03 -0400, Bruce Momjian wrote:
> > Well, my point is that merge join works that way, and no one has needed
> > a knob to avoid mergejoin if it is going to spill to disk.  If they are
> > adjusting work_mem to prevent spill of merge join, they can do the same
> > for hash agg.  We just need to document this in the release notes.
>
> I don't think this is comparable. For starters, the IO indirectly
> triggered by mergejoin actually leads to plenty people just straight out
> disabling it. For lots of workloads there's never a valid reason to use
> a mergejoin (and often the planner will never choose one). Secondly, the
> planner has better information about estimating the memory usage for the
> to-be-sorted data than it has about the size of the transition
> values. And lastly, there's a difference between a long existing cause
> for bad IO behaviour and one that's suddenly kicks in after a major
> version upgrade, to which there's no escape hatch (it's rarely realistic
> to disable hash aggs, in contrast to merge joins).

Well, this sounds like an issue of degree, rather than kind.  It sure
sounds like "ignore work_mem for this join type, but not the other".

--
  Bruce Momjian  <[hidden email]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Bruce Momjian
On Wed, Jun 24, 2020 at 07:18:10PM -0400, Bruce Momjian wrote:

> On Wed, Jun 24, 2020 at 12:19:00PM -0700, Andres Freund wrote:
> > Hi,
> >
> > On 2020-06-24 13:12:03 -0400, Bruce Momjian wrote:
> > > Well, my point is that merge join works that way, and no one has needed
> > > a knob to avoid mergejoin if it is going to spill to disk.  If they are
> > > adjusting work_mem to prevent spill of merge join, they can do the same
> > > for hash agg.  We just need to document this in the release notes.
> >
> > I don't think this is comparable. For starters, the IO indirectly
> > triggered by mergejoin actually leads to plenty people just straight out
> > disabling it. For lots of workloads there's never a valid reason to use
> > a mergejoin (and often the planner will never choose one). Secondly, the
> > planner has better information about estimating the memory usage for the
> > to-be-sorted data than it has about the size of the transition
> > values. And lastly, there's a difference between a long existing cause
> > for bad IO behaviour and one that's suddenly kicks in after a major
> > version upgrade, to which there's no escape hatch (it's rarely realistic
> > to disable hash aggs, in contrast to merge joins).
>
> Well, this sounds like an issue of degree, rather than kind.  It sure
> sounds like "ignore work_mem for this join type, but not the other".

I think my main point is that work_mem was not being honored for
hash-agg before, but now that PG 13 can do it, we are again allowing
work_mem not to apply in certain cases.  I am wondering if our hard
limit for work_mem is the issue, and we should make that more flexible
for all uses.

--
  Bruce Momjian  <[hidden email]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Robert Haas
On Wed, Jun 24, 2020 at 7:38 PM Bruce Momjian <[hidden email]> wrote:
> I think my main point is that work_mem was not being honored for
> hash-agg before, but now that PG 13 can do it, we are again allowing
> work_mem not to apply in certain cases.  I am wondering if our hard
> limit for work_mem is the issue, and we should make that more flexible
> for all uses.

I mean, that's pretty much what we're talking about here, isn't it? It
seems like in your previous two replies you were opposed to separating
the plan-type limit from the execution-time limit, but that idea is
precisely a way of being more flexible (and extending it to other plan
nodes is a way of making it more flexible for more use cases).

As I think you know, if you have a system where the workload varies a
lot, you may sometimes be using 0 copies of work_mem and at other
times 1000 or more copies, so the value has to be chosen
conservatively as a percentage of system memory, else you start
swapping or the OOM killer gets involved. On the other hand, some plan
nodes get a lot less efficient when the amount of memory available
falls below some threshold, so you can't just set this to a tiny value
and forget about it. Because the first problem is so bad, most people
set the value relatively conservatively and just live with the
performance consequences. But this also means that they have memory
left over most of the time, so the idea of letting a node burst above
its work_mem allocation when something unexpected happens isn't crazy:
as long as only a few nodes do that here and there, rather than, say,
all the nodes doing it all at the same time, it's actually fine. If we
had a smarter system that could dole out more work_mem to nodes that
would really benefit from it and less to nodes where it isn't likely
to make much difference, that would be similar in spirit but even
better.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Jeff Davis-8
In reply to this post by Robert Haas
On Wed, 2020-06-24 at 15:28 -0400, Robert Haas wrote:
> On Wed, Jun 24, 2020 at 3:14 PM Andres Freund <[hidden email]>
> wrote:
> > FWIW, my gut feeling is that we'll end up have to separate the
> > "execution time" spilling from using plain work mem, because it'll
> > trigger spilling too often. E.g. if the plan isn't expected to
> > spill,
> > only spill at 10 x work_mem or something like that.  Or we'll need
> > better management of temp file data when there's plenty memory
> > available.

...

> I think that's actually pretty appealing. Separating the memory we
> plan to use from the memory we're willing to use before spilling
> seems
> like a good idea in general, and I think we should probably also do
> it
> in other places - like sorts.

I'm trying to make sense of this. Let's say there are two GUCs:
planner_work_mem=16MB and executor_work_mem=32MB.

And let's say a query comes along and generates a HashAgg path, and the
planner (correctly) thinks if you put all the groups in memory at once,
it would be 24MB. Then the planner, using planner_work_mem, would think
spilling was necessary, and generate a cost that involves spilling.

Then it's going to generate a Sort+Group path, as well. And perhaps it
estimates that sorting all of the tuples in memory would also take
24MB, so it generates a cost that involves spilling to disk.

But it has to choose one of them. We've penalized plans at risk of
spilling to disk, but what's the point? The planner needs to choose one
of them, and both are at risk of spilling to disk.

Regards,
        Jeff Davis




Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Jeff Davis-8
In reply to this post by Andres Freund
On Wed, 2020-06-24 at 12:14 -0700, Andres Freund wrote:
> E.g. if the plan isn't expected to spill,
> only spill at 10 x work_mem or something like that.

Let's say you have work_mem=32MB and a query that's expected to use
16MB of memory. In reality, it uses 64MB of memory. So you are saying
this query would get to use all 64MB of memory, right?

But then you run ANALYZE. Now the query is (correctly) expected to use
64MB of memory. Are you saying this query, executed again with better
stats, would only get to use 32MB of memory, and therefore run slower?

Regards,
        Jeff Davis




Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Andres Freund
On 2020-06-25 09:24:52 -0700, Jeff Davis wrote:

> On Wed, 2020-06-24 at 12:14 -0700, Andres Freund wrote:
> > E.g. if the plan isn't expected to spill,
> > only spill at 10 x work_mem or something like that.
>
> Let's say you have work_mem=32MB and a query that's expected to use
> 16MB of memory. In reality, it uses 64MB of memory. So you are saying
> this query would get to use all 64MB of memory, right?
>
> But then you run ANALYZE. Now the query is (correctly) expected to use
> 64MB of memory. Are you saying this query, executed again with better
> stats, would only get to use 32MB of memory, and therefore run slower?

Yes. I think that's ok, because it was taken into account from a costing
perspective int he second case.


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Jeff Davis-8
In reply to this post by Andres Freund
On Wed, 2020-06-24 at 12:31 -0700, Andres Freund wrote:

> nodeAgg.c already treats those separately:
>
> void
> hash_agg_set_limits(double hashentrysize, uint64 input_groups, int
> used_bits,
> Size *mem_limit, uint64
> *ngroups_limit,
> int *num_partitions)
> {
> int npartitions;
> Size partition_mem;
>
> /* if not expected to spill, use all of work_mem */
> if (input_groups * hashentrysize < work_mem * 1024L)
> {
> if (num_partitions != NULL)
> *num_partitions = 0;
> *mem_limit = work_mem * 1024L;
> *ngroups_limit = *mem_limit / hashentrysize;
> return;
> }

The reason this code exists is to decide how much of work_mem to set
aside for spilling (each spill partition needs an IO buffer).

The alternative would be to fix the number of partitions before
processing a batch, which didn't seem ideal. Or, we could just ignore
the memory required for IO buffers, like HashJoin.

Granted, this is an example where an underestimate can give an
advantage, but I don't think we want to extend the concept into other
areas.

Regards,
        Jeff Davis




Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Jeff Davis-8
In reply to this post by Robert Haas
On Thu, 2020-06-25 at 11:46 -0400, Robert Haas wrote:
> Because the first problem is so bad, most people
> set the value relatively conservatively and just live with the
> performance consequences. But this also means that they have memory
> left over most of the time, so the idea of letting a node burst above
> its work_mem allocation when something unexpected happens isn't
> crazy:
> as long as only a few nodes do that here and there, rather than, say,
> all the nodes doing it all at the same time, it's actually fine.

Unexpected things (meaning underestimates) are not independent. All the
queries are based on the same stats, so if you have a lot of similar
queries, they will all get the same underestimate at once, and all be
surprised when they need to spill at once, and then all decide they are
entitled to ignore work_mem at once.

>  If we
> had a smarter system that could dole out more work_mem to nodes that
> would really benefit from it and less to nodes where it isn't likely
> to make much difference, that would be similar in spirit but even
> better.

That sounds more useful and probably not too hard to implement in a
crude form. Just have a shared counter in memory representing GB. If a
node is about to spill, it could try to decrement the counter by N, and
if it succeeds, it gets to exceed work_mem by N more GB.

Regards,
        Jeff Davis




Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Bruce Momjian
In reply to this post by Robert Haas
On Thu, Jun 25, 2020 at 11:46:54AM -0400, Robert Haas wrote:

> On Wed, Jun 24, 2020 at 7:38 PM Bruce Momjian <[hidden email]> wrote:
> > I think my main point is that work_mem was not being honored for
> > hash-agg before, but now that PG 13 can do it, we are again allowing
> > work_mem not to apply in certain cases.  I am wondering if our hard
> > limit for work_mem is the issue, and we should make that more flexible
> > for all uses.
>
> I mean, that's pretty much what we're talking about here, isn't it? It
> seems like in your previous two replies you were opposed to separating
> the plan-type limit from the execution-time limit, but that idea is
> precisely a way of being more flexible (and extending it to other plan
> nodes is a way of making it more flexible for more use cases).

I think it is was Tom who was complaining about plan vs. execution time
control.

> As I think you know, if you have a system where the workload varies a
> lot, you may sometimes be using 0 copies of work_mem and at other
> times 1000 or more copies, so the value has to be chosen
> conservatively as a percentage of system memory, else you start
> swapping or the OOM killer gets involved. On the other hand, some plan
> nodes get a lot less efficient when the amount of memory available
> falls below some threshold, so you can't just set this to a tiny value
> and forget about it. Because the first problem is so bad, most people
> set the value relatively conservatively and just live with the
> performance consequences. But this also means that they have memory
> left over most of the time, so the idea of letting a node burst above
> its work_mem allocation when something unexpected happens isn't crazy:
> as long as only a few nodes do that here and there, rather than, say,
> all the nodes doing it all at the same time, it's actually fine. If we
> had a smarter system that could dole out more work_mem to nodes that
> would really benefit from it and less to nodes where it isn't likely
> to make much difference, that would be similar in spirit but even
> better.

I think the issue is that in PG 13 work_mem controls sorts and hashes
with a new hard limit for hash aggregation:

        https://www.postgresql.org/docs/12/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY
       
        Sort operations are used for ORDER BY, DISTINCT, and merge joins. Hash
        tables are used in hash joins, hash-based aggregation, and hash-based
        processing of IN subqueries.

In pre-PG 13, we "controlled" it by avoiding hash-based aggregation if
was expected it to exceed work_mem, but if we assumed it would be less
than work_mem and it was more, we exceeded work_mem allocation for that
node.  In PG 13, we "limit" memory to work_mem and spill to disk if we
exceed it.

We should really have always documented that hash agg could exceed
work_mem for misestimation, and if we add a hash_agg work_mem
misestimation bypass setting we should document this setting in work_mem
as well.

But then the question is why do we allow this bypass only for hash agg?
Should work_mem have a settings for ORDER BY, merge join, hash join, and
hash agg, e.g.:

        work_mem = 'order_by=10MB, hash_join=20MB, hash_agg=100MB'

Yeah, crazy syntax, but you get the idea.  I understand some nodes are
more sensitive to disk spill than others, so shouldn't we be controlling
this at the work_mem level, rather than for a specific node type like
hash agg?  We could allow for misestimation over allocation of hash agg
work_mem by splitting up the hash agg values:

        work_mem = 'order_by=10MB, hash_join=20MB, hash_agg=100MB hash_agg_max=200MB'

but _avoiding_ hash agg if it is estimated to exceed work mem and spill
to disk is not something to logically control at the work mem level,
which leads so something like David Rowley suggested, but with different
names:

        enable_hashagg = on | soft | avoid | off

where 'on' and 'off' are the current PG 13 behavior, 'soft' means to
treat work_mem as a soft limit and allow it to exceed work mem for
misestimation, and 'avoid' means to avoid hash agg if it is estimated to
exceed work mem.  Both 'soft' and 'avoid' don't spill to disk.

David's original terms of "trynospill" and "neverspill" were focused on
spilling, not on its interaction with work_mem, and I found that
confusing.

Frankly, if it took me this long to get my head around this, I am
unclear how many people will understand this tuning feature enough to
actually use it.

--
  Bruce Momjian  <[hidden email]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Jeff Davis-8
In reply to this post by Andres Freund
On Thu, 2020-06-25 at 09:37 -0700, Andres Freund wrote:

> > Let's say you have work_mem=32MB and a query that's expected to use
> > 16MB of memory. In reality, it uses 64MB of memory. So you are
> > saying
> > this query would get to use all 64MB of memory, right?
> >
> > But then you run ANALYZE. Now the query is (correctly) expected to
> > use
> > 64MB of memory. Are you saying this query, executed again with
> > better
> > stats, would only get to use 32MB of memory, and therefore run
> > slower?
>
> Yes. I think that's ok, because it was taken into account from a
> costing
> perspective int he second case.

What do you mean by "taken into account"?

There are only two possible paths: HashAgg and Sort+Group, and we need
to pick one. If the planner expects one to spill, it is likely to
expect the other to spill. If one spills in the executor, then the
other is likely to spill, too. (I'm ignoring the case with a lot of
tuples and few groups because that doesn't seem relevant.)

Imagine that there was only one path available to choose. Would you
suggest the same thing, that unexpected spills can exceed work_mem but
expected spills can't?
Regards,
        Jeff Davis




Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Robert Haas
In reply to this post by Jeff Davis-8
On Thu, Jun 25, 2020 at 1:15 PM Jeff Davis <[hidden email]> wrote:
> Unexpected things (meaning underestimates) are not independent. All the
> queries are based on the same stats, so if you have a lot of similar
> queries, they will all get the same underestimate at once, and all be
> surprised when they need to spill at once, and then all decide they are
> entitled to ignore work_mem at once.

Yeah, that's a risk. But what is proposed is a configuration setting,
so people can adjust it depending on what they think is likely to
happen in their environment.

> That sounds more useful and probably not too hard to implement in a
> crude form. Just have a shared counter in memory representing GB. If a
> node is about to spill, it could try to decrement the counter by N, and
> if it succeeds, it gets to exceed work_mem by N more GB.

That's a neat idea, although GB seems too coarse.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Jeff Davis-8
In reply to this post by Bruce Momjian
On Thu, 2020-06-25 at 13:17 -0400, Bruce Momjian wrote:
> Frankly, if it took me this long to get my head around this, I am
> unclear how many people will understand this tuning feature enough to
> actually use it.

The way I think about it is that v13 HashAgg is much more consistent
with the way we do everything else: the planner costs it (including any
spilling that is expected), and the executor executes it (including any
spilling that is required to obey work_mem).

In earlier versions, HashAgg was weird. If we add GUCs to get that
weird behavior back, then the GUCs will necessarily be weird; and
therefore hard to document.

I would feel more comfortable with some kind of GUC escape hatch (or
two). GROUP BY is just too common, and I don't think we can ignore the
potential for users experiencing a regression of some type (even if, in
principle, the v13 version is better).

If we have the GUCs there, then at least if someone comes to the
mailing list with a problem, we can offer them a temporary solution,
and have time to try to avoid the problem in a future release (tweaking
estimates, cost model, defaults, etc.).

One idea is to have undocumented GUCs. That way we don't have to
support them forever, and we are more likely to hear problem reports.

Regards,
        Jeff Davis




Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Jeff Davis-8
In reply to this post by Tom Lane-2
On Wed, 2020-06-24 at 13:29 -0400, Tom Lane wrote:
> If we feel we need something to let people have the v12 behavior
> back, let's have
> (1) enable_hashagg on/off --- controls planner, same as it ever was
> (2) enable_hashagg_spill on/off --- controls executor by disabling
> spill
>
> But I'm not really convinced that we need (2).

If we're not going to have a planner GUC, one alternative is to just
penalize the disk costs of HashAgg for a release or two. It would only
affect the cost of HashAgg paths that are expected to spill, which
weren't even generated in previous releases.

In other words, multiply the disk costs by enough that the planner will
usually not choose HashAgg if expected to spill unless the average
group size is quite large (i.e. there are a lot more tuples than
groups, but still enough groups to spill).

As we learn more and optimize more, we can reduce or eliminate the
penalty in a future release. I'm not sure exactly what the penalty
would be, though.

Regards,
        Jeff Davis




Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Andres Freund
In reply to this post by Jeff Davis-8
Hi,

On 2020-06-25 10:44:42 -0700, Jeff Davis wrote:
> There are only two possible paths: HashAgg and Sort+Group, and we need
> to pick one. If the planner expects one to spill, it is likely to
> expect the other to spill. If one spills in the executor, then the
> other is likely to spill, too. (I'm ignoring the case with a lot of
> tuples and few groups because that doesn't seem relevant.)

There's also ordered index scan + Group. Which will often be vastly
better than Sort+Group, but still slower than HashAgg.


> Imagine that there was only one path available to choose. Would you
> suggest the same thing, that unexpected spills can exceed work_mem but
> expected spills can't?

I'm not saying what I propose is perfect, but I've yet to hear a better
proposal. Given that there *are* different ways to implement
aggregation, and that we use expected costs to choose, I think the
assumed costs are relevant.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Bruce Momjian
In reply to this post by Jeff Davis-8
On Thu, Jun 25, 2020 at 11:02:30AM -0700, Jeff Davis wrote:
> If we have the GUCs there, then at least if someone comes to the
> mailing list with a problem, we can offer them a temporary solution,
> and have time to try to avoid the problem in a future release (tweaking
> estimates, cost model, defaults, etc.).
>
> One idea is to have undocumented GUCs. That way we don't have to
> support them forever, and we are more likely to hear problem reports.

Uh, our track record of adding GUCs just in case is not good, and
removing them is even harder.  Undocumented sounds interesting but then
how do we even document when we remove it?  I don't think we want to go
there.  Oracle has done that, and I don't think the user experience is
good.

Maybe we should just continue though beta, add an incompatibility item
to the PG 13 release notes, and see what feedback we get.  We know
increasing work_mem gets us the exceed work_mem behavior, but that
affects other nodes too, and I can't think of a way to avoid if spill is
predicted except to disable hash agg for that query.

I am still trying to get my head around why the spill is going to be so
much work to adjust for hash agg than our other spillable nodes.  What
are people doing for those cases already?  Do we have an real-world
queries that are a problem in PG 13 for this?

--
  Bruce Momjian  <[hidden email]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Bruce Momjian
In reply to this post by Jeff Davis-8
On Thu, Jun 25, 2020 at 11:10:58AM -0700, Jeff Davis wrote:

> On Wed, 2020-06-24 at 13:29 -0400, Tom Lane wrote:
> > If we feel we need something to let people have the v12 behavior
> > back, let's have
> > (1) enable_hashagg on/off --- controls planner, same as it ever was
> > (2) enable_hashagg_spill on/off --- controls executor by disabling
> > spill
> >
> > But I'm not really convinced that we need (2).
>
> If we're not going to have a planner GUC, one alternative is to just
> penalize the disk costs of HashAgg for a release or two. It would only
> affect the cost of HashAgg paths that are expected to spill, which
> weren't even generated in previous releases.
>
> In other words, multiply the disk costs by enough that the planner will
> usually not choose HashAgg if expected to spill unless the average
> group size is quite large (i.e. there are a lot more tuples than
> groups, but still enough groups to spill).

Well, the big question is whether this costing is actually more accurate
than what we have now.  What I am hearing is that spilling hash agg is
expensive, so whatever we can do to reflect the actual costs seems like
a win.  If it can be done, it certainly seems better than a cost setting
few people will use.

--
  Bruce Momjian  <[hidden email]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Bruce Momjian
On Thu, Jun 25, 2020 at 03:24:42PM -0400, Bruce Momjian wrote:

> On Thu, Jun 25, 2020 at 11:10:58AM -0700, Jeff Davis wrote:
> > On Wed, 2020-06-24 at 13:29 -0400, Tom Lane wrote:
> > > If we feel we need something to let people have the v12 behavior
> > > back, let's have
> > > (1) enable_hashagg on/off --- controls planner, same as it ever was
> > > (2) enable_hashagg_spill on/off --- controls executor by disabling
> > > spill
> > >
> > > But I'm not really convinced that we need (2).
> >
> > If we're not going to have a planner GUC, one alternative is to just
> > penalize the disk costs of HashAgg for a release or two. It would only
> > affect the cost of HashAgg paths that are expected to spill, which
> > weren't even generated in previous releases.
> >
> > In other words, multiply the disk costs by enough that the planner will
> > usually not choose HashAgg if expected to spill unless the average
> > group size is quite large (i.e. there are a lot more tuples than
> > groups, but still enough groups to spill).
>
> Well, the big question is whether this costing is actually more accurate
> than what we have now.  What I am hearing is that spilling hash agg is
> expensive, so whatever we can do to reflect the actual costs seems like
> a win.  If it can be done, it certainly seems better than a cost setting
> few people will use.

It is my understanding that spill of sorts is mostly read sequentially,
while hash reads are random.  Is that right?  Is that not being costed
properly?

That doesn't fix the misestimation case, but increasing work mem does
allow pre-PG 13 behavior there.

--
  Bruce Momjian  <[hidden email]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Andres Freund
In reply to this post by Bruce Momjian
Hi,

On 2020-06-25 14:25:12 -0400, Bruce Momjian wrote:
> I am still trying to get my head around why the spill is going to be so
> much work to adjust for hash agg than our other spillable nodes.

Aggregates are the classical case used to process large amounts of
data. For larger amounts of data sorted input (be it via explicit sort
or ordered index scan) isn't an attractive option. IOW hash-agg is the
common case.  There's also fewer stats for halfway accurately estimating
the number of groups and the size of the transition state - a sort /
hash join doesn't have an equivalent to the variably sized transition
value.


> What are people doing for those cases already?  Do we have an
> real-world queries that are a problem in PG 13 for this?

I don't know about real world, but it's pretty easy to come up with
examples.

query:
SELECT a, array_agg(b) FROM (SELECT generate_series(1, 10000)) a(a), (SELECT generate_series(1, 10000)) b(b) GROUP BY a HAVING array_length(array_agg(b), 1) = 0;

work_mem = 4MB

12      18470.012 ms
HEAD    44635.210 ms

HEAD causes ~2.8GB of file IO, 12 doesn't cause any. If you're IO
bandwidth constrained, this could be quite bad.

Obviously this is contrived, and a pretty extreme case. But if you
imagine this happening on a system where disk IO isn't super fast
(e.g. just about any cloud provider).

An even more extreme version of the above is this:


query: SELECT a, array_agg(b) FROM (SELECT generate_series(1, 50000)) a(a), (SELECT generate_series(1, 10000)) b(b) GROUP BY a HAVING array_length(array_agg(b), 1) = 0;

work_mem = 16MB
12      81598.965 ms
HEAD    210772.360 ms

temporary tablespace on magnetic disk (raid 0 of two 7.2k server
spinning disks)

12      81136.530 ms
HEAD   225182.560 ms

The disks are busy in some periods, but still keep up. If I however make
the transition state a bit bigger:

query: SELECT a, array_agg(b), count(c), max(d),max(e) FROM (SELECT generate_series(1, 10000)) a(a), (SELECT generate_series(1, 5000)::text, repeat(random()::text, 10), repeat(random()::text, 10), repeat(random()::text, 10)) b(b,c,d,e) GROUP BY a HAVING array_length(array_agg(b), 1) = 0;

12 28164.865 ms

fast ssd:
HEAD    92520.680 ms

magnetic:
HEAD    183968.538 ms

(no reads, there's plenty enough memory. Just writes because the age /
amount thresholds for dirty data are reached)

In the magnetic case we're IO bottlenecked nearly the whole time.


Just to be clear: I think this is a completely over-the-top example. But
I do think it shows the problem to some degree at least.

Greetings,

Andres Freund


123456