Default setting for enable_hashagg_disk

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
102 messages Options
123456
Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Tomas Vondra-4
On Fri, Jun 26, 2020 at 05:24:36PM -0700, Peter Geoghegan wrote:

>On Fri, Jun 26, 2020 at 4:59 PM Tomas Vondra
><[hidden email]> wrote:
>> I agree larger work_mem for hashagg (and thus less spilling) may mean
>> lower work_mem for so some other nodes that are less sensitive to this.
>> But I think this needs to be formulated as a cost-based decision,
>> although I don't know how to do that for the reasons I explained before
>> (bottom-up plan construction vs. distributing the memory budget).
>
>Why do you think that it needs to be formulated as a cost-based
>decision? That's probably true of a scheme that allocates memory very
>intelligently, but what about an approach that's slightly better than
>work_mem?
>

Well, there are multiple ideas discussed in this (sub)thread, one of
them being a per-query memory limit. That requires decisions how much
memory should different nodes get, which I think would need to be
cost-based.

>What problems do you foresee (if any) with adding a hash_mem GUC that
>gets used for both planning and execution for hash aggregate and hash
>join nodes, in about the same way as work_mem is now?
>

Of course, a simpler scheme like this would not require that. And maybe
introducing hash_mem is a good idea - I'm not particularly opposed to
that, actually. But I think we should not introduce separate memory
limits for each node type, which was also mentioned earlier.

The problem of course is that hash_mem does not really solve the issue
discussed at the beginning of this thread, i.e. regressions due to
underestimates and unexpected spilling at execution time.

The thread is getting a rather confusing mix of proposals how to fix
that for v13 and proposals how to improve our configuration of memory
limits :-(

>> FWIW some databases already do something like this - SQL Server has
>> something called "memory grant" which I think mostly does what you
>> described here.
>
>Same is true of Oracle. But Oracle also has simple work_mem-like
>settings for sorting and hashing. People don't really use them anymore,
>but apparently it was once common for the DBA to explicitly give over
>more memory to hashing -- much like the hash_mem setting I asked about.
>IIRC the same is true of DB2.
>

Interesting. What is not entirely clear to me how do these databases
decide how much should each node get during planning. With the separate
work_mem-like settings it's fairly obvious, but how do they do that with
the global limit (either per-instance or per-query)?

>> The difference between sort and hashagg spills is that for sorts
>> there is no behavior change. Plans that did (not) spill in v12 will
>> behave the same way on v13, modulo some random perturbation. For
>> hashagg that's not the case - some queries that did not spill before
>> will spill now.
>>
>> So even if the hashagg spills are roughly equal to sort spills, both
>> are significantly more expensive than not spilling.
>
>Just to make sure we're on the same page: both are significantly more
>expensive than a hash aggregate not spilling *specifically*. OTOH, a
>group aggregate may not be much slower when it spills compared to an
>in-memory sort group aggregate. It may even be noticeably faster, due
>to caching effects, as you mentioned at one point upthread.
>
>This is the property that makes hash aggregate special, and justifies
>giving it more memory than other nodes on a system-wide basis (the same
>thing applies to hash join). This could even work as a multiplier of
>work_mem.
>

Yes, I agree.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Peter Geoghegan-4
In reply to this post by akapila
On Sat, Jun 27, 2020 at 3:00 AM Amit Kapila <[hidden email]> wrote:
> I think the advantage of delaying it is that we
> might see some real problems (like where hash aggregate is not a good
> choice) which can be fixed via the costing model.

I think any problem that might come up with the costing is best
thought of as a distinct problem. This thread is mostly about the
problem of users getting fewer in-memory hash aggregates compared to a
previous release running the same application (though there has been
some discussion of the other problem, too [1], but it's thought to be
less serious).

The problem is that affected users were theoretically never entitled
to the performance they came to rely on, and yet there is good reason
to think that hash aggregate really should be entitled to more memory.
They won't care that they were theoretically never entitled to that
performance, though -- they *liked* the fact that hash agg could
cheat. And they'll dislike the fact that this cannot be corrected by
tuning work_mem, since that affects all node types that consume
work_mem, not just hash aggregate -- that could cause OOMs for them.

There are two or three similar ideas under discussion that might fix
the problem. They all seem to involve admitting that hash aggregate's
"cheating" might actually have been a good thing all along (even
though giving hash aggregate much much more memory than other nodes is
terrible), and giving hash aggregate license to "cheat openly". Note
that the problem isn't exactly a problem with the hash aggregate
spilling patch. You could think of the problem as a pre-existing issue
-- a failure to give more memory to hash aggregate, which really
should be entitled to more memory. Jeff's patch just made the issue
more obvious.

[1] https://postgr.es/m/20200624191433.5gnqgrxfmucexldm@...
--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Peter Geoghegan-4
In reply to this post by Tomas Vondra-4
On Sat, Jun 27, 2020 at 3:41 AM Tomas Vondra
<[hidden email]> wrote:
> Well, there are multiple ideas discussed in this (sub)thread, one of
> them being a per-query memory limit. That requires decisions how much
> memory should different nodes get, which I think would need to be
> cost-based.

A design like that probably makes sense. But it's way out of scope for
Postgres 13, and not something that should be discussed further on
this thread IMV.

> Of course, a simpler scheme like this would not require that. And maybe
> introducing hash_mem is a good idea - I'm not particularly opposed to
> that, actually. But I think we should not introduce separate memory
> limits for each node type, which was also mentioned earlier.

I had imagined that hash_mem would apply to hash join and hash
aggregate only. A GUC that either represents a multiple of work_mem,
or an absolute work_mem-style KB value.

> The problem of course is that hash_mem does not really solve the issue
> discussed at the beginning of this thread, i.e. regressions due to
> underestimates and unexpected spilling at execution time.

Like Andres, I suspect that that's a smaller problem in practice. A
hash aggregate that spills often has performance characteristics
somewhat like a group aggregate + sort, anyway. I'm worried about
cases where an *in-memory* hash aggregate is naturally far far faster
than other strategies, and yet we can't use it -- despite the fact
that Postgres 12 could "safely" do so. (It probably doesn't matter
whether the slow plan that you get in Postgres 13 is a hash aggregate
that spills, or something else -- this is not really a costing
problem.)

Besides, hash_mem *can* solve that problem to some extent. Other cases
(cases where the estimates are so bad that hash_mem won't help) seem
like less of a concern to me. To some extent, that's the price you pay
to avoid the risk of an OOM.

> The thread is getting a rather confusing mix of proposals how to fix
> that for v13 and proposals how to improve our configuration of memory
> limits :-(

As I said to Amit in my last message, I think that all of the ideas
that are worth pursuing involve giving hash aggregate nodes license to
use more memory than other nodes. One variant involves doing so only
at execution time, while the hash_mem idea involves formalizing and
documenting that hash-based nodes are special -- and taking that into
account during both planning and execution.

> Interesting. What is not entirely clear to me how do these databases
> decide how much should each node get during planning. With the separate
> work_mem-like settings it's fairly obvious, but how do they do that with
> the global limit (either per-instance or per-query)?

I don't know, but that seems like a much more advanced way of
approaching the problem. It isn't in scope here.

Perhaps I'm not considering some unintended consequence of the planner
giving hash-based nodes extra memory "for free" in the common case
where hash_mem exceeds work_mem (by 2x, say). But my guess is that
that won't be a significant problem that biases the planner in some
obviously undesirable way.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Tomas Vondra-4
On Sun, Jun 28, 2020 at 06:39:40PM -0700, Peter Geoghegan wrote:

>On Sat, Jun 27, 2020 at 3:41 AM Tomas Vondra
><[hidden email]> wrote:
>> Well, there are multiple ideas discussed in this (sub)thread, one of
>> them being a per-query memory limit. That requires decisions how much
>> memory should different nodes get, which I think would need to be
>> cost-based.
>
>A design like that probably makes sense. But it's way out of scope for
>Postgres 13, and not something that should be discussed further on
>this thread IMV.
>

100% agree

>> Of course, a simpler scheme like this would not require that. And maybe
>> introducing hash_mem is a good idea - I'm not particularly opposed to
>> that, actually. But I think we should not introduce separate memory
>> limits for each node type, which was also mentioned earlier.
>
>I had imagined that hash_mem would apply to hash join and hash
>aggregate only. A GUC that either represents a multiple of work_mem,
>or an absolute work_mem-style KB value.
>

I'm not against having a hash_mem (and I'd vote to be it a simple
work_mem-style value, not a multiple). Maybe we should have it, the
argument to allow hashagg (and perhaps hashjoin) to use more memory than
some other nodes seems convincing to me.

I'm just not sure which of the problems mentioned in this thread it
actually addresses ...

>> The problem of course is that hash_mem does not really solve the issue
>> discussed at the beginning of this thread, i.e. regressions due to
>> underestimates and unexpected spilling at execution time.
>
>Like Andres, I suspect that that's a smaller problem in practice. A
>hash aggregate that spills often has performance characteristics
>somewhat like a group aggregate + sort, anyway. I'm worried about
>cases where an *in-memory* hash aggregate is naturally far far faster
>than other strategies, and yet we can't use it -- despite the fact
>that Postgres 12 could "safely" do so. (It probably doesn't matter
>whether the slow plan that you get in Postgres 13 is a hash aggregate
>that spills, or something else -- this is not really a costing
>problem.)
>

Not sure I follow. Which cases do you mean when you say that 12 could
safely do them, but 13 won't? I see the following two cases:


a) Planner in 12 and 13 disagree about whether the hash table will fit
into work_mem.

I don't quite see why this would be the case (given the same cardinality
estimates etc.), though. That is, if 12 says "will fit" I'd expect 13 to
end up with the same conclusion. But maybe 13 has higher per-tuple
overhead or something? I know we set aside some memory for BufFiles, but
not when we expect the whole hash table to fit into memory.


b) Planner believes the hash table will fit, due to underestimate.

On 12, we'd just let the hash table overflow, which may be a win when
there's enough RAM and the estimate is not "too wrong". But it may
easily end with a sad OOM.

On 13, we'll just start spilling. True - people tend to use conservative
work_mem values exactly because of cases like this (which is somewhat
futile as the underestimate may be arbitrarily wrong) and also because
they don't know how many work_mem instances the query will use.

So yeah, I understand why people may not want to increase work_mem too
much, and maybe hash_work would be a way to get the "no spill" behavior.


>Besides, hash_mem *can* solve that problem to some extent. Other cases
>(cases where the estimates are so bad that hash_mem won't help) seem
>like less of a concern to me. To some extent, that's the price you pay
>to avoid the risk of an OOM.
>

True, avoiding the risk of OOM has it's cost.

>> The thread is getting a rather confusing mix of proposals how to fix
>> that for v13 and proposals how to improve our configuration of memory
>> limits :-(
>
>As I said to Amit in my last message, I think that all of the ideas
>that are worth pursuing involve giving hash aggregate nodes license to
>use more memory than other nodes. One variant involves doing so only
>at execution time, while the hash_mem idea involves formalizing and
>documenting that hash-based nodes are special -- and taking that into
>account during both planning and execution.
>

Understood. I mostly agree with this.

>> Interesting. What is not entirely clear to me how do these databases
>> decide how much should each node get during planning. With the separate
>> work_mem-like settings it's fairly obvious, but how do they do that with
>> the global limit (either per-instance or per-query)?
>
>I don't know, but that seems like a much more advanced way of
>approaching the problem. It isn't in scope here.
>

+1

>Perhaps I'm not considering some unintended consequence of the planner
>giving hash-based nodes extra memory "for free" in the common case
>where hash_mem exceeds work_mem (by 2x, say). But my guess is that
>that won't be a significant problem that biases the planner in some
>obviously undesirable way.
>

My concern is how much more difficult would these proposals make the
reasoning about memory usage get. Maybe not much, not sure.

I certainly agree it may be beneficial to give more memory to hashagg at
the expense of other nodes.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Bruce Momjian
In reply to this post by Peter Geoghegan-4
On Sun, Jun 28, 2020 at 05:40:16PM -0700, Peter Geoghegan wrote:

> I think any problem that might come up with the costing is best
> thought of as a distinct problem. This thread is mostly about the
> problem of users getting fewer in-memory hash aggregates compared to a
> previous release running the same application (though there has been
> some discussion of the other problem, too [1], but it's thought to be
> less serious).
>
> The problem is that affected users were theoretically never entitled
> to the performance they came to rely on, and yet there is good reason
> to think that hash aggregate really should be entitled to more memory.
> They won't care that they were theoretically never entitled to that
> performance, though -- they *liked* the fact that hash agg could
> cheat. And they'll dislike the fact that this cannot be corrected by
> tuning work_mem, since that affects all node types that consume
> work_mem, not just hash aggregate -- that could cause OOMs for them.
>
> There are two or three similar ideas under discussion that might fix
> the problem. They all seem to involve admitting that hash aggregate's
> "cheating" might actually have been a good thing all along (even
> though giving hash aggregate much much more memory than other nodes is
> terrible), and giving hash aggregate license to "cheat openly". Note
> that the problem isn't exactly a problem with the hash aggregate
> spilling patch. You could think of the problem as a pre-existing issue
> -- a failure to give more memory to hash aggregate, which really
> should be entitled to more memory. Jeff's patch just made the issue
> more obvious.

In thinking some more about this, I came out with two ideas. First, in
pre-PG 13, we didn't choose hash_agg if we thought it would spill, but
if we misestimated and it used more work_mem, we allowed it.  The effect
of this is that if we were close, but it went over, we allowed it just
for hash_agg.  Is this something we want to codify for all node types,
i.e., choose a non-spill node type if we need a lot more than work_mem,
but then let work_mem be a soft limit if we do choose it, e.g., allow
50% over work_mem in the executor for misestimation before spill?  My
point is, do we want to use a lower work_mem for planning and a higher
one in the executor before spilling.

My second thought is from an earlier report that spilling is very
expensive, but smaller work_mem doesn't seem to hurt much.  Would we
achieve better overall performance by giving a few nodes a lot of memory
(and not spill those), and other nodes very little, rather than having
them all be the same size, and all spill?

--
  Bruce Momjian  <[hidden email]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Peter Geoghegan-4
In reply to this post by Tomas Vondra-4
On Mon, Jun 29, 2020 at 8:07 AM Tomas Vondra
<[hidden email]> wrote:
> Not sure I follow. Which cases do you mean when you say that 12 could
> safely do them, but 13 won't? I see the following two cases:

> a) Planner in 12 and 13 disagree about whether the hash table will fit
> into work_mem.
>
> I don't quite see why this would be the case (given the same cardinality
> estimates etc.), though. That is, if 12 says "will fit" I'd expect 13 to
> end up with the same conclusion. But maybe 13 has higher per-tuple
> overhead or something? I know we set aside some memory for BufFiles, but
> not when we expect the whole hash table to fit into memory.

I have no reason to believe that the planner is any more or any less
likely to conclude that the hash table will fit in memory in v13 as
things stand (I don't know if the BufFile issue matters).

In general, grouping estimates probably aren't very good compared to
join estimates. I imagine that in either v12 or v13 the planner is
likely to incorrectly believe that it'll all fit in memory fairly
often. v12 was much too permissive about what could happen. But v13 is
too conservative.

> b) Planner believes the hash table will fit, due to underestimate.
>
> On 12, we'd just let the hash table overflow, which may be a win when
> there's enough RAM and the estimate is not "too wrong". But it may
> easily end with a sad OOM.

It might end up with an OOM on v12 due to an underestimate -- but
probably not! The fact that a hash aggregate is faster than a group
aggregate ameliorates the higher memory usage. You might actually use
less memory this way.

> On 13, we'll just start spilling. True - people tend to use conservative
> work_mem values exactly because of cases like this (which is somewhat
> futile as the underestimate may be arbitrarily wrong) and also because
> they don't know how many work_mem instances the query will use.
>
> So yeah, I understand why people may not want to increase work_mem too
> much, and maybe hash_work would be a way to get the "no spill" behavior.

Andres wanted to increase the amount of memory that could be used at
execution time, without changing planning. You could say that hash_mem
is a more ambitious proposal than that. It's changing the behavior
across the board -- though in a way that makes sense anyway. It has
the additional benefit of making it more likely that an in-memory hash
aggregate will be used. That isn't a problem that we're obligated to
solve now, so this may seem odd. But if the more ambitious plan is
actually easier to implement and support, why not pursue it?

hash_mem seems a lot easier to explain and reason about than having
different work_mem budgets during planning and execution, which is
clearly a kludge. hash_mem makes sense generally, and more or less
solves the problems raised on this thread.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Bruce Momjian
On Mon, Jun 29, 2020 at 10:20:14AM -0700, Peter Geoghegan wrote:
> I have no reason to believe that the planner is any more or any less
> likely to conclude that the hash table will fit in memory in v13 as
> things stand (I don't know if the BufFile issue matters).
>
> In general, grouping estimates probably aren't very good compared to
> join estimates. I imagine that in either v12 or v13 the planner is
> likely to incorrectly believe that it'll all fit in memory fairly
> often. v12 was much too permissive about what could happen. But v13 is
> too conservative.

FYI, we have improved planner statistics estimates for years, which must
have affected node spill behavior on many node types (except hash_agg),
and don't remember any complaints about it.

--
  Bruce Momjian  <[hidden email]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Peter Geoghegan-4
In reply to this post by Bruce Momjian
On Mon, Jun 29, 2020 at 8:29 AM Bruce Momjian <[hidden email]> wrote:
> Is this something we want to codify for all node types,
> i.e., choose a non-spill node type if we need a lot more than work_mem,
> but then let work_mem be a soft limit if we do choose it, e.g., allow
> 50% over work_mem in the executor for misestimation before spill?  My
> point is, do we want to use a lower work_mem for planning and a higher
> one in the executor before spilling.

Andres said something about doing that with hash aggregate, which I
can see an argument for, but I don't think that it would make sense
with most other nodes. In particular, sorts still perform almost as
well with only a fraction of the "optimal" memory.

> My second thought is from an earlier report that spilling is very
> expensive, but smaller work_mem doesn't seem to hurt much.

It's not really about the spilling itself IMV. It's the inability to
do hash aggregation in a single pass.

You can think of hashing (say for hash join or hash aggregate) as a
strategy that consists of a logical division followed by a physical
combination. Sorting (or sort merge join, or group agg), in contrast,
consists of a physical division and logical combination. As a
consequence, it can be a huge win to do everything in memory in the
case of hash aggregate. Whereas sort-based aggregation can sometimes
be slightly faster with external sorts due to CPU caching effects, and
because an on-the-fly merge in tuplesort can output the first tuple
before the tuples are fully sorted.

> Would we
> achieve better overall performance by giving a few nodes a lot of memory
> (and not spill those), and other nodes very little, rather than having
> them all be the same size, and all spill?

If the nodes that we give more memory to use it for a hash table, then yes.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Tomas Vondra-4
In reply to this post by Peter Geoghegan-4
On Mon, Jun 29, 2020 at 10:20:14AM -0700, Peter Geoghegan wrote:

>On Mon, Jun 29, 2020 at 8:07 AM Tomas Vondra
><[hidden email]> wrote:
>> Not sure I follow. Which cases do you mean when you say that 12 could
>> safely do them, but 13 won't? I see the following two cases:
>
>> a) Planner in 12 and 13 disagree about whether the hash table will fit
>> into work_mem.
>>
>> I don't quite see why this would be the case (given the same cardinality
>> estimates etc.), though. That is, if 12 says "will fit" I'd expect 13 to
>> end up with the same conclusion. But maybe 13 has higher per-tuple
>> overhead or something? I know we set aside some memory for BufFiles, but
>> not when we expect the whole hash table to fit into memory.
>
>I have no reason to believe that the planner is any more or any less
>likely to conclude that the hash table will fit in memory in v13 as
>things stand (I don't know if the BufFile issue matters).
>
>In general, grouping estimates probably aren't very good compared to
>join estimates. I imagine that in either v12 or v13 the planner is
>likely to incorrectly believe that it'll all fit in memory fairly
>often. v12 was much too permissive about what could happen. But v13 is
>too conservative.
>

Can you give and example of what you mean by being too permissive or too
conservative? Do you mean the possibility of unlimited memory usage in
v12, and strict enforcement in v13?

IMO enforcing the work_mem limit (in v13) is right in principle, but I
do understand the concerns about unexpected regressions compared to v12.

>> b) Planner believes the hash table will fit, due to underestimate.
>>
>> On 12, we'd just let the hash table overflow, which may be a win when
>> there's enough RAM and the estimate is not "too wrong". But it may
>> easily end with a sad OOM.
>
>It might end up with an OOM on v12 due to an underestimate -- but
>probably not! The fact that a hash aggregate is faster than a group
>aggregate ameliorates the higher memory usage. You might actually use
>less memory this way.
>

I don't understand what you mean by "less memory" when the whole issue
is significantly exceeding work_mem?

I don't think the OOM is the only negative performance here - using a
lot of memory also forces eviction of data from page cache (although
writing a lot of temporary files may have similar effect).

>> On 13, we'll just start spilling. True - people tend to use conservative
>> work_mem values exactly because of cases like this (which is somewhat
>> futile as the underestimate may be arbitrarily wrong) and also because
>> they don't know how many work_mem instances the query will use.
>>
>> So yeah, I understand why people may not want to increase work_mem too
>> much, and maybe hash_work would be a way to get the "no spill" behavior.
>
>Andres wanted to increase the amount of memory that could be used at
>execution time, without changing planning. You could say that hash_mem
>is a more ambitious proposal than that. It's changing the behavior
>across the board -- though in a way that makes sense anyway. It has
>the additional benefit of making it more likely that an in-memory hash
>aggregate will be used. That isn't a problem that we're obligated to
>solve now, so this may seem odd. But if the more ambitious plan is
>actually easier to implement and support, why not pursue it?
>
>hash_mem seems a lot easier to explain and reason about than having
>different work_mem budgets during planning and execution, which is
>clearly a kludge. hash_mem makes sense generally, and more or less
>solves the problems raised on this thread.
>

I agree with this, and I'm mostly OK with having hash_mem. In fact, from
the proposals in this thread I like it the most - as long as it's used
both during planning and execution. It's a pretty clear solution.

It's not a perfect solution in the sense that it does not reintroduce
the v12 behavior perfectly (i.e. we'll still spill after reaching
hash_mem) but that may be good enougn.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Tomas Vondra-4
In reply to this post by Bruce Momjian
On Mon, Jun 29, 2020 at 01:31:40PM -0400, Bruce Momjian wrote:

>On Mon, Jun 29, 2020 at 10:20:14AM -0700, Peter Geoghegan wrote:
>> I have no reason to believe that the planner is any more or any less
>> likely to conclude that the hash table will fit in memory in v13 as
>> things stand (I don't know if the BufFile issue matters).
>>
>> In general, grouping estimates probably aren't very good compared to
>> join estimates. I imagine that in either v12 or v13 the planner is
>> likely to incorrectly believe that it'll all fit in memory fairly
>> often. v12 was much too permissive about what could happen. But v13 is
>> too conservative.
>
>FYI, we have improved planner statistics estimates for years, which must
>have affected node spill behavior on many node types (except hash_agg),
>and don't remember any complaints about it.
>

I think misestimates for GROUP BY are quite common and very hard to fix.
Firstly, our ndistinct estimator may give pretty bad results depending
e.g. on how is the table correlated.

I've been running some TPC-H benchmarks, and for partsupp.ps_partkey our
estimate was 4338776, when the actual value is 15000000, i.e. ~3.5x
higher. This was with statistics target increased to 1000. I can easily
imagine even worse estimates with lower values.

This ndistinct estimator is used even for extended statistics, so that
can't quite save us. Moreover, the grouping may be on top of a join, in
which case using ndistinct coefficients may not be possible :-(

So I think this is a quite real problem ...


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Peter Geoghegan-4
In reply to this post by Tomas Vondra-4
On Mon, Jun 29, 2020 at 2:22 PM Tomas Vondra
<[hidden email]> wrote:
> Can you give and example of what you mean by being too permissive or too
> conservative? Do you mean the possibility of unlimited memory usage in
> v12, and strict enforcement in v13?

Yes -- that's all I meant.

> IMO enforcing the work_mem limit (in v13) is right in principle, but I
> do understand the concerns about unexpected regressions compared to v12.

Yeah. Both of these two things are true at the same time.

> I don't understand what you mean by "less memory" when the whole issue
> is significantly exceeding work_mem?

I was just reiterating what I said a few times already: Not using an
in-memory hash aggregate when the amount of memory required is high
but not prohibitively high is penny wise, pound foolish. It's easy to
imagine this actually using up more memory when an entire workload is
considered. This observation does not apply to a system that only ever
has one active connection at a time, but who cares about that?

> I don't think the OOM is the only negative performance here - using a
> lot of memory also forces eviction of data from page cache (although
> writing a lot of temporary files may have similar effect).

True.

> I agree with this, and I'm mostly OK with having hash_mem. In fact, from
> the proposals in this thread I like it the most - as long as it's used
> both during planning and execution. It's a pretty clear solution.

Great.

It's not trivial to write the patch, since there are a few tricky
cases. For example, maybe there is some subtlety in nodeAgg.c with
AGG_MIXED cases. Is there somebody else that knows that code better
than I do that wants to have a go at writing a patch?

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Peter Geoghegan-4
On Mon, Jun 29, 2020 at 2:46 PM Peter Geoghegan <[hidden email]> wrote:

> On Mon, Jun 29, 2020 at 2:22 PM Tomas Vondra
> <[hidden email]> wrote:
> > I agree with this, and I'm mostly OK with having hash_mem. In fact, from
> > the proposals in this thread I like it the most - as long as it's used
> > both during planning and execution. It's a pretty clear solution.
>
> Great.
>
> It's not trivial to write the patch, since there are a few tricky
> cases. For example, maybe there is some subtlety in nodeAgg.c with
> AGG_MIXED cases.
Attached is an attempt at this. I have not been particularly thorough,
since it is still not completely clear that the hash_mem proposal has
a serious chance of resolving the "many users rely on hashagg
exceeding work_mem, regardless of whether or not that is the intended
behavior in Postgres 12" problem. But at least we have a patch now,
and so have some idea of how invasive this will have to be. We also
have something to test.

Note that I created a new open item for this "maybe we need something
like a hash_mem GUC now" problem today. To recap, this thread started
out being a discussion about the enable_hashagg_disk GUC, which seems
like a distinct problem to me. It won't make much sense to return to
discussing the original problem before we have a solution to this
other problem (the problem that I propose to address by inventing
hash_mem).

About the patch:

The patch adds hash_mem, which is just another work_mem-like GUC that
the patch has us use in certain cases -- cases where the work area is
a hash table (but not when it's a sort, or some kind of bitset, or
anything else). I still think that the new GUC should work as a
multiplier of work_mem, or something else along those lines, though
for now it's just an independent work_mem used for hashing. I bring it
up again because I'm concerned about users that upgrade to Postgres 13
incautiously, and find that hashing uses *less* memory than before.
Many users probably get away with setting work_mem quite high across
the board. At the very least, hash_mem should be ignored when it's set
to below work_mem (which isn't what the patch does).

It might have made more sense to call the new GUC hash_work_mem
instead of hash_mem. I don't feel strongly about the name. Again, this
is just a starting point for further discussion.

--
Peter Geoghegan

0001-Add-a-GUC-that-limits-memory-use-for-hash-tables.patch (64K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk (hash_mem)

Justin Pryzby
On Thu, Jul 02, 2020 at 07:05:48PM -0700, Peter Geoghegan wrote:

> On Mon, Jun 29, 2020 at 2:46 PM Peter Geoghegan <[hidden email]> wrote:
> > On Mon, Jun 29, 2020 at 2:22 PM Tomas Vondra <[hidden email]> wrote:
> > > I agree with this, and I'm mostly OK with having hash_mem. In fact, from
> > > the proposals in this thread I like it the most - as long as it's used
> > > both during planning and execution. It's a pretty clear solution.
> >
> > Great.
> >
> > It's not trivial to write the patch, since there are a few tricky
> > cases. For example, maybe there is some subtlety in nodeAgg.c with
> > AGG_MIXED cases.
>
> Attached is an attempt at this. I have not been particularly thorough,

Thanks for putting it together, I agree that hash_mem seems to be an obvious
"escape hatch" that generalizes existing GUCs and independently useful.

> anything else). I still think that the new GUC should work as a
> multiplier of work_mem, or something else along those lines, though
> for now it's just an independent work_mem used for hashing. I bring it
> up again because I'm concerned about users that upgrade to Postgres 13
> incautiously, and find that hashing uses *less* memory than before.
> Many users probably get away with setting work_mem quite high across
> the board. At the very least, hash_mem should be ignored when it's set
> to below work_mem (which isn't what the patch does).

I feel it should same as work_mem, as it's written, and not a multiplier.

And actually I don't think a lower value should be ignored: "mechanism not
policy".  Do we refuse atypical values of maintenance_work_mem < work_mem ?

I assumed that hash_mem would default to -1, which would mean "fall back to
work_mem".  We'd then advise users to increase it if they have an issue in v13
with performance of hashes spilled to disk.  (And maybe in other cases, too.)

I read the argument that hash tables are a better use of RAM than sort.
However it seems like setting the default to greater than work_mem is a
separate change than providing the mechanism allowing user to do so.  I guess
the change in default is intended to mitigate the worst possible behavior
change someone might experience in v13 hashing, and might be expected to
improve "out of the box" performance.  I'm not opposed to it, but it's not an
essential part of the patch.

In nodeHash.c, you missed an underscore:
+        * Target in-memory hashtable size is hashmem kilobytes.

--
Justin


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk (hash_mem)

Bruce Momjian
On Thu, Jul  2, 2020 at 09:46:49PM -0500, Justin Pryzby wrote:

> On Thu, Jul 02, 2020 at 07:05:48PM -0700, Peter Geoghegan wrote:
> > anything else). I still think that the new GUC should work as a
> > multiplier of work_mem, or something else along those lines, though
> > for now it's just an independent work_mem used for hashing. I bring it
> > up again because I'm concerned about users that upgrade to Postgres 13
> > incautiously, and find that hashing uses *less* memory than before.
> > Many users probably get away with setting work_mem quite high across
> > the board. At the very least, hash_mem should be ignored when it's set
> > to below work_mem (which isn't what the patch does).
>
> I feel it should same as work_mem, as it's written, and not a multiplier.
>
> And actually I don't think a lower value should be ignored: "mechanism not
> policy".  Do we refuse atypical values of maintenance_work_mem < work_mem ?
>
> I assumed that hash_mem would default to -1, which would mean "fall back to
> work_mem".  We'd then advise users to increase it if they have an issue in v13
> with performance of hashes spilled to disk.  (And maybe in other cases, too.)

Uh, with this patch, don't we really have sort_mem and hash_mem, but
hash_mem default to sort_mem, or something like that.  If hash_mem is a
multiplier, it would make more sense to keep the work_mem name.

--
  Bruce Momjian  <[hidden email]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk (hash_mem)

Bruce Momjian
On Thu, Jul  2, 2020 at 10:58:34PM -0400, Bruce Momjian wrote:

> On Thu, Jul  2, 2020 at 09:46:49PM -0500, Justin Pryzby wrote:
> > On Thu, Jul 02, 2020 at 07:05:48PM -0700, Peter Geoghegan wrote:
> > > anything else). I still think that the new GUC should work as a
> > > multiplier of work_mem, or something else along those lines, though
> > > for now it's just an independent work_mem used for hashing. I bring it
> > > up again because I'm concerned about users that upgrade to Postgres 13
> > > incautiously, and find that hashing uses *less* memory than before.
> > > Many users probably get away with setting work_mem quite high across
> > > the board. At the very least, hash_mem should be ignored when it's set
> > > to below work_mem (which isn't what the patch does).
> >
> > I feel it should same as work_mem, as it's written, and not a multiplier.
> >
> > And actually I don't think a lower value should be ignored: "mechanism not
> > policy".  Do we refuse atypical values of maintenance_work_mem < work_mem ?
> >
> > I assumed that hash_mem would default to -1, which would mean "fall back to
> > work_mem".  We'd then advise users to increase it if they have an issue in v13
> > with performance of hashes spilled to disk.  (And maybe in other cases, too.)
>
> Uh, with this patch, don't we really have sort_mem and hash_mem, but
> hash_mem default to sort_mem, or something like that.  If hash_mem is a
> multiplier, it would make more sense to keep the work_mem name.

Also, I feel this is all out of scope for PG 13, frankly.  I think our
only option is to revert the hash spill entirely, and return to PG 13
behavior, if people are too worried about hash performance regression.

--
  Bruce Momjian  <[hidden email]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk (hash_mem)

Peter Geoghegan-4
On Thu, Jul 2, 2020 at 8:00 PM Bruce Momjian <[hidden email]> wrote:
> Also, I feel this is all out of scope for PG 13, frankly.  I think our
> only option is to revert the hash spill entirely, and return to PG 13
> behavior, if people are too worried about hash performance regression.

But the problem isn't really the hashaggs-that-spill patch itself.
Rather, the problem is the way that work_mem is supposed to behave in
general, and the impact that that has on hash aggregate now that it
has finally been brought into line with every other kind of executor
node. There just isn't much reason to think that we should give the
same amount of memory to a groupagg + sort as a hash aggregate. The
patch more or less broke an existing behavior that is itself
officially broken. That is, the problem that we're trying to fix here
is only a problem to the extent that the previous scheme isn't really
operating as intended (because grouping estimates are inherently very
hard). A revert doesn't seem like it helps anyone.

I accept that the idea of inventing hash_mem to fix this problem now
is unorthodox. In a certain sense it solves problems beyond the
problems that we're theoretically obligated to solve now. But any
"more conservative" approach that I can think of seems like a big
mess.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk (hash_mem)

Peter Geoghegan-4
In reply to this post by Justin Pryzby
On Thu, Jul 2, 2020 at 7:46 PM Justin Pryzby <[hidden email]> wrote:
> Thanks for putting it together, I agree that hash_mem seems to be an obvious
> "escape hatch" that generalizes existing GUCs and independently useful.

It is independently useful. It's a natural consequence of "being
honest" about work_mem and hashing.

> I feel it should same as work_mem, as it's written, and not a multiplier.
>
> And actually I don't think a lower value should be ignored: "mechanism not
> policy".  Do we refuse atypical values of maintenance_work_mem < work_mem ?

I see your point, but AFAIK maintenance_work_mem was not retrofit like
this. It seems different. (Unless we add the -1 behavior, perhaps)

> I assumed that hash_mem would default to -1, which would mean "fall back to
> work_mem".  We'd then advise users to increase it if they have an issue in v13
> with performance of hashes spilled to disk.  (And maybe in other cases, too.)

Yeah, this kind of -1 behavior could make sense.

> I read the argument that hash tables are a better use of RAM than sort.
> However it seems like setting the default to greater than work_mem is a
> separate change than providing the mechanism allowing user to do so.  I guess
> the change in default is intended to mitigate the worst possible behavior
> change someone might experience in v13 hashing, and might be expected to
> improve "out of the box" performance.  I'm not opposed to it, but it's not an
> essential part of the patch.

That's true.

> In nodeHash.c, you missed an underscore:
> +        * Target in-memory hashtable size is hashmem kilobytes.

Got it; thanks.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk (hash_mem)

Bruce Momjian
In reply to this post by Peter Geoghegan-4
On Thu, Jul  2, 2020 at 08:35:40PM -0700, Peter Geoghegan wrote:

> But the problem isn't really the hashaggs-that-spill patch itself.
> Rather, the problem is the way that work_mem is supposed to behave in
> general, and the impact that that has on hash aggregate now that it
> has finally been brought into line with every other kind of executor
> node. There just isn't much reason to think that we should give the
> same amount of memory to a groupagg + sort as a hash aggregate. The
> patch more or less broke an existing behavior that is itself
> officially broken. That is, the problem that we're trying to fix here
> is only a problem to the extent that the previous scheme isn't really
> operating as intended (because grouping estimates are inherently very
> hard). A revert doesn't seem like it helps anyone.
>
> I accept that the idea of inventing hash_mem to fix this problem now
> is unorthodox. In a certain sense it solves problems beyond the
> problems that we're theoretically obligated to solve now. But any
> "more conservative" approach that I can think of seems like a big
> mess.

Well, the bottom line is that we are designing features during beta.
People are supposed to be testing PG 13 behavior during beta, including
optimizer behavior.  We don't even have a user report yet of a
regression compared to PG 12, or one that can't be fixed by increasing
work_mem.

If we add a new behavior to PG 13, we then have the pre-PG 13 behavior,
the pre-patch behavior, and the post-patch behavior.  How are people
supposed to test all of that?  Add to that that some don't even feel we
need a new behavior, which is delaying any patch from being applied.

--
  Bruce Momjian  <[hidden email]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk (hash_mem)

Justin Pryzby
On Fri, Jul 03, 2020 at 10:08:08AM -0400, Bruce Momjian wrote:

> On Thu, Jul  2, 2020 at 08:35:40PM -0700, Peter Geoghegan wrote:
> > But the problem isn't really the hashaggs-that-spill patch itself.
> > Rather, the problem is the way that work_mem is supposed to behave in
> > general, and the impact that that has on hash aggregate now that it
> > has finally been brought into line with every other kind of executor
> > node. There just isn't much reason to think that we should give the
> > same amount of memory to a groupagg + sort as a hash aggregate. The
> > patch more or less broke an existing behavior that is itself
> > officially broken. That is, the problem that we're trying to fix here
> > is only a problem to the extent that the previous scheme isn't really
> > operating as intended (because grouping estimates are inherently very
> > hard). A revert doesn't seem like it helps anyone.
> >
> > I accept that the idea of inventing hash_mem to fix this problem now
> > is unorthodox. In a certain sense it solves problems beyond the
> > problems that we're theoretically obligated to solve now. But any
> > "more conservative" approach that I can think of seems like a big
> > mess.
>
> Well, the bottom line is that we are designing features during beta.
> People are supposed to be testing PG 13 behavior during beta, including
> optimizer behavior.  We don't even have a user report yet of a
> regression compared to PG 12, or one that can't be fixed by increasing
> work_mem.
>
> If we add a new behavior to PG 13, we then have the pre-PG 13 behavior,
> the pre-patch behavior, and the post-patch behavior.  How are people
> supposed to test all of that?  Add to that that some don't even feel we
> need a new behavior, which is delaying any patch from being applied.

If we default hash_mem=-1, the post-patch behavior by default would be same as
the pre-patch behavior.

Actually, another reason it should be -1 is simply to reduce the minimum,
essential number of GUCs everyone has to change or review on a new installs of
a dedicated or nontrivial instance.  shared_buffers, max_wal_size,
checkpoint_timeout, eff_cache_size, work_mem.

I don't think anybody said it before, but now it occurs to me that one
advantage of making hash_mem a multiplier (I'm thinking of
hash_mem_scale_factor) rather than an absolute is that one wouldn't need to
remember to increase hash_mem every time they increase work_mem.  Otherwise,
this is kind of a foot-gun: hash_mem would default to 16MB, and people
experiencing poor performance would increase work_mem to 256MB like they've
been doing for decades, and see no effect.  Or someone would increase work_mem
from 4MB to 256MB, which exceeds hash_mem default of 16MB, so then (if Peter
has his way) hash_mem is ignored.

Due to these behaviors, I'll retract my previous preference:
| "I feel it should same as work_mem, as it's written, and not a multiplier."

I think the better ideas are:
 - hash_mem=-1
 - hash_mem_scale_factor=1 ?

Maybe as a separate patch we'd set default hash_mem_scale_factor=4, possibly
only in master not and v13.

--
Justin


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk (hash_mem)

Alvaro Herrera-9
In reply to this post by Bruce Momjian
On 2020-Jul-03, Bruce Momjian wrote:

> Well, the bottom line is that we are designing features during beta.

Well, we're designing a way for users to interact the new feature.
The feature itself is already in, and it works well in general terms.  I
expect that the new behavior is a win in the majority of cases, and the
problem being discussed here will only manifest as a regression in
corner cases.  (I don't have data to back this up, but if this weren't
the case we would have realized much earlier).

It seem to me we're designing a solution to a problem that was found
during testing, which seems perfectly acceptable to me.  I don't see
grounds for reverting the behavior and I haven't seen anyone suggesting
that it would be an appropriate solution to the issue.

> If we add a new behavior to PG 13, we then have the pre-PG 13 behavior,
> the pre-patch behavior, and the post-patch behavior.  How are people
> supposed to test all of that?

They don't have to.  We tell them that we added some new tweak for a new
pg13 feature in beta3 and that's it.

> Add to that that some don't even feel we
> need a new behavior, which is delaying any patch from being applied.

If we don't need any new behavior, then we would just close the open
item and call the current state good, no?

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


123456