Default setting for enable_hashagg_disk

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
102 messages Options
123456
Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Robert Haas
On Mon, Jun 22, 2020 at 11:06 AM Justin Pryzby <[hidden email]> wrote:
> This was addressed in 92c58fd94801dd5c81ee20e26c5bb71ad64552a8
> https://wiki.postgresql.org/index.php?title=PostgreSQL_13_Open_Items&diff=34994&oldid=34993

I mean, that's fine, but I am trying to make a more general point
about priorities. Getting the GUCs right is a lot less important than
getting the feature right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Jeff Davis-8
In reply to this post by Robert Haas
On Mon, 2020-06-22 at 10:52 -0400, Robert Haas wrote:
> So I feel like the really important thing here is to fix the cases
> that don't come out well with default settings.

...with the caveat that perfection is not something to expect from our
planner.

>  If we can't do that,
> then the feature is half-baked and maybe should not have been
> committed in the first place.

HashAgg started out half-baked at the dawn of time, and stayed that way
through version 12. Disk-based HashAgg was designed to fix it.

Other major planner features generally offer a way to turn them off
(e.g. parallelism, JIT), and we don't call those half-baked.

I agree that the single GUC added in v13 (hashagg_avoid_disk_plan) is
weird because it's half of a disable switch. But it's not weird because
of my changes in v13; it's weird because the planner behavior in v12
was weird. I hope not many people need to set it, and I hope we can
remove it soon.

If you think we will never be able to remove the GUC, then we should
think a little harder about whether we really need it. I am open to
that discussion, but I don't think the presence of this GUC implies
that disk-based hashagg is half-baked.

Regards,
        Jeff Davis




Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Jeff Davis-8
In reply to this post by Robert Haas
On Mon, 2020-06-22 at 11:17 -0400, Robert Haas wrote:
> I mean, that's fine, but I am trying to make a more general point
> about priorities. Getting the GUCs right is a lot less important than
> getting the feature right.

What about the feature you are worried that we're getting wrong?

Regards,
        Jeff Davis




Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Robert Haas
In reply to this post by Jeff Davis-8
On Mon, Jun 22, 2020 at 1:30 PM Jeff Davis <[hidden email]> wrote:
> On Mon, 2020-06-22 at 10:52 -0400, Robert Haas wrote:
> > So I feel like the really important thing here is to fix the cases
> > that don't come out well with default settings.
>
> ...with the caveat that perfection is not something to expect from our
> planner.

+1.

> >  If we can't do that,
> > then the feature is half-baked and maybe should not have been
> > committed in the first place.
>
> HashAgg started out half-baked at the dawn of time, and stayed that way
> through version 12. Disk-based HashAgg was designed to fix it.
>
> Other major planner features generally offer a way to turn them off
> (e.g. parallelism, JIT), and we don't call those half-baked.

Sure, and I'm not calling this half-baked either, but there is a
difference. JIT and parallelism are discrete features to a far greater
extent than this is. I think we can explain to people the pros and
cons of those things and ask them to make an intelligent choice about
whether they want them. You can say things like "well, JIT is liable
to make your queries run faster once they get going, but it adds to
the startup time and creates a dependency on LLVM" and the user can
decide whether they want that or not. At least to me, something like
this isn't so easy to consider as a separate feature. As you say:

> I agree that the single GUC added in v13 (hashagg_avoid_disk_plan) is
> weird because it's half of a disable switch. But it's not weird because
> of my changes in v13; it's weird because the planner behavior in v12
> was weird. I hope not many people need to set it, and I hope we can
> remove it soon.

The weirdness is the problem here, at least for me. Generally, I don't
like GUCs of the form give_me_the_old_strange_behavior=true, because
either they tend to be either unnecessary (because nobody wants the
old strange behavior) or hard to eliminate (because the new behavior
is also strange and is not categorically better).

> If you think we will never be able to remove the GUC, then we should
> think a little harder about whether we really need it. I am open to
> that discussion, but I don't think the presence of this GUC implies
> that disk-based hashagg is half-baked.

I don't think it necessarily implies that either. I do however have
some concerns about people using the GUC as a crutch. I am slightly
worried that this is going to have hard-to-fix problems and that we'll
be stuck with the GUC for that reason. Now if that is the case, is
removing the GUC any better? Maybe not. These decisions are hard, and
I am not trying to pretend like I have all the answers.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Jeff Davis-8
On Mon, 2020-06-22 at 15:28 -0400, Robert Haas wrote:
> The weirdness is the problem here, at least for me. Generally, I
> don't
> like GUCs of the form give_me_the_old_strange_behavior=true

I agree with all of that in general.

> I don't think it necessarily implies that either. I do however have
> some concerns about people using the GUC as a crutch.

Another way of looking at it is that the weird behavior is already
there in v12, so there are already users relying on this weird behavior
as a crutch for some other planner mistake. The question is whether we
want to:

(a) take the weird behavior away now as a consequence of implementing
disk-based HashAgg; or
(b) support the weird behavior forever; or
(c) introduce a GUC now to help transition away from the weird behavior

The danger with (c) is that it gives users more time to become more
reliant on the weird behavior; and worse, a GUC could be seen as an
endorsement of the weird behavior rather than a path to eliminating it.
So we could intend to do (c) and end up with (b). We can mitigate this
with documentation warnings, perhaps.

>  I am slightly
> worried that this is going to have hard-to-fix problems and that
> we'll
> be stuck with the GUC for that reason.

Without the GUC, it's basically a normal cost-based decision, with all
of the good and bad that comes with that.

> Now if that is the case, is
> removing the GUC any better? Maybe not. These decisions are hard, and
> I am not trying to pretend like I have all the answers.

I agree that there is no easy answer.

My philosophy here is: if a user does experience a plan regression due
to my change, would it be reasonable to tell them that we don't have
any escape hatch or transition period at all? That would be a tough
sell for such a common plan type.

Regards,
        Jeff Davis




Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

David Rowley
On Tue, 23 Jun 2020 at 08:24, Jeff Davis <[hidden email]> wrote:

> Another way of looking at it is that the weird behavior is already
> there in v12, so there are already users relying on this weird behavior
> as a crutch for some other planner mistake. The question is whether we
> want to:
>
> (a) take the weird behavior away now as a consequence of implementing
> disk-based HashAgg; or
> (b) support the weird behavior forever; or
> (c) introduce a GUC now to help transition away from the weird behavior
>
> The danger with (c) is that it gives users more time to become more
> reliant on the weird behavior; and worse, a GUC could be seen as an
> endorsement of the weird behavior rather than a path to eliminating it.
> So we could intend to do (c) and end up with (b). We can mitigate this
> with documentation warnings, perhaps.

So, I have a few thoughts on this subject. I understand both problem
cases have been mentioned before on this thread, but just to reiterate
the two problem cases that we really would rather people didn't hit.

They are:

1. Statistics underestimation can cause hashagg to be selected. The
executor will spill to disk in PG13.  Users may find performance
suffers as previously the query may have just overshot work_mem
without causing any OOM issues. Their I/O performance might be
terrible.
2. We might now choose to hash aggregate where pre PG13, we didn't
choose that because the hash table was estimated to be bigger than
work_mem. Hash agg might not be the best plan for the job.

For #1. We know users are forced to run smaller work_mems than they
might like as they need to allow for that random moment where all
backends happen to be doing that 5-way hash join all at the same time.
It seems reasonable that someone might want the old behaviour. They
may well be sitting on a timebomb that's about to OOM, but it would be
sad if someone's upgrade to PG13 was blocked on this, especially if
it's just due to some query that runs once per month but needs to
perform quickly.

For #2. This seems like a very legitimate requirement to me.  If a
user is unhappy that PG13 now hashaggs where before it sorted and
group aggregated, but they're unhappy, not because there's some issue
with hashagg spilling, but because that causes the node above the agg
to becomes a Hash Join rather than a Merge Join and that's bad for
some existing reason. Our planner doing the wrong thing based on
either; lack of, inaccurate or out-of-date statistics is not Jeff's
fault. Having the ability to switch off a certain planner feature is
just following along with what we do today for many other node types.

As for GUCs to try to help the group of users who, *I'm certain*, will
have problems with PG13's plan choice. I think the overloaded
enable_hashagg option is a really nice compromise.   We don't really
have any other executor node type that has multiple GUCs controlling
its behaviour, so I believe it would be nice to keep it that way.

How about:

enable_hashagg = "on" -- enables hashagg allowing it to freely spill
to disk as it pleases.
enable_hashagg = "trynospill" -- Planner will only choose hash_agg if
it thinks it won't spill (pre PG13 planner behaviour)
enable_hashagg = "neverspill" -- executor will *never* spill to disk
and can still OOM (NOT RECOMMENDED, but does give pre PG13 planner and
executor behaviour)
enable_hashagg = "off" -- planner does not consider hash agg, ever.
Same as what PG12 did for this setting.

Now, it's a bit weird to have "neverspill" as this is controlling
what's done in the executor from a planner GUC.  Likely we can just
work around that by having a new "allowhashspill" bool field in the
"Agg" struct that's set by the planner, say during createplan that
controls if nodeAgg.c is allowed to spill or not.  That'll also allow
PREPAREd plans to continue to do what they had planned to do already.

The thing I like about doing it this way is that:

a) it does not add any new GUCs
b) it semi hides the weird values that we really wish nobody would
ever have to set in a GUC that people have become used it just
allowing the values "on" and "off".

The thing I don't quite like about this idea is:
a) I wish the planner was perfect and we didn't need to do this.
b) It's a bit weird to overload a GUC that has a very booleanish name
to not be bool.

However, I also think it's pretty lightweight to support this. I
imagine a dozen lines of docs and likely about half a dozen lines per
GUC option in the planner.

And in the future, when our planner is perfect*, we can easily just
remove the enum values from the GUC that we no longer want to support.

David

* Yes I know that will never happen.


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Justin Pryzby
On Wed, Jun 24, 2020 at 02:11:57PM +1200, David Rowley wrote:
> On Tue, 23 Jun 2020 at 08:24, Jeff Davis <[hidden email]> wrote:
> > Another way of looking at it is that the weird behavior is already
> > there in v12, so there are already users relying on this weird behavior
> > as a crutch for some other planner mistake. The question is whether we
> > want to:

Yea - "behavior change" is a scenario for which it's hard to anticipate well
all the range of consequences.

> How about:
>
> enable_hashagg = "on" -- enables hashagg allowing it to freely spill
> to disk as it pleases.
> enable_hashagg = "trynospill" -- Planner will only choose hash_agg if
> it thinks it won't spill (pre PG13 planner behaviour)
> enable_hashagg = "neverspill" -- executor will *never* spill to disk
> and can still OOM (NOT RECOMMENDED, but does give pre PG13 planner and
> executor behaviour)
> enable_hashagg = "off" -- planner does not consider hash agg, ever.
> Same as what PG12 did for this setting.

+1

I like that this allows the new behavior as an *option* one *can* use rather
than a "behavior change" which is imposed on users and which users then *have*
to accomodate in postgres.

--
Justin


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Bruce Momjian
In reply to this post by David Rowley
On Wed, Jun 24, 2020 at 02:11:57PM +1200, David Rowley wrote:

> On Tue, 23 Jun 2020 at 08:24, Jeff Davis <[hidden email]> wrote:
> > Another way of looking at it is that the weird behavior is already
> > there in v12, so there are already users relying on this weird behavior
> > as a crutch for some other planner mistake. The question is whether we
> > want to:
> >
> > (a) take the weird behavior away now as a consequence of implementing
> > disk-based HashAgg; or
> > (b) support the weird behavior forever; or
> > (c) introduce a GUC now to help transition away from the weird behavior
> >
> > The danger with (c) is that it gives users more time to become more
> > reliant on the weird behavior; and worse, a GUC could be seen as an
> > endorsement of the weird behavior rather than a path to eliminating it.
> > So we could intend to do (c) and end up with (b). We can mitigate this
> > with documentation warnings, perhaps.
>
> So, I have a few thoughts on this subject. I understand both problem
> cases have been mentioned before on this thread, but just to reiterate
> the two problem cases that we really would rather people didn't hit.

I appreciated this summary since I wasn't fully following the issues.

> As for GUCs to try to help the group of users who, *I'm certain*, will
> have problems with PG13's plan choice. I think the overloaded
> enable_hashagg option is a really nice compromise.   We don't really
> have any other executor node type that has multiple GUCs controlling
> its behaviour, so I believe it would be nice to keep it that way.

So, in trying to anticipate how users will be affected by an API change,
I try to look at similar cases where we already have this behavior, and
how users react to this.  Looking at the available join methods, I think
we have one.  We currently support:

        * nested loop with sequential scan
        * nested loop with index scan
        * hash join
        * merge join

It would seem merge join has almost the same complexities as the new
hash join code, since it can spill to disk doing sorts for merge joins,
and adjusting work_mem is the only way to control that spill to disk.  I
don't remember anyone complaining about spills to disk during merge
join, so I am unclear why we would need a such control for hash join.

--
  Bruce Momjian  <[hidden email]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

David Rowley
On Wed, 24 Jun 2020 at 21:06, Bruce Momjian <[hidden email]> wrote:
> I
> don't remember anyone complaining about spills to disk during merge
> join, so I am unclear why we would need a such control for hash join.

Hash aggregate, you mean?   The reason is that upgrading to PG13 can
cause a performance regression for an underestimated ndistinct on the
GROUP BY clause and cause hash aggregate to spill to disk where it
previously did everything in RAM.   Sure, that behaviour was never
what we wanted to happen, Jeff has fixed that now, but the fact
remains that this does happen in the real world quite often and people
often get away with it, likey because work_mem is generally set to
some very conservative value.  Of course, there's also a bunch of
people that have been bitten by OOM due to this too. The "neverspill"
wouldn't be for those people.  Certainly, it's possible that we just
tell these people to increase work_mem for this query, that way they
can set it to something reasonable and still get spilling if it's
really needed to save them from OOM, but the problem there is that
it's not very easy to go and set work_mem for a certain query.

FWIW, I wish that I wasn't suggesting we do this, but I am because it
seems simple enough to implement and it removes a pretty big roadblock
that might exist for a small subset of people wishing to upgrade to
PG13. It seems lightweight enough to maintain, at least until we
invent some better management of how many executor nodes we can have
allocating work_mem at once.

The suggestion I made was just based on asking myself the following
set of questions:

Since Hash Aggregate has been able to overflow work_mem since day 1,
and now that we've completely changed that fact in PG13,  is that
likely to upset anyone?  If so, should we go to the trouble of giving
those people a way of getting the old behaviour back? If we do want to
help those people, what's the best way to make those options available
to them in a way that we can remove the special options with the least
pain in some future version of PostgreSQL?

I'd certainly be interested in hearing how other people would answer
those question.

David


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Justin Pryzby
In reply to this post by Bruce Momjian
On Wed, Jun 24, 2020 at 05:06:28AM -0400, Bruce Momjian wrote:

> On Wed, Jun 24, 2020 at 02:11:57PM +1200, David Rowley wrote:
> > On Tue, 23 Jun 2020 at 08:24, Jeff Davis <[hidden email]> wrote:
> > > Another way of looking at it is that the weird behavior is already
> > > there in v12, so there are already users relying on this weird behavior
> > > as a crutch for some other planner mistake. The question is whether we
> > > want to:
> > >
> > > (a) take the weird behavior away now as a consequence of implementing
> > > disk-based HashAgg; or
> > > (b) support the weird behavior forever; or
> > > (c) introduce a GUC now to help transition away from the weird behavior
> > >
> > > The danger with (c) is that it gives users more time to become more
> > > reliant on the weird behavior; and worse, a GUC could be seen as an
> > > endorsement of the weird behavior rather than a path to eliminating it.
> > > So we could intend to do (c) and end up with (b). We can mitigate this
> > > with documentation warnings, perhaps.
> >
> > So, I have a few thoughts on this subject. I understand both problem
> > cases have been mentioned before on this thread, but just to reiterate
> > the two problem cases that we really would rather people didn't hit.
>
> I appreciated this summary since I wasn't fully following the issues.
>
> > As for GUCs to try to help the group of users who, *I'm certain*, will
> > have problems with PG13's plan choice. I think the overloaded
> > enable_hashagg option is a really nice compromise.   We don't really
> > have any other executor node type that has multiple GUCs controlling
> > its behaviour, so I believe it would be nice to keep it that way.
...
> It would seem merge join has almost the same complexities as the new
> hash join code, since it can spill to disk doing sorts for merge joins,
> and adjusting work_mem is the only way to control that spill to disk.  I
> don't remember anyone complaining about spills to disk during merge
> join, so I am unclear why we would need a such control for hash join.

It loooks like merge join was new in 8.3.  I don't think that's a good analogy,
since the old behavior was still available with enable_mergejoin=off.

I think a better analogy would be if we now changed sort nodes beneath merge
join to use at most 0.5*work_mem, with no way of going back to using
1.0*work_mem.

--
Justin


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Bruce Momjian
On Wed, Jun 24, 2020 at 07:38:43AM -0500, Justin Pryzby wrote:
> On Wed, Jun 24, 2020 at 05:06:28AM -0400, Bruce Momjian wrote:
> > It would seem merge join has almost the same complexities as the new
> > hash join code, since it can spill to disk doing sorts for merge joins,
> > and adjusting work_mem is the only way to control that spill to disk.  I
> > don't remember anyone complaining about spills to disk during merge
> > join, so I am unclear why we would need a such control for hash join.
>
> It loooks like merge join was new in 8.3.  I don't think that's a good analogy,
> since the old behavior was still available with enable_mergejoin=off.

Uh, we don't gurantee backward compatibility in the optimizer.  You can
turn off hashagg if you want.  That doesn't get you to PG 13 behavior,
but we don't gurantee that.

--
  Bruce Momjian  <[hidden email]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Bruce Momjian
In reply to this post by David Rowley
On Thu, Jun 25, 2020 at 12:24:29AM +1200, David Rowley wrote:
> On Wed, 24 Jun 2020 at 21:06, Bruce Momjian <[hidden email]> wrote:
> > I
> > don't remember anyone complaining about spills to disk during merge
> > join, so I am unclear why we would need a such control for hash join.
>
> Hash aggregate, you mean?   The reason is that upgrading to PG13 can

Yes, sorry.

> cause a performance regression for an underestimated ndistinct on the
> GROUP BY clause and cause hash aggregate to spill to disk where it
> previously did everything in RAM.   Sure, that behaviour was never
> what we wanted to happen, Jeff has fixed that now, but the fact
> remains that this does happen in the real world quite often and people
> often get away with it, likey because work_mem is generally set to
> some very conservative value.  Of course, there's also a bunch of
> people that have been bitten by OOM due to this too. The "neverspill"
> wouldn't be for those people.  Certainly, it's possible that we just
> tell these people to increase work_mem for this query, that way they
> can set it to something reasonable and still get spilling if it's
> really needed to save them from OOM, but the problem there is that
> it's not very easy to go and set work_mem for a certain query.

Well, my point is that merge join works that way, and no one has needed
a knob to avoid mergejoin if it is going to spill to disk.  If they are
adjusting work_mem to prevent spill of merge join, they can do the same
for hash agg.  We just need to document this in the release notes.

--
  Bruce Momjian  <[hidden email]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Tom Lane-2
In reply to this post by Justin Pryzby
Justin Pryzby <[hidden email]> writes:
> On Wed, Jun 24, 2020 at 05:06:28AM -0400, Bruce Momjian wrote:
>> It would seem merge join has almost the same complexities as the new
>> hash join code, since it can spill to disk doing sorts for merge joins,
>> and adjusting work_mem is the only way to control that spill to disk.  I
>> don't remember anyone complaining about spills to disk during merge
>> join, so I am unclear why we would need a such control for hash join.

> It loooks like merge join was new in 8.3.  I don't think that's a good analogy,
> since the old behavior was still available with enable_mergejoin=off.

Uh, what?  A look into our git history shows immediately that
nodeMergejoin.c has been there since the initial code import in 1996.

I tend to agree with Bruce that it's not very obvious that we need
another GUC knob here ... especially not one as ugly as this.
I'm especially against the "neverspill" option, because that makes a
single GUC that affects both the planner and executor independently.

If we feel we need something to let people have the v12 behavior
back, let's have
(1) enable_hashagg on/off --- controls planner, same as it ever was
(2) enable_hashagg_spill on/off --- controls executor by disabling spill

But I'm not really convinced that we need (2).

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Tomas Vondra-4
On Wed, Jun 24, 2020 at 01:29:56PM -0400, Tom Lane wrote:

>Justin Pryzby <[hidden email]> writes:
>> On Wed, Jun 24, 2020 at 05:06:28AM -0400, Bruce Momjian wrote:
>>> It would seem merge join has almost the same complexities as the new
>>> hash join code, since it can spill to disk doing sorts for merge joins,
>>> and adjusting work_mem is the only way to control that spill to disk.  I
>>> don't remember anyone complaining about spills to disk during merge
>>> join, so I am unclear why we would need a such control for hash join.
>
>> It loooks like merge join was new in 8.3.  I don't think that's a good analogy,
>> since the old behavior was still available with enable_mergejoin=off.
>
>Uh, what?  A look into our git history shows immediately that
>nodeMergejoin.c has been there since the initial code import in 1996.
>
>I tend to agree with Bruce that it's not very obvious that we need
>another GUC knob here ... especially not one as ugly as this.
>I'm especially against the "neverspill" option, because that makes a
>single GUC that affects both the planner and executor independently.
>
>If we feel we need something to let people have the v12 behavior
>back, let's have
>(1) enable_hashagg on/off --- controls planner, same as it ever was
>(2) enable_hashagg_spill on/off --- controls executor by disabling spill
>

What if a user specifies

   enable_hashagg = on
   enable_hashagg_spill = off

and the estimates say the hashagg would need to spill to disk. Should
that disable the query (in which case the second GUC affects both
executor and planner) or run it (in which case we knowingly ignore
work_mem, which seems wrong).


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Tom Lane-2
Tomas Vondra <[hidden email]> writes:
> On Wed, Jun 24, 2020 at 01:29:56PM -0400, Tom Lane wrote:
>> If we feel we need something to let people have the v12 behavior
>> back, let's have
>> (1) enable_hashagg on/off --- controls planner, same as it ever was
>> (2) enable_hashagg_spill on/off --- controls executor by disabling spill

> What if a user specifies
>    enable_hashagg = on
>    enable_hashagg_spill = off

It would probably be reasonable for the planner to behave as it did
pre-v13, that is not choose hashagg if it estimates that work_mem
would be exceeded.  (So, okay, that means enable_hashagg_spill
affects both planner and executor ... but ISTM it's just one
behavior not two.)

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Andres Freund
In reply to this post by David Rowley
Hi,

On 2020-06-24 14:11:57 +1200, David Rowley wrote:
> 1. Statistics underestimation can cause hashagg to be selected. The
> executor will spill to disk in PG13.  Users may find performance
> suffers as previously the query may have just overshot work_mem
> without causing any OOM issues. Their I/O performance might be
> terrible.

> 2. We might now choose to hash aggregate where pre PG13, we didn't
> choose that because the hash table was estimated to be bigger than
> work_mem. Hash agg might not be the best plan for the job.

> For #1. We know users are forced to run smaller work_mems than they
> might like as they need to allow for that random moment where all
> backends happen to be doing that 5-way hash join all at the same time.
> It seems reasonable that someone might want the old behaviour. They
> may well be sitting on a timebomb that's about to OOM, but it would be
> sad if someone's upgrade to PG13 was blocked on this, especially if
> it's just due to some query that runs once per month but needs to
> perform quickly.

I'm quite concerned about this one. I think this isn't just going to hit
when the planner mis-estimates ndistinct, but also when transition
values use a bit more space. We'll now start spilling in cases the
< v13 planner did everything right.

That's great for cases where we'd otherwise OOM, but for a lot of other
cases where there actually is more than sufficient RAM to overrun
work_mem by a single-digit factor, it can cause a pretty massive
increase of IO over < v13.


FWIW, my gut feeling is that we'll end up have to separate the
"execution time" spilling from using plain work mem, because it'll
trigger spilling too often. E.g. if the plan isn't expected to spill,
only spill at 10 x work_mem or something like that.  Or we'll need
better management of temp file data when there's plenty memory
available.


> For #2. This seems like a very legitimate requirement to me.  If a
> user is unhappy that PG13 now hashaggs where before it sorted and
> group aggregated, but they're unhappy, not because there's some issue
> with hashagg spilling, but because that causes the node above the agg
> to becomes a Hash Join rather than a Merge Join and that's bad for
> some existing reason. Our planner doing the wrong thing based on
> either; lack of, inaccurate or out-of-date statistics is not Jeff's
> fault. Having the ability to switch off a certain planner feature is
> just following along with what we do today for many other node types.

This one concerns me a bit less, fwiw. There's a lot more "pressure" in
the planner to choose hash agg or sorted agg, compared to e.g. a bunch
of aggregate states taking up a bit more space (can't estimate that at
all for ma.


> As for GUCs to try to help the group of users who, *I'm certain*, will
> have problems with PG13's plan choice. I think the overloaded
> enable_hashagg option is a really nice compromise.   We don't really
> have any other executor node type that has multiple GUCs controlling
> its behaviour, so I believe it would be nice to keep it that way.
>
> How about:
>
> enable_hashagg = "on" -- enables hashagg allowing it to freely spill
> to disk as it pleases.
> enable_hashagg = "trynospill" -- Planner will only choose hash_agg if
> it thinks it won't spill (pre PG13 planner behaviour)
> enable_hashagg = "neverspill" -- executor will *never* spill to disk
> and can still OOM (NOT RECOMMENDED, but does give pre PG13 planner and
> executor behaviour)
> enable_hashagg = "off" -- planner does not consider hash agg, ever.
> Same as what PG12 did for this setting.
>
> Now, it's a bit weird to have "neverspill" as this is controlling
> what's done in the executor from a planner GUC.  Likely we can just
> work around that by having a new "allowhashspill" bool field in the
> "Agg" struct that's set by the planner, say during createplan that
> controls if nodeAgg.c is allowed to spill or not.  That'll also allow
> PREPAREd plans to continue to do what they had planned to do already.
>
> The thing I like about doing it this way is that:
>
> a) it does not add any new GUCs
> b) it semi hides the weird values that we really wish nobody would
> ever have to set in a GUC that people have become used it just
> allowing the values "on" and "off".
>
> The thing I don't quite like about this idea is:
> a) I wish the planner was perfect and we didn't need to do this.
> b) It's a bit weird to overload a GUC that has a very booleanish name
> to not be bool.
>
> However, I also think it's pretty lightweight to support this. I
> imagine a dozen lines of docs and likely about half a dozen lines per
> GUC option in the planner.

That'd work for me, but I honestly don't particularly care about the
specific naming, as long as we provide users an escape hatch from the
increased amount of IO.


Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Andres Freund
In reply to this post by Bruce Momjian
Hi,

On 2020-06-24 13:12:03 -0400, Bruce Momjian wrote:
> Well, my point is that merge join works that way, and no one has needed
> a knob to avoid mergejoin if it is going to spill to disk.  If they are
> adjusting work_mem to prevent spill of merge join, they can do the same
> for hash agg.  We just need to document this in the release notes.

I don't think this is comparable. For starters, the IO indirectly
triggered by mergejoin actually leads to plenty people just straight out
disabling it. For lots of workloads there's never a valid reason to use
a mergejoin (and often the planner will never choose one). Secondly, the
planner has better information about estimating the memory usage for the
to-be-sorted data than it has about the size of the transition
values. And lastly, there's a difference between a long existing cause
for bad IO behaviour and one that's suddenly kicks in after a major
version upgrade, to which there's no escape hatch (it's rarely realistic
to disable hash aggs, in contrast to merge joins).

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Robert Haas
In reply to this post by Andres Freund
On Wed, Jun 24, 2020 at 3:14 PM Andres Freund <[hidden email]> wrote:
> FWIW, my gut feeling is that we'll end up have to separate the
> "execution time" spilling from using plain work mem, because it'll
> trigger spilling too often. E.g. if the plan isn't expected to spill,
> only spill at 10 x work_mem or something like that.  Or we'll need
> better management of temp file data when there's plenty memory
> available.

So, I don't think we can wire in a constant like 10x. That's really
unprincipled and I think it's a bad idea. What we could do, though, is
replace the existing Boolean-valued GUC with a new GUC that controls
the size at which the aggregate spills. The default could be -1,
meaning work_mem, but a user could configure a larger value if desired
(presumably, we would just treat a value smaller than work_mem as
work_mem, and document the same).

I think that's actually pretty appealing. Separating the memory we
plan to use from the memory we're willing to use before spilling seems
like a good idea in general, and I think we should probably also do it
in other places - like sorts.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Andres Freund
In reply to this post by Tom Lane-2
Hi,

On 2020-06-24 14:40:50 -0400, Tom Lane wrote:

> Tomas Vondra <[hidden email]> writes:
> > On Wed, Jun 24, 2020 at 01:29:56PM -0400, Tom Lane wrote:
> >> If we feel we need something to let people have the v12 behavior
> >> back, let's have
> >> (1) enable_hashagg on/off --- controls planner, same as it ever was
> >> (2) enable_hashagg_spill on/off --- controls executor by disabling spill
>
> > What if a user specifies
> >    enable_hashagg = on
> >    enable_hashagg_spill = off
>
> It would probably be reasonable for the planner to behave as it did
> pre-v13, that is not choose hashagg if it estimates that work_mem
> would be exceeded.  (So, okay, that means enable_hashagg_spill
> affects both planner and executor ... but ISTM it's just one
> behavior not two.)

There's two different reasons for spilling in the executor right now:

1) The planner estimated that we'd need to spill, and that turns out to
   be true. There seems no reason to not spill in that case (as long as
   it's enabled/chosen in the planner).

2) The planner didn't think we'd need to spill, but we end up using more
   than work_mem memory.

nodeAgg.c already treats those separately:

void
hash_agg_set_limits(double hashentrysize, uint64 input_groups, int used_bits,
                                        Size *mem_limit, uint64 *ngroups_limit,
                                        int *num_partitions)
{
        int npartitions;
        Size partition_mem;

        /* if not expected to spill, use all of work_mem */
        if (input_groups * hashentrysize < work_mem * 1024L)
        {
                if (num_partitions != NULL)
                        *num_partitions = 0;
                *mem_limit = work_mem * 1024L;
                *ngroups_limit = *mem_limit / hashentrysize;
                return;
        }

We can't sensibly disable spilling when chosen at plan time, because
that'd lead to *more* OOMS than in v12.

ISTM that we should have one option that controls whether 1) is done,
and one that controls whether 2) is done. Even if the option for 2 is
off, we still should spill when the option for 1) chooses a spilling
plan.  I don't think it makes sense for one of those options to
influence the other implicitly.

So maybe enable_hashagg_spilling_plan for 1) and
hashagg_spill_on_exhaust for 2).

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Default setting for enable_hashagg_disk

Andres Freund
In reply to this post by Robert Haas
Hi,

On 2020-06-24 15:28:47 -0400, Robert Haas wrote:

> On Wed, Jun 24, 2020 at 3:14 PM Andres Freund <[hidden email]> wrote:
> > FWIW, my gut feeling is that we'll end up have to separate the
> > "execution time" spilling from using plain work mem, because it'll
> > trigger spilling too often. E.g. if the plan isn't expected to spill,
> > only spill at 10 x work_mem or something like that.  Or we'll need
> > better management of temp file data when there's plenty memory
> > available.
>
> So, I don't think we can wire in a constant like 10x. That's really
> unprincipled and I think it's a bad idea. What we could do, though, is
> replace the existing Boolean-valued GUC with a new GUC that controls
> the size at which the aggregate spills. The default could be -1,
> meaning work_mem, but a user could configure a larger value if desired
> (presumably, we would just treat a value smaller than work_mem as
> work_mem, and document the same).

To be clear, I wasn't actually thinking of hard-coding 10x, but having a
config option that specifies a factor of work_mem. A factor seems better
because it'll work reasonably for different values of work_mem, whereas
a concrete size wouldn't.


> I think that's actually pretty appealing. Separating the memory we
> plan to use from the memory we're willing to use before spilling seems
> like a good idea in general, and I think we should probably also do it
> in other places - like sorts.

Indeed. And then perhaps we could eventually add some reporting /
monitoring infrastructure for the cases where plan time and execution
time memory estimate/usage widely differs.

Greetings,

Andres Freund


123456