accounting for memory used for BufFile during hash joins

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

accounting for memory used for BufFile during hash joins

Tomas Vondra-4
Hi,

I'm starting this thread mostly to keep track of patches developed in
response to issue [1] reported on pgsql-performance. The symptoms are
very simple - query performing a hash join ends up using much more
memory than expected (pretty much ignoring work_mem), and possibly
ending up with OOM.

The root cause is that hash join treats batches as pretty much free, but
that's not really true - we do allocate two BufFile structs per batch,
and each BufFile is ~8kB as it includes PGAlignedBuffer.

This is not ideal even if we happen to estimate everything correctly,
because for example with work_mem=4MB and nbatch=1024, it means we'll
use about 16MB (2*8kB*1024) for the BufFile structures alone, plus the
work_mem for hash table itself.

But it can easily explode when we under-estimate the hash side. In the
pgsql-performance message, the hash side (with the patches applied,
allowing the query to complete) it looks like this:

  Hash (cost=2823846.37..2823846.37 rows=34619 width=930)
       (actual time=252946.367..252946.367 rows=113478127 loops=1)

So it's 3277x under-estimated. It starts with 16 batches, and ends up
adding more and more batches until it fails with 524288 of them (it gets
to that many batches because some of the values are very common and we
don't disable the growth earlier).

The OOM is not very surprising, because with 524288 batches it'd need
about 8GB of memory, and the system only has 8GB RAM installed.

The two attached patches both account for the BufFile memory, but then
use very different strategies when the work_mem limit is reached.

The first patch realizes it's impossible to keep adding batches without
breaking the work_mem limit, because at some point the BufFile will need
more memory than that. But it does not make sense to stop adding batches
entirely, because then the hash table could grow indefinitely.

So the patch abandons the idea of enforcing work_mem in this situation,
and instead attempts to minimize memory usage over time - it increases
the spaceAllowed in a way that ensures doubling the number of batches
actually reduces memory usage in the long run.

The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).


Neither of those patches tweaks ExecChooseHashTableSize() to consider
memory needed for BufFiles while deciding how many batches will be
needed. That's something that probably needs to happen, but it would not
help with the underestimate issue.

I'm not entirely sure which of those approaches is the right one. The
first one is clearly just a "damage control" for cases where the hash
side turned out to be much larger than we expected. With good estimates
we probably would not have picked a hash join for those (that is, we
should have realized we can't keep work_mem and prohibit hash join).

The second patch however makes hash join viable for some of those cases,
and it seems to work pretty well (there are some numbers in the message
posted to pgsql-performance thread). So I kinda like this second one.

It's all just PoC quality, at this point, far from committable state.


[1] https://www.postgresql.org/message-id/flat/bc138e9f-c89e-9147-5395-61d51a757b3b%40gusw.net


--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

v2-simple-rebalance.patch (5K) Download Attachment
v4-per-slice-overflow-file.patch (27K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: accounting for memory used for BufFile during hash joins

Melanie Plageman


On Fri, May 3, 2019 at 5:34 PM Tomas Vondra <[hidden email]> wrote:

The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).

I want to see if I understand the implications of the per-slice-overflow patch
for execution of hashjoin:
For each bucket in the hashtable, when attempting to double the number of
batches, if the memory that the BufFile structs will occupy once this is done
will exceed the work_mem, split each batch into slices that fit into memory.
This means that, for each probe-side tuple hashing to that bucket, you have to
load every slice of each batch separately into memory to ensure correct results.
Is this right?
 

I'm not entirely sure which of those approaches is the right one. The
first one is clearly just a "damage control" for cases where the hash
side turned out to be much larger than we expected. With good estimates
we probably would not have picked a hash join for those (that is, we
should have realized we can't keep work_mem and prohibit hash join).

The second patch however makes hash join viable for some of those cases,
and it seems to work pretty well (there are some numbers in the message
posted to pgsql-performance thread). So I kinda like this second one.

So, my initial reaction after taking a look at the patches is that I prefer the
first approach--increasing the resize threshhold. The second patch, the
per-slice-overflow patch, adds a major new mechanic to hashjoin in order to
address what is, based on my understanding, an edge case.

--
Melanie Plageman
Reply | Threaded
Open this post in threaded view
|

Re: accounting for memory used for BufFile during hash joins

Thomas Munro-5
On Tue, May 7, 2019 at 9:58 AM Melanie Plageman
<[hidden email]> wrote:

> On Fri, May 3, 2019 at 5:34 PM Tomas Vondra <[hidden email]> wrote:
>> The second patch tries to enforce work_mem more strictly. That would be
>> impossible if we were to keep all the BufFile structs in memory, so
>> instead it slices the batches into chunks that fit into work_mem, and
>> then uses a single "overflow" file for slices currently not in memory.
>> These extra slices can't be counted into work_mem, but we should need
>> just very few of them. For example with work_mem=4MB the slice is 128
>> batches, so we need 128x less overflow files (compared to per-batch).
>>
> I want to see if I understand the implications of the per-slice-overflow patch
> for execution of hashjoin:
> For each bucket in the hashtable, when attempting to double the number of
> batches, if the memory that the BufFile structs will occupy once this is done
> will exceed the work_mem, split each batch into slices that fit into memory.
> This means that, for each probe-side tuple hashing to that bucket, you have to
> load every slice of each batch separately into memory to ensure correct results.
> Is this right?

Seems expensive for large numbers of slices -- you need to join the
outer batch against each inner slice.  But I wonder how we'd deal with
outer joins, as Tom Lane asked in another thread:

https://www.postgresql.org/message-id/12185.1488932980%40sss.pgh.pa.us

>> I'm not entirely sure which of those approaches is the right one. The
>> first one is clearly just a "damage control" for cases where the hash
>> side turned out to be much larger than we expected. With good estimates
>> we probably would not have picked a hash join for those (that is, we
>> should have realized we can't keep work_mem and prohibit hash join).
>>
>> The second patch however makes hash join viable for some of those cases,
>> and it seems to work pretty well (there are some numbers in the message
>> posted to pgsql-performance thread). So I kinda like this second one.
>>
> So, my initial reaction after taking a look at the patches is that I prefer the
> first approach--increasing the resize threshhold. The second patch, the
> per-slice-overflow patch, adds a major new mechanic to hashjoin in order to
> address what is, based on my understanding, an edge case.

Personally I'd like to make work_mem more reliable, even if it takes a
major new mechanism.

Stepping back a bit, I think there is something fishy about the way we
detect extreme skew.  Is that a factor in this case?  Right now we
wait until we have a batch that gets split into child batches
containing exactly 0% and 100% of the tuples before we give up.
Previously I had thought of that as merely a waste of time, but
clearly it's also a waste of unmetered memory.  Oops.

I think our extreme skew detector should go off sooner, because
otherwise if you have N nicely distributed unique keys and also M
duplicates of one bad egg key that'll never fit in memory, we keep
repartitioning until none of the N keys fall into the batch containing
the key for the M duplicates before we give up!  You can use
balls-into-bins maths to figure out the number, but I think that means
we expect to keep splitting until we have N * some_constant batches,
and that's just silly and liable to create massive numbers of
partitions proportional to N, even though we're trying to solve a
problem with M.  In another thread I suggested we should stop when
(say) 95% of the tuples go to one child batch.  I'm not sure how you
pick the number.

Of course that doesn't solve the problem that we don't have a better
plan for dealing with the M duplicates -- it just avoids a needless
batch explosions triggered by bad maths.  I think we need something
like Tomas's #2, or a way to switch to sort-merge, or some other
scheme.  I'm not sure how to compare the slice idea, which involves
processing outer tuples * inner slices with the sort-merge idea, which
involves sorting the inner and outer batch, plus the entirely new
concept of switching to another node at execution time.

I also wondered about reducing the buffer size of the BufFiles, but
that doesn't seem to be fixing the real problem.

--
Thomas Munro
https://enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: accounting for memory used for BufFile during hash joins

Tomas Vondra-4
On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:

>On Tue, May 7, 2019 at 9:58 AM Melanie Plageman
><[hidden email]> wrote:
>> On Fri, May 3, 2019 at 5:34 PM Tomas Vondra <[hidden email]> wrote:
>>> The second patch tries to enforce work_mem more strictly. That would be
>>> impossible if we were to keep all the BufFile structs in memory, so
>>> instead it slices the batches into chunks that fit into work_mem, and
>>> then uses a single "overflow" file for slices currently not in memory.
>>> These extra slices can't be counted into work_mem, but we should need
>>> just very few of them. For example with work_mem=4MB the slice is 128
>>> batches, so we need 128x less overflow files (compared to per-batch).
>>>
>> I want to see if I understand the implications of the per-slice-overflow patch
>> for execution of hashjoin:
>> For each bucket in the hashtable, when attempting to double the number of
>> batches, if the memory that the BufFile structs will occupy once this is done
>> will exceed the work_mem, split each batch into slices that fit into memory.
>> This means that, for each probe-side tuple hashing to that bucket, you have to
>> load every slice of each batch separately into memory to ensure correct results.
>> Is this right?
>

>Seems expensive for large numbers of slices -- you need to join the
>outer batch against each inner slice.

Nope, that's not how it works. It's the array of batches that gets
sliced, not the batches themselves.

It does slightly increase the amount of data we need to shuffle between
the temp files, because we can't write the data directly to batches in
"future" slices. But that amplification is capped to ~2.2x (compared to
the ~1.4x in master) - I've shared some measurements in [1].

[1] https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development

>But I wonder how we'd deal with outer joins, as Tom Lane asked in
>another thread:
>
>https://www.postgresql.org/message-id/12185.1488932980%40sss.pgh.pa.us
>

That seems unrelated - we slice the array of batches, to keep memory
needed for BufFile under control. The hash table remains intact, so
there's no issue with outer joins.

>>> I'm not entirely sure which of those approaches is the right one. The
>>> first one is clearly just a "damage control" for cases where the hash
>>> side turned out to be much larger than we expected. With good estimates
>>> we probably would not have picked a hash join for those (that is, we
>>> should have realized we can't keep work_mem and prohibit hash join).
>>>
>>> The second patch however makes hash join viable for some of those cases,
>>> and it seems to work pretty well (there are some numbers in the message
>>> posted to pgsql-performance thread). So I kinda like this second one.
>>>
>> So, my initial reaction after taking a look at the patches is that I prefer the
>> first approach--increasing the resize threshhold. The second patch, the
>> per-slice-overflow patch, adds a major new mechanic to hashjoin in order to
>> address what is, based on my understanding, an edge case.
>
>Personally I'd like to make work_mem more reliable, even if it takes a
>major new mechanism.
>

Yeah, I share that attitude.

>Stepping back a bit, I think there is something fishy about the way we
>detect extreme skew.  Is that a factor in this case?  Right now we
>wait until we have a batch that gets split into child batches
>containing exactly 0% and 100% of the tuples before we give up.
>Previously I had thought of that as merely a waste of time, but
>clearly it's also a waste of unmetered memory.  Oops.
>

Yes, that was a factor in the reported query - the data set contained
significant number of duplicate values (~10%) but it took a while to
disable growth because there always happened to be a couple rows with a
different value.

>I think our extreme skew detector should go off sooner, because
>otherwise if you have N nicely distributed unique keys and also M
>duplicates of one bad egg key that'll never fit in memory, we keep
>repartitioning until none of the N keys fall into the batch containing
>the key for the M duplicates before we give up!  You can use
>balls-into-bins maths to figure out the number, but I think that means
>we expect to keep splitting until we have N * some_constant batches,
>and that's just silly and liable to create massive numbers of
>partitions proportional to N, even though we're trying to solve a
>problem with M.  In another thread I suggested we should stop when
>(say) 95% of the tuples go to one child batch.  I'm not sure how you
>pick the number.
>

I agree we should relax the 0%/100% split condition, and disable the
growth sooner. But I think we should also re-evaluate that decision
after a while - the data set may be correlated in some way, in which
case we may disable the growth prematurely. It may not reduce memory
usage now, but it may help in the future.

It's already an issue, but it would be even more likely if we disabled
growth e.g. with just 5%/95% splits.

FWIW I believe this is mostly orthogonal issue to what's discussed in
this thread.

>Of course that doesn't solve the problem that we don't have a better
>plan for dealing with the M duplicates -- it just avoids a needless
>batch explosions triggered by bad maths.  I think we need something
>like Tomas's #2, or a way to switch to sort-merge, or some other
>scheme.  I'm not sure how to compare the slice idea, which involves
>processing outer tuples * inner slices with the sort-merge idea, which
>involves sorting the inner and outer batch, plus the entirely new
>concept of switching to another node at execution time.
>

Do we actually check how many duplicates are there during planning? I
wonder if we could penalize (of even disable) hashjoins when there are
too many duplicates to fit into work_mem. Of course, that's going to be
tricky with filtering, and so on.

Switching to some other algorithm during execution moves the goal posts
to the next galaxy, I'm afraid.

>I also wondered about reducing the buffer size of the BufFiles, but
>that doesn't seem to be fixing the real problem.
>

Yeah. It might help a bit, but it's very limited - even if you reduce
the buffer to say 1kB, it's just a factor of 8. And I'm not sure what
would be the impact on performance.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: accounting for memory used for BufFile during hash joins

Tom Lane-2
Tomas Vondra <[hidden email]> writes:
> Do we actually check how many duplicates are there during planning?

Certainly that's part of the planner's cost estimates ... but it's
only as good as the planner's statistical knowledge.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: accounting for memory used for BufFile during hash joins

Thomas Munro-5
In reply to this post by Tomas Vondra-4
On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
<[hidden email]> wrote:
> On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
> >Seems expensive for large numbers of slices -- you need to join the
> >outer batch against each inner slice.
>
> Nope, that's not how it works. It's the array of batches that gets
> sliced, not the batches themselves.

Sorry, I read only the description and not the code, and got confused
about that.  So, I see three separate but related problems:

A.  Broken escape valve:  sometimes we generate a huge number of
batches while trying to split up many duplicates, because of the
presence of other more uniformly distributed keys.  We could fix that
with (say) a 95% rule.
B.  Lack of good alternative execution strategy when the escape valve
is triggered.  A batch cannot be split effectively, but cannot fit in
work_mem, so for now we decide to ignore work_mem.
C.  Unmetered explosion of batches and thus BufFiles, probably usually
caused by problem A, but theoretically also due to a real need for
partitions.

> >But I wonder how we'd deal with outer joins, as Tom Lane asked in
> >another thread:
> >
> >https://www.postgresql.org/message-id/12185.1488932980%40sss.pgh.pa.us
>
> That seems unrelated - we slice the array of batches, to keep memory
> needed for BufFile under control. The hash table remains intact, so
> there's no issue with outer joins.

Right, sorry, my confusion.  I thought you were describing
https://en.wikipedia.org/wiki/Block_nested_loop.  (I actually think we
can make that work for left outer joins without too much fuss by
writing out a stream of match bits to a new temporary file.  Googling,
I see that MySQL originally didn't support BNL for outer joins and
then added some match flag propagation thing recently.)

> I agree we should relax the 0%/100% split condition, and disable the
> growth sooner. But I think we should also re-evaluate that decision
> after a while - the data set may be correlated in some way, in which
> case we may disable the growth prematurely. It may not reduce memory
> usage now, but it may help in the future.
>
> It's already an issue, but it would be even more likely if we disabled
> growth e.g. with just 5%/95% splits.
>
> FWIW I believe this is mostly orthogonal issue to what's discussed in
> this thread.

But isn't problem A the root cause of problem C, in most cases?  There
must also be "genuine" cases of problem C that would occur even if we
fix that, of course: someone has small work_mem, and data that can be
effectively partitioned to fit it, but it just takes a huge number of
partitions to do it.  So that we don't behave badly in those cases, I
agree with you 100%: we should fix the memory accounting to count
BufFile overheads as you are proposing, and then I guess ideally
switch to our alternative strategy (BNL or sort-merge or ...) when we
see that BufFiles are wasting to much work_mem and its time to try
something else.  It seems you don't actually have one of those cases
here, though?

I think we should fix problem A.  Then handle problem C by accounting
for BufFiles, and figure out a way to switch to our alternative
strategy (currently: ignore work_mem), when we think that creating
more BufFiles will be futile (not sure exactly what the rule there
should be).  And then work on fixing B properly with a good strategy.
Here's a straw-man idea: we could adopt BNL, and then entirely remove
our repartitioning code.  If the planner's number of partitions turns
out to be not enough, we'll just handle it using BNL loops.

> Switching to some other algorithm during execution moves the goal posts
> to the next galaxy, I'm afraid.

The main problem I'm aware of with sort-merge join is: not all that is
hashable is sortable.  So BNL is actually the only solution I'm aware
of for problem B that doesn't involve changing a fundamental thing
about PostgreSQL's data type requirements.

--
Thomas Munro
https://enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: accounting for memory used for BufFile during hash joins

Tomas Vondra-4
In reply to this post by Tom Lane-2
On Mon, May 06, 2019 at 11:18:28PM -0400, Tom Lane wrote:
>Tomas Vondra <[hidden email]> writes:
>> Do we actually check how many duplicates are there during planning?
>
>Certainly that's part of the planner's cost estimates ... but it's
>only as good as the planner's statistical knowledge.
>

I'm looking at the code, and the only place where I see code dealing with
MCVs (probably the best place for info about duplicate values) is
estimate_hash_bucketsize in final_cost_hashjoin. That's not quite what I
had in mind - I was thinking more about something along the lines "See the
larget group of duplicate values, disable hash join if it can't fit into
work_mem at all."

Of course, if the input estimates are off, that may not work too well. It
would certainly not help the query failing with OOM, because that was a
case of severe underestimate.

Or did you mean some other piece of code that I have missed.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Reply | Threaded
Open this post in threaded view
|

Re: accounting for memory used for BufFile during hash joins

Tomas Vondra-4
In reply to this post by Thomas Munro-5
On Tue, May 07, 2019 at 04:28:36PM +1200, Thomas Munro wrote:

>On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
><[hidden email]> wrote:
>> On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
>> >Seems expensive for large numbers of slices -- you need to join the
>> >outer batch against each inner slice.
>>
>> Nope, that's not how it works. It's the array of batches that gets
>> sliced, not the batches themselves.
>
>Sorry, I read only the description and not the code, and got confused
>about that.  So, I see three separate but related problems:
>
>A.  Broken escape valve:  sometimes we generate a huge number of
>batches while trying to split up many duplicates, because of the
>presence of other more uniformly distributed keys.  We could fix that
>with (say) a 95% rule.
>B.  Lack of good alternative execution strategy when the escape valve
>is triggered.  A batch cannot be split effectively, but cannot fit in
>work_mem, so for now we decide to ignore work_mem.
>C.  Unmetered explosion of batches and thus BufFiles, probably usually
>caused by problem A, but theoretically also due to a real need for
>partitions.
>

Right. I don't think a single solution addressing all those issues exists.
It's more likely we need multiple improvements.

>> >But I wonder how we'd deal with outer joins, as Tom Lane asked in
>> >another thread:
>> >
>> >https://www.postgresql.org/message-id/12185.1488932980%40sss.pgh.pa.us
>>
>> That seems unrelated - we slice the array of batches, to keep memory
>> needed for BufFile under control. The hash table remains intact, so
>> there's no issue with outer joins.
>
>Right, sorry, my confusion.  I thought you were describing
>https://en.wikipedia.org/wiki/Block_nested_loop.  (I actually think we
>can make that work for left outer joins without too much fuss by
>writing out a stream of match bits to a new temporary file.  Googling,
>I see that MySQL originally didn't support BNL for outer joins and
>then added some match flag propagation thing recently.)
>

Possibly, I'm not against implementing that, although I don't have very
good idea what the benefits of BNL joins are (performance-wise). In any
case, I think entirely unrelated to hash joins.

>> I agree we should relax the 0%/100% split condition, and disable the
>> growth sooner. But I think we should also re-evaluate that decision
>> after a while - the data set may be correlated in some way, in which
>> case we may disable the growth prematurely. It may not reduce memory
>> usage now, but it may help in the future.
>>
>> It's already an issue, but it would be even more likely if we disabled
>> growth e.g. with just 5%/95% splits.
>>
>> FWIW I believe this is mostly orthogonal issue to what's discussed in
>> this thread.
>
>But isn't problem A the root cause of problem C, in most cases?  There
>must also be "genuine" cases of problem C that would occur even if we
>fix that, of course: someone has small work_mem, and data that can be
>effectively partitioned to fit it, but it just takes a huge number of
>partitions to do it.  So that we don't behave badly in those cases, I
>agree with you 100%: we should fix the memory accounting to count
>BufFile overheads as you are proposing, and then I guess ideally
>switch to our alternative strategy (BNL or sort-merge or ...) when we
>see that BufFiles are wasting to much work_mem and its time to try
>something else.  It seems you don't actually have one of those cases
>here, though?
>

Maybe. Or maybe not. I don't have enough data to make such judgements
about the causes in general. We have one query from pgsql-performance.
There might be more, but IMO that's probably biased data set.

But even that reported query actually is not the case that A causes C.
The outer side of the hash join was significantly underestimated (34619
vs. 113478127) due to highly-correlated conditions.

And in that case it's trivial to cause nbatch explosion even with perfect
data sets with no duplicates (so no escape valve failure).


>I think we should fix problem A.  Then handle problem C by accounting
>for BufFiles, and figure out a way to switch to our alternative
>strategy (currently: ignore work_mem), when we think that creating
>more BufFiles will be futile (not sure exactly what the rule there
>should be).  And then work on fixing B properly with a good strategy.
>Here's a straw-man idea: we could adopt BNL, and then entirely remove
>our repartitioning code.  If the planner's number of partitions turns
>out to be not enough, we'll just handle it using BNL loops.
>

Yeah, something like that.

I think we can fix A by relaxing the escape valve condition, and then
rechecking it once in a while. So we fill work_mem, realize it didn't
actually reduce the batch size significantly and disable nbatch growth.
But at the same time we increase the threshold to 2x work_mem, and after
reaching it we "consider" a nbatch increase.  That is, we walk the batch
and see how many tuples would move if we increased nbatch (that should be
fairly cheap) - if it helps, great, enable growth and split the batch. If
not, double the threshold again.  Rinse and repeat.

For C, I think we can use either of the two approaches I proposed. I like
the second option better, as it actually enforces work_mem. The first
option kinda helped with A too, although in different way, ana I think the
solution I outlined in the previous paragraph will work better.

No opinion regarding the switch to BNL, at the moment.

>> Switching to some other algorithm during execution moves the goal posts
>> to the next galaxy, I'm afraid.
>
>The main problem I'm aware of with sort-merge join is: not all that is
>hashable is sortable.  So BNL is actually the only solution I'm aware
>of for problem B that doesn't involve changing a fundamental thing
>about PostgreSQL's data type requirements.
>

Sure, each of those algorithms has limitations. But I think that's mostly
irrelevant to the main issue - switching between algorithms mid-execution.
At that point some of the tuples might have been already sent sent to the
other nodes, and I have no idea how to "resume" the tuple stream short of
buffering everything locally until the join completes. And that would be
rather terrible, I guess.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Reply | Threaded
Open this post in threaded view
|

Re: accounting for memory used for BufFile during hash joins

Tom Lane-2
In reply to this post by Tomas Vondra-4
Tomas Vondra <[hidden email]> writes:
> On Mon, May 06, 2019 at 11:18:28PM -0400, Tom Lane wrote:
>> Tomas Vondra <[hidden email]> writes:
>>> Do we actually check how many duplicates are there during planning?

>> Certainly that's part of the planner's cost estimates ... but it's
>> only as good as the planner's statistical knowledge.

> I'm looking at the code, and the only place where I see code dealing with
> MCVs (probably the best place for info about duplicate values) is
> estimate_hash_bucketsize in final_cost_hashjoin.

What I'm thinking of is this bit in final_cost_hashjoin:

    /*
     * If the bucket holding the inner MCV would exceed work_mem, we don't
     * want to hash unless there is really no other alternative, so apply
     * disable_cost.  (The executor normally copes with excessive memory usage
     * by splitting batches, but obviously it cannot separate equal values
     * that way, so it will be unable to drive the batch size below work_mem
     * when this is true.)
     */
    if (relation_byte_size(clamp_row_est(inner_path_rows * innermcvfreq),
                           inner_path->pathtarget->width) >
        (work_mem * 1024L))
        startup_cost += disable_cost;

It's certainly likely that that logic needs improvement in view of this
discussion --- I was just pushing back on the claim that we weren't
considering the issue at all.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: accounting for memory used for BufFile during hash joins

Tomas Vondra-4
On Tue, May 07, 2019 at 10:42:36AM -0400, Tom Lane wrote:

>Tomas Vondra <[hidden email]> writes:
>> On Mon, May 06, 2019 at 11:18:28PM -0400, Tom Lane wrote:
>>> Tomas Vondra <[hidden email]> writes:
>>>> Do we actually check how many duplicates are there during planning?
>
>>> Certainly that's part of the planner's cost estimates ... but it's
>>> only as good as the planner's statistical knowledge.
>
>> I'm looking at the code, and the only place where I see code dealing with
>> MCVs (probably the best place for info about duplicate values) is
>> estimate_hash_bucketsize in final_cost_hashjoin.
>
>What I'm thinking of is this bit in final_cost_hashjoin:
>
>    /*
>     * If the bucket holding the inner MCV would exceed work_mem, we don't
>     * want to hash unless there is really no other alternative, so apply
>     * disable_cost.  (The executor normally copes with excessive memory usage
>     * by splitting batches, but obviously it cannot separate equal values
>     * that way, so it will be unable to drive the batch size below work_mem
>     * when this is true.)
>     */
>    if (relation_byte_size(clamp_row_est(inner_path_rows * innermcvfreq),
>                           inner_path->pathtarget->width) >
>        (work_mem * 1024L))
>        startup_cost += disable_cost;
>
>It's certainly likely that that logic needs improvement in view of this
>discussion --- I was just pushing back on the claim that we weren't
>considering the issue at all.
>

Ah, this code is new in 11, and I was looking at code from 10 for some
reason. I don't think we can do much better than this, except perhaps
falling back to (1/ndistinct) when there's no MCV available.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Reply | Threaded
Open this post in threaded view
|

Re: accounting for memory used for BufFile during hash joins

Melanie Plageman
In reply to this post by Tomas Vondra-4


On Mon, May 6, 2019 at 8:15 PM Tomas Vondra <[hidden email]> wrote:
Nope, that's not how it works. It's the array of batches that gets
sliced, not the batches themselves.

It does slightly increase the amount of data we need to shuffle between
the temp files, because we can't write the data directly to batches in
"future" slices. But that amplification is capped to ~2.2x (compared to
the ~1.4x in master) - I've shared some measurements in [1].

[1] https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development


Cool, I misunderstood. I looked at the code again today, and, at the email
thread where you measured "amplification".

In terms of how many times you write each tuple, is it accurate to say that a
tuple can now be spilled three times (in the worst case) whereas, before, it
could be spilled only twice?

1 - when building the inner side hashtable, tuple is spilled to a "slice" file
2 - (assuming the number of batches was increased) during execution, when a
tuple belonging to a later slice's spill file is found, it is re-spilled to that
slice's spill file
3 - during execution, when reading from its slice file, it is re-spilled (again)
to its batch's spill file

Is it correct that the max number of BufFile structs you will have is equal to
the number of slices + number of batches in a slice
because that is the max number of open BufFiles you would have at a time?

By the way, applying v4 patch on master, in an assert build, I am tripping some
asserts -- starting with
Assert(!file->readOnly);
in BufFileWrite

One thing I was a little confused by was the nbatch_inmemory member of the
hashtable.  The comment in ExecChooseHashTableSize says that it is determining
the number of batches we can fit in memory.  I thought that the problem was the
amount of space taken up by the BufFile data structure itself--which is related
to the number of open BufFiles you need at a time. This comment in
ExecChooseHashTableSize makes it sound like you are talking about fitting more
than one batch of tuples into memory at a time. I was under the impression that
you could only fit one batch of tuples in memory at a time.

So, I was stepping through the code with work_mem set to the lower bound, and in
ExecHashIncreaseNumBatches, I got confused.
hashtable->nbatch_inmemory was 2 for me, thus, nbatch_tmp was 2
so, I didn't meet this condition
if (nbatch_tmp > hashtable->nbatch_inmemory)
since I just set nbatch_tmp using hashtable->nbatch_inmemory
So, I didn't increase the number of slices, which is what I was expecting.
What happens when hashtable->nbatch_inmemory is equal to nbatch_tmp?

--
Melanie Plageman
Reply | Threaded
Open this post in threaded view
|

Re: accounting for memory used for BufFile during hash joins

Melanie Plageman
In reply to this post by Tomas Vondra-4

On Tue, May 7, 2019 at 6:59 AM Tomas Vondra <[hidden email]> wrote:
On Tue, May 07, 2019 at 04:28:36PM +1200, Thomas Munro wrote:
>On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
><[hidden email]> wrote:
>> On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
>> Switching to some other algorithm during execution moves the goal posts
>> to the next galaxy, I'm afraid.
>
>The main problem I'm aware of with sort-merge join is: not all that is
>hashable is sortable.  So BNL is actually the only solution I'm aware
>of for problem B that doesn't involve changing a fundamental thing
>about PostgreSQL's data type requirements.
>

Sure, each of those algorithms has limitations. But I think that's mostly
irrelevant to the main issue - switching between algorithms mid-execution.
At that point some of the tuples might have been already sent sent to the
other nodes, and I have no idea how to "resume" the tuple stream short of
buffering everything locally until the join completes. And that would be
rather terrible, I guess.


What if you switched to NLJ on a batch-by-batch basis and did it before starting
execution of the join but after building the inner side of the hash table.  That
way, no tuples will have been sent to other nodes yet.

--
Melanie Plageman
Reply | Threaded
Open this post in threaded view
|

Re: accounting for memory used for BufFile during hash joins

Tomas Vondra-4
On Tue, May 07, 2019 at 05:43:56PM -0700, Melanie Plageman wrote:

>   On Tue, May 7, 2019 at 6:59 AM Tomas Vondra <[hidden email]>
>   wrote:
>
>     On Tue, May 07, 2019 at 04:28:36PM +1200, Thomas Munro wrote:
>     >On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
>     ><[hidden email]> wrote:
>     >> On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
>     >> Switching to some other algorithm during execution moves the goal
>     posts
>     >> to the next galaxy, I'm afraid.
>     >
>     >The main problem I'm aware of with sort-merge join is: not all that is
>     >hashable is sortable.  So BNL is actually the only solution I'm aware
>     >of for problem B that doesn't involve changing a fundamental thing
>     >about PostgreSQL's data type requirements.
>     >
>
>     Sure, each of those algorithms has limitations. But I think that's
>     mostly
>     irrelevant to the main issue - switching between algorithms
>     mid-execution.
>     At that point some of the tuples might have been already sent sent to
>     the
>     other nodes, and I have no idea how to "resume" the tuple stream short
>     of
>     buffering everything locally until the join completes. And that would be
>     rather terrible, I guess.
>
>   What if you switched to NLJ on a batch-by-batch basis and did it before
>   starting
>   execution of the join but after building the inner side of the hash
>   table.  That
>   way, no tuples will have been sent to other nodes yet.
>

Interesting idea! I think you're right doing it on a per-batch basis
would solve that problem. Essentially, if all (or >95%) of the tuples
has the same hash value, we could switch to a special "degraded" mode
doing something like a NL. At that point the hash table benefits are
lost anyway, because all the tuples are in a single chain, so it's not
going to be much slower.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: accounting for memory used for BufFile during hash joins

Tomas Vondra-4
In reply to this post by Melanie Plageman
On Tue, May 07, 2019 at 05:30:27PM -0700, Melanie Plageman wrote:

>   On Mon, May 6, 2019 at 8:15 PM Tomas Vondra <[hidden email]>
>   wrote:
>
>     Nope, that's not how it works. It's the array of batches that gets
>     sliced, not the batches themselves.
>
>     It does slightly increase the amount of data we need to shuffle between
>     the temp files, because we can't write the data directly to batches in
>     "future" slices. But that amplification is capped to ~2.2x (compared to
>     the ~1.4x in master) - I've shared some measurements in [1].
>
>     [1]
>     https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development
>
>   Cool, I misunderstood. I looked at the code again today, and, at the email
>   thread where you measured "amplification".
>
Oh! I hope you're not too disgusted by the code in that PoC patch ;-)

>   In terms of how many times you write each tuple, is it accurate to
>   say that a tuple can now be spilled three times (in the worst case)
>   whereas, before, it could be spilled only twice?
>
>   1 - when building the inner side hashtable, tuple is spilled to a "slice"
>   file
>   2 - (assuming the number of batches was increased) during execution, when
>   a tuple belonging to a later slice's spill file is found, it is re-spilled
>   to that slice's spill file
>   3 - during execution, when reading from its slice file, it is re-spilled
>   (again) to its batch's spill file
>
Yes, that's mostly accurate understanding. Essentially this might add
one extra step of "reshuffling" from the per-slice to per-batch files.

>   Is it correct that the max number of BufFile structs you will have
>   is equal to the number of slices + number of batches in a slice
>   because that is the max number of open BufFiles you would have at a
>   time?

Yes. With the caveat that we need twice that number of BufFile structs,
because we need them on both sides of the join.

>   By the way, applying v4 patch on master, in an assert build, I am tripping
>   some
>   asserts -- starting with
>   Assert(!file->readOnly);
>   in BufFileWrite

Whoooops :-/

>   One thing I was a little confused by was the nbatch_inmemory member
>   of the hashtable.  The comment in ExecChooseHashTableSize says that
>   it is determining the number of batches we can fit in memory.  I
>   thought that the problem was the amount of space taken up by the
>   BufFile data structure itself--which is related to the number of
>   open BufFiles you need at a time. This comment in
>   ExecChooseHashTableSize makes it sound like you are talking about
>   fitting more than one batch of tuples into memory at a time. I was
>   under the impression that you could only fit one batch of tuples in
>   memory at a time.
I suppose you mean this chunk:

    /*
     * See how many batches we can fit into memory (driven mostly by size
     * of BufFile, with PGAlignedBlock being the largest part of that).
     * We need one BufFile for inner and outer side, so we count it twice
     * for each batch, and we stop once we exceed (work_mem/2).
     */
    while ((nbatch_inmemory * 2) * sizeof(PGAlignedBlock) * 2
           <= (work_mem * 1024L / 2))
        nbatch_inmemory *= 2;

Yeah, that comment is a bit confusing. What the code actually does is
computing the largest "slice" of batches for which we can keep the
BufFile structs in memory, without exceeding work_mem/2.

Maybe the nbatch_inmemory should be renamed to nbatch_slice, not sure.

>   So, I was stepping through the code with work_mem set to the lower
>   bound, and in ExecHashIncreaseNumBatches, I got confused.
>   hashtable->nbatch_inmemory was 2 for me, thus, nbatch_tmp was 2 so,
>   I didn't meet this condition if (nbatch_tmp >
>   hashtable->nbatch_inmemory) since I just set nbatch_tmp using
>   hashtable->nbatch_inmemory So, I didn't increase the number of
>   slices, which is what I was expecting.  What happens when
>   hashtable->nbatch_inmemory is equal to nbatch_tmp?
>

Ah, good catch. The condition you're refering to

    if (nbatch_tmp > hashtable->nbatch_inmemory)

should actually be

    if (nbatch > hashtable->nbatch_inmemory)

because the point is to initialize BufFile structs for the overflow
files, and we need to do that once we cross nbatch_inmemory.

And it turns out this actually causes the assert failures in regression
tests, you reported earlier. It failed to initialize the overflow files
in some cases, so the readOnly flag seemed to be set.

Attached is an updated patch, fixing this. I tried to clarify some of
the comments too, and I fixed another bug I found while running the
regression tests. It's still very much a crappy PoC code, though.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

v4-per-slice-overflow-file-20190508.patch (28K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: accounting for memory used for BufFile during hash joins

Melanie Plageman


On Wed, May 8, 2019 at 8:08 AM Tomas Vondra <[hidden email]> wrote:
On Tue, May 07, 2019 at 05:30:27PM -0700, Melanie Plageman wrote:
>   One thing I was a little confused by was the nbatch_inmemory member
>   of the hashtable.  The comment in ExecChooseHashTableSize says that
>   it is determining the number of batches we can fit in memory.  I
>   thought that the problem was the amount of space taken up by the
>   BufFile data structure itself--which is related to the number of
>   open BufFiles you need at a time. This comment in
>   ExecChooseHashTableSize makes it sound like you are talking about
>   fitting more than one batch of tuples into memory at a time. I was
>   under the impression that you could only fit one batch of tuples in
>   memory at a time.

I suppose you mean this chunk:

    /*
     * See how many batches we can fit into memory (driven mostly by size
     * of BufFile, with PGAlignedBlock being the largest part of that).
     * We need one BufFile for inner and outer side, so we count it twice
     * for each batch, and we stop once we exceed (work_mem/2).
     */
    while ((nbatch_inmemory * 2) * sizeof(PGAlignedBlock) * 2
           <= (work_mem * 1024L / 2))
        nbatch_inmemory *= 2;

Yeah, that comment is a bit confusing. What the code actually does is
computing the largest "slice" of batches for which we can keep the
BufFile structs in memory, without exceeding work_mem/2.

Maybe the nbatch_inmemory should be renamed to nbatch_slice, not sure.

I definitely would prefer to see hashtable->nbatch_inmemory renamed to
hashtable->nbatch_slice--or maybe hashtable->nbuff_inmemory?

I've been poking around the code for awhile today, and, even though I
know that the nbatch_inmemory is referring to the buffiles that can
fit in memory, I keep forgetting and thinking it is referring to the
tuple data that can fit in memory.

It might be worth explicitly calling out somewhere in the comments
that overflow slices will only be created either when the number of
batches was underestimated as part of ExecHashIncreaseNumBatches and
the new number of batches exceeds the value for
hashtable->nbatch_inmemory or when creating the hashtable initially
and the number of batches exceeds the value for
hashtable->nbatch_inmemory (the name confuses this for me at hashtable
creation time especially) -- the number of actual buffiles that can be
managed in memory.
 

Attached is an updated patch, fixing this. I tried to clarify some of
the comments too, and I fixed another bug I found while running the
regression tests. It's still very much a crappy PoC code, though.


So, I ran the following example on master and with your patch.

drop table foo;
drop table bar;
create table foo(a int, b int);
create table bar(c int, d int);
insert into foo select i, i from generate_series(1,10000)i;
insert into bar select 1, 1 from generate_series(1,1000)i;
insert into bar select i%3, i%3 from generate_series(1000,10000)i;
insert into foo select 1,1 from generate_series(1,1000)i;
analyze foo; analyze bar;
set work_mem=64;

On master, explain analyze looked like this

postgres=# explain analyze verbose select * from foo, bar where a = c;
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=339.50..53256.27 rows=4011001 width=16) (actual time=28.962..1048.442 rows=4008001 loops=1)
   Output: foo.a, foo.b, bar.c, bar.d
   Hash Cond: (bar.c = foo.a)
   ->  Seq Scan on public.bar  (cost=0.00..145.01 rows=10001 width=8) (actual time=0.030..1.777 rows=10001 loops=1)
         Output: bar.c, bar.d
   ->  Hash  (cost=159.00..159.00 rows=11000 width=8) (actual time=12.285..12.285 rows=11000 loops=1)
         Output: foo.a, foo.b
         Buckets: 2048 (originally 2048)  Batches: 64 (originally 16)  Memory Usage: 49kB
         ->  Seq Scan on public.foo  (cost=0.00..159.00 rows=11000 width=8) (actual time=0.023..3.786 rows=11000 loops=1)
               Output: foo.a, foo.b
 Planning Time: 0.435 ms
 Execution Time: 1206.904 ms
(12 rows)

and with your patch, it looked like this.

postgres=# explain analyze verbose select * from foo, bar where a = c;
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=339.50..53256.27 rows=4011001 width=16) (actual time=28.256..1102.026 rows=4008001 loops=1)
   Output: foo.a, foo.b, bar.c, bar.d
   Hash Cond: (bar.c = foo.a)
   ->  Seq Scan on public.bar  (cost=0.00..145.01 rows=10001 width=8) (actual time=0.040..1.717 rows=10001 loops=1)
         Output: bar.c, bar.d
   ->  Hash  (cost=159.00..159.00 rows=11000 width=8) (actual time=12.327..12.327 rows=11000 loops=1)
         Output: foo.a, foo.b
         Buckets: 2048 (originally 2048)  Batches: 16384 (originally 16, in-memory 2)  Memory Usage: 131160kB
         ->  Seq Scan on public.foo  (cost=0.00..159.00 rows=11000 width=8) (actual time=0.029..3.569 rows=11000 loops=1)
               Output: foo.a, foo.b
 Planning Time: 0.260 ms
 Execution Time: 1264.995 ms
(12 rows)

I noticed that the number of batches is much higher with the patch,
and, I was checking $PGDATA/base/pgsql_tmp and saw that the number of
temp files which are the overflow files any given time was quite high.

I would imagine that the desired behaviour is to keep memory usage
within work_mem.
In this example, the number of slices is about 8000, each of which
would have an overflow file. Is this the case you mention in the
comment in ExecChooseHashTableSize ?

* We ignore (per-slice)
* overflow files, because those serve as "damage control" for cases
* when per-batch BufFiles would exceed work_mem. Given enough batches
* it's impossible to enforce work_mem strictly, because the overflow
* files alone will consume more memory.

--
Melanie Plageman
Reply | Threaded
Open this post in threaded view
|

Re: accounting for memory used for BufFile during hash joins

Tomas Vondra-4
On Tue, May 21, 2019 at 05:38:50PM -0700, Melanie Plageman wrote:

>On Wed, May 8, 2019 at 8:08 AM Tomas Vondra <[hidden email]>
>wrote:
>
>> On Tue, May 07, 2019 at 05:30:27PM -0700, Melanie Plageman wrote:
>> >   One thing I was a little confused by was the nbatch_inmemory member
>> >   of the hashtable.  The comment in ExecChooseHashTableSize says that
>> >   it is determining the number of batches we can fit in memory.  I
>> >   thought that the problem was the amount of space taken up by the
>> >   BufFile data structure itself--which is related to the number of
>> >   open BufFiles you need at a time. This comment in
>> >   ExecChooseHashTableSize makes it sound like you are talking about
>> >   fitting more than one batch of tuples into memory at a time. I was
>> >   under the impression that you could only fit one batch of tuples in
>> >   memory at a time.
>>
>> I suppose you mean this chunk:
>>
>>     /*
>>      * See how many batches we can fit into memory (driven mostly by size
>>      * of BufFile, with PGAlignedBlock being the largest part of that).
>>      * We need one BufFile for inner and outer side, so we count it twice
>>      * for each batch, and we stop once we exceed (work_mem/2).
>>      */
>>     while ((nbatch_inmemory * 2) * sizeof(PGAlignedBlock) * 2
>>            <= (work_mem * 1024L / 2))
>>         nbatch_inmemory *= 2;
>>
>> Yeah, that comment is a bit confusing. What the code actually does is
>> computing the largest "slice" of batches for which we can keep the
>> BufFile structs in memory, without exceeding work_mem/2.
>>
>> Maybe the nbatch_inmemory should be renamed to nbatch_slice, not sure.
>>
>
>I definitely would prefer to see hashtable->nbatch_inmemory renamed to
>hashtable->nbatch_slice--or maybe hashtable->nbuff_inmemory?
>
>I've been poking around the code for awhile today, and, even though I
>know that the nbatch_inmemory is referring to the buffiles that can
>fit in memory, I keep forgetting and thinking it is referring to the
>tuple data that can fit in memory.
>

That's a fair point. I think nbatch_slice is a good name.

>It might be worth explicitly calling out somewhere in the comments
>that overflow slices will only be created either when the number of
>batches was underestimated as part of ExecHashIncreaseNumBatches and
>the new number of batches exceeds the value for
>hashtable->nbatch_inmemory or when creating the hashtable initially
>and the number of batches exceeds the value for
>hashtable->nbatch_inmemory (the name confuses this for me at hashtable
>creation time especially) -- the number of actual buffiles that can be
>managed in memory.
>

Yes, this definitely needs to be explained somewhere - possibly in a
comment at the beginning of nodeHash.c or something like that.

FWIW I wonder if this "slicing" would be useful even with correct
estimates. E.g. let's say we can fit 128 batches into work_mem, but we
expect to need 256 (and it's accurate). At that point it's probably too
aggressive to disable hash joins - a merge join is likely more expensive
than just using the slicing. But that should be a cost-based decision.

>
>>
>> Attached is an updated patch, fixing this. I tried to clarify some of
>> the comments too, and I fixed another bug I found while running the
>> regression tests. It's still very much a crappy PoC code, though.
>>
>>
>So, I ran the following example on master and with your patch.
>
>drop table foo;
>drop table bar;
>create table foo(a int, b int);
>create table bar(c int, d int);
>insert into foo select i, i from generate_series(1,10000)i;
>insert into bar select 1, 1 from generate_series(1,1000)i;
>insert into bar select i%3, i%3 from generate_series(1000,10000)i;
>insert into foo select 1,1 from generate_series(1,1000)i;
>analyze foo; analyze bar;
>set work_mem=64;
>
>On master, explain analyze looked like this
>
>postgres=# explain analyze verbose select * from foo, bar where a = c;
>                                                        QUERY PLAN
>
>--------------------------------------------------------------------------------------------------------------------------
> Hash Join  (cost=339.50..53256.27 rows=4011001 width=16) (actual
>time=28.962..1048.442 rows=4008001 loops=1)
>   Output: foo.a, foo.b, bar.c, bar.d
>   Hash Cond: (bar.c = foo.a)
>   ->  Seq Scan on public.bar  (cost=0.00..145.01 rows=10001 width=8)
>(actual time=0.030..1.777 rows=10001 loops=1)
>         Output: bar.c, bar.d
>   ->  Hash  (cost=159.00..159.00 rows=11000 width=8) (actual
>time=12.285..12.285 rows=11000 loops=1)
>         Output: foo.a, foo.b
>         Buckets: 2048 (originally 2048)  Batches: 64 (originally 16)
> Memory Usage: 49kB
>         ->  Seq Scan on public.foo  (cost=0.00..159.00 rows=11000 width=8)
>(actual time=0.023..3.786 rows=11000 loops=1)
>               Output: foo.a, foo.b
> Planning Time: 0.435 ms
> Execution Time: 1206.904 ms
>(12 rows)
>
>and with your patch, it looked like this.
>
>postgres=# explain analyze verbose select * from foo, bar where a = c;
>                                                        QUERY PLAN
>
>--------------------------------------------------------------------------------------------------------------------------
> Hash Join  (cost=339.50..53256.27 rows=4011001 width=16) (actual
>time=28.256..1102.026 rows=4008001 loops=1)
>   Output: foo.a, foo.b, bar.c, bar.d
>   Hash Cond: (bar.c = foo.a)
>   ->  Seq Scan on public.bar  (cost=0.00..145.01 rows=10001 width=8)
>(actual time=0.040..1.717 rows=10001 loops=1)
>         Output: bar.c, bar.d
>   ->  Hash  (cost=159.00..159.00 rows=11000 width=8) (actual
>time=12.327..12.327 rows=11000 loops=1)
>         Output: foo.a, foo.b
>         Buckets: 2048 (originally 2048)  Batches: 16384 (originally 16,
>in-memory 2)  Memory Usage: 131160kB
>         ->  Seq Scan on public.foo  (cost=0.00..159.00 rows=11000 width=8)
>(actual time=0.029..3.569 rows=11000 loops=1)
>               Output: foo.a, foo.b
> Planning Time: 0.260 ms
> Execution Time: 1264.995 ms
>(12 rows)
>
>I noticed that the number of batches is much higher with the patch,
>and, I was checking $PGDATA/base/pgsql_tmp and saw that the number of
>temp files which are the overflow files any given time was quite high.
>
>I would imagine that the desired behaviour is to keep memory usage
>within work_mem.

There's definitely something fishy going on. I suspect it's either because
of the duplicate values (which might fit into 64kB on master, but not when
accounting for BufFile). Or maybe it's because the initial 16 batches
can't possibly fit into work_mem.

If you try with a larger work_mem, say 256kB, does that behave OK?

>In this example, the number of slices is about 8000, each of which
>would have an overflow file. Is this the case you mention in the
>comment in ExecChooseHashTableSize ?
>
>* We ignore (per-slice)
>* overflow files, because those serve as "damage control" for cases
>* when per-batch BufFiles would exceed work_mem. Given enough batches
>* it's impossible to enforce work_mem strictly, because the overflow
>* files alone will consume more memory.
>

Yes. 8000 slices is ~64MB, so considering we need them on both sides of
the join that'd be ~128MB. Which is pretty much exactly 131160kB.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Reply | Threaded
Open this post in threaded view
|

Re: accounting for memory used for BufFile during hash joins

Hubert Zhang
In reply to this post by Tomas Vondra-4
On Sat, May 4, 2019 at 8:34 AM Tomas Vondra <[hidden email]> wrote:
The root cause is that hash join treats batches as pretty much free, but
that's not really true - we do allocate two BufFile structs per batch,
and each BufFile is ~8kB as it includes PGAlignedBuffer.

The OOM is not very surprising, because with 524288 batches it'd need
about 8GB of memory, and the system only has 8GB RAM installed.

The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).


Hi Tomas

I read your second patch which uses overflow buf files to reduce the total number of batches.
It would solve the hash join OOM problem what you discussed above: 8K per batch leads to batch bloating problem.

I mentioned in another thread:
There is another hashjoin OOM problem which disables splitting batches too early. PG uses a flag hashtable->growEnable to determine whether to split batches. Once one splitting failed(all the tuples are assigned to only one batch of two split ones) The growEnable flag would be turned off forever.

The is an opposite side of batch bloating problem. It only contains too few batches and makes the in-memory hash table too large to fit into memory.

Here is the tradeoff: one batch takes more than 8KB(8KB makes sense, due to performance), in-memory hash table takes memory as well and splitting batched may(not must) reduce the in-memory hash table size but introduce more batches(and thus more memory usage 8KB*#batch).
Can we conclude that it would be worth to splitting if satisfy:
(The reduced memory of in-memory hash table) - (8KB * number of new batches) > 0

So I'm considering to combine our patch with your patch to fix join OOM problem. No matter the OOM is introduced by (the memory usage of in-memory hash table) or (8KB * number of batches).

nbatch_inmemory in your patch could also use the upper rule to redefine.

What's your opinion?

Thanks

Hubert Zhang
Reply | Threaded
Open this post in threaded view
|

Re: accounting for memory used for BufFile during hash joins

Hubert Zhang
Hi Tomas,

Here is the patch, it's could be compatible with your patch and it focus on when to regrow the batch.


On Tue, May 28, 2019 at 3:40 PM Hubert Zhang <[hidden email]> wrote:
On Sat, May 4, 2019 at 8:34 AM Tomas Vondra <[hidden email]> wrote:
The root cause is that hash join treats batches as pretty much free, but
that's not really true - we do allocate two BufFile structs per batch,
and each BufFile is ~8kB as it includes PGAlignedBuffer.

The OOM is not very surprising, because with 524288 batches it'd need
about 8GB of memory, and the system only has 8GB RAM installed.

The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).


Hi Tomas

I read your second patch which uses overflow buf files to reduce the total number of batches.
It would solve the hash join OOM problem what you discussed above: 8K per batch leads to batch bloating problem.

I mentioned in another thread:
There is another hashjoin OOM problem which disables splitting batches too early. PG uses a flag hashtable->growEnable to determine whether to split batches. Once one splitting failed(all the tuples are assigned to only one batch of two split ones) The growEnable flag would be turned off forever.

The is an opposite side of batch bloating problem. It only contains too few batches and makes the in-memory hash table too large to fit into memory.

Here is the tradeoff: one batch takes more than 8KB(8KB makes sense, due to performance), in-memory hash table takes memory as well and splitting batched may(not must) reduce the in-memory hash table size but introduce more batches(and thus more memory usage 8KB*#batch).
Can we conclude that it would be worth to splitting if satisfy:
(The reduced memory of in-memory hash table) - (8KB * number of new batches) > 0

So I'm considering to combine our patch with your patch to fix join OOM problem. No matter the OOM is introduced by (the memory usage of in-memory hash table) or (8KB * number of batches).

nbatch_inmemory in your patch could also use the upper rule to redefine.

What's your opinion?

Thanks

Hubert Zhang


--
Thanks

Hubert Zhang

0001-Allow-to-continue-to-split-batch-when-tuples-become-.patch (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: accounting for memory used for BufFile during hash joins

Tomas Vondra-4
In reply to this post by Hubert Zhang
On Tue, May 28, 2019 at 03:40:01PM +0800, Hubert Zhang wrote:

>On Sat, May 4, 2019 at 8:34 AM Tomas Vondra <[hidden email]>
>wrote:
>
>Hi Tomas
>
>I read your second patch which uses overflow buf files to reduce the total
>number of batches.
>It would solve the hash join OOM problem what you discussed above: 8K per
>batch leads to batch bloating problem.
>
>I mentioned in another thread:
>
>https://www.postgresql.org/message-id/flat/CAB0yrekv%3D6_T_eUe2kOEvWUMwufcvfd15SFmCABtYFOkxCFdfA%40mail.gmail.com
>There is another hashjoin OOM problem which disables splitting batches too
>early. PG uses a flag hashtable->growEnable to determine whether to split
>batches. Once one splitting failed(all the tuples are assigned to only one
>batch of two split ones) The growEnable flag would be turned off forever.
>
>The is an opposite side of batch bloating problem. It only contains too few
>batches and makes the in-memory hash table too large to fit into memory.
>

Yes. There are deffinitely multiple separate issues in the hashjoin code,
and the various improvements discussed in this (and other) thread usually
address just a subset of them. We need to figure out how to combine them
or maybe devise some more generic solution.

So I think we need to take a step back, and figure out how to combine
these improvements - otherwise we might commit a fix for one issue, making
it much harder/impossible to improve the other issues.

The other important question is whether we see these cases as outliers
(and the solutions as last-resort-attempt-to-survive kind of fix) or more
widely applicable optimizations. I've seen some interesting speedups with
the overflow-batches patch, but my feeling is we should really treat it as
a last-resort to survive.

I had a chat about this with Thomas Munro yesterday. Unfortunately, some
beer was involved but I do vaguely remember he more or less convinced me
the BNL (block nested loop join) might be the right approach here. We
don't have any patch for that yet, though :-(

>Here is the tradeoff: one batch takes more than 8KB(8KB makes sense, due to
>performance), in-memory hash table takes memory as well and splitting
>batched may(not must) reduce the in-memory hash table size but introduce
>more batches(and thus more memory usage 8KB*#batch).
>Can we conclude that it would be worth to splitting if satisfy:
>(The reduced memory of in-memory hash table) - (8KB * number of new
>batches) > 0
>

Something like that, yes.

>So I'm considering to combine our patch with your patch to fix join OOM
>problem. No matter the OOM is introduced by (the memory usage of in-memory
>hash table) or (8KB * number of batches).
>
>nbatch_inmemory in your patch could also use the upper rule to redefine.
>
>What's your opinion?
>

One of the issues with my "overflow batches" patch, pointed out to me by
Thomas yesterday, is that it only works with non-parallel hash join. And
we don't know how to make it work in the parallel mode :-(


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services