Avoiding hash join batch explosions with extreme skew and weird stats

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
41 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Avoiding hash join batch explosions with extreme skew and weird stats

Thomas Munro-5
Hello,

As discussed elsewhere[1][2], our algorithm for deciding when to give
up on repartitioning (AKA increasing the number of batches) tends to
keep going until it has a number of batches that is a function of the
number of distinct well distributed keys.  I wanted to move this minor
issue away from Tomas Vondra's thread[2] since it's a mostly
independent problem.

SET max_parallel_workers_per_gather = 0;
SET synchronize_seqscans = off;
SET work_mem = '4MB';

CREATE TABLE r AS SELECT generate_series(1, 10000000)::int i;
ANALYZE r;

-- 1k uniform keys + 1m duplicates
CREATE TABLE s1k (i int);
INSERT INTO s1k SELECT generate_series(1, 1000)::int i;
ALTER TABLE s1k SET (autovacuum_enabled = off);
ANALYZE s1k;
INSERT INTO s1k SELECT 42 FROM generate_series(1, 1000000);

EXPLAIN ANALYZE SELECT COUNT(*) FROM r JOIN s1k USING (i);

  Buckets: 1048576 (originally 1048576)
  Batches: 4096 (originally 16)
  Memory Usage: 35157kB

-- 10k uniform keys + 1m duplicates
CREATE TABLE s10k (i int);
INSERT INTO s10k SELECT generate_series(1, 10000)::int i;
ALTER TABLE s10k SET (autovacuum_enabled = off);
ANALYZE s10k;
INSERT INTO s10k SELECT 42 FROM generate_series(1, 1000000);

EXPLAIN ANALYZE SELECT COUNT(*) FROM r JOIN s10k USING (i);

  Buckets: 131072 (originally 131072)
  Batches: 32768 (originally 16)
  Memory Usage: 35157kB

See how the number of batches is determined by the number of uniform
keys in r?  That's because the explosion unfolds until there is
*nothing left* but keys that hash to the same value in the problem
batch, which means those uniform keys have to keep spreading out until
there is something on the order of two batches per key.  The point is
that it's bounded only by input data (or eventually INT_MAX / 2 and
MaxAllocSize), and as Tomas has illuminated, batches eat unmetered
memory.  Ouch.

Here's a quick hack to show that a 95% cut-off fixes those examples.
I don't really know how to choose the number, but I suspect it should
be much closer to 100 than 50.  I think this is the easiest of three
fundamental problems that need to be solved in this area.  The others
are: accounting for per-partition overheads as Tomas pointed out, and
providing an actual fallback strategy that respects work_mem when
extreme skew is detected OR per-partition overheads dominate.  I plan
to experiment with nested loop hash join (or whatever you want to call
it: the thing where you join every arbitrary fragment of the hash
table against the outer batch, and somehow deal with outer match
flags) when time permits.

[1] https://www.postgresql.org/message-id/flat/CAG_%3D8kBoWY4AXwW%3DCj44xe13VZnYohV9Yr-_hvZdx2xpiipr9w%40mail.gmail.com
[2] https://www.postgresql.org/message-id/flat/20190504003414.bulcbnge3rhwhcsh%40development

--
Thomas Munro
https://enterprisedb.com

fix.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Tomas Vondra-4
On Thu, May 16, 2019 at 01:22:31PM +1200, Thomas Munro wrote:

> ...
>
>Here's a quick hack to show that a 95% cut-off fixes those examples.
>I don't really know how to choose the number, but I suspect it should
>be much closer to 100 than 50.  I think this is the easiest of three
>fundamental problems that need to be solved in this area.  The others
>are: accounting for per-partition overheads as Tomas pointed out, and
>providing an actual fallback strategy that respects work_mem when
>extreme skew is detected OR per-partition overheads dominate.  I plan
>to experiment with nested loop hash join (or whatever you want to call
>it: the thing where you join every arbitrary fragment of the hash
>table against the outer batch, and somehow deal with outer match
>flags) when time permits.
>

I think this is a step in the right direction, but as I said on the other
thread(s), I think we should not disable growth forever and recheck once
in a while. Otherwise we'll end up in sad situation with non-uniform data
sets, as poined out by Hubert Zhang in [1]. It's probably even truer with
this less strict logic, using 95% as a threshold (instead of 100%).

I kinda like the idea with increasing the spaceAllowed value. Essentially,
if we decide adding batches would be pointless, increasing the memory
budget is the only thing we can do anyway.

The problem however is that we only really look at a single bit - it may
be that doubling the batches would not help, but doing it twice would
actually reduce the memory usage. For example, assume there are 2 distinct
values in the batch, with hash values (in binary)

  101010000
  101010111

and assume we currently. Clearly, splitting batches is going to do nothing
until we get to the 000 vs. 111 parts.

At first I thought this is rather unlikely and we can ignore that, but I'm
not really sure about that - it may actually be pretty likely. We may get
to 101010 bucket with sufficiently large data set, and then it's ~50%
probability the next bit is the same (assuming two distinct values). So
this may be quite an issue, I think.

regards


[1] https://www.postgresql.org/message-id/CAB0yrekv%3D6_T_eUe2kOEvWUMwufcvfd15SFmCABtYFOkxCFdfA%40mail.gmail.com

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Reply | Threaded
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Thomas Munro-5
On Fri, May 17, 2019 at 4:39 AM Tomas Vondra
<[hidden email]> wrote:
> I think this is a step in the right direction, but as I said on the other
> thread(s), I think we should not disable growth forever and recheck once
> in a while. Otherwise we'll end up in sad situation with non-uniform data
> sets, as poined out by Hubert Zhang in [1]. It's probably even truer with
> this less strict logic, using 95% as a threshold (instead of 100%).
>
> I kinda like the idea with increasing the spaceAllowed value. Essentially,
> if we decide adding batches would be pointless, increasing the memory
> budget is the only thing we can do anyway.

But that's not OK, we need to fix THAT.

> The problem however is that we only really look at a single bit - it may
> be that doubling the batches would not help, but doing it twice would
> actually reduce the memory usage. For example, assume there are 2 distinct
> values in the batch, with hash values (in binary)

Yes, that's a good point, and not a case that we should ignore.  But
if we had a decent fall-back strategy that respected work_mem, we
wouldn't care so much if we get it wrong in a corner case.  I'm
arguing that we should use Grace partitioning as our primary
partitioning strategy, but fall back to looping (or possibly
sort-merging) for the current batch if Grace doesn't seem to be
working.  You'll always be able to find cases where if you'd just
tried one more round, Grace would work, but that seems acceptable to
me, because getting it wrong doesn't melt your computer, it just
probably takes longer.  Or maybe it doesn't.  How much longer would it
take to loop twice?  Erm, twice as long, and each loop makes actual
progress, unlike extra speculative Grace partition expansions which
apply not just to the current batch but all batches, might not
actually work, and you *have* to abandon at some point.  The more I
think about it, the more I think that a loop-base escape valve, though
unpalatably quadratic, is probably OK because we're in a sink-or-swim
situation at this point, and our budget is work_mem, not work_time.

I'm concerned that we're trying to find ways to treat the symptoms,
allowing us to exceed work_mem but maybe not so much, instead of
focusing on the fundamental problem, which is that we don't yet have
an algorithm that is guaranteed to respect work_mem.

Admittedly I don't have a patch, just a bunch of handwaving.  One
reason I haven't attempted to write it is because although I know how
to do the non-parallel version using a BufFile full of match bits in
sync with the tuples for outer joins, I haven't figured out how to do
it for parallel-aware hash join, because then each loop over the outer
batch could see different tuples in each participant.  You could use
the match bit in HashJoinTuple header, but then you'd have to write
all the tuples out again, which is more IO than I want to do.  I'll
probably start another thread about that.

--
Thomas Munro
https://enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Tomas Vondra-4
On Fri, May 17, 2019 at 10:21:56AM +1200, Thomas Munro wrote:

>On Fri, May 17, 2019 at 4:39 AM Tomas Vondra
><[hidden email]> wrote:
>> I think this is a step in the right direction, but as I said on the other
>> thread(s), I think we should not disable growth forever and recheck once
>> in a while. Otherwise we'll end up in sad situation with non-uniform data
>> sets, as poined out by Hubert Zhang in [1]. It's probably even truer with
>> this less strict logic, using 95% as a threshold (instead of 100%).
>>
>> I kinda like the idea with increasing the spaceAllowed value. Essentially,
>> if we decide adding batches would be pointless, increasing the memory
>> budget is the only thing we can do anyway.
>
>But that's not OK, we need to fix THAT.
>

I agree increasing the budget is not ideal, althought at the moment it's
the only thing we can do. If we can improve that, great.

>> The problem however is that we only really look at a single bit - it may
>> be that doubling the batches would not help, but doing it twice would
>> actually reduce the memory usage. For example, assume there are 2 distinct
>> values in the batch, with hash values (in binary)
>
>Yes, that's a good point, and not a case that we should ignore.  But
>if we had a decent fall-back strategy that respected work_mem, we
>wouldn't care so much if we get it wrong in a corner case.  I'm
>arguing that we should use Grace partitioning as our primary
>partitioning strategy, but fall back to looping (or possibly
>sort-merging) for the current batch if Grace doesn't seem to be
>working.  You'll always be able to find cases where if you'd just
>tried one more round, Grace would work, but that seems acceptable to
>me, because getting it wrong doesn't melt your computer, it just
>probably takes longer.  Or maybe it doesn't.  How much longer would it
>take to loop twice?  Erm, twice as long, and each loop makes actual
>progress, unlike extra speculative Grace partition expansions which
>apply not just to the current batch but all batches, might not
>actually work, and you *have* to abandon at some point.  The more I
>think about it, the more I think that a loop-base escape valve, though
>unpalatably quadratic, is probably OK because we're in a sink-or-swim
>situation at this point, and our budget is work_mem, not work_time.
>

True.

>I'm concerned that we're trying to find ways to treat the symptoms,
>allowing us to exceed work_mem but maybe not so much, instead of
>focusing on the fundamental problem, which is that we don't yet have
>an algorithm that is guaranteed to respect work_mem.
>

Yes, that's a good point.

>Admittedly I don't have a patch, just a bunch of handwaving.  One
>reason I haven't attempted to write it is because although I know how
>to do the non-parallel version using a BufFile full of match bits in
>sync with the tuples for outer joins, I haven't figured out how to do
>it for parallel-aware hash join, because then each loop over the outer
>batch could see different tuples in each participant.  You could use
>the match bit in HashJoinTuple header, but then you'd have to write
>all the tuples out again, which is more IO than I want to do.  I'll
>probably start another thread about that.
>

That pesky parallelism ;-)


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Reply | Threaded
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Tom Lane-2
In reply to this post by Thomas Munro-5
Thomas Munro <[hidden email]> writes:
> On Fri, May 17, 2019 at 4:39 AM Tomas Vondra
> <[hidden email]> wrote:
>> I kinda like the idea with increasing the spaceAllowed value. Essentially,
>> if we decide adding batches would be pointless, increasing the memory
>> budget is the only thing we can do anyway.

> But that's not OK, we need to fix THAT.

I don't think it's necessarily a good idea to suppose that we MUST
fit in work_mem come what may.  It's likely impossible to guarantee
that in all cases.  Even if we can, a query that runs for eons will
help nobody.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Tomas Vondra-4
On Thu, May 16, 2019 at 06:58:43PM -0400, Tom Lane wrote:

>Thomas Munro <[hidden email]> writes:
>> On Fri, May 17, 2019 at 4:39 AM Tomas Vondra
>> <[hidden email]> wrote:
>>> I kinda like the idea with increasing the spaceAllowed value. Essentially,
>>> if we decide adding batches would be pointless, increasing the memory
>>> budget is the only thing we can do anyway.
>
>> But that's not OK, we need to fix THAT.
>
>I don't think it's necessarily a good idea to suppose that we MUST
>fit in work_mem come what may.  It's likely impossible to guarantee
>that in all cases.  Even if we can, a query that runs for eons will
>help nobody.
>

I kinda agree with Thomas - arbitrarily increasing work_mem is something
we should not do unless abosolutely necessary. If the query is slow, it's
up to the user to bump the value up, if deemed appropriate.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Reply | Threaded
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Thomas Munro-5
On Fri, May 17, 2019 at 11:46 AM Tomas Vondra
<[hidden email]> wrote:

> On Thu, May 16, 2019 at 06:58:43PM -0400, Tom Lane wrote:
> >Thomas Munro <[hidden email]> writes:
> >> On Fri, May 17, 2019 at 4:39 AM Tomas Vondra
> >> <[hidden email]> wrote:
> >>> I kinda like the idea with increasing the spaceAllowed value. Essentially,
> >>> if we decide adding batches would be pointless, increasing the memory
> >>> budget is the only thing we can do anyway.
> >
> >> But that's not OK, we need to fix THAT.
> >
> >I don't think it's necessarily a good idea to suppose that we MUST
> >fit in work_mem come what may.  It's likely impossible to guarantee
> >that in all cases.  Even if we can, a query that runs for eons will
> >help nobody.
>
> I kinda agree with Thomas - arbitrarily increasing work_mem is something
> we should not do unless abosolutely necessary. If the query is slow, it's
> up to the user to bump the value up, if deemed appropriate.

+1

I think we can gaurantee that we can fit in work_mem with only one
exception: we have to allow work_mem to be exceeded when we otherwise
couldn't fit a single tuple.

Then the worst possible case with the looping algorithm is that we
degrade to loading just one inner tuple at a time into the hash table,
at which point we effectively have a nested loop join (except (1) it's
flipped around: for each tuple on the inner side, we scan the outer
side; and (2) we can handle full outer joins).  In any reasonable case
you'll have a decent amount of tuples at a time, so you won't have to
loop too many times so it's not really quadratic in the number of
tuples.  The realisation that it's a nested loop join in the extreme
case is probably why the MySQL people called it 'block nested loop
join' (and as far as I can tell from quick googling, it might be their
*primary* strategy for hash joins that don't fit in memory, not just a
secondary strategy after Grace fails, but I might be wrong about
that).  Unlike plain old single-tuple nested loop join, it works in
arbitrary sized blocks (the hash table).  What we would call a regular
hash join, they call a BNL that just happens to have only one loop.  I
think Grace is probably a better primary strategy, but loops are a
good fallback.

The reason I kept mentioning sort-merge in earlier threads is because
it'd be better in the worst cases.  Unfortunately it would be worse in
the best case (smallish numbers of loops) and I suspect many real
world cases.  It's hard to decide, so perhaps we should be happy that
sort-merge can't be considered currently because the join conditions
may not be merge-joinable.


--
Thomas Munro
https://enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Melanie Plageman
In reply to this post by Thomas Munro-5


On Thu, May 16, 2019 at 3:22 PM Thomas Munro <[hidden email]> wrote:
Admittedly I don't have a patch, just a bunch of handwaving.  One
reason I haven't attempted to write it is because although I know how
to do the non-parallel version using a BufFile full of match bits in
sync with the tuples for outer joins, I haven't figured out how to do
it for parallel-aware hash join, because then each loop over the outer
batch could see different tuples in each participant.  You could use
the match bit in HashJoinTuple header, but then you'd have to write
all the tuples out again, which is more IO than I want to do.  I'll
probably start another thread about that.


Could you explain more about the implementation you are suggesting?

Specifically, what do you mean "BufFile full of match bits in sync with the
tuples for outer joins?"

Is the implementation you are thinking of one which falls back to NLJ on a
batch-by-batch basis decided during the build phase?
If so, why do you need to keep track of the outer tuples seen?
If you are going to loop through the whole outer side for each tuple on the
inner side, it seems like you wouldn't need to.

Could you make an outer "batch" which is the whole of the outer relation? That
is, could you do something like: when hashing the inner side, if re-partitioning
is resulting in batches that will overflow spaceAllowed, could you set a flag on
that batch use_NLJ and when making batches for the outer side, make one "batch"
that has all the tuples from the outer side which the inner side batch which was
flagged will do NLJ with.

--
Melanie Plageman
Reply | Threaded
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Thomas Munro-5
On Sat, May 18, 2019 at 12:15 PM Melanie Plageman
<[hidden email]> wrote:

> On Thu, May 16, 2019 at 3:22 PM Thomas Munro <[hidden email]> wrote:
>> Admittedly I don't have a patch, just a bunch of handwaving.  One
>> reason I haven't attempted to write it is because although I know how
>> to do the non-parallel version using a BufFile full of match bits in
>> sync with the tuples for outer joins, I haven't figured out how to do
>> it for parallel-aware hash join, because then each loop over the outer
>> batch could see different tuples in each participant.  You could use
>> the match bit in HashJoinTuple header, but then you'd have to write
>> all the tuples out again, which is more IO than I want to do.  I'll
>> probably start another thread about that.
>
> Could you explain more about the implementation you are suggesting?
>
> Specifically, what do you mean "BufFile full of match bits in sync with the
> tuples for outer joins?"

First let me restate the PostgreSQL terminology for this stuff so I
don't get confused while talking about it:

* The inner side of the join = the right side = the side we use to
build a hash table.  Right and full joins emit inner tuples when there
is no matching tuple on the outer side.

* The outer side of the join = the left side = the side we use to
probe the hash table.  Left and full joins emit outer tuples when
there is no matching tuple on the inner side.

* Semi and anti joins emit exactly one instance of each outer tuple if
there is/isn't at least one match on the inner side.

We have a couple of relatively easy cases:

* Inner joins: for every outer tuple, we try to find a match in the
hash table, and if we find one we emit a tuple.  To add looping
support, if we run out of memory when loading the hash table we can
just proceed to probe the fragment we've managed to load so far, and
then rewind the outer batch, clear the hash table and load in the next
work_mem-sized fragment and do it again... rinse and repeat until
we've eventually processed the whole inner batch.  After we've
finished looping, we move on to the next batch.

* For right and full joins ("HJ_FILL_INNER"), we also need to emit an
inner tuple for every tuple that was loaded into the hash table but
never matched.  That's done using a flag HEAP_TUPLE_HAS_MATCH in the
header of the tuples of the hash table, and a scan through the whole
hash table at the end of each batch to look for unmatched tuples
(ExecScanHashTableForUnmatched()).  To add looping support, that just
has to be done at the end of every inner batch fragment, that is,
after every loop.

And now for the cases that need a new kind of match bit, as far as I can see:

* For left and full joins ("HJ_FILL_OUTER"), we also need to emit an
outer tuple for every tuple that didn't find a match in the hash
table.  Normally that is done while probing, without any need for
memory or match flags: if we don't find a match, we just spit out an
outer tuple immediately.  But that simple strategy won't work if the
hash table holds only part of the inner batch.  Since we'll be
rewinding and looping over the outer batch again for the next inner
batch fragment, we can't yet say if there will be a match in a later
loop.  But the later loops don't know on their own either.  So we need
some kind of cumulative memory between loops, and we only know which
outer tuples have a match after we've finished all loops.  So there
would need to be a new function ExecScanOuterBatchForUnmatched().

* For semi joins, we need to emit exactly one outer tuple whenever
there is one or more match on the inner side.  To add looping support,
we need to make sure that we don't emit an extra copy of the outer
tuple if there is a second match in another inner batch fragment.
Again, this implies some kind of memory between loops, so we can
suppress later matches.

* For anti joins, we need to emit an outer tuple whenever there is no
match.  To add looping support, we need to wait until we've seen all
the inner batch fragments before we know that a given outer tuple has
no match, perhaps with the same new function
ExecScanOuterBatchForUnmatched().

So, we need some kind of inter-loop memory, but we obviously don't
want to create another source of unmetered RAM gobbling.  So one idea
is a BufFile that has one bit per outer tuple in the batch.  In the
first loop, we just stream out the match results as we go, and then
somehow we OR the bitmap with the match results in subsequent loops.
After the last loop, we have a list of unmatched tuples -- just scan
it in lock-step with the outer batch and look for 0 bits.

Unfortunately that bits-in-order scheme doesn't work for parallel
hash, where the SharedTuplestore tuples seen by each worker are
non-deterministic.  So perhaps in that case we could use the
HEAP_TUPLE_HAS_MATCH bit in the outer tuple header itself, and write
the whole outer batch back out each time through the loop.  That'd
keep the tuples and match bits together, but it seems like a lot of
IO...  Note that parallel hash doesn't support right/full joins today,
because of some complications about waiting and deadlocks that might
turn out to be relevant here too, and might be solvable (I should
probably write about that in another email), but left joins *are*
supported today so would need to be desupported if we wanted to add
loop-based escape valve but not deal with with these problems.  That
doesn't seem acceptable, which is why I'm a bit stuck on this point,
and unfortunately it may be a while before I have time to tackle any
of that personally.

> Is the implementation you are thinking of one which falls back to NLJ on a
> batch-by-batch basis decided during the build phase?

Yeah.

> If so, why do you need to keep track of the outer tuples seen?
> If you are going to loop through the whole outer side for each tuple on the
> inner side, it seems like you wouldn't need to.

The idea is to loop through the whole outer batch for every
work_mem-sized inner batch fragment, not every tuple.  Though in
theory it could be as small as a single tuple.

> Could you make an outer "batch" which is the whole of the outer relation? That
> is, could you do something like: when hashing the inner side, if re-partitioning
> is resulting in batches that will overflow spaceAllowed, could you set a flag on
> that batch use_NLJ and when making batches for the outer side, make one "batch"
> that has all the tuples from the outer side which the inner side batch which was
> flagged will do NLJ with.

I didn't understand this... you always need to make one outer batch
corresponding to every inner batch.  The problem is the tricky
left/full/anti/semi join cases when joining against fragments holding
less that the full inner batch: we still need some way to implement
join logic that depends on knowing whether there is a match in *any*
of the inner fragments/loops.

About the question of when exactly to set the "use_NLJ" flag:  I had
originally been thinking of this only as a way to deal with the
extreme skew problem.  But in light of Tomas's complaints about
unmetered per-batch memory overheads, I had a new thought: it should
also be triggered whenever doubling the number of batches would halve
the amount of memory left for the hash table (after including the size
of all those BufFile objects in the computation as Tomas proposes).  I
think that might be exactly the right right cut-off if you want to do
as much Grace partitioning as your work_mem can afford, and therefore
as little looping as possible to complete the join while respecting
work_mem.

--
Thomas Munro
https://enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Tomas Vondra-4
On Mon, May 20, 2019 at 11:07:03AM +1200, Thomas Munro wrote:

>On Sat, May 18, 2019 at 12:15 PM Melanie Plageman
><[hidden email]> wrote:
>> On Thu, May 16, 2019 at 3:22 PM Thomas Munro <[hidden email]> wrote:
>>> Admittedly I don't have a patch, just a bunch of handwaving.  One
>>> reason I haven't attempted to write it is because although I know how
>>> to do the non-parallel version using a BufFile full of match bits in
>>> sync with the tuples for outer joins, I haven't figured out how to do
>>> it for parallel-aware hash join, because then each loop over the outer
>>> batch could see different tuples in each participant.  You could use
>>> the match bit in HashJoinTuple header, but then you'd have to write
>>> all the tuples out again, which is more IO than I want to do.  I'll
>>> probably start another thread about that.
>>
>> Could you explain more about the implementation you are suggesting?
>>
>> Specifically, what do you mean "BufFile full of match bits in sync with the
>> tuples for outer joins?"
>
>First let me restate the PostgreSQL terminology for this stuff so I
>don't get confused while talking about it:
>
>* The inner side of the join = the right side = the side we use to
>build a hash table.  Right and full joins emit inner tuples when there
>is no matching tuple on the outer side.
>
>* The outer side of the join = the left side = the side we use to
>probe the hash table.  Left and full joins emit outer tuples when
>there is no matching tuple on the inner side.
>
>* Semi and anti joins emit exactly one instance of each outer tuple if
>there is/isn't at least one match on the inner side.
>

I think you're conflating inner/outer side and left/right, or rather
assuming it's always left=inner and right=outer.

> ... snip ...
>
>> Could you make an outer "batch" which is the whole of the outer relation? That
>> is, could you do something like: when hashing the inner side, if re-partitioning
>> is resulting in batches that will overflow spaceAllowed, could you set a flag on
>> that batch use_NLJ and when making batches for the outer side, make one "batch"
>> that has all the tuples from the outer side which the inner side batch which was
>> flagged will do NLJ with.
>
>I didn't understand this... you always need to make one outer batch
>corresponding to every inner batch.  The problem is the tricky
>left/full/anti/semi join cases when joining against fragments holding
>less that the full inner batch: we still need some way to implement
>join logic that depends on knowing whether there is a match in *any*
>of the inner fragments/loops.
>
>About the question of when exactly to set the "use_NLJ" flag:  I had
>originally been thinking of this only as a way to deal with the
>extreme skew problem.  But in light of Tomas's complaints about
>unmetered per-batch memory overheads, I had a new thought: it should
>also be triggered whenever doubling the number of batches would halve
>the amount of memory left for the hash table (after including the size
>of all those BufFile objects in the computation as Tomas proposes).  I
>think that might be exactly the right right cut-off if you want to do
>as much Grace partitioning as your work_mem can afford, and therefore
>as little looping as possible to complete the join while respecting
>work_mem.
>

Not sure what NLJ flag rule you propose, exactly.

Regarding the threshold value - once the space for BufFiles (and other
overhead) gets over work_mem/2, it does not make any sense to increase
the number of batches because then the work_mem would be entirely
occupied by BufFiles.

The WIP patches don't actually do exactly that though - they just check
if the incremented size would be over work_mem/2. I think we should
instead allow up to work_mem*2/3, i.e. stop adding batches after the
BufFiles start consuming more than work_mem/3 memory.

I think that's actually what you mean by "halving the amount of memory
left for the hash table" because that's what happens after reaching the
work_mem/3.

But I think that rule is irrelevant here, really, because this thread
was discussing cases where adding batches is futile due to skew, no? In
which case we should stop adding batches after reaching some % of tuples
not moving from the batch.

Or are you suggesting we should remove that rule, and instead realy on
this rule about halving the hash table space? That might work too, I
guess.

OTOH I'm not sure it's a good idea to handle both those cases the same
way - "overflow file" idea works pretty well for cases where the hash
table actually can be split into batches, and I'm afraid NLJ will be
much less efficient for those cases.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Thomas Munro-5
On Mon, May 20, 2019 at 12:22 PM Tomas Vondra
<[hidden email]> wrote:

> On Mon, May 20, 2019 at 11:07:03AM +1200, Thomas Munro wrote:
> >First let me restate the PostgreSQL terminology for this stuff so I
> >don't get confused while talking about it:
> >
> >* The inner side of the join = the right side = the side we use to
> >build a hash table.  Right and full joins emit inner tuples when there
> >is no matching tuple on the outer side.
> >
> >* The outer side of the join = the left side = the side we use to
> >probe the hash table.  Left and full joins emit outer tuples when
> >there is no matching tuple on the inner side.
> >
> >* Semi and anti joins emit exactly one instance of each outer tuple if
> >there is/isn't at least one match on the inner side.
> >
>
> I think you're conflating inner/outer side and left/right, or rather
> assuming it's always left=inner and right=outer.

In PostgreSQL, it's always inner = right, outer = left.  You can see
that reflected in plannodes.h and elsewhere:

/* ----------------
 *      these are defined to avoid confusion problems with "left"
 *      and "right" and "inner" and "outer".  The convention is that
 *      the "left" plan is the "outer" plan and the "right" plan is
 *      the inner plan, but these make the code more readable.
 * ----------------
 */
#define innerPlan(node)                 (((Plan *)(node))->righttree)
#define outerPlan(node)                 (((Plan *)(node))->lefttree)

I'm not sure you think it's not always like that: are you referring to
the fact that the planner can choose to reverse the join (compared to
the SQL LEFT|RIGHT JOIN that appeared in the query), creating an extra
layer of confusion?  In my email I was talking only about left and
right as seen by the executor.

> >About the question of when exactly to set the "use_NLJ" flag:  I had
> >originally been thinking of this only as a way to deal with the
> >extreme skew problem.  But in light of Tomas's complaints about
> >unmetered per-batch memory overheads, I had a new thought: it should
> >also be triggered whenever doubling the number of batches would halve
> >the amount of memory left for the hash table (after including the size
> >of all those BufFile objects in the computation as Tomas proposes).  I
> >think that might be exactly the right right cut-off if you want to do
> >as much Grace partitioning as your work_mem can afford, and therefore
> >as little looping as possible to complete the join while respecting
> >work_mem.
> >
>
> Not sure what NLJ flag rule you propose, exactly.
>
> Regarding the threshold value - once the space for BufFiles (and other
> overhead) gets over work_mem/2, it does not make any sense to increase
> the number of batches because then the work_mem would be entirely
> occupied by BufFiles.
>
> The WIP patches don't actually do exactly that though - they just check
> if the incremented size would be over work_mem/2. I think we should
> instead allow up to work_mem*2/3, i.e. stop adding batches after the
> BufFiles start consuming more than work_mem/3 memory.
>
> I think that's actually what you mean by "halving the amount of memory
> left for the hash table" because that's what happens after reaching the
> work_mem/3.

Well, instead of an arbitrary number like work_mem/2 or work_mem *
2/3, I was trying to figure out the precise threshold beyond which it
doesn't make sense to expend more memory on BufFile objects, even if
the keys are uniformly distributed so that splitting batches halves
the expect tuple count per batch.  Let work_mem_for_hash_table =
work_mem - nbatch * sizeof(BufFile).  Whenever you increase nbatch,
work_mem_for_hash_table goes down, but it had better be more than half
what it was before, or we expect to run out of memory again (if the
batch didn't fit before, and we're now splitting it so that we'll try
to load only half of it, we'd better have more than half the budget
for the hash table than we had before).  Otherwise you'd be making
matters worse, and this process probably won't terminate.

> But I think that rule is irrelevant here, really, because this thread
> was discussing cases where adding batches is futile due to skew, no? In
> which case we should stop adding batches after reaching some % of tuples
> not moving from the batch.

Yeah, this thread started off just about the 95% thing, but veered off
course since these topics are tangled up.  Sorry.

> Or are you suggesting we should remove that rule, and instead realy on
> this rule about halving the hash table space? That might work too, I
> guess.

No, I suspect you need both rules.  We still want to detect extreme
skew soon as possible, even though the other rule will eventually
fire; might as well do it sooner in clear-cut cases.

> OTOH I'm not sure it's a good idea to handle both those cases the same
> way - "overflow file" idea works pretty well for cases where the hash
> table actually can be split into batches, and I'm afraid NLJ will be
> much less efficient for those cases.

Yeah, you might be right about that, and everything I'm describing is
pure vapourware anyway.  But your overflow file scheme isn't exactly
free of IO-amplification and multiple-processing of input data
either... and I haven't yet grokked how it would work for parallel
hash.  Parallel hash generally doesn't have the
'throw-the-tuples-forward' concept. which is inherently based on
sequential in-order processing of batches.

--
Thomas Munro
https://enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Andres Freund
Hi,

On 2019-05-20 13:25:52 +1200, Thomas Munro wrote:

> In PostgreSQL, it's always inner = right, outer = left.  You can see
> that reflected in plannodes.h and elsewhere:
>
> /* ----------------
>  *      these are defined to avoid confusion problems with "left"
>  *      and "right" and "inner" and "outer".  The convention is that
>  *      the "left" plan is the "outer" plan and the "right" plan is
>  *      the inner plan, but these make the code more readable.
>  * ----------------
>  */
> #define innerPlan(node)                 (((Plan *)(node))->righttree)
> #define outerPlan(node)                 (((Plan *)(node))->lefttree)

I really don't understand why we don't just rename those fields.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Tomas Vondra-4
In reply to this post by Thomas Munro-5
On Mon, May 20, 2019 at 01:25:52PM +1200, Thomas Munro wrote:

>On Mon, May 20, 2019 at 12:22 PM Tomas Vondra
><[hidden email]> wrote:
>> On Mon, May 20, 2019 at 11:07:03AM +1200, Thomas Munro wrote:
>> >First let me restate the PostgreSQL terminology for this stuff so I
>> >don't get confused while talking about it:
>> >
>> >* The inner side of the join = the right side = the side we use to
>> >build a hash table.  Right and full joins emit inner tuples when there
>> >is no matching tuple on the outer side.
>> >
>> >* The outer side of the join = the left side = the side we use to
>> >probe the hash table.  Left and full joins emit outer tuples when
>> >there is no matching tuple on the inner side.
>> >
>> >* Semi and anti joins emit exactly one instance of each outer tuple if
>> >there is/isn't at least one match on the inner side.
>> >
>>
>> I think you're conflating inner/outer side and left/right, or rather
>> assuming it's always left=inner and right=outer.
>
>In PostgreSQL, it's always inner = right, outer = left.  You can see
>that reflected in plannodes.h and elsewhere:
>
>/* ----------------
> *      these are defined to avoid confusion problems with "left"
> *      and "right" and "inner" and "outer".  The convention is that
> *      the "left" plan is the "outer" plan and the "right" plan is
> *      the inner plan, but these make the code more readable.
> * ----------------
> */
>#define innerPlan(node)                 (((Plan *)(node))->righttree)
>#define outerPlan(node)                 (((Plan *)(node))->lefttree)
>
>I'm not sure you think it's not always like that: are you referring to
>the fact that the planner can choose to reverse the join (compared to
>the SQL LEFT|RIGHT JOIN that appeared in the query), creating an extra
>layer of confusion?  In my email I was talking only about left and
>right as seen by the executor.
>

It might be my lack of understanding, but I'm not sure how we map
LEFT/RIGHT JOIN to left/righttree and inner/outer at plan level. My
assumption was that for "a LEFT JOIN b" then "a" and "b" can end up
both as inner and outer (sub)tree.

But I haven't checked so I may easily be wrong. Maybe the comment you
quoted clarifies that, not sure.

>> >About the question of when exactly to set the "use_NLJ" flag:  I had
>> >originally been thinking of this only as a way to deal with the
>> >extreme skew problem.  But in light of Tomas's complaints about
>> >unmetered per-batch memory overheads, I had a new thought: it should
>> >also be triggered whenever doubling the number of batches would halve
>> >the amount of memory left for the hash table (after including the size
>> >of all those BufFile objects in the computation as Tomas proposes).  I
>> >think that might be exactly the right right cut-off if you want to do
>> >as much Grace partitioning as your work_mem can afford, and therefore
>> >as little looping as possible to complete the join while respecting
>> >work_mem.
>> >
>>
>> Not sure what NLJ flag rule you propose, exactly.
>>
>> Regarding the threshold value - once the space for BufFiles (and other
>> overhead) gets over work_mem/2, it does not make any sense to increase
>> the number of batches because then the work_mem would be entirely
>> occupied by BufFiles.
>>
>> The WIP patches don't actually do exactly that though - they just check
>> if the incremented size would be over work_mem/2. I think we should
>> instead allow up to work_mem*2/3, i.e. stop adding batches after the
>> BufFiles start consuming more than work_mem/3 memory.
>>
>> I think that's actually what you mean by "halving the amount of memory
>> left for the hash table" because that's what happens after reaching the
>> work_mem/3.
>
>Well, instead of an arbitrary number like work_mem/2 or work_mem *
>2/3, I was trying to figure out the precise threshold beyond which it
>doesn't make sense to expend more memory on BufFile objects, even if
>the keys are uniformly distributed so that splitting batches halves
>the expect tuple count per batch.  Let work_mem_for_hash_table =
>work_mem - nbatch * sizeof(BufFile).  Whenever you increase nbatch,
>work_mem_for_hash_table goes down, but it had better be more than half
>what it was before, or we expect to run out of memory again (if the
>batch didn't fit before, and we're now splitting it so that we'll try
>to load only half of it, we'd better have more than half the budget
>for the hash table than we had before).  Otherwise you'd be making
>matters worse, and this process probably won't terminate.
>

But the work_mem/3 does exactly that.

Let's say BufFiles need a bit less than work_mem/3. That means we have
a bit more than 2*work_mem/3 for the hash table. If you double the number
of batches, then you'll end up with a bit more than 2*work_mem/3. That is,
we've not halved the hash table size.

If BufFiles need a bit more memory than work_mem/3, then after doubling
the number of batches we'll end up with less than half the initial hash
table space.

So I think work_mem/3 is the threshold we're looking for.

>> But I think that rule is irrelevant here, really, because this thread
>> was discussing cases where adding batches is futile due to skew, no? In
>> which case we should stop adding batches after reaching some % of tuples
>> not moving from the batch.
>
>Yeah, this thread started off just about the 95% thing, but veered off
>course since these topics are tangled up.  Sorry.
>
>> Or are you suggesting we should remove that rule, and instead realy on
>> this rule about halving the hash table space? That might work too, I
>> guess.
>
>No, I suspect you need both rules.  We still want to detect extreme
>skew soon as possible, even though the other rule will eventually
>fire; might as well do it sooner in clear-cut cases.
>

Right, I agree. I think we need the 95% rule (or whatever) to handle the
cases with skew / many duplicates, and then the overflow files to handle
underestimates with uniform distribution (or some other solution).

>> OTOH I'm not sure it's a good idea to handle both those cases the same
>> way - "overflow file" idea works pretty well for cases where the hash
>> table actually can be split into batches, and I'm afraid NLJ will be
>> much less efficient for those cases.
>
>Yeah, you might be right about that, and everything I'm describing is
>pure vapourware anyway.  But your overflow file scheme isn't exactly
>free of IO-amplification and multiple-processing of input data
>either... and I haven't yet grokked how it would work for parallel
>hash.  Parallel hash generally doesn't have the
>'throw-the-tuples-forward' concept. which is inherently based on
>sequential in-order processing of batches.
>

Sure, let's do some math.

With the overflow scheme, the amplification is roughly ~2x (relative to
master), because we need to write data for most batches first into the
overflow file and then to the correct one. Master has wrte aplification
about ~1.25x (due to the gradual increase of batches), so the "total"
amplification is ~2.5x.

For the NLJ, the amplification fully depends on what fraction of the hash
table fits into work_mem. For example when it needs to be split into 32
fragments, we have ~32x amplification. It might affect just some batches,
of course.

So I still think those approaches are complementary and we need both.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Reply | Threaded
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Melanie Plageman
In reply to this post by Thomas Munro-5

On Sun, May 19, 2019 at 4:07 PM Thomas Munro <[hidden email]> wrote:
On Sat, May 18, 2019 at 12:15 PM Melanie Plageman
<[hidden email]> wrote:
> On Thu, May 16, 2019 at 3:22 PM Thomas Munro <[hidden email]> wrote:
>> Admittedly I don't have a patch, just a bunch of handwaving.  One
>> reason I haven't attempted to write it is because although I know how
>> to do the non-parallel version using a BufFile full of match bits in
>> sync with the tuples for outer joins, I haven't figured out how to do
>> it for parallel-aware hash join, because then each loop over the outer
>> batch could see different tuples in each participant.  You could use
>> the match bit in HashJoinTuple header, but then you'd have to write
>> all the tuples out again, which is more IO than I want to do.  I'll
>> probably start another thread about that.
>
> Could you explain more about the implementation you are suggesting?
>
> Specifically, what do you mean "BufFile full of match bits in sync with the
> tuples for outer joins?"

First let me restate the PostgreSQL terminology for this stuff so I
don't get confused while talking about it:

* The inner side of the join = the right side = the side we use to
build a hash table.  Right and full joins emit inner tuples when there
is no matching tuple on the outer side.

* The outer side of the join = the left side = the side we use to
probe the hash table.  Left and full joins emit outer tuples when
there is no matching tuple on the inner side.

* Semi and anti joins emit exactly one instance of each outer tuple if
there is/isn't at least one match on the inner side.

We have a couple of relatively easy cases:

* Inner joins: for every outer tuple, we try to find a match in the
hash table, and if we find one we emit a tuple.  To add looping
support, if we run out of memory when loading the hash table we can
just proceed to probe the fragment we've managed to load so far, and
then rewind the outer batch, clear the hash table and load in the next
work_mem-sized fragment and do it again... rinse and repeat until
we've eventually processed the whole inner batch.  After we've
finished looping, we move on to the next batch.

* For right and full joins ("HJ_FILL_INNER"), we also need to emit an
inner tuple for every tuple that was loaded into the hash table but
never matched.  That's done using a flag HEAP_TUPLE_HAS_MATCH in the
header of the tuples of the hash table, and a scan through the whole
hash table at the end of each batch to look for unmatched tuples
(ExecScanHashTableForUnmatched()).  To add looping support, that just
has to be done at the end of every inner batch fragment, that is,
after every loop.

And now for the cases that need a new kind of match bit, as far as I can see:

* For left and full joins ("HJ_FILL_OUTER"), we also need to emit an
outer tuple for every tuple that didn't find a match in the hash
table.  Normally that is done while probing, without any need for
memory or match flags: if we don't find a match, we just spit out an
outer tuple immediately.  But that simple strategy won't work if the
hash table holds only part of the inner batch.  Since we'll be
rewinding and looping over the outer batch again for the next inner
batch fragment, we can't yet say if there will be a match in a later
loop.  But the later loops don't know on their own either.  So we need
some kind of cumulative memory between loops, and we only know which
outer tuples have a match after we've finished all loops.  So there
would need to be a new function ExecScanOuterBatchForUnmatched().

* For semi joins, we need to emit exactly one outer tuple whenever
there is one or more match on the inner side.  To add looping support,
we need to make sure that we don't emit an extra copy of the outer
tuple if there is a second match in another inner batch fragment.
Again, this implies some kind of memory between loops, so we can
suppress later matches.

* For anti joins, we need to emit an outer tuple whenever there is no
match.  To add looping support, we need to wait until we've seen all
the inner batch fragments before we know that a given outer tuple has
no match, perhaps with the same new function
ExecScanOuterBatchForUnmatched().

So, we need some kind of inter-loop memory, but we obviously don't
want to create another source of unmetered RAM gobbling.  So one idea
is a BufFile that has one bit per outer tuple in the batch.  In the
first loop, we just stream out the match results as we go, and then
somehow we OR the bitmap with the match results in subsequent loops.
After the last loop, we have a list of unmatched tuples -- just scan
it in lock-step with the outer batch and look for 0 bits.

That makes sense. Thanks for the detailed explanation.
 

Unfortunately that bits-in-order scheme doesn't work for parallel
hash, where the SharedTuplestore tuples seen by each worker are
non-deterministic.  So perhaps in that case we could use the
HEAP_TUPLE_HAS_MATCH bit in the outer tuple header itself, and write
the whole outer batch back out each time through the loop.  That'd
keep the tuples and match bits together, but it seems like a lot of
IO... 

If you set the has_match flag in the tuple header itself, wouldn't you only
need to write the tuples from the outer batch back out that don't have
matches?
 
> If so, why do you need to keep track of the outer tuples seen?
> If you are going to loop through the whole outer side for each tuple on the
> inner side, it seems like you wouldn't need to.

The idea is to loop through the whole outer batch for every
work_mem-sized inner batch fragment, not every tuple.  Though in
theory it could be as small as a single tuple.

> Could you make an outer "batch" which is the whole of the outer relation? That
> is, could you do something like: when hashing the inner side, if re-partitioning
> is resulting in batches that will overflow spaceAllowed, could you set a flag on
> that batch use_NLJ and when making batches for the outer side, make one "batch"
> that has all the tuples from the outer side which the inner side batch which was
> flagged will do NLJ with.

I didn't understand this... you always need to make one outer batch
corresponding to every inner batch.  The problem is the tricky
left/full/anti/semi join cases when joining against fragments holding
less that the full inner batch: we still need some way to implement
join logic that depends on knowing whether there is a match in *any*
of the inner fragments/loops.

Sorry, my suggestion was inaccurate and unclear: I was basically suggesting
that once you have all batches created for outer and inner sides, for a
given inner side batch that does not fit in memory, for each outer tuple in
the corresponding outer batch file, load and join all of the chunks of the
inner batch file. That way, before you emit that tuple, you have checked
all of the corresponding inner batch.

Thinking about it now, I realize that that would be worse in all cases than
what you are thinking of -- joining the outer side batch with the inner
side batch chunk that fits in memory and marking the BufFile bit
representing that outer side tuple as "matched" and only emitting it with a
NULL from the inner side after all chunks have been processed.

--
Melanie Plageman
Reply | Threaded
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Melanie Plageman
In reply to this post by Thomas Munro-5


On Sun, May 19, 2019 at 4:07 PM Thomas Munro <[hidden email]> wrote:
On Sat, May 18, 2019 at 12:15 PM Melanie Plageman
<[hidden email]> wrote:
> On Thu, May 16, 2019 at 3:22 PM Thomas Munro <[hidden email]> wrote:
>> Admittedly I don't have a patch, just a bunch of handwaving.  One
>> reason I haven't attempted to write it is because although I know how
>> to do the non-parallel version using a BufFile full of match bits in
>> sync with the tuples for outer joins, I haven't figured out how to do
>> it for parallel-aware hash join, because then each loop over the outer
>> batch could see different tuples in each participant.  You could use
>> the match bit in HashJoinTuple header, but then you'd have to write
>> all the tuples out again, which is more IO than I want to do.  I'll
>> probably start another thread about that.
>
> Could you explain more about the implementation you are suggesting?
>
> Specifically, what do you mean "BufFile full of match bits in sync with the
> tuples for outer joins?"

First let me restate the PostgreSQL terminology for this stuff so I
don't get confused while talking about it:

* The inner side of the join = the right side = the side we use to
build a hash table.  Right and full joins emit inner tuples when there
is no matching tuple on the outer side.

* The outer side of the join = the left side = the side we use to
probe the hash table.  Left and full joins emit outer tuples when
there is no matching tuple on the inner side.

* Semi and anti joins emit exactly one instance of each outer tuple if
there is/isn't at least one match on the inner side.

We have a couple of relatively easy cases:

* Inner joins: for every outer tuple, we try to find a match in the
hash table, and if we find one we emit a tuple.  To add looping
support, if we run out of memory when loading the hash table we can
just proceed to probe the fragment we've managed to load so far, and
then rewind the outer batch, clear the hash table and load in the next
work_mem-sized fragment and do it again... rinse and repeat until
we've eventually processed the whole inner batch.  After we've
finished looping, we move on to the next batch.

* For right and full joins ("HJ_FILL_INNER"), we also need to emit an
inner tuple for every tuple that was loaded into the hash table but
never matched.  That's done using a flag HEAP_TUPLE_HAS_MATCH in the
header of the tuples of the hash table, and a scan through the whole
hash table at the end of each batch to look for unmatched tuples
(ExecScanHashTableForUnmatched()).  To add looping support, that just
has to be done at the end of every inner batch fragment, that is,
after every loop.

And now for the cases that need a new kind of match bit, as far as I can see:

* For left and full joins ("HJ_FILL_OUTER"), we also need to emit an
outer tuple for every tuple that didn't find a match in the hash
table.  Normally that is done while probing, without any need for
memory or match flags: if we don't find a match, we just spit out an
outer tuple immediately.  But that simple strategy won't work if the
hash table holds only part of the inner batch.  Since we'll be
rewinding and looping over the outer batch again for the next inner
batch fragment, we can't yet say if there will be a match in a later
loop.  But the later loops don't know on their own either.  So we need
some kind of cumulative memory between loops, and we only know which
outer tuples have a match after we've finished all loops.  So there
would need to be a new function ExecScanOuterBatchForUnmatched().

* For semi joins, we need to emit exactly one outer tuple whenever
there is one or more match on the inner side.  To add looping support,
we need to make sure that we don't emit an extra copy of the outer
tuple if there is a second match in another inner batch fragment.
Again, this implies some kind of memory between loops, so we can
suppress later matches.

* For anti joins, we need to emit an outer tuple whenever there is no
match.  To add looping support, we need to wait until we've seen all
the inner batch fragments before we know that a given outer tuple has
no match, perhaps with the same new function
ExecScanOuterBatchForUnmatched().

So, we need some kind of inter-loop memory, but we obviously don't
want to create another source of unmetered RAM gobbling.  So one idea
is a BufFile that has one bit per outer tuple in the batch.  In the
first loop, we just stream out the match results as we go, and then
somehow we OR the bitmap with the match results in subsequent loops.
After the last loop, we have a list of unmatched tuples -- just scan
it in lock-step with the outer batch and look for 0 bits.

Unfortunately that bits-in-order scheme doesn't work for parallel
hash, where the SharedTuplestore tuples seen by each worker are
non-deterministic.  So perhaps in that case we could use the
HEAP_TUPLE_HAS_MATCH bit in the outer tuple header itself, and write
the whole outer batch back out each time through the loop.  That'd
keep the tuples and match bits together, but it seems like a lot of
IO...  Note that parallel hash doesn't support right/full joins today,
because of some complications about waiting and deadlocks that might
turn out to be relevant here too, and might be solvable (I should
probably write about that in another email), but left joins *are*
supported today so would need to be desupported if we wanted to add
loop-based escape valve but not deal with with these problems.  That
doesn't seem acceptable, which is why I'm a bit stuck on this point,
and unfortunately it may be a while before I have time to tackle any
of that personally.


There was an off-list discussion at PGCon last week about doing this
hash looping strategy using the bitmap with match bits and solving the
parallel hashjoin problem by having tuple-identifying information
encoded in the bitmap which allowed each worker to indicate that an
outer tuple had a match when processing that inner side chunk and
then, at the end of the scan of the outer side, the bitmaps would be
OR'd together to represent a single view of the unmatched tuples from
that iteration.

I was talking to Jeff Davis about this on Saturday, and, he felt that
there might be a way to solve the problem differently if we thought of
the left join case as performing an inner join and an antijoin
instead.

Riffing on this idea a bit, I started trying to write a patch that
would basically emit a tuple if it matches and write the tuple out to
a file if it does not match. Then, after iterating through the outer
batch the first time for the first inner chunk, any tuples which do
not yet have a match are the only ones which need to be joined against
the other inner chunks. Instead of iterating through the outer side
original batch file, use the unmatched outer tuples file to do the
join against the next chunk. Repeat this for all chunks.

Could we not do this and avoid using the match bit? In the worst case,
you would have to write out all the tuples on the outer side (if none
match) nchunks times (chunk is the work_mem sized chunk of inner
loaded into the hashtable).

--
Melanie Plageman
Reply | Threaded
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Robert Haas
On Mon, Jun 3, 2019 at 5:10 PM Melanie Plageman
<[hidden email]> wrote:

> I was talking to Jeff Davis about this on Saturday, and, he felt that
> there might be a way to solve the problem differently if we thought of
> the left join case as performing an inner join and an antijoin
> instead.
>
> Riffing on this idea a bit, I started trying to write a patch that
> would basically emit a tuple if it matches and write the tuple out to
> a file if it does not match. Then, after iterating through the outer
> batch the first time for the first inner chunk, any tuples which do
> not yet have a match are the only ones which need to be joined against
> the other inner chunks. Instead of iterating through the outer side
> original batch file, use the unmatched outer tuples file to do the
> join against the next chunk. Repeat this for all chunks.

I'm not sure that I understanding this proposal correctly, but if I am
then I think it doesn't work in the case where a single outer row
matches rows in many different inner chunks.  When you "use the
unmatched outer tuples file to do the join against the next chunk,"
you deny any rows that have already matched the chance to produce
additional matches.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Robert Haas
In reply to this post by Thomas Munro-5
On Sun, May 19, 2019 at 7:07 PM Thomas Munro <[hidden email]> wrote:
> Unfortunately that bits-in-order scheme doesn't work for parallel
> hash, where the SharedTuplestore tuples seen by each worker are
> non-deterministic.  So perhaps in that case we could use the
> HEAP_TUPLE_HAS_MATCH bit in the outer tuple header itself, and write
> the whole outer batch back out each time through the loop.  That'd
> keep the tuples and match bits together, but it seems like a lot of
> IO...

So, I think the case you're worried about here is something like:

Gather
-> Parallel Hash Left Join
  -> Parallel Seq Scan on a
  -> Parallel Hash
    -> Parallel Seq Scan on b

If I understand ExecParallelHashJoinPartitionOuter correctly, we're
going to hash all of a and put it into a set of batch files before we
even get started, so it's possible to identify precisely which tuple
we're talking about by just giving the batch number and the position
of the tuple within that batch.  So while it's true that the
individual workers can't use the number of tuples they've read to know
where they are in the SharedTuplestore, maybe the SharedTuplestore
could just tell them.  Then they could maintain a paged bitmap of the
tuples that they've matched to something, indexed by
position-within-the-tuplestore, and those bitmaps could be OR'd
together at the end.

Crazy idea, or...?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Melanie Plageman
In reply to this post by Robert Haas


On Tue, Jun 4, 2019 at 5:43 AM Robert Haas <[hidden email]> wrote:
On Mon, Jun 3, 2019 at 5:10 PM Melanie Plageman
<[hidden email]> wrote:
> I was talking to Jeff Davis about this on Saturday, and, he felt that
> there might be a way to solve the problem differently if we thought of
> the left join case as performing an inner join and an antijoin
> instead.
>
> Riffing on this idea a bit, I started trying to write a patch that
> would basically emit a tuple if it matches and write the tuple out to
> a file if it does not match. Then, after iterating through the outer
> batch the first time for the first inner chunk, any tuples which do
> not yet have a match are the only ones which need to be joined against
> the other inner chunks. Instead of iterating through the outer side
> original batch file, use the unmatched outer tuples file to do the
> join against the next chunk. Repeat this for all chunks.

I'm not sure that I understanding this proposal correctly, but if I am
then I think it doesn't work in the case where a single outer row
matches rows in many different inner chunks.  When you "use the
unmatched outer tuples file to do the join against the next chunk,"
you deny any rows that have already matched the chance to produce
additional matches.


Oops! You are totally right.
I will amend the idea:
For each chunk on the inner side, loop through both the original batch
file and the unmatched outer tuples file created for the last chunk.
Emit any matches and write out any unmatched tuples to a new unmatched
outer tuples file.

I think, in the worst case, if no tuples from the outer have a match,
you end up writing out all of the outer tuples for each chunk on the
inner side. However, using the match bit in the tuple header solution
would require this much writing.
Probably the bigger problem is that in this worst case you would also
need to read double the number of outer tuples for each inner chunk.

However, in the best case it seems like it would be better than the
match bit/write everything from the outer side out solution.

--
Melanie Plageman
Reply | Threaded
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Robert Haas
On Tue, Jun 4, 2019 at 2:47 PM Melanie Plageman
<[hidden email]> wrote:

> Oops! You are totally right.
> I will amend the idea:
> For each chunk on the inner side, loop through both the original batch
> file and the unmatched outer tuples file created for the last chunk.
> Emit any matches and write out any unmatched tuples to a new unmatched
> outer tuples file.
>
> I think, in the worst case, if no tuples from the outer have a match,
> you end up writing out all of the outer tuples for each chunk on the
> inner side. However, using the match bit in the tuple header solution
> would require this much writing.
> Probably the bigger problem is that in this worst case you would also
> need to read double the number of outer tuples for each inner chunk.
>
> However, in the best case it seems like it would be better than the
> match bit/write everything from the outer side out solution.

I guess so, but the downside of needing to read twice as many outer
tuples for each inner chunk seems pretty large.  It would be a lot
nicer if we could find a way to store the matched-bits someplace other
than where we are storing the tuples, what Thomas called a
bits-in-order scheme, because then the amount of additional read and
write I/O would be tiny -- one bit per tuple doesn't add up very fast.

In the scheme you propose here, I think that after you read the
original outer tuples for each chunk and the unmatched outer tuples
for each chunk, you'll have to match up the unmatched tuples to the
original tuples, probably by using memcmp() or something.  Otherwise,
when a new match occurs, you won't know which tuple should now not be
emitted into the new unmatched outer tuples file that you're going to
produce.  So I think what's going to happen is that you'll read the
original batch file, then read the unmatched tuples file and use that
to set or not set a bit on each tuple in memory, then do the real work
setting more bits, then write out a new unmatched-tuples file with the
tuples that still don't have the bit set.  So your unmatched tuple
file is basically a list of tuple identifiers in the least compact
form imaginable: the tuple is identified by the entire tuple contents.
That doesn't seem very appealing, although I expect that it would
still win for some queries.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Melanie Plageman
In reply to this post by Robert Haas


On Tue, Jun 4, 2019 at 6:05 AM Robert Haas <[hidden email]> wrote:
On Sun, May 19, 2019 at 7:07 PM Thomas Munro <[hidden email]> wrote:
> Unfortunately that bits-in-order scheme doesn't work for parallel
> hash, where the SharedTuplestore tuples seen by each worker are
> non-deterministic.  So perhaps in that case we could use the
> HEAP_TUPLE_HAS_MATCH bit in the outer tuple header itself, and write
> the whole outer batch back out each time through the loop.  That'd
> keep the tuples and match bits together, but it seems like a lot of
> IO...

So, I think the case you're worried about here is something like:

Gather
-> Parallel Hash Left Join
  -> Parallel Seq Scan on a
  -> Parallel Hash
    -> Parallel Seq Scan on b

If I understand ExecParallelHashJoinPartitionOuter correctly, we're
going to hash all of a and put it into a set of batch files before we
even get started, so it's possible to identify precisely which tuple
we're talking about by just giving the batch number and the position
of the tuple within that batch.  So while it's true that the
individual workers can't use the number of tuples they've read to know
where they are in the SharedTuplestore, maybe the SharedTuplestore
could just tell them.  Then they could maintain a paged bitmap of the
tuples that they've matched to something, indexed by
position-within-the-tuplestore, and those bitmaps could be OR'd
together at the end.

Crazy idea, or...?


That idea does sound like it could work. Basically a worker is given a
tuple and a bit index (process this tuple and if it matches go flip
the bit at position 30) in its own bitmap, right?

I need to spend some time understanding how SharedTupleStore works and
how workers get tuples, so what I'm saying might not make sense.

One question I have is, how would the OR'd together bitmap be
propagated to workers after the first chunk? That is, when there are
no tuples left in the outer bunch, for a given inner chunk, would you
load the bitmaps from each worker into memory, OR them together, and
then write the updated bitmap back out so that each worker starts with
the updated bitmap?

--
Melanie Plageman
123