BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
42 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Tom Lane-2
Thomas Munro <[hidden email]> writes:
> I think I see what's happening: we're running out of hash bits.

Ugh.

> Besides switching to 64 bit hashes so that we don't run out of
> information (clearly a good idea), what other options do we have?

If the input column is less than 64 bits wide, this is an illusory
solution, because there can't really be that many independent hash bits.
I think we need to have some constraint that keeps us from widening
the batch+bucket counts to more hash bits than we have.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

James Coleman
In reply to this post by Thomas Munro-5
On Sun, Nov 10, 2019 at 4:09 PM Thomas Munro <[hidden email]> wrote:

>
> I think I see what's happening: we're running out of hash bits.
>
> > Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally 4096)  Memory Usage: 344448kB
>
> Here it's using the lower 22 bits for the bucket number, and started
> out using 12 bits for the batch (!), and increased that until it got
> to 15 (!!).  After using 22 bits for the bucket, there are only 10
> bits left, so all the tuples go into the lower 1024 batches.
>
> I'm not sure how exactly this leads to wildly varying numbers of
> repartioning cycles (the above-quoted example did it 3 times, the
> version that crashed into MaxAllocSize did it ~10 times).
>
> Besides switching to 64 bit hashes so that we don't run out of
> information (clearly a good idea), what other options do we have?  (1)
> We could disable repartitioning (give up on work_mem) after we've run
> out of bits; this will eat more memory than it should.  (2) You could
> start stealing bucket bits; this will eat more CPU than it should,
> because you'd effectively have fewer active buckets (tuples will
> concentrated on the bits you didn't steal).

Thanks for working through this!

But now we're at the end of my understanding of how hash tables and
joins are implemented in PG; is there a wiki page or design that might
give me some current design description of how the buckets and batches
work with the hash so I can keep following along?

James


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Tomas Vondra-4
In reply to this post by Tomas Vondra-4
On Sun, Nov 10, 2019 at 10:23:52PM +0100, Tomas Vondra wrote:

>On Mon, Nov 11, 2019 at 10:08:58AM +1300, Thomas Munro wrote:
>>I think I see what's happening: we're running out of hash bits.
>>
>>>Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally 4096)  Memory Usage: 344448kB
>>
>>Here it's using the lower 22 bits for the bucket number, and started
>>out using 12 bits for the batch (!), and increased that until it got
>>to 15 (!!).  After using 22 bits for the bucket, there are only 10
>>bits left, so all the tuples go into the lower 1024 batches.
>>
>
>Ouch!
>
>>I'm not sure how exactly this leads to wildly varying numbers of
>>repartioning cycles (the above-quoted example did it 3 times, the
>>version that crashed into MaxAllocSize did it ~10 times).
>>
>>Besides switching to 64 bit hashes so that we don't run out of
>>information (clearly a good idea), what other options do we have?  (1)
>>We could disable repartitioning (give up on work_mem) after we've run
>>out of bits; this will eat more memory than it should.  (2) You could
>>start stealing bucket bits; this will eat more CPU than it should,
>>because you'd effectively have fewer active buckets (tuples will
>>concentrated on the bits you didn't steal).
>
>Can't we simply compute two hash values, using different seeds - one for
>bucket and the other for batch? Of course, that'll be more expensive.

Meh, I realized that's pretty much just a different way to get 64-bit
hashes (which is what you mentioned).

An I think you're right we should detect cases when we use all the bits,
and stop making it worse. Stealing bucket bits seems reasonable - we
can't stop adding batches, because that would mean we stop enforcing
work_mem. The question is how far that gets us - we enforce nbuckets to
be at least 1024 (i.e. 10 bits), which leaves 32 bits for nbatch. And in
one of the explains we've seen nbatch=2097152 (i.e. 21 bits).

Of course, this is bound to be (extremely) slow. The hashjoin changes in
9.5 which reduced th hash table load factor from 10 to 1 resulted in
speedups close to 3x. And if you go from 4194304 to 1024 buckets, those
will be loooooooong chains in each bucket :-(

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Tomas Vondra-4
In reply to this post by Tom Lane-2
On Sun, Nov 10, 2019 at 04:34:09PM -0500, Tom Lane wrote:

>Thomas Munro <[hidden email]> writes:
>> I think I see what's happening: we're running out of hash bits.
>
>Ugh.
>
>> Besides switching to 64 bit hashes so that we don't run out of
>> information (clearly a good idea), what other options do we have?
>
>If the input column is less than 64 bits wide, this is an illusory
>solution, because there can't really be that many independent hash bits.
>I think we need to have some constraint that keeps us from widening
>the batch+bucket counts to more hash bits than we have.
>

That's not quite correct - it's a solution/improvement as long as the
values have more than 32 bits of information. I.e. it works even for
values that are less than 64 bits wide, as long as there's more than 32
bits of information (assuming the hash does not lost the extra bits).

In the case discussed in this text we're dealing with a citext column,
with average length ~8 chars.  So it probably has less than 64 bits of
information, because it's likely just alphanumeric, or something like
that. But likely more than 32 bits.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Thomas Munro-5
In reply to this post by James Coleman
On Mon, Nov 11, 2019 at 10:38 AM James Coleman <[hidden email]> wrote:
> But now we're at the end of my understanding of how hash tables and
> joins are implemented in PG; is there a wiki page or design that might
> give me some current design description of how the buckets and batches
> work with the hash so I can keep following along?

We have something like the so-called "Grace" hash join (with the
"hybrid" refinement, irrelevant for this discussion):

https://en.wikipedia.org/wiki/Hash_join#Grace_hash_join

Our word "batch" just means partition.  Most descriptions talk about
using two different hash functions for partition and bucket, but our
implementation uses a single hash function, and takes some of the bits
to choose the bucket and some of the bits to choose the batch.  That
happens here:

https://github.com/postgres/postgres/blob/master/src/backend/executor/nodeHash.c#L1872


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Andres Freund
In reply to this post by Tomas Vondra-4
Hi,

On 2019-11-10 22:50:17 +0100, Tomas Vondra wrote:
> On Sun, Nov 10, 2019 at 10:23:52PM +0100, Tomas Vondra wrote:
> > On Mon, Nov 11, 2019 at 10:08:58AM +1300, Thomas Munro wrote:
> > Can't we simply compute two hash values, using different seeds - one for
> > bucket and the other for batch? Of course, that'll be more expensive.
>
> Meh, I realized that's pretty much just a different way to get 64-bit
> hashes (which is what you mentioned).

I'm not sure it's really the same, given practical realities in
postgres. Right now the "extended" hash function supporting 64 bit hash
functions is optional. So we couldn't unconditionally rely on it being
present, even in master, unless we're prepared to declare it as
required from now on.

So computing two different hash values at the same time, by using a
different IV and a different combine function, doesn't seem like an
unreasonable approach.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Tomas Vondra-4
On Sun, Nov 10, 2019 at 02:46:31PM -0800, Andres Freund wrote:

>Hi,
>
>On 2019-11-10 22:50:17 +0100, Tomas Vondra wrote:
>> On Sun, Nov 10, 2019 at 10:23:52PM +0100, Tomas Vondra wrote:
>> > On Mon, Nov 11, 2019 at 10:08:58AM +1300, Thomas Munro wrote:
>> > Can't we simply compute two hash values, using different seeds - one for
>> > bucket and the other for batch? Of course, that'll be more expensive.
>>
>> Meh, I realized that's pretty much just a different way to get 64-bit
>> hashes (which is what you mentioned).
>
>I'm not sure it's really the same, given practical realities in
>postgres. Right now the "extended" hash function supporting 64 bit hash
>functions is optional. So we couldn't unconditionally rely on it being
>present, even in master, unless we're prepared to declare it as
>required from now on.
>
>So computing two different hash values at the same time, by using a
>different IV and a different combine function, doesn't seem like an
>unreasonable approach.
>

True. I was commenting on the theoretical fact that computing two 32-bit
hashes is close to computing a 64-bit hash, but you're right there are
implementation details that may make it more usable in our case.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Thomas Munro-5
On Mon, Nov 11, 2019 at 12:44 PM Tomas Vondra
<[hidden email]> wrote:

> On Sun, Nov 10, 2019 at 02:46:31PM -0800, Andres Freund wrote:
> >On 2019-11-10 22:50:17 +0100, Tomas Vondra wrote:
> >> On Sun, Nov 10, 2019 at 10:23:52PM +0100, Tomas Vondra wrote:
> >> > On Mon, Nov 11, 2019 at 10:08:58AM +1300, Thomas Munro wrote:
> >> > Can't we simply compute two hash values, using different seeds - one for
> >> > bucket and the other for batch? Of course, that'll be more expensive.
> >>
> >> Meh, I realized that's pretty much just a different way to get 64-bit
> >> hashes (which is what you mentioned).
> >
> >I'm not sure it's really the same, given practical realities in
> >postgres. Right now the "extended" hash function supporting 64 bit hash
> >functions is optional. So we couldn't unconditionally rely on it being
> >present, even in master, unless we're prepared to declare it as
> >required from now on.
> >
> >So computing two different hash values at the same time, by using a
> >different IV and a different combine function, doesn't seem like an
> >unreasonable approach.
>
> True. I was commenting on the theoretical fact that computing two 32-bit
> hashes is close to computing a 64-bit hash, but you're right there are
> implementation details that may make it more usable in our case.
Here is a quick sketch of something like that, for discussion only.  I
figured that simply mixing the hash value we have with some arbitrary
bits afterwards would be just as good as having started with a
different IV, which leads to a very simple change without refactoring.
From quick experiments with unique keys (generate_series) I seem to
get approximately even sized partitions, and correct answers, but I
make no claim to strong hash-math-fu and haven't tested on very large
inputs.  Thoughts?

rehash-partition.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Tomas Vondra-4
On Mon, Nov 11, 2019 at 01:33:41PM +1300, Thomas Munro wrote:

>On Mon, Nov 11, 2019 at 12:44 PM Tomas Vondra
><[hidden email]> wrote:
>> On Sun, Nov 10, 2019 at 02:46:31PM -0800, Andres Freund wrote:
>> >On 2019-11-10 22:50:17 +0100, Tomas Vondra wrote:
>> >> On Sun, Nov 10, 2019 at 10:23:52PM +0100, Tomas Vondra wrote:
>> >> > On Mon, Nov 11, 2019 at 10:08:58AM +1300, Thomas Munro wrote:
>> >> > Can't we simply compute two hash values, using different seeds - one for
>> >> > bucket and the other for batch? Of course, that'll be more expensive.
>> >>
>> >> Meh, I realized that's pretty much just a different way to get 64-bit
>> >> hashes (which is what you mentioned).
>> >
>> >I'm not sure it's really the same, given practical realities in
>> >postgres. Right now the "extended" hash function supporting 64 bit hash
>> >functions is optional. So we couldn't unconditionally rely on it being
>> >present, even in master, unless we're prepared to declare it as
>> >required from now on.
>> >
>> >So computing two different hash values at the same time, by using a
>> >different IV and a different combine function, doesn't seem like an
>> >unreasonable approach.
>>
>> True. I was commenting on the theoretical fact that computing two 32-bit
>> hashes is close to computing a 64-bit hash, but you're right there are
>> implementation details that may make it more usable in our case.
>
>Here is a quick sketch of something like that, for discussion only.  I
>figured that simply mixing the hash value we have with some arbitrary
>bits afterwards would be just as good as having started with a
>different IV, which leads to a very simple change without refactoring.
>From quick experiments with unique keys (generate_series) I seem to
>get approximately even sized partitions, and correct answers, but I
>make no claim to strong hash-math-fu and haven't tested on very large
>inputs.  Thoughts?

Hmmm, iteresting. Using a hard-coded constant seems a bit weird,
although I don't have any precise argument why it's wrong yet.

Essentially, the patch does this:

   batchno = hash_combine(0x9e3779b9, hashvalue) & (nbatch - 1);
   bucketno = hashvalue & (nbuckets - 1);

but adding a constant does not 'propagate' any bits from the original
value to the hash. So it seems pretty much equivalent to

   batchno = hashvalue & (nbatch - 1);
   bucketno = hashvalue & (nbuckets - 1);

so we still operate with the same number of bits. It might solve this
particular case, but I doubt it's a good idea in general ...

I think we will have to actually compute two hashes, to get more than 32
bits. But yes, that'll be more invasive, and unlikely backpatchable.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

James Coleman
In reply to this post by Thomas Munro-5
On Sun, Nov 10, 2019 at 4:09 PM Thomas Munro <[hidden email]> wrote:
>
> I think I see what's happening: we're running out of hash bits.
>
> > Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally 4096)  Memory Usage: 344448kB
>
> Here it's using the lower 22 bits for the bucket number, and started
> out using 12 bits for the batch (!), and increased that until it got
> to 15 (!!).  After using 22 bits for the bucket, there are only 10
> bits left, so all the tuples go into the lower 1024 batches.

Do we have this kind of problem with hash aggregates also? I've
noticed the temp disk usage pattern applies to both, and the buffers
stats shows that being the case, but unfortunately the hash aggregate
node doesn't report memory usage for its hash or buckets info. Given
it's not a join, maybe we only need buckets and not batches, but I
don't know this part of the code at all, so I'm just guessing to
assume either way.

James


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Tomas Vondra-4
On Mon, Nov 11, 2019 at 09:14:43AM -0500, James Coleman wrote:

>On Sun, Nov 10, 2019 at 4:09 PM Thomas Munro <[hidden email]> wrote:
>>
>> I think I see what's happening: we're running out of hash bits.
>>
>> > Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally 4096)  Memory Usage: 344448kB
>>
>> Here it's using the lower 22 bits for the bucket number, and started
>> out using 12 bits for the batch (!), and increased that until it got
>> to 15 (!!).  After using 22 bits for the bucket, there are only 10
>> bits left, so all the tuples go into the lower 1024 batches.
>
>Do we have this kind of problem with hash aggregates also? I've
>noticed the temp disk usage pattern applies to both, and the buffers
>stats shows that being the case, but unfortunately the hash aggregate
>node doesn't report memory usage for its hash or buckets info. Given
>it's not a join, maybe we only need buckets and not batches, but I
>don't know this part of the code at all, so I'm just guessing to
>assume either way.
>

I don't think so, because hash aggregate does not spill to disk. The
trouble with hashjoins is that we need two independent indexes - bucket
and batch, and we only have a single 32-bit hash value. The hash agg is
currently unable to spill to disk, so it's not affected by this (but it
does have plenty of issues on it's own).

There's work in progress aiming to add memory-bounded hash aggregate,
but I think the spilling is supposed to work very differently there.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Thomas Munro-5
In reply to this post by Tomas Vondra-4
On Mon, Nov 11, 2019 at 11:46 PM Tomas Vondra
<[hidden email]> wrote:

> Essentially, the patch does this:
>
>    batchno = hash_combine(0x9e3779b9, hashvalue) & (nbatch - 1);
>    bucketno = hashvalue & (nbuckets - 1);
>
> but adding a constant does not 'propagate' any bits from the original
> value to the hash. So it seems pretty much equivalent to
>
>    batchno = hashvalue & (nbatch - 1);
>    bucketno = hashvalue & (nbuckets - 1);
>
> so we still operate with the same number of bits. It might solve this
> particular case, but I doubt it's a good idea in general ...

Yeah, I think you're generally right, but it's much better than the
specific coding batchno = hashvalue & (nbatch - 1).  With that recipe,
batch #42 finishes up with all its tuples crammed into bucket #42.
That's the worst case of the steal-the-bucket-bits idea mentioned
earlier.

After sleeping on it, I think the re-hash idea is equivalent to
stealing the minimum number of bucket bits.  A simple but subtly
broken version of that would be: take bucket number from the low end,
and batch number from the high end, and don't worry if they begin to
overlap because it's the best we can do:

    batchno = (hashvalue >> (hash_width - nbatch_width)) & (nbatch - 1);
    bucketno = hashvalue & (nbuckets - 1);

Tuples start concentrating in a subset of buckets when the bit ranges
begin to overlap, due to a correlation between batch and bucket, but
that seems to be true for the re-hash idea too.  You can't actually
use that exact coding, though, because of the way we increase nbatch
in serial hash joins: we're only allowed to add new bits to the high
end of the generated batch numbers so that every tuple either stays in
the current batch or is "thrown forward" to a higher-numbered batch
when nbatch increases.  So instead you could write it as:

    batchno = reverse_bits(hashvalue) & (nbatch - 1);
    bucketno = hashvalue & (nbuckets - 1);

I'm not sure if reversing bits can be done any more efficiently than
the cheap-and-cheerful hash_combine() code, though.  With the right
inlining it seems like nice straight-line no-jmp no-lookup-table code,
so maybe it's still worth considering if it has equivalent performance
for our purposes.

A tangential observation is that with any of these approaches you can
probably lift the current restriction on increasing nbucket when
nbatch > 1 (that prohibition is based on not wanting overlapping
bucket and batch bits ranges, but if we're deciding that's OK in
extreme cases and taking steps to minimise it in usual cases...)

> I think we will have to actually compute two hashes, to get more than 32
> bits. But yes, that'll be more invasive, and unlikely backpatchable.

Agreed.  I think if you started with a different IV *inside the hash
function*, in other words, if you effectively had two different hash
functions (which is the way Grace is usually described) so that the
two hash function each get to see (potentially) more than 32 bits'
worth of input, it'd be about the same as having a single 64 bit hash
and using non-overlapping bit ranges for batch and bucket.  Nothing
like that seems plausible for back branches.


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Andres Freund
In reply to this post by Tomas Vondra-4
Hi,

On 2019-11-11 11:46:05 +0100, Tomas Vondra wrote:

> On Mon, Nov 11, 2019 at 01:33:41PM +1300, Thomas Munro wrote:
> Hmmm, iteresting. Using a hard-coded constant seems a bit weird,
> although I don't have any precise argument why it's wrong yet.
>
> Essentially, the patch does this:
>
>   batchno = hash_combine(0x9e3779b9, hashvalue) & (nbatch - 1);
>   bucketno = hashvalue & (nbuckets - 1);
>
> but adding a constant does not 'propagate' any bits from the original
> value to the hash. So it seems pretty much equivalent to

I was over IM arguing that we ought to also use a different mixing
functions when computing the hash.


>   batchno = hashvalue & (nbatch - 1);
>   bucketno = hashvalue & (nbuckets - 1);

> so we still operate with the same number of bits. It might solve this
> particular case, but I doubt it's a good idea in general ...
>
> I think we will have to actually compute two hashes, to get more than 32
> bits. But yes, that'll be more invasive, and unlikely backpatchable.

I think it might also just generally not be acceptable, from a
performance POV. Computing the hashes is already a major bottleneck for
HJ. Penalizing everybody for these extreme cases doesn't strike me as
great. It might be ok to compute a 64bit hash, but I don't see how
computing two hashes in parallel wouldn't have significant overhead.


I don't really see what the problem is with using just a differently
mixed hash. There's not really a requirement for there to be a
tremendous amount of independence between the bits used for the bucket,
and the bits for the batch. We cannot just reuse exactly the same bits
for the bucket that we used for the hash, as otherwise we'd just raise
the number of bucket conflicts (as the whole hashtable will only contain
tuples with the reduced number of bits). But I don't see why that
implies a need for full bit indepence.  As long as the mixing guarantees
that we use the bits for bucketing differently than the ones for
batches, I think we're good.  All that need is that each batch is is
roughly equally sized, and contains tuples with hashes that roughly
equally distributed - you could bitreverse the hash for batch selection,
and get that, no?


Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Thomas Munro-5
On Thu, Nov 14, 2019 at 7:48 AM Andres Freund <[hidden email]> wrote:

> I don't really see what the problem is with using just a differently
> mixed hash. There's not really a requirement for there to be a
> tremendous amount of independence between the bits used for the bucket,
> and the bits for the batch. We cannot just reuse exactly the same bits
> for the bucket that we used for the hash, as otherwise we'd just raise
> the number of bucket conflicts (as the whole hashtable will only contain
> tuples with the reduced number of bits). But I don't see why that
> implies a need for full bit indepence.  As long as the mixing guarantees
> that we use the bits for bucketing differently than the ones for
> batches, I think we're good.  All that need is that each batch is is
> roughly equally sized, and contains tuples with hashes that roughly
> equally distributed - you could bitreverse the hash for batch selection,
> and get that, no?

Right, that's what I said earlier:

>     batchno = reverse_bits(hashvalue) & (nbatch - 1);
>     bucketno = hashvalue & (nbuckets - 1);

My theory is that batchno = hash_combine(<constant>, hashvalue) &
(nbatch - 1) has about the same distribution properties.  That is, it
works nicely up until you have more than about 2^32 keys, and then it
begins to break down due to lack of information, but the break down
manifests as more conflicts in buckets, instead of the failure to
repartition once you start reading off the end of the hash value (the
cause of the present bug report).  The bit reverse approach might be a
bit less mysterious, but when I looked at ways to implement that they
looked potentially slower than the xor, shifts and adds used by
hash_combine.  I didn't attempt to test that, though.


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Thomas Munro-5
On Thu, Nov 14, 2019 at 9:20 AM Thomas Munro <[hidden email]> wrote:

> On Thu, Nov 14, 2019 at 7:48 AM Andres Freund <[hidden email]> wrote:
> > I don't really see what the problem is with using just a differently
> > mixed hash. There's not really a requirement for there to be a
> > tremendous amount of independence between the bits used for the bucket,
> > and the bits for the batch. We cannot just reuse exactly the same bits
> > for the bucket that we used for the hash, as otherwise we'd just raise
> > the number of bucket conflicts (as the whole hashtable will only contain
> > tuples with the reduced number of bits). But I don't see why that
> > implies a need for full bit indepence.  As long as the mixing guarantees
> > that we use the bits for bucketing differently than the ones for
> > batches, I think we're good.  All that need is that each batch is is
> > roughly equally sized, and contains tuples with hashes that roughly
> > equally distributed - you could bitreverse the hash for batch selection,
> > and get that, no?
>
> Right, that's what I said earlier:
>
> >     batchno = reverse_bits(hashvalue) & (nbatch - 1);
> >     bucketno = hashvalue & (nbuckets - 1);
>
> My theory is that batchno = hash_combine(<constant>, hashvalue) &
> (nbatch - 1) has about the same distribution properties.  That is, it
> works nicely up until you have more than about 2^32 keys, and then it
> begins to break down due to lack of information, but the break down
> manifests as more conflicts in buckets, instead of the failure to
> repartition once you start reading off the end of the hash value (the
> cause of the present bug report).  The bit reverse approach might be a
> bit less mysterious, but when I looked at ways to implement that they
> looked potentially slower than the xor, shifts and adds used by
> hash_combine.  I didn't attempt to test that, though.

If you have plenty of disk space and time, you can repro the problem like so:

create unlogged table sevenb as select generate_series(1, 7000000000)::bigint i;
analyze sevenb;
set work_mem = '150MB';
explain analyze select count(*) from sevenb t1 join sevenb t2 using (i);

If you have max_parallel_workers_per_gather >= 6 it gives up the
strange idea of using a sort-merge join, otherwise you also need
enable_mergejoin = off to prevent that, which is curious.

You can see as soon as it starts executing the join that it's only
writing to 1023 partitions (that is, 1023 * nprocesses temporary files
with names like "i...of4096.p...", and then when one of those gets too
big, "i...of8192.p...", and so on until something goes wrong).  My
(slow) machine hasn't finished repartitioning yet... I'll check again
tomorrow, if it hasn't melted...  Patched, it writes to 4095 as it
should (there is no "i0of4096..."), their sizes are uniform, and the
query completes after a single repartitioning operation.  (Well it's
4096 with work_mem = 150MB and 2 workers, YMMV).

I'm planning to push something like this late this week (though I
think the constant might be better as 0 since it doesn't matter much
given the definition, and I need a back-patchable version that is open
coded or I need to back-patch hash_combine), unless someone has a
better idea.

Separately, I'm working on an idea for how to support extended hashing
for the master branch only.


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Thomas Munro-5
On Wed, Nov 20, 2019 at 1:05 PM Thomas Munro <[hidden email]> wrote:
> I'm planning to push something like this late this week (though I
> think the constant might be better as 0 since it doesn't matter much
> given the definition, and I need a back-patchable version that is open
> coded or I need to back-patch hash_combine), unless someone has a
> better idea.

Scratch that plan, it turned out to be a terrible idea.  Although it
solves the batch problem, it's much worse than I expected at bucket
distribution, so the performance is bad.  I think something like
that might work given better bit mixing, but that sounds expensive (or
maybe I'm just being really stupid here).  Anyway, back to the drawing
board.  I'm now thinking the bit stealing scheme (ie take bits from
the other end of the hash, and just let them start to overlap with
bucket bits if you have to when nbatch increases) is a better plan,
and it is so much easier to reason about, and still in the realm of a
handful of instructions.  Here are some numbers[1] showing the number
of non-empty buckets ("pop") and chain length distribution for
non-empty buckets, given 7 billion bigints as input.

method | nbatch | nbucket |    pop |  min |  max |  avg | stddev
-------+--------+---------+--------+------+------+------+--------
rehash |   4096 | 1048576 |    256 | 6212 | 7001 | 6674 |    110
steal  |   4096 | 1048576 | 659574 |    1 |   17 |  2.5 |    1.5
steal  |   4096 | 4194304 | 659574 |    1 |   17 |  2.5 |    1.5
steal  |   8192 | 4194304 | 329685 |    1 |   17 |  2.5 |    1.5

The rehash numbers are obviously catastrophic in this case, and that's
before we've even run out of hash bits.  The steal numbers show that
you just waste memory on empty buckets when hash bits get thin on the
ground (the last two lines in the table).  We could perhaps consider
automatically lowering nbucket in this case too, since you can't
really have more than 2^32 virtual buckets, so pretending you can just
wastes array memory.

So here's a draft steal-the-bucket-bits patch.  Yeah,
reverse_bits_u32() may be in the wrong header.  But is it too slow?
On my desktop I can call ExecGetBucketAndBatch() 353 million times per
second (~2.8ns), and unpatched gets me 656 million/sec (~1.5ns)
(though that includes a function call, and when Hash does it it's
inlined), but it's only slower when nbatch > 1 due to the condition.
To put that number into persepective, I can hash 10 million single-int
tuples from a prewarmed seq scan in 2.5s without batching or
parallelism, so that's 250ns per tuple.  This'd be +0.4% of that, and
I do see it in a few more samples with my profiler, but it's still
basically nothing, and lost in the noise with other noisy partitioning
overheads like IO.  Thoughts?

[1] Working: https://gist.github.com/macdice/b7eb7015846b8384972c2c7cfd6d2c22

0001-Improve-batch-number-algorithm-for-large-hash-joins.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Alvaro Herrera-9
Hello

I can make no useful comment on the correctness of the new bit
distribution.  But I can make two comments on this part:

On 2019-Nov-21, Thomas Munro wrote:

> So here's a draft steal-the-bucket-bits patch.  Yeah,
> reverse_bits_u32() may be in the wrong header.

I think pg_bitutils.h is the right place in master, but that file didn't
exist earlier.  I think access/hash.h is the right place in older
branches, which is where hash_any() used to be.

> But is it too slow?
> On my desktop I can call ExecGetBucketAndBatch() 353 million times per
> second (~2.8ns), and unpatched gets me 656 million/sec (~1.5ns)
> (though that includes a function call, and when Hash does it it's
> inlined), but it's only slower when nbatch > 1 due to the condition.

If the whole query takes hours (or even minutes) to run, then adding one
second of runtime is not going to change things in any noticeable way.
I'd rather have the extreme cases working and take additional 1.3ns per
output row, than not work at all.

> To put that number into persepective, I can hash 10 million single-int
> tuples from a prewarmed seq scan in 2.5s without batching or
> parallelism, so that's 250ns per tuple.  This'd be +0.4% of that, and
> I do see it in a few more samples with my profiler, but it's still
> basically nothing, and lost in the noise with other noisy partitioning
> overheads like IO.  Thoughts?

That summarizes it well for me: yes, it's a slowdown, yes it's barely
measurable.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Tomas Vondra-4
On Tue, Dec 03, 2019 at 02:07:04PM -0300, Alvaro Herrera wrote:

>Hello
>
>I can make no useful comment on the correctness of the new bit
>distribution.  But I can make two comments on this part:
>
>On 2019-Nov-21, Thomas Munro wrote:
>
>> So here's a draft steal-the-bucket-bits patch.  Yeah,
>> reverse_bits_u32() may be in the wrong header.
>
>I think pg_bitutils.h is the right place in master, but that file didn't
>exist earlier.  I think access/hash.h is the right place in older
>branches, which is where hash_any() used to be.
>
>> But is it too slow?
>> On my desktop I can call ExecGetBucketAndBatch() 353 million times per
>> second (~2.8ns), and unpatched gets me 656 million/sec (~1.5ns)
>> (though that includes a function call, and when Hash does it it's
>> inlined), but it's only slower when nbatch > 1 due to the condition.
>
>If the whole query takes hours (or even minutes) to run, then adding one
>second of runtime is not going to change things in any noticeable way.
>I'd rather have the extreme cases working and take additional 1.3ns per
>output row, than not work at all.
>
>> To put that number into persepective, I can hash 10 million single-int
>> tuples from a prewarmed seq scan in 2.5s without batching or
>> parallelism, so that's 250ns per tuple.  This'd be +0.4% of that, and
>> I do see it in a few more samples with my profiler, but it's still
>> basically nothing, and lost in the noise with other noisy partitioning
>> overheads like IO.  Thoughts?
>
>That summarizes it well for me: yes, it's a slowdown, yes it's barely
>measurable.
>

OK, I did a bit of testing with this patch today, and I can confirm it
does indeed help quite a bit. I only used a smaller data set with 4.5B
rows, but it was enough to trigger the issue - it allocated ~1TB of temp
space, and then crashed.

The annoying thing is that it's the workers that crash, and the leader
failed to notice that, so it was waiting in WaitForParallelWorkersToExit
forever. Not sure what the issue is.

Anyway, with the patch applied, the query completed in ~2.2 hours,
after allocating ~235GB of temp files. So that's good, the patch clearly
improves the situation quite a bit.

As for the performance impact, I did this:

   create table dim (id int, val text);
   insert into dim select i, md5(i::text) from generate_series(1,1000000) s(i);

   create table fact (id int, val text);
   insert into fact select mod(i,1000000)+1, md5(i::text) from generate_series(1,25000000) s(i);

   set max_parallel_workers_per_gather = 0;
   select count(*) from fact join dim using (id);

So a perfectly regular join between 1M and 25M table. On my machine,
this takes ~8851ms on master and 8979ms with the patch (average of about
20 runs with minimal variability). That's ~1.4% regression, so a bit
more than the 0.4% mentioned before. Not a huge difference though, and
some of it might be due to different binary layout etc.

It's probably still a good idea to do this, although I wonder if we
might start doing this only when actually running out of bits (and use
the current algorithm by default). But I have no idea if that would be
any cheaper, and it would add complexity.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Alvaro Herrera-9
On 2019-Dec-10, Tomas Vondra wrote:

> It's probably still a good idea to do this, although I wonder if we
> might start doing this only when actually running out of bits (and use
> the current algorithm by default). But I have no idea if that would be
> any cheaper, and it would add complexity.

I'm under the impression that this patch is proposed for backpatching,
and that in master we can simply increase the hash width.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Tomas Vondra-4
On Tue, Dec 10, 2019 at 02:35:16PM -0300, Alvaro Herrera wrote:

>On 2019-Dec-10, Tomas Vondra wrote:
>
>> It's probably still a good idea to do this, although I wonder if we
>> might start doing this only when actually running out of bits (and use
>> the current algorithm by default). But I have no idea if that would be
>> any cheaper, and it would add complexity.
>
>I'm under the impression that this patch is proposed for backpatching,
>and that in master we can simply increase the hash width.
>

Possibly, although AFAIK we don't have that patch yet, and I don't
recall any measurements (so it might have overhead too, not sure). But
it's true additional complexity might complicate backpatching.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


123