BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
42 messages Options
123
Reply | Threaded
Open this post in threaded view
|

BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

PG Bug reporting form
The following bug has been logged on the website:

Bug reference:      16104
Logged by:          James Coleman
Email address:      [hidden email]
PostgreSQL version: 11.5
Operating system:   Debian
Description:        

We have a query that, after a recent logical migration to 11.5, ends up with
a parallel hash join (I don't think the query plan/query itself is important
here, but if needed after the rest of the explanation, I can try to redact
it for posting). The query results in this error:

ERROR:  invalid DSA memory alloc request size 1375731712

(the size changes sometimes significantly, but always over a GB)

At first glance it sounded eerily similar to this report which preceded the
final release of 11.0:
https://www.postgresql.org/message-id/flat/CAEepm%3D1x48j0P5gwDUXyo6c9xRx0t_57UjVaz6X98fEyN-mQ4A%40mail.gmail.com#465f3a61bea2719bc4a7102541326dde
but I confirmed that the patch for that bug was applied and is in 11.5 (and
earlier).

We managed to reproduce this on a replica, and so were able to attach gdb in
production to capture a backtrace:

#0  errfinish (dummy=dummy@entry=0) at
./build/../src/backend/utils/error/elog.c:423
#1  0x000055a7c0a00f79 in elog_finish (elevel=elevel@entry=20,
fmt=fmt@entry=0x55a7c0babc18 "invalid DSA memory alloc request size %zu") at
./build/../src/backend/utils/error/elog.c:1385
#2  0x000055a7c0a2308b in dsa_allocate_extended (area=0x55a7c1d6aa38,
size=1140850688, flags=flags@entry=4) at
./build/../src/backend/utils/mmgr/dsa.c:677
#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
./build/../src/backend/executor/nodeHash.c:2889
#4  0x000055a7c079e5f9 in ExecParallelHashIncreaseNumBatches
(hashtable=0x55a7c1db2740) at
./build/../src/backend/executor/nodeHash.c:1122
#5  0x000055a7c079ef6e in ExecParallelHashTuplePrealloc (size=56,
batchno=<optimized out>, hashtable=0x55a7c1db2740) at
./build/../src/backend/executor/nodeHash.c:3283
#6  ExecParallelHashTableInsert (hashtable=hashtable@entry=0x55a7c1db2740,
slot=slot@entry=0x55a7c1dadc90, hashvalue=<optimized out>) at
./build/../src/backend/executor/nodeHash.c:1716
#7  0x000055a7c079f17f in MultiExecParallelHash (node=0x55a7c1dacb78) at
./build/../src/backend/executor/nodeHash.c:288
#8  MultiExecHash (node=node@entry=0x55a7c1dacb78) at
./build/../src/backend/executor/nodeHash.c:112
#9  0x000055a7c078c40c in MultiExecProcNode (node=node@entry=0x55a7c1dacb78)
at ./build/../src/backend/executor/execProcnode.c:501
#10 0x000055a7c07a07d5 in ExecHashJoinImpl (parallel=true, pstate=<optimized
out>) at ./build/../src/backend/executor/nodeHashjoin.c:290
#11 ExecParallelHashJoin (pstate=<optimized out>) at
./build/../src/backend/executor/nodeHashjoin.c:581
#12 0x000055a7c078bdd9 in ExecProcNodeInstr (node=0x55a7c1d7b018) at
./build/../src/backend/executor/execProcnode.c:461
#13 0x000055a7c079f142 in ExecProcNode (node=0x55a7c1d7b018) at
./build/../src/include/executor/executor.h:251
#14 MultiExecParallelHash (node=0x55a7c1d759d0) at
./build/../src/backend/executor/nodeHash.c:281
#15 MultiExecHash (node=node@entry=0x55a7c1d759d0) at
./build/../src/backend/executor/nodeHash.c:112
#16 0x000055a7c078c40c in MultiExecProcNode (node=node@entry=0x55a7c1d759d0)
at ./build/../src/backend/executor/execProcnode.c:501
#17 0x000055a7c07a07d5 in ExecHashJoinImpl (parallel=true, pstate=<optimized
out>) at ./build/../src/backend/executor/nodeHashjoin.c:290
#18 ExecParallelHashJoin (pstate=<optimized out>) at
./build/../src/backend/executor/nodeHashjoin.c:581
#19 0x000055a7c078bdd9 in ExecProcNodeInstr (node=0x55a7c1d74e60) at
./build/../src/backend/executor/execProcnode.c:461
#20 0x000055a7c0784303 in ExecProcNode (node=0x55a7c1d74e60) at
./build/../src/include/executor/executor.h:251
#21 ExecutePlan (execute_once=<optimized out>, dest=0x55a7c1d0be00,
direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>,
operation=CMD_SELECT, use_parallel_mode=<optimized out>,
planstate=0x55a7c1d74e60, estate=0x55a7c1d74b70) at
./build/../src/backend/executor/execMain.c:1640
#22 standard_ExecutorRun (queryDesc=0x55a7c1d5dcd0, direction=<optimized
out>, count=0, execute_once=<optimized out>) at
./build/../src/backend/executor/execMain.c:369
#23 0x00007f4b8b9ace85 in pgss_ExecutorRun (queryDesc=0x55a7c1d5dcd0,
direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at
./build/../contrib/pg_stat_statements/pg_stat_statements.c:892
#24 0x000055a7c07893d1 in ParallelQueryMain (seg=0x55a7c1caa648,
toc=<optimized out>) at
./build/../src/backend/executor/execParallel.c:1401
#25 0x000055a7c064ee64 in ParallelWorkerMain (main_arg=<optimized out>) at
./build/../src/backend/access/transam/parallel.c:1409
#26 0x000055a7c08568ed in StartBackgroundWorker () at
./build/../src/backend/postmaster/bgworker.c:834
#27 0x000055a7c08637b5 in do_start_bgworker (rw=0x55a7c1c98200) at
./build/../src/backend/postmaster/postmaster.c:5722
#28 maybe_start_bgworkers () at
./build/../src/backend/postmaster/postmaster.c:5935
#29 0x000055a7c0864355 in sigusr1_handler (postgres_signal_arg=<optimized
out>) at ./build/../src/backend/postmaster/postmaster.c:5096
#30 <signal handler called>
#31 0x00007f4b915895e3 in select () from /lib/x86_64-linux-gnu/libc.so.6
#32 0x000055a7c05d8b5d in ServerLoop () at
./build/../src/backend/postmaster/postmaster.c:1671
#33 0x000055a7c08654f1 in PostmasterMain (argc=5, argv=0x55a7c1c73e50) at
./build/../src/backend/postmaster/postmaster.c:1380
#34 0x000055a7c05dac34 in main (argc=5, argv=0x55a7c1c73e50) at
./build/../src/backend/main/main.c:228

From what I can tell it seems that src/backend/executor/nodeHash.c:2888
(looking at the 11.5 release tag) is another entry point into similar
potential problems as were guarded against in the patch I mentioned earlier,
and that this is another way parallel hash nodes can end up attempting to
allocate more memory than is allowed.

Thanks,
James Coleman

Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Tomas Vondra-4
On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:

>The following bug has been logged on the website:
>
>Bug reference:      16104
>Logged by:          James Coleman
>Email address:      [hidden email]
>PostgreSQL version: 11.5
>Operating system:   Debian
>Description:
>
>We have a query that, after a recent logical migration to 11.5, ends up with
>a parallel hash join (I don't think the query plan/query itself is important
>here, but if needed after the rest of the explanation, I can try to redact
>it for posting). The query results in this error:
>
>ERROR:  invalid DSA memory alloc request size 1375731712
>
>(the size changes sometimes significantly, but always over a GB)
>
>At first glance it sounded eerily similar to this report which preceded the
>final release of 11.0:
>https://www.postgresql.org/message-id/flat/CAEepm%3D1x48j0P5gwDUXyo6c9xRx0t_57UjVaz6X98fEyN-mQ4A%40mail.gmail.com#465f3a61bea2719bc4a7102541326dde
>but I confirmed that the patch for that bug was applied and is in 11.5 (and
>earlier).
>
>We managed to reproduce this on a replica, and so were able to attach gdb in
>production to capture a backtrace:
>
>#0  errfinish (dummy=dummy@entry=0) at
>./build/../src/backend/utils/error/elog.c:423
>#1  0x000055a7c0a00f79 in elog_finish (elevel=elevel@entry=20,
>fmt=fmt@entry=0x55a7c0babc18 "invalid DSA memory alloc request size %zu") at
>./build/../src/backend/utils/error/elog.c:1385
>#2  0x000055a7c0a2308b in dsa_allocate_extended (area=0x55a7c1d6aa38,
>size=1140850688, flags=flags@entry=4) at
>./build/../src/backend/utils/mmgr/dsa.c:677
>#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
>(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
>./build/../src/backend/executor/nodeHash.c:2889
> ...

I've briefly looked at this today, and I think the root cause is
somewhat similar to what is described in [1] where we simply increase
the number of batches in an effort to keep batch contents in work_mem,
but ignoring that each batch requires quite a bit of memory. So we end
up with a lot of batches where each is small enough to fit into
work_mem, but we need much more than work_mem to track the batches.

This seems to be about the same problem, except that instead of
forgeting about BufFile, the parallel hash join ignores this:

   pstate->batches =
     dsa_allocate0(hashtable->area,
                   EstimateParallelHashJoinBatch(hashtable) * nbatch);

Looking at the backtrace, you ended up with nbatch=2097152 and it failed
trying to allocate 1140850688 bytes, i.e. exactly 544B per batch.

The EstimateParallelHashJoinBatch combines a bunch of structs and does
depend on the number of participants, and by my rough calculation 544B
means ~13 participants. Any idea if that matches the number of parallel
workers for this query?

Can you try running the query with fewer workers per gather?

IMHO this has the same issue as the non-parallel case described in [1]
in that we don't really consider this memory when planning the query,
which means we may accept hash join for queries where we know the number
of batches will be too high during planning. We need to reject parallel
hash join in those cases.

Of course, I don't know if this is such case - perhaps this is a case of
underestimation and we only learn the actual number while executing the
query, at which point it's too late to reject PHJ. I suppose the right
solution in that case would be some form of BNL, as discussed in [2] but
I'm not sure how far are we from having that, or if that works for PHJ.


regards

[1] https://www.postgresql.org/message-id/flat/20190504003414.bulcbnge3rhwhcsh@development

[2] https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

James Coleman
On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra
<[hidden email]> wrote:

>
> On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:
> >The following bug has been logged on the website:
> >
> >Bug reference:      16104
> >Logged by:          James Coleman
> >Email address:      [hidden email]
> >PostgreSQL version: 11.5
> >Operating system:   Debian
> >Description:
> >
> >We have a query that, after a recent logical migration to 11.5, ends up with
> >a parallel hash join (I don't think the query plan/query itself is important
> >here, but if needed after the rest of the explanation, I can try to redact
> >it for posting). The query results in this error:
> >
> >ERROR:  invalid DSA memory alloc request size 1375731712
> >
> >(the size changes sometimes significantly, but always over a GB)
> >
> >At first glance it sounded eerily similar to this report which preceded the
> >final release of 11.0:
> >https://www.postgresql.org/message-id/flat/CAEepm%3D1x48j0P5gwDUXyo6c9xRx0t_57UjVaz6X98fEyN-mQ4A%40mail.gmail.com#465f3a61bea2719bc4a7102541326dde
> >but I confirmed that the patch for that bug was applied and is in 11.5 (and
> >earlier).
> >
> >We managed to reproduce this on a replica, and so were able to attach gdb in
> >production to capture a backtrace:
> >
> >#0  errfinish (dummy=dummy@entry=0) at
> >./build/../src/backend/utils/error/elog.c:423
> >#1  0x000055a7c0a00f79 in elog_finish (elevel=elevel@entry=20,
> >fmt=fmt@entry=0x55a7c0babc18 "invalid DSA memory alloc request size %zu") at
> >./build/../src/backend/utils/error/elog.c:1385
> >#2  0x000055a7c0a2308b in dsa_allocate_extended (area=0x55a7c1d6aa38,
> >size=1140850688, flags=flags@entry=4) at
> >./build/../src/backend/utils/mmgr/dsa.c:677
> >#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
> >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
> >./build/../src/backend/executor/nodeHash.c:2889
> > ...
>
> I've briefly looked at this today, and I think the root cause is
> somewhat similar to what is described in [1] where we simply increase
> the number of batches in an effort to keep batch contents in work_mem,
> but ignoring that each batch requires quite a bit of memory. So we end
> up with a lot of batches where each is small enough to fit into
> work_mem, but we need much more than work_mem to track the batches.
>
> This seems to be about the same problem, except that instead of
> forgeting about BufFile, the parallel hash join ignores this:
>
>    pstate->batches =
>      dsa_allocate0(hashtable->area,
>                    EstimateParallelHashJoinBatch(hashtable) * nbatch);
>
> Looking at the backtrace, you ended up with nbatch=2097152 and it failed
> trying to allocate 1140850688 bytes, i.e. exactly 544B per batch.
>
> The EstimateParallelHashJoinBatch combines a bunch of structs and does
> depend on the number of participants, and by my rough calculation 544B
> means ~13 participants. Any idea if that matches the number of parallel
> workers for this query?

We currently have max_parallel_workers_per_gather = 6, which is where
the repro happened with the backtrace. We initially had the problem
with max_parallel_workers_per_gather = 2. I also just confirmed an
explain shows Workers Planned: 6.

> Can you try running the query with fewer workers per gather?

Since it failed initially with 2, I think we've effectively tested
this already (I assume that's a significant enough change to be
expected to solve it).

> IMHO this has the same issue as the non-parallel case described in [1]
> in that we don't really consider this memory when planning the query,
> which means we may accept hash join for queries where we know the number
> of batches will be too high during planning. We need to reject parallel
> hash join in those cases.
>
> Of course, I don't know if this is such case - perhaps this is a case of
> underestimation and we only learn the actual number while executing the
> query, at which point it's too late to reject PHJ. I suppose the right
> solution in that case would be some form of BNL, as discussed in [2] but
> I'm not sure how far are we from having that, or if that works for PHJ.
>
>
> regards
>
> [1] https://www.postgresql.org/message-id/flat/20190504003414.bulcbnge3rhwhcsh@development
>
> [2] https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com
>
> --
> Tomas Vondra                  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Thomas Munro-5
On Sat, Nov 9, 2019 at 1:23 PM James Coleman <[hidden email]> wrote:
> On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <[hidden email]> wrote:
> > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:
> > >ERROR:  invalid DSA memory alloc request size 1375731712

> > >#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
> > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at

> > I've briefly looked at this today, and I think the root cause is
> > somewhat similar to what is described in [1] where we simply increase
> > the number of batches in an effort to keep batch contents in work_mem,
> > but ignoring that each batch requires quite a bit of memory. So we end
> > up with a lot of batches where each is small enough to fit into
> > work_mem, but we need much more than work_mem to track the batches.

Yeah.  So even when this is fixed, the query is going to perform
*terribly*, opening and closing millions of files in random order to
stream tuples into, if this is case where there really are tuples to
go to all partitions (and not just a case of extreme skew that our
extreme skew detector fails to detect because it only detects absolute
extreme skew).

> > This seems to be about the same problem, except that instead of
> > forgeting about BufFile, the parallel hash join ignores this:
> >
> >    pstate->batches =
> >      dsa_allocate0(hashtable->area,
> >                    EstimateParallelHashJoinBatch(hashtable) * nbatch);

Yeah, I failed to consider that possibility.  I suppose it could be
avoided with something like this (not tested, I will find a repro for
this on Monday to convince myself that it's right):

@@ -1246,7 +1246,10 @@
ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
                                }

                                /* Don't keep growing if it's not
helping or we'd overflow. */
-                               if (extreme_skew_detected ||
hashtable->nbatch >= INT_MAX / 2)
+                               if (extreme_skew_detected ||
+                                       hashtable->nbatch >= INT_MAX / 2 ||
+
!AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) *
+
   hashtable->nbatch * 2))
                                        pstate->growth = PHJ_GROWTH_DISABLED;
                                else if (space_exhausted)
                                        pstate->growth =
PHJ_GROWTH_NEED_MORE_BATCHES;

But James's query is still going to be terrible.

Do you know if it's extreme skew (many tuples with the same key, just
a few scattered around in other keys), or simply too much data for
your work_mem setting?


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

James Coleman
On Fri, Nov 8, 2019 at 8:12 PM Thomas Munro <[hidden email]> wrote:

>
> On Sat, Nov 9, 2019 at 1:23 PM James Coleman <[hidden email]> wrote:
> > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <[hidden email]> wrote:
> > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:
> > > >ERROR:  invalid DSA memory alloc request size 1375731712
>
> > > >#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
> > > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
>
> > > I've briefly looked at this today, and I think the root cause is
> > > somewhat similar to what is described in [1] where we simply increase
> > > the number of batches in an effort to keep batch contents in work_mem,
> > > but ignoring that each batch requires quite a bit of memory. So we end
> > > up with a lot of batches where each is small enough to fit into
> > > work_mem, but we need much more than work_mem to track the batches.
>
> Yeah.  So even when this is fixed, the query is going to perform
> *terribly*, opening and closing millions of files in random order to
> stream tuples into, if this is case where there really are tuples to
> go to all partitions (and not just a case of extreme skew that our
> extreme skew detector fails to detect because it only detects absolute
> extreme skew).

work_mem in the repro case is 500MB (the original failure was at
150MB). I realize that's too small for this query, though it's also
worth knowing that if I get rid of some other cluster-wide tunings
that shouldn't have been cluster-wide original (modifications to
cpu_*_cost), the seq scan on a TB+ table feeding the hash turns into
an index scan and no hash (and performs much better).

I think this also correlates with us seeing ~TB spike in disk usage,
so your explanation of the lots of "small" files would seem to be
consistent with that.

> > > This seems to be about the same problem, except that instead of
> > > forgeting about BufFile, the parallel hash join ignores this:
> > >
> > >    pstate->batches =
> > >      dsa_allocate0(hashtable->area,
> > >                    EstimateParallelHashJoinBatch(hashtable) * nbatch);
>
> Yeah, I failed to consider that possibility.  I suppose it could be
> avoided with something like this (not tested, I will find a repro for
> this on Monday to convince myself that it's right):
>
> @@ -1246,7 +1246,10 @@
> ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
>                                 }
>
>                                 /* Don't keep growing if it's not
> helping or we'd overflow. */
> -                               if (extreme_skew_detected ||
> hashtable->nbatch >= INT_MAX / 2)
> +                               if (extreme_skew_detected ||
> +                                       hashtable->nbatch >= INT_MAX / 2 ||
> +
> !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) *
> +
>    hashtable->nbatch * 2))
>                                         pstate->growth = PHJ_GROWTH_DISABLED;
>                                 else if (space_exhausted)
>                                         pstate->growth =
> PHJ_GROWTH_NEED_MORE_BATCHES;
>
> But James's query is still going to be terrible.
>
> Do you know if it's extreme skew (many tuples with the same key, just
> a few scattered around in other keys), or simply too much data for
> your work_mem setting?

Given my description earlier (seq scan on a very large table), I
assume it's likely the latter? If you think that's sufficiently
likely, I'll leave it at that, or if not I could do calculation on
that key to see how distributed it is.

James


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Tomas Vondra-4
On Fri, Nov 08, 2019 at 09:10:13PM -0500, James Coleman wrote:

>On Fri, Nov 8, 2019 at 8:12 PM Thomas Munro <[hidden email]> wrote:
>>
>> On Sat, Nov 9, 2019 at 1:23 PM James Coleman <[hidden email]> wrote:
>> > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <[hidden email]> wrote:
>> > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:
>> > > >ERROR:  invalid DSA memory alloc request size 1375731712
>>
>> > > >#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
>> > > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
>>
>> > > I've briefly looked at this today, and I think the root cause is
>> > > somewhat similar to what is described in [1] where we simply increase
>> > > the number of batches in an effort to keep batch contents in work_mem,
>> > > but ignoring that each batch requires quite a bit of memory. So we end
>> > > up with a lot of batches where each is small enough to fit into
>> > > work_mem, but we need much more than work_mem to track the batches.
>>
>> Yeah.  So even when this is fixed, the query is going to perform
>> *terribly*, opening and closing millions of files in random order to
>> stream tuples into, if this is case where there really are tuples to
>> go to all partitions (and not just a case of extreme skew that our
>> extreme skew detector fails to detect because it only detects absolute
>> extreme skew).
>
>work_mem in the repro case is 500MB (the original failure was at
>150MB). I realize that's too small for this query, though it's also
>worth knowing that if I get rid of some other cluster-wide tunings
>that shouldn't have been cluster-wide original (modifications to
>cpu_*_cost), the seq scan on a TB+ table feeding the hash turns into
>an index scan and no hash (and performs much better).
>

So is it a case of underestimate? I.e. does the Hash side expect much
less data (rows) than it gets during execution?

>I think this also correlates with us seeing ~TB spike in disk usage,
>so your explanation of the lots of "small" files would seem to be
>consistent with that.

That's consistent with the data. 500MB and nbatch=2097152 is exactly
1TB, and there'll be some additional overhead.

>
>> > > This seems to be about the same problem, except that instead of
>> > > forgeting about BufFile, the parallel hash join ignores this:
>> > >
>> > >    pstate->batches =
>> > >      dsa_allocate0(hashtable->area,
>> > >                    EstimateParallelHashJoinBatch(hashtable) * nbatch);
>>
>> Yeah, I failed to consider that possibility.  I suppose it could be
>> avoided with something like this (not tested, I will find a repro for
>> this on Monday to convince myself that it's right):
>>
>> @@ -1246,7 +1246,10 @@
>> ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
>>                                 }
>>
>>                                 /* Don't keep growing if it's not
>> helping or we'd overflow. */
>> -                               if (extreme_skew_detected ||
>> hashtable->nbatch >= INT_MAX / 2)
>> +                               if (extreme_skew_detected ||
>> +                                       hashtable->nbatch >= INT_MAX / 2 ||
>> +
>> !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) *
>> +
>>    hashtable->nbatch * 2))
>>                                         pstate->growth = PHJ_GROWTH_DISABLED;
>>                                 else if (space_exhausted)
>>                                         pstate->growth =
>> PHJ_GROWTH_NEED_MORE_BATCHES;
>>

Probably, but that's the execution part of it. I think we should also
consider this in ExecChooseHashTableSize, and just don't do PHJ when
it exceeds work_mem from the very beginning.

Once we start executing we probably can't do much better than what you
proposed. In particular it doesn't make sense to limit the space by
work_mem, unless we also tweak that limit because then the batch size
increases arbitrarily.

I think we need do something about this in PG13 - both while planning
(considering BufFile and SharedTuplestore), and during execution. The
planner part seems fairly simple and independent, and I might have said
before I'll look into it.

For the executor I think we've agreed the "proper" solution is BNL or
something like that. Not sure how far are we from that, though, I
don't recall any recent updates (but maybe I just missed that, the
pgsql-hackers traffic is pretty insane). I wonder if we should get
something like the "rebalancing" I proposed before, which is not a 100%
fix but may at least reduce the negative impact.

>> But James's query is still going to be terrible.
>>
>> Do you know if it's extreme skew (many tuples with the same key, just
>> a few scattered around in other keys), or simply too much data for
>> your work_mem setting?
>
>Given my description earlier (seq scan on a very large table), I
>assume it's likely the latter? If you think that's sufficiently
>likely, I'll leave it at that, or if not I could do calculation on
>that key to see how distributed it is.
>

Depends where exactly is the seq scan - is it under the Hash? If yes,
how come we even pick hash join in this case? Hard to say without seeing
the query plan.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

James Coleman
Just realized I accidentally replied only to Tomas.

On Sat, Nov 9, 2019 at 4:55 AM Tomas Vondra
<[hidden email]> wrote:

>
> On Fri, Nov 08, 2019 at 09:10:13PM -0500, James Coleman wrote:
> >On Fri, Nov 8, 2019 at 8:12 PM Thomas Munro <[hidden email]> wrote:
> >>
> >> On Sat, Nov 9, 2019 at 1:23 PM James Coleman <[hidden email]> wrote:
> >> > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <[hidden email]> wrote:
> >> > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:
> >> > > >ERROR:  invalid DSA memory alloc request size 1375731712
> >>
> >> > > >#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
> >> > > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
> >>
> >> > > I've briefly looked at this today, and I think the root cause is
> >> > > somewhat similar to what is described in [1] where we simply increase
> >> > > the number of batches in an effort to keep batch contents in work_mem,
> >> > > but ignoring that each batch requires quite a bit of memory. So we end
> >> > > up with a lot of batches where each is small enough to fit into
> >> > > work_mem, but we need much more than work_mem to track the batches.
> >>
> >> Yeah.  So even when this is fixed, the query is going to perform
> >> *terribly*, opening and closing millions of files in random order to
> >> stream tuples into, if this is case where there really are tuples to
> >> go to all partitions (and not just a case of extreme skew that our
> >> extreme skew detector fails to detect because it only detects absolute
> >> extreme skew).
> >
> >work_mem in the repro case is 500MB (the original failure was at
> >150MB). I realize that's too small for this query, though it's also
> >worth knowing that if I get rid of some other cluster-wide tunings
> >that shouldn't have been cluster-wide original (modifications to
> >cpu_*_cost), the seq scan on a TB+ table feeding the hash turns into
> >an index scan and no hash (and performs much better).
> >
>
> So is it a case of underestimate? I.e. does the Hash side expect much
> less data (rows) than it gets during execution?
I've attached a redacted plan, but the estimate with workers planned =
6 is 986,975,281, which seems quite reasonable to be as the current
table count is 5,917,539,491.

> >I think this also correlates with us seeing ~TB spike in disk usage,
> >so your explanation of the lots of "small" files would seem to be
> >consistent with that.
>
> That's consistent with the data. 500MB and nbatch=2097152 is exactly
> 1TB, and there'll be some additional overhead.

Good, that helps make sense of that piece of the puzzle.

> >> > > This seems to be about the same problem, except that instead of
> >> > > forgeting about BufFile, the parallel hash join ignores this:
> >> > >
> >> > >    pstate->batches =
> >> > >      dsa_allocate0(hashtable->area,
> >> > >                    EstimateParallelHashJoinBatch(hashtable) * nbatch);
> >>
> >> Yeah, I failed to consider that possibility.  I suppose it could be
> >> avoided with something like this (not tested, I will find a repro for
> >> this on Monday to convince myself that it's right):
> >>
> >> @@ -1246,7 +1246,10 @@
> >> ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
> >>                                 }
> >>
> >>                                 /* Don't keep growing if it's not
> >> helping or we'd overflow. */
> >> -                               if (extreme_skew_detected ||
> >> hashtable->nbatch >= INT_MAX / 2)
> >> +                               if (extreme_skew_detected ||
> >> +                                       hashtable->nbatch >= INT_MAX / 2 ||
> >> +
> >> !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) *
> >> +
> >>    hashtable->nbatch * 2))
> >>                                         pstate->growth = PHJ_GROWTH_DISABLED;
> >>                                 else if (space_exhausted)
> >>                                         pstate->growth =
> >> PHJ_GROWTH_NEED_MORE_BATCHES;
> >>
>
> Probably, but that's the execution part of it. I think we should also
> consider this in ExecChooseHashTableSize, and just don't do PHJ when
> it exceeds work_mem from the very beginning.
>
> Once we start executing we probably can't do much better than what you
> proposed. In particular it doesn't make sense to limit the space by
> work_mem, unless we also tweak that limit because then the batch size
> increases arbitrarily.
>
> I think we need do something about this in PG13 - both while planning
> (considering BufFile and SharedTuplestore), and during execution. The
> planner part seems fairly simple and independent, and I might have said
> before I'll look into it.
>
> For the executor I think we've agreed the "proper" solution is BNL or
> something like that. Not sure how far are we from that, though, I
> don't recall any recent updates (but maybe I just missed that, the
> pgsql-hackers traffic is pretty insane). I wonder if we should get
> something like the "rebalancing" I proposed before, which is not a 100%
> fix but may at least reduce the negative impact.
Do you happen have a link to those discussions? I'd be interested in
following along. I also can't say I know what BNL stands for.

> >> But James's query is still going to be terrible.
> >>
> >> Do you know if it's extreme skew (many tuples with the same key, just
> >> a few scattered around in other keys), or simply too much data for
> >> your work_mem setting?
> >
> >Given my description earlier (seq scan on a very large table), I
> >assume it's likely the latter? If you think that's sufficiently
> >likely, I'll leave it at that, or if not I could do calculation on
> >that key to see how distributed it is.
> >
>
> Depends where exactly is the seq scan - is it under the Hash? If yes,
> how come we even pick hash join in this case? Hard to say without seeing
> the query plan.
As noted earlier, I've attached a redacted plan here.

The seq scan is under the parallel hash. Note that also as hinted at
earlier, this happens with cpu_tuple_cost and cpu_operator_cost both
set to 0.5, which has been a long-standing tweak on this cluster to
affect some plans that are otherwise painfully bad, but really should
be targeted at specific queries. With those reset, everything stays
the same except that the hash join turns into a nested loop, and
instead of a hash on the inner side of the join there's an index scan
(the join key is indexed on that table).

Because of the modified cpu_*_cost gucs we have two ways around this
specific query plan: either reset those gucs or set
enable_parallel_hash = off. But even if it means poor performance, it
seems to me that we wouldn't want to fail the query. I can confirm
that it is in fact capable of completing with this plan even with
work_mem = 150MB. It's slower...but "only" by 7-8x. That timing would
have been some level of unsurprising in this batch system, and also
much closer to "normal" for the non-parallel plan we used to have in
9.6. We were able to see this result on replicas while trying to find
out exactly how to reproduce the error (it seems sometimes it was
right under the boundary needed to raise the error).


James

error_plan_redacted.txt (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Tomas Vondra-4
On Sat, Nov 09, 2019 at 08:23:23AM -0500, James Coleman wrote:

>Just realized I accidentally replied only to Tomas.
>
>On Sat, Nov 9, 2019 at 4:55 AM Tomas Vondra
><[hidden email]> wrote:
>>
>> On Fri, Nov 08, 2019 at 09:10:13PM -0500, James Coleman wrote:
>> >On Fri, Nov 8, 2019 at 8:12 PM Thomas Munro <[hidden email]> wrote:
>> >>
>> >> On Sat, Nov 9, 2019 at 1:23 PM James Coleman <[hidden email]> wrote:
>> >> > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <[hidden email]> wrote:
>> >> > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:
>> >> > > >ERROR:  invalid DSA memory alloc request size 1375731712
>> >>
>> >> > > >#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
>> >> > > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
>> >>
>> >> > > I've briefly looked at this today, and I think the root cause is
>> >> > > somewhat similar to what is described in [1] where we simply increase
>> >> > > the number of batches in an effort to keep batch contents in work_mem,
>> >> > > but ignoring that each batch requires quite a bit of memory. So we end
>> >> > > up with a lot of batches where each is small enough to fit into
>> >> > > work_mem, but we need much more than work_mem to track the batches.
>> >>
>> >> Yeah.  So even when this is fixed, the query is going to perform
>> >> *terribly*, opening and closing millions of files in random order to
>> >> stream tuples into, if this is case where there really are tuples to
>> >> go to all partitions (and not just a case of extreme skew that our
>> >> extreme skew detector fails to detect because it only detects absolute
>> >> extreme skew).
>> >
>> >work_mem in the repro case is 500MB (the original failure was at
>> >150MB). I realize that's too small for this query, though it's also
>> >worth knowing that if I get rid of some other cluster-wide tunings
>> >that shouldn't have been cluster-wide original (modifications to
>> >cpu_*_cost), the seq scan on a TB+ table feeding the hash turns into
>> >an index scan and no hash (and performs much better).
>> >
>>
>> So is it a case of underestimate? I.e. does the Hash side expect much
>> less data (rows) than it gets during execution?
>
>I've attached a redacted plan, but the estimate with workers planned =
>6 is 986,975,281, which seems quite reasonable to be as the current
>table count is 5,917,539,491.
>

Hmmm, but the expected row width is only 16B, and with 6M rows that's
only about 90GB. So how come this needs 1TB temporary files? I'm sure
there's a bit of overhead, but 10X seems a bit much.

>> >I think this also correlates with us seeing ~TB spike in disk usage,
>> >so your explanation of the lots of "small" files would seem to be
>> >consistent with that.
>>
>> That's consistent with the data. 500MB and nbatch=2097152 is exactly
>> 1TB, and there'll be some additional overhead.
>
>Good, that helps make sense of that piece of the puzzle.
>
>> >> > > This seems to be about the same problem, except that instead of
>> >> > > forgeting about BufFile, the parallel hash join ignores this:
>> >> > >
>> >> > >    pstate->batches =
>> >> > >      dsa_allocate0(hashtable->area,
>> >> > >                    EstimateParallelHashJoinBatch(hashtable) * nbatch);
>> >>
>> >> Yeah, I failed to consider that possibility.  I suppose it could be
>> >> avoided with something like this (not tested, I will find a repro for
>> >> this on Monday to convince myself that it's right):
>> >>
>> >> @@ -1246,7 +1246,10 @@
>> >> ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
>> >>                                 }
>> >>
>> >>                                 /* Don't keep growing if it's not
>> >> helping or we'd overflow. */
>> >> -                               if (extreme_skew_detected ||
>> >> hashtable->nbatch >= INT_MAX / 2)
>> >> +                               if (extreme_skew_detected ||
>> >> +                                       hashtable->nbatch >= INT_MAX / 2 ||
>> >> +
>> >> !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) *
>> >> +
>> >>    hashtable->nbatch * 2))
>> >>                                         pstate->growth = PHJ_GROWTH_DISABLED;
>> >>                                 else if (space_exhausted)
>> >>                                         pstate->growth =
>> >> PHJ_GROWTH_NEED_MORE_BATCHES;
>> >>
>>
>> Probably, but that's the execution part of it. I think we should also
>> consider this in ExecChooseHashTableSize, and just don't do PHJ when
>> it exceeds work_mem from the very beginning.
>>
>> Once we start executing we probably can't do much better than what you
>> proposed. In particular it doesn't make sense to limit the space by
>> work_mem, unless we also tweak that limit because then the batch size
>> increases arbitrarily.
>>
>> I think we need do something about this in PG13 - both while planning
>> (considering BufFile and SharedTuplestore), and during execution. The
>> planner part seems fairly simple and independent, and I might have said
>> before I'll look into it.
>>
>> For the executor I think we've agreed the "proper" solution is BNL or
>> something like that. Not sure how far are we from that, though, I
>> don't recall any recent updates (but maybe I just missed that, the
>> pgsql-hackers traffic is pretty insane). I wonder if we should get
>> something like the "rebalancing" I proposed before, which is not a 100%
>> fix but may at least reduce the negative impact.
>
>Do you happen have a link to those discussions? I'd be interested in
>following along. I also can't say I know what BNL stands for.
>

I think the two main threads are these two:

1) accounting for memory used for BufFile during hash joins
https://www.postgresql.org/message-id/20190504003414.bulcbnge3rhwhcsh%40development

2) Avoiding hash join batch explosions with extreme skew and weird stats
https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com

The BNL means "block nested loop" which means we split the inner
relation into blocks, and then do nested loop for (some of) them. This
works as a fallback strategy for hash join - when a batch is too large
to fit into work_mem, we can split it into blocks and do a BNL against
the outer relation.

>> >> But James's query is still going to be terrible.
>> >>
>> >> Do you know if it's extreme skew (many tuples with the same key, just
>> >> a few scattered around in other keys), or simply too much data for
>> >> your work_mem setting?
>> >
>> >Given my description earlier (seq scan on a very large table), I
>> >assume it's likely the latter? If you think that's sufficiently
>> >likely, I'll leave it at that, or if not I could do calculation on
>> >that key to see how distributed it is.
>> >
>>
>> Depends where exactly is the seq scan - is it under the Hash? If yes,
>> how come we even pick hash join in this case? Hard to say without seeing
>> the query plan.
>
>As noted earlier, I've attached a redacted plan here.
>
>The seq scan is under the parallel hash. Note that also as hinted at
>earlier, this happens with cpu_tuple_cost and cpu_operator_cost both
>set to 0.5, which has been a long-standing tweak on this cluster to
>affect some plans that are otherwise painfully bad, but really should
>be targeted at specific queries. With those reset, everything stays
>the same except that the hash join turns into a nested loop, and
>instead of a hash on the inner side of the join there's an index scan
>(the join key is indexed on that table).
>

I wonder what the bad plans look like and why this fixes them, but it
seems like a separate issue.

>Because of the modified cpu_*_cost gucs we have two ways around this
>specific query plan: either reset those gucs or set
>enable_parallel_hash = off. But even if it means poor performance, it
>seems to me that we wouldn't want to fail the query. I can confirm
>that it is in fact capable of completing with this plan even with
>work_mem = 150MB. It's slower...but "only" by 7-8x. That timing would
>have been some level of unsurprising in this batch system, and also
>much closer to "normal" for the non-parallel plan we used to have in
>9.6. We were able to see this result on replicas while trying to find
>out exactly how to reproduce the error (it seems sometimes it was
>right under the boundary needed to raise the error).
>

It's certainly true that completing a query is preferrable to failing.
It does depend when we can identify that's the case - the later we
realize the issue, the harder it's to fix it. If we notice that during
planning, we can simply disable the hash join, which then forces using a
different join method (and it may actually be quite fast). Once we start
executing, it's way harder.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

James Coleman
On Sat, Nov 9, 2019 at 1:27 PM Tomas Vondra
<[hidden email]> wrote:

>
> On Sat, Nov 09, 2019 at 08:23:23AM -0500, James Coleman wrote:
> >Just realized I accidentally replied only to Tomas.
> >
> >On Sat, Nov 9, 2019 at 4:55 AM Tomas Vondra
> ><[hidden email]> wrote:
> >>
> >> On Fri, Nov 08, 2019 at 09:10:13PM -0500, James Coleman wrote:
> >> >On Fri, Nov 8, 2019 at 8:12 PM Thomas Munro <[hidden email]> wrote:
> >> >>
> >> >> On Sat, Nov 9, 2019 at 1:23 PM James Coleman <[hidden email]> wrote:
> >> >> > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <[hidden email]> wrote:
> >> >> > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:
> >> >> > > >ERROR:  invalid DSA memory alloc request size 1375731712
> >> >>
> >> >> > > >#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
> >> >> > > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
> >> >>
> >> >> > > I've briefly looked at this today, and I think the root cause is
> >> >> > > somewhat similar to what is described in [1] where we simply increase
> >> >> > > the number of batches in an effort to keep batch contents in work_mem,
> >> >> > > but ignoring that each batch requires quite a bit of memory. So we end
> >> >> > > up with a lot of batches where each is small enough to fit into
> >> >> > > work_mem, but we need much more than work_mem to track the batches.
> >> >>
> >> >> Yeah.  So even when this is fixed, the query is going to perform
> >> >> *terribly*, opening and closing millions of files in random order to
> >> >> stream tuples into, if this is case where there really are tuples to
> >> >> go to all partitions (and not just a case of extreme skew that our
> >> >> extreme skew detector fails to detect because it only detects absolute
> >> >> extreme skew).
> >> >
> >> >work_mem in the repro case is 500MB (the original failure was at
> >> >150MB). I realize that's too small for this query, though it's also
> >> >worth knowing that if I get rid of some other cluster-wide tunings
> >> >that shouldn't have been cluster-wide original (modifications to
> >> >cpu_*_cost), the seq scan on a TB+ table feeding the hash turns into
> >> >an index scan and no hash (and performs much better).
> >> >
> >>
> >> So is it a case of underestimate? I.e. does the Hash side expect much
> >> less data (rows) than it gets during execution?
> >
> >I've attached a redacted plan, but the estimate with workers planned =
> >6 is 986,975,281, which seems quite reasonable to be as the current
> >table count is 5,917,539,491.
> >
>
> Hmmm, but the expected row width is only 16B, and with 6M rows that's
> only about 90GB. So how come this needs 1TB temporary files? I'm sure
> there's a bit of overhead, but 10X seems a bit much.

I'm running the query again to confirm this is indeed the one that
triggers that, though I'm reasonably confident it is as I was able to
reproduce the disk spike on a barely used replica box.

I don't suppose there's any way for the row width to be wrong, or not
include some other variable (e.g., variable length fields)? The key
value here is type citext, and is often 8 chars long, but sometimes
much longer (I sampled some that were 34 chars long, for example). I'm
not sure the actual distribution of those lengths.

> >> >I think this also correlates with us seeing ~TB spike in disk usage,
> >> >so your explanation of the lots of "small" files would seem to be
> >> >consistent with that.
> >>
> >> That's consistent with the data. 500MB and nbatch=2097152 is exactly
> >> 1TB, and there'll be some additional overhead.
> >
> >Good, that helps make sense of that piece of the puzzle.
> >
> >> >> > > This seems to be about the same problem, except that instead of
> >> >> > > forgeting about BufFile, the parallel hash join ignores this:
> >> >> > >
> >> >> > >    pstate->batches =
> >> >> > >      dsa_allocate0(hashtable->area,
> >> >> > >                    EstimateParallelHashJoinBatch(hashtable) * nbatch);
> >> >>
> >> >> Yeah, I failed to consider that possibility.  I suppose it could be
> >> >> avoided with something like this (not tested, I will find a repro for
> >> >> this on Monday to convince myself that it's right):
> >> >>
> >> >> @@ -1246,7 +1246,10 @@
> >> >> ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
> >> >>                                 }
> >> >>
> >> >>                                 /* Don't keep growing if it's not
> >> >> helping or we'd overflow. */
> >> >> -                               if (extreme_skew_detected ||
> >> >> hashtable->nbatch >= INT_MAX / 2)
> >> >> +                               if (extreme_skew_detected ||
> >> >> +                                       hashtable->nbatch >= INT_MAX / 2 ||
> >> >> +
> >> >> !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) *
> >> >> +
> >> >>    hashtable->nbatch * 2))
> >> >>                                         pstate->growth = PHJ_GROWTH_DISABLED;
> >> >>                                 else if (space_exhausted)
> >> >>                                         pstate->growth =
> >> >> PHJ_GROWTH_NEED_MORE_BATCHES;
> >> >>
> >>
> >> Probably, but that's the execution part of it. I think we should also
> >> consider this in ExecChooseHashTableSize, and just don't do PHJ when
> >> it exceeds work_mem from the very beginning.
> >>
> >> Once we start executing we probably can't do much better than what you
> >> proposed. In particular it doesn't make sense to limit the space by
> >> work_mem, unless we also tweak that limit because then the batch size
> >> increases arbitrarily.
> >>
> >> I think we need do something about this in PG13 - both while planning
> >> (considering BufFile and SharedTuplestore), and during execution. The
> >> planner part seems fairly simple and independent, and I might have said
> >> before I'll look into it.
> >>
> >> For the executor I think we've agreed the "proper" solution is BNL or
> >> something like that. Not sure how far are we from that, though, I
> >> don't recall any recent updates (but maybe I just missed that, the
> >> pgsql-hackers traffic is pretty insane). I wonder if we should get
> >> something like the "rebalancing" I proposed before, which is not a 100%
> >> fix but may at least reduce the negative impact.
> >
> >Do you happen have a link to those discussions? I'd be interested in
> >following along. I also can't say I know what BNL stands for.
> >
>
> I think the two main threads are these two:
>
> 1) accounting for memory used for BufFile during hash joins
> https://www.postgresql.org/message-id/20190504003414.bulcbnge3rhwhcsh%40development
>
> 2) Avoiding hash join batch explosions with extreme skew and weird stats
> https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com
>
> The BNL means "block nested loop" which means we split the inner
> relation into blocks, and then do nested loop for (some of) them. This
> works as a fallback strategy for hash join - when a batch is too large
> to fit into work_mem, we can split it into blocks and do a BNL against
> the outer relation.

Thanks.

> >> >> But James's query is still going to be terrible.
> >> >>
> >> >> Do you know if it's extreme skew (many tuples with the same key, just
> >> >> a few scattered around in other keys), or simply too much data for
> >> >> your work_mem setting?
> >> >
> >> >Given my description earlier (seq scan on a very large table), I
> >> >assume it's likely the latter? If you think that's sufficiently
> >> >likely, I'll leave it at that, or if not I could do calculation on
> >> >that key to see how distributed it is.
> >> >
> >>
> >> Depends where exactly is the seq scan - is it under the Hash? If yes,
> >> how come we even pick hash join in this case? Hard to say without seeing
> >> the query plan.
> >
> >As noted earlier, I've attached a redacted plan here.
> >
> >The seq scan is under the parallel hash. Note that also as hinted at
> >earlier, this happens with cpu_tuple_cost and cpu_operator_cost both
> >set to 0.5, which has been a long-standing tweak on this cluster to
> >affect some plans that are otherwise painfully bad, but really should
> >be targeted at specific queries. With those reset, everything stays
> >the same except that the hash join turns into a nested loop, and
> >instead of a hash on the inner side of the join there's an index scan
> >(the join key is indexed on that table).
> >
>
> I wonder what the bad plans look like and why this fixes them, but it
> seems like a separate issue.

I am going to look into those further, but it's been configured this
way for some time, so it's always a bit messy to find all of these
affected queries.

> >Because of the modified cpu_*_cost gucs we have two ways around this
> >specific query plan: either reset those gucs or set
> >enable_parallel_hash = off. But even if it means poor performance, it
> >seems to me that we wouldn't want to fail the query. I can confirm
> >that it is in fact capable of completing with this plan even with
> >work_mem = 150MB. It's slower...but "only" by 7-8x. That timing would
> >have been some level of unsurprising in this batch system, and also
> >much closer to "normal" for the non-parallel plan we used to have in
> >9.6. We were able to see this result on replicas while trying to find
> >out exactly how to reproduce the error (it seems sometimes it was
> >right under the boundary needed to raise the error).
> >
>
> It's certainly true that completing a query is preferrable to failing.
> It does depend when we can identify that's the case - the later we
> realize the issue, the harder it's to fix it. If we notice that during
> planning, we can simply disable the hash join, which then forces using a
> different join method (and it may actually be quite fast). Once we start
> executing, it's way harder.

In this case though the failure in some sense seems fairly artificial.
Aside from the query being slow, there doesn't appear to be any real
limitation on the query completing. The box in question has 768GB of
memory, so limiting this memory structure to 1GB seems artificial.

James


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Thomas Munro-5
On Sun, Nov 10, 2019 at 9:05 AM James Coleman <[hidden email]> wrote:
> In this case though the failure in some sense seems fairly artificial.
> Aside from the query being slow, there doesn't appear to be any real
> limitation on the query completing. The box in question has 768GB of
> memory, so limiting this memory structure to 1GB seems artificial.

Yeah, completely artificial.  We limit ourselves to MaxAllocSize, a
self-imposed limit that applies to palloc() and dsa_allocate(),
probably dating back to a time in history when anyone asking for more
than that must surely be out of their mind.  Then there are a few
places where we allow ourselves to go past that limit by passing in an
extra flag MCXT_ALLOC_HUGE or DSA_ALLOC_HUGE: from a quick grep,
that's the bitmap heap scan page table, and anything using
simplehash.h's default allocator (so I guess that includes hash agg;
that's interesting, we can do a 64GB-bucket-array hash agg but not
hash join).

As noted on https://wiki.postgresql.org/wiki/Hash_Join, there are a
couple of artificial constraints on hash joins: the limit on the
number of hash join buckets which comes entirely from using 'int' as a
data type for bucket numbers (an anachronism from pre-standard C or
maybe just 32 bit-ism: the correct type is surely size_t, which is by
definition big enough for any array that you could possibly address),
and the MaxAllocSize thing.  Easy to fix, of course.  I noticed that
when working on PHJ and decided to keep the same restriction for the
new parallel code paths, because (1) it seemed like a policy choice we
should make and then apply to both, and (2) it does provide some kind
of sanity check, though it's increasingly looking overdue to be
removed (in other words: previously I was only speculating about
people showing up with ~1TB RAM machines and running into these
ancient limits, but ... here we are).

For example, suppose we added the DSA_ALLOC_HUGE flag to the line that
is failing in your case.  Now it would be able to blow through that
1GB limit, but where would it stop?  Until we know how you're reaching
this state, it's hard to know whether it'd go to (say) 2GB, and then
work perfectly, job done, or whether it'd keep going until it ate all
your RAM and made your users really unhappy.

I think this must be a case of extreme skew, as complained about
in[1].  Let's see... you have ~6 billion rows, and you said the
planner knew that (it estimated about a billion, when there were 6
workers, so it's in the ball park).  You didn't say how many batches
the planner planned for.  Let's see if I can guess... 256 or 512?
That'd allow for 6 billion * 16 byte rows + some slop, chopped up into
a power-of-two number of partitions that fit inside 500MB.  Then at
execution time, they didn't fit, and we went into
repartition-until-it-fits mode.  At some point we tried to cut them
into ~2 million partitions and hit this error.  That'd be a paltry
3,000 rows per partition if the keys were uniformly distributed, but
it thinks that at least one partition is not fitting into 500MB.
Conclusion: unless this is a carefully crafted hash attack, there must
be one particular key has more than 500MB worth of rows, but also a
smattering of other keys that fall into the same partition, that are
causing us to keep trying to repartition until it eventually squeezes
all of them all in the same direction during a split (this requires
repeatedly splitting partitions until you reach one partition per
tuple!).  So I'm wondering if this would be fixed by, say, a 95%
threshold (instead of 100%) for the extreme skew detector, as I
proposed in a patch in the first email in that thread that later
veered off into the BNL discussion[1].  Unfortunately that patch only
deals with non-parallel HJ, but a PHJ version should be easy.  And if
not by 95%, I wonder what threshold would be needed for this case, and
what kind of maths to use to think about it.  If I wrote a patch like
[1] with PHJ support, would you be able to test it on a copy of your
workload?

[1] https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Thomas Munro-5
In reply to this post by Tomas Vondra-4
On Sun, Nov 10, 2019 at 7:27 AM Tomas Vondra
<[hidden email]> wrote:
> Hmmm, but the expected row width is only 16B, and with 6M rows that's
> only about 90GB. So how come this needs 1TB temporary files? I'm sure
> there's a bit of overhead, but 10X seems a bit much.

(s/6M/6B/) Yeah, that comes out to only ~90GB but ... PHJ doesn't
immediately unlink files from the previous generation when it
repartitions.  You need at two generations' worth of files (old and
new) while repartitioning, but you don't need the grand-parent
generation.  I didn't think this was a problem because I didn't expect
to have to repartition many times (and there is a similar but
different kind of amplification in the non-parallel code).  If this
problem is due to the 100% extreme skew threshold causing us to go
berserk, then that 10X multiplier is of the right order, if you
imagine this thing started out with ~512 batches and got up to ~1M
batches before it blew a gasket.


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

James Coleman
In reply to this post by Thomas Munro-5
On Sat, Nov 9, 2019 at 4:54 PM Thomas Munro <[hidden email]> wrote:

>
> On Sun, Nov 10, 2019 at 9:05 AM James Coleman <[hidden email]> wrote:
> > In this case though the failure in some sense seems fairly artificial.
> > Aside from the query being slow, there doesn't appear to be any real
> > limitation on the query completing. The box in question has 768GB of
> > memory, so limiting this memory structure to 1GB seems artificial.
>
> Yeah, completely artificial.  We limit ourselves to MaxAllocSize, a
> self-imposed limit that applies to palloc() and dsa_allocate(),
> probably dating back to a time in history when anyone asking for more
> than that must surely be out of their mind.  Then there are a few
> places where we allow ourselves to go past that limit by passing in an
> extra flag MCXT_ALLOC_HUGE or DSA_ALLOC_HUGE: from a quick grep,
> that's the bitmap heap scan page table, and anything using
> simplehash.h's default allocator (so I guess that includes hash agg;
> that's interesting, we can do a 64GB-bucket-array hash agg but not
> hash join).
>
> As noted on https://wiki.postgresql.org/wiki/Hash_Join, there are a
> couple of artificial constraints on hash joins: the limit on the
> number of hash join buckets which comes entirely from using 'int' as a
> data type for bucket numbers (an anachronism from pre-standard C or
> maybe just 32 bit-ism: the correct type is surely size_t, which is by
> definition big enough for any array that you could possibly address),
> and the MaxAllocSize thing.  Easy to fix, of course.  I noticed that
> when working on PHJ and decided to keep the same restriction for the
> new parallel code paths, because (1) it seemed like a policy choice we
> should make and then apply to both, and (2) it does provide some kind
> of sanity check, though it's increasingly looking overdue to be
> removed (in other words: previously I was only speculating about
> people showing up with ~1TB RAM machines and running into these
> ancient limits, but ... here we are).

Heh. We've run with some clusters at 2TB of memory for many years, but
I also realize that's not all that common.

> For example, suppose we added the DSA_ALLOC_HUGE flag to the line that
> is failing in your case.  Now it would be able to blow through that
> 1GB limit, but where would it stop?  Until we know how you're reaching
> this state, it's hard to know whether it'd go to (say) 2GB, and then
> work perfectly, job done, or whether it'd keep going until it ate all
> your RAM and made your users really unhappy.

Yeah, I think it's definitely non-obvious exactly what to do. In
general I think preferring not failing queries is good, but on the
other hand that could result in OOM-killing a bunch of other things
too and/or failing other queries because they can't alloc.

> I think this must be a case of extreme skew, as complained about
> in[1].  Let's see... you have ~6 billion rows, and you said the
> planner knew that (it estimated about a billion, when there were 6
> workers, so it's in the ball park).  You didn't say how many batches
> the planner planned for.  Let's see if I can guess... 256 or 512?

I ran a successful one (this run happened to have max parallel workers
per gather set to 8, and used 8 workers) at 150MB work_mem, and the
parallel hash node had these stats:

Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally
4096)  Memory Usage: 344448kB
Buffers: shared hit=38360 read=142422702, temp read=45766294 written=76912272

I'm not sure how to captures the batch information on a failing run,
since you only get it on ANALYZE, but the query doesn't complete so
you can't get that. I do notice in the original bug report stack trace
though that:

#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
./build/../src/backend/executor/nodeHash.c:2889

which I think tells us the number of batches?

That buffers written is ~615GB which corresponds to one blip on the
disk graphs for this run (that was around 600GB) and the HashAggregate
node at the top of the query (which is only grouping on
items.account_id; see below, but this has significant skew) had:

Buffers: shared hit=608075044 read=142422702, temp read=77222547
written=77284432

which corresponds to the second disk graph blip of similar size.

Side note about explain output: it seems that the hash aggregate node
appears to not output bucket stats. I've also noticed that the nested
loop joins appear to not output the join condition, but hash joins
output the hash condition, and if I remember correctly, merge joins
output the join condition. If I'm right about that...I guess it means
another thing on the list of potential patches. Seems like it'd make
understanding plans easier.

> That'd allow for 6 billion * 16 byte rows + some slop, chopped up into
> a power-of-two number of partitions that fit inside 500MB.  Then at
> execution time, they didn't fit, and we went into
> repartition-until-it-fits mode.  At some point we tried to cut them
> into ~2 million partitions and hit this error.  That'd be a paltry
> 3,000 rows per partition if the keys were uniformly distributed, but
> it thinks that at least one partition is not fitting into 500MB.
> Conclusion: unless this is a carefully crafted hash attack, there must
> be one particular key has more than 500MB worth of rows, but also a
> smattering of other keys that fall into the same partition, that are
> causing us to keep trying to repartition until it eventually squeezes
> all of them all in the same direction during a split (this requires
> repeatedly splitting partitions until you reach one partition per
> tuple!).  So I'm wondering if this would be fixed by, say, a 95%
> threshold (instead of 100%) for the extreme skew detector, as I
> proposed in a patch in the first email in that thread that later
> veered off into the BNL discussion[1].  Unfortunately that patch only
> deals with non-parallel HJ, but a PHJ version should be easy.  And if
> not by 95%, I wonder what threshold would be needed for this case, and
> what kind of maths to use to think about it.

So I should have run the earlier attached plan with VERBOSE, but
here's the interesting thing: the parallel hash node's seq scan node
outputs two columns: let's call them (from the redacted plan)
items.system_id and items.account_id. The first (system_id) is both
not null and unique; the second (account_id) definitely has massive
skew. I'm not very up-to-speed on how the hash building works, but I
would have (perhaps naïvely?) assumed that the first column being
unique would make the hash keys very likely not to collide in any
significantly skewed way. Am I missing something here?

> If I wrote a patch like
> [1] with PHJ support, would you be able to test it on a copy of your
> workload?

I think that would be a significant effort to pull off, but _maybe_
not entire impossible. Certainly quite difficult though.

> [1] https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

James Coleman
In reply to this post by Thomas Munro-5
On Sat, Nov 9, 2019 at 6:14 PM Thomas Munro <[hidden email]> wrote:

>
> On Sun, Nov 10, 2019 at 7:27 AM Tomas Vondra
> <[hidden email]> wrote:
> > Hmmm, but the expected row width is only 16B, and with 6M rows that's
> > only about 90GB. So how come this needs 1TB temporary files? I'm sure
> > there's a bit of overhead, but 10X seems a bit much.
>
> (s/6M/6B/) Yeah, that comes out to only ~90GB but ... PHJ doesn't
> immediately unlink files from the previous generation when it
> repartitions.  You need at two generations' worth of files (old and
> new) while repartitioning, but you don't need the grand-parent
> generation.  I didn't think this was a problem because I didn't expect
> to have to repartition many times (and there is a similar but
> different kind of amplification in the non-parallel code).  If this
> problem is due to the 100% extreme skew threshold causing us to go
> berserk, then that 10X multiplier is of the right order, if you
> imagine this thing started out with ~512 batches and got up to ~1M
> batches before it blew a gasket.

Are you saying that it also doesn't unlink the grand-parent until the end?


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Thomas Munro-5
On Sun, Nov 10, 2019 at 3:25 PM James Coleman <[hidden email]> wrote:

> On Sat, Nov 9, 2019 at 6:14 PM Thomas Munro <[hidden email]> wrote:
> > On Sun, Nov 10, 2019 at 7:27 AM Tomas Vondra
> > <[hidden email]> wrote:
> > > Hmmm, but the expected row width is only 16B, and with 6M rows that's
> > > only about 90GB. So how come this needs 1TB temporary files? I'm sure
> > > there's a bit of overhead, but 10X seems a bit much.
> >
> > (s/6M/6B/) Yeah, that comes out to only ~90GB but ... PHJ doesn't
> > immediately unlink files from the previous generation when it
> > repartitions.  You need at two generations' worth of files (old and
> > new) while repartitioning, but you don't need the grand-parent
> > generation.  I didn't think this was a problem because I didn't expect
> > to have to repartition many times (and there is a similar but
> > different kind of amplification in the non-parallel code).  If this
> > problem is due to the 100% extreme skew threshold causing us to go
> > berserk, then that 10X multiplier is of the right order, if you
> > imagine this thing started out with ~512 batches and got up to ~1M
> > batches before it blew a gasket.
>
> Are you saying that it also doesn't unlink the grand-parent until the end?

Yeah, it doesn't unlink anything until the end, which obviously isn't
great in this case.  Might be something to add to my list of patches
to write.


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Thomas Munro-5
In reply to this post by James Coleman
On Sun, Nov 10, 2019 at 3:25 PM James Coleman <[hidden email]> wrote:
> So I should have run the earlier attached plan with VERBOSE, but
> here's the interesting thing: the parallel hash node's seq scan node
> outputs two columns: let's call them (from the redacted plan)
> items.system_id and items.account_id. The first (system_id) is both
> not null and unique; the second (account_id) definitely has massive
> skew. I'm not very up-to-speed on how the hash building works, but I
> would have (perhaps naïvely?) assumed that the first column being
> unique would make the hash keys very likely not to collide in any
> significantly skewed way. Am I missing something here?

Hrm.  So the compound key is unique then.  I was assuming up until now
that it had duplicates.  The hashes of the individual keys are
combined (see ExecHashGetHashValue()), so assuming there is nothing
funky about the way citext gets hashed (and it's not jumping out at
me), your unique keys should give you uniform hash values and thus
partition size, and repartitioning should be an effective way of
reducing hash table size.  So now it sounds like you have a simple
case of underestimation, but now I'm confused about how you got a
344MB hash table with work_mem = 150MB:

Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally
4096)  Memory Usage: 344448kB

And I'm confused about what was different when it wanted the crazy
number of batches.


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

James Coleman
On Sat, Nov 9, 2019 at 10:44 PM Thomas Munro <[hidden email]> wrote:

>
> On Sun, Nov 10, 2019 at 3:25 PM James Coleman <[hidden email]> wrote:
> > So I should have run the earlier attached plan with VERBOSE, but
> > here's the interesting thing: the parallel hash node's seq scan node
> > outputs two columns: let's call them (from the redacted plan)
> > items.system_id and items.account_id. The first (system_id) is both
> > not null and unique; the second (account_id) definitely has massive
> > skew. I'm not very up-to-speed on how the hash building works, but I
> > would have (perhaps naïvely?) assumed that the first column being
> > unique would make the hash keys very likely not to collide in any
> > significantly skewed way. Am I missing something here?
>
> Hrm.  So the compound key is unique then.  I was assuming up until now
> that it had duplicates.  The hashes of the individual keys are
> combined (see ExecHashGetHashValue()), so assuming there is nothing
> funky about the way citext gets hashed (and it's not jumping out at
> me), your unique keys should give you uniform hash values and thus
> partition size, and repartitioning should be an effective way of
> reducing hash table size.  So now it sounds like you have a simple
> case of underestimation, but now I'm confused about how you got a
> 344MB hash table with work_mem = 150MB:

Looking at the source, citext's hash is a pretty standard call to
hash_any, so I don't see how that would lead to any oddities (barring
an intentional hash collision etc., but this is real data).

Do you have any theories about where the underestimation is happening?
It knows the number of rows reasonably well. The one thing I'm not
sure about yet is the row width = 16. The account_id is a bigint, so
we can assume it knows the size properly there. The system_id being
citext...I think the 8 bytes it has for that is probably a bit low on
average, but I'm not sure yet by how much (I'm going to run a query to
find out). Hmm. Query came back, and average length is just shy of
7.9...so the 16 byte row size is looking pretty good. So I'm at a loss
of why/what it would be underestimating (does it know about
uniqueness/non-uniqueness? could that be a factor?)?

I also don't know why it seems to regularly fail on the primary, but
not on the sync, unless we adjust the work_mem up. I've double-checked
all GUCs and the only differences are things related to replication
that you'd expect to be different on primary/replica.

I know there are some things that execute differently on replicas, so
I assume it's something like that, but I don't know what specifically
would cause this here.

> Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally
> 4096)  Memory Usage: 344448kB
>
> And I'm confused about what was different when it wanted the crazy
> number of batches.

I'm not quite sure how to find out; if you have any ideas, I'd love to
hear them. The one thing I can think of to try is to slowly increase
work_mem (since setting it to 500MB reproduced the error on the
replica) and see if the bucket info starts to trend up.

James


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Tomas Vondra-4
On Sun, Nov 10, 2019 at 09:15:06AM -0500, James Coleman wrote:

>On Sat, Nov 9, 2019 at 10:44 PM Thomas Munro <[hidden email]> wrote:
>>
>> On Sun, Nov 10, 2019 at 3:25 PM James Coleman <[hidden email]> wrote:
>> > So I should have run the earlier attached plan with VERBOSE, but
>> > here's the interesting thing: the parallel hash node's seq scan node
>> > outputs two columns: let's call them (from the redacted plan)
>> > items.system_id and items.account_id. The first (system_id) is both
>> > not null and unique; the second (account_id) definitely has massive
>> > skew. I'm not very up-to-speed on how the hash building works, but I
>> > would have (perhaps naïvely?) assumed that the first column being
>> > unique would make the hash keys very likely not to collide in any
>> > significantly skewed way. Am I missing something here?
>>
>> Hrm.  So the compound key is unique then.  I was assuming up until now
>> that it had duplicates.  The hashes of the individual keys are
>> combined (see ExecHashGetHashValue()), so assuming there is nothing
>> funky about the way citext gets hashed (and it's not jumping out at
>> me), your unique keys should give you uniform hash values and thus
>> partition size, and repartitioning should be an effective way of
>> reducing hash table size.  So now it sounds like you have a simple
>> case of underestimation, but now I'm confused about how you got a
>> 344MB hash table with work_mem = 150MB:
>

Why would the compound key matter? The join only does this

   Hash Cond: (group_details.group_number = items.system_id)

so the only thing that really matters when it comes to skew is
system_id, i.e. the citext column. The other column is used by
aggregation, but that's irrelevant here, no?

>Looking at the source, citext's hash is a pretty standard call to
>hash_any, so I don't see how that would lead to any oddities (barring
>an intentional hash collision etc., but this is real data).
>
>Do you have any theories about where the underestimation is happening?
>It knows the number of rows reasonably well. The one thing I'm not
>sure about yet is the row width = 16. The account_id is a bigint, so
>we can assume it knows the size properly there. The system_id being
>citext...I think the 8 bytes it has for that is probably a bit low on
>average, but I'm not sure yet by how much (I'm going to run a query to
>find out). Hmm. Query came back, and average length is just shy of
>7.9...so the 16 byte row size is looking pretty good. So I'm at a loss
>of why/what it would be underestimating (does it know about
>uniqueness/non-uniqueness? could that be a factor?)?
>

I think the easiest thing we can do is running a couple of queries
collecting useful stats, like

     -- the most frequent system_id values
     select system_id, count(*), sum(length(system_id)) from items
     gtoup by system_id
     order by count(*) desc limit 100;

     -- the largest group by system_id
     select system_id, count(*), sum(length(system_id)) from items
     gtoup by system_id
     order by sum(length(system_id))  desc limit 100;

That should tell us if there's something off.

>I also don't know why it seems to regularly fail on the primary, but
>not on the sync, unless we adjust the work_mem up. I've double-checked
>all GUCs and the only differences are things related to replication
>that you'd expect to be different on primary/replica.
>
>I know there are some things that execute differently on replicas, so
>I assume it's something like that, but I don't know what specifically
>would cause this here.
>

Hmm, I haven't realized it fails on primary more often. That's certainly
strange, I don't have a very good idea why that could be happening. Do
the query plans look the same? How does the EXPLAIN ANALYZE look like?

>> Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally
>> 4096)  Memory Usage: 344448kB
>>
>> And I'm confused about what was different when it wanted the crazy
>> number of batches.
>
>I'm not quite sure how to find out; if you have any ideas, I'd love to
>hear them. The one thing I can think of to try is to slowly increase
>work_mem (since setting it to 500MB reproduced the error on the
>replica) and see if the bucket info starts to trend up.
>

Chances are it actually happened to detect skew and disabled the growth,
hence the table grew above work_mem. We should probably add info about
this (skew, disabling growth) to EXPLAIN ANALYZE verbose mode.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

James Coleman
On Sun, Nov 10, 2019 at 12:29 PM Tomas Vondra
<[hidden email]> wrote:

>
> On Sun, Nov 10, 2019 at 09:15:06AM -0500, James Coleman wrote:
> >On Sat, Nov 9, 2019 at 10:44 PM Thomas Munro <[hidden email]> wrote:
> >>
> >> On Sun, Nov 10, 2019 at 3:25 PM James Coleman <[hidden email]> wrote:
> >> > So I should have run the earlier attached plan with VERBOSE, but
> >> > here's the interesting thing: the parallel hash node's seq scan node
> >> > outputs two columns: let's call them (from the redacted plan)
> >> > items.system_id and items.account_id. The first (system_id) is both
> >> > not null and unique; the second (account_id) definitely has massive
> >> > skew. I'm not very up-to-speed on how the hash building works, but I
> >> > would have (perhaps naīvely?) assumed that the first column being
> >> > unique would make the hash keys very likely not to collide in any
> >> > significantly skewed way. Am I missing something here?
> >>
> >> Hrm.  So the compound key is unique then.  I was assuming up until now
> >> that it had duplicates.  The hashes of the individual keys are
> >> combined (see ExecHashGetHashValue()), so assuming there is nothing
> >> funky about the way citext gets hashed (and it's not jumping out at
> >> me), your unique keys should give you uniform hash values and thus
> >> partition size, and repartitioning should be an effective way of
> >> reducing hash table size.  So now it sounds like you have a simple
> >> case of underestimation, but now I'm confused about how you got a
> >> 344MB hash table with work_mem = 150MB:
> >
>
> Why would the compound key matter? The join only does this
>
>    Hash Cond: (group_details.group_number = items.system_id)
>
> so the only thing that really matters when it comes to skew is
> system_id, i.e. the citext column. The other column is used by
> aggregation, but that's irrelevant here, no?

Yeah, that seems right. I think I got confused because account_id is
used in another join (well, technically a correlated subquery, but
effectively a join in implementation).

But even so, system_id is unique and not null. So the above discussion
of how uniqueness affects this still holds, right?

> >Looking at the source, citext's hash is a pretty standard call to
> >hash_any, so I don't see how that would lead to any oddities (barring
> >an intentional hash collision etc., but this is real data).
> >
> >Do you have any theories about where the underestimation is happening?
> >It knows the number of rows reasonably well. The one thing I'm not
> >sure about yet is the row width = 16. The account_id is a bigint, so
> >we can assume it knows the size properly there. The system_id being
> >citext...I think the 8 bytes it has for that is probably a bit low on
> >average, but I'm not sure yet by how much (I'm going to run a query to
> >find out). Hmm. Query came back, and average length is just shy of
> >7.9...so the 16 byte row size is looking pretty good. So I'm at a loss
> >of why/what it would be underestimating (does it know about
> >uniqueness/non-uniqueness? could that be a factor?)?
> >
>
> I think the easiest thing we can do is running a couple of queries
> collecting useful stats, like
>
>      -- the most frequent system_id values
>      select system_id, count(*), sum(length(system_id)) from items
>      gtoup by system_id
>      order by count(*) desc limit 100;
>
>      -- the largest group by system_id
>      select system_id, count(*), sum(length(system_id)) from items
>      gtoup by system_id
>      order by sum(length(system_id))  desc limit 100;
>
> That should tell us if there's something off.

Since the column is unique, we don't need to do any grouping to get
stats; as noted earlier I queried and found that the average length
was just shy of 7.9 chars long (though I happen to know the length
varies at least between 7 and 34).

> >I also don't know why it seems to regularly fail on the primary, but
> >not on the sync, unless we adjust the work_mem up. I've double-checked
> >all GUCs and the only differences are things related to replication
> >that you'd expect to be different on primary/replica.
> >
> >I know there are some things that execute differently on replicas, so
> >I assume it's something like that, but I don't know what specifically
> >would cause this here.
> >
>
> Hmm, I haven't realized it fails on primary more often. That's certainly
> strange, I don't have a very good idea why that could be happening. Do
> the query plans look the same? How does the EXPLAIN ANALYZE look like?

So far they all seem to look very similar. The other possibility is
that it's just that the replica is behind (it's not a sync replica),
but this has happened enough times on different days that that
explanation feels pretty unsatisfactory to me.

As mentioned in my previous email, I'm slowly checking against various
work_mem values to see if I can determine 1.) where it starts to only
fail on the primary and 2.) if the buckets stats start to grow before
the failure case.

So far I've seen it fail on both primary and sync at 500MB and 400MB,
300MB, and 200MB, and, interestingly, with today's data, succeed on
both at 150MB. That last point does imply data changing could have
something to do with the sync/replica difference I suppose. ...And, it
just failed on the replica at 150MB and succeeded on the primary at
150MB. So...probably data changing?

In all of those runs I haven't seen the bucket stats change.

> >> Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally
> >> 4096)  Memory Usage: 344448kB
> >>
> >> And I'm confused about what was different when it wanted the crazy
> >> number of batches.
> >
> >I'm not quite sure how to find out; if you have any ideas, I'd love to
> >hear them. The one thing I can think of to try is to slowly increase
> >work_mem (since setting it to 500MB reproduced the error on the
> >replica) and see if the bucket info starts to trend up.
> >
>
> Chances are it actually happened to detect skew and disabled the growth,
> hence the table grew above work_mem. We should probably add info about
> this (skew, disabling growth) to EXPLAIN ANALYZE verbose mode.

Would skew be likely though if the column is unique?


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Thomas Munro-5
I think I see what's happening: we're running out of hash bits.

> Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally 4096)  Memory Usage: 344448kB

Here it's using the lower 22 bits for the bucket number, and started
out using 12 bits for the batch (!), and increased that until it got
to 15 (!!).  After using 22 bits for the bucket, there are only 10
bits left, so all the tuples go into the lower 1024 batches.

I'm not sure how exactly this leads to wildly varying numbers of
repartioning cycles (the above-quoted example did it 3 times, the
version that crashed into MaxAllocSize did it ~10 times).

Besides switching to 64 bit hashes so that we don't run out of
information (clearly a good idea), what other options do we have?  (1)
We could disable repartitioning (give up on work_mem) after we've run
out of bits; this will eat more memory than it should.  (2) You could
start stealing bucket bits; this will eat more CPU than it should,
because you'd effectively have fewer active buckets (tuples will
concentrated on the bits you didn't steal).


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Tomas Vondra-4
On Mon, Nov 11, 2019 at 10:08:58AM +1300, Thomas Munro wrote:
>I think I see what's happening: we're running out of hash bits.
>
>> Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally 4096)  Memory Usage: 344448kB
>
>Here it's using the lower 22 bits for the bucket number, and started
>out using 12 bits for the batch (!), and increased that until it got
>to 15 (!!).  After using 22 bits for the bucket, there are only 10
>bits left, so all the tuples go into the lower 1024 batches.
>

Ouch!

>I'm not sure how exactly this leads to wildly varying numbers of
>repartioning cycles (the above-quoted example did it 3 times, the
>version that crashed into MaxAllocSize did it ~10 times).
>
>Besides switching to 64 bit hashes so that we don't run out of
>information (clearly a good idea), what other options do we have?  (1)
>We could disable repartitioning (give up on work_mem) after we've run
>out of bits; this will eat more memory than it should.  (2) You could
>start stealing bucket bits; this will eat more CPU than it should,
>because you'd effectively have fewer active buckets (tuples will
>concentrated on the bits you didn't steal).

Can't we simply compute two hash values, using different seeds - one for
bucket and the other for batch? Of course, that'll be more expensive.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


123