Avoiding hash join batch explosions with extreme skew and weird stats

41 messages
123
Open this post in threaded view
|

Avoiding hash join batch explosions with extreme skew and weird stats

 Hello, As discussed elsewhere[1][2], our algorithm for deciding when to give up on repartitioning (AKA increasing the number of batches) tends to keep going until it has a number of batches that is a function of the number of distinct well distributed keys.  I wanted to move this minor issue away from Tomas Vondra's thread[2] since it's a mostly independent problem. SET max_parallel_workers_per_gather = 0; SET synchronize_seqscans = off; SET work_mem = '4MB'; CREATE TABLE r AS SELECT generate_series(1, 10000000)::int i; ANALYZE r; -- 1k uniform keys + 1m duplicates CREATE TABLE s1k (i int); INSERT INTO s1k SELECT generate_series(1, 1000)::int i; ALTER TABLE s1k SET (autovacuum_enabled = off); ANALYZE s1k; INSERT INTO s1k SELECT 42 FROM generate_series(1, 1000000); EXPLAIN ANALYZE SELECT COUNT(*) FROM r JOIN s1k USING (i);   Buckets: 1048576 (originally 1048576)   Batches: 4096 (originally 16)   Memory Usage: 35157kB -- 10k uniform keys + 1m duplicates CREATE TABLE s10k (i int); INSERT INTO s10k SELECT generate_series(1, 10000)::int i; ALTER TABLE s10k SET (autovacuum_enabled = off); ANALYZE s10k; INSERT INTO s10k SELECT 42 FROM generate_series(1, 1000000); EXPLAIN ANALYZE SELECT COUNT(*) FROM r JOIN s10k USING (i);   Buckets: 131072 (originally 131072)   Batches: 32768 (originally 16)   Memory Usage: 35157kB See how the number of batches is determined by the number of uniform keys in r?  That's because the explosion unfolds until there is *nothing left* but keys that hash to the same value in the problem batch, which means those uniform keys have to keep spreading out until there is something on the order of two batches per key.  The point is that it's bounded only by input data (or eventually INT_MAX / 2 and MaxAllocSize), and as Tomas has illuminated, batches eat unmetered memory.  Ouch. Here's a quick hack to show that a 95% cut-off fixes those examples. I don't really know how to choose the number, but I suspect it should be much closer to 100 than 50.  I think this is the easiest of three fundamental problems that need to be solved in this area.  The others are: accounting for per-partition overheads as Tomas pointed out, and providing an actual fallback strategy that respects work_mem when extreme skew is detected OR per-partition overheads dominate.  I plan to experiment with nested loop hash join (or whatever you want to call it: the thing where you join every arbitrary fragment of the hash table against the outer batch, and somehow deal with outer match flags) when time permits. [1] https://www.postgresql.org/message-id/flat/CAG_%3D8kBoWY4AXwW%3DCj44xe13VZnYohV9Yr-_hvZdx2xpiipr9w%40mail.gmail.com[2] https://www.postgresql.org/message-id/flat/20190504003414.bulcbnge3rhwhcsh%40development-- Thomas Munro https://enterprisedb.com fix.patch (1K) Download Attachment
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

 On Thu, May 16, 2019 at 01:22:31PM +1200, Thomas Munro wrote: > ... > >Here's a quick hack to show that a 95% cut-off fixes those examples. >I don't really know how to choose the number, but I suspect it should >be much closer to 100 than 50.  I think this is the easiest of three >fundamental problems that need to be solved in this area.  The others >are: accounting for per-partition overheads as Tomas pointed out, and >providing an actual fallback strategy that respects work_mem when >extreme skew is detected OR per-partition overheads dominate.  I plan >to experiment with nested loop hash join (or whatever you want to call >it: the thing where you join every arbitrary fragment of the hash >table against the outer batch, and somehow deal with outer match >flags) when time permits. > I think this is a step in the right direction, but as I said on the other thread(s), I think we should not disable growth forever and recheck once in a while. Otherwise we'll end up in sad situation with non-uniform data sets, as poined out by Hubert Zhang in [1]. It's probably even truer with this less strict logic, using 95% as a threshold (instead of 100%). I kinda like the idea with increasing the spaceAllowed value. Essentially, if we decide adding batches would be pointless, increasing the memory budget is the only thing we can do anyway. The problem however is that we only really look at a single bit - it may be that doubling the batches would not help, but doing it twice would actually reduce the memory usage. For example, assume there are 2 distinct values in the batch, with hash values (in binary)   101010000   101010111 and assume we currently. Clearly, splitting batches is going to do nothing until we get to the 000 vs. 111 parts. At first I thought this is rather unlikely and we can ignore that, but I'm not really sure about that - it may actually be pretty likely. We may get to 101010 bucket with sufficiently large data set, and then it's ~50% probability the next bit is the same (assuming two distinct values). So this may be quite an issue, I think. regards [1] https://www.postgresql.org/message-id/CAB0yrekv%3D6_T_eUe2kOEvWUMwufcvfd15SFmCABtYFOkxCFdfA%40mail.gmail.com-- Tomas Vondra                  http://www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

 On Fri, May 17, 2019 at 4:39 AM Tomas Vondra <[hidden email]> wrote: > I think this is a step in the right direction, but as I said on the other > thread(s), I think we should not disable growth forever and recheck once > in a while. Otherwise we'll end up in sad situation with non-uniform data > sets, as poined out by Hubert Zhang in [1]. It's probably even truer with > this less strict logic, using 95% as a threshold (instead of 100%). > > I kinda like the idea with increasing the spaceAllowed value. Essentially, > if we decide adding batches would be pointless, increasing the memory > budget is the only thing we can do anyway. But that's not OK, we need to fix THAT. > The problem however is that we only really look at a single bit - it may > be that doubling the batches would not help, but doing it twice would > actually reduce the memory usage. For example, assume there are 2 distinct > values in the batch, with hash values (in binary) Yes, that's a good point, and not a case that we should ignore.  But if we had a decent fall-back strategy that respected work_mem, we wouldn't care so much if we get it wrong in a corner case.  I'm arguing that we should use Grace partitioning as our primary partitioning strategy, but fall back to looping (or possibly sort-merging) for the current batch if Grace doesn't seem to be working.  You'll always be able to find cases where if you'd just tried one more round, Grace would work, but that seems acceptable to me, because getting it wrong doesn't melt your computer, it just probably takes longer.  Or maybe it doesn't.  How much longer would it take to loop twice?  Erm, twice as long, and each loop makes actual progress, unlike extra speculative Grace partition expansions which apply not just to the current batch but all batches, might not actually work, and you *have* to abandon at some point.  The more I think about it, the more I think that a loop-base escape valve, though unpalatably quadratic, is probably OK because we're in a sink-or-swim situation at this point, and our budget is work_mem, not work_time. I'm concerned that we're trying to find ways to treat the symptoms, allowing us to exceed work_mem but maybe not so much, instead of focusing on the fundamental problem, which is that we don't yet have an algorithm that is guaranteed to respect work_mem. Admittedly I don't have a patch, just a bunch of handwaving.  One reason I haven't attempted to write it is because although I know how to do the non-parallel version using a BufFile full of match bits in sync with the tuples for outer joins, I haven't figured out how to do it for parallel-aware hash join, because then each loop over the outer batch could see different tuples in each participant.  You could use the match bit in HashJoinTuple header, but then you'd have to write all the tuples out again, which is more IO than I want to do.  I'll probably start another thread about that. -- Thomas Munro https://enterprisedb.com
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

 On Fri, May 17, 2019 at 10:21:56AM +1200, Thomas Munro wrote: >On Fri, May 17, 2019 at 4:39 AM Tomas Vondra ><[hidden email]> wrote: >> I think this is a step in the right direction, but as I said on the other >> thread(s), I think we should not disable growth forever and recheck once >> in a while. Otherwise we'll end up in sad situation with non-uniform data >> sets, as poined out by Hubert Zhang in [1]. It's probably even truer with >> this less strict logic, using 95% as a threshold (instead of 100%). >> >> I kinda like the idea with increasing the spaceAllowed value. Essentially, >> if we decide adding batches would be pointless, increasing the memory >> budget is the only thing we can do anyway. > >But that's not OK, we need to fix THAT. > I agree increasing the budget is not ideal, althought at the moment it's the only thing we can do. If we can improve that, great. >> The problem however is that we only really look at a single bit - it may >> be that doubling the batches would not help, but doing it twice would >> actually reduce the memory usage. For example, assume there are 2 distinct >> values in the batch, with hash values (in binary) > >Yes, that's a good point, and not a case that we should ignore.  But >if we had a decent fall-back strategy that respected work_mem, we >wouldn't care so much if we get it wrong in a corner case.  I'm >arguing that we should use Grace partitioning as our primary >partitioning strategy, but fall back to looping (or possibly >sort-merging) for the current batch if Grace doesn't seem to be >working.  You'll always be able to find cases where if you'd just >tried one more round, Grace would work, but that seems acceptable to >me, because getting it wrong doesn't melt your computer, it just >probably takes longer.  Or maybe it doesn't.  How much longer would it >take to loop twice?  Erm, twice as long, and each loop makes actual >progress, unlike extra speculative Grace partition expansions which >apply not just to the current batch but all batches, might not >actually work, and you *have* to abandon at some point.  The more I >think about it, the more I think that a loop-base escape valve, though >unpalatably quadratic, is probably OK because we're in a sink-or-swim >situation at this point, and our budget is work_mem, not work_time. > True. >I'm concerned that we're trying to find ways to treat the symptoms, >allowing us to exceed work_mem but maybe not so much, instead of >focusing on the fundamental problem, which is that we don't yet have >an algorithm that is guaranteed to respect work_mem. > Yes, that's a good point. >Admittedly I don't have a patch, just a bunch of handwaving.  One >reason I haven't attempted to write it is because although I know how >to do the non-parallel version using a BufFile full of match bits in >sync with the tuples for outer joins, I haven't figured out how to do >it for parallel-aware hash join, because then each loop over the outer >batch could see different tuples in each participant.  You could use >the match bit in HashJoinTuple header, but then you'd have to write >all the tuples out again, which is more IO than I want to do.  I'll >probably start another thread about that. > That pesky parallelism ;-) regards -- Tomas Vondra                  http://www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

 In reply to this post by Thomas Munro-5 Thomas Munro <[hidden email]> writes: > On Fri, May 17, 2019 at 4:39 AM Tomas Vondra > <[hidden email]> wrote: >> I kinda like the idea with increasing the spaceAllowed value. Essentially, >> if we decide adding batches would be pointless, increasing the memory >> budget is the only thing we can do anyway. > But that's not OK, we need to fix THAT. I don't think it's necessarily a good idea to suppose that we MUST fit in work_mem come what may.  It's likely impossible to guarantee that in all cases.  Even if we can, a query that runs for eons will help nobody.                         regards, tom lane
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

 On Thu, May 16, 2019 at 06:58:43PM -0400, Tom Lane wrote: >Thomas Munro <[hidden email]> writes: >> On Fri, May 17, 2019 at 4:39 AM Tomas Vondra >> <[hidden email]> wrote: >>> I kinda like the idea with increasing the spaceAllowed value. Essentially, >>> if we decide adding batches would be pointless, increasing the memory >>> budget is the only thing we can do anyway. > >> But that's not OK, we need to fix THAT. > >I don't think it's necessarily a good idea to suppose that we MUST >fit in work_mem come what may.  It's likely impossible to guarantee >that in all cases.  Even if we can, a query that runs for eons will >help nobody. > I kinda agree with Thomas - arbitrarily increasing work_mem is something we should not do unless abosolutely necessary. If the query is slow, it's up to the user to bump the value up, if deemed appropriate. regards -- Tomas Vondra                  http://www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

 On Fri, May 17, 2019 at 11:46 AM Tomas Vondra <[hidden email]> wrote: > On Thu, May 16, 2019 at 06:58:43PM -0400, Tom Lane wrote: > >Thomas Munro <[hidden email]> writes: > >> On Fri, May 17, 2019 at 4:39 AM Tomas Vondra > >> <[hidden email]> wrote: > >>> I kinda like the idea with increasing the spaceAllowed value. Essentially, > >>> if we decide adding batches would be pointless, increasing the memory > >>> budget is the only thing we can do anyway. > > > >> But that's not OK, we need to fix THAT. > > > >I don't think it's necessarily a good idea to suppose that we MUST > >fit in work_mem come what may.  It's likely impossible to guarantee > >that in all cases.  Even if we can, a query that runs for eons will > >help nobody. > > I kinda agree with Thomas - arbitrarily increasing work_mem is something > we should not do unless abosolutely necessary. If the query is slow, it's > up to the user to bump the value up, if deemed appropriate. +1 I think we can gaurantee that we can fit in work_mem with only one exception: we have to allow work_mem to be exceeded when we otherwise couldn't fit a single tuple. Then the worst possible case with the looping algorithm is that we degrade to loading just one inner tuple at a time into the hash table, at which point we effectively have a nested loop join (except (1) it's flipped around: for each tuple on the inner side, we scan the outer side; and (2) we can handle full outer joins).  In any reasonable case you'll have a decent amount of tuples at a time, so you won't have to loop too many times so it's not really quadratic in the number of tuples.  The realisation that it's a nested loop join in the extreme case is probably why the MySQL people called it 'block nested loop join' (and as far as I can tell from quick googling, it might be their *primary* strategy for hash joins that don't fit in memory, not just a secondary strategy after Grace fails, but I might be wrong about that).  Unlike plain old single-tuple nested loop join, it works in arbitrary sized blocks (the hash table).  What we would call a regular hash join, they call a BNL that just happens to have only one loop.  I think Grace is probably a better primary strategy, but loops are a good fallback. The reason I kept mentioning sort-merge in earlier threads is because it'd be better in the worst cases.  Unfortunately it would be worse in the best case (smallish numbers of loops) and I suspect many real world cases.  It's hard to decide, so perhaps we should be happy that sort-merge can't be considered currently because the join conditions may not be merge-joinable. -- Thomas Munro https://enterprisedb.com
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

 In reply to this post by Thomas Munro-5 On Thu, May 16, 2019 at 3:22 PM Thomas Munro <[hidden email]> wrote: Admittedly I don't have a patch, just a bunch of handwaving.  One reason I haven't attempted to write it is because although I know how to do the non-parallel version using a BufFile full of match bits in sync with the tuples for outer joins, I haven't figured out how to do it for parallel-aware hash join, because then each loop over the outer batch could see different tuples in each participant.  You could use the match bit in HashJoinTuple header, but then you'd have to write all the tuples out again, which is more IO than I want to do.  I'll probably start another thread about that. Could you explain more about the implementation you are suggesting?Specifically, what do you mean "BufFile full of match bits in sync with thetuples for outer joins?"Is the implementation you are thinking of one which falls back to NLJ on abatch-by-batch basis decided during the build phase?If so, why do you need to keep track of the outer tuples seen?If you are going to loop through the whole outer side for each tuple on theinner side, it seems like you wouldn't need to.Could you make an outer "batch" which is the whole of the outer relation? Thatis, could you do something like: when hashing the inner side, if re-partitioningis resulting in batches that will overflow spaceAllowed, could you set a flag onthat batch use_NLJ and when making batches for the outer side, make one "batch"that has all the tuples from the outer side which the inner side batch which wasflagged will do NLJ with.-- Melanie Plageman
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

 Hi, On 2019-05-20 13:25:52 +1200, Thomas Munro wrote: > In PostgreSQL, it's always inner = right, outer = left.  You can see > that reflected in plannodes.h and elsewhere: > > /* ---------------- >  *      these are defined to avoid confusion problems with "left" >  *      and "right" and "inner" and "outer".  The convention is that >  *      the "left" plan is the "outer" plan and the "right" plan is >  *      the inner plan, but these make the code more readable. >  * ---------------- >  */ > #define innerPlan(node)                 (((Plan *)(node))->righttree) > #define outerPlan(node)                 (((Plan *)(node))->lefttree) I really don't understand why we don't just rename those fields. Greetings, Andres Freund
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

 On Mon, Jun 3, 2019 at 5:10 PM Melanie Plageman <[hidden email]> wrote: > I was talking to Jeff Davis about this on Saturday, and, he felt that > there might be a way to solve the problem differently if we thought of > the left join case as performing an inner join and an antijoin > instead. > > Riffing on this idea a bit, I started trying to write a patch that > would basically emit a tuple if it matches and write the tuple out to > a file if it does not match. Then, after iterating through the outer > batch the first time for the first inner chunk, any tuples which do > not yet have a match are the only ones which need to be joined against > the other inner chunks. Instead of iterating through the outer side > original batch file, use the unmatched outer tuples file to do the > join against the next chunk. Repeat this for all chunks. I'm not sure that I understanding this proposal correctly, but if I am then I think it doesn't work in the case where a single outer row matches rows in many different inner chunks.  When you "use the unmatched outer tuples file to do the join against the next chunk," you deny any rows that have already matched the chance to produce additional matches. -- Robert Haas EnterpriseDB: http://www.enterprisedb.comThe Enterprise PostgreSQL Company
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

 In reply to this post by Thomas Munro-5 On Sun, May 19, 2019 at 7:07 PM Thomas Munro <[hidden email]> wrote: > Unfortunately that bits-in-order scheme doesn't work for parallel > hash, where the SharedTuplestore tuples seen by each worker are > non-deterministic.  So perhaps in that case we could use the > HEAP_TUPLE_HAS_MATCH bit in the outer tuple header itself, and write > the whole outer batch back out each time through the loop.  That'd > keep the tuples and match bits together, but it seems like a lot of > IO... So, I think the case you're worried about here is something like: Gather -> Parallel Hash Left Join   -> Parallel Seq Scan on a   -> Parallel Hash     -> Parallel Seq Scan on b If I understand ExecParallelHashJoinPartitionOuter correctly, we're going to hash all of a and put it into a set of batch files before we even get started, so it's possible to identify precisely which tuple we're talking about by just giving the batch number and the position of the tuple within that batch.  So while it's true that the individual workers can't use the number of tuples they've read to know where they are in the SharedTuplestore, maybe the SharedTuplestore could just tell them.  Then they could maintain a paged bitmap of the tuples that they've matched to something, indexed by position-within-the-tuplestore, and those bitmaps could be OR'd together at the end. Crazy idea, or...? -- Robert Haas EnterpriseDB: http://www.enterprisedb.comThe Enterprise PostgreSQL Company
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

 In reply to this post by Robert Haas On Tue, Jun 4, 2019 at 5:43 AM Robert Haas <[hidden email]> wrote:On Mon, Jun 3, 2019 at 5:10 PM Melanie Plageman <[hidden email]> wrote: > I was talking to Jeff Davis about this on Saturday, and, he felt that > there might be a way to solve the problem differently if we thought of > the left join case as performing an inner join and an antijoin > instead. > > Riffing on this idea a bit, I started trying to write a patch that > would basically emit a tuple if it matches and write the tuple out to > a file if it does not match. Then, after iterating through the outer > batch the first time for the first inner chunk, any tuples which do > not yet have a match are the only ones which need to be joined against > the other inner chunks. Instead of iterating through the outer side > original batch file, use the unmatched outer tuples file to do the > join against the next chunk. Repeat this for all chunks. I'm not sure that I understanding this proposal correctly, but if I am then I think it doesn't work in the case where a single outer row matches rows in many different inner chunks.  When you "use the unmatched outer tuples file to do the join against the next chunk," you deny any rows that have already matched the chance to produce additional matches. Oops! You are totally right.I will amend the idea: For each chunk on the inner side, loop through both the original batchfile and the unmatched outer tuples file created for the last chunk.Emit any matches and write out any unmatched tuples to a new unmatchedouter tuples file.I think, in the worst case, if no tuples from the outer have a match,you end up writing out all of the outer tuples for each chunk on theinner side. However, using the match bit in the tuple header solutionwould require this much writing.Probably the bigger problem is that in this worst case you would alsoneed to read double the number of outer tuples for each inner chunk.However, in the best case it seems like it would be better than thematch bit/write everything from the outer side out solution. -- Melanie Plageman
Open this post in threaded view
|

Re: Avoiding hash join batch explosions with extreme skew and weird stats

 On Tue, Jun 4, 2019 at 2:47 PM Melanie Plageman <[hidden email]> wrote: > Oops! You are totally right. > I will amend the idea: > For each chunk on the inner side, loop through both the original batch > file and the unmatched outer tuples file created for the last chunk. > Emit any matches and write out any unmatched tuples to a new unmatched > outer tuples file. > > I think, in the worst case, if no tuples from the outer have a match, > you end up writing out all of the outer tuples for each chunk on the > inner side. However, using the match bit in the tuple header solution > would require this much writing. > Probably the bigger problem is that in this worst case you would also > need to read double the number of outer tuples for each inner chunk. > > However, in the best case it seems like it would be better than the > match bit/write everything from the outer side out solution. I guess so, but the downside of needing to read twice as many outer tuples for each inner chunk seems pretty large.  It would be a lot nicer if we could find a way to store the matched-bits someplace other than where we are storing the tuples, what Thomas called a bits-in-order scheme, because then the amount of additional read and write I/O would be tiny -- one bit per tuple doesn't add up very fast. In the scheme you propose here, I think that after you read the original outer tuples for each chunk and the unmatched outer tuples for each chunk, you'll have to match up the unmatched tuples to the original tuples, probably by using memcmp() or something.  Otherwise, when a new match occurs, you won't know which tuple should now not be emitted into the new unmatched outer tuples file that you're going to produce.  So I think what's going to happen is that you'll read the original batch file, then read the unmatched tuples file and use that to set or not set a bit on each tuple in memory, then do the real work setting more bits, then write out a new unmatched-tuples file with the tuples that still don't have the bit set.  So your unmatched tuple file is basically a list of tuple identifiers in the least compact form imaginable: the tuple is identified by the entire tuple contents. That doesn't seem very appealing, although I expect that it would still win for some queries. -- Robert Haas EnterpriseDB: http://www.enterprisedb.comThe Enterprise PostgreSQL Company