Experimenting with hash join prefetch

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Experimenting with hash join prefetch

Thomas Munro-3
Hello hackers,

I have a list of micro-optimisations and things to look into for hash
joins, which I've updated on the wiki[1].  Here's one that I was
inspired to poke at with a stick in a spare hour today.

Cache-oblivious hash joins cause a lot of TLB and cache misses.
Researchers tell us to reduce those using huge/super VM pages[2] and
cache prefetch instructions[3].  (There is another class of
cache-aware hash join algorithms that partition carefully up front to
avoid them; that's not us.)

Here is a totally contrived experiment that shows the effect quite
reproducibly here:

shared_buffers = '1GB'

create table t as select generate_series(1, 20000000)::int i;
set max_parallel_workers_per_gather = 0;
set work_mem = '2GB';
select pg_prewarm('t'::regclass);

select count(*) from t t1 join t t2 using (i);

-> 00:12.683

First, let's try to prefetch the hash bucket for the next tuple while
computing the hash for the current tuple, since we can see into the
future quite easily: we know the keys are sequential integers in this
contrived experiment.  In ExecHashGetHashValue():

+               /* Prefetch the bucket for the next key */
+               uint32 next_hash = hash_uint32(DatumGetInt32(keyval) + 1);
+               uint32 next_bucket = next_hash % hashtable->nbuckets;
+               __builtin_prefetch(&hashtable->buckets.unshared[next_bucket]);

select count(*) from t t1 join t t2 using (i);

-> 00:09.901

Huzzah!  Next up, let's try a two-stage prefetch pipeline for the
probing phase, seeing two steps ahead:

+               /* Prefetch the bucket for the tuple after next */
+               uint32 next_next_hash = hash_uint32(DatumGetInt32(keyval) + 2);
+               uint32 next_next_bucket = next_next_hash % hashtable->nbuckets;
+
__builtin_prefetch(&hashtable->buckets.unshared[next_next_bucket]);
+               if (outer_tuple)
+               {
+                       /* Prefetch the first tuple in the next bucket */
+                       uint32 next_hash =
hash_uint32(DatumGetInt32(keyval) + 1);
+                       uint32 next_bucket = next_hash % hashtable->nbuckets;
+
__builtin_prefetch(hashtable->buckets.unshared[next_bucket]);
+               }

-> 00:09.138

It's nice to see this effect, but it isn't really surprising: there is
no doubt about the value of prefetching random access data.

I think we could probably do this for the build phase with the
existing tuple-at-a-time executor interface by doing the bucket
insertions from a queue that runs N tuples behind the one we're
currently loading and hashing.  Or something like that.  For the probe
phase (probably much more interesting) I think it'd involve extra
tuple copying, so that it could still access the last tuple while
pulling the next tuple to hash its key.  To avoid that we'd need a
facility for peeking at future tuples, or a proper first class batch
mode.

[1] https://wiki.postgresql.org/wiki/Parallel_Hash
[2] https://15721.courses.cs.cmu.edu/spring2016/papers/balkesen-icde2013.pdf
(table VI)
[3] http://www.cs.cmu.edu/~chensm/papers/hashjoin_icde04.pdf

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Experimenting with hash join prefetch

Andrey Borodin-2

Hi, Thomas!
> 14 окт. 2018 г., в 9:18, Thomas Munro <[hidden email]> написал(а):
>
> +               /* Prefetch the bucket for the next key */
> +               uint32 next_hash = hash_uint32(DatumGetInt32(keyval) + 1);
> +               uint32 next_bucket = next_hash % hashtable->nbuckets;
> +               __builtin_prefetch(&hashtable->buckets.unshared[next_bucket]);


+1, I also think that we should use __builtin_prefetch these days (years, actually).
Exactly after listening Anastassia Ailamaki's (author of referenced paper) talk on VLDB I've proposed to do that for B-tree [0], but did not really pursuit that idea.

[0] https://www.postgresql.org/message-id/3B774C9E-01E8-46A7-9642-7830DC1108F1%40yandex-team.ru
Reply | Threaded
Open this post in threaded view
|

Re: Experimenting with hash join prefetch

Dmitry Dolgov
> On Sun, 14 Oct 2018 at 06:19, Thomas Munro <[hidden email]> wrote:
>
> Cache-oblivious hash joins cause a lot of TLB and cache misses.
> ...
> (There is another class of cache-aware hash join algorithms that partition
> carefully up front to avoid them; that's not us.)

Just out of curiosity, can you please elaborate more on this part (with
references)? I'm thinking about this topic for a while, and I'm wondering, if
by another class you mean something like this [1], then even if it's not us
today, are there any issues that prevent from experimenting in this area?

[1]: https://www.cse.ust.hk/catalac/papers/coqp_tods08.pdf

Reply | Threaded
Open this post in threaded view
|

Re: Experimenting with hash join prefetch

Thomas Munro-3
On Mon, Oct 15, 2018 at 12:16 AM Dmitry Dolgov <[hidden email]> wrote:

> > On Sun, 14 Oct 2018 at 06:19, Thomas Munro <[hidden email]> wrote:
> > Cache-oblivious hash joins cause a lot of TLB and cache misses.
> > ...
> > (There is another class of cache-aware hash join algorithms that partition
> > carefully up front to avoid them; that's not us.)
>
> Just out of curiosity, can you please elaborate more on this part (with
> references)? I'm thinking about this topic for a while, and I'm wondering, if
> by another class you mean something like this [1], then even if it's not us
> today, are there any issues that prevent from experimenting in this area?

Hmm, I didn't mean the term-of-art "cache-oblivious" (Wikipedia
definition: "an algorithm designed to take advantage of a CPU cache
without having the size of the cache"), I meant not "cache-conscious"
(we don't do anything at all to reduce cache misses, though obviously
we could add prefetching to improve on that).

The distinction I'm making is between "no partition" hash join (what
we have), where you don't have to do any work up front, but you pay
for a lot of cache misses during building and probing, and "partition"
hash join, notably "radix join" (what MonetDB has?), where you have a
partition phase before the build phase that aims to break the data up
into enough partitions so that the hash tables will fit better in
cache, making the later phases faster.  There seems to be some
disagreement about which is best -- passing through the data first is
expensive, but so are cache misses on every probe, and there are
claims that the winner depends on skew and tuple size.

Here's some of the stuff I read/watched about this subject:
https://15721.courses.cs.cmu.edu/spring2018/schedule.html#apr-04-2018
Add to that http://www.adms-conf.org/2017/camera-ready/Analyzing_In_Memory_Hash_Joins__Granularity_Matters.pdf.

Skimming very quickly through the paper you posted, yeah, I mean
exactly that stuff.  Specifically I was thinking of the radix join
mentioned in background section 2.3.  (I see also that the same
authors wrote a paper "Cache-Oblivious Hash Joins" which appears to
describe a refinement of radix that doesn't need to be parameterised
for cache size.)  Sure, we could always consider these ideas.  I am
not personally working on that; to me it looked very hard to build,
for a feature so uncertain to produce better results!

(Note: we do of course have some kind of partitioning called
"batching" when work_mem runs out, but it's not a kind of partitioning
that cares about reducing cache misses, so if I understood correctly
it's still "no partition" as far as this discussion goes.)

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Experimenting with hash join prefetch

Thomas Munro-5
In reply to this post by Andrey Borodin-2
On Sun, Oct 14, 2018 at 11:11 PM Andrey Borodin <[hidden email]> wrote:

> > 14 окт. 2018 г., в 9:18, Thomas Munro <[hidden email]> написал(а):
> >
> > +               /* Prefetch the bucket for the next key */
> > +               uint32 next_hash = hash_uint32(DatumGetInt32(keyval) + 1);
> > +               uint32 next_bucket = next_hash % hashtable->nbuckets;
> > +               __builtin_prefetch(&hashtable->buckets.unshared[next_bucket]);
>
>
> +1, I also think that we should use __builtin_prefetch these days (years, actually).
> Exactly after listening Anastassia Ailamaki's (author of referenced paper) talk on VLDB I've proposed to do that for B-tree [0], but did not really pursuit that idea.
The above was obviously just a hard-coded hack that "knew" the next
key would be n + 1.  I've been wondering how you might abstract real
look-ahead with the shiny new TupleTableSlot design.  Here's a concept
I came up with: ExecScrollSlot(slot, 1) -> bool, to try to look ahead
to the next tuple if possible.  I suppose there could be a kind of
scrolling that actually consumes tuples (for true batch-mode tuple
processing in tight inner loops, for example hash table build), and
scrolling that merely peeks ahead (what I'm describing so far).  I'm
keen to hear other ideas about how that could look, because I know
that "vector" and "batch" mode tuple processing are ideas that people
have been bouncing around for a while.  Flame away.

POC patch attached.  I never got around to figuring out why it breaks
anti-joins (hence some regression test failures) or filling out
various other important details (see commit message for a list), but I
figured it was better on the mailing list than hiding in a drawer,
anyway.

Here is an example of times for a trivial join on my laptop.  Note
that this is prefetching only the probing phase, not while building
which should also be possible.  I didn't get around to trying deeper
prefetch pipelines as discussed earlier, but those seemed to produce
diminishing returns with hardcoded tests like in the earlier message.

shared_buffers = '3GB'

create table r as select generate_series(1, 40000000)::int i;
create table s as select generate_series(1, 10000000)::int i;
analyze;

set max_parallel_workers_per_gather = 0;
set work_mem = '2GB';
select pg_prewarm('r'::regclass);
select pg_prewarm('s'::regclass);

select count(*) from r join s using (i);

Master:  00:14.230
Patched: 00:11.818


--
Thomas Munro
https://enterprisedb.com

0001-Experimental-ExecScrollSlot-for-hash-join-prefetch.patch (21K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Experimenting with hash join prefetch

Robert Haas
On Wed, Apr 10, 2019 at 2:10 AM Thomas Munro <[hidden email]> wrote:
> Here is an example of times for a trivial join on my laptop.  Note
> that this is prefetching only the probing phase, not while building
> which should also be possible.  I didn't get around to trying deeper
> prefetch pipelines as discussed earlier, but those seemed to produce
> diminishing returns with hardcoded tests like in the earlier message.

It would be interesting to see how this does with moderately-long text
keys, say 32 or 64 byte strings, and with actually-long text keys, say
several kB, and then with gigantic text keys, say several MB.  At some
point the key probably gets large enough that computing the hash value
for the next key evicts the current key from the relevant CPU cache,
and if I had to guess, at that point prefetching will become a loser.
But I don't know where that point is.  If it turns out for example
that this technique is a winner for pass-by-value datatypes and a
loser for pass-by-reference datatypes, or that it's a winner always,
or some sort of simple rule like that, awesome!  But if it turns out
that there's no simple rule that we can use to know whether it wins or
loses, then that might make things a little tricky.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Experimenting with hash join prefetch

Thomas Munro-5
On Fri, Apr 12, 2019 at 1:35 AM Robert Haas <[hidden email]> wrote:

> It would be interesting to see how this does with moderately-long text
> keys, say 32 or 64 byte strings, and with actually-long text keys, say
> several kB, and then with gigantic text keys, say several MB.  At some
> point the key probably gets large enough that computing the hash value
> for the next key evicts the current key from the relevant CPU cache,
> and if I had to guess, at that point prefetching will become a loser.
> But I don't know where that point is.  If it turns out for example
> that this technique is a winner for pass-by-value datatypes and a
> loser for pass-by-reference datatypes, or that it's a winner always,
> or some sort of simple rule like that, awesome!  But if it turns out
> that there's no simple rule that we can use to know whether it wins or
> loses, then that might make things a little tricky.
Ok, I ran the attached script on an E5-2695 v3 @ 2.30GHz with 32K of
L1, 256K of L2, 30M of L3.  I used shared_buffers=16GB and prewarmed
all tables.

The short version: very mixed results.  For small hash tables it
clearly hurts, for large ones it looks promising.  Much more digging
required to draw any conclusions.

It'd be nice to understand exactly how the hidden parameters work.
Hash bucket array size vs hardware cache sizes must be a factor.
Another is surely timing; there is an optimal time to begin
prefetching (too soon and your line might be replaced by the time you
need it, too late and you won't avoid stalling).  Here, I am doing a
ham-fisted prefetch of t+1's bucket header and not yet trying to
prefetch the tuple itself, but the Intel/CMU paper[1] of course shows
a deeper pipeline and talks about calibrating the prefetch distance
for the hardware.  As far as I can see, that's quite tricky in
general, and complicated by our executor design: the number of cycles
before the node runs again depends on the higher-level plan!
Previously I have heard icache-based arguments for why we should be
able to emit more than one tuple at a time, and I suppose this is a
new argument for that: it makes it actually plausible to calibrate the
prefetch distance for "software-pipelined prefetch" techniques.

Anyway, here are the results for what they are worth:

Table r has 50,000,000 rows.  Table s was tested with 3 different
sizes.  Both tables have the same layout: key (various types and
sizes) + 0, 2 or 4 extra columns.  I ran each test 3 times, and
compared the worst (w) and median (m) times and computed the speed-up
provided by the patch.  The absolute times (not shown) were all in the
range 9-40 seconds depending depending on the parameters.

            s=1,000              s=100,000            s=10,000,000
            ==================== ==================== ====================
int         w=-10.86%, m=-11.03% w=-8.56%, m=-9.17%   w=+16.79%, m=+19.63%
 + 2 cols   w=-14.19%, m=-16.97% w=-6.59%, m=-7.89%   w=+15.88%, m=+16.81%
 + 4 cols   w=-17.14%, m=-14.34% w=-10.01%, m=-8.85%  w=+37.38%, m=+24.01%
text(8)     w=-12.42%, m=-12.32% w=-4.04%, m=-1.96%   w=+13.52%, m=+16.17%
 + 2 cols   w=-13.50%, m=-14.98% w=-4.29%, m=-3.40%   w=+15.98%, m=+19.45%
 + 4 cols   w=-11.53%, m=-9.61%  w=-3.70%, m=-5.91%   w=+46.66%, m=+51.06%
text(16)    w=-11.46%, m=-9.71%  w=-4.85%, m=-3.86%   w=+16.47%, m=+22.10%
 + 2 cols   w=-13.97%, m=-14.55% w=-7.08%, m=-6.07%   w=+20.50%, m=+21.77%
 + 4 cols   w=-9.72%, m=-11.31%  w=-1.03%, m=-2.55%   w=+8.25%, m=+12.21%
text(32)    w=-14.86%, m=-15.48% w=-9.36%, m=-9.27%   w=+19.86%, m=+15.34%
 + 2 cols   w=-12.35%, m=-11.71% w=-10.61%, m=-9.87%  w=+98.29%, m=+97.40%
 + 4 cols   w=-10.71%, m=-10.40% w=-2.42%, m=-1.25%   w=-8.34%, m=-10.44%
text(64)    w=-9.45%, m=-11.36%  w=-13.94%, m=-11.42% w=+9.44%, m=+9.57%
 + 2 cols   w=-12.47%, m=-13.17% w=-9.60%, m=-6.61%   w=-4.69%, m=+10.06%
 + 4 cols   w=-9.47%, m=-12.20%  w=-5.60%, m=-3.55%   w=-15.91%, m=-23.29%

I'd expect the right-hand column to get a further speed-up (or
slow-down) of around 1/5 the given numbers, if we also prefetch during
the build phase (= s/r).  Maybe 2-stage pipeline would help, though
I'm starting to see the complexity of organising a perfecting primed
memory pipeline ... ie what this topic is really all about.

Well, that's enough learning-basic-facts-about-computers by trying to
whack PostgreSQL with database literature for now.  Looks like I'll
have to work quite a bit harder to make something useful out of this.
I think I see where some of the problems lie.  I think being able to
store multiple tuples in a slot (via TTS-implementation-specific
means, as we see with the heap scan in my patch, and as I think hash
join would want to do to emit multiple tuples before relinquishing
control) and then look at them via ExecScrollSlot() and perhaps also
consume them via ExecNextSlot() are promising ideas, but I've only
scratched the surface.

[1] http://www.cs.cmu.edu/~chensm/papers/hashjoin_tods_preliminary.pdf


--
Thomas Munro
https://enterprisedb.com

make_prefetch_test.py (2K) Download Attachment
process_test_results.py (2K) Download Attachment