On Fri, Apr 12, 2019 at 1:35 AM Robert Haas <

[hidden email]> wrote:

> It would be interesting to see how this does with moderately-long text

> keys, say 32 or 64 byte strings, and with actually-long text keys, say

> several kB, and then with gigantic text keys, say several MB. At some

> point the key probably gets large enough that computing the hash value

> for the next key evicts the current key from the relevant CPU cache,

> and if I had to guess, at that point prefetching will become a loser.

> But I don't know where that point is. If it turns out for example

> that this technique is a winner for pass-by-value datatypes and a

> loser for pass-by-reference datatypes, or that it's a winner always,

> or some sort of simple rule like that, awesome! But if it turns out

> that there's no simple rule that we can use to know whether it wins or

> loses, then that might make things a little tricky.

Ok, I ran the attached script on an E5-2695 v3 @ 2.30GHz with 32K of

L1, 256K of L2, 30M of L3. I used shared_buffers=16GB and prewarmed

all tables.

The short version: very mixed results. For small hash tables it

clearly hurts, for large ones it looks promising. Much more digging

required to draw any conclusions.

It'd be nice to understand exactly how the hidden parameters work.

Hash bucket array size vs hardware cache sizes must be a factor.

Another is surely timing; there is an optimal time to begin

prefetching (too soon and your line might be replaced by the time you

need it, too late and you won't avoid stalling). Here, I am doing a

ham-fisted prefetch of t+1's bucket header and not yet trying to

prefetch the tuple itself, but the Intel/CMU paper[1] of course shows

a deeper pipeline and talks about calibrating the prefetch distance

for the hardware. As far as I can see, that's quite tricky in

general, and complicated by our executor design: the number of cycles

before the node runs again depends on the higher-level plan!

Previously I have heard icache-based arguments for why we should be

able to emit more than one tuple at a time, and I suppose this is a

new argument for that: it makes it actually plausible to calibrate the

prefetch distance for "software-pipelined prefetch" techniques.

Anyway, here are the results for what they are worth:

Table r has 50,000,000 rows. Table s was tested with 3 different

sizes. Both tables have the same layout: key (various types and

sizes) + 0, 2 or 4 extra columns. I ran each test 3 times, and

compared the worst (w) and median (m) times and computed the speed-up

provided by the patch. The absolute times (not shown) were all in the

range 9-40 seconds depending depending on the parameters.

s=1,000 s=100,000 s=10,000,000

==================== ==================== ====================

int w=-10.86%, m=-11.03% w=-8.56%, m=-9.17% w=+16.79%, m=+19.63%

+ 2 cols w=-14.19%, m=-16.97% w=-6.59%, m=-7.89% w=+15.88%, m=+16.81%

+ 4 cols w=-17.14%, m=-14.34% w=-10.01%, m=-8.85% w=+37.38%, m=+24.01%

text(8) w=-12.42%, m=-12.32% w=-4.04%, m=-1.96% w=+13.52%, m=+16.17%

+ 2 cols w=-13.50%, m=-14.98% w=-4.29%, m=-3.40% w=+15.98%, m=+19.45%

+ 4 cols w=-11.53%, m=-9.61% w=-3.70%, m=-5.91% w=+46.66%, m=+51.06%

text(16) w=-11.46%, m=-9.71% w=-4.85%, m=-3.86% w=+16.47%, m=+22.10%

+ 2 cols w=-13.97%, m=-14.55% w=-7.08%, m=-6.07% w=+20.50%, m=+21.77%

+ 4 cols w=-9.72%, m=-11.31% w=-1.03%, m=-2.55% w=+8.25%, m=+12.21%

text(32) w=-14.86%, m=-15.48% w=-9.36%, m=-9.27% w=+19.86%, m=+15.34%

+ 2 cols w=-12.35%, m=-11.71% w=-10.61%, m=-9.87% w=+98.29%, m=+97.40%

+ 4 cols w=-10.71%, m=-10.40% w=-2.42%, m=-1.25% w=-8.34%, m=-10.44%

text(64) w=-9.45%, m=-11.36% w=-13.94%, m=-11.42% w=+9.44%, m=+9.57%

+ 2 cols w=-12.47%, m=-13.17% w=-9.60%, m=-6.61% w=-4.69%, m=+10.06%

+ 4 cols w=-9.47%, m=-12.20% w=-5.60%, m=-3.55% w=-15.91%, m=-23.29%

I'd expect the right-hand column to get a further speed-up (or

slow-down) of around 1/5 the given numbers, if we also prefetch during

the build phase (= s/r). Maybe 2-stage pipeline would help, though

I'm starting to see the complexity of organising a perfecting primed

memory pipeline ... ie what this topic is really all about.

Well, that's enough learning-basic-facts-about-computers by trying to

whack PostgreSQL with database literature for now. Looks like I'll

have to work quite a bit harder to make something useful out of this.

I think I see where some of the problems lie. I think being able to

store multiple tuples in a slot (via TTS-implementation-specific

means, as we see with the heap scan in my patch, and as I think hash

join would want to do to emit multiple tuples before relinquishing

control) and then look at them via ExecScrollSlot() and perhaps also

consume them via ExecNextSlot() are promising ideas, but I've only

scratched the surface.

[1]

http://www.cs.cmu.edu/~chensm/papers/hashjoin_tods_preliminary.pdf--

Thomas Munro

https://enterprisedb.com