Hybrid Hash/Nested Loop joins and caching results from subplans

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
50 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Hybrid Hash/Nested Loop joins and caching results from subplans

David Rowley
Hackers,

Over on [1], Heikki mentioned about the usefulness of caching results
from parameterized subplans so that they could be used again for
subsequent scans which have the same parameters as a previous scan.
On [2], I mentioned that parameterized nested loop joins could see
similar gains with such a cache. I suggested there that instead of
adding code that only allows this to work for subplans, that instead,
we add a new node type that can handle the caching for us.  We can
then just inject that node type in places where it seems beneficial.

I've attached a patch which implements this.  The new node type is
called "Result Cache".  I'm not particularly wedded to keeping that
name, but if I change it, I only want to do it once. I've got a few
other names I mind, but I don't feel strongly or confident enough in
them to go and do the renaming.

How the caching works:

First off, it's only good for plugging in on top of parameterized
nodes that are rescanned with different parameters. The cache itself
uses a hash table using the simplehash.h implementation.  The memory
consumption is limited to work_mem. The code maintains an LRU list and
when we need to add new entries but don't have enough space to do so,
we free off older items starting at the top of the LRU list.  When we
get a cache hit, we move that entry to the end of the LRU list so that
it'll be the last to be evicted.

When should we cache:

For nested loop joins, the decision is made purely based on cost.  The
costing model looks at the expected number of calls, the distinct
value estimate and work_mem size. It then determines how many items
can be cached and then goes on to estimate an expected cache hit ratio
and also an eviction ratio. It adjusts the input costs according to
those ratios and adds some additional charges for caching and cache
lookups.

For subplans, since we plan subplans before we're done planning the
outer plan, there's very little information to go on about the number
of times that the cache will be looked up. For now, I've coded things
so the cache is always used for EXPR_SUBLINK type subplans.  There may
be other types of subplan that could support caching too, but I've not
really gone through them all yet to determine which.  I certainly know
there's some that we can't cache for.

Why caching might be good:

With hash joins, it's sometimes not so great that we have to hash the
entire inner plan and only probe a very small number of values.  If we
were able to only fill the hash table with values that are needed,
then then a lot of time and memory could be saved.  Effectively, the
patch does exactly this with the combination of a parameterized nested
loop join with a Result Cache node above the inner scan.

For subplans, the gains can be more because often subplans are much
more expensive to execute than what might go on the inside of a
parameterized nested loop join.

Current problems and some ways to make it better:

The patch does rely heavily on good ndistinct estimates. One
unfortunate problem is that if the planner has no statistics for
whatever it's trying to estimate for, it'll default to returning
DEFAULT_NUM_DISTINCT (200).  That may cause the Result Cache to appear
much more favourable than it should.  One way I can think to work
around that would be to have another function similar to
estimate_num_groups() which accepts a default value which it will
return if it was unable to find statistics to use.  In this case, such
a function could just be called passing the number of input rows as
the default, which would make the costing code think each value is
unique, which would not be favourable for caching.  I've not done
anything like that in what I've attached here.  That solution would
also do nothing if the ndistinct estimate was available, but was just
incorrect, as it often is.

There are currently a few compiler warnings with the patch due to the
scope of the simplehash.h hash table. Because the scope is static
rather than extern there's a load of unused function warnings.  Not
sure yet the best way to deal with this. I don't want to change the
scope to extern just to keep compilers quiet.

Also during cache_reduce_memory(), I'm performing a hash table lookup
followed by a hash table delete. I already have the entry to delete,
but there's no simplehash.h function that allows deletion by element
pointer, only by key. This wastes a hash table lookup. I'll likely
make an adjustment to the simplehash.h code to export the delete code
as a separate function to fix this.

Demo:

# explain (analyze, costs off) select relname,(select count(*) from
pg_class c2 where c1.relkind = c2.relkind) from pg_class c1;
                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Seq Scan on pg_class c1 (actual time=0.069..0.470 rows=391 loops=1)
   SubPlan 1
     ->  Result Cache (actual time=0.001..0.001 rows=1 loops=391)
           Cache Key: c1.relkind
           Cache Hits: 387  Cache Misses: 4 Cache Evictions: 0  Cache
Overflows: 0
           ->  Aggregate (actual time=0.062..0.062 rows=1 loops=4)
                 ->  Seq Scan on pg_class c2 (actual time=0.007..0.056
rows=98 loops=4)
                       Filter: (c1.relkind = relkind)
                       Rows Removed by Filter: 293
 Planning Time: 0.047 ms
 Execution Time: 0.536 ms
(11 rows)

# set enable_resultcache=0; -- disable result caching
SET
# explain (analyze, costs off) select relname,(select count(*) from
pg_class c2 where c1.relkind = c2.relkind) from pg_class c1;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Seq Scan on pg_class c1 (actual time=0.070..24.619 rows=391 loops=1)
   SubPlan 1
     ->  Aggregate (actual time=0.062..0.062 rows=1 loops=391)
           ->  Seq Scan on pg_class c2 (actual time=0.009..0.056
rows=120 loops=391)
                 Filter: (c1.relkind = relkind)
                 Rows Removed by Filter: 271
 Planning Time: 0.042 ms
 Execution Time: 24.653 ms
(8 rows)

-- Demo with parameterized nested loops
create table hundredk (hundredk int, tenk int, thousand int, hundred
int, ten int, one int);
insert into hundredk select x%100000,x%10000,x%1000,x%100,x%10,1 from
generate_Series(1,100000) x;
create table lookup (a int);
insert into lookup select x from generate_Series(1,100000)x,
generate_Series(1,100);
create index on lookup(a);
vacuum analyze lookup, hundredk;

# explain (analyze, costs off) select count(*) from hundredk hk inner
join lookup l on hk.thousand = l.a;
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Aggregate (actual time=1876.710..1876.710 rows=1 loops=1)
   ->  Nested Loop (actual time=0.013..1371.690 rows=9990000 loops=1)
         ->  Seq Scan on hundredk hk (actual time=0.005..8.451
rows=100000 loops=1)
         ->  Result Cache (actual time=0.000..0.005 rows=100 loops=100000)
               Cache Key: hk.thousand
               Cache Hits: 99000  Cache Misses: 1000 Cache Evictions:
0  Cache Overflows: 0
               ->  Index Only Scan using lookup_a_idx on lookup l
(actual time=0.002..0.011 rows=100 loops=1000)
                     Index Cond: (a = hk.thousand)
                     Heap Fetches: 0
 Planning Time: 0.113 ms
 Execution Time: 1876.741 ms
(11 rows)

# set enable_resultcache=0;
SET
# explain (analyze, costs off) select count(*) from hundredk hk inner
join lookup l on hk.thousand = l.a;
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Aggregate (actual time=2401.351..2401.352 rows=1 loops=1)
   ->  Merge Join (actual time=28.412..1890.905 rows=9990000 loops=1)
         Merge Cond: (l.a = hk.thousand)
         ->  Index Only Scan using lookup_a_idx on lookup l (actual
time=0.005..10.170 rows=99901 loops=1)
               Heap Fetches: 0
         ->  Sort (actual time=28.388..576.783 rows=9990001 loops=1)
               Sort Key: hk.thousand
               Sort Method: quicksort  Memory: 7760kB
               ->  Seq Scan on hundredk hk (actual time=0.005..11.039
rows=100000 loops=1)
 Planning Time: 0.123 ms
 Execution Time: 2401.379 ms
(11 rows)

Cache Overflows:

You might have noticed "Cache Overflow" in the EXPLAIN ANALYZE output.
This happens if a single scan of the inner node exhausts the cache
memory. In this case, all the other entries will already have been
evicted in an attempt to make space for the current scan's tuples.
However, if we see an overflow then the size of the results from a
single scan alone must have exceeded work_mem.  There might be some
tweaking to do here as it seems a shame that a single overly larger
scan would flush the entire cache. I doubt it would be too hard to
limit the flushing to some percentage of work_mem. Similar to how
large seqscans don't entirely flush shared_buffers.

Current Status:

I've spent quite a bit of time getting this working. I'd like to take
a serious go at making this happen for PG14.  For now, it all seems to
work. I have some concerns about bad statistics causing nested loop
joins to be favoured more than they were previously due to the result
cache further lowering the cost of them when the cache hit ratio is
thought to be high.

For now, the node type is parallel_safe, but not parallel_aware. I can
see that a parallel_aware version would be useful, but I've not done
that here. Anything in that area will not be part of my initial
effort.  The unfortunate part about that is the actual hit ratio will
drop with more parallel workers since the caches of each worker are
separate.

Some tests show a 10x speedup on TPC-H Q2.

I'm interested in getting feedback on this before doing much further work on it.

Does it seem like something we might want for PG14?

David

[1] https://www.postgresql.org/message-id/daceb327-9a20-51f4-fe6c-60b898692305%40iki.fi
[2] https://www.postgresql.org/message-id/CAKJS1f8oNXQ-LqjK%3DBOFDmxLc_7s3uFr_g4qi7Ncrjig0JOCiA%40mail.gmail.com

resultcache_2020-05-20.patch.bz2 (39K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

Simon Riggs
On Wed, 20 May 2020 at 12:44, David Rowley <[hidden email]> wrote:
Hackers,

Over on [1], Heikki mentioned about the usefulness of caching results
from parameterized subplans so that they could be used again for
subsequent scans which have the same parameters as a previous scan.
On [2], I mentioned that parameterized nested loop joins could see
similar gains with such a cache. I suggested there that instead of
adding code that only allows this to work for subplans, that instead,
we add a new node type that can handle the caching for us.  We can
then just inject that node type in places where it seems beneficial.

Very cool
 
I've attached a patch which implements this.  The new node type is
called "Result Cache".  I'm not particularly wedded to keeping that
name, but if I change it, I only want to do it once. I've got a few
other names I mind, but I don't feel strongly or confident enough in
them to go and do the renaming.

How the caching works:

First off, it's only good for plugging in on top of parameterized
nodes that are rescanned with different parameters. The cache itself
uses a hash table using the simplehash.h implementation.  The memory
consumption is limited to work_mem. The code maintains an LRU list and
when we need to add new entries but don't have enough space to do so,
we free off older items starting at the top of the LRU list.  When we
get a cache hit, we move that entry to the end of the LRU list so that
it'll be the last to be evicted.

When should we cache:

For nested loop joins, the decision is made purely based on cost.

I thought the main reason to do this was the case when the nested loop subplan was significantly underestimated and we realize during execution that we should have built a hash table. So including this based on cost alone seems to miss a trick.
 
The patch does rely heavily on good ndistinct estimates. 

Exactly. We know we seldom get those with many-way joins.

So +1 for adding this technique. My question is whether it should be added as an optional facility of a parameterised sub plan, rather than an always-needed full-strength node. That way the choice of whether to use it can happen at execution time once we notice that we've been called too many times.

--
Simon Riggs                http://www.2ndQuadrant.com/
Mission Critical Databases
Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

David Rowley
On Thu, 21 May 2020 at 00:56, Simon Riggs <[hidden email]> wrote:
> I thought the main reason to do this was the case when the nested loop subplan was significantly underestimated and we realize during execution that we should have built a hash table. So including this based on cost alone seems to miss a trick.

Isn't that mostly because the planner tends to choose a
non-parameterized nested loop when it thinks the outer side of the
join has just 1 row?  If so, I'd say that's a separate problem as
Result Cache only deals with parameterized nested loops.  Perhaps the
problem you mention could be fixed by adding some "uncertainty degree"
to the selectivity estimate function and have it return that along
with the selectivity.  We'd likely not want to choose an
unparameterized nested loop when the uncertainly level is high.
Multiplying the selectivity of different selectivity estimates could
raise the uncertainty level a magnitude.

For plans where the planner chooses to use a non-parameterized nested
loop due to having just 1 row on the outer side of the loop, it's
taking a huge risk. The cost of putting the 1 row on the inner side of
a hash join would bearly cost anything extra during execution.
Hashing 1 row is pretty cheap and performing a lookup on that hashed
row is not much more expensive than evaluating the qual of the nested
loop. Really just requires the additional hash function calls.  Having
the uncertainty degree I mentioned above would allow us to only have
the planner do that when the uncertainty degree indicates it's not
worth the risk.

David


Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

Andy Fan
In reply to this post by Simon Riggs



 My question is whether it should be added as an optional facility of a parameterised sub plan, rather than an always-needed full-strength node. That way the choice of whether to use it can happen at execution time once we notice that we've been called too many times.

 
Actually I am not sure about what does the "parameterized sub plan" mean (I
treat is a SubPlan Node), so please correct me if I misunderstand you:)  Because
the inner plan in nest loop not a SubPlan node actually.  so if bind the
facility to SubPlan node, we may loss the chances for nest loop.  And when we
consider the usage for nest loop,  we can consider the below example, where this
feature will be more powerful.


select j1o.i, j2_v.sum_5
from j1 j1o
inner join lateral
(select im100, sum(im5) as sum_5
from j2
where j1o.im100 = im100
and j1o.i = 1
group by im100) j2_v
on true
where j1o.i = 1;

--
Best Regards
Andy Fan
Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

David Rowley
On Fri, 22 May 2020 at 12:12, Andy Fan <[hidden email]> wrote:
> Actually I am not sure about what does the "parameterized sub plan" mean (I
> treat is a SubPlan Node), so please correct me if I misunderstand you:)  Because
> the inner plan in nest loop not a SubPlan node actually.  so if bind the
> facility to SubPlan node, we may loss the chances for nest loop.

A parameterized subplan would be a subquery that contains column
reference to a query above its own level. The executor changes that
column reference into a parameter and the subquery will need to be
rescanned each time the parameter's value changes.

> And when we
> consider the usage for nest loop,  we can consider the below example, where this
> feature will be more powerful.

I didn't quite get the LATERAL support quite done in the version I
sent. For now, I'm not considering adding a Result Cache node if there
are lateral vars in any location other than the inner side of the
nested loop join.  I think it'll just be a few lines to make it work
though.  I wanted to get some feedback before going to too much more
trouble to make all cases work.

I've now added this patch to the first commitfest of PG14.

David


Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

Andy Fan
Today I tested the correctness & performance of this patch based on TPC-H
workload, the environment is setup based on [1]. Correctness is tested by
storing the result into another table when this feature is not introduced and
then enable this feature and comparing the result with the original ones. No
issue is found at this stage.

I also checked the performance gain for TPC-H workload, totally 4 out of the 22
queries uses this new path, 3 of them are subplan, 1 of them is nestloop. All of
changes gets a better result. You can check the attachments for reference.
normal.log is the data without this feature, patched.log is the data with the
feature. The data doesn't show the 10x performance gain, I think that's mainly
data size related.

At the code level,  I mainly checked nestloop path and cost_resultcache_rescan,
everything looks good to me. I'd like to check the other parts in the following days.


 
--
Best Regards
Andy Fan

patched.log (105K) Download Attachment
normal.log (103K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

David Rowley
On Tue, 2 Jun 2020 at 21:05, Andy Fan <[hidden email]> wrote:
> Today I tested the correctness & performance of this patch based on TPC-H
> workload, the environment is setup based on [1]. Correctness is tested by
> storing the result into another table when this feature is not introduced and
> then enable this feature and comparing the result with the original ones. No
> issue is found at this stage.

Thank you for testing it out.

> I also checked the performance gain for TPC-H workload, totally 4 out of the 22
> queries uses this new path, 3 of them are subplan, 1 of them is nestloop. All of
> changes gets a better result. You can check the attachments for reference.
> normal.log is the data without this feature, patched.log is the data with the
> feature. The data doesn't show the 10x performance gain, I think that's mainly
> data size related.

Thanks for running those tests.  I had a quick look at the results and
I think to say that all 4 are better is not quite right. One is
actually a tiny bit slower and one is only faster due to a plan
change.  Here's my full analysis.

Q2 uses a result cache for the subplan and has about a 37.5% hit ratio
which reduces the execution time of the query down to 67% of the
original.
Q17 uses a result cache for the subplan and has about a 96.5% hit
ratio which reduces the execution time of the query down to 24% of the
original time.
Q18 uses a result cache for 2 x nested loop joins and has a 0% hit
ratio.  The execution time is reduced to 91% of the original time only
because the planner uses a different plan, which just happens to be
faster by chance.
Q20 uses a result cache for the subplan and has a 0% hit ratio.  The
execution time is 100.27% of the original time. There are 8620 cache
misses.
All other queries use the same plan with and without the patch.

> At the code level,  I mainly checked nestloop path and cost_resultcache_rescan,
> everything looks good to me. I'd like to check the other parts in the following days.

Great.

> [1] https://ankane.org/tpc-h


Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

Andy Fan

Thanks for running those tests.  I had a quick look at the results and
I think to say that all 4 are better is not quite right. One is
actually a tiny bit slower and one is only faster due to a plan
change.  


Yes..  Thanks for pointing it out. 


Q18 uses a result cache for 2 x nested loop joins and has a 0% hit
ratio.  The execution time is reduced to 91% of the original time only
because the planner uses a different plan, which just happens to be
faster by chance.
Q20 uses a result cache for the subplan and has a 0% hit ratio.  The
execution time is 100.27% of the original time. There are 8620 cache
misses.

Looks the case here is some statistics issue or cost model issue. I'd
like to check more about that.  But before that, I upload the steps[1] I used
in case you want to reproduce it locally. 


--
Best Regards
Andy Fan
Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

Andy Fan


On Wed, Jun 3, 2020 at 10:36 AM Andy Fan <[hidden email]> wrote:

Thanks for running those tests.  I had a quick look at the results and
I think to say that all 4 are better is not quite right. One is
actually a tiny bit slower and one is only faster due to a plan
change.  


Yes..  Thanks for pointing it out. 


Q18 uses a result cache for 2 x nested loop joins and has a 0% hit
ratio.  The execution time is reduced to 91% of the original time only
because the planner uses a different plan, which just happens to be
faster by chance.

This case should be caused by wrong rows estimations on condition  
o_orderkey in (select l_orderkey from lineitem group by l_orderkey having
sum(l_quantity) > 312).  The estimation is 123766 rows, but the fact is 10 rows.
This estimation is hard and I don't think we should address this issue on this
patch. 


Q20 uses a result cache for the subplan and has a 0% hit ratio.  The
execution time is 100.27% of the original time. There are 8620 cache
misses.


This is by design for current implementation. 

> For subplans, since we plan subplans before we're done planning the
> outer plan, there's very little information to go on about the number
> of times that the cache will be looked up. For now, I've coded things
> so the cache is always used for EXPR_SUBLINK type subplans. "

I first tried to see if we can have a row estimation before the subplan
is created and it looks very complex.  The subplan was created during 
preprocess_qual_conditions, at that time, we even didn't create the base
RelOptInfo , to say nothing of join_rel which the rows estimation happens
much later. 

Then I see if we can delay the cache decision until we have the rows estimation,
ExecInitSubPlan may be a candidate.  At this time  we can't add a new
ResutCache node, but we can add a cache function to SubPlan node with costed
based.  However the num_of_distinct values for parameterized variable can't be
calculated which I still leave it as an open issue. 

--
Best Regards
Andy Fan
Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

David Rowley
On Fri, 12 Jun 2020 at 16:10, Andy Fan <[hidden email]> wrote:

> I first tried to see if we can have a row estimation before the subplan
> is created and it looks very complex.  The subplan was created during
> preprocess_qual_conditions, at that time, we even didn't create the base
> RelOptInfo , to say nothing of join_rel which the rows estimation happens
> much later.
>
> Then I see if we can delay the cache decision until we have the rows estimation,
> ExecInitSubPlan may be a candidate.  At this time  we can't add a new
> ResutCache node, but we can add a cache function to SubPlan node with costed
> based.  However the num_of_distinct values for parameterized variable can't be
> calculated which I still leave it as an open issue.

I don't really like the idea of stuffing this feature into some
existing node type.  Doing so would seem pretty magical when looking
at an EXPLAIN ANALYZE. There is of course overhead to pulling tuples
through an additional node in the plan, but if you use that as an
argument then there's some room to argue that we should only have 1
executor node type to get rid of that overhead.

Tom mentioned in [1] that he's reconsidering his original thoughts on
leaving the AlternativeSubPlan selection decision until execution
time.  If that were done late in planning, as Tom mentioned, then it
would be possible to give a more accurate cost to the Result Cache as
we'd have built the outer plan by that time and would be able to
estimate the number of distinct calls to the correlated subplan. As
that feature is today we'd be unable to delay making the decision
until execution time as we don't have the required details to know how
many distinct calls there will be to the Result Cache node.

For now, I'm planning on changing things around a little in the Result
Cache node to allow faster deletions from the cache.  As of now, we
must perform 2 hash lookups to perform a single delete.  This is
because we must perform the lookup to fetch the entry from the MRU
list key, then an additional lookup in the hash delete code.  I plan
on changing the hash delete code to expose another function that
allows us to delete an item directly if we've already looked it up.
This should make a small reduction in the overheads of the node.
Perhaps if the overhead is very small (say < 1%) when the cache is of
no use then it might not be such a bad thing to just have a Result
Cache for correlated subplans regardless of estimates. With the TPCH
Q20 test, it appeared as if the overhead was 0.27% for that particular
subplan. A more simple subplan would execute more quickly resulting
the Result Cache overhead being a more significant portion of the
overall subquery execution. I'd need to perform a worst-case overhead
test to get an indication of what the percentage is.

David

[1] https://www.postgresql.org/message-id/1992952.1592785225@...


Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

David Rowley
On Tue, 30 Jun 2020 at 11:57, David Rowley <[hidden email]> wrote:

> For now, I'm planning on changing things around a little in the Result
> Cache node to allow faster deletions from the cache.  As of now, we
> must perform 2 hash lookups to perform a single delete.  This is
> because we must perform the lookup to fetch the entry from the MRU
> list key, then an additional lookup in the hash delete code.  I plan
> on changing the hash delete code to expose another function that
> allows us to delete an item directly if we've already looked it up.
> This should make a small reduction in the overheads of the node.
> Perhaps if the overhead is very small (say < 1%) when the cache is of
> no use then it might not be such a bad thing to just have a Result
> Cache for correlated subplans regardless of estimates. With the TPCH
> Q20 test, it appeared as if the overhead was 0.27% for that particular
> subplan. A more simple subplan would execute more quickly resulting
> the Result Cache overhead being a more significant portion of the
> overall subquery execution. I'd need to perform a worst-case overhead
> test to get an indication of what the percentage is.
I made the changes that I mention to speedup the cache deletes.  The
patch is now in 3 parts. The first two parts are additional work and
the final part is the existing work with some small tweaks.

0001: Alters estimate_num_groups() to allow it to pass back a flags
variable to indicate if the estimate used DEFAULT_NUM_DISTINCT.  The
idea here is to try and avoid using a Result Cache for a Nested Loop
join when the statistics are likely to be unreliable. Because
DEFAULT_NUM_DISTINCT is 200, if we estimate that number of distinct
values then a Result Cache is likely to look highly favourable in some
situations where it very well may not be.  I've not given this patch a
huge amount of thought, but so far I don't see anything too
unreasonable about it. I'm prepared to be wrong about that though.

0002 Makes some adjustments to simplehash.h to expose a function which
allows direct deletion of a hash table element when we already have a
pointer to the bucket. I think this is a pretty good change as it
reuses more simplehash.h code than without the patch.

0003 Is the result cache code.  I've done another pass over this
version and fixed a few typos and added a few comments.  I've not yet
added support for LATERAL joins. I plan to do that soon. For now, I
just wanted to get something out there as I saw that the patch did
need rebased.

I did end up testing the overheads of having a Result Cache node on a
very simple subplan that'll never see a cache hit. The overhead is
quite a bit more than the 0.27% that we saw with TPCH Q20.

Using a query that gets zero cache hits:

$ cat bench.sql
select relname,(select oid from pg_class c2 where c1.oid = c2.oid)
from pg_Class c1 offset 1000000000;

enable_resultcache = on:

$ pgbench -n -f bench.sql -T 60 postgres
latency average = 0.474 ms
tps = 2110.431529 (including connections establishing)
tps = 2110.503284 (excluding connections establishing)

enable_resultcache = off:

$ pgbench -n -f bench.sql -T 60 postgres
latency average = 0.379 ms
tps = 2640.534303 (including connections establishing)
tps = 2640.620552 (excluding connections establishing)

Which is about a 25% overhead in this very simple case.  With more
complex subqueries that overhead will drop significantly, but for that
simple one, it does seem a quite a bit too high to be adding a Result
Cache unconditionally for all correlated subqueries.  I think based on
that it's worth looking into the AlternativeSubPlan option that I
mentioned earlier.

I've attached the v2 patch series.

David

v2-0001-Allow-estimate_num_groups-to-pass-back-further-de.patch (12K) Download Attachment
v2-0002-Allow-users-of-simplehash.h-to-perform-direct-del.patch (6K) Download Attachment
v2-0003-Add-Result-Cache-executor-node.patch (208K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

David Rowley
On Thu, 2 Jul 2020 at 22:57, David Rowley <[hidden email]> wrote:
> I've attached the v2 patch series.

There was a bug in v2 that caused the caching not to work properly
when a unique join skipped to the next outer row after finding the
first match.  The cache was not correctly marked as complete in that
case. Normally we only mark the cache entry complete when we read the
scan to completion. Unique joins is a special case where we can mark
it as complete early.

I've also made a few more changes to reduce the size of the
ResultCacheEntry struct, taking it from 40 bytes down to 24.  That
matters quite a bit when the cached tuple is very narrow.  One of the
tests in resultcache.out, because we can now fit more entries in the
cache, it reports a 15% increase in cache hits.

I also improved the costing regarding the estimate of how many cache
entries we could fit in work mem.  Previously I was not accounting for
the size of the cache data structures in memory. v2 only accounted for
the tuples themselves. It's important to count these as if we don't
then it could cause the costing to think we could fit more entries
than we actually could which meant the estimated number of cache
evictions was off and could result in preferring a result cache plan
when we perhaps shouldn't.

I've attached v4.

I've also attached a bunch of benchmark results which were based on v3
of the patch. I didn't send out v3, but the results of v4 should be
almost the same for this test.  The script to run the benchmark is
contained in the resultcachebench.txt file.  The benchmark just mocks
up a "parts" table and a "sales" table.  The parts table has 1 million
rows in the 1 million test, as does the sales table. This goes up to
10 million and 100 million in the other two tests.  What varies with
each bar in the chart is the number of distinct parts in the sales
table. I just started with 1 part then doubled that up to ~1 million.
The unpatched version always uses a Hash Join, which is wasteful since
only a subset of parts are looked up.  In the 1 million test the
planner switches to using a Hash Join in the patched version at 65k
parts.  It waits until the 1 million distinct parts test to switch
over in the 10 million and 100 million test. The hash join costs are
higher in that case due to multi-batching, which is why the crossover
point is higher on the larger scale tests. I used 256MB work_mem for
all tests. Looking closely at the 10 million test, you can see that
the hash join starts taking longer from 128 parts onward. The hash
table is the same each time here, so I can only suspect that the
slowdown between 64 and 128 parts is due to CPU cache thrashing when
getting the correct buckets from the overly large hash table. This is
not really visible in the patched version as the resultcache hash
table is much smaller.

David

resultcachebench.txt (2K) Download Attachment
v4-0002-Allow-users-of-simplehash.h-to-perform-direct-del.patch (6K) Download Attachment
v4-0001-Allow-estimate_num_groups-to-pass-back-further-de.patch (12K) Download Attachment
v4-0003-Add-Result-Cache-executor-node.patch (213K) Download Attachment
rc_1million_bench.png (80K) Download Attachment
rc_10million_bench.png (83K) Download Attachment
rc_100million_bench.png (85K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

David Rowley
On Wed, 8 Jul 2020 at 00:32, David Rowley <[hidden email]> wrote:
> I've attached v4.

Thomas pointed out to me earlier that, per the CFbot, v4 was
generating a new compiler warning.  Andres pointed out to me that I
could fix the warnings of the unused functions in simplehash.h by
changing the scope from static to static inline.

The attached v5 patch set fixes that.

David

v5-0001-Allow-estimate_num_groups-to-pass-back-further-de.patch (12K) Download Attachment
v5-0002-Allow-users-of-simplehash.h-to-perform-direct-del.patch (7K) Download Attachment
v5-0003-Add-Result-Cache-executor-node.patch (213K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

Andres Freund
In reply to this post by David Rowley
Hi,

On 2020-05-20 23:44:27 +1200, David Rowley wrote:
> I've attached a patch which implements this.  The new node type is
> called "Result Cache".  I'm not particularly wedded to keeping that
> name, but if I change it, I only want to do it once. I've got a few
> other names I mind, but I don't feel strongly or confident enough in
> them to go and do the renaming.

I'm not convinced it's a good idea to introduce a separate executor node
for this. There's a fair bit of overhead in them, and they will only be
below certain types of nodes afaict. It seems like it'd be better to
pull the required calls into the nodes that do parametrized scans of
subsidiary nodes. Have you considered that?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

David Rowley
On Thu, 9 Jul 2020 at 04:53, Andres Freund <[hidden email]> wrote:

>
> On 2020-05-20 23:44:27 +1200, David Rowley wrote:
> > I've attached a patch which implements this.  The new node type is
> > called "Result Cache".  I'm not particularly wedded to keeping that
> > name, but if I change it, I only want to do it once. I've got a few
> > other names I mind, but I don't feel strongly or confident enough in
> > them to go and do the renaming.
>
> I'm not convinced it's a good idea to introduce a separate executor node
> for this. There's a fair bit of overhead in them, and they will only be
> below certain types of nodes afaict. It seems like it'd be better to
> pull the required calls into the nodes that do parametrized scans of
> subsidiary nodes. Have you considered that?

I see 41 different node types mentioned in ExecReScan().  I don't
really think it would be reasonable to change all those.

Here are a couple of examples, one with a Limit below the Result Cache
and one with a GroupAggregate.

postgres=# explain (costs off) select * from pg_Class c1 where relname
= (select relname from pg_Class c2 where c1.relname = c2.relname
offset 1 limit 1);
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Seq Scan on pg_class c1
   Filter: (relname = (SubPlan 1))
   SubPlan 1
     ->  Result Cache
           Cache Key: c1.relname
           ->  Limit
                 ->  Index Only Scan using pg_class_relname_nsp_index
on pg_class c2
                       Index Cond: (relname = c1.relname)
(8 rows)


postgres=# explain (costs off) select * from pg_Class c1 where relname
= (select relname from pg_Class c2 where c1.relname = c2.relname group
by 1 having count(*) > 1);
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Seq Scan on pg_class c1
   Filter: (relname = (SubPlan 1))
   SubPlan 1
     ->  Result Cache
           Cache Key: c1.relname
           ->  GroupAggregate
                 Group Key: c2.relname
                 Filter: (count(*) > 1)
                 ->  Index Only Scan using pg_class_relname_nsp_index
on pg_class c2
                       Index Cond: (relname = c1.relname)
(10 rows)

As for putting the logic somewhere like ExecReScan() then the first
paragraph in [1] are my thoughts on that.

David

[1] https://www.postgresql.org/message-id/CAApHDvr-yx9DEJ1Lc9aAy8QZkgEZkTP=3hBRBe83Vwo=kAndcA@...


Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

David Rowley
In reply to this post by David Rowley
On Wed, 8 Jul 2020 at 15:37, David Rowley <[hidden email]> wrote:
> The attached v5 patch set fixes that.

I've attached an updated set of patches for this per recent conflict.

I'd like to push the 0002 patch quite soon as I think it's an
improvement to simplehash.h regardless of if we get Result Cache.  It
reuses the SH_LOOKUP function for deletes. Also, if we ever get around
to giving up performing a lookup if we get too far away from the
optimal bucket, then that would only need to appear in one location
rather than in two.

Andres, or anyone, any objections to me pushing 0002?

David

v6-0002-Allow-users-of-simplehash.h-to-perform-direct-del.patch (7K) Download Attachment
v6-0001-Allow-estimate_num_groups-to-pass-back-further-de.patch (12K) Download Attachment
v6-0003-Add-Result-Cache-executor-node.patch (213K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

Andres Freund
Hi,

On 2020-08-04 10:05:25 +1200, David Rowley wrote:
> I'd like to push the 0002 patch quite soon as I think it's an
> improvement to simplehash.h regardless of if we get Result Cache.  It
> reuses the SH_LOOKUP function for deletes. Also, if we ever get around
> to giving up performing a lookup if we get too far away from the
> optimal bucket, then that would only need to appear in one location
> rather than in two.

> Andres, or anyone, any objections to me pushing 0002?

I think it'd be good to add a warning that, unless one is very careful,
no other hashtable modifications are allowed between lookup and
modification. E.g. something like
a = foobar_lookup();foobar_insert();foobar_delete();
will occasionally go wrong...


> - /* TODO: return false; if distance too big */
> +/*
> + * Perform hash table lookup on 'key', delete the entry belonging to it and
> + * return true.  Returns false if no item could be found relating to 'key'.
> + */
> +SH_SCOPE bool
> +SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
> +{
> + SH_ELEMENT_TYPE *entry = SH_LOOKUP(tb, key);
>  
> - curelem = SH_NEXT(tb, curelem, startelem);
> + if (likely(entry != NULL))
> + {
> + /*
> + * Perform deletion and also the relocation of subsequent items which
> + * are not in their optimal position but can now be moved up.
> + */
> + SH_DELETE_ITEM(tb, entry);
> + return true;
>   }
> +
> + return false; /* Can't find 'key' */
>  }

You meantioned on IM that there's a slowdowns with gcc. I wonder if this
could partially be responsible. Does SH_DELETE inline LOOKUP and
DELETE_ITEM? And does the generated code end up reloading entry-> or
tb-> members?

When the SH_SCOPE isn't static *, then IIRC gcc on unixes can't rely on
the called function actually being the function defined in the same
translation unit (unless -fno-semantic-interposition is specified).


Hm, but you said that this happens in tidbitmap.c, and there all
referenced functions are local statics. So that's not quite the
explanation I was thinking it was...


Hm. Also wonder whether we currently (i.e. the existing code) we
unnecessarily end up reloading tb->data a bunch of times, because we do
the access to ->data as
                SH_ELEMENT_TYPE *entry = &tb->data[curelem];

Think we should instead store tb->data in a local variable.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

Robert Haas
In reply to this post by David Rowley
On Wed, May 20, 2020 at 7:44 AM David Rowley <[hidden email]> wrote:
> I've attached a patch which implements this.  The new node type is
> called "Result Cache".  I'm not particularly wedded to keeping that
> name, but if I change it, I only want to do it once. I've got a few
> other names I mind, but I don't feel strongly or confident enough in
> them to go and do the renaming.

This is cool work; I am going to bikeshed on the name for a minute. I
don't think Result Cache is terrible, but I have two observations:

1. It might invite confusion with a feature of some other database
systems where they cache the results of entire queries and try to
reuse the entire result set.

2. The functionality reminds me a bit of a Materialize node, except
that instead of overflowing to disk, we throw away cache entries, and
instead of caching just one result, we potentially cache many.

I can't really think of a way to work Materialize into the node name
and I'm not sure it would be the right idea anyway. But I wonder if
maybe a name like "Parameterized Cache" would be better? That would
avoid confusion with any other use of the phrase "result cache"; also,
an experienced PostgreSQL user might be more likely to guess how a
"Parameterized Cache" is different from a "Materialize" node than they
would be if it were called a "Result Cache".

Just my $0.02,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

Peter Geoghegan-4
In reply to this post by David Rowley
On Wed, May 20, 2020 at 4:44 AM David Rowley <[hidden email]> wrote:
> Does it seem like something we might want for PG14?

Minor terminology issue: "Hybrid Hash Join" is a specific hash join
algorithm which is unrelated to what you propose to do here. I hope
that confusion can be avoided, possibly by not using the word hybrid
in the name.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

David Rowley
In reply to this post by Robert Haas
On Thu, 6 Aug 2020 at 08:13, Robert Haas <[hidden email]> wrote:
>
> This is cool work; I am going to bikeshed on the name for a minute. I
> don't think Result Cache is terrible, but I have two observations:

Thanks

> 1. It might invite confusion with a feature of some other database
> systems where they cache the results of entire queries and try to
> reuse the entire result set.

Yeah. I think "Cache" is good to keep, but I'm pretty much in favour
of swapping "Result" for something else. It's a bit too close to the
"Result" node in name, but too distant for everything else.

> 2. The functionality reminds me a bit of a Materialize node, except
> that instead of overflowing to disk, we throw away cache entries, and
> instead of caching just one result, we potentially cache many.
>
> I can't really think of a way to work Materialize into the node name
> and I'm not sure it would be the right idea anyway. But I wonder if
> maybe a name like "Parameterized Cache" would be better?

Yeah, I think that name is better. The only downside as far as I can
see is the length of it.

I'll hold off a bit before doing any renaming though to see what other
people think. I just feel bikeshedding on the name is something that's
going to take up quite a bit of time and effort with this. I plan to
rename it at most once.

Thanks for the comments

David


123