Hybrid Hash/Nested Loop joins and caching results from subplans

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Hybrid Hash/Nested Loop joins and caching results from subplans

David Rowley
Hackers,

Over on [1], Heikki mentioned about the usefulness of caching results
from parameterized subplans so that they could be used again for
subsequent scans which have the same parameters as a previous scan.
On [2], I mentioned that parameterized nested loop joins could see
similar gains with such a cache. I suggested there that instead of
adding code that only allows this to work for subplans, that instead,
we add a new node type that can handle the caching for us.  We can
then just inject that node type in places where it seems beneficial.

I've attached a patch which implements this.  The new node type is
called "Result Cache".  I'm not particularly wedded to keeping that
name, but if I change it, I only want to do it once. I've got a few
other names I mind, but I don't feel strongly or confident enough in
them to go and do the renaming.

How the caching works:

First off, it's only good for plugging in on top of parameterized
nodes that are rescanned with different parameters. The cache itself
uses a hash table using the simplehash.h implementation.  The memory
consumption is limited to work_mem. The code maintains an LRU list and
when we need to add new entries but don't have enough space to do so,
we free off older items starting at the top of the LRU list.  When we
get a cache hit, we move that entry to the end of the LRU list so that
it'll be the last to be evicted.

When should we cache:

For nested loop joins, the decision is made purely based on cost.  The
costing model looks at the expected number of calls, the distinct
value estimate and work_mem size. It then determines how many items
can be cached and then goes on to estimate an expected cache hit ratio
and also an eviction ratio. It adjusts the input costs according to
those ratios and adds some additional charges for caching and cache
lookups.

For subplans, since we plan subplans before we're done planning the
outer plan, there's very little information to go on about the number
of times that the cache will be looked up. For now, I've coded things
so the cache is always used for EXPR_SUBLINK type subplans.  There may
be other types of subplan that could support caching too, but I've not
really gone through them all yet to determine which.  I certainly know
there's some that we can't cache for.

Why caching might be good:

With hash joins, it's sometimes not so great that we have to hash the
entire inner plan and only probe a very small number of values.  If we
were able to only fill the hash table with values that are needed,
then then a lot of time and memory could be saved.  Effectively, the
patch does exactly this with the combination of a parameterized nested
loop join with a Result Cache node above the inner scan.

For subplans, the gains can be more because often subplans are much
more expensive to execute than what might go on the inside of a
parameterized nested loop join.

Current problems and some ways to make it better:

The patch does rely heavily on good ndistinct estimates. One
unfortunate problem is that if the planner has no statistics for
whatever it's trying to estimate for, it'll default to returning
DEFAULT_NUM_DISTINCT (200).  That may cause the Result Cache to appear
much more favourable than it should.  One way I can think to work
around that would be to have another function similar to
estimate_num_groups() which accepts a default value which it will
return if it was unable to find statistics to use.  In this case, such
a function could just be called passing the number of input rows as
the default, which would make the costing code think each value is
unique, which would not be favourable for caching.  I've not done
anything like that in what I've attached here.  That solution would
also do nothing if the ndistinct estimate was available, but was just
incorrect, as it often is.

There are currently a few compiler warnings with the patch due to the
scope of the simplehash.h hash table. Because the scope is static
rather than extern there's a load of unused function warnings.  Not
sure yet the best way to deal with this. I don't want to change the
scope to extern just to keep compilers quiet.

Also during cache_reduce_memory(), I'm performing a hash table lookup
followed by a hash table delete. I already have the entry to delete,
but there's no simplehash.h function that allows deletion by element
pointer, only by key. This wastes a hash table lookup. I'll likely
make an adjustment to the simplehash.h code to export the delete code
as a separate function to fix this.

Demo:

# explain (analyze, costs off) select relname,(select count(*) from
pg_class c2 where c1.relkind = c2.relkind) from pg_class c1;
                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Seq Scan on pg_class c1 (actual time=0.069..0.470 rows=391 loops=1)
   SubPlan 1
     ->  Result Cache (actual time=0.001..0.001 rows=1 loops=391)
           Cache Key: c1.relkind
           Cache Hits: 387  Cache Misses: 4 Cache Evictions: 0  Cache
Overflows: 0
           ->  Aggregate (actual time=0.062..0.062 rows=1 loops=4)
                 ->  Seq Scan on pg_class c2 (actual time=0.007..0.056
rows=98 loops=4)
                       Filter: (c1.relkind = relkind)
                       Rows Removed by Filter: 293
 Planning Time: 0.047 ms
 Execution Time: 0.536 ms
(11 rows)

# set enable_resultcache=0; -- disable result caching
SET
# explain (analyze, costs off) select relname,(select count(*) from
pg_class c2 where c1.relkind = c2.relkind) from pg_class c1;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Seq Scan on pg_class c1 (actual time=0.070..24.619 rows=391 loops=1)
   SubPlan 1
     ->  Aggregate (actual time=0.062..0.062 rows=1 loops=391)
           ->  Seq Scan on pg_class c2 (actual time=0.009..0.056
rows=120 loops=391)
                 Filter: (c1.relkind = relkind)
                 Rows Removed by Filter: 271
 Planning Time: 0.042 ms
 Execution Time: 24.653 ms
(8 rows)

-- Demo with parameterized nested loops
create table hundredk (hundredk int, tenk int, thousand int, hundred
int, ten int, one int);
insert into hundredk select x%100000,x%10000,x%1000,x%100,x%10,1 from
generate_Series(1,100000) x;
create table lookup (a int);
insert into lookup select x from generate_Series(1,100000)x,
generate_Series(1,100);
create index on lookup(a);
vacuum analyze lookup, hundredk;

# explain (analyze, costs off) select count(*) from hundredk hk inner
join lookup l on hk.thousand = l.a;
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Aggregate (actual time=1876.710..1876.710 rows=1 loops=1)
   ->  Nested Loop (actual time=0.013..1371.690 rows=9990000 loops=1)
         ->  Seq Scan on hundredk hk (actual time=0.005..8.451
rows=100000 loops=1)
         ->  Result Cache (actual time=0.000..0.005 rows=100 loops=100000)
               Cache Key: hk.thousand
               Cache Hits: 99000  Cache Misses: 1000 Cache Evictions:
0  Cache Overflows: 0
               ->  Index Only Scan using lookup_a_idx on lookup l
(actual time=0.002..0.011 rows=100 loops=1000)
                     Index Cond: (a = hk.thousand)
                     Heap Fetches: 0
 Planning Time: 0.113 ms
 Execution Time: 1876.741 ms
(11 rows)

# set enable_resultcache=0;
SET
# explain (analyze, costs off) select count(*) from hundredk hk inner
join lookup l on hk.thousand = l.a;
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Aggregate (actual time=2401.351..2401.352 rows=1 loops=1)
   ->  Merge Join (actual time=28.412..1890.905 rows=9990000 loops=1)
         Merge Cond: (l.a = hk.thousand)
         ->  Index Only Scan using lookup_a_idx on lookup l (actual
time=0.005..10.170 rows=99901 loops=1)
               Heap Fetches: 0
         ->  Sort (actual time=28.388..576.783 rows=9990001 loops=1)
               Sort Key: hk.thousand
               Sort Method: quicksort  Memory: 7760kB
               ->  Seq Scan on hundredk hk (actual time=0.005..11.039
rows=100000 loops=1)
 Planning Time: 0.123 ms
 Execution Time: 2401.379 ms
(11 rows)

Cache Overflows:

You might have noticed "Cache Overflow" in the EXPLAIN ANALYZE output.
This happens if a single scan of the inner node exhausts the cache
memory. In this case, all the other entries will already have been
evicted in an attempt to make space for the current scan's tuples.
However, if we see an overflow then the size of the results from a
single scan alone must have exceeded work_mem.  There might be some
tweaking to do here as it seems a shame that a single overly larger
scan would flush the entire cache. I doubt it would be too hard to
limit the flushing to some percentage of work_mem. Similar to how
large seqscans don't entirely flush shared_buffers.

Current Status:

I've spent quite a bit of time getting this working. I'd like to take
a serious go at making this happen for PG14.  For now, it all seems to
work. I have some concerns about bad statistics causing nested loop
joins to be favoured more than they were previously due to the result
cache further lowering the cost of them when the cache hit ratio is
thought to be high.

For now, the node type is parallel_safe, but not parallel_aware. I can
see that a parallel_aware version would be useful, but I've not done
that here. Anything in that area will not be part of my initial
effort.  The unfortunate part about that is the actual hit ratio will
drop with more parallel workers since the caches of each worker are
separate.

Some tests show a 10x speedup on TPC-H Q2.

I'm interested in getting feedback on this before doing much further work on it.

Does it seem like something we might want for PG14?

David

[1] https://www.postgresql.org/message-id/daceb327-9a20-51f4-fe6c-60b898692305%40iki.fi
[2] https://www.postgresql.org/message-id/CAKJS1f8oNXQ-LqjK%3DBOFDmxLc_7s3uFr_g4qi7Ncrjig0JOCiA%40mail.gmail.com

resultcache_2020-05-20.patch.bz2 (39K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

Simon Riggs
On Wed, 20 May 2020 at 12:44, David Rowley <[hidden email]> wrote:
Hackers,

Over on [1], Heikki mentioned about the usefulness of caching results
from parameterized subplans so that they could be used again for
subsequent scans which have the same parameters as a previous scan.
On [2], I mentioned that parameterized nested loop joins could see
similar gains with such a cache. I suggested there that instead of
adding code that only allows this to work for subplans, that instead,
we add a new node type that can handle the caching for us.  We can
then just inject that node type in places where it seems beneficial.

Very cool
 
I've attached a patch which implements this.  The new node type is
called "Result Cache".  I'm not particularly wedded to keeping that
name, but if I change it, I only want to do it once. I've got a few
other names I mind, but I don't feel strongly or confident enough in
them to go and do the renaming.

How the caching works:

First off, it's only good for plugging in on top of parameterized
nodes that are rescanned with different parameters. The cache itself
uses a hash table using the simplehash.h implementation.  The memory
consumption is limited to work_mem. The code maintains an LRU list and
when we need to add new entries but don't have enough space to do so,
we free off older items starting at the top of the LRU list.  When we
get a cache hit, we move that entry to the end of the LRU list so that
it'll be the last to be evicted.

When should we cache:

For nested loop joins, the decision is made purely based on cost.

I thought the main reason to do this was the case when the nested loop subplan was significantly underestimated and we realize during execution that we should have built a hash table. So including this based on cost alone seems to miss a trick.
 
The patch does rely heavily on good ndistinct estimates. 

Exactly. We know we seldom get those with many-way joins.

So +1 for adding this technique. My question is whether it should be added as an optional facility of a parameterised sub plan, rather than an always-needed full-strength node. That way the choice of whether to use it can happen at execution time once we notice that we've been called too many times.

--
Simon Riggs                http://www.2ndQuadrant.com/
Mission Critical Databases
Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

David Rowley
On Thu, 21 May 2020 at 00:56, Simon Riggs <[hidden email]> wrote:
> I thought the main reason to do this was the case when the nested loop subplan was significantly underestimated and we realize during execution that we should have built a hash table. So including this based on cost alone seems to miss a trick.

Isn't that mostly because the planner tends to choose a
non-parameterized nested loop when it thinks the outer side of the
join has just 1 row?  If so, I'd say that's a separate problem as
Result Cache only deals with parameterized nested loops.  Perhaps the
problem you mention could be fixed by adding some "uncertainty degree"
to the selectivity estimate function and have it return that along
with the selectivity.  We'd likely not want to choose an
unparameterized nested loop when the uncertainly level is high.
Multiplying the selectivity of different selectivity estimates could
raise the uncertainty level a magnitude.

For plans where the planner chooses to use a non-parameterized nested
loop due to having just 1 row on the outer side of the loop, it's
taking a huge risk. The cost of putting the 1 row on the inner side of
a hash join would bearly cost anything extra during execution.
Hashing 1 row is pretty cheap and performing a lookup on that hashed
row is not much more expensive than evaluating the qual of the nested
loop. Really just requires the additional hash function calls.  Having
the uncertainty degree I mentioned above would allow us to only have
the planner do that when the uncertainty degree indicates it's not
worth the risk.

David


Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

Andy Fan
In reply to this post by Simon Riggs



 My question is whether it should be added as an optional facility of a parameterised sub plan, rather than an always-needed full-strength node. That way the choice of whether to use it can happen at execution time once we notice that we've been called too many times.

 
Actually I am not sure about what does the "parameterized sub plan" mean (I
treat is a SubPlan Node), so please correct me if I misunderstand you:)  Because
the inner plan in nest loop not a SubPlan node actually.  so if bind the
facility to SubPlan node, we may loss the chances for nest loop.  And when we
consider the usage for nest loop,  we can consider the below example, where this
feature will be more powerful.


select j1o.i, j2_v.sum_5
from j1 j1o
inner join lateral
(select im100, sum(im5) as sum_5
from j2
where j1o.im100 = im100
and j1o.i = 1
group by im100) j2_v
on true
where j1o.i = 1;

--
Best Regards
Andy Fan
Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

David Rowley
On Fri, 22 May 2020 at 12:12, Andy Fan <[hidden email]> wrote:
> Actually I am not sure about what does the "parameterized sub plan" mean (I
> treat is a SubPlan Node), so please correct me if I misunderstand you:)  Because
> the inner plan in nest loop not a SubPlan node actually.  so if bind the
> facility to SubPlan node, we may loss the chances for nest loop.

A parameterized subplan would be a subquery that contains column
reference to a query above its own level. The executor changes that
column reference into a parameter and the subquery will need to be
rescanned each time the parameter's value changes.

> And when we
> consider the usage for nest loop,  we can consider the below example, where this
> feature will be more powerful.

I didn't quite get the LATERAL support quite done in the version I
sent. For now, I'm not considering adding a Result Cache node if there
are lateral vars in any location other than the inner side of the
nested loop join.  I think it'll just be a few lines to make it work
though.  I wanted to get some feedback before going to too much more
trouble to make all cases work.

I've now added this patch to the first commitfest of PG14.

David


Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

Andy Fan
Today I tested the correctness & performance of this patch based on TPC-H
workload, the environment is setup based on [1]. Correctness is tested by
storing the result into another table when this feature is not introduced and
then enable this feature and comparing the result with the original ones. No
issue is found at this stage.

I also checked the performance gain for TPC-H workload, totally 4 out of the 22
queries uses this new path, 3 of them are subplan, 1 of them is nestloop. All of
changes gets a better result. You can check the attachments for reference.
normal.log is the data without this feature, patched.log is the data with the
feature. The data doesn't show the 10x performance gain, I think that's mainly
data size related.

At the code level,  I mainly checked nestloop path and cost_resultcache_rescan,
everything looks good to me. I'd like to check the other parts in the following days.


 
--
Best Regards
Andy Fan

patched.log (105K) Download Attachment
normal.log (103K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

David Rowley
On Tue, 2 Jun 2020 at 21:05, Andy Fan <[hidden email]> wrote:
> Today I tested the correctness & performance of this patch based on TPC-H
> workload, the environment is setup based on [1]. Correctness is tested by
> storing the result into another table when this feature is not introduced and
> then enable this feature and comparing the result with the original ones. No
> issue is found at this stage.

Thank you for testing it out.

> I also checked the performance gain for TPC-H workload, totally 4 out of the 22
> queries uses this new path, 3 of them are subplan, 1 of them is nestloop. All of
> changes gets a better result. You can check the attachments for reference.
> normal.log is the data without this feature, patched.log is the data with the
> feature. The data doesn't show the 10x performance gain, I think that's mainly
> data size related.

Thanks for running those tests.  I had a quick look at the results and
I think to say that all 4 are better is not quite right. One is
actually a tiny bit slower and one is only faster due to a plan
change.  Here's my full analysis.

Q2 uses a result cache for the subplan and has about a 37.5% hit ratio
which reduces the execution time of the query down to 67% of the
original.
Q17 uses a result cache for the subplan and has about a 96.5% hit
ratio which reduces the execution time of the query down to 24% of the
original time.
Q18 uses a result cache for 2 x nested loop joins and has a 0% hit
ratio.  The execution time is reduced to 91% of the original time only
because the planner uses a different plan, which just happens to be
faster by chance.
Q20 uses a result cache for the subplan and has a 0% hit ratio.  The
execution time is 100.27% of the original time. There are 8620 cache
misses.
All other queries use the same plan with and without the patch.

> At the code level,  I mainly checked nestloop path and cost_resultcache_rescan,
> everything looks good to me. I'd like to check the other parts in the following days.

Great.

> [1] https://ankane.org/tpc-h


Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

Andy Fan

Thanks for running those tests.  I had a quick look at the results and
I think to say that all 4 are better is not quite right. One is
actually a tiny bit slower and one is only faster due to a plan
change.  


Yes..  Thanks for pointing it out. 


Q18 uses a result cache for 2 x nested loop joins and has a 0% hit
ratio.  The execution time is reduced to 91% of the original time only
because the planner uses a different plan, which just happens to be
faster by chance.
Q20 uses a result cache for the subplan and has a 0% hit ratio.  The
execution time is 100.27% of the original time. There are 8620 cache
misses.

Looks the case here is some statistics issue or cost model issue. I'd
like to check more about that.  But before that, I upload the steps[1] I used
in case you want to reproduce it locally. 


--
Best Regards
Andy Fan
Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

Andy Fan


On Wed, Jun 3, 2020 at 10:36 AM Andy Fan <[hidden email]> wrote:

Thanks for running those tests.  I had a quick look at the results and
I think to say that all 4 are better is not quite right. One is
actually a tiny bit slower and one is only faster due to a plan
change.  


Yes..  Thanks for pointing it out. 


Q18 uses a result cache for 2 x nested loop joins and has a 0% hit
ratio.  The execution time is reduced to 91% of the original time only
because the planner uses a different plan, which just happens to be
faster by chance.

This case should be caused by wrong rows estimations on condition  
o_orderkey in (select l_orderkey from lineitem group by l_orderkey having
sum(l_quantity) > 312).  The estimation is 123766 rows, but the fact is 10 rows.
This estimation is hard and I don't think we should address this issue on this
patch. 


Q20 uses a result cache for the subplan and has a 0% hit ratio.  The
execution time is 100.27% of the original time. There are 8620 cache
misses.


This is by design for current implementation. 

> For subplans, since we plan subplans before we're done planning the
> outer plan, there's very little information to go on about the number
> of times that the cache will be looked up. For now, I've coded things
> so the cache is always used for EXPR_SUBLINK type subplans. "

I first tried to see if we can have a row estimation before the subplan
is created and it looks very complex.  The subplan was created during 
preprocess_qual_conditions, at that time, we even didn't create the base
RelOptInfo , to say nothing of join_rel which the rows estimation happens
much later. 

Then I see if we can delay the cache decision until we have the rows estimation,
ExecInitSubPlan may be a candidate.  At this time  we can't add a new
ResutCache node, but we can add a cache function to SubPlan node with costed
based.  However the num_of_distinct values for parameterized variable can't be
calculated which I still leave it as an open issue. 

--
Best Regards
Andy Fan
Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

David Rowley
On Fri, 12 Jun 2020 at 16:10, Andy Fan <[hidden email]> wrote:

> I first tried to see if we can have a row estimation before the subplan
> is created and it looks very complex.  The subplan was created during
> preprocess_qual_conditions, at that time, we even didn't create the base
> RelOptInfo , to say nothing of join_rel which the rows estimation happens
> much later.
>
> Then I see if we can delay the cache decision until we have the rows estimation,
> ExecInitSubPlan may be a candidate.  At this time  we can't add a new
> ResutCache node, but we can add a cache function to SubPlan node with costed
> based.  However the num_of_distinct values for parameterized variable can't be
> calculated which I still leave it as an open issue.

I don't really like the idea of stuffing this feature into some
existing node type.  Doing so would seem pretty magical when looking
at an EXPLAIN ANALYZE. There is of course overhead to pulling tuples
through an additional node in the plan, but if you use that as an
argument then there's some room to argue that we should only have 1
executor node type to get rid of that overhead.

Tom mentioned in [1] that he's reconsidering his original thoughts on
leaving the AlternativeSubPlan selection decision until execution
time.  If that were done late in planning, as Tom mentioned, then it
would be possible to give a more accurate cost to the Result Cache as
we'd have built the outer plan by that time and would be able to
estimate the number of distinct calls to the correlated subplan. As
that feature is today we'd be unable to delay making the decision
until execution time as we don't have the required details to know how
many distinct calls there will be to the Result Cache node.

For now, I'm planning on changing things around a little in the Result
Cache node to allow faster deletions from the cache.  As of now, we
must perform 2 hash lookups to perform a single delete.  This is
because we must perform the lookup to fetch the entry from the MRU
list key, then an additional lookup in the hash delete code.  I plan
on changing the hash delete code to expose another function that
allows us to delete an item directly if we've already looked it up.
This should make a small reduction in the overheads of the node.
Perhaps if the overhead is very small (say < 1%) when the cache is of
no use then it might not be such a bad thing to just have a Result
Cache for correlated subplans regardless of estimates. With the TPCH
Q20 test, it appeared as if the overhead was 0.27% for that particular
subplan. A more simple subplan would execute more quickly resulting
the Result Cache overhead being a more significant portion of the
overall subquery execution. I'd need to perform a worst-case overhead
test to get an indication of what the percentage is.

David

[1] https://www.postgresql.org/message-id/1992952.1592785225@...


Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

David Rowley
On Tue, 30 Jun 2020 at 11:57, David Rowley <[hidden email]> wrote:

> For now, I'm planning on changing things around a little in the Result
> Cache node to allow faster deletions from the cache.  As of now, we
> must perform 2 hash lookups to perform a single delete.  This is
> because we must perform the lookup to fetch the entry from the MRU
> list key, then an additional lookup in the hash delete code.  I plan
> on changing the hash delete code to expose another function that
> allows us to delete an item directly if we've already looked it up.
> This should make a small reduction in the overheads of the node.
> Perhaps if the overhead is very small (say < 1%) when the cache is of
> no use then it might not be such a bad thing to just have a Result
> Cache for correlated subplans regardless of estimates. With the TPCH
> Q20 test, it appeared as if the overhead was 0.27% for that particular
> subplan. A more simple subplan would execute more quickly resulting
> the Result Cache overhead being a more significant portion of the
> overall subquery execution. I'd need to perform a worst-case overhead
> test to get an indication of what the percentage is.
I made the changes that I mention to speedup the cache deletes.  The
patch is now in 3 parts. The first two parts are additional work and
the final part is the existing work with some small tweaks.

0001: Alters estimate_num_groups() to allow it to pass back a flags
variable to indicate if the estimate used DEFAULT_NUM_DISTINCT.  The
idea here is to try and avoid using a Result Cache for a Nested Loop
join when the statistics are likely to be unreliable. Because
DEFAULT_NUM_DISTINCT is 200, if we estimate that number of distinct
values then a Result Cache is likely to look highly favourable in some
situations where it very well may not be.  I've not given this patch a
huge amount of thought, but so far I don't see anything too
unreasonable about it. I'm prepared to be wrong about that though.

0002 Makes some adjustments to simplehash.h to expose a function which
allows direct deletion of a hash table element when we already have a
pointer to the bucket. I think this is a pretty good change as it
reuses more simplehash.h code than without the patch.

0003 Is the result cache code.  I've done another pass over this
version and fixed a few typos and added a few comments.  I've not yet
added support for LATERAL joins. I plan to do that soon. For now, I
just wanted to get something out there as I saw that the patch did
need rebased.

I did end up testing the overheads of having a Result Cache node on a
very simple subplan that'll never see a cache hit. The overhead is
quite a bit more than the 0.27% that we saw with TPCH Q20.

Using a query that gets zero cache hits:

$ cat bench.sql
select relname,(select oid from pg_class c2 where c1.oid = c2.oid)
from pg_Class c1 offset 1000000000;

enable_resultcache = on:

$ pgbench -n -f bench.sql -T 60 postgres
latency average = 0.474 ms
tps = 2110.431529 (including connections establishing)
tps = 2110.503284 (excluding connections establishing)

enable_resultcache = off:

$ pgbench -n -f bench.sql -T 60 postgres
latency average = 0.379 ms
tps = 2640.534303 (including connections establishing)
tps = 2640.620552 (excluding connections establishing)

Which is about a 25% overhead in this very simple case.  With more
complex subqueries that overhead will drop significantly, but for that
simple one, it does seem a quite a bit too high to be adding a Result
Cache unconditionally for all correlated subqueries.  I think based on
that it's worth looking into the AlternativeSubPlan option that I
mentioned earlier.

I've attached the v2 patch series.

David

v2-0001-Allow-estimate_num_groups-to-pass-back-further-de.patch (12K) Download Attachment
v2-0002-Allow-users-of-simplehash.h-to-perform-direct-del.patch (6K) Download Attachment
v2-0003-Add-Result-Cache-executor-node.patch (208K) Download Attachment