Hybrid Hash/Nested Loop joins and caching results from subplans

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Hybrid Hash/Nested Loop joins and caching results from subplans

David Rowley
Hackers,

Over on [1], Heikki mentioned about the usefulness of caching results
from parameterized subplans so that they could be used again for
subsequent scans which have the same parameters as a previous scan.
On [2], I mentioned that parameterized nested loop joins could see
similar gains with such a cache. I suggested there that instead of
adding code that only allows this to work for subplans, that instead,
we add a new node type that can handle the caching for us.  We can
then just inject that node type in places where it seems beneficial.

I've attached a patch which implements this.  The new node type is
called "Result Cache".  I'm not particularly wedded to keeping that
name, but if I change it, I only want to do it once. I've got a few
other names I mind, but I don't feel strongly or confident enough in
them to go and do the renaming.

How the caching works:

First off, it's only good for plugging in on top of parameterized
nodes that are rescanned with different parameters. The cache itself
uses a hash table using the simplehash.h implementation.  The memory
consumption is limited to work_mem. The code maintains an LRU list and
when we need to add new entries but don't have enough space to do so,
we free off older items starting at the top of the LRU list.  When we
get a cache hit, we move that entry to the end of the LRU list so that
it'll be the last to be evicted.

When should we cache:

For nested loop joins, the decision is made purely based on cost.  The
costing model looks at the expected number of calls, the distinct
value estimate and work_mem size. It then determines how many items
can be cached and then goes on to estimate an expected cache hit ratio
and also an eviction ratio. It adjusts the input costs according to
those ratios and adds some additional charges for caching and cache
lookups.

For subplans, since we plan subplans before we're done planning the
outer plan, there's very little information to go on about the number
of times that the cache will be looked up. For now, I've coded things
so the cache is always used for EXPR_SUBLINK type subplans.  There may
be other types of subplan that could support caching too, but I've not
really gone through them all yet to determine which.  I certainly know
there's some that we can't cache for.

Why caching might be good:

With hash joins, it's sometimes not so great that we have to hash the
entire inner plan and only probe a very small number of values.  If we
were able to only fill the hash table with values that are needed,
then then a lot of time and memory could be saved.  Effectively, the
patch does exactly this with the combination of a parameterized nested
loop join with a Result Cache node above the inner scan.

For subplans, the gains can be more because often subplans are much
more expensive to execute than what might go on the inside of a
parameterized nested loop join.

Current problems and some ways to make it better:

The patch does rely heavily on good ndistinct estimates. One
unfortunate problem is that if the planner has no statistics for
whatever it's trying to estimate for, it'll default to returning
DEFAULT_NUM_DISTINCT (200).  That may cause the Result Cache to appear
much more favourable than it should.  One way I can think to work
around that would be to have another function similar to
estimate_num_groups() which accepts a default value which it will
return if it was unable to find statistics to use.  In this case, such
a function could just be called passing the number of input rows as
the default, which would make the costing code think each value is
unique, which would not be favourable for caching.  I've not done
anything like that in what I've attached here.  That solution would
also do nothing if the ndistinct estimate was available, but was just
incorrect, as it often is.

There are currently a few compiler warnings with the patch due to the
scope of the simplehash.h hash table. Because the scope is static
rather than extern there's a load of unused function warnings.  Not
sure yet the best way to deal with this. I don't want to change the
scope to extern just to keep compilers quiet.

Also during cache_reduce_memory(), I'm performing a hash table lookup
followed by a hash table delete. I already have the entry to delete,
but there's no simplehash.h function that allows deletion by element
pointer, only by key. This wastes a hash table lookup. I'll likely
make an adjustment to the simplehash.h code to export the delete code
as a separate function to fix this.

Demo:

# explain (analyze, costs off) select relname,(select count(*) from
pg_class c2 where c1.relkind = c2.relkind) from pg_class c1;
                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Seq Scan on pg_class c1 (actual time=0.069..0.470 rows=391 loops=1)
   SubPlan 1
     ->  Result Cache (actual time=0.001..0.001 rows=1 loops=391)
           Cache Key: c1.relkind
           Cache Hits: 387  Cache Misses: 4 Cache Evictions: 0  Cache
Overflows: 0
           ->  Aggregate (actual time=0.062..0.062 rows=1 loops=4)
                 ->  Seq Scan on pg_class c2 (actual time=0.007..0.056
rows=98 loops=4)
                       Filter: (c1.relkind = relkind)
                       Rows Removed by Filter: 293
 Planning Time: 0.047 ms
 Execution Time: 0.536 ms
(11 rows)

# set enable_resultcache=0; -- disable result caching
SET
# explain (analyze, costs off) select relname,(select count(*) from
pg_class c2 where c1.relkind = c2.relkind) from pg_class c1;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Seq Scan on pg_class c1 (actual time=0.070..24.619 rows=391 loops=1)
   SubPlan 1
     ->  Aggregate (actual time=0.062..0.062 rows=1 loops=391)
           ->  Seq Scan on pg_class c2 (actual time=0.009..0.056
rows=120 loops=391)
                 Filter: (c1.relkind = relkind)
                 Rows Removed by Filter: 271
 Planning Time: 0.042 ms
 Execution Time: 24.653 ms
(8 rows)

-- Demo with parameterized nested loops
create table hundredk (hundredk int, tenk int, thousand int, hundred
int, ten int, one int);
insert into hundredk select x%100000,x%10000,x%1000,x%100,x%10,1 from
generate_Series(1,100000) x;
create table lookup (a int);
insert into lookup select x from generate_Series(1,100000)x,
generate_Series(1,100);
create index on lookup(a);
vacuum analyze lookup, hundredk;

# explain (analyze, costs off) select count(*) from hundredk hk inner
join lookup l on hk.thousand = l.a;
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Aggregate (actual time=1876.710..1876.710 rows=1 loops=1)
   ->  Nested Loop (actual time=0.013..1371.690 rows=9990000 loops=1)
         ->  Seq Scan on hundredk hk (actual time=0.005..8.451
rows=100000 loops=1)
         ->  Result Cache (actual time=0.000..0.005 rows=100 loops=100000)
               Cache Key: hk.thousand
               Cache Hits: 99000  Cache Misses: 1000 Cache Evictions:
0  Cache Overflows: 0
               ->  Index Only Scan using lookup_a_idx on lookup l
(actual time=0.002..0.011 rows=100 loops=1000)
                     Index Cond: (a = hk.thousand)
                     Heap Fetches: 0
 Planning Time: 0.113 ms
 Execution Time: 1876.741 ms
(11 rows)

# set enable_resultcache=0;
SET
# explain (analyze, costs off) select count(*) from hundredk hk inner
join lookup l on hk.thousand = l.a;
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Aggregate (actual time=2401.351..2401.352 rows=1 loops=1)
   ->  Merge Join (actual time=28.412..1890.905 rows=9990000 loops=1)
         Merge Cond: (l.a = hk.thousand)
         ->  Index Only Scan using lookup_a_idx on lookup l (actual
time=0.005..10.170 rows=99901 loops=1)
               Heap Fetches: 0
         ->  Sort (actual time=28.388..576.783 rows=9990001 loops=1)
               Sort Key: hk.thousand
               Sort Method: quicksort  Memory: 7760kB
               ->  Seq Scan on hundredk hk (actual time=0.005..11.039
rows=100000 loops=1)
 Planning Time: 0.123 ms
 Execution Time: 2401.379 ms
(11 rows)

Cache Overflows:

You might have noticed "Cache Overflow" in the EXPLAIN ANALYZE output.
This happens if a single scan of the inner node exhausts the cache
memory. In this case, all the other entries will already have been
evicted in an attempt to make space for the current scan's tuples.
However, if we see an overflow then the size of the results from a
single scan alone must have exceeded work_mem.  There might be some
tweaking to do here as it seems a shame that a single overly larger
scan would flush the entire cache. I doubt it would be too hard to
limit the flushing to some percentage of work_mem. Similar to how
large seqscans don't entirely flush shared_buffers.

Current Status:

I've spent quite a bit of time getting this working. I'd like to take
a serious go at making this happen for PG14.  For now, it all seems to
work. I have some concerns about bad statistics causing nested loop
joins to be favoured more than they were previously due to the result
cache further lowering the cost of them when the cache hit ratio is
thought to be high.

For now, the node type is parallel_safe, but not parallel_aware. I can
see that a parallel_aware version would be useful, but I've not done
that here. Anything in that area will not be part of my initial
effort.  The unfortunate part about that is the actual hit ratio will
drop with more parallel workers since the caches of each worker are
separate.

Some tests show a 10x speedup on TPC-H Q2.

I'm interested in getting feedback on this before doing much further work on it.

Does it seem like something we might want for PG14?

David

[1] https://www.postgresql.org/message-id/daceb327-9a20-51f4-fe6c-60b898692305%40iki.fi
[2] https://www.postgresql.org/message-id/CAKJS1f8oNXQ-LqjK%3DBOFDmxLc_7s3uFr_g4qi7Ncrjig0JOCiA%40mail.gmail.com

resultcache_2020-05-20.patch.bz2 (39K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

Simon Riggs
On Wed, 20 May 2020 at 12:44, David Rowley <[hidden email]> wrote:
Hackers,

Over on [1], Heikki mentioned about the usefulness of caching results
from parameterized subplans so that they could be used again for
subsequent scans which have the same parameters as a previous scan.
On [2], I mentioned that parameterized nested loop joins could see
similar gains with such a cache. I suggested there that instead of
adding code that only allows this to work for subplans, that instead,
we add a new node type that can handle the caching for us.  We can
then just inject that node type in places where it seems beneficial.

Very cool
 
I've attached a patch which implements this.  The new node type is
called "Result Cache".  I'm not particularly wedded to keeping that
name, but if I change it, I only want to do it once. I've got a few
other names I mind, but I don't feel strongly or confident enough in
them to go and do the renaming.

How the caching works:

First off, it's only good for plugging in on top of parameterized
nodes that are rescanned with different parameters. The cache itself
uses a hash table using the simplehash.h implementation.  The memory
consumption is limited to work_mem. The code maintains an LRU list and
when we need to add new entries but don't have enough space to do so,
we free off older items starting at the top of the LRU list.  When we
get a cache hit, we move that entry to the end of the LRU list so that
it'll be the last to be evicted.

When should we cache:

For nested loop joins, the decision is made purely based on cost.

I thought the main reason to do this was the case when the nested loop subplan was significantly underestimated and we realize during execution that we should have built a hash table. So including this based on cost alone seems to miss a trick.
 
The patch does rely heavily on good ndistinct estimates. 

Exactly. We know we seldom get those with many-way joins.

So +1 for adding this technique. My question is whether it should be added as an optional facility of a parameterised sub plan, rather than an always-needed full-strength node. That way the choice of whether to use it can happen at execution time once we notice that we've been called too many times.

--
Simon Riggs                http://www.2ndQuadrant.com/
Mission Critical Databases
Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

David Rowley
On Thu, 21 May 2020 at 00:56, Simon Riggs <[hidden email]> wrote:
> I thought the main reason to do this was the case when the nested loop subplan was significantly underestimated and we realize during execution that we should have built a hash table. So including this based on cost alone seems to miss a trick.

Isn't that mostly because the planner tends to choose a
non-parameterized nested loop when it thinks the outer side of the
join has just 1 row?  If so, I'd say that's a separate problem as
Result Cache only deals with parameterized nested loops.  Perhaps the
problem you mention could be fixed by adding some "uncertainty degree"
to the selectivity estimate function and have it return that along
with the selectivity.  We'd likely not want to choose an
unparameterized nested loop when the uncertainly level is high.
Multiplying the selectivity of different selectivity estimates could
raise the uncertainty level a magnitude.

For plans where the planner chooses to use a non-parameterized nested
loop due to having just 1 row on the outer side of the loop, it's
taking a huge risk. The cost of putting the 1 row on the inner side of
a hash join would bearly cost anything extra during execution.
Hashing 1 row is pretty cheap and performing a lookup on that hashed
row is not much more expensive than evaluating the qual of the nested
loop. Really just requires the additional hash function calls.  Having
the uncertainty degree I mentioned above would allow us to only have
the planner do that when the uncertainty degree indicates it's not
worth the risk.

David


Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

Andy Fan
In reply to this post by Simon Riggs



 My question is whether it should be added as an optional facility of a parameterised sub plan, rather than an always-needed full-strength node. That way the choice of whether to use it can happen at execution time once we notice that we've been called too many times.

 
Actually I am not sure about what does the "parameterized sub plan" mean (I
treat is a SubPlan Node), so please correct me if I misunderstand you:)  Because
the inner plan in nest loop not a SubPlan node actually.  so if bind the
facility to SubPlan node, we may loss the chances for nest loop.  And when we
consider the usage for nest loop,  we can consider the below example, where this
feature will be more powerful.


select j1o.i, j2_v.sum_5
from j1 j1o
inner join lateral
(select im100, sum(im5) as sum_5
from j2
where j1o.im100 = im100
and j1o.i = 1
group by im100) j2_v
on true
where j1o.i = 1;

--
Best Regards
Andy Fan
Reply | Threaded
Open this post in threaded view
|

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

David Rowley
On Fri, 22 May 2020 at 12:12, Andy Fan <[hidden email]> wrote:
> Actually I am not sure about what does the "parameterized sub plan" mean (I
> treat is a SubPlan Node), so please correct me if I misunderstand you:)  Because
> the inner plan in nest loop not a SubPlan node actually.  so if bind the
> facility to SubPlan node, we may loss the chances for nest loop.

A parameterized subplan would be a subquery that contains column
reference to a query above its own level. The executor changes that
column reference into a parameter and the subquery will need to be
rescanned each time the parameter's value changes.

> And when we
> consider the usage for nest loop,  we can consider the below example, where this
> feature will be more powerful.

I didn't quite get the LATERAL support quite done in the version I
sent. For now, I'm not considering adding a Result Cache node if there
are lateral vars in any location other than the inner side of the
nested loop join.  I think it'll just be a few lines to make it work
though.  I wanted to get some feedback before going to too much more
trouble to make all cases work.

I've now added this patch to the first commitfest of PG14.

David