Removing unneeded self joins

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Removing unneeded self joins

Alexander Kuzmenkov
Hi hackers,

There is a join optimization we don't do -- removing inner join of a
table with itself on a unique column. Such joins are generated by
various ORMs, so from time to time our customers ask us to look into
this. Most recently, it was discussed on the list in relation to an
article comparing the optimizations that some DBMS make [1].

I started to explore what can be done about this. Attached is a proof of
concept patch. It works for some simple cases:

create table tt(a int primary key, b text);
explain select p.* from tt p join (select * from tt where b ~~ 'a%') q
on p.a = q.a;

                       QUERY PLAN
──────────────────────────────────────────────────────
  Seq Scan on tt p  (cost=0.00..25.88 rows=6 width=36)
    Filter: (b ~~ 'a%'::text)

It also works for semi-joins like `explain select p.* from tt p where
exists (select * from tt where b ~~ 'a%' and a = p.a);`. This requires a
preparatory step of reducing unique semi joins to inner joins, and we
already do this (reduce_unique_semijoin).


What this patch tries to do is to remove these inner joins when a single
join is being planned (populate_joinrel_with_paths). The main entry
point is reduce_self_unique_join. First, it proves that both input
relations are uniquely constrained by the same index given the
particular join clauses. We already have a way to find such indexes
(relation_has_unique_index_for), so I was able to reuse this. What I'm
not sure about is how to properly remove the join after that. For now, I
just pretend that the join relation being built is the outer baserel,
add to it the restrictions from the inner relation, and then plan it as
usual. Maybe there is a less hacky way to do it? I've seen elsewhere a
suggestion to use an AppendPath for a similar purpose, but here we can't
just use the outer relation we've already planned because the
restriction list is different.

I'd be glad to hear your thoughts on this.

[1]
https://www.postgresql.org/message-id/flat/CAMjNa7cC4X9YR-vAJS-jSYCajhRDvJQnN7m2sLH1wLh-_Z2bsw%40mail.gmail.com#CAMjNa7cC4X9YR-vAJS-jSYCajhRDvJQnN7m2sLH1wLh-_Z2bsw@...

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


remove-self-join-v1.patch (18K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Removing unneeded self joins

Tom Lane-2
Alexander Kuzmenkov <[hidden email]> writes:
> There is a join optimization we don't do -- removing inner join of a
> table with itself on a unique column. Such joins are generated by
> various ORMs, so from time to time our customers ask us to look into
> this. Most recently, it was discussed on the list in relation to an
> article comparing the optimizations that some DBMS make [1].

This is the sort of thing that I always wonder why the customers don't
ask the ORM to stop generating such damfool queries.  Its *expensive*
for us to clean up after their stupidity; almost certainly, it would
take far fewer cycles, net, for them to be a bit smarter in the first
place.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Removing unneeded self joins

Robert Haas
On Wed, May 16, 2018 at 12:08 PM, Tom Lane <[hidden email]> wrote:

> Alexander Kuzmenkov <[hidden email]> writes:
>> There is a join optimization we don't do -- removing inner join of a
>> table with itself on a unique column. Such joins are generated by
>> various ORMs, so from time to time our customers ask us to look into
>> this. Most recently, it was discussed on the list in relation to an
>> article comparing the optimizations that some DBMS make [1].
>
> This is the sort of thing that I always wonder why the customers don't
> ask the ORM to stop generating such damfool queries.  Its *expensive*
> for us to clean up after their stupidity; almost certainly, it would
> take far fewer cycles, net, for them to be a bit smarter in the first
> place.

The trouble, of course, is that the customer didn't write the ORM,
likely has no idea how it works, and doesn't want to run a modified
version of it even if they do.  If the queries run faster on other
systems than they do on PostgreSQL, we get dinged -- not unjustly.

Also, I'm not sure that I believe that it's always easy to avoid
generating such queries.  I mean, this case is trivial so it's easy to
say, well, just rewrite the query.  But suppose that I have a fact
table over which I've created two views, each of which performs
various joins between the fact table and various lookup tables.  My
queries are such that I normally need the joins in just one of these
two views and not the other to fetch the information I care about.
But every once in a while I need to run a report that involves pulling
every column possible.  The obvious solution is to join the views on
the underlying table's primary key, but then you get this problem.  Of
course there's a workaround: define a third view that does both sets
of joins-to-lookup-tables.  But that starts to feel like you're
handholding the database; surely it's the database's job to optimize
queries, not the user's.

It's been about 10 years since I worked as a web developer, but I do
remember hitting this kind of problem from time to time and I'd really
like to see us do something about it.  I wish we could optimize away
inner joins, too, for similar reasons.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: Removing unneeded self joins

Andres Freund
Hi,

On 2018-05-16 12:26:48 -0400, Robert Haas wrote:
> Also, I'm not sure that I believe that it's always easy to avoid
> generating such queries.

Yea. There's obviously plenty cases where ORMs just want to make the
database hurt. But especially when building a join between a number of
tables based on various fields, it's not going to be easy for the ORM to
figure out which ones can be safely omitted. It'd need similar
optimization as we'd have to do, without having the infrastructure core
PG has.  And then there's, as you say, views etc...

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Removing unneeded self joins

Simon Riggs
In reply to this post by Robert Haas
On 16 May 2018 at 11:26, Robert Haas <[hidden email]> wrote:

> On Wed, May 16, 2018 at 12:08 PM, Tom Lane <[hidden email]> wrote:
>> Alexander Kuzmenkov <[hidden email]> writes:
>>> There is a join optimization we don't do -- removing inner join of a
>>> table with itself on a unique column. Such joins are generated by
>>> various ORMs, so from time to time our customers ask us to look into
>>> this. Most recently, it was discussed on the list in relation to an
>>> article comparing the optimizations that some DBMS make [1].
>>
>> This is the sort of thing that I always wonder why the customers don't
>> ask the ORM to stop generating such damfool queries.  Its *expensive*
>> for us to clean up after their stupidity; almost certainly, it would
>> take far fewer cycles, net, for them to be a bit smarter in the first
>> place.
>
> The trouble, of course, is that the customer didn't write the ORM,
> likely has no idea how it works, and doesn't want to run a modified
> version of it even if they do.  If the queries run faster on other
> systems than they do on PostgreSQL, we get dinged -- not unjustly.
>
> Also, I'm not sure that I believe that it's always easy to avoid
> generating such queries.  I mean, this case is trivial so it's easy to
> say, well, just rewrite the query.  But suppose that I have a fact
> table over which I've created two views, each of which performs
> various joins between the fact table and various lookup tables.  My
> queries are such that I normally need the joins in just one of these
> two views and not the other to fetch the information I care about.
> But every once in a while I need to run a report that involves pulling
> every column possible.  The obvious solution is to join the views on
> the underlying table's primary key, but then you get this problem.  Of
> course there's a workaround: define a third view that does both sets
> of joins-to-lookup-tables.  But that starts to feel like you're
> handholding the database; surely it's the database's job to optimize
> queries, not the user's.
>
> It's been about 10 years since I worked as a web developer, but I do
> remember hitting this kind of problem from time to time and I'd really
> like to see us do something about it.  I wish we could optimize away
> inner joins, too, for similar reasons.

I agree with everything you say.

What I would add is that I've seen cases where the extra joins do NOT
hurt performance, so the extra CPU used to remove the join hurts more
than the benefit of removing it. Yes, we tried it.

More advanced optimizations should only be applied when we've assessed
that the likely run time is high enough to make it worth investing in
further optimization.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Removing unneeded self joins

Tom Lane-2
Simon Riggs <[hidden email]> writes:
> What I would add is that I've seen cases where the extra joins do NOT
> hurt performance, so the extra CPU used to remove the join hurts more
> than the benefit of removing it. Yes, we tried it.

Interesting.  The concern I had was more about the cost imposed on every
query to detect self-joins and try to prove them useless, even in queries
where no benefit ensues.  It's possible that we can get that down to the
point where it's negligible; but this says that even the successful-proof
case has to be very cheap.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Removing unneeded self joins

Jonathan S. Katz-3
In reply to this post by Andres Freund

> On May 16, 2018, at 1:58 PM, Andres Freund <[hidden email]> wrote:
>
> Hi,
>
> On 2018-05-16 12:26:48 -0400, Robert Haas wrote:
>> Also, I'm not sure that I believe that it's always easy to avoid
>> generating such queries.
>
> Yea. There's obviously plenty cases where ORMs just want to make the
> database hurt. But especially when building a join between a number of
> tables based on various fields, it's not going to be easy for the ORM to
> figure out which ones can be safely omitted. It'd need similar
> optimization as we'd have to do, without having the infrastructure core
> PG has.  And then there's, as you say, views etc…

Are there specific examples of what the ORM code is that generated
the SQL? I’m more curious to see what people are writing that
generates such code. As earlier mentioned we could always report back
to the specific ORM maintainer(s) such examples and see if they could
tweak.

Jonathan
Reply | Threaded
Open this post in threaded view
|

Re: Removing unneeded self joins

Simon Riggs
In reply to this post by Tom Lane-2
On 16 May 2018 at 15:10, Tom Lane <[hidden email]> wrote:

> Simon Riggs <[hidden email]> writes:
>> What I would add is that I've seen cases where the extra joins do NOT
>> hurt performance, so the extra CPU used to remove the join hurts more
>> than the benefit of removing it. Yes, we tried it.
>
> Interesting.  The concern I had was more about the cost imposed on every
> query to detect self-joins and try to prove them useless, even in queries
> where no benefit ensues.  It's possible that we can get that down to the
> point where it's negligible; but this says that even the successful-proof
> case has to be very cheap.

What I was advocating was an approach that varies according to the
query cost, so we don't waste time trying to tune the heck out of OLTP
queries, but for larger queries we might take a more considered
approach.

For advanced optimizations that are costly to check for, skip the
check if we are already below a cost threshold. The threshold would be
a heuristic that varies according to the cost of the check.

I realise that in this case we wouldn't know the full query cost until
we've done join planning, so we would need some lower bound estimate
to check whether its worth trying to remove joins.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Removing unneeded self joins

David Rowley-3
In reply to this post by Alexander Kuzmenkov
On 17 May 2018 at 03:43, Alexander Kuzmenkov <[hidden email]> wrote:
> I'd be glad to hear your thoughts on this.

(I only glanced at the patch)

I've thought and discussed this before on this list. I think the
arguments for and against it were much the same as you've received
already. If you trawl through the archives you'll see my argument for
matches quite closely to Robert regarding the nested-views. I
personally experienced this issue in my previous job, although it was
not with PostgreSQL.

I think it's worth doing this providing that we can fast-path out
quickly enough in cases where we can't possibly remove anything.
Likely the success of this patch depends on how quick that fast-path
is.

From my experience on join removals, I imagine all this can be done
just after the left join removal code has completed. I see your patch
does it much later, which I don't think is particularly great since
Paths have already been generated by that time. I think it makes sense
to do this as early as possible to save wasting planning work for
relations that will be removed.

I think all this can be done just after left joins are removed by
remove_useless_joins. You may like to move the code that exists in
that function today into a new static function named
remove_useless_left_joins, and put this new code in new static
function named remove_useless_self_joins:

1. Allocate an array root->simple_rel_array_size in size. Populate it
with a struct which is defined as struct { Index relid; Oid oid; }
2. Populate that array by looping over the simple_rel_array. Ignore
anything that's not a baserel with relkind = 'r'
3. qsort the array on Oid.
4. Make a pass over the array (up to its size - 1) looking for
elements where the current oid is the same as the next. Build a List
of RelIds containing all relids of Oids which are duplicated.
5. If no pairs. Abort.
6. Process each combination of pairs found in each Relids in the list
made in step 1. Probably start at the lowest relid.
7. For each pair:
  a. If there's a join condition, ensure all join OpExprs are equality
exprs with a mergejoinable opno (copy what left join removal check
with the opno used). Ensure Vars used in the OpExpr have the same
attrno on each side.
  b. For bonus points don't reject non-Vars used in the join
condition, but ensure they're equal and there are no non-immutable
functions inside.
  c. Ensure relation_has_unique_index_for returns true for the Vars
(and Exprs if doing b) used in the join condition.
  d. Choose the relation with the highest relid and rewrite the parse
changing the varno of all Vars to use the one of the relation with the
lowest relid.
  e. list_concat baserestictinfos from removed relation onto other relation.
  f. Check for Vars in the join condition that can contain NULLs and
lappend IS NOT NULLs into the baserestrictinfo. (Removing the join
could have removed NULL filtering)
  g. Mark highest relid relation as DEAD. (See what the left join
removal code does (you may need to do some extra work around
equivalence classes))


--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Removing unneeded self joins

David Rowley-3
In reply to this post by Simon Riggs
On 17 May 2018 at 08:44, Simon Riggs <[hidden email]> wrote:
> What I was advocating was an approach that varies according to the
> query cost, so we don't waste time trying to tune the heck out of OLTP
> queries, but for larger queries we might take a more considered
> approach.

That's tricky. If we do this, it should be done before Path
generation, so not much is known about the costs in those case.

Perhaps something can be done by looking at the number of relpages,
but I've no idea what that would be. Perhaps we need to see how costly
this operation is first before we try to think of ways to only apply
it conditionally?

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Removing unneeded self joins

Alexander Kuzmenkov
In reply to this post by David Rowley-3
David,

Many thanks for the detailed explanation. I'll try to code it up and
measure how much overhead it introduces.

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: Removing unneeded self joins

Thomas Munro-3
In reply to this post by Alexander Kuzmenkov
On Thu, May 17, 2018 at 3:43 AM, Alexander Kuzmenkov
<[hidden email]> wrote:
> There is a join optimization we don't do -- removing inner join of a table
> with itself on a unique column. Such joins are generated by various ORMs, so
> from time to time our customers ask us to look into this. Most recently, it
> was discussed on the list in relation to an article comparing the
> optimizations that some DBMS make [1].
>
> ...
>
> I'd be glad to hear your thoughts on this.

+1

Some thoughts:

There might be some interesting corner cases involving self-joins in
UPDATE/DELETE statements, and also FOR UPDATE etc.  Those can result
in some surprising results in a self-join (one side is subject to EPQ
and the other isn't) which I think might be changed by your patch
(though I didn't try it or read the patch very closely).

IIUC in DB2 (the clear winner at join elimination in the article you
mentioned), you get these sorts of things by default (optimisation
level 5 includes it), but not if you SET CURRENT QUERY OPTIMIZATION =
3 as many articles recommend for OLTP work.  I think it's interesting
that they provide that knob rather than something automatic, and
interesting that there is one linear knob to classify your workload
rather than N knobs for N optimisations.

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Removing unneeded self joins

Andres Freund
In reply to this post by David Rowley-3
HI,

On 2018-05-17 08:48:58 +1200, David Rowley wrote:

> On 17 May 2018 at 08:44, Simon Riggs <[hidden email]> wrote:
> > What I was advocating was an approach that varies according to the
> > query cost, so we don't waste time trying to tune the heck out of OLTP
> > queries, but for larger queries we might take a more considered
> > approach.
>
> That's tricky. If we do this, it should be done before Path
> generation, so not much is known about the costs in those case.
>
> Perhaps something can be done by looking at the number of relpages,
> but I've no idea what that would be. Perhaps we need to see how costly
> this operation is first before we try to think of ways to only apply
> it conditionally?

I'm also not buying that this isn't a benefit in OLTP in general. Sure,
for a single query RTT costs are going to dominate, but if you use
prepared statements the costs are going to pay of over multiple
executions.  Even just avoiding initializing unnecessary executor nodes
shows up in profiles.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Removing unneeded self joins

Tom Lane-2
In reply to this post by David Rowley-3
David Rowley <[hidden email]> writes:
> On 17 May 2018 at 08:44, Simon Riggs <[hidden email]> wrote:
>> What I was advocating was an approach that varies according to the
>> query cost, so we don't waste time trying to tune the heck out of OLTP
>> queries, but for larger queries we might take a more considered
>> approach.

> That's tricky. If we do this, it should be done before Path
> generation, so not much is known about the costs in those case.

Yeah.  It'd have to be a very heuristic thing that doesn't account
for much beyond the number of relations in the query, and maybe their
sizes --- although I don't think we even know the latter at the
point where join removal would be desirable.  (And note that one of
the desirable benefits of join removal is not having to find out the
sizes of removed rels ... so just swapping that around doesn't appeal.)

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Removing unneeded self joins

David Rowley-3
On 17 May 2018 at 10:13, Tom Lane <[hidden email]> wrote:
> Yeah.  It'd have to be a very heuristic thing that doesn't account
> for much beyond the number of relations in the query, and maybe their
> sizes --- although I don't think we even know the latter at the
> point where join removal would be desirable.  (And note that one of
> the desirable benefits of join removal is not having to find out the
> sizes of removed rels ... so just swapping that around doesn't appeal.)

There's probably some argument for delaying obtaining the relation
size until after join removal and probably partition pruning too, but
it's currently done well before that in build_simple_rel, where the
RelOptInfo is built.


--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Removing unneeded self joins

Tom Lane-2
In reply to this post by Thomas Munro-3
Thomas Munro <[hidden email]> writes:
> IIUC in DB2 (the clear winner at join elimination in the article you
> mentioned), you get these sorts of things by default (optimisation
> level 5 includes it), but not if you SET CURRENT QUERY OPTIMIZATION =
> 3 as many articles recommend for OLTP work.  I think it's interesting
> that they provide that knob rather than something automatic, and
> interesting that there is one linear knob to classify your workload
> rather than N knobs for N optimisations.

There's a lot to be said for that type of approach, as opposed to trying
to drive it off some necessarily-very-inexact preliminary estimate of
query cost.  For example, the mere fact that you're joining giant tables
doesn't in itself suggest that extra efforts in query optimization will be
repaid.  (If anything, it seems more likely that the user would've avoided
silliness like useless self-joins in such a case.)

A different line of thought is that, to me, the most intellectually
defensible rationale for efforts like const-simplification and join
removal is that opportunities for those things can arise after view
expansion, even in queries where the original query text didn't seem
to contain anything extraneous.  (Robert and Andres alluded to this
upthread, but not very clearly.)  So maybe we could track how much
the query got changed during rewriting, and use that to drive the
planner's decisions about how hard to work later on.  But I'm not
very sure that this'd be superior to having a user-visible knob.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Removing unneeded self joins

Andres Freund
On 2018-05-16 18:37:11 -0400, Tom Lane wrote:

> Thomas Munro <[hidden email]> writes:
> > IIUC in DB2 (the clear winner at join elimination in the article you
> > mentioned), you get these sorts of things by default (optimisation
> > level 5 includes it), but not if you SET CURRENT QUERY OPTIMIZATION =
> > 3 as many articles recommend for OLTP work.  I think it's interesting
> > that they provide that knob rather than something automatic, and
> > interesting that there is one linear knob to classify your workload
> > rather than N knobs for N optimisations.
>
> There's a lot to be said for that type of approach, as opposed to trying
> to drive it off some necessarily-very-inexact preliminary estimate of
> query cost.  For example, the mere fact that you're joining giant tables
> doesn't in itself suggest that extra efforts in query optimization will be
> repaid.  (If anything, it seems more likely that the user would've avoided
> silliness like useless self-joins in such a case.)

For prepared statements we could also start making more expensive
optimizations after the first execution, when we know how long the query
took / how expensive it was (also, if we had a plan cache...).

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Removing unneeded self joins

David Rowley-3
In reply to this post by Tom Lane-2
On 17 May 2018 at 10:37, Tom Lane <[hidden email]> wrote:

> Thomas Munro <[hidden email]> writes:
>> IIUC in DB2 (the clear winner at join elimination in the article you
>> mentioned), you get these sorts of things by default (optimisation
>> level 5 includes it), but not if you SET CURRENT QUERY OPTIMIZATION =
>> 3 as many articles recommend for OLTP work.  I think it's interesting
>> that they provide that knob rather than something automatic, and
>> interesting that there is one linear knob to classify your workload
>> rather than N knobs for N optimisations.
>
> There's a lot to be said for that type of approach, as opposed to trying
> to drive it off some necessarily-very-inexact preliminary estimate of
> query cost.  For example, the mere fact that you're joining giant tables
> doesn't in itself suggest that extra efforts in query optimization will be
> repaid.  (If anything, it seems more likely that the user would've avoided
> silliness like useless self-joins in such a case.)
>
> A different line of thought is that, to me, the most intellectually
> defensible rationale for efforts like const-simplification and join
> removal is that opportunities for those things can arise after view
> expansion, even in queries where the original query text didn't seem
> to contain anything extraneous.  (Robert and Andres alluded to this
> upthread, but not very clearly.)  So maybe we could track how much
> the query got changed during rewriting, and use that to drive the
> planner's decisions about how hard to work later on.  But I'm not
> very sure that this'd be superior to having a user-visible knob.

This seems like a good line of thought.  Perhaps a knob is a good
first step, then maybe having the ability to set that knob to
"automatic" is something to aspire for later.

I don't think Alexander should work on this as part of this patch
though. Perhaps we can re-evaluate when Alexander posts some planner
benchmarks from the patch.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Removing unneeded self joins

Tom Lane-2
In reply to this post by David Rowley-3
David Rowley <[hidden email]> writes:
> On 17 May 2018 at 10:13, Tom Lane <[hidden email]> wrote:
>> Yeah.  It'd have to be a very heuristic thing that doesn't account
>> for much beyond the number of relations in the query, and maybe their
>> sizes --- although I don't think we even know the latter at the
>> point where join removal would be desirable.  (And note that one of
>> the desirable benefits of join removal is not having to find out the
>> sizes of removed rels ... so just swapping that around doesn't appeal.)

> There's probably some argument for delaying obtaining the relation
> size until after join removal and probably partition pruning too, but
> it's currently done well before that in build_simple_rel, where the
> RelOptInfo is built.

Yeah, but that's something we ought to fix someday; IMO it's an artifact
of having wedged in remove_useless_joins without doing the extensive
refactoring that'd be needed to do it at a more desirable time.  I don't
want to build user-visible behavior that's dependent on doing that wrong.

(But wait a second ... we could improve this without quite that much work:
instead of doing estimate_rel_size immediately during get_relation_info,
couldn't it be left until the set_base_rel_sizes pass?  Since
RelationGetNumberOfBlocks involves kernel calls, skipping it for removed
rels seems worth doing.)

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Removing unneeded self joins

Andres Freund
On 2018-05-16 18:55:41 -0400, Tom Lane wrote:

> David Rowley <[hidden email]> writes:
> > On 17 May 2018 at 10:13, Tom Lane <[hidden email]> wrote:
> >> Yeah.  It'd have to be a very heuristic thing that doesn't account
> >> for much beyond the number of relations in the query, and maybe their
> >> sizes --- although I don't think we even know the latter at the
> >> point where join removal would be desirable.  (And note that one of
> >> the desirable benefits of join removal is not having to find out the
> >> sizes of removed rels ... so just swapping that around doesn't appeal.)
>
> > There's probably some argument for delaying obtaining the relation
> > size until after join removal and probably partition pruning too, but
> > it's currently done well before that in build_simple_rel, where the
> > RelOptInfo is built.
>
> Yeah, but that's something we ought to fix someday; IMO it's an artifact
> of having wedged in remove_useless_joins without doing the extensive
> refactoring that'd be needed to do it at a more desirable time.  I don't
> want to build user-visible behavior that's dependent on doing that wrong.

My patch that introduced a radix tree buffer mapping also keeps an
accurate relation size in memory, making it far cheaper to use. While I
depriorized the patchset for the moment (I'll post what I'm working on
first soon), that should address some of the cost till then.

Wonder if we shouldn't just cache an estimated relation size in the
relcache entry till then. For planning purposes we don't need to be
accurate, and usually activity that drastically expands relation size
will trigger relcache activity before long. Currently there's plenty
workloads where the lseeks(SEEK_END) show up pretty prominently.

Greetings,

Andres Freund

12
Previous Thread Next Thread