Rangejoin rebased

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Rangejoin rebased

Jeff Davis-8
New rangejoin patch attached.

I had previously attempted to make this work well for multiple range
join keys, but this patch implements single rangejoin keys only, and
the rest of the rangejoin clauses are effectively just rechecks. I
believe it can be made effective for multiple rangejoin keys, but at
the cost of additional complexity which is neither justified nor
implemented at this point.

Regards,
     Jeff Davis

rangejoin20171229.diff (73K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Simon Riggs
On 29 December 2017 at 18:25, Jeff Davis <[hidden email]> wrote:
> New rangejoin patch attached.
>
> I had previously attempted to make this work well for multiple range
> join keys, but this patch implements single rangejoin keys only, and
> the rest of the rangejoin clauses are effectively just rechecks. I
> believe it can be made effective for multiple rangejoin keys, but at
> the cost of additional complexity which is neither justified nor
> implemented at this point.

For this to be useful, it needs to include some details of how to use
it when people have NOT used range datatypes in their tables.

If we can see some examples with StartDate and EndDate cast to a
tzrange, plus some "don't write it like this" anti-patterns that would
help.

We can then make it clear that this is a huge performance benefit for
these important cases.

Just to emphasise why we want this, it might be better for the EXPLAIN
to say "Time Range Join" when the ranges being joined are Time Ranges,
and for other cases to just say "Range Join". The use of the word
Merge doesn't help much there.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Jeff Davis-8
On Sat, Jan 6, 2018 at 10:38 AM, Simon Riggs <[hidden email]> wrote:
> For this to be useful, it needs to include some details of how to use
> it when people have NOT used range datatypes in their tables.

Good idea. I added an example that doesn't have range types in the base table.

> If we can see some examples with StartDate and EndDate cast to a
> tzrange, plus some "don't write it like this" anti-patterns that would
> help.

By "anti-patterns", I assume you mean implementing range intersection
by hand in SQL over non-range types. Such an example would be quite a
long query, and might add to the confusion -- it seems strange to
spend more text explaining what *not* to do than what to do.

I see what you are saying, but perhaps I'm just not coming up with the
right text to explain it succinctly in the manual, so for now I just
added one example. Let me know if you think it needs more.

> We can then make it clear that this is a huge performance benefit for
> these important cases.

Done.

> Just to emphasise why we want this, it might be better for the EXPLAIN
> to say "Time Range Join" when the ranges being joined are Time Ranges,
> and for other cases to just say "Range Join". The use of the word
> Merge doesn't help much there.

I don't care for special-casing the word "time" in there, because no
other type would benefit. It seems a little too magical. I also do
like leaving "merge" in there because it helps the user understand why
their inputs are being sorted.

Regards,
     Jeff Davis

rangejoin20180109.diff (76K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Simon Riggs
On 10 January 2018 at 04:24, Jeff Davis <[hidden email]> wrote:
> On Sat, Jan 6, 2018 at 10:38 AM, Simon Riggs <[hidden email]> wrote:
>> For this to be useful, it needs to include some details of how to use
>> it when people have NOT used range datatypes in their tables.
>
> Good idea. I added an example that doesn't have range types in the base table.

Cool, thanks

...


It would be useful to consider any related use cases.
Are there applications for range operators other than &&?

Do we optimize for TIMESTAMP <@ RANGE as well?

Does this link in nicely with partition-aware joins?
Does it allow partition exclusion if you join a daterange to a time
range partitioned table?

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Simon Riggs
In reply to this post by Jeff Davis-8
On 10 January 2018 at 04:24, Jeff Davis <[hidden email]> wrote:

> Done.

I think you need to make changes to other parts of the docs also, so
that it is clear what will now be possible

https://www.postgresql.org/docs/devel/static/using-explain.html
https://www.postgresql.org/docs/devel/static/xoper-optimization.html#id-1.8.3.17.9
https://www.postgresql.org/docs/devel/static/planner-optimizer.html

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Alexander Kuzmenkov
In reply to this post by Jeff Davis-8
Hi Jeff,

Just a quick comment -- I ran a slightly modified version of a query
from the regression tests, and got an assertion failure:

select i1, ir1, i2, ir2
   from (select * from rangejoin_left order by ir1 desc) as a1 inner
join (select * from rangejoin_right order by ir2 desc) as a2
     on (i1 = i2 and ir1 && ir2)
   order by ir1 desc, i1;
TRAP: FailedAssertion("!(!ssup->ssup_reverse)", File:
"/home/akuzmenkov/pgp-old/build/../postgrespro/src/backend/executor/nodeMergejoin.c",
Line: 492)

The sort order isn't right for the join, it seems. I remember having
similar troubles with my full merge join implementation. I tried
filtering unsuitable paths in try_mergejoin_path, but that was not quite
enough. The planner tries to use only one sort direction to limit the
number of path, so the path we need might not be there at all. This
optimization was added in commit 834ddc62, see right_merge_direction().
Sadly, I have no idea how to solve this.

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Robert Haas
In reply to this post by Jeff Davis-8
On Tue, Jan 9, 2018 at 11:24 PM, Jeff Davis <[hidden email]> wrote:
>> Just to emphasise why we want this, it might be better for the EXPLAIN
>> to say "Time Range Join" when the ranges being joined are Time Ranges,
>> and for other cases to just say "Range Join". The use of the word
>> Merge doesn't help much there.
>
> I don't care for special-casing the word "time" in there, because no
> other type would benefit. It seems a little too magical. I also do
> like leaving "merge" in there because it helps the user understand why
> their inputs are being sorted.

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Jeff Davis-8
In reply to this post by Alexander Kuzmenkov
On Fri, Jan 12, 2018 at 11:02 AM, Alexander Kuzmenkov
<[hidden email]> wrote:
> The sort order isn't right for the join, it seems. I remember having similar
> troubles with my full merge join implementation. I tried filtering
> unsuitable paths in try_mergejoin_path, but that was not quite enough. The
> planner tries to use only one sort direction to limit the number of path, so
> the path we need might not be there at all. This optimization was added in
> commit 834ddc62, see right_merge_direction(). Sadly, I have no idea how to
> solve this.

Interesting problem, thank you. My first reaction was to hack
right_merge_direction() to recognize the range opfamily in the ASC
direction as always potentially useful. But what's happening is that,
when the inputs are already forced to be sorted DESC, as in your
example, there is still no appropriate pathkey. So the problem is
reproducible with no ORDER BY clause at all.

My proposed fix is to make an internal opfamily identical to the
external one, such that it's not recognized as part of the same EC,
and the planner won't try to eliminate it. It loses out on potential
optimizations, but those are mostly theoretical since the btree
opclass ordering for ranges is not very interesting to a user.

I notice that window functions seem to handle these cases better,
maybe that approach would work for your full join patch? I haven't
investigated that yet.

     Regards,
           Jeff Davis

Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Jeff Davis-8
In reply to this post by Simon Riggs
On Wed, Jan 10, 2018 at 7:49 AM, Simon Riggs <[hidden email]> wrote:
> Do we optimize for TIMESTAMP <@ RANGE as well?

Not currently. It requires a little extra complexity because empty
ranges will match anything and need special handling.

> Does this link in nicely with partition-aware joins?

Yes: if the partitioning is on a non-range column, and the join key
includes both the partition key and a range column, it can do
partition-wise joins.

It does not try to invent a concept of partitioning on a spatial key.

> Does it allow partition exclusion if you join a daterange to a time
> range partitioned table?

I'm a little unclear what you mean here. Are you talking about spatial
partitioning? Or are you talking about joining a daterange column to a
timestamptz column (I suppose using @>)? I think the answer to your
question is "no", but let me know if I am missing an important case.

    Regards,
         Jeff Davis

Previous Thread Next Thread