Rangejoin rebased

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Rangejoin rebased

Jeff Davis-8
New rangejoin patch attached.

I had previously attempted to make this work well for multiple range
join keys, but this patch implements single rangejoin keys only, and
the rest of the rangejoin clauses are effectively just rechecks. I
believe it can be made effective for multiple rangejoin keys, but at
the cost of additional complexity which is neither justified nor
implemented at this point.

Regards,
     Jeff Davis

rangejoin20171229.diff (73K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Simon Riggs
On 29 December 2017 at 18:25, Jeff Davis <[hidden email]> wrote:
> New rangejoin patch attached.
>
> I had previously attempted to make this work well for multiple range
> join keys, but this patch implements single rangejoin keys only, and
> the rest of the rangejoin clauses are effectively just rechecks. I
> believe it can be made effective for multiple rangejoin keys, but at
> the cost of additional complexity which is neither justified nor
> implemented at this point.

For this to be useful, it needs to include some details of how to use
it when people have NOT used range datatypes in their tables.

If we can see some examples with StartDate and EndDate cast to a
tzrange, plus some "don't write it like this" anti-patterns that would
help.

We can then make it clear that this is a huge performance benefit for
these important cases.

Just to emphasise why we want this, it might be better for the EXPLAIN
to say "Time Range Join" when the ranges being joined are Time Ranges,
and for other cases to just say "Range Join". The use of the word
Merge doesn't help much there.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Jeff Davis-8
On Sat, Jan 6, 2018 at 10:38 AM, Simon Riggs <[hidden email]> wrote:
> For this to be useful, it needs to include some details of how to use
> it when people have NOT used range datatypes in their tables.

Good idea. I added an example that doesn't have range types in the base table.

> If we can see some examples with StartDate and EndDate cast to a
> tzrange, plus some "don't write it like this" anti-patterns that would
> help.

By "anti-patterns", I assume you mean implementing range intersection
by hand in SQL over non-range types. Such an example would be quite a
long query, and might add to the confusion -- it seems strange to
spend more text explaining what *not* to do than what to do.

I see what you are saying, but perhaps I'm just not coming up with the
right text to explain it succinctly in the manual, so for now I just
added one example. Let me know if you think it needs more.

> We can then make it clear that this is a huge performance benefit for
> these important cases.

Done.

> Just to emphasise why we want this, it might be better for the EXPLAIN
> to say "Time Range Join" when the ranges being joined are Time Ranges,
> and for other cases to just say "Range Join". The use of the word
> Merge doesn't help much there.

I don't care for special-casing the word "time" in there, because no
other type would benefit. It seems a little too magical. I also do
like leaving "merge" in there because it helps the user understand why
their inputs are being sorted.

Regards,
     Jeff Davis

rangejoin20180109.diff (76K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Simon Riggs
On 10 January 2018 at 04:24, Jeff Davis <[hidden email]> wrote:
> On Sat, Jan 6, 2018 at 10:38 AM, Simon Riggs <[hidden email]> wrote:
>> For this to be useful, it needs to include some details of how to use
>> it when people have NOT used range datatypes in their tables.
>
> Good idea. I added an example that doesn't have range types in the base table.

Cool, thanks

...


It would be useful to consider any related use cases.
Are there applications for range operators other than &&?

Do we optimize for TIMESTAMP <@ RANGE as well?

Does this link in nicely with partition-aware joins?
Does it allow partition exclusion if you join a daterange to a time
range partitioned table?

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Simon Riggs
In reply to this post by Jeff Davis-8
On 10 January 2018 at 04:24, Jeff Davis <[hidden email]> wrote:

> Done.

I think you need to make changes to other parts of the docs also, so
that it is clear what will now be possible

https://www.postgresql.org/docs/devel/static/using-explain.html
https://www.postgresql.org/docs/devel/static/xoper-optimization.html#id-1.8.3.17.9
https://www.postgresql.org/docs/devel/static/planner-optimizer.html

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Alexander Kuzmenkov
In reply to this post by Jeff Davis-8
Hi Jeff,

Just a quick comment -- I ran a slightly modified version of a query
from the regression tests, and got an assertion failure:

select i1, ir1, i2, ir2
   from (select * from rangejoin_left order by ir1 desc) as a1 inner
join (select * from rangejoin_right order by ir2 desc) as a2
     on (i1 = i2 and ir1 && ir2)
   order by ir1 desc, i1;
TRAP: FailedAssertion("!(!ssup->ssup_reverse)", File:
"/home/akuzmenkov/pgp-old/build/../postgrespro/src/backend/executor/nodeMergejoin.c",
Line: 492)

The sort order isn't right for the join, it seems. I remember having
similar troubles with my full merge join implementation. I tried
filtering unsuitable paths in try_mergejoin_path, but that was not quite
enough. The planner tries to use only one sort direction to limit the
number of path, so the path we need might not be there at all. This
optimization was added in commit 834ddc62, see right_merge_direction().
Sadly, I have no idea how to solve this.

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Robert Haas
In reply to this post by Jeff Davis-8
On Tue, Jan 9, 2018 at 11:24 PM, Jeff Davis <[hidden email]> wrote:
>> Just to emphasise why we want this, it might be better for the EXPLAIN
>> to say "Time Range Join" when the ranges being joined are Time Ranges,
>> and for other cases to just say "Range Join". The use of the word
>> Merge doesn't help much there.
>
> I don't care for special-casing the word "time" in there, because no
> other type would benefit. It seems a little too magical. I also do
> like leaving "merge" in there because it helps the user understand why
> their inputs are being sorted.

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Jeff Davis-8
In reply to this post by Alexander Kuzmenkov
On Fri, Jan 12, 2018 at 11:02 AM, Alexander Kuzmenkov
<[hidden email]> wrote:
> The sort order isn't right for the join, it seems. I remember having similar
> troubles with my full merge join implementation. I tried filtering
> unsuitable paths in try_mergejoin_path, but that was not quite enough. The
> planner tries to use only one sort direction to limit the number of path, so
> the path we need might not be there at all. This optimization was added in
> commit 834ddc62, see right_merge_direction(). Sadly, I have no idea how to
> solve this.

Interesting problem, thank you. My first reaction was to hack
right_merge_direction() to recognize the range opfamily in the ASC
direction as always potentially useful. But what's happening is that,
when the inputs are already forced to be sorted DESC, as in your
example, there is still no appropriate pathkey. So the problem is
reproducible with no ORDER BY clause at all.

My proposed fix is to make an internal opfamily identical to the
external one, such that it's not recognized as part of the same EC,
and the planner won't try to eliminate it. It loses out on potential
optimizations, but those are mostly theoretical since the btree
opclass ordering for ranges is not very interesting to a user.

I notice that window functions seem to handle these cases better,
maybe that approach would work for your full join patch? I haven't
investigated that yet.

     Regards,
           Jeff Davis

Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Jeff Davis-8
In reply to this post by Simon Riggs
On Wed, Jan 10, 2018 at 7:49 AM, Simon Riggs <[hidden email]> wrote:
> Do we optimize for TIMESTAMP <@ RANGE as well?

Not currently. It requires a little extra complexity because empty
ranges will match anything and need special handling.

> Does this link in nicely with partition-aware joins?

Yes: if the partitioning is on a non-range column, and the join key
includes both the partition key and a range column, it can do
partition-wise joins.

It does not try to invent a concept of partitioning on a spatial key.

> Does it allow partition exclusion if you join a daterange to a time
> range partitioned table?

I'm a little unclear what you mean here. Are you talking about spatial
partitioning? Or are you talking about joining a daterange column to a
timestamptz column (I suppose using @>)? I think the answer to your
question is "no", but let me know if I am missing an important case.

    Regards,
         Jeff Davis

Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Simon Riggs
On 17 January 2018 at 05:49, Jeff Davis <[hidden email]> wrote:
> On Wed, Jan 10, 2018 at 7:49 AM, Simon Riggs <[hidden email]> wrote:
>> Do we optimize for TIMESTAMP <@ RANGE as well?
>
> Not currently. It requires a little extra complexity because empty
> ranges will match anything and need special handling.

TIMESTAMP <@ RANGE is arguably more important than RANGE && RANGE

Trying to cast timestamp to range to make that work is a bit hokey

If the problem is just empty ranges, it seems like we should do that here also.

I'd be happy with the optimization only working if ranges are provably
non-empty, e.g. CHECK (NOT isempty(col))
Or perhaps we need non-empty types: e.g. tsrangene

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Simon Riggs
On 19 January 2018 at 08:25, Simon Riggs <[hidden email]> wrote:
> On 17 January 2018 at 05:49, Jeff Davis <[hidden email]> wrote:
>> On Wed, Jan 10, 2018 at 7:49 AM, Simon Riggs <[hidden email]> wrote:
>>> Do we optimize for TIMESTAMP <@ RANGE as well?
>>
>> Not currently. It requires a little extra complexity because empty
>> ranges will match anything and need special handling.

err... that isn't correct. An empty range matches nothing, so can be
ignored in joins.

So probably best to explain some more, please.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Jeff Davis-8
On Fri, Jan 19, 2018 at 2:07 AM, Simon Riggs <[hidden email]> wrote:
> err... that isn't correct. An empty range matches nothing, so can be
> ignored in joins.
>
> So probably best to explain some more, please.

The semantics of R1 @> R2 will return true if R1 is a non-NULL range
and R2 is empty.

It's set semantics, and all sets contain the empty set.

But I understand @> is an important case so I am looking into it.

Regards,
    Jeff Davis

Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Simon Riggs
On 23 January 2018 at 05:08, Jeff Davis <[hidden email]> wrote:

> On Fri, Jan 19, 2018 at 2:07 AM, Simon Riggs <[hidden email]> wrote:
>> err... that isn't correct. An empty range matches nothing, so can be
>> ignored in joins.
>>
>> So probably best to explain some more, please.
>
> The semantics of R1 @> R2 will return true if R1 is a non-NULL range
> and R2 is empty.
>
> It's set semantics, and all sets contain the empty set.

Understood

> But I understand @> is an important case so I am looking into it.

Perhaps we are misunderstanding each other

TIMESTAMP <@ RANGE1 doesn't match if RANGE1 is empty
and that is the most important case

RANGE OP RANGE is important also. It would be useful for OP to be more
than just &&

It's certainly weird that R1 @> EMPTY is true, but R1 && EMPTY is not.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Jeff Davis-8
On Tue, Jan 23, 2018 at 2:04 AM, Simon Riggs <[hidden email]> wrote:
> Perhaps we are misunderstanding each other
>
> TIMESTAMP <@ RANGE1 doesn't match if RANGE1 is empty
> and that is the most important case

When <@ is supported, that case should be fine if range1 is on the
outer. The case I was concerned about is with a R1 <@ R2 join where R1
is on the inner side and could have empty ranges.

One option would be to try to force R2 to be on the inner. But that
doesn't quite solve other related issues, like if R2 has a few large
ranges that contain almost everything.

Right now I'm considering an approach where we use some counters to
determine that a few ranges are preventing us from moving the mark
forward, and then move those few ranges into a separate tuplestore so
that we can move the mark forward.

> RANGE OP RANGE is important also. It would be useful for OP to be more
> than just &&

I agree that contains/contained-by are useful; do you have other
operators in mind as well?

>
> It's certainly weird that R1 @> EMPTY is true, but R1 && EMPTY is not.

This was discussed back in 9.2, and there were no obviously better
semantics available. I chose to follow set semantics: X contains Y
means that Y is a subset of X; X overlaps Y means that the
intersection of X and Y is nonempty.

I understand it can be surprising, but other definitions can be surprising, too.

Regards,
     Jeff Davis

Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Alexander Kuzmenkov
In reply to this post by Jeff Davis-8
On 16.01.2018 10:49, Jeff Davis wrote:
> My proposed fix is to make an internal opfamily identical to the
> external one, such that it's not recognized as part of the same EC,
> and the planner won't try to eliminate it. It loses out on potential
> optimizations, but those are mostly theoretical since the btree
> opclass ordering for ranges is not very interesting to a user.
I think I figured out what to do with missing sort directions. We can
change select_outer_pathkeys_for_merge() to generate the pathkeys we
need. Also, find_mergeclauses_for_outer_pathkeys() has to be changed
too, so that it knows which pathkeys are compatible to which range join
clauses.

About the patch, do I understand it right that you are working on the
next version now?

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Robert Haas
On Fri, Mar 2, 2018 at 11:12 AM, Alexander Kuzmenkov
<[hidden email]> wrote:

> On 16.01.2018 10:49, Jeff Davis wrote:
>> My proposed fix is to make an internal opfamily identical to the
>> external one, such that it's not recognized as part of the same EC,
>> and the planner won't try to eliminate it. It loses out on potential
>> optimizations, but those are mostly theoretical since the btree
>> opclass ordering for ranges is not very interesting to a user.
>
> I think I figured out what to do with missing sort directions. We can change
> select_outer_pathkeys_for_merge() to generate the pathkeys we need. Also,
> find_mergeclauses_for_outer_pathkeys() has to be changed too, so that it
> knows which pathkeys are compatible to which range join clauses.
>
> About the patch, do I understand it right that you are working on the next
> version now?

I think we are quite clearly past the deadline to submit a new patch
for inclusion in v11 at this point.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

David Steele
On 3/2/18 11:44 AM, Robert Haas wrote:

> On Fri, Mar 2, 2018 at 11:12 AM, Alexander Kuzmenkov
> <[hidden email]> wrote:
>> On 16.01.2018 10:49, Jeff Davis wrote:
>>> My proposed fix is to make an internal opfamily identical to the
>>> external one, such that it's not recognized as part of the same EC,
>>> and the planner won't try to eliminate it. It loses out on potential
>>> optimizations, but those are mostly theoretical since the btree
>>> opclass ordering for ranges is not very interesting to a user.
>>
>> I think I figured out what to do with missing sort directions. We can change
>> select_outer_pathkeys_for_merge() to generate the pathkeys we need. Also,
>> find_mergeclauses_for_outer_pathkeys() has to be changed too, so that it
>> knows which pathkeys are compatible to which range join clauses.
>>
>> About the patch, do I understand it right that you are working on the next
>> version now?
>
> I think we are quite clearly past the deadline to submit a new patch
> for inclusion in v11 at this point.

It seems that a new patch is needed here but none has been presented.
I've marked this Waiting on Author for the moment, but I really think it
should be marked Returned with Feedback and submitted to the next CF
when a new patch is ready.

Regards,
--
-David
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

Andres Freund
Hi,

On 2018-03-28 14:18:42 -0400, David Steele wrote:
> It seems that a new patch is needed here but none has been presented.
> I've marked this Waiting on Author for the moment, but I really think it
> should be marked Returned with Feedback and submitted to the next CF
> when a new patch is ready.

I'd just do so now. There's not been any progress for months, and
there's been an update request weeks ago...

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Rangejoin rebased

David Steele
On 3/28/18 2:23 PM, Andres Freund wrote:
> On 2018-03-28 14:18:42 -0400, David Steele wrote:
>> It seems that a new patch is needed here but none has been presented.
>> I've marked this Waiting on Author for the moment, but I really think it
>> should be marked Returned with Feedback and submitted to the next CF
>> when a new patch is ready.
>
> I'd just do so now. There's not been any progress for months, and
> there's been an update request weeks ago...
Done.

--
-David
[hidden email]

Previous Thread Next Thread