range_agg

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
83 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

range_agg

Paul A Jungwirth
Hello,

I wrote an extension to add a range_agg function with similar behavior
to existing *_agg functions, and I'm wondering if folks would like to
have it in core? Here is the repo: https://github.com/pjungwir/range_agg

I'm also working on a patch for temporal foreign keys, and having
range_agg would make the FK check easier and faster, which is why I'd
like to get it added. But also it just seems useful, like array_agg,
json_agg, etc.

One question is how to aggregate ranges that would leave gaps and/or
overlaps. So in my extension there is a one-param version that forbids
gaps & overlaps, but I let you permit them by passing extra parameters,
so the signature is:

     range_agg(r anyrange, permit_gaps boolean, permit_overlaps boolean)

Perhaps another useful choice would be to return NULL if a gap/overlap
is found, so that each param would have three choices instead of just
two: accept the inputs, raise an error, return a NULL.

What do people think? I plan to work on a patch regardless, so that I
can use it for temporal FKs, but I'd appreciate some feedback on the
"user interface".

Thanks,

--
Paul              ~{:-)
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: range_agg

David Fetter
On Fri, May 03, 2019 at 03:56:41PM -0700, Paul Jungwirth wrote:

> Hello,
>
> I wrote an extension to add a range_agg function with similar behavior to
> existing *_agg functions, and I'm wondering if folks would like to have it
> in core? Here is the repo: https://github.com/pjungwir/range_agg
>
> I'm also working on a patch for temporal foreign keys, and having range_agg
> would make the FK check easier and faster, which is why I'd like to get it
> added. But also it just seems useful, like array_agg, json_agg, etc.
>
> One question is how to aggregate ranges that would leave gaps and/or
> overlaps.

This suggests two different ways to extend ranges over aggregation:
one which is a union of (in general) disjoint intervals, two others
are a union of intervals, each of which has a weight.  Please pardon
the ASCII art.

The aggregation of:

[1, 4)
  [2, 5)
             [8, 10)

could turn into:

{[1,5), [8, 10)} (union without weight)

{{[1,2),1}, {[2,4),2}, {[4,5),1}, {[8,10),1}} (strictly positive weights which don't (in general) cover the space)

{{[1,2),1}, {[2,4),2}, {[4,5),1}, {[5,8),0}, {[8,10),1}} (non-negative weights which guarantee the space is covered)

There is no principled reason to choose one over the others.

> What do people think? I plan to work on a patch regardless, so that I can
> use it for temporal FKs, but I'd appreciate some feedback on the "user
> interface".

I think the cases above, or at least the first two of them, should be
available. They could be called range_agg, weighted_range_agg, and
covering_range_agg.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Reply | Threaded
Open this post in threaded view
|

Re: range_agg

Corey Huinker
In reply to this post by Paul A Jungwirth
One question is how to aggregate ranges that would leave gaps and/or
overlaps. So in my extension there is a one-param version that forbids
gaps & overlaps, but I let you permit them by passing extra parameters,
so the signature is:

Perhaps a third way would be to allow and preserve the gaps.

A while back I wrote an extension called disjoint_date_range for storing sets of dates where it was assumed that most dates would be contiguous. The basic idea was that The core datatype was an array of ranges of dates, and with every modification you'd unnest them all to their discrete elements and use a window function to identify "runs" of dates and recompose them into a canonical set. It was an efficient way of representing "Every day last year except for June 2nd and August 4th, when we closed business for special events."

For arrays of ranges the principle is the same but it'd get a bit more tricky, you'd have to order by low bound, use window functions to detect adjacency/overlap to identify your runs, and the generate the canonical minimum set of ranges in your array.
Reply | Threaded
Open this post in threaded view
|

Re: range_agg

Paul A Jungwirth
On 5/4/19 3:11 PM, Corey Huinker wrote:
>     One question is how to aggregate ranges that would leave gaps and/or
>     overlaps. So in my extension there is a one-param version that forbids
>     gaps & overlaps, but I let you permit them by passing extra parameters,
>     so the signature is:
>
>
> Perhaps a third way would be to allow and preserve the gaps.

Thanks for the feedback! I think this is what I'm doing already
(returning an array of ranges), but let me know if I'm misunderstanding.
My extension has these signatures:

     range_agg(anyrange) returning anyrange
     range_agg(anyrange, bool) returning anyarray
     range_agg(anyrange, bool, bool) returning anyarray.

The first variant raises an error if there are gaps or overlaps and
always returns a single range, but the other two return an array of ranges.

I was planning to use the same signatures for my patch to pg, unless
someone thinks they should be different. But I'm starting to wonder if
they shouldn't *all* return arrays. I have two concrete use-cases for
these functions and they both require the array-returning versions. Is
it helpful to have a version that always returns a single range? Or
should I make them all consistent?

Thanks,

--
Paul              ~{:-)
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: range_agg

Paul A Jungwirth
In reply to this post by David Fetter
On 5/3/19 6:41 PM, David Fetter wrote:
> This suggests two different ways to extend ranges over aggregation:
> one which is a union of (in general) disjoint intervals, two others
> are a union of intervals, each of which has a weight.
> . . .
> I think the cases above, or at least the first two of them, should be
> available. They could be called range_agg, weighted_range_agg, and
> covering_range_agg.

Thanks David! I think these two new functions make sense. Before I
implement them too I wonder if anyone else has uses for them?

Thanks,

--
Paul              ~{:-)
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: range_agg

David Fetter
On Mon, May 06, 2019 at 11:19:27AM -0700, Paul Jungwirth wrote:

> On 5/3/19 6:41 PM, David Fetter wrote:
> > This suggests two different ways to extend ranges over aggregation:
> > one which is a union of (in general) disjoint intervals, two others
> > are a union of intervals, each of which has a weight.
> > . . .
> > I think the cases above, or at least the first two of them, should be
> > available. They could be called range_agg, weighted_range_agg, and
> > covering_range_agg.
>
> Thanks David! I think these two new functions make sense. Before I implement
> them too I wonder if anyone else has uses for them?

I suspect that if you build it, the will come, "they" being anyone who
has to schedule coverage, check usage of a resource over time, etc. Is
this something you want help with at some level? Coding, testing,
promoting...

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Reply | Threaded
Open this post in threaded view
|

Re: range_agg

Paul A Jungwirth
> I suspect that if you build it, the will come, "they" being anyone who
> has to schedule coverage, check usage of a resource over time, etc. Is
> this something you want help with at some level? Coding, testing,
> promoting...

You might be right. :-) Most of this is done already, since it was
largely copy/paste from my extension plus figuring out how to register
built-in functions with the .dat files. I need to write some docs and do
some cleanup and I'll have a CF entry. And I'll probably go ahead and
add your two suggestions too.... Things I'd love help with:

- Getting more opinions about the functions' interface, either from you
or others, especially:
   - In the extension I have a boolean param to let you accept gaps or
raise an error, and another for overlaps. But what about
accepting/raising/returning null? How should the parameters expose that?
Maybe leave them as bools but accept true/false/null for
permit/raise/nullify respectively? That seems like a questionable UI,
but I'm not sure what would be better. Maybe someone with better taste
can weigh in. :-) I tried to find existing built-in functions that gave
a enumeration of options like that but couldn't find an existing example.
   - Also: what do you think of the question I asked in my reply to
Corey? Is it better to have *all* range_agg functions return an array of
ranges, or it is nicer to have a variant that always returns a single range?
- Getting it reviewed.
- Advice about sequencing it with respect to my temporal foreign keys
patch, where I'm planning to call range_agg to check an FK. E.g. should
my FK patch be a diff on top of the range_agg code? I assume they should
have separate CF entries though?

Oh and here's something specific:

- I gave oids to my new functions starting with 8000, because I thought
I saw some discussion about that recently, and the final committer will
correct the oids to the current n+1? But I can't find that discussion
anymore, so if that's the wrong approach let me know.

Thanks!

--
Paul              ~{:-)
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: range_agg

Paul A Jungwirth
On Mon, May 6, 2019 at 4:21 PM Paul Jungwirth
<[hidden email]> wrote:
> I need to write some docs and do
> some cleanup and I'll have a CF entry.

Here is an initial patch. I'd love to have some feedback! :-)

One challenge was handling polymorphism, since I want to have this:

    anyrange[] range_agg(anyrange, bool, bool)

But there is no anyrange[]. I asked about this back when I wrote my
extension too:

https://www.postgresql.org/message-id/CA%2BrenyVOjb4xQZGjdCnA54-1nzVSY%2B47-h4nkM-EP5J%3D1z%3Db9w%40mail.gmail.com

Like then, I handled it by overloading the function with separate
signatures for each built-in range type:

    int4range[] range_agg(int4range, bool, bool);
    int8range[] range_agg(int8range, bool, bool);
    numrange[] range_agg(numrange, bool, bool);
    tsrange[] range_agg(tsrange, bool, bool);
    tstzrange[] range_agg(tstzrange, bool, bool);
    daterange[] range_agg(daterange, bool, bool);

(And users can still define a range_agg for their own custom range
types using similar instructions to those in my extension's README.)

The problem was the opr_sanity regression test, which rejects
functions that share the same C-function implementation (roughly). I
took a few stabs at changing my code to pass that test, e.g. defining
separate wrapper functions for everything like this:

    Datum
    int4range_agg_transfn(PG_FUNCTION_ARGS) {
          return range_agg_transfn(fcinfo);
    }

But that felt like it was getting ridiculous (8 range types *
transfn+finalfn * 1-bool and 2-bool variants), so in the end I just
added my functions to the "permitted" output in opr_sanity. Let me
know if I should avoid that though.

Also I would still appreciate some bikeshedding over the functions' UI
since I don't feel great about it.

If the overall approach seems okay, I'm still open to adding David's
suggestions for weighted_range_agg and covering_range_agg.

Thanks!
Paul

range_agg_v1.patch (67K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: range_agg

Paul A Jungwirth
On Wed, May 8, 2019 at 9:54 PM Paul A Jungwirth
<[hidden email]> wrote:
> Here is an initial patch. I'd love to have some feedback! :-)

Here is a v2 rebased off current master. No substantive changes, but
it does fix one trivial git conflict.

After talking with David in Ottawa and hearing a good use-case from
one other person for his proposed weighted_range_agg and
covering_range_agg, I think *will* add those to this patch, but if
anyone wants to offer feedback on my approach so far, I'd appreciate
that too.

Yours,
Paul

range_agg_v0002.patch (67K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: range_agg

Jeff Davis-8
In reply to this post by Paul A Jungwirth
On Fri, 2019-05-03 at 15:56 -0700, Paul Jungwirth wrote:
> Hello,
>
> I wrote an extension to add a range_agg function with similar
> behavior
> to existing *_agg functions, and I'm wondering if folks would like
> to
> have it in core? Here is the repo:
> https://github.com/pjungwir/range_agg

This seems like a very useful extension, thank you.

For getting into core though, it should be a more complete set of
related operations. The patch is implicitly introducing the concept of
a "multirange" (in this case, an array of ranges), but it's not making
the concept whole.

What else should return a multirange? This patch implements the union-
agg of ranges, but we also might want an intersection-agg of ranges
(that is, the set of ranges that are subranges of every input). Given
that there are other options here, the name "range_agg" is too generic
to mean union-agg specifically.

What can we do with a multirange? A lot of range operators still make
sense, like "contains" or "overlaps"; but "adjacent" doesn't quite
work. What about new operations like inverting a multirange to get the
gaps?

Do we want to continue with the array-of-ranges implementation of a
multirange, or do we want a first-class multirange concept that might
eliminate the boilerplate around defining all of these operations?

If we have a more complete set of operators here, the flags for
handling overlapping ranges and gaps will be unnecessary.

Regards,
        Jeff Davis




Reply | Threaded
Open this post in threaded view
|

Re: range_agg

Alvaro Herrera-9
In reply to this post by Paul A Jungwirth
I noticed that this patch has a // comment about it segfaulting.  Did
you ever figure that out?  Is the resulting code the one you intend as
final?

Did you make any inroads regarding Jeff Davis' suggestion about
implementing "multiranges"?  I wonder if that's going to end up being a
Pandora's box.

Stylistically, the code does not match pgindent's choices very closely.
I can return a diff to a pgindented version of your v0002 for your
perusal, if it would be useful for you to learn its style.  (A committer
would eventually run pgindent himself[1], but it's good to have
submissions be at least close to the final form.)  BTW, I suggest "git
format-patch -vN" to prepare files for submission.


[1] No female committers yet ... is this 2019?

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: range_agg

Pavel Stehule
In reply to this post by Jeff Davis-8
Hi

út 2. 7. 2019 v 0:38 odesílatel Jeff Davis <[hidden email]> napsal:
On Fri, 2019-05-03 at 15:56 -0700, Paul Jungwirth wrote:
> Hello,
>
> I wrote an extension to add a range_agg function with similar
> behavior
> to existing *_agg functions, and I'm wondering if folks would like
> to
> have it in core? Here is the repo:
> https://github.com/pjungwir/range_agg

This seems like a very useful extension, thank you.

For getting into core though, it should be a more complete set of
related operations. The patch is implicitly introducing the concept of
a "multirange" (in this case, an array of ranges), but it's not making
the concept whole.

What else should return a multirange? This patch implements the union-
agg of ranges, but we also might want an intersection-agg of ranges
(that is, the set of ranges that are subranges of every input). Given
that there are other options here, the name "range_agg" is too generic
to mean union-agg specifically.

What can we do with a multirange? A lot of range operators still make
sense, like "contains" or "overlaps"; but "adjacent" doesn't quite
work. What about new operations like inverting a multirange to get the
gaps?

Do we want to continue with the array-of-ranges implementation of a
multirange, or do we want a first-class multirange concept that might
eliminate the boilerplate around defining all of these operations?

If we have a more complete set of operators here, the flags for
handling overlapping ranges and gaps will be unnecessary.

I think so scope of this patch is correct. Introduction of set of ranges data type based on a array or not, should be different topic.

The question is naming - should be this agg function named "range_agg", and multi range agg "multirange_agg"? Personally, I have not a problem with range_agg, and I have not a problem so it is based on union operation. It is true so only result of union can be implemented as range simply. When I though about multi range result, then there are really large set of possible algorithms how to do some operations over two, three values. So personally, I am satisfied with scope of simple range_agg functions, because I see a benefits, and I don't think so this implementation block any more complex designs in the future. There is really big questions how to implement multi range, and now I think so special data type will be better than possible unordered arrays.

Regards

Pavel



Regards,
        Jeff Davis




Reply | Threaded
Open this post in threaded view
|

Re: range_agg

Pavel Stehule
In reply to this post by Alvaro Herrera-9


čt 4. 7. 2019 v 20:34 odesílatel Alvaro Herrera <[hidden email]> napsal:
I noticed that this patch has a // comment about it segfaulting.  Did
you ever figure that out?  Is the resulting code the one you intend as
final?

Did you make any inroads regarding Jeff Davis' suggestion about
implementing "multiranges"?  I wonder if that's going to end up being a
Pandora's box.

Introduction of new datatype can be better, because we can ensure so data are correctly ordered 


Stylistically, the code does not match pgindent's choices very closely.
I can return a diff to a pgindented version of your v0002 for your
perusal, if it would be useful for you to learn its style.  (A committer
would eventually run pgindent himself[1], but it's good to have
submissions be at least close to the final form.)  BTW, I suggest "git
format-patch -vN" to prepare files for submission.

The first issue is unstable regress tests - there is a problem with opr_sanity

SELECT p1.oid, p1.proname, p2.oid, p2.proname
FROM pg_proc AS p1, pg_proc AS p2
WHERE p1.oid < p2.oid AND
    p1.prosrc = p2.prosrc AND
    p1.prolang = 12 AND p2.prolang = 12 AND
    (p1.prokind != 'a' OR p2.prokind != 'a') AND
    (p1.prolang != p2.prolang OR
     p1.prokind != p2.prokind OR
     p1.prosecdef != p2.prosecdef OR
     p1.proleakproof != p2.proleakproof OR
     p1.proisstrict != p2.proisstrict OR
     p1.proretset != p2.proretset OR
     p1.provolatile != p2.provolatile OR
     p1.pronargs != p2.pronargs)
ORDER BY p1.oid, p2.oid; -- requires explicit ORDER BY now

+   rangeTypeId = get_fn_expr_argtype(fcinfo->flinfo, 1);
+   if (!type_is_range(rangeTypeId))
+   {
+       ereport(ERROR, (errmsg("range_agg must be called with a range")));
+   }

???

+                   r1Str = "lastRange";
+                   r2Str = "currentRange";
+                   // TODO: Why is this segfaulting?:
+                   // r1Str = DatumGetCString(DirectFunctionCall1(range_out, RangeTypePGetDatum(lastRange)));
+                   // r2Str = DatumGetCString(DirectFunctionCall1(range_out, RangeTypePGetDatum(currentRange)));
+                   ereport(ERROR, (errmsg("range_agg: gap detected between %s and %s", r1Str, r2Str)));
+               }

???

The patch doesn't respect Postgres formatting

+# Making 2- and 3-param range_agg polymorphic is difficult
+# because it would take an anyrange and return an anyrange[],
+# which doesn't exist.
+# As a workaround we define separate functions for each built-in range type.
+# This is what causes the mess in src/test/regress/expected/opr_sanity.out.
+{ oid => '8003', descr => 'aggregate transition function',

This is strange. Does it means so range_agg will not work with custom range types?

I am not sure about implementation. I think so accumulating all ranges, sorting and processing on the end can be memory and CPU expensive.

Regards

Pavel




[1] No female committers yet ... is this 2019?

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: range_agg

Jeff Davis-8
In reply to this post by Pavel Stehule
On Fri, 2019-07-05 at 07:58 +0200, Pavel Stehule wrote:
> The question is naming - should be this agg function named
> "range_agg", and multi range agg "multirange_agg"? Personally, I have
> not a problem with range_agg, and I have not a problem so it is based
> on union operation. It is true so only result of union can be
> implemented as range simply. When I though about multi range result,
> then there are really large set of possible algorithms how to do some
> operations over two, three values.

Hi Pavel,

Can you explain in more detail? Would an intersection-based aggregate
be useful? If so, and we implement it in the future, what would we call
it?

Regards,
        Jeff Davis




Reply | Threaded
Open this post in threaded view
|

Re: range_agg

Paul A Jungwirth
In reply to this post by Jeff Davis-8
On Mon, Jul 1, 2019 at 3:38 PM Jeff Davis <[hidden email]> wrote:

>
> For getting into core though, it should be a more complete set of
> related operations. The patch is implicitly introducing the concept of
> a "multirange" (in this case, an array of ranges), but it's not making
> the concept whole.
>
> What else should return a multirange? This patch implements the union-
> agg of ranges, but we also might want an intersection-agg of ranges
> (that is, the set of ranges that are subranges of every input). Given
> that there are other options here, the name "range_agg" is too generic
> to mean union-agg specifically.

Thanks for the review!

I like the idea of adding a multirange type that works like range
types, although I'm not sure I want to build it. :-)

My own motivations for the range_agg patch are for temporal databases,
where I have two use-cases: checking temporal foreign keys [1] and
doing a Snodgrass "coalesce" operation [2]. The FKs use-case is more
important. For coalesce I just immediately UNNEST the array, so a
multirange would sort of "get in the way". It's no big deal if I can
cast a multirange to an array, although for something that would run
on every INSERT/UPDATE/DELETE I'd like to understand what the cast
would cost us in performance. But coalesce isn't part of SQL:2011 and
should be optional behavior or just something in an extension. The FKs
use case matters to me a lot more, and I think a multirange would be
fine for that. Also a multirange would solve the generics problem
Pavel asked about. (I'll say more about that in a reply to him.)

I'm not convinced that an intersection aggregate function for
multiranges would be used by anyone, but I don't mind building one.
Every other *_agg function has an "additive" sense, not a
"subtractive" sense. json{,b}_object_agg are the closest since you
*could* imagine intersection semantics for those, but they are unions.
So in terms of *naming* I think using "range_agg" for union semantics
is natural and would fit expectations. (I'm not the first to name this
function range_agg btw: [3]).

But there is clearly more than one worthwhile way to aggregate ranges:

- David suggested weighted_range_agg and covering_range_agg. At PGCon
2019 someone else said he has had to build something that was
essentially weighted_range_agg. I can see it used for
scheduling/capacity/max-utilization problems.
- You suggested an intersection range_agg.
- At [4] there is a missing_ranges function that only gives the *gaps*
between the input ranges.

Nonetheless I still think I would call the union behavior simply
range_agg, and then use weighted_range_agg, covering_range_agg,
intersection_range_agg, and missing_range_agg for the rest (assuming
we built them all). I'm not going to insist on any of that, but it's
what feels most user-friendly to me.

> What can we do with a multirange? A lot of range operators still make
> sense, like "contains" or "overlaps"; but "adjacent" doesn't quite
> work. What about new operations like inverting a multirange to get the
> gaps?

I can think of a lot of cool operators for `multirange \bigoplus
multirange` or `multirange \bigoplus range` (commuting of course). And
I've certainly wanted `range + range` and `range - range` in the past,
which would both return a multirange.

> If we have a more complete set of operators here, the flags for
> handling overlapping ranges and gaps will be unnecessary.

Both of my use cases should permit overlaps & gaps (range_agg(r, true,
true)), so I'm actually pretty okay with dropping the flags entirely
and just giving a one-param function that behavior. Or defining a
strict_range_agg that offers more control. But also I don't understand
how richer *operators* make the flags for the *aggregate* unnecessary.

So I really like the idea of multiranges, but I'm reluctant to take it
on myself, especially since this patch is just a utility function for
my other temporal patches. But I don't want my rush to leave a blemish
in our standard library either. But I think what really persuades me
to add multiranges is making a range_agg that more easily supports
user-defined range types. So how about I start on it and see how it
goes? I expect I can follow the existing code for range types pretty
closely, so maybe it won't be too hard.

Another option would be to rename my function range_array_agg (or
something) so that we are leaving space for a multirange function in
the future. I don't love this idea myself but it would could a Plan B.
What do you think of that?

Regards,
Paul

[1] With range_agg I can make the temporal FK check use similar SQL to
the non-temporal FK check:

    SELECT 1
    FROM [ONLY] <pktable> x
    WHERE pkatt1 = $1 [AND ...]
    FOR KEY SHARE OF x

vs

    SELECT 1
    FROM (
        SELECT range_agg(r, true, true) AS r
        FROM  (
            SELECT pkperiodatt AS r
            FROM [ONLY] pktable x
            WHERE pkatt1 = $1 [AND ...]
            FOR KEY SHARE OF x
        ) x1
    ) x2
    WHERE $n <@ x2.r[1]

(The temporal version could be simplified a bit if FOR KEY SHARE ever
supports aggregate functions.)

[2] page number 159ff in https://www2.cs.arizona.edu/~rts/tdbbook.pdf
(section 6.5.1, starts on page 183 of the PDF)

[3] https://git.proteus-tech.com/open-source/django-postgres/blob/fa91cf9b43ce942e84b1a9be22f445f3515ca360/postgres/sql/range_agg.sql

[4] https://schinckel.net/2014/11/18/aggregating-ranges-in-postgres/


Reply | Threaded
Open this post in threaded view
|

Re: range_agg

Paul A Jungwirth
In reply to this post by Alvaro Herrera-9
On Thu, Jul 4, 2019 at 11:34 AM Alvaro Herrera <[hidden email]> wrote:
>
> I noticed that this patch has a // comment about it segfaulting.  Did
> you ever figure that out?  Is the resulting code the one you intend as
> final?

Thanks for the review! I haven't revisited it but I'll see if I can
track it down. I consider this a WIP patch, not something final. (I
don't think Postgres likes C++-style comments, so anything that is //
marks something I consider needs more work.)

> Stylistically, the code does not match pgindent's choices very closely.
> I can return a diff to a pgindented version of your v0002 for your
> perusal, if it would be useful for you to learn its style.

Sorry about that, and thank you for making it easier for me to learn
how to do it the right way. :-)

Paul


Reply | Threaded
Open this post in threaded view
|

Re: range_agg

Paul A Jungwirth
In reply to this post by Pavel Stehule
On Fri, Jul 5, 2019 at 4:31 AM Pavel Stehule <[hidden email]> wrote:
> The first issue is unstable regress tests - there is a problem with opr_sanity

I would prefer to avoid needing to add anything to opr_sanity really.
A multirange would let me achieve that I think. But otherwise I'll add
the ordering. Thanks!

> +   rangeTypeId = get_fn_expr_argtype(fcinfo->flinfo, 1);
> +   if (!type_is_range(rangeTypeId))
> +   {
> +       ereport(ERROR, (errmsg("range_agg must be called with a range")));
> +   }

I think this was left over from my attempts to support user-defined
ranges. I believe I can remove it now.

> +# Making 2- and 3-param range_agg polymorphic is difficult
> +# because it would take an anyrange and return an anyrange[],
> +# which doesn't exist.
> +# As a workaround we define separate functions for each built-in range type.
> +# This is what causes the mess in src/test/regress/expected/opr_sanity.out.
> +{ oid => '8003', descr => 'aggregate transition function',
>
> This is strange. Does it means so range_agg will not work with custom range types?

You would have to define your own range_agg with your own custom type
in the signature. I gave instructions about that in my range_agg
extension (https://github.com/pjungwir/range_agg), but it's not as
easy with range_agg in core because I don't think you can define a
custom function that is backed by a built-in C function. At least I
couldn't get that to work.

The biggest argument for a multirange to me is that we could have
anymultirange, with similar rules as anyarray. That way I could have
`range_agg(r anyrange) RETURNS anymultirange`. [1] is a conversation
where I asked about doing this before multiranges were suggested. Also
I'm aware of your own recent work on polymorphic types at [2] but I
haven't read it closely enough to see if it would help me here. Do you
think it applies?

> I am not sure about implementation. I think so accumulating all ranges, sorting and processing on the end can be memory and CPU expensive.

I did some research and couldn't find any algorithm that was better
than O(n log n), but if you're aware of any I'd like to know about it.
Assuming we can't improve on the complexity bounds, I think a sort +
iteration is desirable because it keeps things simple and benefits
from past & future work on the sorting code. I care more about
optimizing time here than RAM, especially since we'll use this in FK
checks, where the inputs will rarely be more than a few and probably
never in the millions.

We especially want to avoid an O(n^2) algorithm, which would be easy
to stumble into if we naively accumulated the result as we go. For
example if we had multiranges you could imagine an implementation that
just did `result + r` for all inputs r. But that would have the same
n^2 pattern as looping over strcat.

We could use a tree to keep things sorted and accumulate as we go, but
the implementation would be more complicated and I think slower.

Thanks for the review!

Paul

[1] https://www.postgresql.org/message-id/CA%2BrenyVOjb4xQZGjdCnA54-1nzVSY%2B47-h4nkM-EP5J%3D1z%3Db9w%40mail.gmail.com

[2] https://www.postgresql.org/message-id/CAFj8pRDna7VqNi8gR+Tt2Ktmz0cq5G93guc3Sbn_NVPLdXAkqA@...


Reply | Threaded
Open this post in threaded view
|

Re: range_agg

Paul A Jungwirth
In reply to this post by Jeff Davis-8
On Mon, Jul 1, 2019 at 3:38 PM Jeff Davis <[hidden email]> wrote:
>
> The patch is implicitly introducing the concept of
> a "multirange" (in this case, an array of ranges),

I meant to say before: this patch always returns a sorted array, and I
think a multirange should always act as if sorted when we stringify it
or cast it to an array. If you disagree let me know. :-)

You could imagine that when returning arrays we rely on the caller to
do the sorting (range_agg(r ORDER BY r)) and otherwise give wrong
results. But hopefully everyone agrees that would not be nice. :-) So
even the array-returning version should always return a sorted array I
think. (I'm not sure anything else is really coherent or at least easy
to describe.)

Paul


Reply | Threaded
Open this post in threaded view
|

Re: range_agg

David Fetter
In reply to this post by Paul A Jungwirth
On Fri, Jul 05, 2019 at 09:58:02AM -0700, Paul A Jungwirth wrote:

> On Mon, Jul 1, 2019 at 3:38 PM Jeff Davis <[hidden email]> wrote:
> >
> > For getting into core though, it should be a more complete set of
> > related operations. The patch is implicitly introducing the concept of
> > a "multirange" (in this case, an array of ranges), but it's not making
> > the concept whole.
> >
> > What else should return a multirange? This patch implements the union-
> > agg of ranges, but we also might want an intersection-agg of ranges
> > (that is, the set of ranges that are subranges of every input). Given
> > that there are other options here, the name "range_agg" is too generic
> > to mean union-agg specifically.
>
> Thanks for the review!
>
> I like the idea of adding a multirange type that works like range
> types, although I'm not sure I want to build it. :-)
>
> My own motivations for the range_agg patch are for temporal databases,
> where I have two use-cases: checking temporal foreign keys [1] and
> doing a Snodgrass "coalesce" operation [2]. The FKs use-case is more
> important. For coalesce I just immediately UNNEST the array, so a
> multirange would sort of "get in the way". It's no big deal if I can
> cast a multirange to an array, although for something that would run
> on every INSERT/UPDATE/DELETE I'd like to understand what the cast
> would cost us in performance. But coalesce isn't part of SQL:2011 and
> should be optional behavior or just something in an extension. The FKs
> use case matters to me a lot more, and I think a multirange would be
> fine for that. Also a multirange would solve the generics problem
> Pavel asked about. (I'll say more about that in a reply to him.)
>
> I'm not convinced that an intersection aggregate function for
> multiranges would be used by anyone, but I don't mind building one.
> Every other *_agg function has an "additive" sense, not a
> "subtractive" sense. json{,b}_object_agg are the closest since you
> *could* imagine intersection semantics for those, but they are unions.
> So in terms of *naming* I think using "range_agg" for union semantics
> is natural and would fit expectations. (I'm not the first to name this
> function range_agg btw: [3]).
>
> But there is clearly more than one worthwhile way to aggregate ranges:
>
> - David suggested weighted_range_agg and covering_range_agg. At PGCon

If I understand the cases correctly, the combination of covering_range
and multi_range types covers all cases. To recap, covering_range_agg
assigns a weight, possibly 0, to each non-overlapping sub-range.  A
cast from covering_range to multi_range would meld adjacent ranges
with non-zero weights into single ranges in O(N) (N sub-ranges) time
and some teensy amount of memory for tracking progress of adjacent
ranges of non-zero weight.  Gaps would just be multi_range consisting
of the sub-ranges of the covering_range with weight 0, and wouldn't
require any tracking.

What have I missed?

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Reply | Threaded
Open this post in threaded view
|

Re: range_agg

Paul A Jungwirth
On Fri, Jul 5, 2019 at 10:45 AM David Fetter <[hidden email]> wrote:
> If I understand the cases correctly, the combination of covering_range
> and multi_range types covers all cases. To recap, covering_range_agg
> assigns a weight, possibly 0, to each non-overlapping sub-range.  A
> cast from covering_range to multi_range would meld adjacent ranges
> with non-zero weights into single ranges in O(N) (N sub-ranges) time
> and some teensy amount of memory for tracking progress of adjacent
> ranges of non-zero weight.  Gaps would just be multi_range consisting
> of the sub-ranges of the covering_range with weight 0, and wouldn't
> require any tracking.

I take it that a multirange contains of *disjoint* ranges, so instead
of {[1,2), [2,3), [6,7)} you'd have {[1,3), [6,7)}. Jeff does that
match your expectation?

I just realized that since weighted_range_agg and covering_range_agg
return tuples of (anyrange, integer) (maybe other numeric types too?),
their elements are *not ranges*, so they couldn't return a multirange.
They would have to return an array of those tuples.

I agree that if you had a pre-sorted list of weighted ranges (with or
without zero-weight elements), you could convert it to a multirange in
O(n) very easily.

Paul


12345