[PATCH] distinct aggregates within a window function WIP

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH] distinct aggregates within a window function WIP

Krasiyan Andreev
Hi hackers,

I want to propose to you an old patch for Postgres 11, off-site developed by Oliver Ford,
but I have permission from him to publish it and to continue it's development,
that allow distinct aggregates, like select sum(distinct nums) within a window function.

I have rebased it for current git master branch and have made necessary changes to it to work with Postgres 13devel.

It's a WIP, because it doesn't have tests yet (I will add them later) and also, it works for a int, float, and numeric types,
but probably distinct check can be rewritten for possible performance improvement,
with storing the distinct elements in a hash table which should give a performance improvement.

If you find the implementation of patch acceptable from committers perspective,
I will answer to all yours design and review notes and will try to go ahead with it,
also, I will add this patch to the March commit fest.

For example usage of a patch, if you have time series data, with current Postgres you will get an error:

postgres=# CREATE TABLE t_demo AS
postgres-#     SELECT   ordinality, day, date_part('week', day) AS week
postgres-#     FROM    generate_series('2020-01-02', '2020-01-15', '1 day'::interval)
postgres-#             WITH ORDINALITY AS day;
SELECT 14
postgres=# SELECT * FROM t_demo;
 ordinality |          day           | week
------------+------------------------+------
          1 | 2020-01-02 00:00:00+02 |    1
          2 | 2020-01-03 00:00:00+02 |    1
          3 | 2020-01-04 00:00:00+02 |    1
          4 | 2020-01-05 00:00:00+02 |    1
          5 | 2020-01-06 00:00:00+02 |    2
          6 | 2020-01-07 00:00:00+02 |    2
          7 | 2020-01-08 00:00:00+02 |    2
          8 | 2020-01-09 00:00:00+02 |    2
          9 | 2020-01-10 00:00:00+02 |    2
         10 | 2020-01-11 00:00:00+02 |    2
         11 | 2020-01-12 00:00:00+02 |    2
         12 | 2020-01-13 00:00:00+02 |    3
         13 | 2020-01-14 00:00:00+02 |    3
         14 | 2020-01-15 00:00:00+02 |    3
(14 rows)

postgres=# SELECT   *,
postgres-#         array_agg(DISTINCT week) OVER (ORDER BY day ROWS
postgres(#                                 BETWEEN 2 PRECEDING AND 2 FOLLOWING)
postgres-# FROM    t_demo;
ERROR:  DISTINCT is not implemented for window functions
LINE 2:         array_agg(DISTINCT week) OVER (ORDER BY day ROWS
                ^

So you will need to write something like this:

postgres=# SELECT  *, (SELECT array_agg(DISTINCT unnest) FROM unnest(x)) AS b
postgres-# FROM
postgres-# (
postgres(#         SELECT  *,
postgres(#                 array_agg(week) OVER (ORDER BY day ROWS
postgres(#                         BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS x
postgres(#         FROM    t_demo
postgres(# ) AS a;
 ordinality |          day           | week |      x      |   b  
------------+------------------------+------+-------------+-------
          1 | 2020-01-02 00:00:00+02 |    1 | {1,1,1}     | {1}
          2 | 2020-01-03 00:00:00+02 |    1 | {1,1,1,1}   | {1}
          3 | 2020-01-04 00:00:00+02 |    1 | {1,1,1,1,2} | {1,2}
          4 | 2020-01-05 00:00:00+02 |    1 | {1,1,1,2,2} | {1,2}
          5 | 2020-01-06 00:00:00+02 |    2 | {1,1,2,2,2} | {1,2}
          6 | 2020-01-07 00:00:00+02 |    2 | {1,2,2,2,2} | {1,2}
          7 | 2020-01-08 00:00:00+02 |    2 | {2,2,2,2,2} | {2}
          8 | 2020-01-09 00:00:00+02 |    2 | {2,2,2,2,2} | {2}
          9 | 2020-01-10 00:00:00+02 |    2 | {2,2,2,2,2} | {2}
         10 | 2020-01-11 00:00:00+02 |    2 | {2,2,2,2,3} | {2,3}
         11 | 2020-01-12 00:00:00+02 |    2 | {2,2,2,3,3} | {2,3}
         12 | 2020-01-13 00:00:00+02 |    3 | {2,2,3,3,3} | {2,3}
         13 | 2020-01-14 00:00:00+02 |    3 | {2,3,3,3}   | {2,3}
         14 | 2020-01-15 00:00:00+02 |    3 | {3,3,3}     | {3}
(14 rows)

With attached version, you will get the desired results:

postgres=# SELECT   *,
postgres-#         array_agg(DISTINCT week) OVER (ORDER BY day ROWS
postgres(#                                 BETWEEN 2 PRECEDING AND 2 FOLLOWING)
postgres-# FROM    t_demo;
 ordinality |          day           | week | array_agg
------------+------------------------+------+-----------
          1 | 2020-01-02 00:00:00+02 |    1 | {1}
          2 | 2020-01-03 00:00:00+02 |    1 | {1}
          3 | 2020-01-04 00:00:00+02 |    1 | {1,2}
          4 | 2020-01-05 00:00:00+02 |    1 | {1,2}
          5 | 2020-01-06 00:00:00+02 |    2 | {1,2}
          6 | 2020-01-07 00:00:00+02 |    2 | {1,2}
          7 | 2020-01-08 00:00:00+02 |    2 | {2}
          8 | 2020-01-09 00:00:00+02 |    2 | {2}
          9 | 2020-01-10 00:00:00+02 |    2 | {2}
         10 | 2020-01-11 00:00:00+02 |    2 | {2,3}
         11 | 2020-01-12 00:00:00+02 |    2 | {2,3}
         12 | 2020-01-13 00:00:00+02 |    3 | {2,3}
         13 | 2020-01-14 00:00:00+02 |    3 | {2,3}
         14 | 2020-01-15 00:00:00+02 |    3 | {3}
(14 rows)






pg13-distinct-window.patch (16K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] distinct aggregates within a window function WIP

Tom Lane-2
Krasiyan Andreev <[hidden email]> writes:
> I want to propose to you an old patch for Postgres 11, off-site developed
> by Oliver Ford,
> but I have permission from him to publish it and to continue it's
> development,
> that allow distinct aggregates, like select sum(distinct nums) within a
> window function.

I started to respond by asking whether that's well-defined, but
reading down further I see that that's not actually what the feature
is: what it is is attaching DISTINCT to a window function itself.
I'd still ask whether it's well-defined though, or even minimally
sensible.  Window functions are generally supposed to produce one
row per input row --- how does that square with the implicit row
merging of DISTINCT?  They're also typically row-order-sensitive
--- how does that work with DISTINCT?  Also, to the extent that
this is sensible, can't you get the same results already today
with appropriate use of window framing options?

> It's a WIP, because it doesn't have tests yet (I will add them later) and
> also, it works for a int, float, and numeric types,

As a rule of thumb, operations like this should not be coded to be
datatype-specific.  We threw out some features in the original window
function patch until they could be rewritten to not be limited to a
hard-coded set of data types (cf commit 0a459cec9), and I don't see
why we'd apply a lesser standard here.  Certainly DISTINCT for
aggregates has no such limitation.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] distinct aggregates within a window function WIP

Thomas Kellerer-4
Tom Lane schrieb am 13.01.2020 um 15:19:

> what it is is attaching DISTINCT to a window function itself.
> I'd still ask whether it's well-defined though, or even minimally
> sensible.  Window functions are generally supposed to produce one
> row per input row --- how does that square with the implicit row
> merging of DISTINCT?  They're also typically row-order-sensitive
> --- how does that work with DISTINCT?  Also, to the extent that
> this is sensible, can't you get the same results already today
> with appropriate use of window framing options?

I find the example using array_agg() and cumulative window functions a
bit confusing as well, but I think there are situations where having this
is really helpful, e.g.:

   count(distinct some_column) over (partition by something)

I know it's not an argument, but Oracle supports this and porting
queries like that from Oracle to Postgres isn't really fun.

Thomas





Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] distinct aggregates within a window function WIP

Krasiyan Andreev
In reply to this post by Tom Lane-2
I understand yours note about datatype-specific operations, so I need to think more generic about it.
About yours additional note, I think that it is not possible to get easy the same result with appropriate use of window framing options,
because "exclude ties" will not exclude "current row" itself, only peers of it. So, that is the only difference and reason of DISTINCT aggregate.
Maybe, if we can specify at the same time to "exclude ties" and "exclude current row" itself, there will not be need of DISTINCT, but right now
I think that nor "exclude ties" nor "exclude groups" or "exclude current row", can specify it, because they can't be nested or used at the same time.

На пн, 13.01.2020 г. в 16:19 Tom Lane <[hidden email]> написа:
Krasiyan Andreev <[hidden email]> writes:
> I want to propose to you an old patch for Postgres 11, off-site developed
> by Oliver Ford,
> but I have permission from him to publish it and to continue it's
> development,
> that allow distinct aggregates, like select sum(distinct nums) within a
> window function.

I started to respond by asking whether that's well-defined, but
reading down further I see that that's not actually what the feature
is: what it is is attaching DISTINCT to a window function itself.
I'd still ask whether it's well-defined though, or even minimally
sensible.  Window functions are generally supposed to produce one
row per input row --- how does that square with the implicit row
merging of DISTINCT?  They're also typically row-order-sensitive
--- how does that work with DISTINCT?  Also, to the extent that
this is sensible, can't you get the same results already today
with appropriate use of window framing options?

> It's a WIP, because it doesn't have tests yet (I will add them later) and
> also, it works for a int, float, and numeric types,

As a rule of thumb, operations like this should not be coded to be
datatype-specific.  We threw out some features in the original window
function patch until they could be rewritten to not be limited to a
hard-coded set of data types (cf commit 0a459cec9), and I don't see
why we'd apply a lesser standard here.  Certainly DISTINCT for
aggregates has no such limitation.

                        regards, tom lane
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] distinct aggregates within a window function WIP

Vik Fearing-4
In reply to this post by Tom Lane-2
On 13/01/2020 15:19, Tom Lane wrote:

> Krasiyan Andreev <[hidden email]> writes:
>> I want to propose to you an old patch for Postgres 11, off-site developed
>> by Oliver Ford,
>> but I have permission from him to publish it and to continue it's
>> development,
>> that allow distinct aggregates, like select sum(distinct nums) within a
>> window function.
> I started to respond by asking whether that's well-defined, but
> reading down further I see that that's not actually what the feature
> is: what it is is attaching DISTINCT to a window function itself.
> I'd still ask whether it's well-defined though, or even minimally
> sensible.  Window functions are generally supposed to produce one
> row per input row --- how does that square with the implicit row
> merging of DISTINCT?  They're also typically row-order-sensitive
> --- how does that work with DISTINCT?  


It's a little strange because the spec says:


<q>
If the window ordering clause or the window framing clause of the window
structure descriptor that describes the <window name or specification>
is present, then no <aggregate function> simply contained in <window
function> shall specify DISTINCT or <ordered set function>.
</q>


So it seems to be well defined if all you have is a partition.


But then it also says:


<q>
DENSE_RANK() OVER WNS is equivalent to the <window function>:
    COUNT (DISTINCT ROW ( VE 1 , ..., VE N ) )
    OVER (WNS1 RANGE UNBOUNDED PRECEDING)
</q>


And that kind of looks like a framing clause there.


> Also, to the extent that
> this is sensible, can't you get the same results already today
> with appropriate use of window framing options?


I don't see how.


I have sometimes wanted this feature so I am +1 on us getting at least a
minimal form of it.

--

Vik



Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] distinct aggregates within a window function WIP

Krasiyan Andreev
I have currently suspended development of this patch, based on it's review,
but I will continue development of the other Oliver Ford's work about adding support of respect/ignore nulls
for lag(),lead(),first_value(),last_value() and nth_value() and from first/last for nth_value() patch,
but I am not sure how to proceed with it's implementation and any feedback will be very helpful.

I have dropped support of from first/last for nth_value(), but also I reimplemented it in a different way,
by using negative number for the position argument, to be able to get the same frame in exact reverse order.
After that patch becomes much more simple and major concerns about precedence hack has gone,
but maybe it can be additionally simplified.

I have not renamed special bool type "ignorenulls", because it is probably not acceptable way for calling extra version
of window functions (but it makes things very easy and it can reuse frames), but I removed the other special bool type "fromlast".

Attached file is for PostgreSQL 13 (master git branch) and I will add it now to a March commit fest, to be able to track changes.
Everything works and patch is in very good shape, all tests are passed and also, I use it from some time for SQL analysis purposes
(because ignore nulls is one of the most needed feature in OLAP/BI area and Oracle, Amazon Redshift, even Informix have it).

After patch review and suggestions about what to do with special bool type and unreserved keywords, I will reimplement it, if needed.

На пн, 13.01.2020 г. в 18:19 Vik Fearing <[hidden email]> написа:
On 13/01/2020 15:19, Tom Lane wrote:
> Krasiyan Andreev <[hidden email]> writes:
>> I want to propose to you an old patch for Postgres 11, off-site developed
>> by Oliver Ford,
>> but I have permission from him to publish it and to continue it's
>> development,
>> that allow distinct aggregates, like select sum(distinct nums) within a
>> window function.
> I started to respond by asking whether that's well-defined, but
> reading down further I see that that's not actually what the feature
> is: what it is is attaching DISTINCT to a window function itself.
> I'd still ask whether it's well-defined though, or even minimally
> sensible.  Window functions are generally supposed to produce one
> row per input row --- how does that square with the implicit row
> merging of DISTINCT?  They're also typically row-order-sensitive
> --- how does that work with DISTINCT? 


It's a little strange because the spec says:


<q>
If the window ordering clause or the window framing clause of the window
structure descriptor that describes the <window name or specification>
is present, then no <aggregate function> simply contained in <window
function> shall specify DISTINCT or <ordered set function>.
</q>


So it seems to be well defined if all you have is a partition.


But then it also says:


<q>
DENSE_RANK() OVER WNS is equivalent to the <window function>:
    COUNT (DISTINCT ROW ( VE 1 , ..., VE N ) )
    OVER (WNS1 RANGE UNBOUNDED PRECEDING)
</q>


And that kind of looks like a framing clause there.


> Also, to the extent that
> this is sensible, can't you get the same results already today
> with appropriate use of window framing options?


I don't see how.


I have sometimes wanted this feature so I am +1 on us getting at least a
minimal form of it.

--

Vik


pg13_ignore_nulls.patch (70K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] distinct aggregates within a window function WIP

Surafel Temesgen

On Thu, Mar 5, 2020 at 4:17 AM Krasiyan Andreev <[hidden email]> wrote:
I have currently suspended development of this patch, based on it's review,
but I will continue development of the other Oliver Ford's work about adding support of respect/ignore nulls
for lag(),lead(),first_value(),last_value() and nth_value() and from first/last for nth_value() patch,
but I am not sure how to proceed with it's implementation and any feedback will be very helpful.


* I applied your patch on top of 58c47ccfff20b8c125903 . It applied cleanly , compiled, make check pass, but it have white space errors:

*Added functions on windowfuncs.c have no comments so it's not easily understandable.

* Regression test addition seems huge to me. Can you reduce that? You can use existing tables and fewer records.

* I don’t understand why this patch has to change makeBoolAConst? It already make “bool” constant node


regards

Surafel

 
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] distinct aggregates within a window function WIP

Krasiyan Andreev
Thank you very much.

I think that Vik Fearing's patch about "Implement <null treatment> for window functions" is much clear, better and has a chance to be committed.
For me it's not important which patch will go into PostgreSQL, because it's a much needed feature.

In mine patch, there is also a feature about using negative indexes, to be able to reverse order in exact same window frame for "FROM FIRST/FROM LAST",
but I am not sure, is such non-standard usage is acceptable (it's the same as some array functions in programming language), if it's acceptable, it can be easy ported to Vik's patch.

I am thinking also to concentrate on Vik's patch, if it has a clear design point of view, clear design, I can withdraw mine patch.



На ср, 16.09.2020 г. в 11:19 Surafel Temesgen <[hidden email]> написа:

On Thu, Mar 5, 2020 at 4:17 AM Krasiyan Andreev <[hidden email]> wrote:
I have currently suspended development of this patch, based on it's review,
but I will continue development of the other Oliver Ford's work about adding support of respect/ignore nulls
for lag(),lead(),first_value(),last_value() and nth_value() and from first/last for nth_value() patch,
but I am not sure how to proceed with it's implementation and any feedback will be very helpful.


* I applied your patch on top of 58c47ccfff20b8c125903 . It applied cleanly , compiled, make check pass, but it have white space errors:

*Added functions on windowfuncs.c have no comments so it's not easily understandable.

* Regression test addition seems huge to me. Can you reduce that? You can use existing tables and fewer records.

* I don’t understand why this patch has to change makeBoolAConst? It already make “bool” constant node


regards

Surafel

 
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] distinct aggregates within a window function WIP

Michael Paquier-2
On Wed, Sep 16, 2020 at 11:35:22AM +0300, Krasiyan Andreev wrote:
> I am thinking also to concentrate on Vik's patch, if it has a clear design
> point of view, clear design, I can withdraw mine patch.

Okay, I have done that then.
--
Michael

signature.asc (849 bytes) Download Attachment