FETCH FIRST clause WITH TIES option

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
37 messages Options
12
Reply | Threaded
Open this post in threaded view
|

FETCH FIRST clause WITH TIES option

Surafel Temesgen

hello ,

The WITH TIES keyword is sql standard that specifies any peers of retained rows

to retained in the result set too .which means according to ordering key the result set can includes additional rows which have ties on the last position, if there are any and It work with ORDER BY query

The attach patch is a finished implementation of it except it did not have a regression test because am not sure of changing the test data set for it and I can’t find fetch first clause regression test too.

Regards

Surafel


fetch_first_with_ties_v1.patch (34K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: FETCH FIRST clause WITH TIES option

Tomas Vondra-4
Hello Surafel,

On 10/26/2018 12:28 PM, Surafel Temesgen wrote:
> hello ,
>
> The WITH TIES keyword is sql standard that specifies any peers of
> retained rows to retained in the result set too .which means
> according to ordering key the result set can includes additional rows
> which have ties on the last position, if there are any and It work
> with ORDER BY query.
>

Thanks for the patch. I've looked at it today, and it seems mostly OK,
with a couple of minor issues. Most of it is code formatting and comment
wording issues, so I'm not going to go through them here - see the
attached 0002 patch (0001 is your patch, rebased to current master).

There's one place that I think is incorrect, and that's this bit from
ExecLimit:

    }else
    /*
     * Get next tuple from subplan, if any.
     */
    slot = ExecProcNode(outerPlan);
    if (TupIsNull(slot))
    {
        node->lstate = LIMIT_SUBPLANEOF;
        return NULL;
    }
    if (node->limitOption == WITH_TIES)
    {
        ExecCopySlot(node->last_slot, slot);
    }
    node->subSlot = slot;
    node->position++;

Note that the "else" guards only the very first line, not the whole
block. This seems wrong, i.e. there should be {} around the whole block.

I'm also suggesting to slightly change the create_limit_plan(), to keep
a single make_limit call. IMHO that makes it easier to understand,
although I admit it's fairly subjective.

One question is whether to make this work for LIMIT too, not just for
FETCH FIRST. I'm not sure how difficult would that be (it should be a
grammar-only tweak I guess), or if it's a good idea in general. But I
find it quite confusing that various comments say things like this:

  LimitOption  limitOption; /* LIMIT with ties or  exact number */

while in fact it does not work with LIMIT.

>
> The attach patch is a finished implementation of it except it did not
> have a regression test because am not sure of changing the test data set
> for it and I can’t find fetch first clause regression test too.
>

Well, that's not a very good reason not to have tests for this new
improvement. FWIW there are a couple of places in regression tests where
FETCH FIRST is used, see this:

  $ grep -i 'fetch first' src/test/regress/sql/*
  src/test/regress/sql/hs_standby_allowed.sql:fetch first from hsc;
  src/test/regress/sql/tablesample.sql:FETCH FIRST FROM tablesample_cur;
  src/test/regress/sql/tablesample.sql:FETCH FIRST FROM tablesample_cur;
  src/test/regress/sql/tidscan.sql:FETCH FIRST FROM c;

But those places don't seem like very good match for testing the new
stuff, because those are primarily testing something else. I suggest we
add the new tests into limit.sql.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

0002-review.patch (9K) Download Attachment
0001-rebased-patch.patch (27K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: FETCH FIRST clause WITH TIES option

Andrew Gierth
>>>>> "Tomas" == Tomas Vondra <[hidden email]> writes:

 > On 10/26/2018 12:28 PM, Surafel Temesgen wrote:
 >> hello ,
 >>
 >> The WITH TIES keyword is sql standard that specifies any peers of
 >> retained rows to retained in the result set too .which means
 >> according to ordering key the result set can includes additional rows
 >> which have ties on the last position, if there are any and It work
 >> with ORDER BY query.

 Tomas> Thanks for the patch. I've looked at it today, and it seems
 Tomas> mostly OK, with a couple of minor issues. Most of it is code
 Tomas> formatting and comment wording issues, so I'm not going to go
 Tomas> through them here - see the attached 0002 patch (0001 is your
 Tomas> patch, rebased to current master).

I still think that this is the wrong approach. Implementing WITH TIES
and PERCENT together using an implicit window function call kills two
birds with one very small stone (the only executor change needed would
be teaching LIMIT to be able to stop on a boolean condition), with
maximum reuse of existing facilities.

--
Andrew (irc:RhodiumToad)

Reply | Threaded
Open this post in threaded view
|

Re: FETCH FIRST clause WITH TIES option

Tomas Vondra-4
On 10/29/2018 04:17 PM, Andrew Gierth wrote:

>>>>>> "Tomas" == Tomas Vondra <[hidden email]> writes:
>
>   > On 10/26/2018 12:28 PM, Surafel Temesgen wrote:
>   >> hello ,
>   >>
>   >> The WITH TIES keyword is sql standard that specifies any peers of
>   >> retained rows to retained in the result set too .which means
>   >> according to ordering key the result set can includes additional rows
>   >> which have ties on the last position, if there are any and It work
>   >> with ORDER BY query.
>
>   Tomas> Thanks for the patch. I've looked at it today, and it seems
>   Tomas> mostly OK, with a couple of minor issues. Most of it is code
>   Tomas> formatting and comment wording issues, so I'm not going to go
>   Tomas> through them here - see the attached 0002 patch (0001 is your
>   Tomas> patch, rebased to current master).
>
> I still think that this is the wrong approach. Implementing WITH TIES
> and PERCENT together using an implicit window function call kills two
> birds with one very small stone (the only executor change needed would
> be teaching LIMIT to be able to stop on a boolean condition), with
> maximum reuse of existing facilities.
>

Hmmm, maybe. How would that work, exactly? Wouldn't that mean extra
overhead (the window functions are hardly free) and limitations? Perhaps
that was discussed in some other thread in the past?

FWIW, I doubt the patch can be much smaller/simpler - a significant part
of the new stuff is in gram.y and node read/out infrastructure, the
changes to LIMIT node are fairly minimal.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: FETCH FIRST clause WITH TIES option

Andrew Gierth
>>>>> "Tomas" == Tomas Vondra <[hidden email]> writes:

 >> I still think that this is the wrong approach. Implementing WITH
 >> TIES and PERCENT together using an implicit window function call
 >> kills two birds with one very small stone (the only executor change
 >> needed would be teaching LIMIT to be able to stop on a boolean
 >> condition), with maximum reuse of existing facilities.

 Tomas> Hmmm, maybe. How would that work, exactly? Wouldn't that mean
 Tomas> extra overhead (the window functions are hardly free) and
 Tomas> limitations?

They're not free, but then this case probably shouldn't be treated as a
particularly hot code path.

The basic idea is this: extend the Limit node to be able to stop on a
boolean expression, such that the first false value ends the result.

Then FETCH FIRST N WITH TIES becomes "stop when the expression
  rank() over (order by ...) <= N
is no longer true" (where the ... is the existing top level order by)

and FETCH FIRST N PERCENT is the same but with percent_rank (or
cume_dist, I'd have to check the exact semantics)

rank() doesn't need to read ahead in the input. percent_rank of course
does, but it's clearly impossible to implement PERCENT without doing
that.

Also, one could consider extending LIMIT to support arbitrary
expressions, with some syntax like LIMIT WHEN (condition) which would be
the general form.

 Tomas> FWIW, I doubt the patch can be much smaller/simpler - a
 Tomas> significant part of the new stuff is in gram.y and node read/out
 Tomas> infrastructure, the changes to LIMIT node are fairly minimal.

It's not a matter of patch size as such, but reuse of mechanisms rather
than adding new ones. Also doing WITH TIES as a special case in the
Limit node doesn't make adding PERCENT any easier at all, whereas the
above gets it basically for free.

--
Andrew (irc:RhodiumToad)

Reply | Threaded
Open this post in threaded view
|

Re: FETCH FIRST clause WITH TIES option

Tomas Vondra-4
On 10/29/2018 05:47 PM, Andrew Gierth wrote:

>>>>>> "Tomas" == Tomas Vondra <[hidden email]> writes:
>
>   >> I still think that this is the wrong approach. Implementing WITH
>   >> TIES and PERCENT together using an implicit window function call
>   >> kills two birds with one very small stone (the only executor change
>   >> needed would be teaching LIMIT to be able to stop on a boolean
>   >> condition), with maximum reuse of existing facilities.
>
>   Tomas> Hmmm, maybe. How would that work, exactly? Wouldn't that mean
>   Tomas> extra overhead (the window functions are hardly free) and
>   Tomas> limitations?
>
> They're not free, but then this case probably shouldn't be treated as a
> particularly hot code path.
>
> The basic idea is this: extend the Limit node to be able to stop on a
> boolean expression, such that the first false value ends the result.
>
> Then FETCH FIRST N WITH TIES becomes "stop when the expression
>    rank() over (order by ...) <= N
> is no longer true" (where the ... is the existing top level order by)
>
> and FETCH FIRST N PERCENT is the same but with percent_rank (or
> cume_dist, I'd have to check the exact semantics)
>
> rank() doesn't need to read ahead in the input. percent_rank of course
> does, but it's clearly impossible to implement PERCENT without doing
> that.
>

OK. I was under the impression the window function would need to see all
the input rows, but that seems not to be the case in general.

> Also, one could consider extending LIMIT to support arbitrary
> expressions, with some syntax like LIMIT WHEN (condition) which would be
> the general form.
>

Hmmm. I can't really recall needing such thing, but of course - that
does not prove it'd not be a good thing.

>   Tomas> FWIW, I doubt the patch can be much smaller/simpler - a
>   Tomas> significant part of the new stuff is in gram.y and node read/out
>   Tomas> infrastructure, the changes to LIMIT node are fairly minimal.
>
> It's not a matter of patch size as such, but reuse of mechanisms rather
> than adding new ones. Also doing WITH TIES as a special case in the
> Limit node doesn't make adding PERCENT any easier at all, whereas the
> above gets it basically for free.
>

Makes sense, I guess. At first I was thinking that this certainly does
not add more new mechanisms than allowing LIMIT to terminate on boolean
expression. But you're right about the WITH PERCENT part - using window
functions would make adding this much simpler.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: FETCH FIRST clause WITH TIES option

Robert Haas
In reply to this post by Andrew Gierth
On Mon, Oct 29, 2018 at 12:48 PM Andrew Gierth
<[hidden email]> wrote:
> Then FETCH FIRST N WITH TIES becomes "stop when the expression
>   rank() over (order by ...) <= N
> is no longer true" (where the ... is the existing top level order by)

Wouldn't that be wicked expensive compared to something hard-coded
that does the same thing?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: FETCH FIRST clause WITH TIES option

Tomas Vondra-4
On 10/31/2018 06:17 PM, Robert Haas wrote:
> On Mon, Oct 29, 2018 at 12:48 PM Andrew Gierth
> <[hidden email]> wrote:
>> Then FETCH FIRST N WITH TIES becomes "stop when the expression
>>    rank() over (order by ...) <= N
>> is no longer true" (where the ... is the existing top level order by)
>
> Wouldn't that be wicked expensive compared to something hard-coded
> that does the same thing?
>

Not sure, but that was one of my concerns too, particularly for the
simple FETCH FIRST N WITH TIES case. But I think Andrew has a point it
would make it much easier to implement the PERCENT case.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: FETCH FIRST clause WITH TIES option

Surafel Temesgen
hi,

The attached patch include all the comment given by Tomas and i check sql standard about LIMIT and this feature

it did not say anything about it but I think its good idea to include it to LIMIT too and I will add it if we have consensus on it.

regards

surafel


fetch_first_with_ties_v2.patch (38K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: FETCH FIRST clause WITH TIES option

Surafel Temesgen
Attach is rebased patch against the current master
regards
Surafel

On Thu, Nov 1, 2018 at 2:28 PM Surafel Temesgen <[hidden email]> wrote:
hi,

The attached patch include all the comment given by Tomas and i check sql standard about LIMIT and this feature

it did not say anything about it but I think its good idea to include it to LIMIT too and I will add it if we have consensus on it.

regards

surafel


fetch_first_with_ties_v3.patch (38K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: FETCH FIRST clause WITH TIES option

Tomas Vondra-4
Hi,

On 11/24/18 10:28 AM, Surafel Temesgen wrote:

> Attach is rebased patch against the current master
> regards
> Surafel
>
> On Thu, Nov 1, 2018 at 2:28 PM Surafel Temesgen <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     hi,
>
>     The attached patch include all the comment given by Tomas and i
>     check sql standard about LIMIT and this feature
>
Unfortunately, it seems the "limit" regression tests fail for some
reason - the output mismatches the expected results for some reason. It
seems as if the WITH TIES code affects ordering of the results within
the group. See the attached file.

>     it did not say anything about it but I think its good idea to
>     include it to LIMIT too and I will add it if we have consensus on it.
>

Hmm, I'm not sure that's needed. I don't see an urgent need to do that
in v1 of the patch.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

regression.diffs (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: FETCH FIRST clause WITH TIES option

Surafel Temesgen


On Tue, Jan 1, 2019 at 8:38 PM Tomas Vondra <[hidden email]> wrote:
>
>     The attached patch include all the comment given by Tomas and i
>     check sql standard about LIMIT and this feature
>

Unfortunately, it seems the "limit" regression tests fail for some
reason - the output mismatches the expected results for some reason. It
seems as if the WITH TIES code affects ordering of the results within
the group. See the attached file.


Yes the reason is the order of returned row is not always the same. I remove other columns from the result set to get constant result

Regards

Surafel

 

fetch_first_with_ties_v4.patch (37K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: FETCH FIRST clause WITH TIES option

Tomas Vondra-4


On 1/2/19 11:51 AM, Surafel Temesgen wrote:

>
>
> On Tue, Jan 1, 2019 at 8:38 PM Tomas Vondra
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     >
>     >     The attached patch include all the comment given by Tomas and i
>     >     check sql standard about LIMIT and this feature
>     >
>
>     Unfortunately, it seems the "limit" regression tests fail for some
>     reason - the output mismatches the expected results for some reason. It
>     seems as if the WITH TIES code affects ordering of the results within
>     the group. See the attached file.
>
>
> Yes the reason is the order of returned row is not always the same. I
> remove other columns from the result set to get constant result
>

Thanks, that seems reasonable.

After looking at the "FETCH FIRST ... PERCENT" patch, I wonder if this
patch should tweak estimates in some way. Currently, the cardinality
estimate is the same as for plain LIMIT, using the requested number of
rows. But let's say there are very few large groups - that will
naturally increase the number of rows produced.

As an example, let's say the subplan produces 1M rows, and there are
1000 groups (when split according to the ORDER BY clause). Consider a
query with "FETCH FIRST 10 ROWS WITH TIES". AFAICS the current estimate
will be 10, but in fact we know that it's likely to produce ~1000 rows
(because that's the average group size).

So I think the patch should tweak the estimate to be

  limitCount + (avgGroupSize/2);

or perhaps

  Max(avgGroupSize, limitCount + (avgGroupSize/2))

The 1/2 is there because we don't know where the group starts.

Of course, using average group size like this is rather crude, but it's
about the best thing we can do. In principle, increasing the cardinality
estimate is the right thing to do.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: FETCH FIRST clause WITH TIES option

Surafel Temesgen


On Wed, Jan 2, 2019 at 6:19 PM Tomas Vondra <[hidden email]> wrote:
After looking at the "FETCH FIRST ... PERCENT" patch, I wonder if this
patch should tweak estimates in some way. Currently, the cardinality
estimate is the same as for plain LIMIT, using the requested number of
rows. But let's say there are very few large groups - that will
naturally increase the number of rows produced.

As an example, let's say the subplan produces 1M rows, and there are
1000 groups (when split according to the ORDER BY clause).

 

can we use ORDER BY column raw statistic in limit node reliably? because it seems to me it can be affected by other operation in a subplan like filter condition

regards

Surafel

Reply | Threaded
Open this post in threaded view
|

Re: FETCH FIRST clause WITH TIES option

Tomas Vondra-4


On 1/15/19 11:07 AM, Surafel Temesgen wrote:

>
>
> On Wed, Jan 2, 2019 at 6:19 PM Tomas Vondra
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     After looking at the "FETCH FIRST ... PERCENT" patch, I wonder if this
>     patch should tweak estimates in some way. Currently, the cardinality
>     estimate is the same as for plain LIMIT, using the requested number of
>     rows. But let's say there are very few large groups - that will
>     naturally increase the number of rows produced.
>
>     As an example, let's say the subplan produces 1M rows, and there are
>     1000 groups (when split according to the ORDER BY clause).
>
>
>  
>
> can we use ORDER BY column raw statistic in limit node reliably?
> because it seems to me it can be affected by other operation in a
> subplan like filter condition
>

What do you mean by "raw statistic"? Using stats from the underlying
table is not quite possible, because you might be operating on top of
join or something like that.

IMHO the best thing you can do is call estimate_num_groups() and combine
that with the number of input rows. That shall benefit from ndistinct
coefficients when available, etc. I've been thinking that considering
the unreliability of grouping estimates we should use a multiple of the
average size (because there may be much larger groups), but I think
that's quite unprecipled and I'd much rather try without it first.

But maybe we can do better when there really is a single table to
consider, in which case we might look at MCV lists and estimate the
largest group. That would give us a much better idea of the worst case.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: FETCH FIRST clause WITH TIES option

Surafel Temesgen


On Tue, Jan 15, 2019 at 2:51 PM Tomas Vondra <[hidden email]> wrote:
 
What do you mean by "raw statistic"?

I mean without further calculation to consider other operation

 

IMHO the best thing you can do is call estimate_num_groups() and combine
that with the number of input rows. That shall benefit from ndistinct
coefficients when available, etc. I've been thinking that considering
the unreliability of grouping estimates we should use a multiple of the
average size (because there may be much larger groups), but I think
that's quite unprecipled and I'd much rather try without it first.


thank you for clarifying.

The attache patch use your suggestion uptread

regards

Surafel


fetch_first_with_ties_v5.patch (38K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: FETCH FIRST clause WITH TIES option

Michael Paquier-2
On Wed, Jan 16, 2019 at 11:45:46AM +0300, Surafel Temesgen wrote:
> The attached patch use your suggestion uptread

This patch needs a rebase because of conflicts done recently for
pluggable storage, so moved to next CF, waiting on author.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: FETCH FIRST clause WITH TIES option

Surafel Temesgen


On Mon, Feb 4, 2019 at 8:29 AM Michael Paquier <[hidden email]> wrote:

This patch needs a rebase because of conflicts done recently for
pluggable storage, so moved to next CF, waiting on author.
--

Thank you . rebased against current master

regards
Surafel

fetch_first_with_ties_v6.patch (38K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Re: FETCH FIRST clause WITH TIES option

David Steele
On 2/4/19 2:37 PM, Surafel Temesgen wrote:

>
>
> On Mon, Feb 4, 2019 at 8:29 AM Michael Paquier <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>
>     This patch needs a rebase because of conflicts done recently for
>     pluggable storage, so moved to next CF, waiting on author.
>     --
>
>
> Thank you . rebased against current master

This patch no longer passes testing so marked Waiting on Author.

Regards,
--
-David
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Re: FETCH FIRST clause WITH TIES option

Surafel Temesgen


On Mon, Mar 25, 2019 at 11:56 AM David Steele <[hidden email]> wrote:
This patch no longer passes testing so marked Waiting on Author.


Thank  you for informing. Fixed

fetch_first_with_ties_v7.patch (39K) Download Attachment
12