[PATCH] Remove useless distinct clauses

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH] Remove useless distinct clauses

Pierre Ducroquet
Hi

In a recent audit, I noticed that application developers have a tendency to
abuse the distinct clause. For instance they use an ORM and add a distinct at
the top level just because they don't know the cost it has, or they don't know
that using EXISTS is a better way to express their queries than doing JOINs
(or worse, they can't do better).

They thus have this kind of queries (considering tbl1 has a PK of course):
SELECT DISTINCT * FROM tbl1;
SELECT DISTINCT * FROM tbl1 ORDER BY a;
SELECT DISTINCT tbl1.* FROM tbl1
        JOIN tbl2 ON tbl2.a = tbl1.id;

These can be transformed into:
SELECT * FROM tbl1 ORDER BY *;
SELECT * FROM tbl1 ORDER BY a;
SELECT tbl1.* FROM tbl1 SEMI-JOIN tbl2 ON tbl2.a = tbl1.id ORDER BY tbl1.*;

The attached patch does that.
I added extra safeties in several place just to be sure I don't touch
something I can not handle, but I may have been very candid with the distinct
to sort transformation.
The cost of this optimization is quite low : for queries that don't have any
distinct, it's just one if. If there is a distinct, we iterate once through
every target, then we fetch the PK and iterate through the DISTINCT clause
fields. If it is possible to optimize, we then iterate through the JOINs.

Any comment on this would be more than welcome!

Regards

 Pierre


0001-Add-a-new-remove_useless_distinct-function-in-planne.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Remove useless distinct clauses

Ashutosh Bapat-2
Hi Pierre,

On Fri, Jul 31, 2020 at 2:11 PM Pierre Ducroquet <[hidden email]> wrote:

>
> Hi
>
> In a recent audit, I noticed that application developers have a tendency to
> abuse the distinct clause. For instance they use an ORM and add a distinct at
> the top level just because they don't know the cost it has, or they don't know
> that using EXISTS is a better way to express their queries than doing JOINs
> (or worse, they can't do better).
>
> They thus have this kind of queries (considering tbl1 has a PK of course):
> SELECT DISTINCT * FROM tbl1;
> SELECT DISTINCT * FROM tbl1 ORDER BY a;
> SELECT DISTINCT tbl1.* FROM tbl1
>         JOIN tbl2 ON tbl2.a = tbl1.id;
>
> These can be transformed into:
> SELECT * FROM tbl1 ORDER BY *;

We don't need an ORDER BY here since there's primary key on tbl1 and
DISTINCT doesn't ensure ordered result.

> SELECT * FROM tbl1 ORDER BY a;
> SELECT tbl1.* FROM tbl1 SEMI-JOIN tbl2 ON tbl2.a = tbl1.id ORDER BY tbl1.*;
>
> The attached patch does that.
> I added extra safeties in several place just to be sure I don't touch
> something I can not handle, but I may have been very candid with the distinct
> to sort transformation.
> The cost of this optimization is quite low : for queries that don't have any
> distinct, it's just one if. If there is a distinct, we iterate once through
> every target, then we fetch the PK and iterate through the DISTINCT clause
> fields. If it is possible to optimize, we then iterate through the JOINs.

We are already discussing this feature at
https://www.postgresql.org/message-id/flat/CAKJS1f-wH83Fi2coEVNUWFxOGQ4BJRRTGqDMvidCoiR9WEwxsw%40mail.gmail.com#56a08b441cc61afaf85c6232c5d40a3f.
You are welcome to contribute your ideas/code/review on that thread.

--
Best Wishes,
Ashutosh Bapat