BUG #16887: Group by is faster than distinct

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

BUG #16887: Group by is faster than distinct

PG Bug reporting form
The following bug has been logged on the website:

Bug reference:      16887
Logged by:          XINYU LIU
Email address:      [hidden email]
PostgreSQL version: 12.3
Operating system:   Ubuntu 20.04
Description:        

Hello,

We have 2 TPC-H queries which fetch the same tuples but have significant
query execution time differences (3 times).

We are sharing a pair of TPC-H queries that exhibit this performance
difference:

First query:
SELECT DISTINCT d0.lineitem_l_shipmode
FROM            (
                       SELECT lineitem.l_shipmode AS lineitem_l_shipmode
                       FROM   lineitem
                       WHERE  lineitem.l_shipinstruct IS distinct
                       FROM   'COLLECT COD              ') AS d0
WHERE           d0.lineitem_l_shipmode IS DISTINCT
FROM            'MAIL'

Second query:
SELECT   d0.lineitem_l_shipmode
FROM     (
                SELECT lineitem.l_shipmode AS lineitem_l_shipmode
                FROM   lineitem
                WHERE  lineitem.l_shipinstruct IS distinct
                FROM   'COLLECT COD              ') AS d0
WHERE    d0.lineitem_l_shipmode IS DISTINCT
FROM     'MAIL'
GROUP BY d0.lineitem_l_shipmode

Actual Behavior
We executed both queries on the TPC-H benchmark of scale factor 5: the first
query takes over 20 seconds, while the second query only takes 6.5 seconds.
We think the time difference results from different plans selected.
Specifically, in the first (slow) query, the optimizer decides to not
parallelize the SCAN and GROUP operations.

Query Execution Plan
First query:
                                                       QUERY PLAN          
                                                   
------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=1060784.91..1060784.98 rows=7 width=11) (actual
time=20233.319..20233.320 rows=6 loops=1)
   Group Key: lineitem.l_shipmode
   ->  Seq Scan on lineitem  (cost=0.00..1012391.94 rows=19357187 width=11)
(actual time=0.074..13586.400 rows=19288709 loops=1)
         Filter: ((l_shipinstruct IS DISTINCT FROM 'COLLECT COD            
'::bpchar) AND (l_shipmode IS DISTINCT FROM 'MAIL'::bpchar))
         Rows Removed by Filter: 10711086
 Planning Time: 0.133 ms
 Execution Time: 20233.374 ms





Second query:
                                                                      QUERY
PLAN                                                                        

------------------------------------------------------------------------------------------------------------------------------------------------------------
 Group  (cost=771057.65..771059.32 rows=7 width=11) (actual
time=6583.724..6583.735 rows=6 loops=1)
   Group Key: lineitem.l_shipmode
   ->  Gather Merge  (cost=771057.65..771059.29 rows=14 width=11) (actual
time=6583.722..6585.134 rows=18 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=770057.63..770057.65 rows=7 width=11) (actual
time=6579.986..6579.986 rows=6 loops=3)
               Sort Key: lineitem.l_shipmode
               Sort Method: quicksort  Memory: 25kB
               Worker 0:  Sort Method: quicksort  Memory: 25kB
               Worker 1:  Sort Method: quicksort  Memory: 25kB
               ->  Partial HashAggregate  (cost=770057.46..770057.53 rows=7
width=11) (actual time=6579.945..6579.946 rows=6 loops=3)
                     Group Key: lineitem.l_shipmode
                     ->  Parallel Seq Scan on lineitem
(cost=0.00..749893.72 rows=8065495 width=11) (actual time=0.058..4510.171
rows=6429570 loops=3)
                           Filter: ((l_shipinstruct IS DISTINCT FROM
'COLLECT COD              '::bpchar) AND (l_shipmode IS DISTINCT FROM
'MAIL'::bpchar))
                           Rows Removed by Filter: 3570362
 Planning Time: 0.159 ms
 Execution Time: 6585.196 ms




Expected Behavior
Since these two queries are semantically equivalent, we were hoping that
PostgreSQL will evaluate them in roughly the same amount of time.

Test Environment
Ubuntu 20.04 machine "Linux panda 5.4.0-40-generic #44-Ubuntu SMP Tue Jun 23
00:01:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux"
PostgreSQL v12.3
Database: TPC-H benchmark (with scale factor 5)
The description of table lineitem is as follows:
tpch5=# \d+ lineitem;
                                             Table "public.lineitem"
     Column      |         Type          | Collation | Nullable | Default |
Storage  | Stats target | Description
-----------------+-----------------------+-----------+----------+---------+----------+--------------+-------------
 l_orderkey      | integer               |           | not null |         |
plain    |              |
 l_partkey       | integer               |           | not null |         |
plain    |              |
 l_suppkey       | integer               |           | not null |         |
plain    |              |
 l_linenumber    | integer               |           | not null |         |
plain    |              |
 l_quantity      | numeric(15,2)         |           | not null |         |
main     |              |
 l_extendedprice | numeric(15,2)         |           | not null |         |
main     |              |
 l_discount      | numeric(15,2)         |           | not null |         |
main     |              |
 l_tax           | numeric(15,2)         |           | not null |         |
main     |              |
 l_returnflag    | character(1)          |           | not null |         |
extended |              |
 l_linestatus    | character(1)          |           | not null |         |
extended |              |
 l_shipdate      | date                  |           | not null |         |
plain    |              |
 l_commitdate    | date                  |           | not null |         |
plain    |              |
 l_receiptdate   | date                  |           | not null |         |
plain    |              |
 l_shipinstruct  | character(25)         |           | not null |         |
extended |              |
 l_shipmode      | character(10)         |           | not null |         |
extended |              |
 l_comment       | character varying(44) |           | not null |         |
extended |              |
Indexes:
    "lineitem_pkey" PRIMARY KEY, btree (l_orderkey, l_linenumber)



Here are the steps for reproducing our observations:
       
1. Download the dataset from the link:
https://drive.google.com/file/d/13rFa1BNDi4e2RmXBn-yEQkcqt6lsBu1c/view?usp=sharing
2. Set up TPC-H benchmark
tar xzvf tpch5_postgresql.tar.gz
cd tpch5_postgresql
db=tpch5
createdb $db
psql -d $db < dss.ddl
for i in `ls *.tbl`
do
    echo $i
    name=`echo $i|cut -d'.' -f1`
    psql -d $db -c "COPY $name FROM '`pwd`/$i' DELIMITER '|' ENCODING
'LATIN1';"
done
psql -d $db < dss_postgres.ri
3. Execute the queries

Reply | Threaded
Open this post in threaded view
|

Re: BUG #16887: Group by is faster than distinct

David Rowley
On Tue, 23 Feb 2021 at 10:26, PG Bug reporting form
<[hidden email]> wrote:
> Actual Behavior
> We executed both queries on the TPC-H benchmark of scale factor 5: the first
> query takes over 20 seconds, while the second query only takes 6.5 seconds.
> We think the time difference results from different plans selected.
> Specifically, in the first (slow) query, the optimizer decides to not
> parallelize the SCAN and GROUP operations.


> Expected Behavior
> Since these two queries are semantically equivalent, we were hoping that
> PostgreSQL will evaluate them in roughly the same amount of time.

It makes sense that you'd expect this, however, we don't currently
generate parallel plans for DISTINCT queries.  So this is more of
something that's yet to be implemented rather than a bug.

When parallel aggregates were added in 9.6, it was quite late in the
cycle and I narrowed the scope to just GROUP BY.  DISTINCT was left
behind. I tried to pick that up again several years ago, but I was
encouraged to drop it in favour of other work.

David