expensive function in select list vs limit clause

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

expensive function in select list vs limit clause

Christian Mair
Hi,

I've found a (simple) situation where the planner does something I don't understand.

Below is a complete test case followed by output.

 From the timings it appears that in the second explain analyze query a function
call in the select list (expensive()) is evaluated in the sequential scan node
*for each* row in big, despite the use of limit.

I would have expected expensive() to be evaluated only for the ten rows
in the result set. Hence the second explain analyze query shouldn't be more
expensive than the first one.

My trust in Postgres' planner goes so far as I feel the planner is right and there
must be a reason for this :)

Could someone help me understand this behaviour?

Thanks & Bye,
Chris.



-- ***************************************************************************

select version();

-- setup: create a time wasting function and a table with 1M rows

create function expensive() returns double precision as
$$
     begin
         for i in 1 .. 15000 loop
         end loop;
         return random();
     end;
$$ language 'plpgsql';

create table big as select random() as r from generate_series(1, 1000000);
analyze big;

\timing on

-- benchmark expensive(): one call to expensive takes about 0.3 ms => OK

do $$ begin for i in 1 .. 1000 loop perform expensive(); end loop; end; $$;

-- find the ten smallest values in big: takes ~ 0.18s => OK

explain analyze select r from big order by r offset 0 limit 10;

-- now do the same, but add an expensive() column to the result:
-- takes ~ 29s => WHY?

explain analyze select r, expensive() from big order by r offset 0 limit 10;

-- clean up :)

\timing off
drop function expensive();
drop table big;

-- ***************************************************************************

                                                    version
--------------------------------------------------------------------------------------------------------------
  PostgreSQL 9.5.4 on x86_64-apple-darwin14.5.0, compiled by Apple LLVM version 7.0.2 (clang-700.1.81), 64-bit
(1 row)

Time: 0.386 ms
CREATE FUNCTION
Time: 0.456 ms
SELECT 1000000
Time: 466.814 ms
ANALYZE
Time: 73.770 ms
Timing is on.
DO
Time: 33.922 ms
                                                         QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=36034.64..36034.67 rows=10 width=8) (actual time=182.361..182.363 rows=10 loops=1)
    ->  Sort  (cost=36034.64..38534.64 rows=1000000 width=8) (actual time=182.360..182.361 rows=10 loops=1)
          Sort Key: r
          Sort Method: top-N heapsort  Memory: 25kB
          ->  Seq Scan on big  (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.022..99.777 rows=1000000 loops=1)
  Planning time: 0.070 ms
  Execution time: 182.377 ms
(7 rows)

Time: 182.689 ms
                                                           QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=286034.64..286034.67 rows=10 width=8) (actual time=28932.311..28932.314 rows=10 loops=1)
    ->  Sort  (cost=286034.64..288534.64 rows=1000000 width=8) (actual time=28932.309..28932.310 rows=10 loops=1)
          Sort Key: r
          Sort Method: top-N heapsort  Memory: 25kB
          ->  Seq Scan on big  (cost=0.00..264425.00 rows=1000000 width=8) (actual time=0.062..28822.520 rows=1000000 loops=1)
  Planning time: 0.038 ms
  Execution time: 28932.339 ms
(7 rows)

Time: 28932.908 ms
Timing is off.
DROP FUNCTION
DROP TABLE

-- ***************************************************************************



--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|

Re: expensive function in select list vs limit clause

Albe Laurenz *EXTERN*
Chris Mair wrote:

> I've found a (simple) situation where the planner does something I don't understand.
>
> Below is a complete test case followed by output.
>
>  From the timings it appears that in the second explain analyze query a function
> call in the select list (expensive()) is evaluated in the sequential scan node
> *for each* row in big, despite the use of limit.
>
> I would have expected expensive() to be evaluated only for the ten rows
> in the result set. Hence the second explain analyze query shouldn't be more
> expensive than the first one.
>
> My trust in Postgres' planner goes so far as I feel the planner is right and there
> must be a reason for this :)
>
> Could someone help me understand this behaviour?
[...]
> create function expensive() returns double precision as
> $$
>      begin
>          for i in 1 .. 15000 loop
>          end loop;
>          return random();
>      end;
> $$ language 'plpgsql';

This is unrelated, but you should set COST for an expensive function
to help the planner.

[...]
> -- now do the same, but add an expensive() column to the result:
> -- takes ~ 29s => WHY?
>
> explain analyze select r, expensive() from big order by r offset 0 limit 10;
[...]

>                                                            QUERY PLAN
> ------------------------------------------------------------------------------------------
> ------------------------------------
>   Limit  (cost=286034.64..286034.67 rows=10 width=8) (actual time=28932.311..28932.314
> rows=10 loops=1)
>     ->  Sort  (cost=286034.64..288534.64 rows=1000000 width=8) (actual
> time=28932.309..28932.310 rows=10 loops=1)
>           Sort Key: r
>           Sort Method: top-N heapsort  Memory: 25kB
>           ->  Seq Scan on big  (cost=0.00..264425.00 rows=1000000 width=8) (actual
> time=0.062..28822.520 rows=1000000 loops=1)
>   Planning time: 0.038 ms
>   Execution time: 28932.339 ms
> (7 rows)

ORDER BY can only be processed after all rows have been fetched, this
includes the expensive result column.

You can easily avoid that by applying the LIMIT first:

  SELECT r, expensive()
  FROM (SELECT r
        FROM big
        ORDER BY r
        LIMIT 10
       ) inner;

I don't know how hard it would be to only fetch the necessary columns before
the ORDER BY and fetch the others after the LIMIT has been applied, but it
is probably nontrivial and would require processing time for *everybody*
who runs a query with ORDER BY to solve a rare problem that can easily be
worked around.

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|

Re: expensive function in select list vs limit clause

Tom Lane-2
In reply to this post by Christian Mair
Chris Mair <[hidden email]> writes:
>  From the timings it appears that in the second explain analyze query a function
> call in the select list (expensive()) is evaluated in the sequential scan node
> *for each* row in big, despite the use of limit.

According to the SQL standard, the SELECT list is evaluated before ORDER
BY, so if you need an explicit sort step the function is going to get
calculated first.  This is obviously necessary if the function is used
as the sort key, but otherwise it's possible to be smarter.  We were not
smarter before 9.6 though.  You might find this commit message informative:

https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=9118d03a8

                        regards, tom lane


--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|

Re: expensive function in select list vs limit clause

Christian Mair

> https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=9118d03a8

Hi,

thanks!

I've just tested with 9.6 and the test runs fast with or without expensive().

So the above patch does indeed improve this case a lot!

Bye,
Chris.





--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|

Re: expensive function in select list vs limit clause

Christian Mair
In reply to this post by Albe Laurenz *EXTERN*

>
> ORDER BY can only be processed after all rows have been fetched, this
> includes the expensive result column.
>
> You can easily avoid that by applying the LIMIT first:
>
>   SELECT r, expensive()
>   FROM (SELECT r
>         FROM big
>         ORDER BY r
>         LIMIT 10
>        ) inner;
>
> I don't know how hard it would be to only fetch the necessary columns before
> the ORDER BY and fetch the others after the LIMIT has been applied, but it
> is probably nontrivial and would require processing time for *everybody*
> who runs a query with ORDER BY to solve a rare problem that can easily be
> worked around.

Hi,

Tom Lane just pointed out that 9.6 is able to optimise this (at least
the synthetic example).

Anyway, my real problem could be beautifully improved by subselect-trick!

Thanks a lot!

Bye,
Chris.






--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general