A limit clause can cause a poor index choice

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

A limit clause can cause a poor index choice

Nick Cleaton
The attached script builds a 10G test table which demonstrates a
problem that we have in production with postgresql 12.3-1.pgdg18.04+1
on ubuntu linux. Indexes:

test_orders_o_date_idx btree(o_date)
test_orders_customer_id_o_date_idx btree(customer_id, o_date)

We query for the most recent orders for sets of customers, and
sometimes none of those customers have any orders and the results are
empty:

explain analyze select * from test_orders where
customer_id=any(ARRAY[9993,9997,9912,9954,9100,9101,9102,9234,9500,9512])
order by o_date desc;

    QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=24848.96..24870.67 rows=8686 width=1839) (actual
time=1.101..1.102 rows=0 loops=1)
   Sort Key: o_date DESC
   Sort Method: quicksort  Memory: 25kB
   ->  Index Scan using test_orders_customer_id_o_date_idx on
test_orders  (cost=0.43..17361.20 rows=8686 width=1839) (actual
time=1.047..1.047 rows=0 loops=1)
         Index Cond: (customer_id = ANY
('{9993,9997,9912,9954,9100,9101,9102,9234,9500,9512}'::integer[]))
 Planning Time: 3.821 ms
 Execution Time: 1.174 ms
(7 rows)

So far so good. But if we add a limit clause to the query then the
plan goes very wrong:

explain analyze select * from test_orders where
customer_id=any(ARRAY[9993,9997,9912,9954,9100,9101,9102,9234,9500,9512])
order by o_date desc limit 10;

      QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..1660.98 rows=10 width=1839) (actual
time=4990.424..4990.424 rows=0 loops=1)
   ->  Index Scan Backward using test_orders_o_date_idx on test_orders
 (cost=0.43..1442355.43 rows=8686 width=1839) (actual
time=4990.423..4990.423 rows=0 loops=1)
         Filter: (customer_id = ANY
('{9993,9997,9912,9954,9100,9101,9102,9234,9500,9512}'::integer[]))
         Rows Removed by Filter: 5000000
 Planning Time: 0.063 ms
 Execution Time: 4990.435 ms


Is there something we can adjust to get it to prefer
test_orders_customer_id_o_date_idx even when there's a limit clause ?

testdata.py (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: A limit clause can cause a poor index choice

Mohamed Wael Khobalatte
Hi Nick, 

I believe a second ordering, by id desc, will get your query to use the right index, and shouldn't be functionally different from what you would expect.  

```
select * from test_orders where

customer_id=any(ARRAY[9993,9997,9912,9954,9100,9101,9102,9234,9500,9512])

order by o_date desc, id desc limit 10;
```

I didn't look closely as to why from your data though. I'll leave it to more experienced people to comment as to why the planner misjudged your query badly. What happens when you raise the limit? Say to a 1000?

On Tue, May 19, 2020 at 3:00 PM Nick Cleaton <[hidden email]> wrote:
The attached script builds a 10G test table which demonstrates a
problem that we have in production with postgresql 12.3-1.pgdg18.04+1
on ubuntu linux. Indexes:

test_orders_o_date_idx btree(o_date)
test_orders_customer_id_o_date_idx btree(customer_id, o_date)

We query for the most recent orders for sets of customers, and
sometimes none of those customers have any orders and the results are
empty:

explain analyze select * from test_orders where
customer_id=any(ARRAY[9993,9997,9912,9954,9100,9101,9102,9234,9500,9512])
order by o_date desc;

    QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=24848.96..24870.67 rows=8686 width=1839) (actual
time=1.101..1.102 rows=0 loops=1)
   Sort Key: o_date DESC
   Sort Method: quicksort  Memory: 25kB
   ->  Index Scan using test_orders_customer_id_o_date_idx on
test_orders  (cost=0.43..17361.20 rows=8686 width=1839) (actual
time=1.047..1.047 rows=0 loops=1)
         Index Cond: (customer_id = ANY
('{9993,9997,9912,9954,9100,9101,9102,9234,9500,9512}'::integer[]))
 Planning Time: 3.821 ms
 Execution Time: 1.174 ms
(7 rows)

So far so good. But if we add a limit clause to the query then the
plan goes very wrong:

explain analyze select * from test_orders where
customer_id=any(ARRAY[9993,9997,9912,9954,9100,9101,9102,9234,9500,9512])
order by o_date desc limit 10;

      QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..1660.98 rows=10 width=1839) (actual
time=4990.424..4990.424 rows=0 loops=1)
   ->  Index Scan Backward using test_orders_o_date_idx on test_orders
 (cost=0.43..1442355.43 rows=8686 width=1839) (actual
time=4990.423..4990.423 rows=0 loops=1)
         Filter: (customer_id = ANY
('{9993,9997,9912,9954,9100,9101,9102,9234,9500,9512}'::integer[]))
         Rows Removed by Filter: 5000000
 Planning Time: 0.063 ms
 Execution Time: 4990.435 ms


Is there something we can adjust to get it to prefer
test_orders_customer_id_o_date_idx even when there's a limit clause ?
Reply | Threaded
Open this post in threaded view
|

Re: A limit clause can cause a poor index choice

Michael Lewis
What does pg_stats say about column customer_id? Specifically, how many ndistinct, and what is the sum of the most common values? If you have 1000 distinct customer_id values, and the (default 100) most common values only cover 2% of the total rows, then the optimizer will assume that any given customer_id will yield approx reltuples * .98 / ( 5000 - 100 ) rows. So if your table has 1 million rows, your estimate might be that there should be 200 rows in the table per customer_id in your array.

Looking at your query plan, the optimizer expects rows=8686 for those customer_id and it knows you only want 10 of the most recent ones. It made the right call based on the information it has.

Increase default_statistics_target, at least on that column, and see if you get a much much better plan. I don't know where I got this query from online, but here ya go. I'd be curious how frac_MCV in this changes when default_statistics_target is more like 250 or 500 and the table is analyzed again to reflect that change.


SELECT

( SELECT SUM (x) FROM UNNEST (most_common_freqs) x ) frac_MCV,

tablename,

attname,

inherited,

null_frac,

n_distinct,

array_length(most_common_vals,1) n_mcv,

array_length(histogram_bounds,1) n_hist,

correlation,

*

FROM pg_stats

WHERE

schemaname = 'public'

AND tablename='test_orders'

AND attname='customer_id'

ORDER BY 1;

Reply | Threaded
Open this post in threaded view
|

Re: A limit clause can cause a poor index choice

Nick Cleaton
In reply to this post by Mohamed Wael Khobalatte
On Tue, 19 May 2020 at 21:56, Mohamed Wael Khobalatte
<[hidden email]> wrote:

> I believe a second ordering, by id desc, will get your query to use the right index, and shouldn't be functionally different from what you would expect.

Thanks, that works nicely on our production table, even with much
larger sets of customer_id values.

> What happens when you raise the limit? Say to a 1000?

A limit of 1000 makes it choose the fast plan. A limit of 100 causes
it to choose the fast plan if I raise the stats target on that column
to 250 or above, otherwise not.


Reply | Threaded
Open this post in threaded view
|

Re: A limit clause can cause a poor index choice

Nick Cleaton
In reply to this post by Michael Lewis
On Tue, 19 May 2020 at 22:15, Michael Lewis <[hidden email]> wrote:

> Increase default_statistics_target, at least on that column, and see if you get a much much better plan. I don't know where I got this query from online, but here ya go. I'd be curious how frac_MCV in this changes when default_statistics_target is more like 250 or 500 and the table is analyzed again to reflect that change.

It chooses the fast plan for a limit of 10 if the stats target is approaching the number of distinct customer_id values, which is 6000 for this test table:

 stats |  frac_mcv   | n_distinct | n_mcv | n_hist | correlation | l10 | l100 | l1000
-------+-------------+------------+-------+--------+-------------+-----+------+-------
    -1 | 0.015666666 |       5728 |    34 |    101 |  0.98172975 | f   | f    | t
   150 | 0.015022225 |       5821 |    38 |    151 |   0.9817175 | f   | f    | t
   250 |  0.04347998 |       5867 |   134 |    251 |  0.98155195 | f   | t    | t
   500 |  0.12606017 |       5932 |   483 |    501 |  0.98155344 | f   | t    | t
   750 |  0.18231618 |       5949 |   750 |    751 |  0.98166454 | f   | t    | t
  1000 |   0.2329197 |       5971 |  1000 |   1001 |   0.9816691 | f   | t    | t
  1500 |   0.3312785 |       5982 |  1500 |   1501 |    0.981609 | f   | t    | t
  3000 |   0.6179379 |       5989 |  3000 |   2989 |    0.981612 | f   | t    | t
  4000 |   0.8033856 |       5994 |  4000 |   1994 |   0.9816348 | f   | t    | t
  4500 |   0.8881603 |       5994 |  4500 |   1494 |  0.98160636 | f   | t    | t
  4800 |   0.9281193 |       5993 |  4800 |   1193 |   0.9816273 | f   | t    | t
  4900 |   0.9396781 |       5994 |  4900 |   1094 |   0.9816546 | f   | t    | t
  5000 |   0.9500147 |       5993 |  5000 |    993 |   0.9816481 | t   | t    | t
  6000 |    0.999714 |       5996 |  5923 |     73 |  0.98162216 | t   | t    | t
 10000 |  0.99995905 |       5998 |  5970 |     28 |  0.98164326 | t   | t    | t