surprising query optimisation

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

surprising query optimisation

Chris Withers-2
Hi All,

We have an app that deals with a lot of queries, and we've been slowly
seeing performance issues emerge. We take a lot of free form queries
from users and stumbled upon a very surprising optimisation.

So, we have a 'state' column which is a 3 character string column with
an index on it. Despite being a string, this column is only used to
store one of three values: 'NEW', 'ACK', or 'RSV'.

One of our most common queries clauses is "state!='RSV'" and we've found
that by substituting this clause with "state='ACK' or state='NEW'"
wherever it was used, we've dropped the postgres server's load average
from 20 down to 4 and the CPU usage from 60% in user space down to <5%.

This seems counter-intuitive to me, so thought I'd ask here. Why would
this be likely to make such a difference? We're currently on 9.4, is
this something that's likely to be different (better? worse?) if we got
all the way up to 10 or 11?

cheers,

Chris


Reply | Threaded
Open this post in threaded view
|

Re: surprising query optimisation

Adrian Klaver-4
On 11/28/18 2:26 PM, Chris Withers wrote:

> Hi All,
>
> We have an app that deals with a lot of queries, and we've been slowly
> seeing performance issues emerge. We take a lot of free form queries
> from users and stumbled upon a very surprising optimisation.
>
> So, we have a 'state' column which is a 3 character string column with
> an index on it. Despite being a string, this column is only used to
> store one of three values: 'NEW', 'ACK', or 'RSV'.
>
> One of our most common queries clauses is "state!='RSV'" and we've found
> that by substituting this clause with "state='ACK' or state='NEW'"
> wherever it was used, we've dropped the postgres server's load average
> from 20 down to 4 and the CPU usage from 60% in user space down to <5%.
>
> This seems counter-intuitive to me, so thought I'd ask here. Why would

The way I see it is state = "something" is a confined question. state !=
'something' is potentially unbounded.

Does EXPLAIN ANALYZE shed any light?

> this be likely to make such a difference? We're currently on 9.4, is
> this something that's likely to be different (better? worse?) if we got
> all the way up to 10 or 11?
>
> cheers,
>
> Chris
>
>


--
Adrian Klaver
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: surprising query optimisation

Gavin Flower-2
In reply to this post by Chris Withers-2
On 29/11/2018 11:26, Chris Withers wrote:

> Hi All,
>
> We have an app that deals with a lot of queries, and we've been slowly
> seeing performance issues emerge. We take a lot of free form queries
> from users and stumbled upon a very surprising optimisation.
>
> So, we have a 'state' column which is a 3 character string column with
> an index on it. Despite being a string, this column is only used to
> store one of three values: 'NEW', 'ACK', or 'RSV'.
>
> One of our most common queries clauses is "state!='RSV'" and we've
> found that by substituting this clause with "state='ACK' or
> state='NEW'" wherever it was used, we've dropped the postgres server's
> load average from 20 down to 4 and the CPU usage from 60% in user
> space down to <5%.
>
> This seems counter-intuitive to me, so thought I'd ask here. Why would
> this be likely to make such a difference? We're currently on 9.4, is
> this something that's likely to be different (better? worse?) if we
> got all the way up to 10 or 11?
>
> cheers,
>
> Chris
>
>
At a guess...

     "state!='RSV'"  ==> pg only has to check one value

and

     "state='ACK' or state='NEW'"   ==> pg has to check two values

so I would expect the '!=' to be faster.


Cheers,
Gavin


Reply | Threaded
Open this post in threaded view
|

Re: surprising query optimisation

Stephen Frost
In reply to this post by Chris Withers-2
Greetings,

* Chris Withers ([hidden email]) wrote:
> We have an app that deals with a lot of queries, and we've been slowly
> seeing performance issues emerge. We take a lot of free form queries from
> users and stumbled upon a very surprising optimisation.
>
> So, we have a 'state' column which is a 3 character string column with an
> index on it. Despite being a string, this column is only used to store one
> of three values: 'NEW', 'ACK', or 'RSV'.

Sounds like a horrible field to have an index on.

> One of our most common queries clauses is "state!='RSV'" and we've found
> that by substituting this clause with "state='ACK' or state='NEW'" wherever
> it was used, we've dropped the postgres server's load average from 20 down
> to 4 and the CPU usage from 60% in user space down to <5%.

You've changed the question you're asking the database.  PG doesn't
*know* that there's only those three values, but it probably has a
pretty good guess about how many records are ACK and how many are NEW
thanks to those being in the MCV list.

> This seems counter-intuitive to me, so thought I'd ask here. Why would this
> be likely to make such a difference? We're currently on 9.4, is this
> something that's likely to be different (better? worse?) if we got all the
> way up to 10 or 11?

When you change what you're asking, PG is going to change how it gives
you the answer and sometimes that'll be faster and other times it won't
be.

Really though, if you want something more than wild speculation, posting
the 'explain analyze' of each query along with the actual table
definitions and sizes and such would be the best way to get it.

I'd suggest you check out the wiki article written about this kind of
question:

https://wiki.postgresql.org/wiki/Slow_Query_Questions

Thanks!

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: surprising query optimisation

Chris Withers-2
On 28/11/2018 22:49, Stephen Frost wrote:

> * Chris Withers ([hidden email]) wrote:
>> We have an app that deals with a lot of queries, and we've been slowly
>> seeing performance issues emerge. We take a lot of free form queries from
>> users and stumbled upon a very surprising optimisation.
>>
>> So, we have a 'state' column which is a 3 character string column with an
>> index on it. Despite being a string, this column is only used to store one
>> of three values: 'NEW', 'ACK', or 'RSV'.
>
> Sounds like a horrible field to have an index on.

That's counter-intuitive for me. What leads you to say this and what
would you do/recommend instead?

> Really though, if you want something more than wild speculation, posting
> the 'explain analyze' of each query along with the actual table
> definitions and sizes and such would be the best way to get it.

table definition:

# \d alerts_alert
               Table "public.alerts_alert"
      Column      |           Type           | Modifiers
-----------------+--------------------------+-----------
  tags            | jsonb                    | not null
  id              | character varying(86)    | not null
  earliest_seen   | timestamp with time zone | not null
  latest_seen     | timestamp with time zone | not null
  created         | timestamp with time zone | not null
  modified        | timestamp with time zone | not null
  type            | character varying(300)   | not null
  state           | character varying(3)     | not null
  until           | timestamp with time zone |
  latest_note     | text                     | not null
  created_by_id   | integer                  | not null
  modified_by_id  | integer                  | not null
  owner_id        | integer                  |
  owning_group_id | integer                  | not null
  latest_new      | timestamp with time zone | not null
Indexes:
     "alerts_alert_pkey" PRIMARY KEY, btree (id)
     "alert_tags_index" gin (tags)
     "alerts_alert_1efacf1d" btree (latest_seen)
     "alerts_alert_3103a7d8" btree (until)
     "alerts_alert_599dcce2" btree (type)
     "alerts_alert_5e7b1936" btree (owner_id)
     "alerts_alert_9ae73c65" btree (modified)
     "alerts_alert_9ed39e2e" btree (state)
     "alerts_alert_b3da0983" btree (modified_by_id)
     "alerts_alert_c5151f5a" btree (earliest_seen)
     "alerts_alert_e2fa5388" btree (created)
     "alerts_alert_e93cb7eb" btree (created_by_id)
     "alerts_alert_efea2d76" btree (owning_group_id)
     "alerts_alert_id_13155e16_like" btree (id varchar_pattern_ops)
     "alerts_alert_latest_new_e8d1fbde_uniq" btree (latest_new)
     "alerts_alert_state_90ab480b_like" btree (state varchar_pattern_ops)
     "alerts_alert_type_3021f46f_like" btree (type varchar_pattern_ops)
Foreign-key constraints:
     "alerts_alert_created_by_id_520608c0_fk_alerts_user_id" FOREIGN KEY
(created_by_id) REFERENCES alerts_user(id) DEFERRABLE INITIALLY DEFERRED
     "alerts_alert_modified_by_id_6db4b04b_fk_alerts_user_id" FOREIGN
KEY (modified_by_id) REFERENCES alerts_user(id) DEFERRABLE INITIALLY
DEFERRED
     "alerts_alert_owner_id_0c00548a_fk_alerts_user_id" FOREIGN KEY
(owner_id) REFERENCES alerts_user(id) DEFERRABLE INITIALLY DEFERRED
     "alerts_alert_owning_group_id_a4869b66_fk_auth_group_id" FOREIGN
KEY (owning_group_id) REFERENCES auth_group(id) DEFERRABLE INITIALLY
DEFERRED
Referenced by:
     TABLE "alerts_alertevent" CONSTRAINT
"alerts_alertevent_alert_id_edd734b8_fk_alerts_alert_id" FOREIGN KEY
(alert_id) REFERENCES alerts_alert(id) DEFERRABLE INITIALLY DEFERRED

Row counts by state:

# select state, count(*) from alerts_alert group by 1 order by 1;
  state |  count
-------+---------
  ACK   |    1053
  NEW   |    1958
  RSV   | 1528623
(3 rows)

here's an example of the "bad" query plan:
https://explain.depesz.com/s/cDkp

here's an example with all the "state!='RSV'" clauses rewritten as I
described:
https://explain.depesz.com/s/B9Xi

> I'd suggest you check out the wiki article written about this kind of
> question:
>
> https://wiki.postgresql.org/wiki/Slow_Query_Questions

Thanks!

Chris

Reply | Threaded
Open this post in threaded view
|

Re: surprising query optimisation

Stephen Frost
Greetings,

On Fri, Nov 30, 2018 at 07:52 Chris Withers <[hidden email]> wrote:
On 28/11/2018 22:49, Stephen Frost wrote:
> * Chris Withers ([hidden email]) wrote:
>> We have an app that deals with a lot of queries, and we've been slowly
>> seeing performance issues emerge. We take a lot of free form queries from
>> users and stumbled upon a very surprising optimisation.
>>
>> So, we have a 'state' column which is a 3 character string column with an
>> index on it. Despite being a string, this column is only used to store one
>> of three values: 'NEW', 'ACK', or 'RSV'.
>
> Sounds like a horrible field to have an index on.

That's counter-intuitive for me. What leads you to say this and what
would you do/recommend instead?

> Really though, if you want something more than wild speculation, posting
> the 'explain analyze' of each query along with the actual table
> definitions and sizes and such would be the best way to get it.

table definition:

# \d alerts_alert
               Table "public.alerts_alert"
      Column      |           Type           | Modifiers
-----------------+--------------------------+-----------
  tags            | jsonb                    | not null
  id              | character varying(86)    | not null
  earliest_seen   | timestamp with time zone | not null
  latest_seen     | timestamp with time zone | not null
  created         | timestamp with time zone | not null
  modified        | timestamp with time zone | not null
  type            | character varying(300)   | not null
  state           | character varying(3)     | not null
  until           | timestamp with time zone |
  latest_note     | text                     | not null
  created_by_id   | integer                  | not null
  modified_by_id  | integer                  | not null
  owner_id        | integer                  |
  owning_group_id | integer                  | not null
  latest_new      | timestamp with time zone | not null
Indexes:
     "alerts_alert_pkey" PRIMARY KEY, btree (id)
     "alert_tags_index" gin (tags)
     "alerts_alert_1efacf1d" btree (latest_seen)
     "alerts_alert_3103a7d8" btree (until)
     "alerts_alert_599dcce2" btree (type)
     "alerts_alert_5e7b1936" btree (owner_id)
     "alerts_alert_9ae73c65" btree (modified)
     "alerts_alert_9ed39e2e" btree (state)
     "alerts_alert_b3da0983" btree (modified_by_id)
     "alerts_alert_c5151f5a" btree (earliest_seen)
     "alerts_alert_e2fa5388" btree (created)
     "alerts_alert_e93cb7eb" btree (created_by_id)
     "alerts_alert_efea2d76" btree (owning_group_id)
     "alerts_alert_id_13155e16_like" btree (id varchar_pattern_ops)
     "alerts_alert_latest_new_e8d1fbde_uniq" btree (latest_new)
     "alerts_alert_state_90ab480b_like" btree (state varchar_pattern_ops)
     "alerts_alert_type_3021f46f_like" btree (type varchar_pattern_ops)
Foreign-key constraints:
     "alerts_alert_created_by_id_520608c0_fk_alerts_user_id" FOREIGN KEY
(created_by_id) REFERENCES alerts_user(id) DEFERRABLE INITIALLY DEFERRED
     "alerts_alert_modified_by_id_6db4b04b_fk_alerts_user_id" FOREIGN
KEY (modified_by_id) REFERENCES alerts_user(id) DEFERRABLE INITIALLY
DEFERRED
     "alerts_alert_owner_id_0c00548a_fk_alerts_user_id" FOREIGN KEY
(owner_id) REFERENCES alerts_user(id) DEFERRABLE INITIALLY DEFERRED
     "alerts_alert_owning_group_id_a4869b66_fk_auth_group_id" FOREIGN
KEY (owning_group_id) REFERENCES auth_group(id) DEFERRABLE INITIALLY
DEFERRED
Referenced by:
     TABLE "alerts_alertevent" CONSTRAINT
"alerts_alertevent_alert_id_edd734b8_fk_alerts_alert_id" FOREIGN KEY
(alert_id) REFERENCES alerts_alert(id) DEFERRABLE INITIALLY DEFERRED

Row counts by state:

# select state, count(*) from alerts_alert group by 1 order by 1;
  state |  count
-------+---------
  ACK   |    1053
  NEW   |    1958
  RSV   | 1528623
(3 rows)

here's an example of the "bad" query plan:
https://explain.depesz.com/s/cDkp

here's an example with all the "state!='RSV'" clauses rewritten as I
described:
https://explain.depesz.com/s/B9Xi

> I'd suggest you check out the wiki article written about this kind of
> question:
>
> https://wiki.postgresql.org/wiki/Slow_Query_Questions

Have you tried a partial index on state!=‘RSV’?

Thanks,

Stephen

Reply | Threaded
Open this post in threaded view
|

Re: surprising query optimisation

Chris Withers-2
On 30/11/2018 12:55, Stephen Frost wrote:
>      > I'd suggest you check out the wiki article written about this kind of
>      > question:
>      >
>      > https://wiki.postgresql.org/wiki/Slow_Query_Questions
>
>
> Have you tried a partial index on state!=‘RSV’?

The solution I originally posted, that we do easily enough at our query
generation layer, is working perfectly, but this is good to know for
next time.

My post here is mainly to try and understand what's going on so I can
improve my general feel for how to use postgres at it's best.

So, why was the query ending up being a big scan rather than some quick
lookups based on the index?

cheers,

Chris

Reply | Threaded
Open this post in threaded view
|

Re: surprising query optimisation

Stephen Frost
Greetings,

On Fri, Nov 30, 2018 at 08:00 Chris Withers <[hidden email]> wrote:
On 30/11/2018 12:55, Stephen Frost wrote:
>      > I'd suggest you check out the wiki article written about this kind of
>      > question:
>      >
>      > https://wiki.postgresql.org/wiki/Slow_Query_Questions
>
>
> Have you tried a partial index on state!=‘RSV’?

The solution I originally posted, that we do easily enough at our query
generation layer, is working perfectly, but this is good to know for
next time.

My post here is mainly to try and understand what's going on so I can
improve my general feel for how to use postgres at it's best.

So, why was the query ending up being a big scan rather than some quick
lookups based on the index?

Thought that was mentioned already but at least part of the issue is that PG can’t just search for the other values when it’s a != in the index because it wouldn’t know what values to search for...  PG doesn’t know, with complete certainty, that there’s only 3 values.

The partial index is something you should want anyway- that index won’t ever be used to search for RSV because it’s cheaper to just scan the table if we are looking for those, and the index will be much, much smaller without that common value being included.

Thanks!

Stephen
Reply | Threaded
Open this post in threaded view
|

Re: surprising query optimisation

Stephen Frost
In reply to this post by Chris Withers-2
Greetings,

* Chris Withers ([hidden email]) wrote:

> On 28/11/2018 22:49, Stephen Frost wrote:
> >* Chris Withers ([hidden email]) wrote:
> >>We have an app that deals with a lot of queries, and we've been slowly
> >>seeing performance issues emerge. We take a lot of free form queries from
> >>users and stumbled upon a very surprising optimisation.
> >>
> >>So, we have a 'state' column which is a 3 character string column with an
> >>index on it. Despite being a string, this column is only used to store one
> >>of three values: 'NEW', 'ACK', or 'RSV'.
> >
> >Sounds like a horrible field to have an index on.
>
> That's counter-intuitive for me. What leads you to say this and what would
> you do/recommend instead?
For this, specifically, it's because you end up with exactly what you
have: a large index with tons of duplicate values.  Indexes are
particularly good when you have high-cardinality fields.  Now, if you
have a skewed index, where there's one popular value and a few much less
popular ones, then that's where you really want a partial index (as I
suggest earlier) so that queries against the non-popular value(s) is
able to use the index and the index is much smaller.

Of course, for this to work you need to set up the partial index
correctly and make sure that your queries end up using it.

Thanks!

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: surprising query optimisation

Gavin Flower-2
On 01/12/2018 04:33, Stephen Frost wrote:

> Greetings,
>
> * Chris Withers ([hidden email]) wrote:
>> On 28/11/2018 22:49, Stephen Frost wrote:
>>> * Chris Withers ([hidden email]) wrote:
>>>> We have an app that deals with a lot of queries, and we've been slowly
>>>> seeing performance issues emerge. We take a lot of free form queries from
>>>> users and stumbled upon a very surprising optimisation.
>>>>
>>>> So, we have a 'state' column which is a 3 character string column with an
>>>> index on it. Despite being a string, this column is only used to store one
>>>> of three values: 'NEW', 'ACK', or 'RSV'.
>>> Sounds like a horrible field to have an index on.
>> That's counter-intuitive for me. What leads you to say this and what would
>> you do/recommend instead?
> For this, specifically, it's because you end up with exactly what you
> have: a large index with tons of duplicate values.  Indexes are
> particularly good when you have high-cardinality fields.  Now, if you
> have a skewed index, where there's one popular value and a few much less
> popular ones, then that's where you really want a partial index (as I
> suggest earlier) so that queries against the non-popular value(s) is
> able to use the index and the index is much smaller.
>
> Of course, for this to work you need to set up the partial index
> correctly and make sure that your queries end up using it.
>
> Thanks!
>
> Stephen

An index simply tells pg which block to look at (assuming that the index
itself is not sufficient to satisfy the query), so if using an index
would still require that pg look at most blocks, then it cheaper to just
scan the whole table - rather than load the index and still look at all
blocks that contain the table data.  I've oversimplified slightly.

An index is best used when using it results in fewer blocks being read
from disk.

Also the use of RAM is better when the size of the index is kept small. 
For example having an index on sex for nurses is a waste of time because
most nurses are female.  However, a partial index (as suggested by
Stephen, for your query) that includes only males could be useful if you
have queries looking for male nurses (assumes male nurses are a very
small fraction of nurses such that most data blocks don't have rows for
males nurses, and the planner knows this).

I once optimised a very complex set queries that made extensive use of
indexes.  However, with the knowledge I have today, I would have most
likely had fewer and smaller indexes.  As I now realize, that some of my
indexes were probably counter productive, especially as I'd given no
thought to how much RAM would be required to host the data plus
indexes!  Fortunately, while the objective was to run all those queries
within 24 hours, they actually only took a couple of hours.

Chris, I would strongly suggest, you read up on the excellent
documentation pg has about indexes, but don't expect to understand it
all at one sitting...


Cheers,
Gavin


Reply | Threaded
Open this post in threaded view
|

Re: surprising query optimisation

Chris Withers-2
In reply to this post by Stephen Frost
On 30/11/2018 15:33, Stephen Frost wrote:

> Greetings,
>
> * Chris Withers ([hidden email]) wrote:
>> On 28/11/2018 22:49, Stephen Frost wrote:
>>> * Chris Withers ([hidden email]) wrote:
>>>> We have an app that deals with a lot of queries, and we've been slowly
>>>> seeing performance issues emerge. We take a lot of free form queries from
>>>> users and stumbled upon a very surprising optimisation.
>>>>
>>>> So, we have a 'state' column which is a 3 character string column with an
>>>> index on it. Despite being a string, this column is only used to store one
>>>> of three values: 'NEW', 'ACK', or 'RSV'.
>>>
>>> Sounds like a horrible field to have an index on.
>>
>> That's counter-intuitive for me. What leads you to say this and what would
>> you do/recommend instead?
>
> For this, specifically, it's because you end up with exactly what you
> have: a large index with tons of duplicate values.  Indexes are
> particularly good when you have high-cardinality fields.  Now, if you
> have a skewed index, where there's one popular value and a few much less
> popular ones, then that's where you really want a partial index (as I
> suggest earlier) so that queries against the non-popular value(s) is
> able to use the index and the index is much smaller.

Interesting! In my head, for some reason, I'd always assumed a btree
index would break down a char field based on the characters within it.
Does that never happen?

If I changed this to be an enum field, would != still perform poorly or
can the query optimiser spot that it's an enum and just look for the
other options?

cheers,

Chris

Reply | Threaded
Open this post in threaded view
|

Re: surprising query optimisation

Chris Withers-2
In reply to this post by Gavin Flower-2
On 30/11/2018 22:10, Gavin Flower wrote:
>
> I once optimised a very complex set queries that made extensive use of
> indexes.  However, with the knowledge I have today, I would have most
> likely had fewer and smaller indexes.  As I now realize, that some of my
> indexes were probably counter productive, especially as I'd given no
> thought to how much RAM would be required to host the data plus
> indexes!  Fortunately, while the objective was to run all those queries
> within 24 hours, they actually only took a couple of hours.

So, interestingly, this box has 250GB memory in it, and even though I've
set effective_cache_size to 200GB, I only see 9G of memory being used.
How can I persuade postgres to keep more in memory?

cheers,

Chris

Reply | Threaded
Open this post in threaded view
|

Re: surprising query optimisation

Thomas Kellerer
In reply to this post by Stephen Frost
Stephen Frost schrieb am 30.11.2018 um 14:05:
> PG doesn’t know, with complete certainty, that there’s only 3
> values.
 
Would the optimizer consult a check constraint ensuring that?


Reply | Threaded
Open this post in threaded view
|

Re: surprising query optimisation

Thomas Kellerer
In reply to this post by Chris Withers-2
Chris Withers schrieb am 05.12.2018 um 12:42:
> So, interestingly, this box has 250GB memory in it, and even though
> I've set effective_cache_size to 200GB, I only see 9G of memory being
> used. How can I persuade postgres to keep more in memory?
effective_cache_size is a hint to the optimizer on how much data is to be expected to be cached (by the filesystem).

Only shared_buffers (and work_mem) will actually allocate memory from the operating system.

Thomas

Reply | Threaded
Open this post in threaded view
|

Re: surprising query optimisation

Stephen Frost
In reply to this post by Thomas Kellerer
Greetings,

* Thomas Kellerer ([hidden email]) wrote:
> Stephen Frost schrieb am 30.11.2018 um 14:05:
> > PG doesn’t know, with complete certainty, that there’s only 3
> > values.
>  
> Would the optimizer consult a check constraint ensuring that?

Not today, I don't believe (haven't looked at the code though, to be
fair), and seems like it'd be an awful lot of work for a rare
use-case that would be better with just a partial index..

What we will do today is if you ask for a specific value and there's a
CHECK constraint which lists out specific values and that value isn't
among the set, then we'll just skip over the table (One-time filter:
false), thanks to constraint exclusion.

Thanks!

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: surprising query optimisation

Stephen Frost
In reply to this post by Chris Withers-2
Greetings,

* Chris Withers ([hidden email]) wrote:

> On 30/11/2018 15:33, Stephen Frost wrote:
> >* Chris Withers ([hidden email]) wrote:
> >>On 28/11/2018 22:49, Stephen Frost wrote:
> >For this, specifically, it's because you end up with exactly what you
> >have: a large index with tons of duplicate values.  Indexes are
> >particularly good when you have high-cardinality fields.  Now, if you
> >have a skewed index, where there's one popular value and a few much less
> >popular ones, then that's where you really want a partial index (as I
> >suggest earlier) so that queries against the non-popular value(s) is
> >able to use the index and the index is much smaller.
>
> Interesting! In my head, for some reason, I'd always assumed a btree index
> would break down a char field based on the characters within it. Does that
> never happen?
Not sure what you mean by 'break down a char field'.

> If I changed this to be an enum field, would != still perform poorly or can
> the query optimiser spot that it's an enum and just look for the other
> options?

I don't believe we've got any kind of optimization like that today for
enums.

Thanks!

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: surprising query optimisation

Chris Withers-2
On 05/12/2018 14:38, Stephen Frost wrote:

> Greetings,
>
> * Chris Withers ([hidden email]) wrote:
>> On 30/11/2018 15:33, Stephen Frost wrote:
>>> * Chris Withers ([hidden email]) wrote:
>>>> On 28/11/2018 22:49, Stephen Frost wrote:
>>> For this, specifically, it's because you end up with exactly what you
>>> have: a large index with tons of duplicate values.  Indexes are
>>> particularly good when you have high-cardinality fields.  Now, if you
>>> have a skewed index, where there's one popular value and a few much less
>>> popular ones, then that's where you really want a partial index (as I
>>> suggest earlier) so that queries against the non-popular value(s) is
>>> able to use the index and the index is much smaller.
>>
>> Interesting! In my head, for some reason, I'd always assumed a btree index
>> would break down a char field based on the characters within it. Does that
>> never happen?
>
> Not sure what you mean by 'break down a char field'.

Rather than breaking into three buckets ('NEW', 'ACK', 'RSV'), a more
complicated hierarchy ('N', 'NE', 'A', 'AC', etc).

>> If I changed this to be an enum field, would != still perform poorly or can
>> the query optimiser spot that it's an enum and just look for the other
>> options?
>
> I don't believe we've got any kind of optimization like that today for
> enums.

Good to know, I see query optimisers as magic, and postgres often seems
to achieve magic results ;-)

Chris

Reply | Threaded
Open this post in threaded view
|

Re: surprising query optimisation

Ron-2
On 12/05/2018 08:42 AM, Chris Withers wrote:

> On 05/12/2018 14:38, Stephen Frost wrote:
>> Greetings,
>>
>> * Chris Withers ([hidden email]) wrote:
>>> On 30/11/2018 15:33, Stephen Frost wrote:
>>>> * Chris Withers ([hidden email]) wrote:
>>>>> On 28/11/2018 22:49, Stephen Frost wrote:
>>>> For this, specifically, it's because you end up with exactly what you
>>>> have: a large index with tons of duplicate values.  Indexes are
>>>> particularly good when you have high-cardinality fields. Now, if you
>>>> have a skewed index, where there's one popular value and a few much less
>>>> popular ones, then that's where you really want a partial index (as I
>>>> suggest earlier) so that queries against the non-popular value(s) is
>>>> able to use the index and the index is much smaller.
>>>
>>> Interesting! In my head, for some reason, I'd always assumed a btree index
>>> would break down a char field based on the characters within it. Does that
>>> never happen?
>>
>> Not sure what you mean by 'break down a char field'.
>
> Rather than breaking into three buckets ('NEW', 'ACK', 'RSV'), a more
> complicated hierarchy ('N', 'NE', 'A', 'AC', etc).

The b-tree indexes on legacy RDBMS which I still occasionally fiddle with
performs key prefix compression in a manner similar to what you refer to,
but otherwise that's not how b-trees work.

--
Angular momentum makes the world go 'round.

Reply | Threaded
Open this post in threaded view
|

Re: surprising query optimisation

Stephen Frost
Greetings,

* Ron ([hidden email]) wrote:

> On 12/05/2018 08:42 AM, Chris Withers wrote:
> >On 05/12/2018 14:38, Stephen Frost wrote:
> >>>>* Chris Withers ([hidden email]) wrote:
> >>>Interesting! In my head, for some reason, I'd always assumed a btree index
> >>>would break down a char field based on the characters within it. Does that
> >>>never happen?
> >>
> >>Not sure what you mean by 'break down a char field'.
> >
> >Rather than breaking into three buckets ('NEW', 'ACK', 'RSV'), a more
> >complicated hierarchy ('N', 'NE', 'A', 'AC', etc).
>
> The b-tree indexes on legacy RDBMS which I still occasionally fiddle with
> performs key prefix compression in a manner similar to what you refer to,
> but otherwise that's not how b-trees work.
There's been some discussion of prefix compression in PostgreSQL.  Even
with that, though, it hardly seems sensible to have an index which has
tons of duplicates comprising most of the index, and a != would still
have to search the index to make sure there aren't any entries which
need to be returned..

Now, maybe once we get skipping scans where we would be able to skip
over a large chunk of the index because it's just tons of duplicates
without having to visit everything along the way, then maybe having this
inefficient index would "just" take up disk space, but why waste that
space?

Thanks!

Stephen

signature.asc (836 bytes) Download Attachment