citext LIKE search bug

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

citext LIKE search bug

Morris de Oryx
I haven't reported a Postgres bug until now, please point me in the right direction if this isn't the right procedure for new reports.

I'm using citext fields, and am working through how to build indexes that the planner recognizes. I found this paragraph in the PG 11 release notes:

"Allow creation of indexes that can be used by LIKE comparisons on citext columns (Alexey Chernyshov)
To do this, the index must be created using the citext_pattern_ops operator class."


As far as I can tell, this doesn't work, or at least not in my case with an expression index. I've got a table with about 8M rows where the citext column contains anything from 1-5000+ characters. Since that's too big for a B-tree entry in some cases, I've got an expression index.

CREATE INDEX record_changes_log_detail_old_value_ix_btree
    ON record_changes_log_detail 
    USING btree ((substring(old_value,1,1024)::citext) citext_pattern_ops);

The following = query *does* use the index:

set max_parallel_workers_per_gather = 0;
explain analyze
select * from record_changes_log_detail where substring(old_value,1,1024)::citext = 'Gold Kerrison Neuro';

The following LIKE query does *not* use the index:

set max_parallel_workers_per_gather = 0;
explain analyze
select * from record_changes_log_detail where substring(old_value,1,1024)::citext LIKE 'Gold Kerrison Neuro%';


Interestingly, if I build the index with text_pattern_ops, then these two queries *do* use the index:

select * from record_changes_log_detail where substring(old_value,1,1024) = 'Gold Kerrison Neuro';
select * from record_changes_log_detail where substring(old_value,1,1024) LIKE 'Gold Kerrison Neuro%';

So far as I can tell, the LIKE support for citext doesn't work. Or else I'm not understanding how to build the index or construct the query correctly.
Reply | Threaded
Open this post in threaded view
|

Fwd: citext LIKE search bug

Morris de Oryx
I haven't reported a Postgres bug until now, please point me in the right direction if this isn't the right procedure for new reports.

I'm using citext fields, and am working through how to build indexes that the planner recognizes. I found this paragraph in the PG 11 release notes:

"Allow creation of indexes that can be used by LIKE comparisons on citext columns (Alexey Chernyshov)
To do this, the index must be created using the citext_pattern_ops operator class."


As far as I can tell, this doesn't work, or at least not in my case with an expression index. I've got a table with about 8M rows where the citext column contains anything from 1-5000+ characters. Since that's too big for a B-tree entry in some cases, I've got an expression index.

CREATE INDEX record_changes_log_detail_old_value_ix_btree
    ON record_changes_log_detail 
    USING btree ((substring(old_value,1,1024)::citext) citext_pattern_ops);

The following = query *does* use the index:

set max_parallel_workers_per_gather = 0;
explain analyze
select * from record_changes_log_detail where substring(old_value,1,1024)::citext = 'Gold Kerrison Neuro';

The following LIKE query does *not* use the index:

set max_parallel_workers_per_gather = 0;
explain analyze
select * from record_changes_log_detail where substring(old_value,1,1024)::citext LIKE 'Gold Kerrison Neuro%';


Interestingly, if I build the index with text_pattern_ops, then these two queries *do* use the index:

select * from record_changes_log_detail where substring(old_value,1,1024) = 'Gold Kerrison Neuro';
select * from record_changes_log_detail where substring(old_value,1,1024) LIKE 'Gold Kerrison Neuro%';

So far as I can tell, the LIKE support for citext doesn't work. Or else I'm not understanding how to build the index or construct the query correctly.
Reply | Threaded
Open this post in threaded view
|

Re: citext LIKE search bug

Tom Lane-2
In reply to this post by Morris de Oryx
Morris de Oryx <[hidden email]> writes:
> I'm using citext fields, and am working through how to build indexes that
> the planner recognizes. I found this paragraph in the PG 11 release notes:

> "Allow creation of indexes that can be used by LIKE comparisons on citext
> columns (Alexey Chernyshov)
> To do this, the index must be created using the citext_pattern_ops operator
> class."
> https://www.postgresql.org/docs/11/release-11.html

> As far as I can tell, this doesn't work, or at least not in my case with an
> expression index.

Hm.  I found the original submission in the archives [1], and Alexey
explicitly *didn't* claim that that patch was sufficient to enable LIKE
index optimizations on citext.  Indeed it obviously isn't, since it just
created some weirdly-named operators without connecting them up to the
LIKE plumbing in any way.  So this seems like a case of the REL11 release
notes author (probably Bruce) not reading too closely, which doesn't seem
like entirely his fault since the commit message was totally content-free
about what the point of the patch was [2].  But anyway, right now it seems
to me that citext_pattern_ops has exactly zero value, which makes me
wonder why we committed it in advance of some use-case getting filled in.
It's not such a large patch that it had to get in to reduce its maintenance
overhead.

While I'm looking at it, I notice that the patch failed to honor the
scalarltsel-vs-scalarlesel, scalargtsel-vs-scalargesel distinction
that had been created not too long before.  Sigh.

Anyway it seems like the only near-term to-do item here is to correct
the v11 release notes to not claim that citext_pattern_ops does anything
useful.  Maybe we should just remove the entry altogether.

                        regards, tom lane

[1] https://www.postgresql.org/message-id/flat/d868ae6c-501c-a17c-c01b-f531d646172d%40postgrespro.ru
[2] https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=f24649976


Reply | Threaded
Open this post in threaded view
|

Re: citext LIKE search bug

Morris de Oryx
Thanks for the answer, "doc bug" works for me.

For what it's worth, the citext_pattern_ops option seems to provide case-blind = searches, which text_pattern_ops does not. Unless I got confused when I tested out the various permutations. If that's true, then citext_pattern_ops has value for me.

On Thu, Sep 19, 2019 at 2:19 PM Tom Lane <[hidden email]> wrote:
Morris de Oryx <[hidden email]> writes:
> I'm using citext fields, and am working through how to build indexes that
> the planner recognizes. I found this paragraph in the PG 11 release notes:

> "Allow creation of indexes that can be used by LIKE comparisons on citext
> columns (Alexey Chernyshov)
> To do this, the index must be created using the citext_pattern_ops operator
> class."
> https://www.postgresql.org/docs/11/release-11.html

> As far as I can tell, this doesn't work, or at least not in my case with an
> expression index.

Hm.  I found the original submission in the archives [1], and Alexey
explicitly *didn't* claim that that patch was sufficient to enable LIKE
index optimizations on citext.  Indeed it obviously isn't, since it just
created some weirdly-named operators without connecting them up to the
LIKE plumbing in any way.  So this seems like a case of the REL11 release
notes author (probably Bruce) not reading too closely, which doesn't seem
like entirely his fault since the commit message was totally content-free
about what the point of the patch was [2].  But anyway, right now it seems
to me that citext_pattern_ops has exactly zero value, which makes me
wonder why we committed it in advance of some use-case getting filled in.
It's not such a large patch that it had to get in to reduce its maintenance
overhead.

While I'm looking at it, I notice that the patch failed to honor the
scalarltsel-vs-scalarlesel, scalargtsel-vs-scalargesel distinction
that had been created not too long before.  Sigh.

Anyway it seems like the only near-term to-do item here is to correct
the v11 release notes to not claim that citext_pattern_ops does anything
useful.  Maybe we should just remove the entry altogether.

                        regards, tom lane

[1] https://www.postgresql.org/message-id/flat/d868ae6c-501c-a17c-c01b-f531d646172d%40postgrespro.ru
[2] https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=f24649976
Reply | Threaded
Open this post in threaded view
|

Re: citext LIKE search bug

Tom Lane-2
Morris de Oryx <[hidden email]> writes:
> For what it's worth, the citext_pattern_ops option seems to provide
> case-blind = searches, which text_pattern_ops does not. Unless I got
> confused when I tested out the various permutations. If that's true, then
> citext_pattern_ops has value for me.

The default citext index opclass should handle that case fine ...

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: citext LIKE search bug

Jeff Janes
In reply to this post by Morris de Oryx


On Thu, Sep 19, 2019 at 12:30 AM Morris de Oryx <[hidden email]> wrote:
Thanks for the answer, "doc bug" works for me.

For what it's worth, the citext_pattern_ops option seems to provide case-blind = searches, which text_pattern_ops does not.

That is correct, but the default operator for citext type already provides case-blind =.  The only improvement on that that  citext_pattern_ops could hope to achieve is fast case-blind prefix matching, which it fails to do. 

Note that you can get case-blind LIKE matching use ILIKE, and can accelerate it with a pg_trgm index.  However, if the only type of matching you want to accelerate is prefix matching (% only at the end of the pattner), then pg_trgm will be much less efficient than a fully-functioning  citext_pattern_ops would have been had it done what its name implies.  Still, it might be better than the other choices you currently have.

Cheers,

Jeff
Reply | Threaded
Open this post in threaded view
|

Re: citext LIKE search bug

Morris de Oryx
Thanks for the answer and clarification. Yes, I'm using trigram index now, and it's working well. I love n-grams for fuzzy matching, and the Postgres implementation is a thing of beauty. It is a bit funny in that you have to use LIKE to hint the planner to use the index on a = search.

I've been working with Postgres for a couple of years now, coming from another environment. In our case, Postgres is not a front-line system, it's a secondary system for aggregation, analysis and pushing things through to a visualization platform. Postgres is just awesome but, whenever you're shifting to a new context, it's easy to come in with bad priors. That's me and citext, all the way down.

Postgres has case-sensitive text searches, as you all know. I guess that becomes a bit more optional in PG 12 with case and/or diacritical-blind collations...but we're on PG 11 deployed on RDS, so we're unlikely to see PG 12 a lot before ~March 2020. We started on 9.5 or 9.6. What I was used to are case-blind searches with B-trees of up to the first 1024 bytes of character data (it's all in UTF-16) that work on = <> and starts_with% without any functions or castings. So, Bad Priors. In Postgres:

-- Searches on text are case-sensitive (I figured that out early.)

-- The planner only uses a function index if the query uses *exactly* the same query. This I found out *last week*.

It's that second bit that was the killer, and that makes me question the point of citext at all. For background, I dug through my notes and assembled a chronological list of days when users asked me for case-sensitive search in their database over the past 30+ years:

....

Exactly never. My goal was to make case-blind searches automatic and transparent...no special goo in the query:

select foo from bar where lower(text) = lower('the search string');

But, as it turns out, you lose the indexes unless you do exactly something like that with citext. Like if you use the clever (not my clever) idea of B-treeing a hash of the text:

select *
  from record_changes_log_detail
 where hashtext(old_value) = hashtext('Gold Kerrison Neuro') and -- The first clause uses the index to find records with matching hash codes.
       old_value = 'Gold Kerrison Neuro'; -- The second clause makes sure that the records are real matches.

That works perfectly for = searches, but there's a whole lot of complexity in the query syntax for someone used to this to do the same thing:

select *
  from record_changes_log_detail
 where old_value = 'Gold Kerrison Neuro';
 
Is there, or could there be, any way for the query planner to guess that an expression-based index should be searched other than recapitulating the entire expression in the query? For example, my sample data has about 8M rows with a citext field with old values (a sort of audit-log thing). The length of these values are typically very small. 6 records in the entire data set have long values. If I index the first 128 characters, I think that gives me the original value for 99.9999+% of the rows. So, something like this - or the same for text: 

DROP INDEX IF EXISTS record_changes_log_detail_old_value_ix_btree;
CREATE INDEX record_changes_log_detail_old_value_ix_btree
    ON record_changes_log_detail
    USING btree ((left(old_value,128)::citext) citext_pattern_ops);

For the index to be used today, the search needs to be

select * from record_changes_log_detail
where left(old_value,128)::citext = 'Gold Kerrison Neuro';

But here's the thing, Postgres _knows_ that the expression index is on the first 128 characters of the field. If the query value is within that value on an = search, then the B-tree of the substring could be used, and would return 100% accurate results.

Is that sort of query planner optimization conceivable in Postgres? I'd offer to help but, well, I'm absolutely unqualified. I knew a guy who gave up his PhD on mapping five-dimensional math into four-dimensional space. He said "I was smart enough to ask the question, but not even close to smart enough to answer it." That's me on my query planner idea/question.

Reply | Threaded
Open this post in threaded view
|

Re: citext LIKE search bug

Bruce Momjian
In reply to this post by Tom Lane-2
On Thu, Sep 19, 2019 at 12:19:34AM -0400, Tom Lane wrote:

> Morris de Oryx <[hidden email]> writes:
> > I'm using citext fields, and am working through how to build indexes that
> > the planner recognizes. I found this paragraph in the PG 11 release notes:
>
> > "Allow creation of indexes that can be used by LIKE comparisons on citext
> > columns (Alexey Chernyshov)
> > To do this, the index must be created using the citext_pattern_ops operator
> > class."
> > https://www.postgresql.org/docs/11/release-11.html
>
> > As far as I can tell, this doesn't work, or at least not in my case with an
> > expression index.
>
> Hm.  I found the original submission in the archives [1], and Alexey
> explicitly *didn't* claim that that patch was sufficient to enable LIKE
> index optimizations on citext.  Indeed it obviously isn't, since it just
> created some weirdly-named operators without connecting them up to the
> LIKE plumbing in any way.  So this seems like a case of the REL11 release
> notes author (probably Bruce) not reading too closely, which doesn't seem
> like entirely his fault since the commit message was totally content-free
> about what the point of the patch was [2].

I have no idea how I would have generated any release note item with
that commit or the commit diff.  I must have asked someone, but I don't
remember who.  The item was in my initial version of the relesae notes,
so it was not added later.  Sorry.

--
  Bruce Momjian  <[hidden email]>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +