Case Insensitive Comparison with Postgres 12

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Case Insensitive Comparison with Postgres 12

Igal @ Lucee.org
I am trying to test a simple case insensitive comparison.  Most likely the collation that I chose is wrong, but I'm not sure how to choose the correct one (for English/US?).  Here is my snippet:

create collation case_insensitive(
    provider=icu, locale='en-US-x-icu', deterministic=false
);
select 'Abc' = 'abc' collate case_insensitive;

I expected true but am getting false.

Any thoughts?

Thanks,

Igal

Reply | Threaded
Open this post in threaded view
|

Re: Case Insensitive Comparison with Postgres 12

Morris de Oryx
As I understand it, custom collation are not applied globally. Meaning, you have to associate a collation with a column or en expression with COLLATE. 
Reply | Threaded
Open this post in threaded view
|

Re: Case Insensitive Comparison with Postgres 12

Pavel Křehula
In reply to this post by Igal @ Lucee.org
Hello,
use correct locale identifier, in your case it should be:
create collation "case_insensitive" (provider=icu, locale="en-US-u-ks-level2", deterministic = false);

See http://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options
for available options.

--
Pavel

Dne 09.10.2019 0:51:52, "Igal Sapir" <[hidden email]> napsal:

I am trying to test a simple case insensitive comparison.  Most likely the collation that I chose is wrong, but I'm not sure how to choose the correct one (for English/US?).  Here is my snippet:

create collation case_insensitive(
    provider=icu, locale='en-US-x-icu', deterministic=false
);
select 'Abc' = 'abc' collate case_insensitive;

I expected true but am getting false.

Any thoughts?

Thanks,

Igal

Reply | Threaded
Open this post in threaded view
|

Re: Case Insensitive Comparison with Postgres 12

Wim Bertels-3
In reply to this post by Igal @ Lucee.org
Using the datatype citext might be an alternative solution

Igal Sapir <[hidden email]> schreef op October 8, 2019 10:51:52 PM UTC:
I am trying to test a simple case insensitive comparison.  Most likely the collation that I chose is wrong, but I'm not sure how to choose the correct one (for English/US?).  Here is my snippet:

create collation case_insensitive(
    provider=icu, locale='en-US-x-icu', deterministic=false
);
select 'Abc' = 'abc' collate case_insensitive;

I expected true but am getting false.

Any thoughts?

Thanks,

Igal


--
Verstuurd vanaf een fairphone met K-9 Mail. Excuseer mijn beknoptheid.
Reply | Threaded
Open this post in threaded view
|

Re: Case Insensitive Comparison with Postgres 12

Laurenz Albe
In reply to this post by Igal @ Lucee.org
Igal Sapir wrote:

> I am trying to test a simple case insensitive comparison.  Most likely the
> collation that I chose is wrong, but I'm not sure how to choose the correct
> one (for English/US?).  Here is my snippet:
>
> create collation case_insensitive(
>     provider=icu, locale='en-US-x-icu', deterministic=false
> );
> select 'Abc' = 'abc' collate case_insensitive;
>
> I expected true but am getting false.
>
> Any thoughts?

Yes, the LOCALE is wrong. Use

create collation case_insensitive (
   provider=icu, locale='en-US-u-ks-level2', deterministic=false
);

The name of the locale defines it.

My blog post can give a simple introduction:
https://www.cybertec-postgresql.com/en/icu-collations-against-glibc-2-28-data-corruption/

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com



Reply | Threaded
Open this post in threaded view
|

Re: Case Insensitive Comparison with Postgres 12

Igal @ Lucee.org
On 10/9/2019 12:34 AM, Laurenz Albe wrote:

> Igal Sapir wrote:
>> I am trying to test a simple case insensitive comparison.  Most likely the
>> collation that I chose is wrong, but I'm not sure how to choose the correct
>> one (for English/US?).  Here is my snippet:
>>
>> create collation case_insensitive(
>>      provider=icu, locale='en-US-x-icu', deterministic=false
>> );
>> select 'Abc' = 'abc' collate case_insensitive;
>>
>> I expected true but am getting false.
>>
>> Any thoughts?
> Yes, the LOCALE is wrong. Use
>
> create collation case_insensitive (
>     provider=icu, locale='en-US-u-ks-level2', deterministic=false
> );
>
> The name of the locale defines it.
>
> My blog post can give a simple introduction:
> https://www.cybertec-postgresql.com/en/icu-collations-against-glibc-2-28-data-corruption/

Thank you all for replying.  I tried to use the locale suggested by both
Laurenz and Pavel, en-US-u-ks-level2, but I'm still getting false for a
simple comparison of 'Abc' = 'abc'.  I tried the locale both as a
'string' and as an "identifier":

 > select version();

version |
-------------------------------------------------------------------------------------------------------|
PostgreSQL 12.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
20150623 (Red Hat 4.8.5-39), 64-bit|

 > drop collation if exists case_insensitive;

 > create collation case_insensitive (
    provider=icu, locale="en-US-u-ks-level2", deterministic=false
);

 > select 'Abc' = 'abc' collate case_insensitive as is_equal;

is_equal|
--------|
false   |

What am I doing wrong here?

Thanks,

Igal




Reply | Threaded
Open this post in threaded view
|

Re: Case Insensitive Comparison with Postgres 12

Thomas Kellerer
Igal @ Lucee.org schrieb am 10.10.2019 um 14:41:

> Thank you all for replying.  I tried to use the locale suggested by
> both Laurenz and Pavel, en-US-u-ks-level2, but I'm still getting
> false for a simple comparison of 'Abc' = 'abc'.  I tried the locale
> both as a 'string' and as an "identifier":
>
>> drop collation if exists case_insensitive;
>
>> create collation case_insensitive (
>    provider=icu, locale="en-US-u-ks-level2", deterministic=false
> );
>
>> select 'Abc' = 'abc' collate case_insensitive as is_equal;
>
> is_equal|
> --------|
> false   |
>
> What am I doing wrong here?

Check the version of libicu that your Linux is using.
That locale format requires version 54 or later.
(My up-to-date CentOS 7.7 for example is still using version 50 and the EDB Windows binaries include version 53).

In another thread about ICU problems, Daniel Verite explained that in more detail:

> With ICU 53 or older, instead of the locale above, we must use the old-style syntax:
>
>  locale = 'de-DE@colStrength=secondary'

In your case I guess, it should be

   locale = 'en-US@colStrength=secondary'

Thomas


Reply | Threaded
Open this post in threaded view
|

Re: Case Insensitive Comparison with Postgres 12

Igal @ Lucee.org
Thomas,

On 10/10/2019 6:22 AM, Thomas Kellerer wrote:

> Igal @ Lucee.org schrieb am 10.10.2019 um 14:41:
>> Thank you all for replying.  I tried to use the locale suggested by
>> both Laurenz and Pavel, en-US-u-ks-level2, but I'm still getting
>> false for a simple comparison of 'Abc' = 'abc'.  I tried the locale
>> both as a 'string' and as an "identifier":
>>
>>> drop collation if exists case_insensitive;
>>> create collation case_insensitive (
>>     provider=icu, locale="en-US-u-ks-level2", deterministic=false
>> );
>>
>>> select 'Abc' = 'abc' collate case_insensitive as is_equal;
>> is_equal|
>> --------|
>> false   |
>>
>> What am I doing wrong here?
> Check the version of libicu that your Linux is using.
> That locale format requires version 54 or later.
> (My up-to-date CentOS 7.7 for example is still using version 50 and the EDB Windows binaries include version 53).
>
> In another thread about ICU problems, Daniel Verite explained that in more detail:
>
>> With ICU 53 or older, instead of the locale above, we must use the old-style syntax:
>>
>>   locale = 'de-DE@colStrength=secondary'
> In your case I guess, it should be
>
>     locale = 'en-US@colStrength=secondary'

That works, thank you!

I also have CentOS installed on that machine: CentOS Linux release
7.7.1908 (Core), showing libicu Version 50.2 via `yum info libicu`.

Best,

Igal





Reply | Threaded
Open this post in threaded view
|

Re: Case Insensitive Comparison with Postgres 12

Igal @ Lucee.org
In reply to this post by Igal @ Lucee.org
On Fri, Oct 11, 2019 at 1:09 AM stan <[hidden email]> wrote:
On Thu, Oct 10, 2019 at 05:41:47AM -0700, Igal @ Lucee.org wrote:
> On 10/9/2019 12:34 AM, Laurenz Albe wrote:
> > Igal Sapir wrote:
> > > I am trying to test a simple case insensitive comparison.  Most likely the
> > > collation that I chose is wrong, but I'm not sure how to choose the correct
> > > one (for English/US?).  Here is my snippet:
> > >
> > > create collation case_insensitive(
> > >      provider=icu, locale='en-US-x-icu', deterministic=false
> > > );
> > > select 'Abc' = 'abc' collate case_insensitive;
> > >
> > > I expected true but am getting false.
> > >
> > > Any thoughts?
> > Yes, the LOCALE is wrong. Use
> >
> > create collation case_insensitive (
> >     provider=icu, locale='en-US-u-ks-level2', deterministic=false
> > );
> >
> > The name of the locale defines it.
> >
> > My blog post can give a simple introduction:
> > https://www.cybertec-postgresql.com/en/icu-collations-against-glibc-2-28-data-corruption/
>
> Thank you all for replying.?? I tried to use the locale suggested by both
> Laurenz and Pavel, en-US-u-ks-level2, but I'm still getting false for a
> simple comparison of 'Abc' = 'abc'.?? I tried the locale both as a 'string'
> and as an "identifier":
>
> > select version();
>
> version |
> -------------------------------------------------------------------------------------------------------|
> PostgreSQL 12.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623
> (Red Hat 4.8.5-39), 64-bit|
>
> > drop collation if exists case_insensitive;
>
> > create collation case_insensitive (
> ???? provider=icu, locale="en-US-u-ks-level2", deterministic=false
> );
>
> > select 'Abc' = 'abc' collate case_insensitive as is_equal;
>
> is_equal|
> --------|
> false???? |
>
> What am I doing wrong here?
>
Out of curiosity is there a eason not to use the citext type for th?


Using the collation seems like a much cleaner approach, and I trust ICU to do a better job at comparing strings according to language rules etc.

Igal

Reply | Threaded
Open this post in threaded view
|

Re: Case Insensitive Comparison with Postgres 12

Daniel Verite
        Igal Sapir wrote:

> > Out of curiosity is there a eason not to use the citext type for th?
> >
> >
> Using the collation seems like a much cleaner approach, and I trust ICU to
> do a better job at comparing strings according to language rules etc.

One notable difference between citext and case-insensitive collations
by ICU is that the latter recognizes canonically equivalent sequences
of codepoints [1] as equal, while the former does not.

For instance:

=# CREATE COLLATION ci (locale='und@colStrength=secondary',
      provider='icu', deterministic=false);

=# SELECT E'E\u0302TES'::citext = 'Êtes'::citext AS "citext-equal",
          E'E\u0302TES' = 'Êtes' collate "ci" AS "ci-equal";
 citext-equal | ci-equal
--------------+----------
 f      | t

Another significant difference is that building or rebuilding an index on a
text column with a CI collation appears to be way faster than with citext
(I've seen 10:1 ratios, but do your own tests).

On the minus side, substring matching with LIKE or other methods
is not possible with CI collations whereas it does work with citext.


[1] https://en.wikipedia.org/wiki/Unicode_equivalence


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite