ICU for global collation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

ICU for global collation

Peter Eisentraut-6
Here is an initial patch to add the option to use ICU as the global
collation provider, a long-requested feature.

To activate, use something like

    initdb --collation-provider=icu --locale=...

A trick here is that since we need to also still set the normal POSIX
locales, the --locale value needs to be valid as both a POSIX locale and
a ICU locale.  If that doesn't work out, there is also a way to specify
it separately, e.g.,

    initdb --collation-provider=icu --locale=en_US.utf8 --icu-locale=en

This complexity is unfortunate, but I don't see a way around it right now.

There are also options for createdb and CREATE DATABASE to do this for a
particular database only.

Besides this, the implementation is quite small: When starting up a
database, we create an ICU collator object, store it in a global
variable, and then use it when appropriate.  All the ICU code for
creating and invoking those collators already exists of course.

For the version tracking, I use the pg_collation row for the "default"
collation.  Again, this mostly reuses existing code and concepts.

Nondeterministic collations are not supported for the global collation,
because then LIKE and regular expressions don't work and that breaks
some system views.  This needs some separate research.

To test, run the existing regression tests against a database
initialized with ICU.  Perhaps some options for pg_regress could
facilitate that.

I fear that the Localization chapter in the documentation will need a
bit of a rewrite after this, because the hitherto separately treated
concepts of locale and collation are fusing together.  I haven't done
that here yet, but that would be the plan for later.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

v1-0001-Add-option-to-use-ICU-as-global-collation-provide.patch (60K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ICU for global collation

Andrey Borodin-2
Hi!


> 20 авг. 2019 г., в 19:21, Peter Eisentraut <[hidden email]> написал(а):
>
> Here is an initial patch to add the option to use ICU as the global
> collation provider, a long-requested feature.
>
> To activate, use something like
>
>    initdb --collation-provider=icu --locale=...
>
> A trick here is that since we need to also still set the normal POSIX
> locales, the --locale value needs to be valid as both a POSIX locale and
> a ICU locale.  If that doesn't work out, there is also a way to specify
> it separately, e.g.,
>
>    initdb --collation-provider=icu --locale=en_US.utf8 --icu-locale=en

Thanks! This is very awaited feature.

Seems like user cannot change locale for database if icu is already chosen?

postgres=# \l
                               List of databases
   Name    | Owner | Encoding | Collate | Ctype | Provider | Access privileges
-----------+-------+----------+---------+-------+----------+-------------------
 postgres  | x4mmm | UTF8     | ru_RU   | ru_RU | icu      |
 template0 | x4mmm | UTF8     | ru_RU   | ru_RU | icu      | =c/x4mmm         +
           |       |          |         |       |          | x4mmm=CTc/x4mmm
 template1 | x4mmm | UTF8     | ru_RU   | ru_RU | icu      | =c/x4mmm         +
           |       |          |         |       |          | x4mmm=CTc/x4mmm
(3 rows)

postgres=# create database a template template0 collation_provider icu lc_collate 'en_US.utf8';
CREATE DATABASE
postgres=# \c a
2019-08-21 11:43:40.379 +05 [41509] FATAL:  collations with different collate and ctype values are not supported by ICU
FATAL:  collations with different collate and ctype values are not supported by ICU
Previous connection kept

Am I missing something?

BTW, psql does not know about collation_provider.

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: ICU for global collation

Peter Eisentraut-6
On 2019-08-21 08:56, Andrey Borodin wrote:
> postgres=# create database a template template0 collation_provider icu lc_collate 'en_US.utf8';
> CREATE DATABASE
> postgres=# \c a
> 2019-08-21 11:43:40.379 +05 [41509] FATAL:  collations with different collate and ctype values are not supported by ICU
> FATAL:  collations with different collate and ctype values are not supported by ICU

Try

create database a template template0 collation_provider icu locale
'en_US.utf8';

which sets both lc_collate and lc_ctype.  But 'en_US.utf8' is not a
valid ICU locale name.  Perhaps use 'en' or 'en-US'.

I'm making a note that we should prevent creating a database with a
faulty locale configuration in the first place instead of failing when
we're connecting.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: ICU for global collation

Andrey Borodin-2


> 21 авг. 2019 г., в 12:23, Peter Eisentraut <[hidden email]> написал(а):
>
> On 2019-08-21 08:56, Andrey Borodin wrote:
>> postgres=# create database a template template0 collation_provider icu lc_collate 'en_US.utf8';
>> CREATE DATABASE
>> postgres=# \c a
>> 2019-08-21 11:43:40.379 +05 [41509] FATAL:  collations with different collate and ctype values are not supported by ICU
>> FATAL:  collations with different collate and ctype values are not supported by ICU
>
> Try
>
> create database a template template0 collation_provider icu locale
> 'en_US.utf8';
>
> which sets both lc_collate and lc_ctype.  But 'en_US.utf8' is not a
> valid ICU locale name.  Perhaps use 'en' or 'en-US'.
>
> I'm making a note that we should prevent creating a database with a
> faulty locale configuration in the first place instead of failing when
> we're connecting.

Yes, the problem is input with lc_collate is accepted
postgres=# create database a template template0 collation_provider icu lc_collate 'en_US.utf8';
CREATE DATABASE
postgres=# \c a
2019-09-11 10:01:00.373 +05 [56878] FATAL:  collations with different collate and ctype values are not supported by ICU
FATAL:  collations with different collate and ctype values are not supported by ICU
Previous connection kept
postgres=# create database b template template0 collation_provider icu locale 'en_US.utf8';
CREATE DATABASE
postgres=# \c b
You are now connected to database "b" as user "x4mmm".

I get same output with 'en' or 'en-US'.


Also, cluster initialized --with-icu started on binaries without icu just fine.
And only after some time, I've got that messages "ERROR:  ICU is not supported in this build".
Is it expected behavior? Maybe we should refuse to start without icu?

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: ICU for global collation

Daniel Verite
In reply to this post by Peter Eisentraut-6
 Hi,

When trying databases defined with ICU locales, I see that backends
that serve such databases seem to have their LC_CTYPE inherited from
the environment (as opposed to a per-database fixed value).

That's a problem for the backend code that depends on libc functions
that themselves depend on LC_CTYPE, such as the full text search parser
and dictionaries.

For instance, if you start the instance with a C locale
(LC_ALL=C pg_ctl...) , and tries to use FTS in an ICU UTF-8 database,
it doesn't work:

template1=# create database "fr-utf8"
  template 'template0' encoding UTF8
  locale 'fr'
  collation_provider 'icu';

template1=# \c fr-utf8
You are now connected to database "fr-utf8" as user "daniel".

fr-utf8=# show lc_ctype;
 lc_ctype
----------
 fr
(1 row)

fr-utf8=# select to_tsvector('été');
ERROR: invalid multibyte character for locale
HINT:  The server's LC_CTYPE locale is probably incompatible with the
database encoding.

If I peek into the "real" LC_CTYPE when connected to this database,
I can see it's "C":

fr-utf8=# create extension plperl;
CREATE EXTENSION

fr-utf8=# create function lc_ctype() returns text as '$ENV{LC_CTYPE};'
  language plperl;
CREATE FUNCTION

fr-utf8=# select lc_ctype();
 lc_ctype
----------
 C


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Reply | Threaded
Open this post in threaded view
|

Re: ICU for global collation

Timmer, Marius
Hi everyone,

like the others before me we (the university of Münster) are happy to
see this feature as well. Thank you this.

When I applied the patch two weeks ago I run into the issue that initdb
did not recognize the new parameters (collation-provider and icu-locale)
but I guess it was caused by my own stupidity.

> When trying databases defined with ICU locales, I see that backends
> that serve such databases seem to have their LC_CTYPE inherited from
> the environment (as opposed to a per-database fixed value).
I am able to recreate the issue described by Daniel on my machine.

Now it works as expected. I just had to update the patch since commit
3f6b3be3 had modified two lines which resulted in conflicts. You find
the updated patch as attachement to this mail.


Best regards,

Marius Timmer




--
Westfälische Wilhelms-Universität Münster (WWU)
Zentrum für Informationsverarbeitung (ZIV)
Röntgenstraße 7-13
Besucheradresse: Einsteinstraße 60 - Raum 107
48149 Münster
+49 251 83 31158
[hidden email]
https://www.uni-muenster.de/ZIV

v1-0002-Add-option-to-use-ICU-as-global-collation-provide_rebased.patch (46K) Download Attachment
smime.p7s (8K) Download Attachment