ICU for global collation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

ICU for global collation

Peter Eisentraut-6
Here is an initial patch to add the option to use ICU as the global
collation provider, a long-requested feature.

To activate, use something like

    initdb --collation-provider=icu --locale=...

A trick here is that since we need to also still set the normal POSIX
locales, the --locale value needs to be valid as both a POSIX locale and
a ICU locale.  If that doesn't work out, there is also a way to specify
it separately, e.g.,

    initdb --collation-provider=icu --locale=en_US.utf8 --icu-locale=en

This complexity is unfortunate, but I don't see a way around it right now.

There are also options for createdb and CREATE DATABASE to do this for a
particular database only.

Besides this, the implementation is quite small: When starting up a
database, we create an ICU collator object, store it in a global
variable, and then use it when appropriate.  All the ICU code for
creating and invoking those collators already exists of course.

For the version tracking, I use the pg_collation row for the "default"
collation.  Again, this mostly reuses existing code and concepts.

Nondeterministic collations are not supported for the global collation,
because then LIKE and regular expressions don't work and that breaks
some system views.  This needs some separate research.

To test, run the existing regression tests against a database
initialized with ICU.  Perhaps some options for pg_regress could
facilitate that.

I fear that the Localization chapter in the documentation will need a
bit of a rewrite after this, because the hitherto separately treated
concepts of locale and collation are fusing together.  I haven't done
that here yet, but that would be the plan for later.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

v1-0001-Add-option-to-use-ICU-as-global-collation-provide.patch (60K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ICU for global collation

Andrey Borodin-2
Hi!


> 20 авг. 2019 г., в 19:21, Peter Eisentraut <[hidden email]> написал(а):
>
> Here is an initial patch to add the option to use ICU as the global
> collation provider, a long-requested feature.
>
> To activate, use something like
>
>    initdb --collation-provider=icu --locale=...
>
> A trick here is that since we need to also still set the normal POSIX
> locales, the --locale value needs to be valid as both a POSIX locale and
> a ICU locale.  If that doesn't work out, there is also a way to specify
> it separately, e.g.,
>
>    initdb --collation-provider=icu --locale=en_US.utf8 --icu-locale=en

Thanks! This is very awaited feature.

Seems like user cannot change locale for database if icu is already chosen?

postgres=# \l
                               List of databases
   Name    | Owner | Encoding | Collate | Ctype | Provider | Access privileges
-----------+-------+----------+---------+-------+----------+-------------------
 postgres  | x4mmm | UTF8     | ru_RU   | ru_RU | icu      |
 template0 | x4mmm | UTF8     | ru_RU   | ru_RU | icu      | =c/x4mmm         +
           |       |          |         |       |          | x4mmm=CTc/x4mmm
 template1 | x4mmm | UTF8     | ru_RU   | ru_RU | icu      | =c/x4mmm         +
           |       |          |         |       |          | x4mmm=CTc/x4mmm
(3 rows)

postgres=# create database a template template0 collation_provider icu lc_collate 'en_US.utf8';
CREATE DATABASE
postgres=# \c a
2019-08-21 11:43:40.379 +05 [41509] FATAL:  collations with different collate and ctype values are not supported by ICU
FATAL:  collations with different collate and ctype values are not supported by ICU
Previous connection kept

Am I missing something?

BTW, psql does not know about collation_provider.

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: ICU for global collation

Peter Eisentraut-6
On 2019-08-21 08:56, Andrey Borodin wrote:
> postgres=# create database a template template0 collation_provider icu lc_collate 'en_US.utf8';
> CREATE DATABASE
> postgres=# \c a
> 2019-08-21 11:43:40.379 +05 [41509] FATAL:  collations with different collate and ctype values are not supported by ICU
> FATAL:  collations with different collate and ctype values are not supported by ICU

Try

create database a template template0 collation_provider icu locale
'en_US.utf8';

which sets both lc_collate and lc_ctype.  But 'en_US.utf8' is not a
valid ICU locale name.  Perhaps use 'en' or 'en-US'.

I'm making a note that we should prevent creating a database with a
faulty locale configuration in the first place instead of failing when
we're connecting.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: ICU for global collation

Andrey Borodin-2


> 21 авг. 2019 г., в 12:23, Peter Eisentraut <[hidden email]> написал(а):
>
> On 2019-08-21 08:56, Andrey Borodin wrote:
>> postgres=# create database a template template0 collation_provider icu lc_collate 'en_US.utf8';
>> CREATE DATABASE
>> postgres=# \c a
>> 2019-08-21 11:43:40.379 +05 [41509] FATAL:  collations with different collate and ctype values are not supported by ICU
>> FATAL:  collations with different collate and ctype values are not supported by ICU
>
> Try
>
> create database a template template0 collation_provider icu locale
> 'en_US.utf8';
>
> which sets both lc_collate and lc_ctype.  But 'en_US.utf8' is not a
> valid ICU locale name.  Perhaps use 'en' or 'en-US'.
>
> I'm making a note that we should prevent creating a database with a
> faulty locale configuration in the first place instead of failing when
> we're connecting.

Yes, the problem is input with lc_collate is accepted
postgres=# create database a template template0 collation_provider icu lc_collate 'en_US.utf8';
CREATE DATABASE
postgres=# \c a
2019-09-11 10:01:00.373 +05 [56878] FATAL:  collations with different collate and ctype values are not supported by ICU
FATAL:  collations with different collate and ctype values are not supported by ICU
Previous connection kept
postgres=# create database b template template0 collation_provider icu locale 'en_US.utf8';
CREATE DATABASE
postgres=# \c b
You are now connected to database "b" as user "x4mmm".

I get same output with 'en' or 'en-US'.


Also, cluster initialized --with-icu started on binaries without icu just fine.
And only after some time, I've got that messages "ERROR:  ICU is not supported in this build".
Is it expected behavior? Maybe we should refuse to start without icu?

Best regards, Andrey Borodin.