Can ICU be used for a database's default sort order?

classic Classic list List threaded Threaded
27 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Can ICU be used for a database's default sort order?

Tom Lane-2
I tried to arrange $subject via

create database icu encoding 'utf8' lc_ctype "en-US-x-icu" lc_collate "en-US-x-icu" template template0;

and got only

ERROR:  invalid locale name: "en-US-x-icu"

which is unsurprising after looking into the code, because createdb()
checks those parameters with check_locale() which only knows about
libc-defined locale names.

Is there some way I'm missing, or is this just a not-done-yet feature?

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Can ICU be used for a database's default sort order?

Peter Geoghegan-4
On Thu, Jun 22, 2017 at 7:10 PM, Tom Lane <[hidden email]> wrote:
> Is there some way I'm missing, or is this just a not-done-yet feature?

It's a not-done-yet feature.


--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Can ICU be used for a database's default sort order?

Peter Eisentraut-6
On 6/22/17 23:10, Peter Geoghegan wrote:
> On Thu, Jun 22, 2017 at 7:10 PM, Tom Lane <[hidden email]> wrote:
>> Is there some way I'm missing, or is this just a not-done-yet feature?
>
> It's a not-done-yet feature.

It's something I hope to address soon.

The main definitional challenge is how to associate a pg_database entry
with a collation.

What we currently effectively do is duplicate the fields of pg_collation
in pg_database.  But I imagine over time we'll add more properties in
pg_collation, along with additional ALTER COLLATION commands etc., so
duplicating all of that would be a significant amount of code
complication and result in a puzzling user interface.

Ideally, I'd like to see CREATE DATABASE ... COLLATION "foo".  But the
problem is of course that collations are per-database objects.  Possible
solutions:

1) Associate by name only.  That is, you can create a database with any
COLLATION "foo" that you want, and it's only checked when you first
connect to or do anything in the database.

2) Create shared collations.  Then we'd need a way to manage having a
mix of shared and non-shared collations around.

There are significant pros and cons to all of these ideas.  Some people
I talked to appeared to prefer the shared collations approach.

Other ideas?

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Can ICU be used for a database's default sort order?

Peter Geoghegan-4
On Fri, Jun 23, 2017 at 11:32 AM, Peter Eisentraut
<[hidden email]> wrote:
> It's something I hope to address soon.

I hope you do. I think that we'd realize significant benefits by
having ICU become the defacto standard collation provider, that most
users get without even realizing it. As things stand, you have to make
a point of specifying an ICU collation as your per-column collation
within every CREATE TABLE. That's a significant barrier to adoption.

> 1) Associate by name only.  That is, you can create a database with any
> COLLATION "foo" that you want, and it's only checked when you first
> connect to or do anything in the database.
>
> 2) Create shared collations.  Then we'd need a way to manage having a
> mix of shared and non-shared collations around.
>
> There are significant pros and cons to all of these ideas.  Some people
> I talked to appeared to prefer the shared collations approach.

I strongly prefer the second approach. The only downside that occurs
to me is that that approach requires more code. Is there something
that I've missed?

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Can ICU be used for a database's default sort order?

Tom Lane-2
Peter Geoghegan <[hidden email]> writes:

> On Fri, Jun 23, 2017 at 11:32 AM, Peter Eisentraut
> <[hidden email]> wrote:
>> 1) Associate by name only.  That is, you can create a database with any
>> COLLATION "foo" that you want, and it's only checked when you first
>> connect to or do anything in the database.
>>
>> 2) Create shared collations.  Then we'd need a way to manage having a
>> mix of shared and non-shared collations around.
>>
>> There are significant pros and cons to all of these ideas.  Some people
>> I talked to appeared to prefer the shared collations approach.

> I strongly prefer the second approach. The only downside that occurs
> to me is that that approach requires more code. Is there something
> that I've missed?

I'm not very clear on how you'd bootstrap template1 into anything
other than C locale in the second approach.  With our existing
libc-based stuff, it's possible to define what the database's locale
is before there are any catalogs.  It's not apparent how to do that with
a collation-based solution.

In my mind, collations are just a SQL-syntax wrapper for locales that
are really defined one level down.  I think we'd be well advised to
carry that same approach into the database properties, because otherwise
we have circularities to deal with. So I'm imagining something more like

create database encoding 'utf8' lc_collate 'icu-en_US' lc_ctype ...

where lc_collate is just a string that we know how to interpret, the
same as now.

We could optionally reduce the amount of notation involved by merging the
lc_collate and lc_ctype parameters into one, say

create database encoding 'utf8' locale 'icu-en_US' ...

I'm not too clear on how this would play with other libc locale
functionality (lc_monetary and so on), but we'd have to deal with
that question anyway.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Can ICU be used for a database's default sort order?

d0uble
In reply to this post by Peter Eisentraut-6
Hi.

23 июня 2017 г., в 21:32, Peter Eisentraut <[hidden email]> написал(а):

On 6/22/17 23:10, Peter Geoghegan wrote:
On Thu, Jun 22, 2017 at 7:10 PM, Tom Lane <[hidden email]> wrote:
Is there some way I'm missing, or is this just a not-done-yet feature?

It's a not-done-yet feature.

It's something I hope to address soon.

Will it work only for a particular database? Or for a whole cluster during initdb also? Any chance to get this done in 11?

--
May the force be with you…
https://simply.name

Reply | Threaded
Open this post in threaded view
|

Re: Can ICU be used for a database's default sort order?

Peter Eisentraut-6
On 1/31/18 11:48, Vladimir Borodin wrote:
> Will it work only for a particular database? Or for a whole cluster
> during initdb also? Any chance to get this done in 11?

I'm currently not working on it.

It's basically just a lot of leg work, and you need to come up with a
catalog representation.  Possible options have already been addressed in
earlier threads.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Can ICU be used for a database's default sort order?

Andrey Borodin-2
Hello!

> 1 февр. 2018 г., в 19:09, Peter Eisentraut <[hidden email]> написал(а):
>
> On 1/31/18 11:48, Vladimir Borodin wrote:
>> Will it work only for a particular database? Or for a whole cluster
>> during initdb also? Any chance to get this done in 11?
>
> I'm currently not working on it.
>
> It's basically just a lot of leg work, and you need to come up with a
> catalog representation.  Possible options have already been addressed in
> earlier threads.


I can try to do this before next CF. But ISTM that EDB and Postgres Pro already have various flavours of similar feature. Maybe they are planning to publish that?

Best regards, Andrey Borodin.
Reply | Threaded
Open this post in threaded view
|

Re: Can ICU be used for a database's default sort order?

Peter Geoghegan-4
On Fri, Feb 2, 2018 at 4:22 AM, Andrey Borodin <[hidden email]> wrote:
> I can try to do this before next CF. But ISTM that EDB and Postgres Pro already have various flavours of similar feature. Maybe they are planning to publish that?

I would definitely review that patch.


--
Peter Geoghegan

Reply | Threaded
Open this post in threaded view
|

Re: Can ICU be used for a database's default sort order?

Andrey Borodin-2
Hi!

> 2 февр. 2018 г., в 21:14, Peter Geoghegan <[hidden email]> написал(а):
>
> On Fri, Feb 2, 2018 at 4:22 AM, Andrey Borodin <[hidden email]> wrote:
>> I can try to do this before next CF. But ISTM that EDB and Postgres Pro already have various flavours of similar feature. Maybe they are planning to publish that?
>
> I would definitely review that patch.

I've contacted Postgres Professional. Marina Polyakova had kindly provided their patch.
The patch allows to use libc locale with ICU collation as default for cluster or database.

It seems that this patch brings important long-awaited feature and deserves to be included in last v11 commitfest.
Peter, everyone, do you agree with this? Or should we better adapt this work through v12 cycle?

I'm planning to provide review asap and do necessary changes if required (this was discussed with Marina and Postgres Professional).

Best regards, Andrey Borodin.

0001-ICU-as-default-collation-provider.txt (204K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Can ICU be used for a database's default sort order?

Andrey Borodin-2
Hi everyone!

> 10 февр. 2018 г., в 20:45, Andrey Borodin <[hidden email]> написал(а):
>
> I'm planning to provide review
>

So, I was looking into the patch.
The patch adds:
1. Ability to specify collation provider (with version) in --locale for initdb and createdb.
2. Changes to locale checks
3. Sets ICU as default collation provider. For example "ru_RU@icu.153.80.32.1" is default on my machine with patch
4. Tests and necessary changes to documentation

With patch I get correct ICU ordering by default
postgres=# select unnest(array['е','ё','ж']) order by 1;
 unnest
--------
 е
 ё
 ж
(3 rows)

While libc locale provides incorrect order (I also get same ordering by default without patch)

postgres=# select c from unnest(array['е','ё','ж']) c order by c collate "ru_RU";
 c
---
 е
 ж
 ё
(3 rows)


Unfortunately, neither "ru_RU@icu.153.80.32.1" (exposed by LC_COLLATE and other places) nor "ru_RU@icu" cannot be used by collate SQL clause.
Also, patch removes compatibility with MSVC 1800 (Visual Studio 2013) on Windows XP and Windows Server 2003. This is done to use newer locale-related functions in VS2013 build.

If the database was initialized with default locale without this patch, one cannot connect to it anymore
psql: FATAL:  could not find out the collation provider for datcollate "ru_RU.UTF-8" of database "postgres"
This problem is mentioned in commit message of the patch. I think that this problem should be addressed somehow.
What do you think?

Overall patch looks solid and thoughtful work and adds important functionality.

Best regards, Andrey Borodin.
Reply | Threaded
Open this post in threaded view
|

Re: Can ICU be used for a database's default sort order?

Michail Nikolaev
Hello.

Just want to inform:
I have run check,installcheck,plcheck,contribcheck,modulescheck,ecpgcheck,isolationcheck,upgradecheck tests on Windows 10, VC2017 with patch applied on top of 2a41507dab0f293ff241fe8ae326065998668af8 as Andrey asked me.

Everything is passing with and without $config->{icu} = 'D:\Dev\postgres\icu\';

Best regards, 
Michail.


пт, 16 февр. 2018 г. в 11:13, Andrey Borodin <[hidden email]>:
Hi everyone!

> 10 февр. 2018 г., в 20:45, Andrey Borodin <[hidden email]> написал(а):
>
> I'm planning to provide review
>

So, I was looking into the patch.
The patch adds:
1. Ability to specify collation provider (with version) in --locale for initdb and createdb.
2. Changes to locale checks
3. Sets ICU as default collation provider. For example "ru_RU@icu.153.80.32.1" is default on my machine with patch
4. Tests and necessary changes to documentation

With patch I get correct ICU ordering by default
postgres=# select unnest(array['е','ё','ж']) order by 1;
 unnest
--------
 е
 ё
 ж
(3 rows)

While libc locale provides incorrect order (I also get same ordering by default without patch)

postgres=# select c from unnest(array['е','ё','ж']) c order by c collate "ru_RU";
 c
---
 е
 ж
 ё
(3 rows)


Unfortunately, neither "ru_RU@icu.153.80.32.1" (exposed by LC_COLLATE and other places) nor "ru_RU@icu" cannot be used by collate SQL clause.
Also, patch removes compatibility with MSVC 1800 (Visual Studio 2013) on Windows XP and Windows Server 2003. This is done to use newer locale-related functions in VS2013 build.

If the database was initialized with default locale without this patch, one cannot connect to it anymore
psql: FATAL:  could not find out the collation provider for datcollate "ru_RU.UTF-8" of database "postgres"
This problem is mentioned in commit message of the patch. I think that this problem should be addressed somehow.
What do you think?

Overall patch looks solid and thoughtful work and adds important functionality.

Best regards, Andrey Borodin.
Reply | Threaded
Open this post in threaded view
|

Re: Can ICU be used for a database's default sort order?

Andres Freund
In reply to this post by Andrey Borodin-2
Hi,

On 2018-02-10 20:45:40 +0500, Andrey Borodin wrote:
> I've contacted Postgres Professional. Marina Polyakova had kindly provided their patch.
> The patch allows to use libc locale with ICU collation as default for cluster or database.
>
> It seems that this patch brings important long-awaited feature and deserves to be included in last v11 commitfest.
> Peter, everyone, do you agree with this? Or should we better adapt this work through v12 cycle?
>
> I'm planning to provide review asap and do necessary changes if required (this was discussed with Marina and Postgres Professional).

This patch was submitted for the last v11 commitfest, it's not a trivial
patch, and hasn't yet been reviewed. I'm afraid the policy is that large
patches shouldn't be submitted for the last commitfest...  Thus I think
this should be moved to the next one.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Re: [HACKERS] Can ICU be used for a database's default sort order?

David Steele
On 3/2/18 1:14 AM, Andres Freund wrote:

>
> On 2018-02-10 20:45:40 +0500, Andrey Borodin wrote:
>> I've contacted Postgres Professional. Marina Polyakova had kindly provided their patch.
>> The patch allows to use libc locale with ICU collation as default for cluster or database.
>>
>> It seems that this patch brings important long-awaited feature and deserves to be included in last v11 commitfest.
>> Peter, everyone, do you agree with this? Or should we better adapt this work through v12 cycle?
>>
>> I'm planning to provide review asap and do necessary changes if required (this was discussed with Marina and Postgres Professional).
>
> This patch was submitted for the last v11 commitfest, it's not a trivial
> patch, and hasn't yet been reviewed. I'm afraid the policy is that large
> patches shouldn't be submitted for the last commitfest...  Thus I think
> this should be moved to the next one.

This patch has been moved to the next CF.

Regards,
--
-David
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Can ICU be used for a database's default sort order?

Daniel Verite
In reply to this post by Andrey Borodin-2
        Andrey Borodin wrote:

> Overall patch looks solid and thoughtful work and adds important
> functionality.

I tried the patch, with some minor changes to build with HEAD.

I was surprised by the interface, that is, the fact that a user is
not allowed to freely choose the ICU collation of a database, in
constrast with CREATE COLLATION.

AFAIU, when the "default collation provider" is ICU, CREATE DATABASE
still expects a libc locale in the lc_collate/lc_ctype arguments.
The code will automatically find an ICU equivalent by matching the
language, and it seems that the country is ignored?

So if we wanted a database with an ICU collation like, say,
"es@collation=traditional" or "es-u-co-trad" as expressed with a
BCP-47 tag, or anything that is not defined by only a language,
would it be possible? I have the impression it wouldn't.

This is not something that could be easily improved after the fact
because getting the ICU collation through a libc collation is a
user-interface choice.

I think users would rather be able to create a database with
something like:

CREATE DATABASE foo
  COLLPROVIDER='icu'
  LOCALE='icu_locale' |
      LC_COLLATE='icu_locale' | LC_CTYPE='icu_locale'
  ...
which would be in line with CREATE COLLATION.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Reply | Threaded
Open this post in threaded view
|

Re: Can ICU be used for a database's default sort order?

Daniel Verite
        I wrote:

> I tried the patch, with some minor changes to build with HEAD.

PFA a rebased version.
(for some reason http://cfbot.cputube.org/ did not pick up the initial patch;
there's an entry for it but the rightmost column is empty,
no link to patch and no green/red success/failure icons.
It seems it's the only entry in this state).


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

ICU-as-default-collation-provider-rebased-jul-2018.patch (254K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Can ICU be used for a database's default sort order?

Thomas Munro-3
In reply to this post by Peter Geoghegan-4
On Sat, Jun 24, 2017 at 10:55 AM Peter Geoghegan <[hidden email]> wrote:

> On Fri, Jun 23, 2017 at 11:32 AM, Peter Eisentraut
> <[hidden email]> wrote:
> > It's something I hope to address soon.
>
> I hope you do. I think that we'd realize significant benefits by
> having ICU become the defacto standard collation provider, that most
> users get without even realizing it. As things stand, you have to make
> a point of specifying an ICU collation as your per-column collation
> within every CREATE TABLE. That's a significant barrier to adoption.
>
> > 1) Associate by name only.  That is, you can create a database with any
> > COLLATION "foo" that you want, and it's only checked when you first
> > connect to or do anything in the database.
> >
> > 2) Create shared collations.  Then we'd need a way to manage having a
> > mix of shared and non-shared collations around.
> >
> > There are significant pros and cons to all of these ideas.  Some people
> > I talked to appeared to prefer the shared collations approach.
>
> I strongly prefer the second approach. The only downside that occurs
> to me is that that approach requires more code. Is there something
> that I've missed?

Sorry to join this thread late.  I was redirected here from another one[1].

I like the shared catalog idea, but here's one objection I thought
about: it makes it a bit harder to track whether you've sorted out all
your indexes after a version change.  Say collation fr_CA's version
changes according to the provider, so that it no longer matches the
stored collversion.  Now you'll need to be careful to connect to every
database in the cluster and run REINDEX, before you run ALTER
COLLATION "fr_CA" REFRESH VERSION to update the single shared
pg_collation row's collversion.  With the non-shared pg_collation
scheme we have currently, you'd need to refresh the collation row in
each database after reindexing the whole database, which is IMHO a bit
nicer (you track which databases you've dealt with as you go through
them).

In other words, using a shared catalog moves the "scope" of the
version tracking even further away from the ideal scope, and requires
humans to actually get the cleanup right, and it's extra confusing
because you can only be connected to one database at a time so there
is no "REINDEX MY CLUSTER" and no possibility of making a command that
reindexes dependent indexes and then refreshes the collation version.

The ideal scope would be to track all referenced collation versions on
every index, and only update them at CREATE INDEX or REINDEX time
(also, as discussed in some other thread, CHECK constraints and
partition keys might be invalidated and should in theory also carry
versions that can only be updated by running a hypothetical RECHECK or
REPARTITION command).  Then a shared pg_collation catalog would make
perfect sense, and there would be no need for it to have a collversion
column at all, or an ALTER COLLATION ... REFRESH VERSION command, and
therefore there would be no way to screw it up by REFRESHing the
VERSION without having really fixed the problem.

[1] https://www.postgresql.org/message-id/242e081c-aec8-a20a-510c-f4d0f183cebd%402ndquadrant.com

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Can ICU be used for a database's default sort order?

Michael Paquier-2
On Wed, Sep 12, 2018 at 01:06:12PM +1200, Thomas Munro wrote:

> The ideal scope would be to track all referenced collation versions on
> every index, and only update them at CREATE INDEX or REINDEX time
> (also, as discussed in some other thread, CHECK constraints and
> partition keys might be invalidated and should in theory also carry
> versions that can only be updated by running a hypothetical RECHECK or
> REPARTITION command).  Then a shared pg_collation catalog would make
> perfect sense, and there would be no need for it to have a collversion
> column at all, or an ALTER COLLATION ... REFRESH VERSION command, and
> therefore there would be no way to screw it up by REFRESHing the
> VERSION without having really fixed the problem.
Please note that the latest patch set does not apply, so this has been
switched to commit fest 2018-11, waiting on author for a rebase.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Can ICU be used for a database's default sort order?

Andrey Borodin-2
Hi!

> 2 окт. 2018 г., в 11:37, Michael Paquier <[hidden email]> написал(а):
>
> Please note that the latest patch set does not apply, so this has been
> switched to commit fest 2018-11, waiting on author for a rebase.

PFA rebased version. I've added LDFLAGS_INTERNAL += $(ICU_LIBS) in libpq, but I'm not entirely sure this is correct way to deal with complaints on ICU functions from libpq linking.


Best regards, Andrey Borodin.

0001-ICU-as-a-default-collation-provider-rebased-oct-2018.patch (278K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Can ICU be used for a database's default sort order?

Dmitry Dolgov
> On Tue, Oct 30, 2018 at 9:07 AM Andrey Borodin <[hidden email]> wrote:
>
> Hi!
>
> > 2 окт. 2018 г., в 11:37, Michael Paquier <[hidden email]> написал(а):
> >
> > Please note that the latest patch set does not apply, so this has been
> > switched to commit fest 2018-11, waiting on author for a rebase.
>
> PFA rebased version. I've added LDFLAGS_INTERNAL += $(ICU_LIBS) in libpq,

Thanks for providing the rebased version.

> but I'm not entirely sure this is correct way to deal with complaints on ICU
> functions from libpq linking.

Well, it was enough on my own Gentoo, where patch actually compiles without
errors and pass "make check". But for cfbot it doesn't compile, I'm not sure
why.

../../../src/interfaces/libpq/libpq.so: undefined reference to `ucol_open_52'
../../../src/interfaces/libpq/libpq.so: undefined reference to
`uloc_toLanguageTag_52'
../../../src/interfaces/libpq/libpq.so: undefined reference to `u_errorName_52'
../../../src/interfaces/libpq/libpq.so: undefined reference to
`ucol_getVersion_52'
../../../src/interfaces/libpq/libpq.so: undefined reference to
`u_versionToString_52'
../../../src/interfaces/libpq/libpq.so: undefined reference to
`uloc_setDefault_52'
../../../src/interfaces/libpq/libpq.so: undefined reference to
`uloc_getDefault_52'
../../../src/interfaces/libpq/libpq.so: undefined reference to `ucol_close_52'

As a side note, I'm a bit confused, who is the original author of the proposed
patch? If it's Marina, why she isn't involved in the discussion or even
mentioned in the patch itself?

12