ICU integration

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
83 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Craig Ringer-3
On 9 September 2016 at 00:19, Peter Eisentraut
<[hidden email]> wrote:

> On 9/8/16 11:16 AM, Tom Lane wrote:
>> This is a problem, if ICU won't guarantee cross-version compatibility,
>> because it destroys the argument that moving to ICU would offer us
>> collation behavior stability.
>
> It would offer a significant upgrade over the current situation.
>
> First, it offers stability inside the same version.  Whereas glibc might
> change a collation in a minor upgrade, ICU won't do that.  And the
> postgres binary is bound to a major version of ICU by the soname (which
> changes with every major release).  So this would avoid the situation
> that a simple OS update could break collations.

It also lets *users* and PostgreSQL-specific distributors bundle ICU
and get stable collations. We can't exactly bundle glibc.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Tatsuo Ishii-3
> On 9 September 2016 at 00:19, Peter Eisentraut
> <[hidden email]> wrote:
>> On 9/8/16 11:16 AM, Tom Lane wrote:
>>> This is a problem, if ICU won't guarantee cross-version compatibility,
>>> because it destroys the argument that moving to ICU would offer us
>>> collation behavior stability.
>>
>> It would offer a significant upgrade over the current situation.
>>
>> First, it offers stability inside the same version.  Whereas glibc might
>> change a collation in a minor upgrade, ICU won't do that.  And the
>> postgres binary is bound to a major version of ICU by the soname (which
>> changes with every major release).  So this would avoid the situation
>> that a simple OS update could break collations.
>
> It also lets *users* and PostgreSQL-specific distributors bundle ICU
> and get stable collations. We can't exactly bundle glibc.

I would like to know the fate of community RPMs because if PostgreSQL
does not include ICU source, the effort of integrating ICU is totally
up to packagers.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Craig Ringer-3
On 9 September 2016 at 08:51, Tatsuo Ishii <[hidden email]> wrote:

>> On 9 September 2016 at 00:19, Peter Eisentraut
>> <[hidden email]> wrote:
>>> On 9/8/16 11:16 AM, Tom Lane wrote:
>>>> This is a problem, if ICU won't guarantee cross-version compatibility,
>>>> because it destroys the argument that moving to ICU would offer us
>>>> collation behavior stability.
>>>
>>> It would offer a significant upgrade over the current situation.
>>>
>>> First, it offers stability inside the same version.  Whereas glibc might
>>> change a collation in a minor upgrade, ICU won't do that.  And the
>>> postgres binary is bound to a major version of ICU by the soname (which
>>> changes with every major release).  So this would avoid the situation
>>> that a simple OS update could break collations.
>>
>> It also lets *users* and PostgreSQL-specific distributors bundle ICU
>> and get stable collations. We can't exactly bundle glibc.
>
> I would like to know the fate of community RPMs because if PostgreSQL
> does not include ICU source, the effort of integrating ICU is totally
> up to packagers.

CC'ing Devrim.

Personally I would be pretty reluctant to package libicu when it's
already in RH/Fedora. If it were bundled in Pg's source tree and a
private copy was built/installed by the build system so it was part of
the main postgresql-server package that'd be different. I can't really
imagine that being acceptable for upstream development though.

RH and Debian would instantly rip it out and replace it with their
packaged ICU anyway.

Pity ICU doesn't offer versioned collations within a single install.
Though I can understand why they don't.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Geoghegan-3
On Thu, Sep 8, 2016 at 6:48 PM, Craig Ringer <[hidden email]> wrote:
> Pity ICU doesn't offer versioned collations within a single install.
> Though I can understand why they don't.

There are two separate issues with collator versioning. ICU can
probably be used in a way that clearly decouples these two issues,
which is very important. The first is that the rules of collations
change. The second is that the binary key that collators create (i.e.
the equivalent of strxfrm()) can change for various reasons that have
nothing to do with culture or natural languages -- purely technical
reasons. For example, they can add new optimizations to make
generating new binary keys faster. If there are bugs in how that
works, they can fix the bugs and increment the identifier [1], which
could allow Postgres to insist on a REINDEX (if abbreviated keys for
collated text were reenabled, although I don't think that problems
like that are limited to binary key generation).

So, to bring it back to that little program I wrote:

$ ./icu-coll-versions | head
Collator                                          | ICU Version | UCA Version
-----------------------------------------------------------------------------
Afrikaans                                         | 99-38-00-00 | 07-00-00-00
Afrikaans (Namibia)                               | 99-38-00-00 | 07-00-00-00
Afrikaans (South Africa)                          | 99-38-00-00 | 07-00-00-00
Aghem                                             | 99-38-00-00 | 07-00-00-00
Aghem (Cameroon)                                  | 99-38-00-00 | 07-00-00-00
Akan                                              | 99-38-00-00 | 07-00-00-00
Akan (Ghana)                                      | 99-38-00-00 | 07-00-00-00
Amharic                                           | 99-38-00-00 | 07-00-00-00

Here, what appears as "ICU version" has the identifier [1] baked in,
although this is undocumented (it also has any "custom tailorings"
that might be used, say if we had user defined customizations to
collations, as Firebird apparently does [2] [3]). I'm pretty sure that
UCA version relates to a version of the Unicode collation algorithm,
and its associated DUCET table (this is all subject to ISO
standardization). I gather that a particular collation is actually
comprised of a base UCA version (and DUCET table -- I think that ICU
sometimes calls this the "root"), with custom tailorings that a locale
provides for a given culture or country. These collators may in turn
be further "tailored" to get that fancy user defined customization
stuff.

In principle, and assuming I haven't gotten something wrong, it ought
to be possible to unambiguously identify a collation based on a
matching UCA version (i.e. DUCET table), plus the collation tailorings
matching exactly, even across ICU versions that happen to be based on
the same UCA version (they only seem to put out a new UCA version
about once a year [4]).  It *might* be fine, practically speaking, to
assume that a collation with a matching iso-code and UCA version is
compatible forever and always across any ICU version. If not, it might
instead be feasible to write a custom fingerprinter for collation
tailorings that ran at initdb time. Maybe the tailorings, which are
abstract rules, could even be stored in system catalogs, so the only
thing that need match is ICU's UCA version (the "root" collators must
still match), since replicas may reconstruct the serialized tailorings
that comprise a collation as needed [5][6], since the tailoring that a
default collator for a locale uses isn't special, technically
speaking.

Of course, this is all pretty hand-wavey right now, and much more
research is needed. I am very intrigued about the idea of storing the
collators in the system catalogs wholesale, since ICU provides
facilities that make that seem possible. If a "serialized unicode set"
build from a collators tailoring rules, or, alternatively, a collator
saved as a binary representation [7] were stored in the system
catalogs, perhaps it wouldn't matter as much that the stuff
distributed with different ICU versions didn't match, at least in
theory. It's unclear that the system catalog representation could be
usable with a fair cross section of ICU versions, but if it could then
that would be perfect. This also seems to be how Firebird style
user-defined tailorings might be implemented anyway, and it seems very
appealing to add that as a light layer on top of how the base system
works, if at all possible.

[1] https://github.com/svn2github/libicu/blob/c43ec130ea0ee6cd565d87d70088e1d70d892f32/common/unicode/uvernum.h#L149
[2] http://www.firebirdsql.org/refdocs/langrefupd25-ddl-collation.html
[3] http://userguide.icu-project.org/collation/customization#TOC-Building-on-Existing-Locales
[4] http://unicode.org/reports/tr10/#Synch_14651_Table
[5] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a1982f184bca8adaa848144a1959ff235
[6] https://ssl.icu-project.org/apiref/icu4c/structUSerializedSet.html
[7] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a2719995a75ebed7aacc1419bb2b781db
--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Devrim GÜNDÜZ
In reply to this post by Craig Ringer-3

Hi,

On Fri, 2016-09-09 at 09:48 +0800, Craig Ringer wrote:
> Personally I would be pretty reluctant to package libicu when it's
> already in RH/Fedora. If it were bundled in Pg's source tree and a
> private copy was built/installed by the build system so it was part of
> the main postgresql-server package that'd be different.

Agreed. I did not read the whole thread (yet), but if this is something like
tzdata, I would personally want to use the libuci supplied by OS, like we do
for tzdata.

(That said, just checked EDB's ICU support. We currently ship our own libicu
there, as a part of EPAS, but I don't know the reasoning/history behind there.)

Regards,
--
Devrim GÜNDÜZ
EnterpriseDB: http://www.enterprisedb.com
PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer
Twitter: @DevrimGunduz , @DevrimGunduzTR

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Magnus Hagander-2
In reply to this post by Peter Eisentraut-6
On Thu, Sep 8, 2016 at 6:19 PM, Peter Eisentraut <[hidden email]> wrote:
On 9/8/16 11:16 AM, Tom Lane wrote:
> This is a problem, if ICU won't guarantee cross-version compatibility,
> because it destroys the argument that moving to ICU would offer us
> collation behavior stability.

It would offer a significant upgrade over the current situation.

First, it offers stability inside the same version.  Whereas glibc might
change a collation in a minor upgrade, ICU won't do that.  And the
postgres binary is bound to a major version of ICU by the soname (which
changes with every major release).  So this would avoid the situation
that a simple OS update could break collations.

Second, it offers a way to detect that something has changed.  With
glibc, you don't know anything unless you read the source diffs.  With
ICU, you can compare the collation version before and after and at least
tell the user that they need to refresh indexes or whatever.


+1 on the importance of this last part.

We may not be able to handle it directly, but just being able to point out to the user that "this index is incorrect, you have to reindex" and then refuse to use the index until that has been done would be a *huge* improvement.  And it would definitely help solve an existing real-world problem, which is what can happen when you restore a physical backup onto a different version of an operating system at least. 

Sure, it would be even better if we could automatically *deal* with it. But failing in a loud and obvious way is a *lot* better than silently returning incorrect data...

--
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Craig Ringer-3
On 9 September 2016 at 16:21, Magnus Hagander <[hidden email]> wrote:

> On Thu, Sep 8, 2016 at 6:19 PM, Peter Eisentraut
> <[hidden email]> wrote:
>>
>> On 9/8/16 11:16 AM, Tom Lane wrote:
>> > This is a problem, if ICU won't guarantee cross-version compatibility,
>> > because it destroys the argument that moving to ICU would offer us
>> > collation behavior stability.
>>
>> It would offer a significant upgrade over the current situation.
>>
>> First, it offers stability inside the same version.  Whereas glibc might
>> change a collation in a minor upgrade, ICU won't do that.  And the
>> postgres binary is bound to a major version of ICU by the soname (which
>> changes with every major release).  So this would avoid the situation
>> that a simple OS update could break collations.
>>
>> Second, it offers a way to detect that something has changed.  With
>> glibc, you don't know anything unless you read the source diffs.  With
>> ICU, you can compare the collation version before and after and at least
>> tell the user that they need to refresh indexes or whatever.
>>
>
> +1 on the importance of this last part.
>
> We may not be able to handle it directly, but just being able to point out
> to the user that "this index is incorrect, you have to reindex" and then
> refuse to use the index until that has been done would be a *huge*
> improvement.  And it would definitely help solve an existing real-world
> problem, which is what can happen when you restore a physical backup onto a
> different version of an operating system at least.
>
> Sure, it would be even better if we could automatically *deal* with it. But
> failing in a loud and obvious way is a *lot* better than silently returning
> incorrect data...

Yep, I strongly agree. That's part of why I think this is well worth
doing even though it doesn't look like it'll give us a full solution
for stable collations.

It's likely also a step or three toward case-insensitive
locales/collations, which I'm sure many people would like. Though far
from the whole picture.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
In reply to this post by Peter Geoghegan-3
On 9/8/16 11:08 PM, Peter Geoghegan wrote:

> In principle, and assuming I haven't gotten something wrong, it ought
> to be possible to unambiguously identify a collation based on a
> matching UCA version (i.e. DUCET table), plus the collation tailorings
> matching exactly, even across ICU versions that happen to be based on
> the same UCA version (they only seem to put out a new UCA version
> about once a year [4]).  It *might* be fine, practically speaking, to
> assume that a collation with a matching iso-code and UCA version is
> compatible forever and always across any ICU version. If not, it might
> instead be feasible to write a custom fingerprinter for collation
> tailorings that ran at initdb time.

The documentation [0] states that the collation version covers a broad
range of things.  So I don't think these additional mechanisms you
describe are necessary.

[0]: http://userguide.icu-project.org/collation/architecture#TOC-Versioning

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
In reply to this post by Peter Geoghegan-3
On 9/8/16 8:07 PM, Peter Geoghegan wrote:
> Not exactly. Peter E. didn't seem to be aware that there is an ICU
> collator versioning concept (perhaps I misunderstood, though).

You appear to have missed the part about the collversion column that my
patch adds.  That's exactly the collator version of ICU.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Dave Page-7
In reply to this post by Devrim GÜNDÜZ
On Fri, Sep 9, 2016 at 9:02 AM, Devrim Gündüz <[hidden email]> wrote:

>
> Hi,
>
> On Fri, 2016-09-09 at 09:48 +0800, Craig Ringer wrote:
>> Personally I would be pretty reluctant to package libicu when it's
>> already in RH/Fedora. If it were bundled in Pg's source tree and a
>> private copy was built/installed by the build system so it was part of
>> the main postgresql-server package that'd be different.
>
> Agreed. I did not read the whole thread (yet), but if this is something like
> tzdata, I would personally want to use the libuci supplied by OS, like we do
> for tzdata.
>
> (That said, just checked EDB's ICU support. We currently ship our own libicu
> there, as a part of EPAS, but I don't know the reasoning/history behind there.)

We needed a specific version that was newer than that shipped with
RHEL 6 (or in EPEL) iirc.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Tom Lane-2
Dave Page <[hidden email]> writes:
> We needed a specific version that was newer than that shipped with
> RHEL 6 (or in EPEL) iirc.

Sure hope that's not true of the currently-proposed patch :-(

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Dave Page-7
On Fri, Sep 9, 2016 at 2:27 PM, Tom Lane <[hidden email]> wrote:
> Dave Page <[hidden email]> writes:
>> We needed a specific version that was newer than that shipped with
>> RHEL 6 (or in EPEL) iirc.
>
> Sure hope that's not true of the currently-proposed patch :-(

Looking back at my old emails, apparently ICU 5.0 and later include
ucol_strcollUTF8() which avoids the need to convert UTF-8 characters
to 16 bit before sorting. RHEL 6 has the older 4.2 version of ICU.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Geoghegan-3
On Fri, Sep 9, 2016 at 6:39 AM, Dave Page <[hidden email]> wrote:
> Looking back at my old emails, apparently ICU 5.0 and later include
> ucol_strcollUTF8() which avoids the need to convert UTF-8 characters
> to 16 bit before sorting. RHEL 6 has the older 4.2 version of ICU.

At the risk of stating the obvious, there is a reason why ICU
traditionally worked with UTF-16 natively. It's the same reason why
many OSes and application frameworks (e.g., Java) use UTF-16
internally, even though UTF-8 is much more popular on the web. Which
is: there are certain low-level optimizations possible that are not
possible with UTF-8.

I'm not saying that it would be just as good if we were to not use the
UTF-8 optimized stuff that ICU now has. My point is that it's not
useful to prejudge whether or not performance will be acceptable based
on a factor like this, which is ultimately just an implementation
detail. The ICU patch either performs acceptably as a substitute for
something like glibc, or it does not.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Bruce Momjian
In reply to this post by Peter Eisentraut-6
On Thu, Sep  8, 2016 at 12:19:39PM -0400, Peter Eisentraut wrote:

> On 9/8/16 11:16 AM, Tom Lane wrote:
> > This is a problem, if ICU won't guarantee cross-version compatibility,
> > because it destroys the argument that moving to ICU would offer us
> > collation behavior stability.
>
> It would offer a significant upgrade over the current situation.
>
> First, it offers stability inside the same version.  Whereas glibc might
> change a collation in a minor upgrade, ICU won't do that.  And the
> postgres binary is bound to a major version of ICU by the soname (which
> changes with every major release).  So this would avoid the situation
> that a simple OS update could break collations.

Uh, how do we know that ICU doesn't change collations in minor versions?
Couldn't we get into a case where the OS changes the ICU version or
collations more frequently than glibc does?  Seems that would be a
negative.

I don't see how having our binary bound to a ICU major version gives us
any benefit.

It seems we are still hostage to the OS version.

> Second, it offers a way to detect that something has changed.  With
> glibc, you don't know anything unless you read the source diffs.  With
> ICU, you can compare the collation version before and after and at least
> tell the user that they need to refresh indexes or whatever.

Yes, that is a clear win.

--
  Bruce Momjian  <[hidden email]>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Petr Jelinek
In reply to this post by Peter Eisentraut-6
On 31/08/16 04:46, Peter Eisentraut wrote:
> Here is a patch I've been working on to allow the use of ICU for sorting
> and other locale things.
>

Hi, first pass review of the patch (somewhat high level at this point).

First, there was some discussion about ICU versioning and collation
versioning and support for multiple versions of ICU at the same time. My
opinion on this is that the approach you have chosen is good, at least
for first iteration of the feature. We might want to support multiple
versions of the library in the future (I am not all that convinced of
that though), but we definitely don't at the moment. It is important to
have version checking against the data directory which you have done.

>
> What I have done is extend collation objects with a collprovider column
> that tells whether the collation is using POSIX (appropriate name?) or
> ICU facilities.  The pg_locale_t type is changed to a struct that
> contains the provider-specific locale handles.  Users of locale
> information are changed to look into that struct for the appropriate
> handle to use.
>

This sounds reasonable. I would prefer the POSIX to be named something
like SYSTEM though (think windows).

> In initdb, I initialize the default collation set as before from the
> `locale -a` output, but also add all available ICU locales with a "%icu"
> appended (so "fr_FR%icu").  I suppose one could create a configuration
> option perhaps in initdb to change the default so that, say, "fr_FR"
> uses ICU and "fr_FR%posix" uses the old stuff.

I wonder if we should have both %icu and %posix unconditionally and then
have option to pick one of them for the list of collations without the
suffix (defaulting to posix for now but maybe not in the future).

>
> That all works well enough for named collations and for sorting.  The
> thread about the FreeBSD ICU patch discusses some details of how to best
> use the ICU APIs to do various aspects of the sorting, so I didn't focus
> on that too much.

That's something for follow-up patches IMHO.

> I'm not sure how well it will work to replace all the bits of LIKE and
> regular expressions with ICU API calls.  One problem is that ICU likes
> to do case folding as a whole string, not by character.  I need to do
> more research about that.  

Can't we use the regular expression api provided by ICU, or is that too
incompatible?

> Another problem, which was also previously
> discussed is that ICU does case folding in a locale-agnostic manner, so
> it does not consider things such as the Turkish special cases.  This is
> per Unicode standard modulo weasel wording, but it breaks existing tests
> at least.
>

I am quite sure it will break existing usecases as well. Don't know if
that's an issue if we keep multiple locale providers though. It's
expected that you get different behavior from them.

>
> Where it gets really interesting is what to do with the database
> locales.  They just set the global process locale.  So in order to port
> that to ICU we'd need to check every implicit use of the process locale
> and tweak it.  We could add a datcollprovider column or something.  But
> we also rely on the datctype setting to validate the encoding of the
> database.  Maybe we wouldn't need that anymore, but it sounds risky.

I think we'll need to separate the process locale and the locale used
for string operations.

>
> We could have a datcollation column that by OID references a collation
> defined inside the database.  With a background worker, we can log into
> the database as it is being created and make adjustments, including
> defining or adjusting collation definitions.  This would open up
> interesting new possibilities.

I don't really follow this but it sounds icky.

>
> What is a way to go forward here?  What's a minimal useful feature that
> is future-proof?  Just allow named collations referencing ICU for now?

I think that's minimal useful feature indeed, it will move us forward,
especially on systems where posix locales are not that well implemented
but will not be world changer just yet.


> Is throwing out POSIX locales even for the process locale reasonable?
>

I don't think we can really do that, we'll need some kind of mapping
table between system locales and ICU locales (IIRC we build something
like that on windows as well). Maybe that mapping can be used for the
datctype to process locale conversion (datctype would then be provider
specific).

We also need to keep posix locale support anyway otherwise we'd break
pg_upgrade I think (at leas to the point that pg_upgrade where you have
to reindex whole database is much less feasible).

> Oh, that case folding code in formatting.c needs some refactoring.
> There are so many ifdefs there and it's repeated almost identically
> three times, it's crazy to work in that.

Yuck, yes it wasn't pretty before and it's quite unreadable now, I think
this patch will have to do something about that.

The message about collate provider version change could use some
improvements though, ie we could try to find affected indexes, also
there might be check constraints affected, etc. If nothing else it
deserves explanation in the docs.

The icu conversion functions could use some docs.

I wonder if some of the pieces of the code with following pattern:
    if (mylocale->provider == COLLPROVIDER_ICU)
        do lot's of icu stuff
    else
        call posix function
should move the icu code to separate functions, it would certainly
improve overall readability.

--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Alvaro Herrera-9
Petr Jelinek wrote:

> > I'm not sure how well it will work to replace all the bits of LIKE and
> > regular expressions with ICU API calls.  One problem is that ICU likes
> > to do case folding as a whole string, not by character.  I need to do
> > more research about that.  
>
> Can't we use the regular expression api provided by ICU, or is that too
> incompatible?

That probably won't fly very well -- it would be rather absurd to have
different regex behavior when ICU is compiled compared to when it's
not.  Perhaps down the road we could have an extension that provides
ICU-based regex operators, but most likely not in core.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Tom Lane-2
Alvaro Herrera <[hidden email]> writes:
> Petr Jelinek wrote:
>>> I'm not sure how well it will work to replace all the bits of LIKE and
>>> regular expressions with ICU API calls.  One problem is that ICU likes
>>> to do case folding as a whole string, not by character.  I need to do
>>> more research about that.  

>> Can't we use the regular expression api provided by ICU, or is that too
>> incompatible?

> That probably won't fly very well -- it would be rather absurd to have
> different regex behavior when ICU is compiled compared to when it's
> not.  Perhaps down the road we could have an extension that provides
> ICU-based regex operators, but most likely not in core.

The regex stuff does not handle case-insensitivity by case-folding the
data string, anyway.  Rather, it does that by converting 'a' or 'A'
in the pattern to the equivalent of '[aA]'.  So performance for that
step is not terribly critical.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Thomas Munro-3
In reply to this post by Peter Eisentraut-6
On Wed, Aug 31, 2016 at 2:46 PM, Peter Eisentraut
<[hidden email]> wrote:
> Here is a patch I've been working on to allow the use of ICU for sorting
> and other locale things.

This is very interesting work, and it's great to see some development
in this area.  I've been peripherally involved in more than one
collation-change-broke-my-data incident over the years.  I took the
patch for a quick spin today.  Here are a couple of initial
observations.

This patch adds pkg-config support to our configure script, in order
to produce the build options for ICU.  That's cool, and I'm a fan of
pkg-config, but it's an extra dependency that I just wanted to
highlight.  For example MacOSX appears to ship with ICU but has is no
pkg-config or ICU .pc files out of the box (erm, I can't even find the
headers, so maybe that copy of ICU is useless to everyone except
Apple).  The MacPorts ICU port ships with .pc files, so that's easy to
deal with, and I don't expect it to be difficult to get a working
pkg-config and meta-data installed alongside ICU on any platform that
doesn't already ship with them.  It may well be useful for configuring
other packages.  (There is also other cool stuff that would become
possible once ICU is optionally around, off topic.)

There is something missing from the configure script however: the
output of pkg-config is not making it into CFLAGS or LDFLAGS, so
building fails on FreeBSD and MacOSX where for example
<unicode/ucnv.h> doesn't live in the default search path.  I tried
very briefly to work out what but autoconfuse defeated me.

You call the built-in strcoll/strxfrm collation provider "posix", and
although POSIX does define those functions, it only does so because it
inherits them from ISO C90.  As far as I can tell Windows provides
those functions because it conforms to the C spec, not the POSIX spec,
though I suppose you could argue that in that respect it DOES conform
to the POSIX spec...  Also, we already have a collation called
"POSIX".  Maybe we should avoid confusing people with a "posix"
provider and a "POSIX" collation?  But then I'm not sure what else to
call it, but perhaps "system" as Petr Jelinek suggested[1], or "libc".

postgres=# select * from pg_collation where collname like 'en_NZ%';
┌──────────────────┬───────────────┬───────────┬──────────────┬──────────────┬──────────────────┬──────────────────┬─────────────┐
│     collname     │ collnamespace │ collowner │ collprovider │
collencoding │   collcollate    │    collctype     │ collversion │
├──────────────────┼───────────────┼───────────┼──────────────┼──────────────┼──────────────────┼──────────────────┼─────────────┤
│ en_NZ            │            11 │        10 │ p            │
    6 │ en_NZ            │ en_NZ            │           0 │
│ en_NZ            │            11 │        10 │ p            │
    8 │ en_NZ.ISO8859-1  │ en_NZ.ISO8859-1  │           0 │
│ en_NZ            │            11 │        10 │ p            │
   16 │ en_NZ.ISO8859-15 │ en_NZ.ISO8859-15 │           0 │
│ en_NZ.ISO8859-1  │            11 │        10 │ p            │
    8 │ en_NZ.ISO8859-1  │ en_NZ.ISO8859-1  │           0 │
│ en_NZ.ISO8859-15 │            11 │        10 │ p            │
   16 │ en_NZ.ISO8859-15 │ en_NZ.ISO8859-15 │           0 │
│ en_NZ.UTF-8      │            11 │        10 │ p            │
    6 │ en_NZ.UTF-8      │ en_NZ.UTF-8      │           0 │
│ en_NZ%icu        │            11 │        10 │ i            │
   -1 │ en_NZ            │ en_NZ            │ -1724383232 │
└──────────────────┴───────────────┴───────────┴──────────────┴──────────────┴──────────────────┴──────────────────┴─────────────┘
(7 rows)

I notice that you use encoding -1, meaning "all".  I haven't fully
grokked what that really means but it appears that we won't be able to
create any new such collations after initdb using CREATE COLLATION, if
for example you update your ICU installation and now have a new
collation available.  When I tried dropping some and recreating them
they finished up with collencoding = 6.  Should the initdb-created
rows use 6 too?

+ ucol_getVersion(collator, versioninfo);
+ numversion = ntohl(*((uint32 *) versioninfo));
+
+ if (numversion != collform->collversion)
+ ereport(WARNING,
+ (errmsg("ICU collator version mismatch"),
+ errdetail("The database was created using version 0x%08X, the
library provides version 0x%08X.",
+   (uint32) collform->collversion, (uint32) numversion),
+ errhint("Rebuild affected indexes, or build PostgreSQL with the
right version of ICU.")));

I played around with bumping version numbers artificially.  That gives
each session that accesses the collation one warning:

postgres=# select * from foo order by id;
WARNING:  ICU collator version mismatch
DETAIL:  The database was created using version 0x99380001, the
library provides version 0x99380000.
HINT:  Rebuild affected indexes, or build PostgreSQL with the right
version of ICU.
┌────┐
│ id │
├────┤
└────┘
(0 rows)

postgres=# select * from foo order by id;
┌────┐
│ id │
├────┤
└────┘
(0 rows)

The warning persists even after I reindex all affected tables (and
start a new backend), because you're only recording the collation at
pg_collation level and then only setting it at initdb time, so that
HINT isn't much use for clearing the warning.  I think you'd need to
record a snapshot of the version used for each collation that was used
on *each index*, and update it whenever you CREATE INDEX or REINDEX
etc.  There can of course be more than one collation and thus version
involved, if you have columns with different collations, so I guess
you'd need an column to hold an array of version snapshots whose order
corresponds to pg_index.indcollation's.

I contemplated something like that for the built-in libc collations,
based on extensions or external scripts that would know how to
checksum the collation definition files for your particular operating
system[2] and track them per index (and per column) in a user table.
That's not material for a core feature but it did cause me to think
about this problem a little bit.  That was based on the idea that
you'd run a script periodically or at least after OS upgrades, and
either have it run index automatically or email you a set of command
to run off hours.

Obviously you can't refuse to come up on mismatch, or you'll never be
able to REINDEX.  Automatically launched unscheduled REINDEXes in core
would be antisocial and never fly.  But a warning could easily go
unnoticed leading to corruption (not only reads, but also UNIQUE
CONSTRAINTs not enforced, yada yada yada).  I wonder what else we
could reasonably do...

A couple of thoughts about abbreviated keys:

#ifndef TRUST_STRXFRM
    if (!collate_c)
        abbreviate = false;
#endif

I think this macro should affect only strxfrm, and we should trust
ucol_getSortKey or disable it independently.  ICU's manual says
reassuring things like "Sort keys are most useful in databases" and
"Sort keys are generally only useful in databases or other
circumstances where function calls are extremely expensive".

It looks like varstr_abbrev_convert calls strxfrm unconditionally
(assuming TRUST_STRXFRM is defined).  <captain-obvious>This needs to
use ucol_getSortKey instead when appropriate.</>  It looks like it's a
bit more helpful than strxfrm about telling you the output buffer size
it wants, and it doesn't need nul termination, which is nice.
Unfortunately it is like strxfrm in that the output buffer's contents
is unspecified if it ran out of space.

+++ b/src/test/regress/expected/collate.icu.out
@@ -0,0 +1,1076 @@
+/*
+ * This test is for Linux/glibc systems and assumes that a full set of
+ * locales is installed.  It must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */

Should say ICU.

[1] https://www.postgresql.org/message-id/3113bd83-9b3a-a95b-cf24-4b5b1dc6666f%402ndquadrant.com
[2] https://github.com/macdice/check_pg_collations

--
Thomas Munro
http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Geoghegan-3
On Fri, Sep 23, 2016 at 7:27 AM, Thomas Munro
<[hidden email]> wrote:

> A couple of thoughts about abbreviated keys:
>
> #ifndef TRUST_STRXFRM
>     if (!collate_c)
>         abbreviate = false;
> #endif
>
> I think this macro should affect only strxfrm, and we should trust
> ucol_getSortKey or disable it independently.  ICU's manual says
> reassuring things like "Sort keys are most useful in databases" and
> "Sort keys are generally only useful in databases or other
> circumstances where function calls are extremely expensive".

+1. Abbreviated keys are essential to get competitive performance
while sorting text, and the fact that ICU makes them safe to
reintroduce is a major advantage of adopting ICU. Perhaps we should
consider wrapping strxfrm() instead, though, so that other existing
callers of strxfrm() (I'm thinking of convert_string_datum()) almost
automatically do the right thing. In other words, maybe there should
be a pg_strxfrm() or something, with TRUST_STRXFRM changed to be
something that can dynamically resolve whether or not it's a collation
managed by a trusted collation provider (this could only be resolved
at runtime, which I think is fine).

> It looks like varstr_abbrev_convert calls strxfrm unconditionally
> (assuming TRUST_STRXFRM is defined).  <captain-obvious>This needs to
> use ucol_getSortKey instead when appropriate.</>  It looks like it's a
> bit more helpful than strxfrm about telling you the output buffer size
> it wants, and it doesn't need nul termination, which is nice.
> Unfortunately it is like strxfrm in that the output buffer's contents
> is unspecified if it ran out of space.

One can use the ucol_nextSortKeyPart() interface to just get the first
4/8 bytes of an abbreviated key, reducing the overhead somewhat, so
the output buffer size limitation is probably irrelevant. The ICU
documentation says something about this being useful for Radix sort,
but I suspect it's more often used to generate abbreviated keys.
Abbreviated keys were not my original idea. They're really just a
standard technique.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Thomas Munro-3
On Sat, Sep 24, 2016 at 10:13 PM, Peter Geoghegan <[hidden email]> wrote:

> On Fri, Sep 23, 2016 at 7:27 AM, Thomas Munro
> <[hidden email]> wrote:
>> It looks like varstr_abbrev_convert calls strxfrm unconditionally
>> (assuming TRUST_STRXFRM is defined).  <captain-obvious>This needs to
>> use ucol_getSortKey instead when appropriate.</>  It looks like it's a
>> bit more helpful than strxfrm about telling you the output buffer size
>> it wants, and it doesn't need nul termination, which is nice.
>> Unfortunately it is like strxfrm in that the output buffer's contents
>> is unspecified if it ran out of space.
>
> One can use the ucol_nextSortKeyPart() interface to just get the first
> 4/8 bytes of an abbreviated key, reducing the overhead somewhat, so
> the output buffer size limitation is probably irrelevant. The ICU
> documentation says something about this being useful for Radix sort,
> but I suspect it's more often used to generate abbreviated keys.
> Abbreviated keys were not my original idea. They're really just a
> standard technique.

Nice!  The other advantage of ucol_nextSortKeyPart is that you don't have to convert the whole string to UChar (UTF16) first, as I think you would need to with ucol_getSortKey, because the UCharIterator mechanism can read directly from a UTF8 string.  I see in the documentation that ucol_nextSortKeyPart and ucol_getSortKey don't have compatible output, and this caveat may be related to whether sort key compression is used.  I don't understand what sort of compression is involved but out of curiosity I asked ICU to spit out some sort keys from ucol_nextSortKeyPart so I could see their size.  As you say, we can ask it to stop at 4 or 8 bytes which is very convenient for our purposes, but here I asked for more to get the full output so I could see where the primary weight part ends.  The primary weight took one byte for the Latin letters I tried and two for the Japanese characters I tried (except 一 which was just 0xaa).

ucol_nextSortKeyPart(en_US, "a", ...) -> 29 01 05 01 05
ucol_nextSortKeyPart(en_US, "ab", ...) -> 29 2b 01 06 01 06
ucol_nextSortKeyPart(en_US, "abc", ...) -> 29 2b 2d 01 07 01 07
ucol_nextSortKeyPart(en_US, "abcd", ...) -> 29 2b 2d 2f 01 08 01 08
ucol_nextSortKeyPart(en_US, "A", ...) -> 29 01 05 01 dc
ucol_nextSortKeyPart(en_US, "AB", ...) -> 29 2b 01 06 01 dc dc
ucol_nextSortKeyPart(en_US, "ABC", ...) -> 29 2b 2d 01 07 01 dc dc dc
ucol_nextSortKeyPart(en_US, "ABCD", ...) -> 29 2b 2d 2f 01 08 01 dc dc dc dc
ucol_nextSortKeyPart(ja_JP, "一", ...) -> aa 01 05 01 05
ucol_nextSortKeyPart(ja_JP, "一二", ...) -> aa d0 0f 01 06 01 06
ucol_nextSortKeyPart(ja_JP, "一二三", ...) -> aa d0 0f cb b8 01 07 01 07
ucol_nextSortKeyPart(ja_JP, "一二三四", ...) -> aa d0 0f cb b8 cb d5 01 08 01 08
ucol_nextSortKeyPart(ja_JP, "日", ...) -> d0 18 01 05 01 05
ucol_nextSortKeyPart(ja_JP, "日本", ...) -> d0 18 d1 d0 01 06 01 06
ucol_nextSortKeyPart(fr_FR, "cote", ...) -> 2d 45 4f 31 01 08 01 08
ucol_nextSortKeyPart(fr_FR, "côte", ...) -> 2d 45 4f 31 01 44 8e 06 01 09
ucol_nextSortKeyPart(fr_FR, "coté", ...) -> 2d 45 4f 31 01 42 88 01 09
ucol_nextSortKeyPart(fr_FR, "côté", ...) -> 2d 45 4f 31 01 44 8e 44 88 01 0a
ucol_nextSortKeyPart(fr_CA, "cote", ...) -> 2d 45 4f 31 01 08 01 08
ucol_nextSortKeyPart(fr_CA, "côte", ...) -> 2d 45 4f 31 01 44 8e 06 01 09
ucol_nextSortKeyPart(fr_CA, "coté", ...) -> 2d 45 4f 31 01 88 08 01 09
ucol_nextSortKeyPart(fr_CA, "côté", ...) -> 2d 45 4f 31 01 88 44 8e 06 01 0a

I wonder how it manages to deal with fr_CA's reversed secondary weighting rule which requires you to consider diacritics in reverse order -- apparently abandoned in France but still used in Canada -- using a fixed size space for state between calls.

--
Thomas Munro
http://www.enterprisedb.com
12345