ICU integration

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
83 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

ICU integration

Peter Eisentraut-6
Here is a patch I've been working on to allow the use of ICU for sorting
and other locale things.

This is mostly complementary to the existing FreeBSD ICU patch, most
recently discussed in [0].  While that patch removes the POSIX locale
use and replaces it with ICU, my interest was on allowing the use of
both.  I think that is necessary for upgrading, compatibility, and maybe
because someone likes it.

What I have done is extend collation objects with a collprovider column
that tells whether the collation is using POSIX (appropriate name?) or
ICU facilities.  The pg_locale_t type is changed to a struct that
contains the provider-specific locale handles.  Users of locale
information are changed to look into that struct for the appropriate
handle to use.

In initdb, I initialize the default collation set as before from the
`locale -a` output, but also add all available ICU locales with a "%icu"
appended (so "fr_FR%icu").  I suppose one could create a configuration
option perhaps in initdb to change the default so that, say, "fr_FR"
uses ICU and "fr_FR%posix" uses the old stuff.

That all works well enough for named collations and for sorting.  The
thread about the FreeBSD ICU patch discusses some details of how to best
use the ICU APIs to do various aspects of the sorting, so I didn't focus
on that too much.  I took the existing collate.linux.utf8.sql test and
ported it to the ICU setup, and it passes except for the case noted below.

I'm not sure how well it will work to replace all the bits of LIKE and
regular expressions with ICU API calls.  One problem is that ICU likes
to do case folding as a whole string, not by character.  I need to do
more research about that.  Another problem, which was also previously
discussed is that ICU does case folding in a locale-agnostic manner, so
it does not consider things such as the Turkish special cases.  This is
per Unicode standard modulo weasel wording, but it breaks existing tests
at least.

So right now the entries in collcollate and collctype need to be valid
for ICU *and* POSIX for everything to work.

Also note that ICU locales are encoding-independent and don't support a
separate collcollate and collctype, so the existing catalog structure is
not optimal.

Where it gets really interesting is what to do with the database
locales.  They just set the global process locale.  So in order to port
that to ICU we'd need to check every implicit use of the process locale
and tweak it.  We could add a datcollprovider column or something.  But
we also rely on the datctype setting to validate the encoding of the
database.  Maybe we wouldn't need that anymore, but it sounds risky.

We could have a datcollation column that by OID references a collation
defined inside the database.  With a background worker, we can log into
the database as it is being created and make adjustments, including
defining or adjusting collation definitions.  This would open up
interesting new possibilities.

What is a way to go forward here?  What's a minimal useful feature that
is future-proof?  Just allow named collations referencing ICU for now?
Is throwing out POSIX locales even for the process locale reasonable?

Oh, that case folding code in formatting.c needs some refactoring.
There are so many ifdefs there and it's repeated almost identically
three times, it's crazy to work in that.


[0]:
https://www.postgresql.org/message-id/flat/789A2F56-0E42-409D-A840-6AF5110D6085%40pingpong.net


--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

icu-integration.patch (119K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Craig Ringer-3
On 31 August 2016 at 10:46, Peter Eisentraut
<[hidden email]> wrote:
> Here is a patch I've been working on to allow the use of ICU for sorting
> and other locale things.

Great to see you working on this. We've got some icky head-in-the-sand
issues in this area when it comes to OS upgrades, cross-OS-release
replication, etc, and having more control over our locale handling
would be nice. It should also get rid of the Windows-vs-*nix locale
naming wart.

Speaking of which, have you had a chance to try it on Windows yet? If
not, I'm happy to offer some pointers and a working test env.

How stable are the UCU locales? Most importantly, does ICU offer any
way to "pin" a locale version, so we can say "we want de_DE as it was
in ICU 4.6" and get consistent behaviour when the user sets up a
replica on some other system with ICU 4.8? Even if the German
government has changed its mind (again) about some details of the
language and 4.8 knows about the changes but 4.4 doesn't?

Otherwise we'll just have a new version of the same problem when it
comes to replication and upgrades. User upgrades ICU or replicates to
host with newer ICU and the data in indexes no longer correctly
reflects the runtime.

Even if ICU doesn't help solve this problem it's still valuable. I
just think it's something to think about.

We could always bundle a specific ICU version, but I know how well
that'd go down with distributors. They'd just ignore us and unbundle
it then complain about it. Not wholly without reason either; they
don't want to push security updates and bug fixes for bundled
libraries. Also, ICU isn't exactly a pocket-sized library we can stash
in some 3rdpty_libs/ dir .

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
On 8/30/16 11:27 PM, Craig Ringer wrote:
> Speaking of which, have you had a chance to try it on Windows yet?

nope

> How stable are the UCU locales? Most importantly, does ICU offer any
> way to "pin" a locale version, so we can say "we want de_DE as it was
> in ICU 4.6" and get consistent behaviour when the user sets up a
> replica on some other system with ICU 4.8? Even if the German
> government has changed its mind (again) about some details of the
> language and 4.8 knows about the changes but 4.4 doesn't?

I forgot to mention this, but the patch adds a collversion column that
stores the collation version (provided by ICU).  And then when you
upgrade ICU to something incompatible you get

+           if (numversion != collform->collversion)
+               ereport(WARNING,
+                       (errmsg("ICU collator version mismatch"),
+                        errdetail("The database was created using
version 0x%08X, the library provides version 0x%08X.",
+                                  (uint32) collform->collversion,
(uint32) numversion),
+                        errhint("Rebuild affected indexes, or build
PostgreSQL with the right version of ICU.")));

So you still need to manage this carefully, but at least you have a
chance to learn about it.

Suggestions for refining this are welcome.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Michael Paquier
On Wed, Aug 31, 2016 at 1:12 PM, Peter Eisentraut
<[hidden email]> wrote:
> On 8/30/16 11:27 PM, Craig Ringer wrote:
>> Speaking of which, have you had a chance to try it on Windows yet?
>
> nope

+SELECT a, b FROM collate_test2 ORDER BY b;
+ a |  b
+---+-----
+ 1 | abc
+ 4 | ABC
+ 3 | bbc
+ 2 | äbc
+(4 rows)
Be careful with non-ASCII characters in the regression tests, remember
for example that where jacana was not happy:
https://www.postgresql.org/message-id/CAB7nPqROd2MXqy_5+cZJVhW0wHrrz6P8jV_RSbLcrXRTwLh7tQ@...
--
Michael


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Geoghegan-3
In reply to this post by Peter Eisentraut-6
On Tue, Aug 30, 2016 at 7:46 PM, Peter Eisentraut
<[hidden email]> wrote:
> Here is a patch I've been working on to allow the use of ICU for sorting
> and other locale things.

I'm really happy that you're working on this. This is more important
than is widely appreciated, and very important overall.

In a world where ICU becomes the defacto standard (i.e. is used on
major platforms by default), what remaining barriers are there to
documenting and enforcing the binary compatibility of replicas? I may
be mistaken, but offhand I can't think of any. Being able to describe
exactly what works and what doesn't is very important. After all,
failure to adhere to "the rules" today, such as they are, can leave
you with a subtly broken replica. I'd like to make that scenario
mechanically impossible, by locking everything down.

> I'm not sure how well it will work to replace all the bits of LIKE and
> regular expressions with ICU API calls.  One problem is that ICU likes
> to do case folding as a whole string, not by character.  I need to do
> more research about that.

My guess is that there are cultural reasons why it wants to operate on
a whole string, at least in some cases.

> Also note that ICU locales are encoding-independent and don't support a
> separate collcollate and collctype, so the existing catalog structure is
> not optimal.

That makes more sense to me, personally. ICU very explicitly decouples
technical issues (like the representation of strxfrm() keys, and, I
gather, encoding) from cultural issues (the actual user-visible
behaviors). This allows us to use strxfrm()-style binary keys in
indexes directly, since they're versioned independently from their
underlying collation; they can add a new optimization to strxfrm()-key
generation to the next ICU version, and we can detect that and require
a REINDEX, even when the collation version itself (the user-visible
behaviors) are unchanged. I'm getting ahead of myself here, but that
does seem very useful.

The Unicode collation algorithm [1] that ICU is directly based on
knows plenty about the requirements of indexing. It contains guidance
about equivalence vs. equality that we learned the hard way in commit
656beff5, for example.

> Where it gets really interesting is what to do with the database
> locales.  They just set the global process locale.  So in order to port
> that to ICU we'd need to check every implicit use of the process locale
> and tweak it.  We could add a datcollprovider column or something.  But
> we also rely on the datctype setting to validate the encoding of the
> database.  Maybe we wouldn't need that anymore, but it sounds risky.

Not sure about that.

Whatever we come up with here needs to mesh well with the existing
conventions around collation versioning that ICU has, in the context
of various operating system packages in particular. We can arrange it
so that in practice, an ICU upgrade doesn't often break your indexes
due to a collation rule change; ICU is happy to have multiple versions
of a collation at a time, and you'll probably retain the old collation
version in ICU.

Even if your old collation version isn't available in a new ICU
release (which I think is unlikely in practice), or you downgrade ICU,
it might be possible to give guidance on how to download a "Collation
Resource Bundle" [2][3] that *does* have the right collation version,
which presumably satisfies the requirement immediately.

Firebird already uses ICU. Maybe we have something to learn from them
here. In particular, where do they (by which I mean the ICU version
that Firebird links to) get its collations from in practice? I think
that the CLDR Data collations were at one time not even distributed
with ICU source. It might be a matter of individual OS packagers of
ICU deciding what exact CLDR data to use, which may or may not be of
any significant consequence in practice.

[1] http://unicode.org/reports/tr10
[2] http://site.icu-project.org/design/size/collation
[3] http://userguide.icu-project.org/icudata
--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Geoghegan-3
In reply to this post by Peter Eisentraut-6
On Tue, Aug 30, 2016 at 7:46 PM, Peter Eisentraut
<[hidden email]> wrote:
> In initdb, I initialize the default collation set as before from the
> `locale -a` output, but also add all available ICU locales with a "%icu"
> appended (so "fr_FR%icu").  I suppose one could create a configuration
> option perhaps in initdb to change the default so that, say, "fr_FR"
> uses ICU and "fr_FR%posix" uses the old stuff.

I suspect that we'd be better off adding a mechanism for adding a new
collation after initdb runs, on a live production instance. Maybe that
part can come later.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Doug Doole
In reply to this post by Peter Eisentraut-6
Hi all. I’m new to the PostgreSQL code and the mailing list, but I’ve had a lot of experience with using ICU in a different database product. So while I’m not up to speed on the code yet, I can offer some insights on using ICU.

> On Aug 30, 2016, at 9:12 PM, Peter Eisentraut <[hidden email]> wrote:
>> How stable are the UCU locales? Most importantly, does ICU offer any
>> way to "pin" a locale version, so we can say "we want de_DE as it was
>> in ICU 4.6" and get consistent behaviour when the user sets up a
>> replica on some other system with ICU 4.8? Even if the German
>> government has changed its mind (again) about some details of the
>> language and 4.8 knows about the changes but 4.4 doesn’t?

ICU explicitly does not provide stability in their locales and collations. We pushed them hard to provide this, but between changes to the CLDR data and changes to the ICU code it just wasn’t feasible for them to provide version to version stability.

What they do offer is a compile option when building ICU to version all their APIs. So instead of calling icu_foo() you’d call icu_foo46(). (Or something like this - it’s been a few years since I actually worked with the ICU code.) This ultimately allows you to load multiple versions of the ICU library into a single program and provide stability by calling the appropriate version of the library. (Unfortunately, the OS - at least my Linux box - only provides the generic version of ICU and not the version annotated APIs, which means a separate compile of ICU is needed.)

The catch with this is that it means you likely want to expose the version information. In another note it was suggested to use something like fr_FR%icu. If you want to pin it to a specific version of ICU, you’ll likely need something like fr_FR%icu46. (There’s nothing wrong with supporting fr_FR%icu to give users an easy way of saying “give me the latest and greatest”, but you’d probably want to harden it to a specific ICU version internally.)

> I forgot to mention this, but the patch adds a collversion column that
> stores the collation version (provided by ICU).  And then when you
> upgrade ICU to something incompatible you get
>
> +           if (numversion != collform->collversion)
> +               ereport(WARNING,
> +                       (errmsg("ICU collator version mismatch"),
> +                        errdetail("The database was created using
> version 0x%08X, the library provides version 0x%08X.",
> +                                  (uint32) collform->collversion,
> (uint32) numversion),
> +                        errhint("Rebuild affected indexes, or build
> PostgreSQL with the right version of ICU.")));
>
> So you still need to manage this carefully, but at least you have a
> chance to learn about it.

Indexes are the obvious place where collation comes into play, and are relatively easy to address. But consider all the places where string comparisons can be done. For example, check constraints and referential constraints can depend on string comparisons. If the collation rules change because of a new version of ICU, the database can become inconsistent and will need a lot more work than an index rebuild.

> Suggestions for refining this are welcome.
>
> --
> Peter Eisentraut              http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Doug Doole
Salesforce



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
In reply to this post by Michael Paquier
On 8/31/16 12:32 AM, Michael Paquier wrote:

> On Wed, Aug 31, 2016 at 1:12 PM, Peter Eisentraut
> <[hidden email]> wrote:
>> On 8/30/16 11:27 PM, Craig Ringer wrote:
>>> Speaking of which, have you had a chance to try it on Windows yet?
>>
>> nope
>
> +SELECT a, b FROM collate_test2 ORDER BY b;
> + a |  b
> +---+-----
> + 1 | abc
> + 4 | ABC
> + 3 | bbc
> + 2 | äbc
> +(4 rows)

It take that to mean, "it works".

> Be careful with non-ASCII characters in the regression tests, remember
> for example that where jacana was not happy:
> https://www.postgresql.org/message-id/CAB7nPqROd2MXqy_5+cZJVhW0wHrrz6P8jV_RSbLcrXRTwLh7tQ@...

That thread didn't tell me much, except that the client encoding didn't
handle some of the characters.  That doesn't seem specific to Windows.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
In reply to this post by Doug Doole
On 8/31/16 4:24 PM, Doug Doole wrote:
> ICU explicitly does not provide stability in their locales and collations. We pushed them hard to provide this, but between changes to the CLDR data and changes to the ICU code it just wasn’t feasible for them to provide version to version stability.
>
> What they do offer is a compile option when building ICU to version all their APIs. So instead of calling icu_foo() you’d call icu_foo46(). (Or something like this - it’s been a few years since I actually worked with the ICU code.) This ultimately allows you to load multiple versions of the ICU library into a single program and provide stability by calling the appropriate version of the library. (Unfortunately, the OS - at least my Linux box - only provides the generic version of ICU and not the version annotated APIs, which means a separate compile of ICU is needed.)
>
> The catch with this is that it means you likely want to expose the version information. In another note it was suggested to use something like fr_FR%icu. If you want to pin it to a specific version of ICU, you’ll likely need something like fr_FR%icu46. (There’s nothing wrong with supporting fr_FR%icu to give users an easy way of saying “give me the latest and greatest”, but you’d probably want to harden it to a specific ICU version internally.)

There are multiple things going on.

Collations in ICU are versioned.  You can find out the version of the
collation you are currently using using an API call.  A collation
version does not change during the life of a single version of ICU.  But
it might well change in the next version of ICU, as bugs are fixed and
things are refined.  There is no way in the API to call for a collation
of a specific version, since there is only one version of a collation in
a specific installation of ICU.  So my implementation is that we store
the version of the collation in the catalog when we create the
collation, and if we later on find at run time that the collation is of
a different version, we warn about it.

The ICU ABI (not API) is also versioned.  The way that this is done is
that all functions are actually macros to a versioned symbol.  So
ucol_open() is actually a macro that expands to, say, ucol_open_57() in
ICU version 57.  (They also got rid of a dot in their versions a while
ago.)  It's basically hand-crafted symbol versioning.  That way, you can
link with multiple versions of ICU at the same time.  However, the
purpose of that, as I understand it, is so that plugins can have a
different version of ICU loaded than the main process or another plugin.
 In terms of postgres using the right version of ICU, it doesn't buy
anything beyond what the soname mechanism does.

>> +           if (numversion != collform->collversion)
>> +               ereport(WARNING,
>> +                       (errmsg("ICU collator version mismatch"),
>> +                        errdetail("The database was created using
>> version 0x%08X, the library provides version 0x%08X.",
>> +                                  (uint32) collform->collversion,
>> (uint32) numversion),
>> +                        errhint("Rebuild affected indexes, or build
>> PostgreSQL with the right version of ICU.")));
>>
>> So you still need to manage this carefully, but at least you have a
>> chance to learn about it.
>
> Indexes are the obvious place where collation comes into play, and are relatively easy to address. But consider all the places where string comparisons can be done. For example, check constraints and referential constraints can depend on string comparisons. If the collation rules change because of a new version of ICU, the database can become inconsistent and will need a lot more work than an index rebuild.

We can refine the guidance.  But indexes are the most important issue, I
think, because changing the sorting rules in the background makes data
silently disappear.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
In reply to this post by Peter Geoghegan-3
On 8/31/16 1:32 PM, Peter Geoghegan wrote:
> ICU is happy to have multiple versions
> of a collation at a time, and you'll probably retain the old collation
> version in ICU.
>
> Even if your old collation version isn't available in a new ICU
> release (which I think is unlikely in practice)

I think this is wrong, or I have misunderstood the ICU documentation.
There is no way to "choose" a collation version.  You can only record
the one you have gotten and check that you get the same one next time.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Doug Doole
In reply to this post by Peter Eisentraut-6
> The ICU ABI (not API) is also versioned.  The way that this is done is
> that all functions are actually macros to a versioned symbol.  So
> ucol_open() is actually a macro that expands to, say, ucol_open_57() in
> ICU version 57.  (They also got rid of a dot in their versions a while
> ago.)  It's basically hand-crafted symbol versioning.  That way, you can
> link with multiple versions of ICU at the same time.  However, the
> purpose of that, as I understand it, is so that plugins can have a
> different version of ICU loaded than the main process or another plugin.
 > In terms of postgres using the right version of ICU, it doesn't buy
> anything beyond what the soname mechanism does.

You can access the versioned API as well, it's just not documented. (The ICU team does support this - we worked very closely with them when doing all this.) We exploited the versioned API when we learned that there is no guarantee of backwards compatibility in collations. You can't just change a collation under a user (at least that was our opinion) since it can cause all sorts of problems. Refreshing a collation (especially on the fly) is a lot more work than we were prepared to take on. So we exploited the versioned APIs.

We carried the ICU version numbers around on our collation and locale IDs (such as fr_FR%icu36) . The database would load multiple versions of the ICU library so that something created with ICU 3.6 would always be processed with ICU 3.6. This avoided the problems of trying to change the rules on the user. (We'd always intended to provide tooling to allow the user to move an existing object up to a newer version of ICU, but we never got around to doing it.) In the code, this meant we were explicitly calling the versioned API so that we could keep the calls straight. (Of course this was abstracted in a set of our own locale functions so that the rest of the engine was ignorant of the ICU library fun that was going on.)

> We can refine the guidance.  But indexes are the most important issue, I
> think, because changing the sorting rules in the background makes data
> silently disappear.

I'd say that collation is the most important issue, but collation impacts a lot more than indexes.

Unfortunately as part of changing companies I had to leave my "screwy stuff that has happened in collations" presentation behind so I don't have concrete examples to point to, but I can cook up illustrative examples:

- Suppose in ICU X.X, AA = Å but in ICU Y.Y AA != Å. Further suppose there was an RI constraint where the primary key used AA and the foreign key used Å. If ICU was updated, the RI constraint between the rows would break, leaving an orphaned foreign key.

- I can't remember the specific language but they had the collation rule that "CH" was treated as a distinct entity between C and D. This gave the order C < CG < CI < CZ < CH < D. Then they removed CH as special which gave C < CG < CH < CI < CZ < D. Suppose there was the constraint CHECK (COL BETWEEN 'C' AND 'CH'). Originally it would allow (almost) all strings that started with C. After the change it the constraint would block everything that started with CI - CZ leaving many rows that no longer qualify in the database.

It could be argued that these are edge cases and not likely to be hit. That's likely true for a lot of users. But for a user who hits this, their database is going to be left in a mess.

--
Doug Doole

On Tue, Sep 6, 2016 at 8:37 AM Peter Eisentraut <[hidden email]> wrote:
On 8/31/16 4:24 PM, Doug Doole wrote:
> ICU explicitly does not provide stability in their locales and collations. We pushed them hard to provide this, but between changes to the CLDR data and changes to the ICU code it just wasn’t feasible for them to provide version to version stability.
>
> What they do offer is a compile option when building ICU to version all their APIs. So instead of calling icu_foo() you’d call icu_foo46(). (Or something like this - it’s been a few years since I actually worked with the ICU code.) This ultimately allows you to load multiple versions of the ICU library into a single program and provide stability by calling the appropriate version of the library. (Unfortunately, the OS - at least my Linux box - only provides the generic version of ICU and not the version annotated APIs, which means a separate compile of ICU is needed.)
>
> The catch with this is that it means you likely want to expose the version information. In another note it was suggested to use something like fr_FR%icu. If you want to pin it to a specific version of ICU, you’ll likely need something like fr_FR%icu46. (There’s nothing wrong with supporting fr_FR%icu to give users an easy way of saying “give me the latest and greatest”, but you’d probably want to harden it to a specific ICU version internally.)

There are multiple things going on.

Collations in ICU are versioned.  You can find out the version of the
collation you are currently using using an API call.  A collation
version does not change during the life of a single version of ICU.  But
it might well change in the next version of ICU, as bugs are fixed and
things are refined.  There is no way in the API to call for a collation
of a specific version, since there is only one version of a collation in
a specific installation of ICU.  So my implementation is that we store
the version of the collation in the catalog when we create the
collation, and if we later on find at run time that the collation is of
a different version, we warn about it.

The ICU ABI (not API) is also versioned.  The way that this is done is
that all functions are actually macros to a versioned symbol.  So
ucol_open() is actually a macro that expands to, say, ucol_open_57() in
ICU version 57.  (They also got rid of a dot in their versions a while
ago.)  It's basically hand-crafted symbol versioning.  That way, you can
link with multiple versions of ICU at the same time.  However, the
purpose of that, as I understand it, is so that plugins can have a
different version of ICU loaded than the main process or another plugin.
 In terms of postgres using the right version of ICU, it doesn't buy
anything beyond what the soname mechanism does.

>> +           if (numversion != collform->collversion)
>> +               ereport(WARNING,
>> +                       (errmsg("ICU collator version mismatch"),
>> +                        errdetail("The database was created using
>> version 0x%08X, the library provides version 0x%08X.",
>> +                                  (uint32) collform->collversion,
>> (uint32) numversion),
>> +                        errhint("Rebuild affected indexes, or build
>> PostgreSQL with the right version of ICU.")));
>>
>> So you still need to manage this carefully, but at least you have a
>> chance to learn about it.
>
> Indexes are the obvious place where collation comes into play, and are relatively easy to address. But consider all the places where string comparisons can be done. For example, check constraints and referential constraints can depend on string comparisons. If the collation rules change because of a new version of ICU, the database can become inconsistent and will need a lot more work than an index rebuild.

We can refine the guidance.  But indexes are the most important issue, I
think, because changing the sorting rules in the background makes data
silently disappear.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Geoghegan-3
On Tue, Sep 6, 2016 at 10:40 AM, Doug Doole <[hidden email]> wrote:

>> The ICU ABI (not API) is also versioned.  The way that this is done is
>> that all functions are actually macros to a versioned symbol.  So
>> ucol_open() is actually a macro that expands to, say, ucol_open_57() in
>> ICU version 57.  (They also got rid of a dot in their versions a while
>> ago.)  It's basically hand-crafted symbol versioning.  That way, you can
>> link with multiple versions of ICU at the same time.  However, the
>> purpose of that, as I understand it, is so that plugins can have a
>> different version of ICU loaded than the main process or another plugin.
>  > In terms of postgres using the right version of ICU, it doesn't buy
>> anything beyond what the soname mechanism does.
>
> You can access the versioned API as well, it's just not documented. (The ICU
> team does support this - we worked very closely with them when doing all
> this.) We exploited the versioned API when we learned that there is no
> guarantee of backwards compatibility in collations. You can't just change a
> collation under a user (at least that was our opinion) since it can cause
> all sorts of problems. Refreshing a collation (especially on the fly) is a
> lot more work than we were prepared to take on. So we exploited the
> versioned APIs.

I originally got some of this information from the ICU Doxygen site
for the C API, which isn't great documentation, but also isn't bad. I
admit that there are certainly gaps in my understanding of how to
bridge our requirements with versioning to what ICU can give us.


--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Geoghegan-3
In reply to this post by Doug Doole
On Tue, Sep 6, 2016 at 10:40 AM, Doug Doole <[hidden email]> wrote:
> - Suppose in ICU X.X, AA = Å but in ICU Y.Y AA != Å. Further suppose there
> was an RI constraint where the primary key used AA and the foreign key used
> Å. If ICU was updated, the RI constraint between the rows would break,
> leaving an orphaned foreign key.

This isn't a problem for Postgres, or at least wouldn't be right now,
because we don't have case insensitive collations. So, we use a
strcmp()/memcmp() tie-breaker when strcoll() indicates equality, while
also making the general notion of text equality actually mean binary
equality. In short, we are aware that cases like this exist. IIRC
Unicode Technical Standard #10 independently recommends that this
tie-breaker strategy is one way of dealing with problems like this, in
a pinch, though I think we came up with the idea independently of that
recommendation. This was in response to a bug report over 10 years
ago.

I would like to get case insensitive collations some day, and was
really hoping that ICU would help. That being said, the need for a
strcmp() tie-breaker makes that hard. Oh well.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
In reply to this post by Doug Doole
On 9/6/16 1:40 PM, Doug Doole wrote:
> We carried the ICU version numbers around on our collation and locale
> IDs (such as fr_FR%icu36) . The database would load multiple versions of
> the ICU library so that something created with ICU 3.6 would always be
> processed with ICU 3.6. This avoided the problems of trying to change
> the rules on the user. (We'd always intended to provide tooling to allow
> the user to move an existing object up to a newer version of ICU, but we
> never got around to doing it.) In the code, this meant we were
> explicitly calling the versioned API so that we could keep the calls
> straight.

I understand that in principle, but I don't see operating system
providers shipping a bunch of ICU versions to facilitate that.  They
will usually ship one.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Doug Doole
> I understand that in principle, but I don't see operating system
> providers shipping a bunch of ICU versions to facilitate that.  They
> will usually ship one.

Yep. If you want the protection I've described, you can't depend on the OS copy of ICU. You need to have multiple ICU libraries that are named/installed such that you can load the specific version you want. It also means that you can have dependencies on versions of ICU that are no longer supported. (In my previous project, we were shipping 3 copies of the ICU library, going back to 2.x. Needless to say, we didn't add support for every drop of ICU.)

On Wed, Sep 7, 2016 at 5:53 AM Peter Eisentraut <[hidden email]> wrote:
On 9/6/16 1:40 PM, Doug Doole wrote:
> We carried the ICU version numbers around on our collation and locale
> IDs (such as fr_FR%icu36) . The database would load multiple versions of
> the ICU library so that something created with ICU 3.6 would always be
> processed with ICU 3.6. This avoided the problems of trying to change
> the rules on the user. (We'd always intended to provide tooling to allow
> the user to move an existing object up to a newer version of ICU, but we
> never got around to doing it.) In the code, this meant we were
> explicitly calling the versioned API so that we could keep the calls
> straight.

I understand that in principle, but I don't see operating system
providers shipping a bunch of ICU versions to facilitate that.  They
will usually ship one.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Doug Doole
In reply to this post by Peter Geoghegan-3
This isn't a problem for Postgres, or at least wouldn't be right now,
because we don't have case insensitive collations.

I was wondering if Postgres might be that way. It does avoid the RI constraint problem, but there are still troubles with range based predicates. (My previous project wanted case/accent insensitive collations, so we got to deal with it all.)
 
So, we use a strcmp()/memcmp() tie-breaker when strcoll() indicates equality, while also making the general notion of text equality actually mean binary equality.

We used a similar tie breaker in places. (e.g. Index keys needed to be identical, not just equal. We also broke ties in sort to make its behaviour more deterministic.)

I would like to get case insensitive collations some day, and was
really hoping that ICU would help. That being said, the need for a
strcmp() tie-breaker makes that hard. Oh well.

Prior to adding ICU to my previous project, it had the assumption that equal meant identical as well. It turned out to be a lot easier to break this assumption than I expected, but that code base had religiously used its own string comparison function for user data - strcmp()/memcmp() was never called for user data. (I don't know if the same can be said for Postgres.) We found that very few places needed to be aware of values that were equal but not identical. (Index and sort were the big two.)

Hopefully Postgres will be the same.

--
Doug Doole
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Alvaro Herrera-9
In reply to this post by Doug Doole
> - I can't remember the specific language but they had the collation rule
> that "CH" was treated as a distinct entity between C and D. This gave the
> order C < CG < CI < CZ < CH < D. Then they removed CH as special which gave
> C < CG < CH < CI < CZ < D. Suppose there was the constraint CHECK (COL
> BETWEEN 'C' AND 'CH'). Originally it would allow (almost) all strings that
> started with C. After the change it the constraint would block everything
> that started with CI - CZ leaving many rows that no longer qualify in the
> database.

(This was probably Spanish.)

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Tom Lane-2
In reply to this post by Peter Eisentraut-6
Peter Eisentraut <[hidden email]> writes:

> On 9/6/16 1:40 PM, Doug Doole wrote:
>> We carried the ICU version numbers around on our collation and locale
>> IDs (such as fr_FR%icu36) . The database would load multiple versions of
>> the ICU library so that something created with ICU 3.6 would always be
>> processed with ICU 3.6. This avoided the problems of trying to change
>> the rules on the user. (We'd always intended to provide tooling to allow
>> the user to move an existing object up to a newer version of ICU, but we
>> never got around to doing it.) In the code, this meant we were
>> explicitly calling the versioned API so that we could keep the calls
>> straight.

> I understand that in principle, but I don't see operating system
> providers shipping a bunch of ICU versions to facilitate that.  They
> will usually ship one.

I agree with that estimate, and I would further venture that even if we
wanted to bundle ICU into our tarballs, distributors would rip it out
again on security grounds.  I am dead certain Red Hat would do so; less
sure that other vendors have similar policies, but it seems likely.
They don't want to have to fix security bugs in more than one place.

This is a problem, if ICU won't guarantee cross-version compatibility,
because it destroys the argument that moving to ICU would offer us
collation behavior stability.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
On 9/8/16 11:16 AM, Tom Lane wrote:
> This is a problem, if ICU won't guarantee cross-version compatibility,
> because it destroys the argument that moving to ICU would offer us
> collation behavior stability.

It would offer a significant upgrade over the current situation.

First, it offers stability inside the same version.  Whereas glibc might
change a collation in a minor upgrade, ICU won't do that.  And the
postgres binary is bound to a major version of ICU by the soname (which
changes with every major release).  So this would avoid the situation
that a simple OS update could break collations.

Second, it offers a way to detect that something has changed.  With
glibc, you don't know anything unless you read the source diffs.  With
ICU, you can compare the collation version before and after and at least
tell the user that they need to refresh indexes or whatever.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Geoghegan-3
In reply to this post by Tom Lane-2
On Thu, Sep 8, 2016 at 8:16 AM, Tom Lane <[hidden email]> wrote:
>> I understand that in principle, but I don't see operating system
>> providers shipping a bunch of ICU versions to facilitate that.  They
>> will usually ship one.
>
> I agree with that estimate, and I would further venture that even if we
> wanted to bundle ICU into our tarballs, distributors would rip it out
> again on security grounds.

I agree that we're not going to bundle our own ICU. And, that
packagers have to be more or less on board with whatever plan we come
up with for any this to be of much practical value. The plan itself is
at least as important as the patch.

> This is a problem, if ICU won't guarantee cross-version compatibility,
> because it destroys the argument that moving to ICU would offer us
> collation behavior stability.

Not exactly. Peter E. didn't seem to be aware that there is an ICU
collator versioning concept (perhaps I misunderstood, though). It
might be that in practice, the locales are very stable, so it almost
doesn't matter that it's annoying when they change. Note that
"collators" are versioned in a sophisticated way, not locales.

You can build the attached simple C program to see the versions of
available collators from each locale, as follows:

$ gcc icu-test.c -licui18n -licuuc -o icu-coll-versions
$ ./icu-coll-versions | head -n 20
Collator                                          | ICU Version | UCA Version
-----------------------------------------------------------------------------
Afrikaans                                         | 99-38-00-00 | 07-00-00-00
Afrikaans (Namibia)                               | 99-38-00-00 | 07-00-00-00
Afrikaans (South Africa)                          | 99-38-00-00 | 07-00-00-00
Aghem                                             | 99-38-00-00 | 07-00-00-00
Aghem (Cameroon)                                  | 99-38-00-00 | 07-00-00-00
Akan                                              | 99-38-00-00 | 07-00-00-00
Akan (Ghana)                                      | 99-38-00-00 | 07-00-00-00
Amharic                                           | 99-38-00-00 | 07-00-00-00
Amharic (Ethiopia)                                | 99-38-00-00 | 07-00-00-00
Arabic                                            | 99-38-1B-01 | 07-00-00-00
Arabic (World)                                    | 99-38-1B-01 | 07-00-00-00
Arabic (United Arab Emirates)                     | 99-38-1B-01 | 07-00-00-00
Arabic (Bahrain)                                  | 99-38-1B-01 | 07-00-00-00
Arabic (Djibouti)                                 | 99-38-1B-01 | 07-00-00-00
Arabic (Algeria)                                  | 99-38-1B-01 | 07-00-00-00
Arabic (Egypt)                                    | 99-38-1B-01 | 07-00-00-00
Arabic (Western Sahara)                           | 99-38-1B-01 | 07-00-00-00
Arabic (Eritrea)                                  | 99-38-1B-01 | 07-00-00-00

I also attach a full list from my Ubuntu 16.04 laptop. I'll try to
find some other system to generate output from, to see how close it
matches what I happen to have here.

"ICU version" here is an opaque 32-bit integer [1]. I'd be interested
to see how much the output of this program differs from one major
version of ICU to the next. Collations will change. of course, but not
that often. It's not the end of the world if somebody has to REINDEX
when they change major OS version. It would be nice if everything just
continued to work with no further input from the user, but it's not
essential, assuming that collation are pretty stable in practice,
which I think they are. It is a total disaster if a mismatch in
collations is initially undetected, though.

Another issue that nobody has mentioned here, I think, is that the
glibc people just don't seem to care about our use-case (Carlos
O'Donnell basically said as much, during the strxfrm() debacle earlier
this year, but it wasn't limited to how we were relying on strxfrm()
at that time). Since it's almost certainly true that other major
database systems are critically reliant on ICU's strxfrm() agreeing
with strcoll (substitute ICU equivalent spellings), and issues beyond
that, it stands to reason that they take that stuff very seriously. It
would be really nice to get back abbreviated keys for collated text,
IMV. I think ICU gets us that. Even if we used ICU in exactly the same
way as we use the C standard library today, that general sense of
stability being critical that ICU has would still be a big advantage.
If ICU drops the ball on collation stability, or strxfrm() disagreeing
with strcoll(), it's a huge problem for lots of groups of people, not
just us.

[1] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#af756972781ac556a62e48cbd509ea4a6
--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

full-list.txt (71K) Download Attachment
icu-test.c (3K) Download Attachment
12345
Previous Thread Next Thread