ICU integration

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
83 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
On 1/31/17 16:59, Thomas Munro wrote:
> I assume (but haven't checked) that ucol_nextSortKeyPart accesses only
> the start of the string via the UCharIterator passed in, unless you
> have the rare reverse-accent-sort feature enabled (maybe used only in
> fr_CA, it looks like it is required to scan the whole string looking
> for the last accent).  So I assume that uiter_setUTF8 and
> ucol_nextSortKeyPart would usually do a small fixed amount of work,
> whereas this patch's icu_to_uchar allocates space and converts the
> whole variable length string every time.

I have changed it to use ucol_nextSortKeyPart() when appropriate.

> That's about abbreviation, but I note that you can also compare
> strings using iterators with ucol_strcollIter, avoiding the need to
> allocate and transcode up front.  I have no idea whether that'd pay
> off.

These iterators don't actually transcode on the fly.  You can only set
up iterators for UTF-8 or UTF-16 strings.  And for UTF-8, we already
have a special code path using ucol_strcollUTF8(), which uses an
interator internally.

> Clearly when you upgrade your system from (say) Debian 8 to Debian 9
> and the ICU major version changes we expect to have to REINDEX, but
> does anyone know if such data changes are ever pulled into the minor
> version package upgrades you get from regular apt-get update of (say)
> a Debian 8 or CentOS 7 or FreeBSD 11 system?  In other words, do we
> expect collversion changes to happen basically any time in response to
> regular system updates, or only when you're doing a major upgrade of
> some kind, as the above-quoted documentation patch implies?

Stable operating systems shouldn't do major library upgrades, so I feel
pretty confident about this.

>
> +               collversion = ntohl(*((uint32 *) versioninfo));
>
> UVersionInfo is an array of four uint8_t.  That doesn't sound like
> something that needs to be bit-swizzled... is it really?  Even if it
> is arranged differently on different architectures, I'm not sure why
> you care since we only ever use it to compare for equality on the same
> system.  But aside from that, I don't love this cast to uint32.  I
> wonder if we should use u_versionToString to respect boundaries a bit
> better?

Makes sense, changed the column type to text.

> I have another motivation for wanting to model collation versions as
> strings: I have been contemplating a version check for system-provided
> collations too, piggy-backing on your work here.  Obviously PostgreSQL
> can't directly know anything about system collation versions, but I'm
> thinking of a GUC system_collation_version_command which you could
> optionally set to a script that knows how to inspect the local
> operating system.  For example, a package maintainer might be
> interested in writing such a script that knows how to ask the package
> system for a locale data version.  A brute force approach that works
> on many systems could be 'tar c /usr/share/local/*/LC_COLLATE | md5'.

That sounds like an idea worth pursuing.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Geoghegan-4
On Wed, Feb 15, 2017 at 9:17 PM, Peter Eisentraut
<[hidden email]> wrote:
> I have changed it to use ucol_nextSortKeyPart() when appropriate.

I think it would be fine if the second last argument was
"sizeof(Datum)", not "Min(sizeof(Datum), sss->buflen2)" here:

> +               if (GetDatabaseEncoding() == PG_UTF8)
> +               {
> +                   UCharIterator iter;
> +                   uint32_t    state[2];
> +                   UErrorCode  status;
> +
> +                   uiter_setUTF8(&iter, sss->buf1, len);
> +                   state[0] = state[1] = 0;  /* won't need that again */
> +                   status = U_ZERO_ERROR;
> +                   bsize = ucol_nextSortKeyPart(sss->locale->info.icu.ucol,
> +                                                &iter,
> +                                                state,
> +                                                (uint8_t *) sss->buf2,
> +                                                Min(sizeof(Datum), sss->buflen2),
> +                                                &status);
> +                   if (U_FAILURE(status))
> +                       ereport(ERROR,
> +                               (errmsg("sort key generation failed: %s", u_errorName(status))));
> +               }

Does this really need to be in the strxfrm() loop at all? I don't see
why you need to ever retry here.

There is an issue of style here:

> -       /* Just like strcoll(), strxfrm() expects a NUL-terminated string */
>         memcpy(sss->buf1, authoritative_data, len);
> +       /* Just like strcoll(), strxfrm() expects a NUL-terminated string.
> +        * Not necessary for ICU, but doesn't hurt. */

I see that you have preserved strcoll() comparison caching (or should
I say ucol_strcollUTF8() comparison caching?), at the cost of having
to keep around a buffer which we must continue to copy every text
string into within varstr_abbrev_convert(). That was probably the
right trade-off. It doesn't hurt that it required minimal divergence
within new ICU code, too.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Geoghegan-4
In reply to this post by Peter Eisentraut-6
On Wed, Feb 15, 2017 at 9:17 PM, Peter Eisentraut
<[hidden email]> wrote:

>> Clearly when you upgrade your system from (say) Debian 8 to Debian 9
>> and the ICU major version changes we expect to have to REINDEX, but
>> does anyone know if such data changes are ever pulled into the minor
>> version package upgrades you get from regular apt-get update of (say)
>> a Debian 8 or CentOS 7 or FreeBSD 11 system?  In other words, do we
>> expect collversion changes to happen basically any time in response to
>> regular system updates, or only when you're doing a major upgrade of
>> some kind, as the above-quoted documentation patch implies?
>
> Stable operating systems shouldn't do major library upgrades, so I feel
> pretty confident about this.

There is one case where it *is* appropriate for a bugfix release of
ICU to bump collator version: When there was a bug in the collator
itself, leading to broken binary blobs [1]. We should make the user
REINDEX when this happens.

I think that ICU may well do this in point releases, which is actually
a good thing. So, it seems like a good idea to test that there is no
change, in a place like check_strxfrm_bug(). In my opinion, we should
LOG a complaint in the event of a mismatch rather than refusing to
start the server, since there probably isn't that much chance of the
user being harmed by the bug. The cost of not starting the server
properly until a REINDEX finishes or similar looks particularly
unappealing when one considers that the app was probably affected by
any corruption for weeks or months before the ICU update enabled its
detection.

[1] http://userguide.icu-project.org/collation/api#TOC-Sort-Key-Features
--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Alvaro Herrera-9
In reply to this post by Peter Eisentraut-6
Peter Eisentraut wrote:

> - Naming of collations:  Are we happy with the "de%icu" naming?  I might
> have come up with that while reviewing the IPv6 zone index patch. ;-)
> An alternative might be "de$icu" for more Oracle vibe and avoiding the
> need for double quotes in some cases.  (But we have mixed-case names
> like "de_AT%icu", so further changes would be necessary to fully get rid
> of the need for quoting.)  A more radical alternative would be to
> install ICU locales in a different schema and use the search_path, or
> even have a separate search path setting for collations only.  Which
> leads to ...
>
> - Selecting default collation provider:  Maybe we want a setting, say in
> initdb, to determine which provider's collations get the "good" names?
> Maybe not necessary for this release, but something to think about.

I'm not sure I like "default collations" to depend on either search_path
or some hypothetical future setting.  That seems to lead to unproductive
discussions on mailing list questions,

User: hey, my sort doesn't work sanely
Pg person: okay, but what collation are you using?  Look for the collate
clause
User: Ah, it's de_DE
Pg person: oh, okay, but is it ICU's de_DE or the libc's?
User: ...

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
In reply to this post by Peter Geoghegan-4
On 2/16/17 01:13, Peter Geoghegan wrote:
> On Wed, Feb 15, 2017 at 9:17 PM, Peter Eisentraut
> <[hidden email]> wrote:
>> I have changed it to use ucol_nextSortKeyPart() when appropriate.
>
> I think it would be fine if the second last argument was
> "sizeof(Datum)", not "Min(sizeof(Datum), sss->buflen2)" here:

I don't see anything that guarantees that the buffer length is always at
least sizeof(Datum).

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Bruce Momjian
In reply to this post by Peter Geoghegan-4
On Wed, Feb 15, 2017 at 10:35:34PM -0800, Peter Geoghegan wrote:

> On Wed, Feb 15, 2017 at 9:17 PM, Peter Eisentraut
> <[hidden email]> wrote:
> > Stable operating systems shouldn't do major library upgrades, so I feel
> > pretty confident about this.
>
> There is one case where it *is* appropriate for a bugfix release of
> ICU to bump collator version: When there was a bug in the collator
> itself, leading to broken binary blobs [1]. We should make the user
> REINDEX when this happens.
>
> I think that ICU may well do this in point releases, which is actually
> a good thing. So, it seems like a good idea to test that there is no
> change, in a place like check_strxfrm_bug(). In my opinion, we should
> LOG a complaint in the event of a mismatch rather than refusing to
> start the server, since there probably isn't that much chance of the
> user being harmed by the bug. The cost of not starting the server
> properly until a REINDEX finishes or similar looks particularly
> unappealing when one considers that the app was probably affected by
> any corruption for weeks or months before the ICU update enabled its
> detection.
> http://www.postgresql.org/mailpref/pgsql-hackers

I can't think of any cases where we warn of possible corruption only in
the server logs.

--
  Bruce Momjian  <[hidden email]>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Geoghegan-4
On Mon, Feb 20, 2017 at 2:38 PM, Bruce Momjian <[hidden email]> wrote:
> I can't think of any cases where we warn of possible corruption only in
> the server logs.

It's just like any case where there was a bug in Postgres that could
have subtly broken index builds. We don't make the next point release
force users to REINDEX every possibly-affected index every time that
happens. Presumably this is because they aren't particularly likely to
be affected by any given bug, and because it's pretty infeasible for
us to do so anyway. There aren't always easy ways to detect the
problem, and users will never learn to deal with issues like this well
when it is by definition something that should never happen.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Bruce Momjian
On Mon, Feb 20, 2017 at 02:54:22PM -0800, Peter Geoghegan wrote:

> On Mon, Feb 20, 2017 at 2:38 PM, Bruce Momjian <[hidden email]> wrote:
> > I can't think of any cases where we warn of possible corruption only in
> > the server logs.
>
> It's just like any case where there was a bug in Postgres that could
> have subtly broken index builds. We don't make the next point release
> force users to REINDEX every possibly-affected index every time that
> happens. Presumably this is because they aren't particularly likely to
> be affected by any given bug, and because it's pretty infeasible for
> us to do so anyway. There aren't always easy ways to detect the
> problem, and users will never learn to deal with issues like this well
> when it is by definition something that should never happen.

Well, the release notes are a clear-enough communication for a need to
reindex.  I don't see a LOG message as similar.  Don't we have other
cases where we have to warn administrators?  We can mark the indexes as
invalid but how would we report that?

--
  Bruce Momjian  <[hidden email]>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Geoghegan-4
On Mon, Feb 20, 2017 at 3:19 PM, Bruce Momjian <[hidden email]> wrote:
> Well, the release notes are a clear-enough communication for a need to
> reindex.  I don't see a LOG message as similar.  Don't we have other
> cases where we have to warn administrators?  We can mark the indexes as
> invalid but how would we report that?

Marking all indexes as invalid would be an enormous overkill. We don't
even know for sure that the values the user has indexed happens to be
affected. In order for there to have been a bug in ICU in the first
place, the likelihood is that it only occurs in what are edge cases
for the collator.

ICU is a very popular library, used in software that I personally
interact with every day [1]. Any bugs like this should be exceptional.
They very clearly appreciate how sensitive software like Postgres is
to changes like this, which is why the versioning API exists.

[1] http://site.icu-project.org/#TOC-Who-Uses-ICU-
--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Bruce Momjian
On Mon, Feb 20, 2017 at 03:29:07PM -0800, Peter Geoghegan wrote:

> Marking all indexes as invalid would be an enormous overkill. We don't
> even know for sure that the values the user has indexed happens to be
> affected. In order for there to have been a bug in ICU in the first
> place, the likelihood is that it only occurs in what are edge cases
> for the collator.
>
> ICU is a very popular library, used in software that I personally
> interact with every day [1]. Any bugs like this should be exceptional.
> They very clearly appreciate how sensitive software like Postgres is
> to changes like this, which is why the versioning API exists.
>
> [1] http://site.icu-project.org/#TOC-Who-Uses-ICU-

So we don't have any other cases where we warn about possible corruption
except this?

Also, I will go back to my previous concern, that while I like the fact
we can detect collation changes with ICU, we don't know if ICU
collations change more often than OS collations.

--
  Bruce Momjian  <[hidden email]>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Geoghegan-4
On Mon, Feb 20, 2017 at 3:51 PM, Bruce Momjian <[hidden email]> wrote:
> So we don't have any other cases where we warn about possible corruption
> except this?

I'm not sure that I understand the distinction you're making.

> Also, I will go back to my previous concern, that while I like the fact
> we can detect collation changes with ICU, we don't know if ICU
> collations change more often than OS collations.

We do know that ICU collations can never change behaviorally in a
given stable release. Bug fixes are allowed in point releases, but
these never change the user-visible behavior of collations. That's
very clear, because an upstream Unciode UCA version is used by a given
major release of ICU. This upstream data describes the behavior of a
collation using a high-level declarative language, that non-programmer
experts in natural languages write.

ICU versions many different things, in fact. Importantly, it
explicitly decouples behavioral issues (user visible sort order -- UCA
version) from technical issues (collator implementation details). So,
my original point is that that could change, and if that happens we
ought to have a plan. But, it won't change unless it really has to.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
In reply to this post by Peter Eisentraut-6
The v4 patch didn't apply anymore, so here is a rebased patch.  No
changes in functionality.


On 2/16/17 00:10, Peter Eisentraut wrote:

> Updated and rebased patch.
>
> Significant changes:
>
> - Changed collversion to type text
>
> - Changed pg_locale_t to a union
>
> - Use ucol_getAvailable() instead of uloc_getAvailable(), so the set of
> initial collations is smaller now, because redundancies are eliminated.
>
> - Added keyword variants to predefined ICU collations (so you get
> "de_phonebook%icu", for example)  (So the initial set of collations is
> bigger now. :) )
>
> - Predefined ICU collations have a comment now, so \dOS+ is useful.
>
> - Use ucol_nextSortKeyPart() for abbreviated keys
>
> - Enhanced tests and documentation
>
> I believe all issues raised in reviews have been addressed.
>
> Discussion points:
>
> - Naming of collations:  Are we happy with the "de%icu" naming?  I might
> have come up with that while reviewing the IPv6 zone index patch. ;-)
> An alternative might be "de$icu" for more Oracle vibe and avoiding the
> need for double quotes in some cases.  (But we have mixed-case names
> like "de_AT%icu", so further changes would be necessary to fully get rid
> of the need for quoting.)  A more radical alternative would be to
> install ICU locales in a different schema and use the search_path, or
> even have a separate search path setting for collations only.  Which
> leads to ...
>
> - Selecting default collation provider:  Maybe we want a setting, say in
> initdb, to determine which provider's collations get the "good" names?
> Maybe not necessary for this release, but something to think about.
>
> - Currently (in this patch), we check a collation's version when it is
> first used.  But, say, after pg_upgrade, you might want to check all of
> them right away.  What might be a good interface for that?  (Possibly,
> we only have to check the ones actually in use, and we have dependency
> information for that.)
>
>
>
>

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

v5-0001-ICU-support.patch (211K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Andreas Karlsson
On 03/09/2017 10:13 PM, Peter Eisentraut wrote:
>> - Naming of collations:  Are we happy with the "de%icu" naming?  I might
>> have come up with that while reviewing the IPv6 zone index patch. ;-)
>> An alternative might be "de$icu" for more Oracle vibe and avoiding the
>> need for double quotes in some cases.  (But we have mixed-case names
>> like "de_AT%icu", so further changes would be necessary to fully get rid
>> of the need for quoting.)  A more radical alternative would be to
>> install ICU locales in a different schema and use the search_path, or
>> even have a separate search path setting for collations only.  Which
>> leads to ...

I do not like the schema based solution since search_path already gives
us enough headaches. As for the naming I am fine with the current scheme.

Would be nice with something we did not have to quote, but I do not like
using dollar signs since they are already use for other things.

>> - Selecting default collation provider:  Maybe we want a setting, say in
>> initdb, to determine which provider's collations get the "good" names?
>> Maybe not necessary for this release, but something to think about.

This does not seem necessary for the initial release.

>> - Currently (in this patch), we check a collation's version when it is
>> first used.  But, say, after pg_upgrade, you might want to check all of
>> them right away.  What might be a good interface for that?  (Possibly,
>> we only have to check the ones actually in use, and we have dependency
>> information for that.)

How about adding a SQL level function for checking this which can be
called by pg_upgrade?

= Review

Here is an initial review. I will try to find some time to do more
testing later this week.

This is a really useful feature given the poor quality of collation
support libc. Just that ICU versions the encodings is huge, and the
larger range of available collations with high configurability.
Additionally being able to use abbreviated keys again would be huge.

ICU also supports writing your own collations and modifying existing
ones, something which might be possible to expose in the future. In
general ICU offers a lot for users who care about the details of text
collation.

== Functional

- Generally things seem to work fine and as expected.

- I get a test failures in the default test suite due to not having the
tr_TR locale installed. I would assume that this would be pretty common
for hackers.

***************
*** 428,443 ****

   -- to_char
   SET lc_time TO 'tr_TR';
   SELECT to_char(date '2010-04-01', 'DD TMMON YYYY');
      to_char
   -------------
!  01 NIS 2010
   (1 row)

   SELECT to_char(date '2010-04-01', 'DD TMMON YYYY' COLLATE "tr%icu");
      to_char
   -------------
!  01 NİS 2010
   (1 row)

   -- backwards parsing
--- 428,444 ----

   -- to_char
   SET lc_time TO 'tr_TR';
+ ERROR:  invalid value for parameter "lc_time": "tr_TR"
   SELECT to_char(date '2010-04-01', 'DD TMMON YYYY');
      to_char
   -------------
!  01 APR 2010
   (1 row)

   SELECT to_char(date '2010-04-01', 'DD TMMON YYYY' COLLATE "tr%icu");
      to_char
   -------------
!  01 APR 2010
   (1 row)

   -- backwards parsing

======================================================================

- The code no longer compiles without HAVE_LOCALE_T.

- I do not like how it is not obvious which is the default version of
every locale. E.g. I believe that "sv%icu" is "sv_reformed%icu" and not
"sv_standard%icu" as one might expect. Is this inherent to ICU or
something we can work around?

- ICU talks about a new syntax for locale extensions (old:
de_DE@collation=phonebook, new: de_DE_u_co_phonebk) at this page
http://userguide.icu-project.org/collation/api. Is this something we
need to car about? It looks like we currently use the old format, and
while I personally prefer it I am not sure we should rely on an old syntax.

- I get an error when creating a new locale.

#CREATE COLLATION sv2 (LOCALE = 'sv');
ERROR:  could not create locale "sv": Success

# CREATE COLLATION sv2 (LOCALE = 'sv');
ERROR:  could not create locale "sv": Resource temporarily unavailable
Time: 1.109 ms

== Code review

- For the collprovider is it really correct to say that 'd' is the
default value as it does in catalogs.sgml?

- I do not know what the policy for formatting the documentation is, but
some of the paragraphs are in need of re-wrapping.

- Add a hint to "ERROR:  conflicting or redundant options". The error
message is pretty unclear.

- I am not a fan of this patch comparing the encoding with a -1 literal.
How about adding -1 as a value to the enum? See the example below for a
place where the patch compares with -1.

             ereport(NOTICE,
                 (errcode(ERRCODE_DUPLICATE_OBJECT),
-                errmsg("collation \"%s\" for encoding \"%s\" already
exists, skipping",
-                       collname, pg_encoding_to_char(collencoding))));
+                collencoding == -1
+                ? errmsg("collation \"%s\" already exists, skipping",
+                         collname)
+                : errmsg("collation \"%s\" for encoding \"%s\" already
exists, skipping",
+                         collname, pg_encoding_to_char(collencoding))));
             return InvalidOid;

- The patch adds "FIXME" in the below. Is that a left-over from
development or something which still needs doing?

     /*
      * Also forbid matching an any-encoding entry.  This test of course
is not
      * backed up by the unique index, but it's not a problem since we don't
-    * support adding any-encoding entries after initdb.
+    * support adding any-encoding entries after initdb. FIXME
      */

- Should functions like normalize_locale_name() be renamed to indicate
they relate to libc locales? I am leaning towards doing so but have not
looked closely at the task.

Andreas


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
Updated patch attached.


On 3/14/17 22:15, Andreas Karlsson wrote:
> I do not like the schema based solution since search_path already gives
> us enough headaches. As for the naming I am fine with the current scheme.

Yeah, it seems we're going to settle on the suffix idea.  See below for
an idea that builds on that.

> - I get a test failures in the default test suite due to not having the
> tr_TR locale installed. I would assume that this would be pretty common
> for hackers.

I have removed that test.  It seems it's not possible to test that
portably without major contortions.

> - The code no longer compiles without HAVE_LOCALE_T.

Fixed that.

> - I do not like how it is not obvious which is the default version of
> every locale. E.g. I believe that "sv%icu" is "sv_reformed%icu" and not
> "sv_standard%icu" as one might expect. Is this inherent to ICU or
> something we can work around?

We get these keywords from ucol_getKeywordValuesForLocale(), which says
"Given a key and a locale, returns an array of string values in a
preferred order that would make a difference."  So all those are
supposedly different from each other.

> - ICU talks about a new syntax for locale extensions (old:
> de_DE@collation=phonebook, new: de_DE_u_co_phonebk) at this page
> http://userguide.icu-project.org/collation/api. Is this something we
> need to car about? It looks like we currently use the old format, and
> while I personally prefer it I am not sure we should rely on an old syntax.

Interesting.  I hadn't see this before, and the documentation is sparse.
 But it seems that this refers to BCP 47 language tags, which seem like
a good idea.

So what I have done is change all the predefined ICU collations to be
named after the BCP 47 scheme, with a "private use" -x-icu suffix
(instead of %icu).  The preserves the original idea but uses a standard
naming scheme.

I'm not terribly worried that they are going to remove the "old" locale
naming, but just to be forward looking, I have changed it so that the
collcollate entries are made using the "new" naming for ICU >=54.

> - I get an error when creating a new locale.
>
> #CREATE COLLATION sv2 (LOCALE = 'sv');
> ERROR:  could not create locale "sv": Success
>
> # CREATE COLLATION sv2 (LOCALE = 'sv');
> ERROR:  could not create locale "sv": Resource temporarily unavailable
> Time: 1.109 ms

Hmm, that's pretty straightforward code.  What is your operating system?
 What are the build options?  Does it work without this patch?

> - For the collprovider is it really correct to say that 'd' is the
> default value as it does in catalogs.sgml?

It doesn't say it's the default value, it says it uses the database
default.  This is all a bit confusing.  We have a collation object named
"default", which uses the locale set for the database.  That's been that
way for a while.  Now when introducing the collation providers, that
"default" collation gets its own collprovider category 'd'.  That is not
really used anywhere, but we have to put something there.

> - I do not know what the policy for formatting the documentation is, but
> some of the paragraphs are in need of re-wrapping.

Hmm, I don't see anything terribly bad.

> - Add a hint to "ERROR:  conflicting or redundant options". The error
> message is pretty unclear.

I don't see that in my patch.  Example?

> - I am not a fan of this patch comparing the encoding with a -1 literal.
> How about adding -1 as a value to the enum? See the example below for a
> place where the patch compares with -1.

That's existing practice.  Not a great practice, probably, but widespread.

> - The patch adds "FIXME" in the below. Is that a left-over from
> development or something which still needs doing?
>
>      /*
>       * Also forbid matching an any-encoding entry.  This test of course
> is not
>       * backed up by the unique index, but it's not a problem since we don't
> -    * support adding any-encoding entries after initdb.
> +    * support adding any-encoding entries after initdb. FIXME
>       */
I had mentioned that upthread.  It technically needs "doing" as you say,
but it's not clear how and it's not terribly important, arguably.

> - Should functions like normalize_locale_name() be renamed to indicate
> they relate to libc locales? I am leaning towards doing so but have not
> looked closely at the task.

Renamed.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

v6-0001-ICU-support.patch (215K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Andreas Karlsson
On 03/15/2017 05:33 PM, Peter Eisentraut wrote:
> Updated patch attached.

Ok, I applied to patch again and ran the tests. I get a test failure on
make check-world in the pg_dump tests but it can be fixed with the below.

diff --git a/src/bin/pg_dump/t/002_pg_dump.pl
b/src/bin/pg_dump/t/002_pg_dump.pl
index 3cac4a9ae0..7d9c90363b 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -2422,7 +2422,7 @@ qr/^\QINSERT INTO test_fifth_table (col1, col2,
col3, col4, col5) VALUES (NULL,
           'CREATE COLLATION test0 FROM "C";',
         regexp =>
           qr/^
-         \QCREATE COLLATION test0 (lc_collate = 'C', lc_ctype = 'C');\E/xm,
+         \QCREATE COLLATION test0 (provider = 'libc', locale = 'C');\E/xm,
         like => {
             binary_upgrade           => 1,
             clean                    => 1,

>> - I do not like how it is not obvious which is the default version of
>> every locale. E.g. I believe that "sv%icu" is "sv_reformed%icu" and not
>> "sv_standard%icu" as one might expect. Is this inherent to ICU or
>> something we can work around?
>
> We get these keywords from ucol_getKeywordValuesForLocale(), which says
> "Given a key and a locale, returns an array of string values in a
> preferred order that would make a difference."  So all those are
> supposedly different from each other.

I believe you are mistaken. The locale "sv" is just an alias for
"sv-u-standard" as far as I can tell. See the definition of the Swedish
locale
(http://source.icu-project.org/repos/icu/trunk/icu4c/source/data/coll/sv.txt)
and there are just three collations: reformed (default), standard,
search (plus eot and emoji which are inherited).

I am also quite annoyed at col-emoji and col-eor (European Ordering
Rules). They are defined at the root and inherited by all languages, but
no language modifies col-emoji for their needs which makes it a foot
gun. See the Danish sorting example below where at least I expected the
same order. For col-eor it makes a bit more sense since I would expect
the locales en_GB-u-col-eot and sv_SE-u-col-eot to collate exactly the same.

# SELECT * FROM (VALUES ('a'), ('k'), ('aa')) q (x) ORDER BY x COLLATE
"da-x-icu";
  x
----
  a
  k
  aa
(3 rows)

# SELECT * FROM (VALUES ('a'), ('k'), ('aa')) q (x) ORDER BY x COLLATE
"da-u-co-emoji-x-icu";
  x
----
  a
  aa
  k
(3 rows)

It seems ICU has made quite the mess here, and I am not sure to what
degree we need to handle it to avoid our users getting confused. I need
to think some of it, and would love input from others. Maybe the right
thing to do is to ignore the issue with col-emoji, but I would want to
do something about the default collations.

>> - ICU talks about a new syntax for locale extensions (old:
>> de_DE@collation=phonebook, new: de_DE_u_co_phonebk) at this page
>> http://userguide.icu-project.org/collation/api. Is this something we
>> need to car about? It looks like we currently use the old format, and
>> while I personally prefer it I am not sure we should rely on an old syntax.
>
> Interesting.  I hadn't see this before, and the documentation is sparse.
>  But it seems that this refers to BCP 47 language tags, which seem like
> a good idea.
>
> So what I have done is change all the predefined ICU collations to be
> named after the BCP 47 scheme, with a "private use" -x-icu suffix
> (instead of %icu).  The preserves the original idea but uses a standard
> naming scheme.
>
> I'm not terribly worried that they are going to remove the "old" locale
> naming, but just to be forward looking, I have changed it so that the
> collcollate entries are made using the "new" naming for ICU >=54.

Sounds good.

>> - I get an error when creating a new locale.
>>
>> #CREATE COLLATION sv2 (LOCALE = 'sv');
>> ERROR:  could not create locale "sv": Success
>>
>> # CREATE COLLATION sv2 (LOCALE = 'sv');
>> ERROR:  could not create locale "sv": Resource temporarily unavailable
>> Time: 1.109 ms
>
> Hmm, that's pretty straightforward code.  What is your operating system?
>  What are the build options?  Does it work without this patch?

This issue is unrelated to ICU. I had forgot to specify provider so the
eorrs are correct (even though that the value of the errno is weird).

>> - For the collprovider is it really correct to say that 'd' is the
>> default value as it does in catalogs.sgml?
>
> It doesn't say it's the default value, it says it uses the database
> default.  This is all a bit confusing.  We have a collation object named
> "default", which uses the locale set for the database.  That's been that
> way for a while.  Now when introducing the collation providers, that
> "default" collation gets its own collprovider category 'd'.  That is not
> really used anywhere, but we have to put something there.

Ah, I see now. Hm, that is a bit awkward but I cannot think of a cleaner
alternative.

>> - I do not know what the policy for formatting the documentation is, but
>> some of the paragraphs are in need of re-wrapping.
>
> Hmm, I don't see anything terribly bad.

Maybe it is just me who is sensitive to wrapping, but her is an example
which sticks out to me with its 98 character line.

+   <para>
+    A collation object provided by <literal>libc</literal> maps to a
combination
+    of <symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol>
settings.  (As
      the name would suggest, the main purpose of a collation is to set
      <symbol>LC_COLLATE</symbol>, which controls the sort order.  But
      it is rarely necessary in practice to have an
      <symbol>LC_CTYPE</symbol> setting that is different from
      <symbol>LC_COLLATE</symbol>, so it is more convenient to collect
      these under one concept than to create another infrastructure for
-    setting <symbol>LC_CTYPE</symbol> per expression.)  Also, a collation
+    setting <symbol>LC_CTYPE</symbol> per expression.)  Also, a
<literal>libc</literal> collation
      is tied to a character set encoding (see <xref linkend="multibyte">).
      The same collation name may exist for different encodings.
     </para>

>> - Add a hint to "ERROR:  conflicting or redundant options". The error
>> message is pretty unclear.
>
> I don't see that in my patch.  Example?

It is for the below. But after some thinking I am fine not fixing it.

# CREATE COLLATION svi (LOCALE = 'sv', LC_COLLATE = 'sv', PROVIDER =  icu);
ERROR:  conflicting or redundant options

>> - I am not a fan of this patch comparing the encoding with a -1 literal.
>> How about adding -1 as a value to the enum? See the example below for a
>> place where the patch compares with -1.
>
> That's existing practice.  Not a great practice, probably, but widespread.

Ok, if it is widespread in the code then let's keep using it. In a case
like this consistency is just as important.

>> - The patch adds "FIXME" in the below. Is that a left-over from
>> development or something which still needs doing?
>>
>>      /*
>>       * Also forbid matching an any-encoding entry.  This test of course
>> is not
>>       * backed up by the unique index, but it's not a problem since we don't
>> -    * support adding any-encoding entries after initdb.
>> +    * support adding any-encoding entries after initdb. FIXME
>>       */
>
> I had mentioned that upthread.  It technically needs "doing" as you say,
> but it's not clear how and it's not terribly important, arguably.

The comment is no longer true since for ICU we can do that (it is not an
issue though). At the very least this comment needs to be updated.

Andreas


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
On 3/19/17 18:31, Andreas Karlsson wrote:
> Ok, I applied to patch again and ran the tests. I get a test failure on
> make check-world in the pg_dump tests but it can be fixed with the below.

Fixed.  This was a new pg_dump test that had been added in the meantime.

> I believe you are mistaken. The locale "sv" is just an alias for
> "sv-u-standard" as far as I can tell. See the definition of the Swedish
> locale
> (http://source.icu-project.org/repos/icu/trunk/icu4c/source/data/coll/sv.txt)
> and there are just three collations: reformed (default), standard,
> search (plus eot and emoji which are inherited).

I have filed bug <https://ssl.icu-project.org/trac/ticket/13050> about
this.  I don't think we can work around this by simply ignoring the
"standard" variant.  For example, the URL you show indicates that the
default variant for "sv" is "reformed", not "standard".

> I am also quite annoyed at col-emoji and col-eor (European Ordering
> Rules). They are defined at the root and inherited by all languages, but
> no language modifies col-emoji for their needs which makes it a foot
> gun. See the Danish sorting example below where at least I expected the
> same order. For col-eor it makes a bit more sense since I would expect
> the locales en_GB-u-col-eot and sv_SE-u-col-eot to collate exactly the same.

I think this is just the way the inheritance of locales works there.  If
the root has a variant and the more specific locale doesn't, then if you
specify the variant, it gets pulled from the root.  It is a bit dubious.
 Arguably, either the customizations of the specific locale should be
applied to all variants, or ucol_getKeywordValuesForLocale() shouldn't
return them, or both.  Perhaps another bug report.

>>> - I get an error when creating a new locale.
>>>
>>> #CREATE COLLATION sv2 (LOCALE = 'sv');
>>> ERROR:  could not create locale "sv": Success
>>>
>>> # CREATE COLLATION sv2 (LOCALE = 'sv');
>>> ERROR:  could not create locale "sv": Resource temporarily unavailable
>>> Time: 1.109 ms
>>
>> Hmm, that's pretty straightforward code.  What is your operating system?
>>  What are the build options?  Does it work without this patch?
>
> This issue is unrelated to ICU. I had forgot to specify provider so the
> eorrs are correct (even though that the value of the errno is weird).
Perhaps we can do something about this anyway.  What is our OS and version?

>>> - I do not know what the policy for formatting the documentation is, but
>>> some of the paragraphs are in need of re-wrapping.
>>
>> Hmm, I don't see anything terribly bad.
>
> Maybe it is just me who is sensitive to wrapping, but her is an example
> which sticks out to me with its 98 character line.

I see.  Fixed some of those.

>>> - The patch adds "FIXME" in the below. Is that a left-over from
>>> development or something which still needs doing?
>>>
>>>      /*
>>>       * Also forbid matching an any-encoding entry.  This test of course
>>> is not
>>>       * backed up by the unique index, but it's not a problem since we don't
>>> -    * support adding any-encoding entries after initdb.
>>> +    * support adding any-encoding entries after initdb. FIXME
>>>       */
>>
>> I had mentioned that upthread.  It technically needs "doing" as you say,
>> but it's not clear how and it's not terribly important, arguably.
>
> The comment is no longer true since for ICU we can do that (it is not an
> issue though). At the very least this comment needs to be updated.
Fixed that by some additional checks and using ShareRowExclusiveLock in
CREATE COLLATION to prevent concurrent changes.


New patch attached.

I have also implemented additional facilities to check collations
needing refresh after an upgrade (see ALTER COLLATION ref page), and
some necessary bits for that in pg_dump.  Some additional support in
pg_upgrade could be useful, but I think we don't need that until people
are upgrading *from* PG10.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

v7-0001-ICU-support.patch (229K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Andreas Karlsson
I am fine with this version of the patch. The issues I have with it,
which I mentioned earlier in this thread, seem to be issues with ICU
rather than with this patch. For example there seems to be no way for
ICU to validate the syntax of the BCP 47 locales (or ICU's old format).
But I think we will just have to accept the weirdness of how ICU handles
locales.

I think this patch is ready to be committed.

Found a typo in the documentation:

"The inspect the currently available locales" should be "To inspect the
currently available locales".

Andreas


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
On 3/23/17 05:34, Andreas Karlsson wrote:

> I am fine with this version of the patch. The issues I have with it,
> which I mentioned earlier in this thread, seem to be issues with ICU
> rather than with this patch. For example there seems to be no way for
> ICU to validate the syntax of the BCP 47 locales (or ICU's old format).
> But I think we will just have to accept the weirdness of how ICU handles
> locales.
>
> I think this patch is ready to be committed.
>
> Found a typo in the documentation:
>
> "The inspect the currently available locales" should be "To inspect the
> currently available locales".

Committed.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Jeff Janes
On Thu, Mar 23, 2017 at 12:34 PM, Peter Eisentraut <[hidden email]> wrote:
On 3/23/17 05:34, Andreas Karlsson wrote:
> I am fine with this version of the patch. The issues I have with it,
> which I mentioned earlier in this thread, seem to be issues with ICU
> rather than with this patch. For example there seems to be no way for
> ICU to validate the syntax of the BCP 47 locales (or ICU's old format).
> But I think we will just have to accept the weirdness of how ICU handles
> locales.
>
> I think this patch is ready to be committed.
>
> Found a typo in the documentation:
>
> "The inspect the currently available locales" should be "To inspect the
> currently available locales".

Committed.

This has broken the C locale, and the build farm.

if (pg_database_encoding_max_length() > 1 || locale->provider == COLLPROVIDER_ICU)

segfaults because locale is null.  (locale_is_c is true)

Cheers,

Jeff
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Thomas Munro-3
In reply to this post by Peter Eisentraut-6
On Fri, Mar 24, 2017 at 8:34 AM, Peter Eisentraut
<[hidden email]> wrote:
> Committed.

On a macOS system using MacPorts, I have "icu" installed.  MacPorts is
a package manager that installs stuff under /opt/local.  I have
/opt/local/bin in $PATH so that pkg-config can be found.  I run
./configure with --with-icu but without mentioning any special paths
and it says:

checking whether to build with ICU support... yes
checking for pkg-config... /opt/local/bin/pkg-config
checking pkg-config is at least version 0.9.0... yes
checking for icu-uc icu-i18n... yes

... but then building fails, because there are no headers in the search path:

gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -Wno-unused-command-line-argument -g -O2 -Wall -Werror
-I../../../src/include/snowball
-I../../../src/include/snowball/libstemmer -I../../../src/include   -c
-o dict_snowball.o dict_snowball.c -MMD -MP -MF .deps/dict_snowball.Po
...
../../../src/include/utils/pg_locale.h:19:10: fatal error:
'unicode/ucol.h' file not found

But pkg-config does know where to find those headers:

$ pkg-config --cflags icu-i18n
-I/opt/local/include

... and it's not wrong:

$ ls /opt/local/include/unicode/ucol.h
/opt/local/include/unicode/ucol.h

So I think there may be a bug in the configure script: I think it
should be putting pkg-config's --cflags output into one of the
CFLAGS-like variables.

Or am I misunderstanding what pkg-config is supposed to be doing for us here?

--
Thomas Munro
http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
12345