ICU integration

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
83 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Thomas Munro-3
On Fri, Sep 23, 2016 at 6:27 PM, Thomas Munro
<[hidden email]> wrote:

> On Wed, Aug 31, 2016 at 2:46 PM, Peter Eisentraut
> <[hidden email]> wrote:
>> Here is a patch I've been working on to allow the use of ICU for sorting
>> and other locale things.
>
> This is very interesting work, and it's great to see some development
> in this area.  I've been peripherally involved in more than one
> collation-change-broke-my-data incident over the years.  I took the
> patch for a quick spin today.  Here are a couple of initial
> observations.

This seems like a solid start, but there are unresolved questions
about both high level goals (versioning strategy etc) and also some
technical details with this WIP patch.  It looks like several people
have an interest and ideas in this area, but clearly there isn't going
to be a committable patch in the next 48 hours.  So I will set this to
'Returned with Feedback' for now.  If you think you'll have a new
patch for the next CF then it looks like you can still 'Move to Next
CF' from 'Returned with Feedback' state if appropriate.  Thanks!

--
Thomas Munro
http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
In reply to this post by Thomas Munro-3
On 9/23/16 2:27 AM, Thomas Munro wrote:
> This patch adds pkg-config support to our configure script, in order
> to produce the build options for ICU.  That's cool, and I'm a fan of
> pkg-config, but it's an extra dependency that I just wanted to
> highlight.  For example MacOSX appears to ship with ICU but has is no
> pkg-config or ICU .pc files out of the box (erm, I can't even find the
> headers, so maybe that copy of ICU is useless to everyone except
> Apple).

The Homebrew package icu4c contains this note:

    OS X provides libicucore.dylib (but nothing else).

That probably explains what you are seeing.

> There is something missing from the configure script however: the
> output of pkg-config is not making it into CFLAGS or LDFLAGS, so
> building fails on FreeBSD and MacOSX where for example
> <unicode/ucnv.h> doesn't live in the default search path.

It sets ICU_CFLAGS and ICU_LIBS, but it seems I didn't put ICU_CFLAGS in
the backend makefiles.  So that should be easy to fix.

> You call the built-in strcoll/strxfrm collation provider "posix", and
> although POSIX does define those functions, it only does so because it
> inherits them from ISO C90.

POSIX defines strcoll_l() and such.  But I agree SYSTEM might be better.

> I notice that you use encoding -1, meaning "all".

The use of encoding -1 is existing behavior.

> I haven't fully
> grokked what that really means but it appears that we won't be able to
> create any new such collations after initdb using CREATE COLLATION, if
> for example you update your ICU installation and now have a new
> collation available.  When I tried dropping some and recreating them
> they finished up with collencoding = 6.  Should the initdb-created
> rows use 6 too?

Good observation.  That will need some fine-tuning.

> The warning persists even after I reindex all affected tables (and
> start a new backend), because you're only recording the collation at
> pg_collation level and then only setting it at initdb time, so that
> HINT isn't much use for clearing the warning.  I think you'd need to
> record a snapshot of the version used for each collation that was used
> on *each index*, and update it whenever you CREATE INDEX or REINDEX
> etc.  There can of course be more than one collation and thus version
> involved, if you have columns with different collations, so I guess
> you'd need an column to hold an array of version snapshots whose order
> corresponds to pg_index.indcollation's.

Hmm, yeah, that will need more work.  To be completely correct, I think,
we'd also need to record the version in each expression node, so that
check constraints of the form CHECK (x > 'abc') can be handled.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Thomas Munro-3
On Sat, Oct 1, 2016 at 3:30 AM, Peter Eisentraut
<[hidden email]> wrote:

> On 9/23/16 2:27 AM, Thomas Munro wrote:
>> The warning persists even after I reindex all affected tables (and
>> start a new backend), because you're only recording the collation at
>> pg_collation level and then only setting it at initdb time, so that
>> HINT isn't much use for clearing the warning.  I think you'd need to
>> record a snapshot of the version used for each collation that was used
>> on *each index*, and update it whenever you CREATE INDEX or REINDEX
>> etc.  There can of course be more than one collation and thus version
>> involved, if you have columns with different collations, so I guess
>> you'd need an column to hold an array of version snapshots whose order
>> corresponds to pg_index.indcollation's.
>
> Hmm, yeah, that will need more work.  To be completely correct, I think,
> we'd also need to record the version in each expression node, so that
> check constraints of the form CHECK (x > 'abc') can be handled.

Hmm.  That is quite a rabbit hole.  In theory you need to recheck such
a constraint, but it's not at all clear when you should recheck and
what you should do about it if it fails.  Similar for the future
PARTITION feature.

--
Thomas Munro
http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
On 9/30/16 4:32 PM, Thomas Munro wrote:
>> Hmm, yeah, that will need more work.  To be completely correct, I think,
>> > we'd also need to record the version in each expression node, so that
>> > check constraints of the form CHECK (x > 'abc') can be handled.
> Hmm.  That is quite a rabbit hole.  In theory you need to recheck such
> a constraint, but it's not at all clear when you should recheck and
> what you should do about it if it fails.  Similar for the future
> PARTITION feature.

I think it's not worth dealing with this in that much detail at the
moment.  It's not like the collation will just randomly change under you
(unlike with glibc).  It would have to involve pg_upgrade, physical
replication, or a rebuilt installation.  So I think I will change the
message to something to the effect of "however you got here, you can't
do that".  We can develop some recipes and ideas on the side for how to
recover situations like that and then maybe integrate tooling for that
later.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
In reply to this post by Peter Eisentraut-6
Updated patch attached.

The previous round of reviews showed that there was general agreement on
the approach.  So I have focused on filling in the gaps, added ICU
support to all the locale-using places, added documentation, fixed some
minor issues that have been pointed out.  Everything appears to work
correctly now, and the user-facing feature set is pretty well-rounded.

I don't have much experience with the abbreviated key stuff.  I have
filled in what I think should work, but it needs detailed review.

Similarly, some of the stuff in the regular expression code was hacked
in blindly.

One minor problem is that we now support adding any-encoding collation
entries, which violates some of the comments in CollationCreate().  See
FIXME there.  It doesn't seem worth major contortions to fix that; maybe
it just has to be documented better.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

v2-0001-ICU-support.patch (157K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Geoghegan-3
On Tue, Dec 27, 2016 at 6:50 PM, Peter Eisentraut
<[hidden email]> wrote:
> I don't have much experience with the abbreviated key stuff.  I have
> filled in what I think should work, but it needs detailed review.

Thanks.

It occurs to me that the comparison caching stuff added by commit
0e57b4d8b needs to be considered here, too. When we had to copy the
string to a temp buffer anyway, in order to add the terminating NUL
byte expected by strcoll(), there was an opportunity to do caching of
comparisons at little additional cost. However, since ICU offers an
interface that you're using that doesn't require any NUL byte, there
is a new trade-off to be considered -- swallow the cost of copying
into our own temp buffer solely for the benefit of comparison caching,
or don't do comparison caching. (Note that glibc had a similar
comparison caching optimization itself at one point, built right into
strcoll(), but it was subsequently disabled.)

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
On 1/7/17 10:01 PM, Peter Geoghegan wrote:

> It occurs to me that the comparison caching stuff added by commit
> 0e57b4d8b needs to be considered here, too. When we had to copy the
> string to a temp buffer anyway, in order to add the terminating NUL
> byte expected by strcoll(), there was an opportunity to do caching of
> comparisons at little additional cost. However, since ICU offers an
> interface that you're using that doesn't require any NUL byte, there
> is a new trade-off to be considered -- swallow the cost of copying
> into our own temp buffer solely for the benefit of comparison caching,
> or don't do comparison caching. (Note that glibc had a similar
> comparison caching optimization itself at one point, built right into
> strcoll(), but it was subsequently disabled.)

That might be worth looking into, but it seems a bit daunting to
construct a benchmark specifically for this, unless we have the one that
was originally used lying around somewhere.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Geoghegan-3
On Mon, Jan 9, 2017 at 12:25 PM, Peter Eisentraut
<[hidden email]> wrote:
> On 1/7/17 10:01 PM, Peter Geoghegan wrote:
>> It occurs to me that the comparison caching stuff added by commit
>> 0e57b4d8b needs to be considered here, too.

> That might be worth looking into, but it seems a bit daunting to
> construct a benchmark specifically for this, unless we have the one that
> was originally used lying around somewhere.

The benchmark used when 0e57b4d8b went in only had to prove that there
was no measurable overhead when the optimization didn't help (It was
quite clear that it was worthwhile in good cases). I think that
comparison caching will continue to be about as effective as before in
good cases, but you don't do comparison caching anymore. That might be
fine, but let's be sure that that's the right trade-off.

To demonstrate the effectiveness of the patch, I used this cities sample data:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/data/cities.dump

Test query: select country, province, count(*) from cities group by
rollup (country, province);

This was shown to be about 25% faster, although that was with
abbreviated keys (plus caching of abbreviated keys), and not just the
comparison caching optimization.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Geoghegan-3
In reply to this post by Peter Eisentraut-6
On Tue, Dec 27, 2016 at 6:50 PM, Peter Eisentraut
<[hidden email]> wrote:
> Updated patch attached.

Some more things I noticed following another quick read over the patch:

* I think it's worth looking into ucol_nextSortKeyPart(), and using
that as an alternative to ucol_getSortKey(). It doesn't seem any
harder, and when I tested it it was clearly faster. (I think that
ucol_nextSortKeyPart() is more or less intended to be used for
abbreviated keys.)

* I think that it's not okay that convert_string_datum() still uses
strxfrm() without considering if it's an ICU build. That's why I
raised the idea of a pg_strxfrm() wrapper at one point.

* Similarly, I think that check_strxfrm_bug() should have something
about ICU. It's looking for a particular bug in some very old version
of Solaris 8. At a minimum, check_strxfrm_bug() should now not run at
all (a broken OS strxfrm() shouldn't be a problem with ICU).
Otherwise, it could do some kind of testing on our pg_strxfrm()
wrapper (or similar).

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Pavel Stehule
In reply to this post by Peter Eisentraut-6
Hi

2016-12-28 3:50 GMT+01:00 Peter Eisentraut <[hidden email]>:
Updated patch attached.

The previous round of reviews showed that there was general agreement on
the approach.  So I have focused on filling in the gaps, added ICU
support to all the locale-using places, added documentation, fixed some
minor issues that have been pointed out.  Everything appears to work
correctly now, and the user-facing feature set is pretty well-rounded.

I don't have much experience with the abbreviated key stuff.  I have
filled in what I think should work, but it needs detailed review.

Similarly, some of the stuff in the regular expression code was hacked
in blindly.

One minor problem is that we now support adding any-encoding collation
entries, which violates some of the comments in CollationCreate().  See
FIXME there.  It doesn't seem worth major contortions to fix that; maybe
it just has to be documented better.

the regress test fails

 Program received signal SIGSEGV, Segmentation fault.
0x00000000007bbc2b in pattern_char_isalpha (locale_is_c=0 '\000', locale=0x1a73220, is_multibyte=1 '\001', c=97 'a') at selfuncs.c:5291
5291 return isalpha_l((unsigned char) c, locale->lt);


(gdb) bt
#0  0x00000000007bbc2b in pattern_char_isalpha (locale_is_c=0 '\000', locale=0x1a73220, is_multibyte=1 '\001', c=97 'a') at selfuncs.c:5291
#1  like_fixed_prefix (patt_const=<optimized out>, case_insensitive=<optimized out>, collation=<optimized out>, prefix_const=0x7ffc0963e800, 
    rest_selec=0x7ffc0963e808) at selfuncs.c:5389
#2  0x00000000007c1076 in patternsel (fcinfo=<optimized out>, ptype=ptype@entry=Pattern_Type_Like_IC, negate=negate@entry=0 '\000')
    at selfuncs.c:1228
#3  0x00000000007c1680 in iclikesel (fcinfo=<optimized out>) at selfuncs.c:1406
#4  0x000000000080db56 in OidFunctionCall4Coll (functionId=<optimized out>, collation=collation@entry=12886, arg1=arg1@entry=28299032, 
    arg2=arg2@entry=1627, arg3=arg3@entry=28300096, arg4=arg4@entry=0) at fmgr.c:1674
#5  0x000000000068e424 in restriction_selectivity (root=root@entry=0x1afcf18, operatorid=1627, args=0x1afd340, inputcollid=12886, 
    varRelid=varRelid@entry=0) at plancat.c:1583
#6  0x000000000065457e in clause_selectivity (root=0x1afcf18, clause=<optimized out>, varRelid=0, jointype=JOIN_INNER, sjinfo=0x0)
    at clausesel.c:657
#7  0x000000000065485c in clauselist_selectivity (root=root@entry=0x1afcf18, clauses=<optimized out>, varRelid=varRelid@entry=0, 
    jointype=jointype@entry=JOIN_INNER, sjinfo=sjinfo@entry=0x0) at clausesel.c:107
#8  0x00000000006599d4 in set_baserel_size_estimates (root=root@entry=0x1afcf18, rel=rel@entry=0x1afd500) at costsize.c:3771
#9  0x00000000006526e5 in set_plain_rel_size (rte=<optimized out>, rel=<optimized out>, root=<optimized out>) at allpaths.c:492
#10 set_rel_size (root=root@entry=0x1afcf18, rel=rel@entry=0x1afd500, rti=rti@entry=1, rte=0x1acfb68) at allpaths.c:352
#11 0x0000000000653ebd in set_base_rel_sizes (root=<optimized out>) at allpaths.c:272
#12 make_one_rel (root=root@entry=0x1afcf18, joinlist=joinlist@entry=0x1afd810) at allpaths.c:170
#13 0x0000000000670ad4 in query_planner (root=root@entry=0x1afcf18, tlist=tlist@entry=0x1afd1b0, 
    qp_callback=qp_callback@entry=0x6710c0 <standard_qp_callback>, qp_extra=qp_extra@entry=0x7ffc0963f020) at planmain.c:254
#14 0x00000000006727cc in grouping_planner (root=root@entry=0x1afcf18, inheritance_update=inheritance_update@entry=0 '\000', 
    tuple_fraction=<optimized out>, tuple_fraction@entry=0) at planner.c:1729
#15 0x00000000006752c6 in subquery_planner (glob=glob@entry=0x1afcbe8, parse=parse@entry=0x1acfa50, parent_root=parent_root@entry=0x0, 
    hasRecursion=hasRecursion@entry=0 '\000', tuple_fraction=tuple_fraction@entry=0) at planner.c:789
#16 0x000000000067619f in standard_planner (parse=0x1acfa50, cursorOptions=256, boundParams=0x0) at planner.c:301
#17 0x00000000007095bd in pg_plan_query (querytree=0x1acfa50, cursorOptions=256, boundParams=0x0) at postgres.c:798
#18 0x000000000070968e in pg_plan_queries (querytrees=<optimized out>, cursorOptions=cursorOptions@entry=256, boundParams=boundParams@entry=0x0)
    at postgres.c:864
#19 0x000000000070b1e7 in exec_simple_query (query_string=0x1ace8c8 "SELECT * FROM collate_test1 WHERE b ILIKE 'abc';") at postgres.c:1029
#20 PostgresMain (argc=<optimized out>, argv=argv@entry=0x1a78988, dbname=<optimized out>, username=<optimized out>) at postgres.c:4067
#21 0x000000000046fc7d in BackendRun (port=0x1a6e6e0) at postmaster.c:4300
#22 BackendStartup (port=0x1a6e6e0) at postmaster.c:3972

[root@localhost backend]# dnf info libicu
Last metadata expiration check: 1:50:20 ago on Sun Jan 15 09:58:29 2017.
Installed Packages
Name        : libicu
Arch        : x86_64
Epoch       : 0
Version     : 57.1
Release     : 4.fc25
Size        : 29 M
Repo        : @System

Regards

Pavel



--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
On 1/15/17 5:53 AM, Pavel Stehule wrote:
> the regress test fails
>
>  Program received signal SIGSEGV, Segmentation fault.
> 0x00000000007bbc2b in pattern_char_isalpha (locale_is_c=0 '\000',
> locale=0x1a73220, is_multibyte=1 '\001', c=97 'a') at selfuncs.c:5291
> 5291return isalpha_l((unsigned char) c, locale->lt);

Here is an updated patch that fixes this crash and is rebased on top of
recent changes.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

v3-0001-ICU-support.patch (152K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
In reply to this post by Peter Geoghegan-3
On 1/9/17 3:45 PM, Peter Geoghegan wrote:
> * I think it's worth looking into ucol_nextSortKeyPart(), and using
> that as an alternative to ucol_getSortKey(). It doesn't seem any
> harder, and when I tested it it was clearly faster. (I think that
> ucol_nextSortKeyPart() is more or less intended to be used for
> abbreviated keys.)

I will try to look into that.

> * I think that it's not okay that convert_string_datum() still uses
> strxfrm() without considering if it's an ICU build. That's why I
> raised the idea of a pg_strxfrm() wrapper at one point.

That code works in a locale-agnostic way now, which might be
questionable, but it's not the fault of this patch, I think.

> * Similarly, I think that check_strxfrm_bug() should have something
> about ICU. It's looking for a particular bug in some very old version
> of Solaris 8. At a minimum, check_strxfrm_bug() should now not run at
> all (a broken OS strxfrm() shouldn't be a problem with ICU).

But we'll still be using strxfrm for the database default locale, so we
need to keep that check.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Thomas Munro-3
In reply to this post by Peter Geoghegan-3
On Tue, Jan 10, 2017 at 9:45 AM, Peter Geoghegan <[hidden email]> wrote:
> * I think it's worth looking into ucol_nextSortKeyPart(), and using
> that as an alternative to ucol_getSortKey(). It doesn't seem any
> harder, and when I tested it it was clearly faster. (I think that
> ucol_nextSortKeyPart() is more or less intended to be used for
> abbreviated keys.)

+1

I assume (but haven't checked) that ucol_nextSortKeyPart accesses only
the start of the string via the UCharIterator passed in, unless you
have the rare reverse-accent-sort feature enabled (maybe used only in
fr_CA, it looks like it is required to scan the whole string looking
for the last accent).  So I assume that uiter_setUTF8 and
ucol_nextSortKeyPart would usually do a small fixed amount of work,
whereas this patch's icu_to_uchar allocates space and converts the
whole variable length string every time.

On the other hand, I see that this patch's icu_to_uchar can deal with
encodings other than UTF8.  At first glance, the UCharIterator
interface provides a way to make a UCharIterator that iterates over a
string of UChar or a string of UTF8, but I'm not sure how to make it
do character-by-character transcoding from arbitrary encodings via the
C API.  It seems like that should be possible using ucnv_getNextUChar
as the source of transcoded characters, but I'm not sure how to wire
those two things together (in C++ I think you'd subclass
icu::CharacterIterator).

That's about abbreviation, but I note that you can also compare
strings using iterators with ucol_strcollIter, avoiding the need to
allocate and transcode up front.  I have no idea whether that'd pay
off.

+   A change in collation definitions can lead to corrupt indexes and other
+   problems where the database system relies on stored objects having a
+   certain sort order.  Generally, this should be avoided, but it can happen
+   in legitimate circumstances, such as when
+   using <command>pg_upgrade</command> to upgrade to server binaries linked
+   with a newer version of ICU.  When this happens, all objects depending on
+   the collation should be rebuilt, for example,
+   using <command>REINDEX</command>.  When that is done, the collation version
+   can be refreshed using the command <literal>ALTER COLLATION ... REFRESH
+   VERSION</literal>.  This will update the system catalog to record the
+   current collator version and will make the warning go away.  Note that this
+   does not actually check whether all affected objects have been rebuilt

I think this is a pretty reasonable first approach to this problem.
It's a simple way to flag up a problem to the DBA, but leaves all the
responsibility for figuring out how to fix it to the DBA.  I think we
should considering going further in later patches (tracking the
version used at last rebuild per index etc as discussed, so that the
condition is cleared only by rebuilding the affected things).

(REPARTITION anyone?)

As far as I know there are two things moving: ICU code and CLDR data.
Here we can see the CLDR versions being pulled into ICU:

http://bugs.icu-project.org/trac/log/icu/trunk/source/data/locales?rev=39273

Clearly when you upgrade your system from (say) Debian 8 to Debian 9
and the ICU major version changes we expect to have to REINDEX, but
does anyone know if such data changes are ever pulled into the minor
version package upgrades you get from regular apt-get update of (say)
a Debian 8 or CentOS 7 or FreeBSD 11 system?  In other words, do we
expect collversion changes to happen basically any time in response to
regular system updates, or only when you're doing a major upgrade of
some kind, as the above-quoted documentation patch implies?

+               collversion = ntohl(*((uint32 *) versioninfo));

UVersionInfo is an array of four uint8_t.  That doesn't sound like
something that needs to be bit-swizzled... is it really?  Even if it
is arranged differently on different architectures, I'm not sure why
you care since we only ever use it to compare for equality on the same
system.  But aside from that, I don't love this cast to uint32.  I
wonder if we should use u_versionToString to respect boundaries a bit
better?

I have another motivation for wanting to model collation versions as
strings: I have been contemplating a version check for system-provided
collations too, piggy-backing on your work here.  Obviously PostgreSQL
can't directly know anything about system collation versions, but I'm
thinking of a GUC system_collation_version_command which you could
optionally set to a script that knows how to inspect the local
operating system.  For example, a package maintainer might be
interested in writing such a script that knows how to ask the package
system for a locale data version.  A brute force approach that works
on many systems could be 'tar c /usr/share/local/*/LC_COLLATE | md5'.
A string would provide more leeway than a measly int32.  That's
independent of this patch and you might hate the whole idea, but seems
to be the kind of thing you anticipated when you described collversion
as "[p]rovider-specific version of the collation".

--
Thomas Munro
http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Michael Paquier
In reply to this post by Peter Eisentraut-6
On Wed, Jan 25, 2017 at 2:44 AM, Peter Eisentraut
<[hidden email]> wrote:

> On 1/15/17 5:53 AM, Pavel Stehule wrote:
>> the regress test fails
>>
>>  Program received signal SIGSEGV, Segmentation fault.
>> 0x00000000007bbc2b in pattern_char_isalpha (locale_is_c=0 '\000',
>> locale=0x1a73220, is_multibyte=1 '\001', c=97 'a') at selfuncs.c:5291
>> 5291return isalpha_l((unsigned char) c, locale->lt);
>
> Here is an updated patch that fixes this crash and is rebased on top of
> recent changes.

Patch that still applies + no reviews = moved to CF 2017-03.
--
Michael


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Pavel Stehule
In reply to this post by Peter Eisentraut-6
Hi

2017-01-24 18:44 GMT+01:00 Peter Eisentraut <[hidden email]>:
On 1/15/17 5:53 AM, Pavel Stehule wrote:
> the regress test fails
>
>  Program received signal SIGSEGV, Segmentation fault.
> 0x00000000007bbc2b in pattern_char_isalpha (locale_is_c=0 '\000',
> locale=0x1a73220, is_multibyte=1 '\001', c=97 'a') at selfuncs.c:5291
> 5291return isalpha_l((unsigned char) c, locale->lt);

Here is an updated patch that fixes this crash and is rebased on top of
recent changes.

This patch is not possible to compile on today master

commands/collationcmds.o: In function `AlterCollation':
/home/pavel/src/postgresql/src/backend/commands/collationcmds.c:297: undefined reference to `CatalogUpdateIndexes'
collect2: error: ld returned 1 exit status

Regards

Pavel

 

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Tom Lane-2
Pavel Stehule <[hidden email]> writes:
> This patch is not possible to compile on today master
> commands/collationcmds.o: In function `AlterCollation':
> /home/pavel/src/postgresql/src/backend/commands/collationcmds.c:297:
> undefined reference to `CatalogUpdateIndexes'

Evidently collateral damage from 2f5c9d9c9.  But I'd suggest waiting
to fix it until you can also do s/simple_heap_delete/CatalogTupleDelete/
as I proposed in
https://www.postgresql.org/message-id/462.1485902736@...

I'll go make that happen right now, now that I realize there are patch(es)
waiting on it.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Tom Lane-2
I wrote:
> Evidently collateral damage from 2f5c9d9c9.  But I'd suggest waiting
> to fix it until you can also do s/simple_heap_delete/CatalogTupleDelete/
> as I proposed in
> https://www.postgresql.org/message-id/462.1485902736@...
> I'll go make that happen right now, now that I realize there are patch(es)
> waiting on it.

Done.  There's some more loose ends but they won't affect very many
call sites, so you should be able to rebase now.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
In reply to this post by Peter Eisentraut-6
On 1/24/17 12:45 PM, Peter Eisentraut wrote:
> On 1/9/17 3:45 PM, Peter Geoghegan wrote:
>> * I think it's worth looking into ucol_nextSortKeyPart(), and using
>> that as an alternative to ucol_getSortKey(). It doesn't seem any
>> harder, and when I tested it it was clearly faster. (I think that
>> ucol_nextSortKeyPart() is more or less intended to be used for
>> abbreviated keys.)
> I will try to look into that.

I think I have this sorted out.  What I don't understand, however, is
why varstr_abbrev_convert() makes a point of looping until the result of
strxfrm() fits into the output buffer (buf2), when we only need 8 bytes,
and we throw away the rest later.  Wouldn't it work to just request 8 bytes?

If there is a problem with just requesting 8 bytes, then I'm wondering
how this would affect the ICU code branch.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Geoghegan-4
On Thu, Feb 9, 2017 at 7:58 AM, Peter Eisentraut
<[hidden email]> wrote:
> I think I have this sorted out.  What I don't understand, however, is
> why varstr_abbrev_convert() makes a point of looping until the result of
> strxfrm() fits into the output buffer (buf2), when we only need 8 bytes,
> and we throw away the rest later.  Wouldn't it work to just request 8 bytes?

Maybe. We do that because strxfrm() is not required by the standard to
produce well defined contents for the buffer when the return value
indicates that it didn't fit entirely. This is a standard idiom, I
think.

> If there is a problem with just requesting 8 bytes, then I'm wondering
> how this would affect the ICU code branch.

This must be fine with ICU's ucol_nextSortKeyPart(), because it is
designed for the express purpose of producing only a few bytes of the
final blob at a time.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: ICU integration

Peter Eisentraut-6
In reply to this post by Peter Eisentraut-6
Updated and rebased patch.

Significant changes:

- Changed collversion to type text

- Changed pg_locale_t to a union

- Use ucol_getAvailable() instead of uloc_getAvailable(), so the set of
initial collations is smaller now, because redundancies are eliminated.

- Added keyword variants to predefined ICU collations (so you get
"de_phonebook%icu", for example)  (So the initial set of collations is
bigger now. :) )

- Predefined ICU collations have a comment now, so \dOS+ is useful.

- Use ucol_nextSortKeyPart() for abbreviated keys

- Enhanced tests and documentation

I believe all issues raised in reviews have been addressed.

Discussion points:

- Naming of collations:  Are we happy with the "de%icu" naming?  I might
have come up with that while reviewing the IPv6 zone index patch. ;-)
An alternative might be "de$icu" for more Oracle vibe and avoiding the
need for double quotes in some cases.  (But we have mixed-case names
like "de_AT%icu", so further changes would be necessary to fully get rid
of the need for quoting.)  A more radical alternative would be to
install ICU locales in a different schema and use the search_path, or
even have a separate search path setting for collations only.  Which
leads to ...

- Selecting default collation provider:  Maybe we want a setting, say in
initdb, to determine which provider's collations get the "good" names?
Maybe not necessary for this release, but something to think about.

- Currently (in this patch), we check a collation's version when it is
first used.  But, say, after pg_upgrade, you might want to check all of
them right away.  What might be a good interface for that?  (Possibly,
we only have to check the ones actually in use, and we have dependency
information for that.)

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

v4-0001-ICU-support.patch (157K) Download Attachment
12345