Collation versioning

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
57 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Collation versioning

Thomas Munro-3
Hello,

While reviewing the ICU versioning work a while back, I mentioned the
idea of using a user-supplied command to get a collversion string for
the libc collation provider.  I was reminded about that by recent news
about an upcoming glibc/CLDR resync that is likely to affect
PostgreSQL users (though, I guess, probably only when they do a major
OS upgrade).  Here's an experimental patch to try that idea out.  For
example, you might set it like this:

libc_collation_version_command = 'md5
/usr/share/locale/@LC_COLLATE@/LC_COLLATE | sed "s/.* = //"'

... or, on a Debian system using the locales package, like this:

libc_collation_version_command = 'dpkg -s locales | grep Version: |
sed "s/Version: //"'

Using the checksum approach, it works like this:

postgres=# alter collation "xx_XX" refresh version;
NOTICE:  changing version from b88d621596b7e61337e832f7841066a9 to
7b008442fbaf5dfe7a10fb3d82a634ab
ALTER COLLATION
postgres=# select * from pg_collation where collname = 'xx_XX';
-[ RECORD 1 ]-+---------------------------------
collname      | xx_XX
collnamespace | 2200
collowner     | 10
collprovider  | c
collencoding  | 6
collcollate   | en_US.UTF-8
collctype     | UTF-8
collversion   | 7b008442fbaf5dfe7a10fb3d82a634ab

When the collation definition changes you get the desired scary
warning on next attempt to use it in a fresh backend:

postgres=# select * from t order by v;
WARNING:  collation "xx_XX" has version mismatch
DETAIL:  The collation in the database was created using version
b88d621596b7e61337e832f7841066a9, but the operating system provides
version 7b008442fbaf5dfe7a10fb3d82a634ab.
HINT:  Rebuild all objects affected by this collation and run ALTER
COLLATION public."xx_XX" REFRESH VERSION, or build PostgreSQL with the
right library version.

The problem is that it isn't in effect at initdb time so if you add
that later it only affects new locales.  You'd need a way to do that
during init to capture the imported system locale versions, and that's
a really ugly string to have to pass into some initdb option.  Ugh.

Another approach would be to decide that we're willing to put
non-portable version extracting magic in pg_locale.c.   On a long
flight I hacked my libc to store a version string (based on CLDR
version or whatever) in its binary locale definitions and provide a
proper interface to ask for it, modelled on querylocale(3):

const char *querylocaleversion(int mask, locale_t locale);

Then the patch for pg_locale.c is trivial, see attached.  While I
could conceivably try to convince my local friendly OS to take such a
patch, the real question is how to deal with glibc.  Does anyone know
of a way to extract a version string from glibc using existing
interfaces?  I heard there was an undocumented way but I haven't been
able to find it -- probably because I was, erm, looking in the
documentation.

Or maybe this isn't worth bothering with, and we should just build out
the ICU support and then make it the default and be done with it.

In passing, here's a patch to add tab completion for ALTER COLLATION
... REFRESH VERSION.

--
Thomas Munro
http://www.enterprisedb.com

0001-Auto-complete-ALTER-COLLATION-.-REFRESH-VERSION.patch (1K) Download Attachment
0001-Add-libc_collation_version_command-GUC.patch (4K) Download Attachment
0001-Add-libc_collation_version_command-GUC.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Collation versioning

Thomas Munro-3
On Tue, Sep 4, 2018 at 10:02 AM Thomas Munro
<[hidden email]> wrote:
> const char *querylocaleversion(int mask, locale_t locale);
>
> Then the patch for pg_locale.c is trivial, see attached.

Oops, here's that one, FWIW.

--
Thomas Munro
http://www.enterprisedb.com

0001-Use-querylocaleversion-3-if-available.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Collation versioning

Christoph Berg-2
In reply to this post by Thomas Munro-3
Re: Thomas Munro 2018-09-04 <CAEepm=0uEQCpfq_+LYFBdArCe4Ot98t1aR4eYiYTe=[hidden email]>
> I was reminded about that by recent news
> about an upcoming glibc/CLDR resync that is likely to affect
> PostgreSQL users (though, I guess, probably only when they do a major
> OS upgrade).

Or replicating/restoring a database to a newer host.

> ... or, on a Debian system using the locales package, like this:
>
> libc_collation_version_command = 'dpkg -s locales | grep Version: |
> sed "s/Version: //"'

Ugh. This sounds horribly easy to get wrong on the user side. I could
of course put that preconfigured into the Debian packages, but that
would leave everyone not using any of the standard distro packagings
in the rain.

> Does anyone know
> of a way to extract a version string from glibc using existing
> interfaces?  I heard there was an undocumented way but I haven't been
> able to find it -- probably because I was, erm, looking in the
> documentation.

That sounds more robust. Googling around:

https://www.linuxquestions.org/questions/linux-software-2/how-to-check-glibc-version-263103/

#include <stdio.h>
#include <gnu/libc-version.h>
int main (void) { puts (gnu_get_libc_version ()); return 0; }

$ ./a.out
2.27

Hopefully that version info is fine-grained enough.

Christoph

Reply | Threaded
Open this post in threaded view
|

Re: Collation versioning

Thomas Munro-3
On Wed, Sep 5, 2018 at 3:35 AM Christoph Berg <[hidden email]> wrote:
> Re: Thomas Munro 2018-09-04 <CAEepm=0uEQCpfq_+LYFBdArCe4Ot98t1aR4eYiYTe=[hidden email]>
> > I was reminded about that by recent news
> > about an upcoming glibc/CLDR resync that is likely to affect
> > PostgreSQL users (though, I guess, probably only when they do a major
> > OS upgrade).
>
> Or replicating/restoring a database to a newer host.

Yeah.

> > Does anyone know
> > of a way to extract a version string from glibc using existing
> > interfaces?  I heard there was an undocumented way but I haven't been
> > able to find it -- probably because I was, erm, looking in the
> > documentation.
>
> That sounds more robust. Googling around:
>
> https://www.linuxquestions.org/questions/linux-software-2/how-to-check-glibc-version-263103/
>
> #include <stdio.h>
> #include <gnu/libc-version.h>
> int main (void) { puts (gnu_get_libc_version ()); return 0; }
>
> $ ./a.out
> 2.27
>
> Hopefully that version info is fine-grained enough.

Hmm.  I was looking for locale data version, not libc.so itself.  I
realise they come ultimately from the same source package, but are the
locale definitions and libc6 guaranteed to be updated at the same
time?  I see that the locales package depends on libc-bin >> 2.27 (no
upper bound), which in turn depends on libc6 >> 2.27, << 2.28.  So
perhaps you can have a system with locales 2.27 and libc6 2.28?  I
also wonder if there are some configurations where they could get out
of sync because of manual use of locale-gen or something like that.
Clearly most systems would have them in sync through apt-get upgrade
though, so maybe gnu_get_libc_version() would work well in practice?

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Collation versioning

Thomas Munro-3
On Wed, Sep 5, 2018 at 8:20 AM Thomas Munro
<[hidden email]> wrote:

> On Wed, Sep 5, 2018 at 3:35 AM Christoph Berg <[hidden email]> wrote:
> > int main (void) { puts (gnu_get_libc_version ()); return 0; }
> >
> > $ ./a.out
> > 2.27
>
> Hmm.  I was looking for locale data version, not libc.so itself.  I
> realise they come ultimately from the same source package, but are the
> locale definitions and libc6 guaranteed to be updated at the same
> time?

And even if they are, what if your cluster is still running and still
has the older libc.so.6 mapped in?  Newly forked backends will see new
locale data but gnu_get_libc_version() will return the old string.
(Pointed out off-list by Andres.)  Eventually you restart your cluster
and start seeing the error.

So, it's not ideal but perhaps worth considering on the grounds that
it's better than nothing?

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Collation versioning

Christoph Berg-2
Re: Thomas Munro 2018-09-05 <CAEepm=3a5BC7CwsXZo3V4fw6YuAMT2nJ1krwtqOatb=[hidden email]>
> > Hopefully that version info is fine-grained enough.
>
> Hmm.  I was looking for locale data version, not libc.so itself.  I
> realise they come ultimately from the same source package, but are the
> locale definitions and libc6 guaranteed to be updated at the same
> time?  I see that the locales package depends on libc-bin >> 2.27 (no
> upper bound), which in turn depends on libc6 >> 2.27, << 2.28.  So
> perhaps you can have a system with locales 2.27 and libc6 2.28?

No because libc6.deb "breaks" locales << 2.27:

Package: libc6
Source: glibc
Version: 2.27-5
Breaks: [...] locales (<< 2.27), locales-all (<< 2.27),

(I can't tell off-hand why this isn't just a stricter dependency in
locales.deb, but it's probably because this variant works better for
upgrades.)

> I also wonder if there are some configurations where they could get out
> of sync because of manual use of locale-gen or something like that.
> Clearly most systems would have them in sync through apt-get upgrade
> though, so maybe gnu_get_libc_version() would work well in practice?

I'd hope so. I'm more worried about breakage because of fixes applied
within one glibc version (2.27-5 vs 2.27-6), but I guess Debian's
glibc maintainers are clueful enough not to do that.


Re: Thomas Munro 2018-09-05 <CAEepm=0hoACQLFn8ro7jCO9-wTth2mXXS3K=[hidden email]>
> And even if they are, what if your cluster is still running and still
> has the older libc.so.6 mapped in?  Newly forked backends will see new
> locale data but gnu_get_libc_version() will return the old string.
> (Pointed out off-list by Andres.)  Eventually you restart your cluster
> and start seeing the error.

That problem isn't protected against by PG itself. I've seen clusters
that were upgraded on disk but not restarted yet where every plpgsql
invocation was throwing symbol errors. So I guess we don't have to try
harder for libc.

> So, it's not ideal but perhaps worth considering on the grounds that
> it's better than nothing?

Ack.

Christoph

Reply | Threaded
Open this post in threaded view
|

Re: Collation versioning

Thomas Munro-3
On Wed, Sep 5, 2018 at 12:10 PM Christoph Berg <[hidden email]> wrote:
> > So, it's not ideal but perhaps worth considering on the grounds that
> > it's better than nothing?
>
> Ack.

Ok, here's a little patch like that.

postgres=# select collname, collcollate, collversion from pg_collation
where collname = 'en_NZ';
 collname | collcollate | collversion
----------+-------------+-------------
 en_NZ    | en_NZ.utf8  | 2.24
(1 row)

--
Thomas Munro
http://www.enterprisedb.com

0001-Set-collversion-for-collations-that-come-from-glibc.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Collation versioning

Peter Eisentraut-6
On 05/09/2018 23:18, Thomas Munro wrote:

> On Wed, Sep 5, 2018 at 12:10 PM Christoph Berg <[hidden email]> wrote:
>>> So, it's not ideal but perhaps worth considering on the grounds that
>>> it's better than nothing?
>>
>> Ack.
>
> Ok, here's a little patch like that.
>
> postgres=# select collname, collcollate, collversion from pg_collation
> where collname = 'en_NZ';
>  collname | collcollate | collversion
> ----------+-------------+-------------
>  en_NZ    | en_NZ.utf8  | 2.24
> (1 row)

But wouldn't that also have the effect that glibc updates that don't
change anything about the locales would trigger the version
incompatibility warning?

Also, note that this mechanism only applies to collation objects, not to
database-global locales.  So most users wouldn't be helped by this approach.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Collation versioning

Thomas Munro-3
On Thu, Sep 6, 2018 at 12:01 PM Peter Eisentraut
<[hidden email]> wrote:

> On 05/09/2018 23:18, Thomas Munro wrote:
> > On Wed, Sep 5, 2018 at 12:10 PM Christoph Berg <[hidden email]> wrote:
> >>> So, it's not ideal but perhaps worth considering on the grounds that
> >>> it's better than nothing?
> >>
> >> Ack.
> >
> > Ok, here's a little patch like that.
> >
> > postgres=# select collname, collcollate, collversion from pg_collation
> > where collname = 'en_NZ';
> >  collname | collcollate | collversion
> > ----------+-------------+-------------
> >  en_NZ    | en_NZ.utf8  | 2.24
> > (1 row)
>
> But wouldn't that also have the effect that glibc updates that don't
> change anything about the locales would trigger the version
> incompatibility warning?

Right.  And likewise, a glibc update that does change some locales but
not the locales that you are actually using will trigger false alarm
warnings. The same goes for the ICU provider, which appears to return
the same collversion for every collation, even though presumably some
don't change from one ICU version to the next.

I wonder if someone here knows how many "locales" packages have been
released over the lifetime of (say) the current Debian stable distro,
whether any LC_COLLATE files changed over those releases, and whether
libc6 had the same MAJOR.MINOR for the whole lifetime.  That is, even
though they might have been through 2.19-17+blah, 2.19-18+blah, ...
did they all report "2.19" and were the collations actually stable?
If that's the case, I think it'd be quite good: we'd only raise the
alarm after a big dist-upgrade Debian 8->9, or when doing streaming
replication from a Debian 8 box to a Debian 9 box.

> Also, note that this mechanism only applies to collation objects, not to
> database-global locales.  So most users wouldn't be helped by this approach.

Yeah, right, that would have to work for this to be useful.  I will
look into that.

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Collation versioning

Christoph Berg-2
Fwiw, I was doing some tests with LC_COLLATE last year:

https://github.com/ChristophBerg/lc_collate_testsuite

Iirc the outcome was that everything except de_DE.UTF-8 was stable.

Christoph

Reply | Threaded
Open this post in threaded view
|

Re: Collation versioning

Thomas Munro-3
In reply to this post by Thomas Munro-3
On Thu, Sep 6, 2018 at 5:36 PM Thomas Munro
<[hidden email]> wrote:
> On Thu, Sep 6, 2018 at 12:01 PM Peter Eisentraut
> <[hidden email]> wrote:
> > Also, note that this mechanism only applies to collation objects, not to
> > database-global locales.  So most users wouldn't be helped by this approach.
>
> Yeah, right, that would have to work for this to be useful.  I will
> look into that.

We could perform a check up front in (say) CheckMyDatabase(), or maybe
defer until the first string comparison.  The tricky question is where
to store it.

1.  We could add datcollversion to pg_database.

2.  We could remove datcollate and datctype and instead store a
collation OID.  I'm not sure what problems would come up, but for
starters it seems a bit weird to have a shared catalog pointing to
rows in a non-shared catalog.

The same question comes up if we want to support ICU as a database
level default.  Add datcollprovider, or point to a pg_collation row?

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Collation versioning

Peter Eisentraut-6
On 07/09/2018 23:34, Thomas Munro wrote:

> On Thu, Sep 6, 2018 at 5:36 PM Thomas Munro
> <[hidden email]> wrote:
>> On Thu, Sep 6, 2018 at 12:01 PM Peter Eisentraut
>> <[hidden email]> wrote:
>>> Also, note that this mechanism only applies to collation objects, not to
>>> database-global locales.  So most users wouldn't be helped by this approach.
>>
>> Yeah, right, that would have to work for this to be useful.  I will
>> look into that.
>
> We could perform a check up front in (say) CheckMyDatabase(), or maybe
> defer until the first string comparison.  The tricky question is where
> to store it.
>
> 1.  We could add datcollversion to pg_database.
>
> 2.  We could remove datcollate and datctype and instead store a
> collation OID.  I'm not sure what problems would come up, but for
> starters it seems a bit weird to have a shared catalog pointing to
> rows in a non-shared catalog.
>
> The same question comes up if we want to support ICU as a database
> level default.  Add datcollprovider, or point to a pg_collation row?

This was previously discussed here:
https://www.postgresql.org/message-id/f689322a-4fc5-10cc-4a60-81f1ff0166c9@...
-- without a conclusion.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Collation versioning

Christoph Berg-2
In reply to this post by Thomas Munro-3
Re: Thomas Munro 2018-09-07 <CAEepm=[hidden email]>
> 2.  We could remove datcollate and datctype and instead store a
> collation OID.  I'm not sure what problems would come up, but for
> starters it seems a bit weird to have a shared catalog pointing to
> rows in a non-shared catalog.

Naive idea: make that catalog shared? Collations are system-wide after
all.

Christoph

Reply | Threaded
Open this post in threaded view
|

Re: Collation versioning

Peter Eisentraut-6
On 12/09/2018 10:15, Christoph Berg wrote:
> Re: Thomas Munro 2018-09-07 <CAEepm=[hidden email]>
>> 2.  We could remove datcollate and datctype and instead store a
>> collation OID.  I'm not sure what problems would come up, but for
>> starters it seems a bit weird to have a shared catalog pointing to
>> rows in a non-shared catalog.
>
> Naive idea: make that catalog shared? Collations are system-wide after
> all.

By the same argument, extensions should be shared, but they are not.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Collation versioning

Christoph Berg-2
Re: Peter Eisentraut 2018-09-12 <[hidden email]>
> > Naive idea: make that catalog shared? Collations are system-wide after
> > all.
>
> By the same argument, extensions should be shared, but they are not.

But extensions put a lot of visible stuff into a database, whereas a
collation is just a line in some table that doesn't get into the way.

Christoph

Reply | Threaded
Open this post in threaded view
|

Re: Collation versioning

Peter Eisentraut-6
On 12/09/2018 13:25, Christoph Berg wrote:
> Re: Peter Eisentraut 2018-09-12 <[hidden email]>
>>> Naive idea: make that catalog shared? Collations are system-wide after
>>> all.
>>
>> By the same argument, extensions should be shared, but they are not.
>
> But extensions put a lot of visible stuff into a database, whereas a
> collation is just a line in some table that doesn't get into the way.

How about C functions?  They are just a system catalog representation of
something that exists on the OS.

Anyway, we also want to support application-specific collation
definitions, so that users can CREATE COLLATION
"my_specific_requirements" and use that that in their application, so
global collations wouldn't be appropriate for that.

Moreover, the fix for a collation version mismatch is, in the simplest
case, to go around and REINDEX everything.  Making the collation or
collation version global doesn't fix that.  It would actually make it
harder because you couldn't run ALTER COLLATION REFRESH VERSION until
after you have rebuilt all affected objects *in all databases*.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Collation versioning

Christoph Berg-2
Re: Peter Eisentraut 2018-09-13 <[hidden email]>
> Moreover, the fix for a collation version mismatch is, in the simplest
> case, to go around and REINDEX everything.  Making the collation or
> collation version global doesn't fix that.  It would actually make it
> harder because you couldn't run ALTER COLLATION REFRESH VERSION until
> after you have rebuilt all affected objects *in all databases*.

Btw, I think a "reindexdb --all --collation" (and the SQL per-database
equivalent) that only rebuilds indexes that are affected by collations
would be immensely useful to have.

Christoph

Reply | Threaded
Open this post in threaded view
|

Re: Collation versioning

Stephen Frost
Greetings,

* Christoph Berg ([hidden email]) wrote:

> Re: Peter Eisentraut 2018-09-13 <[hidden email]>
> > Moreover, the fix for a collation version mismatch is, in the simplest
> > case, to go around and REINDEX everything.  Making the collation or
> > collation version global doesn't fix that.  It would actually make it
> > harder because you couldn't run ALTER COLLATION REFRESH VERSION until
> > after you have rebuilt all affected objects *in all databases*.
>
> Btw, I think a "reindexdb --all --collation" (and the SQL per-database
> equivalent) that only rebuilds indexes that are affected by collations
> would be immensely useful to have.
As I was discussing w/ Peter G during PostgresOpen, we'd have to wait
until that reindexdb is complete before actually using anything in the
system and that's pretty painful.  While it sounds like it'd be a good
bit of work, it seems like we really need to have a way to support
multiple collation versions concurrently and to do that we'll need to
have the library underneath actually providing that to us.  Once we have
that, we can build new indexes concurrently and swap to them (or,
ideally, just issue REINDEX CONCURRENTLY once we support that..).

Until then, it seems like we really need to have a way to realize that a
given upgrade is going to require a big reindex, before actually doing
the reindex and suddenly discovering that we can't use a bunch of
indexes because they're out of date and extending the downtime for the
upgrade to be however long it takes to rebuild those indexes...

Thanks!

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Collation versioning

Thomas Munro-3
In reply to this post by Peter Eisentraut-6
On Thu, Sep 13, 2018 at 7:03 PM Peter Eisentraut
<[hidden email]> wrote:

> On 12/09/2018 13:25, Christoph Berg wrote:
> > Re: Peter Eisentraut 2018-09-12 <[hidden email]>
> >>> Naive idea: make that catalog shared? Collations are system-wide after
> >>> all.
> >>
> >> By the same argument, extensions should be shared, but they are not.
> >
> > But extensions put a lot of visible stuff into a database, whereas a
> > collation is just a line in some table that doesn't get into the way.
>
> How about C functions?  They are just a system catalog representation of
> something that exists on the OS.
>
> Anyway, we also want to support application-specific collation
> definitions, so that users can CREATE COLLATION
> "my_specific_requirements" and use that that in their application, so
> global collations wouldn't be appropriate for that.
>
> Moreover, the fix for a collation version mismatch is, in the simplest
> case, to go around and REINDEX everything.  Making the collation or
> collation version global doesn't fix that.  It would actually make it
> harder because you couldn't run ALTER COLLATION REFRESH VERSION until
> after you have rebuilt all affected objects *in all databases*.

Here's one idea I came up with.  It involves a new kind of magic.  The
goals are:

1.  Support versioning for the libc provider, including for the
default collation.
2.  Support ICU for the default collation.
3.  Fix the tracking of when reindexes need to be rebuilt, so that you
can't get it wrong (as you're alluding to above).

Changes:

1.  Drop the datcollate and datctype columns from pg_database.
2.  In CheckMyDatabase() or elsewhere in backend initialisation, get
that information instead by loading the pg_collation row with OID =
DEFAULT_COLLATION_OID.
3.  Don't put COLLPROVIDER_DEFAULT into the default collation
collprovider column, instead give it a concrete provider value, ie
COLLPROVIDER_LIBC.
4.  After creating a new database, update that row as appropriate in
the new database (!).  Or find some other way to write a new table out
and switch it around, or something like that.  That is, if you say
CREATE DATABASE foo LC_COLLATE = 'xx_XX', COLLATION_PROVIDER = libc
then those values somehow get written into the default pg_collation
row in the *new* database (so at that point it's not a simple copy of
the template database).
5.  Drop the collversion column from pg_collation.  Get rid of the
REFRESH VERSION command.  Instead, add a new column indcollversion to
pg_index.  It needs to be an array of text (not sure if that is a
problem in a catalog), with elements that correspond to the elements
of indcollation.
6.  Do the check and log warnings when we first open each index.
7.  Update indcollversion at index creation and whenever we REINDEX.

I haven't actually tried any of this so I'm not sure if I'm missing
something other than the inherent difficulty of updating a row in a
table in a database you're not connected to...

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Collation versioning

Douglas Doole
On Sun, Sep 16, 2018 at 1:20 AM Thomas Munro <[hidden email]> wrote:
3.  Fix the tracking of when reindexes need to be rebuilt, so that you
can't get it wrong (as you're alluding to above).

I've mentioned this in the past, but didn't seem to get any traction, so I'll try it again ;-)

The focus on indexes when a collation changes is, in my opinion, the least of the problems. You definitely have to worry about indexes, but they can be easily rebuilt. What about other places where collation is hardened into the system, such as constraints?

For example, in ICU 4.6 the handling of accents changed for French. Previously accents were considered right-to-left but ICU 4.6 reversed this. So consider a constraint like CHECK COL < 'coté' (last letter is U+00E9, small letter e with acute). Prior to ICU 4.6 the value 'côte' (second letter is U+00F4, small letter o with circumflex) would have passed this constraint. With 4.6 or later it would be rejected because of the accent ordering change. As soon as the collation changes, this table becomes inconsistent and a reindex isn't going to help it. This becomes a data cleansing problem at this point (which sucks for the user because their data was clean immediately prior to the "upgrade").

There have similarly been cases where ICU changes have caused equal characters to become unequal and vice versa. (Unfortunately all my ICU notes with examples are at my previous employer.) Consider the effect on RI constraints. The primary key can be fixed with a reindex (although dealing with two existing values becoming equal is a pain). But then the user also has to deal with the foreign keys since they may now have foreign keys which have no match in the primary key.

And constraints problems are even easier than triggers. Consider a database with complex BI rules that are implemented through triggers that fire when values are/are not equal. If the equality of strings change, there could be bad data throughout the tables. (At least with constraints the inter-column dependencies are explicit in the catalogs. With triggers anything goes.)

All this collation stuff is great, and I know users want it, but it feels like were pushing them out of an airplane with a ripped parachute every time the collation libraries change. Maybe they'll land safely or maybe things will get very messy.

- Doug
Salesforce
123