Upgrading locale issues

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Upgrading locale issues

rihad
Hi. Today we run pg_ctl promote on a slave server (10.7) and started
using it as a master. The OS also got upgraded FreeBSD 10.4 -> FreeBSD
11.2. And you guessed it, most varchar indexes got corrupted because
system local changed in subtle ways. So I created the extension amcheck
and reindexed all bad indexes one by one. Is there any way to prevent
such things in the future? Will switching to ICU fix all such issues?
The problem with it is that ICU collations are absent in pg_collation,
initdb should be run to create them, but pg_basebackup only runs on an
empty base directory, so I couldn't run initdb + pg_basebackup to
prepare the replica server. I believe I can run the create collation
command manually, but what would it look like for en-x-icu?


CREATE COLLATION "en-ix-icu" (provider = icu, locale = 'en-x-icu');


is that it? But what about version etc?


         collname        | collnamespace | collowner | collprovider |
collencoding |    collcollate    | collctype     | collversion

  en-x-icu               |            11 |        10 | i           
|           -1 | en                | en                | 153.88


Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Upgrading locale issues

Peter Geoghegan-4
On Mon, Apr 29, 2019 at 7:45 AM rihad <[hidden email]> wrote:
> Hi. Today we run pg_ctl promote on a slave server (10.7) and started
> using it as a master. The OS also got upgraded FreeBSD 10.4 -> FreeBSD
> 11.2. And you guessed it, most varchar indexes got corrupted because
> system local changed in subtle ways. So I created the extension amcheck
> and reindexed all bad indexes one by one. Is there any way to prevent
> such things in the future? Will switching to ICU fix all such issues?

Not necessarily, but it will detect the incompatibility more or less
automatically, making it far more likely that the problem will be
caught before it does any harm. ICU versions collations, giving
Postgres a way to reason about their compatibility over time. The libc
collations are not versioned, though (at least not in any standard way
that Postgres can take advantage of).

> The problem with it is that ICU collations are absent in pg_collation,
> initdb should be run to create them, but pg_basebackup only runs on an
> empty base directory, so I couldn't run initdb + pg_basebackup to
> prepare the replica server. I believe I can run the create collation
> command manually, but what would it look like for en-x-icu?

It is safe to call pg_import_system_collations() directly, which is
all that initdb does. This is documented, so you wouldn't be relying
on a hack.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Upgrading locale issues

Thomas Munro-5
On Thu, May 2, 2019 at 8:26 AM Peter Geoghegan <[hidden email]> wrote:

> On Mon, Apr 29, 2019 at 7:45 AM rihad <[hidden email]> wrote:
> > Hi. Today we run pg_ctl promote on a slave server (10.7) and started
> > using it as a master. The OS also got upgraded FreeBSD 10.4 -> FreeBSD
> > 11.2. And you guessed it, most varchar indexes got corrupted because
> > system local changed in subtle ways. So I created the extension amcheck
> > and reindexed all bad indexes one by one. Is there any way to prevent
> > such things in the future? Will switching to ICU fix all such issues?
>
> Not necessarily, but it will detect the incompatibility more or less
> automatically, making it far more likely that the problem will be
> caught before it does any harm. ICU versions collations, giving
> Postgres a way to reason about their compatibility over time. The libc
> collations are not versioned, though (at least not in any standard way
> that Postgres can take advantage of).

As discussed over on -hackers[1], I think it's worth pursuing that
though.  FWIW I've proposed locale versioning for FreeBSD's libc[2].
The reason I haven't gone further with that yet even though the code
change has been accepted in principle by FreeBSD reviewers is because
I got stuck on the question of how exactly to model the versions.  If,
say, just Turkish changes, I don't want to be rebuilding my French
indexes, which means that I don't think you can use the CLDR version
string.  Frustratingly, you could probably do a good job of that by
just checksumming the collation definition files, but that seems a bit
too crude.

There is also the question of how PostgreSQL should model versions,
and as I've argued in [1], I think we should track them at the level
of database object dependencies.

I'm hoping to reopen this can of worms for PostgreSQL 13 (and the
corresponding support could in theory be in FreeBSD 13... coincidence,
or a sign!?)

> > The problem with it is that ICU collations are absent in pg_collation,
> > initdb should be run to create them, but pg_basebackup only runs on an
> > empty base directory, so I couldn't run initdb + pg_basebackup to
> > prepare the replica server. I believe I can run the create collation
> > command manually, but what would it look like for en-x-icu?
>
> It is safe to call pg_import_system_collations() directly, which is
> all that initdb does. This is documented, so you wouldn't be relying
> on a hack.

Unfortunately you can't use ICU collations as a database default yet
(though there was some WIP code[3]), so ICU only saves you from
versioning problems if you explicitly set collations for columns or
expressions, and even then the version tracking is currently just a
warning that you clear manually with a command, not a mechanism that
really tracks which database objects were last rebuilt/validated with
a given version.

[1] https://www.postgresql.org/message-id/flat/CAEepm%3D0uEQCpfq_%2BLYFBdArCe4Ot98t1aR4eYiYTe%3DyavQygiQ%40mail.gmail.com
[2] https://reviews.freebsd.org/D17166
[3] https://www.postgresql.org/message-id/flat/3366.1498183854%40sss.pgh.pa.us

--
Thomas Munro
https://enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Upgrading locale issues

rihad
In reply to this post by Peter Geoghegan-4
On 05/02/2019 12:26 AM, Peter Geoghegan wrote:

> On Mon, Apr 29, 2019 at 7:45 AM rihad <[hidden email]> wrote:
>> Hi. Today we run pg_ctl promote on a slave server (10.7) and started
>> using it as a master. The OS also got upgraded FreeBSD 10.4 -> FreeBSD
>> 11.2. And you guessed it, most varchar indexes got corrupted because
>> system local changed in subtle ways. So I created the extension amcheck
>> and reindexed all bad indexes one by one. Is there any way to prevent
>> such things in the future? Will switching to ICU fix all such issues?
> Not necessarily, but it will detect the incompatibility more or less
> automatically, making it far more likely that the problem will be
> caught before it does any harm. ICU versions collations, giving
> Postgres a way to reason about their compatibility over time. The libc
> collations are not versioned, though (at least not in any standard way
> that Postgres can take advantage of).
>
>> The problem with it is that ICU collations are absent in pg_collation,
>> initdb should be run to create them, but pg_basebackup only runs on an
>> empty base directory, so I couldn't run initdb + pg_basebackup to
>> prepare the replica server. I believe I can run the create collation
>> command manually, but what would it look like for en-x-icu?
> It is safe to call pg_import_system_collations() directly, which is
> all that initdb does. This is documented, so you wouldn't be relying
> on a hack.
>
Thanks for the reply. Do you know what would a "decent" ICU collation be
to bind to a field's schema definition so it would mimic a UTF-8
encoding for a multilingual column? Maybe und-x-icu? We aren't as much
concerned about their sortability in most cases, we just want indexes to
better handle future PG/ICU upgrades. But what does und(efined) even
mean with respect to collations? With UTF-8 at least some default
collation is specified, like en_US.UTF-8. Will results be in a
completely undefined order as a result of ORDER BY "icu_und_column"?



Reply | Threaded
Open this post in threaded view
|

Re: Upgrading locale issues

Daniel Verite
        rihad wrote:

> Thanks for the reply. Do you know what would a "decent" ICU collation be
> to bind to a field's schema definition so it would mimic a UTF-8
> encoding for a multilingual column? Maybe und-x-icu? We aren't as much
> concerned about their sortability in most cases, we just want indexes to
> better handle future PG/ICU upgrades. But what does und(efined) even
> mean with respect to collations?

"undefined" in this context means unspecified language and
unspecified country or region. It implies that no language-specific
nor regional rule will be applied to compare strings.

Using C.UTF-8 as the collation for text fields to index may be the
best trade-off in your case. It should be immune to libc and ICU
upgrades.

With C.UTF-8, a string like 'BC' will sort before 'ab', and punctuation
and accents will also sort differently than with a linguistic-aware
collation.
If your applications care about that, it can be fixed by simply
adding COLLATE "default" to the ORDER BY clause of the queries that
are meant to present data to users.
COLLATE "default" means the collation of the database, which
presumably would be something like "language_REGION.UTF-8" in your
case. If you never specified it explicitly, it came from initdb which
itself got it from the environment of the server.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Reply | Threaded
Open this post in threaded view
|

Re: Upgrading locale issues

rihad
On 05/02/2019 05:36 PM, Daniel Verite wrote:

> rihad wrote:
>
>> Thanks for the reply. Do you know what would a "decent" ICU collation be
>> to bind to a field's schema definition so it would mimic a UTF-8
>> encoding for a multilingual column? Maybe und-x-icu? We aren't as much
>> concerned about their sortability in most cases, we just want indexes to
>> better handle future PG/ICU upgrades. But what does und(efined) even
>> mean with respect to collations?
> "undefined" in this context means unspecified language and
> unspecified country or region. It implies that no language-specific
> nor regional rule will be applied to compare strings.
>
> Using C.UTF-8 as the collation for text fields to index may be the
> best trade-off in your case. It should be immune to libc and ICU
> upgrades.
>
> With C.UTF-8, a string like 'BC' will sort before 'ab', and punctuation
> and accents will also sort differently than with a linguistic-aware
> collation.
Thanks, I'm a bit confused here. AFAIK indexes are used for at least two
things: for speed and for skipping the ORDER BY step (since btree
indexes are already sorted). Will such an "upgrade-immune" C.UTF-8 index
still work correctly for table lookups? And can the existing 
en_US.UTF-8 fields' definition be altered in place, without a
dump+restore? en_US.UTF-8 is the default encoding+locale+collation, it
isn't set explicitly for any of our string columns. I assume there's
some "catch-all" ordering taking place even for the C locale, so there
won't be any bizarre things like b coming before a, or generally for any
language, the second letter of its alphabet coming before the first?


> If your applications care about that, it can be fixed by simply
> adding COLLATE "default" to the ORDER BY clause of the queries that
> are meant to present data to users.
> COLLATE "default" means the collation of the database, which
> presumably would be something like "language_REGION.UTF-8" in your
> case. If you never specified it explicitly, it came from initdb which
> itself got it from the environment of the server.
>
>
> Best regards,




Reply | Threaded
Open this post in threaded view
|

Re: Upgrading locale issues

Daniel Verite
        rihad wrote:

> Thanks, I'm a bit confused here. AFAIK indexes are used for at least two
> things: for speed and for skipping the ORDER BY step (since btree
> indexes are already sorted). Will such an "upgrade-immune" C.UTF-8 index
> still work correctly for table lookups?

If the lookup is based on a equality test or a pattern match with LIKE
or a regexp, it makes no difference.  But the locale makes a
difference with inequality tests, such as < > or BETWEEN.

Around version 9.1 and in the pre-ICU days, Robert Haas wrote
this post that explained it pretty well, I think:
http://rhaas.blogspot.com/2012/03/perils-of-collation-aware-comparisons.html

Quote:

  If you happen to need the particular sorting behavior that
  collation-aware sorting and comparisons provide, then you may find
  this price worth paying, but I suspect there are a lot of people out
  there who are paying it more or less accidentally and don't really
  care very much about the underlying sorting behavior.  If, for
  example, all of your queries are based on equality, and you don't
  use greater-than or less-than tests, then it doesn't matter what
  collation is in use. You might as well use "C" instead of whatever
  your local default may be, because it's faster.

For non-English text, I would recommend C.UTF-8 over "C" because of
its better handling of Unicode characters. For instance:

=# select upper('été' collate "C"), upper('été' collate "C.UTF-8");
 upper | upper
-------+-------
 éTé   | ÉTÉ

The "price" of linguistic comparisons that Robert mentioned was about
performance, but the troubles we have with the lack of long-term
immutability of these locales are worth being added to that.

> And can the existing en_US.UTF-8 fields' definition be altered in
> place, without a dump+restore?

Changing the COLLATE clause of a column with ALTER TABLE does
implicitly rebuild an index on this column if there is one,
A dump+restore is not needed, nor an explicit REINDEX.

The dump+restore is needed in another scenario, where you would
decide to change the LC_COLLATE and LC_CTYPE of the database,
instead of doing it only for some columns.
This scenario makes perfect sense if the locale of the database
has been set implicitly and it uses linguistic sorts without
really needing them ("accidentally" as said in the post).


> en_US.UTF-8 is the default encoding+locale+collation, it
> isn't set explicitly for any of our string columns. I assume there's
> some "catch-all" ordering taking place even for the C locale, so there
> won't be any bizarre things like b coming before a, or generally for any
> language, the second letter of its alphabet coming before the first?

'b' < 'a' is never true for any locale AFAIK, it is 'B' < 'a' that is
true for some locales such as C or C.UTF-8.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Reply | Threaded
Open this post in threaded view
|

Re: Upgrading locale issues

rihad
On 05/03/2019 05:35 PM, Daniel Verite wrote:

> rihad wrote:
>
>> Thanks, I'm a bit confused here. AFAIK indexes are used for at least two
>> things: for speed and for skipping the ORDER BY step (since btree
>> indexes are already sorted). Will such an "upgrade-immune" C.UTF-8 index
>> still work correctly for table lookups?
> If the lookup is based on a equality test or a pattern match with LIKE
> or a regexp, it makes no difference.  But the locale makes a
> difference with inequality tests, such as < > or BETWEEN.
>
> Around version 9.1 and in the pre-ICU days, Robert Haas wrote
> this post that explained it pretty well, I think:
> http://rhaas.blogspot.com/2012/03/perils-of-collation-aware-comparisons.html
>
> Quote:
>
>    If you happen to need the particular sorting behavior that
>    collation-aware sorting and comparisons provide, then you may find
>    this price worth paying, but I suspect there are a lot of people out
>    there who are paying it more or less accidentally and don't really
>    care very much about the underlying sorting behavior.  If, for
>    example, all of your queries are based on equality, and you don't
>    use greater-than or less-than tests, then it doesn't matter what
>    collation is in use. You might as well use "C" instead of whatever
>    your local default may be, because it's faster.
>
> For non-English text, I would recommend C.UTF-8 over "C" because of
> its better handling of Unicode characters. For instance:
>
> =# select upper('été' collate "C"), upper('été' collate "C.UTF-8");
>   upper | upper
> -------+-------
>   éTé   | ÉTÉ
>
> The "price" of linguistic comparisons that Robert mentioned was about
> performance, but the troubles we have with the lack of long-term
> immutability of these locales are worth being added to that.
>
>> And can the existing en_US.UTF-8 fields' definition be altered in
>> place, without a dump+restore?
> Changing the COLLATE clause of a column with ALTER TABLE does
> implicitly rebuild an index on this column if there is one,
> A dump+restore is not needed, nor an explicit REINDEX.
>
> The dump+restore is needed in another scenario, where you would
> decide to change the LC_COLLATE and LC_CTYPE of the database,
> instead of doing it only for some columns.
> This scenario makes perfect sense if the locale of the database
> has been set implicitly and it uses linguistic sorts without
> really needing them ("accidentally" as said in the post).
>
>
>> en_US.UTF-8 is the default encoding+locale+collation, it
>> isn't set explicitly for any of our string columns. I assume there's
>> some "catch-all" ordering taking place even for the C locale, so there
>> won't be any bizarre things like b coming before a, or generally for any
>> language, the second letter of its alphabet coming before the first?
> 'b' < 'a' is never true for any locale AFAIK, it is 'B' < 'a' that is
> true for some locales such as C or C.UTF-8.
>
>
> Best regards,


Thanks a lot for sharing your insights!



Reply | Threaded
Open this post in threaded view
|

Re: Upgrading locale issues

rihad
In reply to this post by Daniel Verite
On 05/03/2019 05:35 PM, Daniel Verite wrote:
> For non-English text, I would recommend C.UTF-8 over "C" because of

BTW, there's no C.UTF-8 inside pg_collation, and running select
pg_import_system_collations('pg_catalog') doesn't bring it in, at least
not on FreeBSD 11.2.


Reply | Threaded
Open this post in threaded view
|

Re: Upgrading locale issues

Daniel Verite
        rihad wrote:

> On 05/03/2019 05:35 PM, Daniel Verite wrote:
> > For non-English text, I would recommend C.UTF-8 over "C" because of
>
> BTW, there's no C.UTF-8 inside pg_collation, and running select
> pg_import_system_collations('pg_catalog') doesn't bring it in, at least
> not on FreeBSD 11.2.

Yes, aside from "C", locales are quite system-dependent, unfortunately.

It looks like FreeBSD 13 does provide C.UTF-8:

https://unix.stackexchange.com/questions/485073/how-to-backport-freebsd-13-current-c-utf-8-locale-to-11-2-release/485077#485077


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Reply | Threaded
Open this post in threaded view
|

Re: Upgrading locale issues

Peter Geoghegan-4
In reply to this post by Thomas Munro-5
On Wed, May 1, 2019 at 3:09 PM Thomas Munro <[hidden email]> wrote:
> As discussed over on -hackers[1], I think it's worth pursuing that
> though.  FWIW I've proposed locale versioning for FreeBSD's libc[2].
> The reason I haven't gone further with that yet even though the code
> change has been accepted in principle by FreeBSD reviewers is because
> I got stuck on the question of how exactly to model the versions.  If,
> say, just Turkish changes, I don't want to be rebuilding my French
> indexes, which means that I don't think you can use the CLDR version
> string.

The ICU versions can handle that, though. Importantly, ICU decouples
implementation details from actual versioning.

I must say that I am not enthused about the idea of trying to get libc
people of any variety on board. I don't have an objection to it if it
can work for FreeBSD, but I don't think it can scale. ICU is built
around a culture that takes our concerns seriously already, which is
what it boils down to. Also, we can imagine a package manager taking
it upon themselves to vendor their own ICU, with platform-specific
guarantees around stability. That seems like a nice option to have, at
least.

> There is also the question of how PostgreSQL should model versions,
> and as I've argued in [1], I think we should track them at the level
> of database object dependencies.

I think you're probably right about that.

> I'm hoping to reopen this can of worms for PostgreSQL 13 (and the
> corresponding support could in theory be in FreeBSD 13... coincidence,
> or a sign!?)

Maybe we should do what Oracle did, and call it PostgreSQL 18c
instead. Actually, any number that isn't of interest to numerologists
will do.

> Unfortunately you can't use ICU collations as a database default yet
> (though there was some WIP code[3]), so ICU only saves you from
> versioning problems if you explicitly set collations for columns or
> expressions, and even then the version tracking is currently just a
> warning that you clear manually with a command, not a mechanism that
> really tracks which database objects were last rebuilt/validated with
> a given version.

Yes, that does seem like a big remaining weakness.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Upgrading locale issues

Peter Eisentraut-6
In reply to this post by Daniel Verite
On 2019-05-03 15:35, Daniel Verite wrote:
> 'b' < 'a' is never true for any locale AFAIK,

But there is 'bb' < 'aa' in Danish. :-)

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services