Mixed Locales and Upgrading

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Mixed Locales and Upgrading

Don Seiler
Good morning,

I have a few clusters that need to be upgraded from PG 9.6 or 10 to 12. Normally I'd just plan to pg_upgrade the lot and be good to go. However I've found that quite a few (including our biggest/busiest database) have mixed locales. In the case of the biggest/busiest database, the cluster was created with locale en_US (NOT en_US.UTF-8), and so the databases have encoding LATIN1. 

However this database has encoding UTF8 while still having ctype and collation of en_US. I've since found that when this was last upgraded, they ran "update pg_database set encoding = pg_char_to_encoding('UTF8') where datname = 'test';" to change the encoding. In my testing, pg_upgrade breaks when trying to restore this since UTF8 isn't supported in en_US for the CREATE DATABASE command used during pg_restore:

command: "/usr/lib/postgresql/12/bin/pg_restore" --host /var/lib/postgresql --port 50432 --username postgres --create --exit-on-error --verbose --dbname template1 "pg_upgrade_dump_16385.custom" >> "pg_upgrade_dump_16385.log" 2>&1
pg_restore: connecting to database for restore
pg_restore: creating DATABASE "test"
pg_restore: while PROCESSING TOC:
pg_restore: from TOC entry 2164; 1262 16385 DATABASE test postgres
pg_restore: error: could not execute query: ERROR:  encoding "UTF8" does not match locale "en_US"
DETAIL:  The chosen LC_CTYPE setting requires encoding "LATIN1".
Command was: CREATE DATABASE "test" WITH TEMPLATE = template0 ENCODING = 'UTF8' LC_COLLATE = 'en_US' LC_CTYPE = 'en_US';


Is there a way around this while still using pg_upgrade? My understanding is that a full dump and restore into a new DB with everything set to en_US.UTF-8 is the only to convert these (if I have to dump/restore, I wouldn't want to keep the mixed environment). Even with parallel jobs, I imagine that's a bit of downtime but I'll have to wait until I can get a copy of prod data to test with to be sure.

Is logical replication an option here? Either maintaining the mixed environment or converting everything to en_US.UTF-8? I'm relatively new in this shop but I'm told they didn't mean to use en_US and there's no reason they wouldn't want to just use the standard/default UTF-8.

Thanks,
Don.

--
Don Seiler
www.seiler.us
Reply | Threaded
Open this post in threaded view
|

Re: Mixed Locales and Upgrading

Tom Lane-2
Don Seiler <[hidden email]> writes:
> However this database has encoding UTF8 while still having ctype and
> collation of en_US. I've since found that when this was last upgraded, they
> ran "update pg_database set encoding = pg_char_to_encoding('UTF8') where
> datname = 'test';" to change the encoding.

Egad.

> In my testing, pg_upgrade breaks
> when trying to restore this since UTF8 isn't supported in en_US for the CREATE
> DATABASE command used during pg_restore:

Well, in principle you could likewise manually update pg_database's
datcollate and datctype columns to say "en_US.utf8".  However, there's
a much bigger problem here --- what steps if any did this cowboy take
to ensure that the data inside the database was valid UTF8?

I don't think you should use pg_upgrade here at all.  A dump/restore
is really the only way to make sure that you have validly encoded data.

However, if it's only one database out of a bunch, you could do something
like

* pg_dump that one database;
* drop said database;
* pg_upgrade everything else;
* restore that one database from dump.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Mixed Locales and Upgrading

Don Seiler
On Mon, Mar 16, 2020 at 10:28 AM Tom Lane <[hidden email]> wrote:

Egad.

My thoughts exactly.
 
Well, in principle you could likewise manually update pg_database's
datcollate and datctype columns to say "en_US.utf8".  However, there's
a much bigger problem here --- what steps if any did this cowboy take
to ensure that the data inside the database was valid UTF8?

No steps that I've seen from the chat history I've been able to search. I'm not sure if there was an (invalid) assumption that LATIN1 is a subset of UTF-8 or if it was done in a panic to get the import/update working years ago.
 
I don't think you should use pg_upgrade here at all.  A dump/restore
is really the only way to make sure that you have validly encoded data.

That is what I thought, and probably not what they'll want to hear given the downtime involved. Even with parallel dump/restore jobs, I imagine it will take quite a while (this first DB is almost 900GB).
 
However, if it's only one database out of a bunch, you could do something
like

* pg_dump that one database;
* drop said database;
* pg_upgrade everything else;
* restore that one database from dump.

In the case of this busy cluster, the layout is like this:

postgres=# \l
                                 List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    |   Access privileges
-----------+----------+----------+------------+------------+-----------------------
 foooo_all | postgres | UTF8     | en_US      | en_US      |
 postgres  | postgres | LATIN1   | en_US      | en_US      | =Tc/postgres         +
           |          |          |            |            | postgres=CTc/postgres
 template0 | postgres | LATIN1   | en_US      | en_US      | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 |
(4 rows)


So the template1 DB was dropped and recreated with the collate and ctype as well and then set to be a template again. But I believe that was well after foooo_all was changed (and so probably no need for the template1 change). In this case, if this is the only DB in the cluster, would it make sense to just create a new one as en_US.UTF-8 and then restore a dump of foooo_all into a pre-created en_US.UTF-8 DB?

We have a few other mixed environments similar to this as well. Some have postgres and both template DBs with this same UTF8/en_US/en_US configuration.

Is logical replication an option here? If the target DB were setup as en_US.UTF-8 across the board, would logical replication safely replicate and convert the data until we could then cut over?


Thanks,
Don.
--
Don Seiler
www.seiler.us
Reply | Threaded
Open this post in threaded view
|

Re: Mixed Locales and Upgrading

Don Seiler
On Tue, Mar 17, 2020 at 8:06 AM Don Seiler <[hidden email]> wrote:
On Mon, Mar 16, 2020 at 10:28 AM Tom Lane <[hidden email]> wrote:

Well, in principle you could likewise manually update pg_database's
datcollate and datctype columns to say "en_US.utf8".  However, there's
a much bigger problem here --- what steps if any did this cowboy take
to ensure that the data inside the database was valid UTF8?

Is there a way to programmatically check for data that might be a problem now?

--
Don Seiler
www.seiler.us
Reply | Threaded
Open this post in threaded view
|

Re: Mixed Locales and Upgrading

Tom Lane-2
In reply to this post by Don Seiler
Don Seiler <[hidden email]> writes:
> On Mon, Mar 16, 2020 at 10:28 AM Tom Lane <[hidden email]> wrote:
>> I don't think you should use pg_upgrade here at all.  A dump/restore
>> is really the only way to make sure that you have validly encoded data.

> That is what I thought, and probably not what they'll want to hear given
> the downtime involved. Even with parallel dump/restore jobs, I imagine it
> will take quite a while (this first DB is almost 900GB).

Yikes.  Well, if there aren't obvious operational problems, it might be
that the data is actually UTF8-clean, or almost entirely so.  Maybe you
could look at the problem as being one of validation.  In that case,
it'd be possible to consider not taking the production DB down, but just
doing a pg_dump from it and seeing if you can restore somewhere else.
If not, fix the broken data; repeat till clean.  After that you could
do pg_upgrade with a clear conscience.  I think you'll still end up
manually fixing the inconsistent datcollate/datctype settings though.

> Is logical replication an option here? If the target DB were setup as
> en_US.UTF-8 across the board, would logical replication safely replicate
> and convert the data until we could then cut over?

I think you need to make sure the data is clean first.  I doubt that
logical replication will magically fix any problems in data it's trying
to push over, and I also doubt that we have any really good answer to
what happens if a replication update fails due to bad data.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Mixed Locales and Upgrading

Don Seiler
On Tue, Mar 17, 2020 at 8:56 AM Tom Lane <[hidden email]> wrote:

Yikes.  Well, if there aren't obvious operational problems, it might be
that the data is actually UTF8-clean, or almost entirely so.  Maybe you
could look at the problem as being one of validation.  In that case,
it'd be possible to consider not taking the production DB down, but just
doing a pg_dump from it and seeing if you can restore somewhere else.
If not, fix the broken data; repeat till clean.  After that you could
do pg_upgrade with a clear conscience.  I think you'll still end up
manually fixing the inconsistent datcollate/datctype settings though.

For this test, would we restore into an en_US.UTF-8/UTF8 database? Then, assuming no errors (or fixing any errors until clean), we change the datcollate/datctype settings in prod and proceed with pg_upgrade (obviously after testing all of that heavily)?

What are the ramifications of changing collation like that? Should we consider rebuilding indexes ASAP after that?

Don.

--
Don Seiler
www.seiler.us
Reply | Threaded
Open this post in threaded view
|

Re: Mixed Locales and Upgrading

Tom Lane-2
Don Seiler <[hidden email]> writes:
> On Tue, Mar 17, 2020 at 8:56 AM Tom Lane <[hidden email]> wrote:
>> Yikes.  Well, if there aren't obvious operational problems, it might be
>> that the data is actually UTF8-clean, or almost entirely so.  Maybe you
>> could look at the problem as being one of validation.

> For this test, would we restore into an en_US.UTF-8/UTF8 database? Then,
> assuming no errors (or fixing any errors until clean), we change the
> datcollate/datctype settings in prod and proceed with pg_upgrade (obviously
> after testing all of that heavily)?

Yeah, that's the basic idea.

> What are the ramifications of changing collation like that? Should we
> consider rebuilding indexes ASAP after that?

Text indexes would definitely be at risk here.  I'm not really certain
how bad the problem would be.  Do you have a feeling for how much of
the data is 100% ASCII?  If you could be sure of that for any given
column, you wouldn't have to reindex indexes on that column.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Mixed Locales and Upgrading

Michael Paquier-2
On Tue, Mar 17, 2020 at 10:45:50AM -0400, Tom Lane wrote:
> Don Seiler <[hidden email]> writes:
>> What are the ramifications of changing collation like that? Should we
>> consider rebuilding indexes ASAP after that?
>
> Text indexes would definitely be at risk here.  I'm not really certain
> how bad the problem would be.  Do you have a feeling for how much of
> the data is 100% ASCII?  If you could be sure of that for any given
> column, you wouldn't have to reindex indexes on that column.

There is no way to know how much indexes would get broken without
having a look at it.  Anything ASCII-based should be of no problem.
If you have a doubt, reindexing evey index which includes text column
data is the best course of action in my opinion if you have any
doubts, because that's safe even if it has a higher cost.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Mixed Locales and Upgrading

Don Seiler
On Tue, Mar 17, 2020 at 9:25 PM Michael Paquier <[hidden email]> wrote:

There is no way to know how much indexes would get broken without
having a look at it.  Anything ASCII-based should be of no problem.
If you have a doubt, reindexing evey index which includes text column
data is the best course of action in my opinion if you have any
doubts, because that's safe even if it has a higher cost.

Here's the fun part. A lot of the tables use UUIDv4 strings for primary keys. However these are stored in text/varchar columns.

--
Don Seiler
www.seiler.us
Reply | Threaded
Open this post in threaded view
|

Re: Mixed Locales and Upgrading

Don Seiler
On Sun, Mar 22, 2020 at 4:48 PM Don Seiler <[hidden email]> wrote:

Here's the fun part. A lot of the tables use UUIDv4 strings for primary keys. However these are stored in text/varchar columns.

Actually, would I need to re-index on text columns that we know contain UUID strings? UUID characters seem to be pretty basic alphanumeric ASCII characters. 

--
Don Seiler
www.seiler.us
Reply | Threaded
Open this post in threaded view
|

Re: Mixed Locales and Upgrading

Tom Lane-2
Don Seiler <[hidden email]> writes:
> Actually, would I need to re-index on text columns that we know contain
> UUID strings? UUID characters seem to be pretty basic alphanumeric ASCII
> characters.

I think you're all right with respect to those, since they're the
same under any encoding.  It's columns containing non-ASCII characters
that you'd want to worry about reindexing.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Mixed Locales and Upgrading

Don Seiler
On Mon, Mar 30, 2020 at 4:30 PM Tom Lane <[hidden email]> wrote:
Don Seiler <[hidden email]> writes:
> Actually, would I need to re-index on text columns that we know contain
> UUID strings? UUID characters seem to be pretty basic alphanumeric ASCII
> characters.

I think you're all right with respect to those, since they're the
same under any encoding.  It's columns containing non-ASCII characters
that you'd want to worry about reindexing.

That's what I was hoping to hear. Thanks!

Don.

--
Don Seiler
www.seiler.us
Reply | Threaded
Open this post in threaded view
|

Re: Mixed Locales and Upgrading

Don Seiler
On Mon, Mar 30, 2020 at 4:39 PM Don Seiler <[hidden email]> wrote:
On Mon, Mar 30, 2020 at 4:30 PM Tom Lane <[hidden email]> wrote:
Don Seiler <[hidden email]> writes:
> Actually, would I need to re-index on text columns that we know contain
> UUID strings? UUID characters seem to be pretty basic alphanumeric ASCII
> characters.

I think you're all right with respect to those, since they're the
same under any encoding.  It's columns containing non-ASCII characters
that you'd want to worry about reindexing.

Follow-up question, the locale setting on the host would still be set to en_US (as would the postgres and template0 databases). Should I look to change that locale on the system to en_US.UTF-8, or even just for the postgres user that the DB cluster runs as? What are the ramification for doing (or not doing) so?

Don. 

--
Don Seiler
www.seiler.us
Reply | Threaded
Open this post in threaded view
|

Re: Mixed Locales and Upgrading

Peter Eisentraut-6
On 2020-04-07 18:41, Don Seiler wrote:
> Follow-up question, the locale setting on the host would still be set to
> en_US (as would the postgres and template0 databases). Should I look to
> change that locale on the system to en_US.UTF-8, or even just for the
> postgres user that the DB cluster runs as? What are the ramification for
> doing (or not doing) so?

I think the only place where this would really matter is that psql by
default sets the client encoding based on the current OS locale setting.
  So if you don't change the setting to en_US.UTF-8, then you might get
encoding errors when selecting data that is not representable as LATIN1
or whatever.  However, if you change the setting, that doesn't mean your
terminal setup will actually display Unicode correctly.  You said you're
dealing with mostly ASCII-ish data anyway, so it will probably not make
a difference.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Mixed Locales and Upgrading

Peter J. Holzer
In reply to this post by Tom Lane-2
On 2020-03-30 17:30:32 -0400, Tom Lane wrote:
> Don Seiler <[hidden email]> writes:
> > Actually, would I need to re-index on text columns that we know contain
> > UUID strings? UUID characters seem to be pretty basic alphanumeric ASCII
> > characters.
>
> I think you're all right with respect to those, since they're the
> same under any encoding.  It's columns containing non-ASCII characters
> that you'd want to worry about reindexing.

I'm not so sure whether all locales agree on whether to sort digits
before or after letters. However there are only two possibilities and
that's easy to check.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | [hidden email]         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

signature.asc (849 bytes) Download Attachment