Encoding/collation question

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Encoding/collation question

Rich Shepard
My older databases have LATIN1 encoding and C collation; the newer ones have
UTF8 encoding and en_US.UTF-8 collation. A web search taught me that I can
change each old database by dumping it and restoring it with the desired
encoding and collation types. My question is whether the older types make
any difference in a single-user environment.

Regards,

Rich


Reply | Threaded
Open this post in threaded view
|

Re: Encoding/collation question

Tom Lane-2
Rich Shepard <[hidden email]> writes:
> My older databases have LATIN1 encoding and C collation; the newer ones have
> UTF8 encoding and en_US.UTF-8 collation. A web search taught me that I can
> change each old database by dumping it and restoring it with the desired
> encoding and collation types. My question is whether the older types make
> any difference in a single-user environment.

String comparisons in non-C collations tend to be a lot slower than
they are in C collation.  Whether this makes a noticeable difference
to you depends on your workload, but certainly we've seen performance
gripes that trace to that.

If your data doesn't require the larger character set of UTF8, then
using LATIN-any is going to offer some space savings (for non-ASCII
characters) plus minor performance benefits due to the lack of
variable-width characters.  This is less significant than the
collation issue, though, for most people.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Encoding/collation question

Rich Shepard
On Wed, 11 Dec 2019, Tom Lane wrote:

> String comparisons in non-C collations tend to be a lot slower than they
> are in C collation. Whether this makes a noticeable difference to you
> depends on your workload, but certainly we've seen performance gripes that
> trace to that.

Tom,

How interesting.

> If your data doesn't require the larger character set of UTF8, then using
> LATIN-any is going to offer some space savings (for non-ASCII characters)
> plus minor performance benefits due to the lack of variable-width
> characters. This is less significant than the collation issue, though, for
> most people.

I doubt that my use will notice meaningful differences. Since there are only
two or three databases in UTF8 and its collation perhaps I'll convert those
to LATIN1 and C.

Thanks for the insights.

Regards,

Rich


Reply | Threaded
Open this post in threaded view
|

Re: Encoding/collation question

Andrew Gierth
>>>>> "Rich" == Rich Shepard <[hidden email]> writes:

 Rich> I doubt that my use will notice meaningful differences. Since
 Rich> there are only two or three databases in UTF8 and its collation
 Rich> perhaps I'll convert those to LATIN1 and C.

Note that it's perfectly fine to use UTF8 encoding and C collation (this
has the effect of sorting strings in Unicode codepoint order); this is
as fast for comparisons as LATIN1/C is.

For those cases where you need data to be sorted in a
culturally-meaningful order rather than in codepoint order, you can set
collations on specific columns or in individual queries.

--
Andrew (irc:RhodiumToad)


Reply | Threaded
Open this post in threaded view
|

Re: Encoding/collation question

Karsten Hilbert
On Thu, Dec 12, 2019 at 05:03:59AM +0000, Andrew Gierth wrote:

>  Rich> I doubt that my use will notice meaningful differences. Since
>  Rich> there are only two or three databases in UTF8 and its collation
>  Rich> perhaps I'll convert those to LATIN1 and C.
>
> Note that it's perfectly fine to use UTF8 encoding and C collation (this
> has the effect of sorting strings in Unicode codepoint order); this is
> as fast for comparisons as LATIN1/C is.
>
> For those cases where you need data to be sorted in a
> culturally-meaningful order rather than in codepoint order, you can set
> collations on specific columns or in individual queries.

Nice, thanks for pointing that out. One addition: while this
may seem like "the" magic bullet it should be noted that one
will need additional indexes for culturally-meaningful ORDER
BY sorts to be fast (while having a default non-C collation
one will get a by-default culturally-meaningful index for
that one non-C locale).

Question: is C collation expected to be future-proof /
rock-solid /stable -- like UTF8 for encoding choice -- or
could it end up like the SQL-ASCII encoding did: Yeah, we
support it, it's been in use a long time, it should work,
but, nah, one doesn't really want to choose it over UTF8 if
at all possible, or at least know *exactly* what one is doing
and BTW YMMV ?

Karsten
--
GPG  40BE 5B0E C98E 1713 AFA6  5BC0 3BEA AC80 7D4F C89B


Reply | Threaded
Open this post in threaded view
|

Re: Encoding/collation question

Tom Lane-2
Karsten Hilbert <[hidden email]> writes:
> Question: is C collation expected to be future-proof /
> rock-solid /stable -- like UTF8 for encoding choice -- or
> could it end up like the SQL-ASCII encoding did: Yeah, we
> support it, it's been in use a long time, it should work,
> but, nah, one doesn't really want to choose it over UTF8 if
> at all possible, or at least know *exactly* what one is doing
> and BTW YMMV ?

C collation basically devolves to strcmp/memcmp, which are as standard
and well-defined as can be.  If you're happy with the way it sorts
things then there's no reason not to use it.

It's actually all the *other* collations where you should worry about
their behavior being a moving target :-(.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Encoding/collation question

Rich Shepard
In reply to this post by Andrew Gierth
On Thu, 12 Dec 2019, Andrew Gierth wrote:

> Note that it's perfectly fine to use UTF8 encoding and C collation (this
> has the effect of sorting strings in Unicode codepoint order); this is as
> fast for comparisons as LATIN1/C is.

Andrew,

This is really useful insight. I've not thought of the relationship of
encoding to collation (which I now know there isn't.)

> For those cases where you need data to be sorted in a
> culturally-meaningful order rather than in codepoint order, you can set
> collations on specific columns or in individual queries.

Not an issue for my work. :-)

Thanks very much,

Rich


Reply | Threaded
Open this post in threaded view
|

Re: Encoding/collation question

Karsten Hilbert
In reply to this post by Tom Lane-2
On Thu, Dec 12, 2019 at 08:35:53AM -0500, Tom Lane wrote:

> C collation basically devolves to strcmp/memcmp, which are as standard
> and well-defined as can be.  If you're happy with the way it sorts
> things then there's no reason not to use it.

So that's the collation to use when "technical" sorting is
required (say, when uniqueness does not depend on the notion
of culturally equivalent characters).

> It's actually all the *other* collations where you should worry about
> their behavior being a moving target :-(.

But then that is to be expected.

Thanks,
Karsten
--
GPG  40BE 5B0E C98E 1713 AFA6  5BC0 3BEA AC80 7D4F C89B