On 3/15/19 11:59 AM, Gunther wrote:
> This is not an issue for "hackers" nor "performance" in fact even for
> "general" it isn't really an issue.
As long as it's already been posted, may as well make it something
helpful to find in the archive.
> Understand charsets -- character set, code point, and encoding. Then
> understand how encoding and string literals and "escape sequences" in
> string literals might work.
Good advice for sure.
> Know that UNICODE today is the one standard, and there is no more need
I wasn't sure from the question whether the original poster was in
a position to choose the encoding of the database. Lots of things are
easier if it can be set to UTF-8 these days, but perhaps it's a legacy
Maybe a good start would be to go do
and then hit the internet and look up what that encoding (or those
encodings, if different) can and can't represent, and go from there.
It's worth knowing that, when the server encoding isn't UTF-8,
PostgreSQL will have the obvious limitations entailed by that,
but also some non-obvious ones that may be surprising, e.g. .
Many of us have faced character encoding issues because we are not in control of our input sources and made the common assumption that UTF-8 covers everything.
In my lab, as an example, some of our social media posts have included ZawGyi Burmese character sets rather than Unicode Burmese. (Because Myanmar developed technology In a closed to the world environment, they made up their own non-standard character
set which is very common still in Mobile phones.). We had fully tested the app with Unicode Burmese, but honestly didn’t know ZawGyi was even a thing that we would see in our dataset. We’ve also had problems with non-Unicode word separators in Arabic.
What we’ve found to be helpful is to view the troubling code in a hex editor and determine what non-standard characters may be causing the problem.
It may be some data conversion is necessary before insertion. But the first step is knowing WHICH characters are causing the issue.
On 2019-03-17 15:01:40 +0000, Warner, Gary, Jr wrote:
> Many of us have faced character encoding issues because we are not in control
> of our input sources and made the common assumption that UTF-8 covers
UTF-8 covers "everything" in the sense that there is a round-trip from
each character in every commonly-used charset/encoding to Unicode and
The actual code may of course be different. For example, the € sign is
0xA4 in iso-8859-15, but U+20AC in Unicode. So you need an
And "commonly-used" means just that. Unicode covers a lot of character
sets, but it can't cover every character set ever invented (I invented
my own character sets when I was sixteen. Nobody except me ever used
them and they have long succumbed to bit rot).
> In my lab, as an example, some of our social media posts have included ZawGyi
> Burmese character sets rather than Unicode Burmese. (Because Myanmar developed
> technology In a closed to the world environment, they made up their own
> non-standard character set which is very common still in Mobile phones.).
I'd be surprised if there was a character set which is "very common in
Mobile phones", even in a relatively poor country like Myanmar. Does
ZawGyi actually include characters which aren't in Unicode are are they
just encoded differently?