Operator "=" not unicode-safe?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Operator "=" not unicode-safe?

Jörg Haustein
PostgreSQL version: 8.0.3
Operating system:   Linux (SuSE 9.1)

I have a UNICODE database, trying to compare two unicode strings (Ethiopic
characters). Client encoding is also UNICODE:
===================================================
testdb=> select 'በድሩ ሁሴን'='ሰይፉ ከበደ';
 ?column?
----------
 t
(1 row)

Clearly, it can be seen that they are not equal. The "LIKE" operator also
seems to think so:

testdb=> select 'በድሩ ሁሴን' LIKE 'ሰይፉ ከበደ';
 ?column?
----------
 f
(1 row)
===================================================

What is the problem here?
The behavior is the same with SQL_ASCII databases and the SQL_ASCII client
encoding.

Of course one could always overload the operator or just use LIKE. But
where it really matters is with queries using UNION, EXCEPT or INTERSECT:

==========================

testdb=> select a from a;
    a
---------
 በድሩ ሁሴን
 ሰይፉ ከበደ
(2 rows)

testdb=> select a from b;
    a
---------
 ሰይፉ ከበደ
(1 row)

testdb=> select a from a union select a from b;
    a
---------
 በድሩ ሁሴን
(1 row)

testdb=> select a from a except select a from b;
 a
---
(0 rows)

testdb=> select a from a intersect select a from b;
    a
---------
 በድሩ ሁሴን
(1 row)
==========================

What can I do?
With kind regards,

Jörg Haustein




---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to [hidden email] so that your
       message can get through to the mailing list cleanly
Reply | Threaded
Open this post in threaded view
|

Re: Operator "=" not unicode-safe?

Tom Lane-2
=?UTF-8?B?SsO2cmcgSGF1c3RlaW4=?= <[hidden email]> writes:
> I have a UNICODE database, trying to compare two unicode strings (Ethiopic
> characters). Client encoding is also UNICODE:
> ===================================================
> testdb=> select '??? ???'='??? ???';
>  ?column?
> ----------
>  t
> (1 row)

> Clearly, it can be seen that they are not equal.

Sounds to me like you chose a locale that is expecting some non-Unicode
encoding.  "=" ultimately depends on the system's strcoll() routine,
and in many locales strcoll doesn't behave very sanely when handed data
that's illegal in whatever it thinks the encoding is.

Redo your initdb in a locale that is UTF-8 based, and make sure to keep
the database encoding UTF8.  The apparent flexibility to choose
different database encodings really only works if the underlying locale
is "C".

There is a warning about this in the docs, though perhaps not prominent
enough:
http://www.postgresql.org/docs/8.0/static/multibyte.html#AEN20633

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to [hidden email] so that your
       message can get through to the mailing list cleanly