BUG #1721: mutiple bytes character string comaprison error

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

BUG #1721: mutiple bytes character string comaprison error

Chii-Tung Liu

The following bug has been logged online:

Bug reference:      1721
Logged by:          Chii-Tung Liu
Email address:      [hidden email]
PostgreSQL version: 8.0.3
Operating system:   Windows XP SP2
Description:        mutiple bytes character string comaprison error
Details:

When compare two UTF-8 encoded string that contains Chinese words, the
result is always TRUE
1. create a database test with encoding set to unicode
CREATE DATABASE test
  WITH OWNER = postgres
       ENCODING = 'UNICODE'
       TABLESPACE = pg_default;
2. insert data with Chinese words
INSERT into node set title='1 ??????'

3. SELECT title from node where title > '1.1 '
would return '1 ??????'

4. Both SELECT '1 ??????' > '1.1' and  SELECT '1.1' > '1 ??????' return
FALSE

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq
Reply | Threaded
Open this post in threaded view
|

Re: BUG #1721: mutiple bytes character string comaprison error

Tom Lane-2
"Chii-Tung Liu" <[hidden email]> writes:
> PostgreSQL version: 8.0.3
> Operating system:   Windows XP SP2

> When compare two UTF-8 encoded string that contains Chinese words, the
> result is always TRUE

Sorry, but UTF-8 encoding doesn't work properly on Windows (yet).
Use some other database encoding.

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Reply | Threaded
Open this post in threaded view
|

Re: BUG #1721: mutiple bytes character string comaprison

Kris Jurka


On Sun, 19 Jun 2005, Tom Lane wrote:

> "Chii-Tung Liu" <[hidden email]> writes:
> > PostgreSQL version: 8.0.3
> > Operating system:   Windows XP SP2
>
> > When compare two UTF-8 encoded string that contains Chinese words, the
> > result is always TRUE
>
> Sorry, but UTF-8 encoding doesn't work properly on Windows (yet).
> Use some other database encoding.
>

Shouldn't we forbid its creation then?  At least a strongly worded
warning?  We see these complaints too often.

Kris Jurka

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to [hidden email])
Reply | Threaded
Open this post in threaded view
|

Re: BUG #1721: mutiple bytes character string comaprison error

Tom Lane-2
Kris Jurka <[hidden email]> writes:
> On Sun, 19 Jun 2005, Tom Lane wrote:
>> Sorry, but UTF-8 encoding doesn't work properly on Windows (yet).
>> Use some other database encoding.

> Shouldn't we forbid its creation then?

There was serious discussion of that before the 8.0 release, but
we decided not to forbid it.  Check the archives; I don't recall
the reasoning at the moment.

> We see these complaints too often.

There are lots of complaints we see way too often ;-) ... but
distressingly, there are still only 24 hours in a day.

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: BUG #1721: mutiple bytes character string comaprison

Tatsuo Ishii
In reply to this post by Chii-Tung Liu
> The following bug has been logged online:
>
> Bug reference:      1721
> Logged by:          Chii-Tung Liu
> Email address:      [hidden email]
> PostgreSQL version: 8.0.3
> Operating system:   Windows XP SP2
> Description:        mutiple bytes character string comaprison error
> Details:
>
> When compare two UTF-8 encoded string that contains Chinese words, the
> result is always TRUE
> 1. create a database test with encoding set to unicode
> CREATE DATABASE test
>   WITH OWNER = postgres
>        ENCODING = 'UNICODE'
>        TABLESPACE = pg_default;
> 2. insert data with Chinese words
> INSERT into node set title='1 中文'
>
> 3. SELECT title from node where title > '1.1 '
> would return '1 中文'
>
> 4. Both SELECT '1 中文' > '1.1' and  SELECT '1.1' > '1 中文' return
> FALSE

I think you need to use C locale.
--
Tatsuo Ishii

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to [hidden email] so that your
      message can get through to the mailing list cleanly
Reply | Threaded
Open this post in threaded view
|

Re: BUG #1721: mutiple bytes character string comaprison error

Bruce Momjian-2
In reply to this post by Tom Lane-2
Tom Lane wrote:

> Kris Jurka <[hidden email]> writes:
> > On Sun, 19 Jun 2005, Tom Lane wrote:
> >> Sorry, but UTF-8 encoding doesn't work properly on Windows (yet).
> >> Use some other database encoding.
>
> > Shouldn't we forbid its creation then?
>
> There was serious discussion of that before the 8.0 release, but
> we decided not to forbid it.  Check the archives; I don't recall
> the reasoning at the moment.

UTF8 encoding works with the C locale assuming you don't care about
ordering of the character set, e.g. Japanese.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  [hidden email]               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Reply | Threaded
Open this post in threaded view
|

Re: BUG #1721: mutiple bytes character string comaprison error

John Hansen
In reply to this post by Tom Lane-2
>
> UTF8 encoding works with the C locale assuming you don't care
> about ordering of the character set, e.g. Japanese.
>

Has anyone with the ability to compile postgresql on windows tested the
ICU patch?

... John

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq
Reply | Threaded
Open this post in threaded view
|

Re: BUG #1721: mutiple bytes character string comaprison error

Magnus Hagander
In reply to this post by Tom Lane-2
> > UTF8 encoding works with the C locale assuming you don't care about
> > ordering of the character set, e.g. Japanese.
> >
>
> Has anyone with the ability to compile postgresql on windows
> tested the ICU patch?

Yes.
See http://archives.postgresql.org/pgsql-hackers/2005-05/msg00662.php


//Magnus

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq
Reply | Threaded
Open this post in threaded view
|

Re: BUG #1721: mutiple bytes character string comaprison

Tatsuo Ishii
In reply to this post by Bruce Momjian-2
> Tom Lane wrote:
> > Kris Jurka <[hidden email]> writes:
> > > On Sun, 19 Jun 2005, Tom Lane wrote:
> > >> Sorry, but UTF-8 encoding doesn't work properly on Windows (yet).
> > >> Use some other database encoding.
> >
> > > Shouldn't we forbid its creation then?
> >
> > There was serious discussion of that before the 8.0 release, but
> > we decided not to forbid it.  Check the archives; I don't recall
> > the reasoning at the moment.
>
> UTF8 encoding works with the C locale assuming you don't care about
> ordering of the character set, e.g. Japanese.

No, sometimes Japanese needs char ordering too and I think this is not
a Windows only problem. The real problem is Unicode defines char
orderes in totally random manner because Chinese/Japanese/Korean Kanji
characters are "Unified" in Unicode. To solve the problem, we can use
convert UTF8 to EUC_JP using CONVERT. See archives for more details.

Or you can use Unicode locale only if your platform's locale database
is not broken and you only use single locale.
--
Tatsuo Ishii

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
      joining column's datatypes do not match