BUG #15805: Problem with lower function for greek sigma (Σ) letter

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

BUG #15805: Problem with lower function for greek sigma (Σ) letter

apt.postgresql.org Repository Update
The following bug has been logged on the website:

Bug reference:      15805
Logged by:          Sergey kuznetsov
Email address:      [hidden email]
PostgreSQL version: 9.6.10
Operating system:   linux
Description:        

I see unexpected behaviour of lower function for greek sigma letter Σ.
According to wikipedia (https://en.wikipedia.org/wiki/Sigma) if this letter
is in final word position, it should be ς in lowercase, and not σ. But
PotgreSQL lower function returns σ, for example lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') =
"δημοτεσ φαιστου" instead of "δημοτες φαιστου". So if I try to pass this
lower string from another part of a system (java code, for example) Postgre
will not return this row cause it differs from java-generated one.

Thanks,
Sergey Kuznetsov

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

Daniel Gustafsson
> On 15 May 2019, at 09:57, PG Bug reporting form <[hidden email]> wrote:

>
> The following bug has been logged on the website:
>
> Bug reference:      15805
> Logged by:          Sergey kuznetsov
> Email address:      [hidden email]
> PostgreSQL version: 9.6.10
> Operating system:   linux
> Description:        
>
> I see unexpected behaviour of lower function for greek sigma letter Σ.
> According to wikipedia (https://en.wikipedia.org/wiki/Sigma) if this letter
> is in final word position, it should be ς in lowercase, and not σ. But
> PotgreSQL lower function returns σ, for example lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') =
> "δημοτεσ φαιστου" instead of "δημοτες φαιστου". So if I try to pass this
> lower string from another part of a system (java code, for example) Postgre
> will not return this row cause it differs from java-generated one.
This is indeed a bug, and a rare occurrence since AFAICT from ISO 30112 and
googling there is only a single case of word-final lowercasing which is this
sigma. The attached patch takes a stab at fixing this.

cheers ./daniel


sigma-wordfinal.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

Alvaro Herrera-9
On 2019-May-15, Daniel Gustafsson wrote:

> > I see unexpected behaviour of lower function for greek sigma letter Σ.
> > According to wikipedia (https://en.wikipedia.org/wiki/Sigma) if this letter
> > is in final word position, it should be ς in lowercase, and not σ. But
> > PotgreSQL lower function returns σ, for example lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') =
> > "δημοτεσ φαιστου" instead of "δημοτες φαιστου". So if I try to pass this
> > lower string from another part of a system (java code, for example) Postgre
> > will not return this row cause it differs from java-generated one.
>
> This is indeed a bug, and a rare occurrence since AFAICT from ISO 30112 and
> googling there is only a single case of word-final lowercasing which is this
> sigma. The attached patch takes a stab at fixing this.

Ummm ... isn't this a counterexample?
https://hebrew4christians.com/Grammar/Unit_One/Final_Forms/final_forms.html

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

Tom Lane-2
Alvaro Herrera <[hidden email]> writes:
> On 2019-May-15, Daniel Gustafsson wrote:
>> This is indeed a bug, and a rare occurrence since AFAICT from ISO 30112 and
>> googling there is only a single case of word-final lowercasing which is this
>> sigma. The attached patch takes a stab at fixing this.

> Ummm ... isn't this a counterexample?
> https://hebrew4christians.com/Grammar/Unit_One/Final_Forms/final_forms.html

I do not think the patch as given is acceptable in any case:

1. assumes without any evidence whatsoever that the system's wide-character
representation is Unicode code points;

2. assumes without checking that the locale is one that would allow this
conversion (counterexample: C locale);

3. unreasonable hard-coded assumption about what the "not a word character"
condition is.

It's possible that 1 and 2 could be finessed by checking both that the
original character is Σ and the new one is σ (in Unicode).  We'd still
theoretically be taking a risk of the wrong substitution if the wchar
representation is not Unicode, but the odds seem fairly small.  As for
point 3, why aren't you using iswalpha() on the next character?

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

Daniel Verite
In reply to this post by apt.postgresql.org Repository Update
        PG Bug reporting form wrote:

> lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') =
> "δημοτεσ φαιστου" instead of "δημοτες φαιστου"

With PostgreSQL version 10 or newer, you could use an ICU
locale. lower() would produce the expected result:

psql (11.3 (Debian 11.3-1.pgdg90+1))

=> select lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ' collate "und-x-icu");
      lower
-----------------
 δημοτες φαιστου
(1 row)

This case looks comparable to the case of the german ß (sharp s),
which should be upcased into 'SS', but the locales backed by libc
don't do that:

=> select upper(''Ich muß');
  upper  
---------
 ICH MUß

For that exemple as well, an ICU locale produces a correct
result with regard to linguistic rules:

=> select upper('Ich muß' collate "und-x-icu");
  upper  
----------
 ICH MUSS

The libc library provides an API with character-by-character
case conversions (tolower/toupper), which is too limited
to deal with the above cases, and PostgreSQL basically
just uses this API.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

Tom Lane-2
"Daniel Verite" <[hidden email]> writes:
> PG Bug reporting form wrote:
>> lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') =
>> "δημοτεσ φαιστου" instead of "δημοτες φαιστου"

> With PostgreSQL version 10 or newer, you could use an ICU
> locale. lower() would produce the expected result:

Oh, if using ICU already fixes this, I think we might as well just
say that you have to use ICU if you want the right behavior for such
cases.

> The libc library provides an API with character-by-character
> case conversions (tolower/toupper), which is too limited
> to deal with the above cases, and PostgreSQL basically
> just uses this API.

Right.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

Daniel Gustafsson
In reply to this post by Alvaro Herrera-9
> On 15 May 2019, at 14:20, Alvaro Herrera <[hidden email]> wrote:
>
> On 2019-May-15, Daniel Gustafsson wrote:
>
>>> I see unexpected behaviour of lower function for greek sigma letter Σ.
>>> According to wikipedia (https://en.wikipedia.org/wiki/Sigma) if this letter
>>> is in final word position, it should be ς in lowercase, and not σ. But
>>> PotgreSQL lower function returns σ, for example lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') =
>>> "δημοτεσ φαιστου" instead of "δημοτες φαιστου". So if I try to pass this
>>> lower string from another part of a system (java code, for example) Postgre
>>> will not return this row cause it differs from java-generated one.
>>
>> This is indeed a bug, and a rare occurrence since AFAICT from ISO 30112 and
>> googling there is only a single case of word-final lowercasing which is this
>> sigma. The attached patch takes a stab at fixing this.
>
> Ummm ... isn't this a counterexample?
> https://hebrew4christians.com/Grammar/Unit_One/Final_Forms/final_forms.html

Hebrew doesn’t have case, so it doesn’t apply in this case.

cheers ./daniel

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

Alvaro Herrera-9
On 2019-May-15, Daniel Gustafsson wrote:

> > On 15 May 2019, at 14:20, Alvaro Herrera <[hidden email]> wrote:

> > Ummm ... isn't this a counterexample?
> > https://hebrew4christians.com/Grammar/Unit_One/Final_Forms/final_forms.html
>
> Hebrew doesn’t have case, so it doesn’t apply in this case.

Yeah, I realized that afterwards.  Sorry for the noise.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15805: Problem with lower function for greek sigma (Σ) letter

Daniel Gustafsson
In reply to this post by Tom Lane-2
> On 15 May 2019, at 17:46, Tom Lane <[hidden email]> wrote:

>
> "Daniel Verite" <[hidden email]> writes:
>> PG Bug reporting form wrote:
>>> lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') =
>>> "δημοτεσ φαιστου" instead of "δημοτες φαιστου"
>
>> With PostgreSQL version 10 or newer, you could use an ICU
>> locale. lower() would produce the expected result:
>
> Oh, if using ICU already fixes this, I think we might as well just
> say that you have to use ICU if you want the right behavior for such
> cases.
Seems reasonable.  Maybe it warrants a mention in the docs on the string
function page since it may suprise users?

cheers ./daniel


casefolding.diff (557 bytes) Download Attachment