BUG #15548: Unaccent does not remove combining diacritical characters

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
49 messages Options
123
Reply | Threaded
Open this post in threaded view
|

BUG #15548: Unaccent does not remove combining diacritical characters

PG Doc comments form
The following bug has been logged on the website:

Bug reference:      15548
Logged by:          Hugh Ranalli
Email address:      [hidden email]
PostgreSQL version: 11.1
Operating system:   Ubuntu 18.04
Description:        

Apparently Unicode has two ways of accenting a character: as a separate code
point, which represents the base character and the accent, or as a
"combining diacritical mark"
(https://en.wikipedia.org/wiki/Combining_Diacritical_Marks), in which case
the mark applies itself to the preceding character. For example, A followed
by U+0300 displays À. However, unaccent is not removing these accents.

SELECT unaccent(U&'A\0300'); should result in 'A', but instead results in
'À.' I'm running PostgreSQL 11.1, installed from the PostgreSQL
repositories. I've read bug report #13440, and have tried with both the
installed unaccent.rules as well as a new set generated by the
generate_unaccent_rules.py distributed with the 11.1 source code:
  wget http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt
  wget
https://www.unicode.org/repos/cldr/trunk/common/transforms/Latin-ASCII.xml
  python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt
--latin-ascii-file  Latin-ASCII.xml > unaccent.rules

I see there have been some updates to generate_unaccent_rules.py to handle
Greek and Vietnamese characters, but neither of them seem to address this
issue. I'm happy to contribute a patch to handle these cases, but of course
wanted to make sure this is desired behaviour, or if I am misunderstanding
something somewhere.

Thank you,
Hugh Ranalli

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Daniel Verite
        PG Bug reporting form wrote:

> Apparently Unicode has two ways of accenting a character: as a separate code
> point, which represents the base character and the accent, or as a
> "combining diacritical mark"
> (https://en.wikipedia.org/wiki/Combining_Diacritical_Marks)

Yes. See also https://en.wikipedia.org/wiki/Unicode_equivalence

In general, PostgreSQL leaves it to applications to normalize
Unicode strings so that they are all in the same canonical form,
either composed or decomposed.

> the mark applies itself to the preceding character. For example, A
> followed by U+0300 displays À. However, unaccent is not removing
> these accents.

Short of having the input normalized by the application, ISTM that the
best solution would be to provide functions to do it in Postgres, so
you'd just write for example:
    unaccent(unicode_NFC(string))

Otherwise unaccent.rules can be customized. You may add replacements
for letter+diacritical sequences that are missing for the languages
you have to deal with. But doing it in general for all diacriticals
multiplied by all base characters seems unrealistic.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Tom Lane-2
"Daniel Verite" <[hidden email]> writes:
> PG Bug reporting form wrote:
>> ... For example, A
>> followed by U+0300 displays À. However, unaccent is not removing
>> these accents.

> Short of having the input normalized by the application, ISTM that the
> best solution would be to provide functions to do it in Postgres, so
> you'd just write for example:
>     unaccent(unicode_NFC(string))

That might be worthwhile, but it seems independent of this issue.

> Otherwise unaccent.rules can be customized. You may add replacements
> for letter+diacritical sequences that are missing for the languages
> you have to deal with. But doing it in general for all diacriticals
> multiplied by all base characters seems unrealistic.

Hm, I thought the OP's proposal was just to make unaccent drop
combining diacriticals independently of context, which'd avoid the
combinatorial-growth problem.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Daniel Verite
        Tom Lane wrote:

> Hm, I thought the OP's proposal was just to make unaccent drop
> combining diacriticals independently of context, which'd avoid the
> combinatorial-growth problem.

In that case, this could be achieved by simply appending the
diacriticals themselves to unaccent.rules, since replacement of a
string by an empty string is already supported as a rule.
It doesn't seem like the current file has any of these, but from
https://www.postgresql.org/docs/11/unaccent.html :

 "Alternatively, if only one character is given on a line, instances
 of that character are deleted; this is useful in languages where
 accents are represented by separate characters"

Incidentally we may want to improve this bit of doc to mention
explicitly the Unicode decomposed forms as a use case for
removing characters. In fact I wonder if that's not what it's
already trying to express, but confusing "languages" with "forms".


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Hugh Ranalli


On Thu, 13 Dec 2018, 11:26 Daniel Verite <[hidden email] wrote:
        Tom Lane wrote:

> Hm, I thought the OP's proposal was just to make unaccent drop
> combining diacriticals independently of context, which'd avoid the
> combinatorial-growth problem.

That's what I was thinking. Given that the accent is separate from the characters, simply dropping it should result in the correct unaccented character.

In that case, this could be achieved by simply appending the
diacriticals themselves to unaccent.rules, since replacement of a
string by an empty string is already supported as a rule.
It doesn't seem like the current file has any of these, but from
https://www.postgresql.org/docs/11/unaccent.html :

 "Alternatively, if only one character is given on a line, instances
 of that character are deleted; this is useful in languages where
 accents are represented by separate characters"

Yes, I had read that in the docs, and that's the approach I planned to take. I'll go ahead and develop a patch, then.

Best wishes,
Hugh
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Hugh Ranalli
I've attached a patch removes combining diacriticals. As with Latin and Greek letters, it uses ranges to restrict its activity. 

I have not submitted a patch for unaccent.rules, as it seems that a rules file generated from generate_unaccent_rules.py will actually remove a large number of rules (even before my changes), such as replacing the copyright symbol © with (C), as well as other accented characters. It's probably worth asking if the shipped unaccent.rules should correspond to what the shipped generation utility produces, or not. I was surprised to see that it didn't.

Please let me know if you see anything I need to change.

Best wishes,
Hugh

--
Hugh Ranalli
Principal Consultant
White Horse Technology Consulting
e: [hidden email]
c: +01-416-994-7957
w: www.whtc.ca


On Thu, 13 Dec 2018 at 13:50, Hugh Ranalli <[hidden email]> wrote:


On Thu, 13 Dec 2018, 11:26 Daniel Verite <[hidden email] wrote:
        Tom Lane wrote:

> Hm, I thought the OP's proposal was just to make unaccent drop
> combining diacriticals independently of context, which'd avoid the
> combinatorial-growth problem.

That's what I was thinking. Given that the accent is separate from the characters, simply dropping it should result in the correct unaccented character.

In that case, this could be achieved by simply appending the
diacriticals themselves to unaccent.rules, since replacement of a
string by an empty string is already supported as a rule.
It doesn't seem like the current file has any of these, but from
https://www.postgresql.org/docs/11/unaccent.html :

 "Alternatively, if only one character is given on a line, instances
 of that character are deleted; this is useful in languages where
 accents are represented by separate characters"

Yes, I had read that in the docs, and that's the approach I planned to take. I'll go ahead and develop a patch, then.

Best wishes,
Hugh

remove-combining-diacritical-accents-in-unaccent.rules.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Tom Lane-2
Hugh Ranalli <[hidden email]> writes:
> I've attached a patch removes combining diacriticals. As with Latin and
> Greek letters, it uses ranges to restrict its activity.

Cool.  Please add it to the current CF so we don't forget about it:
https://commitfest.postgresql.org/21/

> I have not submitted a patch for unaccent.rules, as it seems that a rules
> file generated from generate_unaccent_rules.py will actually remove a large
> number of rules (even before my changes), such as replacing the copyright
> symbol © with (C), as well as other accented characters. It's probably
> worth asking if the shipped unaccent.rules should correspond to what the
> shipped generation utility produces, or not. I was surprised to see that it
> didn't.

Me too -- seems like that bears looking into.  Perhaps the script's
results are platform dependent -- what were you testing on?

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Hugh Ranalli


On Fri, 14 Dec 2018 at 17:50, Tom Lane <[hidden email]> wrote:
Hugh Ranalli <[hidden email]> writes:
Cool.  Please add it to the current CF so we don't forget about it:
https://commitfest.postgresql.org/21/
Done.
 
Me too -- seems like that bears looking into.  Perhaps the script's
results are platform dependent -- what were you testing on?
I'm on Linux Mint 17, which is based on Ubuntu 14.04. But I don't think that's it. The program's decisions come from the two data files, the Unicode data set and the Latin-ASCII transliteration file. The script uses categories (ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html#General%20Category) to identify letters (and now combining marks) and if they are in range, performs a substitution. It then uses the transliteration file to find rules for particular character substitutions (for example, that file seems to handle the copyright symbol substitution). I don't see anything platform dependent in there. 

In looking more closely, I also see that script isn't generating ligatures, even though it should, because although the program can generate them, none of the ligatures are in the ranges defined in PLAIN_LETTER_RANGES, and so they are skipped.

This could probably be handled by adding the ligature ranges to the defined ranges. Symbol types could be added to the types it looks at, and perhaps the codepoint ranges collapsed into one list, as the IDs are unique across all categories. I don't think we'd want to just rely on ranges, as that could include control characters, punctuation, etc. 

There are a number of other characters that appear in unaccent.rules that aren't generated by the script. I've attached a diff of the output of generate_unaccent_rules (using the version before my changes, to simplify matters) and unaccent.rules. Unfortunately, I don't know how to interpret most of these characters.

I suppose it's valid to ask if changing © to (C) is even something an "unaccent" function should do. Given that it's in the existing rules file, should it be supported as an existing behaviour?

Sorry for more questions than answers. ;-)


unaccent.diff (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Tom Lane-2
Hugh Ranalli <[hidden email]> writes:
> On Fri, 14 Dec 2018 at 17:50, Tom Lane <[hidden email]> wrote:
>> Me too -- seems like that bears looking into.  Perhaps the script's
>> results are platform dependent -- what were you testing on?

> I'm on Linux Mint 17, which is based on Ubuntu 14.04. But I don't think
> that's it. The program's decisions come from the two data files, the
> Unicode data set and the Latin-ASCII transliteration file. The script uses
> categories (
> ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html#General%20Category)
> to identify letters (and now combining marks) and if they are in range,
> performs a substitution. It then uses the transliteration file to find
> rules for particular character substitutions (for example, that file seems
> to handle the copyright symbol substitution). I don't see anything platform
> dependent in there.

Hm.  Something funny is going on here.  When I fetch the two reference
files from the URLs cited in the script, and do

python2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml >newrules

I get something that's bit-for-bit the same as what's in unaccent.rules.
So there's clearly a platform difference between here and there.

I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
it on anything newer.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Tom Lane-2
I wrote:
> ... I get something that's bit-for-bit the same as what's in unaccent.rules.
> So there's clearly a platform difference between here and there.
> I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
> it on anything newer.

A few minutes later on a Fedora 28 box: python 2.7.15 also gives me the
expected results, while python 3.6.6 fails with "SyntaxError: invalid
syntax".

So updating that script to also work with python3 might be a worthwhile
TODO item.  But I'm at a loss to explain why you get different results.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Hugh Ranalli
In reply to this post by Tom Lane-2

On Sat, 15 Dec 2018 at 13:44, Tom Lane <[hidden email]> wrote:
Hm.  Something funny is going on here.  When I fetch the two reference
files from the URLs cited in the script, and do

python2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml >newrules

I get something that's bit-for-bit the same as what's in unaccent.rules.
So there's clearly a platform difference between here and there.

I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
it on anything newer.
Well, that's embarrassing. When I looked I couldn't see anything that looked platform specific. I'm on Python 2.7.6, which shipped with Mint 17. We use other versions of 2.7 on our production platforms. I'll take another look, and check the URLs I am using.

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Hugh Ranalli

On Sat, 15 Dec 2018 at 14:05, Hugh Ranalli <[hidden email]> wrote:
On Sat, 15 Dec 2018 at 13:44, Tom Lane <[hidden email]> wrote:
Hm.  Something funny is going on here.  When I fetch the two reference
files from the URLs cited in the script, and do

python2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml >newrules

I get something that's bit-for-bit the same as what's in unaccent.rules.
So there's clearly a platform difference between here and there.

I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
it on anything newer.
Well, that's embarrassing. When I looked I couldn't see anything that looked platform specific. I'm on Python 2.7.6, which shipped with Mint 17. We use other versions of 2.7 on our production platforms. I'll take another look, and check the URLs I am using.

The problem is that I downloaded the latest version of the Latin-ASCII transliteration file (r34 rather than the r28 specified in the URL). Over 3 years ago (in r29, of course) they changed the file format (https://unicode.org/cldr/trac/ticket/5873) so that parse_cldr_latin_ascii_transliterator loads an empty rules set. I'd be happy to either a) support both formats, or b), support just the newest and update the URL. Option b) is cleaner, and I can't imagine why anyone would want to use an older rule set (then again, struggling with Unicode always makes my head hurt; I am not an expert on it). Thoughts?

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Tom Lane-2
Hugh Ranalli <[hidden email]> writes:
> The problem is that I downloaded the latest version of the Latin-ASCII
> transliteration file (r34 rather than the r28 specified in the URL). Over 3
> years ago (in r29, of course) they changed the file format (
> https://unicode.org/cldr/trac/ticket/5873) so that
> parse_cldr_latin_ascii_transliterator loads an empty rules set.

Ah-hah.

> I'd be
> happy to either a) support both formats, or b), support just the newest and
> update the URL. Option b) is cleaner, and I can't imagine why anyone would
> want to use an older rule set (then again, struggling with Unicode always
> makes my head hurt; I am not an expert on it). Thoughts?

(b) seems sufficient to me, but perhaps someone else has a different
opinion.

Whichever we do, I think it should be a separate patch from the feature
addition for combining diacriticals, just to keep the commit history
clear.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Thomas Munro-3
On Sun, Dec 16, 2018 at 8:20 AM Tom Lane <[hidden email]> wrote:

> Hugh Ranalli <[hidden email]> writes:
> > The problem is that I downloaded the latest version of the Latin-ASCII
> > transliteration file (r34 rather than the r28 specified in the URL). Over 3
> > years ago (in r29, of course) they changed the file format (
> > https://unicode.org/cldr/trac/ticket/5873) so that
> > parse_cldr_latin_ascii_transliterator loads an empty rules set.
>
> Ah-hah.
>
> > I'd be
> > happy to either a) support both formats, or b), support just the newest and
> > update the URL. Option b) is cleaner, and I can't imagine why anyone would
> > want to use an older rule set (then again, struggling with Unicode always
> > makes my head hurt; I am not an expert on it). Thoughts?
>
> (b) seems sufficient to me, but perhaps someone else has a different
> opinion.
>
> Whichever we do, I think it should be a separate patch from the feature
> addition for combining diacriticals, just to keep the commit history
> clear.

+1 for updating to the latest file from time to time.  After
http://unicode.org/cldr/trac/ticket/11383 makes it into a new release,
our special_cases() function will have just the two Cyrillic
characters, which should almost certainly be handled by adding
Cyrillic to the ranges we handle via the usual code path, and DEGREE
CELSIUS and DEGREE FAHRENHEIT.  Those degree signs could possibly be
extracted from Unicode.txt (or we could just forget about them), and
then we could drop special_cases().

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Hugh Ranalli

On Sat, 15 Dec 2018 at 21:26, Thomas Munro <[hidden email]> wrote:
+1 for updating to the latest file from time to time.  After
http://unicode.org/cldr/trac/ticket/11383 makes it into a new release,
our special_cases() function will have just the two Cyrillic
characters, which should almost certainly be handled by adding
Cyrillic to the ranges we handle via the usual code path, and DEGREE
CELSIUS and DEGREE FAHRENHEIT.  Those degree signs could possibly be
extracted from Unicode.txt (or we could just forget about them), and
then we could drop special_cases().
Well, when I modified the code to handle the new version of the transliteration file, I discovered that was sufficient to handle the old version as well. That's not the way things usually go, but I'll take it. ;-)

I've attached two patches, one to update generate_unaccent_rules.py, and another that updates unaccent.rules from the v34 transliteration file. I'll be happy to add these to the CF. Does anyone need to review them and give me approval before I do so?

Best wishes,
Hugh 
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Tom Lane-2
Hugh Ranalli <[hidden email]> writes:
> I've attached two patches, one to update generate_unaccent_rules.py, and
> another that updates unaccent.rules from the v34 transliteration file.

I think you forgot the patches?

> I'll
> be happy to add these to the CF. Does anyone need to review them and give
> me approval before I do so?

Nope.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Hugh Ranalli
On Mon, 17 Dec 2018 at 15:31, Tom Lane <[hidden email]> wrote:
Hugh Ranalli <[hidden email]> writes:
> I've attached two patches, one to update generate_unaccent_rules.py, and
> another that updates unaccent.rules from the v34 transliteration file.

I think you forgot the patches?

Sigh, yes I did. That's what I get for trying to get this sent out before heading to an appointment. Patches attached and will add to CF. Let me know if you see anything amiss.

Hugh

unaccent.rules-update-to-Latin-ASCII-CDLR-v34.patch (462 bytes) Download Attachment
generate_unaccent_rules-handle-all-Latin-ASCII-versions.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Thomas Munro-3
On Tue, Dec 18, 2018 at 12:03 PM Hugh Ranalli <[hidden email]> wrote:
> On Mon, 17 Dec 2018 at 15:31, Tom Lane <[hidden email]> wrote:
>> Hugh Ranalli <[hidden email]> writes:
>> > I've attached two patches, one to update generate_unaccent_rules.py, and
>> > another that updates unaccent.rules from the v34 transliteration file.
>>
>> I think you forgot the patches?
>
>
> Sigh, yes I did. That's what I get for trying to get this sent out before heading to an appointment. Patches attached and will add to CF. Let me know if you see anything amiss.

+ʹ    '
+ʺ    "
+ʻ    '
+ʼ    '
+ʽ    '
+˂    <
+˃    >
+˄    ^
+ˆ    ^
+ˈ    '
+ˋ    `
+ː    :
+˖    +
+˗    -
+˜    ~

I don't think this is quite right.  Those don't seem to be the
combining codepoints[1], and in any case they are being replaced with
ASCII characters, whereas I thought we wanted to replace them with
nothing at all.  Here is my attempt to come up with a test case using
combining characters:

  select unaccent('un café crème s''il vous plaît');

It's not stripping the accents.  I've attached that in a file for
reference so you can run it with psql -f x.sql, and you can see that
it's using combining code points (code points 0301, 0300, 0302 which
come out as cc81, cc80, cc82 in UTF-8) like so:

$ xxd x.sql
00000000: 7365 6c65 6374 2075 6e61 6363 656e 7428  select unaccent(
00000010: 2775 6e20 6361 6665 cc81 2063 7265 cc80  'un cafe.. cre..
00000020: 6d65 2073 2727 696c 2076 6f75 7320 706c  me s''il vous pl
00000030: 6169 cc82 7427 293b 0a0a                 ai..t');..

(To come up with that I used the trick of typing ":%!xxd" and then
when finished ":%!xxd -r", to turn vim into a hex editor.)

[1] https://en.wikipedia.org/wiki/Combining_Diacritical_Marks

--
Thomas Munro
http://www.enterprisedb.com

x.sql (82 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Thomas Munro-3
On Tue, Dec 18, 2018 at 3:05 PM Thomas Munro
<[hidden email]> wrote:

> On Tue, Dec 18, 2018 at 12:03 PM Hugh Ranalli <[hidden email]> wrote:
> +ʹ    '
> +ʺ    "
> +ʻ    '
> +ʼ    '
> +ʽ    '
> +˂    <
> +˃    >
> +˄    ^
> +ˆ    ^
> +ˈ    '
> +ˋ    `
> +ː    :
> +˖    +
> +˗    -
> +˜    ~
>
> I don't think this is quite right.  Those don't seem to be the
> combining codepoints[1], and in any case they are being replaced with
> ASCII characters, whereas I thought we wanted to replace them with
> nothing at all.  Here is my attempt to come up with a test case using
> combining characters:
>
>   select unaccent('un café crème s''il vous plaît');

Oh, I see now that that was just the v34 ASCII transliteration update,
and perhaps the diacritic stripping will be posted separately.

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Michael Paquier-2
In reply to this post by Thomas Munro-3
On Tue, Dec 18, 2018 at 03:05:00PM +1100, Thomas Munro wrote:

> I don't think this is quite right.  Those don't seem to be the
> combining codepoints[1], and in any case they are being replaced with
> ASCII characters, whereas I thought we wanted to replace them with
> nothing at all.  Here is my attempt to come up with a test case using
> combining characters:
>
>   select unaccent('un café crème s''il vous plaît');
>
> It's not stripping the accents.  I've attached that in a file for
> reference so you can run it with psql -f x.sql, and you can see that
> it's using combining code points (code points 0301, 0300, 0302 which
> come out as cc81, cc80, cc82 in UTF-8) like so:
Could you also add some tests in contrib/unaccent/sql/unaccent.sql at
the same time?  That would be nice to check easily the extent of the
patches proposed on this thread.
--
Michael

signature.asc (849 bytes) Download Attachment
123