Re: BUG #15548: Unaccent does not remove combining diacritical characters

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

raam narayana
Hi,

After the latest commit in master branch, I was trying to test the python script. Ironically I still see that the output from the script is completely different from the unaccent.rules file content. Am I missing anything.My testing includes the following

Downloaded the following files

http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt
 
http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml

Executed the below python script

python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file  Latin-ASCII.xml > unaccent.rules

I am using python 3.7.1 and running on Windows 10 Platform

The new status of this patch is: Needs review
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Thomas Munro-3
On Mon, Feb 11, 2019 at 7:07 AM raam narayana <[hidden email]> wrote:

> After the latest commit in master branch, I was trying to test the python script. Ironically I still see that the output from the script is completely different from the unaccent.rules file content. Am I missing anything.My testing includes the following
>
> Downloaded the following files
>
> http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt
>
> http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml
>
> Executed the below python script
>
> python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file  Latin-ASCII.xml > unaccent.rules
>
> I am using python 3.7.1 and running on Windows 10 Platform
>
> The new status of this patch is: Needs review

Hi Raam,

How does it differ?  Can you please share the output you get?  I used
Python 2.7 on a Mac, exactly those input files, and my output matched
Hugh's.

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Hugh Ranalli
In reply to this post by raam narayana

On Sun, 10 Feb 2019 at 15:07, raam narayana <[hidden email]> wrote:
Hi,

After the latest commit in master branch, I was trying to test the python script. Ironically I still see that the output from the script is completely different from the unaccent.rules file content. Am I missing anything.My testing includes the following

Downloaded the following files

http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt

http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml

Executed the below python script

python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file  Latin-ASCII.xml > unaccent.rules

I am using python 3.7.1 and running on Windows 10 Platform

The new status of this patch is: Needs review

Hi Raam,
I just ran generate_unaccent_rules.py under two environments, using the data files given above :
  - Python 3.4.3  on Linux Mint 17.3 (equivalent to Ubuntu 14.04)
  - Python 3.6.7 on Ubuntu 18.04

In both cases, the output was identical to that generated by the program under Python 2.7. So yes, more information would help. Unfortunately I don't have a Windows Python environment readily available, but could set one up if I had to.

Thanks,
Hugh
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

raam narayana
Hi Hugh,

I tested the script in python 2.7 and it works perfect. The problem is in python 3.7(and may be only in windows as you were not getting the issue) and I was getting the following error 

UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in position 0: character maps to <undefined>

 I went through the python script and found that the stdout encoding is set to utf-8 only  if python version is <=2. 

I have made the same change for python version 3 as well. Please find the patch for the same.Let me know if it makes sense

Regards,
Ram.

On Tue, 12 Feb 2019 at 00:50, Hugh Ranalli <[hidden email]> wrote:

On Sun, 10 Feb 2019 at 15:07, raam narayana <[hidden email]> wrote:
Hi,

After the latest commit in master branch, I was trying to test the python script. Ironically I still see that the output from the script is completely different from the unaccent.rules file content. Am I missing anything.My testing includes the following

Downloaded the following files

http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt

http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml

Executed the below python script

python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file  Latin-ASCII.xml > unaccent.rules

I am using python 3.7.1 and running on Windows 10 Platform

The new status of this patch is: Needs review

Hi Raam,
I just ran generate_unaccent_rules.py under two environments, using the data files given above :
  - Python 3.4.3  on Linux Mint 17.3 (equivalent to Ubuntu 14.04)
  - Python 3.6.7 on Ubuntu 18.04

In both cases, the output was identical to that generated by the program under Python 2.7. So yes, more information would help. Unfortunately I don't have a Windows Python environment readily available, but could set one up if I had to.

Thanks,
Hugh


--
Cheers
Ram 4.0

generate_unaccent_rules-remove-combining-diacritical-accents-03.patch (746 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Michael Paquier-2
On Tue, Feb 12, 2019 at 02:27:31AM +0530, Ramanarayana wrote:

> I tested the script in python 2.7 and it works perfect. The problem is in
> python 3.7(and may be only in windows as you were not getting the issue)
> and I was getting the following error
>
> UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
> position 0: character maps to <undefined>
>
>  I went through the python script and found that the stdout encoding is set
> to utf-8 only  if python version is <=2.
>
> I have made the same change for python version 3 as well. Please find the
> patch for the same.Let me know if it makes sense
Isn't that because Windows encoding becomes cp1252, utf16 or such?
FWIW, on Debian SID with Python 3.7, I get the correct output, and no
diffs on HEAD.  Perhaps it would make sense to use open() on the
different files with encoding='utf-8' to avoid any kind of problems?
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

raam narayana
Hi Michael,
The issue was that the python script was working in python 2 but not in python 3 in Windows. This is because the python script writes the final output to stdout and stdout encoding is set to utf-8 only for python 2 but not python 3.If no encoding is set for stdout it takes the encoding from the Operating system.Default encoding in linux and windows might be different.Hence this issue.
Regards,
Ram.

On Tue, 12 Feb 2019 at 09:48, Michael Paquier <[hidden email]> wrote:
On Tue, Feb 12, 2019 at 02:27:31AM +0530, Ramanarayana wrote:
> I tested the script in python 2.7 and it works perfect. The problem is in
> python 3.7(and may be only in windows as you were not getting the issue)
> and I was getting the following error
>
> UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
> position 0: character maps to <undefined>
>
>  I went through the python script and found that the stdout encoding is set
> to utf-8 only  if python version is <=2.
>
> I have made the same change for python version 3 as well. Please find the
> patch for the same.Let me know if it makes sense

Isn't that because Windows encoding becomes cp1252, utf16 or such?
FWIW, on Debian SID with Python 3.7, I get the correct output, and no
diffs on HEAD.  Perhaps it would make sense to use open() on the
different files with encoding='utf-8' to avoid any kind of problems?
--
Michael


--
Cheers
Ram 4.0
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15548: Unaccent does not remove combining diacritical characters

Hugh Ranalli
On Tue, 12 Feb 2019 at 08:54, Ramanarayana <[hidden email]> wrote:
Hi Michael,
The issue was that the python script was working in python 2 but not in python 3 in Windows. This is because the python script writes the final output to stdout and stdout encoding is set to utf-8 only for python 2 but not python 3.If no encoding is set for stdout it takes the encoding from the Operating system.Default encoding in linux and windows might be different.Hence this issue.
Regards,
Ram.

On Tue, 12 Feb 2019 at 09:48, Michael Paquier <[hidden email]> wrote:
On Tue, Feb 12, 2019 at 02:27:31AM +0530, Ramanarayana wrote:
> I tested the script in python 2.7 and it works perfect. The problem is in
> python 3.7(and may be only in windows as you were not getting the issue)
> and I was getting the following error
>
> UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
> position 0: character maps to <undefined>
>
>  I went through the python script and found that the stdout encoding is set
> to utf-8 only  if python version is <=2.
>
> I have made the same change for python version 3 as well. Please find the
> patch for the same.Let me know if it makes sense

Isn't that because Windows encoding becomes cp1252, utf16 or such?
FWIW, on Debian SID with Python 3.7, I get the correct output, and no
diffs on HEAD.  Perhaps it would make sense to use open() on the
different files with encoding='utf-8' to avoid any kind of problems?
--
Michael

I can't look at this today, but will fire up Windows and Python tomorrow, look at Ram's patch, and see what is going on. I'll also look at how we open the input files, to see if we should supply an encoding. It makes sense those input files will only make sense in UTF-8 anyway.

Ram, thanks for catching this issue.,

Hugh