Issue Supporting Emojis in Full Text Search on Ubuntu

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Issue Supporting Emojis in Full Text Search on Ubuntu

Jordan Hurwich
We rely on the Postgres tsvector implementation to enable full text search in our app, but we're having some issues with getting the parser to recognize emoji characters (like "😀" <U+1F600>) as anything other than "blank"/"Space symbols" on Ubuntu per ts_debug(). Notably the characters are recognized as "word"/"Word, all letters" characters on Mac; and non-english, non-emoji characters (like "我" <U+6211>) are recognized as "word" characters on both Mac and Ubuntu.

We greatly appreciate your feedback, debug details below and happy to provide more as requested,
Jordan
pulsasensors.com, jhurwich@

Platform:
- AWS Ubuntu 18.04.2 LTS vs MacOS 10.15.5
- postgres (PostgreSQL) 11.5


* ts_debug() differs on MacOS and Ubuntu *

We have not modified the 'english' text search configuration on either instance, however the query "SELECT * FROM ts_debug('english', '😀');" returns different results on MacOS 10.15.5 and our Ubuntu instance:
- on MacOS:
db=#  select * from  ts_debug('english', '😀');
 alias |    description    | token |  dictionaries  |  dictionary  | lexemes
-------+-------------------+-------+----------------+--------------+---------
 word  | Word, all letters | 😀     | {english_stem} | english_stem | {😀}


 - on Ubuntu:
db=# SELECT * from ts_debug('english','😀');
 alias |  description  | token | dictionaries | dictionary | lexemes
-------+---------------+-------+--------------+------------+---------
 blank | Space symbols | 😀     | {}           |            |


Notably non-english, non-emoji characters like '我' behave as desired on both instances, with the same result on both MacOS and Ubuntu for "SELECT * FROM ts_debug('english', '我');":
db=# SELECT * FROM ts_debug('english', '我');
 alias |    description    | token |  dictionaries  |  dictionary  | lexemes
-------+-------------------+-------+----------------+--------------+---------
 word  | Word, all letters | 我    | {english_stem} | english_stem | {我}



* pg_database *

There are minor differences between MacOS and Ubuntu in pg_database as follows, however modifications to set datcollate and datctype to 'C' on Ubuntu or the more specific 'en_US.UTF-8' have not changed the result for ts_debug(). See row for 'testdb01':
- on Mac:
db=# select datname, encoding, datcollate, datctype, datistemplate from pg_database;
    datname     | encoding | datcollate | datctype | datistemplate
----------------+----------+------------+----------+---------------
 postgres       |        6 | C          | C        | f
 template0      |        6 | C          | C        | t
 template1      |        6 | C          | C        | t
 testdb01       |        6 | C          | C        | f


- on Ubuntu:
db=# select datname, encoding, datcollate, datctype, datistemplate from pg_database;
  datname  | encoding | datcollate  |  datctype   | datistemplate
-----------+----------+-------------+-------------+---------------
 postgres  |        6 | C.UTF-8     | C.UTF-8     | f
 template0 |        6 | C.UTF-8     | C.UTF-8     | t
 template1 |        6 | en_US.UTF-8 | en_US.UTF-8 | t
 testdb01  |        6 | en_US.UTF-8 | en_US.UTF-8 | f



* locale *

The result of `$ locale` on both instances is similar, included below for Ubuntu. Though `$ locale -a` varies considerably, on MacOS dozens of items are returned while only 4 entries are returned on Ubuntu, included below:

- on Ubuntu
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=


- on Ubuntu
$ locale -a
C
C.UTF-8
en_US.utf8
POSIX


* Postgres installation *

On Mac, Postgres was installed and is managed by Homebrew via the "postgresql@11" formula.
On Ubuntu, Postgres was installed from source at https://ftp.postgresql.org/pub/source/v11.5/postgresql-11.5.tar.bz2.
Reply | Threaded
Open this post in threaded view
|

Re: Issue Supporting Emojis in Full Text Search on Ubuntu

Tom Lane-2
Jordan Hurwich <[hidden email]> writes:
> We rely on the Postgres tsvector implementation to enable full text search
> in our app, but we're having some issues with getting the parser to
> recognize emoji characters (like "😀" <U+1F600>) as anything other than
> "blank"/"Space symbols" on Ubuntu per ts_debug(). Notably the characters
> are recognized as "word"/"Word, all letters" characters on Mac; and
> non-english, non-emoji characters (like "我" <U+6211>) are recognized as
> "word" characters on both Mac and Ubuntu.

With a non-C lc_ctype setting, which characters are considered to be
letters is entirely up to the locale code supplied by the platform's libc.
I duplicate your result that U+1F600 is not considered a letter by glibc,
at least not with lc_ctype = en_US.utf8.  (Perhaps there are other locale
settings that would consider it a letter?  Not my expertise though.)

However, with lc_ctype set to C, you should get a platform-independent
result that any non-ASCII character is a letter.

macOS' utf8-based locales are almost completely broken, unfortunately,
so it's hard to predict what they will do except that it'll probably be
wrong.

> We have not modified the 'english' text search configuration on either
> instance, however the query "SELECT * FROM ts_debug('english', '😀');"
> returns different results on MacOS 10.15.5 and our Ubuntu instance:

That's unsurprising in itself, per the above.  There's no standardization
worth mentioning in this area.

> There are minor differences between MacOS and Ubuntu in pg_database as
> follows, however modifications to set datcollate and datctype to 'C' on
> Ubuntu or the more specific 'en_US.UTF-8' have not changed the result for
> ts_debug().

lc_ctype = C should have done what you want (at least for this specific
symbol), so I wonder whether you did the test right.  You gave few details
about how you tested this.

> db=# select datname, encoding, datcollate, datctype, datistemplate from
> pg_database;
>     datname     | encoding | datcollate | datctype | datistemplate
> ----------------+----------+------------+----------+---------------
>  postgres       |        6 | C          | C        | f
>  template0      |        6 | C          | C        | t
>  template1      |        6 | C          | C        | t
>  testdb01       |        6 | C          | C        | f

Hm, how are you getting that result when your prompt says you're
connected to database "db"?

> The result of `$ locale` on both instances is similar, included below for
> Ubuntu. Though `$ locale -a` varies considerably, on MacOS dozens of items
> are returned while only 4 entries are returned on Ubuntu, included below:
> - on Ubuntu
> $ locale -a
> C
> C.UTF-8
> en_US.utf8
> POSIX

This just indicates that you didn't install the package(s) that provide
alternative locales.  On my RHEL box, "locale -a" reports 865 entries,
and I'm pretty sure I don't have all the odder ones.  I do not know,
unfortunately, how Ubuntu/Debian divvy this stuff up into packages.

Anyway, the bottom line is that if you need platform-independent results
then setting lc_ctype to "C" is what to do.  It will not be very bright
about non-ASCII characters, but at least the stupidity will be uniform.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Issue Supporting Emojis in Full Text Search on Ubuntu

Jordan Hurwich
Thank you very much Tom! 

You were completely right to question our previous manipulation of the pg_database entry, and after revisiting that we were able to address the issue by updating datcollate and datctype to "C" - ts_debug() now recognizes emoji characters as "word" characters and text search functionality is behaving as expected.

Thank you once again for taking the time to understand and figure out our issue,
Jordan

On Tue, Jun 30, 2020 at 5:27 PM Tom Lane <[hidden email]> wrote:
Jordan Hurwich <[hidden email]> writes:
> We rely on the Postgres tsvector implementation to enable full text search
> in our app, but we're having some issues with getting the parser to
> recognize emoji characters (like "😀" <U+1F600>) as anything other than
> "blank"/"Space symbols" on Ubuntu per ts_debug(). Notably the characters
> are recognized as "word"/"Word, all letters" characters on Mac; and
> non-english, non-emoji characters (like "我" <U+6211>) are recognized as
> "word" characters on both Mac and Ubuntu.

With a non-C lc_ctype setting, which characters are considered to be
letters is entirely up to the locale code supplied by the platform's libc.
I duplicate your result that U+1F600 is not considered a letter by glibc,
at least not with lc_ctype = en_US.utf8.  (Perhaps there are other locale
settings that would consider it a letter?  Not my expertise though.)

However, with lc_ctype set to C, you should get a platform-independent
result that any non-ASCII character is a letter.

macOS' utf8-based locales are almost completely broken, unfortunately,
so it's hard to predict what they will do except that it'll probably be
wrong.

> We have not modified the 'english' text search configuration on either
> instance, however the query "SELECT * FROM ts_debug('english', '😀');"
> returns different results on MacOS 10.15.5 and our Ubuntu instance:

That's unsurprising in itself, per the above.  There's no standardization
worth mentioning in this area.

> There are minor differences between MacOS and Ubuntu in pg_database as
> follows, however modifications to set datcollate and datctype to 'C' on
> Ubuntu or the more specific 'en_US.UTF-8' have not changed the result for
> ts_debug().

lc_ctype = C should have done what you want (at least for this specific
symbol), so I wonder whether you did the test right.  You gave few details
about how you tested this.

> db=# select datname, encoding, datcollate, datctype, datistemplate from
> pg_database;
>     datname     | encoding | datcollate | datctype | datistemplate
> ----------------+----------+------------+----------+---------------
>  postgres       |        6 | C          | C        | f
>  template0      |        6 | C          | C        | t
>  template1      |        6 | C          | C        | t
>  testdb01       |        6 | C          | C        | f

Hm, how are you getting that result when your prompt says you're
connected to database "db"?

> The result of `$ locale` on both instances is similar, included below for
> Ubuntu. Though `$ locale -a` varies considerably, on MacOS dozens of items
> are returned while only 4 entries are returned on Ubuntu, included below:
> - on Ubuntu
> $ locale -a
> C
> C.UTF-8
> en_US.utf8
> POSIX

This just indicates that you didn't install the package(s) that provide
alternative locales.  On my RHEL box, "locale -a" reports 865 entries,
and I'm pretty sure I don't have all the odder ones.  I do not know,
unfortunately, how Ubuntu/Debian divvy this stuff up into packages.

Anyway, the bottom line is that if you need platform-independent results
then setting lc_ctype to "C" is what to do.  It will not be very bright
about non-ASCII characters, but at least the stupidity will be uniform.

                        regards, tom lane
Reply | Threaded
Open this post in threaded view
|

Re: Issue Supporting Emojis in Full Text Search on Ubuntu

Bzzzz
In reply to this post by Tom Lane-2
On Tue, 30 Jun 2020 20:27:23 -0400
Tom Lane <[hidden email]> wrote:

> With a non-C lc_ctype setting, which characters are considered to be
> letters is entirely up to the locale code supplied by the platform's
> libc. I duplicate your result that U+1F600 is not considered a letter
> by glibc, at least not with lc_ctype = en_US.utf8.  (Perhaps there are
> other locale settings that would consider it a letter?  Not my
> expertise though.)

FWIW, it works ferpectly with a french locale (on Debian buster):

test=# select datname, encoding, datcollate, datctype, datistemplate from
test-# pg_database;
    datname    | encoding | datcollate  |  datctype   | datistemplate
---------------+----------+-------------+-------------+---------------
 postgres      |        6 | fr_FR.utf8  | fr_FR.utf8  | f
 template1     |        6 | fr_FR.utf8  | fr_FR.utf8  | t
 template0     |        6 | fr_FR.utf8  | fr_FR.utf8  | t
 coshonet      |        6 | fr_FR.UTF-8 | en_US.UTF-8 | f
 ejabberd_test |        6 | fr_FR.UTF-8 | en_US.UTF-8 | f
 fudforum      |        6 | fr_FR.UTF-8 | en_US.UTF-8 | f
 postfixadmin  |        6 | fr_FR.UTF-8 | en_US.UTF-8 | f
 test          |        6 | fr_FR.UTF-8 | en_US.UTF-8 | f
 testjy        |        6 | fr_FR.UTF-8 | en_US.UTF-8 | f
 zotonic       |        6 | fr_FR.UTF-8 | fr_FR.UTF-8 | f
(10 rows)

test=# create table tst_ctrlu_chars(id varchar(12) primary key, a text);

test=# insert into tst_ctrlu_chars values ('ü', 'aaaaaaaaaaa'),
('ë','bbbbbbbbbb'),('ê','cccccccccc'),('û','dddddddd'),('😀','EEEEEEEE'),('🙉','FFFFFFFFF');
INSERT 0 6

test=# SELECT * FROM tst_ctrlu_chars WHERE id='😀';
 id |    a    
----+----------
 😀  | EEEEEEEE
(1 row)


My ¢0.2
       
Jiff


Reply | Threaded
Open this post in threaded view
|

Re: Issue Supporting Emojis in Full Text Search on Ubuntu

Laurenz Albe
In reply to this post by Jordan Hurwich
On Wed, 2020-07-01 at 14:56 -0700, Jordan Hurwich wrote:
> You were completely right to question our previous manipulation of the pg_database entry,
> and after revisiting that we were able to address the issue by updating datcollate and
> datctype to "C" - ts_debug() now recognizes emoji characters as "word" characters and
> text search functionality is behaving as expected.

If you manually modified "pg_database", you probably broke your database.

At the very least, REINDEX all indexes that contain strings.  I don't know
if there could be other problems.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com