Full text search bug ('russian' regconfig)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Full text search bug ('russian' regconfig)

egocenter

Hello!

Text search doesn't work correct with the EQUAL string in text and query (russian dictionary config),
as you can see in example ts_vector receives different from ts_query lexemes for identical text:

tsv = 'дан':1 'магазин':2 'нужн':3 'посеща':4 'точн':5
tsq = 'нужн' & 'точн' & 'дан' & 'посещаем' & 'магазин'

SELECT
        (web_query_and @@ ts_title)::INTEGER AS full_title_entries, -- 0 / supposed 1
        (web_query_and @@ 'зачем нужны точные данные о посещаемости магазинов?')::INTEGER AS full_title_entries2,
        *
FROM
        (SELECT
                to_tsvector('russian', STRIP(to_tsvector('russian', 'зачем нужны точные данные о посещаемости магазинов?'))::TEXT ) AS ts_title,
                websearch_to_tsquery('russian', REPLACE('зачем нужны точные данные о посещаемости магазинов?', '- ' , '')) AS web_query_and
               
        ) AS main

--
Best regards,
Roman



Reply | Threaded
Open this post in threaded view
|

Re: Full text search bug ('russian' regconfig)

Artur Zakirov
Hello

On 2/19/2020 5:35 PM, egocenter wrote:
> Text search doesn't work correct with the EQUAL string in text and query (russian dictionary config),
> as you can see in example ts_vector receives different from ts_query lexemes for identical text:
>
> tsv = 'дан':1 'магазин':2 'нужн':3 'посеща':4 'точн':5
> tsq = 'нужн' & 'точн' & 'дан' & 'посещаем' & 'магазин'

It is because you call to_tsvector() two times. 'russian' is a Snowball
dictionary and it uses stemming algorithms to cut words ending. Your
query works if to_tsvector() isn't called twice on the same text:

=# SELECT
   web_query_and @@ ts_title,
   web_query_and @@ 'зачем нужны точные данные о посещаемости магазинов',
   *
FROM
   (SELECT
     to_tsvector('russian', 'зачем нужны точные данные о посещаемости
магазинов') AS ts_title,
     websearch_to_tsquery('russian', 'зачем нужны точные данные о
посещаемости магазинов?') AS web_query_and
   ) AS main;

It gives 'true' for the first column.

--
Artur


Reply | Threaded
Open this post in threaded view
|

Re: Full text search bug ('russian' regconfig)

egocenter

Hello, Artur!

Thanks for the answer,
ok, it's strange that only 1 word is affected that way (as if two lexemes exist for 1 word)...

*I use double to_tsvector to eliminate words duplicates.
in the example below ts_title = 'histori':2 'watcom':1,3
and it gives 2 entries in 'город - watcom' via ts_rank_cd

I need to count UNIQUE words entries but it seems to be no luck with std functionality
(I see 2 ways: custom ts_rank function OR to_tsvector / edit tsvector and leave only first position for 'watcom':
ts_title = 'histori':2 'watcom':1).

If you have any idea on that situation, I would highly appreciate it! Thanks in advance)

---------
SELECT
        round((ts_rank_cd(ts_title, web_query_or)/0.1)::NUMERIC, 0) AS title_entries_count, -- 2, but should be 1
   *
FROM
   (SELECT
     to_tsvector('russian', 'watcom history | watcom') AS ts_title,
     websearch_to_tsquery('russian', REPLACE('город - watcom', '- ' , '')) AS web_query_and, -- тире заменено для отмены его конвертации в отрицание !
     REPLACE(websearch_to_tsquery(:reg_config, REPLACE('город - watcom', '- ' , ''))::TEXT, '&', '|')::tsquery AS web_query_or

   ) AS main;

--



   > Hello

> On 2/19/2020 5:35 PM, egocenter wrote:
>> Text search doesn't work correct with the EQUAL string in text and query (russian dictionary config),
>> as you can see in example ts_vector receives different from ts_query lexemes for identical text:
>>
>> tsv = 'дан':1 'магазин':2 'нужн':3 'посеща':4 'точн':5
>> tsq = 'нужн' & 'точн' & 'дан' & 'посещаем' & 'магазин'

> It is because you call to_tsvector() two times. 'russian' is a Snowball
> dictionary and it uses stemming algorithms to cut words ending. Your
> query works if to_tsvector() isn't called twice on the same text:

> =# SELECT
>    web_query_and @@ ts_title,
>    web_query_and @@ 'зачем нужны точные данные о посещаемости магазинов',
>    *
> FROM
>    (SELECT
>      to_tsvector('russian', 'зачем нужны точные данные о посещаемости
> магазинов') AS ts_title,
>      websearch_to_tsquery('russian', 'зачем нужны точные данные о
> посещаемости магазинов?') AS web_query_and
>    ) AS main;

> It gives 'true' for the first column.