BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

PG Bug reporting form
The following bug has been logged on the website:

Bug reference:      16865
Logged by:          Dimitri Nüscheler
Email address:      [hidden email]
PostgreSQL version: 13.1
Operating system:   Debian
Description:        

Hello

I'm writing a small search engine application that also supports negation
(exclude results that contain terms starting with string).

After upgrading from PostgreSQL 12.5 to 13.1 the negation within a tsquery
when matched to a tsvector using the @@ operator no longer works properly,
so I started bisecting the commit history and tracked it down to this commit
(see the query and expected and actual result further below):

> commit 2f2007fbb255be178aca586780967f43885203a7 (HEAD, refs/bisect/bad)
> Author: Tom Lane <[hidden email]>
> Date:   Fri Jul 24 15:26:51 2020 -0400
>
>    Fix assorted bugs by changing TS_execute's callback API to ternary
logic.
> ...

I'm still working on creating a reproducible test-case without having to
share company data. I'm also trying to understand the code as a fun
exercise.

I can at least share some of the queries and result data:

DDL:
CREATE TABLE IF NOT EXISTS sherlock_catalog (
uri varchar,
description varchar NOT NULL,
metadata varchar NOT NULL,
textsearch tsvector GENERATED ALWAYS AS (to_tsvector('english',
sherlock_normalize(uri || ' ' || description || ' ' || metadata))) STORED,
last_seen timestamptz,
PRIMARY KEY (uri)
);
CREATE OR REPLACE FUNCTION sherlock_normalize(str varchar) RETURNS varchar
AS $$
BEGIN
    RETURN lower(regexp_replace(regexp_replace(regexp_replace(str,
'[^a-zA-Z0-9]+', ' ', 'g'),'([a-z])([A-Z])','\1
\2','g'),'([A-Z][A-Z0-9])([a-z])','\1 \2','g'));
END;
$$
LANGUAGE plpgsql IMMUTABLE RETURNS NULL ON NULL INPUT;

Query:
SELECT * FROM sherlock_catalog WHERE textsearch @@ '''full'':* &
''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinued%' LIMIT 2;

Result: 2 rows

Expected result: 0 rows, because the textsearch vector contains:
'discontinu':86

Plan:
 Limit  (cost=130.61..12789.24 rows=1 width=136)
   ->  Bitmap Heap Scan on sherlock_catalog  (cost=130.61..12789.24 rows=1
width=136)
         Recheck Cond: (textsearch @@ '''full'':* & ''suppli'':* &
!''discontinu'':*'::tsquery)
         Filter: ((metadata)::text ~~ '%Discontinued%'::text)
         ->  Bitmap Index Scan on sherlock_catalog_textsearch
(cost=0.00..130.61 rows=3548 width=0)
               Index Cond: (textsearch @@ '''full'':* & ''suppli'':* &
!''discontinu'':*'::tsquery)

The generated plan is structurally the same for PostgreSQL 12.5 respectively
versions before that commit, but if I alter the plan using (SET enable_*scan
 = off) so that the planner comes up with a sequential scan, I will get the
expected results.

Some other results (count):

Count of the same query:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinu%';
 count
-------
  4962
(1 Zeile)


Negation, but not a prefix search:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':* & !''discontinu'''::tsquery AND metadata LIKE
'%Discontinu%';
 count
-------
     0
(1 Zeile)


Without negation:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':* & ''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinu%';
 count
-------
 13127
(1 Zeile)

Without the "discontinu" term at all

user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':*'::tsquery AND metadata LIKE '%Discontinu%';
 count
-------
 13127
(1 Zeile)


So it seems like the negated query manages to filter out some data, but not
all - as if it failed to recheck and definitely determine the TS_YES
respectively TS_NO answer from an in-precise TS_MAYBE answer from an
unprecise index-based answer (without position information?)? if I even
understand this remotely correctly, I'm new to this.
I'll try to find out more and to prepare shareable data that reproduces the
problem, but I also wonder if I manage to dive into the code and understand
something about it :-)

Kind Regards
Dimitri Nüscheler

Reply | Threaded
Open this post in threaded view
|

Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

Dimitri Nüscheler
Sorry, I forgot to post the DDL for the GIN INDEX:
CREATE INDEX IF NOT EXISTS sherlock_catalog_textsearch ON sherlock_catalog USING GIN (textsearch);

I also uploaded the scripts I used to bisect the issue here, but I didn't include the data file yet. The file "steps" explains the steps.:

I could share the data file with a few individuals until I find a way to anonymize or reduce the data set. My attempts so far, resulted in the bug no longer showing.

Regards
Dimitri

Am So., 14. Feb. 2021 um 00:23 Uhr schrieb PG Bug reporting form <[hidden email]>:
The following bug has been logged on the website:

Bug reference:      16865
Logged by:          Dimitri Nüscheler
Email address:      [hidden email]
PostgreSQL version: 13.1
Operating system:   Debian
Description:       

Hello

I'm writing a small search engine application that also supports negation
(exclude results that contain terms starting with string).

After upgrading from PostgreSQL 12.5 to 13.1 the negation within a tsquery
when matched to a tsvector using the @@ operator no longer works properly,
so I started bisecting the commit history and tracked it down to this commit
(see the query and expected and actual result further below):

> commit 2f2007fbb255be178aca586780967f43885203a7 (HEAD, refs/bisect/bad)
> Author: Tom Lane <[hidden email]>
> Date:   Fri Jul 24 15:26:51 2020 -0400
>
>    Fix assorted bugs by changing TS_execute's callback API to ternary
logic.
> ...

I'm still working on creating a reproducible test-case without having to
share company data. I'm also trying to understand the code as a fun
exercise.

I can at least share some of the queries and result data:

DDL:
CREATE TABLE IF NOT EXISTS sherlock_catalog (
uri varchar,
description varchar NOT NULL,
metadata varchar NOT NULL,
textsearch tsvector GENERATED ALWAYS AS (to_tsvector('english',
sherlock_normalize(uri || ' ' || description || ' ' || metadata))) STORED,
last_seen timestamptz,
PRIMARY KEY (uri)
);
CREATE OR REPLACE FUNCTION sherlock_normalize(str varchar) RETURNS varchar
AS $$
BEGIN
    RETURN lower(regexp_replace(regexp_replace(regexp_replace(str,
'[^a-zA-Z0-9]+', ' ', 'g'),'([a-z])([A-Z])','\1
\2','g'),'([A-Z][A-Z0-9])([a-z])','\1 \2','g'));
END;
$$
LANGUAGE plpgsql IMMUTABLE RETURNS NULL ON NULL INPUT;

Query:
SELECT * FROM sherlock_catalog WHERE textsearch @@ '''full'':* &
''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinued%' LIMIT 2;

Result: 2 rows

Expected result: 0 rows, because the textsearch vector contains:
'discontinu':86

Plan:
 Limit  (cost=130.61..12789.24 rows=1 width=136)
   ->  Bitmap Heap Scan on sherlock_catalog  (cost=130.61..12789.24 rows=1
width=136)
         Recheck Cond: (textsearch @@ '''full'':* & ''suppli'':* &
!''discontinu'':*'::tsquery)
         Filter: ((metadata)::text ~~ '%Discontinued%'::text)
         ->  Bitmap Index Scan on sherlock_catalog_textsearch
(cost=0.00..130.61 rows=3548 width=0)
               Index Cond: (textsearch @@ '''full'':* & ''suppli'':* &
!''discontinu'':*'::tsquery)

The generated plan is structurally the same for PostgreSQL 12.5 respectively
versions before that commit, but if I alter the plan using (SET enable_*scan
 = off) so that the planner comes up with a sequential scan, I will get the
expected results.

Some other results (count):

Count of the same query:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinu%';
 count
-------
  4962
(1 Zeile)


Negation, but not a prefix search:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':* & !''discontinu'''::tsquery AND metadata LIKE
'%Discontinu%';
 count
-------
     0
(1 Zeile)


Without negation:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':* & ''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinu%';
 count
-------
 13127
(1 Zeile)

Without the "discontinu" term at all

user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':*'::tsquery AND metadata LIKE '%Discontinu%';
 count
-------
 13127
(1 Zeile)


So it seems like the negated query manages to filter out some data, but not
all - as if it failed to recheck and definitely determine the TS_YES
respectively TS_NO answer from an in-precise TS_MAYBE answer from an
unprecise index-based answer (without position information?)? if I even
understand this remotely correctly, I'm new to this.
I'll try to find out more and to prepare shareable data that reproduces the
problem, but I also wonder if I manage to dive into the code and understand
something about it :-)

Kind Regards
Dimitri Nüscheler

Reply | Threaded
Open this post in threaded view
|

Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

Dimitri Nüscheler
It seems the issue is not related to negation, because I also get these results:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':* & ''suppli'':* & ''discontinu'':*'::tsquery AND metadata NOT LIKE '%iscontinu%';
 count
-------
  4499
(1 Zeile)

These results don't contain "discontinu" in the tsvector.

Am So., 14. Feb. 2021 um 08:41 Uhr schrieb Dimitri Nüscheler <[hidden email]>:
Sorry, I forgot to post the DDL for the GIN INDEX:
CREATE INDEX IF NOT EXISTS sherlock_catalog_textsearch ON sherlock_catalog USING GIN (textsearch);

I also uploaded the scripts I used to bisect the issue here, but I didn't include the data file yet. The file "steps" explains the steps.:

I could share the data file with a few individuals until I find a way to anonymize or reduce the data set. My attempts so far, resulted in the bug no longer showing.

Regards
Dimitri

Am So., 14. Feb. 2021 um 00:23 Uhr schrieb PG Bug reporting form <[hidden email]>:
The following bug has been logged on the website:

Bug reference:      16865
Logged by:          Dimitri Nüscheler
Email address:      [hidden email]
PostgreSQL version: 13.1
Operating system:   Debian
Description:       

Hello

I'm writing a small search engine application that also supports negation
(exclude results that contain terms starting with string).

After upgrading from PostgreSQL 12.5 to 13.1 the negation within a tsquery
when matched to a tsvector using the @@ operator no longer works properly,
so I started bisecting the commit history and tracked it down to this commit
(see the query and expected and actual result further below):

> commit 2f2007fbb255be178aca586780967f43885203a7 (HEAD, refs/bisect/bad)
> Author: Tom Lane <[hidden email]>
> Date:   Fri Jul 24 15:26:51 2020 -0400
>
>    Fix assorted bugs by changing TS_execute's callback API to ternary
logic.
> ...

I'm still working on creating a reproducible test-case without having to
share company data. I'm also trying to understand the code as a fun
exercise.

I can at least share some of the queries and result data:

DDL:
CREATE TABLE IF NOT EXISTS sherlock_catalog (
uri varchar,
description varchar NOT NULL,
metadata varchar NOT NULL,
textsearch tsvector GENERATED ALWAYS AS (to_tsvector('english',
sherlock_normalize(uri || ' ' || description || ' ' || metadata))) STORED,
last_seen timestamptz,
PRIMARY KEY (uri)
);
CREATE OR REPLACE FUNCTION sherlock_normalize(str varchar) RETURNS varchar
AS $$
BEGIN
    RETURN lower(regexp_replace(regexp_replace(regexp_replace(str,
'[^a-zA-Z0-9]+', ' ', 'g'),'([a-z])([A-Z])','\1
\2','g'),'([A-Z][A-Z0-9])([a-z])','\1 \2','g'));
END;
$$
LANGUAGE plpgsql IMMUTABLE RETURNS NULL ON NULL INPUT;

Query:
SELECT * FROM sherlock_catalog WHERE textsearch @@ '''full'':* &
''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinued%' LIMIT 2;

Result: 2 rows

Expected result: 0 rows, because the textsearch vector contains:
'discontinu':86

Plan:
 Limit  (cost=130.61..12789.24 rows=1 width=136)
   ->  Bitmap Heap Scan on sherlock_catalog  (cost=130.61..12789.24 rows=1
width=136)
         Recheck Cond: (textsearch @@ '''full'':* & ''suppli'':* &
!''discontinu'':*'::tsquery)
         Filter: ((metadata)::text ~~ '%Discontinued%'::text)
         ->  Bitmap Index Scan on sherlock_catalog_textsearch
(cost=0.00..130.61 rows=3548 width=0)
               Index Cond: (textsearch @@ '''full'':* & ''suppli'':* &
!''discontinu'':*'::tsquery)

The generated plan is structurally the same for PostgreSQL 12.5 respectively
versions before that commit, but if I alter the plan using (SET enable_*scan
 = off) so that the planner comes up with a sequential scan, I will get the
expected results.

Some other results (count):

Count of the same query:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinu%';
 count
-------
  4962
(1 Zeile)


Negation, but not a prefix search:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':* & !''discontinu'''::tsquery AND metadata LIKE
'%Discontinu%';
 count
-------
     0
(1 Zeile)


Without negation:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':* & ''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinu%';
 count
-------
 13127
(1 Zeile)

Without the "discontinu" term at all

user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':*'::tsquery AND metadata LIKE '%Discontinu%';
 count
-------
 13127
(1 Zeile)


So it seems like the negated query manages to filter out some data, but not
all - as if it failed to recheck and definitely determine the TS_YES
respectively TS_NO answer from an in-precise TS_MAYBE answer from an
unprecise index-based answer (without position information?)? if I even
understand this remotely correctly, I'm new to this.
I'll try to find out more and to prepare shareable data that reproduces the
problem, but I also wonder if I manage to dive into the code and understand
something about it :-)

Kind Regards
Dimitri Nüscheler

Reply | Threaded
Open this post in threaded view
|

Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

Dimitri Nüscheler
Meanwhile I managed to anonymize the data. I put it in this archive https://www.violetsky.ch/postgres-issue-anonymized.tar.gz
It's called anonymous.csv and the scripts to reproduce it are also included (See the file "steps").

The query that reproduces the issue is now a bit different with the data anonymized:

SELECT * FROM sherlock_catalog WHERE textsearch @@ '''glonoin'':* & !''urinarium'':*'::tsquery AND metadata LIKE '%urinarium%' LIMIT 2;

You will be able to see that the textsearch tsvector contains "urinarium" when it shouldn't.
I also checked that the commit right before (25244b8972a34b838c4033fe9efc1d31cba9d0e3) does not have the issue and returns 0 rows instead only.


Am So., 14. Feb. 2021 um 11:17 Uhr schrieb Dimitri Nüscheler <[hidden email]>:
It seems the issue is not related to negation, because I also get these results:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':* & ''suppli'':* & ''discontinu'':*'::tsquery AND metadata NOT LIKE '%iscontinu%';
 count
-------
  4499
(1 Zeile)

These results don't contain "discontinu" in the tsvector.

Am So., 14. Feb. 2021 um 08:41 Uhr schrieb Dimitri Nüscheler <[hidden email]>:
Sorry, I forgot to post the DDL for the GIN INDEX:
CREATE INDEX IF NOT EXISTS sherlock_catalog_textsearch ON sherlock_catalog USING GIN (textsearch);

I also uploaded the scripts I used to bisect the issue here, but I didn't include the data file yet. The file "steps" explains the steps.:

I could share the data file with a few individuals until I find a way to anonymize or reduce the data set. My attempts so far, resulted in the bug no longer showing.

Regards
Dimitri

Am So., 14. Feb. 2021 um 00:23 Uhr schrieb PG Bug reporting form <[hidden email]>:
The following bug has been logged on the website:

Bug reference:      16865
Logged by:          Dimitri Nüscheler
Email address:      [hidden email]
PostgreSQL version: 13.1
Operating system:   Debian
Description:       

Hello

I'm writing a small search engine application that also supports negation
(exclude results that contain terms starting with string).

After upgrading from PostgreSQL 12.5 to 13.1 the negation within a tsquery
when matched to a tsvector using the @@ operator no longer works properly,
so I started bisecting the commit history and tracked it down to this commit
(see the query and expected and actual result further below):

> commit 2f2007fbb255be178aca586780967f43885203a7 (HEAD, refs/bisect/bad)
> Author: Tom Lane <[hidden email]>
> Date:   Fri Jul 24 15:26:51 2020 -0400
>
>    Fix assorted bugs by changing TS_execute's callback API to ternary
logic.
> ...

I'm still working on creating a reproducible test-case without having to
share company data. I'm also trying to understand the code as a fun
exercise.

I can at least share some of the queries and result data:

DDL:
CREATE TABLE IF NOT EXISTS sherlock_catalog (
uri varchar,
description varchar NOT NULL,
metadata varchar NOT NULL,
textsearch tsvector GENERATED ALWAYS AS (to_tsvector('english',
sherlock_normalize(uri || ' ' || description || ' ' || metadata))) STORED,
last_seen timestamptz,
PRIMARY KEY (uri)
);
CREATE OR REPLACE FUNCTION sherlock_normalize(str varchar) RETURNS varchar
AS $$
BEGIN
    RETURN lower(regexp_replace(regexp_replace(regexp_replace(str,
'[^a-zA-Z0-9]+', ' ', 'g'),'([a-z])([A-Z])','\1
\2','g'),'([A-Z][A-Z0-9])([a-z])','\1 \2','g'));
END;
$$
LANGUAGE plpgsql IMMUTABLE RETURNS NULL ON NULL INPUT;

Query:
SELECT * FROM sherlock_catalog WHERE textsearch @@ '''full'':* &
''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinued%' LIMIT 2;

Result: 2 rows

Expected result: 0 rows, because the textsearch vector contains:
'discontinu':86

Plan:
 Limit  (cost=130.61..12789.24 rows=1 width=136)
   ->  Bitmap Heap Scan on sherlock_catalog  (cost=130.61..12789.24 rows=1
width=136)
         Recheck Cond: (textsearch @@ '''full'':* & ''suppli'':* &
!''discontinu'':*'::tsquery)
         Filter: ((metadata)::text ~~ '%Discontinued%'::text)
         ->  Bitmap Index Scan on sherlock_catalog_textsearch
(cost=0.00..130.61 rows=3548 width=0)
               Index Cond: (textsearch @@ '''full'':* & ''suppli'':* &
!''discontinu'':*'::tsquery)

The generated plan is structurally the same for PostgreSQL 12.5 respectively
versions before that commit, but if I alter the plan using (SET enable_*scan
 = off) so that the planner comes up with a sequential scan, I will get the
expected results.

Some other results (count):

Count of the same query:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinu%';
 count
-------
  4962
(1 Zeile)


Negation, but not a prefix search:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':* & !''discontinu'''::tsquery AND metadata LIKE
'%Discontinu%';
 count
-------
     0
(1 Zeile)


Without negation:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':* & ''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinu%';
 count
-------
 13127
(1 Zeile)

Without the "discontinu" term at all

user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':*'::tsquery AND metadata LIKE '%Discontinu%';
 count
-------
 13127
(1 Zeile)


So it seems like the negated query manages to filter out some data, but not
all - as if it failed to recheck and definitely determine the TS_YES
respectively TS_NO answer from an in-precise TS_MAYBE answer from an
unprecise index-based answer (without position information?)? if I even
understand this remotely correctly, I'm new to this.
I'll try to find out more and to prepare shareable data that reproduces the
problem, but I also wonder if I manage to dive into the code and understand
something about it :-)

Kind Regards
Dimitri Nüscheler

Reply | Threaded
Open this post in threaded view
|

Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

Tom Lane-2
=?UTF-8?Q?Dimitri_N=C3=BCscheler?= <[hidden email]> writes:
> Meanwhile I managed to anonymize the data. I put it in this archive
> https://www.violetsky.ch/postgres-issue-anonymized.tar.gz

Thanks.  I've reproduced the issue and it boils down to doing the
wrong thing when there are GIN_MAYBE values in the input data for
checkcondition_gin, specifically when the bitmap holding the original
GIN index results has become lossy.  We correctly get a TS_MAYBE
result out of the calculation in TS_execute_recurse, but then
TS_execute figures it can throw that detail away and just return
TS_YES.  I think that this problem existed before 2f2007fbb, but
we were masking it by always forcing rechecks.

Anyway, this seems to put the final nail in the coffin of the idea
that it's sufficient for TS_execute to return a boolean.  At least
the tsginidx.c callers really need to see the 3-way result.  Hence
I propose the attached patch.  It fixes the given test case, but
I wonder if you can try it and see if you see any remaining problems.

I wasted quite a bit of time trying to devise a test case that is
short enough to be reasonable to include in our regression tests,
without success ... you need quite a bit of data to make the GIN
bitmap become lossy.  So no test case here, but I'm not super
comfortable with that.

                        regards, tom lane


diff --git a/src/backend/utils/adt/tsginidx.c b/src/backend/utils/adt/tsginidx.c
index 6c913baaba..8eed376708 100644
--- a/src/backend/utils/adt/tsginidx.c
+++ b/src/backend/utils/adt/tsginidx.c
@@ -246,10 +246,22 @@ gin_tsquery_consistent(PG_FUNCTION_ARGS)
  gcv.map_item_operand = (int *) (extra_data[0]);
  gcv.need_recheck = recheck;
 
- res = TS_execute(GETQUERY(query),
- &gcv,
- TS_EXEC_PHRASE_NO_POS,
- checkcondition_gin);
+ switch (TS_execute_ternary(GETQUERY(query),
+   &gcv,
+   TS_EXEC_PHRASE_NO_POS,
+   checkcondition_gin))
+ {
+ case TS_NO:
+ res = false;
+ break;
+ case TS_YES:
+ res = true;
+ break;
+ case TS_MAYBE:
+ res = true;
+ *recheck = true;
+ break;
+ }
  }
 
  PG_RETURN_BOOL(res);
@@ -284,11 +296,12 @@ gin_tsquery_triconsistent(PG_FUNCTION_ARGS)
  gcv.map_item_operand = (int *) (extra_data[0]);
  gcv.need_recheck = &recheck;
 
- if (TS_execute(GETQUERY(query),
-   &gcv,
-   TS_EXEC_PHRASE_NO_POS,
-   checkcondition_gin))
- res = recheck ? GIN_MAYBE : GIN_TRUE;
+ res = TS_execute_ternary(GETQUERY(query),
+ &gcv,
+ TS_EXEC_PHRASE_NO_POS,
+ checkcondition_gin);
+ if (res == GIN_TRUE && recheck)
+ res = GIN_MAYBE;
  }
 
  PG_RETURN_GIN_TERNARY_VALUE(res);
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index 2939fb5c21..9236ebcc8f 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -1854,6 +1854,18 @@ TS_execute(QueryItem *curitem, void *arg, uint32 flags,
  return TS_execute_recurse(curitem, arg, flags, chkcond) != TS_NO;
 }
 
+/*
+ * Evaluate tsquery boolean expression.
+ *
+ * This is the same as TS_execute except that TS_MAYBE is returned as-is.
+ */
+TSTernaryValue
+TS_execute_ternary(QueryItem *curitem, void *arg, uint32 flags,
+   TSExecuteCallback chkcond)
+{
+ return TS_execute_recurse(curitem, arg, flags, chkcond);
+}
+
 /*
  * TS_execute recursion for operators above any phrase operator.  Here we do
  * not need to worry about lexeme positions.  As soon as we hit an OP_PHRASE
diff --git a/src/include/tsearch/ts_utils.h b/src/include/tsearch/ts_utils.h
index 69a9ba8524..4266560151 100644
--- a/src/include/tsearch/ts_utils.h
+++ b/src/include/tsearch/ts_utils.h
@@ -199,6 +199,9 @@ typedef TSTernaryValue (*TSExecuteCallback) (void *arg, QueryOperand *val,
 
 extern bool TS_execute(QueryItem *curitem, void *arg, uint32 flags,
    TSExecuteCallback chkcond);
+extern TSTernaryValue TS_execute_ternary(QueryItem *curitem, void *arg,
+ uint32 flags,
+ TSExecuteCallback chkcond);
 extern bool tsquery_requires_match(QueryItem *curitem);
 
 /*
Reply | Threaded
Open this post in threaded view
|

Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

Tom Lane-2
I wrote:
> Anyway, this seems to put the final nail in the coffin of the idea
> that it's sufficient for TS_execute to return a boolean.  At least
> the tsginidx.c callers really need to see the 3-way result.  Hence
> I propose the attached patch.  It fixes the given test case, but
> I wonder if you can try it and see if you see any remaining problems.

After further reflection, it seems like now that we've done that,
we should go all the way and replace the out-of-band recheck result
from checkcondition_gin with a MAYBE result, as attached.

AFAICT the previous patch didn't leave any user-visible bug,
but there is an inefficiency in using the out-of-band signaling.
Consider a query like

        (a & b:B) | c

and suppose that some index entry has "b" and "c" but not "a".
With the code as it stands, when checkcondition_gin checks for a
match to "b:B" it will find one, and then because of the weight
restriction it will return TS_MAYBE, and also set need_recheck.
Then TS_execute_recurse will combine the TS_MAYBE with the TS_NO
for "a" to produce TS_NO, and finally combine that with TS_YES
for "c" to get TS_YES.  This is a correct result: the row
unquestionably matches the query.  But since we set the
need_recheck flag, we'll force a heap visit and recheck anyway.
That's bogus, and we don't need to accept it anymore now that
the data return path is fully ternary-clean.  Returning TS_MAYBE
is sufficient to do all the right things, so I propose the
attached v2.

                        regards, tom lane


diff --git a/src/backend/utils/adt/tsginidx.c b/src/backend/utils/adt/tsginidx.c
index 6c913baaba..78de531eb8 100644
--- a/src/backend/utils/adt/tsginidx.c
+++ b/src/backend/utils/adt/tsginidx.c
@@ -175,7 +175,6 @@ typedef struct
  QueryItem  *first_item;
  GinTernaryValue *check;
  int   *map_item_operand;
- bool   *need_recheck;
 } GinChkVal;
 
 /*
@@ -184,27 +183,24 @@ typedef struct
 static TSTernaryValue
 checkcondition_gin(void *checkval, QueryOperand *val, ExecPhraseData *data)
 {
+ GinTernaryValue result;
  GinChkVal  *gcv = (GinChkVal *) checkval;
  int j;
 
- /*
- * if any val requiring a weight is used or caller needs position
- * information then set recheck flag
- */
- if (val->weight != 0 || data != NULL)
- *(gcv->need_recheck) = true;
-
  /* convert item's number to corresponding entry's (operand's) number */
  j = gcv->map_item_operand[((QueryItem *) val) - gcv->first_item];
 
+ /* determine presence of current entry in indexed value */
+ result = gcv->check[j];
+
  /*
- * return presence of current entry in indexed value; but TRUE becomes
- * MAYBE in the presence of a query requiring recheck
+ * if any val requiring a weight is used or caller needs position
+ * information then we must recheck, so replace TRUE with MAYBE.
  */
- if (gcv->check[j] == GIN_TRUE)
+ if (result == GIN_TRUE)
  {
  if (val->weight != 0 || data != NULL)
- return TS_MAYBE;
+ result = GIN_MAYBE;
  }
 
  /*
@@ -212,7 +208,7 @@ checkcondition_gin(void *checkval, QueryOperand *val, ExecPhraseData *data)
  * assignments.  We could use a switch statement to map the values if that
  * ever stops being true, but it seems unlikely to happen.
  */
- return (TSTernaryValue) gcv->check[j];
+ return (TSTernaryValue) result;
 }
 
 Datum
@@ -244,12 +240,23 @@ gin_tsquery_consistent(PG_FUNCTION_ARGS)
  "sizes of GinTernaryValue and bool are not equal");
  gcv.check = (GinTernaryValue *) check;
  gcv.map_item_operand = (int *) (extra_data[0]);
- gcv.need_recheck = recheck;
 
- res = TS_execute(GETQUERY(query),
- &gcv,
- TS_EXEC_PHRASE_NO_POS,
- checkcondition_gin);
+ switch (TS_execute_ternary(GETQUERY(query),
+   &gcv,
+   TS_EXEC_PHRASE_NO_POS,
+   checkcondition_gin))
+ {
+ case TS_NO:
+ res = false;
+ break;
+ case TS_YES:
+ res = true;
+ break;
+ case TS_MAYBE:
+ res = true;
+ *recheck = true;
+ break;
+ }
  }
 
  PG_RETURN_BOOL(res);
@@ -266,10 +273,6 @@ gin_tsquery_triconsistent(PG_FUNCTION_ARGS)
  /* int32 nkeys = PG_GETARG_INT32(3); */
  Pointer    *extra_data = (Pointer *) PG_GETARG_POINTER(4);
  GinTernaryValue res = GIN_FALSE;
- bool recheck;
-
- /* Initially assume query doesn't require recheck */
- recheck = false;
 
  if (query->size > 0)
  {
@@ -282,13 +285,11 @@ gin_tsquery_triconsistent(PG_FUNCTION_ARGS)
  gcv.first_item = GETQUERY(query);
  gcv.check = check;
  gcv.map_item_operand = (int *) (extra_data[0]);
- gcv.need_recheck = &recheck;
 
- if (TS_execute(GETQUERY(query),
-   &gcv,
-   TS_EXEC_PHRASE_NO_POS,
-   checkcondition_gin))
- res = recheck ? GIN_MAYBE : GIN_TRUE;
+ res = TS_execute_ternary(GETQUERY(query),
+ &gcv,
+ TS_EXEC_PHRASE_NO_POS,
+ checkcondition_gin);
  }
 
  PG_RETURN_GIN_TERNARY_VALUE(res);
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index 2939fb5c21..9236ebcc8f 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -1854,6 +1854,18 @@ TS_execute(QueryItem *curitem, void *arg, uint32 flags,
  return TS_execute_recurse(curitem, arg, flags, chkcond) != TS_NO;
 }
 
+/*
+ * Evaluate tsquery boolean expression.
+ *
+ * This is the same as TS_execute except that TS_MAYBE is returned as-is.
+ */
+TSTernaryValue
+TS_execute_ternary(QueryItem *curitem, void *arg, uint32 flags,
+   TSExecuteCallback chkcond)
+{
+ return TS_execute_recurse(curitem, arg, flags, chkcond);
+}
+
 /*
  * TS_execute recursion for operators above any phrase operator.  Here we do
  * not need to worry about lexeme positions.  As soon as we hit an OP_PHRASE
diff --git a/src/include/tsearch/ts_utils.h b/src/include/tsearch/ts_utils.h
index 69a9ba8524..4266560151 100644
--- a/src/include/tsearch/ts_utils.h
+++ b/src/include/tsearch/ts_utils.h
@@ -199,6 +199,9 @@ typedef TSTernaryValue (*TSExecuteCallback) (void *arg, QueryOperand *val,
 
 extern bool TS_execute(QueryItem *curitem, void *arg, uint32 flags,
    TSExecuteCallback chkcond);
+extern TSTernaryValue TS_execute_ternary(QueryItem *curitem, void *arg,
+ uint32 flags,
+ TSExecuteCallback chkcond);
 extern bool tsquery_requires_match(QueryItem *curitem);
 
 /*
Reply | Threaded
Open this post in threaded view
|

Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

Dimitri Nüscheler
Hi Tom

Wow, this is going fast. I applied the second patch and it fixes my original test case on my private data.
Thank you also for the explanations. I'm happy it's possible to get insight into PostgreSQL interna so quickly :-)

Let me know if there's more I can do (testing particular cases or patch variants etc).

Kind regards
Dimitri

Am Di., 16. Feb. 2021 um 01:27 Uhr schrieb Tom Lane <[hidden email]>:
I wrote:
> Anyway, this seems to put the final nail in the coffin of the idea
> that it's sufficient for TS_execute to return a boolean.  At least
> the tsginidx.c callers really need to see the 3-way result.  Hence
> I propose the attached patch.  It fixes the given test case, but
> I wonder if you can try it and see if you see any remaining problems.

After further reflection, it seems like now that we've done that,
we should go all the way and replace the out-of-band recheck result
from checkcondition_gin with a MAYBE result, as attached.

AFAICT the previous patch didn't leave any user-visible bug,
but there is an inefficiency in using the out-of-band signaling.
Consider a query like

        (a & b:B) | c

and suppose that some index entry has "b" and "c" but not "a".
With the code as it stands, when checkcondition_gin checks for a
match to "b:B" it will find one, and then because of the weight
restriction it will return TS_MAYBE, and also set need_recheck.
Then TS_execute_recurse will combine the TS_MAYBE with the TS_NO
for "a" to produce TS_NO, and finally combine that with TS_YES
for "c" to get TS_YES.  This is a correct result: the row
unquestionably matches the query.  But since we set the
need_recheck flag, we'll force a heap visit and recheck anyway.
That's bogus, and we don't need to accept it anymore now that
the data return path is fully ternary-clean.  Returning TS_MAYBE
is sufficient to do all the right things, so I propose the
attached v2.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

Tom Lane-2
=?UTF-8?Q?Dimitri_N=C3=BCscheler?= <[hidden email]> writes:
> Wow, this is going fast. I applied the second patch and it fixes my
> original test case on my private data.

Pushed, thanks for testing!

                        regards, tom lane