BUG #16828: duplicate results when using ** recursive expression in JSON path

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

BUG #16828: duplicate results when using ** recursive expression in JSON path

PG Bug reporting form
The following bug has been logged on the website:

Bug reference:      16828
Logged by:          Thomas Kellerer
Email address:      [hidden email]
PostgreSQL version: 13.1
Operating system:   Windows 10, CentOS 8
Description:        

The following query

  select jsonb_path_query_array(col, '$.**.itemName')
  from (
    values ('{"items": [{"itemName": "a", "items": [{"itemName":
"b"}]}]}'::jsonb)
  ) as t(col)

returns ["a", "a", "b", "b"] but should return only ["a", "b"] as each value
only appears once in the JSON value.

This also happens on Postgres 12.x

Reply | Threaded
Open this post in threaded view
|

Re: BUG #16828: duplicate results when using ** recursive expression in JSON path

Tom Lane-2
PG Bug reporting form <[hidden email]> writes:
> The following query

>   select jsonb_path_query_array(col, '$.**.itemName')
>   from (
>     values ('{"items": [{"itemName": "a", "items": [{"itemName":
> "b"}]}]}'::jsonb)
>   ) as t(col)

> returns ["a", "a", "b", "b"] but should return only ["a", "b"] as each value
> only appears once in the JSON value.

jsonpath_exec.c is a muddle of undocumented code, but I think
I found where the duplication is occurring: in executeAnyItem(),
the first occurrence comes out while recursing down from the
executeItemOptUnwrapTarget call at line 1419 (as of HEAD), and then
the duplicate comes out while recursing down from the executeAnyItem
call at line 1439.  So I'd say that that logic for
"ignoreStructuralErrors" needs a rethink.  (BTW, why is half of this
code relying on cxt->ignoreStructuralErrors while the other half
receives a passed-down flag?  That seems like a recipe for bugs,
especially with the lack of commentary about the reason for it.)

Alexander, git blame says all this code is your fault ...

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16828: duplicate results when using ** recursive expression in JSON path

Alexander Korotkov-4
On Sun, Jan 17, 2021 at 10:42 PM Tom Lane <[hidden email]> wrote:

> PG Bug reporting form <[hidden email]> writes:
> > The following query
>
> >   select jsonb_path_query_array(col, '$.**.itemName')
> >   from (
> >     values ('{"items": [{"itemName": "a", "items": [{"itemName":
> > "b"}]}]}'::jsonb)
> >   ) as t(col)
>
> > returns ["a", "a", "b", "b"] but should return only ["a", "b"] as each value
> > only appears once in the JSON value.
>
> jsonpath_exec.c is a muddle of undocumented code, but I think
> I found where the duplication is occurring: in executeAnyItem(),
> the first occurrence comes out while recursing down from the
> executeItemOptUnwrapTarget call at line 1419 (as of HEAD), and then
> the duplicate comes out while recursing down from the executeAnyItem
> call at line 1439.  So I'd say that that logic for
> "ignoreStructuralErrors" needs a rethink.  (BTW, why is half of this
> code relying on cxt->ignoreStructuralErrors while the other half
> receives a passed-down flag?  That seems like a recipe for bugs,
> especially with the lack of commentary about the reason for it.)

I think that's not just a bug in the code, but a high-level problem.

** operator enumerates all the subdocuments.

# select * from jsonb_path_query('{"items": [{"itemName": "a", "items": [{"itemName": "b"}]}]}'::jsonb, '$.**');
                       jsonb_path_query
--------------------------------------------------------------
 {"items": [{"items": [{"itemName": "b"}], "itemName": "a"}]}
 [{"items": [{"itemName": "b"}], "itemName": "a"}]
 {"items": [{"itemName": "b"}], "itemName": "a"}
 [{"itemName": "b"}]
 {"itemName": "b"}
 "b"
 "a"
(7 rows)

After that, .itemName accessor is applied.  But in the lax mode this accessor automatically unwraps the arrays.

# select subres, jsonb_path_query_array(subres, '$.itemName') from jsonb_path_query('{"items": [{"itemName": "a", "items": [{"itemName": "b"}]}]}'::jsonb, '$.**') subres;
                            subres                            | jsonb_path_query_array
--------------------------------------------------------------+------------------------
 {"items": [{"items": [{"itemName": "b"}], "itemName": "a"}]} | []
 [{"items": [{"itemName": "b"}], "itemName": "a"}]            | ["a"]
 {"items": [{"itemName": "b"}], "itemName": "a"}              | ["a"]
 [{"itemName": "b"}]                                          | ["b"]
 {"itemName": "b"}                                            | ["b"]
 "b"                                                          | []
 "a"                                                          | []
(7 rows)

So, everything works as designed, but the design is probably wrong.  The issue here is that ** is our extension of the standard, and it works strangely in lax mode.  But in strict mode everything looks OK.

#  select jsonb_path_query_array('{"items": [{"itemName": "a", "items": [{"itemName": "b"}]}]}'::jsonb, 'strict $.**.itemName');
 jsonb_path_query_array
------------------------
 ["a", "b"]
(1 row)

Should we consider everything that comes after ** always in strict mode?

> Alexander, git blame says all this code is your fault ...

I do agree all faults in this code are mine.  I do agree there are faults in this code.  But I don't think *all* this code is fault :)

I'll plan my time to improve the clarity of this code.

------
Regards,
Alexander Korotkov