BUG #16046: xpath returns CDATA tag along with the value in postgres 12

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

BUG #16046: xpath returns CDATA tag along with the value in postgres 12

apt.postgresql.org Repository Update
The following bug has been logged on the website:

Bug reference:      16046
Logged by:          Mohammad Mostafa Kamal
Email address:      [hidden email]
PostgreSQL version: 12.0
Operating system:   Windows 7
Description:        

While using xpath to extract text from a CDATA section of xml, it returns
CDATA tag along with the value.

Query: SELECT
unnest(xpath('//cname/aname/text()','<cname><aname><![CDATA[select
5]]></aname></cname>'::xml))

Output - pg11: select 5

Output - pg12:  <![CDATA[select 5]]>

Reply | Threaded
Open this post in threaded view
|

Re: BUG #16046: xpath returns CDATA tag along with the value in postgres 12

Tom Lane-2
PG Bug reporting form <[hidden email]> writes:
> While using xpath to extract text from a CDATA section of xml, it returns
> CDATA tag along with the value.
>
> Query: SELECT unnest(xpath('//cname/aname/text()','<cname><aname><![CDATA[select 5]]></aname></cname>'::xml))
>
> Output - pg11: select 5
>
> Output - pg12: <![CDATA[select 5]]>

I haven't actually bisected to make sure, but I imagine this is a
consequence of commit 251cf2e27bec98274e8bb002608680bdc211319e.
What's not entirely clear to me is whether it's an intentional
effect, or a bug.  Authors, any comments?

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16046: xpath returns CDATA tag along with the value in postgres 12

Chapman Flack
On 10/24/19 17:38, Tom Lane wrote:
>> Query: SELECT unnest(xpath('//cname/aname/text()','<cname><aname><![CDATA[select 5]]></aname></cname>'::xml))
>>
>> Output - pg11: select 5
>>
>> Output - pg12: <![CDATA[select 5]]>
>
> ... What's not entirely clear to me is whether it's an intentional
> effect, or a bug.  Authors, any comments?

Hmm. I would say the pg12 behavior is "not wrong". But it's unexpected.
xpath's return type is xml (well, array of), so the result must have a
form that can escape any characters mistakable for markup. In this example,
there aren't any, but once tweaked so there are:

SELECT
 unnest(xpath('//cname/aname/text()',
  '<cname><aname><![CDATA[select 5 & 6 <yahoo!>]]></aname></cname>'::xml));

pg12: <![CDATA[select 5 & 6 <yahoo!>]]>

the necessity is clear.

The other valid option would be to return, not CDATA, but a regular
text node, which would look like straight text if there were no special
characters in it, and would otherwise have every such character individually
escaped. That's what I get from pg11:

SELECT
 unnest(xpath('//cname/aname/text()',
  '<cname><aname><![CDATA[select 5 & 6 <yahoo!>]]></aname></cname>'::xml));

pg11: select 5 &amp; 6 &lt;yahoo!&gt;

So what pg11 is doing is also "not wrong" (in this respect, anyway).
And looks "more natural", in the case where the value has no characters
that need escaping.

Which may or may not be a good thing. Perhaps it could lead the unwary
in some cases to think such a query is giving a directly usable
text string back, which will be harmless until the one time a value
with escaping comes back. (The no-surprises way to get back a directly
usable text string, if that's what's wanted, would be with XMLTABLE
and an output column of text type.)

Oddly, what pg12 is doing seems to be influenced by the form of escaping
used in the input:

SELECT
 unnest(xpath('//cname/aname/text()',
  '<cname><aname><![CDATA[select 5 & 6 <yahoo!>]]></aname></cname>'::xml));
              unnest
-----------------------------------
 <![CDATA[select 5 & 6 <yahoo!>]]>

SELECT
 unnest(xpath('//cname/aname/text()',
  '<cname><aname>select 5 &amp; 6 &lt;yahoo!&gt;</aname></cname>'::xml));
             unnest
---------------------------------
 select 5 &amp; 6 &lt;yahoo!&gt;


Either form of result is correct, and having it respect the form that was
used in the input might even be delightfully smart.

I haven't looked in the code just now to see if it is intentionally being
delightfully smart, or more simplistic-and-lucky.

Regards,
-Chap


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16046: xpath returns CDATA tag along with the value in postgres 12

Chapman Flack
On 10/24/19 19:14, Chapman Flack wrote:
>  <![CDATA[select 5 & 6 <yahoo!>]]>
>  select 5 &amp; 6 &lt;yahoo!&gt;
>
> Either form of result is correct, and having it respect the form that was
> used in the input might even be delightfully smart.
>
> I haven't looked in the code just now to see if it is intentionally being
> delightfully smart, or more simplistic-and-lucky.

It appears to be probably unintentional-but-ok: libxml tags a CDATA
section differently (XML_CDATA_SECTION_NODE) than a text node
(XML_TEXT_NODE), so a CDATA node falls into the catch-all branch of

if (cur->type != XML_ATTRIBUTE_NODE && cur->type != XML_TEXT_NODE)

where it gets dumped faithfully by xmlNodeDump(), with the upshot
that the result always will be well-formed XML, and will respect whether
the input was supplied as CDATA or not. We could choose to say that's
what we meant it to do all along.

Regards,
-Chap


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16046: xpath returns CDATA tag along with the value in postgres 12

Chapman Flack
In reply to this post by Chapman Flack
On 10/24/19 19:14, Chapman Flack wrote:

> Which may or may not be a good thing. Perhaps it could lead the unwary
> in some cases to think such a query is giving a directly usable
> text string back, which will be harmless until the one time a value
> with escaping comes back. (The no-surprises way to get back a directly
> usable text string, if that's what's wanted, would be with XMLTABLE
> and an output column of text type.)

If the original reporter is happy with getting xml back, instead of
raw text, but only wants the form of escaping to be the same as it was
in pg11, the xpath //cname/aname/text() can just be rewritten as
string(//cname/aname), which will lose any CDATA-ness and consistently
produce the form where individual characters are escaped, in 11 or 12.

I think.

Regards,
-Chap


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16046: xpath returns CDATA tag along with the value in postgres 12

Markus Winand
In reply to this post by Chapman Flack

On 25 Oct 2019, at 01:34, Chapman Flack <[hidden email]> wrote:

On 10/24/19 19:14, Chapman Flack wrote:
<![CDATA[select 5 & 6 <yahoo!>]]>
select 5 &amp; 6 &lt;yahoo!&gt;

Either form of result is correct, and having it respect the form that was
used in the input might even be delightfully smart.

I haven't looked in the code just now to see if it is intentionally being
delightfully smart, or more simplistic-and-lucky.

It appears to be probably unintentional-but-ok: libxml tags a CDATA
section differently (XML_CDATA_SECTION_NODE) than a text node
(XML_TEXT_NODE), so a CDATA node falls into the catch-all branch of

if (cur->type != XML_ATTRIBUTE_NODE && cur->type != XML_TEXT_NODE)

This was intentional. Earlier versions of the patch had the CDATA explicitly listed[0]. I suggested to reverse the logic later [1].

My concern at that time was about xmltable but I think the current behavior is fine for xpath too. If you pick fragment(s) out of an XML document the most reasonable thing to do is to return that part(s) of the XML as it actually appears in the input.

 We could choose to say that's
what we meant it to do all along.

+1

-markus


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16046: xpath returns CDATA tag along with the value in postgres 12

Tom Lane-2
In reply to this post by Chapman Flack
Chapman Flack <[hidden email]> writes:
> On 10/24/19 19:14, Chapman Flack wrote:
>> Which may or may not be a good thing. Perhaps it could lead the unwary
>> in some cases to think such a query is giving a directly usable
>> text string back, which will be harmless until the one time a value
>> with escaping comes back. (The no-surprises way to get back a directly
>> usable text string, if that's what's wanted, would be with XMLTABLE
>> and an output column of text type.)

> If the original reporter is happy with getting xml back, instead of
> raw text, but only wants the form of escaping to be the same as it was
> in pg11, the xpath //cname/aname/text() can just be rewritten as
> string(//cname/aname), which will lose any CDATA-ness and consistently
> produce the form where individual characters are escaped, in 11 or 12.

> I think.

Yeah, that does work the same for me in 11 and HEAD:

regression=# SELECT unnest(xpath('string(//cname/aname)','<cname><aname><![CDATA[select 5>4]]></aname></cname>'::xml));
    unnest    
---------------
 select 5&gt;4
(1 row)

So I'm inclined to write this off as an intended, or at least not
unwanted, change.  However, this complaint does reinforce Alvaro's
original choice not to back-patch 251cf2e27.

The v12 release notes don't call this out as a compatibility issue,
nor mention anything about CDATA handling:

  * Fix assorted bugs in XML functions (Pavel Stehule, Markus Winand,
    Chapman Flack)

    Specifically, in XMLTABLE, xpath(), and xmlexists(), fix some cases
    where nothing was output for a node, or an unexpected error was
    thrown, or necessary escaping of XML special characters was omitted.

Should we extend that item (if so, what to say exactly?) and/or move
it into the "Observe the following incompatibilities" list?

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16046: xpath returns CDATA tag along with the value in postgres 12

Pavel Stehule


pá 25. 10. 2019 v 19:28 odesílatel Tom Lane <[hidden email]> napsal:
Chapman Flack <[hidden email]> writes:
> On 10/24/19 19:14, Chapman Flack wrote:
>> Which may or may not be a good thing. Perhaps it could lead the unwary
>> in some cases to think such a query is giving a directly usable
>> text string back, which will be harmless until the one time a value
>> with escaping comes back. (The no-surprises way to get back a directly
>> usable text string, if that's what's wanted, would be with XMLTABLE
>> and an output column of text type.)

> If the original reporter is happy with getting xml back, instead of
> raw text, but only wants the form of escaping to be the same as it was
> in pg11, the xpath //cname/aname/text() can just be rewritten as
> string(//cname/aname), which will lose any CDATA-ness and consistently
> produce the form where individual characters are escaped, in 11 or 12.

> I think.

Yeah, that does work the same for me in 11 and HEAD:

regression=# SELECT unnest(xpath('string(//cname/aname)','<cname><aname><![CDATA[select 5>4]]></aname></cname>'::xml));
    unnest     
---------------
 select 5&gt;4
(1 row)

So I'm inclined to write this off as an intended, or at least not
unwanted, change.  However, this complaint does reinforce Alvaro's
original choice not to back-patch 251cf2e27.

The v12 release notes don't call this out as a compatibility issue,
nor mention anything about CDATA handling:

  * Fix assorted bugs in XML functions (Pavel Stehule, Markus Winand,
    Chapman Flack)

    Specifically, in XMLTABLE, xpath(), and xmlexists(), fix some cases
    where nothing was output for a node, or an unexpected error was
    thrown, or necessary escaping of XML special characters was omitted.

Should we extend that item (if so, what to say exactly?) and/or move
it into the "Observe the following incompatibilities" list?

I tested XMLTABLE function, and it is work how it is expected. About xpath function - I don't know - there is a agreement so result is correct, and I can say so form of result of Postgres 12 is maybe little bit better, because you will get raw result without any transformation, what is sense of CDATA tag. Unfortunately, I agree, so this behave can surprise for developers and it's clear incompatible behave. Trimming CDATA tag or using string function is easy.

Somewhere should be strong warning about change of processing of CDATA nodes - originally these nodes was converted to text and escaped, now they returned in raw format - without any conversions. The example with "string" function should be used there.

Regards

Pavel


                        regards, tom lane
Reply | Threaded
Open this post in threaded view
|

Re: BUG #16046: xpath returns CDATA tag along with the value in postgres 12

Mohammad Mostafa Kamal
Thank you all so very much for your valuable comments.

So we can be take it for granted that, the behavior of CDATA will remain as it is currently now in pg12 and we need change our queries accordingly to handle this from now on.

Thanks again,
Mostafa

From: Pavel Stehule <[hidden email]>
Sent: Tuesday, November 5, 2019 10:28 PM
To: Tom Lane <[hidden email]>
Cc: Chapman Flack <[hidden email]>; [hidden email] <[hidden email]>; PostgreSQL mailing lists <[hidden email]>; Markus Winand <[hidden email]>
Subject: Re: BUG #16046: xpath returns CDATA tag along with the value in postgres 12
 


pá 25. 10. 2019 v 19:28 odesílatel Tom Lane <[hidden email]> napsal:
Chapman Flack <[hidden email]> writes:
> On 10/24/19 19:14, Chapman Flack wrote:
>> Which may or may not be a good thing. Perhaps it could lead the unwary
>> in some cases to think such a query is giving a directly usable
>> text string back, which will be harmless until the one time a value
>> with escaping comes back. (The no-surprises way to get back a directly
>> usable text string, if that's what's wanted, would be with XMLTABLE
>> and an output column of text type.)

> If the original reporter is happy with getting xml back, instead of
> raw text, but only wants the form of escaping to be the same as it was
> in pg11, the xpath //cname/aname/text() can just be rewritten as
> string(//cname/aname), which will lose any CDATA-ness and consistently
> produce the form where individual characters are escaped, in 11 or 12.

> I think.

Yeah, that does work the same for me in 11 and HEAD:

regression=# SELECT unnest(xpath('string(//cname/aname)','<cname><aname><![CDATA[select 5>4]]></aname></cname>'::xml));
    unnest     
---------------
 select 5&gt;4
(1 row)

So I'm inclined to write this off as an intended, or at least not
unwanted, change.  However, this complaint does reinforce Alvaro's
original choice not to back-patch 251cf2e27.

The v12 release notes don't call this out as a compatibility issue,
nor mention anything about CDATA handling:

  * Fix assorted bugs in XML functions (Pavel Stehule, Markus Winand,
    Chapman Flack)

    Specifically, in XMLTABLE, xpath(), and xmlexists(), fix some cases
    where nothing was output for a node, or an unexpected error was
    thrown, or necessary escaping of XML special characters was omitted.

Should we extend that item (if so, what to say exactly?) and/or move
it into the "Observe the following incompatibilities" list?

I tested XMLTABLE function, and it is work how it is expected. About xpath function - I don't know - there is a agreement so result is correct, and I can say so form of result of Postgres 12 is maybe little bit better, because you will get raw result without any transformation, what is sense of CDATA tag. Unfortunately, I agree, so this behave can surprise for developers and it's clear incompatible behave. Trimming CDATA tag or using string function is easy.

Somewhere should be strong warning about change of processing of CDATA nodes - originally these nodes was converted to text and escaped, now they returned in raw format - without any conversions. The example with "string" function should be used there.

Regards

Pavel


                        regards, tom lane