Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

Mike Berrow-2
We need to make extensive use of the 'xml_is_well_formed' function provided by the XML2 module.

Yet the documentation says that the xml2 module will be deprecated since "XML syntax checking and XPath queries"
is covered by the XML-related functionality based on the SQL/XML standard in the core server from PostgreSQL 8.3 onwards.

However, the core function XMLPARSE does not provide equivalent functionality since when it detects an invalid XML document,
it throws an error rather than returning a truth value (which is what we need and currently have with the 'xml_is_well_formed' function).

For example:

select xml_is_well_formed('<br></br2>');
 xml_is_well_formed
--------------------
 f
(1 row)

select XMLPARSE( DOCUMENT '<br></br2>' );
ERROR:  invalid XML document
DETAIL:  Entity: line 1: parser error : expected '>'
<br></br2>
        ^
Entity: line 1: parser error : Extra content at the end of the document
<br></br2>
        ^

Is there some way to use the new, core XML functionality to simply return a truth value
in the way that we need?.

Thanks,
-- Mike Berrow



Reply | Threaded
Open this post in threaded view
|

Re: Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

Mike Rylander
On Mon, Jun 28, 2010 at 11:08 AM, Mike Berrow <[hidden email]> wrote:

> We need to make extensive use of the 'xml_is_well_formed' function provided
> by the XML2 module.
> Yet the documentation says that the xml2 module will be deprecated since
> "XML syntax checking and XPath queries"
> is covered by the XML-related functionality based on the SQL/XML standard in
> the core server from PostgreSQL 8.3 onwards.
> However, the core function XMLPARSE does not provide equivalent
> functionality since when it detects an invalid XML document,
> it throws an error rather than returning a truth value (which is what we
> need and currently have with the 'xml_is_well_formed' function).
> For example:
> select xml_is_well_formed('<br></br2>');
>  xml_is_well_formed
> --------------------
>  f
> (1 row)
> select XMLPARSE( DOCUMENT '<br></br2>' );
> ERROR:  invalid XML document
> DETAIL:  Entity: line 1: parser error : expected '>'
> <br></br2>
>         ^
> Entity: line 1: parser error : Extra content at the end of the document
> <br></br2>
>         ^
> Is there some way to use the new, core XML functionality to simply return a
> truth value
> in the way that we need?.

You could do something like this (untested):

CREATE OR REPLACE FUNCTION my_xml_is_valid ( x TEXT ) RETURNS BOOL AS $$
BEGIN
  PERFORM XMLPARSE( DOCUMENT x::XML );
  RETURN TRUE;
EXCEPTION WHEN OTHERS THEN
  RETURN FALSE;
END;
$$ LANGUAGE PLPGSQL;

--
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / The Evergreen Experts
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  [hidden email]
 | web:  http://www.esilibrary.com

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

David Fetter
In reply to this post by Mike Berrow-2
On Mon, Jun 28, 2010 at 08:08:53AM -0700, Mike Berrow wrote:

> We need to make extensive use of the 'xml_is_well_formed' function provided
> by the XML2 module.
>
> Yet the documentation says that the xml2 module will be deprecated since
> "XML syntax checking and XPath queries"
> is covered by the XML-related functionality based on the SQL/XML standard in
> the core server from PostgreSQL 8.3 onwards.
>
> However, the core function XMLPARSE does not provide equivalent
> functionality since when it detects an invalid XML document,
> it throws an error rather than returning a truth value (which is what we
> need and currently have with the 'xml_is_well_formed' function).
>
> For example:
>
> select xml_is_well_formed('<br></br2>');
>  xml_is_well_formed
> --------------------
>  f
> (1 row)
>
> select XMLPARSE( DOCUMENT '<br></br2>' );
> ERROR:  invalid XML document
> DETAIL:  Entity: line 1: parser error : expected '>'
> <br></br2>
>         ^
> Entity: line 1: parser error : Extra content at the end of the document
> <br></br2>
>         ^
>
> Is there some way to use the new, core XML functionality to simply
> return a truth value in the way that we need?.

Here's a PL/pgsql wrapper for it.  You could create a similar wrapper
for other commands.

CREATE OR REPLACE FUNCTION xml_is_well_formed(in_putative_xml TEXT)
STRICT /* Leave this line here if you want RETURNS NULL ON NULL INPUT behavior. */
RETURNS BOOLEAN
LANGUAGE plpgsql
AS $$
BEGIN
    PERFORM XMLPARSE(DOCUMENT(in_putative_xml));
    RETURN true;
    EXCEPTION
        WHEN invalid_xml_document THEN
            RETURN false;
END;
$$;

While tracking this down, I didn't see a way to get SQLSTATE or the
corresponding condition name via psql.  Is this an oversight?  A bug,
perhaps?

Cheers,
David.
--
David Fetter <[hidden email]> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: [hidden email]
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

Robert Haas
In reply to this post by Mike Rylander
On Mon, Jun 28, 2010 at 11:42 AM, Mike Rylander <[hidden email]> wrote:

> You could do something like this (untested):
>
> CREATE OR REPLACE FUNCTION my_xml_is_valid ( x TEXT ) RETURNS BOOL AS $$
> BEGIN
>  PERFORM XMLPARSE( DOCUMENT x::XML );
>  RETURN TRUE;
> EXCEPTION WHEN OTHERS THEN
>  RETURN FALSE;
> END;
> $$ LANGUAGE PLPGSQL;

This might perform significantly worse, though: exception handling ain't cheap.

It's not a bad workaround, but I think the OP has a point.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

Mike Fowler-3
Robert Haas wrote:

> On Mon, Jun 28, 2010 at 11:42 AM, Mike Rylander <[hidden email]> wrote:
>  
>> You could do something like this (untested):
>>
>> CREATE OR REPLACE FUNCTION my_xml_is_valid ( x TEXT ) RETURNS BOOL AS $$
>> BEGIN
>>  PERFORM XMLPARSE( DOCUMENT x::XML );
>>  RETURN TRUE;
>> EXCEPTION WHEN OTHERS THEN
>>  RETURN FALSE;
>> END;
>> $$ LANGUAGE PLPGSQL;
>>    
>
> This might perform significantly worse, though: exception handling ain't cheap.
>
> It's not a bad workaround, but I think the OP has a point.
>
>  
Should the IS DOCUMENT predicate support this? At the moment you get the
following:

template1=# SELECT
'<towns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns>'
IS DOCUMENT;
 ?column?
----------
 t
(1 row)

template1=# SELECT
'<towns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns'
IS DOCUMENT;
ERROR:  invalid XML content
LINE 1: SELECT '<towns><town>Bidford-on-Avon</town><town>Cwmbran</to...
               ^
DETAIL:  Entity: line 1: parser error : expected '>'
owns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns
                                                                               
^
Entity: line 1: parser error : chunk is not well balanced
owns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns
                                                                               
^
I would've hoped the second would've returned 'f' rather than failing.
I've had a glance at the XML/SQL standard and I don't see anything in
the detail of the predicate (8.2) that would specifically prohibit us
from changing this behavior, unless the common rule  'Parsing a string
as an XML value' (10.16) must always be in force. I'm no standard
expert, but IMHO this would be an acceptable change to improve
usability. What do others think?

Regards,

--
Mike Fowler
Registered Linux user: 379787

"I could be a genius if I just put my mind to it, and I,
I could do anything, if only I could get 'round to it"
-PULP 'Glory Days'


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

Alvaro Herrera-7
In reply to this post by David Fetter
Excerpts from David Fetter's message of lun jun 28 12:00:47 -0400 2010:

> While tracking this down, I didn't see a way to get SQLSTATE or the
> corresponding condition name via psql.  Is this an oversight?  A bug,
> perhaps?

IIRC
\pset VERBOSITY verbose
to get the SQLSTATE.

I don't think you can get the condition name that way, though.

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

Mike Fowler-3
In reply to this post by Mike Fowler-3
Quoting Mike Fowler <[hidden email]>:

> Should the IS DOCUMENT predicate support this? At the moment you get
> the following:
>
> template1=# SELECT
> '<towns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns>'  
> IS
> DOCUMENT;
> ?column?
> ----------
> t
> (1 row)
>
> template1=# SELECT
> '<towns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns'  
> IS
> DOCUMENT;
> ERROR:  invalid XML content
> LINE 1: SELECT '<towns><town>Bidford-on-Avon</town><town>Cwmbran</to...
>               ^
> DETAIL:  Entity: line 1: parser error : expected '>'
> owns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns
>
>       ^
> Entity: line 1: parser error : chunk is not well balanced
> owns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns
>
>       ^
> I would've hoped the second would've returned 'f' rather than failing.
> I've had a glance at the XML/SQL standard and I don't see anything in
> the detail of the predicate (8.2) that would specifically prohibit us
> from changing this behavior, unless the common rule  'Parsing a string
> as an XML value' (10.16) must always be in force. I'm no standard
> expert, but IMHO this would be an acceptable change to improve
> usability. What do others think?

Right, I've answered my own question whilst sitting in the open source  
coding session at CHAR(10). Yes, IS DOCUMENT should return false for a  
non-well formed document, and indeed is coded to do such. However, the  
conversion to the xml type which happens before the underlying  
xml_is_document function is even called fails and exceptions out. I'll  
work on a patch to resolve this behavior such that IS DOCUMENT will  
give you the missing 'xml_is_well_formed' function.

Regards,

--
Mike Fowler
Registered Linux user: 379787


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

Robert Haas
On Thu, Jul 1, 2010 at 12:25 PM, Mike Fowler <[hidden email]> wrote:

> Quoting Mike Fowler <[hidden email]>:
>
>> Should the IS DOCUMENT predicate support this? At the moment you get
>> the following:
>>
>> template1=# SELECT
>>
>> '<towns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns>'
>>  IS
>> DOCUMENT;
>> ?column?
>> ----------
>> t
>> (1 row)
>>
>> template1=# SELECT
>>
>> '<towns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns'
>>  IS
>> DOCUMENT;
>> ERROR:  invalid XML content
>> LINE 1: SELECT '<towns><town>Bidford-on-Avon</town><town>Cwmbran</to...
>>              ^
>> DETAIL:  Entity: line 1: parser error : expected '>'
>>
>> owns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns
>>
>>      ^
>> Entity: line 1: parser error : chunk is not well balanced
>>
>> owns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns
>>
>>      ^
>> I would've hoped the second would've returned 'f' rather than failing.
>> I've had a glance at the XML/SQL standard and I don't see anything in
>> the detail of the predicate (8.2) that would specifically prohibit us
>> from changing this behavior, unless the common rule  'Parsing a string
>> as an XML value' (10.16) must always be in force. I'm no standard
>> expert, but IMHO this would be an acceptable change to improve
>> usability. What do others think?
>
> Right, I've answered my own question whilst sitting in the open source
> coding session at CHAR(10). Yes, IS DOCUMENT should return false for a
> non-well formed document, and indeed is coded to do such. However, the
> conversion to the xml type which happens before the underlying
> xml_is_document function is even called fails and exceptions out. I'll work
> on a patch to resolve this behavior such that IS DOCUMENT will give you the
> missing 'xml_is_well_formed' function.

I think the point if "IS DOCUMENT" is to distinguish a document:

<foo>some stuff<bar/><baz/></foo>

from a document fragment:

<bar/><baz/>

A document is allowed only one toplevel tag.

It'd be nice, I think, to have a function that tells you whether
something is legal XML without throwing an error if it isn't, but I
suspect that should be a separate function, rather than trying to jam
it into "IS DOCUMENT".

http://developer.postgresql.org/pgdocs/postgres/functions-xml.html#AEN15187

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

Mike Fowler-3
Quoting Robert Haas <[hidden email]>:

>
> I think the point if "IS DOCUMENT" is to distinguish a document:
>
> <foo>some stuff<bar/><baz/></foo>
>
> from a document fragment:
>
> <bar/><baz/>
>
> A document is allowed only one toplevel tag.
>
> It'd be nice, I think, to have a function that tells you whether
> something is legal XML without throwing an error if it isn't, but I
> suspect that should be a separate function, rather than trying to jam
> it into "IS DOCUMENT".
>
> http://developer.postgresql.org/pgdocs/postgres/functions-xml.html#AEN15187
>

I've submitted a patch to the bug report I filed yesterday that  
implements this. The way I read the standard (and I'm only reading a  
draft and I'm no expert) I don't see that it mandates that IS DOCUMENT  
returns false when IS CONTENT would return true. So if IS CONTENT were  
to be implemented, to determine that you have something that is  
malformed you could say:

val IS NOT DOCUMENT AND val IS NOT CONTENT

I think having the direct predicate support would be useful for  
columns of text where you know that some, though possibly not all,  
text values are valid XML.

--
Mike Fowler
Registered Linux user: 379787


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

Peter Eisentraut-2
On fre, 2010-07-02 at 14:07 +0100, Mike Fowler wrote:
> So if IS CONTENT were  
> to be implemented, to determine that you have something that is  
> malformed

But that's not what IS CONTENT does.  "Content" still needs to be
well-formed.


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

Mike Fowler-3
Quoting Peter Eisentraut <[hidden email]>:

> On fre, 2010-07-02 at 14:07 +0100, Mike Fowler wrote:
>> So if IS CONTENT were
>> to be implemented, to determine that you have something that is
>> malformed
>
> But that's not what IS CONTENT does.  "Content" still needs to be
> well-formed.
>

What I was hoping to achieve was to determine that something wasn't a  
document and wasn't content, however as you pointed out on the bugs  
thread the value must be XML. My mistake was not checking that I had  
followed the definitions all the way back to the root. What I will do  
instead is implement the xml_is_well_formed function and get a patch  
out in the next day or two.

Thank you Robert and Peter for tolerating my stumbles through the standard.

Regards,

--
Mike Fowler
Registered Linux user: 379787


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

Peter Eisentraut-2
On lör, 2010-07-03 at 09:26 +0100, Mike Fowler wrote:
> What I will do  
> instead is implement the xml_is_well_formed function and get a patch  
> out in the next day or two.

That sounds very useful.


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

[PATCH] Re: Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

Mike Fowler-3
Peter Eisentraut wrote:
> On lör, 2010-07-03 at 09:26 +0100, Mike Fowler wrote:
>  
>> What I will do  
>> instead is implement the xml_is_well_formed function and get a patch  
>> out in the next day or two.
>>    
>
> That sounds very useful.
>  
Here's the patch to add the 'xml_is_well_formed' function. Paraphrasing
the SGML the syntax is:

|xml_is_well_formed|(/text/)

The function |xml_is_well_formed| evaluates whether the /text/ is well
formed XML content, returning a boolean. I've done some tests (included
in the patch) with tables containing a mixture of well formed documents
and content and the function is happily returning the expected result.
Combining with IS (NOT) DOCUMENT is working nicely for pulling out
content or documents from a table of text.

Unless I missed something in the original correspondence, I think this
patch will solve the issue.

Regards,

--
Mike Fowler
Registered Linux user: 379787


*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***************
*** 8554,8562 **** SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab;
  ]]></screen>
      </para>
     </sect3>
 
     <sect3>
!     <title>XML Predicates</title>
 
      <indexterm>
       <primary>IS DOCUMENT</primary>
--- 8554,8566 ----
  ]]></screen>
      </para>
     </sect3>
+   </sect2>
+
+   <sect2>
+    <title>XML Predicates</title>
 
     <sect3>
!     <title>IS DOCUMENT</title>
 
      <indexterm>
       <primary>IS DOCUMENT</primary>
***************
*** 8574,8579 **** SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab;
--- 8578,8653 ----
       between documents and content fragments.
      </para>
     </sect3>
+
+    <sect3>
+     <title>xml_is_well_formed</title>
+
+     <indexterm>
+      <primary>xml_is_well_formed</primary>
+      <secondary>well formed</secondary>
+     </indexterm>
+
+ <synopsis>
+ <function>xml_is_well_formed</function>(<replaceable>text</replaceable>)
+ </synopsis>
+
+     <para>
+      The function <function>xml_is_well_formed</function> evaluates whether
+      the <replaceable>text</replaceable> is well formed XML content, returning
+      a boolean.
+     </para>
+     <para>
+     Example:
+ <screen><![CDATA[
+ SELECT xml_is_well_formed('<foo>bar</foo>');
+  xml_is_well_formed
+ --------------------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo>bar</foo');
+  xml_is_well_formed
+ --------------------
+  f
+ (1 row)
+ ]]></screen>
+     </para>
+     <para>
+     This function can be combined with the IS DOCUMENT predicate to prevent
+     invalid XML content errors from occuring in queries. For example, given a
+     table that may have rows with invalid XML mixed in with rows of valid
+     XML, <function>xml_is_well_formed</function> can be used to filter out all
+     the invalid rows.
+     </para>
+     <para>
+     Example:
+ <screen><![CDATA[
+ SELECT * FROM mixed;
+              data
+ ------------------------------
+  <foo>bar</foo>
+  <foo>bar</foo
+  <foo>bar</foo><bar>foo</bar>
+  <foo>bar</foo><bar>foo</bar
+ (4 rows)
+
+ SELECT COUNT(data) FROM mixed WHERE data::xml IS DOCUMENT;
+ ERROR:  invalid XML content
+ DETAIL:  Entity: line 1: parser error : expected '>'
+ <foo>bar</foo
+              ^
+ Entity: line 1: parser error : chunk is not well balanced
+ <foo>bar</foo
+              ^
+
+ SELECT COUNT(data) FROM mixed WHERE xml_is_well_formed(data) AND data::xml IS DOCUMENT;
+  count
+ -------
+      1
+ (1 row)
+ ]]></screen>
+     </para>
+    </sect3>
    </sect2>
 
    <sect2 id="functions-xml-processing">
*** a/src/backend/utils/adt/xml.c
--- b/src/backend/utils/adt/xml.c
***************
*** 3293,3298 **** xml_xmlnodetoxmltype(xmlNodePtr cur)
--- 3293,3365 ----
  }
  #endif
 
+ Datum
+ xml_is_well_formed(PG_FUNCTION_ARGS)
+ {
+ #ifdef USE_LIBXML
+ text *data = PG_GETARG_TEXT_P(0);
+ bool result;
+ int res_code;
+ int32 len;
+ const xmlChar *string;
+ xmlParserCtxtPtr ctxt;
+ xmlDocPtr doc = NULL;
+
+ len = VARSIZE(data) - VARHDRSZ;
+ string = xml_text2xmlChar(data);
+
+ /* Start up libxml and its parser (no-ops if already done) */
+ pg_xml_init();
+ xmlInitParser();
+
+ ctxt = xmlNewParserCtxt();
+ if (ctxt == NULL)
+ xml_ereport(ERROR, ERRCODE_OUT_OF_MEMORY,
+ "could not allocate parser context");
+
+ PG_TRY();
+ {
+ size_t count;
+ xmlChar    *version = NULL;
+ int standalone = -1;
+
+ res_code = parse_xml_decl(string, &count, &version, NULL, &standalone);
+ if (res_code != 0)
+ xml_ereport_by_code(ERROR, ERRCODE_INVALID_XML_CONTENT,
+  "invalid XML content: invalid XML declaration",
+ res_code);
+
+ doc = xmlNewDoc(version);
+ doc->encoding = xmlStrdup((const xmlChar *) "UTF-8");
+ doc->standalone = 1;
+
+ res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0, string + count, NULL);
+
+ result = !res_code;
+ }
+ PG_CATCH();
+ {
+ if (doc)
+ xmlFreeDoc(doc);
+ if (ctxt)
+ xmlFreeParserCtxt(ctxt);
+
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ if (doc)
+ xmlFreeDoc(doc);
+ if (ctxt)
+ xmlFreeParserCtxt(ctxt);
+
+ return result;
+ #else
+ NO_XML_SUPPORT();
+ return 0;
+ #endif
+ }
+
 
  /*
   * Evaluate XPath expression and return array of XML values.
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 4385,4390 **** DESCR("evaluate XPath expression, with namespaces support");
--- 4385,4393 ----
  DATA(insert OID = 2932 (  xpath PGNSP PGUID 14 1 0 0 f f f t f i 2 0 143 "25 142" _null_ _null_ _null_ _null_ "select pg_catalog.xpath($1, $2, ''{}''::pg_catalog.text[])" _null_ _null_ _null_ ));
  DESCR("evaluate XPath expression");
 
+ DATA(insert OID = 3037 (  xml_is_well_formed PGNSP PGUID 12 1 0 0 f f f t f i 1 0 16 "25" _null_ _null_ _null_ _null_ xml_is_well_formed _null_ _null_ _null_ ));
+ DESCR("determine if a text fragment is well formed XML");
+
  /* uuid */
  DATA(insert OID = 2952 (  uuid_in   PGNSP PGUID 12 1 0 0 f f f t f i 1 0 2950 "2275" _null_ _null_ _null_ _null_ uuid_in _null_ _null_ _null_ ));
  DESCR("I/O");
*** a/src/include/utils/xml.h
--- b/src/include/utils/xml.h
***************
*** 46,51 **** extern Datum query_to_xmlschema(PG_FUNCTION_ARGS);
--- 46,52 ----
  extern Datum cursor_to_xmlschema(PG_FUNCTION_ARGS);
  extern Datum table_to_xml_and_xmlschema(PG_FUNCTION_ARGS);
  extern Datum query_to_xml_and_xmlschema(PG_FUNCTION_ARGS);
+ extern Datum xml_is_well_formed(PG_FUNCTION_ARGS);
 
  extern Datum schema_to_xml(PG_FUNCTION_ARGS);
  extern Datum schema_to_xmlschema(PG_FUNCTION_ARGS);
*** a/src/test/regress/expected/xml.out
--- b/src/test/regress/expected/xml.out
***************
*** 502,504 **** SELECT xpath('//b', '<a>one <b>two</b> three <b>etc</b></a>');
--- 502,565 ----
   {<b>two</b>,<b>etc</b>}
  (1 row)
 
+ -- Test xml_is_well_formed
+ SELECT xml_is_well_formed('<>');
+  xml_is_well_formed
+ --------------------
+  f
+ (1 row)
+
+ SELECT xml_is_well_formed('abc');
+  xml_is_well_formed
+ --------------------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<abc/>');
+  xml_is_well_formed
+ --------------------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo>bar</foo>');
+  xml_is_well_formed
+ --------------------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo>bar</foo');
+  xml_is_well_formed
+ --------------------
+  f
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo><bar>baz</foo>');
+  xml_is_well_formed
+ --------------------
+  f
+ (1 row)
+
+ SELECT xml_is_well_formed('<local:data xmlns:local="http://127.0.0.1"><local:piece id="1">number one</local:piece><local:piece id="2" /></local:data>');
+  xml_is_well_formed
+ --------------------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo>bar</foo>') AND '<foo>bar</foo>' IS DOCUMENT;
+  ?column?
+ ----------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo>bar</foo>baz') AND '<foo>bar</foo>baz' IS NOT DOCUMENT;
+  ?column?
+ ----------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo>bar</foo><bar>foo</bar>') AND '<foo>bar</foo><bar>foo</bar>' IS NOT DOCUMENT;
+  ?column?
+ ----------
+  t
+ (1 row)
+
*** a/src/test/regress/sql/xml.sql
--- b/src/test/regress/sql/xml.sql
***************
*** 163,165 **** SELECT xpath('', '<!-- error -->');
--- 163,179 ----
  SELECT xpath('//text()', '<local:data xmlns:local="http://127.0.0.1"><local:piece id="1">number one</local:piece><local:piece id="2" /></local:data>');
  SELECT xpath('//loc:piece/@id', '<local:data xmlns:local="http://127.0.0.1"><local:piece id="1">number one</local:piece><local:piece id="2" /></local:data>', ARRAY[ARRAY['loc', 'http://127.0.0.1']]);
  SELECT xpath('//b', '<a>one <b>two</b> three <b>etc</b></a>');
+
+ -- Test xml_is_well_formed
+
+ SELECT xml_is_well_formed('<>');
+ SELECT xml_is_well_formed('abc');
+ SELECT xml_is_well_formed('<abc/>');
+ SELECT xml_is_well_formed('<foo>bar</foo>');
+ SELECT xml_is_well_formed('<foo>bar</foo');
+ SELECT xml_is_well_formed('<foo><bar>baz</foo>');
+ SELECT xml_is_well_formed('<local:data xmlns:local="http://127.0.0.1"><local:piece id="1">number one</local:piece><local:piece id="2" /></local:data>');
+ SELECT xml_is_well_formed('<foo>bar</foo>') AND '<foo>bar</foo>' IS DOCUMENT;
+ SELECT xml_is_well_formed('<foo>bar</foo>baz') AND '<foo>bar</foo>baz' IS NOT DOCUMENT;
+ SELECT xml_is_well_formed('<foo>bar</foo><bar>foo</bar>') AND '<foo>bar</foo><bar>foo</bar>' IS NOT DOCUMENT;
+


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Re: Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

Peter Eisentraut-2
On ons, 2010-07-07 at 16:37 +0100, Mike Fowler wrote:
> Here's the patch to add the 'xml_is_well_formed' function.

I suppose we should remove the function from contrib/xml2 at the same
time.


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Re: Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

Robert Haas
On Fri, Jul 9, 2010 at 4:06 PM, Peter Eisentraut <[hidden email]> wrote:
> On ons, 2010-07-07 at 16:37 +0100, Mike Fowler wrote:
>> Here's the patch to add the 'xml_is_well_formed' function.
>
> I suppose we should remove the function from contrib/xml2 at the same
> time.

Yep.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Re: Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

Mike Fowler-3
Robert Haas wrote:

> On Fri, Jul 9, 2010 at 4:06 PM, Peter Eisentraut <[hidden email]> wrote:
>  
>> On ons, 2010-07-07 at 16:37 +0100, Mike Fowler wrote:
>>    
>>> Here's the patch to add the 'xml_is_well_formed' function.
>>>      
>> I suppose we should remove the function from contrib/xml2 at the same
>> time.
>>    
>
> Yep
Revised patch deleting the contrib/xml2 version of the function attached.

Regards,

--
Mike Fowler
Registered Linux user: 379787


*** a/contrib/xml2/xpath.c
--- b/contrib/xml2/xpath.c
***************
*** 27,33 **** PG_MODULE_MAGIC;
 
  /* externally accessible functions */
 
- Datum xml_is_well_formed(PG_FUNCTION_ARGS);
  Datum xml_encode_special_chars(PG_FUNCTION_ARGS);
  Datum xpath_nodeset(PG_FUNCTION_ARGS);
  Datum xpath_string(PG_FUNCTION_ARGS);
--- 27,32 ----
***************
*** 70,97 **** pgxml_parser_init(void)
  xmlLoadExtDtdDefaultValue = 1;
  }
 
-
- /* Returns true if document is well-formed */
-
- PG_FUNCTION_INFO_V1(xml_is_well_formed);
-
- Datum
- xml_is_well_formed(PG_FUNCTION_ARGS)
- {
- text   *t = PG_GETARG_TEXT_P(0); /* document buffer */
- int32 docsize = VARSIZE(t) - VARHDRSZ;
- xmlDocPtr doctree;
-
- pgxml_parser_init();
-
- doctree = xmlParseMemory((char *) VARDATA(t), docsize);
- if (doctree == NULL)
- PG_RETURN_BOOL(false); /* i.e. not well-formed */
- xmlFreeDoc(doctree);
- PG_RETURN_BOOL(true);
- }
-
-
  /* Encodes special characters (<, >, &, " and \r) as XML entities */
 
  PG_FUNCTION_INFO_V1(xml_encode_special_chars);
--- 69,74 ----
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***************
*** 8554,8562 **** SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab;
  ]]></screen>
      </para>
     </sect3>
 
     <sect3>
!     <title>XML Predicates</title>
 
      <indexterm>
       <primary>IS DOCUMENT</primary>
--- 8554,8566 ----
  ]]></screen>
      </para>
     </sect3>
+   </sect2>
+
+   <sect2>
+    <title>XML Predicates</title>
 
     <sect3>
!     <title>IS DOCUMENT</title>
 
      <indexterm>
       <primary>IS DOCUMENT</primary>
***************
*** 8574,8579 **** SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab;
--- 8578,8653 ----
       between documents and content fragments.
      </para>
     </sect3>
+
+    <sect3>
+     <title>xml_is_well_formed</title>
+
+     <indexterm>
+      <primary>xml_is_well_formed</primary>
+      <secondary>well formed</secondary>
+     </indexterm>
+
+ <synopsis>
+ <function>xml_is_well_formed</function>(<replaceable>text</replaceable>)
+ </synopsis>
+
+     <para>
+      The function <function>xml_is_well_formed</function> evaluates whether
+      the <replaceable>text</replaceable> is well formed XML content, returning
+      a boolean.
+     </para>
+     <para>
+     Example:
+ <screen><![CDATA[
+ SELECT xml_is_well_formed('<foo>bar</foo>');
+  xml_is_well_formed
+ --------------------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo>bar</foo');
+  xml_is_well_formed
+ --------------------
+  f
+ (1 row)
+ ]]></screen>
+     </para>
+     <para>
+     This function can be combined with the IS DOCUMENT predicate to prevent
+     invalid XML content errors from occuring in queries. For example, given a
+     table that may have rows with invalid XML mixed in with rows of valid
+     XML, <function>xml_is_well_formed</function> can be used to filter out all
+     the invalid rows.
+     </para>
+     <para>
+     Example:
+ <screen><![CDATA[
+ SELECT * FROM mixed;
+              data
+ ------------------------------
+  <foo>bar</foo>
+  <foo>bar</foo
+  <foo>bar</foo><bar>foo</bar>
+  <foo>bar</foo><bar>foo</bar
+ (4 rows)
+
+ SELECT COUNT(data) FROM mixed WHERE data::xml IS DOCUMENT;
+ ERROR:  invalid XML content
+ DETAIL:  Entity: line 1: parser error : expected '>'
+ <foo>bar</foo
+              ^
+ Entity: line 1: parser error : chunk is not well balanced
+ <foo>bar</foo
+              ^
+
+ SELECT COUNT(data) FROM mixed WHERE xml_is_well_formed(data) AND data::xml IS DOCUMENT;
+  count
+ -------
+      1
+ (1 row)
+ ]]></screen>
+     </para>
+    </sect3>
    </sect2>
 
    <sect2 id="functions-xml-processing">
*** a/src/backend/utils/adt/xml.c
--- b/src/backend/utils/adt/xml.c
***************
*** 3293,3298 **** xml_xmlnodetoxmltype(xmlNodePtr cur)
--- 3293,3365 ----
  }
  #endif
 
+ Datum
+ xml_is_well_formed(PG_FUNCTION_ARGS)
+ {
+ #ifdef USE_LIBXML
+ text *data = PG_GETARG_TEXT_P(0);
+ bool result;
+ int res_code;
+ int32 len;
+ const xmlChar *string;
+ xmlParserCtxtPtr ctxt;
+ xmlDocPtr doc = NULL;
+
+ len = VARSIZE(data) - VARHDRSZ;
+ string = xml_text2xmlChar(data);
+
+ /* Start up libxml and its parser (no-ops if already done) */
+ pg_xml_init();
+ xmlInitParser();
+
+ ctxt = xmlNewParserCtxt();
+ if (ctxt == NULL)
+ xml_ereport(ERROR, ERRCODE_OUT_OF_MEMORY,
+ "could not allocate parser context");
+
+ PG_TRY();
+ {
+ size_t count;
+ xmlChar    *version = NULL;
+ int standalone = -1;
+
+ res_code = parse_xml_decl(string, &count, &version, NULL, &standalone);
+ if (res_code != 0)
+ xml_ereport_by_code(ERROR, ERRCODE_INVALID_XML_CONTENT,
+  "invalid XML content: invalid XML declaration",
+ res_code);
+
+ doc = xmlNewDoc(version);
+ doc->encoding = xmlStrdup((const xmlChar *) "UTF-8");
+ doc->standalone = 1;
+
+ res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0, string + count, NULL);
+
+ result = !res_code;
+ }
+ PG_CATCH();
+ {
+ if (doc)
+ xmlFreeDoc(doc);
+ if (ctxt)
+ xmlFreeParserCtxt(ctxt);
+
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ if (doc)
+ xmlFreeDoc(doc);
+ if (ctxt)
+ xmlFreeParserCtxt(ctxt);
+
+ return result;
+ #else
+ NO_XML_SUPPORT();
+ return 0;
+ #endif
+ }
+
 
  /*
   * Evaluate XPath expression and return array of XML values.
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 4385,4390 **** DESCR("evaluate XPath expression, with namespaces support");
--- 4385,4393 ----
  DATA(insert OID = 2932 (  xpath PGNSP PGUID 14 1 0 0 f f f t f i 2 0 143 "25 142" _null_ _null_ _null_ _null_ "select pg_catalog.xpath($1, $2, ''{}''::pg_catalog.text[])" _null_ _null_ _null_ ));
  DESCR("evaluate XPath expression");
 
+ DATA(insert OID = 3037 (  xml_is_well_formed PGNSP PGUID 12 1 0 0 f f f t f i 1 0 16 "25" _null_ _null_ _null_ _null_ xml_is_well_formed _null_ _null_ _null_ ));
+ DESCR("determine if a text fragment is well formed XML");
+
  /* uuid */
  DATA(insert OID = 2952 (  uuid_in   PGNSP PGUID 12 1 0 0 f f f t f i 1 0 2950 "2275" _null_ _null_ _null_ _null_ uuid_in _null_ _null_ _null_ ));
  DESCR("I/O");
*** a/src/include/utils/xml.h
--- b/src/include/utils/xml.h
***************
*** 46,51 **** extern Datum query_to_xmlschema(PG_FUNCTION_ARGS);
--- 46,52 ----
  extern Datum cursor_to_xmlschema(PG_FUNCTION_ARGS);
  extern Datum table_to_xml_and_xmlschema(PG_FUNCTION_ARGS);
  extern Datum query_to_xml_and_xmlschema(PG_FUNCTION_ARGS);
+ extern Datum xml_is_well_formed(PG_FUNCTION_ARGS);
 
  extern Datum schema_to_xml(PG_FUNCTION_ARGS);
  extern Datum schema_to_xmlschema(PG_FUNCTION_ARGS);
*** a/src/test/regress/expected/xml.out
--- b/src/test/regress/expected/xml.out
***************
*** 502,504 **** SELECT xpath('//b', '<a>one <b>two</b> three <b>etc</b></a>');
--- 502,565 ----
   {<b>two</b>,<b>etc</b>}
  (1 row)
 
+ -- Test xml_is_well_formed
+ SELECT xml_is_well_formed('<>');
+  xml_is_well_formed
+ --------------------
+  f
+ (1 row)
+
+ SELECT xml_is_well_formed('abc');
+  xml_is_well_formed
+ --------------------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<abc/>');
+  xml_is_well_formed
+ --------------------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo>bar</foo>');
+  xml_is_well_formed
+ --------------------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo>bar</foo');
+  xml_is_well_formed
+ --------------------
+  f
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo><bar>baz</foo>');
+  xml_is_well_formed
+ --------------------
+  f
+ (1 row)
+
+ SELECT xml_is_well_formed('<local:data xmlns:local="http://127.0.0.1"><local:piece id="1">number one</local:piece><local:piece id="2" /></local:data>');
+  xml_is_well_formed
+ --------------------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo>bar</foo>') AND '<foo>bar</foo>' IS DOCUMENT;
+  ?column?
+ ----------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo>bar</foo>baz') AND '<foo>bar</foo>baz' IS NOT DOCUMENT;
+  ?column?
+ ----------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo>bar</foo><bar>foo</bar>') AND '<foo>bar</foo><bar>foo</bar>' IS NOT DOCUMENT;
+  ?column?
+ ----------
+  t
+ (1 row)
+
*** a/src/test/regress/sql/xml.sql
--- b/src/test/regress/sql/xml.sql
***************
*** 163,165 **** SELECT xpath('', '<!-- error -->');
--- 163,179 ----
  SELECT xpath('//text()', '<local:data xmlns:local="http://127.0.0.1"><local:piece id="1">number one</local:piece><local:piece id="2" /></local:data>');
  SELECT xpath('//loc:piece/@id', '<local:data xmlns:local="http://127.0.0.1"><local:piece id="1">number one</local:piece><local:piece id="2" /></local:data>', ARRAY[ARRAY['loc', 'http://127.0.0.1']]);
  SELECT xpath('//b', '<a>one <b>two</b> three <b>etc</b></a>');
+
+ -- Test xml_is_well_formed
+
+ SELECT xml_is_well_formed('<>');
+ SELECT xml_is_well_formed('abc');
+ SELECT xml_is_well_formed('<abc/>');
+ SELECT xml_is_well_formed('<foo>bar</foo>');
+ SELECT xml_is_well_formed('<foo>bar</foo');
+ SELECT xml_is_well_formed('<foo><bar>baz</foo>');
+ SELECT xml_is_well_formed('<local:data xmlns:local="http://127.0.0.1"><local:piece id="1">number one</local:piece><local:piece id="2" /></local:data>');
+ SELECT xml_is_well_formed('<foo>bar</foo>') AND '<foo>bar</foo>' IS DOCUMENT;
+ SELECT xml_is_well_formed('<foo>bar</foo>baz') AND '<foo>bar</foo>baz' IS NOT DOCUMENT;
+ SELECT xml_is_well_formed('<foo>bar</foo><bar>foo</bar>') AND '<foo>bar</foo><bar>foo</bar>' IS NOT DOCUMENT;
+


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Re: Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

Thom Brown
On 10 July 2010 14:12, Mike Fowler <[hidden email]> wrote:

> Robert Haas wrote:
>>
>> On Fri, Jul 9, 2010 at 4:06 PM, Peter Eisentraut <[hidden email]> wrote:
>>
>>>
>>> On ons, 2010-07-07 at 16:37 +0100, Mike Fowler wrote:
>>>
>>>>
>>>> Here's the patch to add the 'xml_is_well_formed' function.
>>>>
>>>
>>> I suppose we should remove the function from contrib/xml2 at the same
>>> time.
>>>
>>
>> Yep
>
> Revised patch deleting the contrib/xml2 version of the function attached.
>
> Regards,
>
> --
> Mike Fowler
> Registered Linux user: 379787
>
sql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>

Would a test for mismatched or undefined namespaces be necessary?

For example:

Mismatched namespace:
<pg:foo xmlns:pg="http://postgresql.org/stuff">bar</my:foo>

Undefined namespace when used in conjunction with IS DOCUMENT:
<pg:foo xmlns:my="http://postgresql.org/stuff">bar</pg:foo>

Also, having a look at the following example from the patch:
SELECT xml_is_well_formed('<local:data
xmlns:local="http://127.0.0.1";><local:piece id="1">number
one</local:piece><local:piece id="2" /></local:data>');
 xml_is_well_formed
--------------------
 t
(1 row)

Just wondering about that semi-colon after the namespace definition.

Thom

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Re: Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

Mike Fowler-3
Thom Brown wrote:

> Would a test for mismatched or undefined namespaces be necessary?
>
> For example:
>
> Mismatched namespace:
> <pg:foo xmlns:pg="http://postgresql.org/stuff">bar</my:foo>
>
> Undefined namespace when used in conjunction with IS DOCUMENT:
> <pg:foo xmlns:my="http://postgresql.org/stuff">bar</pg:foo>
>  
Thanks for looking at my patch Thom. I hadn't thought of that particular
scenario and even though I didn't specifically code for it, the
underlying libxml call does correctly reject the mismatched namespace:

template1=# SELECT xml_is_well_formed('<pg:foo xmlns:pg="http://postgresql.org/stuff">bar</my:foo>');
 xml_is_well_formed
--------------------
 f
(1 row)



In the attached patch I've added the example to the SGML documentation
and the regression tests.

> Also, having a look at the following example from the patch:
> SELECT xml_is_well_formed('<local:data
> xmlns:local="http://127.0.0.1";><local:piece id="1">number
> one</local:piece><local:piece id="2" /></local:data>');
>  xml_is_well_formed
> --------------------
>  t
> (1 row)
>
> Just wondering about that semi-colon after the namespace definition.
>
> Thom
>  
The semi-colon is not supposed to be there, and I'm not sure where it's
come from. With Thunderbird I see the email with my patch as an
attachement, downloaded and viewing the file there are no instances of a
" followed by a ;. However, if I look at the message on the archive at
http://archives.postgresql.org/message-id/4C3871C2.8000605@... 
I can see every URL that ends with a " has  a ; following it. Should I
be escaping the " in the patch file in some way or this just an artifact
of HTML parsing a patch?

Regards,

--
Mike Fowler
Registered Linux user: 379787


*** a/contrib/xml2/xpath.c
--- b/contrib/xml2/xpath.c
***************
*** 27,33 **** PG_MODULE_MAGIC;
 
  /* externally accessible functions */
 
- Datum xml_is_well_formed(PG_FUNCTION_ARGS);
  Datum xml_encode_special_chars(PG_FUNCTION_ARGS);
  Datum xpath_nodeset(PG_FUNCTION_ARGS);
  Datum xpath_string(PG_FUNCTION_ARGS);
--- 27,32 ----
***************
*** 70,97 **** pgxml_parser_init(void)
  xmlLoadExtDtdDefaultValue = 1;
  }
 
-
- /* Returns true if document is well-formed */
-
- PG_FUNCTION_INFO_V1(xml_is_well_formed);
-
- Datum
- xml_is_well_formed(PG_FUNCTION_ARGS)
- {
- text   *t = PG_GETARG_TEXT_P(0); /* document buffer */
- int32 docsize = VARSIZE(t) - VARHDRSZ;
- xmlDocPtr doctree;
-
- pgxml_parser_init();
-
- doctree = xmlParseMemory((char *) VARDATA(t), docsize);
- if (doctree == NULL)
- PG_RETURN_BOOL(false); /* i.e. not well-formed */
- xmlFreeDoc(doctree);
- PG_RETURN_BOOL(true);
- }
-
-
  /* Encodes special characters (<, >, &, " and \r) as XML entities */
 
  PG_FUNCTION_INFO_V1(xml_encode_special_chars);
--- 69,74 ----
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***************
*** 8554,8562 **** SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab;
  ]]></screen>
      </para>
     </sect3>
 
     <sect3>
!     <title>XML Predicates</title>
 
      <indexterm>
       <primary>IS DOCUMENT</primary>
--- 8554,8566 ----
  ]]></screen>
      </para>
     </sect3>
+   </sect2>
+
+   <sect2>
+    <title>XML Predicates</title>
 
     <sect3>
!     <title>IS DOCUMENT</title>
 
      <indexterm>
       <primary>IS DOCUMENT</primary>
***************
*** 8574,8579 **** SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab;
--- 8578,8675 ----
       between documents and content fragments.
      </para>
     </sect3>
+
+    <sect3>
+     <title>xml_is_well_formed</title>
+
+     <indexterm>
+      <primary>xml_is_well_formed</primary>
+      <secondary>well formed</secondary>
+     </indexterm>
+
+ <synopsis>
+ <function>xml_is_well_formed</function>(<replaceable>text</replaceable>)
+ </synopsis>
+
+     <para>
+      The function <function>xml_is_well_formed</function> evaluates whether
+      the <replaceable>text</replaceable> is well formed XML content, returning
+      a boolean.
+     </para>
+     <para>
+     Example:
+ <screen><![CDATA[
+ SELECT xml_is_well_formed('<foo>bar</foo>');
+  xml_is_well_formed
+ --------------------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo>bar</foo');
+  xml_is_well_formed
+ --------------------
+  f
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo><bar>stuff</foo>');
+  xml_is_well_formed
+ --------------------
+  f
+ (1 row)
+ ]]></screen>
+     </para>
+     <para>
+     In addition to the structure checks, the function ensures that namespaces are correcty matched.
+ <screen><![CDATA[
+ SELECT xml_is_well_formed('<pg:foo xmlns:pg="http://postgresql.org/stuff">bar</my:foo>');
+  xml_is_well_formed
+ --------------------
+  f
+ (1 row)
+
+ SELECT xml_is_well_formed('<pg:foo xmlns:pg="http://postgresql.org/stuff">bar</pg:foo>');
+  xml_is_well_formed
+ --------------------
+  t
+ (1 row)
+ ]]></screen>
+     </para>
+     <para>
+     This function can be combined with the IS DOCUMENT predicate to prevent
+     invalid XML content errors from occuring in queries. For example, given a
+     table that may have rows with invalid XML mixed in with rows of valid
+     XML, <function>xml_is_well_formed</function> can be used to filter out all
+     the invalid rows.
+     </para>
+     <para>
+     Example:
+ <screen><![CDATA[
+ SELECT * FROM mixed;
+              data
+ ------------------------------
+  <foo>bar</foo>
+  <foo>bar</foo
+  <foo>bar</foo><bar>foo</bar>
+  <foo>bar</foo><bar>foo</bar
+ (4 rows)
+
+ SELECT COUNT(data) FROM mixed WHERE data::xml IS DOCUMENT;
+ ERROR:  invalid XML content
+ DETAIL:  Entity: line 1: parser error : expected '>'
+ <foo>bar</foo
+              ^
+ Entity: line 1: parser error : chunk is not well balanced
+ <foo>bar</foo
+              ^
+
+ SELECT COUNT(data) FROM mixed WHERE xml_is_well_formed(data) AND data::xml IS DOCUMENT;
+  count
+ -------
+      1
+ (1 row)
+ ]]></screen>
+     </para>
+    </sect3>
    </sect2>
 
    <sect2 id="functions-xml-processing">
*** a/src/backend/utils/adt/xml.c
--- b/src/backend/utils/adt/xml.c
***************
*** 3293,3298 **** xml_xmlnodetoxmltype(xmlNodePtr cur)
--- 3293,3365 ----
  }
  #endif
 
+ Datum
+ xml_is_well_formed(PG_FUNCTION_ARGS)
+ {
+ #ifdef USE_LIBXML
+ text *data = PG_GETARG_TEXT_P(0);
+ bool result;
+ int res_code;
+ int32 len;
+ const xmlChar *string;
+ xmlParserCtxtPtr ctxt;
+ xmlDocPtr doc = NULL;
+
+ len = VARSIZE(data) - VARHDRSZ;
+ string = xml_text2xmlChar(data);
+
+ /* Start up libxml and its parser (no-ops if already done) */
+ pg_xml_init();
+ xmlInitParser();
+
+ ctxt = xmlNewParserCtxt();
+ if (ctxt == NULL)
+ xml_ereport(ERROR, ERRCODE_OUT_OF_MEMORY,
+ "could not allocate parser context");
+
+ PG_TRY();
+ {
+ size_t count;
+ xmlChar    *version = NULL;
+ int standalone = -1;
+
+ res_code = parse_xml_decl(string, &count, &version, NULL, &standalone);
+ if (res_code != 0)
+ xml_ereport_by_code(ERROR, ERRCODE_INVALID_XML_CONTENT,
+  "invalid XML content: invalid XML declaration",
+ res_code);
+
+ doc = xmlNewDoc(version);
+ doc->encoding = xmlStrdup((const xmlChar *) "UTF-8");
+ doc->standalone = 1;
+
+ res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0, string + count, NULL);
+
+ result = !res_code;
+ }
+ PG_CATCH();
+ {
+ if (doc)
+ xmlFreeDoc(doc);
+ if (ctxt)
+ xmlFreeParserCtxt(ctxt);
+
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ if (doc)
+ xmlFreeDoc(doc);
+ if (ctxt)
+ xmlFreeParserCtxt(ctxt);
+
+ return result;
+ #else
+ NO_XML_SUPPORT();
+ return 0;
+ #endif
+ }
+
 
  /*
   * Evaluate XPath expression and return array of XML values.
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 4385,4390 **** DESCR("evaluate XPath expression, with namespaces support");
--- 4385,4393 ----
  DATA(insert OID = 2932 (  xpath PGNSP PGUID 14 1 0 0 f f f t f i 2 0 143 "25 142" _null_ _null_ _null_ _null_ "select pg_catalog.xpath($1, $2, ''{}''::pg_catalog.text[])" _null_ _null_ _null_ ));
  DESCR("evaluate XPath expression");
 
+ DATA(insert OID = 3037 (  xml_is_well_formed PGNSP PGUID 12 1 0 0 f f f t f i 1 0 16 "25" _null_ _null_ _null_ _null_ xml_is_well_formed _null_ _null_ _null_ ));
+ DESCR("determine if a text fragment is well formed XML");
+
  /* uuid */
  DATA(insert OID = 2952 (  uuid_in   PGNSP PGUID 12 1 0 0 f f f t f i 1 0 2950 "2275" _null_ _null_ _null_ _null_ uuid_in _null_ _null_ _null_ ));
  DESCR("I/O");
*** a/src/include/utils/xml.h
--- b/src/include/utils/xml.h
***************
*** 46,51 **** extern Datum query_to_xmlschema(PG_FUNCTION_ARGS);
--- 46,52 ----
  extern Datum cursor_to_xmlschema(PG_FUNCTION_ARGS);
  extern Datum table_to_xml_and_xmlschema(PG_FUNCTION_ARGS);
  extern Datum query_to_xml_and_xmlschema(PG_FUNCTION_ARGS);
+ extern Datum xml_is_well_formed(PG_FUNCTION_ARGS);
 
  extern Datum schema_to_xml(PG_FUNCTION_ARGS);
  extern Datum schema_to_xmlschema(PG_FUNCTION_ARGS);
*** a/src/test/regress/expected/xml.out
--- b/src/test/regress/expected/xml.out
***************
*** 502,504 **** SELECT xpath('//b', '<a>one <b>two</b> three <b>etc</b></a>');
--- 502,577 ----
   {<b>two</b>,<b>etc</b>}
  (1 row)
 
+ -- Test xml_is_well_formed
+ SELECT xml_is_well_formed('<>');
+  xml_is_well_formed
+ --------------------
+  f
+ (1 row)
+
+ SELECT xml_is_well_formed('abc');
+  xml_is_well_formed
+ --------------------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<abc/>');
+  xml_is_well_formed
+ --------------------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo>bar</foo>');
+  xml_is_well_formed
+ --------------------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo>bar</foo');
+  xml_is_well_formed
+ --------------------
+  f
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo><bar>baz</foo>');
+  xml_is_well_formed
+ --------------------
+  f
+ (1 row)
+
+ SELECT xml_is_well_formed('<local:data xmlns:local="http://127.0.0.1"><local:piece id="1">number one</local:piece><local:piece id="2" /></local:data>');
+  xml_is_well_formed
+ --------------------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo>bar</foo>') AND '<foo>bar</foo>' IS DOCUMENT;
+  ?column?
+ ----------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo>bar</foo>baz') AND '<foo>bar</foo>baz' IS NOT DOCUMENT;
+  ?column?
+ ----------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<foo>bar</foo><bar>foo</bar>') AND '<foo>bar</foo><bar>foo</bar>' IS NOT DOCUMENT;
+  ?column?
+ ----------
+  t
+ (1 row)
+
+ SELECT xml_is_well_formed('<pg:foo xmlns:pg="http://postgresql.org/stuff">bar</my:foo>');
+  xml_is_well_formed
+ --------------------
+  f
+ (1 row)
+
+ SELECT xml_is_well_formed('<pg:foo xmlns:pg="http://postgresql.org/stuff">bar</pg:foo>');
+  xml_is_well_formed
+ --------------------
+  t
+ (1 row)
+
*** a/src/test/regress/sql/xml.sql
--- b/src/test/regress/sql/xml.sql
***************
*** 163,165 **** SELECT xpath('', '<!-- error -->');
--- 163,180 ----
  SELECT xpath('//text()', '<local:data xmlns:local="http://127.0.0.1"><local:piece id="1">number one</local:piece><local:piece id="2" /></local:data>');
  SELECT xpath('//loc:piece/@id', '<local:data xmlns:local="http://127.0.0.1"><local:piece id="1">number one</local:piece><local:piece id="2" /></local:data>', ARRAY[ARRAY['loc', 'http://127.0.0.1']]);
  SELECT xpath('//b', '<a>one <b>two</b> three <b>etc</b></a>');
+
+ -- Test xml_is_well_formed
+
+ SELECT xml_is_well_formed('<>');
+ SELECT xml_is_well_formed('abc');
+ SELECT xml_is_well_formed('<abc/>');
+ SELECT xml_is_well_formed('<foo>bar</foo>');
+ SELECT xml_is_well_formed('<foo>bar</foo');
+ SELECT xml_is_well_formed('<foo><bar>baz</foo>');
+ SELECT xml_is_well_formed('<local:data xmlns:local="http://127.0.0.1"><local:piece id="1">number one</local:piece><local:piece id="2" /></local:data>');
+ SELECT xml_is_well_formed('<foo>bar</foo>') AND '<foo>bar</foo>' IS DOCUMENT;
+ SELECT xml_is_well_formed('<foo>bar</foo>baz') AND '<foo>bar</foo>baz' IS NOT DOCUMENT;
+ SELECT xml_is_well_formed('<foo>bar</foo><bar>foo</bar>') AND '<foo>bar</foo><bar>foo</bar>' IS NOT DOCUMENT;
+ SELECT xml_is_well_formed('<pg:foo xmlns:pg="http://postgresql.org/stuff">bar</my:foo>');
+ SELECT xml_is_well_formed('<pg:foo xmlns:pg="http://postgresql.org/stuff">bar</pg:foo>');


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Re: Issue: Deprecation of the XML2 module 'xml_is_well_formed' function

Thom Brown
On 12 July 2010 13:07, Mike Fowler <[hidden email]> wrote:

> Thom Brown wrote:
>>
>> Just wondering about that semi-colon after the namespace definition.
>>
>> Thom
>>
>
> The semi-colon is not supposed to be there, and I'm not sure where it's come
> from. With Thunderbird I see the email with my patch as an attachement,
> downloaded and viewing the file there are no instances of a " followed by a
> ;. However, if I look at the message on the archive at
> http://archives.postgresql.org/message-id/4C3871C2.8000605@... I
> can see every URL that ends with a " has  a ; following it. Should I be
> escaping the " in the patch file in some way or this just an artifact of
> HTML parsing a patch?

Yeah, I guess it's a parsing issue related to the archive viewer.  I
arrived there from the commitfest page and should have really looked
directly at the patch.  No problem there then I guess.

Thanks for the work you've done on this. :)

Thom

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers