Strange output of XML attribute values

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Strange output of XML attribute values

Andrew Marynchuk (Андрей Маринчук)
This problem is quite old, but it leads to the inability to use XML generation functions in PostgreSQL database for some cases, or at least requires to perform subsequent parsing and regenerating XML by an external utility. It reproduces in PostgreSQL 12.4, compiled by Visual C++ build 1914, 64-bit (windows 10), but I've seen the same problem in 9.6 build from CentOS yum package.

How to reproduce:
Just execute the query (actually the xmlelement call is enough to reproduce the proble):
select xmlserialize(document xmlroot(xmlelement(name "ЭлементВКириллице", xmlattributes('ЗначениеВКириллице' as "АтрибутВКириллице"), 'ТекстВКириллице'), version '1.0', standalone yes) as text);

Expected result:
<?xml version="1.0" standalone="yes"?><ЭлементВКириллице АтрибутВКириллице="ЗначениеВКириллице">ТекстВКириллице</ЭлементВКириллице>

Actual result:
<?xml version="1.0" standalone="yes"?><ЭлементВКириллице АтрибутВКириллице="&#x417;&#x43D;&#x430;&#x447;&#x435;&#x43D;&#x438;&#x435;&#x412;&#x41A;&#x438;&#x440;&#x438;&#x43B;&#x43B;&#x438;&#x446;&#x435;">ТекстВКириллице</ЭлементВКириллице>

This example uses cyrillic letters, but it could be any non-ASCII character.
According to the discussion, this problem arises because PostgreSQL does not provides libxml2 an information of document encoding due to the lack of xmlTextWriterStartDocument call, so libxml2 has no idea that encoding is UTF-8 and non-ASCII characters could be written without converting to &#x...;-sequences.

In the modern world, UTF-8 encoding is used everywhere and such unnecessary character converting looks strange. Current workaround is passing generated content to the pl/python function which parses and writes back the xml (xml.dom.minidom.parseString(...).toxml()).
Reply | Threaded
Open this post in threaded view
|

Re: Strange output of XML attribute values

Pavel Stehule


st 16. 9. 2020 v 14:11 odesílatel Andrew Marynchuk (Андрей Маринчук) <[hidden email]> napsal:
This problem is quite old, but it leads to the inability to use XML generation functions in PostgreSQL database for some cases, or at least requires to perform subsequent parsing and regenerating XML by an external utility. It reproduces in PostgreSQL 12.4, compiled by Visual C++ build 1914, 64-bit (windows 10), but I've seen the same problem in 9.6 build from CentOS yum package.

How to reproduce:
Just execute the query (actually the xmlelement call is enough to reproduce the proble):
select xmlserialize(document xmlroot(xmlelement(name "ЭлементВКириллице", xmlattributes('ЗначениеВКириллице' as "АтрибутВКириллице"), 'ТекстВКириллице'), version '1.0', standalone yes) as text);

Expected result:
<?xml version="1.0" standalone="yes"?><ЭлементВКириллице АтрибутВКириллице="ЗначениеВКириллице">ТекстВКириллице</ЭлементВКириллице>

Actual result:
<?xml version="1.0" standalone="yes"?><ЭлементВКириллице АтрибутВКириллице="&#x417;&#x43D;&#x430;&#x447;&#x435;&#x43D;&#x438;&#x435;&#x412;&#x41A;&#x438;&#x440;&#x438;&#x43B;&#x43B;&#x438;&#x446;&#x435;">ТекстВКириллице</ЭлементВКириллице>

This example uses cyrillic letters, but it could be any non-ASCII character.
According to the discussion, this problem arises because PostgreSQL does not provides libxml2 an information of document encoding due to the lack of xmlTextWriterStartDocument call, so libxml2 has no idea that encoding is UTF-8 and non-ASCII characters could be written without converting to &#x...;-sequences.

I don't think it is true. The url encoding is done only in attributes, and only when the output encoding will be utf8. When you try to use 8bit encoding with Azbuka support, it will be ok.
 

In the modern world, UTF-8 encoding is used everywhere and such unnecessary character converting looks strange. Current workaround is passing generated content to the pl/python function which parses and writes back the xml (xml.dom.minidom.parseString(...).toxml()).

If I remember some discussion about this topic, the problem is XML standard, that requires url encoding in attribute values.

I reported this issue 10 (maybe 15) years ago to libxml2 developers, and It was rejected. Maybe libxml2 supports too old XML standards. I don't know - this library is years in frozen state, but there is no replacement.

It is a libxml2 problem - , and there it should be reported and fixed. It is not possible to fix this issue on Postgres' side.

Regards

Pavel

Reply | Threaded
Open this post in threaded view
|

Re: Strange output of XML attribute values

Pavel Stehule


st 16. 9. 2020 v 14:50 odesílatel Pavel Stehule <[hidden email]> napsal:


st 16. 9. 2020 v 14:11 odesílatel Andrew Marynchuk (Андрей Маринчук) <[hidden email]> napsal:
This problem is quite old, but it leads to the inability to use XML generation functions in PostgreSQL database for some cases, or at least requires to perform subsequent parsing and regenerating XML by an external utility. It reproduces in PostgreSQL 12.4, compiled by Visual C++ build 1914, 64-bit (windows 10), but I've seen the same problem in 9.6 build from CentOS yum package.

How to reproduce:
Just execute the query (actually the xmlelement call is enough to reproduce the proble):
select xmlserialize(document xmlroot(xmlelement(name "ЭлементВКириллице", xmlattributes('ЗначениеВКириллице' as "АтрибутВКириллице"), 'ТекстВКириллице'), version '1.0', standalone yes) as text);

Expected result:
<?xml version="1.0" standalone="yes"?><ЭлементВКириллице АтрибутВКириллице="ЗначениеВКириллице">ТекстВКириллице</ЭлементВКириллице>

Actual result:
<?xml version="1.0" standalone="yes"?><ЭлементВКириллице АтрибутВКириллице="&#x417;&#x43D;&#x430;&#x447;&#x435;&#x43D;&#x438;&#x435;&#x412;&#x41A;&#x438;&#x440;&#x438;&#x43B;&#x43B;&#x438;&#x446;&#x435;">ТекстВКириллице</ЭлементВКириллице>

This example uses cyrillic letters, but it could be any non-ASCII character.
According to the discussion, this problem arises because PostgreSQL does not provides libxml2 an information of document encoding due to the lack of xmlTextWriterStartDocument call, so libxml2 has no idea that encoding is UTF-8 and non-ASCII characters could be written without converting to &#x...;-sequences.

I don't think it is true. The url encoding is done only in attributes, and only when the output encoding will be utf8. When you try to use 8bit encoding with Azbuka support, it will be ok.
 

In the modern world, UTF-8 encoding is used everywhere and such unnecessary character converting looks strange. Current workaround is passing generated content to the pl/python function which parses and writes back the xml (xml.dom.minidom.parseString(...).toxml()).

If I remember some discussion about this topic, the problem is XML standard, that requires url encoding in attribute values.

I reported this issue 10 (maybe 15) years ago to libxml2 developers, and It was rejected. Maybe libxml2 supports too old XML standards. I don't know - this library is years in frozen state, but there is no replacement.

It is a libxml2 problem - , and there it should be reported and fixed. It is not possible to fix this issue on Postgres' side.

 xmlTextWriterWriteAttribute does this unwanted encoding, xmlTextWriterWriteRaw doesn't do this

postgres=# select xmlelement(name "aho", xmlattributes('žlutý kůň' as x), 'žlutý kǔň');
NOTICE:  >>>>>žlutý kůň
NOTICE:  **** >>>>>žlutý kǔň
┌───────────────────────────────────────────────────────────┐
│                        xmlelement                         │
╞═══════════════════════════════════════════════════════════╡
│ <aho x="&#x17E;lut&#xFD; k&#x16F;&#x148;">žlutý kǔň</aho> │
└───────────────────────────────────────────────────────────┘
(1 row)

So somewhere there will be necessary information for understanding this issue.

Regards

Pavel

Regards

Pavel

Reply | Threaded
Open this post in threaded view
|

Re: Strange output of XML attribute values

Andrew Marynchuk (Андрей Маринчук)
In reply to this post by Pavel Stehule


ср, 16 сент. 2020 г. в 15:51, Pavel Stehule <[hidden email]>:


st 16. 9. 2020 v 14:11 odesílatel Andrew Marynchuk (Андрей Маринчук) <[hidden email]> napsal:
This problem is quite old, but it leads to the inability to use XML generation functions in PostgreSQL database for some cases, or at least requires to perform subsequent parsing and regenerating XML by an external utility. It reproduces in PostgreSQL 12.4, compiled by Visual C++ build 1914, 64-bit (windows 10), but I've seen the same problem in 9.6 build from CentOS yum package.

How to reproduce:
Just execute the query (actually the xmlelement call is enough to reproduce the proble):
select xmlserialize(document xmlroot(xmlelement(name "ЭлементВКириллице", xmlattributes('ЗначениеВКириллице' as "АтрибутВКириллице"), 'ТекстВКириллице'), version '1.0', standalone yes) as text);

Expected result:
<?xml version="1.0" standalone="yes"?><ЭлементВКириллице АтрибутВКириллице="ЗначениеВКириллице">ТекстВКириллице</ЭлементВКириллице>

Actual result:
<?xml version="1.0" standalone="yes"?><ЭлементВКириллице АтрибутВКириллице="&#x417;&#x43D;&#x430;&#x447;&#x435;&#x43D;&#x438;&#x435;&#x412;&#x41A;&#x438;&#x440;&#x438;&#x43B;&#x43B;&#x438;&#x446;&#x435;">ТекстВКириллице</ЭлементВКириллице>

This example uses cyrillic letters, but it could be any non-ASCII character.
According to the discussion, this problem arises because PostgreSQL does not provides libxml2 an information of document encoding due to the lack of xmlTextWriterStartDocument call, so libxml2 has no idea that encoding is UTF-8 and non-ASCII characters could be written without converting to &#x...;-sequences.

I don't think it is true. The url encoding is done only in attributes, and only when the output encoding will be utf8. When you try to use 8bit encoding with Azbuka support, it will be ok.

I've made some investigation on libxml2 sources.
Seems like urlencoding is done in xmlBufAttrSerializeTxtContent function from xmlsave.c after the condition is met. The key part is  (doc == NULL) || (doc->encoding == NULL).
The xmlTextWriterWriteString is the only function calls xmlBufAttrSerializeTxtContent in xmlwriter.c and only for XML_TEXTWRITER_ATTRIBUTE state, so urlencoding is really done only for attributes, but it could be turned off by providing the xmlwriter with encoding.
But there is no api function except xmlTextWriterStartDocument to modify the (xmlTextWriterPtr argument) -> doc -> encoding field.

In other words, the libxml2 library does not know the codepage and assumes it as ASCII, so it turns on the urlencoding of all attributes. I don't have an idea why all other textual document parts are passed as is.Seems like PostgreSQL never specifies the codepage (I haven't seen a encoding declaration in document prolog for documents generated in PostgreSQL). It's reasonable because charset is not defined for varchar type and xml converted to varchar is some abstract text which could be converted to any codepage.


 

In the modern world, UTF-8 encoding is used everywhere and such unnecessary character converting looks strange. Current workaround is passing generated content to the pl/python function which parses and writes back the xml (xml.dom.minidom.parseString(...).toxml()).

If I remember some discussion about this topic, the problem is XML standard, that requires url encoding in attribute values.

I reported this issue 10 (maybe 15) years ago to libxml2 developers, and It was rejected. Maybe libxml2 supports too old XML standards. I don't know - this library is years in frozen state, but there is no replacement.

I traced the description of attributes from the current specification through previous versions till the draft version published in 1996, but didn't find such a requirement. Attribute value defined as '"' ([^<&"] | Reference)* '"' |  "'" ([^<&'] | Reference)* "'", so only three characters are disallowed in attributes.




Reply | Threaded
Open this post in threaded view
|

Re: Strange output of XML attribute values

Pavel Stehule


st 16. 9. 2020 v 22:23 odesílatel Andrew Marynchuk (Андрей Маринчук) <[hidden email]> napsal:


ср, 16 сент. 2020 г. в 15:51, Pavel Stehule <[hidden email]>:


st 16. 9. 2020 v 14:11 odesílatel Andrew Marynchuk (Андрей Маринчук) <[hidden email]> napsal:
This problem is quite old, but it leads to the inability to use XML generation functions in PostgreSQL database for some cases, or at least requires to perform subsequent parsing and regenerating XML by an external utility. It reproduces in PostgreSQL 12.4, compiled by Visual C++ build 1914, 64-bit (windows 10), but I've seen the same problem in 9.6 build from CentOS yum package.

How to reproduce:
Just execute the query (actually the xmlelement call is enough to reproduce the proble):
select xmlserialize(document xmlroot(xmlelement(name "ЭлементВКириллице", xmlattributes('ЗначениеВКириллице' as "АтрибутВКириллице"), 'ТекстВКириллице'), version '1.0', standalone yes) as text);

Expected result:
<?xml version="1.0" standalone="yes"?><ЭлементВКириллице АтрибутВКириллице="ЗначениеВКириллице">ТекстВКириллице</ЭлементВКириллице>

Actual result:
<?xml version="1.0" standalone="yes"?><ЭлементВКириллице АтрибутВКириллице="&#x417;&#x43D;&#x430;&#x447;&#x435;&#x43D;&#x438;&#x435;&#x412;&#x41A;&#x438;&#x440;&#x438;&#x43B;&#x43B;&#x438;&#x446;&#x435;">ТекстВКириллице</ЭлементВКириллице>

This example uses cyrillic letters, but it could be any non-ASCII character.
According to the discussion, this problem arises because PostgreSQL does not provides libxml2 an information of document encoding due to the lack of xmlTextWriterStartDocument call, so libxml2 has no idea that encoding is UTF-8 and non-ASCII characters could be written without converting to &#x...;-sequences.

I don't think it is true. The url encoding is done only in attributes, and only when the output encoding will be utf8. When you try to use 8bit encoding with Azbuka support, it will be ok.

I've made some investigation on libxml2 sources.
Seems like urlencoding is done in xmlBufAttrSerializeTxtContent function from xmlsave.c after the condition is met. The key part is  (doc == NULL) || (doc->encoding == NULL).
The xmlTextWriterWriteString is the only function calls xmlBufAttrSerializeTxtContent in xmlwriter.c and only for XML_TEXTWRITER_ATTRIBUTE state, so urlencoding is really done only for attributes, but it could be turned off by providing the xmlwriter with encoding.
But there is no api function except xmlTextWriterStartDocument to modify the (xmlTextWriterPtr argument) -> doc -> encoding field.

In other words, the libxml2 library does not know the codepage and assumes it as ASCII, so it turns on the urlencoding of all attributes. I don't have an idea why all other textual document parts are passed as is.Seems like PostgreSQL never specifies the codepage (I haven't seen a encoding declaration in document prolog for documents generated in PostgreSQL). It's reasonable because charset is not defined for varchar type and xml converted to varchar is some abstract text which could be converted to any codepage.

xml is usually dumped in client side encoding

<--><-->xmlTextWriterStartElement(writer, (xmlChar *) xexpr->name);

<--><-->forboth(arg, named_arg_strings, narg, xexpr->arg_names)
<--><-->{
<--><--><-->char<-->   *str = (char *) lfirst(arg);
<--><--><-->char<-->   *argname = strVal(lfirst(narg));

<--><--><-->if (str)
<--><--><--><-->xmlTextWriterWriteAttribute(writer,
<--><--><--><--><--><--><--><--><--><--><-->(xmlChar *) argname,
<--><--><--><--><--><--><--><--><--><--><-->(xmlChar *) str);
<--><-->}

<--><-->foreach(arg, arg_strings)
<--><-->{
<--><--><-->char<-->   *str = (char *) lfirst(arg);

<--><--><-->xmlTextWriterWriteRaw(writer, (xmlChar *) str);
<--><-->}

<--><-->xmlTextWriterEndElement(writer);

There is a code from xml.c. Is there some possibility how to specify encoding?




 

In the modern world, UTF-8 encoding is used everywhere and such unnecessary character converting looks strange. Current workaround is passing generated content to the pl/python function which parses and writes back the xml (xml.dom.minidom.parseString(...).toxml()).

If I remember some discussion about this topic, the problem is XML standard, that requires url encoding in attribute values.

I reported this issue 10 (maybe 15) years ago to libxml2 developers, and It was rejected. Maybe libxml2 supports too old XML standards. I don't know - this library is years in frozen state, but there is no replacement.

I traced the description of attributes from the current specification through previous versions till the draft version published in 1996, but didn't find such a requirement. Attribute value defined as '"' ([^<&"] | Reference)* '"' |  "'" ([^<&'] | Reference)* "'", so only three characters are disallowed in attributes.

I tried to find this topic, but without success. Maybe my memory doesn't serve well, but I think I solved a similar issue, and I had to generate a document explicitly with latin2 encoding to fix this issue.

But it is in agreement with your description - if we generate xml without specified encoding (we expect UTF8), then url encoding is active.

If you can play with it - all related codes for postgres side are in xml.c source code file.