BUG #15273: Lexer bug with UESCAPE

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

BUG #15273: Lexer bug with UESCAPE

PG Bug reporting form
The following bug has been logged on the website:

Bug reference:      15273
Logged by:          Yaroslav Schekin
Email address:      [hidden email]
PostgreSQL version: 10.4
Operating system:   Any
Description:        

Hello.

Compare this:
> SELECT U&'a' UESCAPE 'x';
----------
 a
(1 row)
To this:
> SELECT U&'a' /*c1*/ UESCAPE /*c2*/ 'x';

ERROR:  syntax error at or near "'x'"
LINE 1: SELECT U&'a' /*c1*/ UESCAPE /*c2*/ 'x';
                                           ^
I think the former is a bug, as, per ISO SQL, a comment is equivalent to
whitespace (with newline), and therefore, should be ignored here.

(Thanks a lot to RhodiumToad who not only initially found and documented
https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL_Standard#Lexing_of_split_string_literals,
but also discussed it with me on IRC and conducted the investigation of
relevant SQL standards.)

--
WBR, Yaroslav Schekin.

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15273: Lexer bug with UESCAPE

Tom Lane-2
=?utf-8?q?PG_Bug_reporting_form?= <[hidden email]> writes:
> SELECT U&'a' /*c1*/ UESCAPE /*c2*/ 'x';
> ERROR:  syntax error at or near "'x'"
> LINE 1: SELECT U&'a' /*c1*/ UESCAPE /*c2*/ 'x';

> I think the former is a bug, as, per ISO SQL, a comment is equivalent to
> whitespace (with newline), and therefore, should be ignored here.

I'd classify this as "won't fix".  It'd require pretty significant bloat
in the lexer rules to make it happen, and it doesn't really seem worth it.

Also, I'm going to push back on the claim that allowing comments there
is required by the SQL spec.  The relevant rules in SQL:2011 are

<Unicode character string literal> ::=
  [ <introducer> <character set specification> ]
      U <ampersand> <quote> [ <Unicode representation>... ] <quote>
      [ { <separator> <quote> [ <Unicode representation>... ] <quote> }... ]
      <Unicode escape specifier>

<Unicode escape specifier> ::=
  [ UESCAPE <quote> <Unicode escape character> <quote> ]

I do not see any principled way of arguing that these rules require
comments to be allowed adjacent to UESCAPE without also claiming
that they must be allowed between, say, the initial 'U' and the
ampersand.  The only place these rules allow a <separator> is
between segments of a multiline literal.  It looks to me like an
extension that we even allow whitespace around UESCAPE.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15273: Lexer bug with UESCAPE

Andrew Gierth
>>>>> "Tom" == Tom Lane <[hidden email]> writes:

 Tom> Also, I'm going to push back on the claim that allowing comments
 Tom> there is required by the SQL spec. The relevant rules in SQL:2011
 Tom> are

 Tom> <Unicode character string literal> ::=
 Tom>   [ <introducer> <character set specification> ]
 Tom>       U <ampersand> <quote> [ <Unicode representation>... ] <quote>
 Tom>       [ { <separator> <quote> [ <Unicode representation>... ] <quote> }... ]
 Tom>       <Unicode escape specifier>

 Tom> <Unicode escape specifier> ::=
 Tom>   [ UESCAPE <quote> <Unicode escape character> <quote> ]

 Tom> I do not see any principled way of arguing that these rules
 Tom> require comments to be allowed adjacent to UESCAPE without also
 Tom> claiming that they must be allowed between, say, the initial 'U'
 Tom> and the ampersand.

These are the rules that (as far as I can see) apply to that case:

5.2 <token> and <separator>

<separator> ::=
  { <comment> | <white space> }...

  7) Any <token> may be followed by a <separator>.

5.3 <literal>

  11) In a <Unicode character string literal>, there shall be no
      <separator> between the "U" and the <ampersand> nor between the
      <ampersand> and the <quote>.

 Tom> The only place these rules allow a <separator> is between segments
 Tom> of a multiline literal. It looks to me like an extension that we
 Tom> even allow whitespace around UESCAPE.

I think that that use of <separator> is only to indicate that a
<separator> there is _required_, rather than optional as it usually is
after tokens, and that the special rule about requiring newlines also
applies only to that specific use of <separator>.

If the whole <Unicode character string literal> is regarded as being a
single token, and therefore rule 5.2.7 above didn't apply around the
UESCAPE, then there would be no reason to write rule 5.3.11 forbidding
separators within the U&' part.

(In the case of X'...', there's rule 5.2.5, which as I see it would
prevent a space after the X, but that rule explicitly does not apply to
the U& cases.)

As a related issue, we don't allow comments within the <separator> that
splits a multiline literal, even though the spec certainly allows those
(arguably, since the spec defines that comments are equivalent to
newlines, "select 'foo' /**/ 'bar';" should be legal too).

I've put up a summary of all these at
https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL_Standard#Lexing_of_string_literals_and_comments

(under the assumption that the whole issue is filed under WONTFIX at
least for the time being)

--
Andrew (irc:RhodiumToad)

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15273: Lexer bug with UESCAPE

Tom Lane-2
Andrew Gierth <[hidden email]> writes:
> "Tom" == Tom Lane <[hidden email]> writes:
>  Tom> Also, I'm going to push back on the claim that allowing comments
>  Tom> there is required by the SQL spec.

> These are the rules that (as far as I can see) apply to that case:
> 5.2 <token> and <separator>
>   7) Any <token> may be followed by a <separator>.

Right, but that only gets the result you claim if you suppose that a
<Unicode character string literal> consists of more than one <token>.
I don't think so, because the start of the section states

<token> ::=
    <nondelimiter token>
  | <delimiter token>

<nondelimiter token> ::=
    <regular identifier>
  | <key word>
  | <unsigned numeric literal>
  | <national character string literal>
  | <binary string literal>
  | <large object length token>
  | <Unicode delimited identifier>
  | <Unicode character string literal>
  | <SQL language identifier>

and then reading down, we have

<Unicode character string literal> ::=
  [ <introducer> <character set specification> ]
      U <ampersand> <quote> [ <Unicode representation>... ] <quote>
      [ { <separator> <quote> [ <Unicode representation>... ] <quote> }... ]
      <Unicode escape specifier>

There isn't anything here that equates <token> with any sub-part
of a <Unicode character string literal>.  Unless you want to argue
that <ampersand> and <quote> can be derived from <delimiter token>,
but if you take that path you're left to explain why whitespace
at the start of the literal contents is data and not a <separator>.

Of course, there's certainly an argument to be made that the intent
is that the U&'...' part be one token and then UESCAPE and the escape
string are two more, but the SQL committee just can't specify their
way out of a paper bag.  We knew that already.

Anyway, as I said before, I can't see that we would want to fix this
by extending the existing implementation --- you'd need a bunch more
exclusive lexer states which would be a pain in the rear, and possibly
a performance problem too.  I do wonder though why Peter did it like that.
You could imagine returning three tokens to the grammar and letting the
grammar merge them, which'd make the lexer aspect of this far simpler
and perhaps not complicate the grammar too much.

Another thing I noticed about the existing implementation is that it's
very unfriendly if you write an invalid escape specifier:

postgres=# select U&'foo' uescape 'bar';
ERROR:  syntax error at or near "'bar'"
LINE 1: select U&'foo' uescape 'bar';
                               ^

It'd be much nicer to say something along the line of "Unicode escape
specifier must be a single character", but shoehorning that into the
lexer-based implementation would be a giant pain.

I'm not excited enough about any of this to spend more time on it,
but maybe somebody else is.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15273: Lexer bug with UESCAPE

Tom Lane-2
Oh, and one other thing:

regression=# select 'foo'
regression-# 'bar';
 ?column?
----------
 foobar
(1 row)

regression=# select 'foo' -- baz
'bar';
 ?column?
----------
 foobar
(1 row)

regression=# select 'foo' /* baz */
'bar';
ERROR:  syntax error at or near "'bar'"
LINE 2: 'bar';
        ^

So we already have the exact same restriction in all multi-line forms
of literal.  That's far older than the UESCAPE business, and nobody's
complained about it that I can recall.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15273: Lexer bug with UESCAPE

Bruce Momjian
On Wed, Jul 11, 2018 at 05:15:15PM -0400, Tom Lane wrote:

> Oh, and one other thing:
>
> regression=# select 'foo'
> regression-# 'bar';
>  ?column?
> ----------
>  foobar
> (1 row)
>
> regression=# select 'foo' -- baz
> 'bar';
>  ?column?
> ----------
>  foobar
> (1 row)
>
> regression=# select 'foo' /* baz */
> 'bar';
> ERROR:  syntax error at or near "'bar'"
> LINE 2: 'bar';
>         ^
>
> So we already have the exact same restriction in all multi-line forms
> of literal.  That's far older than the UESCAPE business, and nobody's
> complained about it that I can recall.

Do we have any comment in the lexer C files about this so we don't
revisit the issue?

--
  Bruce Momjian  <[hidden email]>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +