SQL-spec incompatibilities in similar_escape() and related stuff

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

SQL-spec incompatibilities in similar_escape() and related stuff

Tom Lane-2
Over in pgsql-bugs [1], we're looking into some bugs associated with
mistranslation of SQL-spec regexes to POSIX regexes.  However, while
poking at that I couldn't help noticing that there are more ways in
which we're not following the letter of the SQL spec in this area:

* As Andrew noted, somewhere between SQL99 and SQL:2008, the committee
just up and changed the syntax of <regular expression substring function>.
SQL99 has

        <regular expression substring function> ::=
              SUBSTRING <left paren> <character value expression>
                                     FROM <character value expression>
                                     FOR <escape character> <right paren>

but in recent versions it's

        <regular expression substring function> ::=
              SUBSTRING <left paren> <character value expression>
                                     SIMILAR <character value expression>
                                     ESCAPE <escape character> <right paren>

I am, frankly, inclined to ignore this as a bad idea.  We do have
SIMILAR and ESCAPE as keywords already, but they're type_func_name_keyword
and unreserved_keyword respectively.  To support this syntax, I'm pretty
sure we'd have to make them both fully reserved.  That seems likely to
break existing applications, and I don't think it's worth it.  But it's
probably something to discuss.

* Our function similar_escape() is not documented, but it underlies
three things in the grammar:

        a SIMILAR TO b
        Translated as "a ~ similar_escape(b, null)"

        a SIMILAR TO b ESCAPE e
        Translated as "a ~ similar_escape(b, e)"

        substring(a, b, e)
        This is a SQL function expanding to
        select pg_catalog.substring($1, pg_catalog.similar_escape($2, $3))

To support the first usage, similar_escape is non-strict, and it takes
a NULL second argument to mean '\'.  This is already a SQL spec violation,
because as far as I can tell from the spec, if you don't write an ESCAPE
clause then there is *no* escape character; there certainly is not a
default of '\'.  However, we document this behavior, so I don't know if
we want to change it.

This behavior also causes spec compatibility problems in the second
syntax, because "a SIMILAR TO b ESCAPE NULL" is treated as though
it were "ESCAPE '\'", which is again a spec violation: the result
should be null.

And, just to add icing on the cake, it causes performance problems
in the third syntax.  3-argument substring is labeled proisstrict,
which is correct behavior per spec (the result is NULL if any of
the three arguments are null).  But because similar_escape is not
strict, the planner fails to inline the SQL function, reasoning
(quite accurately) that doing so would change the behavior for
null inputs.  This costs us something like 4x performance compared
to the underlying 2-argument POSIX-regex substring() function.

I'm not sure what we want to do here, but we probably ought to do
something, because right now substring() and SIMILAR TO aren't even
in agreement between themselves let alone with the SQL spec.  We
could either move towards making all these constructs strict in
accordance with the spec (and possibly breaking some existing
applications), or we could make substring(a, b, e) not strict so
that it inherits similar_escape's idea of what to do for e = NULL.

* similar_escape considers a zero-length escape string to mean
"no escape character".  This is contrary to spec which clearly
says that a zero-length escape string is an error condition
(just as more-than-one-character is an error condition).  It's
also undocumented.  Should we tighten that up to conform to spec,
or document it as an extension?

* Per spec, escape-double-quote must appear exactly twice in
the second argument of substring(a, b, e), while it's not valid
in SIMILAR TO.  similar_escape() doesn't enforce this, and it
can't do so as long as we are using the same pattern conversion
function for both constructs.  However, we could do better than
we're doing:

* If there are zero occurrences, then what you get from substring()
is the whole input string if it matches, as if escape-double-quote
had appeared at each end of the string.  I think this is fine, but
we ought to document it.

* If there are an odd number of occurrences, similar_escape() doesn't
complain, but you'll get this from the regex engine:
    ERROR:  invalid regular expression: parentheses () not balanced
The fact of an error isn't a problem, but the error message is pretty
confusing considering that what the user wrote was not parentheses.
I think similar_escape() ought to throw its own error with an on-point
message.

* If there are more than two pairs of escape-double-quote, you get
some behavior that's completely not per spec --- the patterns
between the additional pairs still contribute to whether there's an
overall match, but they don't affect the result substring.  I'm
inclined to think we ought to throw an error for this, too.

* The spec is much tighter than we are concerning what's a legal
escape sequence.  This is partly intentional on our part, I think;
notably, you can get at POSIX-regex escapes like "\d", which the
SQL spec doesn't provide.  Although I think this is intentional,
it's not documented.  I'm not sure if we want to tighten that
up or document what we allow ... thoughts?


I am not eager to change any of this in released branches, but
I think there's a good case for doing something about these
points in HEAD.

                        regards, tom lane

[1] https://postgr.es/m/5bb27a41-350d-37bf-901e-9d26f5592dd0@...


Reply | Threaded
Open this post in threaded view
|

Re: SQL-spec incompatibilities in similar_escape() and related stuff

Andrew Gierth
>>>>> "Tom" == Tom Lane <[hidden email]> writes:

 Tom> but in recent versions it's

 Tom>         <regular expression substring function> ::=
 Tom>               SUBSTRING <left paren> <character value expression>
 Tom>                                      SIMILAR <character value expression>
 Tom>                                      ESCAPE <escape character> <right paren>

 Tom> I am, frankly, inclined to ignore this as a bad idea. We do have
 Tom> SIMILAR and ESCAPE as keywords already, but they're
 Tom> type_func_name_keyword and unreserved_keyword respectively. To
 Tom> support this syntax, I'm pretty sure we'd have to make them both
 Tom> fully reserved.

I only did a quick trial but it doesn't seem to require reserving them
more strictly - just adding the obvious productions to the grammar
doesn't introduce any conflicts.

 Tom> * Our function similar_escape() is not documented, but it
 Tom> underlies three things in the grammar:

 Tom> a SIMILAR TO b
 Tom> Translated as "a ~ similar_escape(b, null)"

 Tom> a SIMILAR TO b ESCAPE e
 Tom> Translated as "a ~ similar_escape(b, e)"

 Tom> substring(a, b, e)
 Tom> This is a SQL function expanding to
 Tom> select pg_catalog.substring($1, pg_catalog.similar_escape($2, $3))

 Tom> To support the first usage, similar_escape is non-strict, and it
 Tom> takes a NULL second argument to mean '\'. This is already a SQL
 Tom> spec violation, because as far as I can tell from the spec, if you
 Tom> don't write an ESCAPE clause then there is *no* escape character;
 Tom> there certainly is not a default of '\'. However, we document this
 Tom> behavior, so I don't know if we want to change it.

This is the same spec violation that we also have for LIKE, which also
is supposed to have no escape character in the absense of an explicit
ESCAPE clause.

--
Andrew (irc:RhodiumToad)


Reply | Threaded
Open this post in threaded view
|

Re: SQL-spec incompatibilities in similar_escape() and related stuff

Andrew Gierth
>>>>> "Andrew" == Andrew Gierth <[hidden email]> writes:

 Tom> I am, frankly, inclined to ignore this as a bad idea. We do have
 Tom> SIMILAR and ESCAPE as keywords already, but they're
 Tom> type_func_name_keyword and unreserved_keyword respectively. To
 Tom> support this syntax, I'm pretty sure we'd have to make them both
 Tom> fully reserved.

 Andrew> I only did a quick trial but it doesn't seem to require
 Andrew> reserving them more strictly - just adding the obvious
 Andrew> productions to the grammar doesn't introduce any conflicts.

Digging deeper, that's because both SIMILAR and ESCAPE have been
assigned precedence. Ambiguities that exist include:

   ... COLNAME ! SIMILAR ( ...

which could be COLNAME postfix-op SIMILAR a_expr, or COLNAME infix-op
function-call. Postfix operators strike again... we really should kill
those off.

The ESCAPE part could in theory be ambiguous if the SIMILAR expression
ends in a ... SIMILAR TO xxx operator, since then we wouldn't know
whether to attach the ESCAPE to that or keep it as part of the function
syntax. But I think this is probably a non-issue. More significant is
that ... COLNAME ! ESCAPE ... again has postfix- vs. infix-operator
ambiguities.

--
Andrew (irc:RhodiumToad)


Reply | Threaded
Open this post in threaded view
|

Re: SQL-spec incompatibilities in similar_escape() and related stuff

Andrew Gierth
>>>>> "Andrew" == Andrew Gierth <[hidden email]> writes:

 Andrew> The ESCAPE part could in theory be ambiguous if the SIMILAR
 Andrew> expression ends in a ... SIMILAR TO xxx operator, since then we
 Andrew> wouldn't know whether to attach the ESCAPE to that or keep it
 Andrew> as part of the function syntax. But I think this is probably a
 Andrew> non-issue. More significant is that ... COLNAME ! ESCAPE ...
 Andrew> again has postfix- vs. infix-operator ambiguities.

And this ambiguity shows up already in other contexts:

select 'foo' similar to 'f' || escape escape escape from (values ('oo')) v(escape);
psql: ERROR:  syntax error at or near "escape"
LINE 1: select 'foo' similar to 'f' || escape escape escape from (va...

select 'foo' similar to 'f' || escape escape from (values ('oo')) v(escape);
psql: ERROR:  operator does not exist: unknown ||
LINE 1: select 'foo' similar to 'f' || escape escape from (values ('...

I guess this happens because ESCAPE has precedence below POSTFIXOP, so
the ('f' ||) gets reduced in preference to shifting in the first ESCAPE
token.

--
Andrew (irc:RhodiumToad)


Reply | Threaded
Open this post in threaded view
|

Re: SQL-spec incompatibilities in similar_escape() and related stuff

Tom Lane-2
Andrew Gierth <[hidden email]> writes:
> "Andrew" == Andrew Gierth <[hidden email]> writes:
>  Andrew> The ESCAPE part could in theory be ambiguous if the SIMILAR
>  Andrew> expression ends in a ... SIMILAR TO xxx operator, since then we
>  Andrew> wouldn't know whether to attach the ESCAPE to that or keep it
>  Andrew> as part of the function syntax. But I think this is probably a
>  Andrew> non-issue. More significant is that ... COLNAME ! ESCAPE ...
>  Andrew> again has postfix- vs. infix-operator ambiguities.

> And this ambiguity shows up already in other contexts:

> select 'foo' similar to 'f' || escape escape escape from (values ('oo')) v(escape);
> psql: ERROR:  syntax error at or near "escape"
> LINE 1: select 'foo' similar to 'f' || escape escape escape from (va...


Hmm.  Oddly, you can't fix it by adding parens:

regression=# select 'foo' similar to ('f' || escape) escape escape from (values ('oo')) v(escape);
psql: ERROR:  syntax error at or near "escape"
LINE 1: select 'foo' similar to ('f' || escape) escape escape from (...
                                        ^

Since "escape" is an unreserved word, I'd have expected that to work.
Odd.

The big picture here is that fixing grammar ambiguities by adding
precedence is a dangerous business :-(

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: SQL-spec incompatibilities in similar_escape() and related stuff

Andrew Gierth
>>>>> "Tom" == Tom Lane <[hidden email]> writes:

 Tom> Hmm.  Oddly, you can't fix it by adding parens:

 Tom> regression=# select 'foo' similar to ('f' || escape) escape escape from (values ('oo')) v(escape);
 Tom> psql: ERROR:  syntax error at or near "escape"
 Tom> LINE 1: select 'foo' similar to ('f' || escape) escape escape from (...
 Tom>                                         ^

 Tom> Since "escape" is an unreserved word, I'd have expected that to
 Tom> work. Odd.

Simpler cases fail too:

select 'f' || escape from (values ('o')) v(escape);
psql: ERROR:  syntax error at or near "escape"

select 1 + escape from (values (1)) v(escape);  -- works
select 1 & escape from (values (1)) v(escape);  -- fails

in short ESCAPE can't follow any generic operator, because its lower
precedence forces the operator to be reduced as a postfix op instead.

 Tom> The big picture here is that fixing grammar ambiguities by adding
 Tom> precedence is a dangerous business :-(

Yeah. But the alternative is usually reserving words more strictly,
which has its own issues :-(

Or we could kill off postfix operators...

--
Andrew (irc:RhodiumToad)


Reply | Threaded
Open this post in threaded view
|

Re: SQL-spec incompatibilities in similar_escape() and related stuff

Tom Lane-2
In reply to this post by Andrew Gierth
[ backing up to a different sub-discussion ]

Andrew Gierth <[hidden email]> writes:
> "Tom" == Tom Lane <[hidden email]> writes:
>  Tom> To support the first usage, similar_escape is non-strict, and it
>  Tom> takes a NULL second argument to mean '\'. This is already a SQL
>  Tom> spec violation, because as far as I can tell from the spec, if you
>  Tom> don't write an ESCAPE clause then there is *no* escape character;
>  Tom> there certainly is not a default of '\'. However, we document this
>  Tom> behavior, so I don't know if we want to change it.

> This is the same spec violation that we also have for LIKE, which also
> is supposed to have no escape character in the absense of an explicit
> ESCAPE clause.

Right.  After further thought, I propose that what we ought to do is
unify LIKE, SIMILAR TO, and 3-arg SUBSTRING on a single set of behaviors
for the ESCAPE argument:

1. They are strict, ie a NULL value for the escape string produces a
NULL result.  This is per spec, and we don't document anything different,
and nobody would really expect something different.  (But see below
about keeping similar_escape() as a legacy compatibility function.)

2. Omitting the ESCAPE option (not possible for SUBSTRING) results in a
default of '\'.  This is not per spec, but we've long documented it this
way, and frankly I'd say that it's a far more useful default than the
spec's behavior of "there is no escape character".  I propose that we
just document that this is not-per-spec and move on.

3. Interpret an empty ESCAPE string as meaning "there is no escape
character".  This is not per spec either (the spec would have us
throw an error) but it's our historical behavior, and it seems like
a saner approach than the way the spec wants to do it --- in particular,
there's no way to get that behavior in 3-arg SUBSTRING if we don't allow
this.

So only point 1 represents an actual behavioral change from what we've
been doing; the other two just require doc clarifications.

Now, I don't have any problem with changing what happens when somebody
actually writes "a LIKE b ESCAPE NULL"; it seems fairly unlikely that
anyone would expect that to yield a non-null result.  However, we do
have a problem with the fact that the implementation is partially
exposed:

regression=# create view v1 as select f1 similar to 'x*' from text_tbl;
CREATE VIEW
regression=# \d+ v1
...
View definition:
 SELECT text_tbl.f1 ~ similar_escape('x*'::text, NULL::text)
   FROM text_tbl;

If we just change similar_escape() to be strict, then this view will
stop working, which is a bit hard on users who did not write anything
non-spec-compliant.

I propose therefore that we leave similar_escape in place with its
current behavior, as a compatibility measure for cases like this.
Intead, invent two new strict functions, say
        similar_to_escape(pattern)
        similar_to_escape(pattern, escape)
and change the parser and the implementation of SUBSTRING() to
rely on these going forward.

The net effect will be to make explicit "ESCAPE NULL" spec-compliant,
and to get rid of the performance problem from inlining failure for
substring().  All else is just doc clarifications.

Comments?

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: SQL-spec incompatibilities in similar_escape() and related stuff

Robert Haas
In reply to this post by Andrew Gierth
On Mon, May 13, 2019 at 2:39 PM Andrew Gierth
<[hidden email]> wrote:
> Or we could kill off postfix operators...

/me helps Andrew hijack the thread.

We wouldn't even have to go that far.  We could just restrict it to a
specific list of operators that are hard-coded into the lexer and
parser, like say only '!'.

Even if we killed postfix operators completely, the number of users
who would be affected would probably be minimal, because the only
postfix operator we ship is for factorial, and realistically, that's
not exactly a critical thing for most users, especially considering
that our implementation is pretty slow.  But the number of people
using out-of-core postfix operators has got to be really tiny --
unless, maybe, there's some really popular extension like PostGIS that
uses them.

I think it's pretty clear that the theoretical beauty of being able to
handle postfix operators is not worth the tangible cost they impose on
our parser.  We're losing more users as a result of SQL that other
systems can accept and we cannot than we are gaining by being able to
support user-defined postfix operators.  The latter is not exactly a
mainstream need.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: SQL-spec incompatibilities in similar_escape() and related stuff

Tom Lane-2
Robert Haas <[hidden email]> writes:
> I think it's pretty clear that the theoretical beauty of being able to
> handle postfix operators is not worth the tangible cost they impose on
> our parser.  We're losing more users as a result of SQL that other
> systems can accept and we cannot than we are gaining by being able to
> support user-defined postfix operators.

I suppose it's possible to make such an argument, but you haven't
actually made one --- just asserted something without providing
evidence.

If we can lay out some concrete gains that justify zapping postfix
operators, I'd be willing to do it.  I agree that it would likely
hurt few users ... but we need to be able to explain to those few
why we broke it.  And show that the benefits outweigh the cost.

                        regards, tom lane