PostgreSQL 9.3.5 substring(text from pattern for escape) bug

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

PostgreSQL 9.3.5 substring(text from pattern for escape) bug

Robert Schreiber
I believe I have come across a bug in the substring(text from pattern for escape) function.

What I am attempting to do is take a string like 'QMy NameQ' , strip off leading/and/or/trailing Qs and return 'My Name'.

The substring() call that I have coded is substring(xxx from 'Q?#"%#"Q?' FOR '#')

My understand of this is:
Q?     -- matches 0 or 1 occurrences of Q (the leading one, if present)
#"     -- starts data collection
%    -- matches any number of characters
#"   -- ends data collection
Q?  -- matches 0 or 1 occurrences of Q (the trialing one, if present)

What appears to be happening is that the Q? on the tail is being matched, but the Q is also being included in the collected data.

The attached PSQL test.sql file creates a table, populates it with 7 test cases with my expected results, and then executes substring() against it.

There is an attached test_output.txt file is the PSQL output that illustrates the problem. The column "Error" is true when the actual value disagrees with what I expected.

Robert Schreiber
410-392-9553

PostgreSQL 9.3.5, compiled by Visual C++ build 1600, 64-bit
Running under Windows 10 64-bit.

test.sql (1K) Download Attachment
test_output.txt (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PostgreSQL 9.3.5 substring(text from pattern for escape) bug

Daniel Verite
        Robert Schreiber wrote:

> My understand of this is:
> Q?     -- matches 0 or 1 occurrences of Q (the leading one, if present)
> #"     -- starts data collection
> %    -- matches any number of characters
> #"   -- ends data collection
> Q?  -- matches 0 or 1 occurrences of Q (the trialing one, if present)
>
> What appears to be happening is that the _Q? on the tail_ is being
> matched, but _the Q is also being included in the collected data.__

The expression as a whole is being matched, otherwise this form of
substring() would return NULL, as told in
https://www.postgresql.org/docs/current/functions-matching.html :

  "As with SIMILAR TO, the specified pattern must match the entire
   data string,  or else the function fails and returns null. "

Considering your example,  where 'My Q NameQ' is the string and
'Q?#"%#"Q?' the regexp. The engine can either see the ending
'Q' as part of the  % match, or as a match for the 'Q?' that
follows. Both cases are valid matches.

You seem to expect that % must be non-greedy and let the final Q?
match 1 Q instead of 0, but there doesn't appear to be anything
in the doc  that supports this interpretation.
In fact, it mentions that "%" is comparable to ".*" in POSIX
regular expressions, and the latter _is_ greedy.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Reply | Threaded
Open this post in threaded view
|

Re: PostgreSQL 9.3.5 substring(text from pattern for escape) bug

Tom Lane-2
"Daniel Verite" <[hidden email]> writes:
> Robert Schreiber wrote:
>> What appears to be happening is that the _Q? on the tail_ is being
>> matched, but _the Q is also being included in the collected data.__

> You seem to expect that % must be non-greedy and let the final Q?
> match 1 Q instead of 0, but there doesn't appear to be anything
> in the doc  that supports this interpretation.
> In fact, it mentions that "%" is comparable to ".*" in POSIX
> regular expressions, and the latter _is_ greedy.

Right.  You could get the behavior you want using a non-greedy quantifier,
but you'd have to use the POSIX regexp functions, not substring().

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: PostgreSQL 9.3.5 substring(text from pattern for escape) bug

Andrew Gierth
>>>>> "Tom" == Tom Lane <[hidden email]> writes:

 >>> What appears to be happening is that the _Q? on the tail_ is being
 >>> matched, but _the Q is also being included in the collected data.__

 >> You seem to expect that % must be non-greedy and let the final Q?
 >> match 1 Q instead of 0, but there doesn't appear to be anything in
 >> the doc that supports this interpretation. In fact, it mentions that
 >> "%" is comparable to ".*" in POSIX regular expressions, and the
 >> latter _is_ greedy.

 Tom> Right. You could get the behavior you want using a non-greedy
 Tom> quantifier, but you'd have to use the POSIX regexp functions, not
 Tom> substring().

I looked up the spec on this point. As far as I can see, we're not
following it, but neither does the spec do what the OP wanted; in fact
the result should have included the _leading_ Q as well as the trailing
one.

The relevant part of SQL2016 seems to be this:

6.32 <string value function>

General Rules

6) If <regular expression substring function> is specified, then:

[...rules that split the pattern into 'R1#"R2#"R3' omitted...]

  h) Otherwise, the result S of the <regular expression substring
     function> is computed as follows:
 
    i) Let S1 be the shortest initial substring of C such that there is
       a substring S23 of C such that the value of the following <search
       condition> is True:

          'C' = 'S1' || 'S23' AND
          'S1' SIMILAR TO 'R1' ESCAPE 'E' AND
          'S23' SIMILAR TO '(R2R3)' ESCAPE 'E'

    ii) Let S3 be the shortest final substring of S23 such that there is
        a substring S2 of S23 such that the value of the following
        <search condition> is True:

          'S23' = 'S2' || 'S3' AND
          'S2' SIMILAR TO 'R2' ESCAPE 'E' AND
          'S3' SIMILAR TO 'R3' ESCAPE 'E'

    iii) The result of the <regular expression substring function> is S2.

--
Andrew (irc:RhodiumToad)


Reply | Threaded
Open this post in threaded view
|

Re: PostgreSQL 9.3.5 substring(text from pattern for escape) bug

Tom Lane-2
Andrew Gierth <[hidden email]> writes:
> I looked up the spec on this point. As far as I can see, we're not
> following it, but neither does the spec do what the OP wanted; in fact
> the result should have included the _leading_ Q as well as the trailing
> one.

Huh, interesting.  So we should be translating the initial substring
to a non-greedy pattern.  I believe Spencer's engine can handle that
by sticking (?:...){1,1}? around it.

Come to think of it, we probably need to be putting (?:...) around
the trailing substring as well.  I suspect what we're doing today
produces non-spec results if "|" appears in the trailing part.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: PostgreSQL 9.3.5 substring(text from pattern for escape) bug

Andrew Gierth
>>>>> "Tom" == Tom Lane <[hidden email]> writes:

 > Andrew Gierth <[hidden email]> writes:
 >> I looked up the spec on this point. As far as I can see, we're not
 >> following it, but neither does the spec do what the OP wanted; in
 >> fact the result should have included the _leading_ Q as well as the
 >> trailing one.

 Tom> Huh, interesting. So we should be translating the initial
 Tom> substring to a non-greedy pattern. I believe Spencer's engine can
 Tom> handle that by sticking (?:...){1,1}? around it.

 Tom> Come to think of it, we probably need to be putting (?:...) around
 Tom> the trailing substring as well. I suspect what we're doing today
 Tom> produces non-spec results if "|" appears in the trailing part.

Digging into it more:

SUBSTRING(x FROM 'expr' FOR 'escape') is from sql92/sql99 and is gone by
sql2008, replaced by SUBSTRING(x SIMILAR 'expr' ESCAPE 'escape'). sql99
defines the matching rule using different language, but with the same
actual effect (requiring shortest matches for the leading and trailing
strings).

Your suggested fix doesn't seem to work. If the leading/trailing
substrings do not have | or parens in then it seems to work to wrap them
in (?:(?:)??...), thanks to the rule that the first quantified atom in a
subexpression sets the whole subexpression's greediness, but handling |
or parens correctly seems harder.

Are there any other dbs that implement this feature that we can compare
against?

--
Andrew (irc:RhodiumToad)


Reply | Threaded
Open this post in threaded view
|

Re: PostgreSQL 9.3.5 substring(text from pattern for escape) bug

Tom Lane-2
Andrew Gierth <[hidden email]> writes:
> "Tom" == Tom Lane <[hidden email]> writes:
>  Tom> Huh, interesting. So we should be translating the initial
>  Tom> substring to a non-greedy pattern. I believe Spencer's engine can
>  Tom> handle that by sticking (?:...){1,1}? around it.

> Your suggested fix doesn't seem to work. If the leading/trailing
> substrings do not have | or parens in then it seems to work to wrap them
> in (?:(?:)??...), thanks to the rule that the first quantified atom in a
> subexpression sets the whole subexpression's greediness, but handling |
> or parens correctly seems harder.

[ pokes at that... ]  Huh.  That's a bug, which AFAICS is aboriginal in
Henry's code: it optimizes away a {1,1} quantifier without regard to
whether the quantifier is attempting to impose a different greediness
preference than its argument would have naturally.  The attached
seems to fix it.

                        regards, tom lane


diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index eb1f3d5..8cd7d56 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -1155,7 +1155,10 @@ parseqatom(struct vars *v,
  /* rest of branch can be strung starting from atom->end */
  s2 = atom->end;
  }
- else if (m == 1 && n == 1)
+ else if (m == 1 && n == 1 &&
+ (qprefer == 0 ||
+  (atom->flags & (LONGER | SHORTER | MIXED)) == 0 ||
+  qprefer == (atom->flags & (LONGER | SHORTER | MIXED))))
  {
  /* no/vacuous quantifier:  done */
  EMPTYARC(s, atom->begin); /* empty prefix */
diff --git a/src/test/regress/expected/regex.out b/src/test/regress/expected/regex.out
index c0bfa8a..f372003 100644
--- a/src/test/regress/expected/regex.out
+++ b/src/test/regress/expected/regex.out
@@ -492,6 +492,55 @@ select regexp_matches('foo/bar/baz',
  {foo,bar,baz}
 (1 row)
 
+-- Test that greediness can be overridden by outer quantifier
+select regexp_matches('llmmmfff', '^(l*)(.*)(f*)$');
+ regexp_matches
+----------------
+ {ll,mmmfff,""}
+(1 row)
+
+select regexp_matches('llmmmfff', '^(l*){1,1}(.*)(f*)$');
+ regexp_matches
+----------------
+ {ll,mmmfff,""}
+(1 row)
+
+select regexp_matches('llmmmfff', '^(l*){1,1}?(.*)(f*)$');
+  regexp_matches  
+------------------
+ {"",llmmmfff,""}
+(1 row)
+
+select regexp_matches('llmmmfff', '^(l*){1,1}?(.*){1,1}?(f*)$');
+ regexp_matches
+----------------
+ {"",llmmm,fff}
+(1 row)
+
+select regexp_matches('llmmmfff', '^(l*?)(.*)(f*)$');
+  regexp_matches  
+------------------
+ {"",llmmmfff,""}
+(1 row)
+
+select regexp_matches('llmmmfff', '^(l*?){1,1}(.*)(f*)$');
+ regexp_matches
+----------------
+ {ll,mmmfff,""}
+(1 row)
+
+select regexp_matches('llmmmfff', '^(l*?){1,1}?(.*)(f*)$');
+  regexp_matches  
+------------------
+ {"",llmmmfff,""}
+(1 row)
+
+select regexp_matches('llmmmfff', '^(l*?){1,1}?(.*){1,1}?(f*)$');
+ regexp_matches
+----------------
+ {"",llmmm,fff}
+(1 row)
+
 -- Test for infinite loop in cfindloop with zero-length possible match
 -- but no actual match (can only happen in the presence of backrefs)
 select 'a' ~ '$()|^\1';
diff --git a/src/test/regress/sql/regex.sql b/src/test/regress/sql/regex.sql
index 1361b62..a174224 100644
--- a/src/test/regress/sql/regex.sql
+++ b/src/test/regress/sql/regex.sql
@@ -118,6 +118,16 @@ select regexp_matches('Programmer', '(\w)(.*?\1)', 'g');
 select regexp_matches('foo/bar/baz',
                       '^([^/]+?)(?:/([^/]+?))(?:/([^/]+?))?$', '');
 
+-- Test that greediness can be overridden by outer quantifier
+select regexp_matches('llmmmfff', '^(l*)(.*)(f*)$');
+select regexp_matches('llmmmfff', '^(l*){1,1}(.*)(f*)$');
+select regexp_matches('llmmmfff', '^(l*){1,1}?(.*)(f*)$');
+select regexp_matches('llmmmfff', '^(l*){1,1}?(.*){1,1}?(f*)$');
+select regexp_matches('llmmmfff', '^(l*?)(.*)(f*)$');
+select regexp_matches('llmmmfff', '^(l*?){1,1}(.*)(f*)$');
+select regexp_matches('llmmmfff', '^(l*?){1,1}?(.*)(f*)$');
+select regexp_matches('llmmmfff', '^(l*?){1,1}?(.*){1,1}?(f*)$');
+
 -- Test for infinite loop in cfindloop with zero-length possible match
 -- but no actual match (can only happen in the presence of backrefs)
 select 'a' ~ '$()|^\1';
Reply | Threaded
Open this post in threaded view
|

Re: PostgreSQL 9.3.5 substring(text from pattern for escape) bug

Robert Schreiber
In reply to this post by Andrew Gierth
Just to be clear here. It seems to me that I am right in that the leading/trailing Q should NOT be returned...

From Section 9.7.2 in the manual:

The substring function with three parameters, substring(string from pattern for escape-character), provides extraction of a substring that matches an SQL regular expression pattern. As with SIMILAR TO, the specified pattern must match the entire data string, or else the function fails and returns null. To indicate the part of the pattern that should be returned on success, the pattern must contain two occurrences of the escape character followed by a double quote ("). The text matching the portion of the pattern between these markers is returned.

In my mind I see this:

QMy Q NAmeQ  is interpreted as Q/My Q Name/Q rather than Q/My Q Name Q//.

bob



On 5/12/2019 12:27 AM, Andrew Gierth wrote:
"Tom" == Tom Lane [hidden email] writes:
 > Andrew Gierth [hidden email] writes:
 >> I looked up the spec on this point. As far as I can see, we're not
 >> following it, but neither does the spec do what the OP wanted; in
 >> fact the result should have included the _leading_ Q as well as the
 >> trailing one.

 Tom> Huh, interesting. So we should be translating the initial
 Tom> substring to a non-greedy pattern. I believe Spencer's engine can
 Tom> handle that by sticking (?:...){1,1}? around it.

 Tom> Come to think of it, we probably need to be putting (?:...) around
 Tom> the trailing substring as well. I suspect what we're doing today
 Tom> produces non-spec results if "|" appears in the trailing part.

Digging into it more:

SUBSTRING(x FROM 'expr' FOR 'escape') is from sql92/sql99 and is gone by
sql2008, replaced by SUBSTRING(x SIMILAR 'expr' ESCAPE 'escape'). sql99
defines the matching rule using different language, but with the same
actual effect (requiring shortest matches for the leading and trailing
strings).

Your suggested fix doesn't seem to work. If the leading/trailing
substrings do not have | or parens in then it seems to work to wrap them
in (?:(?:)??...), thanks to the rule that the first quantified atom in a
subexpression sets the whole subexpression's greediness, but handling |
or parens correctly seems harder.

Are there any other dbs that implement this feature that we can compare
against?


Reply | Threaded
Open this post in threaded view
|

Re: PostgreSQL 9.3.5 substring(text from pattern for escape) bug

Tom Lane-2
Robert Schreiber <[hidden email]> writes:
> Just to be clear here. It seems to me that I am right in that the
> leading/trailing Q should NOT be returned...

Nope.  The SQL spec's pretty clear that when the first and third
sub-expressions of the pattern can match variable amounts of text,
they're supposed to be matched to the minimum possible amount of text,
which is nothing for a pattern like "Q?".  It is a bug that we're
not doing that as the spec says, but once we fix that, it still won't
act as you're hoping.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: PostgreSQL 9.3.5 substring(text from pattern for escape) bug

Tom Lane-2
In reply to this post by Tom Lane-2
Here's a proposed patch for dealing with the non-spec greediness behavior.
It also fixes misbehavior with | in the pattern, as shown in the new
test cases (well, if you try them against HEAD, you'll see the
misbehavior).

I haven't done anything about most of the SQL-spec incompatibilities
mentioned in the other thread, but I did change the code to throw error
for more than two escape-double-quote separators, mainly because not
doing so would've required more complexity in the result string
allocation logic.

Also, rather than throw error for a single separator, I just let the
code ignore the case, because with this implementation a fairly reasonable
behavior falls out: it acts the same as if the last pattern section were
empty.  I'm a little worried that that's turning an implementation
artifact into a feature, but it seems to line up well with our historical
behavior for no separators (namely, we act as though the first and last
sections were both empty).

I've included doc changes addressing these specific behaviors, but
not done anything about the other omissions in the docs.

Comments?  Should we try to get more than this done for v12, or just
leave it for some other day?

                        regards, tom lane


diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index da0f305..bc2275c 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -4296,19 +4296,45 @@ cast(-44 as bit(12))           <lineannotation>111111010100</lineannotation>
    </para>
 
    <para>
-    The <function>substring</function> function with three parameters,
-    <function>substring(<replaceable>string</replaceable> from
-    <replaceable>pattern</replaceable> for
-    <replaceable>escape-character</replaceable>)</function>, provides
-    extraction of a substring that matches an SQL
-    regular expression pattern.  As with <literal>SIMILAR TO</literal>, the
+    The <function>substring</function> function with three parameters
+    provides extraction of a substring that matches an SQL
+    regular expression pattern.  The function can be written according
+    to SQL99 syntax:
+<synopsis>
+substring(<replaceable>string</replaceable> from <replaceable>pattern</replaceable> for <replaceable>escape-character</replaceable>)
+</synopsis>
+    or as a plain three-argument function:
+<synopsis>
+substring(<replaceable>string</replaceable>, <replaceable>pattern</replaceable>, <replaceable>escape-character</replaceable>)
+</synopsis>
+    As with <literal>SIMILAR TO</literal>, the
     specified pattern must match the entire data string, or else the
     function fails and returns null.  To indicate the part of the
-    pattern that should be returned on success, the pattern must contain
+    pattern for which the matching data sub-string is of interest,
+    the pattern should contain
     two occurrences of the escape character followed by a double quote
     (<literal>"</literal>). <!-- " font-lock sanity -->
     The text matching the portion of the pattern
-    between these markers is returned.
+    between these separators is returned when the match is successful.
+   </para>
+
+   <para>
+    The escape-double-quote separators actually
+    divide <function>substring</function>'s pattern into three independent
+    regular expressions; for example, a vertical bar (<literal>|</literal>)
+    in any of the three sections affects only that section.  Also, the first
+    and third of these regular expressions are defined to match the smallest
+    possible amount of text, not the largest, when there is any ambiguity
+    about how much of the data string matches which pattern.  (In POSIX
+    parlance, the first and third regular expressions are forced to be
+    non-greedy.)
+   </para>
+
+   <para>
+    As an extension to the SQL standard, <productname>PostgreSQL</productname>
+    allows there to be just one escape-double-quote separator, in which case
+    the third regular expression is taken as empty; or no separators, in which
+    case the first and third regular expressions are taken as empty.
    </para>
 
    <para>
diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index da13a87..ab44846 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -708,20 +708,42 @@ similar_escape(PG_FUNCTION_ARGS)
  * We surround the transformed input string with
  * ^(?: ... )$
  * which requires some explanation.  We need "^" and "$" to force
- * the pattern to match the entire input string as per SQL99 spec.
+ * the pattern to match the entire input string as per the SQL spec.
  * The "(?:" and ")" are a non-capturing set of parens; we have to have
  * parens in case the string contains "|", else the "^" and "$" will
  * be bound into the first and last alternatives which is not what we
  * want, and the parens must be non capturing because we don't want them
  * to count when selecting output for SUBSTRING.
+ *
+ * When the pattern is divided into three parts by escape-double-quotes,
+ * what we emit is
+ * ^(?:part1){1,1}?(part2){1,1}(?:part3)$
+ * which requires even more explanation.  The "{1,1}?" on part1 makes it
+ * non-greedy so that it will match the smallest possible amount of text
+ * not the largest, as required by SQL.  The plain parens around part2
+ * are capturing parens so that that part is what controls the result of
+ * SUBSTRING.  The "{1,1}" forces part2 to be greedy, so that it matches
+ * the largest possible amount of text; hence part3 must match the
+ * smallest amount of text, as required by SQL.  We don't need an explicit
+ * greediness marker on part3.  Note that this also confines the effects
+ * of any "|" characters to the respective part, which is what we want.
+ *
+ * The SQL spec says that SUBSTRING's pattern must contain exactly two
+ * escape-double-quotes, but we only complain if there's more than two.
+ * With none, we act as though part1 and part3 are empty; with one, we
+ * act as though part3 is empty.  Both behaviors fall out of omitting
+ * the relevant part separators in the above expansion.  If the result
+ * of this function is used in a plain regexp match (SIMILAR TO), the
+ * escape-double-quotes have no effect on the match behavior.
  *----------
  */
 
  /*
- * We need room for the prefix/postfix plus as many as 3 output bytes per
- * input byte; since the input is at most 1GB this can't overflow
+ * We need room for the prefix/postfix and part separators, plus as many
+ * as 3 output bytes per input byte; since the input is at most 1GB this
+ * can't overflow size_t.
  */
- result = (text *) palloc(VARHDRSZ + 6 + 3 * plen);
+ result = (text *) palloc(VARHDRSZ + 23 + 3 * (size_t) plen);
  r = VARDATA(result);
 
  *r++ = '^';
@@ -760,7 +782,7 @@ similar_escape(PG_FUNCTION_ARGS)
  }
  else if (e && elen == mblen && memcmp(e, p, mblen) == 0)
  {
- /* SQL99 escape character; do not send to output */
+ /* SQL escape character; do not send to output */
  afterescape = true;
  }
  else
@@ -784,10 +806,45 @@ similar_escape(PG_FUNCTION_ARGS)
  /* fast path */
  if (afterescape)
  {
- if (pchar == '"' && !incharclass) /* for SUBSTRING patterns */
- *r++ = ((nquotes++ % 2) == 0) ? '(' : ')';
+ if (pchar == '"' && !incharclass) /* escape-double-quote? */
+ {
+ /* emit appropriate part separator, per notes above */
+ if (nquotes == 0)
+ {
+ *r++ = ')';
+ *r++ = '{';
+ *r++ = '1';
+ *r++ = ',';
+ *r++ = '1';
+ *r++ = '}';
+ *r++ = '?';
+ *r++ = '(';
+ }
+ else if (nquotes == 1)
+ {
+ *r++ = ')';
+ *r++ = '{';
+ *r++ = '1';
+ *r++ = ',';
+ *r++ = '1';
+ *r++ = '}';
+ *r++ = '(';
+ *r++ = '?';
+ *r++ = ':';
+ }
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_USE_OF_ESCAPE_CHARACTER),
+ errmsg("SQL regular expression may not contain more than two escape-double-quote separators")));
+ nquotes++;
+ }
  else
  {
+ /*
+ * We allow any character at all to be escaped; notably, this
+ * allows access to POSIX character-class escapes such as
+ * "\d".  The SQL spec is considerably more restrictive.
+ */
  *r++ = '\\';
  *r++ = pchar;
  }
@@ -795,7 +852,7 @@ similar_escape(PG_FUNCTION_ARGS)
  }
  else if (e && pchar == *e)
  {
- /* SQL99 escape character; do not send to output */
+ /* SQL escape character; do not send to output */
  afterescape = true;
  }
  else if (incharclass)
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index 7c7f872..24570be 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -313,7 +313,7 @@ SELECT SUBSTRING('1234567890' FROM 4 FOR 3) = '456' AS "456";
  t
 (1 row)
 
--- T581 regular expression substring (with SQL99's bizarre regexp syntax)
+-- T581 regular expression substring (with SQL's bizarre regexp syntax)
 SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd";
  bcd
 -----
@@ -328,13 +328,13 @@ SELECT SUBSTRING('abcdefg' FROM '#"(b_d)#"%' FOR '#') IS NULL AS "True";
 (1 row)
 
 -- Null inputs should return NULL
-SELECT SUBSTRING('abcdefg' FROM '(b|c)' FOR NULL) IS NULL AS "True";
+SELECT SUBSTRING('abcdefg' FROM '%' FOR NULL) IS NULL AS "True";
  True
 ------
  t
 (1 row)
 
-SELECT SUBSTRING(NULL FROM '(b|c)' FOR '#') IS NULL AS "True";
+SELECT SUBSTRING(NULL FROM '%' FOR '#') IS NULL AS "True";
  True
 ------
  t
@@ -346,8 +346,57 @@ SELECT SUBSTRING('abcdefg' FROM NULL FOR '#') IS NULL AS "True";
  t
 (1 row)
 
--- PostgreSQL extension to allow omitting the escape character;
--- here the regexp is taken as Posix syntax
+-- The first and last parts should act non-greedy
+SELECT SUBSTRING('abcdefg' FROM 'a#"%#"g' FOR '#') AS "bcdef";
+ bcdef
+-------
+ bcdef
+(1 row)
+
+SELECT SUBSTRING('abcdefg' FROM 'a*#"%#"g*' FOR '#') AS "abcdefg";
+ abcdefg
+---------
+ abcdefg
+(1 row)
+
+-- Vertical bar in any part affects only that part
+SELECT SUBSTRING('abcdefg' FROM 'a|b#"%#"g' FOR '#') AS "bcdef";
+ bcdef
+-------
+ bcdef
+(1 row)
+
+SELECT SUBSTRING('abcdefg' FROM 'a#"%#"x|g' FOR '#') AS "bcdef";
+ bcdef
+-------
+ bcdef
+(1 row)
+
+SELECT SUBSTRING('abcdefg' FROM 'a#"%|ab#"g' FOR '#') AS "bcdef";
+ bcdef
+-------
+ bcdef
+(1 row)
+
+-- Can't have more than two part separators
+SELECT SUBSTRING('abcdefg' FROM 'a*#"%#"g*#"x' FOR '#') AS "error";
+ERROR:  SQL regular expression may not contain more than two escape-double-quote separators
+CONTEXT:  SQL function "substring" statement 1
+-- Postgres extension: with 0 or 1 separator, assume parts 1 and 3 are empty
+SELECT SUBSTRING('abcdefg' FROM 'a#"%g' FOR '#') AS "bcdefg";
+ bcdefg
+--------
+ bcdefg
+(1 row)
+
+SELECT SUBSTRING('abcdefg' FROM 'a%g' FOR '#') AS "abcdefg";
+ abcdefg
+---------
+ abcdefg
+(1 row)
+
+-- substring() with just two arguments is not allowed by SQL spec;
+-- we accept it, but we interpret the pattern as a POSIX regexp not SQL
 SELECT SUBSTRING('abcdefg' FROM 'c.e') AS "cde";
  cde
 -----
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index 9cd2bc5..5744c9f 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -110,19 +110,35 @@ SELECT SUBSTRING('1234567890' FROM 3) = '34567890' AS "34567890";
 
 SELECT SUBSTRING('1234567890' FROM 4 FOR 3) = '456' AS "456";
 
--- T581 regular expression substring (with SQL99's bizarre regexp syntax)
+-- T581 regular expression substring (with SQL's bizarre regexp syntax)
 SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd";
 
 -- No match should return NULL
 SELECT SUBSTRING('abcdefg' FROM '#"(b_d)#"%' FOR '#') IS NULL AS "True";
 
 -- Null inputs should return NULL
-SELECT SUBSTRING('abcdefg' FROM '(b|c)' FOR NULL) IS NULL AS "True";
-SELECT SUBSTRING(NULL FROM '(b|c)' FOR '#') IS NULL AS "True";
+SELECT SUBSTRING('abcdefg' FROM '%' FOR NULL) IS NULL AS "True";
+SELECT SUBSTRING(NULL FROM '%' FOR '#') IS NULL AS "True";
 SELECT SUBSTRING('abcdefg' FROM NULL FOR '#') IS NULL AS "True";
 
--- PostgreSQL extension to allow omitting the escape character;
--- here the regexp is taken as Posix syntax
+-- The first and last parts should act non-greedy
+SELECT SUBSTRING('abcdefg' FROM 'a#"%#"g' FOR '#') AS "bcdef";
+SELECT SUBSTRING('abcdefg' FROM 'a*#"%#"g*' FOR '#') AS "abcdefg";
+
+-- Vertical bar in any part affects only that part
+SELECT SUBSTRING('abcdefg' FROM 'a|b#"%#"g' FOR '#') AS "bcdef";
+SELECT SUBSTRING('abcdefg' FROM 'a#"%#"x|g' FOR '#') AS "bcdef";
+SELECT SUBSTRING('abcdefg' FROM 'a#"%|ab#"g' FOR '#') AS "bcdef";
+
+-- Can't have more than two part separators
+SELECT SUBSTRING('abcdefg' FROM 'a*#"%#"g*#"x' FOR '#') AS "error";
+
+-- Postgres extension: with 0 or 1 separator, assume parts 1 and 3 are empty
+SELECT SUBSTRING('abcdefg' FROM 'a#"%g' FOR '#') AS "bcdefg";
+SELECT SUBSTRING('abcdefg' FROM 'a%g' FOR '#') AS "abcdefg";
+
+-- substring() with just two arguments is not allowed by SQL spec;
+-- we accept it, but we interpret the pattern as a POSIX regexp not SQL
 SELECT SUBSTRING('abcdefg' FROM 'c.e') AS "cde";
 
 -- With a parenthesized subexpression, return only what matches the subexpr
Reply | Threaded
Open this post in threaded view
|

Re: PostgreSQL 9.3.5 substring(text from pattern for escape) bug

Andrew Gierth
>>>>> "Tom" == Tom Lane <[hidden email]> writes:

 Tom> Here's a proposed patch for dealing with the non-spec greediness
 Tom> behavior.

Looks good to me so far.

--
Andrew (irc:RhodiumToad)


Reply | Threaded
Open this post in threaded view
|

Re: PostgreSQL 9.3.5 substring(text from pattern for escape) bug

Tom Lane-2
Andrew Gierth <[hidden email]> writes:
> "Tom" == Tom Lane <[hidden email]> writes:
>  Tom> Here's a proposed patch for dealing with the non-spec greediness
>  Tom> behavior.

> Looks good to me so far.

Thanks for looking!  Pushed now.

                        regards, tom lane