BUG #16133: Regexp quantifier issues

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

BUG #16133: Regexp quantifier issues

PG Doc comments form
The following bug has been logged on the website:

Bug reference:      16133
Logged by:          Andrew Gierth
Email address:      [hidden email]
PostgreSQL version: 12.1
Operating system:   any
Description:        

(This started out as an irc discussion on #tcl that spilled over to
#postgresql:)

SELECT regexp_match('aaa', '(a*)*');
 regexp_match
--------------
 {aaa}
(1 row)

SELECT regexp_match('aaa', '(a*)+');
 regexp_match
--------------
 {""}
(1 row)

What seems to be happening here is that in the + case, the engine is doing
one more match, matching (a*) against an empty string at the end of the
input, unlike the * case where the last match of (a*) is against the whole
string. This seems to violate the rules for determining where subexpression
captures line up. (And certainly there is no justification for the + vs. *
quantifier to make any difference here.)

There are a large number of similar cases, but this seems to be the common
factor to all of them so far.

Reply | Threaded
Open this post in threaded view
|

Re: BUG #16133: Regexp quantifier issues

Tom Lane-2
PG Bug reporting form <[hidden email]> writes:
> SELECT regexp_match('aaa', '(a*)*');
>  regexp_match
> --------------
>  {aaa}
> (1 row)

> SELECT regexp_match('aaa', '(a*)+');
>  regexp_match
> --------------
>  {""}
> (1 row)

> What seems to be happening here is that in the + case, the engine is doing
> one more match, matching (a*) against an empty string at the end of the
> input, unlike the * case where the last match of (a*) is against the whole
> string. This seems to violate the rules for determining where subexpression
> captures line up. (And certainly there is no justification for the + vs. *
> quantifier to make any difference here.)

[ shrug... ]  This excites me not in the least.  The engine is entirely
within its rights to match "a*" to an empty substring.

I do see a documentation issue, which is that we don't specify what
happens when capturing parens are within a repetition operator ---
which of the matches gets captured?

Digging around in the code, it looks like one might choose to cast
blame on regcomp.c, around line 1170:

         * If there's no backrefs involved, we can turn x{m,n} into
         * x{m-1,n-1}x, with capturing parens in only the second x.  This is
         * valid because we only care about capturing matches from the final
         * iteration of the quantifier.  It's a win because we can implement
         * the backref-free left side as a plain DFA node, since we don't
         * really care where its submatches are.

That is, we're effectively converting "(a*)+" to "a*(a*)*", whereupon
it's obvious why you're getting the result you do.

This comment only dates back to 173e29aa5, but apparently the behavior
of capturing the last instance is old; at the time I wrote
   
    Cases where a back-reference is part of a larger subexpression that
    is quantified have never worked in Spencer's regex engine, because
    he used a compile-time transformation that neglected the need to
    check the back-reference match in iterations before the last one.
    (That was okay for capturing parens, and we still do it if the
    regex has *only* capturing parens ... but it's not okay for backrefs.)

Perhaps capturing in the first iteration would provide less surprising
semantics, but I'm not sure that anyone would love us for making such a
change, even if it's feasible code- and performance-wise.

                        regards, tom lane