benchmarking Flex practices

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

benchmarking Flex practices

John Naylor-2
I decided to do some experiments with how we use Flex. The main
takeaway is that backtracking, which we removed in 2005, doesn't seem
to matter anymore for the core scanner. Also, state table size is of
marginal importance.

Using the information_schema Flex+Bison microbenchmark from Tom [1], I
tested removing most of the "fail" rules designed to avoid
backtracking ("decimalfail" is needed by PL/pgSQL). Below are the best
times (most runs within 1%), followed by postgres binary size. The
numbers are with Flex 2.5.35 on MacOS, no asserts or debugging
symbols.

HEAD:
1.53s
7139132 bytes

HEAD minus "fail" rules (patch attached):
1.53s
6971204 bytes

Surprisingly, it has the same performance and a much smaller binary.
The size difference is because the size of the elements of the
yy_transition array is constrained by the number of elements in the
array. Since there are now fewer than INT16_MAX state transitions, the
struct members go from 32 bit:

struct yy_trans_info
{
flex_int32_t yy_verify;
flex_int32_t yy_nxt;
};
static yyconst struct yy_trans_info yy_transition[37045] = ...

to 16 bit:

struct yy_trans_info
{
flex_int16_t yy_verify;
flex_int16_t yy_nxt;
};
static yyconst struct yy_trans_info yy_transition[31763] = ...

To test if array size was the deciding factor, I tried bloating it by
essentially undoing commit a5ff502fcea. Doing so produced an array
with 62583 elements and 32-bit members, so nearly quadruple in size,
and it was still not much slower than HEAD:

HEAD minus "fail" rules, minus %xusend/%xuiend:
1.56s
7343932 bytes

While at it, I repeated the benchmark with different Flex flags:

HEAD, plus -Cf:
1.60s
6995788 bytes

HEAD, minus "fail" rules, plus -Cf:
1.59s
6979396 bytes

HEAD, plus -Cfe:
1.65s
6868804 bytes

So this recommendation of the Flex manual (-CF) still holds true. It's
worth noting that using perfect hashing for keyword lookup (20%
faster) had a much bigger effect than switching from -Cfe to -CF (7%
faster).

It would be nice to have confirmation to make sure I didn't err
somewhere, and to try a more real-world benchmark. (Also for the
moment I only have Linux on a virtual machine.) The regression tests
pass, but some comments are now wrong. If it's confirmed that
backtracking doesn't matter for recent Flex/hardware, disregarding it
would make maintenance of our scanners a bit easier.

[1] https://www.postgresql.org/message-id/14616.1558560331%40sss.pgh.pa.us

--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

remove-scanner-fail-rules.patch (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: benchmarking Flex practices

Tom Lane-2
John Naylor <[hidden email]> writes:
> I decided to do some experiments with how we use Flex. The main
> takeaway is that backtracking, which we removed in 2005, doesn't seem
> to matter anymore for the core scanner. Also, state table size is of
> marginal importance.

Huh.  That's really interesting, because removing backtracking was a
demonstrable, significant win when we did it [1].  I wonder what has
changed?  I'd be prepared to believe that today's machines are more
sensitive to the amount of cache space eaten by the tables --- but that
idea seems contradicted by your result that the table size isn't
important.  (I'm wishing I'd documented the test case I used in 2005...)

> The size difference is because the size of the elements of the
> yy_transition array is constrained by the number of elements in the
> array. Since there are now fewer than INT16_MAX state transitions, the
> struct members go from 32 bit:
> static yyconst struct yy_trans_info yy_transition[37045] = ...
> to 16 bit:
> static yyconst struct yy_trans_info yy_transition[31763] = ...

Hm.  Smaller binary is definitely nice, but 31763 is close enough to
32768 that I'd have little faith in the optimization surviving for long.
Is there any way we could buy back some more transitions?

> It would be nice to have confirmation to make sure I didn't err
> somewhere, and to try a more real-world benchmark.

I don't see much wrong with using information_schema.sql as a parser/lexer
benchmark case.  We should try to confirm the results on other platforms
though.

                        regards, tom lane

[1] https://www.postgresql.org/message-id/8652.1116865895@...


Reply | Threaded
Open this post in threaded view
|

Re: benchmarking Flex practices

Andres Freund
Hi,

On 2019-06-20 10:52:54 -0400, Tom Lane wrote:
> John Naylor <[hidden email]> writes:
> > It would be nice to have confirmation to make sure I didn't err
> > somewhere, and to try a more real-world benchmark.
>
> I don't see much wrong with using information_schema.sql as a parser/lexer
> benchmark case.  We should try to confirm the results on other platforms
> though.

Might be worth also testing with a more repetitive testcase to measure
both cache locality and branch prediction. I assume that with
information_schema there's enough variability that these effects play a
smaller role. And there's plenty real-world cases where there's a *lot*
of very similar statements being parsed over and over. I'd probably just
measure the statements pgbench generates or such.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: benchmarking Flex practices

John Naylor-2
In reply to this post by Tom Lane-2
On Fri, Jun 21, 2019 at 12:02 AM Andres Freund <[hidden email]> wrote:
> Might be worth also testing with a more repetitive testcase to measure
> both cache locality and branch prediction. I assume that with
> information_schema there's enough variability that these effects play a
> smaller role. And there's plenty real-world cases where there's a *lot*
> of very similar statements being parsed over and over. I'd probably just
> measure the statements pgbench generates or such.

I tried benchmarking with a query string with just

BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + 1 WHERE aid = 1;
SELECT abalance FROM pgbench_accounts WHERE aid = 1;
UPDATE pgbench_tellers SET tbalance = tbalance + 1 WHERE tid = 1;
UPDATE pgbench_branches SET bbalance = bbalance + 1 WHERE bid = 1;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (1,
1, 1, 1, CURRENT_TIMESTAMP);
END;

repeated about 500 times. With this, backtracking is about 3% slower:

HEAD:
1.15s

patch:
1.19s

patch + huge array:
1.19s

That's possibly significant enough to be evidence for your assumption,
as well as to persuade us to keep things as they are.

On Thu, Jun 20, 2019 at 10:52 PM Tom Lane <[hidden email]> wrote:
> Huh.  That's really interesting, because removing backtracking was a
> demonstrable, significant win when we did it [1].  I wonder what has
> changed?  I'd be prepared to believe that today's machines are more
> sensitive to the amount of cache space eaten by the tables --- but that
> idea seems contradicted by your result that the table size isn't
> important.  (I'm wishing I'd documented the test case I used in 2005...)

It's possible the code used with backtracking is better predicted than
15 years ago, but my uneducated hunch is our Bison grammar has gotten
much worse in cache misses and branch prediction than the scanner has
in 15 years. That, plus the recent keyword lookup optimization might
have caused parsing to be completely dominated by Bison. If that's the
case, the 3% slowdown above could be a significant portion of scanning
in isolation.

> Hm.  Smaller binary is definitely nice, but 31763 is close enough to
> 32768 that I'd have little faith in the optimization surviving for long.
> Is there any way we could buy back some more transitions?

I tried quickly ripping out the unicode escape support entirely. It
builds with warnings, but the point is to just get the size -- that
produced an array with only 28428 elements, and that's keeping all the
no-backup rules intact. This might be unworkable and/or ugly, but I
wonder if it's possible to pull unicode escape handling into the
parsing stage, with "UESCAPE" being a keyword token that we have to
peek ahead to check for. I'll look for other rules that could be more
easily optimized, but I'm not terribly optimistic.

--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: benchmarking Flex practices

John Naylor-2
I wrote:

> I'll look for other rules that could be more
> easily optimized, but I'm not terribly optimistic.

I found a possible other way to bring the size of the transition table
under 32k entries while keeping the existing no-backup rules in place:
Replace the "quotecontinue" rule with a new state. In the attached
draft patch, when Flex encounters a quote while inside any kind of
quoted string, it saves the current state and enters %xqs (think
'quotestop'). If it then sees {whitespace_with_newline}{quote}, it
reenters the previous state and continues to slurp the string,
otherwise, it throws back everything and returns the string it just
exited. Doing it this way is a bit uglier, but with some extra
commentary it might not be too bad.

The array is now 30883 entries. That's still a bit close for comfort,
but shrinks the binary by 171kB on Linux x86-64 with Flex 2.6.4. The
bad news is I have these baffling backup states in my new rules:

State #133 is non-accepting -
 associated rule line numbers:
551 554 564
 out-transitions: [ \000-\377 ]
 jam-transitions: EOF []

State #162 is non-accepting -
 associated rule line numbers:
551 554 564
 out-transitions: [ \000-\377 ]
 jam-transitions: EOF []

2 backing up (non-accepting) states.

I already explicitly handle EOF, so I don't know what it's trying to
tell me. If it can be fixed while keeping the array size, I'll do
performance tests.

--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

v1-lexer-redo-quote-continuation.patch (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: benchmarking Flex practices

John Naylor-2
I wrote:

> > I'll look for other rules that could be more
> > easily optimized, but I'm not terribly optimistic.
>
> I found a possible other way to bring the size of the transition table
> under 32k entries while keeping the existing no-backup rules in place:
> Replace the "quotecontinue" rule with a new state. In the attached
> draft patch, when Flex encounters a quote while inside any kind of
> quoted string, it saves the current state and enters %xqs (think
> 'quotestop'). If it then sees {whitespace_with_newline}{quote}, it
> reenters the previous state and continues to slurp the string,
> otherwise, it throws back everything and returns the string it just
> exited. Doing it this way is a bit uglier, but with some extra
> commentary it might not be too bad.
I had an epiphany and managed to get rid of the backup states.
Regression tests pass. The array is down to 30367 entries and the
binary is smaller by 172kB on Linux x86-64. Performance is identical
to master on both tests mentioned upthread. I'll clean this up and add
it to the commitfest.

--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

v2-lexer-redo-quote-continuation.patch (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: benchmarking Flex practices

John Naylor-2
I wrote:

> > I found a possible other way to bring the size of the transition table
> > under 32k entries while keeping the existing no-backup rules in place:
> > Replace the "quotecontinue" rule with a new state. In the attached
> > draft patch, when Flex encounters a quote while inside any kind of
> > quoted string, it saves the current state and enters %xqs (think
> > 'quotestop'). If it then sees {whitespace_with_newline}{quote}, it
> > reenters the previous state and continues to slurp the string,
> > otherwise, it throws back everything and returns the string it just
> > exited. Doing it this way is a bit uglier, but with some extra
> > commentary it might not be too bad.
>
> I had an epiphany and managed to get rid of the backup states.
> Regression tests pass. The array is down to 30367 entries and the
> binary is smaller by 172kB on Linux x86-64. Performance is identical
> to master on both tests mentioned upthread. I'll clean this up and add
> it to the commitfest.
For the commitfest:

0001 is a small patch to remove some unneeded generality from the
current rules. This lowers the number of elements in the yy_transition
array from 37045 to 36201.

0002 is a cleaned up version of the above, bring the size down to 29521.

I haven't changed psqlscan.l or pgc.l, in case this approach is
changed or rejected

With the two together, the binary is about 175kB smaller than on HEAD.

I also couldn't resist playing around with the idea upthread to handle
unicode escapes in parser.c, which further reduces the number of
states down to 21068, which allows some headroom for future additions
without going back to 32-bit types in the transition array. It mostly
works, but it's quite ugly and breaks the token position handling for
unicode escape syntax errors, so it's not in a state to share.

--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

v3-0001-Remove-some-unneeded-generality-from-the-core-Fle.patch (2K) Download Attachment
v3-0002-Replace-the-Flex-quotestop-rules-with-a-new-exclu.patch (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: benchmarking Flex practices

Tom Lane-2
John Naylor <[hidden email]> writes:
> 0001 is a small patch to remove some unneeded generality from the
> current rules. This lowers the number of elements in the yy_transition
> array from 37045 to 36201.

I don't particularly like 0001.  The two bits like this

-whitespace ({space}+|{comment})
+whitespace ({space}|{comment})

seem likely to create performance problems for runs of whitespace, in that
the lexer will now have to execute the associated action once per space
character not just once for the whole run.  Those actions are empty, but
I don't think flex optimizes for that, and it's really flex's per-action
overhead that I'm worried about.  Note the comment in the "Performance"
section of the flex manual:

    Another area where the user can increase a scanner's performance (and
    one that's easier to implement) arises from the fact that the longer
    the tokens matched, the faster the scanner will run.  This is because
    with long tokens the processing of most input characters takes place
    in the (short) inner scanning loop, and does not often have to go
    through the additional work of setting up the scanning environment
    (e.g., `yytext') for the action.

There are a bunch of higher-order productions that use "{whitespace}*",
which is surely a bit redundant given the contents of {whitespace}.
But maybe we could address that by replacing "{whitespace}*" with
"{opt_whitespace}" defined as

opt_whitespace ({space}*|{comment})

Not sure what impact if any that'd have on table size, but I'm quite sure
that {whitespace} was defined with an eye to avoiding unnecessary
lexer action cycles.

As for the other two bits that are like

-<xe>. {
- /* This is only needed for \ just before EOF */
+<xe>\\ {

my recollection is that those productions are defined that way to avoid a
flex warning about not all possible input characters being accounted for
in the <xe> (resp. <xdolq>) state.  Maybe that warning is
flex-version-dependent, or maybe this was just a worry and not something
that actually produced a warning ... but I'm hesitant to change it.
If we ever did get to flex's default action, that action is to echo the
current input character to stdout, which would be Very Bad.

As far as I can see, the point of 0002 is to have just one set of
flex rules for the various variants of quotecontinue processing.
That sounds OK, though I'm a bit surprised it makes this much difference
in the table size. I would suggest that "state_before" needs a less
generic name (maybe "state_before_xqs"?) and more than no comment.
Possibly more to the point, it's not okay to have static state variables
in the core scanner, so that variable needs to be kept in yyextra.
(Don't remember offhand whether it's any more acceptable in the other
scanners.)

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: benchmarking Flex practices

John Naylor-2
On Wed, Jul 3, 2019 at 5:35 AM Tom Lane <[hidden email]> wrote:

>
> John Naylor <[hidden email]> writes:
> > 0001 is a small patch to remove some unneeded generality from the
> > current rules. This lowers the number of elements in the yy_transition
> > array from 37045 to 36201.
>
> I don't particularly like 0001.  The two bits like this
>
> -whitespace             ({space}+|{comment})
> +whitespace             ({space}|{comment})
>
> seem likely to create performance problems for runs of whitespace, in that
> the lexer will now have to execute the associated action once per space
> character not just once for the whole run.

Okay.

> There are a bunch of higher-order productions that use "{whitespace}*",
> which is surely a bit redundant given the contents of {whitespace}.
> But maybe we could address that by replacing "{whitespace}*" with
> "{opt_whitespace}" defined as
>
> opt_whitespace          ({space}*|{comment})
>
> Not sure what impact if any that'd have on table size, but I'm quite sure
> that {whitespace} was defined with an eye to avoiding unnecessary
> lexer action cycles.

It turns out that {opt_whitespace} as defined above is not equivalent
to {whitespace}* , since the former is either a single comment or a
single run of 0 or more whitespace chars (if I understand correctly).
Using {opt_whitespace} for the UESCAPE rules on top of v3-0002, the
regression tests pass, but queries like this fail with a syntax error:

# select U&'d!0061t!+000061' uescape  --comment
'!';

There was in fact a substantial size reduction, though, so for
curiosity's sake I tried just replacing {whitespace}* with {space}* in
the UESCAPE rules, and the table shrank from 30367 (that's with 0002
only) to 24661.

> As for the other two bits that are like
>
> -<xe>.                  {
> -                                       /* This is only needed for \ just before EOF */
> +<xe>\\                 {
>
> my recollection is that those productions are defined that way to avoid a
> flex warning about not all possible input characters being accounted for
> in the <xe> (resp. <xdolq>) state.  Maybe that warning is
> flex-version-dependent, or maybe this was just a worry and not something
> that actually produced a warning ... but I'm hesitant to change it.
> If we ever did get to flex's default action, that action is to echo the
> current input character to stdout, which would be Very Bad.

FWIW, I tried Flex 2.5.35 and 2.6.4 with no warnings, and I did get a
warning when I deleted any of those two rules. I'll leave them out for
now, since this change was only good for ~500 fewer elements in the
transition array.

> As far as I can see, the point of 0002 is to have just one set of
> flex rules for the various variants of quotecontinue processing.
> That sounds OK, though I'm a bit surprised it makes this much difference
> in the table size. I would suggest that "state_before" needs a less
> generic name (maybe "state_before_xqs"?) and more than no comment.
> Possibly more to the point, it's not okay to have static state variables
> in the core scanner, so that variable needs to be kept in yyextra.
> (Don't remember offhand whether it's any more acceptable in the other
> scanners.)

Ah yes, I got this idea from the ECPG scanner, which is not reentrant. Will fix.

--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: benchmarking Flex practices

John Naylor-2
In reply to this post by Tom Lane-2
On Wed, Jul 3, 2019 at 5:35 AM Tom Lane <[hidden email]> wrote:
>
> As far as I can see, the point of 0002 is to have just one set of
> flex rules for the various variants of quotecontinue processing.
> That sounds OK, though I'm a bit surprised it makes this much difference
> in the table size. I would suggest that "state_before" needs a less
> generic name (maybe "state_before_xqs"?) and more than no comment.
> Possibly more to the point, it's not okay to have static state variables
> in the core scanner, so that variable needs to be kept in yyextra.

v4-0001 is basically the same as v3-0002, with the state variable in
yyextra. Since follow-on patches use it as well, I've named it
state_before_quote_stop. I failed to come up with a nicer short name.
With this applied, the transition table is reduced from 37045 to
30367. Since that's uncomfortably close to the 32k limit for 16 bit
members, I hacked away further at UESCAPE bloat.

0002 unifies xusend and xuiend by saving the state of xui as well.
This actually causes a performance regression, but it's more of a
refactoring patch to prevent from having to create two additional
start conditions in 0003 (of course it could be done that way if
desired, but the savings won't be as great). In any case, the table is
now down to 26074.

0003 creates a separate start condition so that UESCAPE and the
expected quoted character after it are detected in separate states.
This allows us to use standard whitespace skipping techniques and also
to greatly simplify the uescapefail rule. The final size of the table
is 23696. Removing UESCAPE entirely results in 21860, so this likely
the most compact size of this feature.

Performance is very similar to HEAD. Parsing the information schema
might be a hair faster and pgbench-like queries with simple strings a
hair slower, but the difference seems within the noise of variation.
Parsing strings with UESCAPE likewise seems about the same.

--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

v4-0001-Replace-the-Flex-quotestop-rules-with-a-new-exclu.patch (8K) Download Attachment
v4-0002-Unify-xuiend-and-xusend-into-a-single-start-condi.patch (7K) Download Attachment
v4-0003-Use-separate-start-conditions-for-both-UESCAPE-an.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: benchmarking Flex practices

Tom Lane-2
John Naylor <[hidden email]> writes:
> [ v4 patches for trimming lexer table size ]

I reviewed this and it looks pretty solid.  One gripe I have is
that I think it's best to limit backup-prevention tokens such as
quotecontinuefail so that they match only exact prefixes of their
"success" tokens.  This seems clearer to me, and in at least some cases
it can save a few flex states.  The attached v5 patch does it like that
and gets us down to 22331 states (from 23696).  In some places it looks
like you did that to avoid writing an explicit "{other}" match rule for
an exclusive state, but I think it's better for readability and
separation of concerns to go ahead and have those explicit rules
(and it seems to make no difference table-size-wise).

I also made some cosmetic changes (mostly improving comments) and
smashed the patch series down to 1 patch, because I preferred to
review it that way and we're not really going to commit these
separately.

I did a little bit of portability testing, to the extent of verifying
that the oldest and newest Flex versions I have handy (2.5.33 and 2.6.4)
agree on the table size change and get through regression tests.  So
I think we should be good from that end.

We still need to propagate these changes into the psql and ecpg lexers,
but I assume you were waiting to agree on the core patch before touching
those.  If you're good with the changes I made here, have at it.

                        regards, tom lane


diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index e1cae85..899da09 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -168,12 +168,14 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xh> hexadecimal numeric string
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
- *  <xuiend> end of a quoted identifier with Unicode escapes, UESCAPE can follow
  *  <xus> quoted string with Unicode escapes
- *  <xusend> end of a quoted string with Unicode escapes, UESCAPE can follow
+ *  <xuend> end of a quoted string or identifier with Unicode escapes,
+ *    UESCAPE can follow
+ *  <xuchar> expecting escape character literal after UESCAPE
  *  <xeu> Unicode surrogate pair in extended quoted string
  *
  * Remember to add an <<EOF>> case whenever you add a new exclusive state!
@@ -185,12 +187,13 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
 %x xd
 %x xh
 %x xq
+%x xqs
 %x xe
 %x xdolq
 %x xui
-%x xuiend
 %x xus
-%x xusend
+%x xuend
+%x xuchar
 %x xeu
 
 /*
@@ -231,19 +234,18 @@ special_whitespace ({space}+|{comment}{newline})
 horiz_whitespace ({horiz_space}|{comment})
 whitespace_with_newline ({horiz_whitespace}*{newline}{special_whitespace}*)
 
+quote '
+/* If we see {quote} then {quotecontinue}, the quoted string continues */
+quotecontinue {whitespace_with_newline}{quote}
+
 /*
- * To ensure that {quotecontinue} can be scanned without having to back up
- * if the full pattern isn't matched, we include trailing whitespace in
- * {quotestop}.  This matches all cases where {quotecontinue} fails to match,
- * except for {quote} followed by whitespace and just one "-" (not two,
- * which would start a {comment}).  To cover that we have {quotefail}.
- * The actions for {quotestop} and {quotefail} must throw back characters
- * beyond the quote proper.
+ * {quotecontinuefail} is needed to avoid lexer backup when we fail to match
+ * {quotecontinue}.  It might seem that this could just be {whitespace}*,
+ * but if there's a dash after {whitespace_with_newline}, it must be consumed
+ * to see if there's another dash --- which would start a {comment} and thus
+ * allow continuation of the {quotecontinue} token.
  */
-quote '
-quotestop {quote}{whitespace}*
-quotecontinue {quote}{whitespace_with_newline}{quote}
-quotefail {quote}{whitespace}*"-"
+quotecontinuefail {whitespace}*"-"?
 
 /* Bit string
  * It is tempting to scan the string for only those characters
@@ -304,10 +306,15 @@ xdstop {dquote}
 xddouble {dquote}{dquote}
 xdinside [^"]+
 
-/* Unicode escapes */
-uescape [uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
+/* Optional UESCAPE after a quoted string or identifier with Unicode escapes */
+uescape [uU][eE][sS][cC][aA][pP][eE]
+/* error rule to avoid backup */
+uescapefail [uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
+
+/* escape character literal */
+uescchar {quote}[^']{quote}
 /* error rule to avoid backup */
-uescapefail [uU][eE][sS][cC][aA][pP][eE]{whitespace}*"-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
+uesccharfail {quote}[^']|{quote}
 
 /* Quoted identifier with Unicode escapes */
 xuistart [uU]&{dquote}
@@ -315,10 +322,6 @@ xuistart [uU]&{dquote}
 /* Quoted string with Unicode escapes */
 xusstart [uU]&{quote}
 
-/* Optional UESCAPE after a quoted string or identifier with Unicode escapes. */
-xustop1 {uescapefail}?
-xustop2 {uescape}
-
 /* error rule to avoid backup */
 xufailed [uU]&
 
@@ -476,21 +479,10 @@ other .
  startlit();
  addlitchar('b', yyscanner);
  }
-<xb>{quotestop} |
-<xb>{quotefail} {
- yyless(1);
- BEGIN(INITIAL);
- yylval->str = litbufdup(yyscanner);
- return BCONST;
- }
 <xh>{xhinside} |
 <xb>{xbinside} {
  addlit(yytext, yyleng, yyscanner);
  }
-<xh>{quotecontinue} |
-<xb>{quotecontinue} {
- /* ignore */
- }
 <xb><<EOF>> { yyerror("unterminated bit string literal"); }
 
 {xhstart} {
@@ -505,13 +497,6 @@ other .
  startlit();
  addlitchar('x', yyscanner);
  }
-<xh>{quotestop} |
-<xh>{quotefail} {
- yyless(1);
- BEGIN(INITIAL);
- yylval->str = litbufdup(yyscanner);
- return XCONST;
- }
 <xh><<EOF>> { yyerror("unterminated hexadecimal string literal"); }
 
 {xnstart} {
@@ -568,53 +553,71 @@ other .
  BEGIN(xus);
  startlit();
  }
-<xq,xe>{quotestop} |
-<xq,xe>{quotefail} {
- yyless(1);
- BEGIN(INITIAL);
+
+<xb,xh,xq,xe,xus>{quote} {
  /*
- * check that the data remains valid if it might have been
- * made invalid by unescaping any chars.
+ * When we are scanning a quoted string and see an end
+ * quote, we must look ahead for a possible continuation.
+ * If we don't see one, we know the end quote was in fact
+ * the end of the string.  To reduce the lexer table size,
+ * we use a single "xqs" state to do the lookahead for all
+ * types of strings.
  */
- if (yyextra->saw_non_ascii)
- pg_verifymbstr(yyextra->literalbuf,
-   yyextra->literallen,
-   false);
- yylval->str = litbufdup(yyscanner);
- return SCONST;
- }
-<xus>{quotestop} |
-<xus>{quotefail} {
- /* throw back all but the quote */
- yyless(1);
- /* xusend state looks for possible UESCAPE */
- BEGIN(xusend);
+ yyextra->state_before_quote_stop = YYSTATE;
+ BEGIN(xqs);
  }
-<xusend>{whitespace} {
- /* stay in xusend state over whitespace */
+<xqs>{quotecontinue} {
+ /*
+ * Found a quote continuation, so return to the in-quote
+ * state and continue scanning the literal.
+ */
+ BEGIN(yyextra->state_before_quote_stop);
  }
-<xusend><<EOF>> |
-<xusend>{other} |
-<xusend>{xustop1} {
- /* no UESCAPE after the quote, throw back everything */
+<xqs>{quotecontinuefail} |
+<xqs>{other} |
+<xqs><<EOF>> {
+ /*
+ * Failed to see a quote continuation.  Throw back
+ * everything after the end quote, and handle the string
+ * according to the state we were in previously.
+ */
  yyless(0);
- BEGIN(INITIAL);
- yylval->str = litbuf_udeescape('\\', yyscanner);
- return SCONST;
- }
-<xusend>{xustop2} {
- /* found UESCAPE after the end quote */
- BEGIN(INITIAL);
- if (!check_uescapechar(yytext[yyleng - 2]))
+
+ switch (yyextra->state_before_quote_stop)
  {
- SET_YYLLOC();
- ADVANCE_YYLLOC(yyleng - 2);
- yyerror("invalid Unicode escape character");
+ case xb:
+ BEGIN(INITIAL);
+ yylval->str = litbufdup(yyscanner);
+ return BCONST;
+ case xh:
+ BEGIN(INITIAL);
+ yylval->str = litbufdup(yyscanner);
+ return XCONST;
+ case xe:
+ /* fallthrough */
+ case xq:
+ BEGIN(INITIAL);
+
+ /*
+ * Check that the data remains valid if it
+ * might have been made invalid by unescaping
+ * any chars.
+ */
+ if (yyextra->saw_non_ascii)
+ pg_verifymbstr(yyextra->literalbuf,
+   yyextra->literallen,
+   false);
+ yylval->str = litbufdup(yyscanner);
+ return SCONST;
+ case xus:
+ /* xuend state looks for possible UESCAPE */
+ BEGIN(xuend);
+ break;
+ default:
+ yyerror("unhandled previous state in xqs");
  }
- yylval->str = litbuf_udeescape(yytext[yyleng - 2],
-   yyscanner);
- return SCONST;
  }
+
 <xq,xe,xus>{xqdouble} {
  addlitchar('\'', yyscanner);
  }
@@ -693,9 +696,6 @@ other .
  if (c == '\0' || IS_HIGHBIT_SET(c))
  yyextra->saw_non_ascii = true;
  }
-<xq,xe,xus>{quotecontinue} {
- /* ignore */
- }
 <xe>. {
  /* This is only needed for \ just before EOF */
  addlitchar(yytext[0], yyscanner);
@@ -770,53 +770,89 @@ other .
  return IDENT;
  }
 <xui>{dquote} {
- yyless(1);
- /* xuiend state looks for possible UESCAPE */
- BEGIN(xuiend);
+ /* xuend state looks for possible UESCAPE */
+ yyextra->state_before_quote_stop = YYSTATE;
+ BEGIN(xuend);
  }
-<xuiend>{whitespace} {
- /* stay in xuiend state over whitespace */
+
+<xuend,xuchar>{whitespace} {
+ /* stay in xuend or xuchar state over whitespace */
  }
-<xuiend><<EOF>> |
-<xuiend>{other} |
-<xuiend>{xustop1} {
+<xuend>{uescapefail} |
+<xuend>{other} |
+<xuend><<EOF>> {
  /* no UESCAPE after the quote, throw back everything */
- char   *ident;
- int identlen;
-
  yyless(0);
 
- BEGIN(INITIAL);
- if (yyextra->literallen == 0)
- yyerror("zero-length delimited identifier");
- ident = litbuf_udeescape('\\', yyscanner);
- identlen = strlen(ident);
- if (identlen >= NAMEDATALEN)
- truncate_identifier(ident, identlen, true);
- yylval->str = ident;
- return IDENT;
+ if (yyextra->state_before_quote_stop == xus)
+ {
+ BEGIN(INITIAL);
+ yylval->str = litbuf_udeescape('\\', yyscanner);
+ return SCONST;
+ }
+ else if (yyextra->state_before_quote_stop == xui)
+ {
+ char   *ident;
+ int identlen;
+
+ BEGIN(INITIAL);
+ if (yyextra->literallen == 0)
+ yyerror("zero-length delimited identifier");
+ ident = litbuf_udeescape('\\', yyscanner);
+ identlen = strlen(ident);
+ if (identlen >= NAMEDATALEN)
+ truncate_identifier(ident, identlen, true);
+ yylval->str = ident;
+ return IDENT;
+ }
+ else
+ yyerror("unhandled previous state in xuend");
  }
-<xuiend>{xustop2} {
+<xuend>{uescape} {
  /* found UESCAPE after the end quote */
- char   *ident;
- int identlen;
-
- BEGIN(INITIAL);
- if (yyextra->literallen == 0)
- yyerror("zero-length delimited identifier");
+ BEGIN(xuchar);
+ }
+<xuchar>{uescchar} {
+ /* found escape character literal after UESCAPE */
  if (!check_uescapechar(yytext[yyleng - 2]))
  {
  SET_YYLLOC();
  ADVANCE_YYLLOC(yyleng - 2);
  yyerror("invalid Unicode escape character");
  }
- ident = litbuf_udeescape(yytext[yyleng - 2], yyscanner);
- identlen = strlen(ident);
- if (identlen >= NAMEDATALEN)
- truncate_identifier(ident, identlen, true);
- yylval->str = ident;
- return IDENT;
+
+ if (yyextra->state_before_quote_stop == xus)
+ {
+ BEGIN(INITIAL);
+ yylval->str = litbuf_udeescape(yytext[yyleng - 2],
+   yyscanner);
+ return SCONST;
+ }
+ else if (yyextra->state_before_quote_stop == xui)
+ {
+ char   *ident;
+ int identlen;
+
+ BEGIN(INITIAL);
+ if (yyextra->literallen == 0)
+ yyerror("zero-length delimited identifier");
+ ident = litbuf_udeescape(yytext[yyleng - 2], yyscanner);
+ identlen = strlen(ident);
+ if (identlen >= NAMEDATALEN)
+ truncate_identifier(ident, identlen, true);
+ yylval->str = ident;
+ return IDENT;
+ }
+ else
+ yyerror("unhandled previous state in xuchar");
+ }
+<xuchar>{uesccharfail} |
+<xuchar>{other} |
+<xuchar><<EOF>> {
+ SET_YYLLOC();
+ yyerror("missing or invalid Unicode escape character");
  }
+
 <xd,xui>{xddouble} {
  addlitchar('"', yyscanner);
  }
diff --git a/src/include/parser/scanner.h b/src/include/parser/scanner.h
index 731a2bd..72c2a28 100644
--- a/src/include/parser/scanner.h
+++ b/src/include/parser/scanner.h
@@ -99,6 +99,7 @@ typedef struct core_yy_extra_type
  int literallen; /* actual current string length */
  int literalalloc; /* current allocated buffer size */
 
+ int state_before_quote_stop; /* start cond. before end quote */
  int xcdepth; /* depth of nesting in slash-star comments */
  char   *dolqstart; /* current $foo$ quote start string */
 
Reply | Threaded
Open this post in threaded view
|

Re: benchmarking Flex practices

John Naylor-2
On Wed, Jul 10, 2019 at 3:15 AM Tom Lane <[hidden email]> wrote:

>
> John Naylor <[hidden email]> writes:
> > [ v4 patches for trimming lexer table size ]
>
> I reviewed this and it looks pretty solid.  One gripe I have is
> that I think it's best to limit backup-prevention tokens such as
> quotecontinuefail so that they match only exact prefixes of their
> "success" tokens.  This seems clearer to me, and in at least some cases
> it can save a few flex states.  The attached v5 patch does it like that
> and gets us down to 22331 states (from 23696).  In some places it looks
> like you did that to avoid writing an explicit "{other}" match rule for
> an exclusive state, but I think it's better for readability and
> separation of concerns to go ahead and have those explicit rules
> (and it seems to make no difference table-size-wise).
Looks good to me.

> We still need to propagate these changes into the psql and ecpg lexers,
> but I assume you were waiting to agree on the core patch before touching
> those.  If you're good with the changes I made here, have at it.

I just made a couple additional cosmetic adjustments that made sense
when diff'ing with the other scanners. Make check-world passes. Some
notes:

The pre-existing ecpg var "state_before" was a bit confusing when
combined with the new var "state_before_quote_stop", and the former is
also used with C-comments, so I decided to go with
"state_before_lit_start" and "state_before_lit_stop". Even though
comments aren't literals, it's less of a stretch than referring to
quotes. To keep things consistent, I went with the latter var in psql
and core.

To get the regression tests to pass, I had to add this:

 psql_scan_in_quote(PsqlScanState state)
 {
- return state->start_state != INITIAL;
+ return state->start_state != INITIAL &&
+ state->start_state != xqs;
 }

...otherwise with parens we sometimes don't get the right prompt and
we get empty lines echoed. Adding xuend and xuchar here didn't seem to
make a difference. There might be something subtle I'm missing, so I
thought I'd mention it.

With the unicode escape rules brought over, the diff to the ecpg
scanner is much cleaner now. The diff for C-comment rules were still
pretty messy in comparison, so I made an attempt to clean that up in
0002. A bit off-topic, but I thought I should offer that while it was
fresh in my head.

--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

v6-0001-Reduce-the-number-of-states-in-the-core-scanner-t.patch (48K) Download Attachment
v6-0002-Merge-ECPG-scanner-states-for-C-style-comments.patch (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: benchmarking Flex practices

Tom Lane-2
John Naylor <[hidden email]> writes:
> The pre-existing ecpg var "state_before" was a bit confusing when
> combined with the new var "state_before_quote_stop", and the former is
> also used with C-comments, so I decided to go with
> "state_before_lit_start" and "state_before_lit_stop". Even though
> comments aren't literals, it's less of a stretch than referring to
> quotes. To keep things consistent, I went with the latter var in psql
> and core.

Hm, what do you think of "state_before_str_stop" instead?  It seems
to me that both "quote" and "lit" are pretty specific terms, so
maybe we need something a bit vaguer.

> To get the regression tests to pass, I had to add this:
>  psql_scan_in_quote(PsqlScanState state)
>  {
> - return state->start_state != INITIAL;
> + return state->start_state != INITIAL &&
> + state->start_state != xqs;
>  }
> ...otherwise with parens we sometimes don't get the right prompt and
> we get empty lines echoed. Adding xuend and xuchar here didn't seem to
> make a difference. There might be something subtle I'm missing, so I
> thought I'd mention it.

I think you would see a difference if the regression tests had any cases
with blank lines between a Unicode string/ident and the associated
UESCAPE and escape-character literal.

While poking at that, I also came across this unhappiness:

regression=# select u&'foo' uescape 'bogus';
regression'#

that is, psql thinks we're still in a literal at this point.  That's
because the uesccharfail rule eats "'b" and then we go to INITIAL
state, so that consuming the last "'" puts us back in a string state.
The backend would have thrown an error before parsing as far as the
incomplete literal, so it doesn't care (or probably not, anyway),
but that's not an option for psql.

My first reaction as to how to fix this was to rip the xuend and
xuchar states out of psql, and let it just lex UESCAPE as an
identifier and the escape-character literal like any other literal.
psql doesn't need to account for the escape character's effect on
the meaning of the Unicode literal, so it doesn't have any need to
lex the sequence as one big token.  I think the same is true of ecpg
though I've not looked really closely.

However, my second reaction was that maybe you were on to something
upthread when you speculated about postponing de-escaping of
Unicode literals into the grammar.  If we did it like that then
we would not need to have this difference between the backend and
frontend lexers, and we'd not have to worry about what
psql_scan_in_quote should do about the whitespace before and after
UESCAPE, either.

So I'm feeling like maybe we should experiment to see what that
solution looks like, before we commit to going in this direction.
What do you think?


> With the unicode escape rules brought over, the diff to the ecpg
> scanner is much cleaner now. The diff for C-comment rules were still
> pretty messy in comparison, so I made an attempt to clean that up in
> 0002. A bit off-topic, but I thought I should offer that while it was
> fresh in my head.

I didn't really review this, but it looked like a fairly plausible
change of the same ilk, ie combine rules by adding memory of the
previous start state.

                        regards, tom lane