pgbench's expression parsing & negative numbers

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

pgbench's expression parsing & negative numbers

Andres Freund
Hi,

working on overflow correctness in pg I noticed that pgbench isn't quite
there.  I assume it won't matter terribly often, but the way it parses
integers makes it incorrect for, at least, the negativemost number.

It directly parses the integer input using:
{digit}+ {
                                        yylval->ival = strtoint64(yytext);
                                        return INTEGER_CONST;
                                }

but that unfortunately means that the sign is no included in the call to
strtoint64. Which in turn means you can't correctly parse PG_INT64_MIN,
because that's not representable as a positive number on a two's
complement machine (i.e. everywhere).  Preliminary testing shows that
that can relatively easily fixed by just prefixing [-+]? to the relevant
exprscan.l rules.  But it might also not be a bad idea to simply defer
parsing of the number until exprparse.y has its hand on the number?

There's plenty other things wrong with overflow handling too, especially
evalFunc() basically just doesn't care.  I'll come up with a patch for
that sometime soon.

A third complaint I have is that the tap tests are pretty hard to parse
for humans.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: pgbench's expression parsing & negative numbers

Fabien COELHO-3

Hello Andres,

> working on overflow correctness in pg I noticed that pgbench isn't quite
> there.  I assume it won't matter terribly often, but the way it parses
> integers makes it incorrect for, at least, the negativemost number.
>
> It directly parses the integer input using:
> {digit}+ {
> yylval->ival = strtoint64(yytext);
> return INTEGER_CONST;
> }
>
> but that unfortunately means that the sign is no included in the call to
> strtoint64. Which in turn means you can't correctly parse PG_INT64_MIN,

Indeed. For a benchmarking script which generates a pseudo random load I
do not see as a big issue, but maybe I'm missing some use case.

> because that's not representable as a positive number on a two's
> complement machine (i.e. everywhere).  Preliminary testing shows that
> that can relatively easily fixed by just prefixing [-+]? to the relevant
> exprscan.l rules.

Beware that it must handle the unary/binary plus/minus case consistently:

   "-123" vs "a -123" vs "a + -123"

> But it might also not be a bad idea to simply defer parsing of the
> number until exprparse.y has its hand on the number?

It is usually simpler to let the lexer handle constants.

> There's plenty other things wrong with overflow handling too, especially
> evalFunc() basically just doesn't care.

Sure.

There are some overflow checking with div and double to int cast, which
were added because of previous complaints, but which are not very useful
to me.

ISTM that it is a voluntary feature, which handles int64 operations
with a modulo overflow like C. Generating exceptions on such overflows
does not look very attractive to me.

>  I'll come up with a patch for that sometime soon.

I wish you could provide some motivation: why does it matter much to a
benchmarking script to handle overflows with more case?

> A third complaint I have is that the tap tests are pretty hard to parse
> for humans.

Probably.

Perl is pretty hard to humans to start with:-)

There is a lot of code factorization to cram many tests together, so that
one test is more or less one line and there are hundreds of them. Not sure
it would help if it was expanded.

There are a lot of regular expressions to check outputs, which does not
help.

I wanted to have the pgbench scripts outside but this has been rejected,
so the tested scripts themselves are in the middle of the perl code, which
I think has degraded the readability significantly.

--
Fabien.

Reply | Threaded
Open this post in threaded view
|

Re: pgbench's expression parsing & negative numbers

Andres Freund
On 2017-12-12 07:21:21 +0100, Fabien COELHO wrote:

>
> Hello Andres,
>
> > working on overflow correctness in pg I noticed that pgbench isn't quite
> > there.  I assume it won't matter terribly often, but the way it parses
> > integers makes it incorrect for, at least, the negativemost number.
> >
> > It directly parses the integer input using:
> > {digit}+ {
> > yylval->ival = strtoint64(yytext);
> > return INTEGER_CONST;
> > }
> >
> > but that unfortunately means that the sign is no included in the call to
> > strtoint64. Which in turn means you can't correctly parse PG_INT64_MIN,
>
> Indeed. For a benchmarking script which generates a pseudo random load I do
> not see as a big issue, but maybe I'm missing some use case.

IDK, behaving correctly seems like a good idea. Also far from impossible
that that code gets propagated further.  Doesn't seem like an entirely
crazy idea that somebody actually wants to benchmark boundary cases,
which might be slower and such.

> > because that's not representable as a positive number on a two's
> > complement machine (i.e. everywhere).  Preliminary testing shows that
> > that can relatively easily fixed by just prefixing [-+]? to the relevant
> > exprscan.l rules.
>
> Beware that it must handle the unary/binary plus/minus case consistently:
>
>   "-123" vs "a -123" vs "a + -123"

Which is a lot easier to solve on the parser side of things...


> > But it might also not be a bad idea to simply defer parsing of the
> > number until exprparse.y has its hand on the number?
>
> It is usually simpler to let the lexer handle constants.


> > There's plenty other things wrong with overflow handling too, especially
> > evalFunc() basically just doesn't care.
>
> Sure.
>
> There are some overflow checking with div and double to int cast, which were
> added because of previous complaints, but which are not very useful to me.

I think handling it inconsistently is the worst of all worlds.


> ISTM that it is a voluntary feature, which handles int64 operations with a
> modulo overflow like C. Generating exceptions on such overflows does not
> look very attractive to me.

No, it's not really voluntary. Signed overflow isn't actually defined
behaviour in C. We kinda make it so on some platforms, but that's not
really a good idea.


> >  I'll come up with a patch for that sometime soon.
>
> I wish you could provide some motivation: why does it matter much to a
> benchmarking script to handle overflows with more case?

a) I want to be able to compile without -fwrapv in the near
   future. Which means you can't have signed overflows, because they're
   undefined behaviour in C.
b) I want to be able to compile with -ftrapv /
   -fsanitize=signed-integer-overflow, to be sure code is actually
   correct. Currently pgbench crashes with that.
c) Randomly handling things in some but not all places is a bad idea.


> > A third complaint I have is that the tap tests are pretty hard to parse
> > for humans.
>
> Probably.
>
> Perl is pretty hard to humans to start with:-)

Meh.


> There is a lot of code factorization to cram many tests together, so that
> one test is more or less one line and there are hundreds of them. Not sure
> it would help if it was expanded.

I don't think expanding it is really a problem, I think they're just
largely not well documented, formatted. With some effort the tests could
be a lot easier to read.


Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: pgbench's expression parsing & negative numbers

Fabien COELHO-3

Hello Andres,

>> There are some overflow checking with div and double to int cast, which were
>> added because of previous complaints, but which are not very useful to me.
>
> I think handling it inconsistently is the worst of all worlds.

Hmmm... I cannot say that inconsistency is a good thing, that would not
be consistent:-)

My 0.02€:

  - I do not think that updating pgbench arithmetic for managing integer
    overflows is worth Andres Freund time. My guess is that most
    script would not trigger client-side overflows, so the change would
    be a no-op in practice.

  - I think that pgbench has more important defects/missing features, to
    be fixed more urgently. Given the time patches spend in the cf queue,
    obviously committers disagree on this one:-)

Now ISTM that it does not harm anything to add such a feature, so fine
with me. Maybe the global compiler option removal is worth the effort.

--
Fabien.
Reply | Threaded
Open this post in threaded view
|

Re: pgbench's expression parsing & negative numbers

Andres Freund
Hi,

On 2017-12-14 10:41:04 +0100, Fabien COELHO wrote:
>  - I do not think that updating pgbench arithmetic for managing integer
>    overflows is worth Andres Freund time. My guess is that most
>    script would not trigger client-side overflows, so the change would
>    be a no-op in practice.

It might not be if you view it in isolation (although I'm not
convinced). The problem is that it has cost beyond pgbench. Due to
pgbench's overflow handling I can't run make check on a build that has
-ftrapv, which found several bugs already.

I'd be happy if somebody else would tackle the issue, but I don't quite
see it happening...

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: pgbench's expression parsing & negative numbers

Fabien COELHO-3

Hello,

>>  - I do not think that updating pgbench arithmetic for managing integer
>>    overflows is worth Andres Freund time. My guess is that most
>>    script would not trigger client-side overflows, so the change would
>>    be a no-op in practice.
>
> It might not be if you view it in isolation (although I'm not
> convinced). The problem is that it has cost beyond pgbench. Due to
> pgbench's overflow handling

Lack of?

> I can't run make check on a build that has -ftrapv, which found several
> bugs already.

Hmmm. You suggest that integer overflows do occur when running pgbench.

Indeed, this tap test case: "\set maxint debug(:minint - 1)"

Otherwise, some stat counters may overflow on very long runs? Although
overflowing a int64 takes some time...

> I'd be happy if somebody else would tackle the issue, but I don't quite
> see it happening...

I must admit that it is not very high on my may-do list. I'll review it if
it appears, though.

--
Fabien.

Reply | Threaded
Open this post in threaded view
|

Re: pgbench's expression parsing & negative numbers

Fabien COELHO-3
In reply to this post by Andres Freund

Hello Andres,

> working on overflow correctness in pg I noticed that pgbench isn't quite
> there.

Indeed.

> I assume it won't matter terribly often, but the way it parses integers
> makes it incorrect for, at least, the negativemost number. [...] but
> that unfortunately means that the sign is no included in the call to
> strtoint64.

The strtoint64 function is indeed a mess. It does not really report errors
(more precisely, an error message is printed out, but the execution goes
on nevertheless...).

> Which in turn means you can't correctly parse PG_INT64_MIN,
> because that's not representable as a positive number on a two's
> complement machine (i.e. everywhere).  Preliminary testing shows that
> that can relatively easily fixed by just prefixing [-+]? to the relevant
> exprscan.l rules.

I'm not sure I get it, because then "1 -2" would be scanned as "int
signed_int", which requires to add some fine rules to the parser to handle
that as an addition, and I would be unclear about the priority handling,
eg "1 -2 * 3". But I agree that it can be made to work with transfering
the conversion to the parser and maybe some careful tweaking there.

> But it might also not be a bad idea to simply defer
> parsing of the number until exprparse.y has its hand on the number?
>
> There's plenty other things wrong with overflow handling too, especially
> evalFunc() basically just doesn't care.

Indeed.

Some overflow issues are easy to fix with the existing pg_*_s64_overflow
macros that you provided. It should also handle double2int64 overflows.

I have tried to trigger errors with a -fsanitize=signed-integer-overflow
compilation, without much success, but ISTM that you suggested that
pgbench does not work with that in another thread. Could you be more
precise?

> I'll come up with a patch for that sometime soon.

ISTM that you have not sent any patch on the subject, otherwise I would
have reviewed it. Maybe I could do one some time later, unless you think
that I should not.

--
Fabien.

Reply | Threaded
Open this post in threaded view
|

Re: pgbench's expression parsing & negative numbers

Fabien COELHO-3

Hello Andres,

>> I'll come up with a patch for that sometime soon.
>
> ISTM that you have not sent any patch on the subject, otherwise I would
> have reviewed it. Maybe I could do one some time later, unless you think
> that I should not.

Here is a patch which detects pgbench overflows on int & double constants,
and on integer operators.

--
Fabien.

pgbench-overflow-2.patch (17K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: pgbench's expression parsing & negative numbers

Fabien COELHO-3

>>> I'll come up with a patch for that sometime soon.
>>
>> ISTM that you have not sent any patch on the subject, otherwise I would
>> have reviewed it. Maybe I could do one some time later, unless you think
>> that I should not.
>
> Here is a patch which detects pgbench overflows on int & double constants,
> and on integer operators.

... it but forgot to handle parsing min int, which was the initial focus
of this thread.

This patch does that as well by handling it as the special case between
lexer & parser (the issue being that 9223372036854775808 cannot be lexed
as an standard integer, as it is too large, and -9223372036854775808 is
really two tokens, so must be managed from the parser).

--
Fabien.

pgbench-overflow-3.patch (19K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: pgbench's expression parsing & negative numbers

Ibrar Ahmed-4
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            not tested

Patch does not apply cleanly on the master branch, anyways I managed that.  Patch work according to specs, and no issue found.
The only minor nit is that you can keep the full comments of function strtoint64

/*
 * If not errorOK, an error message is printed out.
 * If errorOK is true, just return "false" for bad input.
 */
Reply | Threaded
Open this post in threaded view
|

Re: pgbench's expression parsing & negative numbers

Fabien COELHO-3

Hello,

> The following review has been posted through the commitfest application:
> make installcheck-world:  not tested
> Implements feature:       tested, passed
> Spec compliant:           tested, passed
> Documentation:            not tested
>
> Patch does not apply cleanly on the master branch, anyways I managed that.  Patch work according to specs, and no issue found.
> The only minor nit is that you can keep the full comments of function strtoint64
>
> /*
> * If not errorOK, an error message is printed out.
> * If errorOK is true, just return "false" for bad input.
> */
Thanks for the review.

Attached is a v4, with improved comments on strtoint64 as you requested.
I also added 2 "unlikely" compiler directives.

--
Fabien.

pgbench-overflow-4.patch (20K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: pgbench's expression parsing & negative numbers

Ibrar Ahmed-4
Patch is good to go from my side.

The new status of this patch is: Ready for Committer
Reply | Threaded
Open this post in threaded view
|

Re: pgbench's expression parsing & negative numbers

Fabien COELHO-3

Hello Ibrar,

> The new status of this patch is: Ready for Committer

Ok, thanks. Let's see what committers think about it.

Andres, are you still interested in overflow detection and handling?

--
Fabien.

Reply | Threaded
Open this post in threaded view
|

Re: pgbench's expression parsing & negative numbers

Andres Freund
Hi,

On 2018-08-16 17:28:06 +0200, Fabien COELHO wrote:
> > The new status of this patch is: Ready for Committer
>
> Ok, thanks. Let's see what committers think about it.
>
> Andres, are you still interested in overflow detection and handling?

Yes.  I'll try to look at it at some point not too far away.

- Andres

Reply | Threaded
Open this post in threaded view
|

Re: pgbench's expression parsing & negative numbers

Andres Freund
In reply to this post by Fabien COELHO-3
Hi,

On 2018-08-10 10:24:29 +0200, Fabien COELHO wrote:

> diff --git a/src/bin/pgbench/exprscan.l b/src/bin/pgbench/exprscan.l
> index 5c1bd88128..e8faea3857 100644
> --- a/src/bin/pgbench/exprscan.l
> +++ b/src/bin/pgbench/exprscan.l
> @@ -191,16 +191,26 @@ notnull [Nn][Oo][Tt][Nn][Uu][Ll][Ll]
>   yylval->bval = false;
>   return BOOLEAN_CONST;
>   }
> +"9223372036854775808" {
> + /* yuk: special handling for minint */
> + return MAXINT_PLUS_ONE_CONST;
> + }

Yikes, do we really need this?  If we dealt with - in the lexer, we
shouldn't need it, no?


>  /*
>   * strtoint64 -- convert a string to 64-bit integer
>   *
> - * This function is a modified version of scanint8() from
> + * This function is a slightly modified version of scanint8() from
>   * src/backend/utils/adt/int8.c.
> + *
> + * The function returns whether the conversion worked, and if so
> + * "*result" is set to the result.
> + *
> + * If not errorOK, an error message is also printed out on errors.
>   */
> -int64
> -strtoint64(const char *str)
> +bool
> +strtoint64(const char *str, bool errorOK, int64 *result)
>  {
>   const char *ptr = str;
> - int64 result = 0;
> - int sign = 1;
> + int64 tmp = 0;
> + bool neg = false;
>  
>   /*
>   * Do our own scan, rather than relying on sscanf which might be broken
>   * for long long.
> + *
> + * As INT64_MIN can't be stored as a positive 64 bit integer, accumulate
> + * value as a negative number.
>   */
>  
>   /* skip leading spaces */
> @@ -650,46 +660,80 @@ strtoint64(const char *str)
>   if (*ptr == '-')
>   {
>   ptr++;
> -
> - /*
> - * Do an explicit check for INT64_MIN.  Ugly though this is, it's
> - * cleaner than trying to get the loop below to handle it portably.
> - */
> - if (strncmp(ptr, "9223372036854775808", 19) == 0)
> - {
> - result = PG_INT64_MIN;
> - ptr += 19;
> - goto gotdigits;
> - }
> - sign = -1;
> + neg = true;
>   }
>   else if (*ptr == '+')
>   ptr++;
>  
>   /* require at least one digit */
> - if (!isdigit((unsigned char) *ptr))
> - fprintf(stderr, "invalid input syntax for integer: \"%s\"\n", str);
> + if (unlikely(!isdigit((unsigned char) *ptr)))
> + goto invalid_syntax;
>  
>   /* process digits */
>   while (*ptr && isdigit((unsigned char) *ptr))
>   {
> - int64 tmp = result * 10 + (*ptr++ - '0');
> + int8 digit = (*ptr++ - '0');
>  
> - if ((tmp / 10) != result) /* overflow? */
> - fprintf(stderr, "value \"%s\" is out of range for type bigint\n", str);
> - result = tmp;
> + if (unlikely(pg_mul_s64_overflow(tmp, 10, &tmp)) ||
> + unlikely(pg_sub_s64_overflow(tmp, digit, &tmp)))
> + goto out_of_range;
>   }
>  
> -gotdigits:
> -
>   /* allow trailing whitespace, but not other trailing chars */
>   while (*ptr != '\0' && isspace((unsigned char) *ptr))
>   ptr++;
>  
> - if (*ptr != '\0')
> - fprintf(stderr, "invalid input syntax for integer: \"%s\"\n", str);
> + if (unlikely(*ptr != '\0'))
> + goto invalid_syntax;
>  
> - return ((sign < 0) ? -result : result);
> + if (!neg)
> + {
> + if (unlikely(tmp == PG_INT64_MIN))
> + goto out_of_range;
> + tmp = -tmp;
> + }
> +
> + *result = tmp;
> + return true;
> +
> +out_of_range:
> + if (!errorOK)
> + fprintf(stderr,
> + "value \"%s\" is out of range for type bigint\n", str);
> + return false;
> +
> +invalid_syntax:
> + /* this can't happen here, function called on int-looking strings. */

This comment doesn't strike me as a good idea, it's almost guaranteed to
be outdated at some point.

> + if (!errorOK)
> + fprintf(stderr,
> + "invalid input syntax for type bigint: \"%s\"\n",str);
> + return false;
> +}


Duplicating these routines is pretty ugly.  I really wish we had more
infrastructure to avoid this (in particularly a portable way to report
an error and jump out).  But probably out of scope for this patches?

> +/* convert string to double, detecting overflows/underflows */
> +bool
> +strtodouble(const char *str, bool errorOK, double *dv)
> +{
> + char *end;
> +
> + *dv = strtod(str, &end);
> +
> + if (unlikely(errno != 0))
> + {
> + if (!errorOK)
> + fprintf(stderr,
> + "value \"%s\" is out of range for type double\n", str);
> + return false;
> + }
> +
> + if (unlikely(end == str || *end != '\0'))
> + {
> + if (!errorOK)
> + fprintf(stderr,
> + "invalid input syntax for type double: \"%s\"\n",str);
> + return false;
> + }
> + return true;
>  }

Not sure I see much point in the unlikelys here, contrasting to the ones
in the backend (and the copy for the frontend) it's extremely unlikley
anything like this will ever be close to a bottleneck.  In general, I'd
strongly suggest not placing unlikelys in non-critical codepaths -
they're way too often wrongly estimated.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: pgbench's expression parsing & negative numbers

Fabien COELHO-3

Hello Andres,

>> +"9223372036854775808" {
>> + /* yuk: special handling for minint */
>> + return MAXINT_PLUS_ONE_CONST;
>> + }
>
> Yikes, do we really need this?  If we dealt with - in the lexer, we
> shouldn't need it, no?

I'm not sure how to handle unary minus constant in the lexer, given that
the same '-' character is also used as minus binary operator.

The proposed solution is functional and has a reduce overall impact (one
lexer and one parser rules added), so it looks good to me.

Probably other approaches are possible.

>> + /* this can't happen here, function called on int-looking strings. */
>
> This comment doesn't strike me as a good idea, it's almost guaranteed to
> be outdated at some point.

It is valid now, and it can be removed anyway.

> [...] Duplicating these routines is pretty ugly.

Sure.

> I really wish we had more infrastructure to avoid this (in particularly
> a portable way to report an error and jump out).  But probably out of
> scope for this patches?

Yes.

Devising an appropriate client-side error handling/reporting
infrastructure is a non trivial tasks, and would cost more than a few
duplicate functions. "fprintf(stderr + return/exit" currently does the job
with minimal hassle. I do not think that this patch is the right one to
change how error are handle in postgres client applications.

>> + if (unlikely(end == str || *end != '\0'))

> Not sure I see much point in the unlikelys here, contrasting to the ones
> in the backend (and the copy for the frontend) it's extremely unlikley
> anything like this will ever be close to a bottleneck.  In general, I'd
> strongly suggest not placing unlikelys in non-critical codepaths -
> they're way too often wrongly estimated.

I put "unlikely" where I really thought it made sense. I do not know when
they would have an actual performance impact, but I appreciate that they
convey information to the reader of the code, telling that this path is
expected not to be taken.

It can be removed anyway if you do not like it.

--
Fabien.

Reply | Threaded
Open this post in threaded view
|

Re: pgbench's expression parsing & negative numbers

Andres Freund
Hi,

On 2018-09-26 22:38:41 +0200, Fabien COELHO wrote:

> > > +"9223372036854775808" {
> > > + /* yuk: special handling for minint */
> > > + return MAXINT_PLUS_ONE_CONST;
> > > + }
> >
> > Yikes, do we really need this?  If we dealt with - in the lexer, we
> > shouldn't need it, no?
>
> I'm not sure how to handle unary minus constant in the lexer, given that the
> same '-' character is also used as minus binary operator.

True. I think the nicer fix than what you've done here is to instead
simply always store the number, as coming from the lexer, as the
negative number.  We do similarly in a number of places.  I've gone with
yours for now.

> > > + /* this can't happen here, function called on int-looking strings. */
> >
> > This comment doesn't strike me as a good idea, it's almost guaranteed to
> > be outdated at some point.
>
> It is valid now, and it can be removed anyway.

Removed



> > > + if (unlikely(end == str || *end != '\0'))
>
> > Not sure I see much point in the unlikelys here, contrasting to the ones
> > in the backend (and the copy for the frontend) it's extremely unlikley
> > anything like this will ever be close to a bottleneck.  In general, I'd
> > strongly suggest not placing unlikelys in non-critical codepaths -
> > they're way too often wrongly estimated.
>
> I put "unlikely" where I really thought it made sense. I do not know when
> they would have an actual performance impact, but I appreciate that they
> convey information to the reader of the code, telling that this path is
> expected not to be taken.

I removed them.

Pushed, thanks for the patch!  I only did some very minor comment
changes, reset errno to 0 before strtod, removed a redundant
multiplication in PGBENCH_MUL.

FWIW, after this, and fixing the prerequisite general code paths, the
pgbench tests pass without -fsanitize=signed-integer-overflow errors.

Greetings,

Andres Freund