Fix number skipping in to_number

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Fix number skipping in to_number

Oliver Ford
Prevents an issue where numbers can be skipped in the to_number()
function when the format mask contains a "G" or a "," but the input
string doesn't contain a separator. This resolves the TODO item "Fix
to_number() handling for values not matching the format string".

== Change ==

Currently, if the format mask in to_number() has a "G" or a ",", it
will assume that the input string contains a separator character in
the same place. If however a number is there instead, that number will
be silently skipped and not appear in the output. So we get:

select to_number('34,50','999,99');
 to_number
-----------
       340
(1 row)

This patch checks the input string when it encounters a "G" or "," in
the format mask. If the separator character is found, the code moves
over it as normal. If it's not found, then the code no longer
increments the relevant pointer so as not to skip the character. After
the patch, we get the correct result:

select to_number('34,50','999,99');
 to_number
-----------
      3450
(1 row)

This is in line with Oracle's result.

== Rationale ==

This patch is a small change, which leaves PostgreSQL behavior
different from Oracle behavior in related cases. Oracle's
implementation seems to read from right-to-left, and raises an
"ORA-01722: invalid number" error if there are digits in the input
string which don't have corresponding characters in the format mask. I
have chosen not to throw such errors, because there are use cases for
only returning part of a number string. For example, the following is
an error on Oracle:

select to_number('123,000', '999G') from dual;
Error report -
SQL Error: ORA-01722: invalid number

But if you wanted to only take the characters before the comma, and
discard the thousands part, you can do so on PostgreSQL with:

select to_number('123,000', '999G');
 to_number
-----------
       123
(1 row)

This is the current behavior. Which is why I think it makes more sense
to do what PostgreSQL currently does and read from left-to-right. The
only change, as mentioned above, is that the current behavior can skip
a digit:

select to_number('123456', '999G999');
 to_number
-----------
   12356
(1 row)

After the patch, this returns all the digits:

select to_number('123456', '999G999');
 to_number
-----------
  123456
(1 row)

== Testing ==

Tested on Windows with MinGW using the latest checkout from master.
Added regression tests to check for this new behavior. All existing
tests pass.


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

0001-apply-number-v1.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Fix number skipping in to_number

Thomas Munro-3
On Thu, Aug 10, 2017 at 10:21 PM, Oliver Ford <[hidden email]> wrote:
> Prevents an issue where numbers can be skipped in the to_number()
> function when the format mask contains a "G" or a "," but the input
> string doesn't contain a separator. This resolves the TODO item "Fix
> to_number() handling for values not matching the format string".

From the TODO I found the previous discussion here:

https://www.postgresql.org/message-id/flat/be46a4f30909212157o71dc82bep7e074f9fa7eb1d14%40mail.gmail.com

There were some votes against changing it and some votes for.  I don't
see what's good about the current behaviour.  It's hard (though not
impossible) to imagine that someone is depending on that weirdness,
and it's harder to believe that anyone wants that behaviour knowingly.
For people migrating from Oracle or writing application portable to
both, the current behaviour sucks.  It's not a SQL standard feature,
it's a be-like-Oracle feature, so it should be like Oracle.  Aside
from that, by the principle of least astonishment alone I think it's
pretty clear that "a separator goes here" shouldn't mean "eat a
digit".

So +1 from me.

> == Change ==
>
> Currently, if the format mask in to_number() has a "G" or a ",", it
> will assume that the input string contains a separator character in
> the same place. If however a number is there instead, that number will
> be silently skipped and not appear in the output. So we get:
>
> select to_number('34,50','999,99');
>  to_number
> -----------
>        340
> (1 row)
>
> This patch checks the input string when it encounters a "G" or "," in
> the format mask. If the separator character is found, the code moves
> over it as normal. If it's not found, then the code no longer
> increments the relevant pointer so as not to skip the character. After
> the patch, we get the correct result:
>
> select to_number('34,50','999,99');
>  to_number
> -----------
>       3450
> (1 row)
>
> This is in line with Oracle's result.

In other words the separators are completely ignored.  Presumably the
only reason you'd have them at all is so that you could use the same
format string in to_number() and to_char() calls to do round-trip
conversions.  That seems reasonable to me.

> == Rationale ==
>
> This patch is a small change, which leaves PostgreSQL behavior
> different from Oracle behavior in related cases. Oracle's
> implementation seems to read from right-to-left, and raises an
> "ORA-01722: invalid number" error if there are digits in the input
> string which don't have corresponding characters in the format mask. I
> have chosen not to throw such errors, because there are use cases for
> only returning part of a number string. For example, the following is
> an error on Oracle:
>
> select to_number('123,000', '999G') from dual;
> Error report -
> SQL Error: ORA-01722: invalid number
>
> But if you wanted to only take the characters before the comma, and
> discard the thousands part, you can do so on PostgreSQL with:
>
> select to_number('123,000', '999G');
>  to_number
> -----------
>        123
> (1 row)

I can see that it's hard to change this one, because it's more likely
to break existing applications that were written by people who didn't
think of the format string as limiting the maximum size of the number.
I think someone could argue for changing that too, but I agree with
your choice for this patch.

> == Testing ==
>
> Tested on Windows with MinGW using the latest checkout from master.
> Added regression tests to check for this new behavior. All existing
> tests pass.

The tests you added pass for me but the int8 test now fails with the
following (this is from regression.diff after running
'PG_REGRESS_DIFF_OPTS="-dU10" make check').  It looks like some new
whitespace is appearing on the right in the output of to_char().  I
didn't try to understand why.

@@ -453,34 +453,34 @@
 ------------------+------------------
  4567890123456789 | 4567890123456789
 (1 row)

 -- TO_CHAR()
 --
 SELECT '' AS to_char_1, to_char(q1, '9G999G999G999G999G999'),
to_char(q2, '9,999,999,999,999,999')
        FROM INT8_TBL;
  to_char_1 |        to_char         |        to_char
 -----------+------------------------+------------------------
-           |                    123 |                    456
+           |                    123 |               456
            |                    123 |  4,567,890,123,456,789
-           |  4,567,890,123,456,789 |                    123
+           |  4,567,890,123,456,789 |               123
            |  4,567,890,123,456,789 |  4,567,890,123,456,789
            |  4,567,890,123,456,789 | -4,567,890,123,456,789
 (5 rows)

 SELECT '' AS to_char_2, to_char(q1, '9G999G999G999G999G999D999G999'),
to_char(q2, '9,999,999,999,999,999.999,999')
        FROM INT8_TBL;
  to_char_2 |            to_char             |            to_char
 -----------+--------------------------------+--------------------------------
-           |                    123.000,000 |                    456.000,000
+           |                    123.000,000 |               456.000,000
            |                    123.000,000 |  4,567,890,123,456,789.000,000
-           |  4,567,890,123,456,789.000,000 |                    123.000,000
+           |  4,567,890,123,456,789.000,000 |               123.000,000
            |  4,567,890,123,456,789.000,000 |  4,567,890,123,456,789.000,000
            |  4,567,890,123,456,789.000,000 | -4,567,890,123,456,789.000,000
 (5 rows)

 SELECT '' AS to_char_3, to_char( (q1 * -1), '9999999999999999PR'),
to_char( (q2 * -1), '9999999999999999.999PR')
        FROM INT8_TBL;
  to_char_3 |      to_char       |        to_char
 -----------+--------------------+------------------------
            |              <123> |              <456.000>
            |              <123> | <4567890123456789.000>

======================================================================

One superficial comment after first glimpse at the patch:

+    if(!strncmp(Np->inout_p, Np->L_thousands_sep, separator_len))

I believe the usual coding around here would be if (strncmp(...) == 0)

--
Thomas Munro
http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Fix number skipping in to_number

Oliver Ford
> The tests you added pass for me but the int8 test now fails with the

> following (this is from regression.diff after running
> 'PG_REGRESS_DIFF_OPTS="-dU10" make check').  It looks like some new
> whitespace is appearing on the right in the output of to_char().  I
> didn't try to understand why.
>
> @@ -453,34 +453,34 @@
>  ------------------+------------------
>   4567890123456789 | 4567890123456789
>  (1 row)
>
>  -- TO_CHAR()
>  --
>  SELECT '' AS to_char_1, to_char(q1, '9G999G999G999G999G999'),
> to_char(q2, '9,999,999,999,999,999')
>         FROM INT8_TBL;
>   to_char_1 |        to_char         |        to_char
>  -----------+------------------------+------------------------
> -           |                    123 |                    456
> +           |                    123 |               456
>             |                    123 |  4,567,890,123,456,789
> -           |  4,567,890,123,456,789 |                    123
> +           |  4,567,890,123,456,789 |               123
>             |  4,567,890,123,456,789 |  4,567,890,123,456,789
>             |  4,567,890,123,456,789 | -4,567,890,123,456,789
>  (5 rows)
>
>  SELECT '' AS to_char_2, to_char(q1, '9G999G999G999G999G999D999G999'),
> to_char(q2, '9,999,999,999,999,999.999,999')
>         FROM INT8_TBL;
>   to_char_2 |            to_char             |            to_char
>  -----------+--------------------------------+--------------------------------
> -           |                    123.000,000 |                    456.000,000
> +           |                    123.000,000 |               456.000,000
>             |                    123.000,000 |  4,567,890,123,456,789.000,000
> -           |  4,567,890,123,456,789.000,000 |                    123.000,000
> +           |  4,567,890,123,456,789.000,000 |               123.000,000
>             |  4,567,890,123,456,789.000,000 |  4,567,890,123,456,789.000,000
>             |  4,567,890,123,456,789.000,000 | -4,567,890,123,456,789.000,000
>  (5 rows)
>
>  SELECT '' AS to_char_3, to_char( (q1 * -1), '9999999999999999PR'),
> to_char( (q2 * -1), '9999999999999999.999PR')
>         FROM INT8_TBL;
>   to_char_3 |      to_char       |        to_char
>  -----------+--------------------+------------------------
>             |              <123> |              <456.000>
>             |              <123> | <4567890123456789.000>
>
That's strange, I can't replicate that issue on my Windows build. I've
tried with 'PG_REGRESS_DIFF_OPTS="-dU10" make check' and all the tests
(including int8) pass fine. The spacing in the results is perfectly
normal. I wonder if something else on your build is causing this? I've
also tried several "make check" options for different locales
mentioned in the docs and they pass fine.

>
> One superficial comment after first glimpse at the patch:
>
> +    if(!strncmp(Np->inout_p, Np->L_thousands_sep, separator_len))
>
> I believe the usual coding around here would be if (strncmp(...) == 0)
>

Yes you're right, that is the coding standard. I've changed it to that
in the attached v2.


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

0001-apply-number-v2.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Fix number skipping in to_number

Thomas Munro-3
On Thu, Aug 17, 2017 at 9:48 PM, Oliver Ford <[hidden email]> wrote:

>> The tests you added pass for me but the int8 test now fails with the
>> following (this is from regression.diff after running
>> 'PG_REGRESS_DIFF_OPTS="-dU10" make check').  It looks like some new
>> whitespace is appearing on the right in the output of to_char().  I
>> didn't try to understand why.
>>
>> @@ -453,34 +453,34 @@
>>  ------------------+------------------
>>   4567890123456789 | 4567890123456789
>>  (1 row)
>>
>>  -- TO_CHAR()
>>  --
>>  SELECT '' AS to_char_1, to_char(q1, '9G999G999G999G999G999'),
>> to_char(q2, '9,999,999,999,999,999')
>>         FROM INT8_TBL;
>>   to_char_1 |        to_char         |        to_char
>>  -----------+------------------------+------------------------
>> -           |                    123 |                    456
>> +           |                    123 |               456
>>             |                    123 |  4,567,890,123,456,789
>> -           |  4,567,890,123,456,789 |                    123
>> +           |  4,567,890,123,456,789 |               123
>>             |  4,567,890,123,456,789 |  4,567,890,123,456,789
>>             |  4,567,890,123,456,789 | -4,567,890,123,456,789
>>  (5 rows)
>>
>>  SELECT '' AS to_char_2, to_char(q1, '9G999G999G999G999G999D999G999'),
>> to_char(q2, '9,999,999,999,999,999.999,999')
>>         FROM INT8_TBL;
>>   to_char_2 |            to_char             |            to_char
>>  -----------+--------------------------------+--------------------------------
>> -           |                    123.000,000 |                    456.000,000
>> +           |                    123.000,000 |               456.000,000
>>             |                    123.000,000 |  4,567,890,123,456,789.000,000
>> -           |  4,567,890,123,456,789.000,000 |                    123.000,000
>> +           |  4,567,890,123,456,789.000,000 |               123.000,000
>>             |  4,567,890,123,456,789.000,000 |  4,567,890,123,456,789.000,000
>>             |  4,567,890,123,456,789.000,000 | -4,567,890,123,456,789.000,000
>>  (5 rows)
>>
>>  SELECT '' AS to_char_3, to_char( (q1 * -1), '9999999999999999PR'),
>> to_char( (q2 * -1), '9999999999999999.999PR')
>>         FROM INT8_TBL;
>>   to_char_3 |      to_char       |        to_char
>>  -----------+--------------------+------------------------
>>             |              <123> |              <456.000>
>>             |              <123> | <4567890123456789.000>
>>
>
> That's strange, I can't replicate that issue on my Windows build. I've
> tried with 'PG_REGRESS_DIFF_OPTS="-dU10" make check' and all the tests
> (including int8) pass fine. The spacing in the results is perfectly
> normal. I wonder if something else on your build is causing this? I've
> also tried several "make check" options for different locales
> mentioned in the docs and they pass fine.

Hmm.  Just in case my macOS laptop (CC=Apple's clang,
LANG=en_NZ.UTF-8) was unduly affected by cosmic rays I tried on a
couple of nearby servers running FreeBSD 11.1 (CC=clang, LANG=<unset>)
and CentOS 7.3 (CC=gcc, LANG=en_US.utf-8) and got the same result:
int8 has lost some whitespace in to_char output.

It looks a bit like it has lost a leading space for every ',' that was
present in the format string but which didn't finish up getting output
(because the number was too small).  It looks a bit like that doesn't
happen for 'G', so let's compare the NUM_COMMA and NUM_G cases.  Ahh..
I'm not sure, but I think you might have a close-brace in the wrong
place here:

@@ -4879,6 +4883,10 @@ NUM_processor(FormatNode *node, NUMDesc *Num,
char *inout,
                                                                continue;
                                                }
                                        }    <--- this guy here might
need to move after "noadd = true"?
+                                               if
(strncmp(Np->inout_p, Np->L_thousands_sep, separator_len) == 0)
+                                                       Np->inout_p +=
separator_len - 1;
+                                               else
+                                                       noadd = true;
                                        break;

                                case NUM_G:

With that change the test passes for me.  But why do we get different results?

--
Thomas Munro
http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Fix number skipping in to_number

Oliver Ford
On Thu, Aug 17, 2017 at 11:55 AM, Thomas Munro
<[hidden email]> wrote:

> Hmm.  Just in case my macOS laptop (CC=Apple's clang,
> LANG=en_NZ.UTF-8) was unduly affected by cosmic rays I tried on a
> couple of nearby servers running FreeBSD 11.1 (CC=clang, LANG=<unset>)
> and CentOS 7.3 (CC=gcc, LANG=en_US.utf-8) and got the same result:
> int8 has lost some whitespace in to_char output.
>
> It looks a bit like it has lost a leading space for every ',' that was
> present in the format string but which didn't finish up getting output
> (because the number was too small).  It looks a bit like that doesn't
> happen for 'G', so let's compare the NUM_COMMA and NUM_G cases.  Ahh..
> I'm not sure, but I think you might have a close-brace in the wrong
> place here:
>
> @@ -4879,6 +4883,10 @@ NUM_processor(FormatNode *node, NUMDesc *Num,
> char *inout,
>                                                                 continue;
>                                                 }
>                                         }    <--- this guy here might
> need to move after "noadd = true"?
> +                                               if
> (strncmp(Np->inout_p, Np->L_thousands_sep, separator_len) == 0)
> +                                                       Np->inout_p +=
> separator_len - 1;
> +                                               else
> +                                                       noadd = true;
>                                         break;
>
>                                 case NUM_G:
>
> With that change the test passes for me.  But why do we get different results?
>
> --
> Thomas Munro
> http://www.enterprisedb.com
Ok I've made that change in the attached v3. I'm not sure as I'm on
en_US.UTF-8 locale too. Maybe something Windows specific?


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

0001-apply-number-v3.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Fix number skipping in to_number

Nathan Wagner
On Thu, Aug 17, 2017 at 12:33:02PM +0100, Oliver Ford wrote:
 
> Ok I've made that change in the attached v3. I'm not sure as I'm on
> en_US.UTF-8 locale too. Maybe something Windows specific?

This patch applies against master (8485a25a), compiles, and
passes a make check.

I tested both on my mac laptop, and my linux server.

If we want this patch, I'd say it's ready for committer.  We may want
(and I can't believe I'm saying this) more discussion as to exactly what
the strategy for to_number() (and friends) is.  Do we want to duplicate
Oracle's functionality, or do we want a similar function to do similar
things, without necessarily having a goal of identical behavior to
oracle?

For myself, I pretty much never use the to_date, to_number, or
to_timestamp functions except when porting oracle code.  I do use the
to_char functions on occasion.  If strftime were available, I probably
wouldn't use them.

I would commit this patch and update the TODO with a goal of making
to_number as Oracle compatible as is reasonable.

--
nw


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Fix number skipping in to_number

Oliver Ford


On Monday, 25 September 2017, Nathan Wagner <[hidden email]> wrote:
On Thu, Aug 17, 2017 at 12:33:02PM +0100, Oliver Ford wrote:

> Ok I've made that change in the attached v3. I'm not sure as I'm on
> en_US.UTF-8 locale too. Maybe something Windows specific?

This patch applies against master (8485a25a), compiles, and
passes a make check.

I tested both on my mac laptop, and my linux server.

If we want this patch, I'd say it's ready for committer.  We may want
(and I can't believe I'm saying this) more discussion as to exactly what
the strategy for to_number() (and friends) is.  Do we want to duplicate
Oracle's functionality, or do we want a similar function to do similar
things, without necessarily having a goal of identical behavior to
oracle?

For myself, I pretty much never use the to_date, to_number, or
to_timestamp functions except when porting oracle code.  I do use the
to_char functions on occasion.  If strftime were available, I probably
wouldn't use them.

I would commit this patch and update the TODO with a goal of making
to_number as Oracle compatible as is reasonable.


Thanks for your review. The issue is that Oracle throws errors on many more input cases than Postgres does, so making it exactly like Oracle could break a lot of existing users. E.g. to_number ('123,000', '999') returns '123' on Postgres, but throws an error on Oracle. So making it exactly Oracle-like could break existing users who might rely on the current behavior.

My view is that we shouldn't deliberately introduce errors in order to be exactly like Oracle if we don't currently and there's a sane use case for current behavior. Do you have any examples of results that are different between Oracle and Postgres, and you think the Oracle result makes more sense?
Reply | Threaded
Open this post in threaded view
|

Re: Fix number skipping in to_number

Nathan Wagner
On Mon, Sep 25, 2017 at 07:52:19PM +0100, Oliver Ford wrote:

> Thanks for your review. The issue is that Oracle throws errors on many
> more input cases than Postgres does, so making it exactly like Oracle
> could break a lot of existing users. E.g. to_number ('123,000', '999')
> returns '123' on Postgres, but throws an error on Oracle. So making it
> exactly Oracle-like could break existing users who might rely on the
> current behavior.

I wouldn't use to_number for anything other than oracle compatibility,
and then just so I didn't have to wade through all the ported oracle
code.  I would use a regex or some such to clean up the number, and then
cast the result.  For an input string of '123,000' I might do a
translate('123,000', ',', '')::integer or perhaps use regexp_replace().
Either way I would want a more positive decision as to what was valid or
not, based on the input.

> My view is that we shouldn't deliberately introduce errors in order to be
> exactly like Oracle if we don't currently and there's a sane use case for
> current behavior. Do you have any examples of results that are different
> between Oracle and Postgres, and you think the Oracle result makes more
> sense?

Not really, other than I think an error report might make more sense.
'123,000' doesn't really match the format of '999'.  If anything it
seems like we're guessing rather than validating input.  It is
surprising (to me at least) that

to_char(to_number('123,000', '999'), '999')

doesn't give us the original input (in the sense that identical formats
don't preserve the original string).  So I'm not sure the current
behavior is a sane use case, but perhaps more people are using
to_number() to get *some* numeric result, rather than for wanting it to
be like oracle.  I would generally prefer to throw an exception instead
of getting a number I wasn't expecting, but I can see cases where that
might not be the case.

--
nw


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Fix number skipping in to_number

Daniel Gustafsson
In reply to this post by Nathan Wagner
> On 25 Sep 2017, at 02:52, Nathan Wagner <[hidden email]> wrote:
>
> On Thu, Aug 17, 2017 at 12:33:02PM +0100, Oliver Ford wrote:
>
>> Ok I've made that change in the attached v3. I'm not sure as I'm on
>> en_US.UTF-8 locale too. Maybe something Windows specific?
>
> This patch applies against master (8485a25a), compiles, and
> passes a make check.
>
> I tested both on my mac laptop, and my linux server.
>
> If we want this patch, I'd say it's ready for committer.  We may want
> (and I can't believe I'm saying this) more discussion as to exactly what
> the strategy for to_number() (and friends) is.

Based on this review, I am moving this patch to the next commitfest and will
update it to Ready for committer.  There might, as you say, be more discussion
needed, but due to the lack of extensive such so far I’ll move it for now.

cheers ./daniel

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Fix number skipping in to_number

Tom Lane-2
In reply to this post by Oliver Ford
Oliver Ford <[hidden email]> writes:
> [ 0001-apply-number-v3.patch ]

I looked at this patch briefly and have a couple of comments:

* It seems entirely wrong to be matching to L_thousands_sep in the
NUM_COMMA case; that format code is by definition not locale aware,
so it should be matching to plain ',' independently of locale.

* Don't we need to fix the NUM_L (currency symbol) case in the
same manner?  (The NUM_D and NUM_S cases are handled in
NUM_numpart_from_char and seem ok at a quick glance.)

* I'm not in love with the noadd flag.  Other places in this
switch that want to skip the final increment do it with
"continue", and I think this should do likewise.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Fix number skipping in to_number

Oliver Ford
On Sun, Nov 12, 2017 at 7:00 PM, Tom Lane <[hidden email]> wrote:
> Oliver Ford <[hidden email]> writes:
>> [ 0001-apply-number-v3.patch ]
>
> I looked at this patch briefly and have a couple of comments:
>
> * It seems entirely wrong to be matching to L_thousands_sep in the
> NUM_COMMA case; that format code is by definition not locale aware,
> so it should be matching to plain ',' independently of locale.

Ok I've changed it to just search for a comma.

> * Don't we need to fix the NUM_L (currency symbol) case in the
> same manner?  (The NUM_D and NUM_S cases are handled in
> NUM_numpart_from_char and seem ok at a quick glance.)

Yes you get the same skipping if you do:

select to_number('12','L99');
 to_number
-----------
         2

However, this case is not as easy to fix as you can't do a simple
string comparison like with the group separator. The currency symbol
for the locale can be " " but if we do a comparison, it won't match if
the symbol specified is "$" or "£" (so will end up missing characters
at the end of the supplied string). Could we apply the attached patch
and then put fixing it for currency on the TODO list?

> * I'm not in love with the noadd flag.  Other places in this
> switch that want to skip the final increment do it with
> "continue", and I think this should do likewise.

I've changed to continue in the latest patch. All existing tests pass.

0001-apply-number-v4.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Fix number skipping in to_number

Tom Lane-2
Oliver Ford <[hidden email]> writes:
> On Sun, Nov 12, 2017 at 7:00 PM, Tom Lane <[hidden email]> wrote:
>> * Don't we need to fix the NUM_L (currency symbol) case in the
>> same manner?  (The NUM_D and NUM_S cases are handled in
>> NUM_numpart_from_char and seem ok at a quick glance.)

> Yes you get the same skipping if you do:

> select to_number('12','L99');
>  to_number
> -----------
>          2

> However, this case is not as easy to fix as you can't do a simple
> string comparison like with the group separator. The currency symbol
> for the locale can be " " but if we do a comparison, it won't match if
> the symbol specified is "$" or "£" (so will end up missing characters
> at the end of the supplied string). Could we apply the attached patch
> and then put fixing it for currency on the TODO list?

I don't follow your concern?  If "$" is not the correct currency
symbol for the locale, we shouldn't accept it as a match to an L format.
Your patch is tightening what we will accept as a match to a G format,
so I don't see why you're concerned about backward compatibility in
one case but not the other.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Fix number skipping in to_number

Oliver Ford


On Monday, 13 November 2017, Tom Lane <[hidden email]> wrote:
Oliver Ford <<a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;ojford@gmail.com&#39;)">ojford@...> writes:
> On Sun, Nov 12, 2017 at 7:00 PM, Tom Lane <<a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;tgl@sss.pgh.pa.us&#39;)">tgl@...> wrote:
>> * Don't we need to fix the NUM_L (currency symbol) case in the
>> same manner?  (The NUM_D and NUM_S cases are handled in
>> NUM_numpart_from_char and seem ok at a quick glance.)

> Yes you get the same skipping if you do:

> select to_number('12','L99');
>  to_number
> -----------
>          2

> However, this case is not as easy to fix as you can't do a simple
> string comparison like with the group separator. The currency symbol
> for the locale can be " " but if we do a comparison, it won't match if
> the symbol specified is "$" or "£" (so will end up missing characters
> at the end of the supplied string). Could we apply the attached patch
> and then put fixing it for currency on the TODO list?

I don't follow your concern?  If "$" is not the correct currency
symbol for the locale, we shouldn't accept it as a match to an L format.
Your patch is tightening what we will accept as a match to a G format,
so I don't see why you're concerned about backward compatibility in
one case but not the other.

                        regards, tom lane

It's a guess as to the likely use case. I would imagine that people are likely to use a currency symbol different from the locale, but unlikely to use a different group separator. Others might have a different opinion though.
Reply | Threaded
Open this post in threaded view
|

Re: Fix number skipping in to_number

Tom Lane-2
Oliver Ford <[hidden email]> writes:
> On Monday, 13 November 2017, Tom Lane <[hidden email]> wrote:
>> I don't follow your concern?  If "$" is not the correct currency
>> symbol for the locale, we shouldn't accept it as a match to an L format.
>> Your patch is tightening what we will accept as a match to a G format,
>> so I don't see why you're concerned about backward compatibility in
>> one case but not the other.

> It's a guess as to the likely use case. I would imagine that people are
> likely to use a currency symbol different from the locale, but unlikely to
> use a different group separator. Others might have a different opinion
> though.

Well, if they use a currency symbol different from the locale's, they're
in trouble anyway because the number of bytes might be different.  In most
encodings, symbols other than "$" are probably not 1-byte characters.

At the very least I think we need to constrain it enough that it not
swallow a fractional character.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Fix number skipping in to_number

Tom Lane-2
I wrote:
> Oliver Ford <[hidden email]> writes:
>> On Monday, 13 November 2017, Tom Lane <[hidden email]> wrote:
>>> I don't follow your concern?  If "$" is not the correct currency
>>> symbol for the locale, we shouldn't accept it as a match to an L format.
>>> Your patch is tightening what we will accept as a match to a G format,
>>> so I don't see why you're concerned about backward compatibility in
>>> one case but not the other.

>> It's a guess as to the likely use case. I would imagine that people are
>> likely to use a currency symbol different from the locale, but unlikely to
>> use a different group separator. Others might have a different opinion
>> though.

> Well, if they use a currency symbol different from the locale's, they're
> in trouble anyway because the number of bytes might be different.  In most
> encodings, symbols other than "$" are probably not 1-byte characters.
> At the very least I think we need to constrain it enough that it not
> swallow a fractional character.

After more testing I understood your concern about L_currency_symbol:
in C locale that's " ", not "$" as I naively imagined.  Americans,
at least, would be pretty unhappy if "$1234.56" suddenly stopped matching
"L9999.99".  So it seems like we can't institute a strict matching rule.
However, it is certainly not good that this happens:

regression=# select to_number('1234.56', 'L9999.99');
 to_number
-----------
    234.56
(1 row)

To me that seems just as bad as having ',' or 'G' eat a digit.

After some reflection I propose that the rule that we want is:

* ',' and 'G' consume input only if it exactly matches the expected
separator.

* Other non-data template patterns consume a number of input characters
equal to the number of characters they'd produce in output, *except* that
these patterns will not consume data characters (digits, signs, decimal
point, comma).

I think that while we are at it we should take some measures to ensure
that "character" in this definition means "character", not "byte".
It is not good that a euro currency symbol might consume an
encoding-dependent number of input characters.

That leads me to the attached patch.  There is more that could be done
here --- in particular, I'd like to see the character-not-byte-count
rule extended to literal text.  But that seems like fit material for
a different patch.

Also, I noticed that in your form of the patch, the strncmp() could read
past the end of the string, possibly resulting in a crash.  So I made it
use the AMOUNT_TEST infrastructure from NUM_numpart_from_char to avoid that.

One other note: I realized that it was only pure luck that your regression
test cases worked in locales where 'G' is decimal point --- they still
gave the same answer, but through a totally different interpretation of
the input.  That did not seem like a good idea, so I adjusted the
regression test to force C locale for the to_number() tests.  I wish we
could use some other locale here, but then it likely wouldn't be portable
to Windows.

                        regards, tom lane


diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index f901567..35a845c 100644
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
*************** SELECT regexp_match('abc01234xyz', '(?:(
*** 5850,5856 ****
      data based on the given value.  Any text that is not a template pattern is
      simply copied verbatim.  Similarly, in an input template string (for the
      other functions), template patterns identify the values to be supplied by
!     the input data string.
     </para>
 
    <para>
--- 5850,5859 ----
      data based on the given value.  Any text that is not a template pattern is
      simply copied verbatim.  Similarly, in an input template string (for the
      other functions), template patterns identify the values to be supplied by
!     the input data string.  If there are characters in the template string
!     that are not template patterns, the corresponding characters in the input
!     data string are simply skipped over (whether or not they are equal to the
!     template string characters).
     </para>
 
    <para>
*************** SELECT regexp_match('abc01234xyz', '(?:(
*** 6176,6188 ****
         Ordinary text is allowed in <function>to_char</function>
         templates and will be output literally.  You can put a substring
         in double quotes to force it to be interpreted as literal text
!        even if it contains pattern key words.  For example, in
         <literal>'"Hello Year "YYYY'</literal>, the <literal>YYYY</literal>
         will be replaced by the year data, but the single <literal>Y</literal> in <literal>Year</literal>
!        will not be.  In <function>to_date</function>, <function>to_number</function>,
!        and <function>to_timestamp</function>, double-quoted strings skip the number of
!        input characters contained in the string, e.g. <literal>"XX"</literal>
!        skips two input characters.
        </para>
       </listitem>
 
--- 6179,6193 ----
         Ordinary text is allowed in <function>to_char</function>
         templates and will be output literally.  You can put a substring
         in double quotes to force it to be interpreted as literal text
!        even if it contains template patterns.  For example, in
         <literal>'"Hello Year "YYYY'</literal>, the <literal>YYYY</literal>
         will be replaced by the year data, but the single <literal>Y</literal> in <literal>Year</literal>
!        will not be.
!        In <function>to_date</function>, <function>to_number</function>,
!        and <function>to_timestamp</function>, literal text and double-quoted
!        strings result in skipping the number of characters contained in the
!        string; for example <literal>"XX"</literal> skips two input characters
!        (whether or not they are <literal>XX</literal>).
        </para>
       </listitem>
 
*************** SELECT regexp_match('abc01234xyz', '(?:(
*** 6485,6490 ****
--- 6490,6506 ----
 
       <listitem>
        <para>
+        In <function>to_number</function>, if non-data template patterns such
+        as <literal>L</literal> or <literal>TH</literal> are used, the
+        corresponding number of input characters are skipped, whether or not
+        they match the template pattern, unless they are data characters
+        (that is, digits, sign, decimal point, or comma).  For
+        example, <literal>TH</literal> would skip two non-data characters.
+       </para>
+      </listitem>
+
+      <listitem>
+       <para>
         <literal>V</literal> with <function>to_char</function>
         multiplies the input values by
         <literal>10^<replaceable>n</replaceable></literal>, where
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 50254f2..5afc293 100644
*** a/src/backend/utils/adt/formatting.c
--- b/src/backend/utils/adt/formatting.c
*************** static char *get_last_relevant_decnum(ch
*** 988,994 ****
  static void NUM_numpart_from_char(NUMProc *Np, int id, int input_len);
  static void NUM_numpart_to_char(NUMProc *Np, int id);
  static char *NUM_processor(FormatNode *node, NUMDesc *Num, char *inout,
!  char *number, int from_char_input_len, int to_char_out_pre_spaces,
   int sign, bool is_to_char, Oid collid);
  static DCHCacheEntry *DCH_cache_getnew(const char *str);
  static DCHCacheEntry *DCH_cache_search(const char *str);
--- 988,994 ----
  static void NUM_numpart_from_char(NUMProc *Np, int id, int input_len);
  static void NUM_numpart_to_char(NUMProc *Np, int id);
  static char *NUM_processor(FormatNode *node, NUMDesc *Num, char *inout,
!  char *number, int input_len, int to_char_out_pre_spaces,
   int sign, bool is_to_char, Oid collid);
  static DCHCacheEntry *DCH_cache_getnew(const char *str);
  static DCHCacheEntry *DCH_cache_search(const char *str);
*************** get_last_relevant_decnum(char *num)
*** 4232,4237 ****
--- 4232,4245 ----
  return result;
  }
 
+ /*
+  * These macros are used in NUM_processor() and its subsidiary routines.
+  * OVERLOAD_TEST: true if we've reached end of input string
+  * AMOUNT_TEST(s): true if at least s characters remain in string
+  */
+ #define OVERLOAD_TEST (Np->inout_p >= Np->inout + input_len)
+ #define AMOUNT_TEST(s) (Np->inout_p <= Np->inout + (input_len - (s)))
+
  /* ----------
   * Number extraction for TO_NUMBER()
   * ----------
*************** NUM_numpart_from_char(NUMProc *Np, int i
*** 4246,4254 ****
  (id == NUM_0 || id == NUM_9) ? "NUM_0/9" : id == NUM_DEC ? "NUM_DEC" : "???");
  #endif
 
- #define OVERLOAD_TEST (Np->inout_p >= Np->inout + input_len)
- #define AMOUNT_TEST(_s) (input_len-(Np->inout_p-Np->inout) >= _s)
-
  if (OVERLOAD_TEST)
  return;
 
--- 4254,4259 ----
*************** NUM_numpart_to_char(NUMProc *Np, int id)
*** 4641,4654 ****
  ++Np->num_curr;
  }
 
  static char *
  NUM_processor(FormatNode *node, NUMDesc *Num, char *inout,
!  char *number, int from_char_input_len, int to_char_out_pre_spaces,
   int sign, bool is_to_char, Oid collid)
  {
  FormatNode *n;
  NUMProc _Np,
    *Np = &_Np;
 
  MemSet(Np, 0, sizeof(NUMProc));
 
--- 4646,4677 ----
  ++Np->num_curr;
  }
 
+ /*
+  * Skip over "n" input characters, but only if they aren't numeric data
+  */
+ static void
+ NUM_eat_non_data_chars(NUMProc *Np, int n, int input_len)
+ {
+ while (n-- > 0)
+ {
+ if (OVERLOAD_TEST)
+ break; /* end of input */
+ if (strchr("0123456789.,+-", *Np->inout_p) != NULL)
+ break; /* it's a data character */
+ Np->inout_p += pg_mblen(Np->inout_p);
+ }
+ }
+
  static char *
  NUM_processor(FormatNode *node, NUMDesc *Num, char *inout,
!  char *number, int input_len, int to_char_out_pre_spaces,
   int sign, bool is_to_char, Oid collid)
  {
  FormatNode *n;
  NUMProc _Np,
    *Np = &_Np;
+ const char *pattern;
+ int pattern_len;
 
  MemSet(Np, 0, sizeof(NUMProc));
 
*************** NUM_processor(FormatNode *node, NUMDesc
*** 4816,4824 ****
  if (!Np->is_to_char)
  {
  /*
! * Check non-string inout end
  */
! if (Np->inout_p >= Np->inout + from_char_input_len)
  break;
  }
 
--- 4839,4849 ----
  if (!Np->is_to_char)
  {
  /*
! * Check at least one character remains to be scanned.  (In
! * actions below, must use AMOUNT_TEST if we want to read more
! * characters than that.)
  */
! if (OVERLOAD_TEST)
  break;
  }
 
*************** NUM_processor(FormatNode *node, NUMDesc
*** 4828,4839 ****
  if (n->type == NODE_TYPE_ACTION)
  {
  /*
! * Create/reading digit/zero/blank/sing
  *
  * 'NUM_S' note: The locale sign is anchored to number and we
  * read/write it when we work with first or last number
! * (NUM_0/NUM_9). This is reason why NUM_S missing in follow
! * switch().
  */
  switch (n->key->id)
  {
--- 4853,4868 ----
  if (n->type == NODE_TYPE_ACTION)
  {
  /*
! * Create/read digit/zero/blank/sign/special-case
  *
  * 'NUM_S' note: The locale sign is anchored to number and we
  * read/write it when we work with first or last number
! * (NUM_0/NUM_9).  This is why NUM_S is missing in switch().
! *
! * Notice the "Np->inout_p++" at the bottom of the loop.  This is
! * why most of the actions advance inout_p one less than you might
! * expect.  In cases where we don't want that increment to happen,
! * a switch case ends with "continue" not "break".
  */
  switch (n->key->id)
  {
*************** NUM_processor(FormatNode *node, NUMDesc
*** 4848,4854 ****
  }
  else
  {
! NUM_numpart_from_char(Np, n->key->id, from_char_input_len);
  break; /* switch() case: */
  }
 
--- 4877,4883 ----
  }
  else
  {
! NUM_numpart_from_char(Np, n->key->id, input_len);
  break; /* switch() case: */
  }
 
*************** NUM_processor(FormatNode *node, NUMDesc
*** 4872,4881 ****
--- 4901,4914 ----
  if (IS_FILLMODE(Np->Num))
  continue;
  }
+ if (*Np->inout_p != ',')
+ continue;
  }
  break;
 
  case NUM_G:
+ pattern = Np->L_thousands_sep;
+ pattern_len = strlen(pattern);
  if (Np->is_to_char)
  {
  if (!Np->num_in)
*************** NUM_processor(FormatNode *node, NUMDesc
*** 4884,4899 ****
  continue;
  else
  {
! int x = strlen(Np->L_thousands_sep);
!
! memset(Np->inout_p, ' ', x);
! Np->inout_p += x - 1;
  }
  }
  else
  {
! strcpy(Np->inout_p, Np->L_thousands_sep);
! Np->inout_p += strlen(Np->inout_p) - 1;
  }
  }
  else
--- 4917,4932 ----
  continue;
  else
  {
! /* just in case there are MB chars */
! pattern_len = pg_mbstrlen(pattern);
! memset(Np->inout_p, ' ', pattern_len);
! Np->inout_p += pattern_len - 1;
  }
  }
  else
  {
! strcpy(Np->inout_p, pattern);
! Np->inout_p += pattern_len - 1;
  }
  }
  else
*************** NUM_processor(FormatNode *node, NUMDesc
*** 4903,4920 ****
  if (IS_FILLMODE(Np->Num))
  continue;
  }
! Np->inout_p += strlen(Np->L_thousands_sep) - 1;
  }
  break;
 
  case NUM_L:
  if (Np->is_to_char)
  {
! strcpy(Np->inout_p, Np->L_currency_symbol);
! Np->inout_p += strlen(Np->inout_p) - 1;
  }
  else
! Np->inout_p += strlen(Np->L_currency_symbol) - 1;
  break;
 
  case NUM_RN:
--- 4936,4968 ----
  if (IS_FILLMODE(Np->Num))
  continue;
  }
!
! /*
! * Because L_thousands_sep typically contains data
! * characters (either '.' or ','), we can't use
! * NUM_eat_non_data_chars here.  Instead skip only if
! * the input matches L_thousands_sep.
! */
! if (AMOUNT_TEST(pattern_len) &&
! strncmp(Np->inout_p, pattern, pattern_len) == 0)
! Np->inout_p += pattern_len - 1;
! else
! continue;
  }
  break;
 
  case NUM_L:
+ pattern = Np->L_currency_symbol;
  if (Np->is_to_char)
  {
! strcpy(Np->inout_p, pattern);
! Np->inout_p += strlen(pattern) - 1;
  }
  else
! {
! NUM_eat_non_data_chars(Np, pg_mbstrlen(pattern), input_len);
! continue;
! }
  break;
 
  case NUM_RN:
*************** NUM_processor(FormatNode *node, NUMDesc
*** 4949,4956 ****
  continue;
 
  if (Np->is_to_char)
  strcpy(Np->inout_p, get_th(Np->number, TH_LOWER));
! Np->inout_p += 1;
  break;
 
  case NUM_TH:
--- 4997,5012 ----
  continue;
 
  if (Np->is_to_char)
+ {
  strcpy(Np->inout_p, get_th(Np->number, TH_LOWER));
! Np->inout_p += 1;
! }
! else
! {
! /* All variants of 'th' occupy 2 characters */
! NUM_eat_non_data_chars(Np, 2, input_len);
! continue;
! }
  break;
 
  case NUM_TH:
*************** NUM_processor(FormatNode *node, NUMDesc
*** 4959,4966 ****
  continue;
 
  if (Np->is_to_char)
  strcpy(Np->inout_p, get_th(Np->number, TH_UPPER));
! Np->inout_p += 1;
  break;
 
  case NUM_MI:
--- 5015,5030 ----
  continue;
 
  if (Np->is_to_char)
+ {
  strcpy(Np->inout_p, get_th(Np->number, TH_UPPER));
! Np->inout_p += 1;
! }
! else
! {
! /* All variants of 'TH' occupy 2 characters */
! NUM_eat_non_data_chars(Np, 2, input_len);
! continue;
! }
  break;
 
  case NUM_MI:
*************** NUM_processor(FormatNode *node, NUMDesc
*** 4977,4982 ****
--- 5041,5051 ----
  {
  if (*Np->inout_p == '-')
  *Np->number = '-';
+ else
+ {
+ NUM_eat_non_data_chars(Np, 1, input_len);
+ continue;
+ }
  }
  break;
 
*************** NUM_processor(FormatNode *node, NUMDesc
*** 4994,5016 ****
  {
  if (*Np->inout_p == '+')
  *Np->number = '+';
  }
  break;
 
  case NUM_SG:
  if (Np->is_to_char)
  *Np->inout_p = Np->sign;
-
  else
  {
  if (*Np->inout_p == '-')
  *Np->number = '-';
  else if (*Np->inout_p == '+')
  *Np->number = '+';
  }
  break;
 
-
  default:
  continue;
  break;
--- 5063,5093 ----
  {
  if (*Np->inout_p == '+')
  *Np->number = '+';
+ else
+ {
+ NUM_eat_non_data_chars(Np, 1, input_len);
+ continue;
+ }
  }
  break;
 
  case NUM_SG:
  if (Np->is_to_char)
  *Np->inout_p = Np->sign;
  else
  {
  if (*Np->inout_p == '-')
  *Np->number = '-';
  else if (*Np->inout_p == '+')
  *Np->number = '+';
+ else
+ {
+ NUM_eat_non_data_chars(Np, 1, input_len);
+ continue;
+ }
  }
  break;
 
  default:
  continue;
  break;
*************** NUM_processor(FormatNode *node, NUMDesc
*** 5019,5025 ****
  else
  {
  /*
! * Remove to output char from input in TO_CHAR
  */
  if (Np->is_to_char)
  *Np->inout_p = n->character;
--- 5096,5107 ----
  else
  {
  /*
! * In TO_CHAR, non-pattern characters in the format are copied to
! * the output.  In TO_NUMBER, we skip one input character for each
! * non-pattern format character, whether or not it matches the
! * format character.  (Currently, that's actually implemented as
! * skipping one input byte per non-pattern format byte, which is
! * wrong...)
  */
  if (Np->is_to_char)
  *Np->inout_p = n->character;
diff --git a/src/test/regress/expected/numeric.out b/src/test/regress/expected/numeric.out
index 7e55b0e..a96bfc0 100644
*** a/src/test/regress/expected/numeric.out
--- b/src/test/regress/expected/numeric.out
*************** SELECT '' AS to_char_26, to_char('100'::
*** 1219,1224 ****
--- 1219,1225 ----
 
  -- TO_NUMBER()
  --
+ SET lc_numeric = 'C';
  SELECT '' AS to_number_1,  to_number('-34,338,492', '99G999G999');
   to_number_1 | to_number
  -------------+-----------
*************** SELECT '' AS to_number_13, to_number(' .
*** 1297,1302 ****
--- 1298,1358 ----
                |     -0.01
  (1 row)
 
+ SELECT '' AS to_number_14, to_number('34,50','999,99');
+  to_number_14 | to_number
+ --------------+-----------
+               |      3450
+ (1 row)
+
+ SELECT '' AS to_number_15, to_number('123,000','999G');
+  to_number_15 | to_number
+ --------------+-----------
+               |       123
+ (1 row)
+
+ SELECT '' AS to_number_16, to_number('123456','999G999');
+  to_number_16 | to_number
+ --------------+-----------
+               |    123456
+ (1 row)
+
+ SELECT '' AS to_number_17, to_number('$1234.56','L9,999.99');
+  to_number_17 | to_number
+ --------------+-----------
+               |   1234.56
+ (1 row)
+
+ SELECT '' AS to_number_18, to_number('$1234.56','L99,999.99');
+  to_number_18 | to_number
+ --------------+-----------
+               |   1234.56
+ (1 row)
+
+ SELECT '' AS to_number_19, to_number('$1,234.56','L99,999.99');
+  to_number_19 | to_number
+ --------------+-----------
+               |   1234.56
+ (1 row)
+
+ SELECT '' AS to_number_20, to_number('1234.56','L99,999.99');
+  to_number_20 | to_number
+ --------------+-----------
+               |   1234.56
+ (1 row)
+
+ SELECT '' AS to_number_21, to_number('1,234.56','L99,999.99');
+  to_number_21 | to_number
+ --------------+-----------
+               |   1234.56
+ (1 row)
+
+ SELECT '' AS to_number_22, to_number('42nd', '99th');
+  to_number_22 | to_number
+ --------------+-----------
+               |        42
+ (1 row)
+
+ RESET lc_numeric;
  --
  -- Input syntax
  --
diff --git a/src/test/regress/sql/numeric.sql b/src/test/regress/sql/numeric.sql
index 9675b6e..321c7bd 100644
*** a/src/test/regress/sql/numeric.sql
--- b/src/test/regress/sql/numeric.sql
*************** SELECT '' AS to_char_26, to_char('100'::
*** 788,793 ****
--- 788,794 ----
 
  -- TO_NUMBER()
  --
+ SET lc_numeric = 'C';
  SELECT '' AS to_number_1,  to_number('-34,338,492', '99G999G999');
  SELECT '' AS to_number_2,  to_number('-34,338,492.654,878', '99G999G999D999G999');
  SELECT '' AS to_number_3,  to_number('<564646.654564>', '999999.999999PR');
*************** SELECT '' AS to_number_10, to_number('0'
*** 801,806 ****
--- 802,817 ----
  SELECT '' AS to_number_11, to_number('.-01', 'S99.99');
  SELECT '' AS to_number_12, to_number('.01-', '99.99S');
  SELECT '' AS to_number_13, to_number(' . 0 1-', ' 9 9 . 9 9 S');
+ SELECT '' AS to_number_14, to_number('34,50','999,99');
+ SELECT '' AS to_number_15, to_number('123,000','999G');
+ SELECT '' AS to_number_16, to_number('123456','999G999');
+ SELECT '' AS to_number_17, to_number('$1234.56','L9,999.99');
+ SELECT '' AS to_number_18, to_number('$1234.56','L99,999.99');
+ SELECT '' AS to_number_19, to_number('$1,234.56','L99,999.99');
+ SELECT '' AS to_number_20, to_number('1234.56','L99,999.99');
+ SELECT '' AS to_number_21, to_number('1,234.56','L99,999.99');
+ SELECT '' AS to_number_22, to_number('42nd', '99th');
+ RESET lc_numeric;
 
  --
  -- Input syntax
Reply | Threaded
Open this post in threaded view
|

Re: Fix number skipping in to_number

Tom Lane-2
I wrote:
> That leads me to the attached patch.  There is more that could be done
> here --- in particular, I'd like to see the character-not-byte-count
> rule extended to literal text.  But that seems like fit material for
> a different patch.

Hearing no further comments, I pushed that patch.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Fix number skipping in to_number

Tom Lane-2
In reply to this post by Tom Lane-2
I wrote:
> That leads me to the attached patch.  There is more that could be done
> here --- in particular, I'd like to see the character-not-byte-count
> rule extended to literal text.  But that seems like fit material for
> a different patch.

Attached is a patch that makes formatting.c more multibyte-aware;
it now handles multibyte characters as single NODE_TYPE_CHAR format
nodes, rather than one node per byte.  This doesn't really have much
impact on the output (to_char) side, but on the input side, it
greatly simplifies treating such characters as single characters
rather than multiple ones.  An example is that (in UTF8 encoding)
previously we got

u8=# select to_number('$12.34', '€99D99');
 to_number
-----------
      0.34
(1 row)

because the literal euro sign is 3 bytes long and was thought to be
3 literal characters.  Now we get the expected result

u8=# select to_number('$12.34', '€99D99');
 to_number
-----------
     12.34
(1 row)

Aside from skipping 1 input character (not byte) per literal format
character, I fixed the SKIP_THth macro, allowing to_date/to_timestamp to
also follow the rule of skipping whole characters not bytes for non-data
format patterns.  There might be some other places that need similar
adjustments, but I couldn't find any.

Not sure about whether/how to add regression tests for this; it's really
impossible to add specific tests in an ASCII-only test file.  Maybe we
could put a test or two into collate.linux.utf8.sql, but it seems a bit
off topic for that, and I think that test file hardly gets run anyway.

Note this needs to be applied over the patch I posted at
https://postgr.es/m/3626.1510949486@...
I intend to commit that fairly soon, but it's not in right now.

                        regards, tom lane


diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index cb0dbf7..ec97de0 100644
*** a/src/backend/utils/adt/formatting.c
--- b/src/backend/utils/adt/formatting.c
*************** typedef enum
*** 151,158 ****
  FROM_CHAR_DATE_ISOWEEK /* ISO 8601 week date */
  } FromCharDateMode;
 
- typedef struct FormatNode FormatNode;
-
  typedef struct
  {
  const char *name;
--- 151,156 ----
*************** typedef struct
*** 162,174 ****
  FromCharDateMode date_mode;
  } KeyWord;
 
! struct FormatNode
  {
! int type; /* node type */
! const KeyWord *key; /* if node type is KEYWORD */
! char character; /* if node type is CHAR */
! int suffix; /* keyword suffix */
! };
 
  #define NODE_TYPE_END 1
  #define NODE_TYPE_ACTION 2
--- 160,172 ----
  FromCharDateMode date_mode;
  } KeyWord;
 
! typedef struct
  {
! int type; /* NODE_TYPE_XXX, see below */
! const KeyWord *key; /* if type is ACTION */
! char character[MAX_MULTIBYTE_CHAR_LEN + 1]; /* if type is CHAR */
! int suffix; /* keyword prefix/suffix code, if any */
! } FormatNode;
 
  #define NODE_TYPE_END 1
  #define NODE_TYPE_ACTION 2
*************** parse_format(FormatNode *node, const cha
*** 1282,1293 ****
  }
  else if (*str)
  {
  /*
  * Process double-quoted literal string, if any
  */
  if (*str == '"')
  {
! while (*(++str))
  {
  if (*str == '"')
  {
--- 1280,1294 ----
  }
  else if (*str)
  {
+ int chlen;
+
  /*
  * Process double-quoted literal string, if any
  */
  if (*str == '"')
  {
! str++;
! while (*str)
  {
  if (*str == '"')
  {
*************** parse_format(FormatNode *node, const cha
*** 1297,1307 ****
  /* backslash quotes the next character, if any */
  if (*str == '\\' && *(str + 1))
  str++;
  n->type = NODE_TYPE_CHAR;
! n->character = *str;
  n->key = NULL;
  n->suffix = 0;
  n++;
  }
  }
  else
--- 1298,1311 ----
  /* backslash quotes the next character, if any */
  if (*str == '\\' && *(str + 1))
  str++;
+ chlen = pg_mblen(str);
  n->type = NODE_TYPE_CHAR;
! memcpy(n->character, str, chlen);
! n->character[chlen] = '\0';
  n->key = NULL;
  n->suffix = 0;
  n++;
+ str += chlen;
  }
  }
  else
*************** parse_format(FormatNode *node, const cha
*** 1312,1323 ****
  */
  if (*str == '\\' && *(str + 1) == '"')
  str++;
  n->type = NODE_TYPE_CHAR;
! n->character = *str;
  n->key = NULL;
  n->suffix = 0;
  n++;
! str++;
  }
  }
  }
--- 1316,1329 ----
  */
  if (*str == '\\' && *(str + 1) == '"')
  str++;
+ chlen = pg_mblen(str);
  n->type = NODE_TYPE_CHAR;
! memcpy(n->character, str, chlen);
! n->character[chlen] = '\0';
  n->key = NULL;
  n->suffix = 0;
  n++;
! str += chlen;
  }
  }
  }
*************** dump_node(FormatNode *node, int max)
*** 1349,1355 ****
  elog(DEBUG_elog_output, "%d:\t NODE_TYPE_ACTION '%s'\t(%s,%s)",
  a, n->key->name, DUMP_THth(n->suffix), DUMP_FM(n->suffix));
  else if (n->type == NODE_TYPE_CHAR)
! elog(DEBUG_elog_output, "%d:\t NODE_TYPE_CHAR '%c'", a, n->character);
  else if (n->type == NODE_TYPE_END)
  {
  elog(DEBUG_elog_output, "%d:\t NODE_TYPE_END", a);
--- 1355,1362 ----
  elog(DEBUG_elog_output, "%d:\t NODE_TYPE_ACTION '%s'\t(%s,%s)",
  a, n->key->name, DUMP_THth(n->suffix), DUMP_FM(n->suffix));
  else if (n->type == NODE_TYPE_CHAR)
! elog(DEBUG_elog_output, "%d:\t NODE_TYPE_CHAR '%s'",
! a, n->character);
  else if (n->type == NODE_TYPE_END)
  {
  elog(DEBUG_elog_output, "%d:\t NODE_TYPE_END", a);
*************** asc_toupper_z(const char *buff)
*** 2008,2015 ****
  do { \
  if (S_THth(_suf)) \
  { \
! if (*(ptr)) (ptr)++; \
! if (*(ptr)) (ptr)++; \
  } \
  } while (0)
 
--- 2015,2022 ----
  do { \
  if (S_THth(_suf)) \
  { \
! if (*(ptr)) (ptr) += pg_mblen(ptr); \
! if (*(ptr)) (ptr) += pg_mblen(ptr); \
  } \
  } while (0)
 
*************** is_next_separator(FormatNode *n)
*** 2076,2082 ****
 
  return true;
  }
! else if (isdigit((unsigned char) n->character))
  return false;
 
  return true; /* some non-digit input (separator) */
--- 2083,2090 ----
 
  return true;
  }
! else if (n->character[1] == '\0' &&
! isdigit((unsigned char) n->character[0]))
  return false;
 
  return true; /* some non-digit input (separator) */
*************** DCH_to_char(FormatNode *node, bool is_in
*** 2405,2412 ****
  {
  if (n->type != NODE_TYPE_ACTION)
  {
! *s = n->character;
! s++;
  continue;
  }
 
--- 2413,2420 ----
  {
  if (n->type != NODE_TYPE_ACTION)
  {
! strcpy(s, n->character);
! s += strlen(s);
  continue;
  }
 
*************** DCH_from_char(FormatNode *node, char *in
*** 2974,2980 ****
  * we don't insist that the consumed character match the format's
  * character.
  */
! s++;
  continue;
  }
 
--- 2982,2988 ----
  * we don't insist that the consumed character match the format's
  * character.
  */
! s += pg_mblen(s);
  continue;
  }
 
*************** get_last_relevant_decnum(char *num)
*** 4217,4223 ****
  /*
   * These macros are used in NUM_processor() and its subsidiary routines.
   * OVERLOAD_TEST: true if we've reached end of input string
!  * AMOUNT_TEST(s): true if at least s characters remain in string
   */
  #define OVERLOAD_TEST (Np->inout_p >= Np->inout + input_len)
  #define AMOUNT_TEST(s) (Np->inout_p <= Np->inout + (input_len - (s)))
--- 4225,4231 ----
  /*
   * These macros are used in NUM_processor() and its subsidiary routines.
   * OVERLOAD_TEST: true if we've reached end of input string
!  * AMOUNT_TEST(s): true if at least s bytes remain in string
   */
  #define OVERLOAD_TEST (Np->inout_p >= Np->inout + input_len)
  #define AMOUNT_TEST(s) (Np->inout_p <= Np->inout + (input_len - (s)))
*************** NUM_processor(FormatNode *node, NUMDesc
*** 4821,4829 ****
  if (!Np->is_to_char)
  {
  /*
! * Check at least one character remains to be scanned.  (In
! * actions below, must use AMOUNT_TEST if we want to read more
! * characters than that.)
  */
  if (OVERLOAD_TEST)
  break;
--- 4829,4837 ----
  if (!Np->is_to_char)
  {
  /*
! * Check at least one byte remains to be scanned.  (In actions
! * below, must use AMOUNT_TEST if we want to read more bytes than
! * that.)
  */
  if (OVERLOAD_TEST)
  break;
*************** NUM_processor(FormatNode *node, NUMDesc
*** 5081,5092 ****
  * In TO_CHAR, non-pattern characters in the format are copied to
  * the output.  In TO_NUMBER, we skip one input character for each
  * non-pattern format character, whether or not it matches the
! * format character.  (Currently, that's actually implemented as
! * skipping one input byte per non-pattern format byte, which is
! * wrong...)
  */
  if (Np->is_to_char)
! *Np->inout_p = n->character;
  }
  Np->inout_p++;
  }
--- 5089,5106 ----
  * In TO_CHAR, non-pattern characters in the format are copied to
  * the output.  In TO_NUMBER, we skip one input character for each
  * non-pattern format character, whether or not it matches the
! * format character.
  */
  if (Np->is_to_char)
! {
! strcpy(Np->inout_p, n->character);
! Np->inout_p += strlen(Np->inout_p);
! }
! else
! {
! Np->inout_p += pg_mblen(Np->inout_p);
! }
! continue;
  }
  Np->inout_p++;
  }
Reply | Threaded
Open this post in threaded view
|

Re: Fix number skipping in to_number

Ioseph Kim-2
In reply to this post by Oliver Ford

Hi.

I'm checking release note for version 11.


in that.

"L and TH now only consume characters that are not digits, positive/negative signs, decimal points, or commas."


postgres@postgres=# select to_number('1234', '+9999');
 to_number
-----------
       234

Is this right?


Regards, ioseph.




2017년 08월 10일 19:21에 Oliver Ford 이(가) 쓴 글:
Prevents an issue where numbers can be skipped in the to_number()
function when the format mask contains a "G" or a "," but the input
string doesn't contain a separator. This resolves the TODO item "Fix
to_number() handling for values not matching the format string".

== Change ==

Currently, if the format mask in to_number() has a "G" or a ",", it
will assume that the input string contains a separator character in
the same place. If however a number is there instead, that number will
be silently skipped and not appear in the output. So we get:

select to_number('34,50','999,99');
 to_number
-----------
       340
(1 row)

This patch checks the input string when it encounters a "G" or "," in
the format mask. If the separator character is found, the code moves
over it as normal. If it's not found, then the code no longer
increments the relevant pointer so as not to skip the character. After
the patch, we get the correct result:

select to_number('34,50','999,99');
 to_number
-----------
      3450
(1 row)

This is in line with Oracle's result.

== Rationale ==

This patch is a small change, which leaves PostgreSQL behavior
different from Oracle behavior in related cases. Oracle's
implementation seems to read from right-to-left, and raises an
"ORA-01722: invalid number" error if there are digits in the input
string which don't have corresponding characters in the format mask. I
have chosen not to throw such errors, because there are use cases for
only returning part of a number string. For example, the following is
an error on Oracle:

select to_number('123,000', '999G') from dual;
Error report -
SQL Error: ORA-01722: invalid number

But if you wanted to only take the characters before the comma, and
discard the thousands part, you can do so on PostgreSQL with:

select to_number('123,000', '999G');
 to_number
-----------
       123
(1 row)

This is the current behavior. Which is why I think it makes more sense
to do what PostgreSQL currently does and read from left-to-right. The
only change, as mentioned above, is that the current behavior can skip
a digit:

select to_number('123456', '999G999');
 to_number
-----------
   12356
(1 row)

After the patch, this returns all the digits:

select to_number('123456', '999G999');
 to_number
-----------
  123456
(1 row)

== Testing ==

Tested on Windows with MinGW using the latest checkout from master.
Added regression tests to check for this new behavior. All existing
tests pass.