A bug when use get_bit() function for a long bytea string

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
27 messages Options
12
Reply | Threaded
Open this post in threaded view
|

A bug when use get_bit() function for a long bytea string

Movead Li
Hello hackers,

I found an issue about get_bit() and set_bit() function,here it is:
################################
postgres=# select get_bit(pg_read_binary_file('/home/movead/temp/file_seek/f_512M'), 0);
2020-03-12 10:05:23.296 CST [10549] ERROR:  index 0 out of valid range, 0..-1
2020-03-12 10:05:23.296 CST [10549] STATEMENT:  select get_bit(pg_read_binary_file('/home/movead/temp/file_seek/f_512M'), 0);
ERROR:  index 0 out of valid range, 0..-1
postgres=# select set_bit(pg_read_binary_file('/home/movead/temp/file_seek/f_512M'), 0,1);
2020-03-12 10:05:27.959 CST [10549] ERROR:  index 0 out of valid range, 0..-1
2020-03-12 10:05:27.959 CST [10549] STATEMENT:  select set_bit(pg_read_binary_file('/home/movead/temp/file_seek/f_512M'), 0,1);
ERROR:  index 0 out of valid range, 0..-1
postgres=#
################################
PostgreSQL can handle bytea size nearby 1G, but now it reports an
error when 512M. And I research it and found it is byteaSetBit() and
byteaGetBit(), it uses an 'int32 len' to hold bit numbers for the long
bytea data, and obvious 512M * 8bit is an overflow for an int32. 
So I fix it and test ok, as below.
################################
postgres=# select get_bit(set_bit(pg_read_binary_file('/home/movead/temp/file_seek/f_512M'), 0,1),0); get_bit --------- 1 (1 row) postgres=# select get_bit(set_bit(pg_read_binary_file('/home/movead/temp/file_seek/f_512M'), 0,0),0); get_bit --------- 0 (1 row) postgres=#
################################


And I do a check about if anything else related bytea has this issue, several codes have the same issue:
1. byteaout() When formatting bytea as an escape, the 'len' variable should be int64, or
it may use an overflowing number. 2. esc_enc_len() Same as above, the 'len' variable should be int64, and the return type
should change as int64. Due to esc_enc_len() has same call struct with pg_base64_enc_len() and hex_enc_len(), so I want to change the return value of the two function. And the opposite function esc_dec_len() seem nothing wrong. 3. binary_encode() and binary_decode() Here use an 'int32 resultlen' to accept an 'unsigned int' function return, which seem unfortable.
I fix all mentioned above, and patch attachments.
How do you think about that?



Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

long_bytea_string_bug_fix.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: A bug when use get_bit() function for a long bytea string

Ashutosh Bapat-2
Hi

On Thu, Mar 12, 2020 at 9:21 AM [hidden email] <[hidden email]> wrote:

>
> Hello hackers,
>
> I found an issue about get_bit() and set_bit() function,here it is:
> ################################
> postgres=# select get_bit(pg_read_binary_file('/home/movead/temp/file_seek/f_512M'), 0);
> 2020-03-12 10:05:23.296 CST [10549] ERROR:  index 0 out of valid range, 0..-1
> 2020-03-12 10:05:23.296 CST [10549] STATEMENT:  select get_bit(pg_read_binary_file('/home/movead/temp/file_seek/f_512M'), 0);
> ERROR:  index 0 out of valid range, 0..-1
> postgres=# select set_bit(pg_read_binary_file('/home/movead/temp/file_seek/f_512M'), 0,1);
> 2020-03-12 10:05:27.959 CST [10549] ERROR:  index 0 out of valid range, 0..-1
> 2020-03-12 10:05:27.959 CST [10549] STATEMENT:  select set_bit(pg_read_binary_file('/home/movead/temp/file_seek/f_512M'), 0,1);
> ERROR:  index 0 out of valid range, 0..-1
> postgres=#
> ################################
> PostgreSQL can handle bytea size nearby 1G, but now it reports an
> error when 512M. And I research it and found it is byteaSetBit() and
> byteaGetBit(), it uses an 'int32 len' to hold bit numbers for the long
> bytea data, and obvious 512M * 8bit is an overflow for an int32.
> So I fix it and test ok, as below.
> ################################

Thanks for the bug report and the analysis. The analysis looks correct.

> postgres=# select get_bit(set_bit(pg_read_binary_file('/home/movead/temp/file_seek/f_512M'), 0,1),0); get_bit --------- 1 (1 row) postgres=# select get_bit(set_bit(pg_read_binary_file('/home/movead/temp/file_seek/f_512M'), 0,0),0); get_bit --------- 0 (1 row) postgres=#
> ################################
>
>
> And I do a check about if anything else related bytea has this issue, several codes have the same issue:
> 1. byteaout() When formatting bytea as an escape, the 'len' variable should be int64, or
> it may use an overflowing number. 2. esc_enc_len() Same as above, the 'len' variable should be int64, and the return type
> should change as int64. Due to esc_enc_len() has same call struct with pg_base64_enc_len() and hex_enc_len(), so I want to change the return value of the two function. And the opposite function esc_dec_len() seem nothing wrong. 3. binary_encode() and binary_decode() Here use an 'int32 resultlen' to accept an 'unsigned int' function return, which seem unfortable.
> I fix all mentioned above, and patch attachments.
> How do you think about that?

Why have you used size? Shouldn't we use int64?

Also in the change
@@ -3458,15 +3458,15 @@ byteaGetBit(PG_FUNCTION_ARGS)
     int32        n = PG_GETARG_INT32(1);
     int            byteNo,
                 bitNo;
-    int            len;
+    Size        len;

If get_bit()/set_bit() accept the second argument as int32, it can not
be used to set bits whose number does not fit 32 bits. I think we need
to change the type of the second argument as well.

Also, I think declaring len to be int is fine since 1G would fit an
int, but what does not fit is len * 8, when performing that
calculation, we have to widen the result. So, instead of changing the
datatype of len, it might be better to perform the calculation as
(int64)len * 8. If we use int64, we could also use INT64_FORMAT
instead of using %ld.

Since this is a bug it shouldn't wait another commitfest, but still
add this patch to the commitfest so that it's not forgotten.

--
Best Wishes,
Ashutosh Bapat


Reply | Threaded
Open this post in threaded view
|

Re: Re: A bug when use get_bit() function for a long bytea string

Movead Li
Thanks for the reply.

>Why have you used size? Shouldn't we use int64?
Yes, thanks for the point, I have changed the patch.
 
>If get_bit()/set_bit() accept the second argument as int32, it can not
>be used to set bits whose number does not fit 32 bits. I think we need
>to change the type of the second argument as well.
Because int32 can cover the length of bytea that PostgreSQL support,
and I have decided to follow your next point 'not use 64bit int for len',
so I think the second argument can keep int32.

>Also, I think declaring len to be int is fine since 1G would fit an
>int, but what does not fit is len * 8, when performing that
>calculation, we have to widen the result. So, instead of changing the
>datatype of len, it might be better to perform the calculation as
>(int64)len * 8. If we use int64, we could also use INT64_FORMAT
>instead of using %ld.
Have followed and changed the patch.
 
>Since this is a bug it shouldn't wait another commitfest, but still
>add this patch to the commitfest so that it's not forgotten.
Will do.


Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

long_bytea_string_bug_fix_ver2.patch (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Re: A bug when use get_bit() function for a long bytea string

Ashutosh Bapat-3


On Fri, 13 Mar 2020 at 08:48, [hidden email] <[hidden email]> wrote:
Thanks for the reply.

>Why have you used size? Shouldn't we use int64?
Yes, thanks for the point, I have changed the patch.
 

Thanks for the patch.
 
>If get_bit()/set_bit() accept the second argument as int32, it can not
>be used to set bits whose number does not fit 32 bits. I think we need
>to change the type of the second argument as well.
Because int32 can cover the length of bytea that PostgreSQL support,

I think the second argument indicates the bit position, which would be max bytea length * 8. If max bytea length covers whole int32, the second argument needs to be wider i.e. int64.

Some more comments on the patch
 struct pg_encoding
 {
- unsigned (*encode_len) (const char *data, unsigned dlen);
+ int64 (*encode_len) (const char *data, unsigned dlen);
  unsigned (*decode_len) (const char *data, unsigned dlen);
  unsigned (*encode) (const char *data, unsigned dlen, char *res);
  unsigned (*decode) (const char *data, unsigned dlen, char *res);

Why not use return type of int64 for rest of the functions here as well?

  res = enc->encode(VARDATA_ANY(data), datalen, VARDATA(result));
 
  /* Make this FATAL 'cause we've trodden on memory ... */
- if (res > resultlen)
+ if ((int64)res > resultlen)

if we change return type of all those functions to int64, we won't need this cast.

Right now we are using int64 because bytea can be 1GB, but what if we increase
that limit tomorrow, will int64 be sufficient? That may be unlikely in the near
future, but nevertheless a possibility. Should we then define a new datatype
which resolves to int64 right now but really depends upon the bytea length. I
am not suggesting that we have to do it right now, but may be something to
think about.
 
 hex_enc_len(const char *src, unsigned srclen)
 {
- return srclen << 1;
+ return (int64)(srclen << 1);

why not to convert srclen also to int64. That will also change the pg_encoding
member definitions. But if encoded length requires int64 to fit the possibly
values, same would be true for the source lengths. Why can't the source length
also be int64?

If still we want the cast, I think it should be ((int64)srclen << 1) rather
than casting the result.
 
  /* 3 bytes will be converted to 4, linefeed after 76 chars */
- return (srclen + 2) * 4 / 3 + srclen / (76 * 3 / 4);
+ return (int64)((srclen + 2) * 4 / 3 + srclen / (76 * 3 / 4));
similar comments as above.
 
 SELECT set_bit(B'0101011000100100', 16, 1); -- fail
 ERROR:  bit index 16 out of valid range (0..15)
+SELECT get_bit(
+       set_bit((repeat('Postgres', 512 * 1024 * 1024 / 8))::bytea, 0, 0)
+       ,0);
+ get_bit
+---------
+       0
+(1 row)

It might help to add a test where we could pass the second argument something
greater than 1G. But it may be difficult to write such a test case.

--
Best Wishes,
Ashutosh
Reply | Threaded
Open this post in threaded view
|

Re: A bug when use get_bit() function for a long bytea string

Movead Li

Hello thanks for the detailed review,

>I think the second argument indicates the bit position, which would be max bytea length * 8. If max bytea length covers whole int32, the second argument >needs to be wider i.e. int64.
Yes, it makes sence and followed.

> Some more comments on the patch
> struct pg_encoding

> {
>- unsigned (*encode_len) (const char *data, unsigned dlen);
>+ int64 (*encode_len) (const char *data, unsigned dlen);
 unsigned (*decode_len) (const char *data, unsigned dlen);
 unsigned (*encode) (const char *data, unsigned dlen, char *res);
>  unsigned (*decode) (const char *data, unsigned dlen, char *res);
> Why not use return type of int64 for rest of the functions here as well?
>  res = enc->encode(VARDATA_ANY(data), datalen, VARDATA(result));
>  /* Make this FATAL 'cause we've trodden on memory ... */
>- if (res > resultlen)
>+ if ((int64)res > resultlen)
>
>if we change return type of all those functions to int64, we won't need this cast.
I change the 'encode' function, it needs an int64 return type, but keep other
functions in 'pg_encoding', because I think it of no necessary reason.

>Right now we are using int64 because bytea can be 1GB, but what if we increase
>that limit tomorrow, will int64 be sufficient? That may be unlikely in the near
>future, but nevertheless a possibility. Should we then define a new datatype
>which resolves to int64 right now but really depends upon the bytea length. I
>am not suggesting that we have to do it right now, but may be something to
>think about.
I decide to use int64 because if we really want to increase the limit,  it should be
the same change with 'text', 'varchar' which have the same limit. So it may need
a more general struct. Hence I give up the idea.

> hex_enc_len(const char *src, unsigned srclen)
> {
>- return srclen << 1;
>+ return (int64)(srclen << 1);
>
>why not to convert srclen also to int64. That will also change the pg_encoding
>member definitions. But if encoded length requires int64 to fit the possibly
>values, same would be true for the source lengths. Why can't the source length
>also be int64?
>If still we want the cast, I think it should be ((int64)srclen << 1) rather
>than casting the result.
I prefer the '((int64)srclen << 1)' way.

>  /* 3 bytes will be converted to 4, linefeed after 76 chars */
>- return (srclen + 2) * 4 / 3 + srclen / (76 * 3 / 4);
>+ return (int64)((srclen + 2) * 4 / 3 + srclen / (76 * 3 / 4));
>similar comments as above.
 Followed.


>It might help to add a test where we could pass the second argument something
>greater than 1G. But it may be difficult to write such a test case.
Add two test cases.



Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

long_bytea_string_bug_fix_ver3.patch (12K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: A bug when use get_bit() function for a long bytea string

Ashutosh Bapat-3


On Wed, 18 Mar 2020 at 08:18, [hidden email] <[hidden email]> wrote:

Hello thanks for the detailed review,

>I think the second argument indicates the bit position, which would be max bytea length * 8. If max bytea length covers whole int32, the second argument >needs to be wider i.e. int64.
Yes, it makes sence and followed.

I think we need a similar change in byteaGetBit() and byteaSetBit() as well.
 

> Some more comments on the patch
> struct pg_encoding

> {
>- unsigned (*encode_len) (const char *data, unsigned dlen);
>+ int64 (*encode_len) (const char *data, unsigned dlen);
 unsigned (*decode_len) (const char *data, unsigned dlen);
 unsigned (*encode) (const char *data, unsigned dlen, char *res);
>  unsigned (*decode) (const char *data, unsigned dlen, char *res);
> Why not use return type of int64 for rest of the functions here as well?
>  res = enc->encode(VARDATA_ANY(data), datalen, VARDATA(result));
>  /* Make this FATAL 'cause we've trodden on memory ... */
>- if (res > resultlen)
>+ if ((int64)res > resultlen)
>
>if we change return type of all those functions to int64, we won't need this cast.
I change the 'encode' function, it needs an int64 return type, but keep other 
functions in 'pg_encoding', because I think it of no necessary reason.

Ok, let's leave it for a committer to decide. 


>Right now we are using int64 because bytea can be 1GB, but what if we increase
>that limit tomorrow, will int64 be sufficient? That may be unlikely in the near
>future, but nevertheless a possibility. Should we then define a new datatype
>which resolves to int64 right now but really depends upon the bytea length. I
>am not suggesting that we have to do it right now, but may be something to
>think about.
I decide to use int64 because if we really want to increase the limit,  it should be
the same change with 'text', 'varchar' which have the same limit. So it may need
a more general struct. Hence I give up the idea.

Hmm, Let's see what a committer says.

Some more review comments.
+   int64       res,resultlen;

We need those on separate lines, possibly.

+   byteNo = (int32)(n / BITS_PER_BYTE);
Does it hurt to have byteNo as int64 so as to avoid a cast. Otherwise, please
add a comment explaining the reason for the cast. The comment applies at other
places where this change appears.

-       int         len;
+       int64       len;
Why do we need this change?
        int         i;
 


>It might help to add a test where we could pass the second argument something
>greater than 1G. But it may be difficult to write such a test case.
Add two test cases.
 
+
+select get_bit(
+        set_bit((repeat('Postgres', 512 * 1024 * 1024 / 8))::bytea, 1024 * 1024 * 1024 + 1, 0)
+       ,1024 * 1024 * 1024 + 1);

This bit position is still within int4.
postgres=# select pg_column_size(1024 * 1024 * 1024 + 1);
 pg_column_size
----------------
              4  
(1 row)

You want something like
postgres=# select pg_column_size(512::bigint * 1024 * 1024 * 8);
 pg_column_size
----------------
              8  
(1 row)

--
Best Wishes,
Ashutosh
Reply | Threaded
Open this post in threaded view
|

Re: A bug when use get_bit() function for a long bytea string

Tom Lane-2
Ashutosh Bapat <[hidden email]> writes:
> On Wed, 18 Mar 2020 at 08:18, [hidden email] <[hidden email]>
> wrote:
>> if we change return type of all those functions to int64, we won't need
>> this cast.
>> I change the 'encode' function, it needs an int64 return type, but keep
>> other
>> functions in 'pg_encoding', because I think it of no necessary reason.

> Ok, let's leave it for a committer to decide.

If I'm grasping the purpose of these correctly, wouldn't Size or size_t
be a more appropriate type?  And I definitely agree with changing all
of these APIs at once, if they're all dealing with the same kind of
value.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: A bug when use get_bit() function for a long bytea string

Ashutosh Bapat-3


On Thu, 26 Mar 2020 at 19:39, Tom Lane <[hidden email]> wrote:
Ashutosh Bapat <[hidden email]> writes:
> On Wed, 18 Mar 2020 at 08:18, [hidden email] <[hidden email]>
> wrote:
>> if we change return type of all those functions to int64, we won't need
>> this cast.
>> I change the 'encode' function, it needs an int64 return type, but keep
>> other
>> functions in 'pg_encoding', because I think it of no necessary reason.

> Ok, let's leave it for a committer to decide.

If I'm grasping the purpose of these correctly, wouldn't Size or size_t
be a more appropriate type? 

Andy had used Size in his earlier patch. But I didn't understand the reason behind it and Andy didn't give any reason. From the patch and the code around the changes some kind of int (so int64) looked better. But if there's a valid reason for using Size, I am fine with it too. Do we have a SQL datatype corresponding to Size?

--
Best Wishes,
Ashutosh
Reply | Threaded
Open this post in threaded view
|

Re: A bug when use get_bit() function for a long bytea string

Daniel Verite
In reply to this post by Ashutosh Bapat-3
        Ashutosh Bapat wrote:

> I think we need a similar change in byteaGetBit() and byteaSetBit()
> as well.

get_bit() and set_bit() as SQL functions take an int4 as the "offset"
argument representing the bit number, meaning that the maximum value
that can be passed is 2^31-1.
But the maximum theorical size of a bytea value being 1 gigabyte or
2^30 bytes, the real maximum bit number in a bytea equals 2^33-1
(2^33=8*2^30), which doesn't fit into an "int4".  As a result, the
part of a bytea beyond the first 256MB is inaccessible to get_bit()
and set_bit().

So aside from the integer overflow bug, isn't there the issue that the
"offset" argument of get_bit() and set_bit() should have been an
int8 in the first place?


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Reply | Threaded
Open this post in threaded view
|

Re: A bug when use get_bit() function for a long bytea string

Tom Lane-2
"Daniel Verite" <[hidden email]> writes:
> So aside from the integer overflow bug, isn't there the issue that the
> "offset" argument of get_bit() and set_bit() should have been an
> int8 in the first place?

Good point, but a fix for that wouldn't be back-patchable.

It does suggest that we should just make all the internal logic use int8
for these values (as the solution to the overflow issue), and then in
HEAD only, adjust the function signatures so that int8 can be passed in.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: A bug when use get_bit() function for a long bytea string

Movead Li
In reply to this post by Ashutosh Bapat-3
>I think the second argument indicates the bit position, which would be max bytea length * 8. If max bytea length covers whole int32, the second argument >needs to be wider i.e. int64.
Yes, it makes sence and followed.

>I think we need a similar change in byteaGetBit() and byteaSetBit() as well.
Sorry, I think it's my mistake, it is the two functions above should be changed.


> Some more comments on the patch
> struct pg_encoding

> {
>- unsigned (*encode_len) (const char *data, unsigned dlen);
>+ int64 (*encode_len) (const char *data, unsigned dlen);
 unsigned (*decode_len) (const char *data, unsigned dlen);
 unsigned (*encode) (const char *data, unsigned dlen, char *res);
>  unsigned (*decode) (const char *data, unsigned dlen, char *res);
> Why not use return type of int64 for rest of the functions here as well?
>  res = enc->encode(VARDATA_ANY(data), datalen, VARDATA(result));
>  /* Make this FATAL 'cause we've trodden on memory ... */
>- if (res > resultlen)
>+ if ((int64)res > resultlen)
>
>if we change return type of all those functions to int64, we won't need this cast.
I change the 'encode' function, it needs an int64 return type, but keep other 
functions in 'pg_encoding', because I think it of no necessary reason.

>Ok, let's leave it for a committer to decide. 
Well, I change all of them this time, because Tom Lane supports on next mail.

 
>Some more review comments.
>+   int64       res,resultlen;
Done

>We need those on separate lines, possibly.
>+   byteNo = (int32)(n / BITS_PER_BYTE);
>Does it hurt to have byteNo as int64 so as to avoid a cast. Otherwise, please
>add a comment explaining the reason for the cast. The comment applies at other
>places where this change appears.
>-       int         len;
>+       int64       len;
>Why do we need this change?
>        int         i;
It is my mistake as describe above, it should not be 'bitgetbit()/bitsetbit()'  to be changed.



>It might help to add a test where we could pass the second argument something
>greater than 1G. But it may be difficult to write such a test case.
Add two test cases.
 
>+
>+select get_bit(
>+        set_bit((repeat('Postgres', 512 * 1024 * 1024 / 8))::bytea, 1024 * 1024 * 1024 + 1, 0)
>+       ,1024 * 1024 * 1024 + 1);

>This bit position is still within int4.
>postgres=# select pg_column_size(1024 * 1024 * 1024 + 1);
> pg_column_size
>----------------
>              4  
>(1 row)

>You want something like
>postgres=# select pg_column_size(512::bigint * 1024 * 1024 * 8);
> pg_column_size
>----------------
>              8  
>(1 row)
I intend to test size large then 1G, and now I think you give a better idea and followed.



Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca


long_bytea_string_bug_fix_ver4.patch (13K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: A bug when use get_bit() function for a long bytea string

Movead Li
In reply to this post by Ashutosh Bapat-3
I want to resent the mail, because last one is in wrong format and hardly to read.

In addition, I think 'Size' or 'size_t' is rely on platform, they may can't work on 32bit
system. So I choose 'int64' after ashutosh's review.

>>I think the second argument indicates the bit position, which would be max bytea length * 8. If max bytea length covers whole int32, the second argument >needs to be wider i.e. int64.
>Yes, it makes sence and followed.

>>I think we need a similar change in byteaGetBit() and byteaSetBit() as well.
Sorry, I think it's my mistake, it is the two functions above should be changed.


>>Some more comments on the patch
>> struct pg_encoding
>> {
>>- unsigned (*encode_len) (const char *data, unsigned dlen);
>>+ int64 (*encode_len) (const char *data, unsigned dlen);
>>  unsigned (*decode_len) (const char *data, unsigned dlen);
>>  unsigned (*encode) (const char *data, unsigned dlen, char *res);
>>  unsigned (*decode) (const char *data, unsigned dlen, char *res);
>> Why not use return type of int64 for rest of the functions here as well?
>>  res = enc->encode(VARDATA_ANY(data), datalen, VARDATA(result));
>>  /* Make this FATAL 'cause we've trodden on memory ... */
>>- if (res > resultlen)
>>+ if ((int64)res > resultlen)
>>
>>if we change return type of all those functions to int64, we won't need this cast.
>I change the 'encode' function, it needs an int64 return type, but keep other 
>functions in 'pg_encoding', because I think it of no necessary reason.

>>Ok, let's leave it for a committer to decide. 
Well, I change all of them this time, because Tom Lane supports on next mail.

 
>Some more review comments.
>+   int64       res,resultlen;
>We need those on separate lines, possibly.
Done

>+   byteNo = (int32)(n / BITS_PER_BYTE);
>Does it hurt to have byteNo as int64 so as to avoid a cast. Otherwise, please
>add a comment explaining the reason for the cast. The comment applies at other
>places where this change appears.
>-       int         len;
>+       int64       len;
>Why do we need this change?
>        int         i;
It is my mistake as describe above, it should not be 'bitgetbit()/bitsetbit()'  to be changed.



>>It might help to add a test where we could pass the second argument something
>>greater than 1G. But it may be difficult to write such a test case.
>Add two test cases.
 
>+
>+select get_bit(
>+        set_bit((repeat('Postgres', 512 * 1024 * 1024 / 8))::bytea, 1024 * 1024 * 1024 + 1, 0)
>+       ,1024 * 1024 * 1024 + 1);

>This bit position is still within int4.
>postgres=# select pg_column_size(1024 * 1024 * 1024 + 1); 
> pg_column_size
>----------------
>              4   
>(1 row)

>You want something like
>postgres=# select pg_column_size(512::bigint * 1024 * 1024 * 8); 
> pg_column_size
>----------------
>              8   
>(1 row)
I intend to test size large then 1G, and now I think you give a better idea and followed.




Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

Reply | Threaded
Open this post in threaded view
|

Re: A bug when use get_bit() function for a long bytea string

Ashutosh Bapat-3
Thanks for the changes,
+ int64 res,resultlen;

It's better to have them on separate lines.

-unsigned
+int64
 hex_decode(const char *src, unsigned len, char *dst)

Do we want to explicitly cast the return value to int64? Will build on some platform crib if not done so? I don't know of such a platform but my knowledge in this area is not great.

+ byteNo = (int)(n / 8);
+ bitNo = (int)(n % 8);
some comment explaining why this downcasting is safe here?

-  proname => 'get_bit', prorettype => 'int4', proargtypes => 'bytea int4',
+  proname => 'get_bit', prorettype => 'int4', proargtypes => 'bytea int8',
   prosrc => 'byteaGetBit' },
 { oid => '724', descr => 'set bit',
-  proname => 'set_bit', prorettype => 'bytea', proargtypes => 'bytea int4 int4',
+  proname => 'set_bit', prorettype => 'bytea', proargtypes => 'bytea int8 int4',
   prosrc => 'byteaSetBit' },

Shouldn't we have similar changes for following entries as well?
{ oid => '3032', descr => 'get bit',
  proname => 'get_bit', prorettype => 'int4', proargtypes => 'bit int4',
  prosrc => 'bitgetbit' },
{ oid => '3033', descr => 'set bit',
  proname => 'set_bit', prorettype => 'bit', proargtypes => 'bit int4 int4',
  prosrc => 'bitsetbit' },

The tests you have added are for bytea variant which ultimately calles byteaGet/SetBit(). But I think we also need tests for bit variants which will ultimately call bitgetbit and bitsetbit functions.

Once you address these comments, I think the patch is good for a committer. So please mark the commitfest entry as such when you post the next version of patch.

On Sat, 28 Mar 2020 at 14:40, [hidden email] <[hidden email]> wrote:
I want to resent the mail, because last one is in wrong format and hardly to read.

In addition, I think 'Size' or 'size_t' is rely on platform, they may can't work on 32bit
system. So I choose 'int64' after ashutosh's review.

>>I think the second argument indicates the bit position, which would be max bytea length * 8. If max bytea length covers whole int32, the second argument >needs to be wider i.e. int64.
>Yes, it makes sence and followed.

>>I think we need a similar change in byteaGetBit() and byteaSetBit() as well.
Sorry, I think it's my mistake, it is the two functions above should be changed.


>>Some more comments on the patch
>> struct pg_encoding
>> {
>>- unsigned (*encode_len) (const char *data, unsigned dlen);
>>+ int64 (*encode_len) (const char *data, unsigned dlen);
>>  unsigned (*decode_len) (const char *data, unsigned dlen);
>>  unsigned (*encode) (const char *data, unsigned dlen, char *res);
>>  unsigned (*decode) (const char *data, unsigned dlen, char *res);
>> Why not use return type of int64 for rest of the functions here as well?
>>  res = enc->encode(VARDATA_ANY(data), datalen, VARDATA(result));
>>  /* Make this FATAL 'cause we've trodden on memory ... */
>>- if (res > resultlen)
>>+ if ((int64)res > resultlen)
>>
>>if we change return type of all those functions to int64, we won't need this cast.
>I change the 'encode' function, it needs an int64 return type, but keep other 
>functions in 'pg_encoding', because I think it of no necessary reason.

>>Ok, let's leave it for a committer to decide. 
Well, I change all of them this time, because Tom Lane supports on next mail.

 
>Some more review comments.
>+   int64       res,resultlen;
>We need those on separate lines, possibly.
Done

>+   byteNo = (int32)(n / BITS_PER_BYTE);
>Does it hurt to have byteNo as int64 so as to avoid a cast. Otherwise, please
>add a comment explaining the reason for the cast. The comment applies at other
>places where this change appears.
>-       int         len;
>+       int64       len;
>Why do we need this change?
>        int         i;
It is my mistake as describe above, it should not be 'bitgetbit()/bitsetbit()'  to be changed.



>>It might help to add a test where we could pass the second argument something
>>greater than 1G. But it may be difficult to write such a test case.
>Add two test cases.
 
>+
>+select get_bit(
>+        set_bit((repeat('Postgres', 512 * 1024 * 1024 / 8))::bytea, 1024 * 1024 * 1024 + 1, 0)
>+       ,1024 * 1024 * 1024 + 1);

>This bit position is still within int4.
>postgres=# select pg_column_size(1024 * 1024 * 1024 + 1); 
> pg_column_size
>----------------
>              4   
>(1 row)

>You want something like
>postgres=# select pg_column_size(512::bigint * 1024 * 1024 * 8); 
> pg_column_size
>----------------
>              8   
>(1 row)
I intend to test size large then 1G, and now I think you give a better idea and followed.




Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:[hidden email](dot)li(at)highgo(dot)ca



--
Best Wishes,
Ashutosh
Reply | Threaded
Open this post in threaded view
|

Re: A bug when use get_bit() function for a long bytea string

Movead Li

>+ int64 res,resultlen;
>It's better to have them on separate lines.
Sorry for that, done.

>-unsigned
>+int64
> hex_decode(const char *src, unsigned len, char *dst)
>Do we want to explicitly cast the return value to int64? Will build on some platform crib if not done so?
>I don't know of such a platform but my knowledge in this area is not great.
I think current change can make sure nothing wrong.

>+ byteNo = (int)(n / 8);
>+ bitNo = (int)(n % 8);
>some comment explaining why this downcasting is safe here?
Done

>-  proname => 'get_bit', prorettype => 'int4', proargtypes => 'bytea int4',
>+  proname => 'get_bit', prorettype => 'int4', proargtypes => 'bytea int8',
>   prosrc => 'byteaGetBit' },
{ oid => '724', descr => 'set bit',
>-  proname => 'set_bit', prorettype => 'bytea', proargtypes => 'bytea int4 int4',
>+  proname => 'set_bit', prorettype => 'bytea', proargtypes => 'bytea int8 int4',
>   prosrc => 'byteaSetBit' },
>Shouldn't we have similar changes for following entries as well?
>{ oid => '3032', descr => 'get bit',
> proname => 'get_bit', prorettype => 'int4', proargtypes => 'bit int4',
>  prosrc => 'bitgetbit' },
>{ oid => '3033', descr => 'set bit',
>  proname => 'set_bit', prorettype => 'bit', proargtypes => 'bit int4 int4',
 > prosrc => 'bitsetbit' },
Because 'bitsetbit' and 'bitgetbit' do not have to calculate bit size by 'multiply 8',
so I think it seems need not to change it.

>The tests you have added are for bytea variant which ultimately calles byteaGet/SetBit(). 
>But I think we also need tests for bit variants which will ultimately call bitgetbit and bitsetbit functions.
As above, it need not to touch 'bitgetbit' and 'bitsetbit'.

>Once you address these comments, I think the patch is good for a committer. 
>So please mark the commitfest entry as such when you post the next version of patch.
Thanks a lot for the detailed review again, and changed patch attached.


Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

long_bytea_string_bug_fix_ver5.patch (14K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: A bug when use get_bit() function for a long bytea string

Tom Lane-2
"[hidden email]" <[hidden email]> writes:
> [ long_bytea_string_bug_fix_ver5.patch ]

I don't think this has really solved the overflow hazards.  For example,
in binary_encode we've got

        resultlen = enc->encode_len(VARDATA_ANY(data), datalen);
        result = palloc(VARHDRSZ + resultlen);

and all you've done about that is changed resultlen from int to int64.
On a 64-bit machine, sure, palloc will be able to detect if the
result exceeds what can be allocated --- but on a 32-bit machine
it'd be possible for the size_t argument to overflow altogether.
(Or if you want to argue that it can't overflow because no encoder
expands the data more than 4X, then we don't need to be making this
change at all.)

I don't think there's really any way to do that safely without an
explicit check before we call palloc.

I also still find the proposed signatures for the encoding-specific
functions to be just plain weird:

- unsigned (*encode_len) (const char *data, unsigned dlen);
- unsigned (*decode_len) (const char *data, unsigned dlen);
- unsigned (*encode) (const char *data, unsigned dlen, char *res);
- unsigned (*decode) (const char *data, unsigned dlen, char *res);
+ int64 (*encode_len) (const char *data, unsigned dlen);
+ int64 (*decode_len) (const char *data, unsigned dlen);
+ int64 (*encode) (const char *data, unsigned dlen, char *res);
+ int64 (*decode) (const char *data, unsigned dlen, char *res);

Why did you change the outputs from unsigned to signed?  Why didn't
you change the dlen inputs?  I grant that we know that the input
can't exceed 1GB in Postgres' usage, but these signatures are just
randomly inconsistent, and you didn't even add a comment explaining
why.  Personally I think I'd make them like

        uint64 (*encode_len) (const char *data, size_t dlen);

which makes it explicit that the dlen argument describes the length
of a chunk of allocated memory, while the result might exceed that.

Lastly, there is a component of this that can be back-patched and
a component that can't --- we do not change system catalog contents
in released branches.  Cramming both parts into the same patch
is forcing the committer to pull them apart, which is kind of
unfriendly.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: A bug when use get_bit() function for a long bytea string

Movead Li

>I don't think this has really solved the overflow hazards.  For example,
>in binary_encode we've got
 
>resultlen = enc->encode_len(VARDATA_ANY(data), datalen);
>result = palloc(VARHDRSZ + resultlen);
 
>and all you've done about that is changed resultlen from int to int64.
>On a 64-bit machine, sure, palloc will be able to detect if the
>result exceeds what can be allocated --- but on a 32-bit machine
>it'd be possible for the size_t argument to overflow altogether.
>(Or if you want to argue that it can't overflow because no encoder
>expands the data more than 4X, then we don't need to be making this
>change at all.)
 
>I don't think there's really any way to do that safely without an
>explicit check before we call palloc.
I am sorry I do not very understand these words, and especially
what's the mean by 'size_t'. 
Here I change resultlen from int to int64, is because we can get a right
error report value other than '-1' or another strange number.


 
>Why did you change the outputs from unsigned to signed?  Why didn't
>you change the dlen inputs?  I grant that we know that the input
>can't exceed 1GB in Postgres' usage, but these signatures are just
>randomly inconsistent, and you didn't even add a comment explaining
>why.  Personally I think I'd make them like
>uint64 (*encode_len) (const char *data, size_t dlen);
>which makes it explicit that the dlen argument describes the length
>of a chunk of allocated memory, while the result might exceed that.
I think it makes sence and followed.
 
>Lastly, there is a component of this that can be back-patched and
>a component that can't --- we do not change system catalog contents
>in released branches.  Cramming both parts into the same patch
>is forcing the committer to pull them apart, which is kind of
>unfriendly.
Sorry about that, attached is the changed patch for PG13, and the one
for older branches will send sooner.


Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

long_bytea_string_bug_fix_ver6.patch (14K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: A bug when use get_bit() function for a long bytea string

Movead Li

>Sorry about that, attached is the changed patch for PG13, and the one
>for older branches will send sooner.
A little update for the patch, and patches for all stable avilable.


Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

long_bytea_string_bug_fix_ver6.zip (14K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: A bug when use get_bit() function for a long bytea string

Daniel Verite
        [hidden email] wrote:

> A little update for the patch, and patches for all stable avilable.

Some comments about the set_bit/get_bit parts.
I'm reading long_bytea_string_bug_fix_ver6_pg10.patch, but that
applies probably to the other files meant for the existing releases
(I think you could get away with only one patch for backpatching
and one patch for v13, and committers will sort out how
to deal with release branches).

 byteaSetBit(PG_FUNCTION_ARGS)
 {
        bytea   *res = PG_GETARG_BYTEA_P_COPY(0);
- int32 n = PG_GETARG_INT32(1);
+ int64 n = PG_GETARG_INT64(1);
        int32 newBit = PG_GETARG_INT32(2);

The 2nd argument is 32-bit, not 64. PG_GETARG_INT32(1) must be used.

+ errmsg("index "INT64_FORMAT" out of valid range, 0.."INT64_FORMAT,
+ n, (int64)len * 8 - 1)));

The cast to int64 avoids the overflow, but it may also produce a
result that does not reflect the actual range, which is limited to
2^31-1, again because the bit number is a signed 32-bit value.

I believe the formula for the upper limit in the 32-bit case is:
  (len <= PG_INT32_MAX / 8) ? (len*8 - 1) : PG_INT32_MAX;

These functions could benefit from a comment mentioning that
they cannot reach the full extent of a bytea, because of the size limit
on the bit number.

--- a/src/test/regress/expected/bit.out
+++ b/src/test/regress/expected/bit.out
@@ -656,6 +656,40 @@ SELECT set_bit(B'0101011000100100', 15, 1);
 
 SELECT set_bit(B'0101011000100100', 16, 1); -- fail
 ERROR:  bit index 16 out of valid range (0..15)+SELECT get_bit(
+ set_bit((repeat('Postgres', 512 * 1024 * 1024 / 8))::bytea, 0, 0)
+ ,0);
+ get_bit
+---------
+ 0
+(1 row)
+
+SELECT get_bit(
+ set_bit((repeat('Postgres', 512 * 1024 * 1024 / 8))::bytea, 0, 1)
+ ,0);
+ get_bit
+---------
+ 1
+(1 row)
+

These 2 tests need to allocate big chunks of contiguous memory, so they
might fail for lack of memory on tiny machines, and even when not failing,
they're pretty slow to run. Are they worth the trouble?

+select get_bit(
+ set_bit((repeat('Postgres', 512 * 1024 * 1024 / 8))::bytea,
+ 512::bigint * 1024 * 1024 * 8 - 1, 0)
+ ,512::bigint * 1024 * 1024 * 8 - 1);
+ get_bit
+---------
+ 0
+(1 row)
+
+select get_bit(
+ set_bit((repeat('Postgres', 512 * 1024 * 1024 / 8))::bytea,
+ 512::bigint * 1024 * 1024 * 8 - 1, 1)
+ ,512::bigint * 1024 * 1024 * 8 - 1);
+ get_bit
+---------
+ 1
+(1 row)

These 2 tests are supposed to fail in existing releases because set_bit()
and get_bit() don't take a bigint as the 2nd argument.
Also, the same comment as above on how much they allocate.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Reply | Threaded
Open this post in threaded view
|

Re: A bug when use get_bit() function for a long bytea string

Tom Lane-2
"Daniel Verite" <[hidden email]> writes:
> These 2 tests need to allocate big chunks of contiguous memory, so they
> might fail for lack of memory on tiny machines, and even when not failing,
> they're pretty slow to run. Are they worth the trouble?

Yeah, I'd noticed those on previous readings of the patch.  They'd almost
certainly fail on some of our older/smaller buildfarm members, so they're
not getting committed, even if they didn't require multiple seconds apiece
to run (even on a machine with plenty of memory).  It's useful to have
them for initial testing though.

It'd be great if there was a way to test get_bit/set_bit on large
indexes without materializing a couple of multi-hundred-MB objects.
Can't think of one offhand though.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: A bug when use get_bit() function for a long bytea string

Dagfinn Ilmari Mannsåker
Tom Lane <[hidden email]> writes:

> "Daniel Verite" <[hidden email]> writes:
>> These 2 tests need to allocate big chunks of contiguous memory, so they
>> might fail for lack of memory on tiny machines, and even when not failing,
>> they're pretty slow to run. Are they worth the trouble?
>
> Yeah, I'd noticed those on previous readings of the patch.  They'd almost
> certainly fail on some of our older/smaller buildfarm members, so they're
> not getting committed, even if they didn't require multiple seconds apiece
> to run (even on a machine with plenty of memory).  It's useful to have
> them for initial testing though.

Perl's test suite has a similar issue with tests for handling of huge
strings, hashes, arrays, regexes etc.  We've taken the approach of
checking the environment variable PERL_TEST_MEMORY and skipping tests
that need more than that many gigabytes.  We currently have tests that
check for values from 1 all the way up to 96 GiB.

This would be trivial to do in the Postgres TAP tests, but something
similar might feasible in the pg_regress too?

> It'd be great if there was a way to test get_bit/set_bit on large
> indexes without materializing a couple of multi-hundred-MB objects.
> Can't think of one offhand though.

For this usecase it might make sense to express the limit in megabytes,
and have a policy for how much memory tests can assume without explicit
opt-in from the developer or buildfarm animal.

- ilmari
--
"The surreality of the universe tends towards a maximum" -- Skud's Law
"Never formulate a law or axiom that you're not prepared to live with
 the consequences of."                              -- Skud's Meta-Law


12