Comment in ginpostinglist.c doesn't match code

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Comment in ginpostinglist.c doesn't match code

Heikki Linnakangas
Hi,

While merging Greenplum with 9.4, we ran into problems with the GIN
posting list encoding, because Greenplum sometimes uses ItemPointers
with offset numbers up to 32768. The GIN posting list code was written
with the assumption that the maximum is MaxHeapTuplesPerPage, and it
uses only 11 bits for the offset number. The fix was simple, we just
modified ginpostinglist.c to use full 16 bits instead of 11.

However, I noticed that this comment in ginpostinglist.c that explains
the encoding doesn't match the code:

>  * These 43-bit integers are encoded using varbyte encoding. In each byte,
>  * the 7 low bits contain data, while the highest bit is a continuation bit.
>  * When the continuation bit is set, the next byte is part of the same
>  * integer, otherwise this is the last byte of this integer.  43 bits fit
>  * conveniently in at most 6 bytes when varbyte encoded (the 6th byte does
>  * not need a continuation bit, because we know the max size to be 43 bits):
>  *
>  * 0XXXXXXX
>  * 1XXXXXXX 0XXXXYYY
>  * 1XXXXXXX 1XXXXYYY 0YYYYYYY
>  * 1XXXXXXX 1XXXXYYY 1YYYYYYY 0YYYYYYY
>  * 1XXXXXXX 1XXXXYYY 1YYYYYYY 1YYYYYYY 0YYYYYYY
>  * 1XXXXXXX 1XXXXYYY 1YYYYYYY 1YYYYYYY 1YYYYYYY YYYYYYYY
>  *
>  * X = bits used for offset number
>  * Y = bits used for block number
The code doesn't actually give the 6th byte any special treatment. If
the input integer has the 43rd bit set, the encoding function will put a
continuation bit on the 6th byte, and generate a 7th byte. And the
decoding function will correctly decode that, too. So to my surprise,
the implementation actually works for integers up 49 bits wide. However,
there is an overflow check in the encoding function that assumes max 6
bytes per integer. That needs to be fixed, along with the comment.

Fitting any item pointer into 6 bytes was an important property when
this was written, because in the old pre-9.4 format, posting lists were
as arrays, with 6 bytes per item pointer. The maximum of 6 bytes per
integer in the new format guaranteed that we could convert any page from
the old format to the new format, after pg_upgrade, so that the new
format was never larger than the old format. But I don't think we need
to worry much about that anymore. Luckily, no one has ran into this
while trying to upgrade. It would require having a 9.3 cluster with a
table larger than 16 TB (with 8k block size), with a GIN index on it,
and a posting list with TIDs more than 2^31 blocks distance, on a full
page. So, not a problem in practice.

In summary, the comment in ginpostinglist.c is wrong, and the overflow
check needs to be fixed. Patch attached.

The patch also includes a little unit test module to test this without
creating a 16 TB table. A whole new test module seems a bit like
overkill just for this, but clearly we were missing test coverage here.
And it will come handy, if we want to invent a new better posting list
format in the future. Thoughts on whether to include the test module or not?

- Heikki

0001-Fix-overflow-check-and-comment-in-GIN-posting-list-e.patch (13K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Comment in ginpostinglist.c doesn't match code

Ashwin Agrawal
On Thu, Aug 22, 2019 at 1:14 AM Heikki Linnakangas <[hidden email]> wrote:

The patch also includes a little unit test module to test this without
creating a 16 TB table. A whole new test module seems a bit like
overkill just for this, but clearly we were missing test coverage here.
And it will come handy, if we want to invent a new better posting list
format in the future. Thoughts on whether to include the test module or not?

I like the test as importantly adds missing coverage. Also, really simplifies validation effort if required to make change in this area anytime in future. So, I would +1 keeping the same.
Reply | Threaded
Open this post in threaded view
|

Re: Comment in ginpostinglist.c doesn't match code

Masahiko Sawada
On Fri, Aug 23, 2019 at 7:05 AM Ashwin Agrawal <[hidden email]> wrote:

>
> On Thu, Aug 22, 2019 at 1:14 AM Heikki Linnakangas <[hidden email]> wrote:
>>
>>
>> The patch also includes a little unit test module to test this without
>> creating a 16 TB table. A whole new test module seems a bit like
>> overkill just for this, but clearly we were missing test coverage here.
>> And it will come handy, if we want to invent a new better posting list
>> format in the future. Thoughts on whether to include the test module or not?
>
>
> I like the test as importantly adds missing coverage. Also, really simplifies validation effort if required to make change in this area anytime in future. So, I would +1 keeping the same.

I'd +1 too. It's valuable to test hard-to-reproduce case. I often want
to do such unit tests with more cheaper costs, though.

BTW it's not related to this patch but I got confused that where "17
bits" of the following paragraph in ginpostinglist.c comes from. If we
use only 43 bit out of 64-bit unsigned integer we have 21 bits left.

 * For encoding purposes, item pointers are represented as 64-bit unsigned
 * integers. The lowest 11 bits represent the offset number, and the next
 * lowest 32 bits are the block number. That leaves 17 bits unused, i.e.
 * only 43 low bits are used.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: Comment in ginpostinglist.c doesn't match code

Alexander Korotkov-4
In reply to this post by Heikki Linnakangas
On Thu, Aug 22, 2019 at 11:14 AM Heikki Linnakangas <[hidden email]> wrote:

> While merging Greenplum with 9.4, we ran into problems with the GIN
> posting list encoding, because Greenplum sometimes uses ItemPointers
> with offset numbers up to 32768. The GIN posting list code was written
> with the assumption that the maximum is MaxHeapTuplesPerPage, and it
> uses only 11 bits for the offset number. The fix was simple, we just
> modified ginpostinglist.c to use full 16 bits instead of 11.
>
> However, I noticed that this comment in ginpostinglist.c that explains
> the encoding doesn't match the code:
>
> >  * These 43-bit integers are encoded using varbyte encoding. In each byte,
> >  * the 7 low bits contain data, while the highest bit is a continuation bit.
> >  * When the continuation bit is set, the next byte is part of the same
> >  * integer, otherwise this is the last byte of this integer.  43 bits fit
> >  * conveniently in at most 6 bytes when varbyte encoded (the 6th byte does
> >  * not need a continuation bit, because we know the max size to be 43 bits):
> >  *
> >  * 0XXXXXXX
> >  * 1XXXXXXX 0XXXXYYY
> >  * 1XXXXXXX 1XXXXYYY 0YYYYYYY
> >  * 1XXXXXXX 1XXXXYYY 1YYYYYYY 0YYYYYYY
> >  * 1XXXXXXX 1XXXXYYY 1YYYYYYY 1YYYYYYY 0YYYYYYY
> >  * 1XXXXXXX 1XXXXYYY 1YYYYYYY 1YYYYYYY 1YYYYYYY YYYYYYYY
> >  *
> >  * X = bits used for offset number
> >  * Y = bits used for block number
>
> The code doesn't actually give the 6th byte any special treatment. If
> the input integer has the 43rd bit set, the encoding function will put a
> continuation bit on the 6th byte, and generate a 7th byte. And the
> decoding function will correctly decode that, too. So to my surprise,
> the implementation actually works for integers up 49 bits wide. However,
> there is an overflow check in the encoding function that assumes max 6
> bytes per integer. That needs to be fixed, along with the comment.

Good catch!

> Fitting any item pointer into 6 bytes was an important property when
> this was written, because in the old pre-9.4 format, posting lists were
> as arrays, with 6 bytes per item pointer. The maximum of 6 bytes per
> integer in the new format guaranteed that we could convert any page from
> the old format to the new format, after pg_upgrade, so that the new
> format was never larger than the old format. But I don't think we need
> to worry much about that anymore. Luckily, no one has ran into this
> while trying to upgrade. It would require having a 9.3 cluster with a
> table larger than 16 TB (with 8k block size), with a GIN index on it,
> and a posting list with TIDs more than 2^31 blocks distance, on a full
> page. So, not a problem in practice.

I agree, this doesn't seem to be a problem assuming nobody reported
this till end of support for 9.3.  It seems that nobody tried to
upgrade such a huge table with GIN index.

> In summary, the comment in ginpostinglist.c is wrong, and the overflow
> check needs to be fixed. Patch attached.
>
> The patch also includes a little unit test module to test this without
> creating a 16 TB table. A whole new test module seems a bit like
> overkill just for this, but clearly we were missing test coverage here.
> And it will come handy, if we want to invent a new better posting list
> format in the future. Thoughts on whether to include the test module or not?

For me test module looks valuable.  Many times I've felt uneasy about
our test suite, because it doesn't cover many situations, which
naturally happens only on large datasets.  It's good that you've found
the way to reproduce this particular case without creating enormous
table.

I didn't review this module yet, but +1 for idea.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: Comment in ginpostinglist.c doesn't match code

Heikki Linnakangas
In reply to this post by Masahiko Sawada
On 23/08/2019 11:44, Masahiko Sawada wrote:

> On Fri, Aug 23, 2019 at 7:05 AM Ashwin Agrawal <[hidden email]> wrote:
>>
>> On Thu, Aug 22, 2019 at 1:14 AM Heikki Linnakangas <[hidden email]> wrote:
>>> The patch also includes a little unit test module to test this without
>>> creating a 16 TB table. A whole new test module seems a bit like
>>> overkill just for this, but clearly we were missing test coverage here.
>>> And it will come handy, if we want to invent a new better posting list
>>> format in the future. Thoughts on whether to include the test module or not?
>>
>>
>> I like the test as importantly adds missing coverage. Also, really simplifies validation effort if required to make change in this area anytime in future. So, I would +1 keeping the same.
>
> I'd +1 too. It's valuable to test hard-to-reproduce case. I often want
> to do such unit tests with more cheaper costs, though.

Ok, including it seems to be the consensus.

> BTW it's not related to this patch but I got confused that where "17
> bits" of the following paragraph in ginpostinglist.c comes from. If we
> use only 43 bit out of 64-bit unsigned integer we have 21 bits left.
>
>   * For encoding purposes, item pointers are represented as 64-bit unsigned
>   * integers. The lowest 11 bits represent the offset number, and the next
>   * lowest 32 bits are the block number. That leaves 17 bits unused, i.e.
>   * only 43 low bits are used.

Huh, you're right. I think it should be "leaves 21 bits unused". I
suspect the patch used 15 bits for the offset number early in the
development, which would've left 17 bits unused, and when it was later
changed to use only 11 bits, that comment was neglected.

Fixed that comment, too, and pushed. Thanks!

- Heikki