Custom compression methods

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
146 messages Options
12345 ... 8
Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4

On 11/20/2017 04:21 PM, Евгений Шишкин wrote:

>
>
>> On Nov 20, 2017, at 18:18, Tomas Vondra <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>
>>
>> I don't think we need to do anything smart here - it should behave just
>> like dropping a data type, for example. That is, error out if there are
>> columns using the compression method (without CASCADE), and drop all the
>> columns (with CASCADE).
>
> What about instead of dropping column we leave data uncompressed?
>

That requires you to go through the data and rewrite the whole table.
And I'm not aware of a DROP command doing that, instead they just drop
the dependent objects (e.g. DROP TYPE, ...). So per PLOS the DROP
COMPRESSION METHOD command should do that too.

But I'm wondering if ALTER COLUMN ... SET NOT COMPRESSED should do that
(currently it only disables compression for new data).


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Evgeniy Shishkin


> On Nov 20, 2017, at 18:29, Tomas Vondra <[hidden email]> wrote:
>
>>
>> What about instead of dropping column we leave data uncompressed?
>>
>
> That requires you to go through the data and rewrite the whole table.
> And I'm not aware of a DROP command doing that, instead they just drop
> the dependent objects (e.g. DROP TYPE, ...). So per PLOS the DROP
> COMPRESSION METHOD command should do that too.

Well, there is no much you can do with DROP TYPE. But i'd argue that compression
is different. We do not drop data in case of DROP STATISTICS or DROP INDEX.

At least there should be a way to easily alter compression method then.
Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Ildus Kurbangaliev
In reply to this post by Tomas Vondra-4
On Mon, 20 Nov 2017 16:29:11 +0100
Tomas Vondra <[hidden email]> wrote:

> On 11/20/2017 04:21 PM, Евгений Шишкин wrote:
> >
> >  
> >> On Nov 20, 2017, at 18:18, Tomas Vondra
> >> <[hidden email]
> >> <mailto:[hidden email]>> wrote:
> >>
> >>
> >> I don't think we need to do anything smart here - it should behave
> >> just like dropping a data type, for example. That is, error out if
> >> there are columns using the compression method (without CASCADE),
> >> and drop all the columns (with CASCADE).  
> >
> > What about instead of dropping column we leave data uncompressed?
> >  
>
> That requires you to go through the data and rewrite the whole table.
> And I'm not aware of a DROP command doing that, instead they just drop
> the dependent objects (e.g. DROP TYPE, ...). So per PLOS the DROP
> COMPRESSION METHOD command should do that too.
>
> But I'm wondering if ALTER COLUMN ... SET NOT COMPRESSED should do
> that (currently it only disables compression for new data).

If the table is big, decompression could take an eternity. That's why i
decided to only to disable it and the data could be decompressed using
compression options.

My idea was to keep compression options forever, since there will not
be much of them in one database. Still that requires that extension is
not removed.

I will try to find a way how to recompress data first in case it moves
to another table.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4
In reply to this post by Evgeniy Shishkin


On 11/20/2017 04:43 PM, Евгений Шишкин wrote:

>
>
>> On Nov 20, 2017, at 18:29, Tomas Vondra <[hidden email]> wrote:
>>
>>>
>>> What about instead of dropping column we leave data uncompressed?
>>>
>>
>> That requires you to go through the data and rewrite the whole table.
>> And I'm not aware of a DROP command doing that, instead they just drop
>> the dependent objects (e.g. DROP TYPE, ...). So per PLOS the DROP
>> COMPRESSION METHOD command should do that too.
>
> Well, there is no much you can do with DROP TYPE. But i'd argue that compression
> is different. We do not drop data in case of DROP STATISTICS or DROP INDEX.
>

But those DROP commands do not 'invalidate' data in the heap, so there's
no reason to drop the columns.

> At least there should be a way to easily alter compression method then.
>

+1


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Ildus Kurbangaliev
In reply to this post by Tomas Vondra-4
On Mon, 20 Nov 2017 00:04:53 +0100
Tomas Vondra <[hidden email]> wrote:

> I did a review of this today, and I think there are some things that
> need improvement / fixing.
>
> Firstly, some basic comments from just eye-balling the diff, then some
> bugs I discovered after writing an extension adding lz4.
>
> 1) formatRelOptions/freeRelOptions are no longer needed (I see Ildar
> already pointer that out)

I removed freeRelOptions, but formatRelOptions is used in other place.

>
> 2) There's unnecessary whitespace (extra newlines) on a couple of
> places, which is needlessly increasing the size of the patch. Small
> difference, but annoying.

Cleaned up.

>
> 3) tuptoaster.c
>
> Why do you change 'info' from int32 to uint32? Seems unnecessary.

That's because I use highest bit, and it makes number negative for
int32. I use right shifting to get that bit and right shift on negative
gives negative value too.

>
> Adding new 'att' variable in toast_insert_or_update is confusing, as
> there already is 'att' in the very next loop. Technically it's
> correct, but I'd bet it'll lead to some WTF?! moments later. I
> propose to just use TupleDescAttr(tupleDesc,i) on the two places
> where it matters, around line 808.
>
> There are no comments for init_compression_options_htab and
> get_compression_options_info, so that needs to be fixed. Moreover, the
> names are confusing because what we really get is not just 'options'
> but the compression routines too.
Removed extra 'att', and added comments.

>
> 4) gen_db_file_maps probably shouldn't do the fprints, right?
>
> 5) not sure why you modify src/tools/pgindent/exclude_file_patterns

My bad, removed these lines.

>
> 6) I'm rather confused by AttributeCompression vs. ColumnCompression.
> I mean, attribute==column, right? Of course, one is for data from
> parser, the other one is for internal info. But can we make the
> naming clearer?

For now I have renamed AttributeCompression to CompressionOptions, not
sure that's a good name but at least it gives less confusion.

>
> 7) The docs in general are somewhat unsatisfactory, TBH. For example
> the ColumnCompression has no comments, unlike everything else in
> parsenodes. Similarly for the SGML docs - I suggest to expand them to
> resemble FDW docs
> (https://www.postgresql.org/docs/10/static/fdwhandler.html) which
> also follows the handler/routines pattern.

I've added more comments. I think I'll add more documentation if the
committers will approve current syntax.

>
> 8) One of the unclear things if why we even need 'drop' routing. It
> seems that if it's defined DropAttributeCompression does something.
> But what should it do? I suppose dropping the options should be done
> using dependencies (just like we drop columns in this case).
>
> BTW why does DropAttributeCompression mess with att->attisdropped in
> this way? That seems a bit odd.

'drop' routine could be useful. An extension could do
something related with the attribute, like remove extra tables or
something else. The compression options will not be removed after
unlinking compression method from a column because there is still be
stored compressed data in that column.

That 'attisdropped' part has been removed.

>
> 9) configure routines that only check if (options != NIL) and then
> error out (like tsvector_configure) seem a bit unnecessary. Just
> allow it to be NULL in CompressionMethodRoutine, and throw an error
> if options is not NIL for such compression method.

Good idea, done.

>
> 10) toast_compress_datum still does this:
>
>     if (!ac && (valsize < PGLZ_strategy_default->min_input_size ||
>                 valsize > PGLZ_strategy_default->max_input_size))
>
> which seems rather pglz-specific (the naming is a hint). Why shouldn't
> this be specific to compression, exposed either as min/max constants,
> or wrapped in another routine - size_is_valid() or something like
> that?
I agree, moved to the next block related with pglz.

>
> 11) The comments in toast_compress_datum probably need updating, as it
> still references to pglz specifically. I guess the new compression
> methods do matter too.

Done.

>
> 12) get_compression_options_info organizes the compression info into a
> hash table by OID. The hash table implementation assumes the hash key
> is at the beginning of the entry, but AttributeCompression is defined
> like this:
>
>     typedef struct
>     {
>         CompressionMethodRoutine *routine;
>         List  *options;
>         Oid cmoptoid;
>     } AttributeCompression;
>
> Which means get_compression_options_info is busted, will never lookup
> anything, and the hash table will grow by adding more and more entries
> into the same bucket. Of course, this has extremely negative impact on
> performance (pretty much arbitrarily bad, depending on how many
> entries you've already added to the hash table).
>
> Moving the OID to the beginning of the struct fixes the issue.
Yeah, I fixed it before, but somehow managed to do not include it to the
patch.

>
> 13) When writing the experimental extension, I was extremely confused
> about the regular varlena headers, custom compression headers, etc. In
> the end I stole the code from tsvector.c and whacked it a bit until it
> worked, but I wouldn't dare to claim I understand how it works.
>
> This needs to be documented somewhere. For example postgres.h has a
> bunch of paragraphs about varlena headers, so perhaps it should be
> there? I see the patch tweaks some of the constants, but does not
> update the comment at all.
This point is good, I'm not sure how this documentation should look
like. I've just assumed that people should have deep undestanding of
varlenas if they're going to compress them. But now it's easy to make
mistake there. Maybe I should add some functions that help to construct
varlena, with different headers. I like the way is how jsonb is
constructed. It uses StringInfo and there are few helper functions
(reserveFromBuffer, appendToBuffer and others). Maybe
they should be not static.

>
> Perhaps it would be useful to provide some additional macros making
> access to custom-compressed varlena values easier. Or perhaps the
> VARSIZE_ANY / VARSIZE_ANY_EXHDR / VARDATA_ANY already support that?
> This part is not very clear to me.

These macros will work, custom compressed varlenas behave like old
compressed varlenas.

> > Still it's a problem if the user used for example `SELECT
> > <compressed_column> INTO * FROM *` because postgres will copy
> > compressed tuples, and there will not be any dependencies between
> > destination and the options.
> >  
>
> This seems like a rather fatal design flaw, though. I'd say we need to
> force recompression of the data, in such cases. Otherwise all the
> dependency tracking is rather pointless.

Fixed this problem too. I've added recompression for datum that use
custom compression.


--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

custom_compression_methods_v5.patch (257K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4
Hi,

On 11/21/2017 03:47 PM, Ildus Kurbangaliev wrote:

> On Mon, 20 Nov 2017 00:04:53 +0100
> Tomas Vondra <[hidden email]> wrote:
>
> ...
>
>> 6) I'm rather confused by AttributeCompression vs.
>> ColumnCompression. I mean, attribute==column, right? Of course, one
>> is for data from parser, the other one is for internal info. But
>> can we make the naming clearer?
>
> For now I have renamed AttributeCompression to CompressionOptions,
> not sure that's a good name but at least it gives less confusion.
>

I propose to use either

   CompressionMethodOptions (and CompressionMethodRoutine)

or

   CompressionOptions (and CompressionRoutine)

>>
>> 7) The docs in general are somewhat unsatisfactory, TBH. For example
>> the ColumnCompression has no comments, unlike everything else in
>> parsenodes. Similarly for the SGML docs - I suggest to expand them to
>> resemble FDW docs
>> (https://www.postgresql.org/docs/10/static/fdwhandler.html) which
>> also follows the handler/routines pattern.
>
> I've added more comments. I think I'll add more documentation if the
> committers will approve current syntax.
>

OK. Haven't reviewed this yet.

>>
>> 8) One of the unclear things if why we even need 'drop' routing. It
>> seems that if it's defined DropAttributeCompression does something.
>> But what should it do? I suppose dropping the options should be done
>> using dependencies (just like we drop columns in this case).
>>
>> BTW why does DropAttributeCompression mess with att->attisdropped in
>> this way? That seems a bit odd.
>
> 'drop' routine could be useful. An extension could do something
> related with the attribute, like remove extra tables or something
> else. The compression options will not be removed after unlinking
> compression method from a column because there is still be stored
> compressed data in that column.
>

OK. So something like a "global" dictionary used for the column, or
something like that? Sure, seems useful and I've been thinking about
that, but I think we badly need some extension using that, even if in a
very simple way. Firstly, we need a "how to" example, secondly we need
some way to test it.

>>
>> 13) When writing the experimental extension, I was extremely
>> confused about the regular varlena headers, custom compression
>> headers, etc. In the end I stole the code from tsvector.c and
>> whacked it a bit until it worked, but I wouldn't dare to claim I
>> understand how it works.
>>
>> This needs to be documented somewhere. For example postgres.h has
>> a bunch of paragraphs about varlena headers, so perhaps it should
>> be there? I see the patch tweaks some of the constants, but does
>> not update the comment at all.
>
> This point is good, I'm not sure how this documentation should look
> like. I've just assumed that people should have deep undestanding of
> varlenas if they're going to compress them. But now it's easy to
> make mistake there. Maybe I should add some functions that help to
> construct varlena, with different headers. I like the way is how
> jsonb is constructed. It uses StringInfo and there are few helper
> functions (reserveFromBuffer, appendToBuffer and others). Maybe they
> should be not static.
>

Not sure. My main problem was not understanding how this affects the
varlena header, etc. And I had no idea where to look.

>>
>> Perhaps it would be useful to provide some additional macros
>> making access to custom-compressed varlena values easier. Or
>> perhaps the VARSIZE_ANY / VARSIZE_ANY_EXHDR / VARDATA_ANY already
>> support that? This part is not very clear to me.
>
> These macros will work, custom compressed varlenas behave like old
> compressed varlenas.
>

OK. But then I don't understand why tsvector.c does things like

    VARSIZE(data) - VARHDRSZ_CUSTOM_COMPRESSED - arrsize
    VARRAWSIZE_4B_C(data) - arrsize

instead of

    VARSIZE_ANY_EXHDR(data) - arrsize
    VARSIZE_ANY(data) - arrsize

Seems somewhat confusing.

>>> Still it's a problem if the user used for example `SELECT
>>> <compressed_column> INTO * FROM *` because postgres will copy
>>> compressed tuples, and there will not be any dependencies
>>> between destination and the options.
>>>  
>>
>> This seems like a rather fatal design flaw, though. I'd say we need
>> to force recompression of the data, in such cases. Otherwise all
>> the dependency tracking is rather pointless.
>
> Fixed this problem too. I've added recompression for datum that use
> custom compression.
>

Hmmm, it still doesn't work for me. See this:

    test=# create extension pg_lz4 ;
    CREATE EXTENSION
    test=# create table t_lz4 (v text compressed lz4);
    CREATE TABLE
    test=# create table t_pglz (v text);
    CREATE TABLE
    test=# insert into t_lz4 select repeat(md5(1::text),300);
    INSERT 0 1
    test=# insert into t_pglz select * from t_lz4;
    INSERT 0 1
    test=# drop extension pg_lz4 cascade;
    NOTICE:  drop cascades to 2 other objects
    DETAIL:  drop cascades to compression options for lz4
    drop cascades to table t_lz4 column v
    DROP EXTENSION
    test=# \c test
    You are now connected to database "test" as user "user".
    test=# insert into t_lz4 select repeat(md5(1::text),300);^C
    test=# select * from t_pglz ;
    ERROR:  cache lookup failed for compression options 16419

That suggests no recompression happened.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Ildus Kurbangaliev
On Tue, 21 Nov 2017 18:47:49 +0100
Tomas Vondra <[hidden email]> wrote:

 
>
> I propose to use either
>
>    CompressionMethodOptions (and CompressionMethodRoutine)
>
> or
>
>    CompressionOptions (and CompressionRoutine)

Sounds good, thanks.

>
> OK. But then I don't understand why tsvector.c does things like
>
>     VARSIZE(data) - VARHDRSZ_CUSTOM_COMPRESSED - arrsize
>     VARRAWSIZE_4B_C(data) - arrsize
>
> instead of
>
>     VARSIZE_ANY_EXHDR(data) - arrsize
>     VARSIZE_ANY(data) - arrsize
>
> Seems somewhat confusing.
>

VARRAWSIZE_4B_C returns original size of data, before compression (from
va_rawsize in current postgres, and from va_info in my patch), not size
of the already compressed data, so you can't use VARSIZE_ANY here.

VARSIZE_ANY_EXHDR in current postgres returns VARSIZE-VARHDRSZ, despite
the varlena is compressed or not, so I just kept this behavior for
custom compressed varlenas too. If you look into tuptoaster.c you will
also see lines like 'VARSIZE(attr) - TOAST_COMPRESS_HDRSZ'. So I think
if VARSIZE_ANY_EXHDR will subtract different header sizes then it
should subtract them for usual compressed varlenas too.

> >  
>
> Hmmm, it still doesn't work for me. See this:
>
>     test=# create extension pg_lz4 ;
>     CREATE EXTENSION
>     test=# create table t_lz4 (v text compressed lz4);
>     CREATE TABLE
>     test=# create table t_pglz (v text);
>     CREATE TABLE
>     test=# insert into t_lz4 select repeat(md5(1::text),300);
>     INSERT 0 1
>     test=# insert into t_pglz select * from t_lz4;
>     INSERT 0 1
>     test=# drop extension pg_lz4 cascade;
>     NOTICE:  drop cascades to 2 other objects
>     DETAIL:  drop cascades to compression options for lz4
>     drop cascades to table t_lz4 column v
>     DROP EXTENSION
>     test=# \c test
>     You are now connected to database "test" as user "user".
>     test=# insert into t_lz4 select repeat(md5(1::text),300);^C
>     test=# select * from t_pglz ;
>     ERROR:  cache lookup failed for compression options 16419
>
> That suggests no recompression happened.

I will check that. Is your extension published somewhere?


Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4

On 11/21/2017 09:28 PM, Ildus K wrote:

>> Hmmm, it still doesn't work for me. See this:
>>
>>     test=# create extension pg_lz4 ;
>>     CREATE EXTENSION
>>     test=# create table t_lz4 (v text compressed lz4);
>>     CREATE TABLE
>>     test=# create table t_pglz (v text);
>>     CREATE TABLE
>>     test=# insert into t_lz4 select repeat(md5(1::text),300);
>>     INSERT 0 1
>>     test=# insert into t_pglz select * from t_lz4;
>>     INSERT 0 1
>>     test=# drop extension pg_lz4 cascade;
>>     NOTICE:  drop cascades to 2 other objects
>>     DETAIL:  drop cascades to compression options for lz4
>>     drop cascades to table t_lz4 column v
>>     DROP EXTENSION
>>     test=# \c test
>>     You are now connected to database "test" as user "user".
>>     test=# insert into t_lz4 select repeat(md5(1::text),300);^C
>>     test=# select * from t_pglz ;
>>     ERROR:  cache lookup failed for compression options 16419
>>
>> That suggests no recompression happened.
>
> I will check that. Is your extension published somewhere?
>
No, it was just an experiment, so I've only attached it to the initial
review. Attached is an updated version, with a fix or two.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pg_lz4.tgz (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Ildus Kurbangaliev
In reply to this post by Tomas Vondra-4
On Tue, 21 Nov 2017 18:47:49 +0100
Tomas Vondra <[hidden email]> wrote:

> >  
>
> Hmmm, it still doesn't work for me. See this:
>
>     test=# create extension pg_lz4 ;
>     CREATE EXTENSION
>     test=# create table t_lz4 (v text compressed lz4);
>     CREATE TABLE
>     test=# create table t_pglz (v text);
>     CREATE TABLE
>     test=# insert into t_lz4 select repeat(md5(1::text),300);
>     INSERT 0 1
>     test=# insert into t_pglz select * from t_lz4;
>     INSERT 0 1
>     test=# drop extension pg_lz4 cascade;
>     NOTICE:  drop cascades to 2 other objects
>     DETAIL:  drop cascades to compression options for lz4
>     drop cascades to table t_lz4 column v
>     DROP EXTENSION
>     test=# \c test
>     You are now connected to database "test" as user "user".
>     test=# insert into t_lz4 select repeat(md5(1::text),300);^C
>     test=# select * from t_pglz ;
>     ERROR:  cache lookup failed for compression options 16419
>
> That suggests no recompression happened.
Should be fixed in the attached patch. I've changed your extension a
little bit according changes in the new patch (also in attachments).

Also I renamed few functions, added more comments and simplified the
code related with DefineRelation (thanks to Ildar Musin suggestion).

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

custom_compression_methods_v6.patch (262K) Download Attachment
pg_lz4.tar.gz (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4
Hi,

On 11/23/2017 10:38 AM, Ildus Kurbangaliev wrote:

> On Tue, 21 Nov 2017 18:47:49 +0100
> Tomas Vondra <[hidden email]> wrote:
>
>>>  
>>
>> Hmmm, it still doesn't work for me. See this:
>>
>>     test=# create extension pg_lz4 ;
>>     CREATE EXTENSION
>>     test=# create table t_lz4 (v text compressed lz4);
>>     CREATE TABLE
>>     test=# create table t_pglz (v text);
>>     CREATE TABLE
>>     test=# insert into t_lz4 select repeat(md5(1::text),300);
>>     INSERT 0 1
>>     test=# insert into t_pglz select * from t_lz4;
>>     INSERT 0 1
>>     test=# drop extension pg_lz4 cascade;
>>     NOTICE:  drop cascades to 2 other objects
>>     DETAIL:  drop cascades to compression options for lz4
>>     drop cascades to table t_lz4 column v
>>     DROP EXTENSION
>>     test=# \c test
>>     You are now connected to database "test" as user "user".
>>     test=# insert into t_lz4 select repeat(md5(1::text),300);^C
>>     test=# select * from t_pglz ;
>>     ERROR:  cache lookup failed for compression options 16419
>>
>> That suggests no recompression happened.
>
> Should be fixed in the attached patch. I've changed your extension a
> little bit according changes in the new patch (also in attachments).
>

Hmm, this seems to have fixed it, but only in one direction. Consider this:

    create table t_pglz (v text);
    create table t_lz4 (v text compressed lz4);

    insert into t_pglz select repeat(md5(i::text),300)
    from generate_series(1,100000) s(i);

    insert into t_lz4 select repeat(md5(i::text),300)
    from generate_series(1,100000) s(i);

    \d+

     Schema |  Name  | Type  | Owner | Size  | Description
    --------+--------+-------+-------+-------+-------------
     public | t_lz4  | table | user  | 12 MB |
     public | t_pglz | table | user  | 18 MB |
    (2 rows)

    truncate t_pglz;
    insert into t_pglz select * from t_lz4;

    \d+

     Schema |  Name  | Type  | Owner | Size  | Description
    --------+--------+-------+-------+-------+-------------
     public | t_lz4  | table | user  | 12 MB |
     public | t_pglz | table | user  | 18 MB |
    (2 rows)

which is fine. But in the other direction, this happens

    truncate t_lz4;
    insert into t_lz4 select * from t_pglz;

     \d+
                       List of relations
     Schema |  Name  | Type  | Owner | Size  | Description
    --------+--------+-------+-------+-------+-------------
     public | t_lz4  | table | user  | 18 MB |
     public | t_pglz | table | user  | 18 MB |
    (2 rows)

which means the data is still pglz-compressed. That's rather strange, I
guess, and it should compress the data using the compression method set
for the target table instead.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Ildus Kurbangaliev
On Thu, 23 Nov 2017 21:54:32 +0100
Tomas Vondra <[hidden email]> wrote:
 

>
> Hmm, this seems to have fixed it, but only in one direction. Consider
> this:
>
>     create table t_pglz (v text);
>     create table t_lz4 (v text compressed lz4);
>
>     insert into t_pglz select repeat(md5(i::text),300)
>     from generate_series(1,100000) s(i);
>
>     insert into t_lz4 select repeat(md5(i::text),300)
>     from generate_series(1,100000) s(i);
>
>     \d+
>
>      Schema |  Name  | Type  | Owner | Size  | Description
>     --------+--------+-------+-------+-------+-------------
>      public | t_lz4  | table | user  | 12 MB |
>      public | t_pglz | table | user  | 18 MB |
>     (2 rows)
>
>     truncate t_pglz;
>     insert into t_pglz select * from t_lz4;
>
>     \d+
>
>      Schema |  Name  | Type  | Owner | Size  | Description
>     --------+--------+-------+-------+-------+-------------
>      public | t_lz4  | table | user  | 12 MB |
>      public | t_pglz | table | user  | 18 MB |
>     (2 rows)
>
> which is fine. But in the other direction, this happens
>
>     truncate t_lz4;
>     insert into t_lz4 select * from t_pglz;
>
>      \d+
>                        List of relations
>      Schema |  Name  | Type  | Owner | Size  | Description
>     --------+--------+-------+-------+-------+-------------
>      public | t_lz4  | table | user  | 18 MB |
>      public | t_pglz | table | user  | 18 MB |
>     (2 rows)
>
> which means the data is still pglz-compressed. That's rather strange,
> I guess, and it should compress the data using the compression method
> set for the target table instead.

That's actually an interesting issue. It happens because if tuple fits
to page then postgres just moves it as is. I've just added
recompression if it has custom compressed datums to keep dependencies
right. But look:

  create table t1(a text);
  create table t2(a text);
  alter table t2 alter column a set storage external;
  insert into t1 select repeat(md5(i::text),300) from
    generate_series(1,100000) s(i);
  \d+

                      List of relations
   Schema | Name | Type  | Owner |    Size    | Description
  --------+------+-------+-------+------------+-------------
   public | t1   | table | ildus | 18 MB      |
   public | t2   | table | ildus | 8192 bytes |
  (2 rows)

  insert into t2 select * from t1;

  \d+

                    List of relations
   Schema | Name | Type  | Owner | Size  | Description
  --------+------+-------+-------+-------+-------------
   public | t1   | table | ildus | 18 MB |
   public | t2   | table | ildus | 18 MB |
  (2 rows)

That means compressed datums now in the column with storage specified as
external. I'm not sure that's a bug or a feature. Lets insert them
usual way:

  delete from t2;
  insert into t2 select repeat(md5(i::text),300) from
    generate_series(1,100000) s(i);
  \d+

                     List of relations
   Schema | Name | Type  | Owner |  Size   | Description
  --------+------+-------+-------+---------+-------------
   public | t1   | table | ildus | 18 MB   |
   public | t2   | table | ildus | 1011 MB |

Maybe there should be more common solution like comparison of attribute
properties?

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4
Hi,

On 11/24/2017 10:38 AM, Ildus Kurbangaliev wrote:
> ...
> That means compressed datums now in the column with storage
> specified as external. I'm not sure that's a bug or a feature.
>

Interesting. Never realized it behaves like this. Not sure if it's
intentional or not (i.e. bug vs. feature).

> Lets insert them usual way:
>
>   delete from t2;
>   insert into t2 select repeat(md5(i::text),300) from
>     generate_series(1,100000) s(i);
>   \d+
>
>                      List of relations
>    Schema | Name | Type  | Owner |  Size   | Description
>   --------+------+-------+-------+---------+-------------
>    public | t1   | table | ildus | 18 MB   |
>    public | t2   | table | ildus | 1011 MB |
>
> Maybe there should be more common solution like comparison of
> attribute properties?
>

Maybe, not sure what the right solution is. I just know that if we allow
inserting data into arbitrary tables without recompression, we may end
up with data that can't be decompressed.

I agree that the behavior with extended storage is somewhat similar, but
the important distinction is that while that is surprising the data is
still accessible.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4
Hi,

I ran into another issue - after inserting some data into a table with a
tsvector column (without any compression defined), I can no longer read
the data.

This is what I get in the console:

db=# select max(md5(body_tsvector::text)) from messages;
ERROR:  cache lookup failed for compression options 6432

and the stack trace looks like this:

Breakpoint 1, get_cached_compression_options (cmoptoid=6432) at
tuptoaster.c:2563
2563 elog(ERROR, "cache lookup failed for compression options %u",
cmoptoid);
(gdb) bt
#0  get_cached_compression_options (cmoptoid=6432) at tuptoaster.c:2563
#1  0x00000000004bf3da in toast_decompress_datum (attr=0x2b44148) at
tuptoaster.c:2390
#2  0x00000000004c0c1e in heap_tuple_untoast_attr (attr=0x2b44148) at
tuptoaster.c:225
#3  0x000000000083f976 in pg_detoast_datum (datum=<optimized out>) at
fmgr.c:1829
#4  0x00000000008072de in tsvectorout (fcinfo=0x2b41e00) at tsvector.c:315
#5  0x00000000005fae00 in ExecInterpExpr (state=0x2b414b8,
econtext=0x2b25ab0, isnull=<optimized out>) at execExprInterp.c:1131
#6  0x000000000060bdf4 in ExecEvalExprSwitchContext
(isNull=0x7fffffe9bd37 "", econtext=0x2b25ab0, state=0x2b414b8) at
../../../src/include/executor/executor.h:299

It seems the VARATT_IS_CUSTOM_COMPRESSED incorrectly identifies the
value as custom-compressed for some reason.

Not sure why, but the tsvector column is populated by a trigger that
simply does

    NEW.body_tsvector
        := to_tsvector('english', strip_replies(NEW.body_plain));

If needed, the complete tool is here:

    https://bitbucket.org/tvondra/archie


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Ildus Kurbangaliev
On Sat, 25 Nov 2017 06:40:00 +0100
Tomas Vondra <[hidden email]> wrote:

> Hi,
>
> I ran into another issue - after inserting some data into a table
> with a tsvector column (without any compression defined), I can no
> longer read the data.
>
> This is what I get in the console:
>
> db=# select max(md5(body_tsvector::text)) from messages;
> ERROR:  cache lookup failed for compression options 6432
>
> and the stack trace looks like this:
>
> Breakpoint 1, get_cached_compression_options (cmoptoid=6432) at
> tuptoaster.c:2563
> 2563 elog(ERROR, "cache lookup failed for
> compression options %u", cmoptoid);
> (gdb) bt
> #0  get_cached_compression_options (cmoptoid=6432) at
> tuptoaster.c:2563 #1  0x00000000004bf3da in toast_decompress_datum
> (attr=0x2b44148) at tuptoaster.c:2390
> #2  0x00000000004c0c1e in heap_tuple_untoast_attr (attr=0x2b44148) at
> tuptoaster.c:225
> #3  0x000000000083f976 in pg_detoast_datum (datum=<optimized out>) at
> fmgr.c:1829
> #4  0x00000000008072de in tsvectorout (fcinfo=0x2b41e00) at
> tsvector.c:315 #5  0x00000000005fae00 in ExecInterpExpr
> (state=0x2b414b8, econtext=0x2b25ab0, isnull=<optimized out>) at
> execExprInterp.c:1131 #6  0x000000000060bdf4 in
> ExecEvalExprSwitchContext (isNull=0x7fffffe9bd37 "",
> econtext=0x2b25ab0, state=0x2b414b8)
> at ../../../src/include/executor/executor.h:299
>
> It seems the VARATT_IS_CUSTOM_COMPRESSED incorrectly identifies the
> value as custom-compressed for some reason.
>
> Not sure why, but the tsvector column is populated by a trigger that
> simply does
>
>     NEW.body_tsvector
>         := to_tsvector('english', strip_replies(NEW.body_plain));
>
> If needed, the complete tool is here:
>
>     https://bitbucket.org/tvondra/archie
>

Hi. This looks like a serious bug, but I couldn't reproduce
it yet. Did you upgrade some old database or this bug happened after
insertion of all data to new database? I tried using your 'archie'
tool to download mailing lists and insert them to database, but couldn't
catch any errors.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4
Hi,

On 11/27/2017 04:52 PM, Ildus Kurbangaliev wrote:
> ...
>
> Hi. This looks like a serious bug, but I couldn't reproduce it yet.
> Did you upgrade some old database or this bug happened after
> insertion of all data to new database? I tried using your 'archie'
> tool to download mailing lists and insert them to database, but
> couldn't catch any errors.
>

I can trigger it pretty reliably with these steps:

    git checkout f65d21b258085bdc8ef2cc282ab1ff12da9c595c
    patch -p1 < ~/custom_compression_methods_v6.patch
    ./configure --enable-debug --enable-cassert \
     CFLAGS="-fno-omit-frame-pointer -O0 -DRANDOMIZE_ALLOCATED_MEMORY" \
     --prefix=/home/postgres/pg-compress
    make -s clean && make -s -j4 install
    cd contrib/
    make -s clean && make -s -j4 install

    export PATH=/home/postgres/pg-compress/bin:$PATH
    pg_ctl -D /mnt/raid/pg-compress init
    pg_ctl -D /mnt/raid/pg-compress -l compress.log start
    createdb archie
    cd ~/archie/sql/
    psql archie < create.sql

    ~/archie/bin/load.py --workers 4 --db archie */* > load.log 2>&1


I guess the trick might be -DRANDOMIZE_ALLOCATED_MEMORY (I first tried
without it, and it seemed working fine). If that's the case, I bet there
is a palloc that should have been palloc0, or something like that.

If you still can't reproduce that, I may give you access to this machine
so that you can debug it there.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Ildus Kurbangaliev
On Mon, 27 Nov 2017 18:20:12 +0100
Tomas Vondra <[hidden email]> wrote:

> I guess the trick might be -DRANDOMIZE_ALLOCATED_MEMORY (I first tried
> without it, and it seemed working fine). If that's the case, I bet
> there is a palloc that should have been palloc0, or something like
> that.

Thanks, that was it. I've been able to reproduce this bug. The attached
patch should fix this bug and I've also added recompression when
tuples moved to the relation with the compressed attribute.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

custom_compression_methods_v7.patch (266K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4
On 11/28/2017 02:29 PM, Ildus Kurbangaliev wrote:

> On Mon, 27 Nov 2017 18:20:12 +0100
> Tomas Vondra <[hidden email]> wrote:
>
>> I guess the trick might be -DRANDOMIZE_ALLOCATED_MEMORY (I first
>> tried without it, and it seemed working fine). If that's the case,
>> I bet there is a palloc that should have been palloc0, or something
>> like that.
>
> Thanks, that was it. I've been able to reproduce this bug. The
> attached patch should fix this bug and I've also added recompression
> when tuples moved to the relation with the compressed attribute.
>
I've done many tests with fulltext search on the mail archive, using
different compression algorithm, and this time it worked fine. So I can
confirm v7 fixes the issue.

Let me elaborate a bit about the benchmarking I did. I realize the patch
is meant to provide only an "API" for custom compression methods, and so
benchmarking of existing general-purpose algorithms (replacing the
built-in pglz) may seem a bit irrelevant. But I'll draw some conclusions
from that, so please bear with me. Or just skip the next section.

------------------ benchmark / start ------------------

I was curious how much better we could do than the built-in compression,
so I've whipped together a bunch of extensions for a few common
general-purpose compression algorithms (lz4, gz, bz2, zstd, brotli and
snappy), loaded the community mailing list archives using "archie" [1]
and ran a bunch of real-world full-text queries on it. I've used
"default" (or "medium") compression levels for all algorithms.

For the loads, the results look like this:

            seconds     size
    -------------------------
    pglz       1631     9786
    zstd       1844     7102
    lz4        1582     9537
    bz2        2382     7670
    gz         1703     7067
    snappy     1587    12288
    brotli    10973     6180

According to those results the algorithms seem quite comparable, with
the exception of snappy and brotli. Snappy supposedly aims for fast
compression and not compression ratio, but it's about as fast as the
other algorithms and compression ratio is almost 2x worse. Brotli is
much slower, although it gets better compression ratio.

For the queries, I ran about 33k of real-world queries (executed on the
community mailing lists in the past). Firstly, a simple

  -- unsorted
  SELECT COUNT(id) FROM messages WHERE body_tsvector @@ $1::tsquery

and then

  -- sorted
  SELECT id FROM messages WHERE body_tsvector @@ $1::tsquery
   ORDER BY ts_rank(body_tsvector, $1::tsquery) DESC LIMIT 100;

Attached are 4 different charts, plotting pglz on x-axis and the other
algorithms on y-axis (so below diagonal => new algorithm is faster,
above diagonal => pglz is faster). I did this on two different machines,
one with only 8GB of RAM (so the dataset does not fit) and one much
larger (so everything fits into RAM).

I'm actually surprised how well the built-in pglz compression fares,
both on compression ratio and (de)compression speed. There is a bit of
noise for the fastest queries, when the alternative algorithms perform
better in non-trivial number of cases.

I suspect those cases may be due to not implementing anything like
PGLZ_strategy_default->min_comp_rate (requiring 25% size reduction), but
I'm not sure about it.

For more expensive queries, pglz pretty much wins. Of course, increasing
compression level might change the results a bit, but it will also make
the data loads slower.

------------------ benchmark / end ------------------

While the results may look differently for other datasets, my conclusion
is that it's unlikely we'll find another general-purpose algorithm
beating pglz (enough for people to switch to it, as they'll need to
worry about testing, deployment of extensions etc).

That doesn't necessarily mean supporting custom compression algorithms
is pointless, of course, but I think people will be much more interested
in exploiting known features of the data (instead of treating the values
as opaque arrays of bytes).

For example, I see the patch implements a special compression method for
tsvector values (used in the tests), exploiting from knowledge of
internal structure. I haven't tested if that is an improvement (either
in compression/decompression speed or compression ratio), though.

I can imagine other interesting use cases - for example values in JSONB
columns often use the same "schema" (keys, nesting, ...), so can I
imagine building a "dictionary" of JSON keys for the whole column ...

Ildus, is this a use case you've been aiming for, or were you aiming to
use the new API in a different way?

I wonder if the patch can be improved to handle this use case better.
For example, it requires knowledge the actual data type, instead of
treating it as opaque varlena / byte array. I see tsvector compression
does that by checking typeid in the handler.

But that fails for example with this example

    db=# create domain x as tsvector;
    CREATE DOMAIN
    db=# create table t (a x compressed ts1);
    ERROR:  unexpected type 28198672 for tsvector compression handler

which means it's a few brick shy to properly support domains. But I
wonder if this should be instead specified in CREATE COMPRESSION METHOD
instead. I mean, something like

    CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler
    TYPE tsvector;

When type is no specified, it applies to all varlena values. Otherwise
only to that type. Also, why not to allow setting the compression as the
default method for a data type, e.g.

    CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler
    TYPE tsvector DEFAULT;

would automatically add 'COMPRESSED ts1' to all tsvector columns in new
CREATE TABLE commands.

BTW do you expect the tsvector compression to be generally useful, or is
it meant to be used only by the tests? If generally useful, perhaps it
should be created in pg_compression by default. If only for tests, maybe
it should be implemented in an extension in contrib (thus also serving
as example how to implement new methods).

I haven't thought about the JSONB use case very much, but I suppose that
could be done using the configure/drop methods. I mean, allocating the
dictionary somewhere (e.g. in a table created by an extension?). The
configure method gets the Form_pg_attribute record, so that should be
enough I guess.

But the patch is not testing those two methods at all, which seems like
something that needs to be addresses before commit. I don't expect a
full-fledged JSONB compression extension, but something simple that
actually exercises those methods in a meaningful way.

Similarly for the compression options - we need to test that the WITH
part is handled correctly (tsvector does not provide configure method).

Which reminds me I'm confused by pg_compression_opt. Consider this:

    CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler;
    CREATE TABLE t (a tsvector COMPRESSED ts1);

    db=# select * from pg_compression_opt ;
     cmoptoid | cmname |          cmhandler           | cmoptions
    ----------+--------+------------------------------+-----------
     28198689 | ts1    | tsvector_compression_handler |
    (1 row)

    DROP TABLE t;

    db=# select * from pg_compression_opt ;
     cmoptoid | cmname |          cmhandler           | cmoptions
    ----------+--------+------------------------------+-----------
     28198689 | ts1    | tsvector_compression_handler |
    (1 row)

    db=# DROP COMPRESSION METHOD ts1;
    ERROR:  cannot drop compression method ts1 because other objects
            depend on it
    DETAIL:  compression options for ts1 depends on compression method
             ts1
    HINT:  Use DROP ... CASCADE to drop the dependent objects too.

I believe the pg_compression_opt is actually linked to pg_attribute, in
which case it should include (attrelid,attnum), and should be dropped
when the table is dropped.

I suppose it was done this way to work around the lack of recompression
(i.e. the compressed value might have ended in other table), but that is
no longer true.

A few more comments:

1) The patch makes optionListToArray (in foreigncmds.c) non-static, but
it's not used anywhere. So this seems like change that is no longer
necessary.

2) I see we add two un-reserved keywords in gram.y - COMPRESSION and
COMPRESSED. Perhaps COMPRESSION would be enough? I mean, we could do

    CREATE TABLE t (c TEXT COMPRESSION cm1);
    ALTER ... SET COMPRESSION name ...
    ALTER ... SET COMPRESSION none;

Although I agree the "SET COMPRESSION none" is a bit strange.

3) heap_prepare_insert uses this chunk of code

+  else if (HeapTupleHasExternal(tup)
+    || RelationGetDescr(relation)->tdflags & TD_ATTR_CUSTOM_COMPRESSED
+    || HeapTupleHasCustomCompressed(tup)
+    || tup->t_len > TOAST_TUPLE_THRESHOLD)

Shouldn't that be rather

+  else if (HeapTupleHasExternal(tup)
+    || (RelationGetDescr(relation)->tdflags & TD_ATTR_CUSTOM_COMPRESSED
+        && HeapTupleHasCustomCompressed(tup))
+    || tup->t_len > TOAST_TUPLE_THRESHOLD)



regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

data-sorted-exceeds-ram.png (296K) Download Attachment
data-sorted-fits-into-ram.png (295K) Download Attachment
data-unsorted-exceeds-ram.png (132K) Download Attachment
data-unsorted-fits-into-ram.png (284K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Michael Paquier
On Thu, Nov 30, 2017 at 8:30 AM, Tomas Vondra
<[hidden email]> wrote:

> On 11/28/2017 02:29 PM, Ildus Kurbangaliev wrote:
>> On Mon, 27 Nov 2017 18:20:12 +0100
>> Tomas Vondra <[hidden email]> wrote:
>>
>>> I guess the trick might be -DRANDOMIZE_ALLOCATED_MEMORY (I first
>>> tried without it, and it seemed working fine). If that's the case,
>>> I bet there is a palloc that should have been palloc0, or something
>>> like that.
>>
>> Thanks, that was it. I've been able to reproduce this bug. The
>> attached patch should fix this bug and I've also added recompression
>> when tuples moved to the relation with the compressed attribute.
>>
>
> I've done many tests with fulltext search on the mail archive, using
> different compression algorithm, and this time it worked fine. So I can
> confirm v7 fixes the issue.

Moved to next CF.
--
Michael

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Ildus Kurbangaliev
In reply to this post by Tomas Vondra-4
On Thu, 30 Nov 2017 00:30:37 +0100
Tomas Vondra <[hidden email]> wrote:

> While the results may look differently for other datasets, my
> conclusion is that it's unlikely we'll find another general-purpose
> algorithm beating pglz (enough for people to switch to it, as they'll
> need to worry about testing, deployment of extensions etc).
>
> That doesn't necessarily mean supporting custom compression algorithms
> is pointless, of course, but I think people will be much more
> interested in exploiting known features of the data (instead of
> treating the values as opaque arrays of bytes).
>
> For example, I see the patch implements a special compression method
> for tsvector values (used in the tests), exploiting from knowledge of
> internal structure. I haven't tested if that is an improvement (either
> in compression/decompression speed or compression ratio), though.
>
> I can imagine other interesting use cases - for example values in
> JSONB columns often use the same "schema" (keys, nesting, ...), so
> can I imagine building a "dictionary" of JSON keys for the whole
> column ...
>
> Ildus, is this a use case you've been aiming for, or were you aiming
> to use the new API in a different way?

Thank you for such good overview. I agree that pglz is pretty good as
general compression method, and there's no point to change it, at
least now.

I see few useful use cases for compression methods, it's special
compression methods for int[], timestamp[] for time series and yes,
dictionaries for jsonb, for which I have even already created an
extension (https://github.com/postgrespro/jsonbd). It's working and
giving promising results.

>
> I wonder if the patch can be improved to handle this use case better.
> For example, it requires knowledge the actual data type, instead of
> treating it as opaque varlena / byte array. I see tsvector compression
> does that by checking typeid in the handler.
>
> But that fails for example with this example
>
>     db=# create domain x as tsvector;
>     CREATE DOMAIN
>     db=# create table t (a x compressed ts1);
>     ERROR:  unexpected type 28198672 for tsvector compression handler
>
> which means it's a few brick shy to properly support domains. But I
> wonder if this should be instead specified in CREATE COMPRESSION
> METHOD instead. I mean, something like
>
>     CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler
>     TYPE tsvector;
>
> When type is no specified, it applies to all varlena values. Otherwise
> only to that type. Also, why not to allow setting the compression as
> the default method for a data type, e.g.
>
>     CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler
>     TYPE tsvector DEFAULT;
>
> would automatically add 'COMPRESSED ts1' to all tsvector columns in
> new CREATE TABLE commands.

Initial version of the patch contains ALTER syntax that change
compression method for whole types, but I have decided to remove that
functionality for now because the patch is already quite complex and it
could be added later as separate patch.

Syntax was:
ALTER TYPE <type> SET COMPRESSION <cm>;

Specifying the supported type for the compression method is a good idea.
Maybe the following syntax would be better?

CREATE COMPRESSION METHOD ts1 FOR tsvector HANDLER
tsvector_compression_handler;

>
> BTW do you expect the tsvector compression to be generally useful, or
> is it meant to be used only by the tests? If generally useful,
> perhaps it should be created in pg_compression by default. If only
> for tests, maybe it should be implemented in an extension in contrib
> (thus also serving as example how to implement new methods).
>
> I haven't thought about the JSONB use case very much, but I suppose
> that could be done using the configure/drop methods. I mean,
> allocating the dictionary somewhere (e.g. in a table created by an
> extension?). The configure method gets the Form_pg_attribute record,
> so that should be enough I guess.
>
> But the patch is not testing those two methods at all, which seems
> like something that needs to be addresses before commit. I don't
> expect a full-fledged JSONB compression extension, but something
> simple that actually exercises those methods in a meaningful way.

I will move to tsvector_compression_handler to separate extension in
the next version. I added it more like as example, but also it could be
used to achieve a better compression for tsvectors. Tests on maillists
database ('archie' tables):

usual compression:

maillists=# select body_tsvector, subject_tsvector into t1 from
messages; SELECT 1114213
maillists=# select pg_size_pretty(pg_total_relation_size('t1'));
 pg_size_pretty
----------------
 1637 MB
(1 row)

tsvector_compression_handler:
maillists=# select pg_size_pretty(pg_total_relation_size('t2'));
 pg_size_pretty
----------------
 1521 MB
(1 row)

lz4:
maillists=# select pg_size_pretty(pg_total_relation_size('t3'));
 pg_size_pretty
----------------
 1487 MB
(1 row)

I don't stick to tsvector_compression_handler, I think if there
will some example that can use all the features then
tsvector_compression_handler could be replaced with it. My extension
for jsonb dictionaries is big enough and I'm not ready to try to include
it to the patch. I just see the use of 'drop' method, since there
should be way for extension to clean its resources, but I don't see
some simple enough usage for it in tests. Maybe just dummy methods for
'drop' and 'configure' will be enough for testing purposes.

>
> Similarly for the compression options - we need to test that the WITH
> part is handled correctly (tsvector does not provide configure
> method).

I could add some options to tsvector_compression_handler, like options
that change pglz_compress parameters.

>
> Which reminds me I'm confused by pg_compression_opt. Consider this:
>
>     CREATE COMPRESSION METHOD ts1 HANDLER
> tsvector_compression_handler; CREATE TABLE t (a tsvector COMPRESSED
> ts1);
>
>     db=# select * from pg_compression_opt ;
>      cmoptoid | cmname |          cmhandler           | cmoptions
>     ----------+--------+------------------------------+-----------
>      28198689 | ts1    | tsvector_compression_handler |
>     (1 row)
>
>     DROP TABLE t;
>
>     db=# select * from pg_compression_opt ;
>      cmoptoid | cmname |          cmhandler           | cmoptions
>     ----------+--------+------------------------------+-----------
>      28198689 | ts1    | tsvector_compression_handler |
>     (1 row)
>
>     db=# DROP COMPRESSION METHOD ts1;
>     ERROR:  cannot drop compression method ts1 because other objects
>             depend on it
>     DETAIL:  compression options for ts1 depends on compression method
>              ts1
>     HINT:  Use DROP ... CASCADE to drop the dependent objects too.
>
> I believe the pg_compression_opt is actually linked to pg_attribute,
> in which case it should include (attrelid,attnum), and should be
> dropped when the table is dropped.
>
> I suppose it was done this way to work around the lack of
> recompression (i.e. the compressed value might have ended in other
> table), but that is no longer true.

Good point, since there is recompression now, the options could be
safely removed in case of dropping table. It will complicate pg_upgrade
but I think this is solvable.

>
> A few more comments:
>
> 1) The patch makes optionListToArray (in foreigncmds.c) non-static,
> but it's not used anywhere. So this seems like change that is no
> longer necessary.

I use this function in CreateCompressionOptions.

>
> 2) I see we add two un-reserved keywords in gram.y - COMPRESSION and
> COMPRESSED. Perhaps COMPRESSION would be enough? I mean, we could do
>
>     CREATE TABLE t (c TEXT COMPRESSION cm1);
>     ALTER ... SET COMPRESSION name ...
>     ALTER ... SET COMPRESSION none;
>
> Although I agree the "SET COMPRESSION none" is a bit strange.

I agree, I've already changed syntax for the next version of the patch.
It's COMPRESSION instead of COMPRESSED and DROP COMPRESSION instead of
SET NOT COMPRESSED. Minus one word from grammar and it looks nicer.

>
> 3) heap_prepare_insert uses this chunk of code
>
> +  else if (HeapTupleHasExternal(tup)
> +    || RelationGetDescr(relation)->tdflags &
> TD_ATTR_CUSTOM_COMPRESSED
> +    || HeapTupleHasCustomCompressed(tup)
> +    || tup->t_len > TOAST_TUPLE_THRESHOLD)
>
> Shouldn't that be rather
>
> +  else if (HeapTupleHasExternal(tup)
> +    || (RelationGetDescr(relation)->tdflags &
> TD_ATTR_CUSTOM_COMPRESSED
> +        && HeapTupleHasCustomCompressed(tup))
> +    || tup->t_len > TOAST_TUPLE_THRESHOLD)

These conditions used for opposite directions.
HeapTupleHasCustomCompressed(tup) is true if tuple has compressed
datums inside and we need to decompress them first, and
TD_ATTR_CUSTOM_COMPRESSED flag means that relation we're putting the
data have columns with custom compression and maybe we need to compress
datums in current tuple.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4


On 11/30/2017 04:20 PM, Ildus Kurbangaliev wrote:

> On Thu, 30 Nov 2017 00:30:37 +0100
> Tomas Vondra <[hidden email]> wrote:
>
> ...
>
>> I can imagine other interesting use cases - for example values in
>> JSONB columns often use the same "schema" (keys, nesting, ...), so
>> can I imagine building a "dictionary" of JSON keys for the whole
>> column ...
>>
>> Ildus, is this a use case you've been aiming for, or were you aiming
>> to use the new API in a different way?
>
> Thank you for such good overview. I agree that pglz is pretty good as
> general compression method, and there's no point to change it, at
> least now.
>
> I see few useful use cases for compression methods, it's special
> compression methods for int[], timestamp[] for time series and yes,
> dictionaries for jsonb, for which I have even already created an
> extension (https://github.com/postgrespro/jsonbd). It's working and
> giving promising results.
>

I understand the reluctance to put everything into core, particularly
for complex patches that evolve quickly. Also, not having to put
everything into core is kinda why we have extensions.

But perhaps some of the simpler cases would be good candidates for core,
making it possible to test the feature?

>>
>> I wonder if the patch can be improved to handle this use case better.
>> For example, it requires knowledge the actual data type, instead of
>> treating it as opaque varlena / byte array. I see tsvector compression
>> does that by checking typeid in the handler.
>>
>> But that fails for example with this example
>>
>>     db=# create domain x as tsvector;
>>     CREATE DOMAIN
>>     db=# create table t (a x compressed ts1);
>>     ERROR:  unexpected type 28198672 for tsvector compression handler
>>
>> which means it's a few brick shy to properly support domains. But I
>> wonder if this should be instead specified in CREATE COMPRESSION
>> METHOD instead. I mean, something like
>>
>>     CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler
>>     TYPE tsvector;
>>
>> When type is no specified, it applies to all varlena values. Otherwise
>> only to that type. Also, why not to allow setting the compression as
>> the default method for a data type, e.g.
>>
>>     CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler
>>     TYPE tsvector DEFAULT;
>>
>> would automatically add 'COMPRESSED ts1' to all tsvector columns in
>> new CREATE TABLE commands.
>
> Initial version of the patch contains ALTER syntax that change
> compression method for whole types, but I have decided to remove
> that functionality for now because the patch is already quite complex
> and it could be added later as separate patch.
>
> Syntax was:
> ALTER TYPE <type> SET COMPRESSION <cm>;
>
> Specifying the supported type for the compression method is a good idea.
> Maybe the following syntax would be better?
>
> CREATE COMPRESSION METHOD ts1 FOR tsvector HANDLER
> tsvector_compression_handler;
>

Understood. Good to know you've considered it, and I agree it doesn't
need to be there from the start (which makes the patch simpler).

>>
>> BTW do you expect the tsvector compression to be generally useful, or
>> is it meant to be used only by the tests? If generally useful,
>> perhaps it should be created in pg_compression by default. If only
>> for tests, maybe it should be implemented in an extension in contrib
>> (thus also serving as example how to implement new methods).
>>
>> I haven't thought about the JSONB use case very much, but I suppose
>> that could be done using the configure/drop methods. I mean,
>> allocating the dictionary somewhere (e.g. in a table created by an
>> extension?). The configure method gets the Form_pg_attribute record,
>> so that should be enough I guess.
>>
>> But the patch is not testing those two methods at all, which seems
>> like something that needs to be addresses before commit. I don't
>> expect a full-fledged JSONB compression extension, but something
>> simple that actually exercises those methods in a meaningful way.
>
> I will move to tsvector_compression_handler to separate extension in
> the next version. I added it more like as example, but also it could be
> used to achieve a better compression for tsvectors. Tests on maillists
> database ('archie' tables):
>
> usual compression:
>
> maillists=# select body_tsvector, subject_tsvector into t1 from
> messages; SELECT 1114213
> maillists=# select pg_size_pretty(pg_total_relation_size('t1'));
>  pg_size_pretty
> ----------------
>  1637 MB
> (1 row)
>
> tsvector_compression_handler:
> maillists=# select pg_size_pretty(pg_total_relation_size('t2'));
>  pg_size_pretty
> ----------------
>  1521 MB
> (1 row)
>
> lz4:
> maillists=# select pg_size_pretty(pg_total_relation_size('t3'));
>  pg_size_pretty
> ----------------
>  1487 MB
> (1 row)
>
> I don't stick to tsvector_compression_handler, I think if there
> will some example that can use all the features then
> tsvector_compression_handler could be replaced with it.
>

OK. I think it's a nice use case (and nice gains on the compression
ratio), demonstrating the datatype-aware compression. The question is
why shouldn't this be built into the datatypes directly?

That would certainly be possible for tsvector, although it wouldn't be
as transparent (the datatype code would have to support it explicitly).

I'm a bit torn on this. The custom compression method patch makes the
compression mostly transparent for the datatype code (by adding an extra
"compression" header). But it's coupled to the datatype quite strongly
as it requires knowledge of the data type internals.

It's a bit less coupled for "generic" datatypes (e.g. arrays of other
types), where it may add important information (e.g. that the array
represents a chunk of timeseries data, which the array code can't
possibly know).

>
> My extension for jsonb dictionaries is big enough and I'm not ready
> to try to include it to the patch. I just see the use of 'drop'
> method, since there should be way for extension to clean its
> resources, but I don't see some simple enough usage for it in tests.
> Maybe just dummy methods for 'drop' and 'configure' will be enough
> for testing purposes.
>

OK.

>>
>> Similarly for the compression options - we need to test that the WITH
>> part is handled correctly (tsvector does not provide configure
>> method).
>
> I could add some options to tsvector_compression_handler, like options
> that change pglz_compress parameters.
>

+1 for doing that

>>
>> Which reminds me I'm confused by pg_compression_opt. Consider this:
>>
>>     CREATE COMPRESSION METHOD ts1 HANDLER
>> tsvector_compression_handler; CREATE TABLE t (a tsvector COMPRESSED
>> ts1);
>>
>>     db=# select * from pg_compression_opt ;
>>      cmoptoid | cmname |          cmhandler           | cmoptions
>>     ----------+--------+------------------------------+-----------
>>      28198689 | ts1    | tsvector_compression_handler |
>>     (1 row)
>>
>>     DROP TABLE t;
>>
>>     db=# select * from pg_compression_opt ;
>>      cmoptoid | cmname |          cmhandler           | cmoptions
>>     ----------+--------+------------------------------+-----------
>>      28198689 | ts1    | tsvector_compression_handler |
>>     (1 row)
>>
>>     db=# DROP COMPRESSION METHOD ts1;
>>     ERROR:  cannot drop compression method ts1 because other objects
>>             depend on it
>>     DETAIL:  compression options for ts1 depends on compression method
>>              ts1
>>     HINT:  Use DROP ... CASCADE to drop the dependent objects too.
>>
>> I believe the pg_compression_opt is actually linked to pg_attribute,
>> in which case it should include (attrelid,attnum), and should be
>> dropped when the table is dropped.
>>
>> I suppose it was done this way to work around the lack of
>> recompression (i.e. the compressed value might have ended in other
>> table), but that is no longer true.
>
> Good point, since there is recompression now, the options could be
> safely removed in case of dropping table. It will complicate pg_upgrade
> but I think this is solvable.
>

+1 to do that. I've never dealt with pg_upgrade, but I suppose this
shouldn't be more complicated than for custom data types, right?

>>
>> A few more comments:
>>
>> 1) The patch makes optionListToArray (in foreigncmds.c) non-static,
>> but it's not used anywhere. So this seems like change that is no
>> longer necessary.
>
> I use this function in CreateCompressionOptions.
>

Ah, my mistake. I only did 'git grep' which however does not search in
new files (not added to git). But it seems a bit strange to have the
function in foreigncmds.c, though, now that we use it outside of FDWs.

>>
>> 2) I see we add two un-reserved keywords in gram.y - COMPRESSION and
>> COMPRESSED. Perhaps COMPRESSION would be enough? I mean, we could do
>>
>>     CREATE TABLE t (c TEXT COMPRESSION cm1);
>>     ALTER ... SET COMPRESSION name ...
>>     ALTER ... SET COMPRESSION none;
>>
>> Although I agree the "SET COMPRESSION none" is a bit strange.
>
> I agree, I've already changed syntax for the next version of the patch.
> It's COMPRESSION instead of COMPRESSED and DROP COMPRESSION instead of
> SET NOT COMPRESSED. Minus one word from grammar and it looks nicer.
>

I'm not sure DROP COMPRESSION is a good idea. It implies that the data
will be uncompressed, but I assume it merely switches to the built-in
compression (pglz), right? Although "SET COMPRESSION none" has the same
issue ...

BTW, when you do DROP COMPRESSION (or whatever syntax we end up using),
will that remove the dependencies on the compression method? I haven't
tried, so maybe it does.

>>
>> 3) heap_prepare_insert uses this chunk of code
>>
>> +  else if (HeapTupleHasExternal(tup)
>> +    || RelationGetDescr(relation)->tdflags &
>> TD_ATTR_CUSTOM_COMPRESSED
>> +    || HeapTupleHasCustomCompressed(tup)
>> +    || tup->t_len > TOAST_TUPLE_THRESHOLD)
>>
>> Shouldn't that be rather
>>
>> +  else if (HeapTupleHasExternal(tup)
>> +    || (RelationGetDescr(relation)->tdflags &
>> TD_ATTR_CUSTOM_COMPRESSED
>> +        && HeapTupleHasCustomCompressed(tup))
>> +    || tup->t_len > TOAST_TUPLE_THRESHOLD)
>
> These conditions used for opposite directions.
> HeapTupleHasCustomCompressed(tup) is true if tuple has compressed
> datums inside and we need to decompress them first, and
> TD_ATTR_CUSTOM_COMPRESSED flag means that relation we're putting the
> data have columns with custom compression and maybe we need to compress
> datums in current tuple.
>

Ah, right, now it makes sense. Thanks for explaining.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

12345 ... 8