Custom compression methods

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
181 messages Options
1 ... 678910
Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4
On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:

>On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
><[hidden email]> wrote:
>
>Thanks, Tomas for your feedback.
>
>> 9) attcompression ...
>>
>> The main issue I see is what the patch does with attcompression. Instead
>> of just using it to store a the compression method, it's also used to
>> store the preserved compression methods. And using NameData to store
>> this seems wrong too - if we really want to store this info, the correct
>> way is either using text[] or inventing charvector or similar.
>
>The reason for using the NameData is the get it in the fixed part of
>the data structure.
>

Why do we need that? It's possible to have varlena fields with direct
access (see pg_index.indkey for example). Adding NameData just to make
it fixed-length means we're always adding 64B even if we just need a
single byte, which means ~30% overhead for the FormData_pg_attribute.
That seems a bit unnecessary, and might be an issue with many attributes
(e.g. with many temp tables, etc.).

>> But to me this seems very much like a misuse of attcompression to track
>> dependencies on compression methods, necessary because we don't have a
>> separate catalog listing compression methods. If we had that, I think we
>> could simply add dependencies between attributes and that catalog.
>
>Basically, up to this patch, we are having only built-in compression
>methods and those can not be dropped so we don't need any dependency
>at all.  We just want to know what is the current compression method
>and what is the preserve compression methods supported for this
>attribute.  Maybe we can do it better instead of using the NameData
>but I don't think it makes sense to add a separate catalog?
>

Sure, I understand what the goal was - all I'm saying is that it looks
very much like a workaround needed because we don't have the catalog.

I don't quite understand how could we support custom compression methods
without listing them in some sort of catalog?

>> Moreover, having the catalog would allow adding compression methods
>> (from extensions etc) instead of just having a list of hard-coded
>> compression methods. Which seems like a strange limitation, considering
>> this thread is called "custom compression methods".
>
>I think I forgot to mention while submitting the previous patch that
>the next patch I am planning to submit is, Support creating the custom
>compression methods wherein we can use pg_am catalog to insert the new
>compression method.  And for dependency handling, we can create an
>attribute dependency on the pg_am row.  Basically, we will create the
>attribute dependency on the current compression method AM as well as
>on the preserved compression methods AM.   As part of this, we will
>add two build-in AMs for zlib and pglz, and the attcompression field
>will be converted to the oid_vector (first OID will be of the current
>compression method, followed by the preserved compression method's
>oids).
>

Hmmm, ok. Not sure pg_am is the right place - compression methods don't
quite match what I though AMs are about, but maybe it's just my fault.

FWIW it seems a bit strange to first do the attcompression magic and
then add the catalog later - I think we should start with the catalog
right away. The advantage is that if we end up committing only some of
the patches in this cycle, we already have all the infrastructure etc.
We can reorder that later, though.

>> 10) compression parameters?
>>
>> I wonder if we could/should allow parameters, like compression level
>> (and maybe other stuff, depending on the compression method). PG13
>> allowed that for opclasses, so perhaps we should allow it here too.
>
>Yes, that is also in the plan.   For doing this we are planning to add
>an extra column in the pg_attribute which will store the compression
>options for the current compression method. The original patch was
>creating an extra catalog pg_column_compression, therein it maintains
>the oid of the compression method as well as the compression options.
>The advantage of creating an extra catalog is that we can keep the
>compression options for the preserved compression methods also so that
>we can support the options which can be used for decompressing the
>data as well.  Whereas if we want to avoid this extra catalog then we
>can not use that compression option for decompressing.  But most of
>the options e.g. compression level are just for the compressing so it
>is enough to store for the current compression method only.  What's
>your thoughts?
>

Not sure. My assumption was we'd end up with a new catalog, but maybe
stashing it into pg_attribute is fine. I was really thinking about two
kinds of options - compression level, and some sort of column-level
dictionary. Compression level is not necessary for decompression, but
the dictionary ID would be needed. (I think the global dictionary was
one of the use cases, aimed at JSON compression.)

But I don't think stashing it in pg_attribute means we couldn't use it
for decompression - we'd just need to keep an array of options, one for
each compression method. Keeping it in a separate new catalog might be
cleaner, and I'm not sure how large the configuration might be.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Dilip Kumar-2
On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
<[hidden email]> wrote:

>
> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> ><[hidden email]> wrote:
> >
> >Thanks, Tomas for your feedback.
> >
> >> 9) attcompression ...
> >>
> >> The main issue I see is what the patch does with attcompression. Instead
> >> of just using it to store a the compression method, it's also used to
> >> store the preserved compression methods. And using NameData to store
> >> this seems wrong too - if we really want to store this info, the correct
> >> way is either using text[] or inventing charvector or similar.
> >
> >The reason for using the NameData is the get it in the fixed part of
> >the data structure.
> >
>
> Why do we need that? It's possible to have varlena fields with direct
> access (see pg_index.indkey for example).

I see.  While making it NameData I was thinking whether we have an
option to direct access the varlena. Thanks for pointing me there. I
will change this.

 Adding NameData just to make
> it fixed-length means we're always adding 64B even if we just need a
> single byte, which means ~30% overhead for the FormData_pg_attribute.
> That seems a bit unnecessary, and might be an issue with many attributes
> (e.g. with many temp tables, etc.).

You are right.  Even I did not like to keep 64B for this, so I will change it.

>
> >> But to me this seems very much like a misuse of attcompression to track
> >> dependencies on compression methods, necessary because we don't have a
> >> separate catalog listing compression methods. If we had that, I think we
> >> could simply add dependencies between attributes and that catalog.
> >
> >Basically, up to this patch, we are having only built-in compression
> >methods and those can not be dropped so we don't need any dependency
> >at all.  We just want to know what is the current compression method
> >and what is the preserve compression methods supported for this
> >attribute.  Maybe we can do it better instead of using the NameData
> >but I don't think it makes sense to add a separate catalog?
> >
>
> Sure, I understand what the goal was - all I'm saying is that it looks
> very much like a workaround needed because we don't have the catalog.
>
> I don't quite understand how could we support custom compression methods
> without listing them in some sort of catalog?

Yeah for supporting custom compression we need some catalog.

> >> Moreover, having the catalog would allow adding compression methods
> >> (from extensions etc) instead of just having a list of hard-coded
> >> compression methods. Which seems like a strange limitation, considering
> >> this thread is called "custom compression methods".
> >
> >I think I forgot to mention while submitting the previous patch that
> >the next patch I am planning to submit is, Support creating the custom
> >compression methods wherein we can use pg_am catalog to insert the new
> >compression method.  And for dependency handling, we can create an
> >attribute dependency on the pg_am row.  Basically, we will create the
> >attribute dependency on the current compression method AM as well as
> >on the preserved compression methods AM.   As part of this, we will
> >add two build-in AMs for zlib and pglz, and the attcompression field
> >will be converted to the oid_vector (first OID will be of the current
> >compression method, followed by the preserved compression method's
> >oids).
> >
>
> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
> quite match what I though AMs are about, but maybe it's just my fault.
>
> FWIW it seems a bit strange to first do the attcompression magic and
> then add the catalog later - I think we should start with the catalog
> right away. The advantage is that if we end up committing only some of
> the patches in this cycle, we already have all the infrastructure etc.
> We can reorder that later, though.

Hmm,  yeah we can do this way as well that first create a new catalog
table and add entries for these two built-in methods and the
attcompression can store the oid vector.  But if we only commit the
build-in compression methods part then does it make sense to create an
extra catalog or adding these build-in methods to the existing catalog
(if we plan to use pg_am).  Then in attcompression instead of using
one byte for each preserve compression method, we need to use oid.  So
from Robert's mail[1], it appeared to me that he wants that the
build-in compression methods part should be independently committable
and if we think from that perspective then adding a catalog doesn't
make much sense.  But if we are planning to commit the custom method
also then it makes more sense to directly start with the catalog
because that way it will be easy to expand without much refactoring.

[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com

> >> 10) compression parameters?
> >>
> >> I wonder if we could/should allow parameters, like compression level
> >> (and maybe other stuff, depending on the compression method). PG13
> >> allowed that for opclasses, so perhaps we should allow it here too.
> >
> >Yes, that is also in the plan.   For doing this we are planning to add
> >an extra column in the pg_attribute which will store the compression
> >options for the current compression method. The original patch was
> >creating an extra catalog pg_column_compression, therein it maintains
> >the oid of the compression method as well as the compression options.
> >The advantage of creating an extra catalog is that we can keep the
> >compression options for the preserved compression methods also so that
> >we can support the options which can be used for decompressing the
> >data as well.  Whereas if we want to avoid this extra catalog then we
> >can not use that compression option for decompressing.  But most of
> >the options e.g. compression level are just for the compressing so it
> >is enough to store for the current compression method only.  What's
> >your thoughts?
> >
>
> Not sure. My assumption was we'd end up with a new catalog, but maybe
> stashing it into pg_attribute is fine. I was really thinking about two
> kinds of options - compression level, and some sort of column-level
> dictionary. Compression level is not necessary for decompression, but
> the dictionary ID would be needed. (I think the global dictionary was
> one of the use cases, aimed at JSON compression.)

Ok

> But I don't think stashing it in pg_attribute means we couldn't use it
> for decompression - we'd just need to keep an array of options, one for
> each compression method.

Yeah, we can do that.

Keeping it in a separate new catalog might be
> cleaner, and I'm not sure how large the configuration might be.

Yeah in that case it will be better to store in a separate catalog,
because sometimes if multiple attributes are using the same
compression method with the same options then we can store the same
oid in attcompression instead of duplicating the option field.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4
On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:

>On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
><[hidden email]> wrote:
>>
>> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
>> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
>> ><[hidden email]> wrote:
>> >
>> >Thanks, Tomas for your feedback.
>> >
>> >> 9) attcompression ...
>> >>
>> >> The main issue I see is what the patch does with attcompression. Instead
>> >> of just using it to store a the compression method, it's also used to
>> >> store the preserved compression methods. And using NameData to store
>> >> this seems wrong too - if we really want to store this info, the correct
>> >> way is either using text[] or inventing charvector or similar.
>> >
>> >The reason for using the NameData is the get it in the fixed part of
>> >the data structure.
>> >
>>
>> Why do we need that? It's possible to have varlena fields with direct
>> access (see pg_index.indkey for example).
>
>I see.  While making it NameData I was thinking whether we have an
>option to direct access the varlena. Thanks for pointing me there. I
>will change this.
>
> Adding NameData just to make
>> it fixed-length means we're always adding 64B even if we just need a
>> single byte, which means ~30% overhead for the FormData_pg_attribute.
>> That seems a bit unnecessary, and might be an issue with many attributes
>> (e.g. with many temp tables, etc.).
>
>You are right.  Even I did not like to keep 64B for this, so I will change it.
>
>>
>> >> But to me this seems very much like a misuse of attcompression to track
>> >> dependencies on compression methods, necessary because we don't have a
>> >> separate catalog listing compression methods. If we had that, I think we
>> >> could simply add dependencies between attributes and that catalog.
>> >
>> >Basically, up to this patch, we are having only built-in compression
>> >methods and those can not be dropped so we don't need any dependency
>> >at all.  We just want to know what is the current compression method
>> >and what is the preserve compression methods supported for this
>> >attribute.  Maybe we can do it better instead of using the NameData
>> >but I don't think it makes sense to add a separate catalog?
>> >
>>
>> Sure, I understand what the goal was - all I'm saying is that it looks
>> very much like a workaround needed because we don't have the catalog.
>>
>> I don't quite understand how could we support custom compression methods
>> without listing them in some sort of catalog?
>
>Yeah for supporting custom compression we need some catalog.
>
>> >> Moreover, having the catalog would allow adding compression methods
>> >> (from extensions etc) instead of just having a list of hard-coded
>> >> compression methods. Which seems like a strange limitation, considering
>> >> this thread is called "custom compression methods".
>> >
>> >I think I forgot to mention while submitting the previous patch that
>> >the next patch I am planning to submit is, Support creating the custom
>> >compression methods wherein we can use pg_am catalog to insert the new
>> >compression method.  And for dependency handling, we can create an
>> >attribute dependency on the pg_am row.  Basically, we will create the
>> >attribute dependency on the current compression method AM as well as
>> >on the preserved compression methods AM.   As part of this, we will
>> >add two build-in AMs for zlib and pglz, and the attcompression field
>> >will be converted to the oid_vector (first OID will be of the current
>> >compression method, followed by the preserved compression method's
>> >oids).
>> >
>>
>> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
>> quite match what I though AMs are about, but maybe it's just my fault.
>>
>> FWIW it seems a bit strange to first do the attcompression magic and
>> then add the catalog later - I think we should start with the catalog
>> right away. The advantage is that if we end up committing only some of
>> the patches in this cycle, we already have all the infrastructure etc.
>> We can reorder that later, though.
>
>Hmm,  yeah we can do this way as well that first create a new catalog
>table and add entries for these two built-in methods and the
>attcompression can store the oid vector.  But if we only commit the
>build-in compression methods part then does it make sense to create an
>extra catalog or adding these build-in methods to the existing catalog
>(if we plan to use pg_am).  Then in attcompression instead of using
>one byte for each preserve compression method, we need to use oid.  So
>from Robert's mail[1], it appeared to me that he wants that the
>build-in compression methods part should be independently committable
>and if we think from that perspective then adding a catalog doesn't
>make much sense.  But if we are planning to commit the custom method
>also then it makes more sense to directly start with the catalog
>because that way it will be easy to expand without much refactoring.
>
>[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
>

Hmmm. Maybe I'm missing something subtle, but I think that plan can be
interpreted in various ways - it does not really say whether the initial
list of built-in methods should be in some C array, or already in a proper
catalog.

All I'm saying is it seems a bit weird to first implement dependencies
based on strange (mis)use of attcompression attribute, and then replace
it with a proper catalog. My understanding is those patches are expected
to be committable one by one, but the attcompression approach seems a
bit too hacky to me - not sure I'd want to commit that ...


>> >> 10) compression parameters?
>> >>
>> >> I wonder if we could/should allow parameters, like compression level
>> >> (and maybe other stuff, depending on the compression method). PG13
>> >> allowed that for opclasses, so perhaps we should allow it here too.
>> >
>> >Yes, that is also in the plan.   For doing this we are planning to add
>> >an extra column in the pg_attribute which will store the compression
>> >options for the current compression method. The original patch was
>> >creating an extra catalog pg_column_compression, therein it maintains
>> >the oid of the compression method as well as the compression options.
>> >The advantage of creating an extra catalog is that we can keep the
>> >compression options for the preserved compression methods also so that
>> >we can support the options which can be used for decompressing the
>> >data as well.  Whereas if we want to avoid this extra catalog then we
>> >can not use that compression option for decompressing.  But most of
>> >the options e.g. compression level are just for the compressing so it
>> >is enough to store for the current compression method only.  What's
>> >your thoughts?
>> >
>>
>> Not sure. My assumption was we'd end up with a new catalog, but maybe
>> stashing it into pg_attribute is fine. I was really thinking about two
>> kinds of options - compression level, and some sort of column-level
>> dictionary. Compression level is not necessary for decompression, but
>> the dictionary ID would be needed. (I think the global dictionary was
>> one of the use cases, aimed at JSON compression.)
>
>Ok
>
>> But I don't think stashing it in pg_attribute means we couldn't use it
>> for decompression - we'd just need to keep an array of options, one for
>> each compression method.
>
>Yeah, we can do that.
>
>Keeping it in a separate new catalog might be
>> cleaner, and I'm not sure how large the configuration might be.
>
>Yeah in that case it will be better to store in a separate catalog,
>because sometimes if multiple attributes are using the same
>compression method with the same options then we can store the same
>oid in attcompression instead of duplicating the option field.
>

I doubt deduplicating the options like this is (sharing options between
columns) is really worth it, as it means extra complexity e.g. during
ALTER TABLE ... SET COMPRESSION. I don't think we do that for other
catalogs, so why should we do it here?

Ultimately I think it's a question of how large we expect the options to
be, and how flexible it needs to be.

For example, what happens if the user does this:

   ALTER ... SET COMPRESSION my_compression WITH (options1) PRESERVE;
   ALTER ... SET COMPRESSION pglz PRESERVE;
   ALTER ... SET COMPRESSION my_compression WITH (options2) PRESERVE;

I believe it's enough to keep just the last value, but maybe I'm wrong
and we need to keep the whole history?

The use case I'm thinking about is the column-level JSON compression,
where one of the options identifies the dictionary. OTOH I'm not sure
this is the right way to track this info - we need to know which options
were compressed with which options, i.e. it needs to be encoded in each
value directly. It'd also require changes to the PRESERVE handling
because it'd be necessary to identify which options to preserve ...

So maybe this is either nonsense or something we don't want to support,
and we should only allow one option for each compression method.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Dilip Kumar-2
On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
<[hidden email]> wrote:

>
> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
> ><[hidden email]> wrote:
> >>
> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> >> ><[hidden email]> wrote:
> >> >
> >> >Thanks, Tomas for your feedback.
> >> >
> >> >> 9) attcompression ...
> >> >>
> >> >> The main issue I see is what the patch does with attcompression. Instead
> >> >> of just using it to store a the compression method, it's also used to
> >> >> store the preserved compression methods. And using NameData to store
> >> >> this seems wrong too - if we really want to store this info, the correct
> >> >> way is either using text[] or inventing charvector or similar.
> >> >
> >> >The reason for using the NameData is the get it in the fixed part of
> >> >the data structure.
> >> >
> >>
> >> Why do we need that? It's possible to have varlena fields with direct
> >> access (see pg_index.indkey for example).
> >
> >I see.  While making it NameData I was thinking whether we have an
> >option to direct access the varlena. Thanks for pointing me there. I
> >will change this.
> >
> > Adding NameData just to make
> >> it fixed-length means we're always adding 64B even if we just need a
> >> single byte, which means ~30% overhead for the FormData_pg_attribute.
> >> That seems a bit unnecessary, and might be an issue with many attributes
> >> (e.g. with many temp tables, etc.).
> >
> >You are right.  Even I did not like to keep 64B for this, so I will change it.
> >
> >>
> >> >> But to me this seems very much like a misuse of attcompression to track
> >> >> dependencies on compression methods, necessary because we don't have a
> >> >> separate catalog listing compression methods. If we had that, I think we
> >> >> could simply add dependencies between attributes and that catalog.
> >> >
> >> >Basically, up to this patch, we are having only built-in compression
> >> >methods and those can not be dropped so we don't need any dependency
> >> >at all.  We just want to know what is the current compression method
> >> >and what is the preserve compression methods supported for this
> >> >attribute.  Maybe we can do it better instead of using the NameData
> >> >but I don't think it makes sense to add a separate catalog?
> >> >
> >>
> >> Sure, I understand what the goal was - all I'm saying is that it looks
> >> very much like a workaround needed because we don't have the catalog.
> >>
> >> I don't quite understand how could we support custom compression methods
> >> without listing them in some sort of catalog?
> >
> >Yeah for supporting custom compression we need some catalog.
> >
> >> >> Moreover, having the catalog would allow adding compression methods
> >> >> (from extensions etc) instead of just having a list of hard-coded
> >> >> compression methods. Which seems like a strange limitation, considering
> >> >> this thread is called "custom compression methods".
> >> >
> >> >I think I forgot to mention while submitting the previous patch that
> >> >the next patch I am planning to submit is, Support creating the custom
> >> >compression methods wherein we can use pg_am catalog to insert the new
> >> >compression method.  And for dependency handling, we can create an
> >> >attribute dependency on the pg_am row.  Basically, we will create the
> >> >attribute dependency on the current compression method AM as well as
> >> >on the preserved compression methods AM.   As part of this, we will
> >> >add two build-in AMs for zlib and pglz, and the attcompression field
> >> >will be converted to the oid_vector (first OID will be of the current
> >> >compression method, followed by the preserved compression method's
> >> >oids).
> >> >
> >>
> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
> >> quite match what I though AMs are about, but maybe it's just my fault.
> >>
> >> FWIW it seems a bit strange to first do the attcompression magic and
> >> then add the catalog later - I think we should start with the catalog
> >> right away. The advantage is that if we end up committing only some of
> >> the patches in this cycle, we already have all the infrastructure etc.
> >> We can reorder that later, though.
> >
> >Hmm,  yeah we can do this way as well that first create a new catalog
> >table and add entries for these two built-in methods and the
> >attcompression can store the oid vector.  But if we only commit the
> >build-in compression methods part then does it make sense to create an
> >extra catalog or adding these build-in methods to the existing catalog
> >(if we plan to use pg_am).  Then in attcompression instead of using
> >one byte for each preserve compression method, we need to use oid.  So
> >from Robert's mail[1], it appeared to me that he wants that the
> >build-in compression methods part should be independently committable
> >and if we think from that perspective then adding a catalog doesn't
> >make much sense.  But if we are planning to commit the custom method
> >also then it makes more sense to directly start with the catalog
> >because that way it will be easy to expand without much refactoring.
> >
> >[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
> >
>
> Hmmm. Maybe I'm missing something subtle, but I think that plan can be
> interpreted in various ways - it does not really say whether the initial
> list of built-in methods should be in some C array, or already in a proper
> catalog.
>
> All I'm saying is it seems a bit weird to first implement dependencies
> based on strange (mis)use of attcompression attribute, and then replace
> it with a proper catalog. My understanding is those patches are expected
> to be committable one by one, but the attcompression approach seems a
> bit too hacky to me - not sure I'd want to commit that ...

Okay, I will change this.  So I will make create a new catalog
pg_compression and add the entry for two built-in compression methods
from the very first patch.

> >> >> 10) compression parameters?
> >> >>
> >> >> I wonder if we could/should allow parameters, like compression level
> >> >> (and maybe other stuff, depending on the compression method). PG13
> >> >> allowed that for opclasses, so perhaps we should allow it here too.
> >> >
> >> >Yes, that is also in the plan.   For doing this we are planning to add
> >> >an extra column in the pg_attribute which will store the compression
> >> >options for the current compression method. The original patch was
> >> >creating an extra catalog pg_column_compression, therein it maintains
> >> >the oid of the compression method as well as the compression options.
> >> >The advantage of creating an extra catalog is that we can keep the
> >> >compression options for the preserved compression methods also so that
> >> >we can support the options which can be used for decompressing the
> >> >data as well.  Whereas if we want to avoid this extra catalog then we
> >> >can not use that compression option for decompressing.  But most of
> >> >the options e.g. compression level are just for the compressing so it
> >> >is enough to store for the current compression method only.  What's
> >> >your thoughts?
> >> >
> >>
> >> Not sure. My assumption was we'd end up with a new catalog, but maybe
> >> stashing it into pg_attribute is fine. I was really thinking about two
> >> kinds of options - compression level, and some sort of column-level
> >> dictionary. Compression level is not necessary for decompression, but
> >> the dictionary ID would be needed. (I think the global dictionary was
> >> one of the use cases, aimed at JSON compression.)
> >
> >Ok
> >
> >> But I don't think stashing it in pg_attribute means we couldn't use it
> >> for decompression - we'd just need to keep an array of options, one for
> >> each compression method.
> >
> >Yeah, we can do that.
> >
> >Keeping it in a separate new catalog might be
> >> cleaner, and I'm not sure how large the configuration might be.
> >
> >Yeah in that case it will be better to store in a separate catalog,
> >because sometimes if multiple attributes are using the same
> >compression method with the same options then we can store the same
> >oid in attcompression instead of duplicating the option field.
> >
>
> I doubt deduplicating the options like this is (sharing options between
> columns) is really worth it, as it means extra complexity e.g. during
> ALTER TABLE ... SET COMPRESSION. I don't think we do that for other
> catalogs, so why should we do it here?

Yeah, valid point.

>
> Ultimately I think it's a question of how large we expect the options to
> be, and how flexible it needs to be.
>
> For example, what happens if the user does this:
>
>    ALTER ... SET COMPRESSION my_compression WITH (options1) PRESERVE;
>    ALTER ... SET COMPRESSION pglz PRESERVE;
>    ALTER ... SET COMPRESSION my_compression WITH (options2) PRESERVE;
>
> I believe it's enough to keep just the last value, but maybe I'm wrong
> and we need to keep the whole history?

Currently, the syntax is like ALTER ... SET COMPRESSION my_compression
WITH (options1) PRESERVE (old_compression1, old_compression2..).  But
I think if the user just gives
PRESERVE without a list then we should just preserve the latest one.

> The use case I'm thinking about is the column-level JSON compression,
> where one of the options identifies the dictionary. OTOH I'm not sure
> this is the right way to track this info - we need to know which options
> were compressed with which options, i.e. it needs to be encoded in each
> value directly. It'd also require changes to the PRESERVE handling
> because it'd be necessary to identify which options to preserve ...
>
> So maybe this is either nonsense or something we don't want to support,
> and we should only allow one option for each compression method.

Yeah, it is a bit confusing to add the same compression method with
different compression options,  then in the preserve list, we will
have to allow the option as well along with the compression method to
know which compression method with what options we want to preserve.

And also as you mentioned that in rows we need to know the option as
well.  I think for solving this anyways for the custom compression
methods we will have to store the OID of the compression method in the
toast header so we can provide an intermediate catalog which will
create a new row for each combination of compression method + option
and the toast header can store the OID of that row so that we know
with which compression method + option it was compressed with.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4
On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote:

>On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
><[hidden email]> wrote:
>>
>> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
>> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
>> ><[hidden email]> wrote:
>> >>
>> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
>> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
>> >> ><[hidden email]> wrote:
>> >> >
>> >> >Thanks, Tomas for your feedback.
>> >> >
>> >> >> 9) attcompression ...
>> >> >>
>> >> >> The main issue I see is what the patch does with attcompression. Instead
>> >> >> of just using it to store a the compression method, it's also used to
>> >> >> store the preserved compression methods. And using NameData to store
>> >> >> this seems wrong too - if we really want to store this info, the correct
>> >> >> way is either using text[] or inventing charvector or similar.
>> >> >
>> >> >The reason for using the NameData is the get it in the fixed part of
>> >> >the data structure.
>> >> >
>> >>
>> >> Why do we need that? It's possible to have varlena fields with direct
>> >> access (see pg_index.indkey for example).
>> >
>> >I see.  While making it NameData I was thinking whether we have an
>> >option to direct access the varlena. Thanks for pointing me there. I
>> >will change this.
>> >
>> > Adding NameData just to make
>> >> it fixed-length means we're always adding 64B even if we just need a
>> >> single byte, which means ~30% overhead for the FormData_pg_attribute.
>> >> That seems a bit unnecessary, and might be an issue with many attributes
>> >> (e.g. with many temp tables, etc.).
>> >
>> >You are right.  Even I did not like to keep 64B for this, so I will change it.
>> >
>> >>
>> >> >> But to me this seems very much like a misuse of attcompression to track
>> >> >> dependencies on compression methods, necessary because we don't have a
>> >> >> separate catalog listing compression methods. If we had that, I think we
>> >> >> could simply add dependencies between attributes and that catalog.
>> >> >
>> >> >Basically, up to this patch, we are having only built-in compression
>> >> >methods and those can not be dropped so we don't need any dependency
>> >> >at all.  We just want to know what is the current compression method
>> >> >and what is the preserve compression methods supported for this
>> >> >attribute.  Maybe we can do it better instead of using the NameData
>> >> >but I don't think it makes sense to add a separate catalog?
>> >> >
>> >>
>> >> Sure, I understand what the goal was - all I'm saying is that it looks
>> >> very much like a workaround needed because we don't have the catalog.
>> >>
>> >> I don't quite understand how could we support custom compression methods
>> >> without listing them in some sort of catalog?
>> >
>> >Yeah for supporting custom compression we need some catalog.
>> >
>> >> >> Moreover, having the catalog would allow adding compression methods
>> >> >> (from extensions etc) instead of just having a list of hard-coded
>> >> >> compression methods. Which seems like a strange limitation, considering
>> >> >> this thread is called "custom compression methods".
>> >> >
>> >> >I think I forgot to mention while submitting the previous patch that
>> >> >the next patch I am planning to submit is, Support creating the custom
>> >> >compression methods wherein we can use pg_am catalog to insert the new
>> >> >compression method.  And for dependency handling, we can create an
>> >> >attribute dependency on the pg_am row.  Basically, we will create the
>> >> >attribute dependency on the current compression method AM as well as
>> >> >on the preserved compression methods AM.   As part of this, we will
>> >> >add two build-in AMs for zlib and pglz, and the attcompression field
>> >> >will be converted to the oid_vector (first OID will be of the current
>> >> >compression method, followed by the preserved compression method's
>> >> >oids).
>> >> >
>> >>
>> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
>> >> quite match what I though AMs are about, but maybe it's just my fault.
>> >>
>> >> FWIW it seems a bit strange to first do the attcompression magic and
>> >> then add the catalog later - I think we should start with the catalog
>> >> right away. The advantage is that if we end up committing only some of
>> >> the patches in this cycle, we already have all the infrastructure etc.
>> >> We can reorder that later, though.
>> >
>> >Hmm,  yeah we can do this way as well that first create a new catalog
>> >table and add entries for these two built-in methods and the
>> >attcompression can store the oid vector.  But if we only commit the
>> >build-in compression methods part then does it make sense to create an
>> >extra catalog or adding these build-in methods to the existing catalog
>> >(if we plan to use pg_am).  Then in attcompression instead of using
>> >one byte for each preserve compression method, we need to use oid.  So
>> >from Robert's mail[1], it appeared to me that he wants that the
>> >build-in compression methods part should be independently committable
>> >and if we think from that perspective then adding a catalog doesn't
>> >make much sense.  But if we are planning to commit the custom method
>> >also then it makes more sense to directly start with the catalog
>> >because that way it will be easy to expand without much refactoring.
>> >
>> >[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
>> >
>>
>> Hmmm. Maybe I'm missing something subtle, but I think that plan can be
>> interpreted in various ways - it does not really say whether the initial
>> list of built-in methods should be in some C array, or already in a proper
>> catalog.
>>
>> All I'm saying is it seems a bit weird to first implement dependencies
>> based on strange (mis)use of attcompression attribute, and then replace
>> it with a proper catalog. My understanding is those patches are expected
>> to be committable one by one, but the attcompression approach seems a
>> bit too hacky to me - not sure I'd want to commit that ...
>
>Okay, I will change this.  So I will make create a new catalog
>pg_compression and add the entry for two built-in compression methods
>from the very first patch.
>

OK.

>> >> >> 10) compression parameters?
>> >> >>
>> >> >> I wonder if we could/should allow parameters, like compression level
>> >> >> (and maybe other stuff, depending on the compression method). PG13
>> >> >> allowed that for opclasses, so perhaps we should allow it here too.
>> >> >
>> >> >Yes, that is also in the plan.   For doing this we are planning to add
>> >> >an extra column in the pg_attribute which will store the compression
>> >> >options for the current compression method. The original patch was
>> >> >creating an extra catalog pg_column_compression, therein it maintains
>> >> >the oid of the compression method as well as the compression options.
>> >> >The advantage of creating an extra catalog is that we can keep the
>> >> >compression options for the preserved compression methods also so that
>> >> >we can support the options which can be used for decompressing the
>> >> >data as well.  Whereas if we want to avoid this extra catalog then we
>> >> >can not use that compression option for decompressing.  But most of
>> >> >the options e.g. compression level are just for the compressing so it
>> >> >is enough to store for the current compression method only.  What's
>> >> >your thoughts?
>> >> >
>> >>
>> >> Not sure. My assumption was we'd end up with a new catalog, but maybe
>> >> stashing it into pg_attribute is fine. I was really thinking about two
>> >> kinds of options - compression level, and some sort of column-level
>> >> dictionary. Compression level is not necessary for decompression, but
>> >> the dictionary ID would be needed. (I think the global dictionary was
>> >> one of the use cases, aimed at JSON compression.)
>> >
>> >Ok
>> >
>> >> But I don't think stashing it in pg_attribute means we couldn't use it
>> >> for decompression - we'd just need to keep an array of options, one for
>> >> each compression method.
>> >
>> >Yeah, we can do that.
>> >
>> >Keeping it in a separate new catalog might be
>> >> cleaner, and I'm not sure how large the configuration might be.
>> >
>> >Yeah in that case it will be better to store in a separate catalog,
>> >because sometimes if multiple attributes are using the same
>> >compression method with the same options then we can store the same
>> >oid in attcompression instead of duplicating the option field.
>> >
>>
>> I doubt deduplicating the options like this is (sharing options between
>> columns) is really worth it, as it means extra complexity e.g. during
>> ALTER TABLE ... SET COMPRESSION. I don't think we do that for other
>> catalogs, so why should we do it here?
>
>Yeah, valid point.
>
>>
>> Ultimately I think it's a question of how large we expect the options to
>> be, and how flexible it needs to be.
>>
>> For example, what happens if the user does this:
>>
>>    ALTER ... SET COMPRESSION my_compression WITH (options1) PRESERVE;
>>    ALTER ... SET COMPRESSION pglz PRESERVE;
>>    ALTER ... SET COMPRESSION my_compression WITH (options2) PRESERVE;
>>
>> I believe it's enough to keep just the last value, but maybe I'm wrong
>> and we need to keep the whole history?
>
>Currently, the syntax is like ALTER ... SET COMPRESSION my_compression
>WITH (options1) PRESERVE (old_compression1, old_compression2..).  But I
>think if the user just gives PRESERVE without a list then we should
>just preserve the latest one.
>

Hmmm. Not sure that's very convenient. I'd expect the most common use
case for PRESERVE being "I want to change compression for new data,
without rewrite". If PRESERVE by default preserves the latest one, that
pretty much forces users to always list all methods. I suggest
iterpreting it as "preserve everything" instead.

Another option would be to require either a list of methods, or some
keyword defining what to preserve. Like for example

     ... PRESERVE (m1, m2, ...)
     ... PRESERVE ALL
     ... PRESERVE LAST

Does that make sense?

>> The use case I'm thinking about is the column-level JSON compression,
>> where one of the options identifies the dictionary. OTOH I'm not sure
>> this is the right way to track this info - we need to know which options
>> were compressed with which options, i.e. it needs to be encoded in each
>> value directly. It'd also require changes to the PRESERVE handling
>> because it'd be necessary to identify which options to preserve ...
>>
>> So maybe this is either nonsense or something we don't want to support,
>> and we should only allow one option for each compression method.
>
>Yeah, it is a bit confusing to add the same compression method with
>different compression options,  then in the preserve list, we will
>have to allow the option as well along with the compression method to
>know which compression method with what options we want to preserve.
>
>And also as you mentioned that in rows we need to know the option as
>well.  I think for solving this anyways for the custom compression
>methods we will have to store the OID of the compression method in the
>toast header so we can provide an intermediate catalog which will
>create a new row for each combination of compression method + option
>and the toast header can store the OID of that row so that we know
>with which compression method + option it was compressed with.
>

I agree. After thinking about this a bit more, I think we should just
keep the last options for each compression method. If we need to allow
multiple options for some future compression method, we can improve
this, but until then it'd be an over-engineering. Let's do the simplest
possible thing here.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Dilip Kumar-2
On Tue, Oct 6, 2020 at 10:21 PM Tomas Vondra
<[hidden email]> wrote:

>
> On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote:
> >On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
> ><[hidden email]> wrote:
> >>
> >> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
> >> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
> >> ><[hidden email]> wrote:
> >> >>
> >> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
> >> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> >> >> ><[hidden email]> wrote:
> >> >> >
> >> >> >Thanks, Tomas for your feedback.
> >> >> >
> >> >> >> 9) attcompression ...
> >> >> >>
> >> >> >> The main issue I see is what the patch does with attcompression. Instead
> >> >> >> of just using it to store a the compression method, it's also used to
> >> >> >> store the preserved compression methods. And using NameData to store
> >> >> >> this seems wrong too - if we really want to store this info, the correct
> >> >> >> way is either using text[] or inventing charvector or similar.
> >> >> >
> >> >> >The reason for using the NameData is the get it in the fixed part of
> >> >> >the data structure.
> >> >> >
> >> >>
> >> >> Why do we need that? It's possible to have varlena fields with direct
> >> >> access (see pg_index.indkey for example).
> >> >
> >> >I see.  While making it NameData I was thinking whether we have an
> >> >option to direct access the varlena. Thanks for pointing me there. I
> >> >will change this.
> >> >
> >> > Adding NameData just to make
> >> >> it fixed-length means we're always adding 64B even if we just need a
> >> >> single byte, which means ~30% overhead for the FormData_pg_attribute.
> >> >> That seems a bit unnecessary, and might be an issue with many attributes
> >> >> (e.g. with many temp tables, etc.).
> >> >
> >> >You are right.  Even I did not like to keep 64B for this, so I will change it.
> >> >
> >> >>
> >> >> >> But to me this seems very much like a misuse of attcompression to track
> >> >> >> dependencies on compression methods, necessary because we don't have a
> >> >> >> separate catalog listing compression methods. If we had that, I think we
> >> >> >> could simply add dependencies between attributes and that catalog.
> >> >> >
> >> >> >Basically, up to this patch, we are having only built-in compression
> >> >> >methods and those can not be dropped so we don't need any dependency
> >> >> >at all.  We just want to know what is the current compression method
> >> >> >and what is the preserve compression methods supported for this
> >> >> >attribute.  Maybe we can do it better instead of using the NameData
> >> >> >but I don't think it makes sense to add a separate catalog?
> >> >> >
> >> >>
> >> >> Sure, I understand what the goal was - all I'm saying is that it looks
> >> >> very much like a workaround needed because we don't have the catalog.
> >> >>
> >> >> I don't quite understand how could we support custom compression methods
> >> >> without listing them in some sort of catalog?
> >> >
> >> >Yeah for supporting custom compression we need some catalog.
> >> >
> >> >> >> Moreover, having the catalog would allow adding compression methods
> >> >> >> (from extensions etc) instead of just having a list of hard-coded
> >> >> >> compression methods. Which seems like a strange limitation, considering
> >> >> >> this thread is called "custom compression methods".
> >> >> >
> >> >> >I think I forgot to mention while submitting the previous patch that
> >> >> >the next patch I am planning to submit is, Support creating the custom
> >> >> >compression methods wherein we can use pg_am catalog to insert the new
> >> >> >compression method.  And for dependency handling, we can create an
> >> >> >attribute dependency on the pg_am row.  Basically, we will create the
> >> >> >attribute dependency on the current compression method AM as well as
> >> >> >on the preserved compression methods AM.   As part of this, we will
> >> >> >add two build-in AMs for zlib and pglz, and the attcompression field
> >> >> >will be converted to the oid_vector (first OID will be of the current
> >> >> >compression method, followed by the preserved compression method's
> >> >> >oids).
> >> >> >
> >> >>
> >> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
> >> >> quite match what I though AMs are about, but maybe it's just my fault.
> >> >>
> >> >> FWIW it seems a bit strange to first do the attcompression magic and
> >> >> then add the catalog later - I think we should start with the catalog
> >> >> right away. The advantage is that if we end up committing only some of
> >> >> the patches in this cycle, we already have all the infrastructure etc.
> >> >> We can reorder that later, though.
> >> >
> >> >Hmm,  yeah we can do this way as well that first create a new catalog
> >> >table and add entries for these two built-in methods and the
> >> >attcompression can store the oid vector.  But if we only commit the
> >> >build-in compression methods part then does it make sense to create an
> >> >extra catalog or adding these build-in methods to the existing catalog
> >> >(if we plan to use pg_am).  Then in attcompression instead of using
> >> >one byte for each preserve compression method, we need to use oid.  So
> >> >from Robert's mail[1], it appeared to me that he wants that the
> >> >build-in compression methods part should be independently committable
> >> >and if we think from that perspective then adding a catalog doesn't
> >> >make much sense.  But if we are planning to commit the custom method
> >> >also then it makes more sense to directly start with the catalog
> >> >because that way it will be easy to expand without much refactoring.
> >> >
> >> >[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
> >> >
> >>
> >> Hmmm. Maybe I'm missing something subtle, but I think that plan can be
> >> interpreted in various ways - it does not really say whether the initial
> >> list of built-in methods should be in some C array, or already in a proper
> >> catalog.
> >>
> >> All I'm saying is it seems a bit weird to first implement dependencies
> >> based on strange (mis)use of attcompression attribute, and then replace
> >> it with a proper catalog. My understanding is those patches are expected
> >> to be committable one by one, but the attcompression approach seems a
> >> bit too hacky to me - not sure I'd want to commit that ...
> >
> >Okay, I will change this.  So I will make create a new catalog
> >pg_compression and add the entry for two built-in compression methods
> >from the very first patch.
> >
>
> OK.
>
> >> >> >> 10) compression parameters?
> >> >> >>
> >> >> >> I wonder if we could/should allow parameters, like compression level
> >> >> >> (and maybe other stuff, depending on the compression method). PG13
> >> >> >> allowed that for opclasses, so perhaps we should allow it here too.
> >> >> >
> >> >> >Yes, that is also in the plan.   For doing this we are planning to add
> >> >> >an extra column in the pg_attribute which will store the compression
> >> >> >options for the current compression method. The original patch was
> >> >> >creating an extra catalog pg_column_compression, therein it maintains
> >> >> >the oid of the compression method as well as the compression options.
> >> >> >The advantage of creating an extra catalog is that we can keep the
> >> >> >compression options for the preserved compression methods also so that
> >> >> >we can support the options which can be used for decompressing the
> >> >> >data as well.  Whereas if we want to avoid this extra catalog then we
> >> >> >can not use that compression option for decompressing.  But most of
> >> >> >the options e.g. compression level are just for the compressing so it
> >> >> >is enough to store for the current compression method only.  What's
> >> >> >your thoughts?
> >> >> >
> >> >>
> >> >> Not sure. My assumption was we'd end up with a new catalog, but maybe
> >> >> stashing it into pg_attribute is fine. I was really thinking about two
> >> >> kinds of options - compression level, and some sort of column-level
> >> >> dictionary. Compression level is not necessary for decompression, but
> >> >> the dictionary ID would be needed. (I think the global dictionary was
> >> >> one of the use cases, aimed at JSON compression.)
> >> >
> >> >Ok
> >> >
> >> >> But I don't think stashing it in pg_attribute means we couldn't use it
> >> >> for decompression - we'd just need to keep an array of options, one for
> >> >> each compression method.
> >> >
> >> >Yeah, we can do that.
> >> >
> >> >Keeping it in a separate new catalog might be
> >> >> cleaner, and I'm not sure how large the configuration might be.
> >> >
> >> >Yeah in that case it will be better to store in a separate catalog,
> >> >because sometimes if multiple attributes are using the same
> >> >compression method with the same options then we can store the same
> >> >oid in attcompression instead of duplicating the option field.
> >> >
> >>
> >> I doubt deduplicating the options like this is (sharing options between
> >> columns) is really worth it, as it means extra complexity e.g. during
> >> ALTER TABLE ... SET COMPRESSION. I don't think we do that for other
> >> catalogs, so why should we do it here?
> >
> >Yeah, valid point.
> >
> >>
> >> Ultimately I think it's a question of how large we expect the options to
> >> be, and how flexible it needs to be.
> >>
> >> For example, what happens if the user does this:
> >>
> >>    ALTER ... SET COMPRESSION my_compression WITH (options1) PRESERVE;
> >>    ALTER ... SET COMPRESSION pglz PRESERVE;
> >>    ALTER ... SET COMPRESSION my_compression WITH (options2) PRESERVE;
> >>
> >> I believe it's enough to keep just the last value, but maybe I'm wrong
> >> and we need to keep the whole history?
> >
> >Currently, the syntax is like ALTER ... SET COMPRESSION my_compression
> >WITH (options1) PRESERVE (old_compression1, old_compression2..).  But I
> >think if the user just gives PRESERVE without a list then we should
> >just preserve the latest one.
> >
>
> Hmmm. Not sure that's very convenient. I'd expect the most common use
> case for PRESERVE being "I want to change compression for new data,
> without rewrite". If PRESERVE by default preserves the latest one, that
> pretty much forces users to always list all methods. I suggest
> iterpreting it as "preserve everything" instead.
>
> Another option would be to require either a list of methods, or some
> keyword defining what to preserve. Like for example
>
>      ... PRESERVE (m1, m2, ...)
>      ... PRESERVE ALL
>      ... PRESERVE LAST
>
> Does that make sense?

Yeah, this makes sense to me.

>
> >> The use case I'm thinking about is the column-level JSON compression,
> >> where one of the options identifies the dictionary. OTOH I'm not sure
> >> this is the right way to track this info - we need to know which options
> >> were compressed with which options, i.e. it needs to be encoded in each
> >> value directly. It'd also require changes to the PRESERVE handling
> >> because it'd be necessary to identify which options to preserve ...
> >>
> >> So maybe this is either nonsense or something we don't want to support,
> >> and we should only allow one option for each compression method.
> >
> >Yeah, it is a bit confusing to add the same compression method with
> >different compression options,  then in the preserve list, we will
> >have to allow the option as well along with the compression method to
> >know which compression method with what options we want to preserve.
> >
> >And also as you mentioned that in rows we need to know the option as
> >well.  I think for solving this anyways for the custom compression
> >methods we will have to store the OID of the compression method in the
> >toast header so we can provide an intermediate catalog which will
> >create a new row for each combination of compression method + option
> >and the toast header can store the OID of that row so that we know
> >with which compression method + option it was compressed with.
> >
>
> I agree. After thinking about this a bit more, I think we should just
> keep the last options for each compression method. If we need to allow
> multiple options for some future compression method, we can improve
> this, but until then it'd be an over-engineering. Let's do the simplest
> possible thing here.

Okay.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Dilip Kumar-2
On Wed, Oct 7, 2020 at 10:26 AM Dilip Kumar <[hidden email]> wrote:

>
> On Tue, Oct 6, 2020 at 10:21 PM Tomas Vondra
> <[hidden email]> wrote:
> >
> > On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote:
> > >On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
> > ><[hidden email]> wrote:
> > >>
> > >> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
> > >> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
> > >> ><[hidden email]> wrote:
> > >> >>
> > >> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
> > >> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> > >> >> ><[hidden email]> wrote:
> > >> >> >
> > >> >> >Thanks, Tomas for your feedback.
> > >> >> >
> > >> >> >> 9) attcompression ...
> > >> >> >>
> > >> >> >> The main issue I see is what the patch does with attcompression. Instead
> > >> >> >> of just using it to store a the compression method, it's also used to
> > >> >> >> store the preserved compression methods. And using NameData to store
> > >> >> >> this seems wrong too - if we really want to store this info, the correct
> > >> >> >> way is either using text[] or inventing charvector or similar.
> > >> >> >
> > >> >> >The reason for using the NameData is the get it in the fixed part of
> > >> >> >the data structure.
> > >> >> >
> > >> >>
> > >> >> Why do we need that? It's possible to have varlena fields with direct
> > >> >> access (see pg_index.indkey for example).
> > >> >
> > >> >I see.  While making it NameData I was thinking whether we have an
> > >> >option to direct access the varlena. Thanks for pointing me there. I
> > >> >will change this.
> > >> >
> > >> > Adding NameData just to make
> > >> >> it fixed-length means we're always adding 64B even if we just need a
> > >> >> single byte, which means ~30% overhead for the FormData_pg_attribute.
> > >> >> That seems a bit unnecessary, and might be an issue with many attributes
> > >> >> (e.g. with many temp tables, etc.).
> > >> >
> > >> >You are right.  Even I did not like to keep 64B for this, so I will change it.
> > >> >
> > >> >>
> > >> >> >> But to me this seems very much like a misuse of attcompression to track
> > >> >> >> dependencies on compression methods, necessary because we don't have a
> > >> >> >> separate catalog listing compression methods. If we had that, I think we
> > >> >> >> could simply add dependencies between attributes and that catalog.
> > >> >> >
> > >> >> >Basically, up to this patch, we are having only built-in compression
> > >> >> >methods and those can not be dropped so we don't need any dependency
> > >> >> >at all.  We just want to know what is the current compression method
> > >> >> >and what is the preserve compression methods supported for this
> > >> >> >attribute.  Maybe we can do it better instead of using the NameData
> > >> >> >but I don't think it makes sense to add a separate catalog?
> > >> >> >
> > >> >>
> > >> >> Sure, I understand what the goal was - all I'm saying is that it looks
> > >> >> very much like a workaround needed because we don't have the catalog.
> > >> >>
> > >> >> I don't quite understand how could we support custom compression methods
> > >> >> without listing them in some sort of catalog?
> > >> >
> > >> >Yeah for supporting custom compression we need some catalog.
> > >> >
> > >> >> >> Moreover, having the catalog would allow adding compression methods
> > >> >> >> (from extensions etc) instead of just having a list of hard-coded
> > >> >> >> compression methods. Which seems like a strange limitation, considering
> > >> >> >> this thread is called "custom compression methods".
> > >> >> >
> > >> >> >I think I forgot to mention while submitting the previous patch that
> > >> >> >the next patch I am planning to submit is, Support creating the custom
> > >> >> >compression methods wherein we can use pg_am catalog to insert the new
> > >> >> >compression method.  And for dependency handling, we can create an
> > >> >> >attribute dependency on the pg_am row.  Basically, we will create the
> > >> >> >attribute dependency on the current compression method AM as well as
> > >> >> >on the preserved compression methods AM.   As part of this, we will
> > >> >> >add two build-in AMs for zlib and pglz, and the attcompression field
> > >> >> >will be converted to the oid_vector (first OID will be of the current
> > >> >> >compression method, followed by the preserved compression method's
> > >> >> >oids).
> > >> >> >
> > >> >>
> > >> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
> > >> >> quite match what I though AMs are about, but maybe it's just my fault.
> > >> >>
> > >> >> FWIW it seems a bit strange to first do the attcompression magic and
> > >> >> then add the catalog later - I think we should start with the catalog
> > >> >> right away. The advantage is that if we end up committing only some of
> > >> >> the patches in this cycle, we already have all the infrastructure etc.
> > >> >> We can reorder that later, though.
> > >> >
> > >> >Hmm,  yeah we can do this way as well that first create a new catalog
> > >> >table and add entries for these two built-in methods and the
> > >> >attcompression can store the oid vector.  But if we only commit the
> > >> >build-in compression methods part then does it make sense to create an
> > >> >extra catalog or adding these build-in methods to the existing catalog
> > >> >(if we plan to use pg_am).  Then in attcompression instead of using
> > >> >one byte for each preserve compression method, we need to use oid.  So
> > >> >from Robert's mail[1], it appeared to me that he wants that the
> > >> >build-in compression methods part should be independently committable
> > >> >and if we think from that perspective then adding a catalog doesn't
> > >> >make much sense.  But if we are planning to commit the custom method
> > >> >also then it makes more sense to directly start with the catalog
> > >> >because that way it will be easy to expand without much refactoring.
> > >> >
> > >> >[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
> > >> >
> > >>
> > >> Hmmm. Maybe I'm missing something subtle, but I think that plan can be
> > >> interpreted in various ways - it does not really say whether the initial
> > >> list of built-in methods should be in some C array, or already in a proper
> > >> catalog.
> > >>
> > >> All I'm saying is it seems a bit weird to first implement dependencies
> > >> based on strange (mis)use of attcompression attribute, and then replace
> > >> it with a proper catalog. My understanding is those patches are expected
> > >> to be committable one by one, but the attcompression approach seems a
> > >> bit too hacky to me - not sure I'd want to commit that ...
> > >
> > >Okay, I will change this.  So I will make create a new catalog
> > >pg_compression and add the entry for two built-in compression methods
> > >from the very first patch.
> > >
> >
> > OK.
I have changed the first 2 patches, basically, now we are providing a
new catalog pg_compression and the pg_attribute is storing the oid of
the compression method.  The patches still need some cleanup and there
is also one open comment that for index we should use its table
compression.

I am still working on the preserve patch.  For preserving the
compression method I am planning to convert the attcompression field
to the oidvector so that we can store the oid of the preserve method
also.  I am not sure whether we can access this oidvector as a fixed
part of the FormData_pg_attribute or not.  The reason is that for
building the tuple descriptor, we need to give the size of the fixed
part (#define ATTRIBUTE_FIXED_PART_SIZE \
(offsetof(FormData_pg_attribute,attcompression) + sizeof(Oid))).  But
if we convert this to the oidvector then we don't know the size of the
fixed part.  Am I missing something?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

v4-0002-alter-table-set-compression.patch (16K) Download Attachment
v4-0001-Built-in-compression-method.patch (271K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Dilip Kumar-2
On Wed, Oct 7, 2020 at 5:00 PM Dilip Kumar <[hidden email]> wrote:

>
> On Wed, Oct 7, 2020 at 10:26 AM Dilip Kumar <[hidden email]> wrote:
> >
> > On Tue, Oct 6, 2020 at 10:21 PM Tomas Vondra
> > <[hidden email]> wrote:
> > >
> > > On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote:
> > > >On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
> > > ><[hidden email]> wrote:
> > > >>
> > > >> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
> > > >> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
> > > >> ><[hidden email]> wrote:
> > > >> >>
> > > >> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
> > > >> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> > > >> >> ><[hidden email]> wrote:
> > > >> >> >
> > > >> >> >Thanks, Tomas for your feedback.
> > > >> >> >
> > > >> >> >> 9) attcompression ...
> > > >> >> >>
> > > >> >> >> The main issue I see is what the patch does with attcompression. Instead
> > > >> >> >> of just using it to store a the compression method, it's also used to
> > > >> >> >> store the preserved compression methods. And using NameData to store
> > > >> >> >> this seems wrong too - if we really want to store this info, the correct
> > > >> >> >> way is either using text[] or inventing charvector or similar.
> > > >> >> >
> > > >> >> >The reason for using the NameData is the get it in the fixed part of
> > > >> >> >the data structure.
> > > >> >> >
> > > >> >>
> > > >> >> Why do we need that? It's possible to have varlena fields with direct
> > > >> >> access (see pg_index.indkey for example).
> > > >> >
> > > >> >I see.  While making it NameData I was thinking whether we have an
> > > >> >option to direct access the varlena. Thanks for pointing me there. I
> > > >> >will change this.
> > > >> >
> > > >> > Adding NameData just to make
> > > >> >> it fixed-length means we're always adding 64B even if we just need a
> > > >> >> single byte, which means ~30% overhead for the FormData_pg_attribute.
> > > >> >> That seems a bit unnecessary, and might be an issue with many attributes
> > > >> >> (e.g. with many temp tables, etc.).
> > > >> >
> > > >> >You are right.  Even I did not like to keep 64B for this, so I will change it.
> > > >> >
> > > >> >>
> > > >> >> >> But to me this seems very much like a misuse of attcompression to track
> > > >> >> >> dependencies on compression methods, necessary because we don't have a
> > > >> >> >> separate catalog listing compression methods. If we had that, I think we
> > > >> >> >> could simply add dependencies between attributes and that catalog.
> > > >> >> >
> > > >> >> >Basically, up to this patch, we are having only built-in compression
> > > >> >> >methods and those can not be dropped so we don't need any dependency
> > > >> >> >at all.  We just want to know what is the current compression method
> > > >> >> >and what is the preserve compression methods supported for this
> > > >> >> >attribute.  Maybe we can do it better instead of using the NameData
> > > >> >> >but I don't think it makes sense to add a separate catalog?
> > > >> >> >
> > > >> >>
> > > >> >> Sure, I understand what the goal was - all I'm saying is that it looks
> > > >> >> very much like a workaround needed because we don't have the catalog.
> > > >> >>
> > > >> >> I don't quite understand how could we support custom compression methods
> > > >> >> without listing them in some sort of catalog?
> > > >> >
> > > >> >Yeah for supporting custom compression we need some catalog.
> > > >> >
> > > >> >> >> Moreover, having the catalog would allow adding compression methods
> > > >> >> >> (from extensions etc) instead of just having a list of hard-coded
> > > >> >> >> compression methods. Which seems like a strange limitation, considering
> > > >> >> >> this thread is called "custom compression methods".
> > > >> >> >
> > > >> >> >I think I forgot to mention while submitting the previous patch that
> > > >> >> >the next patch I am planning to submit is, Support creating the custom
> > > >> >> >compression methods wherein we can use pg_am catalog to insert the new
> > > >> >> >compression method.  And for dependency handling, we can create an
> > > >> >> >attribute dependency on the pg_am row.  Basically, we will create the
> > > >> >> >attribute dependency on the current compression method AM as well as
> > > >> >> >on the preserved compression methods AM.   As part of this, we will
> > > >> >> >add two build-in AMs for zlib and pglz, and the attcompression field
> > > >> >> >will be converted to the oid_vector (first OID will be of the current
> > > >> >> >compression method, followed by the preserved compression method's
> > > >> >> >oids).
> > > >> >> >
> > > >> >>
> > > >> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
> > > >> >> quite match what I though AMs are about, but maybe it's just my fault.
> > > >> >>
> > > >> >> FWIW it seems a bit strange to first do the attcompression magic and
> > > >> >> then add the catalog later - I think we should start with the catalog
> > > >> >> right away. The advantage is that if we end up committing only some of
> > > >> >> the patches in this cycle, we already have all the infrastructure etc.
> > > >> >> We can reorder that later, though.
> > > >> >
> > > >> >Hmm,  yeah we can do this way as well that first create a new catalog
> > > >> >table and add entries for these two built-in methods and the
> > > >> >attcompression can store the oid vector.  But if we only commit the
> > > >> >build-in compression methods part then does it make sense to create an
> > > >> >extra catalog or adding these build-in methods to the existing catalog
> > > >> >(if we plan to use pg_am).  Then in attcompression instead of using
> > > >> >one byte for each preserve compression method, we need to use oid.  So
> > > >> >from Robert's mail[1], it appeared to me that he wants that the
> > > >> >build-in compression methods part should be independently committable
> > > >> >and if we think from that perspective then adding a catalog doesn't
> > > >> >make much sense.  But if we are planning to commit the custom method
> > > >> >also then it makes more sense to directly start with the catalog
> > > >> >because that way it will be easy to expand without much refactoring.
> > > >> >
> > > >> >[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
> > > >> >
> > > >>
> > > >> Hmmm. Maybe I'm missing something subtle, but I think that plan can be
> > > >> interpreted in various ways - it does not really say whether the initial
> > > >> list of built-in methods should be in some C array, or already in a proper
> > > >> catalog.
> > > >>
> > > >> All I'm saying is it seems a bit weird to first implement dependencies
> > > >> based on strange (mis)use of attcompression attribute, and then replace
> > > >> it with a proper catalog. My understanding is those patches are expected
> > > >> to be committable one by one, but the attcompression approach seems a
> > > >> bit too hacky to me - not sure I'd want to commit that ...
> > > >
> > > >Okay, I will change this.  So I will make create a new catalog
> > > >pg_compression and add the entry for two built-in compression methods
> > > >from the very first patch.
> > > >
> > >
> > > OK.
>
> I have changed the first 2 patches, basically, now we are providing a
> new catalog pg_compression and the pg_attribute is storing the oid of
> the compression method.  The patches still need some cleanup and there
> is also one open comment that for index we should use its table
> compression.
>
> I am still working on the preserve patch.  For preserving the
> compression method I am planning to convert the attcompression field
> to the oidvector so that we can store the oid of the preserve method
> also.  I am not sure whether we can access this oidvector as a fixed
> part of the FormData_pg_attribute or not.  The reason is that for
> building the tuple descriptor, we need to give the size of the fixed
> part (#define ATTRIBUTE_FIXED_PART_SIZE \
> (offsetof(FormData_pg_attribute,attcompression) + sizeof(Oid))).  But
> if we convert this to the oidvector then we don't know the size of the
> fixed part.  Am I missing something?

I could think of two solutions here
Sol1.
Make the first oid of the oidvector as part of the fixed size, like below
#define ATTRIBUTE_FIXED_PART_SIZE \
(offsetof(FormData_pg_attribute, attcompression) + OidVectorSize(1))

Sol2:
Keep attcompression as oid only and for the preserve list, adds
another field in the variable part which will be of type oidvector.  I
think most of the time we need to access the current compression
method and with this solution, we will be able to access that as part
of the tuple desc.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Dilip Kumar-2
In reply to this post by Dilip Kumar-2
On Wed, Oct 7, 2020 at 5:00 PM Dilip Kumar <[hidden email]> wrote:

>
> On Wed, Oct 7, 2020 at 10:26 AM Dilip Kumar <[hidden email]> wrote:
> >
> > On Tue, Oct 6, 2020 at 10:21 PM Tomas Vondra
> > <[hidden email]> wrote:
> > >
> > > On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote:
> > > >On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
> > > ><[hidden email]> wrote:
> > > >>
> > > >> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
> > > >> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
> > > >> ><[hidden email]> wrote:
> > > >> >>
> > > >> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
> > > >> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> > > >> >> ><[hidden email]> wrote:
> > > >> >> >
> > > >> >> >Thanks, Tomas for your feedback.
> > > >> >> >
> > > >> >> >> 9) attcompression ...
> > > >> >> >>
> > > >> >> >> The main issue I see is what the patch does with attcompression. Instead
> > > >> >> >> of just using it to store a the compression method, it's also used to
> > > >> >> >> store the preserved compression methods. And using NameData to store
> > > >> >> >> this seems wrong too - if we really want to store this info, the correct
> > > >> >> >> way is either using text[] or inventing charvector or similar.
> > > >> >> >
> > > >> >> >The reason for using the NameData is the get it in the fixed part of
> > > >> >> >the data structure.
> > > >> >> >
> > > >> >>
> > > >> >> Why do we need that? It's possible to have varlena fields with direct
> > > >> >> access (see pg_index.indkey for example).
> > > >> >
> > > >> >I see.  While making it NameData I was thinking whether we have an
> > > >> >option to direct access the varlena. Thanks for pointing me there. I
> > > >> >will change this.
> > > >> >
> > > >> > Adding NameData just to make
> > > >> >> it fixed-length means we're always adding 64B even if we just need a
> > > >> >> single byte, which means ~30% overhead for the FormData_pg_attribute.
> > > >> >> That seems a bit unnecessary, and might be an issue with many attributes
> > > >> >> (e.g. with many temp tables, etc.).
> > > >> >
> > > >> >You are right.  Even I did not like to keep 64B for this, so I will change it.
> > > >> >
> > > >> >>
> > > >> >> >> But to me this seems very much like a misuse of attcompression to track
> > > >> >> >> dependencies on compression methods, necessary because we don't have a
> > > >> >> >> separate catalog listing compression methods. If we had that, I think we
> > > >> >> >> could simply add dependencies between attributes and that catalog.
> > > >> >> >
> > > >> >> >Basically, up to this patch, we are having only built-in compression
> > > >> >> >methods and those can not be dropped so we don't need any dependency
> > > >> >> >at all.  We just want to know what is the current compression method
> > > >> >> >and what is the preserve compression methods supported for this
> > > >> >> >attribute.  Maybe we can do it better instead of using the NameData
> > > >> >> >but I don't think it makes sense to add a separate catalog?
> > > >> >> >
> > > >> >>
> > > >> >> Sure, I understand what the goal was - all I'm saying is that it looks
> > > >> >> very much like a workaround needed because we don't have the catalog.
> > > >> >>
> > > >> >> I don't quite understand how could we support custom compression methods
> > > >> >> without listing them in some sort of catalog?
> > > >> >
> > > >> >Yeah for supporting custom compression we need some catalog.
> > > >> >
> > > >> >> >> Moreover, having the catalog would allow adding compression methods
> > > >> >> >> (from extensions etc) instead of just having a list of hard-coded
> > > >> >> >> compression methods. Which seems like a strange limitation, considering
> > > >> >> >> this thread is called "custom compression methods".
> > > >> >> >
> > > >> >> >I think I forgot to mention while submitting the previous patch that
> > > >> >> >the next patch I am planning to submit is, Support creating the custom
> > > >> >> >compression methods wherein we can use pg_am catalog to insert the new
> > > >> >> >compression method.  And for dependency handling, we can create an
> > > >> >> >attribute dependency on the pg_am row.  Basically, we will create the
> > > >> >> >attribute dependency on the current compression method AM as well as
> > > >> >> >on the preserved compression methods AM.   As part of this, we will
> > > >> >> >add two build-in AMs for zlib and pglz, and the attcompression field
> > > >> >> >will be converted to the oid_vector (first OID will be of the current
> > > >> >> >compression method, followed by the preserved compression method's
> > > >> >> >oids).
> > > >> >> >
> > > >> >>
> > > >> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
> > > >> >> quite match what I though AMs are about, but maybe it's just my fault.
> > > >> >>
> > > >> >> FWIW it seems a bit strange to first do the attcompression magic and
> > > >> >> then add the catalog later - I think we should start with the catalog
> > > >> >> right away. The advantage is that if we end up committing only some of
> > > >> >> the patches in this cycle, we already have all the infrastructure etc.
> > > >> >> We can reorder that later, though.
> > > >> >
> > > >> >Hmm,  yeah we can do this way as well that first create a new catalog
> > > >> >table and add entries for these two built-in methods and the
> > > >> >attcompression can store the oid vector.  But if we only commit the
> > > >> >build-in compression methods part then does it make sense to create an
> > > >> >extra catalog or adding these build-in methods to the existing catalog
> > > >> >(if we plan to use pg_am).  Then in attcompression instead of using
> > > >> >one byte for each preserve compression method, we need to use oid.  So
> > > >> >from Robert's mail[1], it appeared to me that he wants that the
> > > >> >build-in compression methods part should be independently committable
> > > >> >and if we think from that perspective then adding a catalog doesn't
> > > >> >make much sense.  But if we are planning to commit the custom method
> > > >> >also then it makes more sense to directly start with the catalog
> > > >> >because that way it will be easy to expand without much refactoring.
> > > >> >
> > > >> >[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
> > > >> >
> > > >>
> > > >> Hmmm. Maybe I'm missing something subtle, but I think that plan can be
> > > >> interpreted in various ways - it does not really say whether the initial
> > > >> list of built-in methods should be in some C array, or already in a proper
> > > >> catalog.
> > > >>
> > > >> All I'm saying is it seems a bit weird to first implement dependencies
> > > >> based on strange (mis)use of attcompression attribute, and then replace
> > > >> it with a proper catalog. My understanding is those patches are expected
> > > >> to be committable one by one, but the attcompression approach seems a
> > > >> bit too hacky to me - not sure I'd want to commit that ...
> > > >
> > > >Okay, I will change this.  So I will make create a new catalog
> > > >pg_compression and add the entry for two built-in compression methods
> > > >from the very first patch.
> > > >
> > >
> > > OK.
>
> I have changed the first 2 patches, basically, now we are providing a
> new catalog pg_compression and the pg_attribute is storing the oid of
> the compression method.  The patches still need some cleanup and there
> is also one open comment that for index we should use its table
> compression.
There was some unwanted code in the previous patch so attaching the
updated patches.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

v5-0002-alter-table-set-compression.patch (16K) Download Attachment
v5-0001-Built-in-compression-method.patch (271K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4
In reply to this post by Dilip Kumar-2
On Thu, Oct 08, 2020 at 02:38:27PM +0530, Dilip Kumar wrote:

>On Wed, Oct 7, 2020 at 5:00 PM Dilip Kumar <[hidden email]> wrote:
>>
>> On Wed, Oct 7, 2020 at 10:26 AM Dilip Kumar <[hidden email]> wrote:
>> >
>> > On Tue, Oct 6, 2020 at 10:21 PM Tomas Vondra
>> > <[hidden email]> wrote:
>> > >
>> > > On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote:
>> > > >On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
>> > > ><[hidden email]> wrote:
>> > > >>
>> > > >> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
>> > > >> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
>> > > >> ><[hidden email]> wrote:
>> > > >> >>
>> > > >> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
>> > > >> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
>> > > >> >> ><[hidden email]> wrote:
>> > > >> >> >
>> > > >> >> >Thanks, Tomas for your feedback.
>> > > >> >> >
>> > > >> >> >> 9) attcompression ...
>> > > >> >> >>
>> > > >> >> >> The main issue I see is what the patch does with attcompression. Instead
>> > > >> >> >> of just using it to store a the compression method, it's also used to
>> > > >> >> >> store the preserved compression methods. And using NameData to store
>> > > >> >> >> this seems wrong too - if we really want to store this info, the correct
>> > > >> >> >> way is either using text[] or inventing charvector or similar.
>> > > >> >> >
>> > > >> >> >The reason for using the NameData is the get it in the fixed part of
>> > > >> >> >the data structure.
>> > > >> >> >
>> > > >> >>
>> > > >> >> Why do we need that? It's possible to have varlena fields with direct
>> > > >> >> access (see pg_index.indkey for example).
>> > > >> >
>> > > >> >I see.  While making it NameData I was thinking whether we have an
>> > > >> >option to direct access the varlena. Thanks for pointing me there. I
>> > > >> >will change this.
>> > > >> >
>> > > >> > Adding NameData just to make
>> > > >> >> it fixed-length means we're always adding 64B even if we just need a
>> > > >> >> single byte, which means ~30% overhead for the FormData_pg_attribute.
>> > > >> >> That seems a bit unnecessary, and might be an issue with many attributes
>> > > >> >> (e.g. with many temp tables, etc.).
>> > > >> >
>> > > >> >You are right.  Even I did not like to keep 64B for this, so I will change it.
>> > > >> >
>> > > >> >>
>> > > >> >> >> But to me this seems very much like a misuse of attcompression to track
>> > > >> >> >> dependencies on compression methods, necessary because we don't have a
>> > > >> >> >> separate catalog listing compression methods. If we had that, I think we
>> > > >> >> >> could simply add dependencies between attributes and that catalog.
>> > > >> >> >
>> > > >> >> >Basically, up to this patch, we are having only built-in compression
>> > > >> >> >methods and those can not be dropped so we don't need any dependency
>> > > >> >> >at all.  We just want to know what is the current compression method
>> > > >> >> >and what is the preserve compression methods supported for this
>> > > >> >> >attribute.  Maybe we can do it better instead of using the NameData
>> > > >> >> >but I don't think it makes sense to add a separate catalog?
>> > > >> >> >
>> > > >> >>
>> > > >> >> Sure, I understand what the goal was - all I'm saying is that it looks
>> > > >> >> very much like a workaround needed because we don't have the catalog.
>> > > >> >>
>> > > >> >> I don't quite understand how could we support custom compression methods
>> > > >> >> without listing them in some sort of catalog?
>> > > >> >
>> > > >> >Yeah for supporting custom compression we need some catalog.
>> > > >> >
>> > > >> >> >> Moreover, having the catalog would allow adding compression methods
>> > > >> >> >> (from extensions etc) instead of just having a list of hard-coded
>> > > >> >> >> compression methods. Which seems like a strange limitation, considering
>> > > >> >> >> this thread is called "custom compression methods".
>> > > >> >> >
>> > > >> >> >I think I forgot to mention while submitting the previous patch that
>> > > >> >> >the next patch I am planning to submit is, Support creating the custom
>> > > >> >> >compression methods wherein we can use pg_am catalog to insert the new
>> > > >> >> >compression method.  And for dependency handling, we can create an
>> > > >> >> >attribute dependency on the pg_am row.  Basically, we will create the
>> > > >> >> >attribute dependency on the current compression method AM as well as
>> > > >> >> >on the preserved compression methods AM.   As part of this, we will
>> > > >> >> >add two build-in AMs for zlib and pglz, and the attcompression field
>> > > >> >> >will be converted to the oid_vector (first OID will be of the current
>> > > >> >> >compression method, followed by the preserved compression method's
>> > > >> >> >oids).
>> > > >> >> >
>> > > >> >>
>> > > >> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
>> > > >> >> quite match what I though AMs are about, but maybe it's just my fault.
>> > > >> >>
>> > > >> >> FWIW it seems a bit strange to first do the attcompression magic and
>> > > >> >> then add the catalog later - I think we should start with the catalog
>> > > >> >> right away. The advantage is that if we end up committing only some of
>> > > >> >> the patches in this cycle, we already have all the infrastructure etc.
>> > > >> >> We can reorder that later, though.
>> > > >> >
>> > > >> >Hmm,  yeah we can do this way as well that first create a new catalog
>> > > >> >table and add entries for these two built-in methods and the
>> > > >> >attcompression can store the oid vector.  But if we only commit the
>> > > >> >build-in compression methods part then does it make sense to create an
>> > > >> >extra catalog or adding these build-in methods to the existing catalog
>> > > >> >(if we plan to use pg_am).  Then in attcompression instead of using
>> > > >> >one byte for each preserve compression method, we need to use oid.  So
>> > > >> >from Robert's mail[1], it appeared to me that he wants that the
>> > > >> >build-in compression methods part should be independently committable
>> > > >> >and if we think from that perspective then adding a catalog doesn't
>> > > >> >make much sense.  But if we are planning to commit the custom method
>> > > >> >also then it makes more sense to directly start with the catalog
>> > > >> >because that way it will be easy to expand without much refactoring.
>> > > >> >
>> > > >> >[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
>> > > >> >
>> > > >>
>> > > >> Hmmm. Maybe I'm missing something subtle, but I think that plan can be
>> > > >> interpreted in various ways - it does not really say whether the initial
>> > > >> list of built-in methods should be in some C array, or already in a proper
>> > > >> catalog.
>> > > >>
>> > > >> All I'm saying is it seems a bit weird to first implement dependencies
>> > > >> based on strange (mis)use of attcompression attribute, and then replace
>> > > >> it with a proper catalog. My understanding is those patches are expected
>> > > >> to be committable one by one, but the attcompression approach seems a
>> > > >> bit too hacky to me - not sure I'd want to commit that ...
>> > > >
>> > > >Okay, I will change this.  So I will make create a new catalog
>> > > >pg_compression and add the entry for two built-in compression methods
>> > > >from the very first patch.
>> > > >
>> > >
>> > > OK.
>>
>> I have changed the first 2 patches, basically, now we are providing a
>> new catalog pg_compression and the pg_attribute is storing the oid of
>> the compression method.  The patches still need some cleanup and there
>> is also one open comment that for index we should use its table
>> compression.
>>
>> I am still working on the preserve patch.  For preserving the
>> compression method I am planning to convert the attcompression field
>> to the oidvector so that we can store the oid of the preserve method
>> also.  I am not sure whether we can access this oidvector as a fixed
>> part of the FormData_pg_attribute or not.  The reason is that for
>> building the tuple descriptor, we need to give the size of the fixed
>> part (#define ATTRIBUTE_FIXED_PART_SIZE \
>> (offsetof(FormData_pg_attribute,attcompression) + sizeof(Oid))).  But
>> if we convert this to the oidvector then we don't know the size of the
>> fixed part.  Am I missing something?
>
>I could think of two solutions here
>Sol1.
>Make the first oid of the oidvector as part of the fixed size, like below
>#define ATTRIBUTE_FIXED_PART_SIZE \
>(offsetof(FormData_pg_attribute, attcompression) + OidVectorSize(1))
>
>Sol2:
>Keep attcompression as oid only and for the preserve list, adds
>another field in the variable part which will be of type oidvector.  I
>think most of the time we need to access the current compression
>method and with this solution, we will be able to access that as part
>of the tuple desc.
>

And is the oidvector actually needed? If we have the extra catalog,
can't we track this simply using the regular dependencies? So we'd have
the attcompression OID of the current compression method, and the
preserved values would be tracked in pg_depend.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Dilip Kumar-2
On Fri, Oct 9, 2020 at 3:24 AM Tomas Vondra
<[hidden email]> wrote:

>
> On Thu, Oct 08, 2020 at 02:38:27PM +0530, Dilip Kumar wrote:
> >On Wed, Oct 7, 2020 at 5:00 PM Dilip Kumar <[hidden email]> wrote:
> >>
> >> On Wed, Oct 7, 2020 at 10:26 AM Dilip Kumar <[hidden email]> wrote:
> >> >
> >> > On Tue, Oct 6, 2020 at 10:21 PM Tomas Vondra
> >> > <[hidden email]> wrote:
> >> > >
> >> > > On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote:
> >> > > >On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
> >> > > ><[hidden email]> wrote:
> >> > > >>
> >> > > >> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
> >> > > >> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
> >> > > >> ><[hidden email]> wrote:
> >> > > >> >>
> >> > > >> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
> >> > > >> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> >> > > >> >> ><[hidden email]> wrote:
> >> > > >> >> >
> >> > > >> >> >Thanks, Tomas for your feedback.
> >> > > >> >> >
> >> > > >> >> >> 9) attcompression ...
> >> > > >> >> >>
> >> > > >> >> >> The main issue I see is what the patch does with attcompression. Instead
> >> > > >> >> >> of just using it to store a the compression method, it's also used to
> >> > > >> >> >> store the preserved compression methods. And using NameData to store
> >> > > >> >> >> this seems wrong too - if we really want to store this info, the correct
> >> > > >> >> >> way is either using text[] or inventing charvector or similar.
> >> > > >> >> >
> >> > > >> >> >The reason for using the NameData is the get it in the fixed part of
> >> > > >> >> >the data structure.
> >> > > >> >> >
> >> > > >> >>
> >> > > >> >> Why do we need that? It's possible to have varlena fields with direct
> >> > > >> >> access (see pg_index.indkey for example).
> >> > > >> >
> >> > > >> >I see.  While making it NameData I was thinking whether we have an
> >> > > >> >option to direct access the varlena. Thanks for pointing me there. I
> >> > > >> >will change this.
> >> > > >> >
> >> > > >> > Adding NameData just to make
> >> > > >> >> it fixed-length means we're always adding 64B even if we just need a
> >> > > >> >> single byte, which means ~30% overhead for the FormData_pg_attribute.
> >> > > >> >> That seems a bit unnecessary, and might be an issue with many attributes
> >> > > >> >> (e.g. with many temp tables, etc.).
> >> > > >> >
> >> > > >> >You are right.  Even I did not like to keep 64B for this, so I will change it.
> >> > > >> >
> >> > > >> >>
> >> > > >> >> >> But to me this seems very much like a misuse of attcompression to track
> >> > > >> >> >> dependencies on compression methods, necessary because we don't have a
> >> > > >> >> >> separate catalog listing compression methods. If we had that, I think we
> >> > > >> >> >> could simply add dependencies between attributes and that catalog.
> >> > > >> >> >
> >> > > >> >> >Basically, up to this patch, we are having only built-in compression
> >> > > >> >> >methods and those can not be dropped so we don't need any dependency
> >> > > >> >> >at all.  We just want to know what is the current compression method
> >> > > >> >> >and what is the preserve compression methods supported for this
> >> > > >> >> >attribute.  Maybe we can do it better instead of using the NameData
> >> > > >> >> >but I don't think it makes sense to add a separate catalog?
> >> > > >> >> >
> >> > > >> >>
> >> > > >> >> Sure, I understand what the goal was - all I'm saying is that it looks
> >> > > >> >> very much like a workaround needed because we don't have the catalog.
> >> > > >> >>
> >> > > >> >> I don't quite understand how could we support custom compression methods
> >> > > >> >> without listing them in some sort of catalog?
> >> > > >> >
> >> > > >> >Yeah for supporting custom compression we need some catalog.
> >> > > >> >
> >> > > >> >> >> Moreover, having the catalog would allow adding compression methods
> >> > > >> >> >> (from extensions etc) instead of just having a list of hard-coded
> >> > > >> >> >> compression methods. Which seems like a strange limitation, considering
> >> > > >> >> >> this thread is called "custom compression methods".
> >> > > >> >> >
> >> > > >> >> >I think I forgot to mention while submitting the previous patch that
> >> > > >> >> >the next patch I am planning to submit is, Support creating the custom
> >> > > >> >> >compression methods wherein we can use pg_am catalog to insert the new
> >> > > >> >> >compression method.  And for dependency handling, we can create an
> >> > > >> >> >attribute dependency on the pg_am row.  Basically, we will create the
> >> > > >> >> >attribute dependency on the current compression method AM as well as
> >> > > >> >> >on the preserved compression methods AM.   As part of this, we will
> >> > > >> >> >add two build-in AMs for zlib and pglz, and the attcompression field
> >> > > >> >> >will be converted to the oid_vector (first OID will be of the current
> >> > > >> >> >compression method, followed by the preserved compression method's
> >> > > >> >> >oids).
> >> > > >> >> >
> >> > > >> >>
> >> > > >> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
> >> > > >> >> quite match what I though AMs are about, but maybe it's just my fault.
> >> > > >> >>
> >> > > >> >> FWIW it seems a bit strange to first do the attcompression magic and
> >> > > >> >> then add the catalog later - I think we should start with the catalog
> >> > > >> >> right away. The advantage is that if we end up committing only some of
> >> > > >> >> the patches in this cycle, we already have all the infrastructure etc.
> >> > > >> >> We can reorder that later, though.
> >> > > >> >
> >> > > >> >Hmm,  yeah we can do this way as well that first create a new catalog
> >> > > >> >table and add entries for these two built-in methods and the
> >> > > >> >attcompression can store the oid vector.  But if we only commit the
> >> > > >> >build-in compression methods part then does it make sense to create an
> >> > > >> >extra catalog or adding these build-in methods to the existing catalog
> >> > > >> >(if we plan to use pg_am).  Then in attcompression instead of using
> >> > > >> >one byte for each preserve compression method, we need to use oid.  So
> >> > > >> >from Robert's mail[1], it appeared to me that he wants that the
> >> > > >> >build-in compression methods part should be independently committable
> >> > > >> >and if we think from that perspective then adding a catalog doesn't
> >> > > >> >make much sense.  But if we are planning to commit the custom method
> >> > > >> >also then it makes more sense to directly start with the catalog
> >> > > >> >because that way it will be easy to expand without much refactoring.
> >> > > >> >
> >> > > >> >[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
> >> > > >> >
> >> > > >>
> >> > > >> Hmmm. Maybe I'm missing something subtle, but I think that plan can be
> >> > > >> interpreted in various ways - it does not really say whether the initial
> >> > > >> list of built-in methods should be in some C array, or already in a proper
> >> > > >> catalog.
> >> > > >>
> >> > > >> All I'm saying is it seems a bit weird to first implement dependencies
> >> > > >> based on strange (mis)use of attcompression attribute, and then replace
> >> > > >> it with a proper catalog. My understanding is those patches are expected
> >> > > >> to be committable one by one, but the attcompression approach seems a
> >> > > >> bit too hacky to me - not sure I'd want to commit that ...
> >> > > >
> >> > > >Okay, I will change this.  So I will make create a new catalog
> >> > > >pg_compression and add the entry for two built-in compression methods
> >> > > >from the very first patch.
> >> > > >
> >> > >
> >> > > OK.
> >>
> >> I have changed the first 2 patches, basically, now we are providing a
> >> new catalog pg_compression and the pg_attribute is storing the oid of
> >> the compression method.  The patches still need some cleanup and there
> >> is also one open comment that for index we should use its table
> >> compression.
> >>
> >> I am still working on the preserve patch.  For preserving the
> >> compression method I am planning to convert the attcompression field
> >> to the oidvector so that we can store the oid of the preserve method
> >> also.  I am not sure whether we can access this oidvector as a fixed
> >> part of the FormData_pg_attribute or not.  The reason is that for
> >> building the tuple descriptor, we need to give the size of the fixed
> >> part (#define ATTRIBUTE_FIXED_PART_SIZE \
> >> (offsetof(FormData_pg_attribute,attcompression) + sizeof(Oid))).  But
> >> if we convert this to the oidvector then we don't know the size of the
> >> fixed part.  Am I missing something?
> >
> >I could think of two solutions here
> >Sol1.
> >Make the first oid of the oidvector as part of the fixed size, like below
> >#define ATTRIBUTE_FIXED_PART_SIZE \
> >(offsetof(FormData_pg_attribute, attcompression) + OidVectorSize(1))
> >
> >Sol2:
> >Keep attcompression as oid only and for the preserve list, adds
> >another field in the variable part which will be of type oidvector.  I
> >think most of the time we need to access the current compression
> >method and with this solution, we will be able to access that as part
> >of the tuple desc.
> >
>
> And is the oidvector actually needed? If we have the extra catalog,
> can't we track this simply using the regular dependencies? So we'd have
> the attcompression OID of the current compression method, and the
> preserved values would be tracked in pg_depend.

Right, we can do that as well.  Actually, the preserved list need to
be accessed only in case of ALTER TABLE SET COMPRESSION and INSERT
INTO SELECT * FROM queries.  So in such cases, I think it is okay to
get the preserved compression oids from pg_depends.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Dilip Kumar-2
On Fri, Oct 9, 2020 at 3:01 PM Dilip Kumar <[hidden email]> wrote:

>
> On Fri, Oct 9, 2020 at 3:24 AM Tomas Vondra
> <[hidden email]> wrote:
> >
> > On Thu, Oct 08, 2020 at 02:38:27PM +0530, Dilip Kumar wrote:
> > >On Wed, Oct 7, 2020 at 5:00 PM Dilip Kumar <[hidden email]> wrote:
> > >>
> > >> On Wed, Oct 7, 2020 at 10:26 AM Dilip Kumar <[hidden email]> wrote:
> > >> >
> > >> > On Tue, Oct 6, 2020 at 10:21 PM Tomas Vondra
> > >> > <[hidden email]> wrote:
> > >> > >
> > >> > > On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote:
> > >> > > >On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
> > >> > > ><[hidden email]> wrote:
> > >> > > >>
> > >> > > >> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
> > >> > > >> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
> > >> > > >> ><[hidden email]> wrote:
> > >> > > >> >>
> > >> > > >> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
> > >> > > >> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> > >> > > >> >> ><[hidden email]> wrote:
> > >> > > >> >> >
> > >> > > >> >> >Thanks, Tomas for your feedback.
> > >> > > >> >> >
> > >> > > >> >> >> 9) attcompression ...
> > >> > > >> >> >>
> > >> > > >> >> >> The main issue I see is what the patch does with attcompression. Instead
> > >> > > >> >> >> of just using it to store a the compression method, it's also used to
> > >> > > >> >> >> store the preserved compression methods. And using NameData to store
> > >> > > >> >> >> this seems wrong too - if we really want to store this info, the correct
> > >> > > >> >> >> way is either using text[] or inventing charvector or similar.
> > >> > > >> >> >
> > >> > > >> >> >The reason for using the NameData is the get it in the fixed part of
> > >> > > >> >> >the data structure.
> > >> > > >> >> >
> > >> > > >> >>
> > >> > > >> >> Why do we need that? It's possible to have varlena fields with direct
> > >> > > >> >> access (see pg_index.indkey for example).
> > >> > > >> >
> > >> > > >> >I see.  While making it NameData I was thinking whether we have an
> > >> > > >> >option to direct access the varlena. Thanks for pointing me there. I
> > >> > > >> >will change this.
> > >> > > >> >
> > >> > > >> > Adding NameData just to make
> > >> > > >> >> it fixed-length means we're always adding 64B even if we just need a
> > >> > > >> >> single byte, which means ~30% overhead for the FormData_pg_attribute.
> > >> > > >> >> That seems a bit unnecessary, and might be an issue with many attributes
> > >> > > >> >> (e.g. with many temp tables, etc.).
> > >> > > >> >
> > >> > > >> >You are right.  Even I did not like to keep 64B for this, so I will change it.
> > >> > > >> >
> > >> > > >> >>
> > >> > > >> >> >> But to me this seems very much like a misuse of attcompression to track
> > >> > > >> >> >> dependencies on compression methods, necessary because we don't have a
> > >> > > >> >> >> separate catalog listing compression methods. If we had that, I think we
> > >> > > >> >> >> could simply add dependencies between attributes and that catalog.
> > >> > > >> >> >
> > >> > > >> >> >Basically, up to this patch, we are having only built-in compression
> > >> > > >> >> >methods and those can not be dropped so we don't need any dependency
> > >> > > >> >> >at all.  We just want to know what is the current compression method
> > >> > > >> >> >and what is the preserve compression methods supported for this
> > >> > > >> >> >attribute.  Maybe we can do it better instead of using the NameData
> > >> > > >> >> >but I don't think it makes sense to add a separate catalog?
> > >> > > >> >> >
> > >> > > >> >>
> > >> > > >> >> Sure, I understand what the goal was - all I'm saying is that it looks
> > >> > > >> >> very much like a workaround needed because we don't have the catalog.
> > >> > > >> >>
> > >> > > >> >> I don't quite understand how could we support custom compression methods
> > >> > > >> >> without listing them in some sort of catalog?
> > >> > > >> >
> > >> > > >> >Yeah for supporting custom compression we need some catalog.
> > >> > > >> >
> > >> > > >> >> >> Moreover, having the catalog would allow adding compression methods
> > >> > > >> >> >> (from extensions etc) instead of just having a list of hard-coded
> > >> > > >> >> >> compression methods. Which seems like a strange limitation, considering
> > >> > > >> >> >> this thread is called "custom compression methods".
> > >> > > >> >> >
> > >> > > >> >> >I think I forgot to mention while submitting the previous patch that
> > >> > > >> >> >the next patch I am planning to submit is, Support creating the custom
> > >> > > >> >> >compression methods wherein we can use pg_am catalog to insert the new
> > >> > > >> >> >compression method.  And for dependency handling, we can create an
> > >> > > >> >> >attribute dependency on the pg_am row.  Basically, we will create the
> > >> > > >> >> >attribute dependency on the current compression method AM as well as
> > >> > > >> >> >on the preserved compression methods AM.   As part of this, we will
> > >> > > >> >> >add two build-in AMs for zlib and pglz, and the attcompression field
> > >> > > >> >> >will be converted to the oid_vector (first OID will be of the current
> > >> > > >> >> >compression method, followed by the preserved compression method's
> > >> > > >> >> >oids).
> > >> > > >> >> >
> > >> > > >> >>
> > >> > > >> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
> > >> > > >> >> quite match what I though AMs are about, but maybe it's just my fault.
> > >> > > >> >>
> > >> > > >> >> FWIW it seems a bit strange to first do the attcompression magic and
> > >> > > >> >> then add the catalog later - I think we should start with the catalog
> > >> > > >> >> right away. The advantage is that if we end up committing only some of
> > >> > > >> >> the patches in this cycle, we already have all the infrastructure etc.
> > >> > > >> >> We can reorder that later, though.
> > >> > > >> >
> > >> > > >> >Hmm,  yeah we can do this way as well that first create a new catalog
> > >> > > >> >table and add entries for these two built-in methods and the
> > >> > > >> >attcompression can store the oid vector.  But if we only commit the
> > >> > > >> >build-in compression methods part then does it make sense to create an
> > >> > > >> >extra catalog or adding these build-in methods to the existing catalog
> > >> > > >> >(if we plan to use pg_am).  Then in attcompression instead of using
> > >> > > >> >one byte for each preserve compression method, we need to use oid.  So
> > >> > > >> >from Robert's mail[1], it appeared to me that he wants that the
> > >> > > >> >build-in compression methods part should be independently committable
> > >> > > >> >and if we think from that perspective then adding a catalog doesn't
> > >> > > >> >make much sense.  But if we are planning to commit the custom method
> > >> > > >> >also then it makes more sense to directly start with the catalog
> > >> > > >> >because that way it will be easy to expand without much refactoring.
> > >> > > >> >
> > >> > > >> >[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
> > >> > > >> >
> > >> > > >>
> > >> > > >> Hmmm. Maybe I'm missing something subtle, but I think that plan can be
> > >> > > >> interpreted in various ways - it does not really say whether the initial
> > >> > > >> list of built-in methods should be in some C array, or already in a proper
> > >> > > >> catalog.
> > >> > > >>
> > >> > > >> All I'm saying is it seems a bit weird to first implement dependencies
> > >> > > >> based on strange (mis)use of attcompression attribute, and then replace
> > >> > > >> it with a proper catalog. My understanding is those patches are expected
> > >> > > >> to be committable one by one, but the attcompression approach seems a
> > >> > > >> bit too hacky to me - not sure I'd want to commit that ...
> > >> > > >
> > >> > > >Okay, I will change this.  So I will make create a new catalog
> > >> > > >pg_compression and add the entry for two built-in compression methods
> > >> > > >from the very first patch.
> > >> > > >
> > >> > >
> > >> > > OK.
> > >>
> > >> I have changed the first 2 patches, basically, now we are providing a
> > >> new catalog pg_compression and the pg_attribute is storing the oid of
> > >> the compression method.  The patches still need some cleanup and there
> > >> is also one open comment that for index we should use its table
> > >> compression.
> > >>
> > >> I am still working on the preserve patch.  For preserving the
> > >> compression method I am planning to convert the attcompression field
> > >> to the oidvector so that we can store the oid of the preserve method
> > >> also.  I am not sure whether we can access this oidvector as a fixed
> > >> part of the FormData_pg_attribute or not.  The reason is that for
> > >> building the tuple descriptor, we need to give the size of the fixed
> > >> part (#define ATTRIBUTE_FIXED_PART_SIZE \
> > >> (offsetof(FormData_pg_attribute,attcompression) + sizeof(Oid))).  But
> > >> if we convert this to the oidvector then we don't know the size of the
> > >> fixed part.  Am I missing something?
> > >
> > >I could think of two solutions here
> > >Sol1.
> > >Make the first oid of the oidvector as part of the fixed size, like below
> > >#define ATTRIBUTE_FIXED_PART_SIZE \
> > >(offsetof(FormData_pg_attribute, attcompression) + OidVectorSize(1))
> > >
> > >Sol2:
> > >Keep attcompression as oid only and for the preserve list, adds
> > >another field in the variable part which will be of type oidvector.  I
> > >think most of the time we need to access the current compression
> > >method and with this solution, we will be able to access that as part
> > >of the tuple desc.
> > >
> >
> > And is the oidvector actually needed? If we have the extra catalog,
> > can't we track this simply using the regular dependencies? So we'd have
> > the attcompression OID of the current compression method, and the
> > preserved values would be tracked in pg_depend.
>
> Right, we can do that as well.  Actually, the preserved list need to
> be accessed only in case of ALTER TABLE SET COMPRESSION and INSERT
> INTO SELECT * FROM queries.  So in such cases, I think it is okay to
> get the preserved compression oids from pg_depends.
I have worked on this patch, so as discussed now I am maintaining the
preserved compression methods using dependency.  Still PRESERVE ALL
syntax is not supported, I will work on that part.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

v6-0001-Built-in-compression-method.patch (271K) Download Attachment
v6-0003-Add-support-for-PRESERVE.patch (54K) Download Attachment
v6-0002-alter-table-set-compression.patch (16K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4
On Mon, Oct 12, 2020 at 02:28:43PM +0530, Dilip Kumar wrote:
>
>> ...
>
>I have worked on this patch, so as discussed now I am maintaining the
>preserved compression methods using dependency.  Still PRESERVE ALL
>syntax is not supported, I will work on that part.
>

Cool, I'll take a look. What's your opinion on doing it this way? Do you
think it's cleaner / more elegant, or is it something contrary to what
the dependencies are meant to do?


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Dilip Kumar-2
On Mon, Oct 12, 2020 at 7:32 PM Tomas Vondra
<[hidden email]> wrote:

>
> On Mon, Oct 12, 2020 at 02:28:43PM +0530, Dilip Kumar wrote:
> >
> >> ...
> >
> >I have worked on this patch, so as discussed now I am maintaining the
> >preserved compression methods using dependency.  Still PRESERVE ALL
> >syntax is not supported, I will work on that part.
> >
>
> Cool, I'll take a look. What's your opinion on doing it this way? Do you
> think it's cleaner / more elegant, or is it something contrary to what
> the dependencies are meant to do?

I think this looks much cleaner.  Moreover, I feel that once we start
supporting the custom compression methods then we anyway have to
maintain the dependency so using that for finding the preserved
compression method is good option.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Dilip Kumar-2
On Tue, Oct 13, 2020 at 10:30 AM Dilip Kumar <[hidden email]> wrote:

>
> On Mon, Oct 12, 2020 at 7:32 PM Tomas Vondra
> <[hidden email]> wrote:
> >
> > On Mon, Oct 12, 2020 at 02:28:43PM +0530, Dilip Kumar wrote:
> > >
> > >> ...
> > >
> > >I have worked on this patch, so as discussed now I am maintaining the
> > >preserved compression methods using dependency.  Still PRESERVE ALL
> > >syntax is not supported, I will work on that part.
> > >
> >
> > Cool, I'll take a look. What's your opinion on doing it this way? Do you
> > think it's cleaner / more elegant, or is it something contrary to what
> > the dependencies are meant to do?
>
> I think this looks much cleaner.  Moreover, I feel that once we start
> supporting the custom compression methods then we anyway have to
> maintain the dependency so using that for finding the preserved
> compression method is good option.
I have also implemented the next set of patches.
0004 -> Provide a way to create custom compression methods
0005 -> Extention to implement lz4 as a custom compression method.

A pending list of items:
1. Provide support for handling the compression option
- As discussed up thread I will store the compression option of the
latest compression method in a new field in pg_atrribute table
2. As of now I have kept zlib as the second built-in option and lz4 as
a custom compression extension.  In Offlist discussion with Robert, he
suggested that we should keep lz4 as the built-in method and we can
move zlib as an extension because lz4 is faster than zlib so better to
keep that as the built-in method.  So in the next version, I will
change that.  Any different opinion on this?
3. Improve the documentation, especially for create_compression_method.
4. By default support table compression method for the index.
5. Support the PRESERVE ALL option so that we can preserve all
existing lists of compression methods without providing the whole
list.
6. Cleanup of 0004 and 0005 as they are still WIP.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

v7-0002-alter-table-set-compression.patch (16K) Download Attachment
v7-0001-Built-in-compression-method.patch (271K) Download Attachment
v7-0003-Add-support-for-PRESERVE.patch (54K) Download Attachment
v7-0005-new-compression-method-extension-for-lz4_WIP.patch (12K) Download Attachment
v7-0004-Create-custom-compression-methods_WIP.patch (51K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Dilip Kumar-2
On Sat, Oct 17, 2020 at 11:34 AM Dilip Kumar <[hidden email]> wrote:

>
> On Tue, Oct 13, 2020 at 10:30 AM Dilip Kumar <[hidden email]> wrote:
> >
> > On Mon, Oct 12, 2020 at 7:32 PM Tomas Vondra
> > <[hidden email]> wrote:
> > >
> > > On Mon, Oct 12, 2020 at 02:28:43PM +0530, Dilip Kumar wrote:
> > > >
> > > >> ...
> > > >
> > > >I have worked on this patch, so as discussed now I am maintaining the
> > > >preserved compression methods using dependency.  Still PRESERVE ALL
> > > >syntax is not supported, I will work on that part.
> > > >
> > >
> > > Cool, I'll take a look. What's your opinion on doing it this way? Do you
> > > think it's cleaner / more elegant, or is it something contrary to what
> > > the dependencies are meant to do?
> >
> > I think this looks much cleaner.  Moreover, I feel that once we start
> > supporting the custom compression methods then we anyway have to
> > maintain the dependency so using that for finding the preserved
> > compression method is good option.
>
> I have also implemented the next set of patches.
> 0004 -> Provide a way to create custom compression methods
> 0005 -> Extention to implement lz4 as a custom compression method.
In the updated version I have worked on some of the listed items

> A pending list of items:
> 1. Provide support for handling the compression option
> - As discussed up thread I will store the compression option of the
> latest compression method in a new field in pg_atrribute table
> 2. As of now I have kept zlib as the second built-in option and lz4 as
> a custom compression extension.  In Offlist discussion with Robert, he
> suggested that we should keep lz4 as the built-in method and we can
> move zlib as an extension because lz4 is faster than zlib so better to
> keep that as the built-in method.  So in the next version, I will
> change that.  Any different opinion on this?
Done

> 3. Improve the documentation, especially for create_compression_method.
> 4. By default support table compression method for the index.

Done

> 5. Support the PRESERVE ALL option so that we can preserve all
> existing lists of compression methods without providing the whole
> list.

1,3,5 points are still pending.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

v8-0002-alter-table-set-compression.patch (17K) Download Attachment
v8-0001-Built-in-compression-method.patch (275K) Download Attachment
v8-0004-Create-custom-compression-methods.patch (53K) Download Attachment
v8-0003-Add-support-for-PRESERVE.patch (54K) Download Attachment
v8-0005-new-compression-method-extension-for-zlib.patch (13K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Robert Haas
In reply to this post by Tomas Vondra-4
On Thu, Oct 8, 2020 at 5:54 PM Tomas Vondra
<[hidden email]> wrote:
> And is the oidvector actually needed? If we have the extra catalog,
> can't we track this simply using the regular dependencies? So we'd have
> the attcompression OID of the current compression method, and the
> preserved values would be tracked in pg_depend.

If we go that route, we have to be sure that no such dependencies can
exist for any other reason. Otherwise, there would be confusion about
whether the dependency was there because values of that type were
being preserved in the table, or whether it was for the hypothetical
other reason. Now, admittedly, I can't quite think how that would
happen. For example, if the attribute default expression somehow
embedded a reference to a compression AM, that wouldn't cause this
problem, because the dependency would be on the attribute default
rather than the attribute itself. So maybe it's fine.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4
In reply to this post by Dilip Kumar-2
On Wed, Oct 21, 2020 at 01:59:50PM +0530, Dilip Kumar wrote:

>On Sat, Oct 17, 2020 at 11:34 AM Dilip Kumar <[hidden email]> wrote:
>>
>> On Tue, Oct 13, 2020 at 10:30 AM Dilip Kumar <[hidden email]> wrote:
>> >
>> > On Mon, Oct 12, 2020 at 7:32 PM Tomas Vondra
>> > <[hidden email]> wrote:
>> > >
>> > > On Mon, Oct 12, 2020 at 02:28:43PM +0530, Dilip Kumar wrote:
>> > > >
>> > > >> ...
>> > > >
>> > > >I have worked on this patch, so as discussed now I am maintaining the
>> > > >preserved compression methods using dependency.  Still PRESERVE ALL
>> > > >syntax is not supported, I will work on that part.
>> > > >
>> > >
>> > > Cool, I'll take a look. What's your opinion on doing it this way? Do you
>> > > think it's cleaner / more elegant, or is it something contrary to what
>> > > the dependencies are meant to do?
>> >
>> > I think this looks much cleaner.  Moreover, I feel that once we start
>> > supporting the custom compression methods then we anyway have to
>> > maintain the dependency so using that for finding the preserved
>> > compression method is good option.
>>
>> I have also implemented the next set of patches.
>> 0004 -> Provide a way to create custom compression methods
>> 0005 -> Extention to implement lz4 as a custom compression method.
>
>In the updated version I have worked on some of the listed items
>> A pending list of items:
>> 1. Provide support for handling the compression option
>> - As discussed up thread I will store the compression option of the
>> latest compression method in a new field in pg_atrribute table
>> 2. As of now I have kept zlib as the second built-in option and lz4 as
>> a custom compression extension.  In Offlist discussion with Robert, he
>> suggested that we should keep lz4 as the built-in method and we can
>> move zlib as an extension because lz4 is faster than zlib so better to
>> keep that as the built-in method.  So in the next version, I will
>> change that.  Any different opinion on this?
>
>Done
>
>> 3. Improve the documentation, especially for create_compression_method.
>> 4. By default support table compression method for the index.
>
>Done
>
>> 5. Support the PRESERVE ALL option so that we can preserve all
>> existing lists of compression methods without providing the whole
>> list.
>
>1,3,5 points are still pending.
>

Thanks. I took a quick look at the patches and I think it seems fine. I
have one question, though - toast_compress_datum contains this code:


     /* Call the actual compression function */
     tmp = cmroutine->cmcompress((const struct varlena *) value);
     if (!tmp)
         return PointerGetDatum(NULL);


Shouldn't this really throw an error instead? I mean, if the compression
library returns NULL, isn't that an error?


regards

>--
>Regards,
>Dilip Kumar
>EnterpriseDB: http://www.enterprisedb.com








Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Dilip Kumar-2
On Thu, Oct 22, 2020 at 2:11 AM Tomas Vondra
<[hidden email]> wrote:

>
> On Wed, Oct 21, 2020 at 01:59:50PM +0530, Dilip Kumar wrote:
> >On Sat, Oct 17, 2020 at 11:34 AM Dilip Kumar <[hidden email]> wrote:
> >>
> >> On Tue, Oct 13, 2020 at 10:30 AM Dilip Kumar <[hidden email]> wrote:
> >> >
> >> > On Mon, Oct 12, 2020 at 7:32 PM Tomas Vondra
> >> > <[hidden email]> wrote:
> >> > >
> >> > > On Mon, Oct 12, 2020 at 02:28:43PM +0530, Dilip Kumar wrote:
> >> > > >
> >> > > >> ...
> >> > > >
> >> > > >I have worked on this patch, so as discussed now I am maintaining the
> >> > > >preserved compression methods using dependency.  Still PRESERVE ALL
> >> > > >syntax is not supported, I will work on that part.
> >> > > >
> >> > >
> >> > > Cool, I'll take a look. What's your opinion on doing it this way? Do you
> >> > > think it's cleaner / more elegant, or is it something contrary to what
> >> > > the dependencies are meant to do?
> >> >
> >> > I think this looks much cleaner.  Moreover, I feel that once we start
> >> > supporting the custom compression methods then we anyway have to
> >> > maintain the dependency so using that for finding the preserved
> >> > compression method is good option.
> >>
> >> I have also implemented the next set of patches.
> >> 0004 -> Provide a way to create custom compression methods
> >> 0005 -> Extention to implement lz4 as a custom compression method.
> >
> >In the updated version I have worked on some of the listed items
> >> A pending list of items:
> >> 1. Provide support for handling the compression option
> >> - As discussed up thread I will store the compression option of the
> >> latest compression method in a new field in pg_atrribute table
> >> 2. As of now I have kept zlib as the second built-in option and lz4 as
> >> a custom compression extension.  In Offlist discussion with Robert, he
> >> suggested that we should keep lz4 as the built-in method and we can
> >> move zlib as an extension because lz4 is faster than zlib so better to
> >> keep that as the built-in method.  So in the next version, I will
> >> change that.  Any different opinion on this?
> >
> >Done
> >
> >> 3. Improve the documentation, especially for create_compression_method.
> >> 4. By default support table compression method for the index.
> >
> >Done
> >
> >> 5. Support the PRESERVE ALL option so that we can preserve all
> >> existing lists of compression methods without providing the whole
> >> list.
> >
> >1,3,5 points are still pending.
> >
>
> Thanks. I took a quick look at the patches and I think it seems fine. I
> have one question, though - toast_compress_datum contains this code:
>
>
>      /* Call the actual compression function */
>      tmp = cmroutine->cmcompress((const struct varlena *) value);
>      if (!tmp)
>          return PointerGetDatum(NULL);
>
>
> Shouldn't this really throw an error instead? I mean, if the compression
> library returns NULL, isn't that an error?

I don't think that we can throw an error here because pglz_compress
might return -1 if it finds that it can not reduce the size of the
data and we consider such data as "incompressible data" and return
NULL.  In such a case the caller will try to compress another
attribute of the tuple.  I think we can handle such cases in the
specific handler functions.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Dilip Kumar-2
In reply to this post by Robert Haas
On Wed, Oct 21, 2020 at 8:51 PM Robert Haas <[hidden email]> wrote:

>
> On Thu, Oct 8, 2020 at 5:54 PM Tomas Vondra
> <[hidden email]> wrote:
> > And is the oidvector actually needed? If we have the extra catalog,
> > can't we track this simply using the regular dependencies? So we'd have
> > the attcompression OID of the current compression method, and the
> > preserved values would be tracked in pg_depend.
>
> If we go that route, we have to be sure that no such dependencies can
> exist for any other reason. Otherwise, there would be confusion about
> whether the dependency was there because values of that type were
> being preserved in the table, or whether it was for the hypothetical
> other reason. Now, admittedly, I can't quite think how that would
> happen. For example, if the attribute default expression somehow
> embedded a reference to a compression AM, that wouldn't cause this
> problem, because the dependency would be on the attribute default
> rather than the attribute itself. So maybe it's fine.

Yeah, and moreover in the new patchset, we are storing the compression
methods in the new catalog 'pg_compression' instead of merging with
the pg_am.  So I think only for the preserve purpose we will maintain
the attribute -> pg_compression dependency so it should be fine.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


1 ... 678910