Custom compression methods

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
181 messages Options
123456 ... 10
Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Álvaro Herrera
Tomas Vondra wrote:

> On 11/30/2017 04:20 PM, Ildus Kurbangaliev wrote:

> > CREATE COMPRESSION METHOD ts1 FOR tsvector HANDLER
> > tsvector_compression_handler;
>
> Understood. Good to know you've considered it, and I agree it doesn't
> need to be there from the start (which makes the patch simpler).

Just passing by, but wouldn't this fit in the ACCESS METHOD group of
commands?  So this could be simplified down to
CREATE ACCESS METHOD ts1 TYPE COMPRESSION
we have that for indexes and there are patches flying for heap storage,
sequences, etc.  I think that's simpler than trying to invent all new
commands here.  Then (in a future patch) you can use ALTER TYPE to
define compression for that type, or even add a column-level option to
reference a specific compression method.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4


On 11/30/2017 09:51 PM, Alvaro Herrera wrote:

> Tomas Vondra wrote:
>
>> On 11/30/2017 04:20 PM, Ildus Kurbangaliev wrote:
>
>>> CREATE COMPRESSION METHOD ts1 FOR tsvector HANDLER
>>> tsvector_compression_handler;
>>
>> Understood. Good to know you've considered it, and I agree it doesn't
>> need to be there from the start (which makes the patch simpler).
>
> Just passing by, but wouldn't this fit in the ACCESS METHOD group of
> commands?  So this could be simplified down to
> CREATE ACCESS METHOD ts1 TYPE COMPRESSION
> we have that for indexes and there are patches flying for heap storage,
> sequences, etc.  I think that's simpler than trying to invent all new
> commands here.  Then (in a future patch) you can use ALTER TYPE to
> define compression for that type, or even add a column-level option to
> reference a specific compression method.
>

I think that would conflate two very different concepts. In my mind,
access methods define how rows are stored. Compression methods are an
orthogonal concept, e.g. you can compress a value (using a custom
compression algorithm) and store it in an index (using whatever access
method it's using). So not only access methods operate on rows (while
compression operates on varlena values), but you can combine those two
things together. I don't see how you could do that if both are defined
as "access methods" ...

Furthermore, the "TYPE" in CREATE COMPRESSION method was meant to
restrict the compression algorithm to a particular data type (so, if it
relies on tsvector, you can't apply it to text columns). Which is very
different from "TYPE COMPRESSION" in CREATE ACCESS METHOD.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Robert Haas
In reply to this post by Tomas Vondra-4
On Thu, Nov 30, 2017 at 2:47 PM, Tomas Vondra
<[hidden email]> wrote:
> OK. I think it's a nice use case (and nice gains on the compression
> ratio), demonstrating the datatype-aware compression. The question is
> why shouldn't this be built into the datatypes directly?

Tomas, thanks for running benchmarks of this.  I was surprised to see
how little improvement there was from other modern compression
methods, although lz4 did appear to be a modest win on both size and
speed.  But I share your intuition that a lot of the interesting work
is in datatype-specific compression algorithms.  I have noticed in a
number of papers that I've read that teaching other parts of the
system to operate directly on the compressed data, especially for
column stores, is a critical performance optimization; of course, that
only makes sense if the compression is datatype-specific.  I don't
know exactly what that means for the design of this patch, though.

As a general point, no matter which way you go, you have to somehow
deal with on-disk compatibility.  If you want to build in compression
to the datatype itself, you need to find at least one bit someplace to
mark the fact that you applied built-in compression.  If you want to
build it in as a separate facility, you need to denote the compression
used someplace else.  I haven't looked at how this patch does it, but
the proposal in the past has been to add a value to vartag_external.
One nice thing about the latter method is that it can be used for any
data type generically, regardless of how much bit-space is available
in the data type representation itself.  It's realistically hard to
think of a data-type that has no bit space available anywhere but is
still subject to data-type specific compression; bytea definitionally
has no bit space but is also can't benefit from special-purpose
compression, whereas even something like text could be handled by
starting the varlena with a NUL byte to indicate compressed data
following.  However, you'd have to come up with a different trick for
each data type.  Piggybacking on the TOAST machinery avoids that.  It
also implies that we only try to compress values that are "big", which
is probably be desirable if we're talking about a kind of compression
that makes comprehending the value slower. Not all types of
compression do, cf. commit 145343534c153d1e6c3cff1fa1855787684d9a38,
and for those that don't it probably makes more sense to just build it
into the data type.

All of that is a somewhat separate question from whether we should
have CREATE / DROP COMPRESSION, though (or Alvaro's proposal of using
the ACCESS METHOD stuff instead).  Even if we agree that piggybacking
on TOAST is a good way to implement pluggable compression methods, it
doesn't follow that the compression method is something that should be
attached to the datatype from the outside; it could be built into it
in a deep way.  For example, "packed" varlenas (1-byte header) are a
form of compression, and the default functions for detoasting always
produced unpacked values, but the operators for the text data type
know how to operate on the packed representation.  That's sort of a
trivial example, but it might well be that there are other cases where
we can do something similar.  Maybe jsonb, for example, can compress
data in such a way that some of the jsonb functions can operate
directly on the compressed representation -- perhaps the number of
keys is easily visible, for example, or maybe more.  In this view of
the world, each data type should get to define its own compression
method (or methods) but they are hard-wired into the datatype and you
can't add more later, or if you do, you lose the advantages of the
hard-wired stuff.

BTW, another related concept that comes up a lot in discussions of
this area is that we could do a lot better compression of columns if
we had some place to store a per-column dictionary.  I don't really
know how to make that work.  We could have a catalog someplace that
stores an opaque blob for each column configured to use a compression
method, and let the compression method store whatever it likes in
there.  That's probably fine if you are compressing the whole table at
once and the blob is static thereafter.  But if you want to update
that blob as you see new column values there seem to be almost
insurmountable problems.

To be clear, I'm not trying to load this patch down with a requirement
to solve every problem in the universe.  On the other hand, I think it
would be easy to beat a patch like this into shape in a fairly
mechanical way and then commit-and-forget.  That might be leaving a
lot of money on the table; I'm glad you are thinking about the bigger
picture and hope that my thoughts here somehow contribute.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4
On 12/01/2017 03:23 PM, Robert Haas wrote:

> On Thu, Nov 30, 2017 at 2:47 PM, Tomas Vondra
> <[hidden email]> wrote:
>> OK. I think it's a nice use case (and nice gains on the compression
>> ratio), demonstrating the datatype-aware compression. The question is
>> why shouldn't this be built into the datatypes directly?
>
> Tomas, thanks for running benchmarks of this.  I was surprised to see
> how little improvement there was from other modern compression
> methods, although lz4 did appear to be a modest win on both size and
> speed.  But I share your intuition that a lot of the interesting work
> is in datatype-specific compression algorithms.  I have noticed in a
> number of papers that I've read that teaching other parts of the
> system to operate directly on the compressed data, especially for
> column stores, is a critical performance optimization; of course, that
> only makes sense if the compression is datatype-specific.  I don't
> know exactly what that means for the design of this patch, though.
>

It has very little impact on this patch, as it has nothing to do with
columnar storage. That is, each value is compressed independently.

Column stores exploit the fact that they get a vector of values,
compressed in some data-aware way. E.g. some form of RLE or dictionary
compression, which allows them to evaluate expressions on the compressed
vector. But that's irrelevant here, we only get row-by-row execution.

Note: The idea to build dictionary for the whole jsonb column (which
this patch should allow) does not make it "columnar compression" in the
"column store" way. The executor will still get the decompressed value.

> As a general point, no matter which way you go, you have to somehow
> deal with on-disk compatibility.  If you want to build in compression
> to the datatype itself, you need to find at least one bit someplace to
> mark the fact that you applied built-in compression.  If you want to
> build it in as a separate facility, you need to denote the compression
> used someplace else.  I haven't looked at how this patch does it, but
> the proposal in the past has been to add a value to vartag_external.

AFAICS the patch does that by setting a bit in the varlena header, and
then adding OID of the compression method after the varlena header. So
you get (verlena header + OID + data).

This has good and bad consequences.

Good: It's transparent for the datatype, so it does not have to worry
about the custom compression at all (and it may change arbitrarily).

Bad: It's transparent for the datatype, so it can't operate directly on
the compressed representation.

I don't think this is an argument against the patch, though. If the
datatype can support intelligent compression (and execution without
decompression), it has to be done in the datatype anyway.

> One nice thing about the latter method is that it can be used for any
> data type generically, regardless of how much bit-space is available
> in the data type representation itself.  It's realistically hard to
> think of a data-type that has no bit space available anywhere but is
> still subject to data-type specific compression; bytea definitionally
> has no bit space but is also can't benefit from special-purpose
> compression, whereas even something like text could be handled by
> starting the varlena with a NUL byte to indicate compressed data
> following.  However, you'd have to come up with a different trick for
> each data type.  Piggybacking on the TOAST machinery avoids that.  It
> also implies that we only try to compress values that are "big", which
> is probably be desirable if we're talking about a kind of compression
> that makes comprehending the value slower. Not all types of
> compression do, cf. commit 145343534c153d1e6c3cff1fa1855787684d9a38,
> and for those that don't it probably makes more sense to just build it
> into the data type.
>
> All of that is a somewhat separate question from whether we should
> have CREATE / DROP COMPRESSION, though (or Alvaro's proposal of using
> the ACCESS METHOD stuff instead).  Even if we agree that piggybacking
> on TOAST is a good way to implement pluggable compression methods, it
> doesn't follow that the compression method is something that should be
> attached to the datatype from the outside; it could be built into it
> in a deep way.  For example, "packed" varlenas (1-byte header) are a
> form of compression, and the default functions for detoasting always
> produced unpacked values, but the operators for the text data type
> know how to operate on the packed representation.  That's sort of a
> trivial example, but it might well be that there are other cases where
> we can do something similar.  Maybe jsonb, for example, can compress
> data in such a way that some of the jsonb functions can operate
> directly on the compressed representation -- perhaps the number of
> keys is easily visible, for example, or maybe more.  In this view of
> the world, each data type should get to define its own compression
> method (or methods) but they are hard-wired into the datatype and you
> can't add more later, or if you do, you lose the advantages of the
> hard-wired stuff.
>

I agree with these thoughts in general, but I'm not quite sure what is
your conclusion regarding the patch.

The patch allows us to define custom compression methods that are
entirely transparent for the datatype machinery, i.e. allow compression
even for data types that did not consider compression at all. That seems
valuable to me.

Of course, if the same compression logic can be built into the datatype
itself, it may allow additional benefits (like execution on compressed
data directly).

I don't see these two approaches as conflicting.

>
> BTW, another related concept that comes up a lot in discussions of
> this area is that we could do a lot better compression of columns if
> we had some place to store a per-column dictionary.  I don't really
> know how to make that work.  We could have a catalog someplace that
> stores an opaque blob for each column configured to use a compression
> method, and let the compression method store whatever it likes in
> there.  That's probably fine if you are compressing the whole table at
> once and the blob is static thereafter.  But if you want to update
> that blob as you see new column values there seem to be almost
> insurmountable problems.
>

Well, that's kinda the idea behind the configure/drop methods in the
compression handler, and Ildus already did implement such dictionary
compression for the jsonb data type, see:

    https://github.com/postgrespro/jsonbd

Essentially that stores the dictionary in a table, managed by a bunch of
background workers.

>
> To be clear, I'm not trying to load this patch down with a requirement
> to solve every problem in the universe.  On the other hand, I think it
> would be easy to beat a patch like this into shape in a fairly
> mechanical way and then commit-and-forget.  That might be leaving a
> lot of money on the table; I'm glad you are thinking about the bigger
> picture and hope that my thoughts here somehow contribute.
>

Thanks ;-)


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Robert Haas
On Fri, Dec 1, 2017 at 10:18 AM, Tomas Vondra
<[hidden email]> wrote:
> It has very little impact on this patch, as it has nothing to do with
> columnar storage. That is, each value is compressed independently.

I understand that this patch is not about columnar storage, but I
think the idea that we may want to operate on the compressed data
directly is not only applicable to that case.

> I agree with these thoughts in general, but I'm not quite sure what is
> your conclusion regarding the patch.

I have not reached one.  Sometimes I like to discuss problems before
deciding what I think.  :-)

It does seem to me that the patch may be aiming at a relatively narrow
target in a fairly large problem space, but I don't know whether to
label that as short-sightedness or prudent incrementalism.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Álvaro Herrera
In reply to this post by Tomas Vondra-4
Tomas Vondra wrote:

> On 11/30/2017 09:51 PM, Alvaro Herrera wrote:

> > Just passing by, but wouldn't this fit in the ACCESS METHOD group of
> > commands?  So this could be simplified down to
> > CREATE ACCESS METHOD ts1 TYPE COMPRESSION
> > we have that for indexes and there are patches flying for heap storage,
> > sequences, etc.
>
> I think that would conflate two very different concepts. In my mind,
> access methods define how rows are stored.

In mine, they define how things are accessed (i.e. more general than
what you're thinking).  We *currently* use them to store rows [in
indexes], but there is no reason why we couldn't expand that.

So we group access methods in "types"; the current type we have is for
indexes, and methods in that type define how are indexes accessed.  This
new type would indicate how would values be compressed.  I disagree that
there is no parallel there.

I'm trying to avoid pointless proliferation of narrowly defined DDL
commands.

> Furthermore, the "TYPE" in CREATE COMPRESSION method was meant to
> restrict the compression algorithm to a particular data type (so, if it
> relies on tsvector, you can't apply it to text columns).

Yes, of course.  I'm saying that the "datatype" property of a
compression access method would be declared somewhere else, not in the
TYPE clause of the CREATE ACCESS METHOD command.  Perhaps it makes sense
to declare that a certain compression access method is good only for a
certain data type, and then you can put that in the options clause,
"CREATE ACCESS METHOD hyperz TYPE COMPRESSION WITH (type = tsvector)".
But many compression access methods would be general in nature and so
could be used for many datatypes (say, snappy).

To me it makes sense to say "let's create this method which is for data
compression" (CREATE ACCESS METHOD hyperz TYPE COMPRESSION) followed by
either "let's use this new compression method for the type tsvector"
(ALTER TYPE tsvector SET COMPRESSION hyperz) or "let's use this new
compression method for the column tc" (ALTER TABLE ALTER COLUMN tc SET
COMPRESSION hyperz).

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Alvaro Herrera-9
In reply to this post by Ildus Kurbangaliev
Ildus Kurbangaliev wrote:

> If the table is big, decompression could take an eternity. That's why i
> decided to only to disable it and the data could be decompressed using
> compression options.
>
> My idea was to keep compression options forever, since there will not
> be much of them in one database. Still that requires that extension is
> not removed.
>
> I will try to find a way how to recompress data first in case it moves
> to another table.

I think what you should do is add a dependency between a column that
compresses using a method, and that method.  So the method cannot be
dropped and leave compressed data behind.  Since the method is part of
the extension, the extension cannot be dropped either.  If you ALTER
the column so that it uses another compression method, then the table is
rewritten and the dependency is removed; once you do that for all the
columns that use the compression method, the compression method can be
dropped.

Maybe our dependency code needs to be extended in order to support this.
I think the current logic would drop the column if you were to do "DROP
COMPRESSION .. CASCADE", but I'm not sure we'd see that as a feature.
I'd rather have DROP COMPRESSION always fail instead until no columns
use it.  Let's hear other's opinions on this bit though.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4


On 12/01/2017 08:48 PM, Alvaro Herrera wrote:

> Ildus Kurbangaliev wrote:
>
>> If the table is big, decompression could take an eternity. That's why i
>> decided to only to disable it and the data could be decompressed using
>> compression options.
>>
>> My idea was to keep compression options forever, since there will not
>> be much of them in one database. Still that requires that extension is
>> not removed.
>>
>> I will try to find a way how to recompress data first in case it moves
>> to another table.
>
> I think what you should do is add a dependency between a column that
> compresses using a method, and that method.  So the method cannot be
> dropped and leave compressed data behind.  Since the method is part of
> the extension, the extension cannot be dropped either.  If you ALTER
> the column so that it uses another compression method, then the table is
> rewritten and the dependency is removed; once you do that for all the
> columns that use the compression method, the compression method can be
> dropped.
>

+1 to do the rewrite, just like for other similar ALTER TABLE commands

>
> Maybe our dependency code needs to be extended in order to support this.
> I think the current logic would drop the column if you were to do "DROP
> COMPRESSION .. CASCADE", but I'm not sure we'd see that as a feature.
> I'd rather have DROP COMPRESSION always fail instead until no columns
> use it.  Let's hear other's opinions on this bit though.
>

Why should this behave differently compared to data types? Seems quite
against POLA, if you ask me ...

If you want to remove the compression, you can do the SET NOT COMPRESSED
(or whatever syntax we end up using), and then DROP COMPRESSION METHOD.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4
In reply to this post by Robert Haas
On 12/01/2017 08:20 PM, Robert Haas wrote:
> On Fri, Dec 1, 2017 at 10:18 AM, Tomas Vondra
> <[hidden email]> wrote:
>> It has very little impact on this patch, as it has nothing to do with
>> columnar storage. That is, each value is compressed independently.
>
> I understand that this patch is not about columnar storage, but I
> think the idea that we may want to operate on the compressed data
> directly is not only applicable to that case.
>

Yeah. To clarify, my point was that column stores benefit from
compressing many values at once, and then operating on this compressed
vector. That is not what this patch is doing (or can do), of course.

But I certainly do agree that if the compression can be integrated into
the data type, allowing processing on compressed representation, then
that will beat whatever this patch is doing, of course ...

>>
>> I agree with these thoughts in general, but I'm not quite sure
>> what is your conclusion regarding the patch.
>
> I have not reached one. Sometimes I like to discuss problems before
> deciding what I think. :-)
>

That's lame! Let's make decisions without discussion ;-)

>
> It does seem to me that the patch may be aiming at a relatively narrow
> target in a fairly large problem space, but I don't know whether to
> label that as short-sightedness or prudent incrementalism.
>

I don't know either. I don't think people will start switching their
text columns to lz4 just because they can, or because they get 4% space
reduction compared to pglz.

But the ability to build per-column dictionaries seems quite powerful, I
guess. And I don't think that can be easily built directly into JSONB,
because we don't have a way to provide information about the column
(i.e. how would you fetch the correct dictionary?).


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Robert Haas
In reply to this post by Álvaro Herrera
On Fri, Dec 1, 2017 at 2:38 PM, Alvaro Herrera <[hidden email]> wrote:
> In mine, they define how things are accessed (i.e. more general than
> what you're thinking).  We *currently* use them to store rows [in
> indexes], but there is no reason why we couldn't expand that.
>
> So we group access methods in "types"; the current type we have is for
> indexes, and methods in that type define how are indexes accessed.  This
> new type would indicate how would values be compressed.  I disagree that
> there is no parallel there.

+1.

> I'm trying to avoid pointless proliferation of narrowly defined DDL
> commands.

I also think that's an important goal.

> Yes, of course.  I'm saying that the "datatype" property of a
> compression access method would be declared somewhere else, not in the
> TYPE clause of the CREATE ACCESS METHOD command.  Perhaps it makes sense
> to declare that a certain compression access method is good only for a
> certain data type, and then you can put that in the options clause,
> "CREATE ACCESS METHOD hyperz TYPE COMPRESSION WITH (type = tsvector)".
> But many compression access methods would be general in nature and so
> could be used for many datatypes (say, snappy).
>
> To me it makes sense to say "let's create this method which is for data
> compression" (CREATE ACCESS METHOD hyperz TYPE COMPRESSION) followed by
> either "let's use this new compression method for the type tsvector"
> (ALTER TYPE tsvector SET COMPRESSION hyperz) or "let's use this new
> compression method for the column tc" (ALTER TABLE ALTER COLUMN tc SET
> COMPRESSION hyperz).

+1 to this, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Robert Haas
In reply to this post by Tomas Vondra-4
On Fri, Dec 1, 2017 at 4:06 PM, Tomas Vondra
<[hidden email]> wrote:
>>> I agree with these thoughts in general, but I'm not quite sure
>>> what is your conclusion regarding the patch.
>>
>> I have not reached one. Sometimes I like to discuss problems before
>> deciding what I think. :-)
>
> That's lame! Let's make decisions without discussion ;-)

Oh, right.  What was I thinking?

>> It does seem to me that the patch may be aiming at a relatively narrow
>> target in a fairly large problem space, but I don't know whether to
>> label that as short-sightedness or prudent incrementalism.
>
> I don't know either. I don't think people will start switching their
> text columns to lz4 just because they can, or because they get 4% space
> reduction compared to pglz.

Honestly, if we can give everybody a 4% space reduction by switching
to lz4, I think that's totally worth doing -- but let's not make
people choose it, let's make it the default going forward, and keep
pglz support around so we don't break pg_upgrade compatibility (and so
people can continue to choose it if for some reason it works better in
their use case).  That kind of improvement is nothing special in a
specific workload, but TOAST is a pretty general-purpose mechanism.  I
have become, through a few bitter experiences, a strong believer in
the value of trying to reduce our on-disk footprint, and knocking 4%
off the size of every TOAST table in the world does not sound
worthless to me -- even though context-aware compression can doubtless
do a lot better.

> But the ability to build per-column dictionaries seems quite powerful, I
> guess. And I don't think that can be easily built directly into JSONB,
> because we don't have a way to provide information about the column
> (i.e. how would you fetch the correct dictionary?).

That's definitely a problem, but I think we should mull it over a bit
more before giving up.  I have a few thoughts, but the part of my life
that doesn't happen on the PostgreSQL mailing list precludes
expounding on them right this minute.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4
In reply to this post by Álvaro Herrera

On 12/01/2017 08:38 PM, Alvaro Herrera wrote:

> Tomas Vondra wrote:
>
>> On 11/30/2017 09:51 PM, Alvaro Herrera wrote:
>
>>> Just passing by, but wouldn't this fit in the ACCESS METHOD group of
>>> commands?  So this could be simplified down to
>>> CREATE ACCESS METHOD ts1 TYPE COMPRESSION
>>> we have that for indexes and there are patches flying for heap storage,
>>> sequences, etc.
>>
>> I think that would conflate two very different concepts. In my mind,
>> access methods define how rows are stored.
>
> In mine, they define how things are accessed (i.e. more general than
> what you're thinking).  We *currently* use them to store rows [in
> indexes], but there is no reason why we couldn't expand that.
>

Not sure I follow. My argument was not as much about whether the rows
are stored as rows or in some other (columnar) format, but that access
methods deal with "tuples" (i.e. row in the "logical" way). I assume
that even if we end up implementing other access method types, they will
still be tuple-based.

OTOH compression methods (at least as introduced by this patch) operate
on individual values, and have very little to do with access to the
value (in a sense it's a transparent thing).

>
> So we group access methods in "types"; the current type we have is for
> indexes, and methods in that type define how are indexes accessed.  This
> new type would indicate how would values be compressed.  I disagree that
> there is no parallel there.
>
> I'm trying to avoid pointless proliferation of narrowly defined DDL
> commands.
>

Of course, the opposite case is using the same DDL for very different
concepts (although I understand you don't see it that way).

But in fairness, I don't really care if we call this COMPRESSION METHOD
or ACCESS METHOD or DARTH VADER ...

>> Furthermore, the "TYPE" in CREATE COMPRESSION method was meant to
>> restrict the compression algorithm to a particular data type (so, if it
>> relies on tsvector, you can't apply it to text columns).
>
> Yes, of course.  I'm saying that the "datatype" property of a
> compression access method would be declared somewhere else, not in the
> TYPE clause of the CREATE ACCESS METHOD command.  Perhaps it makes sense
> to declare that a certain compression access method is good only for a
> certain data type, and then you can put that in the options clause,
> "CREATE ACCESS METHOD hyperz TYPE COMPRESSION WITH (type = tsvector)".
> But many compression access methods would be general in nature and so
> could be used for many datatypes (say, snappy).
>
> To me it makes sense to say "let's create this method which is for data
> compression" (CREATE ACCESS METHOD hyperz TYPE COMPRESSION) followed by
> either "let's use this new compression method for the type tsvector"
> (ALTER TYPE tsvector SET COMPRESSION hyperz) or "let's use this new
> compression method for the column tc" (ALTER TABLE ALTER COLUMN tc SET
> COMPRESSION hyperz).
>

The WITH syntax does not seem particularly pretty to me, TBH. I'd be
much happier with "TYPE tsvector" and leaving WITH for the options
specific to each compression method.

FWIW I think syntax is the least critical part of this patch. It's ~1%
of the patch, and the gram.y additions are rather trivial.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Andres Freund
In reply to this post by Robert Haas
On 2017-12-01 16:14:58 -0500, Robert Haas wrote:

> Honestly, if we can give everybody a 4% space reduction by switching
> to lz4, I think that's totally worth doing -- but let's not make
> people choose it, let's make it the default going forward, and keep
> pglz support around so we don't break pg_upgrade compatibility (and so
> people can continue to choose it if for some reason it works better in
> their use case).  That kind of improvement is nothing special in a
> specific workload, but TOAST is a pretty general-purpose mechanism.  I
> have become, through a few bitter experiences, a strong believer in
> the value of trying to reduce our on-disk footprint, and knocking 4%
> off the size of every TOAST table in the world does not sound
> worthless to me -- even though context-aware compression can doubtless
> do a lot better.

+1. It's also a lot faster, and I've seen way way to many workloads with
50%+ time spent in pglz.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Álvaro Herrera
In reply to this post by Tomas Vondra-4
Tomas Vondra wrote:

> On 12/01/2017 08:48 PM, Alvaro Herrera wrote:

> > Maybe our dependency code needs to be extended in order to support this.
> > I think the current logic would drop the column if you were to do "DROP
> > COMPRESSION .. CASCADE", but I'm not sure we'd see that as a feature.
> > I'd rather have DROP COMPRESSION always fail instead until no columns
> > use it.  Let's hear other's opinions on this bit though.
>
> Why should this behave differently compared to data types? Seems quite
> against POLA, if you ask me ...

OK, DROP TYPE sounds good enough precedent, so +1 on that.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Ildus Kurbangaliev
In reply to this post by Álvaro Herrera
On Fri, 1 Dec 2017 16:38:42 -0300
Alvaro Herrera <[hidden email]> wrote:

>
> To me it makes sense to say "let's create this method which is for
> data compression" (CREATE ACCESS METHOD hyperz TYPE COMPRESSION)
> followed by either "let's use this new compression method for the
> type tsvector" (ALTER TYPE tsvector SET COMPRESSION hyperz) or "let's
> use this new compression method for the column tc" (ALTER TABLE ALTER
> COLUMN tc SET COMPRESSION hyperz).
>

Hi, I think if CREATE ACCESS METHOD can be used for compression, then it
could be nicer than CREATE COMPRESSION METHOD. I just don't
know that compression could go as access method or not. Anyway
it's easy to change syntax and I don't mind to do it, if it will be
neccessary for the patch to be commited.

--
----
Regards,
Ildus Kurbangaliev

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4
In reply to this post by Andres Freund
On 12/01/2017 10:52 PM, Andres Freund wrote:

> On 2017-12-01 16:14:58 -0500, Robert Haas wrote:
>> Honestly, if we can give everybody a 4% space reduction by
>> switching to lz4, I think that's totally worth doing -- but let's
>> not make people choose it, let's make it the default going forward,
>> and keep pglz support around so we don't break pg_upgrade
>> compatibility (and so people can continue to choose it if for some
>> reason it works better in their use case). That kind of improvement
>> is nothing special in a specific workload, but TOAST is a pretty
>> general-purpose mechanism. I have become, through a few bitter
>> experiences, a strong believer in the value of trying to reduce our
>> on-disk footprint, and knocking 4% off the size of every TOAST
>> table in the world does not sound worthless to me -- even though
>> context-aware compression can doubtless do a lot better.
>
> +1. It's also a lot faster, and I've seen way way to many workloads
> with 50%+ time spent in pglz.
>

TBH the 4% figure is something I mostly made up (I'm fake news!). On the
mailing list archive (which I believe is pretty compressible) I observed
something like 2.5% size reduction with lz4 compared to pglz, at least
with the compression levels I've used ...

Other algorithms (e.g. zstd) got significantly better compression (25%)
compared to pglz, but in exchange for longer compression. I'm sure we
could lower compression level to make it faster, but that will of course
hurt the compression ratio.

I don't think switching to a different compression algorithm is a way
forward - it was proposed and explored repeatedly in the past, and every
time it failed for a number of reasons, most of which are still valid.


Firstly, it's going to be quite hard (or perhaps impossible) to find an
algorithm that is "universally better" than pglz. Some algorithms do
work better for text documents, some for binary blobs, etc. I don't
think there's a win-win option.

Sure, there are workloads where pglz performs poorly (I've seen such
cases too), but IMHO that's more an argument for the custom compression
method approach. pglz gives you good default compression in most cases,
and you can change it for columns where it matters, and where a
different space/time trade-off makes sense.


Secondly, all the previous attempts ran into some legal issues, i.e.
licensing and/or patents. Maybe the situation changed since then (no
idea, haven't looked into that), but in the past the "pluggable"
approach was proposed as a way to address this.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

konstantin knizhnik

On Dec 2, 2017, at 6:04 PM, Tomas Vondra wrote:

On 12/01/2017 10:52 PM, Andres Freund wrote:
On 2017-12-01 16:14:58 -0500, Robert Haas wrote:
Honestly, if we can give everybody a 4% space reduction by
switching to lz4, I think that's totally worth doing -- but let's
not make people choose it, let's make it the default going forward,
and keep pglz support around so we don't break pg_upgrade
compatibility (and so people can continue to choose it if for some
reason it works better in their use case). That kind of improvement
is nothing special in a specific workload, but TOAST is a pretty
general-purpose mechanism. I have become, through a few bitter
experiences, a strong believer in the value of trying to reduce our
on-disk footprint, and knocking 4% off the size of every TOAST
table in the world does not sound worthless to me -- even though
context-aware compression can doubtless do a lot better.

+1. It's also a lot faster, and I've seen way way to many workloads
with 50%+ time spent in pglz.


TBH the 4% figure is something I mostly made up (I'm fake news!). On the
mailing list archive (which I believe is pretty compressible) I observed
something like 2.5% size reduction with lz4 compared to pglz, at least
with the compression levels I've used ...

Other algorithms (e.g. zstd) got significantly better compression (25%)
compared to pglz, but in exchange for longer compression. I'm sure we
could lower compression level to make it faster, but that will of course
hurt the compression ratio.

I don't think switching to a different compression algorithm is a way
forward - it was proposed and explored repeatedly in the past, and every
time it failed for a number of reasons, most of which are still valid.


Firstly, it's going to be quite hard (or perhaps impossible) to find an
algorithm that is "universally better" than pglz. Some algorithms do
work better for text documents, some for binary blobs, etc. I don't
think there's a win-win option.

Sure, there are workloads where pglz performs poorly (I've seen such
cases too), but IMHO that's more an argument for the custom compression
method approach. pglz gives you good default compression in most cases,
and you can change it for columns where it matters, and where a
different space/time trade-off makes sense.


Secondly, all the previous attempts ran into some legal issues, i.e.
licensing and/or patents. Maybe the situation changed since then (no
idea, haven't looked into that), but in the past the "pluggable"
approach was proposed as a way to address this.



May be it will be interesting for you to see the following results of applying page-level compression (CFS in PgPro-EE) to pgbench data:

Configuration
Size (Gb)
Time (sec)
vanilla postgres
15.31
92
zlib (default level)
2.37
284
zlib (best speed)
2.43
191
postgres internal lz
3.89
214
lz4 
4.12
95
snappy (google)
5.18
99
lzfse (apple)
2.80
1099
zstd (facebook)
1.69
125

All algorithms (except zlib) were used with best-speed option: using better compression level usually has not so large impact on compression ratio (<30%), but can significantly increase time (several times).
Certainly pgbench isnot the best candidate for testing compression algorithms: it generates a lot of artificial and redundant data.
But we measured it also on real customers data and still zstd seems to be the best compression methods: provides good compression with smallest CPU overhead.

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Andres Freund
In reply to this post by Tomas Vondra-4
Hi,

On 2017-12-02 16:04:52 +0100, Tomas Vondra wrote:
> Firstly, it's going to be quite hard (or perhaps impossible) to find an
> algorithm that is "universally better" than pglz. Some algorithms do
> work better for text documents, some for binary blobs, etc. I don't
> think there's a win-win option.

lz4 is pretty much there.

> Secondly, all the previous attempts ran into some legal issues, i.e.
> licensing and/or patents. Maybe the situation changed since then (no
> idea, haven't looked into that), but in the past the "pluggable"
> approach was proposed as a way to address this.

Those were pretty bogus.  I think we're not doing our users a favor if
they've to download some external projects, then fiddle with things,
just to not choose a compression algorithm that's been known bad for at
least 5+ years.  If we've a decent algorithm in-core *and* then allow
extensibility, that's one thing, but keeping the bad and tell forks
"please take our users with this code we give you" is ...

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4
In reply to this post by konstantin knizhnik

On 12/02/2017 09:24 PM, konstantin knizhnik wrote:

>
> On Dec 2, 2017, at 6:04 PM, Tomas Vondra wrote:
>
>> On 12/01/2017 10:52 PM, Andres Freund wrote:
>> ...
>>
>> Other algorithms (e.g. zstd) got significantly better compression (25%)
>> compared to pglz, but in exchange for longer compression. I'm sure we
>> could lower compression level to make it faster, but that will of course
>> hurt the compression ratio.
>>
>> I don't think switching to a different compression algorithm is a way
>> forward - it was proposed and explored repeatedly in the past, and every
>> time it failed for a number of reasons, most of which are still valid.
>>
>>
>> Firstly, it's going to be quite hard (or perhaps impossible) to
>> find an algorithm that is "universally better" than pglz. Some
>> algorithms do work better for text documents, some for binary
>> blobs, etc. I don't think there's a win-win option.
>>
>> Sure, there are workloads where pglz performs poorly (I've seen
>> such cases too), but IMHO that's more an argument for the custom
>> compression method approach. pglz gives you good default
>> compression in most cases, and you can change it for columns where
>> it matters, and where a different space/time trade-off makes
>> sense.
>>
>>
>> Secondly, all the previous attempts ran into some legal issues, i.e.
>> licensing and/or patents. Maybe the situation changed since then (no
>> idea, haven't looked into that), but in the past the "pluggable"
>> approach was proposed as a way to address this.
>>
>>
>
> May be it will be interesting for you to see the following results
> of applying page-level compression (CFS in PgPro-EE) to pgbench
> data:
>

I don't follow. If I understand what CFS does correctly (and I'm mostly
guessing here, because I haven't seen the code published anywhere, and I
assume it's proprietary), it essentially compresses whole 8kB blocks.

I don't know it reorganizes the data into columnar format first, in some
way (to make it more "columnar" which is more compressible), which would
make somewhat similar to page-level compression in Oracle.

But it's clearly a very different approach from what the patch aims to
improve (compressing individual varlena values).

>
> All algorithms (except zlib) were used with best-speed option: using
> better compression level usually has not so large impact on
> compression ratio (<30%), but can significantly increase time
> (several times). Certainly pgbench isnot the best candidate for
> testing compression algorithms: it generates a lot of artificial and
> redundant data. But we measured it also on real customers data and
> still zstd seems to be the best compression methods: provides good
> compression with smallest CPU overhead.
>

I think this really depends on the dataset, and drawing conclusions
based on a single test is somewhat crazy. Especially when it's synthetic
pgbench data with lots of inherent redundancy - sequential IDs, ...

My takeaway from the results is rather that page-level compression may
be very beneficial in some cases, although I wonder how much of that can
be gained by simply using compressed filesystem (thus making it
transparent to PostgreSQL).


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Custom compression methods

Tomas Vondra-4
In reply to this post by Andres Freund
On 12/02/2017 09:38 PM, Andres Freund wrote:

> Hi,
>
> On 2017-12-02 16:04:52 +0100, Tomas Vondra wrote:
>> Firstly, it's going to be quite hard (or perhaps impossible) to find an
>> algorithm that is "universally better" than pglz. Some algorithms do
>> work better for text documents, some for binary blobs, etc. I don't
>> think there's a win-win option.
>
> lz4 is pretty much there.
>

That's a matter of opinion, I guess. It's a solid compression algorithm,
that's for sure ...

>> Secondly, all the previous attempts ran into some legal issues, i.e.
>> licensing and/or patents. Maybe the situation changed since then (no
>> idea, haven't looked into that), but in the past the "pluggable"
>> approach was proposed as a way to address this.
>
> Those were pretty bogus.

IANAL so I don't dare to judge on bogusness of such claims. I assume if
we made it optional (e.g. configure/initdb option, it'd be much less of
an issue). Of course, that has disadvantages too (because when you
compile/init with one algorithm, and then find something else would work
better for your data, you have to start from scratch).

>
> I think we're not doing our users a favor if they've to download
> some external projects, then fiddle with things, just to not choose
> a compression algorithm that's been known bad for at least 5+ years.
> If we've a decent algorithm in-core *and* then allow extensibility,
> that's one thing, but keeping the bad and tell forks "please take
> our users with this code we give you" is ...
>

I don't understand what exactly is your issue with external projects,
TBH. I think extensibility is one of the great strengths of Postgres.
It's not all rainbows and unicorns, of course, and it has costs too.

FWIW I don't think pglz is a "known bad" algorithm. Perhaps there are
cases where other algorithms (e.g. lz4) are running circles around it,
particularly when it comes to decompression speed, but I wouldn't say
it's "known bad".

Not sure which forks you're talking about ...

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

123456 ... 10