[Proposal] Page Compression for OLTP

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[Proposal] Page Compression for OLTP

chenhj
Hello hackers!

This email is about the proposal to implement OLTP-oriented compression on PostgreSQL.

Although there are currently some alternative ways to achieve compression, such as using a file system that supports compression.
However, this depends on a specific file system and is not suitable for all deployment environments, but also increases the complexity of deployment and maintenance.

I hope this compression work can meet the following goals
1. In most scenarios, the compressed size of the table can be lower than 50% of the original table
2. Mainly oriented to the OLTP scenario, the performance impact on the load of frequent reads and writes is relatively small.
3. Does not rely on special external software or hardware that is difficult to obtain
4. Friendly to application developers and database managers
5. The transformation of PostgreSQL is small


I have noticed that there has been some discussion or work related to compression before, but they do not meet the above goals.
such as,
1. Use the sparse file[1]
   This is also the implemention method of MySQL 5.7's transparent page compression. However, sparse files may generate a lot of fragmentation inside the file system, and the "compressed" data file in sparse files will be restored to their original size after physical backup and restoration, unless our backup and recovery tools also support sparse files.
2. Use TAM (table access method interface) (pg_cryogen, zedstore) [2] [3]
   Neither storage engine is geared towards OLTP scenarios. It is best to make relatively small modifications to the existing code of the heap engine (by modify md.c and fd.c mainly).


The methods proposed by Postgres Pro Enterprise CFS [4] and Nikolay P [5] are close to my needs.
However, I would like to mention a slightly different implementation plan, which does not require space reclamation. Hope to get any suggestions.

# Premise assumption
1. Most of the pages in the compressed table can be compressed to less than 50% of the original size
   As long as you use an algorithm with a relatively high compression ratio (such as zlib, zstd), this first point should be easy to meet. Unless the table stores compressed data, such as pictures.
2. The compression ratio of most pages in the same table is relatively close


# Page storage

Configure 3 files for storing compressed data for each segment of each main fork. The main fork segment file(for example: 123456.2) still exists, but its size is 0.

-Compressed data file (for example: 123456.2.cd)
   Used to store the compressed page. The block size of this file is table level configurable. But it can only be 1/2, 1/4 or 1/8 of BLOCKSZ
-Compress overflow address file (for example: 123456.2.coa)
   When a page cannot be compressed to less than the size of the compressed block, this file is used to store the address of the overflow block.
-Compress overflow data file (for example: 123456.2.cod)
   When a page cannot be compressed to less than the size of the compressed block, this file is used to store the overflow block.

The following is an example when the compressed block size is 4K, which is 1/2 of BLOCKSZ.

## Scenario 1: The compressed size of the original page (including the header of the compressed page) is less than or equal to the compressed block size (4KB)

Compressed data files(123456.2.cd):
 0       1       2
 +=======+=======+=======+
 | data0 | data1 | data2 |  
 +=======+=======+=======+
->|   4K  |<-


## Scenario 2: The compressed size of the original page (including the header of the compressed page) is larger than the compressed block size (4KB)

If the compressed size of the original page (page 3 below) is greater than 4KB, it will not be compressed. The first 4KB of the original page is stored in the compressed data file, and the last 4KB is stored in the compress overflow data file.

Compressed data files(123456.2.cd):

 0       1       2       3
 +=======+=======+=======+=========+
 | data0 | data1 | data2 | data3_1 |
 +=======+=======+=======+=========+
                       ->| 1st 4K  |<-

Compress overflow address file(123456.2.coa):
The compress overflow address file stores the block number of the compress overflow block assigned to each block + 1
The size of the compressed block and the number of expanded blocks of the compress overflow data file are stored in the head of the compress overflow address file

         0       1       2       3
 +=======+=======+=======+=======+=======+
 | head  |       |       |       |   1   |
 +=======+=======+=======+=======+===|===+
                                     |
                                     |
Compress overflow data file:          |
      _______________________________|
     |
 0   |    1       2       3
 +===|=====+=========+==========+=========+
 | data3_2 |         |          |         |
 +=========+=========+==========+=========+
->| 2nd 4K  |<-


If the compressed block size is 1/4 or 1/8 of BLOCKSZ, each block that fails to compress may require multiple compress overflow block storage.
The following is an example when the compressed block size is 2K, which is 1/4 of BLOCKSZ.

## Scenario 3: The compressed size of the original page (including the header of the compressed page) is larger than 2KB(compressed page block size) but less than 6KB (BLOCKSZ - compressed page block size )

In this case, data files will store compressed data, and at least 2KB storage space can be saved.

Compressed data files(123456.2.cd):

 0       1       2       3
 +=======+=======+=======+=========+
 | data0 | data1 | data2 | data3_1 |
 +=======+=======+=======+=========+
                       ->| 1st 2K  |<-

Compress overflow address file(123456.2.coa):

         0       1       2       3
 +=======+=======+=======+=======+=======+
 | head  |       |       |       | 1,2   |
 +=======+=======+=======+=======+===|===+
                                     |
                                     |
Compress overflow data file :         |
      _______________________________|
     |
 0   |     1         2          3
 +===|=====+=========+==========+=========+
 | data3_2 | data3_3 |          |         |
 +=========+=========+==========+=========+
 | 2nd 2K  | 3rh 2K  |          |         |


## Scenario 4: The compressed size of the original page (including the header of the compressed page) is larger than 6KB (BLOCKSZ - compressed page block size )

In this case, data files will store original data(8KB). same as Scenario 2

Compressed data files(123456.2.cd):

 0       1       2       3
 +=======+=======+=======+=========+
 | data0 | data1 | data2 | data3_1 |
 +=======+=======+=======+=========+
                       ->| 1st 2K  |<-

Compress overflow address file(123456.2.coa):

         0       1       2       3
 +=======+=======+=======+=======+=======+
 | head  |       |       |       | 1,2,3 |
 +=======+=======+=======+=======+===|===+
                                     |
                                     |
Compress overflow data file :         |
      _______________________________|
     |
 0   |     1         2          3
 +===|=====+=========+==========+=========+
 | data3_2 | data3_3 | data3_4  |         |
 +=========+=========+==========+=========+
 | 2nd 2K  | 3rd 2K  | 4th 2K   |         |


# How to distinguish between compressed or uncompressed blocks in compressed data files?

The PostgreSQL heap file has a uniform header. At first, I considered adding compression-related flags to the header.
However, there will be a problem. When the size of the data in the page after compression changes, from compressed format to uncompressed format, or from uncompressed format to compressed format,
Need to modify the head of the original page, not only to recalculate the checksum, but also update the buffer.

However, I noticed that the first 4 bytes of each page are the high part of pg_lsn.
Therefore, use an oversized `lsn number` that cannot appear in the real environment as a sign of whether it is a magic of compressed pages.
The definition of the envisaged compressed header is as follows

typedef struct
{
   uint32    magic;       /* compress page magic number,must be 0xfffffffe */
   uint8    algorithm;     /* 1=pglz, 2=zstd ...*/
   uint8    flag;             /* reserved */
   uint16    size;           /* size after compressed */
} PageCompressHead;


# How to manage block space in compress overflow data files?

Once the overflow block x in the compress overflow data file is allocated to the block a, it will always belong to the block a, even if the size of the block a after compression becomes smaller and the overflow block x is no longer be used.

This approach simplifies the space management of compress overflow blocks, but fragmentation may occur, especially when the compressed block size is 1/4 or 1/8 BLOCKSZ.
However, the fragment will only appear in the scene where the size of the same block is frequently changed greatly after compression.

Consider the following situation. If only one record is inserted into a page and written to the disk, the compression rate must be very high, and only one compressed block is required.
After writing new rows in the future, the required compressed blocks will become 2, 3, 4 ... These overflow blocks are not allocated at a time, so it is likely that they are not sequential in the compress overflow data file, resulting in more fragmentation.

We can avoid this problem by setting a table-level number of pre-allocated compressed blocks.
When the number of compressed blocks required after the original page is compressed is less than this value, space is allocated according to the number of pre-allocated compressed blocks.

And no matter how severe the fragmentation, the total space occupied by the compressed table cannot be larger than the original table before compression.

# Impact on other parts of PostgreSQL?
1. pg_basebackup / pg_checksum needs to handle checksum verification according to the new compression format
2. pg_rewind needs to handle the copying of data blocks according to the new compression format

# Problems
This solution simplifies copy storage space management, but also has the following defects
1. The space saved by compression is limited by the size of the compressed block.
  For example, when the compressed block size is set to 4KB, up to 50% of space can be saved.
  For inserting only unmodified tables, you can set the compressed block size to BLOCKSZ / 8 to alleviate this problem; but for scenarios with frequent writes, it is easy to generate fragments and increase the number of IOs.
2. When accessing a page that can be compressed to a compressed block, only one IO is required; but when accessing a page that cannot be compressed to a compressed block, multiple IO is required
  Generally it is 3 times, the address file is very small, it should be almost in the memory, not counting the address file is 2 times.


I think the above issues are a necessary trade-off. Any suggestions are welcome.

# references
[1] https://www.postgresql.org/message-id/flat/op.ux8if71gcigqcu%40soyouz
[2] https://www.postgresql.org/message-id/CAONYFtNDNghfatnrhOJMOT=BXNbEiobHFABt2sx_cn2=5t=1_Q@...
[3] https://www.postgresql.org/message-id/CALfoeiuF-m5jg51mJUPm5GN8u396o5sA2AF5N97vTRAEDYac7w%40mail.gmail.com
[4] https://postgrespro.com/docs/enterprise/9.6/cfs
[5] https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net


Best Regards
Chen Hujaun
Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Page Compression for OLTP

Fabien COELHO-3

Hello,

My 0.02€, some of which may just show some misunderstanding on my part:

  - you have clearly given quite a few thoughts about the what and how…
    which makes your message an interesting read.

  - Could this be proposed as some kind of extension, provided that enough
    hooks are available? ISTM that foreign tables and/or alternative
    storage engine (aka ACCESS METHOD) provide convenient APIs which could
    fit the need for these? Or are they not appropriate? You seem to
    suggest that there are not.

    If not, what could be done to improve API to allow what you are seeking
    to do? Maybe you need a somehow lower-level programmable API which does
    not exist already, or at least is not exported already, but could be
    specified and implemented with limited effort? Basically you would like
    to read/write pg pages to somewhere, and then there is the syncing
    issue to consider. Maybe such a "page storage" API could provide
    benefit for some specialized hardware, eg persistent memory stores,
    so there would be more reason to define it anyway? I think it might
    be valuable to give it some thoughts.

  - Could you maybe elaborate on how your plan differs from [4] and [5]?

  - Have you consider keeping page headers and compressing tuple data
    only?

  - I'm not sure there is a point in going below the underlying file
    system blocksize, quite often 4 KiB? Or maybe yes? Or is there
    a benefit to aim at 1/4 even if most pages overflow?

  - ISTM that your approach entails 3 "files". Could it be done with 2?
    I'd suggest that the possible overflow pointers (coa) could be part of
    the headers so that when reading the 3.1 page, then the header would
    tell where to find the overflow 3.2, without requiring an additional
    independent structure with very small data in it, most of it zeros.
    Possibly this is not possible, because it would require some available
    space in standard headers when the is page is not compressible, and
    there is not enough. Maybe creating a little room for that in
    existing headers (4 bytes could be enough?) would be a good compromise.
    Hmmm. Maybe the approach I suggest would only work for 1/2 compression,
    but not for other target ratios, but I think it could be made to work
    if the pointer can entail several blocks in the overflow table.

  - If one page is split in 3 parts, could it creates problems on syncing,
    if 1/3 or 2/3 pages get written, but maybe that is manageable with WAL
     as it would note that the page was not synced and that is enough for
     replay.

  - I'm unclear how you would manage the 2 representations of a page in
    memory. I'm afraid that a 8 KiB page compressed to 4 KiB would
    basically take 12 KiB, i.e. reduce the available memory for caching
    purposes. Hmmm. The current status is that a written page probably
    takes 16 KiB, once in shared buffers and once in the system caches,
    so it would be an improvement anyway.

  - Maybe the compressed and overflow table could become bloated somehow,
    which would require the vaccuuming implementation and add to the
    complexity of the implementation?

  - External tools should be available to allow page inspection, eg for
    debugging purposes.

--
Fabien.
Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Page Compression for OLTP

chenhj

At 2020-05-21 15:04:55, "Fabien COELHO" <[hidden email]> wrote: > >Hello, > >My 0.02€, some of which may just show some misunderstanding on my part: > > - Could this be proposed as some kind of extension, provided that enough > hooks are available? ISTM that foreign tables and/or alternative > storage engine (aka ACCESS METHOD) provide convenient APIs which could > fit the need for these? Or are they not appropriate? You seem to > suggest that there are not. > > If not, what could be done to improve API to allow what you are seeking > to do? Maybe you need a somehow lower-level programmable API which does > not exist already, or at least is not exported already, but could be > specified and implemented with limited effort? Basically you would like > to read/write pg pages to somewhere, and then there is the syncing > issue to consider. Maybe such a "page storage" API could provide > benefit for some specialized hardware, eg persistent memory stores, > so there would be more reason to define it anyway? I think it might > be valuable to give it some thoughts. Thank you for giving so many comments. In my opinion, developing a foreign table or a new storage engine, in addition to compression, also needs to do a lot of extra things. A similar explanation was mentioned in Nikolay P's email. The "page storage" API may be a good choice, and I will consider it, but I have not yet figured out how to implement it. > - Could you maybe elaborate on how your plan differs from [4] and [5]? My solution is similar to CFS, and it is also embedded in the file access layer (fd.c, md.c) to realize the mapping from block number to the corresponding file and location where compressed data is stored. However, the most important difference is that I hope to avoid the need for GC through the design of the page layout. https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net >> The most difficult thing in CFS development is certainly >> defragmentation. In CFS it is done using background garbage collection, >> by one or one >> GC worker processes. The main challenges were to minimize its >> interaction with normal work of the system, make it fault tolerant and >> prevent unlimited growth of data segments. >> CFS is not introducing its own storage manager, it is mostly embedded in >> existed Postgres file access layer (fd.c, md.c). It allows to reused >> code responsible for mapping relations and file descriptors cache. As it >> was recently discussed in hackers, it may be good idea to separate the >> questions "how to map blocks to filenames and offsets" and "how to >> actually perform IO". In this it will be easier to implement compressed >> storage manager. > - Have you consider keeping page headers and compressing tuple data > only? In that case, we must add some additional information in the page header to identify whether this is a compressed page or an uncompressed page. When a compressed page becomes an uncompressed page, or vice versa, an uncompressed page becomes a compressed page, the original page header must be modified. This is unacceptable because it requires modifying the shared buffer and recalculating the checksum. However, it should be feasible to put this flag in the compressed address file. The problem with this is that even if a page only occupies the size of one compressed block, the address file needs to be read, that is, from 1 IO to 2 IO. Since the address file is very small, it is basically a memory access, this cost may not be as large as I had imagined. > - I'm not sure there is a point in going below the underlying file > system blocksize, quite often 4 KiB? Or maybe yes? Or is there > a benefit to aim at 1/4 even if most pages overflow? My solution is mainly optimized for scenarios where the original page can be compressed to only require one compressed block of storage. The scene where the original page is stored in multiple compressed blocks is suitable for scenarios that are not particularly sensitive to performance, but are more concerned about the compression rate, such as cold data. In addition, users can also choose to compile PostgreSQL with 16KB or 32KB BLOCKSZ. > - ISTM that your approach entails 3 "files". Could it be done with 2? > I'd suggest that the possible overflow pointers (coa) could be part of > the headers so that when reading the 3.1 page, then the header would > tell where to find the overflow 3.2, without requiring an additional > independent structure with very small data in it, most of it zeros. > Possibly this is not possible, because it would require some available > space in standard headers when the is page is not compressible, and > there is not enough. Maybe creating a little room for that in > existing headers (4 bytes could be enough?) would be a good compromise. > Hmmm. Maybe the approach I suggest would only work for 1/2 compression, > but not for other target ratios, but I think it could be made to work > if the pointer can entail several blocks in the overflow table. My solution is optimized for scenarios where the original page can be compressed to only need one compressed block to store, In this scenario, only 1 IO is required for reading and writing, and there is no need to access additional overflow address file and overflow data file. Your suggestion reminded me. The performance difference may not be as big as I thought (testing and comparison is required). If I give up the pursuit of "only one IO", the file layout can be simplified. For example, it is simplified to the following form, only two files (the following example uses a compressed block size of 4KB) # Page storage(Plan B) Use the compress address file to store the compressed block pointer, and the Compress data file stores the compressed block data. compress address file: 0 1 2 3 +=======+=======+=======+=======+=======+ | head | 1 | 2 | 3,4 | 5 | +=======+=======+=======+=======+=======+ compress address file saves the following information for each page -Compressed size (when size is 0, it means uncompressed format) -Block number occupied in Compress data file By the way, I want to access the compress address file through mmap, just like snapfs https://github.com/postgrespro/snapfs/blob/pg_snap/src/backend/storage/file/snapfs.c Compress data file: 0 1 2 3 4 +=========+=========+==========+=========+=========+ | data1 | data2 | data3_1 | data3_2 | data4 | +=========+=========+==========+=========+=========+ | 4K | # Page storage(Plan C) Further, since the size of the compress address file is fixed, the above address file and data file can also be combined into one file 0 1 2 123071 0 1 2 +=======+=======+=======+ +=======+=========+=========+ | head | 1 | 2 | ... | | data1 | data2 | ... +=======+=======+=======+ +=======+=========+=========+ head | address | data | If the difference in performance is so negligible, maybe Plan C is a better solution. (Are there any other problems?) > > - Maybe the compressed and overflow table could become bloated somehow, > which would require the vaccuuming implementation and add to the > complexity of the implementation? > Vacuuming is what I try to avoid. As I explained in the first email, even without vaccuum, bloating should not become a serious problem. >>However, the fragment will only appear in the scene where the size of the same block is frequently changed greatly after compression. >>... >>And no matter how severe the fragmentation, the total space occupied by the compressed table cannot be larger than the original table before compression. Best Regards Chen Huajun

Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Page Compression for OLTP

chenhj
Sorry, There may be a problem with the display format of the previous mail. So resend it
----------------------------------------------------------------------------------------------------

At 2020-05-21 15:04:55, "Fabien COELHO" <[hidden email]> wrote:

>
>Hello,
>
>My 0.02€, some of which may just show some misunderstanding on my part:
>
>  - Could this be proposed as some kind of extension, provided that enough
>    hooks are available? ISTM that foreign tables and/or alternative
>    storage engine (aka ACCESS METHOD) provide convenient APIs which could
>    fit the need for these? Or are they not appropriate? You seem to
>    suggest that there are not.
>
>    If not, what could be done to improve API to allow what you are seeking
>    to do? Maybe you need a somehow lower-level programmable API which does
>    not exist already, or at least is not exported already, but could be
>    specified and implemented with limited effort? Basically you would like
>    to read/write pg pages to somewhere, and then there is the syncing
>    issue to consider. Maybe such a "page storage" API could provide
>    benefit for some specialized hardware, eg persistent memory stores,
>    so there would be more reason to define it anyway? I think it might
>    be valuable to give it some thoughts.

Thank you for giving so many comments.
In my opinion, developing a foreign table or a new storage engine, in addition to compression, also needs to do a lot of extra things.
A similar explanation was mentioned in Nikolay P's email.

The "page storage" API may be a good choice, and I will consider it, but I have not yet figured out how to implement it.

>  - Could you maybe elaborate on how your plan differs from [4] and [5]?

My solution is similar to CFS, and it is also embedded in the file access layer (fd.c, md.c) to realize the mapping from block number to the corresponding file and location where compressed data is stored.

However, the most important difference is that I hope to avoid the need for GC through the design of the page layout.

https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net

>> The most difficult thing in CFS development is certainly
>> defragmentation. In CFS it is done using background garbage collection,
>> by one or one
>> GC worker processes. The main challenges were to minimize its
>> interaction with normal work of the system, make it fault tolerant and
>> prevent unlimited growth of data segments.

>> CFS is not introducing its own storage manager, it is mostly embedded in
>> existed Postgres file access layer (fd.c, md.c). It allows to reused
>> code responsible for mapping relations and file descriptors cache. As it
>> was recently discussed in hackers, it may be good idea to separate the
>> questions "how to map blocks to filenames and offsets" and "how to
>> actually perform IO". In this it will be easier to implement compressed
>> storage manager.


>  - Have you consider keeping page headers and compressing tuple data
>    only?

In that case, we must add some additional information in the page header to identify whether this is a compressed page or an uncompressed page.
When a compressed page becomes an uncompressed page, or vice versa, an uncompressed page becomes a compressed page, the original page header must be modified.
This is unacceptable because it requires modifying the shared buffer and recalculating the checksum.

However, it should be feasible to put this flag in the compressed address file.
The problem with this is that even if a page only occupies the size of one compressed block, the address file needs to be read, that is, from 1 IO to 2 IO.
Since the address file is very small, it is basically a memory access, this cost may not be as large as I had imagined.

>  - I'm not sure there is a point in going below the underlying file
>    system blocksize, quite often 4 KiB? Or maybe yes? Or is there
>    a benefit to aim at 1/4 even if most pages overflow?

My solution is mainly optimized for scenarios where the original page can be compressed to only require one compressed block of storage.
The scene where the original page is stored in multiple compressed blocks is suitable for scenarios that are not particularly sensitive to performance, but are more concerned about the compression rate, such as cold data.

In addition, users can also choose to compile PostgreSQL with 16KB or 32KB BLOCKSZ.

>  - ISTM that your approach entails 3 "files". Could it be done with 2?
>    I'd suggest that the possible overflow pointers (coa) could be part of
>    the headers so that when reading the 3.1 page, then the header would
>    tell where to find the overflow 3.2, without requiring an additional
>    independent structure with very small data in it, most of it zeros.
>    Possibly this is not possible, because it would require some available
>    space in standard headers when the is page is not compressible, and
>    there is not enough. Maybe creating a little room for that in
>    existing headers (4 bytes could be enough?) would be a good compromise.
>    Hmmm. Maybe the approach I suggest would only work for 1/2 compression,
>    but not for other target ratios, but I think it could be made to work
>    if the pointer can entail several blocks in the overflow table.

My solution is optimized for scenarios where the original page can be compressed to only need one compressed block to store,
In this scenario, only 1 IO is required for reading and writing, and there is no need to access additional overflow address file and overflow data file.

Your suggestion reminded me. The performance difference may not be as big as I thought (testing and comparison is required). If I give up the pursuit of "only one IO", the file layout can be simplified.

For example, it is simplified to the following form, only two files (the following example uses a compressed block size of 4KB)

# Page storage(Plan B)

Use the compress address file to store the compressed block pointer, and the Compress data file stores the compressed block data.

compress address file:
 
        0       1       2       3
+=======+=======+=======+=======+=======+
| head  |  1    |    2  | 3,4   |   5   |
+=======+=======+=======+=======+=======+

compress address file saves the following information for each page

-Compressed size (when size is 0, it means uncompressed format)
-Block number occupied in Compress data file

By the way, I want to access the compress address file through mmap, just like snapfs
https://github.com/postgrespro/snapfs/blob/pg_snap/src/backend/storage/file/snapfs.c

Compress data file:

0         1         2          3         4
+=========+=========+==========+=========+=========+
| data1   | data2   | data3_1  | data3_2 | data4   |
+=========+=========+==========+=========+=========+
|    4K   |


# Page storage(Plan C)

Further, since the size of the compress address file is fixed, the above address file and data file can also be combined into one file

        0       1       2     123071    0         1         2
+=======+=======+=======+     +=======+=========+=========+
| head  |  1    |    2  | ... |       | data1   | data2   | ...  
+=======+=======+=======+     +=======+=========+=========+
  head  |              address        |          data          |

If the difference in performance is so negligible, maybe Plan C is a better solution. (Are there any other problems?)

>
>  - Maybe the compressed and overflow table could become bloated somehow,
>    which would require the vaccuuming implementation and add to the
>    complexity of the implementation?
>

Vacuuming is what I try to avoid.

As I explained in the first email, even without vaccuum, bloating should not become a serious problem.

>>However, the fragment will only appear in the scene where the size of the same block is frequently changed greatly after compression.
>>...
>>And no matter how severe the fragmentation, the total space occupied by the compressed table cannot be larger than the original table before compression.

Best Regards
Chen Huajun
Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Page Compression for OLTP

chenhj
Hi hackers,


> # Page storage(Plan C)
>
> Further, since the size of the compress address file is fixed, the above address file and data file can also be combined into one file
>
>         0       1       2     123071    0         1         2
> +=======+=======+=======+     +=======+=========+=========+
> | head  |  1    |    2  | ... |       | data1   | data2   | ...  
> +=======+=======+=======+     +=======+=========+=========+
>   head  |              address        |          data          |

I made a prototype according to the above storage method. Any suggestions are welcome.

# Page compress file storage related definitions

/*
* layout of Page Compress file:
*
* - PageCompressHeader
* - PageCompressAddr[]
* - chuncks of PageCompressData
*
*/
typedef struct PageCompressHeader
{
     pg_atomic_uint32     nblocks;     /* number of total blocks in this segment */
     pg_atomic_uint32     allocated_chunks;     /* number of total allocated chunks in data area */
     uint16                    chunk_size;     /* size of each chunk, must be 1/2 1/4 or 1/8 of BLCKSZ */
     uint8                    algorithm;     /* compress algorithm, 1=pglz, 2=lz4 */
} PageCompressHeader;

typedef struct PageCompressAddr
{
     uint8                    nchunks;               /* number of chunks for this block */
     uint8                    allocated_chunks;     /* number of allocated chunks for this block */

     /* variable-length fields, 1 based chunk no array for this block, size of the array must be 2, 4 or 8 */
     pc_chunk_number_t     chunknos[FLEXIBLE_ARRAY_MEMBER];
} PageCompressAddr;

typedef struct PageCompressData
{
     char     page_header[SizeOfPageHeaderData];     /* page header */
     uint32     size;                                        /* size of compressed data */
     char     data[FLEXIBLE_ARRAY_MEMBER];          /* compressed page, except for the page header */
} PageCompressData;


# Usage

Set whether to use compression through storage parameters of tables and indexes

- compress_type
 Set whether to compress and the compression algorithm used, supported values: none, pglz, zstd

- compress_chunk_size

 Chunk is the smallest unit of storage space allocated for compressed pages.
 The size of the chunk can only be 1/2, 1/4 or 1/8 of BLCKSZ

- compress_prealloc_chunks

  The number of chunks pre-allocated for each page. The maximum value allowed is: BLCKSZ/compress_chunk_size -1.
  If the number of chunks required for a compressed page is less than `compress_prealloc_chunks`,
  It allocates `compress_prealloc_chunks` chunks to avoid future storage fragmentation when the page needs more storage space.


# Sample

## requirement

- zstd

## build

./configure --with-zstd
make
make install

## create compressed table and index

create table tb1(id int,c1 text);
create table tb1_zstd(id int,c1 text) with(compress_type=zstd,compress_chunk_size=1024);
create table tb1_zstd_4(id int,c1 text) with(compress_type=zstd,compress_chunk_size=1024,compress_prealloc_chunks=4);

create index tb1_idx_id on tb1(id);
create index tb1_idx_id_zstd on tb1(id) with(compress_type=zstd,compress_chunk_size=1024);
create index tb1_idx_id_zstd_4 on tb1(id) with(compress_type=zstd,compress_chunk_size=1024,compress_prealloc_chunks=4);

create index tb1_idx_c1 on tb1(c1);
create index tb1_idx_c1_zstd on tb1(c1) with(compress_type=zstd,compress_chunk_size=1024);
create index tb1_idx_c1_zstd_4 on tb1(c1) with(compress_type=zstd,compress_chunk_size=1024,compress_prealloc_chunks=4);

insert into tb1 select generate_series(1,1000000),md5(random()::text);
insert into tb1_zstd select generate_series(1,1000000),md5(random()::text);
insert into tb1_zstd_4 select generate_series(1,1000000),md5(random()::text);

## show size of table and index

postgres=# \d+
                            List of relations
Schema |    Name    | Type  |  Owner   | Persistence | Size  | Description
--------+------------+-------+----------+-------------+-------+-------------
public | tb1        | table | postgres | permanent   | 65 MB |
public | tb1_zstd   | table | postgres | permanent   | 37 MB |
public | tb1_zstd_4 | table | postgres | permanent   | 37 MB |
(3 rows)

postgres=# \di+
                                    List of relations
Schema |       Name        | Type  |  Owner   | Table | Persistence | Size  | Description
--------+-------------------+-------+----------+-------+-------------+-------+-------------
public | tb1_idx_c1        | index | postgres | tb1   | permanent   | 73 MB |
public | tb1_idx_c1_zstd   | index | postgres | tb1   | permanent   | 36 MB |
public | tb1_idx_c1_zstd_4 | index | postgres | tb1   | permanent   | 41 MB |
public | tb1_idx_id        | index | postgres | tb1   | permanent   | 21 MB |
public | tb1_idx_id_zstd   | index | postgres | tb1   | permanent   | 13 MB |
public | tb1_idx_id_zstd_4 | index | postgres | tb1   | permanent   | 15 MB |
(6 rows)


# pgbench performance testing(TPC-B)

Compress the pgbench_accounts table and its primary key index.
The compression parameters are (compress_type=zstd, compress_chunk_size=1024).
Then compare the performance difference between the original table and the compressed table.

test command:

   pgbench -i -s 1000
   pgbench -n -T 300 -c 16 -j 16 db1

tps comparison:

   original table  :20081
   compressed table:19984


Comparison of storage space:

                       original    compressed(before benchmark)   compressed(after benchmark*)
pgbench_accounts        13 GB       1660 MB                        1711 MB
pgbench_accounts_pkey   2142 MB     738 MB                         816 MB

*note:After the benchmark, there are some compressed pages that need 2 chuncks to store data


# TODO list

1. support setting of compress level
2. support ALTER TABLE/INDEX xx set(...)
3. support checksum in pg_basebackup, pg_checksum and replication
4. support pg_rewind
5. infomation output for compressed page's meta data


# Problem

When compress_chunk_size=1024, about 4MB of space is needed to store the address,
which will cause the space of the small file to become larger after compression.

The solutions considered are as follows:
The address and data of the compressed page are divided into two files, and the address file is also divided into disk space as needed, and at least one BLCKSZ is allocated for each expansion.


Best Regards
Chen Hujaun

page_compress_prototype.patch (92K) Download Attachment