pglz performance

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

pglz performance

Andrey Borodin-2
Hi hackers!

I was reviewing Paul Ramsey's TOAST patch[0] and noticed that there is a big room for improvement in performance of pglz compression and decompression.

With Vladimir we started to investigate ways to boost byte copying and eventually created test suit[1] to investigate performance of compression and decompression.
This is and extension with single function test_pglz() which performs tests for different:
1. Data payloads
2. Compression implementations
3. Decompression implementations

Currently we test mostly decompression improvements against two WALs and one data file taken from pgbench-generated database. Any suggestion on more relevant data payloads are very welcome.
My laptop tests show that our decompression implementation [2] can be from 15% to 50% faster.
Also I've noted that compression is extremely slow, ~30 times slower than decompression. I believe we can do something about it.

We focus only on boosting existing codec without any considerations of other compression algorithms.

Any comments are much appreciated.

Most important questions are:
1. What are relevant data sets?
2. What are relevant CPUs? I have only XEON-based servers and few laptops\desktops with intel CPUs
3. If compression is 30 times slower, should we better focus on compression instead of decompression?

Best regards, Andrey Borodin.


[0] https://www.postgresql.org/message-id/flat/CANP8%2BjKcGj-JYzEawS%2BCUZnfeGKq4T5LswcswMP4GUHeZEP1ag%40mail.gmail.com
[1] https://github.com/x4m/test_pglz
[2] https://www.postgresql.org/message-id/C2D8E5D5-3E83-469B-8751-1C7877C2A5F2%40yandex-team.ru

Reply | Threaded
Open this post in threaded view
|

Re: pglz performance

Michael Paquier-2
On Mon, May 13, 2019 at 07:45:59AM +0500, Andrey Borodin wrote:
> I was reviewing Paul Ramsey's TOAST patch[0] and noticed that there
> is a big room for improvement in performance of pglz compression and
> decompression.

Yes, I believe so too.  pglz is a huge CPU-consumer when it comes to
compilation compared to more modern algos like lz4.

> With Vladimir we started to investigate ways to boost byte copying
> and eventually created test suit[1] to investigate performance of
> compression and decompression.  This is and extension with single
> function test_pglz() which performs tests for different:
> 1. Data payloads
> 2. Compression implementations
> 3. Decompression implementations

Cool.  I got something rather similar in my wallet of plugins:
https://github.com/michaelpq/pg_plugins/tree/master/compress_test
This is something I worked on mainly for FPW compression in WAL.

> Currently we test mostly decompression improvements against two WALs
> and one data file taken from pgbench-generated database. Any
> suggestion on more relevant data payloads are very welcome.

Text strings made of random data and variable length?  For any test of
this kind I think that it is good to focus on the performance of the
low-level calls, even going as far as a simple C wrapper on top of the
pglz APIs to test only the performance and not have extra PG-related
overhead like palloc() which can be a barrier.  Focusing on strings of
lengths of 1kB up to 16kB may be an idea of size, and it is important
to keep the same uncompressed strings for performance comparison.

> My laptop tests show that our decompression implementation [2] can
> be from 15% to 50% faster.  Also I've noted that compression is
> extremely slow, ~30 times slower than decompression. I believe we
> can do something about it.

That's nice.

> We focus only on boosting existing codec without any considerations
> of other compression algorithms.

There is this as well.  A couple of algorithms have a license
compatible with Postgres, but it may be more simple to just improve
pglz.  A 10%~20% improvement is something worth doing.

> Most important questions are:
> 1. What are relevant data sets?
> 2. What are relevant CPUs? I have only XEON-based servers and few
> laptops\desktops with intel CPUs
> 3. If compression is 30 times slower, should we better focus on
> compression instead of decompression?

Decompression can matter a lot for mostly-read workloads and
compression can become a bottleneck for heavy-insert loads, so
improving compression or decompression should be two separate
problems, not two problems linked.  Any improvement in one or the
other, or even both, is nice to have.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: pglz performance

Andrey Borodin-2


> 13 мая 2019 г., в 12:14, Michael Paquier <[hidden email]> написал(а):
>
>> Currently we test mostly decompression improvements against two WALs
>> and one data file taken from pgbench-generated database. Any
>> suggestion on more relevant data payloads are very welcome.
>
> Text strings made of random data and variable length?
Like text corpus?

>  For any test of
> this kind I think that it is good to focus on the performance of the
> low-level calls, even going as far as a simple C wrapper on top of the
> pglz APIs to test only the performance and not have extra PG-related
> overhead like palloc() which can be a barrier.
Our test_pglz extension is measuring only time of real compression, doing warmup run, all allocations are done before measurement.

>  Focusing on strings of
> lengths of 1kB up to 16kB may be an idea of size, and it is important
> to keep the same uncompressed strings for performance comparison.
We intentionally avoid using generated data, thus keep test files committed into git repo.
Also we check that decompressed data matches source of compression. All tests are done 5 times.

We use PG extension only for simplicity of deployment of benchmarks to our PG clusters.


Here are some test results.

Currently we test on 4 payloads:
1. WAL from cluster initialization
2. 2 WALs from pgbench pgbench -i -s 10
3. data file taken from pgbench -i -s 10

We use these decompressors:
1. pglz_decompress_vanilla - taken from PG source code
2. pglz_decompress_hacked - use sliced memcpy to imitate byte-by-byte pglz decompression
3. pglz_decompress_hacked4, pglz_decompress_hacked8, pglz_decompress_hackedX - use memcpy if match is no less than X bytes. We need to determine best X, if this approach is used.

I used three platforms:
1. Server XEONE5-2660 SM/SYS1027RN3RF/10S2.5/1U/2P (2*INTEL XEON E5-2660/16*DDR3ECCREG/10*SAS-2.5) Under Ubuntu 14, PG 9.6.
2. Desktop Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz Ubuntu 18, PG 12devel
3. Laptop MB Pro 15 2015 2.2 GHz Core i7 (I7-4770HQ) MacOS, PG 12devel
Owners of AMD and ARM devices are welcome.

Server results (less is better):
NOTICE:  00000: Time to decompress one byte in ns:
NOTICE:  00000: Payload 000000010000000000000001
NOTICE:  00000: Decompressor pglz_decompress_hacked result 0.647235
NOTICE:  00000: Decompressor pglz_decompress_hacked4 result 0.671029
NOTICE:  00000: Decompressor pglz_decompress_hacked8 result 0.699949
NOTICE:  00000: Decompressor pglz_decompress_hacked16 result 0.739586
NOTICE:  00000: Decompressor pglz_decompress_hacked32 result 0.787926
NOTICE:  00000: Decompressor pglz_decompress_vanilla result 1.147282
NOTICE:  00000: Payload 000000010000000000000006
NOTICE:  00000: Decompressor pglz_decompress_hacked result 0.201774
NOTICE:  00000: Decompressor pglz_decompress_hacked4 result 0.211859
NOTICE:  00000: Decompressor pglz_decompress_hacked8 result 0.212610
NOTICE:  00000: Decompressor pglz_decompress_hacked16 result 0.214601
NOTICE:  00000: Decompressor pglz_decompress_hacked32 result 0.221813
NOTICE:  00000: Decompressor pglz_decompress_vanilla result 0.706005
NOTICE:  00000: Payload 000000010000000000000008
NOTICE:  00000: Decompressor pglz_decompress_hacked result 1.370132
NOTICE:  00000: Decompressor pglz_decompress_hacked4 result 1.388991
NOTICE:  00000: Decompressor pglz_decompress_hacked8 result 1.388502
NOTICE:  00000: Decompressor pglz_decompress_hacked16 result 1.529455
NOTICE:  00000: Decompressor pglz_decompress_hacked32 result 1.520813
NOTICE:  00000: Decompressor pglz_decompress_vanilla result 1.433527
NOTICE:  00000: Payload 16398
NOTICE:  00000: Decompressor pglz_decompress_hacked result 0.606943
NOTICE:  00000: Decompressor pglz_decompress_hacked4 result 0.623044
NOTICE:  00000: Decompressor pglz_decompress_hacked8 result 0.624118
NOTICE:  00000: Decompressor pglz_decompress_hacked16 result 0.620987
NOTICE:  00000: Decompressor pglz_decompress_hacked32 result 0.621183
NOTICE:  00000: Decompressor pglz_decompress_vanilla result 1.365318

Comment: pglz_decompress_hacked is unconditionally optimal. On most of cases it is 2x better than current implementation.
On 000000010000000000000008 it is only marginally better. pglz_decompress_hacked8 is few percents worse than pglz_decompress_hacked.

Desktop results:
NOTICE:  Time to decompress one byte in ns:
NOTICE:  Payload 000000010000000000000001
NOTICE:  Decompressor pglz_decompress_hacked result 0.396454
NOTICE:  Decompressor pglz_decompress_hacked4 result 0.429249
NOTICE:  Decompressor pglz_decompress_hacked8 result 0.436413
NOTICE:  Decompressor pglz_decompress_hacked16 result 0.478077
NOTICE:  Decompressor pglz_decompress_hacked32 result 0.491488
NOTICE:  Decompressor pglz_decompress_vanilla result 0.695527
NOTICE:  Payload 000000010000000000000006
NOTICE:  Decompressor pglz_decompress_hacked result 0.110710
NOTICE:  Decompressor pglz_decompress_hacked4 result 0.115669
NOTICE:  Decompressor pglz_decompress_hacked8 result 0.127637
NOTICE:  Decompressor pglz_decompress_hacked16 result 0.120544
NOTICE:  Decompressor pglz_decompress_hacked32 result 0.117981
NOTICE:  Decompressor pglz_decompress_vanilla result 0.399446
NOTICE:  Payload 000000010000000000000008
NOTICE:  Decompressor pglz_decompress_hacked result 0.647402
NOTICE:  Decompressor pglz_decompress_hacked4 result 0.691891
NOTICE:  Decompressor pglz_decompress_hacked8 result 0.693834
NOTICE:  Decompressor pglz_decompress_hacked16 result 0.776815
NOTICE:  Decompressor pglz_decompress_hacked32 result 0.777960
NOTICE:  Decompressor pglz_decompress_vanilla result 0.721192
NOTICE:  Payload 16398
NOTICE:  Decompressor pglz_decompress_hacked result 0.337654
NOTICE:  Decompressor pglz_decompress_hacked4 result 0.355452
NOTICE:  Decompressor pglz_decompress_hacked8 result 0.351224
NOTICE:  Decompressor pglz_decompress_hacked16 result 0.362548
NOTICE:  Decompressor pglz_decompress_hacked32 result 0.356456
NOTICE:  Decompressor pglz_decompress_vanilla result 0.837042

Comment: identical to Server results.

Laptop results:
NOTICE:  Time to decompress one byte in ns:
NOTICE:  Payload 000000010000000000000001
NOTICE:  Decompressor pglz_decompress_hacked result 0.661469
NOTICE:  Decompressor pglz_decompress_hacked4 result 0.638366
NOTICE:  Decompressor pglz_decompress_hacked8 result 0.664377
NOTICE:  Decompressor pglz_decompress_hacked16 result 0.696135
NOTICE:  Decompressor pglz_decompress_hacked32 result 0.634825
NOTICE:  Decompressor pglz_decompress_vanilla result 0.676560
NOTICE:  Payload 000000010000000000000006
NOTICE:  Decompressor pglz_decompress_hacked result 0.213921
NOTICE:  Decompressor pglz_decompress_hacked4 result 0.224864
NOTICE:  Decompressor pglz_decompress_hacked8 result 0.229394
NOTICE:  Decompressor pglz_decompress_hacked16 result 0.218141
NOTICE:  Decompressor pglz_decompress_hacked32 result 0.220954
NOTICE:  Decompressor pglz_decompress_vanilla result 0.242412
NOTICE:  Payload 000000010000000000000008
NOTICE:  Decompressor pglz_decompress_hacked result 1.053417
NOTICE:  Decompressor pglz_decompress_hacked4 result 1.063704
NOTICE:  Decompressor pglz_decompress_hacked8 result 1.007211
NOTICE:  Decompressor pglz_decompress_hacked16 result 1.145089
NOTICE:  Decompressor pglz_decompress_hacked32 result 1.079702
NOTICE:  Decompressor pglz_decompress_vanilla result 1.051557
NOTICE:  Payload 16398
NOTICE:  Decompressor pglz_decompress_hacked result 0.251690
NOTICE:  Decompressor pglz_decompress_hacked4 result 0.268125
NOTICE:  Decompressor pglz_decompress_hacked8 result 0.269248
NOTICE:  Decompressor pglz_decompress_hacked16 result 0.277880
NOTICE:  Decompressor pglz_decompress_hacked32 result 0.270290
NOTICE:  Decompressor pglz_decompress_vanilla result 0.705652

Comment: decompress time on WAL segments is statistically indistinguishable between hacked and original versions. Hacked decompression of data file is 2x faster.

We are going to try these tests on cascade lake processors too.

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: pglz performance

Andrey Borodin-2


> 15 мая 2019 г., в 15:06, Andrey Borodin <[hidden email]> написал(а):
>
> Owners of AMD and ARM devices are welcome.

Yandex hardware RND guys gave me ARM server and Power9 server. They are looking for AMD and some new Intel boxes.

Meanwhile I made some enhancements to test suit:
1. I've added Shakespeare payload: concatenation of works of this prominent poet.
2. For each payload compute "sliced time" - time to decompress payload if it was sliced by 2Kb pieces or 8Kb pieces.
3. For each decompressor we compute "score": (sum of time to decompress each payload, each payload sliced by 2Kb and 8Kb) * 5 times

I've attached full test logs, meanwhile here's results for different platforms.

Intel Server
NOTICE:  00000: Decompressor pglz_decompress_hacked result 10.346763
NOTICE:  00000: Decompressor pglz_decompress_hacked8 result 11.192078
NOTICE:  00000: Decompressor pglz_decompress_hacked16 result 11.957727
NOTICE:  00000: Decompressor pglz_decompress_vanilla result 14.262256

ARM Server
NOTICE:  Decompressor pglz_decompress_hacked result 12.966668
NOTICE:  Decompressor pglz_decompress_hacked8 result 13.004935
NOTICE:  Decompressor pglz_decompress_hacked16 result 13.043015
NOTICE:  Decompressor pglz_decompress_vanilla result 18.239242

Power9 Server
NOTICE:  Decompressor pglz_decompress_hacked result 10.992974
NOTICE:  Decompressor pglz_decompress_hacked8 result 11.747443
NOTICE:  Decompressor pglz_decompress_hacked16 result 11.026342
NOTICE:  Decompressor pglz_decompress_vanilla result 16.375315

Intel laptop
NOTICE:  Decompressor pglz_decompress_hacked result 9.445808
NOTICE:  Decompressor pglz_decompress_hacked8 result 9.105360
NOTICE:  Decompressor pglz_decompress_hacked16 result 9.621833
NOTICE:  Decompressor pglz_decompress_vanilla result 10.661968

From these results pglz_decompress_hacked looks best.

Best regards, Andrey Borodin.


pglz_benchmarks.txt (24K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: pglz performance

Michael Paquier-2
On Thu, May 16, 2019 at 10:13:22PM +0500, Andrey Borodin wrote:

> Meanwhile I made some enhancements to test suit:
> Intel Server
> NOTICE:  00000: Decompressor pglz_decompress_hacked result 10.346763
> NOTICE:  00000: Decompressor pglz_decompress_hacked8 result 11.192078
> NOTICE:  00000: Decompressor pglz_decompress_hacked16 result 11.957727
> NOTICE:  00000: Decompressor pglz_decompress_vanilla result 14.262256
>
> ARM Server
> NOTICE:  Decompressor pglz_decompress_hacked result 12.966668
> NOTICE:  Decompressor pglz_decompress_hacked8 result 13.004935
> NOTICE:  Decompressor pglz_decompress_hacked16 result 13.043015
> NOTICE:  Decompressor pglz_decompress_vanilla result 18.239242
>
> Power9 Server
> NOTICE:  Decompressor pglz_decompress_hacked result 10.992974
> NOTICE:  Decompressor pglz_decompress_hacked8 result 11.747443
> NOTICE:  Decompressor pglz_decompress_hacked16 result 11.026342
> NOTICE:  Decompressor pglz_decompress_vanilla result 16.375315
>
> Intel laptop
> NOTICE:  Decompressor pglz_decompress_hacked result 9.445808
> NOTICE:  Decompressor pglz_decompress_hacked8 result 9.105360
> NOTICE:  Decompressor pglz_decompress_hacked16 result 9.621833
> NOTICE:  Decompressor pglz_decompress_vanilla result 10.661968
>
> From these results pglz_decompress_hacked looks best.
That's nice.

From the numbers you are presenting here, all of them are much better
than the original, and there is not much difference between any of the
patched versions.  Having a 20%~30% improvement with a patch is very
nice.

After that comes the simplicity and the future maintainability of what
is proposed.  I am not much into accepting a patch which has a 1%~2%
impact for some hardwares and makes pglz much more complex and harder
to understand.  But I am really eager to see a patch with at least a
10% improvement which remains simple, even more if it simplifies the
logic used in pglz.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: pglz performance

Andrey Borodin-2


> 17 мая 2019 г., в 6:44, Michael Paquier <[hidden email]> написал(а):
>
> That's nice.
>
> From the numbers you are presenting here, all of them are much better
> than the original, and there is not much difference between any of the
> patched versions.  Having a 20%~30% improvement with a patch is very
> nice.
>
> After that comes the simplicity and the future maintainability of what
> is proposed.  I am not much into accepting a patch which has a 1%~2%
> impact for some hardwares and makes pglz much more complex and harder
> to understand.  But I am really eager to see a patch with at least a
> 10% improvement which remains simple, even more if it simplifies the
> logic used in pglz.
Here are patches for both winning versions. I'll place them on CF.
My gut feeling is pglz_decompress_hacked8 should be better, but on most architectures benchmarks show opposite.


Best regards, Andrey Borodin.




pglz_decompress_hacked8.diff (1K) Download Attachment
pglz_decompress_hacked.diff (912 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: pglz performance

Gasper Zejn-2
In reply to this post by Andrey Borodin-2

On 16. 05. 19 19:13, Andrey Borodin wrote:
>
>> 15 мая 2019 г., в 15:06, Andrey Borodin <[hidden email]> написал(а):
>>
>> Owners of AMD and ARM devices are welcome.

I've tested according to instructions at the test repo
https://github.com/x4m/test_pglz

Test_pglz is at a97f63b and postgres at 6ba500.

Hardware is desktop AMD Ryzen 5 2600, 32GB RAM

Decompressor score (summ of all times):

NOTICE:  Decompressor pglz_decompress_hacked result 6.988909
NOTICE:  Decompressor pglz_decompress_hacked8 result 7.562619
NOTICE:  Decompressor pglz_decompress_hacked16 result 8.316957
NOTICE:  Decompressor pglz_decompress_vanilla result 10.725826


Attached is the full test run, if needed.

Kind regards,

Gasper

> Yandex hardware RND guys gave me ARM server and Power9 server. They are looking for AMD and some new Intel boxes.
>
> Meanwhile I made some enhancements to test suit:
> 1. I've added Shakespeare payload: concatenation of works of this prominent poet.
> 2. For each payload compute "sliced time" - time to decompress payload if it was sliced by 2Kb pieces or 8Kb pieces.
> 3. For each decompressor we compute "score": (sum of time to decompress each payload, each payload sliced by 2Kb and 8Kb) * 5 times
>
> I've attached full test logs, meanwhile here's results for different platforms.
>
> Intel Server
> NOTICE:  00000: Decompressor pglz_decompress_hacked result 10.346763
> NOTICE:  00000: Decompressor pglz_decompress_hacked8 result 11.192078
> NOTICE:  00000: Decompressor pglz_decompress_hacked16 result 11.957727
> NOTICE:  00000: Decompressor pglz_decompress_vanilla result 14.262256
>
> ARM Server
> NOTICE:  Decompressor pglz_decompress_hacked result 12.966668
> NOTICE:  Decompressor pglz_decompress_hacked8 result 13.004935
> NOTICE:  Decompressor pglz_decompress_hacked16 result 13.043015
> NOTICE:  Decompressor pglz_decompress_vanilla result 18.239242
>
> Power9 Server
> NOTICE:  Decompressor pglz_decompress_hacked result 10.992974
> NOTICE:  Decompressor pglz_decompress_hacked8 result 11.747443
> NOTICE:  Decompressor pglz_decompress_hacked16 result 11.026342
> NOTICE:  Decompressor pglz_decompress_vanilla result 16.375315
>
> Intel laptop
> NOTICE:  Decompressor pglz_decompress_hacked result 9.445808
> NOTICE:  Decompressor pglz_decompress_hacked8 result 9.105360
> NOTICE:  Decompressor pglz_decompress_hacked16 result 9.621833
> NOTICE:  Decompressor pglz_decompress_vanilla result 10.661968
>
> From these results pglz_decompress_hacked looks best.
>
> Best regards, Andrey Borodin.
>

pglz_benchmarks_amd.txt (113K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: pglz performance

Andrey Borodin-2


> 17 мая 2019 г., в 18:40, Gasper Zejn <[hidden email]> написал(а):
>
> I've tested according to instructions at the test repo
> https://github.com/x4m/test_pglz
>
> Test_pglz is at a97f63b and postgres at 6ba500.
>
> Hardware is desktop AMD Ryzen 5 2600, 32GB RAM
>
> Decompressor score (summ of all times):
>
> NOTICE:  Decompressor pglz_decompress_hacked result 6.988909
> NOTICE:  Decompressor pglz_decompress_hacked8 result 7.562619
> NOTICE:  Decompressor pglz_decompress_hacked16 result 8.316957
> NOTICE:  Decompressor pglz_decompress_vanilla result 10.725826

Thanks, Gasper! Basically we observe same 0.65 time reduction here.

That's very good that we have independent scores.

I'm still somewhat not sure that score is fair, on payload 000000010000000000000008 we have vanilla decompression sometimes slower than hacked by few percents. And this is especially visible on AMD. Degradation for 000000010000000000000008 sliced by 8Kb reaches 10%

I think this is because 000000010000000000000008 have highest entropy.It is almost random and matches are very short, but present.
000000010000000000000008
Entropy = 4.360546 bits per byte.
000000010000000000000006
Entropy = 1.450059 bits per byte.
000000010000000000000001
Entropy = 2.944235 bits per byte.
shakespeare.txt
Entropy = 3.603659 bits per byte
16398
Entropy = 1.897640 bits per byte.

Best regards, Andrey Borodin.