[Patch] Optimize dropping of relation buffers using dlist

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
83 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

[Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com

Hi,

 

Currently, we need to scan the WHOLE shared buffers when VACUUM

truncated off any empty pages at end of transaction or when relation

is TRUNCATEd.

As for our customer case, we periodically truncate thousands of tables,

and it's possible to TRUNCATE single table per transaction. This can be

problematic later on during recovery which could take longer, especially

when a sudden failover happens after those TRUNCATEs and when we

have to scan a large-sized shared buffer. In the performance test below,

it took almost 12.5 minutes for recovery to complete for 100GB shared

buffers. But we want to keep failover very short (within 10 seconds).

 

Previously, I made an improvement in speeding the truncates of relation

forks from 3 scans to one scan.[1] This time, the aim of this patch is

to further speedup the invalidation of pages, by linking the cached pages

of the target relation in a doubly-linked list and just traversing it

instead of scanning the whole shared buffers. In DropRelFileNodeBuffers,

we just get the number of target buffers to invalidate for the relation.

There is a significant win in this patch, because we were able to

complete failover and recover in 3 seconds more or less.

 

I performed similar tests to what I did in the speedup truncates of

relations forks.[1][2] However, this time using 100GB shared_buffers.

 

[Machine spec used in testing]

Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz

CPU: 16, Number of cores per socket: 8

RHEL6.5, Memory: 256GB++

 

[Test]

1. (Master) Create table (ex. 10,000 tables). Insert data to tables.

2. (Master) DELETE FROM TABLE (ex. all rows of 10,000 tables)

(Standby) To test with failover, pause the WAL replay on standby server.

(SELECT pg_wal_replay_pause();)

3. (M) psql -c "\timing on" (measures total execution of SQL queries)

4. (M) VACUUM (whole db)

5. (M) Stop primary server. pg_ctl stop -D $PGDATA -w

6. (S) Resume wal replay and promote standby.[2]

 

[Results]

 

A. HEAD (origin/master branch)

A1. Vacuum execution on Primary server

    Time: 730932.408 ms (12:10.932) ~12min 11s

A2. Vacuum + Failover (WAL Recovery on Standby)

    waiting for server to promote...........................

    .................................... stopped waiting

    pg_ctl: server did not promote in time

    2019/10/25_12:13:09.692─┐

    2019/10/25_12:25:43.576─┘

    -->Total: 12min34s

 

B. PATCH

B1. Vacuum execution on Primary/Master

    Time: 6.518333s = 6518.333 ms

B2. Vacuum + Failover (WAL Recovery on Standby)

    2019/10/25_14:17:21.822

    waiting for server to promote...... done

    server promoted

    2019/10/25_14:17:24.827

    2019/10/25_14:17:24.833

    -->Total: 3.011s

 

[Other Notes]

Maybe one disadvantage is that we can have a variable number of

relations, and allocated the same number of relation structures as

the size of shared buffers. I tried to reduce the use of memory when

doing hash table lookup operation by having a fixed size array (100)

or threshold of target buffers to invalidate.

When doing CachedBufLookup() to scan the count of each buffer in the

dlist, I made sure to reduce the number of scans (2x at most).

First, we scan the dlist of cached buffers of relations.

Then store the target buffers in buf_id_array. Non-target buffers

would be removed from dlist but added to temporary dlist.

After reaching end of main dlist, we append the temporary dlist to

tail of main dlist.

I also performed pgbench buffer test, and this patch did not cause

overhead to normal DB access performance.

 

Another one that I'd need feedback of is the use of new dlist operations

for this cached buffer list. I did not use in this patch the existing

Postgres dlist architecture (ilist.h) because I want to save memory space

as much as possible especially when NBuffers become large. Both dlist_node

& dlist_head are 16 bytes. OTOH, two int pointers for this patch is 8 bytes.

 

Hope to hear your feedback and comments.

 

Thanks in advance,

Kirk Jamison

 

[1] https://www.postgresql.org/message-id/flat/D09B13F772D2274BB348A310EE3027C64E2067%40g01jpexmbkw24

[2] https://www.postgresql.org/message-id/D09B13F772D2274BB348A310EE3027C6502672%40g01jpexmbkw24


v1-Optimize-dropping-of-relation-buffers-using-dlist.patch (27K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com

Hi,

 

> Another one that I'd need feedback of is the use of new dlist operations

> for this cached buffer list. I did not use in this patch the existing

> Postgres dlist architecture (ilist.h) because I want to save memory space

> as much as possible especially when NBuffers become large. Both dlist_node

> & dlist_head are 16 bytes. OTOH, two int pointers for this patch is 8 bytes.

 

In cb_dlist_combine(), the code block below can impact performance

especially for cases when the doubly linked list is long (IOW, many cached buffers).

              /* Point to the tail of main dlist */

              while (curr_main->next != CACHEDBLOCK_END_OF_LIST)

                            curr_main = cb_dlist_next(curr_main);

 

Attached is an improved version of the previous patch, which adds a pointer

information of the TAIL field in order to speed up the abovementioned operation.

I stored the tail field in the prev pointer of the head entry (maybe not a typical

approach). A more typical one is by adding a tail field (int tail) to CachedBufferEnt,

but I didn’t do that because as I mentioned in previous email I want to avoid

using more memory as much as possible.

The patch worked as intended and passed the tests.

 

Any thoughts?

 

 

Regards,

Kirk Jamison


v2-Optimize-dropping-of-relation-buffers-using-dlist.patch (28K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Tomas Vondra-4
Hi Kirk,

On Tue, Nov 05, 2019 at 09:58:22AM +0000, [hidden email] wrote:

>Hi,
>
>
>> Another one that I'd need feedback of is the use of new dlist operations
>
>> for this cached buffer list. I did not use in this patch the existing
>
>> Postgres dlist architecture (ilist.h) because I want to save memory space
>
>> as much as possible especially when NBuffers become large. Both dlist_node
>
>> & dlist_head are 16 bytes. OTOH, two int pointers for this patch is 8 bytes.
>
>In cb_dlist_combine(), the code block below can impact performance
>especially for cases when the doubly linked list is long (IOW, many cached buffers).
>              /* Point to the tail of main dlist */
>              while (curr_main->next != CACHEDBLOCK_END_OF_LIST)
>                            curr_main = cb_dlist_next(curr_main);
>
>Attached is an improved version of the previous patch, which adds a pointer
>information of the TAIL field in order to speed up the abovementioned operation.
>I stored the tail field in the prev pointer of the head entry (maybe not a typical
>approach). A more typical one is by adding a tail field (int tail) to CachedBufferEnt,
>but I didn’t do that because as I mentioned in previous email I want to avoid
>using more memory as much as possible.
>The patch worked as intended and passed the tests.
>
>Any thoughts?
>

A couple of comments based on briefly looking at the patch.

1) I don't think you should / need to expose most of the ne stuff in
    buf_internals.h. It's only used from buf_internals.c and having all
    the various cb_dlist_* function in .h seems strange.

2) This adds another hashtable maintenance to BufferAlloc etc. but
    you've only done tests / benchmark for the case this optimizes. I
    think we need to see a benchmark for workload that allocates and
    invalidates lot of buffers. A pgbench with a workload that fits into
    RAM but not into shared buffers would be interesting.

3) I see this triggered a failure on cputube, in the commit_ts TAP test.
    That's a bit strange, someone should investigate I guess.
   
    https://travis-ci.org/postgresql-cfbot/postgresql/builds/607563900

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Robert Haas
On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra
<[hidden email]> wrote:
> 2) This adds another hashtable maintenance to BufferAlloc etc. but
>     you've only done tests / benchmark for the case this optimizes. I
>     think we need to see a benchmark for workload that allocates and
>     invalidates lot of buffers. A pgbench with a workload that fits into
>     RAM but not into shared buffers would be interesting.

Yeah, it seems pretty hard to believe that this won't be bad for some
workloads. Not only do you have the overhead of the hash table
operations, but you also have locking overhead around that. A whole
new set of LWLocks where you have to take and release one of them
every time you allocate or invalidate a buffer seems likely to cause a
pretty substantial contention problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com
On Thurs, November 7, 2019 1:27 AM (GMT+9), Robert Haas wrote:

> On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra <[hidden email]>
> wrote:
> > 2) This adds another hashtable maintenance to BufferAlloc etc. but
> >     you've only done tests / benchmark for the case this optimizes. I
> >     think we need to see a benchmark for workload that allocates and
> >     invalidates lot of buffers. A pgbench with a workload that fits into
> >     RAM but not into shared buffers would be interesting.
>
> Yeah, it seems pretty hard to believe that this won't be bad for some workloads.
> Not only do you have the overhead of the hash table operations, but you also
> have locking overhead around that. A whole new set of LWLocks where you have
> to take and release one of them every time you allocate or invalidate a buffer
> seems likely to cause a pretty substantial contention problem.
I'm sorry for the late reply. Thank you Tomas and Robert for checking this patch.
Attached is the v3 of the patch.
- I moved the unnecessary items from buf_internals.h to cached_buf.c since most of
  of those items are only used in that file.
- Fixed the bug of v2. Seems to pass both RT and TAP test now

Thanks for the advice on benchmark test. Please refer below for test and results.

[Machine spec]
CPU: 16, Number of cores per socket: 8
RHEL6.5, Memory: 240GB

scale: 3125 (about 46GB DB size)
shared_buffers = 8GB

[workload that fits into RAM but not into shared buffers]
pgbench -i -s 3125 cachetest
pgbench -c 16 -j 8 -T 600 cachetest

[Patched]
scaling factor: 3125
query mode: simple
number of clients: 16
number of threads: 8
duration: 600 s
number of transactions actually processed: 8815123
latency average = 1.089 ms
tps = 14691.436343 (including connections establishing)
tps = 14691.482714 (excluding connections establishing)

[Master/Unpatched]
...
number of transactions actually processed: 8852327
latency average = 1.084 ms
tps = 14753.814648 (including connections establishing)
tps = 14753.861589 (excluding connections establishing)


My patch caused a little overhead of about 0.42-0.46%, which I think is small.
Kindly let me know your opinions/comments about the patch or tests, etc.

Thanks,
Kirk Jamison

v3-Optimize-dropping-of-relation-buffers-using-dlist.patch (27K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Tomas Vondra-4
On Tue, Nov 12, 2019 at 10:49:49AM +0000, [hidden email] wrote:

>On Thurs, November 7, 2019 1:27 AM (GMT+9), Robert Haas wrote:
>> On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra <[hidden email]>
>> wrote:
>> > 2) This adds another hashtable maintenance to BufferAlloc etc. but
>> >     you've only done tests / benchmark for the case this optimizes. I
>> >     think we need to see a benchmark for workload that allocates and
>> >     invalidates lot of buffers. A pgbench with a workload that fits into
>> >     RAM but not into shared buffers would be interesting.
>>
>> Yeah, it seems pretty hard to believe that this won't be bad for some workloads.
>> Not only do you have the overhead of the hash table operations, but you also
>> have locking overhead around that. A whole new set of LWLocks where you have
>> to take and release one of them every time you allocate or invalidate a buffer
>> seems likely to cause a pretty substantial contention problem.
>
>I'm sorry for the late reply. Thank you Tomas and Robert for checking this patch.
>Attached is the v3 of the patch.
>- I moved the unnecessary items from buf_internals.h to cached_buf.c since most of
>  of those items are only used in that file.
>- Fixed the bug of v2. Seems to pass both RT and TAP test now
>
>Thanks for the advice on benchmark test. Please refer below for test and results.
>
>[Machine spec]
>CPU: 16, Number of cores per socket: 8
>RHEL6.5, Memory: 240GB
>
>scale: 3125 (about 46GB DB size)
>shared_buffers = 8GB
>
>[workload that fits into RAM but not into shared buffers]
>pgbench -i -s 3125 cachetest
>pgbench -c 16 -j 8 -T 600 cachetest
>
>[Patched]
>scaling factor: 3125
>query mode: simple
>number of clients: 16
>number of threads: 8
>duration: 600 s
>number of transactions actually processed: 8815123
>latency average = 1.089 ms
>tps = 14691.436343 (including connections establishing)
>tps = 14691.482714 (excluding connections establishing)
>
>[Master/Unpatched]
>...
>number of transactions actually processed: 8852327
>latency average = 1.084 ms
>tps = 14753.814648 (including connections establishing)
>tps = 14753.861589 (excluding connections establishing)
>
>
>My patch caused a little overhead of about 0.42-0.46%, which I think is small.
>Kindly let me know your opinions/comments about the patch or tests, etc.
>

Now try measuring that with a read-only workload, with prepared
statements. I've tried that on a machine with 16 cores, doing

   # 16 clients
   pgbench -n -S -j 16 -c 16 -M prepared -T 60 test

   # 1 client
   pgbench -n -S -c 1 -M prepared -T 60 test

and average from 30 runs of each looks like this:

    # clients      master         patched         %
   ---------------------------------------------------------
    1              29690          27833           93.7%
    16            300935         283383           94.1%

That's quite significant regression, considering it's optimizing an
operation that is expected to be pretty rare (people are generally not
dropping dropping objects as often as they query them).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com
On Wed, Nov 13, 2019 4:20AM (GMT +9), Tomas Vondra wrote:

> On Tue, Nov 12, 2019 at 10:49:49AM +0000, [hidden email] wrote:
> >On Thurs, November 7, 2019 1:27 AM (GMT+9), Robert Haas wrote:
> >> On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra
> >> <[hidden email]>
> >> wrote:
> >> > 2) This adds another hashtable maintenance to BufferAlloc etc. but
> >> >     you've only done tests / benchmark for the case this optimizes. I
> >> >     think we need to see a benchmark for workload that allocates and
> >> >     invalidates lot of buffers. A pgbench with a workload that fits into
> >> >     RAM but not into shared buffers would be interesting.
> >>
> >> Yeah, it seems pretty hard to believe that this won't be bad for some
> workloads.
> >> Not only do you have the overhead of the hash table operations, but
> >> you also have locking overhead around that. A whole new set of
> >> LWLocks where you have to take and release one of them every time you
> >> allocate or invalidate a buffer seems likely to cause a pretty substantial
> contention problem.
> >
> >I'm sorry for the late reply. Thank you Tomas and Robert for checking this
> patch.
> >Attached is the v3 of the patch.
> >- I moved the unnecessary items from buf_internals.h to cached_buf.c
> >since most of
> >  of those items are only used in that file.
> >- Fixed the bug of v2. Seems to pass both RT and TAP test now
> >
> >Thanks for the advice on benchmark test. Please refer below for test and
> results.
> >
> >[Machine spec]
> >CPU: 16, Number of cores per socket: 8
> >RHEL6.5, Memory: 240GB
> >
> >scale: 3125 (about 46GB DB size)
> >shared_buffers = 8GB
> >
> >[workload that fits into RAM but not into shared buffers] pgbench -i -s
> >3125 cachetest pgbench -c 16 -j 8 -T 600 cachetest
> >
> >[Patched]
> >scaling factor: 3125
> >query mode: simple
> >number of clients: 16
> >number of threads: 8
> >duration: 600 s
> >number of transactions actually processed: 8815123 latency average =
> >1.089 ms tps = 14691.436343 (including connections establishing) tps =
> >14691.482714 (excluding connections establishing)
> >
> >[Master/Unpatched]
> >...
> >number of transactions actually processed: 8852327 latency average =
> >1.084 ms tps = 14753.814648 (including connections establishing) tps =
> >14753.861589 (excluding connections establishing)
> >
> >
> >My patch caused a little overhead of about 0.42-0.46%, which I think is small.
> >Kindly let me know your opinions/comments about the patch or tests, etc.
> >
>
> Now try measuring that with a read-only workload, with prepared statements.
> I've tried that on a machine with 16 cores, doing
>
>    # 16 clients
>    pgbench -n -S -j 16 -c 16 -M prepared -T 60 test
>
>    # 1 client
>    pgbench -n -S -c 1 -M prepared -T 60 test
>
> and average from 30 runs of each looks like this:
>
>     # clients      master         patched         %
>    ---------------------------------------------------------
>     1              29690          27833           93.7%
>     16            300935         283383           94.1%
>
> That's quite significant regression, considering it's optimizing an
> operation that is expected to be pretty rare (people are generally not
> dropping dropping objects as often as they query them).
I updated the patch and reduced the lock contention of new LWLock,
with tunable definitions in the code and instead of using rnode as the hash key,
I also added the modulo of block number.
#define NUM_MAP_PARTITIONS_FOR_REL 128 /* relation-level */
#define NUM_MAP_PARTITIONS_IN_REL 4 /* block-level */
#define NUM_MAP_PARTITIONS \
        (NUM_MAP_PARTITIONS_FOR_REL * NUM_MAP_PARTITIONS_IN_REL)

I executed again a benchmark for read-only workload,
but regression currently sits at 3.10% (reduced from v3's 6%).

Average of 10 runs, 16 clients
read-only, prepared query mode

[Master]
num of txn processed: 11,950,983.67
latency average = 0.080 ms
tps = 199,182.24
tps = 199,189.54

[V4 Patch]
num of txn processed: 11,580,256.36
latency average = 0.083 ms
tps = 193,003.52
tps = 193,010.76


I checked the wait event statistics (non-impactful events omitted)
and got the following below.
I reset the stats before running the pgbench script,
Then showed the stats right after the run.

[Master]
 wait_event_type |      wait_event       |  calls   | microsec
-----------------+-----------------------+----------+----------
 Client          | ClientRead            |   25116  | 49552452
 IO              | DataFileRead          | 14467109 | 92113056
 LWLock          | buffer_mapping        |   204618 |  1364779

[Patch V4]
 wait_event_type |      wait_event       |  calls   | microsec
-----------------+-----------------------+----------+----------
 Client          | ClientRead            |  111393  | 68773946
 IO              | DataFileRead          | 14186773 | 90399833
 LWLock          | buffer_mapping        |   463844 |  4025198
 LWLock          | cached_buf_tranche_id |    83390 |   336080

It seems the buffer_mapping LWLock wait is 4x slower.
However, I'd like to continue working on this patch to next commitfest,
and further reduce its impact to read-only workloads.


Regards,
Kirk Jamison

v4-Optimize-dropping-of-relation-buffers-using-dlist.patch (29K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com
Hi,

I have updated the patch (v5).
I tried to reduce the lock waiting times by using spinlock
when inserting/deleting buffers in the new hash table, and
exclusive lock when doing lookup for buffers to be dropped.
In summary, instead of scanning the whole buffer pool in
shared buffers, we just traverse the doubly-linked list of linked
buffers for the target relation and block.

In order to understand how this patch affects performance,
I also measured the cache hit rates in addition to
benchmarking db with various shared buffer size settings.

Using the same machine specs, I used the default script
of pgbench for read-only workload with prepared statement,
and executed about 15 runs for varying shared buffer sizes.
  pgbench -i -s 3200 test  //(about 48GB db size)
  pgbench -S -n -M prepared -c 16 -j 16 -T 60 test

[TPS Regression]
 shbuf | tps(master) |   tps(patch)  | %reg  
---------+-----------------+-----------------+-------
  5GB   | 195,737.23  | 191,422.23 | 2.23
 10GB  | 197,067.93  | 194,011.66 | 1.55
 20GB  | 200,241.18  | 200,425.29 | -0.09
 40GB  | 208,772.81  | 209,807.38 | -0.50
 50GB  | 215,684.33  | 218,955.43 | -1.52

[CACHE HIT RATE]
 Shbuf  |   master   |  patch
----------+--------------+----------
 10GB  | 0.141536 | 0.141485
 20GB  | 0.330088 | 0.329894
 30GB  | 0.573383 | 0.573377
 40GB  | 0.819499 | 0.819264
 50GB  | 0.999237 | 0.999577

For this workload, the regression increases for below 20GB
shared_buffers size. However, the cache hit rate both for
master and patch is 32% (20 GB shbuf). Therefore, I think we
can consider this kind of workload with low shared buffers
size as a “special case”, because in terms of db performance
tuning we want as much as possible for the db to have a higher
cache hit rate (99.9%, or maybe let's say 80% is acceptable).
And in this workload, ideal shared_buffers size would be
around 40GB more or less to hit that acceptable cache hit rate.
Looking at this patch's performance result, if it's within the acceptable
cache hit rate, there would be at least no regression and the results als
 show almost similar tps compared to master.

Your feedback about the patch and tests are welcome.

Regards,
Kirk Jamison

v5-Optimize-dropping-of-relation-buffers-using-dlist.patch (32K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com
Hi,

I have rebased the patch to keep the CFbot happy.
Apparently, in the previous patch there was a possibility of infinite loop
when allocating buffers, so I fixed that part and also removed some whitespaces.

Kindly check the attached V6 patch.
Any thoughts on this?

Regards,
Kirk Jamison

v6-Optimize-dropping-of-relation-buffers-using-dlist.patch (32K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Robert Haas
On Tue, Feb 4, 2020 at 4:57 AM [hidden email]
<[hidden email]> wrote:
> Kindly check the attached V6 patch.
> Any thoughts on this?

Unfortunately, I don't have time for detailed review of this. I am
suspicious that there are substantial performance regressions that you
just haven't found yet. I would not take the position that this is a
completely hopeless approach, or anything like that, but neither would
I conclude that the tests shown so far are anywhere near enough to be
confident that there are no problems.

Also, systems with very large shared_buffers settings are becoming
more common, and probably will continue to become more common, so I
don't think we can dismiss that as an edge case any more. People don't
want to run with an 8GB cache on a 1TB server.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com
Hi,

I know this might already be late at end of CommitFest, but attached
is the latest version of the patch. The previous version only includes buffer
invalidation improvement for VACUUM. The new patch adds the same
routine for TRUNCATE WAL replay.

In summary, this patch aims to improve the buffer invalidation process
of VACUUM and  TRUNCATE. Although it may not be a common use
case, our customer uses these commands a lot. Recovery and WAL
replay of these commands can take time depending on the size of
database buffers. So this patch optimizes that using the newly-added
auxiliary cache and doubly-linked list on the shared memory, so that
we don't need to scan the shared buffers anymore.

As for the performance and how it affects the read-only workloads.
Using pgbench, I've kept the overload to a minimum, less than 1%.
I'll post follow-up results.

Although the additional hash table utilizes shared memory, there's
a significant performance gain for both TRUNCATE and VACUUM
from execution to recovery.

Regards,
Kirk Jamison

v7-Optimize-dropping-of-relation-buffers-using-dlist.patch (38K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com
On Wednesday, March 25, 2020 3:25 PM, Kirk Jamison wrote:
> As for the performance and how it affects the read-only workloads.
> Using pgbench, I've kept the overload to a minimum, less than 1%.
> I'll post follow-up results.

Here's the follow-up results.
I executed the similar tests from top of the thread.
I hope the performance test results shown below would suffice.
If not, I'd appreciate any feedback with regards to test or the patch itself.

A. VACUUM execution + Failover test
- 100GB shared_buffers

1. 1000 tables (18MB)
1.1. Execution Time
- [MASTER] 77755.218 ms (01:17.755)
- [PATCH] Execution Time:   2147.914 ms (00:02.148)
1.2. Failover Time (Recovery WAL Replay):
- [MASTER] 01:37.084 (1 min 37.884 s)
- [PATCH] 1627 ms (1.627 s)

2. 10000 tables (110MB)
2.1. Execution Time
- [MASTER] 844174.572 ms (14:04.175) ~14 min 4.175 s
- [PATCH] 75678.559 ms (01:15.679) ~1 min 15.679 s

2.2. Failover Time:
- [MASTER] est. 14 min++
    (I didn't measure anymore because recovery takes
    as much as the execution time.)
- [PATCH] 01:25.559 (1 min 25.559 s)

Significant performance results for VACUUM.


B. TPS Regression for READ-ONLY workload
(PREPARED QUERY MODE, NO VACUUM)

# [16 Clients]
- pgbench -n -S -j 16 -c 16 -M prepared -T 60 cachetest

|shbuf    |Master      |Patch         |% reg    |
|----------|--------------|---------------|----------|
|128MB| 77,416.76 | 77,162.78 |0.33% |
|1GB     | 81,941.30 | 81,812.05 |0.16% |
|2GB     | 84,273.69 | 84,356.38 |-0.10%|
|100GB| 83,807.30 | 83,924.68 |-0.14%|

# [1 Client]
- pgbench -n -S -c 1 -M prepared -T 60 cachetest

|shbuf    |Master      |Patch         |% reg    |
|----------|--------------|---------------|----------|
|128MB| 12,044.54 | 12,037.13 |0.06% |
|1GB     | 12,736.57 | 12,774.77 |-0.30%|
|2GB     | 12,948.98 | 13,159.90 |-1.63%|
|100GB| 12,982.98 | 13,064.04 |-0.62%|

Both were run for 10 times and average tps and % regression are
shown above. At some point only minimal overload was caused by
the patch. As for other cases, it has higher tps compared to master.

If it does not make it this CF, I hope to receive feedback in the future
on how to proceed. Thanks in advance!

Regards,
Kirk Jamison
Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com
Hi,

Since the last posted version of the patch fails, attached is a rebased version.
Written upthread were performance results and some benefits and challenges.
I'd appreciate your feedback/comments.

Regards,
Kirk Jamison

v8-Optimize-dropping-of-relation-buffers-using-dlist.patch (38K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

konstantin knizhnik


On 17.06.2020 09:14, [hidden email] wrote:
> Hi,
>
> Since the last posted version of the patch fails, attached is a rebased version.
> Written upthread were performance results and some benefits and challenges.
> I'd appreciate your feedback/comments.
>
> Regards,
> Kirk Jamison
As far as i understand this patch can provide significant improvement of
performance only in case of
recovery  of truncates of large number of tables. You have added shared
hash of relation buffers and certainly if adds some
extra overhead. According to your latest results this overhead is quite
small. But it will be hard to prove that there will be no noticeable
regression
at some workloads.

I wonder if you have considered case of local hash (maintained only
during recovery)?
If there is after-crash recovery, then there will be no concurrent
access to shared buffers and this hash will be up-to-date.
in case of hot-standby replica we can use some simple invalidation (just
one flag or counter which indicates that buffer cache was updated).
This hash also can be constructed on demand when DropRelFileNodeBuffers
is called first time (so w have to scan all buffers once, but subsequent
drop operation will be fast.

i have not thought much about it, but it seems to me that as far as this
problem only affects recovery, we do not need shared hash for it.



Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com
On Wednesday, July 29, 2020 4:55 PM, Konstantin Knizhnik wrote:
> On 17.06.2020 09:14, [hidden email] wrote:
> > Hi,
> >
> > Since the last posted version of the patch fails, attached is a rebased version.
> > Written upthread were performance results and some benefits and challenges.
> > I'd appreciate your feedback/comments.
> >
> > Regards,
> > Kirk Jamison

> As far as i understand this patch can provide significant improvement of
> performance only in case of recovery  of truncates of large number of tables. You
> have added shared hash of relation buffers and certainly if adds some extra
> overhead. According to your latest results this overhead is quite small. But it will
> be hard to prove that there will be no noticeable regression at some workloads.

Thank you for taking a look at this.

Yes, one of the aims is to speed up recovery of truncations, but at the same time the
patch also improves autovacuum, vacuum and relation truncate index executions.
I showed results of pgbench results above for different types of workloads,
but I am not sure if those are validating enough...

> I wonder if you have considered case of local hash (maintained only during
> recovery)?
> If there is after-crash recovery, then there will be no concurrent access to shared
> buffers and this hash will be up-to-date.
> in case of hot-standby replica we can use some simple invalidation (just one flag
> or counter which indicates that buffer cache was updated).
> This hash also can be constructed on demand when DropRelFileNodeBuffers is
> called first time (so w have to scan all buffers once, but subsequent drop
> operation will be fast.
>
> i have not thought much about it, but it seems to me that as far as this problem
> only affects recovery, we do not need shared hash for it.
>

The idea of the patch is to mark the relation buffers to be dropped after scanning
the whole shared buffers, and store them into shared memory maintained in a dlist,
and traverse the dlist on the next scan.
But I understand the point that it is expensive and may cause overhead, that is why
I tried to define a macro to limit the number of pages that we can cache for cases
that lookup cost can be problematic (i.e. too many pages of relation).

#define BUF_ID_ARRAY_SIZE 100
int buf_id_array[BUF_ID_ARRAY_SIZE];
int forknum_indexes[BUF_ID_ARRAY_SIZE];

In DropRelFileNodeBuffers
do
{
    nbufs = CachedBlockLookup(..., forknum_indexes, buf_id_array, lengthof(buf_id_array));
    for (i = 0; i < nbufs; i++)
    {
        ...
    }
} while (nbufs == lengthof(buf_id_array));


Perhaps the patch affects complexities so we want to keep it simpler, or commit piece by piece?
I will look further into your suggestion of maintaining local hash only during recovery.
Thank you for the suggestion.

Regards,
Kirk Jamison
Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

knizhnik
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           not tested
Documentation:            not tested

I have tested this patch at various workloads and hardware (including Power2 server with 384 virtual cores)
and didn't find performance regression.

The new status of this patch is: Ready for Committer
Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com
On Friday, July 31, 2020 2:37 AM, Konstantin Knizhnik wrote:

> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:       tested, passed
> Spec compliant:           not tested
> Documentation:            not tested
>
> I have tested this patch at various workloads and hardware (including Power2
> server with 384 virtual cores) and didn't find performance regression.
>
> The new status of this patch is: Ready for Committer

Thank you very much, Konstantin, for testing the patch for different workloads.
I wonder if I need to modify some documentations.
I'll leave the final review to the committer/s as well.

Regards,
Kirk Jamison
Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Tom Lane-2
In reply to this post by Robert Haas
Robert Haas <[hidden email]> writes:
> Unfortunately, I don't have time for detailed review of this. I am
> suspicious that there are substantial performance regressions that you
> just haven't found yet. I would not take the position that this is a
> completely hopeless approach, or anything like that, but neither would
> I conclude that the tests shown so far are anywhere near enough to be
> confident that there are no problems.

I took a quick look through the v8 patch, since it's marked RFC, and
my feeling is about the same as Robert's: it is just about impossible
to believe that doubling (or more) the amount of hashtable manipulation
involved in allocating a buffer won't hurt common workloads.  The
offered pgbench results don't reassure me; we've so often found that
pgbench fails to expose performance problems, except maybe when it's
used just so.

But aside from that, I noted a number of things I didn't like a bit:

* The amount of new shared memory this needs seems several orders
of magnitude higher than what I'd call acceptable: according to my
measurements it's over 10KB per shared buffer!  Most of that is going
into the CachedBufTableLock data structure, which seems fundamentally
misdesigned --- how could we be needing a lock per map partition *per
buffer*?  For comparison, the space used by buf_table.c is about 56
bytes per shared buffer; I think this needs to stay at least within
hailing distance of there.

* It is fairly suspicious that the new data structure is manipulated
while holding per-partition locks for the existing buffer hashtable.
At best that seems bad for concurrency, and at worst it could result
in deadlocks, because I doubt we can assume that the new hash table
has partition boundaries identical to the old one.

* More generally, it seems like really poor design that this has been
written completely independently of the existing buffer hash table.
Can't we get any benefit by merging them somehow?

* I do not like much of anything in the code details.  "CachedBuf"
is as unhelpful as could be as a data structure identifier --- what
exactly is not "cached" about shared buffers already?  "CombinedLock"
is not too helpful either, nor could I find any documentation explaining
why you need to invent new locking technology in the first place.
At best, CombinedLockAcquireSpinLock seems like a brute-force approach
to an undocumented problem.

* The commentary overall is far too sparse to be of any value ---
basically, any reader will have to reverse-engineer your entire design.
That's not how we do things around here.  There should be either a README,
or a long file header comment, explaining what's going on, how the data
structure is organized, and what the locking requirements are.
See src/backend/storage/buffer/README for the sort of documentation
that I think this needs.

Even if I were convinced that there's no performance gotchas,
I wouldn't commit this in anything like its current form.

Robert again:
> Also, systems with very large shared_buffers settings are becoming
> more common, and probably will continue to become more common, so I
> don't think we can dismiss that as an edge case any more. People don't
> want to run with an 8GB cache on a 1TB server.

I do agree that it'd be great to improve this area.  Just not convinced
that this is how.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Andres Freund
Hi,

On 2020-07-31 13:39:37 -0400, Tom Lane wrote:

> Robert Haas <[hidden email]> writes:
> > Unfortunately, I don't have time for detailed review of this. I am
> > suspicious that there are substantial performance regressions that you
> > just haven't found yet. I would not take the position that this is a
> > completely hopeless approach, or anything like that, but neither would
> > I conclude that the tests shown so far are anywhere near enough to be
> > confident that there are no problems.
>
> I took a quick look through the v8 patch, since it's marked RFC, and
> my feeling is about the same as Robert's: it is just about impossible
> to believe that doubling (or more) the amount of hashtable manipulation
> involved in allocating a buffer won't hurt common workloads.  The
> offered pgbench results don't reassure me; we've so often found that
> pgbench fails to expose performance problems, except maybe when it's
> used just so.

Indeed. The buffer mapping hashtable already is visible as a major
bottleneck in a number of workloads. Even in readonly pgbench if s_b is
large enough (so the hashtable is larger than the cache). Not to speak
of things like a cached sequential scan with a cheap qual and wide rows.


> Robert again:
> > Also, systems with very large shared_buffers settings are becoming
> > more common, and probably will continue to become more common, so I
> > don't think we can dismiss that as an edge case any more. People don't
> > want to run with an 8GB cache on a 1TB server.
>
> I do agree that it'd be great to improve this area.  Just not convinced
> that this is how.

Wonder if the temporary fix is just to do explicit hashtable probes for
all pages iff the size of the relation is < s_b / 500 or so. That'll
address the case where small tables are frequently dropped - and
dropping large relations is more expensive from the OS and data loading
perspective, so it's not gonna happen as often.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Tom Lane-2
Andres Freund <[hidden email]> writes:
> Indeed. The buffer mapping hashtable already is visible as a major
> bottleneck in a number of workloads. Even in readonly pgbench if s_b is
> large enough (so the hashtable is larger than the cache). Not to speak
> of things like a cached sequential scan with a cheap qual and wide rows.

To be fair, the added overhead is in buffer allocation not buffer lookup.
So it shouldn't add cost to fully-cached cases.  As Tomas noted upthread,
the potential trouble spot is where the working set is bigger than shared
buffers but still fits in RAM (so there's no actual I/O needed, but we do
still have to shuffle buffers a lot).

> Wonder if the temporary fix is just to do explicit hashtable probes for
> all pages iff the size of the relation is < s_b / 500 or so. That'll
> address the case where small tables are frequently dropped - and
> dropping large relations is more expensive from the OS and data loading
> perspective, so it's not gonna happen as often.

Oooh, interesting idea.  We'd need a reliable idea of how long the
relation had been (preferably without adding an lseek call), but maybe
that's do-able.

                        regards, tom lane


12345