Hi, Currently, we need to scan the WHOLE shared buffers when VACUUM truncated off any empty pages at end of transaction or when relation is TRUNCATEd. As for our customer case, we periodically truncate thousands of tables, and it's possible to TRUNCATE single table per transaction. This can be problematic later on during recovery which could take longer, especially when a sudden failover happens after those TRUNCATEs and when we have to scan a large-sized shared buffer. In the performance test below, it took almost 12.5 minutes for recovery to complete for 100GB shared buffers. But we want to keep failover very short (within 10 seconds). Previously, I made an improvement in speeding the truncates of relation forks from 3 scans to one scan.[1] This time, the aim of this patch is to further speedup the invalidation of pages, by linking the cached pages of the target relation in a doubly-linked list and just traversing it instead of scanning the whole shared buffers. In DropRelFileNodeBuffers, we just get the number of target buffers to invalidate for the relation. There is a significant win in this patch, because we were able to complete failover and recover in 3 seconds more or less. I performed similar tests to what I did in the speedup truncates of relations forks.[1][2] However, this time using 100GB shared_buffers. [Machine spec used in testing] Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz CPU: 16, Number of cores per socket: 8 RHEL6.5, Memory: 256GB++ [Test] 1. (Master) Create table (ex. 10,000 tables). Insert data to tables. 2. (Master) DELETE FROM TABLE (ex. all rows of 10,000 tables) (Standby) To test with failover, pause the WAL replay on standby server. (SELECT pg_wal_replay_pause();) 3. (M) psql -c "\timing on" (measures total execution of SQL queries) 4. (M) VACUUM (whole db) 5. (M) Stop primary server. pg_ctl stop -D $PGDATA -w 6. (S) Resume wal replay and promote standby.[2] [Results] A. HEAD (origin/master branch) A1. Vacuum execution on Primary server Time: 730932.408 ms (12:10.932) ~12min 11s A2. Vacuum + Failover (WAL Recovery on Standby) waiting for server to promote........................... .................................... stopped waiting pg_ctl: server did not promote in time 2019/10/25_12:13:09.692─┐ 2019/10/25_12:25:43.576─┘ -->Total: 12min34s B. PATCH B1. Vacuum execution on Primary/Master Time: 6.518333s = 6518.333 ms B2. Vacuum + Failover (WAL Recovery on Standby) 2019/10/25_14:17:21.822 waiting for server to promote...... done server promoted 2019/10/25_14:17:24.827 2019/10/25_14:17:24.833 -->Total: 3.011s [Other Notes] Maybe one disadvantage is that we can have a variable number of relations, and allocated the same number of relation structures as the size of shared buffers. I tried to reduce the use of memory when doing hash table lookup operation by having a fixed size array (100) or threshold of target buffers to invalidate.
When doing CachedBufLookup() to scan the count of each buffer in the dlist, I made sure to reduce the number of scans (2x at most). First, we scan the dlist of cached buffers of relations. Then store the target buffers in buf_id_array. Non-target buffers would be removed from dlist but added to temporary dlist. After reaching end of main dlist, we append the temporary dlist to tail of main dlist. I also performed pgbench buffer test, and this patch did not cause overhead to normal DB access performance. Another one that I'd need feedback of is the use of new dlist operations for this cached buffer list. I did not use in this patch the existing Postgres dlist architecture (ilist.h) because I want to save memory space as much as possible especially when NBuffers become large. Both dlist_node & dlist_head are 16 bytes. OTOH, two int pointers for this patch is 8 bytes. Hope to hear your feedback and comments. Thanks in advance, Kirk Jamison [1] https://www.postgresql.org/message-id/flat/D09B13F772D2274BB348A310EE3027C64E2067%40g01jpexmbkw24 [2] https://www.postgresql.org/message-id/D09B13F772D2274BB348A310EE3027C6502672%40g01jpexmbkw24 |
Hi, > Another one that I'd need feedback of is the use of new dlist operations > for this cached buffer list. I did not use in this patch the existing > Postgres dlist architecture (ilist.h) because I want to save memory space > as much as possible especially when NBuffers become large. Both dlist_node > & dlist_head are 16 bytes. OTOH, two int pointers for this patch is 8 bytes. In cb_dlist_combine(), the code block below can impact performance especially for cases when the doubly linked list is long (IOW, many cached buffers). /* Point to the tail of main dlist */ while (curr_main->next != CACHEDBLOCK_END_OF_LIST) curr_main = cb_dlist_next(curr_main); Attached is an improved version of the previous patch, which adds a pointer information of the TAIL field in order to speed up the abovementioned operation. I stored the tail field in the prev pointer of the head entry (maybe not a typical approach). A more typical one is by adding a tail field (int tail) to CachedBufferEnt, but I didn’t do that because as I mentioned in previous email I want to avoid using more memory as much as possible. The patch worked as intended and passed the tests. Any thoughts? Regards, Kirk Jamison |
Hi Kirk,
On Tue, Nov 05, 2019 at 09:58:22AM +0000, [hidden email] wrote: >Hi, > > >> Another one that I'd need feedback of is the use of new dlist operations > >> for this cached buffer list. I did not use in this patch the existing > >> Postgres dlist architecture (ilist.h) because I want to save memory space > >> as much as possible especially when NBuffers become large. Both dlist_node > >> & dlist_head are 16 bytes. OTOH, two int pointers for this patch is 8 bytes. > >In cb_dlist_combine(), the code block below can impact performance >especially for cases when the doubly linked list is long (IOW, many cached buffers). > /* Point to the tail of main dlist */ > while (curr_main->next != CACHEDBLOCK_END_OF_LIST) > curr_main = cb_dlist_next(curr_main); > >Attached is an improved version of the previous patch, which adds a pointer >information of the TAIL field in order to speed up the abovementioned operation. >I stored the tail field in the prev pointer of the head entry (maybe not a typical >approach). A more typical one is by adding a tail field (int tail) to CachedBufferEnt, >but I didn’t do that because as I mentioned in previous email I want to avoid >using more memory as much as possible. >The patch worked as intended and passed the tests. > >Any thoughts? > A couple of comments based on briefly looking at the patch. 1) I don't think you should / need to expose most of the ne stuff in buf_internals.h. It's only used from buf_internals.c and having all the various cb_dlist_* function in .h seems strange. 2) This adds another hashtable maintenance to BufferAlloc etc. but you've only done tests / benchmark for the case this optimizes. I think we need to see a benchmark for workload that allocates and invalidates lot of buffers. A pgbench with a workload that fits into RAM but not into shared buffers would be interesting. 3) I see this triggered a failure on cputube, in the commit_ts TAP test. That's a bit strange, someone should investigate I guess. https://travis-ci.org/postgresql-cfbot/postgresql/builds/607563900 regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services |
On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra
<[hidden email]> wrote: > 2) This adds another hashtable maintenance to BufferAlloc etc. but > you've only done tests / benchmark for the case this optimizes. I > think we need to see a benchmark for workload that allocates and > invalidates lot of buffers. A pgbench with a workload that fits into > RAM but not into shared buffers would be interesting. Yeah, it seems pretty hard to believe that this won't be bad for some workloads. Not only do you have the overhead of the hash table operations, but you also have locking overhead around that. A whole new set of LWLocks where you have to take and release one of them every time you allocate or invalidate a buffer seems likely to cause a pretty substantial contention problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company |
On Thurs, November 7, 2019 1:27 AM (GMT+9), Robert Haas wrote:
> On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra <[hidden email]> > wrote: > > 2) This adds another hashtable maintenance to BufferAlloc etc. but > > you've only done tests / benchmark for the case this optimizes. I > > think we need to see a benchmark for workload that allocates and > > invalidates lot of buffers. A pgbench with a workload that fits into > > RAM but not into shared buffers would be interesting. > > Yeah, it seems pretty hard to believe that this won't be bad for some workloads. > Not only do you have the overhead of the hash table operations, but you also > have locking overhead around that. A whole new set of LWLocks where you have > to take and release one of them every time you allocate or invalidate a buffer > seems likely to cause a pretty substantial contention problem. Attached is the v3 of the patch. - I moved the unnecessary items from buf_internals.h to cached_buf.c since most of of those items are only used in that file. - Fixed the bug of v2. Seems to pass both RT and TAP test now Thanks for the advice on benchmark test. Please refer below for test and results. [Machine spec] CPU: 16, Number of cores per socket: 8 RHEL6.5, Memory: 240GB scale: 3125 (about 46GB DB size) shared_buffers = 8GB [workload that fits into RAM but not into shared buffers] pgbench -i -s 3125 cachetest pgbench -c 16 -j 8 -T 600 cachetest [Patched] scaling factor: 3125 query mode: simple number of clients: 16 number of threads: 8 duration: 600 s number of transactions actually processed: 8815123 latency average = 1.089 ms tps = 14691.436343 (including connections establishing) tps = 14691.482714 (excluding connections establishing) [Master/Unpatched] ... number of transactions actually processed: 8852327 latency average = 1.084 ms tps = 14753.814648 (including connections establishing) tps = 14753.861589 (excluding connections establishing) My patch caused a little overhead of about 0.42-0.46%, which I think is small. Kindly let me know your opinions/comments about the patch or tests, etc. Thanks, Kirk Jamison |
On Tue, Nov 12, 2019 at 10:49:49AM +0000, [hidden email] wrote:
>On Thurs, November 7, 2019 1:27 AM (GMT+9), Robert Haas wrote: >> On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra <[hidden email]> >> wrote: >> > 2) This adds another hashtable maintenance to BufferAlloc etc. but >> > you've only done tests / benchmark for the case this optimizes. I >> > think we need to see a benchmark for workload that allocates and >> > invalidates lot of buffers. A pgbench with a workload that fits into >> > RAM but not into shared buffers would be interesting. >> >> Yeah, it seems pretty hard to believe that this won't be bad for some workloads. >> Not only do you have the overhead of the hash table operations, but you also >> have locking overhead around that. A whole new set of LWLocks where you have >> to take and release one of them every time you allocate or invalidate a buffer >> seems likely to cause a pretty substantial contention problem. > >I'm sorry for the late reply. Thank you Tomas and Robert for checking this patch. >Attached is the v3 of the patch. >- I moved the unnecessary items from buf_internals.h to cached_buf.c since most of > of those items are only used in that file. >- Fixed the bug of v2. Seems to pass both RT and TAP test now > >Thanks for the advice on benchmark test. Please refer below for test and results. > >[Machine spec] >CPU: 16, Number of cores per socket: 8 >RHEL6.5, Memory: 240GB > >scale: 3125 (about 46GB DB size) >shared_buffers = 8GB > >[workload that fits into RAM but not into shared buffers] >pgbench -i -s 3125 cachetest >pgbench -c 16 -j 8 -T 600 cachetest > >[Patched] >scaling factor: 3125 >query mode: simple >number of clients: 16 >number of threads: 8 >duration: 600 s >number of transactions actually processed: 8815123 >latency average = 1.089 ms >tps = 14691.436343 (including connections establishing) >tps = 14691.482714 (excluding connections establishing) > >[Master/Unpatched] >... >number of transactions actually processed: 8852327 >latency average = 1.084 ms >tps = 14753.814648 (including connections establishing) >tps = 14753.861589 (excluding connections establishing) > > >My patch caused a little overhead of about 0.42-0.46%, which I think is small. >Kindly let me know your opinions/comments about the patch or tests, etc. > Now try measuring that with a read-only workload, with prepared statements. I've tried that on a machine with 16 cores, doing # 16 clients pgbench -n -S -j 16 -c 16 -M prepared -T 60 test # 1 client pgbench -n -S -c 1 -M prepared -T 60 test and average from 30 runs of each looks like this: # clients master patched % --------------------------------------------------------- 1 29690 27833 93.7% 16 300935 283383 94.1% That's quite significant regression, considering it's optimizing an operation that is expected to be pretty rare (people are generally not dropping dropping objects as often as they query them). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services |
On Wed, Nov 13, 2019 4:20AM (GMT +9), Tomas Vondra wrote:
> On Tue, Nov 12, 2019 at 10:49:49AM +0000, [hidden email] wrote: > >On Thurs, November 7, 2019 1:27 AM (GMT+9), Robert Haas wrote: > >> On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra > >> <[hidden email]> > >> wrote: > >> > 2) This adds another hashtable maintenance to BufferAlloc etc. but > >> > you've only done tests / benchmark for the case this optimizes. I > >> > think we need to see a benchmark for workload that allocates and > >> > invalidates lot of buffers. A pgbench with a workload that fits into > >> > RAM but not into shared buffers would be interesting. > >> > >> Yeah, it seems pretty hard to believe that this won't be bad for some > workloads. > >> Not only do you have the overhead of the hash table operations, but > >> you also have locking overhead around that. A whole new set of > >> LWLocks where you have to take and release one of them every time you > >> allocate or invalidate a buffer seems likely to cause a pretty substantial > contention problem. > > > >I'm sorry for the late reply. Thank you Tomas and Robert for checking this > patch. > >Attached is the v3 of the patch. > >- I moved the unnecessary items from buf_internals.h to cached_buf.c > >since most of > > of those items are only used in that file. > >- Fixed the bug of v2. Seems to pass both RT and TAP test now > > > >Thanks for the advice on benchmark test. Please refer below for test and > results. > > > >[Machine spec] > >CPU: 16, Number of cores per socket: 8 > >RHEL6.5, Memory: 240GB > > > >scale: 3125 (about 46GB DB size) > >shared_buffers = 8GB > > > >[workload that fits into RAM but not into shared buffers] pgbench -i -s > >3125 cachetest pgbench -c 16 -j 8 -T 600 cachetest > > > >[Patched] > >scaling factor: 3125 > >query mode: simple > >number of clients: 16 > >number of threads: 8 > >duration: 600 s > >number of transactions actually processed: 8815123 latency average = > >1.089 ms tps = 14691.436343 (including connections establishing) tps = > >14691.482714 (excluding connections establishing) > > > >[Master/Unpatched] > >... > >number of transactions actually processed: 8852327 latency average = > >1.084 ms tps = 14753.814648 (including connections establishing) tps = > >14753.861589 (excluding connections establishing) > > > > > >My patch caused a little overhead of about 0.42-0.46%, which I think is small. > >Kindly let me know your opinions/comments about the patch or tests, etc. > > > > Now try measuring that with a read-only workload, with prepared statements. > I've tried that on a machine with 16 cores, doing > > # 16 clients > pgbench -n -S -j 16 -c 16 -M prepared -T 60 test > > # 1 client > pgbench -n -S -c 1 -M prepared -T 60 test > > and average from 30 runs of each looks like this: > > # clients master patched % > --------------------------------------------------------- > 1 29690 27833 93.7% > 16 300935 283383 94.1% > > That's quite significant regression, considering it's optimizing an > operation that is expected to be pretty rare (people are generally not > dropping dropping objects as often as they query them). with tunable definitions in the code and instead of using rnode as the hash key, I also added the modulo of block number. #define NUM_MAP_PARTITIONS_FOR_REL 128 /* relation-level */ #define NUM_MAP_PARTITIONS_IN_REL 4 /* block-level */ #define NUM_MAP_PARTITIONS \ (NUM_MAP_PARTITIONS_FOR_REL * NUM_MAP_PARTITIONS_IN_REL) I executed again a benchmark for read-only workload, but regression currently sits at 3.10% (reduced from v3's 6%). Average of 10 runs, 16 clients read-only, prepared query mode [Master] num of txn processed: 11,950,983.67 latency average = 0.080 ms tps = 199,182.24 tps = 199,189.54 [V4 Patch] num of txn processed: 11,580,256.36 latency average = 0.083 ms tps = 193,003.52 tps = 193,010.76 I checked the wait event statistics (non-impactful events omitted) and got the following below. I reset the stats before running the pgbench script, Then showed the stats right after the run. [Master] wait_event_type | wait_event | calls | microsec -----------------+-----------------------+----------+---------- Client | ClientRead | 25116 | 49552452 IO | DataFileRead | 14467109 | 92113056 LWLock | buffer_mapping | 204618 | 1364779 [Patch V4] wait_event_type | wait_event | calls | microsec -----------------+-----------------------+----------+---------- Client | ClientRead | 111393 | 68773946 IO | DataFileRead | 14186773 | 90399833 LWLock | buffer_mapping | 463844 | 4025198 LWLock | cached_buf_tranche_id | 83390 | 336080 It seems the buffer_mapping LWLock wait is 4x slower. However, I'd like to continue working on this patch to next commitfest, and further reduce its impact to read-only workloads. Regards, Kirk Jamison |
Hi,
I have updated the patch (v5). I tried to reduce the lock waiting times by using spinlock when inserting/deleting buffers in the new hash table, and exclusive lock when doing lookup for buffers to be dropped. In summary, instead of scanning the whole buffer pool in shared buffers, we just traverse the doubly-linked list of linked buffers for the target relation and block. In order to understand how this patch affects performance, I also measured the cache hit rates in addition to benchmarking db with various shared buffer size settings. Using the same machine specs, I used the default script of pgbench for read-only workload with prepared statement, and executed about 15 runs for varying shared buffer sizes. pgbench -i -s 3200 test //(about 48GB db size) pgbench -S -n -M prepared -c 16 -j 16 -T 60 test [TPS Regression] shbuf | tps(master) | tps(patch) | %reg ---------+-----------------+-----------------+------- 5GB | 195,737.23 | 191,422.23 | 2.23 10GB | 197,067.93 | 194,011.66 | 1.55 20GB | 200,241.18 | 200,425.29 | -0.09 40GB | 208,772.81 | 209,807.38 | -0.50 50GB | 215,684.33 | 218,955.43 | -1.52 [CACHE HIT RATE] Shbuf | master | patch ----------+--------------+---------- 10GB | 0.141536 | 0.141485 20GB | 0.330088 | 0.329894 30GB | 0.573383 | 0.573377 40GB | 0.819499 | 0.819264 50GB | 0.999237 | 0.999577 For this workload, the regression increases for below 20GB shared_buffers size. However, the cache hit rate both for master and patch is 32% (20 GB shbuf). Therefore, I think we can consider this kind of workload with low shared buffers size as a “special case”, because in terms of db performance tuning we want as much as possible for the db to have a higher cache hit rate (99.9%, or maybe let's say 80% is acceptable). And in this workload, ideal shared_buffers size would be around 40GB more or less to hit that acceptable cache hit rate. Looking at this patch's performance result, if it's within the acceptable cache hit rate, there would be at least no regression and the results als show almost similar tps compared to master. Your feedback about the patch and tests are welcome. Regards, Kirk Jamison |
Hi,
I have rebased the patch to keep the CFbot happy. Apparently, in the previous patch there was a possibility of infinite loop when allocating buffers, so I fixed that part and also removed some whitespaces. Kindly check the attached V6 patch. Any thoughts on this? Regards, Kirk Jamison |
On Tue, Feb 4, 2020 at 4:57 AM [hidden email]
<[hidden email]> wrote: > Kindly check the attached V6 patch. > Any thoughts on this? Unfortunately, I don't have time for detailed review of this. I am suspicious that there are substantial performance regressions that you just haven't found yet. I would not take the position that this is a completely hopeless approach, or anything like that, but neither would I conclude that the tests shown so far are anywhere near enough to be confident that there are no problems. Also, systems with very large shared_buffers settings are becoming more common, and probably will continue to become more common, so I don't think we can dismiss that as an edge case any more. People don't want to run with an 8GB cache on a 1TB server. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company |
Hi,
I know this might already be late at end of CommitFest, but attached is the latest version of the patch. The previous version only includes buffer invalidation improvement for VACUUM. The new patch adds the same routine for TRUNCATE WAL replay. In summary, this patch aims to improve the buffer invalidation process of VACUUM and TRUNCATE. Although it may not be a common use case, our customer uses these commands a lot. Recovery and WAL replay of these commands can take time depending on the size of database buffers. So this patch optimizes that using the newly-added auxiliary cache and doubly-linked list on the shared memory, so that we don't need to scan the shared buffers anymore. As for the performance and how it affects the read-only workloads. Using pgbench, I've kept the overload to a minimum, less than 1%. I'll post follow-up results. Although the additional hash table utilizes shared memory, there's a significant performance gain for both TRUNCATE and VACUUM from execution to recovery. Regards, Kirk Jamison |
On Wednesday, March 25, 2020 3:25 PM, Kirk Jamison wrote:
> As for the performance and how it affects the read-only workloads. > Using pgbench, I've kept the overload to a minimum, less than 1%. > I'll post follow-up results. Here's the follow-up results. I executed the similar tests from top of the thread. I hope the performance test results shown below would suffice. If not, I'd appreciate any feedback with regards to test or the patch itself. A. VACUUM execution + Failover test - 100GB shared_buffers 1. 1000 tables (18MB) 1.1. Execution Time - [MASTER] 77755.218 ms (01:17.755) - [PATCH] Execution Time: 2147.914 ms (00:02.148) 1.2. Failover Time (Recovery WAL Replay): - [MASTER] 01:37.084 (1 min 37.884 s) - [PATCH] 1627 ms (1.627 s) 2. 10000 tables (110MB) 2.1. Execution Time - [MASTER] 844174.572 ms (14:04.175) ~14 min 4.175 s - [PATCH] 75678.559 ms (01:15.679) ~1 min 15.679 s 2.2. Failover Time: - [MASTER] est. 14 min++ (I didn't measure anymore because recovery takes as much as the execution time.) - [PATCH] 01:25.559 (1 min 25.559 s) Significant performance results for VACUUM. B. TPS Regression for READ-ONLY workload (PREPARED QUERY MODE, NO VACUUM) # [16 Clients] - pgbench -n -S -j 16 -c 16 -M prepared -T 60 cachetest |shbuf |Master |Patch |% reg | |----------|--------------|---------------|----------| |128MB| 77,416.76 | 77,162.78 |0.33% | |1GB | 81,941.30 | 81,812.05 |0.16% | |2GB | 84,273.69 | 84,356.38 |-0.10%| |100GB| 83,807.30 | 83,924.68 |-0.14%| # [1 Client] - pgbench -n -S -c 1 -M prepared -T 60 cachetest |shbuf |Master |Patch |% reg | |----------|--------------|---------------|----------| |128MB| 12,044.54 | 12,037.13 |0.06% | |1GB | 12,736.57 | 12,774.77 |-0.30%| |2GB | 12,948.98 | 13,159.90 |-1.63%| |100GB| 12,982.98 | 13,064.04 |-0.62%| Both were run for 10 times and average tps and % regression are shown above. At some point only minimal overload was caused by the patch. As for other cases, it has higher tps compared to master. If it does not make it this CF, I hope to receive feedback in the future on how to proceed. Thanks in advance! Regards, Kirk Jamison |
Hi,
Since the last posted version of the patch fails, attached is a rebased version. Written upthread were performance results and some benefits and challenges. I'd appreciate your feedback/comments. Regards, Kirk Jamison |
On 17.06.2020 09:14, [hidden email] wrote: > Hi, > > Since the last posted version of the patch fails, attached is a rebased version. > Written upthread were performance results and some benefits and challenges. > I'd appreciate your feedback/comments. > > Regards, > Kirk Jamison As far as i understand this patch can provide significant improvement of performance only in case of recovery of truncates of large number of tables. You have added shared hash of relation buffers and certainly if adds some extra overhead. According to your latest results this overhead is quite small. But it will be hard to prove that there will be no noticeable regression at some workloads. I wonder if you have considered case of local hash (maintained only during recovery)? If there is after-crash recovery, then there will be no concurrent access to shared buffers and this hash will be up-to-date. in case of hot-standby replica we can use some simple invalidation (just one flag or counter which indicates that buffer cache was updated). This hash also can be constructed on demand when DropRelFileNodeBuffers is called first time (so w have to scan all buffers once, but subsequent drop operation will be fast. i have not thought much about it, but it seems to me that as far as this problem only affects recovery, we do not need shared hash for it. |
On Wednesday, July 29, 2020 4:55 PM, Konstantin Knizhnik wrote:
> On 17.06.2020 09:14, [hidden email] wrote: > > Hi, > > > > Since the last posted version of the patch fails, attached is a rebased version. > > Written upthread were performance results and some benefits and challenges. > > I'd appreciate your feedback/comments. > > > > Regards, > > Kirk Jamison > As far as i understand this patch can provide significant improvement of > performance only in case of recovery of truncates of large number of tables. You > have added shared hash of relation buffers and certainly if adds some extra > overhead. According to your latest results this overhead is quite small. But it will > be hard to prove that there will be no noticeable regression at some workloads. Thank you for taking a look at this. Yes, one of the aims is to speed up recovery of truncations, but at the same time the patch also improves autovacuum, vacuum and relation truncate index executions. I showed results of pgbench results above for different types of workloads, but I am not sure if those are validating enough... > I wonder if you have considered case of local hash (maintained only during > recovery)? > If there is after-crash recovery, then there will be no concurrent access to shared > buffers and this hash will be up-to-date. > in case of hot-standby replica we can use some simple invalidation (just one flag > or counter which indicates that buffer cache was updated). > This hash also can be constructed on demand when DropRelFileNodeBuffers is > called first time (so w have to scan all buffers once, but subsequent drop > operation will be fast. > > i have not thought much about it, but it seems to me that as far as this problem > only affects recovery, we do not need shared hash for it. > The idea of the patch is to mark the relation buffers to be dropped after scanning the whole shared buffers, and store them into shared memory maintained in a dlist, and traverse the dlist on the next scan. But I understand the point that it is expensive and may cause overhead, that is why I tried to define a macro to limit the number of pages that we can cache for cases that lookup cost can be problematic (i.e. too many pages of relation). #define BUF_ID_ARRAY_SIZE 100 int buf_id_array[BUF_ID_ARRAY_SIZE]; int forknum_indexes[BUF_ID_ARRAY_SIZE]; In DropRelFileNodeBuffers do { nbufs = CachedBlockLookup(..., forknum_indexes, buf_id_array, lengthof(buf_id_array)); for (i = 0; i < nbufs; i++) { ... } } while (nbufs == lengthof(buf_id_array)); Perhaps the patch affects complexities so we want to keep it simpler, or commit piece by piece? I will look further into your suggestion of maintaining local hash only during recovery. Thank you for the suggestion. Regards, Kirk Jamison |
The following review has been posted through the commitfest application:
make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation: not tested I have tested this patch at various workloads and hardware (including Power2 server with 384 virtual cores) and didn't find performance regression. The new status of this patch is: Ready for Committer |
On Friday, July 31, 2020 2:37 AM, Konstantin Knizhnik wrote:
> The following review has been posted through the commitfest application: > make installcheck-world: tested, passed > Implements feature: tested, passed > Spec compliant: not tested > Documentation: not tested > > I have tested this patch at various workloads and hardware (including Power2 > server with 384 virtual cores) and didn't find performance regression. > > The new status of this patch is: Ready for Committer Thank you very much, Konstantin, for testing the patch for different workloads. I wonder if I need to modify some documentations. I'll leave the final review to the committer/s as well. Regards, Kirk Jamison |
In reply to this post by Robert Haas
Robert Haas <[hidden email]> writes:
> Unfortunately, I don't have time for detailed review of this. I am > suspicious that there are substantial performance regressions that you > just haven't found yet. I would not take the position that this is a > completely hopeless approach, or anything like that, but neither would > I conclude that the tests shown so far are anywhere near enough to be > confident that there are no problems. I took a quick look through the v8 patch, since it's marked RFC, and my feeling is about the same as Robert's: it is just about impossible to believe that doubling (or more) the amount of hashtable manipulation involved in allocating a buffer won't hurt common workloads. The offered pgbench results don't reassure me; we've so often found that pgbench fails to expose performance problems, except maybe when it's used just so. But aside from that, I noted a number of things I didn't like a bit: * The amount of new shared memory this needs seems several orders of magnitude higher than what I'd call acceptable: according to my measurements it's over 10KB per shared buffer! Most of that is going into the CachedBufTableLock data structure, which seems fundamentally misdesigned --- how could we be needing a lock per map partition *per buffer*? For comparison, the space used by buf_table.c is about 56 bytes per shared buffer; I think this needs to stay at least within hailing distance of there. * It is fairly suspicious that the new data structure is manipulated while holding per-partition locks for the existing buffer hashtable. At best that seems bad for concurrency, and at worst it could result in deadlocks, because I doubt we can assume that the new hash table has partition boundaries identical to the old one. * More generally, it seems like really poor design that this has been written completely independently of the existing buffer hash table. Can't we get any benefit by merging them somehow? * I do not like much of anything in the code details. "CachedBuf" is as unhelpful as could be as a data structure identifier --- what exactly is not "cached" about shared buffers already? "CombinedLock" is not too helpful either, nor could I find any documentation explaining why you need to invent new locking technology in the first place. At best, CombinedLockAcquireSpinLock seems like a brute-force approach to an undocumented problem. * The commentary overall is far too sparse to be of any value --- basically, any reader will have to reverse-engineer your entire design. That's not how we do things around here. There should be either a README, or a long file header comment, explaining what's going on, how the data structure is organized, and what the locking requirements are. See src/backend/storage/buffer/README for the sort of documentation that I think this needs. Even if I were convinced that there's no performance gotchas, I wouldn't commit this in anything like its current form. Robert again: > Also, systems with very large shared_buffers settings are becoming > more common, and probably will continue to become more common, so I > don't think we can dismiss that as an edge case any more. People don't > want to run with an 8GB cache on a 1TB server. I do agree that it'd be great to improve this area. Just not convinced that this is how. regards, tom lane |
Hi,
On 2020-07-31 13:39:37 -0400, Tom Lane wrote: > Robert Haas <[hidden email]> writes: > > Unfortunately, I don't have time for detailed review of this. I am > > suspicious that there are substantial performance regressions that you > > just haven't found yet. I would not take the position that this is a > > completely hopeless approach, or anything like that, but neither would > > I conclude that the tests shown so far are anywhere near enough to be > > confident that there are no problems. > > I took a quick look through the v8 patch, since it's marked RFC, and > my feeling is about the same as Robert's: it is just about impossible > to believe that doubling (or more) the amount of hashtable manipulation > involved in allocating a buffer won't hurt common workloads. The > offered pgbench results don't reassure me; we've so often found that > pgbench fails to expose performance problems, except maybe when it's > used just so. Indeed. The buffer mapping hashtable already is visible as a major bottleneck in a number of workloads. Even in readonly pgbench if s_b is large enough (so the hashtable is larger than the cache). Not to speak of things like a cached sequential scan with a cheap qual and wide rows. > Robert again: > > Also, systems with very large shared_buffers settings are becoming > > more common, and probably will continue to become more common, so I > > don't think we can dismiss that as an edge case any more. People don't > > want to run with an 8GB cache on a 1TB server. > > I do agree that it'd be great to improve this area. Just not convinced > that this is how. Wonder if the temporary fix is just to do explicit hashtable probes for all pages iff the size of the relation is < s_b / 500 or so. That'll address the case where small tables are frequently dropped - and dropping large relations is more expensive from the OS and data loading perspective, so it's not gonna happen as often. Greetings, Andres Freund |
Andres Freund <[hidden email]> writes:
> Indeed. The buffer mapping hashtable already is visible as a major > bottleneck in a number of workloads. Even in readonly pgbench if s_b is > large enough (so the hashtable is larger than the cache). Not to speak > of things like a cached sequential scan with a cheap qual and wide rows. To be fair, the added overhead is in buffer allocation not buffer lookup. So it shouldn't add cost to fully-cached cases. As Tomas noted upthread, the potential trouble spot is where the working set is bigger than shared buffers but still fits in RAM (so there's no actual I/O needed, but we do still have to shuffle buffers a lot). > Wonder if the temporary fix is just to do explicit hashtable probes for > all pages iff the size of the relation is < s_b / 500 or so. That'll > address the case where small tables are frequently dropped - and > dropping large relations is more expensive from the OS and data loading > perspective, so it's not gonna happen as often. Oooh, interesting idea. We'd need a reliable idea of how long the relation had been (preferably without adding an lseek call), but maybe that's do-able. regards, tom lane |
Free forum by Nabble | Edit this page |