[Patch] Optimize dropping of relation buffers using dlist

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
37 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Andres Freund
Hi,

On 2020-07-31 15:50:04 -0400, Tom Lane wrote:

> Andres Freund <[hidden email]> writes:
> > Indeed. The buffer mapping hashtable already is visible as a major
> > bottleneck in a number of workloads. Even in readonly pgbench if s_b is
> > large enough (so the hashtable is larger than the cache). Not to speak
> > of things like a cached sequential scan with a cheap qual and wide rows.
>
> To be fair, the added overhead is in buffer allocation not buffer lookup.
> So it shouldn't add cost to fully-cached cases.  As Tomas noted upthread,
> the potential trouble spot is where the working set is bigger than shared
> buffers but still fits in RAM (so there's no actual I/O needed, but we do
> still have to shuffle buffers a lot).

Oh, right, not sure what I was thinking.


> > Wonder if the temporary fix is just to do explicit hashtable probes for
> > all pages iff the size of the relation is < s_b / 500 or so. That'll
> > address the case where small tables are frequently dropped - and
> > dropping large relations is more expensive from the OS and data loading
> > perspective, so it's not gonna happen as often.
>
> Oooh, interesting idea.  We'd need a reliable idea of how long the
> relation had been (preferably without adding an lseek call), but maybe
> that's do-able.

IIRC we already do smgrnblocks nearby, when doing the truncation (to
figure out which segments we need to remove). Perhaps we can arrange to
combine the two? The layering probably makes that somewhat ugly :(

We could also just use pg_class.relpages. It'll probably mostly be
accurate enough?

Or we could just cache the result of the last smgrnblocks call...


One of the cases where this type of strategy is most intersting to me is
the partial truncations that autovacuum does... There we even know the
range of tables ahead of time.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com
On Saturday, August 1, 2020 5:24 AM, Andres Freund wrote:

Hi,
Thank you for your constructive review and comments.
Sorry for the late reply.

> Hi,
>
> On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
> > Andres Freund <[hidden email]> writes:
> > > Indeed. The buffer mapping hashtable already is visible as a major
> > > bottleneck in a number of workloads. Even in readonly pgbench if s_b
> > > is large enough (so the hashtable is larger than the cache). Not to
> > > speak of things like a cached sequential scan with a cheap qual and wide
> rows.
> >
> > To be fair, the added overhead is in buffer allocation not buffer lookup.
> > So it shouldn't add cost to fully-cached cases.  As Tomas noted
> > upthread, the potential trouble spot is where the working set is
> > bigger than shared buffers but still fits in RAM (so there's no actual
> > I/O needed, but we do still have to shuffle buffers a lot).
>
> Oh, right, not sure what I was thinking.
>
>
> > > Wonder if the temporary fix is just to do explicit hashtable probes
> > > for all pages iff the size of the relation is < s_b / 500 or so.
> > > That'll address the case where small tables are frequently dropped -
> > > and dropping large relations is more expensive from the OS and data
> > > loading perspective, so it's not gonna happen as often.
> >
> > Oooh, interesting idea.  We'd need a reliable idea of how long the
> > relation had been (preferably without adding an lseek call), but maybe
> > that's do-able.
>
> IIRC we already do smgrnblocks nearby, when doing the truncation (to figure out
> which segments we need to remove). Perhaps we can arrange to combine the
> two? The layering probably makes that somewhat ugly :(
>
> We could also just use pg_class.relpages. It'll probably mostly be accurate
> enough?
>
> Or we could just cache the result of the last smgrnblocks call...
>
>
> One of the cases where this type of strategy is most intersting to me is the partial
> truncations that autovacuum does... There we even know the range of tables
> ahead of time.

Konstantin tested it on various workloads and saw no regression.
But I understand the sentiment on the added overhead on BufferAlloc.
Regarding the case where the patch would potentially affect workloads that fit into
RAM but not into shared buffers, could one of Andres' suggested idea/s above address
that, in addition to this patch's possible shared invalidation fix? Could that settle
the added overhead in BufferAlloc() as temporary fix?
Thomas Munro is also working on caching relation sizes [1], maybe that way we
could get the latest known relation size. Currently, it's possible only during
recovery in smgrnblocks.

Tom Lane wrote:

> But aside from that, I noted a number of things I didn't like a bit:
>
> * The amount of new shared memory this needs seems several orders of
> magnitude higher than what I'd call acceptable: according to my measurements
> it's over 10KB per shared buffer!  Most of that is going into the
> CachedBufTableLock data structure, which seems fundamentally misdesigned ---
> how could we be needing a lock per map partition *per buffer*?  For comparison,
> the space used by buf_table.c is about 56 bytes per shared buffer; I think this
> needs to stay at least within hailing distance of there.
>
> * It is fairly suspicious that the new data structure is manipulated while holding
> per-partition locks for the existing buffer hashtable.
> At best that seems bad for concurrency, and at worst it could result in deadlocks,
> because I doubt we can assume that the new hash table has partition boundaries
> identical to the old one.
>
> * More generally, it seems like really poor design that this has been written
> completely independently of the existing buffer hash table.
> Can't we get any benefit by merging them somehow?

The original aim is to just shorten the recovery process, and eventually the speedup
on both vacuum and truncate process are just added bonus.
Given that we don't have a shared invalidation mechanism in place yet like radix tree
buffer mapping which is complex, I thought a patch like mine could be an alternative
approach to that. So I want to improve the patch further.
I hope you can help me clarify the direction, so that I can avoid going farther away
from what the community wants.
 1. Both normal operations and recovery process
 2. Improve recovery process only

For 1, the current patch aims to touch on that, but further design improvement is needed.
It would be ideal to modify the BufferDesc, but that cannot be expanded anymore because
it would exceed the CPU cache line size. So I added new data structures (hash table,
dlist, lock) instead of modifying the existing ones.
The new hash table ensures that it's identical to the old one with the use of the same
Relfilenode in the key and a lock when inserting and deleting buffers from buffer table,
as well as during lookups. As for the partition locking, I added it to reduce lock contention.
Tomas Vondra reported regression and mainly its due to buffer mapping locks in V4 and
previous patch versions. So from V5, I used spinlock when inserting/deleting buffers,
to prevent modification when concurrent lookup is happening. LWLock is acquired when
we're doing lookup operation.
If we want this direction, I hope to address Tom's comments in the next patch version.
I admit that this patch needs reworking on shmem resource consumption and clarifying
the design/approach more, i.e. how it affects the existing buffer allocation and
invalidation process, lock mechanism, etc.

If we're going for 2, Konstantin suggested an idea in the previous email:

> I wonder if you have considered case of local hash (maintained only during recovery)?
> If there is after-crash recovery, then there will be no concurrent
> access to shared buffers and this hash will be up-to-date.
> in case of hot-standby replica we can use some simple invalidation (just
> one flag or counter which indicates that buffer cache was updated).
> This hash also can be constructed on demand when DropRelFileNodeBuffers
> is called first time (so w have to scan all buffers once, but subsequent
> drop operation will be fast.

I'm examining this, but I am not sure if I got the correct understanding. Please correct
me if I'm wrong.
I think above is a suggestion wherein the postgres startup process uses local hash table
to keep track of the buffers of relations. Since there may be other read-only sessions which
read from disk, evict cached blocks, and modify the shared_buffers, the flag would be updated.
We could do it during recovery, then release it as recovery completes.

I haven't looked deeply yet into the source code but we maybe can modify the REDO
(main redo do-while loop) in StartupXLOG() once the read-only connections are consistent.
It would also be beneficial to construct this local hash when DropRefFileNodeBuffers()
is called for the first time, so the whole share buffers is scanned initially, then as
you mentioned subsequent dropping will be fast. (similar behavior to what the patch does)

Do you think this is feasible to be implemented? Or should we explore another approach?

I'd really appreciate your ideas, feedback, suggestions, and advice.
Thank you again for the review.

Regards
Kirk Jamison

[1] https://www.postgresql.org/message-id/CA%2BhUKGKEW7-9pq%2Bs2_4Q-Fcpr9cc7_0b3pkno5qzPKC3y2nOPA%40mail.gmail.com


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Tomas Vondra-4
On Thu, Aug 06, 2020 at 01:23:31AM +0000, [hidden email] wrote:

>On Saturday, August 1, 2020 5:24 AM, Andres Freund wrote:
>
>Hi,
>Thank you for your constructive review and comments.
>Sorry for the late reply.
>
>> Hi,
>>
>> On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
>> > Andres Freund <[hidden email]> writes:
>> > > Indeed. The buffer mapping hashtable already is visible as a major
>> > > bottleneck in a number of workloads. Even in readonly pgbench if s_b
>> > > is large enough (so the hashtable is larger than the cache). Not to
>> > > speak of things like a cached sequential scan with a cheap qual and wide
>> rows.
>> >
>> > To be fair, the added overhead is in buffer allocation not buffer lookup.
>> > So it shouldn't add cost to fully-cached cases.  As Tomas noted
>> > upthread, the potential trouble spot is where the working set is
>> > bigger than shared buffers but still fits in RAM (so there's no actual
>> > I/O needed, but we do still have to shuffle buffers a lot).
>>
>> Oh, right, not sure what I was thinking.
>>
>>
>> > > Wonder if the temporary fix is just to do explicit hashtable probes
>> > > for all pages iff the size of the relation is < s_b / 500 or so.
>> > > That'll address the case where small tables are frequently dropped -
>> > > and dropping large relations is more expensive from the OS and data
>> > > loading perspective, so it's not gonna happen as often.
>> >
>> > Oooh, interesting idea.  We'd need a reliable idea of how long the
>> > relation had been (preferably without adding an lseek call), but maybe
>> > that's do-able.
>>
>> IIRC we already do smgrnblocks nearby, when doing the truncation (to figure out
>> which segments we need to remove). Perhaps we can arrange to combine the
>> two? The layering probably makes that somewhat ugly :(
>>
>> We could also just use pg_class.relpages. It'll probably mostly be accurate
>> enough?
>>
>> Or we could just cache the result of the last smgrnblocks call...
>>
>>
>> One of the cases where this type of strategy is most intersting to me is the partial
>> truncations that autovacuum does... There we even know the range of tables
>> ahead of time.
>
>Konstantin tested it on various workloads and saw no regression.
Unfortunately Konstantin did not share any details about what workloads
he tested, what config etc. But I find the "no regression" hypothesis
rather hard to believe, because we're adding non-trivial amount of code
to a place that can be quite hot.

And I can trivially reproduce measurable (and significant) regression
using a very simple pgbench read-only test, with amount of data that
exceeds shared buffers but fits into RAM.

The following numbers are from a x86_64 machine with 16 cores (32 w HT),
64GB of RAM, and 8GB shared buffers, using pgbench scale 1000 (so 16GB,
i.e. twice the SB size).

With simple "pgbench -S" tests (warmup and then 15 x 1-minute runs with
1, 8 and 16 clients - see the attached script for details) I see this:

                1 client    8 clients    16 clients
     ----------------------------------------------
     master        38249       236336        368591
     patched       35853       217259        349248
                     -6%          -8%           -5%

This is average of the runs, but the conclusions for medians are almost
exactly te same.

>But I understand the sentiment on the added overhead on BufferAlloc.
>Regarding the case where the patch would potentially affect workloads
>that fit into RAM but not into shared buffers, could one of Andres'
>suggested idea/s above address that, in addition to this patch's
>possible shared invalidation fix? Could that settle the added overhead
>in BufferAlloc() as temporary fix?

Not sure.

>Thomas Munro is also working on caching relation sizes [1], maybe that
>way we could get the latest known relation size. Currently, it's
>possible only during recovery in smgrnblocks.

It's not clear to me how would knowing the relation size help reducing
the overhead of this patch?

Can't we somehow identify cases when this optimization might help and
only actually enable it in those cases? Like in a recovery, with a lot
of truncates, or something like that.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

run.sh (680 bytes) Download Attachment
master.csv (693 bytes) Download Attachment
patched.csv (693 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

akapila
In reply to this post by k.jamison@fujitsu.com
On Thu, Aug 6, 2020 at 6:53 AM [hidden email]
<[hidden email]> wrote:

>
> On Saturday, August 1, 2020 5:24 AM, Andres Freund wrote:
>
> Hi,
> Thank you for your constructive review and comments.
> Sorry for the late reply.
>
> > Hi,
> >
> > On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
> > > Andres Freund <[hidden email]> writes:
> > > > Indeed. The buffer mapping hashtable already is visible as a major
> > > > bottleneck in a number of workloads. Even in readonly pgbench if s_b
> > > > is large enough (so the hashtable is larger than the cache). Not to
> > > > speak of things like a cached sequential scan with a cheap qual and wide
> > rows.
> > >
> > > To be fair, the added overhead is in buffer allocation not buffer lookup.
> > > So it shouldn't add cost to fully-cached cases.  As Tomas noted
> > > upthread, the potential trouble spot is where the working set is
> > > bigger than shared buffers but still fits in RAM (so there's no actual
> > > I/O needed, but we do still have to shuffle buffers a lot).
> >
> > Oh, right, not sure what I was thinking.
> >
> >
> > > > Wonder if the temporary fix is just to do explicit hashtable probes
> > > > for all pages iff the size of the relation is < s_b / 500 or so.
> > > > That'll address the case where small tables are frequently dropped -
> > > > and dropping large relations is more expensive from the OS and data
> > > > loading perspective, so it's not gonna happen as often.
> > >
> > > Oooh, interesting idea.  We'd need a reliable idea of how long the
> > > relation had been (preferably without adding an lseek call), but maybe
> > > that's do-able.
> >
> > IIRC we already do smgrnblocks nearby, when doing the truncation (to figure out
> > which segments we need to remove). Perhaps we can arrange to combine the
> > two? The layering probably makes that somewhat ugly :(
> >
> > We could also just use pg_class.relpages. It'll probably mostly be accurate
> > enough?
> >
> > Or we could just cache the result of the last smgrnblocks call...
> >
> >
> > One of the cases where this type of strategy is most intersting to me is the partial
> > truncations that autovacuum does... There we even know the range of tables
> > ahead of time.
>
> Konstantin tested it on various workloads and saw no regression.
> But I understand the sentiment on the added overhead on BufferAlloc.
> Regarding the case where the patch would potentially affect workloads that fit into
> RAM but not into shared buffers, could one of Andres' suggested idea/s above address
> that, in addition to this patch's possible shared invalidation fix? Could that settle
> the added overhead in BufferAlloc() as temporary fix?
>

Yes, I think so. Because as far as I can understand he is suggesting
to do changes only in the Truncate/Vacuum code path. Basically, I
think you need to change DropRelFileNodeBuffers or similar functions.
There shouldn't be any change in the BufferAlloc or the common code
path, so there is no question of regression in such cases. I am not
sure what you have in mind for this but feel free to clarify your
understanding before implementation.

> Thomas Munro is also working on caching relation sizes [1], maybe that way we
> could get the latest known relation size. Currently, it's possible only during
> recovery in smgrnblocks.
>
> Tom Lane wrote:
> > But aside from that, I noted a number of things I didn't like a bit:
> >
> > * The amount of new shared memory this needs seems several orders of
> > magnitude higher than what I'd call acceptable: according to my measurements
> > it's over 10KB per shared buffer!  Most of that is going into the
> > CachedBufTableLock data structure, which seems fundamentally misdesigned ---
> > how could we be needing a lock per map partition *per buffer*?  For comparison,
> > the space used by buf_table.c is about 56 bytes per shared buffer; I think this
> > needs to stay at least within hailing distance of there.
> >
> > * It is fairly suspicious that the new data structure is manipulated while holding
> > per-partition locks for the existing buffer hashtable.
> > At best that seems bad for concurrency, and at worst it could result in deadlocks,
> > because I doubt we can assume that the new hash table has partition boundaries
> > identical to the old one.
> >
> > * More generally, it seems like really poor design that this has been written
> > completely independently of the existing buffer hash table.
> > Can't we get any benefit by merging them somehow?
>
> The original aim is to just shorten the recovery process, and eventually the speedup
> on both vacuum and truncate process are just added bonus.
> Given that we don't have a shared invalidation mechanism in place yet like radix tree
> buffer mapping which is complex, I thought a patch like mine could be an alternative
> approach to that. So I want to improve the patch further.
> I hope you can help me clarify the direction, so that I can avoid going farther away
> from what the community wants.
>  1. Both normal operations and recovery process
>  2. Improve recovery process only
>

I feel Andres's suggestion will help in both cases.

> > I wonder if you have considered case of local hash (maintained only during recovery)?
> > If there is after-crash recovery, then there will be no concurrent
> > access to shared buffers and this hash will be up-to-date.
> > in case of hot-standby replica we can use some simple invalidation (just
> > one flag or counter which indicates that buffer cache was updated).
> > This hash also can be constructed on demand when DropRelFileNodeBuffers
> > is called first time (so w have to scan all buffers once, but subsequent
> > drop operation will be fast.
>
> I'm examining this, but I am not sure if I got the correct understanding. Please correct
> me if I'm wrong.
> I think above is a suggestion wherein the postgres startup process uses local hash table
> to keep track of the buffers of relations. Since there may be other read-only sessions which
> read from disk, evict cached blocks, and modify the shared_buffers, the flag would be updated.
> We could do it during recovery, then release it as recovery completes.
>
> I haven't looked deeply yet into the source code but we maybe can modify the REDO
> (main redo do-while loop) in StartupXLOG() once the read-only connections are consistent.
> It would also be beneficial to construct this local hash when DropRefFileNodeBuffers()
> is called for the first time, so the whole share buffers is scanned initially, then as
> you mentioned subsequent dropping will be fast. (similar behavior to what the patch does)
>
> Do you think this is feasible to be implemented? Or should we explore another approach?
>

I think we should try what Andres is suggesting as that seems like a
promising idea and can address most of the common problems in this
area but if you feel otherwise, then do let us know.

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

akapila
In reply to this post by Tomas Vondra-4
On Fri, Aug 7, 2020 at 3:03 AM Tomas Vondra
<[hidden email]> wrote:

>
> >But I understand the sentiment on the added overhead on BufferAlloc.
> >Regarding the case where the patch would potentially affect workloads
> >that fit into RAM but not into shared buffers, could one of Andres'
> >suggested idea/s above address that, in addition to this patch's
> >possible shared invalidation fix? Could that settle the added overhead
> >in BufferAlloc() as temporary fix?
>
> Not sure.
>
> >Thomas Munro is also working on caching relation sizes [1], maybe that
> >way we could get the latest known relation size. Currently, it's
> >possible only during recovery in smgrnblocks.
>
> It's not clear to me how would knowing the relation size help reducing
> the overhead of this patch?
>

AFAICU the idea is to directly call BufTableLookup (similar to how we
do in BufferAlloc) to find the buf_id in function
DropRelFileNodeBuffers and then invalidate the required buffers.  And,
we need to do this when the size of the relation is less than some
threshold. So, I think the crux would be to reliably get the number of
blocks information. So, probably relation size cache stuff might help.

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

akapila
In reply to this post by Andres Freund
On Sat, Aug 1, 2020 at 1:53 AM Andres Freund <[hidden email]> wrote:

>
> Hi,
>
> On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
> > Andres Freund <[hidden email]> writes:
>
> > > Wonder if the temporary fix is just to do explicit hashtable probes for
> > > all pages iff the size of the relation is < s_b / 500 or so. That'll
> > > address the case where small tables are frequently dropped - and
> > > dropping large relations is more expensive from the OS and data loading
> > > perspective, so it's not gonna happen as often.
> >
> > Oooh, interesting idea.  We'd need a reliable idea of how long the
> > relation had been (preferably without adding an lseek call), but maybe
> > that's do-able.
>
> IIRC we already do smgrnblocks nearby, when doing the truncation (to
> figure out which segments we need to remove). Perhaps we can arrange to
> combine the two? The layering probably makes that somewhat ugly :(
>
> We could also just use pg_class.relpages. It'll probably mostly be
> accurate enough?
>

Don't we need the accurate 'number of blocks' if we want to invalidate
all the buffers? Basically, I think we need to perform BufTableLookup
for all the blocks in the relation and then Invalidate all buffers.

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Tom Lane-2
Amit Kapila <[hidden email]> writes:
> On Sat, Aug 1, 2020 at 1:53 AM Andres Freund <[hidden email]> wrote:
>> We could also just use pg_class.relpages. It'll probably mostly be
>> accurate enough?

> Don't we need the accurate 'number of blocks' if we want to invalidate
> all the buffers? Basically, I think we need to perform BufTableLookup
> for all the blocks in the relation and then Invalidate all buffers.

Yeah, there is no room for "good enough" here.  If a dirty buffer remains
in the system, the checkpointer will eventually try to flush it, and fail
(because there's no file to write it to), and then checkpointing will be
stuck.  So we cannot afford to risk missing any buffers.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

konstantin knizhnik
In reply to this post by Tomas Vondra-4


On 07.08.2020 00:33, Tomas Vondra wrote:
>
> Unfortunately Konstantin did not share any details about what workloads
> he tested, what config etc. But I find the "no regression" hypothesis
> rather hard to believe, because we're adding non-trivial amount of code
> to a place that can be quite hot.

Sorry, that I have not explained  my test scenarios.
As far as Postgres is pgbench-oriented database:) I have also used pgbench:
read-only case and sip-some updates.
For this patch most critical is number of buffer allocations,
so I used small enough database (scale=100), but shared buffer was set
to 1Gb.
As a result, all data is cached in memory (in file system cache), but
there is intensive swapping at Postgres buffer manager level.
I have tested it both with relatively small (100) and large (1000)
number of clients.
I repeated this tests at my notebook (quadcore, 16Gb RAM, SSD) and IBM
Power2 server with about 380 virtual cores  and about 1Tb of memory.
I the last case results are vary very much I think because of NUMA
architecture) but I failed to find some noticeable regression of patched
version.


But I have to agree that adding parallel hash (in addition to existed
buffer manager hash) is not so good idea.
This cache really quite frequently becomes bottleneck.
My explanation of why I have not observed some noticeable regression was
that this patch uses almost the same lock partitioning schema
as already used it adds not so much new conflicts. May be in case of
POwer2 server, overhead of NUMA is much higher than other factors
(although shared hash is one of the main thing suffering from NUMA
architecture).
But in principle I agree that having two independent caches may decrease
speed up to two times  (or even more).

I hope that everybody will agree that this problem is really critical.
It is certainly not the most common case when there are hundreds of
relation which are frequently truncated. But having quadratic complexity
in drop function is not acceptable from my point of view.
And it is not only recovery-specific problem, this is why solution with
local cache is not enough.

I do not know good solution of the problem. Just some thoughts.
- We can somehow combine locking used for main buffer manager cache (by
relid/blockno) and cache for relid. It will eliminates double locking
overhead.
- We can use something like sorted tree (like std::map) instead of hash
- it will allow to locate blocks both by relid/blockno and by relid only.


Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com
In reply to this post by akapila
On Friday, August 7, 2020 12:38 PM, Amit Kapila wrote:
Hi,

> On Thu, Aug 6, 2020 at 6:53 AM [hidden email] <[hidden email]>
> wrote:
> >
> > On Saturday, August 1, 2020 5:24 AM, Andres Freund wrote:
> >
> > Hi,
> > Thank you for your constructive review and comments.
> > Sorry for the late reply.
> >
> > > Hi,
> > >
> > > On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
> > > > Andres Freund <[hidden email]> writes:
> > > > > Indeed. The buffer mapping hashtable already is visible as a
> > > > > major bottleneck in a number of workloads. Even in readonly
> > > > > pgbench if s_b is large enough (so the hashtable is larger than
> > > > > the cache). Not to speak of things like a cached sequential scan
> > > > > with a cheap qual and wide
> > > rows.
> > > >
> > > > To be fair, the added overhead is in buffer allocation not buffer lookup.
> > > > So it shouldn't add cost to fully-cached cases.  As Tomas noted
> > > > upthread, the potential trouble spot is where the working set is
> > > > bigger than shared buffers but still fits in RAM (so there's no
> > > > actual I/O needed, but we do still have to shuffle buffers a lot).
> > >
> > > Oh, right, not sure what I was thinking.
> > >
> > >
> > > > > Wonder if the temporary fix is just to do explicit hashtable
> > > > > probes for all pages iff the size of the relation is < s_b / 500 or so.
> > > > > That'll address the case where small tables are frequently
> > > > > dropped - and dropping large relations is more expensive from
> > > > > the OS and data loading perspective, so it's not gonna happen as often.
> > > >
> > > > Oooh, interesting idea.  We'd need a reliable idea of how long the
> > > > relation had been (preferably without adding an lseek call), but
> > > > maybe that's do-able.
> > >
> > > IIRC we already do smgrnblocks nearby, when doing the truncation (to
> > > figure out which segments we need to remove). Perhaps we can arrange
> > > to combine the two? The layering probably makes that somewhat ugly
> > > :(
> > >
> > > We could also just use pg_class.relpages. It'll probably mostly be
> > > accurate enough?
> > >
> > > Or we could just cache the result of the last smgrnblocks call...
> > >
> > >
> > > One of the cases where this type of strategy is most intersting to
> > > me is the partial truncations that autovacuum does... There we even
> > > know the range of tables ahead of time.
> >
> > Konstantin tested it on various workloads and saw no regression.
> > But I understand the sentiment on the added overhead on BufferAlloc.
> > Regarding the case where the patch would potentially affect workloads
> > that fit into RAM but not into shared buffers, could one of Andres'
> > suggested idea/s above address that, in addition to this patch's
> > possible shared invalidation fix? Could that settle the added overhead in
> BufferAlloc() as temporary fix?
> >
>
> Yes, I think so. Because as far as I can understand he is suggesting to do changes
> only in the Truncate/Vacuum code path. Basically, I think you need to change
> DropRelFileNodeBuffers or similar functions.
> There shouldn't be any change in the BufferAlloc or the common code path, so
> there is no question of regression in such cases. I am not sure what you have in
> mind for this but feel free to clarify your understanding before implementation.
>
> > Thomas Munro is also working on caching relation sizes [1], maybe that
> > way we could get the latest known relation size. Currently, it's
> > possible only during recovery in smgrnblocks.
> >
> > Tom Lane wrote:
> > > But aside from that, I noted a number of things I didn't like a bit:
> > >
> > > * The amount of new shared memory this needs seems several orders of
> > > magnitude higher than what I'd call acceptable: according to my
> > > measurements it's over 10KB per shared buffer!  Most of that is
> > > going into the CachedBufTableLock data structure, which seems
> > > fundamentally misdesigned --- how could we be needing a lock per map
> > > partition *per buffer*?  For comparison, the space used by
> > > buf_table.c is about 56 bytes per shared buffer; I think this needs to stay at
> least within hailing distance of there.
> > >
> > > * It is fairly suspicious that the new data structure is manipulated
> > > while holding per-partition locks for the existing buffer hashtable.
> > > At best that seems bad for concurrency, and at worst it could result
> > > in deadlocks, because I doubt we can assume that the new hash table
> > > has partition boundaries identical to the old one.
> > >
> > > * More generally, it seems like really poor design that this has
> > > been written completely independently of the existing buffer hash table.
> > > Can't we get any benefit by merging them somehow?
> >
> > The original aim is to just shorten the recovery process, and
> > eventually the speedup on both vacuum and truncate process are just added
> bonus.
> > Given that we don't have a shared invalidation mechanism in place yet
> > like radix tree buffer mapping which is complex, I thought a patch
> > like mine could be an alternative approach to that. So I want to improve the
> patch further.
> > I hope you can help me clarify the direction, so that I can avoid
> > going farther away from what the community wants.
> >  1. Both normal operations and recovery process  2. Improve recovery
> > process only
> >
>
> I feel Andres's suggestion will help in both cases.
>
> > > I wonder if you have considered case of local hash (maintained only during
> recovery)?
> > > If there is after-crash recovery, then there will be no concurrent
> > > access to shared buffers and this hash will be up-to-date.
> > > in case of hot-standby replica we can use some simple invalidation
> > > (just one flag or counter which indicates that buffer cache was updated).
> > > This hash also can be constructed on demand when
> > > DropRelFileNodeBuffers is called first time (so w have to scan all
> > > buffers once, but subsequent drop operation will be fast.
> >
> > I'm examining this, but I am not sure if I got the correct
> > understanding. Please correct me if I'm wrong.
> > I think above is a suggestion wherein the postgres startup process
> > uses local hash table to keep track of the buffers of relations. Since
> > there may be other read-only sessions which read from disk, evict cached
> blocks, and modify the shared_buffers, the flag would be updated.
> > We could do it during recovery, then release it as recovery completes.
> >
> > I haven't looked deeply yet into the source code but we maybe can
> > modify the REDO (main redo do-while loop) in StartupXLOG() once the
> read-only connections are consistent.
> > It would also be beneficial to construct this local hash when
> > DropRefFileNodeBuffers() is called for the first time, so the whole
> > share buffers is scanned initially, then as you mentioned subsequent
> > dropping will be fast. (similar behavior to what the patch does)
> >
> > Do you think this is feasible to be implemented? Or should we explore another
> approach?
> >
>
> I think we should try what Andres is suggesting as that seems like a promising
> idea and can address most of the common problems in this area but if you feel
> otherwise, then do let us know.
>
> --
> With Regards,
> Amit Kapila.

Hi, thank you for the review.
I just wanted to confirm so that I can hopefully cover it in the patch revision.
Basically, we don't want the added overhead in BufferAlloc(), so I'll just make
a way to get both the last known relation size and nblocks, and modify the
operations for dropping of relation of buffers, based from the comments
and suggestions of the reviewers. Hopefully I can also provide performance
test results by next CF.

Regards,
Kirk Jamison
Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Tomas Vondra-4
In reply to this post by konstantin knizhnik
On Fri, Aug 07, 2020 at 10:08:23AM +0300, Konstantin Knizhnik wrote:

>
>
>On 07.08.2020 00:33, Tomas Vondra wrote:
>>
>>Unfortunately Konstantin did not share any details about what workloads
>>he tested, what config etc. But I find the "no regression" hypothesis
>>rather hard to believe, because we're adding non-trivial amount of code
>>to a place that can be quite hot.
>
>Sorry, that I have not explained  my test scenarios.
>As far as Postgres is pgbench-oriented database:) I have also used pgbench:
>read-only case and sip-some updates.
>For this patch most critical is number of buffer allocations,
>so I used small enough database (scale=100), but shared buffer was set
>to 1Gb.
>As a result, all data is cached in memory (in file system cache), but
>there is intensive swapping at Postgres buffer manager level.
>I have tested it both with relatively small (100) and large (1000)
>number of clients.
>
>I repeated this tests at my notebook (quadcore, 16Gb RAM, SSD) and IBM
>Power2 server with about 380 virtual cores  and about 1Tb of memory.
>I the last case results are vary very much I think because of NUMA
>architecture) but I failed to find some noticeable regression of
>patched version.
>

IMO using such high numbers of clients is pointless - it's perfectly
fine to test just a single client, and the 'basic overhead' should be
visible. It might have some impact on concurrency, but I think that's
just a secondary effect I think. In fact, I wouldn't be surprised if
high client counts made it harder to observe the overhead, due to
concurrency problems (I doubt you have a laptop with this many cores).

Another thing you might try doing is using taskset to attach processes
to particular CPU cores, and also make sure there's no undesirable
influence from CPU power management etc. Laptops are very problematic in
this regard, but even servers can have that enabled in BIOS.

>
>But I have to agree that adding parallel hash (in addition to existed
>buffer manager hash) is not so good idea.
>This cache really quite frequently becomes bottleneck.
>My explanation of why I have not observed some noticeable regression
>was that this patch uses almost the same lock partitioning schema
>as already used it adds not so much new conflicts. May be in case of
>POwer2 server, overhead of NUMA is much higher than other factors
>(although shared hash is one of the main thing suffering from NUMA
>architecture).
>But in principle I agree that having two independent caches may
>decrease speed up to two times  (or even more).
>
>I hope that everybody will agree that this problem is really critical.
>It is certainly not the most common case when there are hundreds of
>relation which are frequently truncated. But having quadratic
>complexity in drop function is not acceptable from my point of view.
>And it is not only recovery-specific problem, this is why solution
>with local cache is not enough.
>

Well, ultimately it's a balancing act - we need to consider the risk of
regressions vs. how common the improved scenario is. I've seen multiple
applications that e.g. drop many relations (after all, that's why I
optimized that in 9.3) so it's not entirely bogus case.

>I do not know good solution of the problem. Just some thoughts.
>- We can somehow combine locking used for main buffer manager cache
>(by relid/blockno) and cache for relid. It will eliminates double
>locking overhead.
>- We can use something like sorted tree (like std::map) instead of
>hash - it will allow to locate blocks both by relid/blockno and by
>relid only.
>

I don't know. I think the ultimate problem here is that we're adding
code to a fairly hot codepath - it does not matter if it's hash, list,
std::map or something else I think. All of that has overhead.

That's the beauty of Andres' proposal to just loop over the blocks of
the relation and evict them one by one - that adds absolutely nothing to
BufferAlloc.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Robert Haas
In reply to this post by Tom Lane-2
On Fri, Aug 7, 2020 at 12:03 AM Tom Lane <[hidden email]> wrote:
> Yeah, there is no room for "good enough" here.  If a dirty buffer remains
> in the system, the checkpointer will eventually try to flush it, and fail
> (because there's no file to write it to), and then checkpointing will be
> stuck.  So we cannot afford to risk missing any buffers.

This comment suggests another possible approach to the problem, which
is to just make a note someplace in shared memory when we drop a
relation. If we later find any of its buffers, we drop them without
writing them out. This is not altogether simple, because (1) we don't
have infinite room in shared memory to accumulate such notes and (2)
it's not impossible for the OID counter to wrap around and permit the
creation of a new relation with the same OID, which would be a problem
if the previous note is still around.

But this might be solvable. Suppose we create a shared hash table
keyed by <dboid, reload> with room for 1 entry per 1000 shared
buffers. When you drop a relation, you insert into the hash table.
Periodically you "clean" the hash table by marking all the entries,
scanning shared buffers to remove any matches, and then deleting all
the marked entries. This should be done periodically in the
background, but if you try to drop a relation and find the hash table
full, or you try to create a relation and find the OID of your new
relation in the hash table, then you have to clean synchronously.

Right now, the cost of dropping K relation when N shared buffers is
O(KN). But with this approach, you only have to incur the O(N)
overhead of scanning shared_buffers when the hash table fills up, and
the hash table size is proportional to N, so the amortized complexity
is O(K); that is, dropping relations takes time proportional to the
number of relations being dropped, but NOT proportional to the size of
shared_buffers, because as shared_buffers grows the hash table gets
proportionally bigger, so that scans don't need to be done as
frequently.

Andres's approach (retail hash table lookups just for blocks less than
the relation size, rather than a full scan) is going to help most with
small relations, whereas this approach helps with relations of any
size, but if you're trying to drop a lot of relations, they're
probably small, and if they are large, scanning shared buffers may not
be the dominant cost, since unlinking the files also takes time. Also,
this approach might turn out to slow down buffer eviction too much.
That could maybe be mitigated by having some kind of cheap fast-path
that gets used when the hash table is empty (like an atomic flag that
indicates whether a hash table probe is needed), and then trying hard
to keep it empty most of the time (e.g. by aggressive background
cleaning, or by ruling that after some number of hash table lookups
the next process to evict a buffer is forced to perform a cleanup).
But you'd probably have to try it to see how well you can do.

It's also possible to combine the two approaches. Small relations
could use Andres's approach while larger ones could use this approach;
or you could insert both large and small relations into this hash
table but use different strategies for cleaning out shared_buffers
depending on the relation size (which could also be preserved in the
hash table).

Just brainstorming here...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Tom Lane-2
Robert Haas <[hidden email]> writes:
> On Fri, Aug 7, 2020 at 12:03 AM Tom Lane <[hidden email]> wrote:
>> Yeah, there is no room for "good enough" here.  If a dirty buffer remains
>> in the system, the checkpointer will eventually try to flush it, and fail
>> (because there's no file to write it to), and then checkpointing will be
>> stuck.  So we cannot afford to risk missing any buffers.

> This comment suggests another possible approach to the problem, which
> is to just make a note someplace in shared memory when we drop a
> relation. If we later find any of its buffers, we drop them without
> writing them out. This is not altogether simple, because (1) we don't
> have infinite room in shared memory to accumulate such notes and (2)
> it's not impossible for the OID counter to wrap around and permit the
> creation of a new relation with the same OID, which would be a problem
> if the previous note is still around.

Interesting idea indeed.

As for (1), maybe we don't need to keep the info in shmem.  I'll just
point out that the checkpointer has *already got* a complete list of all
recently-dropped relations, cf pendingUnlinks in sync.c.  So you could
imagine looking aside at that to discover that a dirty buffer belongs to a
recently-dropped relation.  pendingUnlinks would need to be converted to a
hashtable to make searches cheap, and it's not very clear what to do in
backends that haven't got access to that table, but maybe we could just
accept that backends that are forced to flush dirty buffers might do some
useless writes in such cases.

As for (2), the reason why we have that list is that the physical unlink
doesn't happen till after the next checkpoint.  So with some messing
around here, we could probably guarantee that every buffer belonging
to the relation has been scanned and deleted before the file unlink
happens --- and then, even if the OID counter has wrapped around, the
OID won't be reassigned to a new relation before that happens.

In short, it seems like maybe we could shove the responsibility for
cleaning up dropped relations' buffers onto the checkpointer without
too much added cost.  A possible problem with this is that recycling
of those buffers will happen much more slowly than it does today,
but maybe that's okay?

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Robert Haas
On Fri, Aug 7, 2020 at 12:09 PM Tom Lane <[hidden email]> wrote:
> As for (1), maybe we don't need to keep the info in shmem.  I'll just
> point out that the checkpointer has *already got* a complete list of all
> recently-dropped relations, cf pendingUnlinks in sync.c.  So you could
> imagine looking aside at that to discover that a dirty buffer belongs to a
> recently-dropped relation.  pendingUnlinks would need to be converted to a
> hashtable to make searches cheap, and it's not very clear what to do in
> backends that haven't got access to that table, but maybe we could just
> accept that backends that are forced to flush dirty buffers might do some
> useless writes in such cases.

I don't see how that can work. It's not that the writes are useless;
it's that they will fail outright because the file doesn't exist.

> As for (2), the reason why we have that list is that the physical unlink
> doesn't happen till after the next checkpoint.  So with some messing
> around here, we could probably guarantee that every buffer belonging
> to the relation has been scanned and deleted before the file unlink
> happens --- and then, even if the OID counter has wrapped around, the
> OID won't be reassigned to a new relation before that happens.

This seems right to me, though.

> In short, it seems like maybe we could shove the responsibility for
> cleaning up dropped relations' buffers onto the checkpointer without
> too much added cost.  A possible problem with this is that recycling
> of those buffers will happen much more slowly than it does today,
> but maybe that's okay?

I suspect it's going to be advantageous to try to make cleaning up
dropped buffers quick in normal cases and allow it to fall behind only
when someone is dropping a lot of relations in quick succession, so
that buffer eviction remains cheap in normal cases. I hadn't thought
about the possible negative performance consequences of failing to put
buffers on the free list, but that's another reason to try to make it
fast.

My viewpoint on this is - I have yet to see anybody really get hosed
because they drop one relation and that causes a full scan of
shared_buffers. I mean, it's slightly expensive, but computers are
fast. Whatever. What hoses people is dropping a lot of relations in
quick succession, either by spamming DROP TABLE commands or by running
something like DROP SCHEMA, and then suddenly they're scanning
shared_buffers over and over again, and their standby is doing the
same thing, and now it hurts. The problem on the standby is actually
worse than the problem on the primary, because the primary can do
other things while one process sits there and thinks about
shared_buffers for a long time, but the standby can't, because the
startup process is all there is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Tom Lane-2
Robert Haas <[hidden email]> writes:
> On Fri, Aug 7, 2020 at 12:09 PM Tom Lane <[hidden email]> wrote:
>> ... it's not very clear what to do in
>> backends that haven't got access to that table, but maybe we could just
>> accept that backends that are forced to flush dirty buffers might do some
>> useless writes in such cases.

> I don't see how that can work. It's not that the writes are useless;
> it's that they will fail outright because the file doesn't exist.

At least in the case of segment zero, the file will still exist.  It'll
have been truncated to zero length, and if the filesystem is stupid about
holes in files then maybe a write to a high block number would consume
excessive disk space, but does anyone still care about such filesystems?
I don't remember at the moment how we handle higher segments, but likely
we could make them still exist too, postponing all the unlinks till after
checkpoint.  Or we could just have the backends give up on recycling a
particular buffer if they can't write it (which is the response to an I/O
failure already, I hope).

> My viewpoint on this is - I have yet to see anybody really get hosed
> because they drop one relation and that causes a full scan of
> shared_buffers. I mean, it's slightly expensive, but computers are
> fast. Whatever. What hoses people is dropping a lot of relations in
> quick succession, either by spamming DROP TABLE commands or by running
> something like DROP SCHEMA, and then suddenly they're scanning
> shared_buffers over and over again, and their standby is doing the
> same thing, and now it hurts.

Yeah, trying to amortize the cost across multiple drops seems like
what we really want here.  I'm starting to think about a "relation
dropper" background process, which would be somewhat like the checkpointer
but it wouldn't have any interest in actually doing buffer I/O.
We'd send relation drop commands to it, and it would scan all of shared
buffers and flush related buffers, and then finally do the file truncates
or unlinks.  Amortization would happen by considering multiple target
relations during any one scan over shared buffers.  I'm not very clear
on how this would relate to the checkpointer's handling of relation
drops, but it could be worked out; if we were lucky maybe the checkpointer
could stop worrying about that.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Robert Haas
On Fri, Aug 7, 2020 at 12:52 PM Tom Lane <[hidden email]> wrote:
> At least in the case of segment zero, the file will still exist.  It'll
> have been truncated to zero length, and if the filesystem is stupid about
> holes in files then maybe a write to a high block number would consume
> excessive disk space, but does anyone still care about such filesystems?
> I don't remember at the moment how we handle higher segments, but likely
> we could make them still exist too, postponing all the unlinks till after
> checkpoint.  Or we could just have the backends give up on recycling a
> particular buffer if they can't write it (which is the response to an I/O
> failure already, I hope).

None of this sounds very appealing. Postponing the unlinks means
postponing recovery of the space at the OS level, which I think will
be noticeable and undesirable for users. The other notions all seem to
involve treating as valid on-disk states we currently treat as
invalid, and our sanity checks in this area are already far too weak.
And all you're buying for it is putting a hash table that would
otherwise be shared memory into backend-private memory, which seems
like quite a minor gain. Having that information visible to everybody
seems a lot cleaner.

> Yeah, trying to amortize the cost across multiple drops seems like
> what we really want here.  I'm starting to think about a "relation
> dropper" background process, which would be somewhat like the checkpointer
> but it wouldn't have any interest in actually doing buffer I/O.
> We'd send relation drop commands to it, and it would scan all of shared
> buffers and flush related buffers, and then finally do the file truncates
> or unlinks.  Amortization would happen by considering multiple target
> relations during any one scan over shared buffers.  I'm not very clear
> on how this would relate to the checkpointer's handling of relation
> drops, but it could be worked out; if we were lucky maybe the checkpointer
> could stop worrying about that.

I considered that, too, but it might be overkill. I think that one
scan of shared_buffers every now and then might be cheap enough that
we could just not worry too much about which process gets stuck doing
it. So for example if the number of buffers allocated since the hash
table ended up non-empty reaches NBuffers, the process wanting to do
the next eviction gets handed the job of cleaning it out. Or maybe the
background writer could help; it's not like it does much anyway, zing.
It's possible that a dedicated process is the right solution, but we
might want to at least poke a bit at other alternatives.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

akapila
In reply to this post by Tom Lane-2
On Fri, Aug 7, 2020 at 9:33 AM Tom Lane <[hidden email]> wrote:

>
> Amit Kapila <[hidden email]> writes:
> > On Sat, Aug 1, 2020 at 1:53 AM Andres Freund <[hidden email]> wrote:
> >> We could also just use pg_class.relpages. It'll probably mostly be
> >> accurate enough?
>
> > Don't we need the accurate 'number of blocks' if we want to invalidate
> > all the buffers? Basically, I think we need to perform BufTableLookup
> > for all the blocks in the relation and then Invalidate all buffers.
>
> Yeah, there is no room for "good enough" here.  If a dirty buffer remains
> in the system, the checkpointer will eventually try to flush it, and fail
> (because there's no file to write it to), and then checkpointing will be
> stuck.  So we cannot afford to risk missing any buffers.
>

Right, this reminds me of the discussion we had last time on this
topic where we decided that we can't even rely on using smgrnblocks to
find the exact number of blocks because lseek might lie about the EOF
position [1]. So, we anyway need some mechanism to push the
information related to the "to be truncated or dropped relations" to
the background worker (checkpointer and or others) to avoid flush
issues. But, maybe it is better to push the responsibility of
invalidating the buffers for truncated/dropped relation to the
background process. However, I feel for some cases where relation size
is greater than the number of shared buffers there might not be much
benefit in pushing this operation to background unless there are
already a few other relation entries (for dropped relations) so that
cost of scanning the buffers can be amortized.

[1] - https://www.postgresql.org/message-id/16664.1435414204%40sss.pgh.pa.us

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

akapila
In reply to this post by Robert Haas
On Fri, Aug 7, 2020 at 11:03 PM Robert Haas <[hidden email]> wrote:
>
> On Fri, Aug 7, 2020 at 12:52 PM Tom Lane <[hidden email]> wrote:
> > At least in the case of segment zero, the file will still exist.  It'll
> > have been truncated to zero length, and if the filesystem is stupid about
> > holes in files then maybe a write to a high block number would consume
> > excessive disk space, but does anyone still care about such filesystems?
> > I don't remember at the moment how we handle higher segments,
> >

We do unlink them and register the request to forget the Fsync
requests for those. See mdunlinkfork.

> > but likely
> > we could make them still exist too, postponing all the unlinks till after
> > checkpoint.  Or we could just have the backends give up on recycling a
> > particular buffer if they can't write it (which is the response to an I/O
> > failure already, I hope).
> >

Note that we don't often try to flush the buffers from the backend. We
first try to forward the request to checkpoint queue and only if the
queue is full, the backend tries to flush it, so even if we decide to
give up flushing such a buffer (where we get an error) via backend, it
shouldn't impact very many cases. I am not sure but if we can somehow
reliably distinguish this type of error from any other I/O failure
then we can probably give up on flushing this buffer and continue or
maybe just retry to push this request to checkpointer.

>
> None of this sounds very appealing. Postponing the unlinks means
> postponing recovery of the space at the OS level, which I think will
> be noticeable and undesirable for users. The other notions all seem to
> involve treating as valid on-disk states we currently treat as
> invalid, and our sanity checks in this area are already far too weak.
> And all you're buying for it is putting a hash table that would
> otherwise be shared memory into backend-private memory, which seems
> like quite a minor gain. Having that information visible to everybody
> seems a lot cleaner.
>

The one more benefit of giving this responsibility to a single process
like checkpointer is that we can avoid unlinking the relation until we
scan all the buffers corresponding to it. Now, surely keeping it in
shared memory and allow other processes to work on it has other merits
which are that such buffers might get invalidated faster but not sure
we can retain the benefit of another approach which is to perform all
such invalidation of buffers before unlinking the relation's first
segment.

--
With Regards,
Amit Kapila.


12