[Patch] Optimize dropping of relation buffers using dlist

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
73 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Kyotaro Horiguchi-4
I'd like make a subtle correction.

At Wed, 02 Sep 2020 10:31:22 +0900 (JST), Kyotaro Horiguchi <[hidden email]> wrote in
> By the way
>
> > #define BUF_DROP_THRESHOLD 500 /* NBuffers divided by 2 */
>
> NBuffers is not a constant. Even if we wanted to set the macro as
> described in the comment, we should have used (NBuffers/2) instead of
> "500". But I suppose you might wanted to use (NBuffders / 500) as Tom
> suggested upthread.  And the name of the macro seems too generic. I

Who made the suggestion is Andres, not Tom. Sorry for the mistake.

> think more specific names like BUF_DROP_FULLSCAN_THRESHOLD would be
> better.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

akapila
In reply to this post by Kyotaro Horiguchi-4
On Wed, Sep 2, 2020 at 7:01 AM Kyotaro Horiguchi
<[hidden email]> wrote:

>
> Hello.
>
> At Tue, 1 Sep 2020 13:02:28 +0000, "[hidden email]" <[hidden email]> wrote in
> > On Tuesday, August 18, 2020 3:05 PM (GMT+9), Amit Kapila wrote:
> > > Today, again thinking about this point it occurred to me that during recovery
> > > we can reliably find the relation size and after Thomas's recent commit
> > > c5315f4f44 (Cache smgrnblocks() results in recovery), we might not need to
> > > even incur the cost of lseek. Why don't we fix this first for 'recovery' (by
> > > following something on the lines of what Andres suggested) and then later
> > > once we have a generic mechanism for "caching the relation size" [1], we can
> > > do it for non-recovery paths.
> > > I think that will at least address the reported use case with some minimal
> > > changes.
> > >
> > > [1] -
> > > https://www.postgresql.org/message-id/CAEepm%3D3SSw-Ty1DFcK%3D1r
> > > U-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
>
> Isn't a relation always locked asscess-exclusively, at truncation
> time?  If so, isn't even the result of lseek reliable enough?
>

Even if the relation is locked, background processes like checkpointer
can still touch the relation which might cause problems. Consider a
case where we extend the relation but didn't flush the newly added
pages. Now during truncate operation, checkpointer can still flush
those pages which can cause trouble for truncate. But, I think in the
recovery path such cases won't cause a problem.

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Tom Lane-2
Amit Kapila <[hidden email]> writes:
> Even if the relation is locked, background processes like checkpointer
> can still touch the relation which might cause problems. Consider a
> case where we extend the relation but didn't flush the newly added
> pages. Now during truncate operation, checkpointer can still flush
> those pages which can cause trouble for truncate. But, I think in the
> recovery path such cases won't cause a problem.

I wouldn't count on that staying true ...

https://www.postgresql.org/message-id/CA+hUKGJ8NRsqgkZEnsnRc2MFROBV-jCnacbYvtpptK2A9YYp9Q@...

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com
In reply to this post by Kyotaro Horiguchi-4
On Wednesday, September 2, 2020 10:31 AM, Kyotaro Horiguchi wrote:

> Hello.
>
> At Tue, 1 Sep 2020 13:02:28 +0000, "[hidden email]"
> <[hidden email]> wrote in
> > On Tuesday, August 18, 2020 3:05 PM (GMT+9), Amit Kapila wrote:
> > > Today, again thinking about this point it occurred to me that during
> > > recovery we can reliably find the relation size and after Thomas's
> > > recent commit
> > > c5315f4f44 (Cache smgrnblocks() results in recovery), we might not
> > > need to even incur the cost of lseek. Why don't we fix this first
> > > for 'recovery' (by following something on the lines of what Andres
> > > suggested) and then later once we have a generic mechanism for
> > > "caching the relation size" [1], we can do it for non-recovery paths.
> > > I think that will at least address the reported use case with some
> > > minimal changes.
> > >
> > > [1] -
> > >
> https://www.postgresql.org/message-id/CAEepm%3D3SSw-Ty1DFcK%3D1r
> > > U-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
>
> Isn't a relation always locked asscess-exclusively, at truncation time?  If so,
> isn't even the result of lseek reliable enough? And if we don't care the cost of
> lseek, we can do the same optimization also for non-recovery paths. Since
> anyway we perform actual file-truncation just after so I think the cost of lseek
> is negligible here.
The reason for that is when I read the comment in smgrnblocks in smgr.c
I thought that smgrnblocks can only be reliably used during recovery here
to ensure that we have the correct size.
Please correct me if my understanding is wrong, and I'll fix the patch.

         * For now, we only use cached values in recovery due to lack of a shared
         * invalidation mechanism for changes in file size.
         */
        if (InRecovery && reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
                return reln->smgr_cached_nblocks[forknum];

> > Attached is an updated V9 version with minimal code changes only and
> > avoids the previous overhead in the BufferAlloc. This time, I only
> > updated the recovery path as suggested by Amit, and followed Andres'
> > suggestion of referring to the cached blocks in smgrnblocks.
> > The layering is kinda tricky so the logic may be wrong. But as of now,
> > it passes the regression tests. I'll follow up with the performance results.
> > It seems there's regression for smaller shared_buffers. I'll update if I find
> bugs.
> > But I'd also appreciate your reviews in case I missed something.
>
> BUF_DROP_THRESHOLD seems to be misued. IIUC it defines the maximum
> number of file pages that we make relation-targetted search for buffers.
> Otherwise we scan through all buffers. On the other hand the latest patch just
> leaves all buffers for relation forks longer than the threshold.
Right, I missed the part or condition for that part. Fixed in the latest one.
 
> I think we should determine whether to do targetted-scan or full-scan based
> on the ratio of (expectedly maximum) total number of pages for all (specified)
> forks in a relation against total number of buffers.
       

> By the way
>
> > #define BUF_DROP_THRESHOLD 500 /* NBuffers divided
> by 2 */
>
> NBuffers is not a constant. Even if we wanted to set the macro as described
> in the comment, we should have used (NBuffers/2) instead of "500". But I
> suppose you might wanted to use (NBuffders / 500) as Tom suggested
> upthread.  And the name of the macro seems too generic. I think more
> specific names like BUF_DROP_FULLSCAN_THRESHOLD would be better.
Fixed.

Thank you for the review!
Attached is the v10 of the patch.

Best regards,
Kirk Jamison

v10-Speedup-dropping-of-relation-buffers-during-recovery.patch (10K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

akapila
In reply to this post by Tom Lane-2
On Wed, Sep 2, 2020 at 9:17 AM Tom Lane <[hidden email]> wrote:

>
> Amit Kapila <[hidden email]> writes:
> > Even if the relation is locked, background processes like checkpointer
> > can still touch the relation which might cause problems. Consider a
> > case where we extend the relation but didn't flush the newly added
> > pages. Now during truncate operation, checkpointer can still flush
> > those pages which can cause trouble for truncate. But, I think in the
> > recovery path such cases won't cause a problem.
>
> I wouldn't count on that staying true ...
>
> https://www.postgresql.org/message-id/CA+hUKGJ8NRsqgkZEnsnRc2MFROBV-jCnacbYvtpptK2A9YYp9Q@...
>

I don't think that proposal will matter after commit c5315f4f44
because we are caching the size/blocks for recovery while doing extend
(smgrextend). In the above scenario, we would have cached the blocks
which will be used at later point of time.

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com
On Wednesday, September 2, 2020 5:49 PM, Amit Kapila wrote:

> On Wed, Sep 2, 2020 at 9:17 AM Tom Lane <[hidden email]> wrote:
> >
> > Amit Kapila <[hidden email]> writes:
> > > Even if the relation is locked, background processes like
> > > checkpointer can still touch the relation which might cause
> > > problems. Consider a case where we extend the relation but didn't
> > > flush the newly added pages. Now during truncate operation,
> > > checkpointer can still flush those pages which can cause trouble for
> > > truncate. But, I think in the recovery path such cases won't cause a
> problem.
> >
> > I wouldn't count on that staying true ...
> >
> >
> https://www.postgresql.org/message-id/CA+hUKGJ8NRsqgkZEnsnRc2MFR
> OBV-jC
> > [hidden email]
> >
>
> I don't think that proposal will matter after commit c5315f4f44 because we are
> caching the size/blocks for recovery while doing extend (smgrextend). In the
> above scenario, we would have cached the blocks which will be used at later
> point of time.
>
Hi,

I'm guessing we can still pursue this idea of improving the recovery path first.

I'm working on an updated patch version, because the CFBot's telling
that postgres fails to build (one of the recovery TAP tests fails).
I'm still working on refactoring my patch, but have yet to find a proper solution at the moment.
So I'm going to continue my investigation.

Attached is an updated WIP patch.
I'd appreciate if you could take a look at the patch as well.

In addition, attached also are the regression logs for the failure and other logs
Accompanying it: wal_optimize_node_minimal and wal_optimize_node_replica.

The failures stated in my session was:
t/018_wal_optimize.pl ................ 18/34 Bailout called.
Further testing stopped:  pg_ctl start failed
FAILED--Further testing stopped: pg_ctl start failed

Best regards,
Kirk Jamison

v11-Speedup-dropping-of-relation-buffers-during-recovery.patch (10K) Download Attachment
regress_log_018_wal_optimize (32K) Download Attachment
018_wal_optimize_node_minimal.log (54K) Download Attachment
018_wal_optimize_node_replica.log (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

akapila
On Mon, Sep 7, 2020 at 1:33 PM [hidden email]
<[hidden email]> wrote:

>
> On Wednesday, September 2, 2020 5:49 PM, Amit Kapila wrote:
> > On Wed, Sep 2, 2020 at 9:17 AM Tom Lane <[hidden email]> wrote:
> > >
> > > Amit Kapila <[hidden email]> writes:
> > > > Even if the relation is locked, background processes like
> > > > checkpointer can still touch the relation which might cause
> > > > problems. Consider a case where we extend the relation but didn't
> > > > flush the newly added pages. Now during truncate operation,
> > > > checkpointer can still flush those pages which can cause trouble for
> > > > truncate. But, I think in the recovery path such cases won't cause a
> > problem.
> > >
> > > I wouldn't count on that staying true ...
> > >
> > >
> > https://www.postgresql.org/message-id/CA+hUKGJ8NRsqgkZEnsnRc2MFR
> > OBV-jC
> > > [hidden email]
> > >
> >
> > I don't think that proposal will matter after commit c5315f4f44 because we are
> > caching the size/blocks for recovery while doing extend (smgrextend). In the
> > above scenario, we would have cached the blocks which will be used at later
> > point of time.
> >
>
> I'm guessing we can still pursue this idea of improving the recovery path first.
>

I think so.

> I'm working on an updated patch version, because the CFBot's telling
> that postgres fails to build (one of the recovery TAP tests fails).
> I'm still working on refactoring my patch, but have yet to find a proper solution at the moment.
> So I'm going to continue my investigation.
>
> Attached is an updated WIP patch.
> I'd appreciate if you could take a look at the patch as well.
>

So, I see the below log as one of the problems:
2020-09-07 06:20:33.918 UTC [10914] LOG:  redo starts at 0/15FFEC0
2020-09-07 06:20:33.919 UTC [10914] FATAL:  unexpected data beyond EOF
in block 1 of relation base/13743/24581

This indicates that we missed invalidating some buffer which should
have been invalidated. If you are able to reproduce this locally then
I suggest to first write a simple patch without the check of the
threshold, basically in recovery always try to use the new way to
invalidate the buffer. That will reduce the scope of the code that can
create a problem. Let us know if the problem still exists and share
the logs. BTW, I think I see one problem in the code:

if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
+ bufHdr->tag.forkNum == forkNum[j] &&
+ bufHdr->tag.blockNum >= firstDelBlock[j])

Here, I think you need to use 'i' not 'j' for forkNum and
firstDelBlock as those are arrays w.r.t forks. That might fix the
problem but I am not sure as I haven't tried to reproduce it.

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

tsunakawa.takay@fujitsu.com
        From: Amit Kapila <[hidden email]>
> if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> + bufHdr->tag.forkNum == forkNum[j] &&
> + bufHdr->tag.blockNum >= firstDelBlock[j])
>
> Here, I think you need to use 'i' not 'j' for forkNum and
> firstDelBlock as those are arrays w.r.t forks. That might fix the
> problem but I am not sure as I haven't tried to reproduce it.


(1)
+ INIT_BUFFERTAG(newTag, rnode.node, forkNum[j], firstDelBlock[j]);

And you need to use i here, too.

I advise you to suspect any character, any word, and any sentence.  I've found many bugs for others so far.  I'm afraid you're just seeing the code flow.


(2)
+ LWLockAcquire(newPartitionLock, LW_SHARED);
+ buf_id = BufTableLookup(&newTag, newHash);
+ LWLockRelease(newPartitionLock);
+
+ bufHdr = GetBufferDescriptor(buf_id);

Check the result of BufTableLookup() and do nothing if the block is not in the shared buffers.


(3)
+ else
+ {
+ for (j = BUF_DROP_FULLSCAN_THRESHOLD; j < NBuffers; j++)
+ {

What's the meaning of this loop?  I don't understand the start condition.  Should j be initialized to 0?


(4)
+#define BUF_DROP_FULLSCAN_THRESHOLD (NBuffers / 2)

Wasn't it 500 instead of 2?  Anyway, I think we need to discuss this threshold later.


(5)
+ if (((int)nblocks) < BUF_DROP_FULLSCAN_THRESHOLD)

It's better to define BUF_DROP_FULLSCAN_THRESHOLD as an uint32 value instead of casting the type here, as these values are blocks.


Regards
Takayuki Tsunakawa

 
Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

tsunakawa.takay@fujitsu.com
From: [hidden email] <[hidden email]>
> (1)
> + INIT_BUFFERTAG(newTag,
> rnode.node, forkNum[j], firstDelBlock[j]);
>
> And you need to use i here, too.

I remember the books "Code Complete" and/or "Readable Code" suggest to use meaningful loop variable names like fork_num and block_count, to prevent this type of mistakes.


Regards
Takayuki Tsunakawa


 
Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com
In reply to this post by akapila
On Tuesday, September 8, 2020 1:02 PM, Amit Kapila wrote:
Hello,

> On Mon, Sep 7, 2020 at 1:33 PM [hidden email]
> <[hidden email]> wrote:
> >
> > On Wednesday, September 2, 2020 5:49 PM, Amit Kapila wrote:
> > > On Wed, Sep 2, 2020 at 9:17 AM Tom Lane <[hidden email]> wrote:
> > > >
> > > > Amit Kapila <[hidden email]> writes:
> > > > > Even if the relation is locked, background processes like
> > > > > checkpointer can still touch the relation which might cause
> > > > > problems. Consider a case where we extend the relation but
> > > > > didn't flush the newly added pages. Now during truncate
> > > > > operation, checkpointer can still flush those pages which can
> > > > > cause trouble for truncate. But, I think in the recovery path
> > > > > such cases won't cause a
> > > problem.
> > > >
> > > > I wouldn't count on that staying true ...
> > > >
> > > >
> > >
> https://www.postgresql.org/message-id/CA+hUKGJ8NRsqgkZEnsnRc2MFR
> > > OBV-jC
> > > > [hidden email]
> > > >
> > >
> > > I don't think that proposal will matter after commit c5315f4f44
> > > because we are caching the size/blocks for recovery while doing
> > > extend (smgrextend). In the above scenario, we would have cached the
> > > blocks which will be used at later point of time.
> > >
> >
> > I'm guessing we can still pursue this idea of improving the recovery path
> first.
> >
>
> I think so.
Alright, so I've updated the patch which passes the regression and TAP tests.
It compiles and builds as intended.

> > I'm working on an updated patch version, because the CFBot's telling
> > that postgres fails to build (one of the recovery TAP tests fails).
> > I'm still working on refactoring my patch, but have yet to find a proper
> solution at the moment.
> > So I'm going to continue my investigation.
> >
> > Attached is an updated WIP patch.
> > I'd appreciate if you could take a look at the patch as well.
> >
>
> So, I see the below log as one of the problems:
> 2020-09-07 06:20:33.918 UTC [10914] LOG:  redo starts at 0/15FFEC0
> 2020-09-07 06:20:33.919 UTC [10914] FATAL:  unexpected data beyond EOF
> in block 1 of relation base/13743/24581
>
> This indicates that we missed invalidating some buffer which should have
> been invalidated. If you are able to reproduce this locally then I suggest to first
> write a simple patch without the check of the threshold, basically in recovery
> always try to use the new way to invalidate the buffer. That will reduce the
> scope of the code that can create a problem. Let us know if the problem still
> exists and share the logs. BTW, I think I see one problem in the code:
>
> if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> + bufHdr->tag.forkNum == forkNum[j] && tag.blockNum >=
> + bufHdr->firstDelBlock[j])
>
> Here, I think you need to use 'i' not 'j' for forkNum and
> firstDelBlock as those are arrays w.r.t forks. That might fix the
> problem but I am not sure as I haven't tried to reproduce it.
Thanks for advice. Right, that seems to be the cause of error,
and fixing that (using fork) solved the case.
I also followed the advice of Tsunakawa-san of using more meaningful iterator
Instead of using "i" & "j" for readability.

I also added a new function when relation fork is bigger than the threshold
    If (nblocks > BUF_DROP_FULLSCAN_THRESHOLD)
(DropRelFileNodeBuffersOfFork) Perhaps there's a better name for that function.
However, as expected in the previous discussions, this is a bit slower than the
standard buffer invalidation process, because the whole shared buffers are scanned nfork times.
Currently, I set the threshold to (NBuffers / 500)

Feedback on the patch/testing are very much welcome.

Best regards,
Kirk Jamison


v12-Speedup-dropping-of-relation-buffers-during-recovery.patch (12K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com
Hi,

> BTW, I think I see one problem in the code:
> >
> > if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> > + bufHdr->tag.forkNum == forkNum[j] && tag.blockNum >=
> > + bufHdr->firstDelBlock[j])
> >
> > Here, I think you need to use 'i' not 'j' for forkNum and
> > firstDelBlock as those are arrays w.r.t forks. That might fix the
> > problem but I am not sure as I haven't tried to reproduce it.
>
> Thanks for advice. Right, that seems to be the cause of error, and fixing that
> (using fork) solved the case.
> I also followed the advice of Tsunakawa-san of using more meaningful
> iterator Instead of using "i" & "j" for readability.
>
> I also added a new function when relation fork is bigger than the threshold
>     If (nblocks > BUF_DROP_FULLSCAN_THRESHOLD)
> (DropRelFileNodeBuffersOfFork) Perhaps there's a better name for that
> function.
> However, as expected in the previous discussions, this is a bit slower than the
> standard buffer invalidation process, because the whole shared buffers are
> scanned nfork times.
> Currently, I set the threshold to (NBuffers / 500)
I made a mistake in the v12. I replaced the firstDelBlock[fork_num] with firstDelBlock[block_num],
In the for-loop code block of block_num, because we want to process the current block of per-block loop

OTOH, I used the firstDelBlock[fork_num] when relation fork is bigger than the threshold,
or if the cached blocks of small relations were already invalidated.

The logic could be either correct or wrong, so I'd appreciate feedback and comments/advice.

Regards,
Kirk Jamison

v13-Speedup-dropping-of-relation-buffers-during-recovery.patch (13K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Kyotaro Horiguchi-4
In reply to this post by akapila
At Wed, 2 Sep 2020 08:18:06 +0530, Amit Kapila <[hidden email]> wrote in

> On Wed, Sep 2, 2020 at 7:01 AM Kyotaro Horiguchi
> <[hidden email]> wrote:
> > Isn't a relation always locked asscess-exclusively, at truncation
> > time?  If so, isn't even the result of lseek reliable enough?
> >
>
> Even if the relation is locked, background processes like checkpointer
> can still touch the relation which might cause problems. Consider a
> case where we extend the relation but didn't flush the newly added
> pages. Now during truncate operation, checkpointer can still flush
> those pages which can cause trouble for truncate. But, I think in the
> recovery path such cases won't cause a problem.

I reconsided on this and still have a doubt.

Is this means lseek(SEEK_END) doesn't count blocks that are
write(2)'ed (by smgrextend) but not yet flushed? (I don't think so,
for clarity.) The nblocks cache is added just to reduce the number of
lseek()s and expected to always have the same value with what lseek()
is expected to return. The reason it is reliable only during recovery
is that the cache is not shared but the startup process is the only
process that changes the relation size during recovery.

If any other process can extend the relation while smgrtruncate is
running, the current DropRelFileNodeBuffers should have the chance
that a new buffer for extended area is allocated at a buffer location
where the function already have passed by, which is a disaster.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

tsunakawa.takay@fujitsu.com
In reply to this post by k.jamison@fujitsu.com
The code doesn't seem to be working correctly.


(1)
+ for (block_num = 0; block_num <= nblocks; block_num++)

should be

+ for (block_num = firstDelBlock[fork_num]; block_num < nblocks; block_num++)

because:

* You only want to invalidate blocks >= firstDelBlock[fork_num], don't you?
* The relation's block number ranges from 0 to nblocks - 1.


(2)
+ INIT_BUFFERTAG(newTag, rnode.node, forkNum[fork_num],
+   firstDelBlock[block_num]);

Replace firstDelBlock[fork_num] with block_num, because you want to process the current block of per-block loop.  Your code accesses memory out of the bounds of the array, and doesn't invalidate any buffer.


(3)
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
+ bufHdr->tag.forkNum == forkNum[fork_num] &&
+ bufHdr->tag.blockNum >= firstDelBlock[block_num])
+ InvalidateBuffer(bufHdr); /* releases spinlock */
+ else
+ UnlockBufHdr(bufHdr, buf_state);

Replace
bufHdr->tag.blockNum >= firstDelBlock[fork_num]
with
bufHdr->tag.blockNum == block_num
because you want to check if the found buffer is for the current block of the loop.


(4)
+ /*
+ * We've invalidated the nblocks already. Scan the shared buffers
+ * for each fork.
+ */
+ if (block_num > nblocks)
+ {
+ DropRelFileNodeBuffersOfFork(rnode.node, forkNum[fork_num],
+ firstDelBlock[fork_num]);
+ }

This part is unnecessary.  This invalidates all buffers that (2) failed to process, so the regression test succeeds.


Regards
Takayuki Tsunakawa

Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Kyotaro Horiguchi-4
In reply to this post by k.jamison@fujitsu.com
Thanks for the new version. Jamison.

At Tue, 15 Sep 2020 11:11:26 +0000, "[hidden email]" <[hidden email]> wrote in

> Hi,
>
> > BTW, I think I see one problem in the code:
> > >
> > > if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> > > + bufHdr->tag.forkNum == forkNum[j] && tag.blockNum >=
> > > + bufHdr->firstDelBlock[j])
> > >
> > > Here, I think you need to use 'i' not 'j' for forkNum and
> > > firstDelBlock as those are arrays w.r.t forks. That might fix the
> > > problem but I am not sure as I haven't tried to reproduce it.
> >
> > Thanks for advice. Right, that seems to be the cause of error, and fixing that
> > (using fork) solved the case.
> > I also followed the advice of Tsunakawa-san of using more meaningful
> > iterator Instead of using "i" & "j" for readability.

(FWIW, I prefer short conventional names for short-term iterator variables.)


master> * XXX currently it sequentially searches the buffer pool, should be
master> * changed to more clever ways of searching.  However, this routine
master> * is used only in code paths that aren't very performance-critical,
master> * and we shouldn't slow down the hot paths to make it faster ...

This comment needs a rewrite.


+ for (fork_num = 0; fork_num < nforks; fork_num++)
  {
  if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
- bufHdr->tag.forkNum == forkNum[j] &&
- bufHdr->tag.blockNum >= firstDelBlock[j])
+ bufHdr->tag.forkNum == forkNum[fork_num] &&
+ bufHdr->tag.blockNum >= firstDelBlock[fork_num])

fork_num is not actually a fork number, but the index of forkNum[].
It should be fork_idx (or just i, which I prefer..).

- for (j = 0; j < nforks; j++)
- DropRelFileNodeLocalBuffers(rnode.node, forkNum[j],
- firstDelBlock[j]);
+ for (fork_num = 0; fork_num < nforks; fork_num++)
+ DropRelFileNodeLocalBuffers(rnode.node, forkNum[fork_num],
+ firstDelBlock[fork_num]);

I think we don't need to include the irrelevant refactoring in this
patch. (And I think j is better there.)

+ * We only speedup this path during recovery, because that's the only
+ * timing when we can get a valid cached value of blocks for relation.
+ * See comment in smgrnblocks() in smgr.c. Otherwise, proceed to usual
+ * buffer invalidation process (scanning of whole shared buffers).

We need an explanation of why we do this optimizaton only for the
recovery case.

+ /* Get the number of blocks for the supplied relation's fork */
+ nblocks = smgrnblocks(smgr_reln, forkNum[fork_num]);
+ Assert(BlockNumberIsValid(nblocks));
+
+ if (nblocks < BUF_DROP_FULLSCAN_THRESHOLD)

As mentioned upthread, the criteria whether we do full-scan or
lookup-drop is how large portion of NBUFFERS this relation-drop can be
going to invalidate.  So the nblocks above sould be the sum of number
of blocks to be truncated (not just the total number of blocks) of all
designated forks.  Then once we decided to do loopup-drop method, we
do that for all forks.

+ for (block_num = 0; block_num <= nblocks; block_num++)
+ {

block_num is quite confusing with nblocks, at least for
me(:p). Likewise fork_num, I prefer that it is just j or iblk or
something else anyway not confusing with nblocks.  By the way, the
loop runs nblocks + 1 times, which seems wrong.  We can start the loop
from firstDelBlock[fork_num], instead of 0 and that makes the check
against firstDelBlock[] later useless.

+ /* create a tag with respect to the block so we can lookup the buffer */
+ INIT_BUFFERTAG(newTag, rnode.node, forkNum[fork_num],
+   firstDelBlock[block_num]);

Mmm. it is wrong that the tag is initialized using
firstDelBlock[block_num]. Why isn't is just block_num?


+ if (buf_id < 0)
+ {
+ LWLockRelease(newPartitionLock);
+ continue;
+ }
+ LWLockRelease(newPartitionLock);

We don't need two separate LWLockRelease()'s there.

+   /*
+    * We can make this a tad faster by prechecking the buffer tag before
+    * we attempt to lock the buffer; this saves a lot of lock
...
+    */
+   if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+   continue;

In the original code, this is performed in order to avoid taking a
lock on bufHder for irrelevant buffers. We have identified the buffer
by looking up using the rnode, so I think we don't need to this
check. Note that we are doing the same check after lock aquisition.

+     else
+     UnlockBufHdr(bufHdr, buf_state);
+    }
+    /*
+     * We've invalidated the nblocks already. Scan the shared buffers
+     * for each fork.
+     */
+    if (block_num > nblocks)
+    {
+     DropRelFileNodeBuffersOfFork(rnode.node, forkNum[fork_num],
+     firstDelBlock[fork_num]);
+    }

Mmm? block_num is always larger than nblocks there. And the function
call runs a whole Nbuffers scan for the just-processed fork. What is
the point of this code?


> > I also added a new function when relation fork is bigger than the threshold
> >     If (nblocks > BUF_DROP_FULLSCAN_THRESHOLD)
> > (DropRelFileNodeBuffersOfFork) Perhaps there's a better name for that
> > function.
> > However, as expected in the previous discussions, this is a bit slower than the
> > standard buffer invalidation process, because the whole shared buffers are
> > scanned nfork times.
> > Currently, I set the threshold to (NBuffers / 500)
>
> I made a mistake in the v12. I replaced the firstDelBlock[fork_num] with firstDelBlock[block_num],
> In the for-loop code block of block_num, because we want to process the current block of per-block loop
> OTOH, I used the firstDelBlock[fork_num] when relation fork is bigger than the threshold,
> or if the cached blocks of small relations were already invalidated.

Really? I believe that firstDelBlock is an array has only nforks elements.

> The logic could be either correct or wrong, so I'd appreciate feedback and comments/advice.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Kyotaro Horiguchi-4
At Wed, 16 Sep 2020 11:56:29 +0900 (JST), Kyotaro Horiguchi <[hidden email]> wrote in
(Oops! Some of my comments duplicate with Tsunakawa-san, sorry.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

akapila
In reply to this post by Kyotaro Horiguchi-4
On Wed, Sep 16, 2020 at 7:46 AM Kyotaro Horiguchi
<[hidden email]> wrote:

>
> At Wed, 2 Sep 2020 08:18:06 +0530, Amit Kapila <[hidden email]> wrote in
> > On Wed, Sep 2, 2020 at 7:01 AM Kyotaro Horiguchi
> > <[hidden email]> wrote:
> > > Isn't a relation always locked asscess-exclusively, at truncation
> > > time?  If so, isn't even the result of lseek reliable enough?
> > >
> >
> > Even if the relation is locked, background processes like checkpointer
> > can still touch the relation which might cause problems. Consider a
> > case where we extend the relation but didn't flush the newly added
> > pages. Now during truncate operation, checkpointer can still flush
> > those pages which can cause trouble for truncate. But, I think in the
> > recovery path such cases won't cause a problem.
>
> I reconsided on this and still have a doubt.
>
> Is this means lseek(SEEK_END) doesn't count blocks that are
> write(2)'ed (by smgrextend) but not yet flushed? (I don't think so,
> for clarity.) The nblocks cache is added just to reduce the number of
> lseek()s and expected to always have the same value with what lseek()
> is expected to return.
>

See comments in ReadBuffer_common() which indicates such a possibility
("Unfortunately, we have also seen this case occurring because of
buggy Linux kernels that sometimes return an lseek(SEEK_END) result
that doesn't account for a recent write."). Also, refer my previous
email [1] on this and another email link in that email which has a
discussion on this point.

> The reason it is reliable only during recovery
> is that the cache is not shared but the startup process is the only
> process that changes the relation size during recovery.
>

Yes, that is why we are planning to do this optimization for recovery path.

> If any other process can extend the relation while smgrtruncate is
> running, the current DropRelFileNodeBuffers should have the chance
> that a new buffer for extended area is allocated at a buffer location
> where the function already have passed by, which is a disaster.
>

The relation might have extended before smgrtruncate but the newly
added pages can be flushed by checkpointer during smgrtruncate.

[1] - https://www.postgresql.org/message-id/CAA4eK1LH2uQWznwtonD%2Bnch76kqzemdTQAnfB06z_LXa6NTFtQ%40mail.gmail.com

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Kyotaro Horiguchi-4
At Wed, 16 Sep 2020 08:33:06 +0530, Amit Kapila <[hidden email]> wrote in

> On Wed, Sep 16, 2020 at 7:46 AM Kyotaro Horiguchi
> <[hidden email]> wrote:
> > Is this means lseek(SEEK_END) doesn't count blocks that are
> > write(2)'ed (by smgrextend) but not yet flushed? (I don't think so,
> > for clarity.) The nblocks cache is added just to reduce the number of
> > lseek()s and expected to always have the same value with what lseek()
> > is expected to return.
> >
>
> See comments in ReadBuffer_common() which indicates such a possibility
> ("Unfortunately, we have also seen this case occurring because of
> buggy Linux kernels that sometimes return an lseek(SEEK_END) result
> that doesn't account for a recent write."). Also, refer my previous
> email [1] on this and another email link in that email which has a
> discussion on this point.
>
> > The reason it is reliable only during recovery
> > is that the cache is not shared but the startup process is the only
> > process that changes the relation size during recovery.
> >
>
> Yes, that is why we are planning to do this optimization for recovery path.
>
> > If any other process can extend the relation while smgrtruncate is
> > running, the current DropRelFileNodeBuffers should have the chance
> > that a new buffer for extended area is allocated at a buffer location
> > where the function already have passed by, which is a disaster.
> >
>
> The relation might have extended before smgrtruncate but the newly
> added pages can be flushed by checkpointer during smgrtruncate.
>
> [1] - https://www.postgresql.org/message-id/CAA4eK1LH2uQWznwtonD%2Bnch76kqzemdTQAnfB06z_LXa6NTFtQ%40mail.gmail.com

Ah! I understood that! The reason we can rely on the cahce is that the
cached value is *not* what lseek returned but how far we intended to
extend. Thank you for the explanation.

By the way I'm not sure that actually happens, but if one smgrextend
call exnteded the relation by two or more blocks, the cache is
invalidated and succeeding smgrnblocks returns lseek()'s result. Don't
we need to guarantee the cache to be valid while recovery?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

akapila
On Wed, Sep 16, 2020 at 9:02 AM Kyotaro Horiguchi
<[hidden email]> wrote:

>
> At Wed, 16 Sep 2020 08:33:06 +0530, Amit Kapila <[hidden email]> wrote in
> > On Wed, Sep 16, 2020 at 7:46 AM Kyotaro Horiguchi
> > <[hidden email]> wrote:
> > > Is this means lseek(SEEK_END) doesn't count blocks that are
> > > write(2)'ed (by smgrextend) but not yet flushed? (I don't think so,
> > > for clarity.) The nblocks cache is added just to reduce the number of
> > > lseek()s and expected to always have the same value with what lseek()
> > > is expected to return.
> > >
> >
> > See comments in ReadBuffer_common() which indicates such a possibility
> > ("Unfortunately, we have also seen this case occurring because of
> > buggy Linux kernels that sometimes return an lseek(SEEK_END) result
> > that doesn't account for a recent write."). Also, refer my previous
> > email [1] on this and another email link in that email which has a
> > discussion on this point.
> >
> > > The reason it is reliable only during recovery
> > > is that the cache is not shared but the startup process is the only
> > > process that changes the relation size during recovery.
> > >
> >
> > Yes, that is why we are planning to do this optimization for recovery path.
> >
> > > If any other process can extend the relation while smgrtruncate is
> > > running, the current DropRelFileNodeBuffers should have the chance
> > > that a new buffer for extended area is allocated at a buffer location
> > > where the function already have passed by, which is a disaster.
> > >
> >
> > The relation might have extended before smgrtruncate but the newly
> > added pages can be flushed by checkpointer during smgrtruncate.
> >
> > [1] - https://www.postgresql.org/message-id/CAA4eK1LH2uQWznwtonD%2Bnch76kqzemdTQAnfB06z_LXa6NTFtQ%40mail.gmail.com
>
> Ah! I understood that! The reason we can rely on the cahce is that the
> cached value is *not* what lseek returned but how far we intended to
> extend. Thank you for the explanation.
>
> By the way I'm not sure that actually happens, but if one smgrextend
> call exnteded the relation by two or more blocks, the cache is
> invalidated and succeeding smgrnblocks returns lseek()'s result.
>

Can you think of any such case? I think in recovery we use
XLogReadBufferExtended->ReadBufferWithoutRelcache for reading the page
which seems to be extending page-by-page but there could be some case
where that is not true. One idea is to run regressions and add an
Assert to see if we are extending more than a block during recovery.

> Don't
> we need to guarantee the cache to be valid while recovery?
>

One possibility could be that we somehow detect that the value we are
using is cached one and if so then only do this optimization.


--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Kyotaro Horiguchi-4
At Wed, 16 Sep 2020 10:05:32 +0530, Amit Kapila <[hidden email]> wrote in

> On Wed, Sep 16, 2020 at 9:02 AM Kyotaro Horiguchi
> <[hidden email]> wrote:
> >
> > At Wed, 16 Sep 2020 08:33:06 +0530, Amit Kapila <[hidden email]> wrote in
> > > On Wed, Sep 16, 2020 at 7:46 AM Kyotaro Horiguchi
> > > <[hidden email]> wrote:
> > By the way I'm not sure that actually happens, but if one smgrextend
> > call exnteded the relation by two or more blocks, the cache is
> > invalidated and succeeding smgrnblocks returns lseek()'s result.
> >
>
> Can you think of any such case? I think in recovery we use
> XLogReadBufferExtended->ReadBufferWithoutRelcache for reading the page
> which seems to be extending page-by-page but there could be some case
> where that is not true. One idea is to run regressions and add an
> Assert to see if we are extending more than a block during recovery.

I agree with you. Actually XLogReadBufferExtended is the only point to
read a page while recovery and seems calling ReadBufferWithoutRelcache
page by page up to the target page. The only case I found where the
cache is invalidated is ALTER TABLE SET TABLESPACE while
wal_level=minimal and not during recovery. smgrextend is called
without smgrnblocks called at the time.

Considering that the behavior of lseek can be a problem only just after
extending a file, an assertion in smgrextend seems to be
enough. Although, I'm not confident on the diagnosis.

--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -474,7 +474,14 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  if (reln->smgr_cached_nblocks[forknum] == blocknum)
  reln->smgr_cached_nblocks[forknum] = blocknum + 1;
  else
+ {
+ /*
+ * DropRelFileNodeBuffers relies on the behavior that nblocks cache
+ * won't be invalidated by file extension while recoverying.
+ */
+ Assert(!InRecovery);
  reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+ }
 }

> > Don't
> > we need to guarantee the cache to be valid while recovery?
> >
>
> One possibility could be that we somehow detect that the value we are
> using is cached one and if so then only do this optimization.

I basically like this direction.  But I'm not sure the additional
parameter for smgrnblocks is acceptable.

But on the contrary, it might be a better design that
DropRelFileNodeBuffers gives up the optimization when
smgrnblocks(,,must_accurate = true) returns InvalidBlockNumber.


@@ -544,9 +551,12 @@ smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 /*
  * smgrnblocks() -- Calculate the number of blocks in the
  * supplied relation.
+ *
+ * Returns InvalidBlockNumber if must_accurate is true and smgr_cached_nblocks
+ * is not available.
  */
 BlockNumber
-smgrnblocks(SMgrRelation reln, ForkNumber forknum)
+smgrnblocks(SMgrRelation reln, ForkNumber forknum, bool must_accurate)
 {
  BlockNumber result;
 
@@ -561,6 +571,17 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum)
 
  reln->smgr_cached_nblocks[forknum] = result;
 
+ /*
+ * We cannot believe the result from smgr_nblocks is always accurate
+ * because lseek of buggy Linux kernels doesn't account for a recent
+ * write. However, we can rely on the result from lseek while recovering
+ * because the first call to this function is not happen just after a file
+ * extension. Return values on subsequent calls return cached nblocks,
+ * which should be accurate during recovery.
+ */
+ if (!InRecovery && must_accurate)
+ return InvalidBlockNumber;
+
  return result;
 }


regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

akapila
On Wed, Sep 16, 2020 at 2:02 PM Kyotaro Horiguchi
<[hidden email]> wrote:

>
> At Wed, 16 Sep 2020 10:05:32 +0530, Amit Kapila <[hidden email]> wrote in
> > On Wed, Sep 16, 2020 at 9:02 AM Kyotaro Horiguchi
> > <[hidden email]> wrote:
> > >
> > > At Wed, 16 Sep 2020 08:33:06 +0530, Amit Kapila <[hidden email]> wrote in
> > > > On Wed, Sep 16, 2020 at 7:46 AM Kyotaro Horiguchi
> > > > <[hidden email]> wrote:
> > > By the way I'm not sure that actually happens, but if one smgrextend
> > > call exnteded the relation by two or more blocks, the cache is
> > > invalidated and succeeding smgrnblocks returns lseek()'s result.
> > >
> >
> > Can you think of any such case? I think in recovery we use
> > XLogReadBufferExtended->ReadBufferWithoutRelcache for reading the page
> > which seems to be extending page-by-page but there could be some case
> > where that is not true. One idea is to run regressions and add an
> > Assert to see if we are extending more than a block during recovery.
>
> I agree with you. Actually XLogReadBufferExtended is the only point to
> read a page while recovery and seems calling ReadBufferWithoutRelcache
> page by page up to the target page. The only case I found where the
> cache is invalidated is ALTER TABLE SET TABLESPACE while
> wal_level=minimal and not during recovery. smgrextend is called
> without smgrnblocks called at the time.
>
> Considering that the behavior of lseek can be a problem only just after
> extending a file, an assertion in smgrextend seems to be
> enough. Although, I'm not confident on the diagnosis.
>
> --- a/src/backend/storage/smgr/smgr.c
> +++ b/src/backend/storage/smgr/smgr.c
> @@ -474,7 +474,14 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
>         if (reln->smgr_cached_nblocks[forknum] == blocknum)
>                 reln->smgr_cached_nblocks[forknum] = blocknum + 1;
>         else
> +       {
> +               /*
> +                * DropRelFileNodeBuffers relies on the behavior that nblocks cache
> +                * won't be invalidated by file extension while recoverying.
> +                */
> +               Assert(!InRecovery);
>                 reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
> +       }
>  }
>

Yeah, I have something like this in mind. I am not very sure at this
stage that we want to commit this but for verification purpose,
running regressions it is a good idea.

> > > Don't
> > > we need to guarantee the cache to be valid while recovery?
> > >
> >
> > One possibility could be that we somehow detect that the value we are
> > using is cached one and if so then only do this optimization.
>
> I basically like this direction.  But I'm not sure the additional
> parameter for smgrnblocks is acceptable.
>
> But on the contrary, it might be a better design that
> DropRelFileNodeBuffers gives up the optimization when
> smgrnblocks(,,must_accurate = true) returns InvalidBlockNumber.
>

I haven't thought about what is the best way to achieve this. Let us
see if Tsunakawa-San or Kirk-San has other ideas on this?

--
With Regards,
Amit Kapila.


1234