[Patch] Optimize dropping of relation buffers using dlist

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
262 messages Options
1234567 ... 14
Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com
On Wednesday, September 16, 2020 5:32 PM, Kyotaro Horiguchi wrote:

> At Wed, 16 Sep 2020 10:05:32 +0530, Amit Kapila <[hidden email]>
> wrote in
> > On Wed, Sep 16, 2020 at 9:02 AM Kyotaro Horiguchi
> > <[hidden email]> wrote:
> > >
> > > At Wed, 16 Sep 2020 08:33:06 +0530, Amit Kapila
> > > <[hidden email]> wrote in
> > > > On Wed, Sep 16, 2020 at 7:46 AM Kyotaro Horiguchi
> > > > <[hidden email]> wrote:
> > > By the way I'm not sure that actually happens, but if one smgrextend
> > > call exnteded the relation by two or more blocks, the cache is
> > > invalidated and succeeding smgrnblocks returns lseek()'s result.
> > >
> >
> > Can you think of any such case? I think in recovery we use
> > XLogReadBufferExtended->ReadBufferWithoutRelcache for reading the
> page
> > which seems to be extending page-by-page but there could be some case
> > where that is not true. One idea is to run regressions and add an
> > Assert to see if we are extending more than a block during recovery.
>
> I agree with you. Actually XLogReadBufferExtended is the only point to read a
> page while recovery and seems calling ReadBufferWithoutRelcache page by
> page up to the target page. The only case I found where the cache is
> invalidated is ALTER TABLE SET TABLESPACE while wal_level=minimal and
> not during recovery. smgrextend is called without smgrnblocks called at the
> time.
>
> Considering that the behavior of lseek can be a problem only just after
> extending a file, an assertion in smgrextend seems to be enough. Although,
> I'm not confident on the diagnosis.
>
> --- a/src/backend/storage/smgr/smgr.c
> +++ b/src/backend/storage/smgr/smgr.c
> @@ -474,7 +474,14 @@ smgrextend(SMgrRelation reln, ForkNumber forknum,
> BlockNumber blocknum,
>   if (reln->smgr_cached_nblocks[forknum] == blocknum)
>   reln->smgr_cached_nblocks[forknum] = blocknum + 1;
>   else
> + {
> + /*
> + * DropRelFileNodeBuffers relies on the behavior that
> nblocks cache
> + * won't be invalidated by file extension while recoverying.
> + */
> + Assert(!InRecovery);
>   reln->smgr_cached_nblocks[forknum] =
> InvalidBlockNumber;
> + }
>  }
>
> > > Don't
> > > we need to guarantee the cache to be valid while recovery?
> > >
> >
> > One possibility could be that we somehow detect that the value we are
> > using is cached one and if so then only do this optimization.
>
> I basically like this direction.  But I'm not sure the additional parameter for
> smgrnblocks is acceptable.
>
> But on the contrary, it might be a better design that DropRelFileNodeBuffers
> gives up the optimization when smgrnblocks(,,must_accurate = true) returns
> InvalidBlockNumber.
>
Thank you for your thoughtful reviews and discussions Horiguchi-san, Tsunakawa-san and Amit-san.
Apologies for my carelessness. I've addressed the bugs in the previous version.
1. Getting the total number of blocks for all the specified forks
2. Hashtable probing conditions

I added the suggestion of putting an assert on smgrextend for the XLogReadBufferExtended case,
and I thought that would be enough. I think modifying the smgrnblocks with the addition of new
parameter would complicate the source code because a number of functions call it.
So I thought that maybe putting BlockNumberIsValid(nblocks) in the condition would suffice.
Else, we do full scan of buffer pool.

+                       if ((nblocks / (uint32)NBuffers) < BUF_DROP_FULLSCAN_THRESHOLD &&
+                               BlockNumberIsValid(nblocks))

+                       else
+                       {
                                //full scan

Attached is the v14 of the patch. It compiles and passes the tests.
Hoping for your continuous reviews and feedback. Thank you very much.

Regards,
Kirk Jamison


v14-Speedup-dropping-of-relation-buffers-during-recovery.patch (10K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

tsunakawa.takay@fujitsu.com
I looked at v14.


(1)
+ /* Get the total number of blocks for the supplied relation's fork */
+ for (j = 0; j < nforks; j++)
+ {
+ BlockNumber block = smgrnblocks(smgr_reln, forkNum[j]);
+ nblocks += block;
+ }

Why do you sum all forks?


(2)
+ if ((nblocks / (uint32)NBuffers) < BUF_DROP_FULLSCAN_THRESHOLD &&
+ BlockNumberIsValid(nblocks))
+ {

The division by NBuffers is not necessary, because both sides of = are number of blocks.
Why is BlockNumberIsValid(nblocks)) call needed?


(3)
  if (reln->smgr_cached_nblocks[forknum] == blocknum)
  reln->smgr_cached_nblocks[forknum] = blocknum + 1;
  else
+ {
+ /*
+ * DropRelFileNodeBuffers relies on the behavior that cached nblocks
+ * won't be invalidated by file extension while recovering.
+ */
+ Assert(!InRecovery);
  reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+ }

I think this change is not directly related to this patch and can be a separate patch, but I want to leave the decision up to a committer.


Regards
Takayuki Tsunakawa




Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

tsunakawa.takay@fujitsu.com
In reply to this post by akapila
From: Amit Kapila <[hidden email]>

> > > > Don't
> > > > we need to guarantee the cache to be valid while recovery?
> > > >
> > >
> > > One possibility could be that we somehow detect that the value we
> > > are using is cached one and if so then only do this optimization.
> >
> > I basically like this direction.  But I'm not sure the additional
> > parameter for smgrnblocks is acceptable.
> >
> > But on the contrary, it might be a better design that
> > DropRelFileNodeBuffers gives up the optimization when
> > smgrnblocks(,,must_accurate = true) returns InvalidBlockNumber.
> >
>
> I haven't thought about what is the best way to achieve this. Let us see if
> Tsunakawa-San or Kirk-San has other ideas on this?

I see no need for smgrnblocks() to add an argument as it returns the correct cached or measured value.



Regards
Takayuki Tsunakawa


Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com
In reply to this post by tsunakawa.takay@fujitsu.com
On Wednesday, September 23, 2020 11:26 AM, Tsunakawa, Takayuki wrote:

> I looked at v14.
Thank you for checking it!
 

> (1)
> + /* Get the total number of blocks for the supplied relation's
> fork */
> + for (j = 0; j < nforks; j++)
> + {
> + BlockNumber block =
> smgrnblocks(smgr_reln, forkNum[j]);
> + nblocks += block;
> + }
>
> Why do you sum all forks?

I revised the patch based from my understanding of Horiguchi-san's comment,
but I could be wrong.
Quoting:

"
+ /* Get the number of blocks for the supplied relation's fork */
+ nblocks = smgrnblocks(smgr_reln, forkNum[fork_num]);
+ Assert(BlockNumberIsValid(nblocks));
+
+ if (nblocks < BUF_DROP_FULLSCAN_THRESHOLD)

As mentioned upthread, the criteria whether we do full-scan or
lookup-drop is how large portion of NBUFFERS this relation-drop can be
going to invalidate.  So the nblocks above should be the sum of number
of blocks to be truncated (not just the total number of blocks) of all
designated forks.  Then once we decided to do lookup-drop method, we
do that for all forks."

> (2)
> + if ((nblocks / (uint32)NBuffers) <
> BUF_DROP_FULLSCAN_THRESHOLD &&
> + BlockNumberIsValid(nblocks))
> + {
>
> The division by NBuffers is not necessary, because both sides of = are
> number of blocks.

Again I based it from my understanding of the comment above,
so nblocks is the sum of all blocks to be truncated for all forks.


> Why is BlockNumberIsValid(nblocks)) call needed?

I thought we need to ensure that nblocks is not invalid, so I also added

> (3)
>   if (reln->smgr_cached_nblocks[forknum] == blocknum)
>   reln->smgr_cached_nblocks[forknum] = blocknum + 1;
>   else
> + {
> + /*
> + * DropRelFileNodeBuffers relies on the behavior that
> cached nblocks
> + * won't be invalidated by file extension while recovering.
> + */
> + Assert(!InRecovery);
>   reln->smgr_cached_nblocks[forknum] =
> InvalidBlockNumber;
> + }
>
> I think this change is not directly related to this patch and can be a separate
> patch, but I want to leave the decision up to a committer.
>
This is noted. Once we clarified the above comments, I'll put it in a separate patch if it's necessary,

Thank you very much for the reviews.

Best regards,
Kirk Jamison




Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

akapila
In reply to this post by tsunakawa.takay@fujitsu.com
On Wed, Sep 23, 2020 at 7:56 AM [hidden email]
<[hidden email]> wrote:

>
> (3)
>         if (reln->smgr_cached_nblocks[forknum] == blocknum)
>                 reln->smgr_cached_nblocks[forknum] = blocknum + 1;
>         else
> +       {
> +               /*
> +                * DropRelFileNodeBuffers relies on the behavior that cached nblocks
> +                * won't be invalidated by file extension while recovering.
> +                */
> +               Assert(!InRecovery);
>                 reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
> +       }
>
> I think this change is not directly related to this patch and can be a separate patch, but I want to leave the decision up to a committer.
>

We have added this mainly for testing purpose, basically this
assertion should not fail during the regression tests. We can keep it
in a separate patch but need to ensure that. If this fails then we
can't rely on the caching behaviour during recovery which is actually
required for the correctness of patch.


--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

akapila
In reply to this post by tsunakawa.takay@fujitsu.com
On Wed, Sep 23, 2020 at 8:04 AM [hidden email]
<[hidden email]> wrote:

>
> From: Amit Kapila <[hidden email]>
> > > > > Don't
> > > > > we need to guarantee the cache to be valid while recovery?
> > > > >
> > > >
> > > > One possibility could be that we somehow detect that the value we
> > > > are using is cached one and if so then only do this optimization.
> > >
> > > I basically like this direction.  But I'm not sure the additional
> > > parameter for smgrnblocks is acceptable.
> > >
> > > But on the contrary, it might be a better design that
> > > DropRelFileNodeBuffers gives up the optimization when
> > > smgrnblocks(,,must_accurate = true) returns InvalidBlockNumber.
> > >
> >
> > I haven't thought about what is the best way to achieve this. Let us see if
> > Tsunakawa-San or Kirk-San has other ideas on this?
>
> I see no need for smgrnblocks() to add an argument as it returns the correct cached or measured value.
>

The idea is that we can't use this optimization if the value is not
cached because we can't rely on lseek behavior. See all the discussion
between Horiguchi-San and me in the thread above. So, how would you
ensure that if we don't use Kirk-San's proposal?

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

tsunakawa.takay@fujitsu.com
In reply to this post by k.jamison@fujitsu.com
From: Jamison, Kirk/ジャミソン カーク <[hidden email]>

> I revised the patch based from my understanding of Horiguchi-san's comment,
> but I could be wrong.
> Quoting:
>
> "
> + /* Get the number of blocks for the supplied relation's
> fork */
> + nblocks = smgrnblocks(smgr_reln,
> forkNum[fork_num]);
> + Assert(BlockNumberIsValid(nblocks));
> +
> + if (nblocks < BUF_DROP_FULLSCAN_THRESHOLD)
>
> As mentioned upthread, the criteria whether we do full-scan or
> lookup-drop is how large portion of NBUFFERS this relation-drop can be
> going to invalidate.  So the nblocks above should be the sum of number
> of blocks to be truncated (not just the total number of blocks) of all
> designated forks.  Then once we decided to do lookup-drop method, we
> do that for all forks."

One takeaway from Horiguchi-san's comment is to use the number of blocks to invalidate for comparison, instead of all blocks in the fork.  That is, use

nblocks = smgrnblocks(fork) - firstDelBlock[fork];

Does this make sense?

What do you think is the reason for summing up all forks?  I didn't understand why.  Typically, FSM and VM forks are very small.  If the main fork is larger than NBuffers / 500, then v14 scans the entire shared buffers for the FSM and VM forks as well as the main fork, resulting in three scans in total.

Also, if you want to judge the criteria based on the total blocks of all forks, the following if should be placed outside the for loop, right?  Because this if condition doesn't change inside the for loop.

+ if ((nblocks / (uint32)NBuffers) < BUF_DROP_FULLSCAN_THRESHOLD &&
+ BlockNumberIsValid(nblocks))
+ {



> > (2)
> > + if ((nblocks / (uint32)NBuffers) <
> > BUF_DROP_FULLSCAN_THRESHOLD &&
> > + BlockNumberIsValid(nblocks))
> > + {
> >
> > The division by NBuffers is not necessary, because both sides of = are
> > number of blocks.
>
> Again I based it from my understanding of the comment above,
> so nblocks is the sum of all blocks to be truncated for all forks.

But the left expression of "<" is a percentage, while the right one is a block count.  Two different units are compared.


> > Why is BlockNumberIsValid(nblocks)) call needed?
>
> I thought we need to ensure that nblocks is not invalid, so I also added

When is it invalid?  smgrnblocks() seems to always return a valid block number.  Am I seeing a different source code (I saw HEAD)?




Regards
Takayuki Tsunakawa



Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

tsunakawa.takay@fujitsu.com
In reply to this post by akapila
From: Amit Kapila <[hidden email]>
> The idea is that we can't use this optimization if the value is not
> cached because we can't rely on lseek behavior. See all the discussion
> between Horiguchi-San and me in the thread above. So, how would you
> ensure that if we don't use Kirk-San's proposal?

Hmm, buggy Linux kernel...  (Until when should we be worried about the bug?)

According to the following Horiguchi-san's suggestion, it's during normal operation, not during recovery, when we should be careful, right?  Then, we can use the current smgrnblocks() as is?

+ /*
+ * We cannot believe the result from smgr_nblocks is always accurate
+ * because lseek of buggy Linux kernels doesn't account for a recent
+ * write. However, we can rely on the result from lseek while recovering
+ * because the first call to this function is not happen just after a file
+ * extension. Return values on subsequent calls return cached nblocks,
+ * which should be accurate during recovery.
+ */
+ if (!InRecovery && must_accurate)
+ return InvalidBlockNumber;
+
  return result;
}

If smgrnblocks() could return a smaller value than the actual file size by one block even during recovery, how about always adding one to the return value of smgrnblocks() in DropRelFileNodeBuffers()?  When smgrnblocks() actually returned the correct value, the extra one block is not found in the shared buffer, so DropRelFileNodeBuffers() does no harm.

Or, add a new function like smgrnblocks_precise() to avoid adding an argument to smgrnblocks()?


Regards
Takayuki Tsunakawa


Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com
In reply to this post by tsunakawa.takay@fujitsu.com
On Wednesday, September 23, 2020 2:37 PM, Tsunakawa, Takayuki wrote:

> > I revised the patch based from my understanding of Horiguchi-san's
> > comment, but I could be wrong.
> > Quoting:
> >
> > "
> > + /* Get the number of blocks for the supplied
> relation's
> > fork */
> > + nblocks = smgrnblocks(smgr_reln,
> > forkNum[fork_num]);
> > + Assert(BlockNumberIsValid(nblocks));
> > +
> > + if (nblocks <
> BUF_DROP_FULLSCAN_THRESHOLD)
> >
> > As mentioned upthread, the criteria whether we do full-scan or
> > lookup-drop is how large portion of NBUFFERS this relation-drop can be
> > going to invalidate.  So the nblocks above should be the sum of number
> > of blocks to be truncated (not just the total number of blocks) of all
> > designated forks.  Then once we decided to do lookup-drop method, we
> > do that for all forks."
>
> One takeaway from Horiguchi-san's comment is to use the number of blocks
> to invalidate for comparison, instead of all blocks in the fork.  That is, use
>
> nblocks = smgrnblocks(fork) - firstDelBlock[fork];
>
> Does this make sense?
Hmm. Ok, I think it got too much to my head that I misunderstood what it meant.
I'll debug again by using ereport just to check the values and behavior are correct.
Your comment about V14 patch has dawned on me that it reverted to previous
slower version where we scan NBuffers for each fork. Thank you for explaining it.

> What do you think is the reason for summing up all forks?  I didn't
> understand why.  Typically, FSM and VM forks are very small.  If the main
> fork is larger than NBuffers / 500, then v14 scans the entire shared buffers for
> the FSM and VM forks as well as the main fork, resulting in three scans in
> total.
>
> Also, if you want to judge the criteria based on the total blocks of all forks, the
> following if should be placed outside the for loop, right?  Because this if
> condition doesn't change inside the for loop.
>
> + if ((nblocks / (uint32)NBuffers) <
> BUF_DROP_FULLSCAN_THRESHOLD &&
> + BlockNumberIsValid(nblocks))
> + {
>
>
>
> > > (2)
> > > + if ((nblocks / (uint32)NBuffers) <
> > > BUF_DROP_FULLSCAN_THRESHOLD &&
> > > + BlockNumberIsValid(nblocks))
> > > + {
> > >
> > > The division by NBuffers is not necessary, because both sides of =
> > > are number of blocks.
> >
> > Again I based it from my understanding of the comment above, so
> > nblocks is the sum of all blocks to be truncated for all forks.
>
> But the left expression of "<" is a percentage, while the right one is a block
> count.  Two different units are compared.
>
Right. Makes sense. Fixed.

> > > Why is BlockNumberIsValid(nblocks)) call needed?
> >
> > I thought we need to ensure that nblocks is not invalid, so I also
> > added
>
> When is it invalid?  smgrnblocks() seems to always return a valid block
> number.  Am I seeing a different source code (I saw HEAD)?

It's based from the discussion upthread to guarantee the cache to be valid while recovery
and that we don't want to proceed with the optimization in case that nblocks is invalid.
It may not be needed so I already removed it, because the correct direction is ensuring that
smgrnblocks return the precise value.
Considering the test case that Horiguchi-san suggested (attached as separate patch),
then maybe there's no need to indicate it in the loop condition.
For now, I haven't modified the design (or created a new function) of smgrnblocks,
and I just updated the patches based from the recent comments.

Thank you very much again for the reviews.

Best regards,
Kirk Jamison

v15-Speedup-dropping-of-relation-buffers-during-recovery.patch (10K) Download Attachment
v1-Prevent-smgrextend-from-invalidating-blocks-during-recovery.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

akapila
In reply to this post by tsunakawa.takay@fujitsu.com
On Wed, Sep 23, 2020 at 12:00 PM [hidden email]
<[hidden email]> wrote:

>
> From: Amit Kapila <[hidden email]>
> > The idea is that we can't use this optimization if the value is not
> > cached because we can't rely on lseek behavior. See all the discussion
> > between Horiguchi-San and me in the thread above. So, how would you
> > ensure that if we don't use Kirk-San's proposal?
>
> Hmm, buggy Linux kernel...  (Until when should we be worried about the bug?)
>
> According to the following Horiguchi-san's suggestion, it's during normal operation, not during recovery, when we should be careful, right?
>

No, during recovery also we need to be careful. We need to ensure that
we use cached value during recovery and cached value is always
up-to-date. We can't rely on lseek and I have provided some scenario
up thread [1] where such behavior can cause problem and then see the
response from Tom Lane why the same can be true for recovery as well.

The basic approach we are trying to pursue here is to rely on the
cached value of 'number of blocks' (as that always gives correct value
and even if there is a problem that will be our bug, we don't need to
rely on OS for correct value and it will be better w.r.t performance
as well). It is currently only possible during recovery so we are
using it in recovery path and later once Thomas's patch to cache it
for non-recovery cases is also done, we can use it for non-recovery
cases as well.

[1] - https://www.postgresql.org/message-id/CAA4eK1LqaJvT%3DbFOpc4i5Haq4oaVQ6wPbAcg64-Kt1qzp_MZYA%40mail.gmail.com

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

tsunakawa.takay@fujitsu.com
In reply to this post by k.jamison@fujitsu.com
In v15:

(1)
+ for (cur_blk = firstDelBlock[j]; cur_blk < nblocks; cur_blk++)

The right side of "cur_blk <" should not be nblocks, because nblocks is not the number of the relation fork anymore.


(2)
+ BlockNumber nblocks;
+ nblocks = smgrnblocks(smgr_reln, forkNum[j]) - firstDelBlock[j];

You should either:

* Combine the two lines into one: BlockNumber nblocks = ...;

or

* Put an empty line between the two lines to separate declarations and execution statements.


After correcting these, I think you can check the recovery performance.



Regards
Takayuki Tsunakawa




Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com
On Thursday, September 24, 2020 1:27 PM, Tsunakawa-san wrote:

> (1)
> + for (cur_blk = firstDelBlock[j]; cur_blk <
> nblocks; cur_blk++)
>
> The right side of "cur_blk <" should not be nblocks, because nblocks is not
> the number of the relation fork anymore.

Right. Fixed. It should be the total number of (n)blocks of relation.

> (2)
> + BlockNumber nblocks;
> + nblocks = smgrnblocks(smgr_reln, forkNum[j]) -
> firstDelBlock[j];
>
> You should either:
>
> * Combine the two lines into one: BlockNumber nblocks = ...;
>
> or
>
> * Put an empty line between the two lines to separate declarations and
> execution statements.
Right. I separated them in the updated patch. And to prevent confusion,
instead of nblocks, nTotalBlocks & nBlocksToInvalidate are used.

/* Get the total number of blocks for the supplied relation's fork */
nTotalBlocks = smgrnblocks(smgr_reln, forkNum[j]);

/* Get the total number of blocks to be invalidated for the specified fork */
nBlocksToInvalidate = nTotalBlocks - firstDelBlock[j];
 

> After correcting these, I think you can check the recovery performance.

I'll send performance measurement results in the next email. Thanks a lot for the reviews!

Regards,
Kirk Jamison

v16-Optimize-DropRelFileNodeBuffers-during-recovery.patch (11K) Download Attachment
v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

Kyotaro Horiguchi-4
In reply to this post by tsunakawa.takay@fujitsu.com
Hello.

At Wed, 23 Sep 2020 05:37:24 +0000, "[hidden email]" <[hidden email]> wrote in
> From: Jamison, Kirk/ジャミソン カーク <[hidden email]>

# Wow. I'm surprised to read it..

> > I revised the patch based from my understanding of Horiguchi-san's comment,
> > but I could be wrong.
> > Quoting:
> >
> > "
> > + /* Get the number of blocks for the supplied relation's
> > fork */
> > + nblocks = smgrnblocks(smgr_reln,
> > forkNum[fork_num]);
> > + Assert(BlockNumberIsValid(nblocks));
> > +
> > + if (nblocks < BUF_DROP_FULLSCAN_THRESHOLD)
> >
> > As mentioned upthread, the criteria whether we do full-scan or
> > lookup-drop is how large portion of NBUFFERS this relation-drop can be
> > going to invalidate.  So the nblocks above should be the sum of number
> > of blocks to be truncated (not just the total number of blocks) of all
> > designated forks.  Then once we decided to do lookup-drop method, we
> > do that for all forks."
>
> One takeaway from Horiguchi-san's comment is to use the number of blocks to invalidate for comparison, instead of all blocks in the fork.  That is, use
>
> nblocks = smgrnblocks(fork) - firstDelBlock[fork];
>
> Does this make sense?
>
> What do you think is the reason for summing up all forks?  I didn't understand why.  Typically, FSM and VM forks are very small.  If the main fork is larger than NBuffers / 500, then v14 scans the entire shared buffers for the FSM and VM forks as well as the main fork, resulting in three scans in total.

I thought of summing up smgrnblocks(fork) - firstDelBlock[fork] of all
folks. I don't mind omitting non-main forks but a comment to explain
the reason or reasoning would be needed.

reards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com
In reply to this post by k.jamison@fujitsu.com
Hi.

> I'll send performance measurement results in the next email. Thanks a lot for
> the reviews!

Below are the performance measurement results.
I was only able to use low-spec machine:
CPU 4v, Memory 8GB, RHEL, xfs filesystem.

[Failover/Recovery Test]
1. (Master) Create table (ex. 10,000 tables). Insert data to tables.
2. (M) DELETE FROM TABLE (ex. all rows of 10,000 tables)
3. (Standby) To test with failover, pause the WAL replay on standby server.
(SELECT pg_wal_replay_pause();)
4. (M) psql -c "\timing on" (measures total execution of SQL queries)
5. (M) VACUUM (whole db)
6. (M) After vacuum finishes, stop primary server: pg_ctl stop -w -mi
7. (S) Resume wal replay and promote standby.
Because it's difficult to measure recovery time I used the attached script (resume.sh)
that prints timestamp before and after promotion. It basically does the following
- "SELECT pg_wal_replay_resume();" is executed and the WAL application is resumed.
- "pg_ctl promote" to promote standby.
- The time difference of "select pg_is_in_recovery();" from "t" to "f" is measured.

[Results]
Recovery/Failover performance (in seconds). 3 trial runs.

| shared_buffers | master | patch  | %reg    |
|----------------|--------|--------|---------|
| 128MB          | 32.406 | 33.785 | 4.08%   |
| 1GB            | 36.188 | 32.747 | -10.51% |
| 2GB            | 41.996 | 32.88  | -27.73% |

There's a bit of small regression with the default shared_buffers (128MB),
but as for the recovery time when we have large NBuffers, it's now at least almost constant
so there's boosted performance. IOW, we enter the optimization most of the time
during recovery.

I also did similar benchmark performance as what Tomas did [1],
simple "pgbench -S" tests (warmup and then 15 x 1-minute runs with
1, 8 and 16 clients, but I'm not sure if my machine is reliable enough to
produce reliable results for 8 clients and more.

| #          | master      | patch       | %reg   |
|------------|-------------|-------------|--------|
| 1 client   | 1676.937825 | 1707.018029 | -1.79% |
| 8 clients  | 7706.835401 | 7529.089044 | 2.31%  |
| 16 clients | 9823.65254  | 9991.184206 | -1.71% |


If there's additional/necessary performance measurement, kindly advise me too.
Thank you in advance.

[1] https://www.postgresql.org/message-id/flat/20200806213334.3bzadeirly3mdtzl%40development#473168a61e229de40eaf36326232f86c

Best regards,
Kirk Jamison

resume.sh (694 bytes) Download Attachment
run.sh (826 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

tsunakawa.takay@fujitsu.com
In reply to this post by akapila
From: Amit Kapila <[hidden email]>

> No, during recovery also we need to be careful. We need to ensure that
> we use cached value during recovery and cached value is always
> up-to-date. We can't rely on lseek and I have provided some scenario
> up thread [1] where such behavior can cause problem and then see the
> response from Tom Lane why the same can be true for recovery as well.
>
> The basic approach we are trying to pursue here is to rely on the
> cached value of 'number of blocks' (as that always gives correct value
> and even if there is a problem that will be our bug, we don't need to
> rely on OS for correct value and it will be better w.r.t performance
> as well). It is currently only possible during recovery so we are
> using it in recovery path and later once Thomas's patch to cache it
> for non-recovery cases is also done, we can use it for non-recovery
> cases as well.

Although I may be still confused, I understood that Kirk-san's patch should:

* Still focus on speeding up the replay of TRUNCATE during recovery.

* During recovery, DropRelFileNodeBuffers() gets the cached size of the relation fork.  If it is cached, trust it and optimize the buffer invalidation.  If it's not cached, we can't trust the return value of smgrnblocks() because it's the lseek(END) return value, so we avoid the optimization.

* Then, add a new function, say, smgrnblocks_cached() that simply returns the cached block count, and DropRelFileNodeBuffers() uses it instead of smgrnblocks().



Regards
Takayuki Tsunakawa


Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

tsunakawa.takay@fujitsu.com
In reply to this post by k.jamison@fujitsu.com
From: Jamison, Kirk/ジャミソン カーク <[hidden email]>
> [Results]
> Recovery/Failover performance (in seconds). 3 trial runs.
>
> | shared_buffers | master | patch  | %reg    |
> |----------------|--------|--------|---------|
> | 128MB          | 32.406 | 33.785 | 4.08%   |
> | 1GB            | 36.188 | 32.747 | -10.51% |
> | 2GB            | 41.996 | 32.88  | -27.73% |

Thanks for sharing good results.  We want to know if we can get as significant results as you gained before with hundreds of GBs of shared buffers, don't we?


> I also did similar benchmark performance as what Tomas did [1], simple
> "pgbench -S" tests (warmup and then 15 x 1-minute runs with 1, 8 and 16
> clients, but I'm not sure if my machine is reliable enough to produce reliable
> results for 8 clients and more.

Let me confirm just in case.  Your patch should not affect pgbench performance, but you measured it.  Is there anything you're concerned about?


Regards
Takayuki Tsunakawa




Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

k.jamison@fujitsu.com
On Friday, September 25, 2020 6:02 PM, Tsunakawa-san wrote:

> From: Jamison, Kirk/ジャミソン カーク <[hidden email]>
> > [Results]
> > Recovery/Failover performance (in seconds). 3 trial runs.
> >
> > | shared_buffers | master | patch  | %reg    |
> > |----------------|--------|--------|---------|
> > | 128MB          | 32.406 | 33.785 | 4.08%   |
> > | 1GB            | 36.188 | 32.747 | -10.51% |
> > | 2GB            | 41.996 | 32.88  | -27.73% |
>
> Thanks for sharing good results.  We want to know if we can get as
> significant results as you gained before with hundreds of GBs of shared
> buffers, don't we?

Yes. But I don't have a high-spec machine I could use at the moment.
I'll try if I can get one by next week. Or if someone would like to reproduce the
test with their available higher spec machines, it'd would be much appreciated.
The test case is upthread [1]

> > I also did similar benchmark performance as what Tomas did [1], simple
> > "pgbench -S" tests (warmup and then 15 x 1-minute runs with 1, 8 and
> > 16 clients, but I'm not sure if my machine is reliable enough to
> > produce reliable results for 8 clients and more.
>
> Let me confirm just in case.  Your patch should not affect pgbench
> performance, but you measured it.  Is there anything you're concerned
> about?
>

Not really. Because In the previous emails, the argument was the BufferAlloc
overhead. But we don't have it in the latest patch. But just in case somebody
asks about benchmark performance, I also posted the results.

[1] https://www.postgresql.org/message-id/OSBPR01MB2341683DEDE0E7A8D045036FEF360%40OSBPR01MB2341.jpnprd01.prod.outlook.com

Regards,
Kirk Jamison


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

akapila
In reply to this post by tsunakawa.takay@fujitsu.com
On Fri, Sep 25, 2020 at 2:25 PM [hidden email]
<[hidden email]> wrote:

>
> From: Amit Kapila <[hidden email]>
> > No, during recovery also we need to be careful. We need to ensure that
> > we use cached value during recovery and cached value is always
> > up-to-date. We can't rely on lseek and I have provided some scenario
> > up thread [1] where such behavior can cause problem and then see the
> > response from Tom Lane why the same can be true for recovery as well.
> >
> > The basic approach we are trying to pursue here is to rely on the
> > cached value of 'number of blocks' (as that always gives correct value
> > and even if there is a problem that will be our bug, we don't need to
> > rely on OS for correct value and it will be better w.r.t performance
> > as well). It is currently only possible during recovery so we are
> > using it in recovery path and later once Thomas's patch to cache it
> > for non-recovery cases is also done, we can use it for non-recovery
> > cases as well.
>
> Although I may be still confused, I understood that Kirk-san's patch should:
>
> * Still focus on speeding up the replay of TRUNCATE during recovery.
>
> * During recovery, DropRelFileNodeBuffers() gets the cached size of the relation fork.  If it is cached, trust it and optimize the buffer invalidation.  If it's not cached, we can't trust the return value of smgrnblocks() because it's the lseek(END) return value, so we avoid the optimization.
>

I agree with the above two points.

> * Then, add a new function, say, smgrnblocks_cached() that simply returns the cached block count, and DropRelFileNodeBuffers() uses it instead of smgrnblocks().
>

I am not sure if it worth adding a new function for this. Why not
simply add a boolean variable in smgrnblocks for this? BTW, AFAICS,
the latest patch doesn't have code to address this point.

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: [Patch] Optimize dropping of relation buffers using dlist

akapila
In reply to this post by k.jamison@fujitsu.com
On Fri, Sep 25, 2020 at 1:49 PM [hidden email]
<[hidden email]> wrote:

>
> Hi.
>
> > I'll send performance measurement results in the next email. Thanks a lot for
> > the reviews!
>
> Below are the performance measurement results.
> I was only able to use low-spec machine:
> CPU 4v, Memory 8GB, RHEL, xfs filesystem.
>
> [Failover/Recovery Test]
> 1. (Master) Create table (ex. 10,000 tables). Insert data to tables.
> 2. (M) DELETE FROM TABLE (ex. all rows of 10,000 tables)
> 3. (Standby) To test with failover, pause the WAL replay on standby server.
> (SELECT pg_wal_replay_pause();)
> 4. (M) psql -c "\timing on" (measures total execution of SQL queries)
> 5. (M) VACUUM (whole db)
> 6. (M) After vacuum finishes, stop primary server: pg_ctl stop -w -mi
> 7. (S) Resume wal replay and promote standby.
> Because it's difficult to measure recovery time I used the attached script (resume.sh)
> that prints timestamp before and after promotion. It basically does the following
> - "SELECT pg_wal_replay_resume();" is executed and the WAL application is resumed.
> - "pg_ctl promote" to promote standby.
> - The time difference of "select pg_is_in_recovery();" from "t" to "f" is measured.
>
> [Results]
> Recovery/Failover performance (in seconds). 3 trial runs.
>
> | shared_buffers | master | patch  | %reg    |
> |----------------|--------|--------|---------|
> | 128MB          | 32.406 | 33.785 | 4.08%   |
> | 1GB            | 36.188 | 32.747 | -10.51% |
> | 2GB            | 41.996 | 32.88  | -27.73% |
>
> There's a bit of small regression with the default shared_buffers (128MB),
>

I feel we should try to address this. Basically, we can see the
smallest value of shared buffers above which the new algorithm is
beneficial and try to use that as threshold for doing this
optimization. I don't think it is beneficial to use this optimization
for a small value of shared_buffers.

> but as for the recovery time when we have large NBuffers, it's now at least almost constant
> so there's boosted performance. IOW, we enter the optimization most of the time
> during recovery.
>

Yeah, that is good to see. We can probably try to check with a much
larger value of shared buffers.

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

RE: [Patch] Optimize dropping of relation buffers using dlist

tsunakawa.takay@fujitsu.com
In reply to this post by akapila
From: Amit Kapila <[hidden email]>
> I agree with the above two points.

Thank you.  I'm relieved to know I didn't misunderstand.


> > * Then, add a new function, say, smgrnblocks_cached() that simply returns
> the cached block count, and DropRelFileNodeBuffers() uses it instead of
> smgrnblocks().
> >
>
> I am not sure if it worth adding a new function for this. Why not simply add a
> boolean variable in smgrnblocks for this?


One reason is that adding an argument requires modification of existing call sites (10 + a few).  Another is that, although this may be different for each person's taste, it's sometimes not easy to understand when a function call with true/false appears.  One such example is find_XXX(some_args, true/false), where the true/false represents missing_ok.  Another example is as follows.  I often wonder "what's the meaning of this false, and that true?"

    if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0, false))
        elog(ERROR, "InstallXLogFileSegment should not have failed");

Fortunately, the new function is very short and doesn't duplicate much code.  The function is a simple getter and the function name can convey the meaning straight (if the name is good.)


> BTW, AFAICS, the latest patch
> doesn't have code to address this point.

Kirk-san, can you address this?  I don't mind much if you add an argument or a new function.



Regards
Takayuki Tsunakawa

1234567 ... 14