[PATCH] Speedup truncates of relation forks

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[PATCH] Speedup truncates of relation forks

Jamison, Kirk

Hi all,

 

Attached is a patch to speed up the performance of truncates of relations.

This is also my first time to contribute my own patch,

and I'd gladly appreciate your feedback and advice.

 

A.     Summary

 

Whenever we truncate relations, it scans the shared buffers thrice

(one per fork) which can be time-consuming. This patch improves

the performance of relation truncates by initially marking the

pages-to-be-truncated of relation forks, then simultaneously

truncating them, resulting to an improved performance in VACUUM,

autovacuum operations and their recovery performance.

 

B.     Patch Details

The following functions were modified:

 

1.      FreeSpaceMapTruncateRel() and visibilitymap_truncate()

a.      CURRENT HEAD: These functions truncate the FSM pages and unused VM pages.

b.      PATCH: Both functions only mark the pages to truncate and return a block number.

-        We used to call smgrtruncate() in these functions, but these are now moved inside the RelationTruncate() and smgr_redo().

-        The tentative renaming of the functions are: MarkFreeSpaceMapTruncateRel() and visibilitymap_mark_truncate(). Feel free to suggest better names.

 

2.      RelationTruncate()

a.      HEAD: Truncate FSM and VM first, then write WAL, and lastly truncate main fork.

b.      PATCH: Now we mark FSM and VM pages first, write WAL, mark MAIN fork pages, then truncate all forks (MAIN, FSM, VM) simultaneously.

 

3.      smgr_redo()

a.      HEAD: Truncate main fork and the relation during XLOG replay, create fake rel cache for FSM and VM, truncate FSM, truncate VM, then free fake rel cache.

b.      PATCH: Mark main fork dirty buffers, create fake rel cache, mark fsm and vm buffers, truncate marked pages of relation forks simultaneously, truncate relation during XLOG replay, then free fake rel cache.

 

4.      smgrtruncate(), DropRelFileNodeBuffers()

-        input arguments are changed to array of forknum and block numbers, int nforks (size of forkNum array)

-        truncates the pages of relation forks simultaneously

 

5.      smgrdounlinkfork()

I modified the function because it calls DropRelFileNodeBuffers. However, this is a dead code that can be removed.

I did not remove it for now because that's not for me but the community to decide.

 

C.     Performance Test

 

I setup a synchronous streaming replication between a master-standby.

 

In postgresql.conf:

autovacuum = off

wal_level = replica

max_wal_senders = 5

wal_keep_segments = 16

max_locks_per_transaction = 10000

#shared_buffers = 8GB

#shared_buffers = 24GB

 

Objective: Measure VACUUM execution time; varying shared_buffers size.

 

1. Create table (ex. 10,000 tables). Insert data to tables.

2. DELETE FROM TABLE (ex. all rows of 10,000 tables)

3. psql -c "\timing on" (measures total execution of SQL queries)

4. VACUUM (whole db)

 

If you want to test with large number of relations,

you may use the stored functions I used here:

http://bit.ly/reltruncates

 

D.     Results

 

HEAD results

1) 128MB shared_buffers = 48.885 seconds

2) 8GB shared_buffers = 5 min 30.695 s

3) 24GB shared_buffers = 14 min 13.598 s

 

PATCH results

1) 128MB shared_buffers = 42.736 s

2) 8GB shared_buffers = 2 min 26.464 s

3) 24GB shared_buffers = 5 min 35.848 s

 

The performance significantly improved compared to HEAD,

especially for large shared buffers.

 

---

Would appreciate to hear your thoughts, comments, advice.

Thank you in advance.

 

 

Regards,

Kirk Jamison


v1-0001-Speedup-truncate-of-relation-forks.patch (30K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Speedup truncates of relation forks

Adrien Nayrat-2
On 6/11/19 9:34 AM, Jamison, Kirk wrote:
> Hi all,
>
> Attached is a patch to speed up the performance of truncates of relations.
>

Thanks for working on this!

>
> *C.     **Performance Test*
>
> I setup a synchronous streaming replication between a master-standby.
>
> In postgresql.conf:
> autovacuum = off
> wal_level = replica
> max_wal_senders = 5
> wal_keep_segments = 16
> max_locks_per_transaction = 10000
> #shared_buffers = 8GB
> #shared_buffers = 24GB
>
> Objective: Measure VACUUM execution time; varying shared_buffers size.
>
> 1. Create table (ex. 10,000 tables). Insert data to tables.
> 2. DELETE FROM TABLE (ex. all rows of 10,000 tables)
> 3. psql -c "\timing on" (measures total execution of SQL queries)
> 4. VACUUM (whole db)
>
> If you want to test with large number of relations,
>
> you may use the stored functions I used here:
> http://bit.ly/reltruncates
You should post these functions in this thread for the archives ;)

>
> *D.     **Results*
>
> HEAD results
>
> 1) 128MB shared_buffers = 48.885 seconds
> 2) 8GB shared_buffers = 5 min 30.695 s
> 3) 24GB shared_buffers = 14 min 13.598 s
>
> PATCH results
>
> 1) 128MB shared_buffers = 42.736 s
> 2) 8GB shared_buffers = 2 min 26.464 s
> 3) 24GB shared_buffers = 5 min 35.848 s
>
> The performance significantly improved compared to HEAD,
> especially for large shared buffers.
>
From a user POW, the main issue with relation truncation is that it can block
queries on standby server during truncation replay.

It could be interesting if you can test this case and give results of your path.
Maybe by performing read queries on standby server and counting wait_event with
pg_wait_sampling?

Regards,

--
Adrien



signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Speedup truncates of relation forks

Tomas Vondra-4
In reply to this post by Jamison, Kirk
On Tue, Jun 11, 2019 at 07:34:35AM +0000, Jamison, Kirk wrote:
>Hi all,
>
>Attached is a patch to speed up the performance of truncates of relations.
>This is also my first time to contribute my own patch,
>and I'd gladly appreciate your feedback and advice.
>

Thanks for the patch. Please add it to the commitfest app, so that we
don't forget about it: https://commitfest.postgresql.org/23/

>
>A.     Summary
>
>Whenever we truncate relations, it scans the shared buffers thrice
>(one per fork) which can be time-consuming. This patch improves
>the performance of relation truncates by initially marking the
>pages-to-be-truncated of relation forks, then simultaneously
>truncating them, resulting to an improved performance in VACUUM,
>autovacuum operations and their recovery performance.
>

OK, so essentially the whole point is to scan the buffers only once, for
all forks at the same time (instead of three times).

>
>B.     Patch Details
>The following functions were modified:
>
>
>1.      FreeSpaceMapTruncateRel() and visibilitymap_truncate()
>
>a.      CURRENT HEAD: These functions truncate the FSM pages and unused VM pages.
>
>b.      PATCH: Both functions only mark the pages to truncate and return a block number.
>
>-        We used to call smgrtruncate() in these functions, but these are now moved inside the RelationTruncate() and smgr_redo().
>
>-        The tentative renaming of the functions are: MarkFreeSpaceMapTruncateRel() and visibilitymap_mark_truncate(). Feel free to suggest better names.
>
>
>2.      RelationTruncate()
>
>a.      HEAD: Truncate FSM and VM first, then write WAL, and lastly truncate main fork.
>
>b.      PATCH: Now we mark FSM and VM pages first, write WAL, mark MAIN fork pages, then truncate all forks (MAIN, FSM, VM) simultaneously.
>
>
>3.      smgr_redo()
>
>a.      HEAD: Truncate main fork and the relation during XLOG replay, create fake rel cache for FSM and VM, truncate FSM, truncate VM, then free fake rel cache.
>
>b.      PATCH: Mark main fork dirty buffers, create fake rel cache, mark fsm and vm buffers, truncate marked pages of relation forks simultaneously, truncate relation during XLOG replay, then free fake rel cache.
>
>
>4.      smgrtruncate(), DropRelFileNodeBuffers()
>
>-        input arguments are changed to array of forknum and block numbers, int nforks (size of forkNum array)
>
>-        truncates the pages of relation forks simultaneously
>
>
>5.      smgrdounlinkfork()
>I modified the function because it calls DropRelFileNodeBuffers. However, this is a dead code that can be removed.
>I did not remove it for now because that's not for me but the community to decide.
>

You really don't need to extract the changes like this - such changes
are generally obvious from the diff.

You only need to explain things that are not obvious from the code
itself, e.g. non-trivial design decisions, etc.

>
>C.     Performance Test
>
>I setup a synchronous streaming replication between a master-standby.
>
>In postgresql.conf:
>autovacuum = off
>wal_level = replica
>max_wal_senders = 5
>wal_keep_segments = 16
>max_locks_per_transaction = 10000
>#shared_buffers = 8GB
>#shared_buffers = 24GB
>
>Objective: Measure VACUUM execution time; varying shared_buffers size.
>
>1. Create table (ex. 10,000 tables). Insert data to tables.
>2. DELETE FROM TABLE (ex. all rows of 10,000 tables)
>3. psql -c "\timing on" (measures total execution of SQL queries)
>4. VACUUM (whole db)
>
>If you want to test with large number of relations,
>you may use the stored functions I used here:
>http://bit.ly/reltruncates
>
>
>D.     Results
>
>HEAD results
>1) 128MB shared_buffers = 48.885 seconds
>2) 8GB shared_buffers = 5 min 30.695 s
>3) 24GB shared_buffers = 14 min 13.598 s
>
>PATCH results
>1) 128MB shared_buffers = 42.736 s
>2) 8GB shared_buffers = 2 min 26.464 s
>3) 24GB shared_buffers = 5 min 35.848 s
>
>The performance significantly improved compared to HEAD,
>especially for large shared buffers.
>

Right, that seems nice. And it matches the expected 1:3 speedup, at
least for the larger shared_buffers cases.

Years ago I've implemented an optimization for many DROP TABLE commands
in a single transaction - instead of scanning buffers for each relation,
the code now accumulates a small number of relations into an array, and
then does a bsearch for each buffer.

Would something like that be applicable/useful here? That is, if we do
multiple TRUNCATE commands in a single transaction, can we optimize it
like this?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Speedup truncates of relation forks

Alvaro Herrera-9
On 2019-Jun-12, Tomas Vondra wrote:

> Years ago I've implemented an optimization for many DROP TABLE commands
> in a single transaction - instead of scanning buffers for each relation,
> the code now accumulates a small number of relations into an array, and
> then does a bsearch for each buffer.

commit 279628a0a7cf582f7dfb68e25b7b76183dd8ff2f:
    Accelerate end-of-transaction dropping of relations
   
    When relations are dropped, at end of transaction we need to remove the
    files and clean the buffer pool of buffers containing pages of those
    relations.  Previously we would scan the buffer pool once per relation
    to clean up buffers.  When there are many relations to drop, the
    repeated scans make this process slow; so we now instead pass a list of
    relations to drop and scan the pool once, checking each buffer against
    the passed list.  When the number of relations is larger than a
    threshold (which as of this patch is being set to 20 relations) we sort
    the array before starting, and bsearch the array; when it's smaller, we
    simply scan the array linearly each time, because that's faster.  The
    exact optimal threshold value depends on many factors, but the
    difference is not likely to be significant enough to justify making it
    user-settable.
   
    This has been measured to be a significant win (a 15x win when dropping
    100,000 relations; an extreme case, but reportedly a real one).
   
    Author: Tomas Vondra, some tweaks by me
    Reviewed by: Robert Haas, Shigeru Hanada, Andres Freund, Álvaro Herrera


--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

RE: [PATCH] Speedup truncates of relation forks

Tsunakawa, Takayuki
In reply to this post by Tomas Vondra-4
From: Tomas Vondra [mailto:[hidden email]]
> Years ago I've implemented an optimization for many DROP TABLE commands
> in a single transaction - instead of scanning buffers for each relation,
> the code now accumulates a small number of relations into an array, and
> then does a bsearch for each buffer.
>
> Would something like that be applicable/useful here? That is, if we do
> multiple TRUNCATE commands in a single transaction, can we optimize it
> like this?

Unfortunately not.  VACUUM and autovacuum handles each table in a different transaction.

BTW, what we really want to do is to keep the failover time within 10 seconds.  The customer periodically TRUNCATEs tens of thousands of tables.  If failover unluckily happens immediately after those TRUNCATEs, the recovery on the standby could take much longer.  But your past improvement seems likely to prevent that problem, if the customer TRUNCATEs tables in the same transaction.

On the other hand, it's now highly possible that the customer can only TRUNCATE a single table in a transaction, thus run as many transactions as the TRUNCATEd tables.  So, we also want to speed up each TRUNCATE by touching only the buffers for the table, not scanning the whole shared buffers.  Andres proposed one method that uses a radix tree, but we don't have an idea how to do it yet.

Speeding up each TRUNCATE and its recovery is a different topic.  The patch proposed here is one possible improvement to shorten the failover time.


Regards
Takayuki Tsunakawa






Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Speedup truncates of relation forks

Masahiko Sawada
On Wed, Jun 12, 2019 at 12:25 PM Tsunakawa, Takayuki
<[hidden email]> wrote:

>
> From: Tomas Vondra [mailto:[hidden email]]
> > Years ago I've implemented an optimization for many DROP TABLE commands
> > in a single transaction - instead of scanning buffers for each relation,
> > the code now accumulates a small number of relations into an array, and
> > then does a bsearch for each buffer.
> >
> > Would something like that be applicable/useful here? That is, if we do
> > multiple TRUNCATE commands in a single transaction, can we optimize it
> > like this?
>
> Unfortunately not.  VACUUM and autovacuum handles each table in a different transaction.

We do RelationTruncate() also when we truncate heaps that are created
in the current transactions or has a new relfilenodes in the current
transaction. So I think there is a room for optimization Thomas
suggested, although I'm not sure it's a popular use case.

I've not look at this patch deeply but in DropRelFileNodeBuffer I
think we can get the min value of all firstDelBlock and use it as the
lower bound of block number that we're interested in. That way we can
skip checking the array during scanning the buffer pool.

-extern void smgrdounlinkfork(SMgrRelation reln, ForkNumber forknum,
bool isRedo);
+extern void smgrdounlinkfork(SMgrRelation reln, ForkNumber *forknum,
+                                                        bool isRedo,
int nforks);
-extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
-                                                BlockNumber nblocks);
+extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
+                                                BlockNumber *nblocks,
int nforks);

Don't we use each elements of nblocks for each fork? That is, each
fork uses an element at its fork number in the nblocks array and sets
InvalidBlockNumber for invalid slots, instead of passing the valid
number of elements. That way the following code that exist at many places,

    blocks[nforks] = visibilitymap_mark_truncate(rel, nblocks);
   if (BlockNumberIsValid(blocks[nforks]))
   {
       forks[nforks] = VISIBILITYMAP_FORKNUM;
       nforks++;
   }

would become

    blocks[VISIBILITYMAP_FORKNUM] = visibilitymap_mark_truncate(rel, nblocks);

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

RE: [PATCH] Speedup truncates of relation forks

Jamison, Kirk
In reply to this post by Adrien Nayrat-2
On Tuesday, June 11, 2019 7:23 PM (GMT+9), Adrien Nayrat wrote:

> > Attached is a patch to speed up the performance of truncates of relations.
>
> Thanks for working on this!

Thank you also for taking a look at my thread.

> > If you want to test with large number of relations,
> > you may use the stored functions I used here:
> > http://bit.ly/reltruncates
>
> You should post these functions in this thread for the archives ;)
This is noted. Pasting it below:

create or replace function create_tables(numtabs int)
returns void as $$
declare query_string text;
begin
  for i in 1..numtabs loop
    query_string := 'create table tab_' || i::text || ' (a int);';
    execute query_string;
  end loop;
end;
$$ language plpgsql;

create or replace function delfrom_tables(numtabs int)
returns void as $$
declare query_string text;
begin
  for i in 1..numtabs loop
    query_string := 'delete from tab_' || i::text;
    execute query_string;
  end loop;
end;
$$ language plpgsql;

create or replace function insert_tables(numtabs int)
returns void as $$
declare query_string text;
begin
  for i in 1..numtabs loop
    query_string := 'insert into tab_' || i::text || ' VALUES (5);' ;
    execute query_string;
  end loop;
end;
$$ language plpgsql;


> From a user POW, the main issue with relation truncation is that it can block
> queries on standby server during truncation replay.
>
> It could be interesting if you can test this case and give results of your
> path.
> Maybe by performing read queries on standby server and counting wait_event
> with pg_wait_sampling?

Thanks for the suggestion. I tried using the extension pg_wait_sampling,
But I wasn't sure that I could replicate the problem of blocked queries on standby server.
Could you advise?
Here's what I did for now, similar to my previous test with hot standby setup,
but with additional read queries of wait events on standby server.

128MB shared_buffers
SELECT create_tables(10000);
SELECT insert_tables(10000);
SELECT delfrom_tables(10000);

[Before VACUUM]
Standby: SELECT the following view from pg_stat_waitaccum

wait_event_type |   wait_event    | calls | microsec
-----------------+-----------------+-------+----------
 Client          | ClientRead      |     2 | 20887759
 IO              | DataFileRead    |   175 |     2788
 IO              | RelationMapRead |     4 |       26
 IO              | SLRURead        |     2 |       38

Primary: Execute VACUUM (induces relation truncates)

[After VACUUM]
Standby:
 wait_event_type |   wait_event    | calls | microsec
-----------------+-----------------+-------+----------
 Client          | ClientRead      |     7 | 77662067
 IO              | DataFileRead    |   284 |     4523
 IO              | RelationMapRead |    10 |       51
 IO              | SLRURead        |     3 |       57

Regards,
Kirk Jamison
Reply | Threaded
Open this post in threaded view
|

RE: [PATCH] Speedup truncates of relation forks

Tsunakawa, Takayuki
In reply to this post by Masahiko Sawada
From: Masahiko Sawada [mailto:[hidden email]]
> We do RelationTruncate() also when we truncate heaps that are created
> in the current transactions or has a new relfilenodes in the current
> transaction. So I think there is a room for optimization Thomas
> suggested, although I'm not sure it's a popular use case.

Right, and I don't think of a use case that motivates the opmitizaion, too.


> I've not look at this patch deeply but in DropRelFileNodeBuffer I
> think we can get the min value of all firstDelBlock and use it as the
> lower bound of block number that we're interested in. That way we can
> skip checking the array during scanning the buffer pool.

That sounds reasonable, although I haven't examined the code, either.


> Don't we use each elements of nblocks for each fork? That is, each
> fork uses an element at its fork number in the nblocks array and sets
> InvalidBlockNumber for invalid slots, instead of passing the valid
> number of elements. That way the following code that exist at many places,

I think the current patch tries to reduce the loop count in DropRelFileNodeBuffers() by passing the number of target forks.


Regards
Takayuki Tsunakawa


 
Reply | Threaded
Open this post in threaded view
|

RE: [PATCH] Speedup truncates of relation forks

Jamison, Kirk
In reply to this post by Masahiko Sawada
On Wednesday, June 12, 2019 4:29 PM (GMT+9), Masahiko Sawada wrote:

> On Wed, Jun 12, 2019 at 12:25 PM Tsunakawa, Takayuki
> <[hidden email]> wrote:
> >
> > From: Tomas Vondra [mailto:[hidden email]]
> > > Years ago I've implemented an optimization for many DROP TABLE
> > > commands in a single transaction - instead of scanning buffers for
> > > each relation, the code now accumulates a small number of relations
> > > into an array, and then does a bsearch for each buffer.
> > >
> > > Would something like that be applicable/useful here? That is, if we
> > > do multiple TRUNCATE commands in a single transaction, can we
> > > optimize it like this?
> >
> > Unfortunately not.  VACUUM and autovacuum handles each table in a different
> transaction.
>
> We do RelationTruncate() also when we truncate heaps that are created in the
> current transactions or has a new relfilenodes in the current transaction.
> So I think there is a room for optimization Thomas suggested, although I'm
> not sure it's a popular use case.

I couldn't think of a use case too.

> I've not look at this patch deeply but in DropRelFileNodeBuffer I think we
> can get the min value of all firstDelBlock and use it as the lower bound of
> block number that we're interested in. That way we can skip checking the array
> during scanning the buffer pool.

I'll take note of this suggestion.
Could you help me expound more on this idea, skipping the internal loop by
comparing the min and buffer descriptor (bufHdr)?

In the current patch, I've implemented the following in DropRelFileNodeBuffers:
        for (i = 0; i < NBuffers; i++)
        {
                ...
                buf_state = LockBufHdr(bufHdr);
                for (k = 0; k < nforks; k++)
                {
                        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
                                bufHdr->tag.forkNum == forkNum[k] &&
                                bufHdr->tag.blockNum >= firstDelBlock[k])
                        {
                                InvalidateBuffer(bufHdr); /* releases spinlock */
                                break;
                        }

> Don't we use each elements of nblocks for each fork? That is, each fork uses
> an element at its fork number in the nblocks array and sets InvalidBlockNumber
> for invalid slots, instead of passing the valid number of elements. That way
> the following code that exist at many places,
>
>     blocks[nforks] = visibilitymap_mark_truncate(rel, nblocks);
>    if (BlockNumberIsValid(blocks[nforks]))
>    {
>        forks[nforks] = VISIBILITYMAP_FORKNUM;
>        nforks++;
>    }
>
> would become
>
>     blocks[VISIBILITYMAP_FORKNUM] = visibilitymap_mark_truncate(rel,
> nblocks);

In the patch, we want to truncate all forks' blocks simultaneously, so
we optimize the invalidation of buffers and reduce the number of loops
using those values.
The suggestion above would have to remove the forks array and its
forksize (nforks), is it correct? But I think we’d need the fork array
and nforks to execute the truncation all at once.
If I'm missing something, I'd really appreciate your further comments.

--
Thank you everyone for taking a look at my thread.
I've also already added this patch to the CommitFest app.

Regards,
Kirk Jamison
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Speedup truncates of relation forks

Masahiko Sawada
On Thu, Jun 13, 2019 at 6:30 PM Jamison, Kirk <[hidden email]> wrote:

>
> On Wednesday, June 12, 2019 4:29 PM (GMT+9), Masahiko Sawada wrote:
> > On Wed, Jun 12, 2019 at 12:25 PM Tsunakawa, Takayuki
> > <[hidden email]> wrote:
> > >
> > > From: Tomas Vondra [mailto:[hidden email]]
> > > > Years ago I've implemented an optimization for many DROP TABLE
> > > > commands in a single transaction - instead of scanning buffers for
> > > > each relation, the code now accumulates a small number of relations
> > > > into an array, and then does a bsearch for each buffer.
> > > >
> > > > Would something like that be applicable/useful here? That is, if we
> > > > do multiple TRUNCATE commands in a single transaction, can we
> > > > optimize it like this?
> > >
> > > Unfortunately not.  VACUUM and autovacuum handles each table in a different
> > transaction.
> >
> > We do RelationTruncate() also when we truncate heaps that are created in the
> > current transactions or has a new relfilenodes in the current transaction.
> > So I think there is a room for optimization Thomas suggested, although I'm
> > not sure it's a popular use case.
>
> I couldn't think of a use case too.
>
> > I've not look at this patch deeply but in DropRelFileNodeBuffer I think we
> > can get the min value of all firstDelBlock and use it as the lower bound of
> > block number that we're interested in. That way we can skip checking the array
> > during scanning the buffer pool.
>
> I'll take note of this suggestion.
> Could you help me expound more on this idea, skipping the internal loop by
> comparing the min and buffer descriptor (bufHdr)?
>

Yes. For example,

    BlockNumber minBlock = InvalidBlockNumber;
(snip)
    /* Get lower bound block number we're interested in */
    for (i = 0; i < nforks; i++)
    {
        if (!BlockNumberIsValid(minBlock) ||
            minBlock > firstDelBlock[i])
            minBlock = firstDelBlock[i];
    }

    for (i = 0; i < NBuffers; i++)
    {
(snip)
        buf_state = LockBufHdr(bufHdr);

        /* check with the lower bound and skip the loop */
        if (bufHdr->tag.blockNum < minBlock)
        {
            UnlockBufHdr(bufHdr, buf_state);
            continue;
        }

        for (k = 0; k < nforks; k++)
        {
            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
                bufHdr->tag.forkNum == forkNum[k] &&
                bufHdr->tag.blockNum >= firstDelBlock[k])

But since we acquire the buffer header lock after all and the number
of the internal loops is small (at most 3 for now)  the benefit will
not be big.

> In the current patch, I've implemented the following in DropRelFileNodeBuffers:
>         for (i = 0; i < NBuffers; i++)
>         {
>                 ...
>                 buf_state = LockBufHdr(bufHdr);
>                 for (k = 0; k < nforks; k++)
>                 {
>                         if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
>                                 bufHdr->tag.forkNum == forkNum[k] &&
>                                 bufHdr->tag.blockNum >= firstDelBlock[k])
>                         {
>                                 InvalidateBuffer(bufHdr); /* releases spinlock */
>                                 break;
>                         }
>
> > Don't we use each elements of nblocks for each fork? That is, each fork uses
> > an element at its fork number in the nblocks array and sets InvalidBlockNumber
> > for invalid slots, instead of passing the valid number of elements. That way
> > the following code that exist at many places,
> >
> >     blocks[nforks] = visibilitymap_mark_truncate(rel, nblocks);
> >    if (BlockNumberIsValid(blocks[nforks]))
> >    {
> >        forks[nforks] = VISIBILITYMAP_FORKNUM;
> >        nforks++;
> >    }
> >
> > would become
> >
> >     blocks[VISIBILITYMAP_FORKNUM] = visibilitymap_mark_truncate(rel,
> > nblocks);
>
> In the patch, we want to truncate all forks' blocks simultaneously, so
> we optimize the invalidation of buffers and reduce the number of loops
> using those values.
> The suggestion above would have to remove the forks array and its
> forksize (nforks), is it correct? But I think we’d need the fork array
> and nforks to execute the truncation all at once.

I meant that each forks can use the its forknumber'th element of
firstDelBlock[]. For example, if firstDelBlock = {1000,
InvalidBlockNumber, 20, InvalidBlockNumber}, we can invalid buffers
pertaining both greater than block number 1000 of main and greater
than block number 20 of vm. Since firstDelBlock[FSM_FORKNUM] ==
InvalidBlockNumber we don't invalid buffers of fsm.

As Tsunakawa-san mentioned, since your approach would reduce the loop
count your idea might be better than mine which always takes 4 loop
counts.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

RE: [PATCH] Speedup truncates of relation forks

Tsunakawa, Takayuki
From: Masahiko Sawada [mailto:[hidden email]]

>     for (i = 0; i < NBuffers; i++)
>     {
> (snip)
>         buf_state = LockBufHdr(bufHdr);
>
>         /* check with the lower bound and skip the loop */
>         if (bufHdr->tag.blockNum < minBlock)
>         {
>             UnlockBufHdr(bufHdr, buf_state);
>             continue;
>         }
>
>         for (k = 0; k < nforks; k++)
>         {
>             if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
>                 bufHdr->tag.forkNum == forkNum[k] &&
>                 bufHdr->tag.blockNum >= firstDelBlock[k])
>
> But since we acquire the buffer header lock after all and the number
> of the internal loops is small (at most 3 for now)  the benefit will
> not be big.

Yeah, so I think we can just compare the block number without locking the buffer header here.


Regards
Takayuki Tsunakawa

Reply | Threaded
Open this post in threaded view
|

RE: [PATCH] Speedup truncates of relation forks

Jamison, Kirk
In reply to this post by Masahiko Sawada
Hi Sawada-san,

On Thursday, June 13, 2019 8:01 PM, Masahiko Sawada wrote:

> On Thu, Jun 13, 2019 at 6:30 PM Jamison, Kirk <[hidden email]>
> wrote:
> >
> > On Wednesday, June 12, 2019 4:29 PM (GMT+9), Masahiko Sawada wrote:
> > > ...
> > > I've not look at this patch deeply but in DropRelFileNodeBuffer I
> > > think we can get the min value of all firstDelBlock and use it as
> > > the lower bound of block number that we're interested in. That way
> > > we can skip checking the array during scanning the buffer pool.
> >
> > I'll take note of this suggestion.
> > Could you help me expound more on this idea, skipping the internal
> > loop by comparing the min and buffer descriptor (bufHdr)?
> >
>
> Yes. For example,
>
>     BlockNumber minBlock = InvalidBlockNumber;
> (snip)
>     /* Get lower bound block number we're interested in */
>     for (i = 0; i < nforks; i++)
>     {
>         if (!BlockNumberIsValid(minBlock) ||
>             minBlock > firstDelBlock[i])
>             minBlock = firstDelBlock[i];
>     }
>
>     for (i = 0; i < NBuffers; i++)
>     {
> (snip)
>         buf_state = LockBufHdr(bufHdr);
>
>         /* check with the lower bound and skip the loop */
>         if (bufHdr->tag.blockNum < minBlock)
>         {
>             UnlockBufHdr(bufHdr, buf_state);
>             continue;
>         }
>
>         for (k = 0; k < nforks; k++)
>         {
>             if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
>                 bufHdr->tag.forkNum == forkNum[k] &&
>                 bufHdr->tag.blockNum >= firstDelBlock[k])
>
> But since we acquire the buffer header lock after all and the number of the
> internal loops is small (at most 3 for now)  the benefit will not be big.

Thank you very much for your kind and detailed explanation.
I'll still consider your suggestions in the next patch and optimize it more
so that we could possibly not need to acquire the LockBufHdr anymore.


> > > Don't we use each elements of nblocks for each fork? That is, each
> > > fork uses an element at its fork number in the nblocks array and
> > > sets InvalidBlockNumber for invalid slots, instead of passing the
> > > valid number of elements. That way the following code that exist at
> > > many places,
> > >
> > >     blocks[nforks] = visibilitymap_mark_truncate(rel, nblocks);
> > >    if (BlockNumberIsValid(blocks[nforks]))
> > >    {
> > >        forks[nforks] = VISIBILITYMAP_FORKNUM;
> > >        nforks++;
> > >    }
> > >
> > > would become
> > >
> > >     blocks[VISIBILITYMAP_FORKNUM] = visibilitymap_mark_truncate(rel,
> > > nblocks);
> >
> > In the patch, we want to truncate all forks' blocks simultaneously, so
> > we optimize the invalidation of buffers and reduce the number of loops
> > using those values.
> > The suggestion above would have to remove the forks array and its
> > forksize (nforks), is it correct? But I think we’d need the fork array
> > and nforks to execute the truncation all at once.
>
> I meant that each forks can use the its forknumber'th element of
> firstDelBlock[]. For example, if firstDelBlock = {1000, InvalidBlockNumber,
> 20, InvalidBlockNumber}, we can invalid buffers pertaining both greater than
> block number 1000 of main and greater than block number 20 of vm. Since
> firstDelBlock[FSM_FORKNUM] == InvalidBlockNumber we don't invalid buffers
> of fsm.
>
> As Tsunakawa-san mentioned, since your approach would reduce the loop count
> your idea might be better than mine which always takes 4 loop counts.

Understood. Thank you again for the kind and detailed explanations.
I'll reconsider these approaches.

Regards,
Kirk Jamison
Reply | Threaded
Open this post in threaded view
|

RE: [PATCH] Speedup truncates of relation forks

Jamison, Kirk
Hi all,

Attached is the v2 of the patch. I added the optimization that Sawada-san
suggested for DropRelFileNodeBuffers, although I did not acquire the lock
when comparing the minBlock and target block.

There's actually a comment written in the source code that we could
pre-check buffer tag for forkNum and blockNum, but given that FSM and VM
blocks are small compared to main fork's, the additional benefit of doing so
would be small.

>* We could check forkNum and blockNum as well as the rnode, but the
>* incremental win from doing so seems small.

I personally think it's alright not to include the suggested pre-checking.
If that's the case, we can just follow the patch v1 version.

Thoughts?

Comments and reviews from other parts of the patch are also very much welcome.

Regards,
Kirk Jamison

v2-0001-Speedup-truncates-of-relation-forks.patch (31K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Speedup truncates of relation forks

Adrien Nayrat-2
In reply to this post by Jamison, Kirk
On 6/12/19 10:29 AM, Jamison, Kirk wrote:

>
>> From a user POW, the main issue with relation truncation is that it can block
>> queries on standby server during truncation replay.
>>
>> It could be interesting if you can test this case and give results of your
>> path.
>> Maybe by performing read queries on standby server and counting wait_event
>> with pg_wait_sampling?
>
> Thanks for the suggestion. I tried using the extension pg_wait_sampling,
> But I wasn't sure that I could replicate the problem of blocked queries on standby server.
> Could you advise?
> Here's what I did for now, similar to my previous test with hot standby setup,
> but with additional read queries of wait events on standby server.
>
> 128MB shared_buffers
> SELECT create_tables(10000);
> SELECT insert_tables(10000);
> SELECT delfrom_tables(10000);
>
> [Before VACUUM]
> Standby: SELECT the following view from pg_stat_waitaccum
>
> wait_event_type |   wait_event    | calls | microsec
> -----------------+-----------------+-------+----------
>  Client          | ClientRead      |     2 | 20887759
>  IO              | DataFileRead    |   175 |     2788
>  IO              | RelationMapRead |     4 |       26
>  IO              | SLRURead        |     2 |       38
>
> Primary: Execute VACUUM (induces relation truncates)
>
> [After VACUUM]
> Standby:
>  wait_event_type |   wait_event    | calls | microsec
> -----------------+-----------------+-------+----------
>  Client          | ClientRead      |     7 | 77662067
>  IO              | DataFileRead    |   284 |     4523
>  IO              | RelationMapRead |    10 |       51
>  IO              | SLRURead        |     3 |       57
>
(Sorry for the delay, I forgot to answer you)

As far as I remember, you should see "relation" wait events (type lock) on
standby server. This is due to startup process acquiring AccessExclusiveLock for
the truncation and other backend waiting to acquire a lock to read the table.

On primary server, vacuum is able to cancel truncation:

/*
 * We need full exclusive lock on the relation in order to do
 * truncation. If we can't get it, give up rather than waiting --- we
 * don't want to block other backends, and we don't want to deadlock
 * (which is quite possible considering we already hold a lower-grade
 * lock).
 */
vacrelstats->lock_waiter_detected = false;
lock_retry = 0;
while (true)
{
    if (ConditionalLockRelation(onerel, AccessExclusiveLock))
        break;

    /*
     * Check for interrupts while trying to (re-)acquire the exclusive
     * lock.
     */
    CHECK_FOR_INTERRUPTS();

    if (++lock_retry > (VACUUM_TRUNCATE_LOCK_TIMEOUT /
                        VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL))
    {
        /*
         * We failed to establish the lock in the specified number of
         * retries. This means we give up truncating.
         */
        vacrelstats->lock_waiter_detected = true;
        ereport(elevel,
                (errmsg("\"%s\": stopping truncate due to conflicting lock request",
                        RelationGetRelationName(onerel))));
        return;
    }

    pg_usleep(VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL * 1000L);
}


To maximize chances to reproduce we can use big shared_buffers. But I am afraid
it is not easy to perform reproducible tests to compare results. Unfortunately I
don't have servers to perform tests.

Regards,


signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: [PATCH] Speedup truncates of relation forks

Jamison, Kirk
On Wednesday, June 26, 2019 6:10 PM(GMT+9), Adrien Nayrat wrote:
> As far as I remember, you should see "relation" wait events (type lock) on
> standby server. This is due to startup process acquiring AccessExclusiveLock
> for the truncation and other backend waiting to acquire a lock to read the
> table.

Hi Adrien, thank you for taking time to reply.

I understand that RelationTruncate() can block read-only queries on
standby during redo. However, it's difficult for me to reproduce the
test case where I need to catch that wait for relation lock, because
one has to execute SELECT within the few milliseconds of redoing the
truncation of one table.

Instead, I just measured the whole recovery time, smgr_redo(),
to show the recovery improvement compared to head. Please refer below.

[Recovery Test]
I used the same stored functions and configurations in the previous email
& created "test" db.

$ createdb test
$ psql -d test

1. [Primary] Create 10,000 relations.
        test=# SELECT create_tables(10000);

2. [P] Insert one row in each table.
        test=# SELECT insert_tables(10000);

3. [P] Delete row of each table.
        test=# SELECT delfrom_tables(10000);

4. [Standby] WAL application is stopped at Standby server.
        test=# SELECT pg_wal_replay_pause();

5. [P] VACUUM is executed at Primary side, and measure its execution time.
        test=# \timing on
        test=# VACUUM;

        Alternatively, you may use:
        $ time psql -d test -c 'VACUUM;'
        (Note: WAL has not replayed on standby because it's been paused.)

6. [P] Wait until VACUUM has finished execution. Then, stop primary server.
        test=# pg_ctl stop -w

7. [S] Resume WAL replay, then promote standby (failover).
I used a shell script to execute recovery & promote standby server
because it's kinda difficult to measure recovery time. Please refer to the script below.
- "SELECT pg_wal_replay_resume();" is executed and the WAL application is resumed.
- "pg_ctl promote" to promote standby.
- The time difference of "select pg_is_in_recovery();" from "t" to "f" is measured.

shell script:

PGDT=/path_to_storage_directory/

if [ "$1" = "resume" ]; then
        psql -c "SELECT pg_wal_replay_resume();" test
        date +%Y/%m/%d_%H:%M:%S.%3N
        pg_ctl promote -D ${PGDT}
        set +x
        date +%Y/%m/%d_%H:%M:%S.%3N
        while [ 1 ]
        do
                RS=`psql -Atc "select pg_is_in_recovery();" test`
                if [ ${RS} = "f" ]; then
                        break
                fi
        done
        date +%Y/%m/%d_%H:%M:%S.%3N
        set -x
        exit 0
fi


[Test Results]
shared_buffers = 24GB

1. HEAD
(wal replay resumed)
2019/07/01_08:48:50.326
server promoted
2019/07/01_08:49:50.482
2019/07/01_09:02:41.051

 Recovery Time:
 13 min 50.725 s -> Time difference from WAL replay to complete recovery
 12 min 50.569 s -> Time difference of "select pg_is_in_recovery();" from "t" to "f"

2. PATCH
(wal replay resumed)
2019/07/01_07:34:26.766
server promoted
2019/07/01_07:34:57.790
2019/07/01_07:34:57.809

 Recovery Time:
 31.043 s -> Time difference from WAL replay to complete recovery
 00.019 s -> Time difference of "select pg_is_in_recovery();" from "t" to "f"
 
[Conclusion]
The recovery time significantly improved compared to head
from 13 minutes to 30 seconds.

Any thoughts?
I'd really appreciate your comments/feedback about the patch and/or test.


Regards,
Kirk Jamison
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Speedup truncates of relation forks

Masahiko Sawada
In reply to this post by Jamison, Kirk
On Mon, Jun 17, 2019 at 5:01 PM Jamison, Kirk <[hidden email]> wrote:

>
> Hi all,
>
> Attached is the v2 of the patch. I added the optimization that Sawada-san
> suggested for DropRelFileNodeBuffers, although I did not acquire the lock
> when comparing the minBlock and target block.
>
> There's actually a comment written in the source code that we could
> pre-check buffer tag for forkNum and blockNum, but given that FSM and VM
> blocks are small compared to main fork's, the additional benefit of doing so
> would be small.
>
> >* We could check forkNum and blockNum as well as the rnode, but the
> >* incremental win from doing so seems small.
>
> I personally think it's alright not to include the suggested pre-checking.
> If that's the case, we can just follow the patch v1 version.
>
> Thoughts?
>
> Comments and reviews from other parts of the patch are also very much welcome.
>

Thank you for updating the patch. Here is the review comments for v2 patch.

---
- *     visibilitymap_truncate - truncate the visibility map
+ *     visibilitymap_mark_truncate - mark the about-to-be-truncated VM
+ *
+ * Formerly, this function truncates VM relation forks. Instead, this just
+ * marks the dirty buffers.
  *
  * The caller must hold AccessExclusiveLock on the relation, to ensure that
  * other backends receive the smgr invalidation event that this function sends
  * before they access the VM again.
  *

I don't think we should describe about the previous behavior here.
Rather we need to describe what visibilitymap_mark_truncate does and
what it returns to the caller.

I'm not sure that visibilitymap_mark_truncate function name is
appropriate here since it actually truncate map bits, not only
marking. Perhaps we can still use visibilitymap_truncate or change to
visibilitymap_truncate_prepare, or something? Anyway, this function
truncates only tail bits in the last remaining map page and we can
have a rule that the caller must call smgrtruncate() later to actually
truncate pages.

The comment of second paragraph is now out of date since this function
no longer sends smgr invalidation message.

Is it worth to leave the current visibilitymap_truncate function as a
shortcut function, instead of replacing? That way we don't need to
change pg_truncate_visibility_map function.

The same comments are true for MarkFreeSpaceMapTruncateRel.

---
+       ForkNumber      forks[MAX_FORKNUM];
+       BlockNumber     blocks[MAX_FORKNUM];
+       BlockNumber     new_nfsmblocks = InvalidBlockNumber;    /* FSM blocks */
+       BlockNumber     newnblocks = InvalidBlockNumber;        /* VM blocks */
+       int             nforks = 0;

I think that we can have new_nfsmblocks and new_nvmblocks and wipe out
the comments.

---
-       /* Truncate the FSM first if it exists */
+       /*
+        * We used to truncate FSM and VM forks here. Now we only mark the
+        * dirty buffers of all forks about-to-be-truncated if they exist.
+        */
+

Again, I think we need the description of current behavior rather than
the history, except the case where the history is important.

---
-               /*
-                * Make an XLOG entry reporting the file truncation.
-                */
+               /* Make an XLOG entry reporting the file truncation */

Unnecessary change.

---
+       /*
+        * We might as well update the local smgr_fsm_nblocks and
smgr_vm_nblocks
+        * setting. smgrtruncate sent an smgr cache inval message,
which will cause
+        * other backends to invalidate their copy of smgr_fsm_nblocks and
+        * smgr_vm_nblocks, and this one too at the next command
boundary. But this
+        * ensures it isn't outright wrong until then.
+        */
+       if (rel->rd_smgr)
+       {
+               rel->rd_smgr->smgr_fsm_nblocks = new_nfsmblocks;
+               rel->rd_smgr->smgr_vm_nblocks = newnblocks;
+       }

new_nfsmblocks and newnblocks could be InvalidBlockNumber when the
forks are already enough small. So the above code sets incorrect
values to smgr_{fsm,vm}_nblocks.

Also, I wonder if we can do the above code in smgrtruncate. Otherwise
we always need to set smgr_{fsm,vm}_nblocks after smgrtruncate, which
is inconvenient.

---
+               /* Update the local smgr_fsm_nblocks and
smgr_vm_nblocks setting */
+               if (rel->rd_smgr)
+               {
+                       rel->rd_smgr->smgr_fsm_nblocks = new_nfsmblocks;
+                       rel->rd_smgr->smgr_vm_nblocks = newnblocks;
+               }

The save as above. And we need to set smgr_{fsm,vm}_nblocks in spite
of freeing the fake relcache soon?

---
+       /* Get the lower bound of target block number we're interested in */
+       for (i = 0; i < nforks; i++)
+       {
+               if (!BlockNumberIsValid(minBlock) ||
+                       minBlock > firstDelBlock[i])
+                       minBlock = firstDelBlock[i];
+       }

Maybe we can declare 'i' in the for statement (i.e. for (int i = 0;
...)) at every outer loops in this functions. And in the inner loop we
can use 'j'.

---
-DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum,
-                                          BlockNumber firstDelBlock)
+DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
+                                          BlockNumber *firstDelBlock,
int nforks)

I think it's better to declare *forkNum and nforks side by side for
readability. That is, we can have it as follows.

DropRelFileNodeBuffers (RelFileNodeBackend rnode, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock)


---
-smgrdounlinkfork(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+smgrdounlinkfork(SMgrRelation reln, ForkNumber *forknum, bool isRedo,
int nforks)

Same as above. The order of reln, *forknum, nforks, isRedo would be better.

---
@@ -383,6 +383,10 @@ pg_truncate_visibility_map(PG_FUNCTION_ARGS)
 {
        Oid                     relid = PG_GETARG_OID(0);
        Relation        rel;
+       ForkNumber      forks[MAX_FORKNUM];
+       BlockNumber     blocks[MAX_FORKNUM];
+       BlockNumber     newnblocks = InvalidBlockNumber;
+       int             nforks = 0;

Why do we need the array of forks and blocks here? I think it's enough
to have one fork and one block number.

---
The comment of smgrdounlinkfork function needs to be updated. We now
can remove multiple forks.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Speedup truncates of relation forks

Adrien Nayrat-2
In reply to this post by Jamison, Kirk
On 7/1/19 12:55 PM, Jamison, Kirk wrote:

> On Wednesday, June 26, 2019 6:10 PM(GMT+9), Adrien Nayrat wrote:
>> As far as I remember, you should see "relation" wait events (type lock) on
>> standby server. This is due to startup process acquiring AccessExclusiveLock
>> for the truncation and other backend waiting to acquire a lock to read the
>> table.
>
> Hi Adrien, thank you for taking time to reply.
>
> I understand that RelationTruncate() can block read-only queries on
> standby during redo. However, it's difficult for me to reproduce the
> test case where I need to catch that wait for relation lock, because
> one has to execute SELECT within the few milliseconds of redoing the
> truncation of one table.
Yes, that why your test by measuring vacuum execution time is better as it is
more reproductible.

>
> Instead, I just measured the whole recovery time, smgr_redo(),
> to show the recovery improvement compared to head. Please refer below.
>
> [Recovery Test]
> I used the same stored functions and configurations in the previous email
> & created "test" db.
>
> $ createdb test
> $ psql -d test
>
> 1. [Primary] Create 10,000 relations.
> test=# SELECT create_tables(10000);
>
> 2. [P] Insert one row in each table.
> test=# SELECT insert_tables(10000);
>
> 3. [P] Delete row of each table.
> test=# SELECT delfrom_tables(10000);
>
> 4. [Standby] WAL application is stopped at Standby server.
> test=# SELECT pg_wal_replay_pause();
>
> 5. [P] VACUUM is executed at Primary side, and measure its execution time.
> test=# \timing on
> test=# VACUUM;
>
> Alternatively, you may use:
> $ time psql -d test -c 'VACUUM;'
> (Note: WAL has not replayed on standby because it's been paused.)
>
> 6. [P] Wait until VACUUM has finished execution. Then, stop primary server.
> test=# pg_ctl stop -w
>
> 7. [S] Resume WAL replay, then promote standby (failover).
> I used a shell script to execute recovery & promote standby server
> because it's kinda difficult to measure recovery time. Please refer to the script below.
> - "SELECT pg_wal_replay_resume();" is executed and the WAL application is resumed.
> - "pg_ctl promote" to promote standby.
> - The time difference of "select pg_is_in_recovery();" from "t" to "f" is measured.
>
> shell script:
>
> PGDT=/path_to_storage_directory/
>
> if [ "$1" = "resume" ]; then
> psql -c "SELECT pg_wal_replay_resume();" test
> date +%Y/%m/%d_%H:%M:%S.%3N
> pg_ctl promote -D ${PGDT}
> set +x
> date +%Y/%m/%d_%H:%M:%S.%3N
> while [ 1 ]
> do
> RS=`psql -Atc "select pg_is_in_recovery();" test`
> if [ ${RS} = "f" ]; then
> break
> fi
> done
> date +%Y/%m/%d_%H:%M:%S.%3N
> set -x
> exit 0
> fi
>
>
> [Test Results]
> shared_buffers = 24GB
>
> 1. HEAD
> (wal replay resumed)
> 2019/07/01_08:48:50.326
> server promoted
> 2019/07/01_08:49:50.482
> 2019/07/01_09:02:41.051
>
>  Recovery Time:
>  13 min 50.725 s -> Time difference from WAL replay to complete recovery
>  12 min 50.569 s -> Time difference of "select pg_is_in_recovery();" from "t" to "f"
>
> 2. PATCH
> (wal replay resumed)
> 2019/07/01_07:34:26.766
> server promoted
> 2019/07/01_07:34:57.790
> 2019/07/01_07:34:57.809
>
>  Recovery Time:
>  31.043 s -> Time difference from WAL replay to complete recovery
>  00.019 s -> Time difference of "select pg_is_in_recovery();" from "t" to "f"
>  
> [Conclusion]
> The recovery time significantly improved compared to head
> from 13 minutes to 30 seconds.
>
> Any thoughts?
> I'd really appreciate your comments/feedback about the patch and/or test.
>
>
Thanks for the time you spend on this test, it is a huge win!
Although creating 10k tables and deleting tuples is not a common use case, it is
still good to know how your patch performs.
I will try to look deeper in your patch, but my knowledge on postgres internal
are limited :)

--
Adrien



signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: [PATCH] Speedup truncates of relation forks

Jamison, Kirk
In reply to this post by Masahiko Sawada
On Tuesday, July 2, 2019 4:59 PM (GMT+9), Masahiko Sawada wrote:
> Thank you for updating the patch. Here is the review comments for v2 patch.

Thank you so much for review!
I indicated the addressed parts below and attached the updated patch.

---
visibilitymap.c: visibilitymap_truncate()

> I don't think we should describe about the previous behavior here.
> Rather we need to describe what visibilitymap_mark_truncate does and what
> it returns to the caller.
>
> I'm not sure that visibilitymap_mark_truncate function name is appropriate
> here since it actually truncate map bits, not only marking. Perhaps we can
> still use visibilitymap_truncate or change to
> visibilitymap_truncate_prepare, or something? Anyway, this function
> truncates only tail bits in the last remaining map page and we can have a
> rule that the caller must call smgrtruncate() later to actually truncate
> pages.
>
> The comment of second paragraph is now out of date since this function no
> longer sends smgr invalidation message.
(1) I updated function name to "visibilitymap_truncate_prepare()" as suggested.
I think that parameter name fits, unless other reviewers suggest a better name.
I also updated its description more accurately: describing current behavior,
caller must eventually call smgrtruncate() to actually truncate vm pages,
and removed the outdated description.


> Is it worth to leave the current visibilitymap_truncate function as a shortcut
> function, instead of replacing? That way we don't need to change
> pg_truncate_visibility_map function.

(2) Yeah, it's kinda displeasing that I had to add lines in pg_truncate_visibility_map.
By any chance, re: shortcut function, you meant to retain the function
visibilitymap_truncate() and just add another visibilitymap_truncate_prepare(),
isn't it? I'm not sure if it's worth the additional lines of adding
another function in visibilitymap.c, that's why I just updated the function itself
which just adds 10 lines to pg_truncate_visibility_map anyway.
Hmm. If it's not wrong to do it this way, then I will retain this change.
OTOH, if pg_visibility.c *must* not be modified, then I'll follow your advice.


----
pg_visibility.c: pg_truncate_visibility_map()

> @@ -383,6 +383,10 @@ pg_truncate_visibility_map(PG_FUNCTION_ARGS)
>  {
>         Oid                     relid = PG_GETARG_OID(0);
>         Relation        rel;
> +       ForkNumber      forks[MAX_FORKNUM];
> +       BlockNumber     blocks[MAX_FORKNUM];
> +       BlockNumber     newnblocks = InvalidBlockNumber;
> +       int             nforks = 0;
>
> Why do we need the array of forks and blocks here? I think it's enough to
> have one fork and one block number.
(3) Thanks for the catch. Updated.


----
freespace.c: MarkFreeSpaceMapTruncateRel()

> The same comments are true for MarkFreeSpaceMapTruncateRel.

> +       BlockNumber     new_nfsmblocks = InvalidBlockNumber;    /* FSM
> blocks */
> +       BlockNumber     newnblocks = InvalidBlockNumber;        /* VM
> blocks */
> +       int             nforks = 0;
>
> I think that we can have new_nfsmblocks and new_nvmblocks and wipe out the
> comments.

(4) I updated the description accordingly, describing only the current behavior.
The caller must eventually call smgrtruncate() to actually truncate fsm pages.
I also removed the outdated description and irrelevant comments.


------
storage.c: RelationTruncate()

> +        * We might as well update the local smgr_fsm_nblocks and
> smgr_vm_nblocks
> +        * setting. smgrtruncate sent an smgr cache inval message,
> which will cause
> +        * other backends to invalidate their copy of smgr_fsm_nblocks and
> +        * smgr_vm_nblocks, and this one too at the next command
> boundary. But this
> +        * ensures it isn't outright wrong until then.
> +        */
> +       if (rel->rd_smgr)
> +       {
> +               rel->rd_smgr->smgr_fsm_nblocks = new_nfsmblocks;
> +               rel->rd_smgr->smgr_vm_nblocks = newnblocks;
> +       }
>
> new_nfsmblocks and newnblocks could be InvalidBlockNumber when the forks are
> already enough small. So the above code sets incorrect values to
> smgr_{fsm,vm}_nblocks.
>
> Also, I wonder if we can do the above code in smgrtruncate. Otherwise we always
> need to set smgr_{fsm,vm}_nblocks after smgrtruncate, which is inconvenient.
(5)
In my patch, did you mean that there's a possibility that these values
were already set to InvalidBlockNumber even before I did the setting, is it?
I'm not sure if IIUC, the point of the above code is to make sure that
smgr_{fsm,vm}_nblocks are not InvalidBlockNumber until the next command
boundary, and while we don't reach that boundary yet, we make sure
these values are valid within that window. Is my understanding correct?
Maybe following your advice that putting it inside the smgrtruncate loop
will make these values correct.
For example, below?

void smgrtruncate
{
        ...
        CacheInvalidateSmgr(reln->smgr_rnode);

        /* Do the truncation */
        for (i = 0; i < nforks; i++)
        {
                smgrsw[reln->smgr_which].smgr_truncate(reln, forknum[i], nblocks[i]);

                if (forknum[i] == FSM_FORKNUM)
                        reln->smgr_fsm_nblocks = nblocks[i];
                if (forknum[i] == VISIBILITYMAP_FORKNUM)
                        reln->smgr_vm_nblocks = nblocks[i];
        }

Another problem I have is where I should call FreeSpaceMapVacuumRange to
account for truncation of fsm pages. I also realized the upper bound
new_nfsmblocks might be incorrect in this case.
This is the cause why regression test fails in my updated patch...
+ * Update upper-level FSM pages to account for the truncation.
+ * This is important because the just-truncated pages were likely
+ * marked as all-free, and would be preferentially selected.
+ */
+ FreeSpaceMapVacuumRange(rel->rd_smgr, new_nfsmblocks, InvalidBlockNumber);


-----------
storage.c: smgr_redo()

> +               /* Update the local smgr_fsm_nblocks and
> smgr_vm_nblocks setting */
> +               if (rel->rd_smgr)
> +               {
> +                       rel->rd_smgr->smgr_fsm_nblocks = new_nfsmblocks;
> +                       rel->rd_smgr->smgr_vm_nblocks = newnblocks;
> +               }
>
> The save as above. And we need to set smgr_{fsm,vm}_nblocks in spite of freeing
> the fake relcache soon?
(6) You're right. It's unnecessary in this case.
If I also put the smgr_{fsm,vm}_nblocks setting inside the smgrtruncate
as you suggested above, it will still be set after truncation? Hmm.
Perhaps it's ok, because in the current source code it also does the setting
whenever we call visibilitymap_truncate, FreeSpaceMapTruncateRel during redo.


-----------
bufmgr.c: DropRelFileNodeBuffers()

> +       /* Get the lower bound of target block number we're interested in
> */
> +       for (i = 0; i < nforks; i++)
> +       {
> +               if (!BlockNumberIsValid(minBlock) ||
> +                       minBlock > firstDelBlock[i])
> +                       minBlock = firstDelBlock[i];
> +       }
>
> Maybe we can declare 'i' in the for statement (i.e. for (int i = 0;
> ...)) at every outer loops in this functions. And in the inner loop we can
> use 'j'.
(7) Agree. Updated.

> I think it's better to declare *forkNum and nforks side by side for readability.
> That is, we can have it as follows.
>
> DropRelFileNodeBuffers (RelFileNodeBackend rnode, ForkNumber *forkNum, int
> nforks, BlockNumber *firstDelBlock)

(8) Agree. I updated DropRelFileNodeBuffers, smgrtruncate and
smgrdounlinkfork accordingly.

---------
smgr.c: smgrdounlinkfork()

> -smgrdounlinkfork(SMgrRelation reln, ForkNumber forknum, bool isRedo)
> +smgrdounlinkfork(SMgrRelation reln, ForkNumber *forknum, bool isRedo,
> int nforks)
>
> Same as above. The order of reln, *forknum, nforks, isRedo would be better.
>
> The comment of smgrdounlinkfork function needs to be updated. We now can
> remove multiple forks.

(9) Agree. Updated accordingly.


I updated the patch based from comments,
but it still fails the regression test as indicated in (5) above.
Kindly verify if I correctly addressed the other parts as what you intended.

Thanks again for the review!
I'll update the patch again after further comments.

Regards,
Kirk Jamison

v3-0001-Speedup-truncates-of-relation-forks.patch (31K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: [PATCH] Speedup truncates of relation forks

Jamison, Kirk
Hi,

> I updated the patch based from comments, but it still fails the regression
> test as indicated in (5) above.
> Kindly verify if I correctly addressed the other parts as what you intended.
>
> Thanks again for the review!
> I'll update the patch again after further comments.

I updated the patch which is similar to V3 of the patch,
but addressing my problem in (5) in the previous email regarding FreeSpaceMapVacuumRange.
It seems to pass the regression test now. Kindly check for validation.
Thank you!

Regards,
Kirk Jamison

v4-0001-Speedup-truncates-of-relation-forks.patch (31K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Speedup truncates of relation forks

Thomas Munro-5
On Fri, Jul 5, 2019 at 3:03 PM Jamison, Kirk <[hidden email]> wrote:
> I updated the patch which is similar to V3 of the patch,
> but addressing my problem in (5) in the previous email regarding FreeSpaceMapVacuumRange.
> It seems to pass the regression test now. Kindly check for validation.

Hi Kirk,

FYI there are a couple of compiler errors reported:

Windows compiler:

contrib/pg_visibility/pg_visibility.c(400): error C2143: syntax error
: missing ')' before '{'
[C:\projects\postgresql\pg_visibility.vcxproj]

GCC:

storage.c: In function ‘RelationTruncate’:
storage.c:238:14: error: variable ‘newnblocks’ set but not used
[-Werror=unused-but-set-variable]
  BlockNumber newnblocks = InvalidBlockNumber;
              ^
storage.c:237:14: error: variable ‘new_nfsmblocks’ set but not used
[-Werror=unused-but-set-variable]
  BlockNumber new_nfsmblocks = InvalidBlockNumber;
              ^
storage.c: In function ‘smgr_redo’:
storage.c:634:15: error: variable ‘newnblocks’ set but not used
[-Werror=unused-but-set-variable]
   BlockNumber newnblocks = InvalidBlockNumber;
               ^
storage.c:633:15: error: variable ‘new_nfsmblocks’ set but not used
[-Werror=unused-but-set-variable]
   BlockNumber new_nfsmblocks = InvalidBlockNumber;
               ^

--
Thomas Munro
https://enterprisedb.com


12