Disable WAL logging to speed up data loading

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
80 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Disable WAL logging to speed up data loading

tsunakawa.takay@fujitsu.com
Hello,


We'd like to propose a feature to disable WAL to speed up data loading.  This was inspired by a feature added in the latest MySQL.  I wish you won't fear this feature...


BACKGROUND
========================================

This branches off from [1] as mentioned therein.  Briefly speaking, a customer wants to shorten the time for nightly loading of data into their data warehouse as much as possible to be prepared for using the data warehouse for new things.

Currently, they are using Oracle's SQL*Loader with its no-logging feature.  They want a similar feature to migrate to Postgres.  Other than the data loading performance, they don't want to be concerned about the storage for large volumes of WAL.

In [1], we thought about something like Oracle's per-table no-logging feature, but it seems difficult (or at least not easy.)  Meanwhile, I found another feature added in the latest MySQL 8.0.21 [2].  This proposal follows it almost directly.  That satisfies the customer request.

As an aside, it's also conceivable that in the near future, users could see the WAL bottleneck (WAL buffer or disk) when they utilize the parallel COPY that is being developed in the community.


FUNCTIONAL SPECIFICATION
========================================

Add a new value 'none' to the server configuration parameter wal_level.  With this setting:

* No WAL is emitted.

* The server refuses to start (pg_ctl start fails) after an abnormal shutdown due to power outage, pg_ctl's immediate shutdown, etc, showing a straightforward message like MySQL.

* Features like continuous archiving, pg_basebackup, and streaming/logical replication that requires wal_level >= replica are not available.

* The user can use all features again if you shut down the server successfully after data loading and reset wal_level to a value other than none.  He needs to take a base backup or rebuild the replication standby after restarting the server.


In addition to the cosmetic modifications to the manual articles that refer to wal_level, add a clause or paragraphs to the following sections to let users know the availability of this feature.

14.4. Populating a Database
18.6.1. Upgrading Data via pg_dumpall


PROGRAM DESIGN (main point only)
========================================

As in the bootstrap mode (during initdb), when wal_level = none, XLogInsert() does nothing and just returns a fixed value, which is the tail of the last shutdown checkpoint WAL record.  As a result, the value is set to the relation page header's LSN field.

In addition, it might be worth having XLogBeginInsert() and XLogRec...() to check wal_level and just return.  I don't expect much from this, but it may be interesting to give it a try to see the squeezed performance.

StartupXLOG() checks the wal_level setting in pg_control and quits the startup with ereport(FATAL) accordingly.


[1]
Implement UNLOGGED clause for COPY FROM
https://www.postgresql.org/message-id/OSBPR01MB488887C0BDC5129C65DFC5E5ED640@...

[2]
Disabling Redo Logging
https://dev.mysql.com/doc/refman/8.0/en/innodb-redo-log.html#innodb-disable-redo-logging


Regards
Takayuki Tsunakawa



Reply | Threaded
Open this post in threaded view
|

Re: Disable WAL logging to speed up data loading

amul sul
On Tue, Sep 29, 2020 at 1:58 PM [hidden email]
<[hidden email]> wrote:
>
> Hello,
>
>
> We'd like to propose a feature to disable WAL to speed up data loading.  This was inspired by a feature added in the latest MySQL.  I wish you won't fear this feature...
>

TWIMW, pg_bulkload contrib module[1], also does the same for the
faster data loading.

Regards,
Amul

1] http://ossc-db.github.io/pg_bulkload/pg_bulkload.html


Reply | Threaded
Open this post in threaded view
|

Re: Disable WAL logging to speed up data loading

Ashutosh Bapat-2
In reply to this post by tsunakawa.takay@fujitsu.com
Can they use a database with all unlogged tables?

On Tue, Sep 29, 2020 at 1:58 PM [hidden email]
<[hidden email]> wrote:

>
> Hello,
>
>
> We'd like to propose a feature to disable WAL to speed up data loading.  This was inspired by a feature added in the latest MySQL.  I wish you won't fear this feature...
>
>
> BACKGROUND
> ========================================
>
> This branches off from [1] as mentioned therein.  Briefly speaking, a customer wants to shorten the time for nightly loading of data into their data warehouse as much as possible to be prepared for using the data warehouse for new things.
>
> Currently, they are using Oracle's SQL*Loader with its no-logging feature.  They want a similar feature to migrate to Postgres.  Other than the data loading performance, they don't want to be concerned about the storage for large volumes of WAL.
>
> In [1], we thought about something like Oracle's per-table no-logging feature, but it seems difficult (or at least not easy.)  Meanwhile, I found another feature added in the latest MySQL 8.0.21 [2].  This proposal follows it almost directly.  That satisfies the customer request.
>
> As an aside, it's also conceivable that in the near future, users could see the WAL bottleneck (WAL buffer or disk) when they utilize the parallel COPY that is being developed in the community.
>
>
> FUNCTIONAL SPECIFICATION
> ========================================
>
> Add a new value 'none' to the server configuration parameter wal_level.  With this setting:
>
> * No WAL is emitted.
>
> * The server refuses to start (pg_ctl start fails) after an abnormal shutdown due to power outage, pg_ctl's immediate shutdown, etc, showing a straightforward message like MySQL.
>
> * Features like continuous archiving, pg_basebackup, and streaming/logical replication that requires wal_level >= replica are not available.
>
> * The user can use all features again if you shut down the server successfully after data loading and reset wal_level to a value other than none.  He needs to take a base backup or rebuild the replication standby after restarting the server.
>
>
> In addition to the cosmetic modifications to the manual articles that refer to wal_level, add a clause or paragraphs to the following sections to let users know the availability of this feature.
>
> 14.4. Populating a Database
> 18.6.1. Upgrading Data via pg_dumpall
>
>
> PROGRAM DESIGN (main point only)
> ========================================
>
> As in the bootstrap mode (during initdb), when wal_level = none, XLogInsert() does nothing and just returns a fixed value, which is the tail of the last shutdown checkpoint WAL record.  As a result, the value is set to the relation page header's LSN field.
>
> In addition, it might be worth having XLogBeginInsert() and XLogRec...() to check wal_level and just return.  I don't expect much from this, but it may be interesting to give it a try to see the squeezed performance.
>
> StartupXLOG() checks the wal_level setting in pg_control and quits the startup with ereport(FATAL) accordingly.
>
>
> [1]
> Implement UNLOGGED clause for COPY FROM
> https://www.postgresql.org/message-id/OSBPR01MB488887C0BDC5129C65DFC5E5ED640@...
>
> [2]
> Disabling Redo Logging
> https://dev.mysql.com/doc/refman/8.0/en/innodb-redo-log.html#innodb-disable-redo-logging
>
>
> Regards
> Takayuki Tsunakawa
>
>
>


--
Best Wishes,
Ashutosh Bapat


Reply | Threaded
Open this post in threaded view
|

RE: Disable WAL logging to speed up data loading

osumi.takamichi@fujitsu.com
In reply to this post by amul sul
Hi, Amul

> > We'd like to propose a feature to disable WAL to speed up data loading.  This
> was inspired by a feature added in the latest MySQL.  I wish you won't fear
> this feature...
> >
>
> TWIMW, pg_bulkload contrib module[1], also does the same for the faster data
> loading.

Both features are helpful to make the data loading faster,
but those are different.

There are at least two major merits as their differences to use wal_level='none'.
The first one happens when user upgrades major version by pg_dumpall.
Imagine a case that one user gets a logical backup of whole cluster by pg_dumpall.
The output file contains many COPY commands in it to recreate and upgrade the cluster.

Setting wal_level='none' can be easily used to boost the speed to remake the
cluster of a newer version by setting the wal_level.
OTOH, pg_bulkload can't treat this flow of upgrade in an easy way.
It requires to plan and write the detail of control files or commands for each table manually.
This wouldn't be easy for users.

Secondly,
to use pg_bulkload requires us to use a wrapper command of pg_ctl
like "pg_bulkload -r", which is prepared only for the feature.
This command is used when pg_bulkload is crashed.
The document recommends not to use pg_ctl directly in such a case.
Like this, paying attention to such usages or minor differences of usage is troublesome
while running a long operation of service, without support of the OSS community.

What did you think ?

Regards,
        Takamichi Osumi
Reply | Threaded
Open this post in threaded view
|

RE: Disable WAL logging to speed up data loading

osumi.takamichi@fujitsu.com
In reply to this post by Ashutosh Bapat-2
Hello, Ashutosh

> Can they use a database with all unlogged tables?
Unfortunately, no. They want to switch a cluster condition to "without WAL logging"
only when they execute night bulk loading for their data warehouse.
In other words, they would like to keep their other usual operations with WAL.
In addition, using all tables as unlogged brings about
the risk to lose data warehouse's data caused by an unexpected server crash or power outage.

Regards,
        Takamichi Osumi
Reply | Threaded
Open this post in threaded view
|

Re: Disable WAL logging to speed up data loading

Fujii Masao-4


On 2020/09/30 12:10, [hidden email] wrote:
> Hello, Ashutosh
>
>> Can they use a database with all unlogged tables?
> Unfortunately, no. They want to switch a cluster condition to "without WAL logging"
> only when they execute night bulk loading for their data warehouse.
> In other words, they would like to keep their other usual operations with WAL.
> In addition, using all tables as unlogged brings about
> the risk to lose data warehouse's data caused by an unexpected server crash or power outage.

But the same issue can happen even in the proposed approach
because Tsunakawa-san explains as follows?

> The server refuses to start (pg_ctl start fails) after an abnormal shutdown due to power outage, pg_ctl's immediate shutdown, etc, showing a straightforward message like MySQL.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: Disable WAL logging to speed up data loading

Kyotaro Horiguchi-4
At Thu, 1 Oct 2020 10:01:56 +0900, Fujii Masao <[hidden email]> wrote in

>
>
> On 2020/09/30 12:10, [hidden email] wrote:
> > Hello, Ashutosh
> >
> >> Can they use a database with all unlogged tables?
> > Unfortunately, no. They want to switch a cluster condition to "without
> > WAL logging"
> > only when they execute night bulk loading for their data warehouse.
> > In other words, they would like to keep their other usual operations
> > with WAL.
> > In addition, using all tables as unlogged brings about
> > the risk to lose data warehouse's data caused by an unexpected server
> > crash or power outage.
>
> But the same issue can happen even in the proposed approach
> because Tsunakawa-san explains as follows?
>
> > The server refuses to start (pg_ctl start fails) after an abnormal
> > shutdown due to power outage, pg_ctl's immediate shutdown, etc,
> > showing a straightforward message like MySQL.

Exactly. Crash safety cannot be a rationale for this feature.

The proposed feature seems to give an advantage to a massive loading
onto a already very large table.  Currently turning a large logged
table into an unlogged one runs a table copy that takes a long
time. After the bulk-loading, to put back the table huge amount of WAL
is emitted and a table-copy takes a long time.  If we had the
wal_level=none feature, the preparation time is not needed and the
put-back time can be offloaded to online backing-up.  The crash-unsafe
period is inevitable but would be shorten by using storage snapshots.
On the other hand, a massive loading to an empty table is efficiently
performed by wal_level=minimal.

As I mentioned in another thread, if ALTER TABLE LOGGED/UNLOGGED
doesn't need a table copy (consequently no insertion WAL are emitted),
it would work. I'm not which is better, or easier to realize.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

RE: Disable WAL logging to speed up data loading

osumi.takamichi@fujitsu.com
In reply to this post by Fujii Masao-4
Hello.

> >> Can they use a database with all unlogged tables?
> > Unfortunately, no. They want to switch a cluster condition to "without WAL
> logging"
> > only when they execute night bulk loading for their data warehouse.
> > In other words, they would like to keep their other usual operations with WAL.
> > In addition, using all tables as unlogged brings about the risk to
> > lose data warehouse's data caused by an unexpected server crash or power
> outage.
>
> But the same issue can happen even in the proposed approach because
> Tsunakawa-san explains as follows?
Sorry, my last expression about "to lose data" was not precise.

When unlogged tables are used and the server is crashed unexpectedly,
all data after the last backup is truncated without WAL, right ?
In this case, sequential commands for unlogged tables disappear.
On the other hand, wal_level='none' can be changed to 'minimal' for example.
Therefore, such other sequential operations after the last backup
could be stored in WAL file, when a user applies such a change of wal_level.
I meant this is helpful but didn't this make sense ?

> > The server refuses to start (pg_ctl start fails) after an abnormal shutdown due
> to power outage, pg_ctl's immediate shutdown, etc, showing a straightforward
> message like MySQL.
This is the behavior during wal_level='none'.

> * The user can use all features again if you shut down the server successfully
> after data loading and reset wal_level to a value other than none.  He needs to
> take a base backup or rebuild the replication standby after restarting the server.
And, we expect we can change the wal_level even after using wal_level='none'
and can use other features again like I mentioned above. What did you think ?

Regards,
        Takamichi Osumi
Reply | Threaded
Open this post in threaded view
|

Re: Disable WAL logging to speed up data loading

Fujii Masao-4


On 2020/10/01 12:04, [hidden email] wrote:

> Hello.
>
>>>> Can they use a database with all unlogged tables?
>>> Unfortunately, no. They want to switch a cluster condition to "without WAL
>> logging"
>>> only when they execute night bulk loading for their data warehouse.
>>> In other words, they would like to keep their other usual operations with WAL.
>>> In addition, using all tables as unlogged brings about the risk to
>>> lose data warehouse's data caused by an unexpected server crash or power
>> outage.
>>
>> But the same issue can happen even in the proposed approach because
>> Tsunakawa-san explains as follows?
> Sorry, my last expression about "to lose data" was not precise.
>
> When unlogged tables are used and the server is crashed unexpectedly,
> all data after the last backup is truncated without WAL, right ?
> In this case, sequential commands for unlogged tables disappear.

To avoid this issue we can mark the table as logged. You may be worried about
that ALTER TABLE SET LOGGED/UNLOGGED would take a long time because
the table needs to be rewriitten. One idea for that is to improve that command
so that it skips the table rewrite if wal_level=minimal.
Of course, also you can change wal_level after marking the table as unlogged.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: Disable WAL logging to speed up data loading

Kyotaro Horiguchi-4
At Thu, 1 Oct 2020 12:58:19 +0900, Fujii Masao <[hidden email]> wrote in
> the table needs to be rewriitten. One idea for that is to improve that
> command
> so that it skips the table rewrite if wal_level=minimal.
> Of course, also you can change wal_level after marking the table as
> unlogged.

tablecmd.c:

> * There are two reasons for requiring a rewrite when changing
> * persistence: on one hand, we need to ensure that the buffers
> * belonging to each of the two relations are marked with or without
> * BM_PERMANENT properly.  On the other hand, since rewriting creates
> * and assigns a new relfilenode, we automatically create or drop an
> * init fork for the relation as appropriate.

According to this comment, perhaps we can do that at least for
wal_level=minimal.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

RE: Disable WAL logging to speed up data loading

osumi.takamichi@fujitsu.com
Hi, Horiguchi-San and Fujii-San.


Thank you so much both of you.
> > the table needs to be rewriitten. One idea for that is to improve that
> > command so that it skips the table rewrite if wal_level=minimal.
> > Of course, also you can change wal_level after marking the table as
> > unlogged.
>
> tablecmd.c:
The idea is really interesting.
I didn't come up with getting rid of the whole copy of
the ALTER TABLE UNLOGGED/LOGGED commands
only when wal_level='minimal'.

> > * There are two reasons for requiring a rewrite when changing
> > * persistence: on one hand, we need to ensure that the buffers
> > * belonging to each of the two relations are marked with or without
> > * BM_PERMANENT properly.  On the other hand, since rewriting creates
> > * and assigns a new relfilenode, we automatically create or drop an
> > * init fork for the relation as appropriate.
Thanks for sharing concrete comments in the source code.

> According to this comment, perhaps we can do that at least for
> wal_level=minimal.
When I compare the 2 ideas,
one of the benefits of this ALTER TABLE 's improvement
is that we can't avoid the downtime
while that of wal_level='none' provides an easy and faster
major version up via output file of pg_dumpall.
Both ideas have good points.
However, actually to modify ALTER TABLE's copy
looks far more difficult than wal_level='none' and
beyond my current ability.
So, I'd like to go forward with the direction of wal_level='none'.
Did you have strong objections for this direction ?


Regards,
        Takamichi Osumi


Reply | Threaded
Open this post in threaded view
|

Re: Disable WAL logging to speed up data loading

Kyotaro Horiguchi-4
At Thu, 1 Oct 2020 08:14:42 +0000, "[hidden email]" <[hidden email]> wrote in

> Hi, Horiguchi-San and Fujii-San.
>
>
> Thank you so much both of you.
> > > the table needs to be rewriitten. One idea for that is to improve that
> > > command so that it skips the table rewrite if wal_level=minimal.
> > > Of course, also you can change wal_level after marking the table as
> > > unlogged.
> >
> > tablecmd.c:
> The idea is really interesting.
> I didn't come up with getting rid of the whole copy of
> the ALTER TABLE UNLOGGED/LOGGED commands
> only when wal_level='minimal'.
>
> > > * There are two reasons for requiring a rewrite when changing
> > > * persistence: on one hand, we need to ensure that the buffers
> > > * belonging to each of the two relations are marked with or without
> > > * BM_PERMANENT properly.  On the other hand, since rewriting creates
> > > * and assigns a new relfilenode, we automatically create or drop an
> > > * init fork for the relation as appropriate.
> Thanks for sharing concrete comments in the source code.
>
> > According to this comment, perhaps we can do that at least for
> > wal_level=minimal.
> When I compare the 2 ideas,
> one of the benefits of this ALTER TABLE 's improvement
> is that we can't avoid the downtime
> while that of wal_level='none' provides an easy and faster
> major version up via output file of pg_dumpall.
The speedup has already been achieved with higher durability by
wal_level=minimal in that case.  Or maybe you should consider using
pg_upgrade instead.  Even inducing the time to take a backup copy of
the whole cluster, running pg_upgrade would be far faster than
pg_dumpall then loading.

> Both ideas have good points.
> However, actually to modify ALTER TABLE's copy
> looks far more difficult than wal_level='none' and
> beyond my current ability.
> So, I'd like to go forward with the direction of wal_level='none'.
> Did you have strong objections for this direction ?

For fuel(?) of the discussion, I tried a very-quick PoC for in-place
ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
trials of several ways, I drifted to the following way after poking
several ways.

1. Flip BM_PERMANENT of active buffers
2. adding/removing init fork
3. sync files,
4. Flip pg_class.relpersistence.

It always skips table copy in the SET UNLOGGED case, and only when
wal_level=minimal in the SET LOGGED case.  Crash recovery seems
working by some brief testing by hand.

Of course, I haven't performed intensive test on it.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dcaea7135f..d8a1c2a7e7 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -613,6 +613,26 @@ heapam_relation_set_new_filenode(Relation rel,
  smgrclose(srel);
 }
 
+static void
+heapam_relation_set_persistence(Relation rel, char persistence)
+{
+ Assert(rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT ||
+   rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
+
+ Assert (rel->rd_rel->relpersistence != persistence);
+
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ {
+ Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
+   rel->rd_rel->relkind == RELKIND_MATVIEW ||
+   rel->rd_rel->relkind == RELKIND_TOASTVALUE);
+ RelationCreateInitFork(rel->rd_node);
+ }
+ else
+ RelationDropInitFork(rel->rd_node);
+}
+
+
 static void
 heapam_relation_nontransactional_truncate(Relation rel)
 {
@@ -2540,6 +2560,7 @@ static const TableAmRoutine heapam_methods = {
  .compute_xid_horizon_for_tuples = heap_compute_xid_horizon_for_tuples,
 
  .relation_set_new_filenode = heapam_relation_set_new_filenode,
+ .relation_set_persistence = heapam_relation_set_persistence,
  .relation_nontransactional_truncate = heapam_relation_nontransactional_truncate,
  .relation_copy_data = heapam_relation_copy_data,
  .relation_copy_for_cluster = heapam_relation_copy_for_cluster,
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index a7c0cb1bc3..8397002613 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,14 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
  xlrec->blkno, xlrec->flags);
  pfree(path);
  }
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
 }
 
 const char *
@@ -55,6 +63,9 @@ smgr_identify(uint8 info)
  case XLOG_SMGR_TRUNCATE:
  id = "TRUNCATE";
  break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
  }
 
  return id;
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index dbbd3aa31f..7f695e6d8b 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -60,6 +60,8 @@ int wal_skip_threshold = 2048; /* in kilobytes */
 typedef struct PendingRelDelete
 {
  RelFileNode relnode; /* relation that may need to be deleted */
+ bool deleteinitfork; /* delete only init fork if true */
+ bool createinitfork; /* create init fork if true */
  BackendId backend; /* InvalidBackendId if not a temp rel */
  bool atCommit; /* T=delete at commit; F=delete at abort */
  int nestLevel; /* xact nesting level of request */
@@ -153,6 +155,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
  pending = (PendingRelDelete *)
  MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
  pending->relnode = rnode;
+ pending->deleteinitfork = false;
+ pending->createinitfork = false;
  pending->backend = backend;
  pending->atCommit = false; /* delete if abort */
  pending->nestLevel = GetCurrentTransactionNestLevel();
@@ -168,6 +172,53 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
  return srel;
 }
 
+void
+RelationCreateInitFork(RelFileNode rnode)
+{
+ PendingRelDelete *pending;
+ SMgrRelation srel;
+
+ srel = smgropen(rnode, InvalidBackendId);
+ smgrcreate(srel, INIT_FORKNUM, false);
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+
+ /* Add the relation to the list of stuff to delete at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->deleteinitfork = true;
+ pending->createinitfork = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false; /* delete if abort */
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
+void
+RelationDropInitFork(RelFileNode rnode)
+{
+ PendingRelDelete *pending;
+ SMgrRelation srel;
+
+ srel = smgropen(rnode, InvalidBackendId);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+
+ /* Add the relation to the list of stuff to delete at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->deleteinitfork = false;
+ pending->createinitfork = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false; /* create if abort */
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
 /*
  * Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
  */
@@ -187,6 +238,25 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
  XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
 }
 
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
 /*
  * RelationDropStorage
  * Schedule unlinking of physical storage at transaction commit.
@@ -200,6 +270,8 @@ RelationDropStorage(Relation rel)
  pending = (PendingRelDelete *)
  MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
  pending->relnode = rel->rd_node;
+ pending->createinitfork = false;
+ pending->deleteinitfork = false;
  pending->backend = rel->rd_backend;
  pending->atCommit = true; /* delete if commit */
  pending->nestLevel = GetCurrentTransactionNestLevel();
@@ -625,19 +697,34 @@ smgrDoPendingDeletes(bool isCommit)
 
  srel = smgropen(pending->relnode, pending->backend);
 
- /* allocate the initial array, or extend it, if needed */
- if (maxrels == 0)
+ /* XXXX */
+ if (pending->createinitfork)
  {
- maxrels = 8;
- srels = palloc(sizeof(SMgrRelation) * maxrels);
+ smgrcreate(srel, INIT_FORKNUM, false);
+ log_smgrcreate(&pending->relnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
  }
- else if (maxrels <= nrels)
+ else if (pending->deleteinitfork)
  {
- maxrels *= 2;
- srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ log_smgrunlink(&pending->relnode, INIT_FORKNUM);
  }
+ else
+ {
+ /* allocate the initial array, or extend it, if needed */
+ if (maxrels == 0)
+ {
+ maxrels = 8;
+ srels = palloc(sizeof(SMgrRelation) * maxrels);
+ }
+ else if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
 
- srels[nrels++] = srel;
+ srels[nrels++] = srel;
+ }
  }
  /* must explicitly free the list entry */
  pfree(pending);
@@ -916,6 +1003,14 @@ smgr_redo(XLogReaderState *record)
  reln = smgropen(xlrec->rnode, InvalidBackendId);
  smgrcreate(reln, xlrec->forkNum, true);
  }
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ }
  else if (info == XLOG_SMGR_TRUNCATE)
  {
  xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index e0ac4e05e5..fdcdb1dbb1 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4897,6 +4897,89 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
  return newcmd;
 }
 
+static bool
+try_inplace_persistence_change(AlteredTableInfo *tab, char persistence,
+   LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, but
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+   tab->newvals == NULL && !tab->verify_new_notnull);
+
+ /*
+ * In the case of SET ULOGGED, the resulting relation does not need WALs,
+ * so the storage can be used as-is. In the opposite case, the table is
+ * required to be restored from WAL, so rewrite the relation.  However,
+ * when wal_level = minimal, we can omit WALs by immediately syncing the
+ * storage.
+ */
+ if (tab->newrelpersistence == RELPERSISTENCE_PERMANENT || XLogIsNeeded())
+ return false;
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform im-place persistnce change");
+
+ RelationOpenSmgr(rel);
+ SetRelFileNodeBuffersPersistence(rel->rd_smgr->smgr_rnode,
+ persistence == RELPERSISTENCE_PERMANENT);
+ table_relation_set_persistence(rel, persistence);
+
+ /*
+ * This relation is now WAL-logged. Sync all files immediately to establish
+ * the initial state.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < MAX_FORKNUM ; i++)
+ {
+ if (smgrexists(rel->rd_smgr, i))
+ smgrimmedsync(rel->rd_smgr, i);
+ }
+ }
+
+ table_close(rel, lockmode);
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+ tuple = SearchSysCacheCopy1(RELOID,
+ ObjectIdGetDatum(RelationGetRelid(rel)));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u",
+ RelationGetRelid(rel));
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+ table_close(classRel, RowExclusiveLock);
+
+ return true;
+}
+
 /*
  * ATRewriteTables: ALTER TABLE phase 3
  */
@@ -5017,45 +5100,51 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
  tab->relid,
  tab->rewrite);
 
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged.  We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one).  Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence,
-   lockmode);
+ if (tab->rewrite != AT_REWRITE_ALTER_PERSISTENCE ||
+ !try_inplace_persistence_change(tab, persistence, lockmode))
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged.  We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one).  Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence,
+   lockmode);
 
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
 
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap.  We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table.  Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap.  We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table.  Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
+ }
  }
  else
  {
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e549fa1d30..7fd6cae12f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3031,6 +3031,40 @@ DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
  }
 }
 
+void
+SetRelFileNodeBuffersPersistence(RelFileNodeBackend rnode, bool permanent)
+{
+ int i;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ if (permanent)
+ {
+ Assert ((buf_state &  BM_PERMANENT) == 0);
+ buf_state |= BM_PERMANENT;
+ }
+ else
+ {
+ Assert ((buf_state &  BM_PERMANENT) != 0);
+ buf_state &= ~BM_PERMANENT;
+ }
+ }
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+}
+
 /* ---------------------------------------------------------------------
  * DropRelFileNodesAllBuffers
  *
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dcc09df0c7..5eb9e97b3d 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -645,6 +645,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
  smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
 /*
  * AtEOXact_SMgr
  *
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 387eb34a61..245a8f0fdd 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -451,6 +451,8 @@ typedef struct TableAmRoutine
   TransactionId *freezeXid,
   MultiXactId *minmulti);
 
+ void (*relation_set_persistence) (Relation rel, char persistence);
+
  /*
  * This callback needs to remove all contents from `rel`'s current
  * relfilenode. No provisions for transactional behaviour need to be made.
@@ -1404,6 +1406,12 @@ table_relation_set_new_filenode(Relation rel,
    freezeXid, minmulti);
 }
 
+static inline void
+table_relation_set_persistence(Relation rel, char persistence)
+{
+ rel->rd_tableam->relation_set_persistence(rel, persistence);
+}
+
 /*
  * Remove all table contents from `rel`, in a non-transactional manner.
  * Non-transactional meaning that there's no need to support rollbacks. This
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 30c38e0ca6..f4fd61f639 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
 extern int wal_skip_threshold;
 
 extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(RelFileNode rel);
+extern void RelationDropInitFork(RelFileNode rel);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
 extern void RelationPreTruncate(Relation rel);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 7b21cab2e0..73ad2ae89e 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -29,6 +29,7 @@
 /* XLOG gives us high 4 bits */
 #define XLOG_SMGR_CREATE 0x10
 #define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
 
 typedef struct xl_smgr_create
 {
@@ -36,6 +37,12 @@ typedef struct xl_smgr_create
  ForkNumber forkNum;
 } xl_smgr_create;
 
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
 /* flags for xl_smgr_truncate */
 #define SMGR_TRUNCATE_HEAP 0x0001
 #define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +58,7 @@ typedef struct xl_smgr_truncate
 } xl_smgr_truncate;
 
 extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
 
 extern void smgr_redo(XLogReaderState *record);
 extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..98967eeff9 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -205,6 +205,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
    int nforks, BlockNumber *firstDelBlock);
+extern void SetRelFileNodeBuffersPersistence(RelFileNodeBackend rnode,
+ bool permanent);
 extern void DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes);
 extern void DropDatabaseBuffers(Oid dbid);
 
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index f28a842401..5d74631006 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -86,6 +86,7 @@ extern void smgrclose(SMgrRelation reln);
 extern void smgrcloseall(void);
 extern void smgrclosenode(RelFileNodeBackend rnode);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
Reply | Threaded
Open this post in threaded view
|

Re: Disable WAL logging to speed up data loading

Fujii Masao-4


On 2020/10/02 10:06, Kyotaro Horiguchi wrote:

> At Thu, 1 Oct 2020 08:14:42 +0000, "[hidden email]" <[hidden email]> wrote in
>> Hi, Horiguchi-San and Fujii-San.
>>
>>
>> Thank you so much both of you.
>>>> the table needs to be rewriitten. One idea for that is to improve that
>>>> command so that it skips the table rewrite if wal_level=minimal.
>>>> Of course, also you can change wal_level after marking the table as
>>>> unlogged.
>>>
>>> tablecmd.c:
>> The idea is really interesting.
>> I didn't come up with getting rid of the whole copy of
>> the ALTER TABLE UNLOGGED/LOGGED commands
>> only when wal_level='minimal'.
>>
>>>> * There are two reasons for requiring a rewrite when changing
>>>> * persistence: on one hand, we need to ensure that the buffers
>>>> * belonging to each of the two relations are marked with or without
>>>> * BM_PERMANENT properly.  On the other hand, since rewriting creates
>>>> * and assigns a new relfilenode, we automatically create or drop an
>>>> * init fork for the relation as appropriate.
>> Thanks for sharing concrete comments in the source code.
>>
>>> According to this comment, perhaps we can do that at least for
>>> wal_level=minimal.
>> When I compare the 2 ideas,
>> one of the benefits of this ALTER TABLE 's improvement
>> is that we can't avoid the downtime
>> while that of wal_level='none' provides an easy and faster
>> major version up via output file of pg_dumpall.
>
> The speedup has already been achieved with higher durability by
> wal_level=minimal in that case.

I was thinking the same, i.e., wal_level=minimal + wal_skip_threshold would
speed up that initial data loading.


>  Or maybe you should consider using
> pg_upgrade instead.  Even inducing the time to take a backup copy of
> the whole cluster, running pg_upgrade would be far faster than
> pg_dumpall then loading.
>
>> Both ideas have good points.
>> However, actually to modify ALTER TABLE's copy
>> looks far more difficult than wal_level='none' and
>> beyond my current ability.
>> So, I'd like to go forward with the direction of wal_level='none'.
>> Did you have strong objections for this direction ?

No, I have no strong objection against your trial. But I was thinking
that it's not so easy to design and implement wal_level=none.
For example, there are some functions and commands depending on
the existence of WAL, like pg_switch_wal(), PREPARE TRANSACTION
and COMMIT PREPARED. Probably you need to define how they should
work in wal_level=none, e.g., emit an error.


> For fuel(?) of the discussion, I tried a very-quick PoC for in-place
> ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
> trials of several ways, I drifted to the following way after poking
> several ways.

Nice!


> 1. Flip BM_PERMANENT of active buffers
> 2. adding/removing init fork
> 3. sync files,
> 4. Flip pg_class.relpersistence.
>
> It always skips table copy in the SET UNLOGGED case,

Even in wal_level != minimal?
What happens in the standby side when SET UNLOGGED is executed without
the table rewrite in the primary? The table data should be truncated
in the standby?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: Disable WAL logging to speed up data loading

Kyotaro Horiguchi-4
At Fri, 2 Oct 2020 10:56:21 +0900, Fujii Masao <[hidden email]> wrote in

>
> On 2020/10/02 10:06, Kyotaro Horiguchi wrote:
> > At Thu, 1 Oct 2020 08:14:42 +0000, "[hidden email]"
> > <[hidden email]> wrote in
> >> When I compare the 2 ideas,
> >> one of the benefits of this ALTER TABLE 's improvement
> >> is that we can't avoid the downtime
> >> while that of wal_level='none' provides an easy and faster
> >> major version up via output file of pg_dumpall.
> > The speedup has already been achieved with higher durability by
> > wal_level=minimal in that case.
>
> I was thinking the same, i.e., wal_level=minimal + wal_skip_threshold
> would
> speed up that initial data loading.

Yeah.

> >  Or maybe you should consider using
> > pg_upgrade instead.  Even inducing the time to take a backup copy of
> > the whole cluster, running pg_upgrade would be far faster than
> > pg_dumpall then loading.
> >
> >> Both ideas have good points.
> >> However, actually to modify ALTER TABLE's copy
> >> looks far more difficult than wal_level='none' and
> >> beyond my current ability.
> >> So, I'd like to go forward with the direction of wal_level='none'.
> >> Did you have strong objections for this direction ?
>
> No, I have no strong objection against your trial. But I was thinking
> that it's not so easy to design and implement wal_level=none.
> For example, there are some functions and commands depending on
> the existence of WAL, like pg_switch_wal(), PREPARE TRANSACTION
> and COMMIT PREPARED. Probably you need to define how they should
> work in wal_level=none, e.g., emit an error.
>
>
> > For fuel(?) of the discussion, I tried a very-quick PoC for in-place
> > ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
> > trials of several ways, I drifted to the following way after poking
> > several ways.
>
> Nice!
>
>
> > 1. Flip BM_PERMANENT of active buffers
> > 2. adding/removing init fork
> > 3. sync files,
> > 4. Flip pg_class.relpersistence.
> > It always skips table copy in the SET UNLOGGED case,
>
> Even in wal_level != minimal?
> What happens in the standby side when SET UNLOGGED is executed without
> the table rewrite in the primary? The table data should be truncated
> in the standby?

A table turned into unlogged on the primary is also turned into
unlogged on the standby and it is inaccessible on the standby. Maybe
the storage is dropped on both patched and unpatched versoins.

After the table is again turned into logged, the content is
transferred via WAL records generated from the insertions into the new
storage and it rebuilds the same storage on the standby on both
patched and unpatched.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: Disable WAL logging to speed up data loading

Kyotaro Horiguchi-4
Sorry for the slippery brain...

At Fri, 02 Oct 2020 13:38:22 +0900 (JST), Kyotaro Horiguchi <[hidden email]> wrote in

> At Fri, 2 Oct 2020 10:56:21 +0900, Fujii Masao <[hidden email]> wrote in
> >
> > On 2020/10/02 10:06, Kyotaro Horiguchi wrote:
> > > 1. Flip BM_PERMANENT of active buffers
> > > 2. adding/removing init fork
> > > 3. sync files,
> > > 4. Flip pg_class.relpersistence.
> > > It always skips table copy in the SET UNLOGGED case,
> >
> > Even in wal_level != minimal?
> > What happens in the standby side when SET UNLOGGED is executed without
> > the table rewrite in the primary? The table data should be truncated
> > in the standby?
>
> A table turned into unlogged on the primary is also turned into
> unlogged on the standby and it is inaccessible on the standby.

> Maybe the storage is dropped on both patched and unpatched versoins.

Maybe the storage dropped on unpatched and left alone on patched.

> After the table is again turned into logged, the content is
> transferred via WAL records generated from the insertions into the new
> storage and it rebuilds the same storage on the standby on both
> patched and unpatched.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

RE: Disable WAL logging to speed up data loading

tsunakawa.takay@fujitsu.com
In reply to this post by Fujii Masao-4
From: Fujii Masao <[hidden email]>
> > The speedup has already been achieved with higher durability by
> > wal_level=minimal in that case.
>
> I was thinking the same, i.e., wal_level=minimal + wal_skip_threshold would
> speed up that initial data loading.

First of all, thank you Horiguchi-san for trying to improve ALTER TABLE SET UNLOGGED/LOGGED.  That should also be appealing.

At the same time, as I said before, both features have good points.  TBH, as a user, I'm kind of attracted by MySQL's approach because of its simplicity for users (although DBMS developers may be worried about this and that.)  What tempts me is that I can just switch on the feature with a single configuration parameter, and continue to use existing SQL scripts and other data integration software without knowing what tables those load data into.  In the same context, I don't have to add or delete ALTER TABLE statements when I have to change the set of tables to be loaded.  For the same reason, I'm also interested in Oracle's another feature ALTER TABLESPACE LOGGING/NOLOGGING.

BTW, does ALTER TABLE LOGGED/UNLOGGED on a partitioned table get the change to its all partitions?  It would be a bit tedious to add/delete ALTER TABLE LOGGED/UNLOGGED when I add/drop a partition.

Regarding data migration, data movement is not limited only to major upgrades.  It will be convenient to speed up the migration of the entire database cluster into a new instance for testing and new deployment.  (I'm not sure about recent pg_upgrade, but pg_upgrade sometimes cannot upgrade too older versions.)

To conclude, I hope both features will be realized, and wish we won't fall in a situation where the words fly such as "Mine is enough. Yours is risky and not necessary."

With that said, I think we may as well separate the thread some time later for CF entry.  Otherwise, we will have trouble in finding the latest patch from the CF entry.


> No, I have no strong objection against your trial. But I was thinking
> that it's not so easy to design and implement wal_level=none.
> For example, there are some functions and commands depending on
> the existence of WAL, like pg_switch_wal(), PREPARE TRANSACTION
> and COMMIT PREPARED. Probably you need to define how they should
> work in wal_level=none, e.g., emit an error.

        Yeah, we thought pg_switch_wal() may need some treatment.  We'll check PREPARE and COMMIT PREPARED as well.  I'd appreciate it if you share what you notice at any time.  It is possible that we should emit WAL records of some resource managers, like the bootstrap mode emits WAL only for RM_XLOG_ID.


Regards
Takayuki Tsunakawa





Reply | Threaded
Open this post in threaded view
|

Re: Disable WAL logging to speed up data loading

Kyotaro Horiguchi-4
In reply to this post by Kyotaro Horiguchi-4
At Fri, 02 Oct 2020 13:51:35 +0900 (JST), Kyotaro Horiguchi <[hidden email]> wrote in
> Sorry for the slippery brain...  ^2

> At Fri, 02 Oct 2020 13:38:22 +0900 (JST), Kyotaro Horiguchi <[hidden email]> wrote in
> > At Fri, 2 Oct 2020 10:56:21 +0900, Fujii Masao <[hidden email]> wrote in
> > >
> > > On 2020/10/02 10:06, Kyotaro Horiguchi wrote:
> > > > 1. Flip BM_PERMANENT of active buffers
> > > > 2. adding/removing init fork
> > > > 3. sync files,
> > > > 4. Flip pg_class.relpersistence.
> > > > It always skips table copy in the SET UNLOGGED case,
> > >
> > > Even in wal_level != minimal?
> > > What happens in the standby side when SET UNLOGGED is executed without
> > > the table rewrite in the primary? The table data should be truncated
> > > in the standby?
> >
> > A table turned into unlogged on the primary is also turned into
> > unlogged on the standby and it is inaccessible on the standby.

> Maybe the storage dropped on unpatched

Maybe the old storage is replaced with an empty stroage on unpatched.

> and left alone on patched.

> > After the table is again turned into logged, the content is
> > transferred via WAL records generated from the insertions into the new
> > storage and it rebuilds the same storage on the standby on both
> > patched and unpatched.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

RE: Disable WAL logging to speed up data loading

osumi.takamichi@fujitsu.com
In reply to this post by tsunakawa.takay@fujitsu.com
Hi,


I wrote and attached the first patch to disable WAL logging.
This patch passes the regression test of check-world already
and is formatted by pgindent.

Also, I conducted 3 types of performance tests to
clarify the rough benefits of the patch.

I compared two wal_levels both 'minimal' and 'none'.
For both levels, I measured
(1) cluster's restore from pg_dumpall output,
(2) COPY initial data to a defined table as initial data loading, and
(3) COPY data to a defined table with tuples, 3 times each for all cases.
After that, calculated the average and the ratio of the loading speed.
The conclusion is that wal_level=none cuts about 20% of the loading speed
compared to 'minimal' in the three cases above. For sure, we could tune the configuration of
postgresql.conf further but roughly it's meaningful to share the result, I think.
'shared_buffers' was set to 40% of RAM while 'maintenance_work_mem'
was set to 20%. I set max_wal_senders = 0 this time.

The input data was generated from pgbench with 1000 scale factor.
It's about 9.3GB. For the table definition or
the initial data for appended data loading test case,
I used pgbench to set up the schema as well.
Sharing other scenario to measure is welcome.

I need to say that the current patch doesn't take the change of wal_level='none'
from/to other ones into account fully or other commands like ones for two phase commit yet.
Sorry for that.

Also, to return the value of last shut down LSN in XLogInsert(),
it acquires lock of control file every time, which was not good clearly.
I tried to replace that code like having a LSN cache as one variable
but I was not sure where I should put the function to set the initial value of the LSN.
After LocalProcessControlFile() in PostmasterMain() was not the right place.
I'd be happy if someone gives me an advice about it.


Best,
        Takamichi Osumi

disable_WAL_logging_v01.patch (12K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Disable WAL logging to speed up data loading

Laurenz Albe
On Wed, 2020-10-28 at 04:11 +0000, [hidden email] wrote:
> I wrote and attached the first patch to disable WAL logging.
> This patch passes the regression test of check-world already
> and is formatted by pgindent.

Without reading the code, I have my doubts about that feature.

While it clearly will improve performance, it opens the door to
data loss.  People will use it to speed up their data loads and
then be unhappy if they cannot use their backups to recover from
a problem.

What happens if you try to do archive recovery across a time where
wal_level was "none"?  Will the recovery process fail, as it should,
or will you end up with data corruption?

We already have a performance-related footgun in the shape of
fsync = off.  Do we want to add another one?

Yours,
Laurenz Albe



Reply | Threaded
Open this post in threaded view
|

RE: Disable WAL logging to speed up data loading

tsunakawa.takay@fujitsu.com
In reply to this post by osumi.takamichi@fujitsu.com
From: Osumi, Takamichi/大墨 昂道 <[hidden email]>
> I wrote and attached the first patch to disable WAL logging.
> This patch passes the regression test of check-world already and is formatted

I think make check-world uses the default setting for wal_level.  You need to set wal_level = none and do make installcheck-world.


> I compared two wal_levels both 'minimal' and 'none'.
> For both levels, I measured
> (1) cluster's restore from pg_dumpall output,
> (2) COPY initial data to a defined table as initial data loading, and
> (3) COPY data to a defined table with tuples, 3 times each for all cases.
> After that, calculated the average and the ratio of the loading speed.
> The conclusion is that wal_level=none cuts about 20% of the loading speed
> compared to 'minimal' in the three cases above.

Hmm.  I wonder why pg_dumpall's performance increases by as much as 20%.  On the contrary to my previous hope, pg_dumpall uses COPY to restore data, so it doesn't emit WAL when wal_level = minimal.  (Is it brought by the difference of whether DDL's WAL is emitted or not?)


> Sharing other scenario to measure is welcome.

How about running multiple concurrent data loading sessions when adding data to existing tables with data, so that WAL is the bottleneck?  That's the use case of the target customer, isn't it?



> The input data was generated from pgbench with 1000 scale factor.
> It's about 9.3GB. For the table definition or the initial data for appended data

IIRC, I thought the scale factor of 1,000 is 1.5 GB.  What displayed 9.3 GB?  SELECT pg_database_size() or something?


Below are review comments:


(1)
@@ -449,6 +449,13 @@ XLogInsert(RmgrId rmid, uint8 info)
  return EndPos;
  }
 
+ /* Issues WAL only for transaction end and check point */
+ if (wal_level == WAL_LEVEL_NONE && rmid != RM_XLOG_ID)
+ {
+ XLogResetInsertion();
+ return GetLatestCheckPointLSN();
+ }
+
  do

This does not emit transaction completion WAL records.  Their RM ID is RM_XACT_ID.  Also, RM_XLOG_ID includes other kinds of WAL records than the checkpoint WAL record.  Correct the comment accordingly.  I don't have a good idea on how to represent the RM_XLOG_ID,, but the following might help:

[rmgrlist.h]
/* symbol name, textual name, redo, desc, identify, startup, cleanup */
PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL)
PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL)


(2)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
@@ -2591,10 +2591,10 @@ include_dir 'conf.d'
         data to support WAL archiving and replication, including running
         read-only queries on a standby server. <literal>minimal</literal> removes all
         logging except the information required to recover from a crash or
-        immediate shutdown.  Finally,
+        immediate shutdown.  <literal>none</literal> generates no WAL in any case. Finally,

According to the previous code, "none" emits some kinds of WAL records.  So I think we need to come up with a good name and/or description.



(3)
         <literal>logical</literal> adds information necessary to support logical
-        decoding.  Each level includes the information logged at all lower
-        levels.  This parameter can only be set at server start.
+        decoding.  Each level except for <literal>none</literal> includes the
+        information logged at all lower levels.  This parameter can only be set at server start.
        </para>

Why is this change necessary?


(4)
+        On the other hand, an unexpected crash of the server makes the database cluster
+        inconsistent. For that reason, before utilizing this level, get a full backup of the cluster and
+        backup of the entire operations that are done under the condition that
+        <varname>wal_level</varname> is <literal>none</literal>.

This gives the impression that the user can start the database server and see inconsistent data.  The reality is that the database server does not start, isn't it?


(5)
@@ -1751,7 +1752,8 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
     Aside from avoiding the time for the archiver or WAL sender to process the
     WAL data, doing this will actually make certain commands faster, because
     they do not to write WAL at all if <varname>wal_level</varname>
-    is <literal>minimal</literal> and the current subtransaction (or top-level
+    is either <literal>minimal</literal> or <literal>minimal</literal>
+    and the current subtransaction (or top-level
     transaction) created or truncated the table or index they change.  (They
     can guarantee crash safety more cheaply by doing
     an <function>fsync</function> at the end than by writing WAL.)

This is not correct.  In minimal, some conditions need to hold true for WAL to not be generated as described above.  OTOH, wal_level = none does not generate WAL unconditionally.


(6)
14.4.9. Some Notes about pg_dump
...
If using WAL archiving or streaming replication, consider disabling them during the restore. To do that, set archive_mode to off, wal_level to minimal, and max_wal_senders to zero before loading the dump. Afterwards, set them back to the right values and take a fresh base backup.

Why don't you refer to wal_level = none here as well?


(7)
@@ -918,10 +918,13 @@ PostmasterMain(int argc, char *argv[])
  ReservedBackends, MaxConnections);
  ExitPostmaster(1);
  }
- if (XLogArchiveMode > ARCHIVE_MODE_OFF && wal_level == WAL_LEVEL_MINIMAL)
+ if ((XLogArchiveMode > ARCHIVE_MODE_OFF && wal_level == WAL_LEVEL_NONE) ||
+ (XLogArchiveMode > ARCHIVE_MODE_OFF && wal_level == WAL_LEVEL_MINIMAL))
  ereport(ERROR,
- (errmsg("WAL archival cannot be enabled when wal_level is \"minimal\"")));
- if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
+ (errmsg("WAL archival cannot be enabled when wal_level is \"%s\"",
+ wal_level == WAL_LEVEL_MINIMAL ? "minimal" : "none")));
+ if ((max_wal_senders > 0 && wal_level == WAL_LEVEL_NONE) ||
+ (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL))
  ereport(ERROR,
  (errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));

This style of writing conitions is redundant.  You can just change the code to wal_level <= WAL_LEVEL_MINIMAL.
Also, the first message can be '"minimal" or "none"' like the second one.


(8)
@@ -989,6 +992,12 @@ PostmasterMain(int argc, char *argv[])
  LocalProcessControlFile(false);
 
  /*
+ * Check some conditions specific to wal_level='none' and ensures the
+ * database isn't inconsistent.
+ */
+ SafelyFinishedNoneLevel();
+
+ /*

This check should be moved to around the beginning of StartupXLOG().  PostmasterMain() is called in multi-user mode.  It's not called in single-user mode (postgres --single).
The new function is not necessary.  StartupXLOG() can see the control file contents directly.



(9)
+ /*
+ * Detect if we previously crashed under wal_level='none' or not.
+ */
+ unexpected_shutdown = ControlFile->state != DB_SHUTDOWNED &&
+ ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY;
+ if ((ControlFile->wal_level == WAL_LEVEL_NONE && unexpected_shutdown))
+ {
+ ereport(ERROR,
+ (errmsg("Server was unexpectedly shut down when WAL logging was disabled"),
+ errhint("It looks like you need to deploy a new cluster from your full backup again.")));
+ }
+}

Refine the message according to the following message guideline.  For example, the primary message has to start with a lower case letter.

https://www.postgresql.org/docs/devel/source.html



(10)
You need to edit this to add none:
src/backend/utils/misc/postgresql.conf.sample


(11)
src/include/access/xlogdefs.h
src/backend/access/transam/varsup.c

Consider modifying the comments in these files that refer to wal_level.  Maybe wal_level <= minimal is enough?


(12)
src/include/utils/rel.h

Modify the RelationNeedsWAL() so that it returns false when wal_level = none.  Adding wal_level != WAL_LEVEL_NONE is would probably be okay.


(13)
@@ -161,7 +161,8 @@ extern int XLogArchiveMode;
 /* WAL levels */
 typedef enum WalLevel
 {
- WAL_LEVEL_MINIMAL = 0,
+ WAL_LEVEL_NONE = 0,
+ WAL_LEVEL_MINIMAL,
  WAL_LEVEL_REPLICA,
  WAL_LEVEL_LOGICAL
 } WalLevel;

I'm a bit concerned about if we can change the values of existing symbols, because wal_level setting is stored in pg_control file.  Having a quick look at source code, there seems to be no problem.  However, I'm not sure about pg_upgrade.  Can you try, for example,

1. Create the database cluster with an older version, say, PG 13.
2. Start and stop the database server.
3. Run pg_controldata and see that it prints replica for the wal_level value.
4. Upgrade the database cluster with pg_upgrade.
5. Run pg_controldata and see the wal_level value.

If you change the values of existing symbols like your patch, you need to bump PG_CONTROL_VERSION.
If you find we can't tchange the existing values, you can probably set WAL_LEVEL_NONE to -1.


Regards
Takayuki Tsunakawa



1234