standby recovery fails (tablespace related) (tentative patch and discussion)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

standby recovery fails (tablespace related) (tentative patch and discussion)

Paul Guo-2
Hello postgres hackers,

Recently my colleagues and I encountered an issue: a standby can not recover after an unclean shutdown and it's related to tablespace.
The issue is that the standby re-replay some xlog that needs tablespace directories (e.g. create a database with tablespace),
but the tablespace directories has already been removed in the previous replay. 

In details, the standby normally finishes replaying for the below operations, but due to unclean shutdown, the redo lsn
is not updated in pg_control and is still kept a value before the 'create db with tabspace' xlog, however since the tablespace
directories were removed so it reports error when repay the database create wal.

create db with tablespace
drop database
drop tablespace.

Here is the log on the standby.
2019-04-17 14:52:14.926 CST [23029] LOG:  starting PostgreSQL 12devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4), 64-bit
2019-04-17 14:52:14.927 CST [23029] LOG:  listening on IPv4 address "192.168.35.130", port 5432
2019-04-17 14:52:14.929 CST [23029] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2019-04-17 14:52:14.943 CST [23030] LOG:  database system was interrupted while in recovery at log time 2019-04-17 14:48:27 CST
2019-04-17 14:52:14.943 CST [23030] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2019-04-17 14:52:14.949 CST [23030] LOG:  entering standby mode                
2019-04-17 14:52:14.950 CST [23030] LOG:  redo starts at 0/30105B8              
2019-04-17 14:52:14.951 CST [23030] FATAL:  could not create directory "pg_tblspc/65546/PG_12_201904072/65547": No such file or directory
2019-04-17 14:52:14.951 CST [23030] CONTEXT:  WAL redo at 0/3011650 for Database/CREATE: copy dir 1663/1 to 65546/65547
2019-04-17 14:52:14.951 CST [23029] LOG:  startup process (PID 23030) exited with exit code 1
2019-04-17 14:52:14.951 CST [23029] LOG:  terminating any other active server processes
2019-04-17 14:52:14.953 CST [23029] LOG:  database system is shut down          

Steps to reprodce:

1. setup a master and standby.
2. On both side, run: mkdir /tmp/some_isolation2_pg_basebackup_tablespace

3. Run SQLs:
drop tablespace if exists some_isolation2_pg_basebackup_tablespace; 
create tablespace some_isolation2_pg_basebackup_tablespace location '/tmp/some_isolation2_pg_basebackup_tablespace';

3. Clean shutdown and restart both postgres instances.

4. Run the following SQLs:

drop database if exists some_database_with_tablespace; 
create database some_database_with_tablespace tablespace some_isolation2_pg_basebackup_tablespace; 
drop database some_database_with_tablespace;
drop tablespace some_isolation2_pg_basebackup_tablespace; 
\! pkill -9 postgres; ssh host70 pkill -9 postgres

Note immediate shutdown via pg_ctl should also be able to reproduce and the above steps probably does not 100% reproduce.

I created an initial patch for this issue (see the attachment). The idea is re-creating those directories recursively. The above issue exists in dbase_redo(),
but TablespaceCreateDbspace (for relation file create redo) is probably buggy also so I modified that function also. Even there is no bug
in that function, it seems that using simple pg_mkdir_p() is cleaner. Note reading TablespaceCreateDbspace(), I found it seems that this issue
has already be thought though insufficient but frankly this solution (directory recreation) seems to be not perfect given actually this should
have been the responsibility of tablespace creation (also tablespace creation does more like symlink creation, etc). Also, I'm not sure whether
we need to use invalid page mechanism (see xlogutils.c).

Another solution is that, actually, we create a checkpoint when createdb/movedb/dropdb/droptablespace, maybe we should enforce to create
restartpoint on standby for such special kind of checkpoint wal - that means we need to set a flag in checkpoing wal and let checkpoint redo
code to create restartpoint if that flag is set. This solution seems to be safer.

Thanks,
Paul


0001-Recursively-create-tablespace-directories-if-those-a.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Asim R P
On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <[hidden email]> wrote:
>
> create db with tablespace
> drop database
> drop tablespace.

Essentially, that sequence of operations causes crash recovery to fail
if the "drop tablespace" transaction was committed before crashing.
This is a bug in crash recovery in general and should be reproducible
without configuring a standby.  Is that right?

Your patch creates missing directories in the destination.  Don't we
need to create the tablespace symlink under pg_tblspc/?  I would
prefer extending the invalid page mechanism to deal with this, as
suggested by Ashwin off-list.  It will allow us to avoid creating
directories and files only to remove them shortly afterwards when the
drop database and drop tablespace records are replayed.

Asim


Reply | Threaded
Open this post in threaded view
|

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Paul Guo-2
Please see my replies inline. Thanks.

On Fri, Apr 19, 2019 at 12:38 PM Asim R P <[hidden email]> wrote:
On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <[hidden email]> wrote:
>
> create db with tablespace
> drop database
> drop tablespace.

Essentially, that sequence of operations causes crash recovery to fail
if the "drop tablespace" transaction was committed before crashing.
This is a bug in crash recovery in general and should be reproducible
without configuring a standby.  Is that right?

No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on master.
That makes the file/directory update-to-date if I understand the related code correctly.
For standby, checkpoint redo does not ensure that.
 

Your patch creates missing directories in the destination.  Don't we
need to create the tablespace symlink under pg_tblspc/?  I would

 'create db with tablespace' redo log does not include the tablespace real directory information.
Yes, we could add in it into the xlog, but that seems to be an overdesign.
 
prefer extending the invalid page mechanism to deal with this, as
suggested by Ashwin off-list.  It will allow us to avoid creating 
directories and files only to remove them shortly afterwards when the
drop database and drop tablespace records are replayed.

 
'invalid page' mechanism seems to be more proper for missing pages of a file. For
missing directories, we could, of course, hack to use that (e.g. reading any page of
a relfile in that database) to make sure the tablespace create code (without symlink)
safer (It assumes those directories will be deleted soon).

More feedback about all of the previous discussed solutions is welcome.
Reply | Threaded
Open this post in threaded view
|

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Kyotaro HORIGUCHI-2
Hello.

At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo <[hidden email]> wrote in <CAEET0ZGpUrMGUzfyzVF9FuSq+zb=[hidden email]>

> Please see my replies inline. Thanks.
>
> On Fri, Apr 19, 2019 at 12:38 PM Asim R P <[hidden email]> wrote:
>
> > On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <[hidden email]> wrote:
> > >
> > > create db with tablespace
> > > drop database
> > > drop tablespace.
> >
> > Essentially, that sequence of operations causes crash recovery to fail
> > if the "drop tablespace" transaction was committed before crashing.
> > This is a bug in crash recovery in general and should be reproducible
> > without configuring a standby.  Is that right?
> >
>
> No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on
> master.
> That makes the file/directory update-to-date if I understand the related
> code correctly.
> For standby, checkpoint redo does not ensure that.
That's right partly. As you must have seen, fast shutdown forces
restartpoint for the last checkpoint and it prevents the problem
from happening. Anyway it seems to be a problem.

> > Your patch creates missing directories in the destination.  Don't we
> > need to create the tablespace symlink under pg_tblspc/?  I would
> >
>
>  'create db with tablespace' redo log does not include the tablespace real
> directory information.
> Yes, we could add in it into the xlog, but that seems to be an overdesign.

But I don't think creating directory that is to be removed just
after is a wanted solution. The directory most likely to be be
removed just after.

> > prefer extending the invalid page mechanism to deal with this, as
> > suggested by Ashwin off-list.  It will allow us to avoid creating
>
> directories and files only to remove them shortly afterwards when the
> > drop database and drop tablespace records are replayed.
> >
> >
> 'invalid page' mechanism seems to be more proper for missing pages of a
> file. For
> missing directories, we could, of course, hack to use that (e.g. reading
> any page of
> a relfile in that database) to make sure the tablespace create code
> (without symlink)
> safer (It assumes those directories will be deleted soon).
>
> More feedback about all of the previous discussed solutions is welcome.
It doesn't seem to me that the invalid page mechanism is
applicable in straightforward way, because it doesn't consider
simple file copy.

Drop failure is ignored any time. I suppose we can ignore the
error to continue recovering as far as recovery have not reached
consistency. The attached would work *at least* your case, but I
haven't checked this covers all places where need the same
treatment.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..0bc63f48da 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -522,6 +522,44 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
  return buffer;
 }
 
+/*
+ * XLogMakePGDirectory
+ *
+ * There is a possibility that WAL replay causes a creation of the same
+ * directory left by the previous crash. Issuing ERROR prevents the caller
+ * from continuing recovery.
+ *
+ * To prevent that case, this function issues WARNING instead of ERROR on
+ * error if consistency is not reached yet.
+ */
+int
+XLogMakePGDirectory(const char *directoryName)
+{
+ int ret;
+
+ ret = MakePGDirectory(directoryName);
+
+ if (ret != 0)
+ {
+ int elevel = ERROR;
+
+ /*
+ * We might get error trying to create existing directory that is to
+ * be removed just after.  Don't issue ERROR in the case. Recovery
+ * will stop if we again failed after reaching consistency.
+ */
+ if (InRecovery && !reachedConsistency)
+ elevel = WARNING;
+
+ ereport(elevel,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m", directoryName)));
+ return ret;
+ }
+
+ return 0;
+}
+
 /*
  * Struct actually returned by XLogFakeRelcacheEntry, though the declared
  * return type is Relation.
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 30f6200a86..0216270dd3 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -22,11 +22,11 @@
 #include <unistd.h>
 #include <sys/stat.h>
 
+#include "access/xlogutils.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-
 /*
  * copydir: copy a directory
  *
@@ -41,10 +41,12 @@ copydir(char *fromdir, char *todir, bool recurse)
  char fromfile[MAXPGPATH * 2];
  char tofile[MAXPGPATH * 2];
 
- if (MakePGDirectory(todir) != 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not create directory \"%s\": %m", todir)));
+ /*
+ * We might have to skip copydir to continue recovery. See the function
+ * for details.
+ */
+ if (XLogMakePGDirectory(todir) != 0)
+ return;
 
  xldir = AllocateDir(fromdir);
 
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 0ab5ba62f5..46a7596315 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -43,6 +43,7 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
    BlockNumber blkno, ReadBufferMode mode);
+extern int XLogMakePGDirectory(const char *directoryName);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
Reply | Threaded
Open this post in threaded view
|

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Kyotaro HORIGUCHI-2
Oops! The comment in the previous patch is wrong.

At Mon, 22 Apr 2019 16:15:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <[hidden email]> wrote in <[hidden email]>

> At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo <[hidden email]> wrote in <CAEET0ZGpUrMGUzfyzVF9FuSq+zb=[hidden email]>
> > Please see my replies inline. Thanks.
> >
> > On Fri, Apr 19, 2019 at 12:38 PM Asim R P <[hidden email]> wrote:
> >
> > > On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <[hidden email]> wrote:
> > > >
> > > > create db with tablespace
> > > > drop database
> > > > drop tablespace.
> > >
> > > Essentially, that sequence of operations causes crash recovery to fail
> > > if the "drop tablespace" transaction was committed before crashing.
> > > This is a bug in crash recovery in general and should be reproducible
> > > without configuring a standby.  Is that right?
> > >
> >
> > No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on
> > master.
> > That makes the file/directory update-to-date if I understand the related
> > code correctly.
> > For standby, checkpoint redo does not ensure that.
>
> That's right partly. As you must have seen, fast shutdown forces
> restartpoint for the last checkpoint and it prevents the problem
> from happening. Anyway it seems to be a problem.
>
> > > Your patch creates missing directories in the destination.  Don't we
> > > need to create the tablespace symlink under pg_tblspc/?  I would
> > >
> >
> >  'create db with tablespace' redo log does not include the tablespace real
> > directory information.
> > Yes, we could add in it into the xlog, but that seems to be an overdesign.
>
> But I don't think creating directory that is to be removed just
> after is a wanted solution. The directory most likely to be be
> removed just after.
>
> > > prefer extending the invalid page mechanism to deal with this, as
> > > suggested by Ashwin off-list.  It will allow us to avoid creating
> >
> > directories and files only to remove them shortly afterwards when the
> > > drop database and drop tablespace records are replayed.
> > >
> > >
> > 'invalid page' mechanism seems to be more proper for missing pages of a
> > file. For
> > missing directories, we could, of course, hack to use that (e.g. reading
> > any page of
> > a relfile in that database) to make sure the tablespace create code
> > (without symlink)
> > safer (It assumes those directories will be deleted soon).
> >
> > More feedback about all of the previous discussed solutions is welcome.
>
> It doesn't seem to me that the invalid page mechanism is
> applicable in straightforward way, because it doesn't consider
> simple file copy.
>
> Drop failure is ignored any time. I suppose we can ignore the
> error to continue recovering as far as recovery have not reached
> consistency. The attached would work *at least* your case, but I
> haven't checked this covers all places where need the same
> treatment.
The comment for the new function XLogMakePGDirectory is wrong:

+ * There is a possibility that WAL replay causes a creation of the same
+ * directory left by the previous crash. Issuing ERROR prevents the caller
+ * from continuing recovery.

The correct one is:

+ * There is a possibility that WAL replay causes an error by creation of a
+ * directory under a directory removed before the previous crash. Issuing
+ * ERROR prevents the caller from continuing recovery.

It is fixed in the attached.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..01331f0da9 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -522,6 +522,44 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
  return buffer;
 }
 
+/*
+ * XLogMakePGDirectory
+ *
+ * There is a possibility that WAL replay causes an error by creation of a
+ * directory under a directory removed before the previous crash. Issuing
+ * ERROR prevents the caller from continuing recovery.
+ *
+ * To prevent that case, this function issues WARNING instead of ERROR on
+ * error if consistency is not reached yet.
+ */
+int
+XLogMakePGDirectory(const char *directoryName)
+{
+ int ret;
+
+ ret = MakePGDirectory(directoryName);
+
+ if (ret != 0)
+ {
+ int elevel = ERROR;
+
+ /*
+ * We might get error trying to create existing directory that is to
+ * be removed just after.  Don't issue ERROR in the case. Recovery
+ * will stop if we again failed after reaching consistency.
+ */
+ if (InRecovery && !reachedConsistency)
+ elevel = WARNING;
+
+ ereport(elevel,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m", directoryName)));
+ return ret;
+ }
+
+ return 0;
+}
+
 /*
  * Struct actually returned by XLogFakeRelcacheEntry, though the declared
  * return type is Relation.
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 30f6200a86..0216270dd3 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -22,11 +22,11 @@
 #include <unistd.h>
 #include <sys/stat.h>
 
+#include "access/xlogutils.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-
 /*
  * copydir: copy a directory
  *
@@ -41,10 +41,12 @@ copydir(char *fromdir, char *todir, bool recurse)
  char fromfile[MAXPGPATH * 2];
  char tofile[MAXPGPATH * 2];
 
- if (MakePGDirectory(todir) != 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not create directory \"%s\": %m", todir)));
+ /*
+ * We might have to skip copydir to continue recovery. See the function
+ * for details.
+ */
+ if (XLogMakePGDirectory(todir) != 0)
+ return;
 
  xldir = AllocateDir(fromdir);
 
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 0ab5ba62f5..46a7596315 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -43,6 +43,7 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
    BlockNumber blkno, ReadBufferMode mode);
+extern int XLogMakePGDirectory(const char *directoryName);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
Reply | Threaded
Open this post in threaded view
|

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Kyotaro HORIGUCHI-2
At Mon, 22 Apr 2019 16:40:27 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <[hidden email]> wrote in <[hidden email]>

> At Mon, 22 Apr 2019 16:15:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <[hidden email]> wrote in <[hidden email]>
> > At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo <[hidden email]> wrote in <CAEET0ZGpUrMGUzfyzVF9FuSq+zb=[hidden email]>
> > > Please see my replies inline. Thanks.
> > >
> > > On Fri, Apr 19, 2019 at 12:38 PM Asim R P <[hidden email]> wrote:
> > >
> > > > On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <[hidden email]> wrote:
> > > > >
> > > > > create db with tablespace
> > > > > drop database
> > > > > drop tablespace.
> > > >
> > > > Essentially, that sequence of operations causes crash recovery to fail
> > > > if the "drop tablespace" transaction was committed before crashing.
> > > > This is a bug in crash recovery in general and should be reproducible
> > > > without configuring a standby.  Is that right?
> > > >
> > >
> > > No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on
> > > master.
> > > That makes the file/directory update-to-date if I understand the related
> > > code correctly.
> > > For standby, checkpoint redo does not ensure that.
The attached exercises this sequence, needing some changes in
PostgresNode.pm and RecursiveCopy.pm to allow tablespaces.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From dbe6306a730f94a5bd8beaf0e534c28ebdd815d4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <[hidden email]>
Date: Mon, 22 Apr 2019 20:10:20 +0900
Subject: [PATCH 1/2] Allow TAP test to excecise tablespace.

To perform tablespace related checks, this patch lets
PostgresNode::backup have a new parameter "tablespace_mapping", and
make init_from_backup handle capable to handle a backup created using
tablespace_mapping.
---
 src/test/perl/PostgresNode.pm  | 10 ++++++++--
 src/test/perl/RecursiveCopy.pm | 33 +++++++++++++++++++++++++++++----
 2 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 76874141c5..59a939821d 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -540,13 +540,19 @@ target server since it isn't done by default.
 
 sub backup
 {
- my ($self, $backup_name) = @_;
+ my ($self, $backup_name, %params) = @_;
  my $backup_path = $self->backup_dir . '/' . $backup_name;
  my $name        = $self->name;
+ my @rest = ();
+
+ if (defined $params{tablespace_mapping})
+ {
+ push(@rest, "--tablespace-mapping=$params{tablespace_mapping}");
+ }
 
  print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
  TestLib::system_or_bail('pg_basebackup', '-D', $backup_path, '-h',
- $self->host, '-p', $self->port, '--no-sync');
+ $self->host, '-p', $self->port, '--no-sync', @rest);
  print "# Backup finished\n";
  return;
 }
diff --git a/src/test/perl/RecursiveCopy.pm b/src/test/perl/RecursiveCopy.pm
index baf5d0ac63..c912ce412d 100644
--- a/src/test/perl/RecursiveCopy.pm
+++ b/src/test/perl/RecursiveCopy.pm
@@ -22,6 +22,7 @@ use warnings;
 use Carp;
 use File::Basename;
 use File::Copy;
+use TestLib;
 
 =pod
 
@@ -97,14 +98,38 @@ sub _copypath_recurse
  # invoke the filter and skip all further operation if it returns false
  return 1 unless &$filterfn($curr_path);
 
- # Check for symlink -- needed only on source dir
- # (note: this will fall through quietly if file is already gone)
- croak "Cannot operate on symlink \"$srcpath\"" if -l $srcpath;
-
  # Abort if destination path already exists.  Should we allow directories
  # to exist already?
  croak "Destination path \"$destpath\" already exists" if -e $destpath;
 
+ # Check for symlink -- needed only on source dir
+ # (note: this will fall through quietly if file is already gone)
+ if (-l $srcpath)
+ {
+ croak "Cannot operate on symlink \"$srcpath\""
+  if ($srcpath !~ /\/(pg_tblspc\/[0-9]+)$/);
+
+ # We have mapped tablespaces. Copy them individually
+ my $linkname = $1;
+ my $tmpdir = TestLib::tempdir;
+ my $dstrealdir = TestLib::real_dir($tmpdir);
+ my $srcrealdir = readlink($srcpath);
+
+ opendir(my $dh, $srcrealdir);
+ while (readdir $dh)
+ {
+ next if (/^\.\.?$/);
+ my $spath = "$srcrealdir/$_";
+ my $dpath = "$dstrealdir/$_";
+
+ copypath($spath, $dpath);
+ }
+ closedir $dh;
+
+ symlink $dstrealdir, $destpath;
+ return 1;
+ }
+
  # If this source path is a file, simply copy it to destination with the
  # same name and we're done.
  if (-f $srcpath)
--
2.16.3


From 382910fbe3738c9098c0568cdc992928f471c7c5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <[hidden email]>
Date: Mon, 22 Apr 2019 20:10:25 +0900
Subject: [PATCH 2/2] Add check for recovery failure caused by tablespace.

Removal of a tablespace on master can cause recovery failure on
standby. This patch adds the check for the case.
---
 src/test/recovery/t/011_crash_recovery.pl | 52 ++++++++++++++++++++++++++++++-
 1 file changed, 51 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 5dc52412ca..d1eb9edccf 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -15,7 +15,7 @@ if ($Config{osname} eq 'MSWin32')
 }
 else
 {
- plan tests => 3;
+ plan tests => 4;
 }
 
 my $node = get_new_node('master');
@@ -66,3 +66,53 @@ is($node->safe_psql('postgres', qq[SELECT txid_status('$xid');]),
  'aborted', 'xid is aborted after crash');
 
 $tx->kill_kill;
+
+
+# Ensure that tablespace removal doesn't cause error while recoverying
+# the preceding create datbase or objects.
+
+my $node_master = get_new_node('master2');
+$node_master->init(allows_streaming => 1);
+$node_master->start;
+
+# Create tablespace
+my $tspDir_master = TestLib::tempdir;
+my $realTSDir_master = TestLib::real_dir($tspDir_master);
+$node_master->safe_psql('postgres', "CREATE TABLESPACE ts1 LOCATION '$realTSDir_master'");
+
+my $tspDir_standby = TestLib::tempdir;
+my $realTSDir_standby = TestLib::real_dir($tspDir_standby);
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name,
+ tablespace_mapping =>
+   "$realTSDir_master=$realTSDir_standby");
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_master->poll_query_until(
+ 'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+# Make sure to perform restartpoint after tablespace creation
+$node_master->wait_for_catchup($node_standby, 'replay',
+   $node_master->lsn('replay'));
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This leaves a CREATE DATBASE WAL record
+# that is to be applied to already-removed tablespace.
+$node_master->safe_psql('postgres',
+ q[CREATE DATABASE db1 WITH TABLESPACE ts1;
+  DROP DATABASE db1;
+  DROP TABLESPACE ts1;]);
+$node_master->wait_for_catchup($node_standby, 'replay',
+   $node_master->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+
--
2.16.3


From 5e3a9b682e6ec2b6cb4e019409112687bd598ebc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <[hidden email]>
Date: Mon, 22 Apr 2019 20:59:15 +0900
Subject: [PATCH 3/3] Fix failure of standby startup caused by tablespace
 removal

When standby restarts after a crash after drop of a tablespace,
there's a possibility that recovery fails trying an object-creation in
already removed tablespace directory. Allow recovery to continue by
ignoring the error if not reaching consistency point.
---
 src/backend/access/transam/xlogutils.c | 34 ++++++++++++++++++++++++++++++++++
 src/backend/storage/file/copydir.c     | 12 +++++++-----
 src/include/access/xlogutils.h         |  1 +
 3 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..75cdb882cd 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -522,6 +522,40 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
  return buffer;
 }
 
+/*
+ * XLogMakePGDirectory
+ *
+ * There is a possibility that WAL replay causes an error by creation of a
+ * directory under a directory removed before the previous crash. Issuing
+ * ERROR prevents the caller from continuing recovery.
+ *
+ * To prevent that case, this function issues WARNING instead of ERROR on
+ * error if consistency is not reached yet.
+ */
+int
+XLogMakePGDirectory(const char *directoryName)
+{
+ int ret;
+
+ ret = MakePGDirectory(directoryName);
+
+ if (ret != 0)
+ {
+ int elevel = ERROR;
+
+ /* Don't issue ERROR for this failure before reaching consistency. */
+ if (InRecovery && !reachedConsistency)
+ elevel = WARNING;
+
+ ereport(elevel,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m", directoryName)));
+ return ret;
+ }
+
+ return 0;
+}
+
 /*
  * Struct actually returned by XLogFakeRelcacheEntry, though the declared
  * return type is Relation.
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 30f6200a86..0216270dd3 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -22,11 +22,11 @@
 #include <unistd.h>
 #include <sys/stat.h>
 
+#include "access/xlogutils.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-
 /*
  * copydir: copy a directory
  *
@@ -41,10 +41,12 @@ copydir(char *fromdir, char *todir, bool recurse)
  char fromfile[MAXPGPATH * 2];
  char tofile[MAXPGPATH * 2];
 
- if (MakePGDirectory(todir) != 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not create directory \"%s\": %m", todir)));
+ /*
+ * We might have to skip copydir to continue recovery. See the function
+ * for details.
+ */
+ if (XLogMakePGDirectory(todir) != 0)
+ return;
 
  xldir = AllocateDir(fromdir);
 
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 0ab5ba62f5..46a7596315 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -43,6 +43,7 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
    BlockNumber blkno, ReadBufferMode mode);
+extern int XLogMakePGDirectory(const char *directoryName);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
--
2.16.3

Reply | Threaded
Open this post in threaded view
|

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Michael Paquier-2
On Mon, Apr 22, 2019 at 09:19:33PM +0900, Kyotaro HORIGUCHI wrote:
> The attached exercises this sequence, needing some changes in
> PostgresNode.pm and RecursiveCopy.pm to allow tablespaces.

+    # Check for symlink -- needed only on source dir
+    # (note: this will fall through quietly if file is already gone)
+    if (-l $srcpath)
+    {
+        croak "Cannot operate on symlink \"$srcpath\""
+          if ($srcpath !~ /\/(pg_tblspc\/[0-9]+)$/);
+
+        # We have mapped tablespaces. Copy them individually
+        my $linkname = $1;
+        my $tmpdir = TestLib::tempdir;
+        my $dstrealdir = TestLib::real_dir($tmpdir);
+        my $srcrealdir = readlink($srcpath);
+
+        opendir(my $dh, $srcrealdir);
+        while (readdir $dh)
+        {
+            next if (/^\.\.?$/);
+            my $spath = "$srcrealdir/$_";
+            my $dpath = "$dstrealdir/$_";
+
+            copypath($spath, $dpath);
+        }
+        closedir $dh;
+
+        symlink $dstrealdir, $destpath;
+        return 1;
+    }

The same stuff is proposed here:
https://www.postgresql.org/message-id/CAGRcZQUxd9YOfifOKXOfJ+Fp3JdpoeKCzt+zH_PRMNaaDaExdQ@...

So there is a lot of demand for making the recursive copy more skilled
at handling symlinks for tablespace tests, and I'd like to propose to
do something among those lines for the tests on HEAD, presumably for
v12 and not v13 as we are talking about a bug fix here?  I am not sure
yet which one of the proposals is better than the other though.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Kyotaro HORIGUCHI-2
At Tue, 23 Apr 2019 11:34:38 +0900, Michael Paquier <[hidden email]> wrote in <[hidden email]>

> On Mon, Apr 22, 2019 at 09:19:33PM +0900, Kyotaro HORIGUCHI wrote:
> > The attached exercises this sequence, needing some changes in
> > PostgresNode.pm and RecursiveCopy.pm to allow tablespaces.
>
> +    # Check for symlink -- needed only on source dir
> +    # (note: this will fall through quietly if file is already gone)
> +    if (-l $srcpath)
> +    {
> +        croak "Cannot operate on symlink \"$srcpath\""
> +          if ($srcpath !~ /\/(pg_tblspc\/[0-9]+)$/);
> +
> +        # We have mapped tablespaces. Copy them individually
> +        my $linkname = $1;
> +        my $tmpdir = TestLib::tempdir;
> +        my $dstrealdir = TestLib::real_dir($tmpdir);
> +        my $srcrealdir = readlink($srcpath);
> +
> +        opendir(my $dh, $srcrealdir);
> +        while (readdir $dh)
> +        {
> +            next if (/^\.\.?$/);
> +            my $spath = "$srcrealdir/$_";
> +            my $dpath = "$dstrealdir/$_";
> +
> +            copypath($spath, $dpath);
> +        }
> +        closedir $dh;
> +
> +        symlink $dstrealdir, $destpath;
> +        return 1;
> +    }
>
> The same stuff is proposed here:
> https://www.postgresql.org/message-id/CAGRcZQUxd9YOfifOKXOfJ+Fp3JdpoeKCzt+zH_PRMNaaDaExdQ@...
>
> So there is a lot of demand for making the recursive copy more skilled
> at handling symlinks for tablespace tests, and I'd like to propose to
> do something among those lines for the tests on HEAD, presumably for
> v12 and not v13 as we are talking about a bug fix here?  I am not sure
> yet which one of the proposals is better than the other though.

TBH I like that (my one cieted above) not so much. However, I
prefer to have v12 if this is a bug and to be fixed in
v12. Otherwise we won't add a test for this later:p

Anyway I'll visit there. Thanks.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center



Reply | Threaded
Open this post in threaded view
|

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Paul Guo-2
In reply to this post by Kyotaro HORIGUCHI-2
Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database create redo error, but I suspect some other kind of redo, which depends on the files under the directory (they are not copied since the directory is not created) and also cannot be covered by the invalid page mechanism, could fail. Thanks.

On Mon, Apr 22, 2019 at 3:40 PM Kyotaro HORIGUCHI <[hidden email]> wrote:
Oops! The comment in the previous patch is wrong.

At Mon, 22 Apr 2019 16:15:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <[hidden email]> wrote in <[hidden email]>
> At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo <[hidden email]> wrote in <CAEET0ZGpUrMGUzfyzVF9FuSq+zb=[hidden email]>
> > Please see my replies inline. Thanks.
> >
> > On Fri, Apr 19, 2019 at 12:38 PM Asim R P <[hidden email]> wrote:
> >
> > > On Wed, Apr 17, 2019 at 1:27 PM Paul Guo <[hidden email]> wrote:
> > > >
> > > > create db with tablespace
> > > > drop database
> > > > drop tablespace.
> > >
> > > Essentially, that sequence of operations causes crash recovery to fail
> > > if the "drop tablespace" transaction was committed before crashing.
> > > This is a bug in crash recovery in general and should be reproducible
> > > without configuring a standby.  Is that right?
> > >
> >
> > No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on
> > master.
> > That makes the file/directory update-to-date if I understand the related
> > code correctly.
> > For standby, checkpoint redo does not ensure that.
>
> That's right partly. As you must have seen, fast shutdown forces
> restartpoint for the last checkpoint and it prevents the problem
> from happening. Anyway it seems to be a problem.
>
> > > Your patch creates missing directories in the destination.  Don't we
> > > need to create the tablespace symlink under pg_tblspc/?  I would
> > >
> >
> >  'create db with tablespace' redo log does not include the tablespace real
> > directory information.
> > Yes, we could add in it into the xlog, but that seems to be an overdesign.
>
> But I don't think creating directory that is to be removed just
> after is a wanted solution. The directory most likely to be be
> removed just after.
>
> > > prefer extending the invalid page mechanism to deal with this, as
> > > suggested by Ashwin off-list.  It will allow us to avoid creating
> >
> > directories and files only to remove them shortly afterwards when the
> > > drop database and drop tablespace records are replayed.
> > >
> > >
> > 'invalid page' mechanism seems to be more proper for missing pages of a
> > file. For
> > missing directories, we could, of course, hack to use that (e.g. reading
> > any page of
> > a relfile in that database) to make sure the tablespace create code
> > (without symlink)
> > safer (It assumes those directories will be deleted soon).
> >
> > More feedback about all of the previous discussed solutions is welcome.
>
> It doesn't seem to me that the invalid page mechanism is
> applicable in straightforward way, because it doesn't consider
> simple file copy.
>
> Drop failure is ignored any time. I suppose we can ignore the
> error to continue recovering as far as recovery have not reached
> consistency. The attached would work *at least* your case, but I
> haven't checked this covers all places where need the same
> treatment.

The comment for the new function XLogMakePGDirectory is wrong:

+ * There is a possibility that WAL replay causes a creation of the same
+ * directory left by the previous crash. Issuing ERROR prevents the caller
+ * from continuing recovery.

The correct one is:

+ * There is a possibility that WAL replay causes an error by creation of a
+ * directory under a directory removed before the previous crash. Issuing
+ * ERROR prevents the caller from continuing recovery.

It is fixed in the attached.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Reply | Threaded
Open this post in threaded view
|

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Kyotaro HORIGUCHI-2
Hello.

At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo <[hidden email]> wrote in <[hidden email]>
> Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database
> create redo error, but I suspect some other kind of redo, which depends on
> the files under the directory (they are not copied since the directory is
> not created) and also cannot be covered by the invalid page mechanism,
> could fail. Thanks.

If recovery starts from just after tablespace creation, that's
simple. The Symlink to the removed tablespace is already removed
in the case. Hence server innocently create files directly under
pg_tblspc, not in the tablespace. Finally all files that were
supposed to be created in the removed tablespace are removed
later in recovery.

If recovery starts from recalling page in a file that have been
in the tablespace, XLogReadBufferExtended creates one (perhaps
directly in pg_tblspc as described above) and the files are
removed later in recoery the same way to above. This case doen't
cause FATAL/PANIC during recovery even in master.

[hidden email]
| * Create the target file if it doesn't already exist.  This lets us cope
| * if the replay sequence contains writes to a relation that is later
| * deleted.  (The original coding of this routine would instead suppress
| * the writes, but that seems like it risks losing valuable data if the
| * filesystem loses an inode during a crash.  Better to write the data
| * until we are actually told to delete the file.)

So buffered access cannot be a problem for the reason above. The
remaining possible issue is non-buffered access to files in
removed tablespaces. This is what I mentioned upthread:

me> but I haven't checked this covers all places where need the same
me> treatment.

--
Kyotaro Horiguchi
NTT Open Source Software Center



Reply | Threaded
Open this post in threaded view
|

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Kyotaro HORIGUCHI-2
Mmm. I posted to wrong thread. Sorry.

At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <[hidden email]> wrote in <[hidden email]>

> At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo <[hidden email]> wrote in <[hidden email]>
> > Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database
> > create redo error, but I suspect some other kind of redo, which depends on
> > the files under the directory (they are not copied since the directory is
> > not created) and also cannot be covered by the invalid page mechanism,
> > could fail. Thanks.
>
> If recovery starts from just after tablespace creation, that's
> simple. The Symlink to the removed tablespace is already removed
> in the case. Hence server innocently create files directly under
> pg_tblspc, not in the tablespace. Finally all files that were
> supposed to be created in the removed tablespace are removed
> later in recovery.
>
> If recovery starts from recalling page in a file that have been
> in the tablespace, XLogReadBufferExtended creates one (perhaps
> directly in pg_tblspc as described above) and the files are
> removed later in recoery the same way to above. This case doen't
> cause FATAL/PANIC during recovery even in master.
>
> [hidden email]
> | * Create the target file if it doesn't already exist.  This lets us cope
> | * if the replay sequence contains writes to a relation that is later
> | * deleted.  (The original coding of this routine would instead suppress
> | * the writes, but that seems like it risks losing valuable data if the
> | * filesystem loses an inode during a crash.  Better to write the data
> | * until we are actually told to delete the file.)
>
> So buffered access cannot be a problem for the reason above. The
> remaining possible issue is non-buffered access to files in
> removed tablespaces. This is what I mentioned upthread:
>
> me> but I haven't checked this covers all places where need the same
> me> treatment.
RM_DBASE_ID is fixed by the patch.

XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG:
  - are not relevant.

HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC:
  - Resources works on buffer is not affected.

SMGR:
  - Both CREATE and TRUNCATE seems fine.

TBLSPC:
  - We don't nest tablespace directories. No Problem.

I don't find a similar case.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center





From bc97e195f21af5d715d85424efc21fcbe8bb902c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <[hidden email]>
Date: Mon, 22 Apr 2019 20:59:15 +0900
Subject: [PATCH 4/5] Fix failure of standby startup caused by tablespace
 removal

When standby restarts after a crash after drop of a tablespace,
there's a possibility that recovery fails trying an object-creation in
already removed tablespace directory. Allow recovery to continue by
ignoring the error if not reaching consistency point.
---
 src/backend/access/transam/xlogutils.c | 34 ++++++++++++++++++++++++++++++++++
 src/backend/commands/tablespace.c      | 12 ++++++------
 src/backend/storage/file/copydir.c     | 12 +++++++-----
 src/include/access/xlogutils.h         |  1 +
 4 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..75cdb882cd 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -522,6 +522,40 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
  return buffer;
 }
 
+/*
+ * XLogMakePGDirectory
+ *
+ * There is a possibility that WAL replay causes an error by creation of a
+ * directory under a directory removed before the previous crash. Issuing
+ * ERROR prevents the caller from continuing recovery.
+ *
+ * To prevent that case, this function issues WARNING instead of ERROR on
+ * error if consistency is not reached yet.
+ */
+int
+XLogMakePGDirectory(const char *directoryName)
+{
+ int ret;
+
+ ret = MakePGDirectory(directoryName);
+
+ if (ret != 0)
+ {
+ int elevel = ERROR;
+
+ /* Don't issue ERROR for this failure before reaching consistency. */
+ if (InRecovery && !reachedConsistency)
+ elevel = WARNING;
+
+ ereport(elevel,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m", directoryName)));
+ return ret;
+ }
+
+ return 0;
+}
+
 /*
  * Struct actually returned by XLogFakeRelcacheEntry, though the declared
  * return type is Relation.
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 66a70871e6..c9fb2af015 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -303,12 +303,6 @@ CreateTableSpace(CreateTableSpaceStmt *stmt)
  (errcode(ERRCODE_INVALID_NAME),
  errmsg("tablespace location cannot contain single quotes")));
 
- /* Reject tablespaces in the data directory. */
- if (is_in_data_directory(location))
- ereport(ERROR,
- (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
- errmsg("tablespace location must not be inside the data directory")));
-
  /*
  * Check that location isn't too long. Remember that we're going to append
  * 'PG_XXX/<dboid>/<relid>_<fork>.<nnn>'.  In the relative path case, we
@@ -323,6 +317,12 @@ CreateTableSpace(CreateTableSpaceStmt *stmt)
  errmsg("tablespace location \"%s\" is too long",
  location)));
 
+ /* Reject tablespaces in the data directory. */
+ if (is_in_data_directory(location))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("tablespace location must not be inside the data directory")));
+
  /*
  * Disallow creation of tablespaces named "pg_xxx"; we reserve this
  * namespace for system purposes.
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 30f6200a86..0216270dd3 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -22,11 +22,11 @@
 #include <unistd.h>
 #include <sys/stat.h>
 
+#include "access/xlogutils.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-
 /*
  * copydir: copy a directory
  *
@@ -41,10 +41,12 @@ copydir(char *fromdir, char *todir, bool recurse)
  char fromfile[MAXPGPATH * 2];
  char tofile[MAXPGPATH * 2];
 
- if (MakePGDirectory(todir) != 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not create directory \"%s\": %m", todir)));
+ /*
+ * We might have to skip copydir to continue recovery. See the function
+ * for details.
+ */
+ if (XLogMakePGDirectory(todir) != 0)
+ return;
 
  xldir = AllocateDir(fromdir);
 
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 0ab5ba62f5..46a7596315 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -43,6 +43,7 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
    BlockNumber blkno, ReadBufferMode mode);
+extern int XLogMakePGDirectory(const char *directoryName);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
--
2.16.3

Reply | Threaded
Open this post in threaded view
|

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Paul Guo-2

On Wed, Apr 24, 2019 at 4:14 PM Kyotaro HORIGUCHI <[hidden email]> wrote:
Mmm. I posted to wrong thread. Sorry.

At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <[hidden email]> wrote in <[hidden email]>
> At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo <[hidden email]> wrote in <[hidden email]>
> > Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database
> > create redo error, but I suspect some other kind of redo, which depends on
> > the files under the directory (they are not copied since the directory is
> > not created) and also cannot be covered by the invalid page mechanism,
> > could fail. Thanks.
>
> If recovery starts from just after tablespace creation, that's
> simple. The Symlink to the removed tablespace is already removed
> in the case. Hence server innocently create files directly under
> pg_tblspc, not in the tablespace. Finally all files that were
> supposed to be created in the removed tablespace are removed
> later in recovery.
>
> If recovery starts from recalling page in a file that have been
> in the tablespace, XLogReadBufferExtended creates one (perhaps
> directly in pg_tblspc as described above) and the files are
> removed later in recoery the same way to above. This case doen't
> cause FATAL/PANIC during recovery even in master.
>
> [hidden email]
> | * Create the target file if it doesn't already exist.  This lets us cope
> | * if the replay sequence contains writes to a relation that is later
> | * deleted.  (The original coding of this routine would instead suppress
> | * the writes, but that seems like it risks losing valuable data if the
> | * filesystem loses an inode during a crash.  Better to write the data
> | * until we are actually told to delete the file.)
>
> So buffered access cannot be a problem for the reason above. The
> remaining possible issue is non-buffered access to files in
> removed tablespaces. This is what I mentioned upthread:
>
> me> but I haven't checked this covers all places where need the same
> me> treatment.

RM_DBASE_ID is fixed by the patch.

XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG:
  - are not relevant.

HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC:
  - Resources works on buffer is not affected.

SMGR:
  - Both CREATE and TRUNCATE seems fine.

TBLSPC:
  - We don't nest tablespace directories. No Problem.

I don't find a similar case.

I took some time in digging into the related code. It seems that ignoring if the dst directory cannot be created directly
should be fine since smgr redo code creates tablespace path finally by calling TablespaceCreateDbspace().
What's more, I found some more issues.

1) The below error message is actually misleading.

2019-04-17 14:52:14.951 CST [23030] FATAL:  could not create directory "pg_tblspc/65546/PG_12_201904072/65547": No such file or directory
2019-04-17 14:52:14.951 CST [23030] CONTEXT:  WAL redo at 0/3011650 for Database/CREATE: copy dir 1663/1 to 65546/65547

That should be due to dbase_desc(). It could be simply fixed following the code logic in GetDatabasePath().

2) It seems that src directory could be missing then dbase_redo()->copydir() could error out. For example,

\!rm -rf /tmp/tbspace1  
\!mkdir /tmp/tbspace1
\!rm -rf /tmp/tbspace2
\!mkdir /tmp/tbspace2
create tablespace tbs1 location '/tmp/tbspace1'; 
create tablespace tbs2 location '/tmp/tbspace2'; 
create database db1 tablespace tbs1;
alter database db1 set tablespace tbs2;
drop tablespace tbs1;

Let's say, the standby finishes all replay but redo lsn on pg_control is still the point at 'alter database', and then
kill postgres, then in theory when startup, dbase_redo()->copydir() will ERROR since 'drop tablespace tbs1'
has removed the directories (and symlink) of tbs1. Below simple code change could fix that.

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 9707afabd9..7d755c759e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2114,6 +2114,15 @@ dbase_redo(XLogReaderState *record)
         */
        FlushDatabaseBuffers(xlrec->src_db_id);

+       /*
+        * It is possible that the source directory is missing if
+        * we are re-replaying the xlog while subsequent xlogs
+        * drop the tablespace in previous replaying. For this
+        * we just skip.
+        */
+       if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+           return;
+
        /*
         * Copy this subdirectory to the new location
         *

If we want to fix the issue by ignoring the dst path create failure, I do not think we should do
that in copydir() since copydir() seems to be a common function. I'm not sure whether it is
used by some extensions or not. If no maybe we should move the dst patch create logic
out of copydir().

Also I'd suggest we should use pg_mkdir_p() in TablespaceCreateDbspace() to replace
the code block includes a lot of get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more graceful and simpler.
 
Whatever ignore mkdir failure or mkdir_p, I found that these steps seem to be error-prone
along with postgre evolving since they are hard to test and also we are not easy to think out
various potential bad cases. Is it possible that we should do real checkpoint (flush & update
redo lsn) when seeing checkpoint xlogs for these operations? This will slow down standby
but master also does this and also these operations are not usual, espeically it seems that it
does not slow down wal receiving usually?



Reply | Threaded
Open this post in threaded view
|

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Paul Guo-2
I updated the original patch to

1) skip copydir() if either src path or dst parent path is missing in dbase_redo(). Both missing cases seem to be possible. For the src path missing case, mkdir_p() is meaningless. It seems that moving the directory existence check step to dbase_redo() has less impact on other code.

2) Fixed dbase_desc(). Now the xlog output looks correct.

rmgr: Database    len (rec/tot):     42/    42, tx:        486, lsn: 0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to pg_tblspc/16384/PG_12_201904281/16386

rmgr: Database    len (rec/tot):     34/    34, tx:        487, lsn: 0/01638EB8, prev 0/01638E40, desc: DROP dir pg_tblspc/16384/PG_12_201904281/16386

I'm not familiar with the TAP test details previously. I learned a lot about how to test such case from Kyotaro's patch series.👍

On Sun, Apr 28, 2019 at 3:33 PM Paul Guo <[hidden email]> wrote:

On Wed, Apr 24, 2019 at 4:14 PM Kyotaro HORIGUCHI <[hidden email]> wrote:
Mmm. I posted to wrong thread. Sorry.

At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <[hidden email]> wrote in <[hidden email]>
> At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo <[hidden email]> wrote in <[hidden email]>
> > Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database
> > create redo error, but I suspect some other kind of redo, which depends on
> > the files under the directory (they are not copied since the directory is
> > not created) and also cannot be covered by the invalid page mechanism,
> > could fail. Thanks.
>
> If recovery starts from just after tablespace creation, that's
> simple. The Symlink to the removed tablespace is already removed
> in the case. Hence server innocently create files directly under
> pg_tblspc, not in the tablespace. Finally all files that were
> supposed to be created in the removed tablespace are removed
> later in recovery.
>
> If recovery starts from recalling page in a file that have been
> in the tablespace, XLogReadBufferExtended creates one (perhaps
> directly in pg_tblspc as described above) and the files are
> removed later in recoery the same way to above. This case doen't
> cause FATAL/PANIC during recovery even in master.
>
> [hidden email]
> | * Create the target file if it doesn't already exist.  This lets us cope
> | * if the replay sequence contains writes to a relation that is later
> | * deleted.  (The original coding of this routine would instead suppress
> | * the writes, but that seems like it risks losing valuable data if the
> | * filesystem loses an inode during a crash.  Better to write the data
> | * until we are actually told to delete the file.)
>
> So buffered access cannot be a problem for the reason above. The
> remaining possible issue is non-buffered access to files in
> removed tablespaces. This is what I mentioned upthread:
>
> me> but I haven't checked this covers all places where need the same
> me> treatment.

RM_DBASE_ID is fixed by the patch.

XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG:
  - are not relevant.

HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC:
  - Resources works on buffer is not affected.

SMGR:
  - Both CREATE and TRUNCATE seems fine.

TBLSPC:
  - We don't nest tablespace directories. No Problem.

I don't find a similar case.

I took some time in digging into the related code. It seems that ignoring if the dst directory cannot be created directly
should be fine since smgr redo code creates tablespace path finally by calling TablespaceCreateDbspace().
What's more, I found some more issues.

1) The below error message is actually misleading.

2019-04-17 14:52:14.951 CST [23030] FATAL:  could not create directory "pg_tblspc/65546/PG_12_201904072/65547": No such file or directory
2019-04-17 14:52:14.951 CST [23030] CONTEXT:  WAL redo at 0/3011650 for Database/CREATE: copy dir 1663/1 to 65546/65547

That should be due to dbase_desc(). It could be simply fixed following the code logic in GetDatabasePath().

2) It seems that src directory could be missing then dbase_redo()->copydir() could error out. For example,

\!rm -rf /tmp/tbspace1  
\!mkdir /tmp/tbspace1
\!rm -rf /tmp/tbspace2
\!mkdir /tmp/tbspace2
create tablespace tbs1 location '/tmp/tbspace1'; 
create tablespace tbs2 location '/tmp/tbspace2'; 
create database db1 tablespace tbs1;
alter database db1 set tablespace tbs2;
drop tablespace tbs1;

Let's say, the standby finishes all replay but redo lsn on pg_control is still the point at 'alter database', and then
kill postgres, then in theory when startup, dbase_redo()->copydir() will ERROR since 'drop tablespace tbs1'
has removed the directories (and symlink) of tbs1. Below simple code change could fix that.

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 9707afabd9..7d755c759e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2114,6 +2114,15 @@ dbase_redo(XLogReaderState *record)
         */
        FlushDatabaseBuffers(xlrec->src_db_id);

+       /*
+        * It is possible that the source directory is missing if
+        * we are re-replaying the xlog while subsequent xlogs
+        * drop the tablespace in previous replaying. For this
+        * we just skip.
+        */
+       if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+           return;
+
        /*
         * Copy this subdirectory to the new location
         *

If we want to fix the issue by ignoring the dst path create failure, I do not think we should do
that in copydir() since copydir() seems to be a common function. I'm not sure whether it is
used by some extensions or not. If no maybe we should move the dst patch create logic
out of copydir().

Also I'd suggest we should use pg_mkdir_p() in TablespaceCreateDbspace() to replace
the code block includes a lot of get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more graceful and simpler.
 
Whatever ignore mkdir failure or mkdir_p, I found that these steps seem to be error-prone
along with postgre evolving since they are hard to test and also we are not easy to think out
various potential bad cases. Is it possible that we should do real checkpoint (flush & update
redo lsn) when seeing checkpoint xlogs for these operations? This will slow down standby
but master also does this and also these operations are not usual, espeically it seems that it
does not slow down wal receiving usually?




v2-0001-skip-copydir-if-either-src-directory-or-dst-direc.patch (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Kyotaro HORIGUCHI-2
Hi.

At Tue, 30 Apr 2019 14:33:47 +0800, Paul Guo <[hidden email]> wrote in <[hidden email]>
> I updated the original patch to

It's reasonable not to touch copydir.

> 1) skip copydir() if either src path or dst parent path is missing in
> dbase_redo(). Both missing cases seem to be possible. For the src path
> missing case, mkdir_p() is meaningless. It seems that moving the directory
> existence check step to dbase_redo() has less impact on other code.

Nice catch.


+      if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+      {

This patch is allowing missing source and destination directory
even in consistent state. I don't think it is safe.



+        ereport(WARNING,
+            (errmsg("directory \"%s\" for copydir() does not exists."
+                "It is possibly expected. Skip copydir().",
+                parent_path)));

This message seems unfriendly to users, or it seems like an elog
message. How about something like this. The same can be said for
the source directory.

| WARNING:  skipped creating database directory: "%s"
| DETAIL:  The tabelspace %u may have been removed just before crash.

# I'm not confident in this at all:(

> 2) Fixed dbase_desc(). Now the xlog output looks correct.
>
> rmgr: Database    len (rec/tot):     42/    42, tx:        486, lsn:
> 0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
> pg_tblspc/16384/PG_12_201904281/16386
>
> rmgr: Database    len (rec/tot):     34/    34, tx:        487, lsn:
> 0/01638EB8, prev 0/01638E40, desc: DROP dir
> pg_tblspc/16384/PG_12_201904281/16386

WAL records don't convey such information. The previous
description seems right to me.

> I'm not familiar with the TAP test details previously. I learned a lot
> about how to test such case from Kyotaro's patch series.👍

Yeah, good to hear.

> On Sun, Apr 28, 2019 at 3:33 PM Paul Guo <[hidden email]> wrote:
> > If we want to fix the issue by ignoring the dst path create failure, I do
> > not think we should do
> > that in copydir() since copydir() seems to be a common function. I'm not
> > sure whether it is
> > used by some extensions or not. If no maybe we should move the dst patch
> > create logic
> > out of copydir().

Agreed to this.

> > Also I'd suggest we should use pg_mkdir_p() in TablespaceCreateDbspace()
> > to replace
> > the code block includes a lot of
> > get_parent_directory(), MakePGDirectory(), etc even it
> > is not fixing a bug since pg_mkdir_p() code change seems to be more
> > graceful and simpler.

But I don't agree to this. pg_mkdir_p goes above two-parents up,
which would be unwanted here.

> > Whatever ignore mkdir failure or mkdir_p, I found that these steps seem to
> > be error-prone
> > along with postgre evolving since they are hard to test and also we are
> > not easy to think out
> > various potential bad cases. Is it possible that we should do real
> > checkpoint (flush & update
> > redo lsn) when seeing checkpoint xlogs for these operations? This will
> > slow down standby
> > but master also does this and also these operations are not usual,
> > espeically it seems that it
> > does not slow down wal receiving usually?

That dramatically slows recovery (not replication) if databases
are created and deleted frequently. That wouldn't be acceptable.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center
Reply | Threaded
Open this post in threaded view
|

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Paul Guo-2

Thanks for the reply.

On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI <[hidden email]> wrote:

+      if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+      {

This patch is allowing missing source and destination directory
even in consistent state. I don't think it is safe.

I do not understand this. Can you elaborate?
 



+        ereport(WARNING,
+            (errmsg("directory \"%s\" for copydir() does not exists."
+                "It is possibly expected. Skip copydir().",
+                parent_path)));

This message seems unfriendly to users, or it seems like an elog
message. How about something like this. The same can be said for
the source directory.

| WARNING:  skipped creating database directory: "%s"
| DETAIL:  The tabelspace %u may have been removed just before crash.

Yeah. Looks better.
 

# I'm not confident in this at all:(

> 2) Fixed dbase_desc(). Now the xlog output looks correct.
>
> rmgr: Database    len (rec/tot):     42/    42, tx:        486, lsn:
> 0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
> pg_tblspc/16384/PG_12_201904281/16386
>
> rmgr: Database    len (rec/tot):     34/    34, tx:        487, lsn:
> 0/01638EB8, prev 0/01638E40, desc: DROP dir
> pg_tblspc/16384/PG_12_201904281/16386

WAL records don't convey such information. The previous
description seems right to me.

2019-04-17 14:52:14.951 CST [23030] CONTEXT:  WAL redo at 0/3011650 for Database/CREATE: copy dir 1663/1 to 65546/65547
The directories are definitely wrong and misleading.


> > Also I'd suggest we should use pg_mkdir_p() in TablespaceCreateDbspace()
> > to replace
> > the code block includes a lot of
> > get_parent_directory(), MakePGDirectory(), etc even it
> > is not fixing a bug since pg_mkdir_p() code change seems to be more
> > graceful and simpler.

But I don't agree to this. pg_mkdir_p goes above two-parents up,
which would be unwanted here.

I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'.
This change just makes the code concise. Though in theory the change is not needed.
 
> > Whatever ignore mkdir failure or mkdir_p, I found that these steps seem to
> > be error-prone
> > along with postgre evolving since they are hard to test and also we are
> > not easy to think out
> > various potential bad cases. Is it possible that we should do real
> > checkpoint (flush & update
> > redo lsn) when seeing checkpoint xlogs for these operations? This will
> > slow down standby
> > but master also does this and also these operations are not usual,
> > espeically it seems that it
> > does not slow down wal receiving usually?

That dramatically slows recovery (not replication) if databases
are created and deleted frequently. That wouldn't be acceptable.

This behavior is rare and seems to have the same impact on master & standby from checkpoint/restartpoint.
We do not worry about master so we should not worry about standby also.
Reply | Threaded
Open this post in threaded view
|

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Kyotaro HORIGUCHI-2
Hello.

At Mon, 13 May 2019 17:37:50 +0800, Paul Guo <[hidden email]> wrote in <[hidden email]>

> Thanks for the reply.
>
> On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI <
> [hidden email]> wrote:
>
> >
> > +      if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
> > +      {
> >
> > This patch is allowing missing source and destination directory
> > even in consistent state. I don't think it is safe.
> >
>
> I do not understand this. Can you elaborate?

Suppose we were recoverying based on a backup at LSN1 targeting
to LSN3 then it crashed at LSN2, where LSN1 < LSN2 <= LSN3. LSN2
is called as "consistency point", before where the database is
not consistent. It's because we are applying WAL recored older
than those that were already applied in the second trial. The
same can be said for crash recovery, where LSN1 is the latest
checkpoint ('s redo LSN) and LSN2=LSN3 is the crashed LSN.

Creation of an existing directory or dropping of a non-existent
directory are apparently inconsistent or "broken" so we should
stop recovery when seeing such WAL records while database is in
consistent state.

> > +        ereport(WARNING,
> > +            (errmsg("directory \"%s\" for copydir() does not exists."
> > +                "It is possibly expected. Skip copydir().",
> > +                parent_path)));
> >
> > This message seems unfriendly to users, or it seems like an elog
> > message. How about something like this. The same can be said for
> > the source directory.
> >
> > | WARNING:  skipped creating database directory: "%s"
> > | DETAIL:  The tabelspace %u may have been removed just before crash.
> >
>
> Yeah. Looks better.
>
>
> >
> > # I'm not confident in this at all:(
> >
> > > 2) Fixed dbase_desc(). Now the xlog output looks correct.
> > >
> > > rmgr: Database    len (rec/tot):     42/    42, tx:        486, lsn:
> > > 0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
> > > pg_tblspc/16384/PG_12_201904281/16386
> > >
> > > rmgr: Database    len (rec/tot):     34/    34, tx:        487, lsn:
> > > 0/01638EB8, prev 0/01638E40, desc: DROP dir
> > > pg_tblspc/16384/PG_12_201904281/16386
> >
> > WAL records don't convey such information. The previous
> > description seems right to me.
> >
>
> 2019-04-17 14:52:14.951 CST [23030] CONTEXT:  WAL redo at 0/3011650 for
> Database/CREATE: copy dir 1663/1 to 65546/65547
> The directories are definitely wrong and misleading.

The original description is right in the light of how the server
recognizes. The record exactly says that "copy dir 1663/1 to
65546/65547" and the latter path is converted in filesystem layer
via a symlink.


> > > > Also I'd suggest we should use pg_mkdir_p() in
> > TablespaceCreateDbspace()
> > > > to replace
> > > > the code block includes a lot of
> > > > get_parent_directory(), MakePGDirectory(), etc even it
> > > > is not fixing a bug since pg_mkdir_p() code change seems to be more
> > > > graceful and simpler.
> >
> > But I don't agree to this. pg_mkdir_p goes above two-parents up,
> > which would be unwanted here.
> >
> > I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'.
> This change just makes the code concise. Though in theory the change is not
> needed.

We don't want to create tablespace direcotory after concurrent
DROPing, as the comment just above is saying:

|  * Acquire TablespaceCreateLock to ensure that no DROP TABLESPACE
|  * or TablespaceCreateDbspace is running concurrently.

If the concurrent DROP TABLESPACE destroyed the grand parent
directory, we mustn't create it again.

> > > > Whatever ignore mkdir failure or mkdir_p, I found that these steps
> > seem to
> > > > be error-prone
> > > > along with postgre evolving since they are hard to test and also we are
> > > > not easy to think out
> > > > various potential bad cases. Is it possible that we should do real
> > > > checkpoint (flush & update
> > > > redo lsn) when seeing checkpoint xlogs for these operations? This will
> > > > slow down standby
> > > > but master also does this and also these operations are not usual,
> > > > espeically it seems that it
> > > > does not slow down wal receiving usually?
> >
> > That dramatically slows recovery (not replication) if databases
> > are created and deleted frequently. That wouldn't be acceptable.
> >
>
> This behavior is rare and seems to have the same impact on master & standby
> from checkpoint/restartpoint.
> We do not worry about master so we should not worry about standby also.

I didn't mention replication. I said that that slows recovery,
which is not governed by master's speed.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center



Reply | Threaded
Open this post in threaded view
|

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Paul Guo-2


On Tue, May 14, 2019 at 11:06 AM Kyotaro HORIGUCHI <[hidden email]> wrote:
Hello.

At Mon, 13 May 2019 17:37:50 +0800, Paul Guo <[hidden email]> wrote in <[hidden email]>
> Thanks for the reply.
>
> On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI <
> [hidden email]> wrote:
>
> >
> > +      if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
> > +      {
> >
> > This patch is allowing missing source and destination directory
> > even in consistent state. I don't think it is safe.
> >
>
> I do not understand this. Can you elaborate?

Suppose we were recoverying based on a backup at LSN1 targeting
to LSN3 then it crashed at LSN2, where LSN1 < LSN2 <= LSN3. LSN2
is called as "consistency point", before where the database is
not consistent. It's because we are applying WAL recored older
than those that were already applied in the second trial. The
same can be said for crash recovery, where LSN1 is the latest
checkpoint ('s redo LSN) and LSN2=LSN3 is the crashed LSN.

Creation of an existing directory or dropping of a non-existent
directory are apparently inconsistent or "broken" so we should
stop recovery when seeing such WAL records while database is in
consistent state.

This seems to be hard to detect. I thought using invalid_page mechanism long ago,
but it seems to be hard to fully detect a dropped tablespace.

> > > 2) Fixed dbase_desc(). Now the xlog output looks correct.
> > >
> > > rmgr: Database    len (rec/tot):     42/    42, tx:        486, lsn:
> > > 0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
> > > pg_tblspc/16384/PG_12_201904281/16386
> > >
> > > rmgr: Database    len (rec/tot):     34/    34, tx:        487, lsn:
> > > 0/01638EB8, prev 0/01638E40, desc: DROP dir
> > > pg_tblspc/16384/PG_12_201904281/16386
> >
> > WAL records don't convey such information. The previous
> > description seems right to me.
> >
>
> 2019-04-17 14:52:14.951 CST [23030] CONTEXT:  WAL redo at 0/3011650 for
> Database/CREATE: copy dir 1663/1 to 65546/65547
> The directories are definitely wrong and misleading.

The original description is right in the light of how the server
recognizes. The record exactly says that "copy dir 1663/1 to
65546/65547" and the latter path is converted in filesystem layer
via a symlink.

In either $PG_DATA/pg_tblspc or symlinked real tablespace directory,
there is an additional directory like PG_12_201905221 between
tablespace oid and database oid. See the directory layout as below,
so the directory info in xlog dump output was not correct.

$ ls -lh data/pg_tblspc/                                                        

total 0                                                                        

lrwxrwxrwx. 1 gpadmin gpadmin 6 May 27 17:23 16384 -> /tmp/2                    

$ ls -lh /tmp/2                                                                

total 0                                                                        

drwx------. 3 gpadmin gpadmin 18 May 27 17:24 PG_12_201905221            



> > > > Also I'd suggest we should use pg_mkdir_p() in
> > TablespaceCreateDbspace()
> > > > to replace
> > > > the code block includes a lot of
> > > > get_parent_directory(), MakePGDirectory(), etc even it
> > > > is not fixing a bug since pg_mkdir_p() code change seems to be more
> > > > graceful and simpler.
> >
> > But I don't agree to this. pg_mkdir_p goes above two-parents up,
> > which would be unwanted here.
> >
> > I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'.
> This change just makes the code concise. Though in theory the change is not
> needed.

We don't want to create tablespace direcotory after concurrent
DROPing, as the comment just above is saying:

|  * Acquire TablespaceCreateLock to ensure that no DROP TABLESPACE
|  * or TablespaceCreateDbspace is running concurrently.

If the concurrent DROP TABLESPACE destroyed the grand parent
directory, we mustn't create it again.

Yes, this is a good reason to keep the original code. Thanks.
 
By the way, based on your previous test patch I added another test which could easily detect
the missing src directory case.
 
Reply | Threaded
Open this post in threaded view
|

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Paul Guo-2


On Mon, May 27, 2019 at 9:39 PM Paul Guo <[hidden email]> wrote:


On Tue, May 14, 2019 at 11:06 AM Kyotaro HORIGUCHI <[hidden email]> wrote:
Hello.

At Mon, 13 May 2019 17:37:50 +0800, Paul Guo <[hidden email]> wrote in <[hidden email]>
> Thanks for the reply.
>
> On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI <
> [hidden email]> wrote:
>
> >
> > +      if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
> > +      {
> >
> > This patch is allowing missing source and destination directory
> > even in consistent state. I don't think it is safe.
> >
>
> I do not understand this. Can you elaborate?

Suppose we were recoverying based on a backup at LSN1 targeting
to LSN3 then it crashed at LSN2, where LSN1 < LSN2 <= LSN3. LSN2
is called as "consistency point", before where the database is
not consistent. It's because we are applying WAL recored older
than those that were already applied in the second trial. The
same can be said for crash recovery, where LSN1 is the latest
checkpoint ('s redo LSN) and LSN2=LSN3 is the crashed LSN.

Creation of an existing directory or dropping of a non-existent
directory are apparently inconsistent or "broken" so we should
stop recovery when seeing such WAL records while database is in
consistent state.

This seems to be hard to detect. I thought using invalid_page mechanism long ago,
but it seems to be hard to fully detect a dropped tablespace.

> > > 2) Fixed dbase_desc(). Now the xlog output looks correct.
> > >
> > > rmgr: Database    len (rec/tot):     42/    42, tx:        486, lsn:
> > > 0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
> > > pg_tblspc/16384/PG_12_201904281/16386
> > >
> > > rmgr: Database    len (rec/tot):     34/    34, tx:        487, lsn:
> > > 0/01638EB8, prev 0/01638E40, desc: DROP dir
> > > pg_tblspc/16384/PG_12_201904281/16386
> >
> > WAL records don't convey such information. The previous
> > description seems right to me.
> >
>
> 2019-04-17 14:52:14.951 CST [23030] CONTEXT:  WAL redo at 0/3011650 for
> Database/CREATE: copy dir 1663/1 to 65546/65547
> The directories are definitely wrong and misleading.

The original description is right in the light of how the server
recognizes. The record exactly says that "copy dir 1663/1 to
65546/65547" and the latter path is converted in filesystem layer
via a symlink.

In either $PG_DATA/pg_tblspc or symlinked real tablespace directory,
there is an additional directory like PG_12_201905221 between
tablespace oid and database oid. See the directory layout as below,
so the directory info in xlog dump output was not correct.

$ ls -lh data/pg_tblspc/                                                        

total 0                                                                        

lrwxrwxrwx. 1 gpadmin gpadmin 6 May 27 17:23 16384 -> /tmp/2                    

$ ls -lh /tmp/2                                                                

total 0                                                                        

drwx------. 3 gpadmin gpadmin 18 May 27 17:24 PG_12_201905221            



> > > > Also I'd suggest we should use pg_mkdir_p() in
> > TablespaceCreateDbspace()
> > > > to replace
> > > > the code block includes a lot of
> > > > get_parent_directory(), MakePGDirectory(), etc even it
> > > > is not fixing a bug since pg_mkdir_p() code change seems to be more
> > > > graceful and simpler.
> >
> > But I don't agree to this. pg_mkdir_p goes above two-parents up,
> > which would be unwanted here.
> >
> > I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'.
> This change just makes the code concise. Though in theory the change is not
> needed.

We don't want to create tablespace direcotory after concurrent
DROPing, as the comment just above is saying:

|  * Acquire TablespaceCreateLock to ensure that no DROP TABLESPACE
|  * or TablespaceCreateDbspace is running concurrently.

If the concurrent DROP TABLESPACE destroyed the grand parent
directory, we mustn't create it again.

Yes, this is a good reason to keep the original code. Thanks.
 
By the way, based on your previous test patch I added another test which could easily detect
the missing src directory case.
 

I updated the patch to v3. In this version, we skip the error if copydir fails due to missing src/dst directory,
but to make sure the ignoring is legal, I add a simple log/forget mechanism (Using List) similar to the xlog invalid page
checking mechanism. Two tap tests are included. One is actually from a previous patch by Kyotaro in this
email thread and another is added by me. In addition, dbase_desc() is fixed to make the message accurate.
 
Thanks.

v3-0001-skip-copydir-if-either-src-directory-or-dst-direc.patch (20K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Thomas Munro-5
On Wed, Jun 19, 2019 at 7:22 PM Paul Guo <[hidden email]> wrote:
> I updated the patch to v3. In this version, we skip the error if copydir fails due to missing src/dst directory,
> but to make sure the ignoring is legal, I add a simple log/forget mechanism (Using List) similar to the xlog invalid page
> checking mechanism. Two tap tests are included. One is actually from a previous patch by Kyotaro in this
> email thread and another is added by me. In addition, dbase_desc() is fixed to make the message accurate.

Hello Paul,

FYI t/011_crash_recovery.pl is failing consistently on Travis CI with
this patch applied:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/555368907

--
Thomas Munro
https://enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Paul Guo-2


On Mon, Jul 8, 2019 at 11:16 AM Thomas Munro <[hidden email]> wrote:
On Wed, Jun 19, 2019 at 7:22 PM Paul Guo <[hidden email]> wrote:
> I updated the patch to v3. In this version, we skip the error if copydir fails due to missing src/dst directory,
> but to make sure the ignoring is legal, I add a simple log/forget mechanism (Using List) similar to the xlog invalid page
> checking mechanism. Two tap tests are included. One is actually from a previous patch by Kyotaro in this
> email thread and another is added by me. In addition, dbase_desc() is fixed to make the message accurate.

Hello Paul,

FYI t/011_crash_recovery.pl is failing consistently on Travis CI with
this patch applied:

https://urldefense.proofpoint.com/v2/url?u=https-3A__travis-2Dci.org_postgresql-2Dcfbot_postgresql_builds_555368907&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=Usi0ex6Ch92MsB5QQDgYFw&m=ABylo8AVfubiiYVbCBSgmNnHEMJhMqGXx5c0hkug7Vw&s=5h4m_JhrZwZqsRsu1CHCD3W2eBl14mT8jWLFsj2-bJ4&e=



This failure is because the previous v3 patch does not align with a recent patch

commit 660a2b19038b2f6b9f6bcb2c3297a47d5e3557a8                                
Author: Noah Misch <[hidden email]>                                          
Date:   Fri Jun 21 20:34:23 2019 -0700                                                   
    Consolidate methods for translating a Perl path to a Windows path.          

My patch uses TestLib::real_dir which is now replaced with TestLib::perl2host in the above commit.

I've updated the patch to v4 to make my code align. Now the test passes in my local environment.

Please see the attached v4 patch.

Thanks.

v4-0001-skip-copydir-if-either-src-directory-or-dst-direc.patch (20K) Download Attachment
12