Stronger safeguard for archive recovery not to miss data

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
55 messages Options
123
Reply | Threaded
Open this post in threaded view
|

RE: Stronger safeguard for archive recovery not to miss data

osumi.takamichi@fujitsu.com
On Monday, April 5, 2021 11:49 PM [hidden email] <[hidden email]>

> On Mon Apr 5, 2021 12:35 PM Fujii Masao <[hidden email]>
> wrote:
> > >>> By the way, when I build postgres with this patch and
> > >>> enable-coverage option, the results of RT becomes unstable. Does
> > >>> someone know the
> > >> reason ?
> > >>> When it fails, I get stderr like below
> > >>
> > >> I have no idea about this. Does this happen even without the patch?
> > > Unfortunately, no. I get this only with --enable-coverage and with
> > > my patch, althought regression tests have passed with this patch.
> > > OSS HEAD doesn't produce the stderr even with --enable-coverage.
> >
> > Could you check whether the latest patch still causes this issue or not?
> > If it still causes, could you check which part (the change of xlog.c
> > or the addition of regression test) caused the issue?
> v07 reproduces the phenomena, even with make coverage-clean between
> tests.
> The possibility is not high though.
>
> We cannot do the regression test separately from xlog.c because it uses the
> new error message of xlog.c.
> Applying only the TAP test should fail because we get an warning not error.
>
> Therefore, I took the changes of xlog.c only and I'm doing the RT in a loop
> now. If we can get the stderr again, then we can guess xlog.c is the cause,
> right ?
>
> I think I can report the result tomorrow.
> Just in case, I'm running the RT for OSS HEAD in parallel...
> although I cannot reproduce it with it at all.
I really apologie that this OSS HEAD reproduced that stderr with success of RT.
I executed check-world in parallel with -j option so
the reason should be what Tsunakawa-san told us.
Its probability is pretty low.
I'm so sorry for making noises loudly.
Therefore, I don't have any concern left.


Best Regards,
        Takamichi Osumi

Reply | Threaded
Open this post in threaded view
|

RE: Stronger safeguard for archive recovery not to miss data

osumi.takamichi@fujitsu.com
On Tuesday, April 6, 2021 8:32 AM Osumi, Takamichi/大墨 昂道 <[hidden email]>

> On Monday, April 5, 2021 11:49 PM [hidden email]
> <[hidden email]>
> > On Mon Apr 5, 2021 12:35 PM Fujii Masao <[hidden email]>
> > wrote:
> > > >>> By the way, when I build postgres with this patch and
> > > >>> enable-coverage option, the results of RT becomes unstable. Does
> > > >>> someone know the
> > > >> reason ?
> > > >>> When it fails, I get stderr like below
> > > >>
> > > >> I have no idea about this. Does this happen even without the patch?
> > > > Unfortunately, no. I get this only with --enable-coverage and with
> > > > my patch, althought regression tests have passed with this patch.
> > > > OSS HEAD doesn't produce the stderr even with --enable-coverage.
> > >
> > > Could you check whether the latest patch still causes this issue or not?
> > > If it still causes, could you check which part (the change of xlog.c
> > > or the addition of regression test) caused the issue?
> > v07 reproduces the phenomena, even with make coverage-clean between
> > tests.
> > The possibility is not high though.
> >
> > We cannot do the regression test separately from xlog.c because it
> > uses the new error message of xlog.c.
> > Applying only the TAP test should fail because we get an warning not error.
> >
> > Therefore, I took the changes of xlog.c only and I'm doing the RT in a
> > loop now. If we can get the stderr again, then we can guess xlog.c is
> > the cause, right ?
> >
> > I think I can report the result tomorrow.
> > Just in case, I'm running the RT for OSS HEAD in parallel...
> > although I cannot reproduce it with it at all.
> I really apologie that this OSS HEAD reproduced that stderr with success of
> RT.
> I executed check-world in parallel with -j option so the reason should be what
> Tsunakawa-san told us.
> Its probability is pretty low.
> I'm so sorry for making noises loudly.
> Therefore, I don't have any concern left.
This is *not* due to the patch but for future analysis.
The phenomena happens with a very little possibility, and in other case,
with --enable-coverage and make check-world causes an error like below.
I used gcc 8.

#   Failed test 'pg_ctl start: no stderr'
#   at t/001_start_stop.pl line 48.
#          got: 'profiling:/home/(path/to/oss/head)/src/backend/utils/adt/regproc.gcda:Merge mismatch for function 24
# '
#     expected: ''
# Looks like you failed 1 test of 24.
make[2]: *** [Makefile:50: check] Error 1
make[1]: *** [Makefile:43: check-pg_ctl-recurse] Error 2
make[1]: *** Waiting for unfinished jobs....
make: *** [GNUmakefile:71: check-world-src/bin-recurse] Error 2
make: *** Waiting for unfinished jobs....

The steps I used are
$ git clone and cd to OSS HEAD
$ ./configure --enable-coverage --enable-cassert --enable-debug --enable-tap-tests --with-icu CFLAGS=-O0 --prefix=/where/to/put/binary
$ make -j4 2> make.log
$ make check-world -j4 2> make_check_world.log

Best Regards,
        Takamichi Osumi

Reply | Threaded
Open this post in threaded view
|

Re: Stronger safeguard for archive recovery not to miss data

Fujii Masao-4
In reply to this post by osumi.takamichi@fujitsu.com


On 2021/04/05 23:49, [hidden email] wrote:
> Some minor comments.

Thanks for the review!

> (1)
>
> +       # Confirm that the archive recovery fails with an error
> +       my $logfile = slurp_file($recovery_node->logfile());
> +       ok( $logfile =~
> +               qr/FATAL:  WAL was generated with wal_level=minimal, cannot continue recovering/,
> +               "$node_text ends with an error because it finds WAL generated with wal_level=minimal");
>
> How about a comment
> "Confirm that the archive recovery fails with an expected error" ?
Sounds good to me. Fixed.


> (2)
>
> +# Test for  archive recovery
> +test_recovery_wal_level_minimal('archive_recovery', 'archive recovery', 0);
> We have two spaces in one comment. Should be fixed.

Yes, fixed.


> (3)
>
> I thought the function name 'test_recovery_wal_level_minimal'
> is a little bit weird and can be changed.
> How about something like execute_recovery_scenario,
> test_recovery_safeguard or test_archive_recovery_safeguard ?

I prefer the original name, so I didn't change this.
And we can rename it later if necessary.

On 2021/04/05 23:54, [hidden email] wrote:
>> This makes me think that we should document this risk.... Thought?
> +1. We should notify the risk when user changes
> the wal_level higher than minimal to minimal
> to invoke a carefulness of user for such kind of operation.

I removed the HINT message "or recover to the point in ..." and
added the following note into the docs.

     Note that changing <varname>wal_level</varname> to
     <literal>minimal</literal> makes any base backups taken before
     unavailable for archive recovery and standby server, which may
     lead to database loss.

Attached is the updated version of the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

stronger_safeguard_for_archive_recovery_v08.patch (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Stronger safeguard for archive recovery not to miss data

osumi.takamichi@fujitsu.com
On Tuesday, April 6, 2021 9:41 AM Fujii Masao <[hidden email]>

> On 2021/04/05 23:54, [hidden email] wrote:
> >> This makes me think that we should document this risk.... Thought?
> > +1. We should notify the risk when user changes
> > the wal_level higher than minimal to minimal to invoke a carefulness
> > of user for such kind of operation.
>
> I removed the HINT message "or recover to the point in ..." and added the
> following note into the docs.
>
>      Note that changing <varname>wal_level</varname> to
>      <literal>minimal</literal> makes any base backups taken before
>      unavailable for archive recovery and standby server, which may
>      lead to database loss.
Thank you for updating the patch. Let's make the sentence more strict.

My suggestion for this explanation is
"In order to prevent database corruption, changing
wal_level to minimal from higher level in the middle of
WAL archiving requires careful attention. It makes any base backups
taken before the operation unavailable for archive recovery
and standby server. Also, it may lead to whole database loss when
archive recovery fails with an error for that change.
Take a new base backup immediately after making wal_level back to higher level."

Then, we can be consistent with our new hint message,
"Use a backup taken after setting wal_level to higher than minimal.".

Is it better to add something similar to "Take an offline backup when you stop the server
and change the wal_level" around the end of this part as another option for safeguard, also?
For the performance technique part, what we need to explain is same.

Another minor thing I felt we need to do might be to add double quotes to wrap minimal in errhint.
Other errhints do so when we use it in a sentence.

There is no more additional comment from me !

Best Regards,
        Takamichi Osumi

Reply | Threaded
Open this post in threaded view
|

Re: Stronger safeguard for archive recovery not to miss data

Kyotaro Horiguchi-4
At Tue, 6 Apr 2021 04:11:35 +0000, "[hidden email]" <[hidden email]> wrote in

> On Tuesday, April 6, 2021 9:41 AM Fujii Masao <[hidden email]>
> > On 2021/04/05 23:54, [hidden email] wrote:
> > >> This makes me think that we should document this risk.... Thought?
> > > +1. We should notify the risk when user changes
> > > the wal_level higher than minimal to minimal to invoke a carefulness
> > > of user for such kind of operation.
> >
> > I removed the HINT message "or recover to the point in ..." and added the
> > following note into the docs.
> >
> >      Note that changing <varname>wal_level</varname> to
> >      <literal>minimal</literal> makes any base backups taken before
> >      unavailable for archive recovery and standby server, which may
> >      lead to database loss.
> Thank you for updating the patch. Let's make the sentence more strict.
>
> My suggestion for this explanation is
> "In order to prevent database corruption, changing
> wal_level to minimal from higher level in the middle of
> WAL archiving requires careful attention. It makes any base backups
> taken before the operation unavailable for archive recovery
> and standby server. Also, it may lead to whole database loss when
> archive recovery fails with an error for that change.
> Take a new base backup immediately after making wal_level back to higher level."

The first sentense looks like somewhat nanny-ish.  The database is not
corrupt at the time of this error. We just lose updates after the last
read segment at this point.  As Fujii-san said, we can continue
recoverying using crash recovery and we will reach having a corrupt
database after that.

About the last sentense, I prefer more flat wording, such as "You need
to take a new base backup..."

> Then, we can be consistent with our new hint message,
> "Use a backup taken after setting wal_level to higher than minimal.".
>
> Is it better to add something similar to "Take an offline backup when you stop the server
> and change the wal_level" around the end of this part as another option for safeguard, also?

Backup policy is completely a matter of DBAs.  If flipping wal_level
alone highly causes unstartable corruption,,, I think it is a bug.

> For the performance technique part, what we need to explain is same.

Might be good, but in simpler wording.

> Another minor thing I felt we need to do might be to add double quotes to wrap minimal in errhint.

Since the error about hot_standby has gone, either will do for me.

> Other errhints do so when we use it in a sentence.
>
> There is no more additional comment from me !

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

RE: Stronger safeguard for archive recovery not to miss data

osumi.takamichi@fujitsu.com
On Tuesday, April 6, 2021 3:24 PM Kyotaro Horiguchi <[hidden email]> wrote:

> At Tue, 6 Apr 2021 04:11:35 +0000, "[hidden email]"
> <[hidden email]> wrote in
> > On Tuesday, April 6, 2021 9:41 AM Fujii Masao
> > <[hidden email]>
> > > On 2021/04/05 23:54, [hidden email] wrote:
> > > >> This makes me think that we should document this risk.... Thought?
> > > > +1. We should notify the risk when user changes
> > > > the wal_level higher than minimal to minimal to invoke a
> > > > carefulness of user for such kind of operation.
> > >
> > > I removed the HINT message "or recover to the point in ..." and
> > > added the following note into the docs.
> > >
> > >      Note that changing <varname>wal_level</varname> to
> > >      <literal>minimal</literal> makes any base backups taken before
> > >      unavailable for archive recovery and standby server, which may
> > >      lead to database loss.
> > Thank you for updating the patch. Let's make the sentence more strict.
> >
> > My suggestion for this explanation is
> > "In order to prevent database corruption, changing wal_level to
> > minimal from higher level in the middle of WAL archiving requires
> > careful attention. It makes any base backups taken before the
> > operation unavailable for archive recovery and standby server. Also,
> > it may lead to whole database loss when archive recovery fails with an
> > error for that change.
> > Take a new base backup immediately after making wal_level back to higher
> level."
>
> The first sentense looks like somewhat nanny-ish.  The database is not
> corrupt at the time of this error.
Yes. Excuse me for misleading sentence.
I just wanted to write why the error was introduced,
but it was not necessary.
We should remove and fix the first part of the sentence.

> We just lose updates after the last read
> segment at this point.  As Fujii-san said, we can continue recoverying using
> crash recovery and we will reach having a corrupt database after that.
OK. Thank you for explanation.


> About the last sentence, I prefer more flat wording, such as "You need to take
> a new base backup..."
Either is fine to me.

> > Then, we can be consistent with our new hint message, "Use a backup
> > taken after setting wal_level to higher than minimal.".
>
> > Is it better to add something similar to "Take an offline backup when
> > you stop the server and change the wal_level" around the end of this part as
> another option for safeguard, also?
>
> Backup policy is completely a matter of DBAs.
OK. No problem. No need to add it.

> If flipping wal_level alone
> highly causes unstartable corruption,,, I think it is a bug.
> > For the performance technique part, what we need to explain is same.
>
> Might be good, but in simpler wording.
Yeah, I agree.
 
> > Another minor thing I felt we need to do might be to add double quotes to
> wrap minimal in errhint.
>
> Since the error about hot_standby has gone, either will do for me.
Thanks for sharing your thoughts.


Best Regards,
        Takamichi Osumi



Reply | Threaded
Open this post in threaded view
|

Re: Stronger safeguard for archive recovery not to miss data

Fujii Masao-4


On 2021/04/06 15:59, [hidden email] wrote:
> I just wanted to write why the error was introduced,
> but it was not necessary.
> We should remove and fix the first part of the sentence.

So the consensus is almost the same as the latest patch?
If they are not so different, I'm thinking to push the latest version at first.
Then we can improve the docs if required.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: Stronger safeguard for archive recovery not to miss data

David Steele
On 4/6/21 7:13 AM, Fujii Masao wrote:

>
> On 2021/04/06 15:59, [hidden email] wrote:
>> I just wanted to write why the error was introduced,
>> but it was not necessary.
>> We should remove and fix the first part of the sentence.
>
> So the consensus is almost the same as the latest patch?
> If they are not so different, I'm thinking to push the latest version at
> first.
> Then we can improve the docs if required.

+1

--
-David
[hidden email]


Reply | Threaded
Open this post in threaded view
|

RE: Stronger safeguard for archive recovery not to miss data

osumi.takamichi@fujitsu.com
On Tuesday, April 6, 2021 8:44 PM David Steele <[hidden email]> wrote:

> On 4/6/21 7:13 AM, Fujii Masao wrote:
> >
> > On 2021/04/06 15:59, [hidden email] wrote:
> >> I just wanted to write why the error was introduced, but it was not
> >> necessary.
> >> We should remove and fix the first part of the sentence.
> >
> > So the consensus is almost the same as the latest patch?
> > If they are not so different, I'm thinking to push the latest version
> > at first.
> > Then we can improve the docs if required.
>
> +1
Yes, please. What I suggested is almost same as your idea.
Thank you for your confirmation.


Best Regards,
        Takamichi Osumi



Reply | Threaded
Open this post in threaded view
|

Re: Stronger safeguard for archive recovery not to miss data

Fujii Masao-4
In reply to this post by David Steele


On 2021/04/06 20:44, David Steele wrote:

> On 4/6/21 7:13 AM, Fujii Masao wrote:
>>
>> On 2021/04/06 15:59, [hidden email] wrote:
>>> I just wanted to write why the error was introduced,
>>> but it was not necessary.
>>> We should remove and fix the first part of the sentence.
>>
>> So the consensus is almost the same as the latest patch?
>> If they are not so different, I'm thinking to push the latest version atfirst.
>> Then we can improve the docs if required.
>
> +1

Thanks! So I pushed the patch.


On 2021/04/06 20:48, [hidden email] wrote:
> Yes, please. What I suggested is almost same as your idea.

Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: Stronger safeguard for archive recovery not to miss data

Fujii Masao-4


On 2021/04/06 23:01, Fujii Masao wrote:

>
>
> On 2021/04/06 20:44, David Steele wrote:
>> On 4/6/21 7:13 AM, Fujii Masao wrote:
>>>
>>> On 2021/04/06 15:59, [hidden email] wrote:
>>>> I just wanted to write why the error was introduced,
>>>> but it was not necessary.
>>>> We should remove and fix the first part of the sentence.
>>>
>>> So the consensus is almost the same as the latest patch?
>>> If they are not so different, I'm thinking to push the latest version atfirst.
>>> Then we can improve the docs if required.
>>
>> +1
>
> Thanks! So I pushed the patch.
The buildfarm members "drongo" and "fairywren" reported that the regression test (024_archive_recovery.pl) commit 9de9294b0c added failed with the following error. ISTM the cause of this failure is that the test calls $node->init() without "allows_streaming => 1" and which doesn't add pg_hba.conf entry for TCP/IP connection from pg_basebackup. So I think that the attached patch needs to be applied.

     pg_basebackup: error: connection to server at "127.0.0.1", port 52316 failed: FATAL:  no pg_hba.conf entry for replication connection from host "127.0.0.1", user "pgrunner", no encryption
     Bail out!  system pg_basebackup failed

I guess that only "drongo" and "fairywren" reported this issue because they are running on Windows and pg_basebackup uses TCP/IP connection. OTOH I guess that other buildfarm members running on non-Windows use Unix-domain for pg_basebackup and pg_hba.conf entry for that exists by default. So ISTM they reported no such issue.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

bugfix.patch (764 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Stronger safeguard for archive recovery not to miss data

Tom Lane-2
In reply to this post by Fujii Masao-4
Fujii Masao <[hidden email]> writes:
> Thanks! So I pushed the patch.

Seems like the test case added by this commit is not
working on Windows.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Stronger safeguard for archive recovery not to miss data

Fujii Masao-4
In reply to this post by Fujii Masao-4


On 2021/04/07 3:03, Fujii Masao wrote:

>
>
> On 2021/04/06 23:01, Fujii Masao wrote:
>>
>>
>> On 2021/04/06 20:44, David Steele wrote:
>>> On 4/6/21 7:13 AM, Fujii Masao wrote:
>>>>
>>>> On 2021/04/06 15:59, [hidden email] wrote:
>>>>> I just wanted to write why the error was introduced,
>>>>> but it was not necessary.
>>>>> We should remove and fix the first part of the sentence.
>>>>
>>>> So the consensus is almost the same as the latest patch?
>>>> If they are not so different, I'm thinking to push the latest version atfirst.
>>>> Then we can improve the docs if required.
>>>
>>> +1
>>
>> Thanks! So I pushed the patch.
>
> The buildfarm members "drongo" and "fairywren" reported that the regression test (024_archive_recovery.pl) commit 9de9294b0c added failed with the following error. ISTM the cause of this failure is that the test calls $node->init() without "allows_streaming => 1" and which doesn't add pg_hba.conf entry for TCP/IP connection from pg_basebackup. So I think that the attached patch needs to be applied.

Pushed.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: Stronger safeguard for archive recovery not to miss data

Fujii Masao-4
In reply to this post by Tom Lane-2


On 2021/04/07 5:01, Tom Lane wrote:
> Fujii Masao <[hidden email]> writes:
>> Thanks! So I pushed the patch.
>
> Seems like the test case added by this commit is not
> working on Windows.

Thanks for the report! My analysis is [1], and I pushed the proposed patch.

[1]
https://postgr.es/m/3cc3909d-f779-7a74-c201-f1f7f62c7497@...

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

RE: Stronger safeguard for archive recovery not to miss data

osumi.takamichi@fujitsu.com
On Wednesday, April 7, 2021 7:50 AM Fujii Masao <[hidden email]> wrote:

> On 2021/04/07 5:01, Tom Lane wrote:
> > Fujii Masao <[hidden email]> writes:
> >> Thanks! So I pushed the patch.
> >
> > Seems like the test case added by this commit is not working on
> > Windows.
>
> Thanks for the report! My analysis is [1], and I pushed the proposed patch.
>
> [1]
> https://postgr.es/m/3cc3909d-f779-7a74-c201-f1f7f62c7497@...
> m
Fujii-san, Tom,
thank you so much for your quick analysis and modification.
I appreciate those.


Best Regards,
        Takamichi Osumi

123