PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
27 messages Options
12
Reply | Threaded
Open this post in threaded view
|

PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

Thunder
The step to reproduce this issue.
1. Create a table
    create table gist_point_tbl(id int4, p point);
    create index gist_pointidx on gist_point_tbl using gist(p);
2. Insert data
    insert into gist_point_tbl (id, p) select g,        point(g*10, g*10) from generate_series(1, 1000000) g;
3. Delete data
     delete from gist_point_bl;
4. Vacuum table
    vacuum gist_point_tbl;
    -- Send SIGINT to vacuum process after WAL-log of the truncation is flushed and the truncation is not finished
    -- We will receive error message "ERROR:  canceling statement due to user request"
5. Vacuum table again
    vacuum gist_point tbl;
    -- The standby node crashed and the PANIC log is "PANIC:  WAL contains references to invalid pages"

The standby node succeed to replay truncate log but master node doesn't truncate the file, it will be crashed if master node writes to these blocks which truncated in standby node.
I try to fix issue to prevent query cancel interrupts during truncating.

diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 5df4382b7e..04b696ae01 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -26,6 +26,7 @@
 #include "access/xlogutils.h"
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
+#include "miscadmin.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
@@ -248,6 +249,14 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
        if (vm)
                visibilitymap_truncate(rel, nblocks);

+       /*
+        * When master node flush WAL-log of the truncation and then receive SIGINT signal to cancel
+        * this transaction before the truncation, if standby receive this WAL-log and do the truncation,
+        * standby node will crash when master node writes to these blocks which are truncated in standby node.
+        * So we prevent query cancel interrupts.
+        */
+       HOLD_CANCEL_INTERRUPTS();
+
        /*
         * We WAL-log the truncation before actually truncating, which means
         * trouble if the truncation fails. If we then crash, the WAL replay
@@ -288,6 +297,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)

        /* Do the real work */
        smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
+
+       RESUME_CANCEL_INTERRUPTS();
 }


 

Reply | Threaded
Open this post in threaded view
|

Re:PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

Thunder
Is this an issue? 
Can we fix like this?
Thanks!





At 2019-09-22 00:38:03, "Thunder" <[hidden email]> wrote:
The step to reproduce this issue.
1. Create a table
    create table gist_point_tbl(id int4, p point);
    create index gist_pointidx on gist_point_tbl using gist(p);
2. Insert data
    insert into gist_point_tbl (id, p) select g,        point(g*10, g*10) from generate_series(1, 1000000) g;
3. Delete data
     delete from gist_point_bl;
4. Vacuum table
    vacuum gist_point_tbl;
    -- Send SIGINT to vacuum process after WAL-log of the truncation is flushed and the truncation is not finished
    -- We will receive error message "ERROR:  canceling statement due to user request"
5. Vacuum table again
    vacuum gist_point tbl;
    -- The standby node crashed and the PANIC log is "PANIC:  WAL contains references to invalid pages"

The standby node succeed to replay truncate log but master node doesn't truncate the file, it will be crashed if master node writes to these blocks which truncated in standby node.
I try to fix issue to prevent query cancel interrupts during truncating.

diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 5df4382b7e..04b696ae01 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -26,6 +26,7 @@
 #include "access/xlogutils.h"
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
+#include "miscadmin.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
@@ -248,6 +249,14 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
        if (vm)
                visibilitymap_truncate(rel, nblocks);

+       /*
+        * When master node flush WAL-log of the truncation and then receive SIGINT signal to cancel
+        * this transaction before the truncation, if standby receive this WAL-log and do the truncation,
+        * standby node will crash when master node writes to these blocks which are truncated in standby node.
+        * So we prevent query cancel interrupts.
+        */
+       HOLD_CANCEL_INTERRUPTS();
+
        /*
         * We WAL-log the truncation before actually truncating, which means
         * trouble if the truncation fails. If we then crash, the WAL replay
@@ -288,6 +297,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)

        /* Do the real work */
        smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
+
+       RESUME_CANCEL_INTERRUPTS();
 }


 



 

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

Tomas Vondra-4
On Mon, Sep 23, 2019 at 03:48:50PM +0800, Thunder wrote:
>Is this an issue?
>Can we fix like this?
>Thanks!
>

I do think it is a valid issue. No opinion on the fix yet, though.
The report was sent on saturday, so patience ;-)

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

Michael Paquier-2
On Mon, Sep 23, 2019 at 01:45:14PM +0200, Tomas Vondra wrote:
> On Mon, Sep 23, 2019 at 03:48:50PM +0800, Thunder wrote:
>> Is this an issue?
>> Can we fix like this?
>> Thanks!
>>
>
> I do think it is a valid issue. No opinion on the fix yet, though.
> The report was sent on saturday, so patience ;-)

And for some others it was even a longer weekend.  Anyway, the problem
can be reproduced if you apply the attached which introduces a failure
point, and then if you run the following commands:
create table aa as select 1;
delete from aa;
\! touch /tmp/truncate_flag
vacuum aa;
\! rm /tmp/truncate_flag
vacuum aa; -- panic on standby

This also points out that there are other things to worry about than
interruptions, as for example DropRelFileNodeLocalBuffers() could lead
to an ERROR, and this happens before the physical truncation is done
but after the WAL record is replayed on the standby, so any failures
happening at the truncation phase before the work is done would be a
problem.  However we are talking about failures which should not
happen and these are elog() calls.  It would be tempting to add a
critical section here, but we could still have problems if we have a
failure after the WAL record has been flushed, which means that it
would be replayed on the standby, and the surrounding comments are
clear about that.  In short, as a matter of safety I'd like to think
that what you are suggesting is rather acceptable (aka hold interrupts
before the WAL record is written and release after the physical
truncate), so as truncation avoids failures possible to avoid.

Do others have thoughts to share on the matter?
--
Michael

truncate_failure_point.patch (525 bytes) Download Attachment
signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

Kyotaro Horiguchi-4
Hello.

At Tue, 24 Sep 2019 10:40:19 +0900, Michael Paquier <[hidden email]> wrote in <[hidden email]>

> On Mon, Sep 23, 2019 at 01:45:14PM +0200, Tomas Vondra wrote:
> > On Mon, Sep 23, 2019 at 03:48:50PM +0800, Thunder wrote:
> >> Is this an issue?
> >> Can we fix like this?
> >> Thanks!
> >>
> >
> > I do think it is a valid issue. No opinion on the fix yet, though.
> > The report was sent on saturday, so patience ;-)
>
> And for some others it was even a longer weekend.  Anyway, the problem
> can be reproduced if you apply the attached which introduces a failure
> point, and then if you run the following commands:
> create table aa as select 1;
> delete from aa;
> \! touch /tmp/truncate_flag
> vacuum aa;
> \! rm /tmp/truncate_flag
> vacuum aa; -- panic on standby
>
> This also points out that there are other things to worry about than
> interruptions, as for example DropRelFileNodeLocalBuffers() could lead
> to an ERROR, and this happens before the physical truncation is done
> but after the WAL record is replayed on the standby, so any failures
> happening at the truncation phase before the work is done would be a

Indeed.

> problem.  However we are talking about failures which should not
> happen and these are elog() calls.  It would be tempting to add a
> critical section here, but we could still have problems if we have a
> failure after the WAL record has been flushed, which means that it
> would be replayed on the standby, and the surrounding comments are

Agreed.

> clear about that.  In short, as a matter of safety I'd like to think
> that what you are suggesting is rather acceptable (aka hold interrupts
> before the WAL record is written and release after the physical
> truncate), so as truncation avoids failures possible to avoid.
>
> Do others have thoughts to share on the matter?

Agreed for the concept, but does the patch work as described? It
seems that query cancel doesn't fire during the holded-off
section since no CHECK_FOR_INTERRUPTS() there.

regares.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

Kyotaro Horiguchi-4
At Tue, 24 Sep 2019 12:46:19 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <[hidden email]> wrote in <[hidden email]>

> > clear about that.  In short, as a matter of safety I'd like to think
> > that what you are suggesting is rather acceptable (aka hold interrupts
> > before the WAL record is written and release after the physical
> > truncate), so as truncation avoids failures possible to avoid.
> >
> > Do others have thoughts to share on the matter?
>
> Agreed for the concept, but does the patch work as described? It
> seems that query cancel doesn't fire during the holded-off
> section since no CHECK_FOR_INTERRUPTS() there.

Of course I found no *explicit* ones. But I found one
ereport(DEBUG1 in register_dirty_segment. So it will work at
least for the case where fsync request queue is full.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

Michael Paquier-2
On Tue, Sep 24, 2019 at 02:48:16PM +0900, Kyotaro Horiguchi wrote:
> Of course I found no *explicit* ones. But I found one
> ereport(DEBUG1 in register_dirty_segment. So it will work at
> least for the case where fsync request queue is full.

Exactly.  I have not checked the patch in details, but I think that
we should not rely on the assumption that no code paths in this area do
not check after CHECK_FOR_INTERRUPTS() as smgrtruncate() does much
more than just the physical segment truncation.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

Kyotaro Horiguchi-4
At Wed, 25 Sep 2019 12:24:03 +0900, Michael Paquier <[hidden email]> wrote in <[hidden email]>
> On Tue, Sep 24, 2019 at 02:48:16PM +0900, Kyotaro Horiguchi wrote:
> > Of course I found no *explicit* ones. But I found one
> > ereport(DEBUG1 in register_dirty_segment. So it will work at
> > least for the case where fsync request queue is full.
>
> Exactly.  I have not checked the patch in details, but I think that
> we should not rely on the assumption that no code paths in this area do
> not check after CHECK_FOR_INTERRUPTS() as smgrtruncate() does much
> more than just the physical segment truncation.

Agreed to the point. Just I doubted that it really fixes the
author's problem. And confirmed that it can be.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

Fujii Masao-2
In reply to this post by Michael Paquier-2
On Tue, Sep 24, 2019 at 10:41 AM Michael Paquier <[hidden email]> wrote:

>
> On Mon, Sep 23, 2019 at 01:45:14PM +0200, Tomas Vondra wrote:
> > On Mon, Sep 23, 2019 at 03:48:50PM +0800, Thunder wrote:
> >> Is this an issue?
> >> Can we fix like this?
> >> Thanks!
> >>
> >
> > I do think it is a valid issue. No opinion on the fix yet, though.
> > The report was sent on saturday, so patience ;-)
>
> And for some others it was even a longer weekend.  Anyway, the problem
> can be reproduced if you apply the attached which introduces a failure
> point, and then if you run the following commands:
> create table aa as select 1;
> delete from aa;
> \! touch /tmp/truncate_flag
> vacuum aa;
> \! rm /tmp/truncate_flag
> vacuum aa; -- panic on standby
>
> This also points out that there are other things to worry about than
> interruptions, as for example DropRelFileNodeLocalBuffers() could lead
> to an ERROR, and this happens before the physical truncation is done
> but after the WAL record is replayed on the standby, so any failures
> happening at the truncation phase before the work is done would be a
> problem.  However we are talking about failures which should not
> happen and these are elog() calls.  It would be tempting to add a
> critical section here, but we could still have problems if we have a
> failure after the WAL record has been flushed, which means that it
> would be replayed on the standby, and the surrounding comments are
> clear about that.

Could you elaborate what problem adding a critical section there occurs?

Regards,

--
Fujii Masao


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

Michael Paquier-2
On Thu, Sep 26, 2019 at 01:13:56AM +0900, Fujii Masao wrote:

> On Tue, Sep 24, 2019 at 10:41 AM Michael Paquier <[hidden email]> wrote:
>> This also points out that there are other things to worry about than
>> interruptions, as for example DropRelFileNodeLocalBuffers() could lead
>> to an ERROR, and this happens before the physical truncation is done
>> but after the WAL record is replayed on the standby, so any failures
>> happening at the truncation phase before the work is done would be a
>> problem.  However we are talking about failures which should not
>> happen and these are elog() calls.  It would be tempting to add a
>> critical section here, but we could still have problems if we have a
>> failure after the WAL record has been flushed, which means that it
>> would be replayed on the standby, and the surrounding comments are
>> clear about that.
>
> Could you elaborate what problem adding a critical section there occurs?
Wrapping the call of smgrtruncate() within RelationTruncate() to use a
critical section would make things worse from the user perspective on
the primary, no?  If the physical truncation fails, we would still
fail WAL replay on the standby, but instead of generating an ERROR in
the session of the user attempting the TRUNCATE, the whole primary
would be taken down.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

Fujii Masao-2
On Fri, Sep 27, 2019 at 3:14 PM Michael Paquier <[hidden email]> wrote:

>
> On Thu, Sep 26, 2019 at 01:13:56AM +0900, Fujii Masao wrote:
> > On Tue, Sep 24, 2019 at 10:41 AM Michael Paquier <[hidden email]> wrote:
> >> This also points out that there are other things to worry about than
> >> interruptions, as for example DropRelFileNodeLocalBuffers() could lead
> >> to an ERROR, and this happens before the physical truncation is done
> >> but after the WAL record is replayed on the standby, so any failures
> >> happening at the truncation phase before the work is done would be a
> >> problem.  However we are talking about failures which should not
> >> happen and these are elog() calls.  It would be tempting to add a
> >> critical section here, but we could still have problems if we have a
> >> failure after the WAL record has been flushed, which means that it
> >> would be replayed on the standby, and the surrounding comments are
> >> clear about that.
> >
> > Could you elaborate what problem adding a critical section there occurs?
>
> Wrapping the call of smgrtruncate() within RelationTruncate() to use a
> critical section would make things worse from the user perspective on
> the primary, no?  If the physical truncation fails, we would still
> fail WAL replay on the standby, but instead of generating an ERROR in
> the session of the user attempting the TRUNCATE, the whole primary
> would be taken down.

Thanks for elaborating that! Understood.

But this can cause subsequent recovery to always fail with invalid-pages error
and the server not to start up. This is bad. So, to allviate the situation,
I'm thinking it would be worth adding something like igore_invalid_pages
developer parameter. When this parameter is set to true, the startup process
always ignores invalid-pages errors. Thought?

Regards,

--
Fujii Masao


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

Michael Paquier-2
On Thu, Oct 03, 2019 at 01:49:34PM +0900, Fujii Masao wrote:
> But this can cause subsequent recovery to always fail with invalid-pages error
> and the server not to start up. This is bad. So, to allviate the situation,
> I'm thinking it would be worth adding something like igore_invalid_pages
> developer parameter. When this parameter is set to true, the startup process
> always ignores invalid-pages errors. Thought?

That could be helpful.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

Fujii Masao-2
On Thu, Oct 3, 2019 at 1:57 PM Michael Paquier <[hidden email]> wrote:
>
> On Thu, Oct 03, 2019 at 01:49:34PM +0900, Fujii Masao wrote:
> > But this can cause subsequent recovery to always fail with invalid-pages error
> > and the server not to start up. This is bad. So, to allviate the situation,
> > I'm thinking it would be worth adding something like igore_invalid_pages
> > developer parameter. When this parameter is set to true, the startup process
> > always ignores invalid-pages errors. Thought?
>
> That could be helpful.

So attached patch adds new developer GUC "ignore_invalid_pages".
Setting ignore_invalid_pages to true causes the system
to ignore the failure (but still report a warning), and continue recovery.

I will add this to next CommitFest.

Regards,

--
Fujii Masao

ignore_invalid_pages_v1.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

Michael Paquier-2
On Thu, Oct 03, 2019 at 05:54:40PM +0900, Fujii Masao wrote:

> On Thu, Oct 3, 2019 at 1:57 PM Michael Paquier <[hidden email]> wrote:
> >
> > On Thu, Oct 03, 2019 at 01:49:34PM +0900, Fujii Masao wrote:
> > > But this can cause subsequent recovery to always fail with invalid-pages error
> > > and the server not to start up. This is bad. So, to allviate the situation,
> > > I'm thinking it would be worth adding something like igore_invalid_pages
> > > developer parameter. When this parameter is set to true, the startup process
> > > always ignores invalid-pages errors. Thought?
> >
> > That could be helpful.
>
> So attached patch adds new developer GUC "ignore_invalid_pages".
> Setting ignore_invalid_pages to true causes the system
> to ignore the failure (but still report a warning), and continue recovery.
>
> I will add this to next CommitFest.
No actual objections against this patch from me as a dev option.

+        Detection of WAL records having references to invalid pages during
+        recovery causes <productname>PostgreSQL</productname> to report
+        an error, aborting the recovery. Setting
Well, that's not really an error.  This triggers a PANIC, aka crashes
the server.  And in this case the actual problem is that you may not
be able to move on with recovery when restarting the server again,
except if luck is on your side because you would continuously face
it..

+        recovery. This behavior may <emphasis>cause crashes, data loss,
+         propagate or hide corruption, or other serious problems</emphasis>.
Nit: indentation on the second line here.

+        However, it may allow you to get past the error, finish the recovery,
+        and cause the server to start up.
For consistency here I would suggest the second part of the sentence
to be "TO finish recovery, and TO cause the server to start up".

+        The default setting is off, and it can only be set at server start.
Nit^2: Missing a <literal> markup for "off"?
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

Fujii Masao-2
On Fri, Nov 29, 2019 at 11:39 AM Michael Paquier <[hidden email]> wrote:

>
> On Thu, Oct 03, 2019 at 05:54:40PM +0900, Fujii Masao wrote:
> > On Thu, Oct 3, 2019 at 1:57 PM Michael Paquier <[hidden email]> wrote:
> > >
> > > On Thu, Oct 03, 2019 at 01:49:34PM +0900, Fujii Masao wrote:
> > > > But this can cause subsequent recovery to always fail with invalid-pages error
> > > > and the server not to start up. This is bad. So, to allviate the situation,
> > > > I'm thinking it would be worth adding something like igore_invalid_pages
> > > > developer parameter. When this parameter is set to true, the startup process
> > > > always ignores invalid-pages errors. Thought?
> > >
> > > That could be helpful.
> >
> > So attached patch adds new developer GUC "ignore_invalid_pages".
> > Setting ignore_invalid_pages to true causes the system
> > to ignore the failure (but still report a warning), and continue recovery.
> >
> > I will add this to next CommitFest.
>
> No actual objections against this patch from me as a dev option.
Thanks for the review! Attached is the updated version of the patch.

> +        Detection of WAL records having references to invalid pages during
> +        recovery causes <productname>PostgreSQL</productname> to report
> +        an error, aborting the recovery. Setting
> Well, that's not really an error.  This triggers a PANIC, aka crashes
> the server.  And in this case the actual problem is that you may not
> be able to move on with recovery when restarting the server again,
> except if luck is on your side because you would continuously face
> it..

So you're thinking that "report an error" should be changed to
"trigger a PANIC"? Personally "report an error" sounds ok because
PANIC is one of "error", I think. But if that misleads people,
I will change the sentence.

> +        recovery. This behavior may <emphasis>cause crashes, data loss,
> +         propagate or hide corruption, or other serious problems</emphasis>.
> Nit: indentation on the second line here.

Yes, I fixed that.

> +        However, it may allow you to get past the error, finish the recovery,
> +        and cause the server to start up.
> For consistency here I would suggest the second part of the sentence
> to be "TO finish recovery, and TO cause the server to start up".

Yes, I fixed that.

> +        The default setting is off, and it can only be set at server start.
> Nit^2: Missing a <literal> markup for "off"?

Yes, I fixed that.

Regards,

--
Fujii Masao

ignore_invalid_pages_v2.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

Michael Paquier-2
On Mon, Dec 16, 2019 at 12:22:18PM +0900, Fujii Masao wrote:

> > +        Detection of WAL records having references to invalid pages during
> > +        recovery causes <productname>PostgreSQL</productname> to report
> > +        an error, aborting the recovery. Setting
> > Well, that's not really an error.  This triggers a PANIC, aka crashes
> > the server.  And in this case the actual problem is that you may not
> > be able to move on with recovery when restarting the server again,
> > except if luck is on your side because you would continuously face
> > it..
>
> So you're thinking that "report an error" should be changed to
> "trigger a PANIC"? Personally "report an error" sounds ok because
> PANIC is one of "error", I think. But if that misleads people,
> I will change the sentence.
In the context of a recovery, an ERROR is promoted to a FATAL, but
here are talking about something that bypasses the crash of the
server.  So this could bring confusion.  I think that the
documentation should be crystal clear about that, with two aspects
outlined when the parameter is disabled, somewhat like data_sync_retry
actually:
- A PANIC-level error is triggered.
- It crashes the cluster.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

Fujii Masao-2
On Tue, Dec 17, 2019 at 2:19 PM Michael Paquier <[hidden email]> wrote:

>
> On Mon, Dec 16, 2019 at 12:22:18PM +0900, Fujii Masao wrote:
> > > +        Detection of WAL records having references to invalid pages during
> > > +        recovery causes <productname>PostgreSQL</productname> to report
> > > +        an error, aborting the recovery. Setting
> > > Well, that's not really an error.  This triggers a PANIC, aka crashes
> > > the server.  And in this case the actual problem is that you may not
> > > be able to move on with recovery when restarting the server again,
> > > except if luck is on your side because you would continuously face
> > > it..
> >
> > So you're thinking that "report an error" should be changed to
> > "trigger a PANIC"? Personally "report an error" sounds ok because
> > PANIC is one of "error", I think. But if that misleads people,
> > I will change the sentence.
>
> In the context of a recovery, an ERROR is promoted to a FATAL, but
> here are talking about something that bypasses the crash of the
> server.  So this could bring confusion.  I think that the
> documentation should be crystal clear about that, with two aspects
> outlined when the parameter is disabled, somewhat like data_sync_retry
> actually:
> - A PANIC-level error is triggered.
> - It crashes the cluster.
OK, I updated the patch that way.
Attached is the updated version of the patch.

Regards,

--
Fujii Masao

ignore_invalid_pages_v3.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

Michael Paquier-2
On Thu, Jan 16, 2020 at 11:17:36PM +0900, Fujii Masao wrote:
> OK, I updated the patch that way.
> Attached is the updated version of the patch.

Thanks.  I have few tweaks to propose to the docs.

+        raise a PANIC-level error, aborting the recovery. Setting
Instead of "PANIC-level error", I would just use "PANIC error", and
instead of "aborting the recovery" just "crashing the server".

+        causes the system to ignore those WAL records
WAL records are not ignored, but errors caused by incorrect page
references in those WAL records are.  The current phrasing sounds like
the WAL records are not applied.

Another thing that I just recalled.  Do you think that it would be
better to mention that invalid page references can only be seen after
reaching the consistent point during recovery?  The information given
looks enough, but I was just wondering if that's worth documenting or
not.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

Fujii Masao-2
On Fri, Jan 17, 2020 at 1:47 PM Michael Paquier <[hidden email]> wrote:
>
> On Thu, Jan 16, 2020 at 11:17:36PM +0900, Fujii Masao wrote:
> > OK, I updated the patch that way.
> > Attached is the updated version of the patch.
>
> Thanks.  I have few tweaks to propose to the docs.
>
> +        raise a PANIC-level error, aborting the recovery. Setting
> Instead of "PANIC-level error", I would just use "PANIC error", and

I have no strong opinion about this, but I used "PANIC-level error"
because the description for data_sync_retry has already used it.

> instead of "aborting the recovery" just "crashing the server".

PANIC implies server crash, so IMO "crashing the server" is
a bit redundant, and "aborting the recovery" is better because
"continue the recovery" is used later.

> +        causes the system to ignore those WAL records
> WAL records are not ignored, but errors caused by incorrect page
> references in those WAL records are.  The current phrasing sounds like
> the WAL records are not applied.

So, what about

---------------
causes the system to ignore invalid page references in WAL records
(but still report a warning), and continue the recovery.
---------------

> Another thing that I just recalled.  Do you think that it would be
> better to mention that invalid page references can only be seen after
> reaching the consistent point during recovery?  The information given
> looks enough, but I was just wondering if that's worth documenting or
> not.

ISTM that this is not the information that users should understand...

Regards,

--
Fujii Masao


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

Michael Paquier-2
On Fri, Jan 17, 2020 at 07:36:51PM +0900, Fujii Masao wrote:
> On Fri, Jan 17, 2020 at 1:47 PM Michael Paquier <[hidden email]> wrote:
>> Thanks.  I have few tweaks to propose to the docs.
>>
>> +        raise a PANIC-level error, aborting the recovery. Setting
>> Instead of "PANIC-level error", I would just use "PANIC error", and
>
> I have no strong opinion about this, but I used "PANIC-level error"
> because the description for data_sync_retry has already used it.

Okay.  Fine with what you think is good.

>> instead of "aborting the recovery" just "crashing the server".
>
> PANIC implies server crash, so IMO "crashing the server" is
> a bit redundant, and "aborting the recovery" is better because
> "continue the recovery" is used later.

Okay.  I see your point here.

> So, what about
>
> ---------------
> causes the system to ignore invalid page references in WAL records
> (but still report a warning), and continue the recovery.
> ---------------

And that sounds good to me.  Switching the patch as ready for
committer.
--
Michael

signature.asc (849 bytes) Download Attachment
12