Assertion failure when the non-exclusive pg_stop_backup aborted.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Assertion failure when the non-exclusive pg_stop_backup aborted.

Masahiko Sawada
Hi,

I got an assert failure when executing pg_terminate_backend to the
process that waiting for WAL to be archived in non-exclusive backup
mode.

TRAP: FailedAssertion("!(XLogCtl->Insert.nonExclusiveBackups > 0)",
File: "xlog.c", Line: 11123)

The one reproducing procedure is,
1. Start postgres with enabling the WAL archive
2. Execute pg_start_backup()
3. Revoke the access privileges of archive directory. (e.g., chown
root:root /path/to/archive)
4. Execute pg_stop_backup() and hangs
5. Execute pg_terminate_backend() to the process that is waiting for
WAL to be archived.
Got the assertion failure.

Perhaps we can reproduce it using pg_basebackup as well.

I think the cause of this is that do_pg_abort_backup can be called
even after we decremented XLogCtl->Insert.nonExclusiveBackups in
do_pg_stop_backup(). That is, do_pg_abort_backup can be called with
XLogCtl->Insert.nonExclusiveBackups = 0 when the transaction aborts
after processed the nonExclusiveBackups (e.g, during waiting for WAL
to be archived)
The bug can happen in PostgreSQL 9.1 or higher that non-exclusive
backup has been introduced, so we should back-patch to the all
supported versions.

I changed do_pg_abort_backup() so that it decrements
nonExclusiveBackups only if > 0. Feedback is very welcome.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

fix_do_pg_abort_backup.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

Michael Paquier
On Thu, Sep 21, 2017 at 1:07 AM, Masahiko Sawada <[hidden email]> wrote:
> The bug can happen in PostgreSQL 9.1 or higher that non-exclusive
> backup has been introduced, so we should back-patch to the all
> supported versions.

There is a typo here right? Non-exclusive backups have been introduced
in 9.6. Why would a back-patch further down be needed?

> I changed do_pg_abort_backup() so that it decrements
> nonExclusiveBackups only if > 0. Feedback is very welcome.

+-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
+   if (XLogCtl->Insert.nonExclusiveBackups > 0)
+       XLogCtl->Insert.nonExclusiveBackups--;
Hm, no, I don't agree. I think that instead you should just leave
do_pg_abort_backup() immediately if sessionBackupState is set to
SESSION_BACKUP_NONE. This variable is the link between the global
counters and the session stopping the backup so I don't think that we
should touch this assertion of this counter. I think that this method
would be safe as well for backup start as pg_start_backup_callback
takes care of any cleanup. Also because the counters are incremented
before entering in the PG_ENSURE_ERROR_CLEANUP block, and
sessionBackupState is updated just after leaving the block.

About the patch:
+-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
There is a typo on this line.

Adding that to the next CF would be a good idea so as we don't forget
about it. Feel free to add me as reviewer.
--
Michael


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

Masahiko Sawada
On Thu, Sep 21, 2017 at 2:25 PM, Michael Paquier
<[hidden email]> wrote:
> On Thu, Sep 21, 2017 at 1:07 AM, Masahiko Sawada <[hidden email]> wrote:
>> The bug can happen in PostgreSQL 9.1 or higher that non-exclusive
>> backup has been introduced, so we should back-patch to the all
>> supported versions.
>
> There is a typo here right? Non-exclusive backups have been introduced
> in 9.6. Why would a back-patch further down be needed?

I think the non-exclusive backups infrastructure has been introduced
in 9.1 for pg_basebackup. I've not checked yet that it can be
reproduced using pg_basebackup in PG9.1 but I thought it could happen
as far as I checked the code.

>
>> I changed do_pg_abort_backup() so that it decrements
>> nonExclusiveBackups only if > 0. Feedback is very welcome.
>
> +-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
> +   if (XLogCtl->Insert.nonExclusiveBackups > 0)
> +       XLogCtl->Insert.nonExclusiveBackups--;
> Hm, no, I don't agree. I think that instead you should just leave
> do_pg_abort_backup() immediately if sessionBackupState is set to
> SESSION_BACKUP_NONE. This variable is the link between the global
> counters and the session stopping the backup so I don't think that we
> should touch this assertion of this counter. I think that this method
> would be safe as well for backup start as pg_start_backup_callback
> takes care of any cleanup. Also because the counters are incremented
> before entering in the PG_ENSURE_ERROR_CLEANUP block, and
> sessionBackupState is updated just after leaving the block.

I think that the assertion failure still can happen if the process
aborts after decremented the counter and before setting to
SESSION_BACKUP_NONE. Am I missing something?

>
> About the patch:
> +-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
> There is a typo on this line.
>
> Adding that to the next CF would be a good idea so as we don't forget
> about it. Feel free to add me as reviewer.

Thank you!

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

Michael Paquier
On Thu, Sep 21, 2017 at 4:40 PM, Masahiko Sawada <[hidden email]> wrote:

> On Thu, Sep 21, 2017 at 2:25 PM, Michael Paquier
> <[hidden email]> wrote:
>> On Thu, Sep 21, 2017 at 1:07 AM, Masahiko Sawada <[hidden email]> wrote:
>>> The bug can happen in PostgreSQL 9.1 or higher that non-exclusive
>>> backup has been introduced, so we should back-patch to the all
>>> supported versions.
>>
>> There is a typo here right? Non-exclusive backups have been introduced
>> in 9.6. Why would a back-patch further down be needed?
>
> I think the non-exclusive backups infrastructure has been introduced
> in 9.1 for pg_basebackup. I've not checked yet that it can be
> reproduced using pg_basebackup in PG9.1 but I thought it could happen
> as far as I checked the code.

Yep, but the deficiency is caused by the use before_shmem_exit() in
the SQL-level functions present in 9.6~ which make the cleanup happen
potentially twice. If you look at the code of basebackup.c, you would
notice that the cleanups of the counters only happen within the
PG_ENSURE_ERROR_CLEANUP() blocks, so a backpatch to 9.6 should be
enough.

>> +-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
>> +   if (XLogCtl->Insert.nonExclusiveBackups > 0)
>> +       XLogCtl->Insert.nonExclusiveBackups--;
>> Hm, no, I don't agree. I think that instead you should just leave
>> do_pg_abort_backup() immediately if sessionBackupState is set to
>> SESSION_BACKUP_NONE. This variable is the link between the global
>> counters and the session stopping the backup so I don't think that we
>> should touch this assertion of this counter. I think that this method
>> would be safe as well for backup start as pg_start_backup_callback
>> takes care of any cleanup. Also because the counters are incremented
>> before entering in the PG_ENSURE_ERROR_CLEANUP block, and
>> sessionBackupState is updated just after leaving the block.
>
> I think that the assertion failure still can happen if the process
> aborts after decremented the counter and before setting to
> SESSION_BACKUP_NONE. Am I missing something?

The process would stop at the next CHECK_FOR_INTERRUPTS() and trigger
the cleanup at this moment. So this happens when waiting for the
archives to be done, and the session flag is set to NONE at this
point.
--
Michael


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

Michael Paquier
On Thu, Sep 21, 2017 at 5:56 PM, Michael Paquier
<[hidden email]> wrote:

> On Thu, Sep 21, 2017 at 4:40 PM, Masahiko Sawada <[hidden email]> wrote:
>> On Thu, Sep 21, 2017 at 2:25 PM, Michael Paquier
>> <[hidden email]> wrote:
>>> +-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
>>> +   if (XLogCtl->Insert.nonExclusiveBackups > 0)
>>> +       XLogCtl->Insert.nonExclusiveBackups--;
>>> Hm, no, I don't agree. I think that instead you should just leave
>>> do_pg_abort_backup() immediately if sessionBackupState is set to
>>> SESSION_BACKUP_NONE. This variable is the link between the global
>>> counters and the session stopping the backup so I don't think that we
>>> should touch this assertion of this counter. I think that this method
>>> would be safe as well for backup start as pg_start_backup_callback
>>> takes care of any cleanup. Also because the counters are incremented
>>> before entering in the PG_ENSURE_ERROR_CLEANUP block, and
>>> sessionBackupState is updated just after leaving the block.
>>
>> I think that the assertion failure still can happen if the process
>> aborts after decremented the counter and before setting to
>> SESSION_BACKUP_NONE. Am I missing something?
>
> The process would stop at the next CHECK_FOR_INTERRUPTS() and trigger
> the cleanup at this moment. So this happens when waiting for the
> archives to be done, and the session flag is set to NONE at this
> point.

And actually, with two non-exclusive backups taken in parallel, it is
still possible to fail on another assertion in do_pg_stop_backup()
even with your patch. Imagine the following:
1) Two backups are taken, counter for non-exclusive backups is set at 2.
2) One backup is stopped, then interrupted. This causes the counter to
be decremented twice, once in do_pg_stop_backup, and once when
aborting. Counter is at 0, switching as well forcePageWrites to
false..
3) The second backup stops, a new assertion failure is triggered.
Without assertions the counter would get at -1.
--
Michael


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

Craig Ringer-3
In reply to this post by Michael Paquier
On 21 September 2017 at 16:56, Michael Paquier <[hidden email]> wrote:
On Thu, Sep 21, 2017 at 4:40 PM, Masahiko Sawada <[hidden email]> wrote:
> On Thu, Sep 21, 2017 at 2:25 PM, Michael Paquier
> <[hidden email]> wrote:
>> On Thu, Sep 21, 2017 at 1:07 AM, Masahiko Sawada <[hidden email]> wrote:
>>> The bug can happen in PostgreSQL 9.1 or higher that non-exclusive
>>> backup has been introduced, so we should back-patch to the all
>>> supported versions.
>>
>> There is a typo here right? Non-exclusive backups have been introduced
>> in 9.6. Why would a back-patch further down be needed?
>
> I think the non-exclusive backups infrastructure has been introduced
> in 9.1 for pg_basebackup. I've not checked yet that it can be
> reproduced using pg_basebackup in PG9.1 but I thought it could happen
> as far as I checked the code.

Yep, but the deficiency is caused by the use before_shmem_exit() in
the SQL-level functions present in 9.6~ which make the cleanup happen
potentially twice. If you look at the code of basebackup.c, you would
notice that the cleanups of the counters only happen within the
PG_ENSURE_ERROR_CLEANUP() blocks, so a backpatch to 9.6 should be
enough.

>> +-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
>> +   if (XLogCtl->Insert.nonExclusiveBackups > 0)
>> +       XLogCtl->Insert.nonExclusiveBackups--;
>> Hm, no, I don't agree. I think that instead you should just leave
>> do_pg_abort_backup() immediately if sessionBackupState is set to
>> SESSION_BACKUP_NONE. This variable is the link between the global
>> counters and the session stopping the backup so I don't think that we
>> should touch this assertion of this counter. I think that this method
>> would be safe as well for backup start as pg_start_backup_callback
>> takes care of any cleanup. Also because the counters are incremented
>> before entering in the PG_ENSURE_ERROR_CLEANUP block, and
>> sessionBackupState is updated just after leaving the block.
>
> I think that the assertion failure still can happen if the process
> aborts after decremented the counter and before setting to
> SESSION_BACKUP_NONE. Am I missing something?

The process would stop at the next CHECK_FOR_INTERRUPTS() and trigger
the cleanup at this moment. So this happens when waiting for the
archives to be done, and the session flag is set to NONE at this
point.

Another one to watch out for is that elog(...) and ereport(...) invoke CHECK_FOR_INTERRUPTS. That's given me exciting surprises before when combined with assertion checking and various exit cleanup hooks.

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Reply | Threaded
Open this post in threaded view
|

Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

Michael Paquier
On Fri, Sep 22, 2017 at 10:41 AM, Craig Ringer <[hidden email]> wrote:
> Another one to watch out for is that elog(...) and ereport(...) invoke
> CHECK_FOR_INTERRUPTS. That's given me exciting surprises before when
> combined with assertion checking and various exit cleanup hooks.

Ahah. Good point. In this case LWLockWaitForVar() is one thing to
worry about if LWLock tracing is enabled because it can log things
before holding the existing interrupts. This can be avoided easily in
the case of this issue by updating sessionBackupState before calling
WALInsertLockRelease and while holding the WAL insert lock. Surely
this meritates a comment in the patch we want here.
--
Michael


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

Masahiko Sawada
In reply to this post by Michael Paquier
On Thu, Sep 21, 2017 at 5:56 PM, Michael Paquier
<[hidden email]> wrote:
>
> Yep, but the deficiency is caused by the use before_shmem_exit() in
> the SQL-level functions present in 9.6~ which make the cleanup happen
> potentially twice. If you look at the code of basebackup.c, you would
> notice that the cleanups of the counters only happen within the
> PG_ENSURE_ERROR_CLEANUP() blocks, so a backpatch to 9.6 should be
> enough.

Thank you for explaining. I understood and agreed..

On Fri, Sep 22, 2017 at 8:02 AM, Michael Paquier
<[hidden email]> wrote:

> On Thu, Sep 21, 2017 at 5:56 PM, Michael Paquier
> <[hidden email]> wrote:
>> On Thu, Sep 21, 2017 at 4:40 PM, Masahiko Sawada <[hidden email]> wrote:
>>> On Thu, Sep 21, 2017 at 2:25 PM, Michael Paquier
>>> <[hidden email]> wrote:
>>>> +-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
>>>> +   if (XLogCtl->Insert.nonExclusiveBackups > 0)
>>>> +       XLogCtl->Insert.nonExclusiveBackups--;
>>>> Hm, no, I don't agree. I think that instead you should just leave
>>>> do_pg_abort_backup() immediately if sessionBackupState is set to
>>>> SESSION_BACKUP_NONE. This variable is the link between the global
>>>> counters and the session stopping the backup so I don't think that we
>>>> should touch this assertion of this counter. I think that this method
>>>> would be safe as well for backup start as pg_start_backup_callback
>>>> takes care of any cleanup. Also because the counters are incremented
>>>> before entering in the PG_ENSURE_ERROR_CLEANUP block, and
>>>> sessionBackupState is updated just after leaving the block.
>>>
>>> I think that the assertion failure still can happen if the process
>>> aborts after decremented the counter and before setting to
>>> SESSION_BACKUP_NONE. Am I missing something?
>>
>> The process would stop at the next CHECK_FOR_INTERRUPTS() and trigger
>> the cleanup at this moment. So this happens when waiting for the
>> archives to be done, and the session flag is set to NONE at this
>> point.
>
> And actually, with two non-exclusive backups taken in parallel, it is
> still possible to fail on another assertion in do_pg_stop_backup()
> even with your patch. Imagine the following:
> 1) Two backups are taken, counter for non-exclusive backups is set at 2.
> 2) One backup is stopped, then interrupted. This causes the counter to
> be decremented twice, once in do_pg_stop_backup, and once when
> aborting. Counter is at 0, switching as well forcePageWrites to
> false..
> 3) The second backup stops, a new assertion failure is triggered.
> Without assertions the counter would get at -1.
You're right. I updated the patch so that it exits do_pg_abort_backup
if the state is NONE and setting the state to NONE in
do_pg_stop_backup before releasing the WAL insert lock.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

fix_do_pg_abort_backup_v2.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

Michael Paquier
On Fri, Sep 22, 2017 at 11:34 AM, Masahiko Sawada <[hidden email]> wrote:
> You're right. I updated the patch so that it exits do_pg_abort_backup
> if the state is NONE and setting the state to NONE in
> do_pg_stop_backup before releasing the WAL insert lock.

-    /* Clean up session-level lock */
+    /*
+     * Clean up session-level lock. To avoid interrupting before changing
+     * the backup state by LWLockWaitForVar we change it while holding the
+     * WAL insert lock.
+     */
     sessionBackupState = SESSION_BACKUP_NONE;
You could just mention directly CHECK_FOR_INTERRUPTS here.

+    /* Quick exit if we have done the backup */
+    if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_NONE)
+        return;

The patch contents look fine to me. I have also tested things in
depths but could not find any problems. I also looked again at the
backup start code, trying to see any flows between the moment the
session backup lock is changed and the moment the callback to do
backup cleanup is registered but I have not found any issues. I am
marking this as ready for committer.
--
Michael


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

Masahiko Sawada
On Fri, Sep 22, 2017 at 3:00 PM, Michael Paquier
<[hidden email]> wrote:

> On Fri, Sep 22, 2017 at 11:34 AM, Masahiko Sawada <[hidden email]> wrote:
>> You're right. I updated the patch so that it exits do_pg_abort_backup
>> if the state is NONE and setting the state to NONE in
>> do_pg_stop_backup before releasing the WAL insert lock.
>
> -    /* Clean up session-level lock */
> +    /*
> +     * Clean up session-level lock. To avoid interrupting before changing
> +     * the backup state by LWLockWaitForVar we change it while holding the
> +     * WAL insert lock.
> +     */
>      sessionBackupState = SESSION_BACKUP_NONE;
> You could just mention directly CHECK_FOR_INTERRUPTS here.

Thank you for the reviewing.
I have a question. Since WALInsertLockRelease seems not to call
LWLockWaitForVar I thought you wanted to mean LWLockReleaseClearVar
instead, is that right?

> +    /* Quick exit if we have done the backup */
> +    if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_NONE)
> +        return;
>
> The patch contents look fine to me. I have also tested things in
> depths but could not find any problems. I also looked again at the
> backup start code, trying to see any flows between the moment the
> session backup lock is changed and the moment the callback to do
> backup cleanup is registered but I have not found any issues. I am
> marking this as ready for committer.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

Michael Paquier
On Fri, Sep 22, 2017 at 3:24 PM, Masahiko Sawada <[hidden email]> wrote:
> I have a question. Since WALInsertLockRelease seems not to call
> LWLockWaitForVar I thought you wanted to mean LWLockReleaseClearVar
> instead, is that right?

This looks like a copy-pasto from my side. Thanks for spotting it.
--
Michael


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

Masahiko Sawada
On Fri, Sep 22, 2017 at 3:46 PM, Michael Paquier
<[hidden email]> wrote:
> On Fri, Sep 22, 2017 at 3:24 PM, Masahiko Sawada <[hidden email]> wrote:
>> I have a question. Since WALInsertLockRelease seems not to call
>> LWLockWaitForVar I thought you wanted to mean LWLockReleaseClearVar
>> instead, is that right?
>
> This looks like a copy-pasto from my side. Thanks for spotting it.

Okay, attached the latest version patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

fix_do_pg_abort_backup_v3.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

Fujii Masao-2
On Fri, Sep 22, 2017 at 4:42 PM, Masahiko Sawada <[hidden email]> wrote:

> On Fri, Sep 22, 2017 at 3:46 PM, Michael Paquier
> <[hidden email]> wrote:
>> On Fri, Sep 22, 2017 at 3:24 PM, Masahiko Sawada <[hidden email]> wrote:
>>> I have a question. Since WALInsertLockRelease seems not to call
>>> LWLockWaitForVar I thought you wanted to mean LWLockReleaseClearVar
>>> instead, is that right?
>>
>> This looks like a copy-pasto from my side. Thanks for spotting it.
>
> Okay, attached the latest version patch.

+ /* Quick exit if we have done the backup */
+ if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_NONE)
+ return;

This quick exit seems to cause another problem. Please imagine the
case where there is no exclusive backup running, a backend starts
non-exclusive backup and is terminated before calling pg_stop_backup().
In this case, do_pg_abort_backup() should decrement
XLogCtl->Insert.nonExclusiveBackups, but, with the patch, because of
the above quick exit, do_pg_abort_backup() fails to decrement that.

Regards,

--
Fujii Masao

Reply | Threaded
Open this post in threaded view
|

Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

Masahiko Sawada

Thank you for the reviewing!

On Nov 15, 2017 2:59 AM, "Fujii Masao" <[hidden email]> wrote:
On Fri, Sep 22, 2017 at 4:42 PM, Masahiko Sawada <[hidden email]> wrote:
> On Fri, Sep 22, 2017 at 3:46 PM, Michael Paquier
> <[hidden email]> wrote:
>> On Fri, Sep 22, 2017 at 3:24 PM, Masahiko Sawada <[hidden email]> wrote:
>>> I have a question. Since WALInsertLockRelease seems not to call
>>> LWLockWaitForVar I thought you wanted to mean LWLockReleaseClearVar
>>> instead, is that right?
>>
>> This looks like a copy-pasto from my side. Thanks for spotting it.
>
> Okay, attached the latest version patch.

+ /* Quick exit if we have done the backup */
+ if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_NONE)
+ return;

This quick exit seems to cause another problem. Please imagine the
case where there is no exclusive backup running, a backend starts
non-exclusive backup and is terminated before calling pg_stop_backup().
In this case, do_pg_abort_backup() should decrement
XLogCtl->Insert.nonExclusiveBackups, but, with the patch, because of
the above quick exit, do_pg_abort_backup() fails to decrement that.

Hmm, I think that in this case because  exclusiveBackupState is not EXCLUSIVE_BACKUP_NONE, nonExclusiveBackups is surely decremented. Am I missing something?

Regards,

--
Masahiko Sawada
Reply | Threaded
Open this post in threaded view
|

Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

Michael Paquier
On Wed, Nov 15, 2017 at 9:06 AM, Masahiko Sawada <[hidden email]> wrote:

>> On Nov 15, 2017 2:59 AM, "Fujii Masao" <[hidden email]> wrote:
>> + /* Quick exit if we have done the backup */
>> + if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_NONE)
>> + return;
>>
>> This quick exit seems to cause another problem. Please imagine the
>> case where there is no exclusive backup running, a backend starts
>> non-exclusive backup and is terminated before calling pg_stop_backup().
>> In this case, do_pg_abort_backup() should decrement
>> XLogCtl->Insert.nonExclusiveBackups, but, with the patch, because of
>> the above quick exit, do_pg_abort_backup() fails to decrement that.
>
> Hmm, I think that in this case because  exclusiveBackupState is not
> EXCLUSIVE_BACKUP_NONE, nonExclusiveBackups is surely decremented. Am I
> missing something?
Nah. Fujii-san is right here as exclusiveBackupState is never updated
for non-exclusive backups. You need an extra check on
sessionBackupState to make sure that even for non-exclusive backups
the counter is correctly decremented if a non-exclusive session lock
is hold. For an exclusive backup, the session lock can be either
SESSION_BACKUP_EXCLUSIVE if an exclusive backup is stopped on the same
session as the start phase, or SESSION_BACKUP_NONE if the exclusive
backup is stopped from a different session. So you'd basically need
that:
+   /*
+    * Quick exit if we have done the exclusive backup or if session is
+    * not keeping around a started non-exclusive backup.
+    */
+   if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_NONE &&
+       sessionBackupState != SESSION_BACKUP_NON_EXCLUSIVE)
+       return;

At the same time it would be safer at startup phase to update
sessionBackupState when holding the WAL insert lock to prevent other
CHECK_FOR_INTERRUPTS hazards. Suggestion of patch attached.
--
Michael

fix_do_pg_abort_backup_v4.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

Masahiko Sawada
On Wed, Nov 15, 2017 at 10:05 AM, Michael Paquier
<[hidden email]> wrote:

> On Wed, Nov 15, 2017 at 9:06 AM, Masahiko Sawada <[hidden email]> wrote:
>>> On Nov 15, 2017 2:59 AM, "Fujii Masao" <[hidden email]> wrote:
>>> + /* Quick exit if we have done the backup */
>>> + if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_NONE)
>>> + return;
>>>
>>> This quick exit seems to cause another problem. Please imagine the
>>> case where there is no exclusive backup running, a backend starts
>>> non-exclusive backup and is terminated before calling pg_stop_backup().
>>> In this case, do_pg_abort_backup() should decrement
>>> XLogCtl->Insert.nonExclusiveBackups, but, with the patch, because of
>>> the above quick exit, do_pg_abort_backup() fails to decrement that.
>>
>> Hmm, I think that in this case because  exclusiveBackupState is not
>> EXCLUSIVE_BACKUP_NONE, nonExclusiveBackups is surely decremented. Am I
>> missing something?
>
> Nah. Fujii-san is right here as exclusiveBackupState is never updated
> for non-exclusive backups.

Thank you, I was wrong.

> You need an extra check on
> sessionBackupState to make sure that even for non-exclusive backups
> the counter is correctly decremented if a non-exclusive session lock
> is hold. For an exclusive backup, the session lock can be either
> SESSION_BACKUP_EXCLUSIVE if an exclusive backup is stopped on the same
> session as the start phase, or SESSION_BACKUP_NONE if the exclusive
> backup is stopped from a different session. So you'd basically need
> that:
> +   /*
> +    * Quick exit if we have done the exclusive backup or if session is
> +    * not keeping around a started non-exclusive backup.
> +    */
> +   if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_NONE &&
> +       sessionBackupState != SESSION_BACKUP_NON_EXCLUSIVE)
> +       return;
>
> At the same time it would be safer at startup phase to update
> sessionBackupState when holding the WAL insert lock to prevent other
> CHECK_FOR_INTERRUPTS hazards. Suggestion of patch attached.

I think we need to check only sessionBackupState and don't need to
check XLogCtl->Insert.exclusiveBackupState in do_pg_abort_backup(). We
can quickly return if sessionBackupState !=
SESSION_BACKUP_NON_EXCLUSIVE. In your suggestion, I think we can still
get an assertion failure when pg_stop_backup(false) waiting for
archiving is terminated while concurrent an exclusive backup is in
progress.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Reply | Threaded
Open this post in threaded view
|

Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

Michael Paquier
On Wed, Nov 15, 2017 at 12:12 PM, Masahiko Sawada <[hidden email]> wrote:
> I think we need to check only sessionBackupState and don't need to
> check XLogCtl->Insert.exclusiveBackupState in do_pg_abort_backup(). We
> can quickly return if sessionBackupState !=
> SESSION_BACKUP_NON_EXCLUSIVE. In your suggestion, I think we can still
> get an assertion failure when pg_stop_backup(false) waiting for
> archiving is terminated while concurrent an exclusive backup is in
> progress.

I have just gone through the thread once again, and noticed that it is
actually what I suggested upthread:
https://www.postgresql.org/message-id/CAB7nPqTm5CDrR5Y7yyfKy+PVDZ6dWS_jKG1KStaN5m95gAMTFQ@...
But your v2 posted here did not do that so it is incorrect from the start:
https://www.postgresql.org/message-id/CAD21AoA+isXYL1_ZXMnk9xJhYEL5h6rxJtTovLi7fumqfmCYgg@...

We both got a bit confused here. As do_pg_abort_backup() is only used
for non-exclusive backups (including those taken through the
replication protocol), going through the session lock for checks is
fine. Could you update your patch accordingly please?
--
Michael

Reply | Threaded
Open this post in threaded view
|

Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

Masahiko Sawada
On Wed, Nov 15, 2017 at 2:38 PM, Michael Paquier
<[hidden email]> wrote:

> On Wed, Nov 15, 2017 at 12:12 PM, Masahiko Sawada <[hidden email]> wrote:
>> I think we need to check only sessionBackupState and don't need to
>> check XLogCtl->Insert.exclusiveBackupState in do_pg_abort_backup(). We
>> can quickly return if sessionBackupState !=
>> SESSION_BACKUP_NON_EXCLUSIVE. In your suggestion, I think we can still
>> get an assertion failure when pg_stop_backup(false) waiting for
>> archiving is terminated while concurrent an exclusive backup is in
>> progress.
>
> I have just gone through the thread once again, and noticed that it is
> actually what I suggested upthread:
> https://www.postgresql.org/message-id/CAB7nPqTm5CDrR5Y7yyfKy+PVDZ6dWS_jKG1KStaN5m95gAMTFQ@...
> But your v2 posted here did not do that so it is incorrect from the start:
> https://www.postgresql.org/message-id/CAD21AoA+isXYL1_ZXMnk9xJhYEL5h6rxJtTovLi7fumqfmCYgg@...

Sorry, it's my fault. I didn't mean it but I forgot.

> We both got a bit confused here. As do_pg_abort_backup() is only used
> for non-exclusive backups (including those taken through the
> replication protocol), going through the session lock for checks is
> fine. Could you update your patch accordingly please?

One question is, since we need to check only the session lock I think
that the following change is not necessary. Even if calling
CHECK_FOR_INTERRUPTS after set sessionBackupState =
SESSION_BACKUP_EXCLUSIVE; we never call do_pg_abort_backup(). Is that
right?

@@ -10636,8 +10636,14 @@ do_pg_start_backup(const char *backupidstr,
bool fast, TimeLineID *starttli_p,
        {
                WALInsertLockAcquireExclusive();
                XLogCtl->Insert.exclusiveBackupState =
EXCLUSIVE_BACKUP_IN_PROGRESS;
-               WALInsertLockRelease();
+
+               /*
+                * Clean up session-level lock. To avoid calling
CHECK_FOR_INTERRUPTS
+                * by LWLockReleaseClearVar before changing the backup
state we change
+                * it while holding the WAL insert lock.
+                */
                sessionBackupState = SESSION_BACKUP_EXCLUSIVE;
+               WALInsertLockRelease();
        }
        else
                sessionBackupState = SESSION_BACKUP_NON_EXCLUSIVE;

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Reply | Threaded
Open this post in threaded view
|

Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

Michael Paquier
On Wed, Nov 15, 2017 at 5:21 PM, Masahiko Sawada <[hidden email]> wrote:

> On Wed, Nov 15, 2017 at 2:38 PM, Michael Paquier
> <[hidden email]> wrote:
>> On Wed, Nov 15, 2017 at 12:12 PM, Masahiko Sawada <[hidden email]> wrote:
>>> I think we need to check only sessionBackupState and don't need to
>>> check XLogCtl->Insert.exclusiveBackupState in do_pg_abort_backup(). We
>>> can quickly return if sessionBackupState !=
>>> SESSION_BACKUP_NON_EXCLUSIVE. In your suggestion, I think we can still
>>> get an assertion failure when pg_stop_backup(false) waiting for
>>> archiving is terminated while concurrent an exclusive backup is in
>>> progress.
>>
>> I have just gone through the thread once again, and noticed that it is
>> actually what I suggested upthread:
>> https://www.postgresql.org/message-id/CAB7nPqTm5CDrR5Y7yyfKy+PVDZ6dWS_jKG1KStaN5m95gAMTFQ@...
>> But your v2 posted here did not do that so it is incorrect from the start:
>> https://www.postgresql.org/message-id/CAD21AoA+isXYL1_ZXMnk9xJhYEL5h6rxJtTovLi7fumqfmCYgg@...
>
> Sorry, it's my fault. I didn't mean it but I forgot.

My review was wrong as well :)

>> We both got a bit confused here. As do_pg_abort_backup() is only used
>> for non-exclusive backups (including those taken through the
>> replication protocol), going through the session lock for checks is
>> fine. Could you update your patch accordingly please?
>
> One question is, since we need to check only the session lock I think
> that the following change is not necessary. Even if calling
> CHECK_FOR_INTERRUPTS after set sessionBackupState =
> SESSION_BACKUP_EXCLUSIVE; we never call do_pg_abort_backup(). Is that
> right?

Yeah, this one is not strictly necessary for this bug, but it seems to
me that it would be a good idea for robustness wiht interrupts to be
consistent with the stop phase when updating the session lock.
--
Michael

Reply | Threaded
Open this post in threaded view
|

Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

Masahiko Sawada
On Wed, Nov 15, 2017 at 5:39 PM, Michael Paquier
<[hidden email]> wrote:

> On Wed, Nov 15, 2017 at 5:21 PM, Masahiko Sawada <[hidden email]> wrote:
>> On Wed, Nov 15, 2017 at 2:38 PM, Michael Paquier
>> <[hidden email]> wrote:
>>> On Wed, Nov 15, 2017 at 12:12 PM, Masahiko Sawada <[hidden email]> wrote:
>>>> I think we need to check only sessionBackupState and don't need to
>>>> check XLogCtl->Insert.exclusiveBackupState in do_pg_abort_backup(). We
>>>> can quickly return if sessionBackupState !=
>>>> SESSION_BACKUP_NON_EXCLUSIVE. In your suggestion, I think we can still
>>>> get an assertion failure when pg_stop_backup(false) waiting for
>>>> archiving is terminated while concurrent an exclusive backup is in
>>>> progress.
>>>
>>> I have just gone through the thread once again, and noticed that it is
>>> actually what I suggested upthread:
>>> https://www.postgresql.org/message-id/CAB7nPqTm5CDrR5Y7yyfKy+PVDZ6dWS_jKG1KStaN5m95gAMTFQ@...
>>> But your v2 posted here did not do that so it is incorrect from the start:
>>> https://www.postgresql.org/message-id/CAD21AoA+isXYL1_ZXMnk9xJhYEL5h6rxJtTovLi7fumqfmCYgg@...
>>
>> Sorry, it's my fault. I didn't mean it but I forgot.
>
> My review was wrong as well :)
>
>>> We both got a bit confused here. As do_pg_abort_backup() is only used
>>> for non-exclusive backups (including those taken through the
>>> replication protocol), going through the session lock for checks is
>>> fine. Could you update your patch accordingly please?
>>
>> One question is, since we need to check only the session lock I think
>> that the following change is not necessary. Even if calling
>> CHECK_FOR_INTERRUPTS after set sessionBackupState =
>> SESSION_BACKUP_EXCLUSIVE; we never call do_pg_abort_backup(). Is that
>> right?
>
> Yeah, this one is not strictly necessary for this bug, but it seems to
> me that it would be a good idea for robustness wiht interrupts to be
> consistent with the stop phase when updating the session lock.
Agreed. Attached the updated patch, please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

fix_do_pg_abort_backup_v5.patch (2K) Download Attachment
12
Previous Thread Next Thread