[PATCH] Fix for infinite signal loop in parallel scan

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH] Fix for infinite signal loop in parallel scan

Chris Travers-7
Hi;

Attached is the patch we are fully testing at Adjust.  There are a few non-obvious aspects of the code around where the patch hits.    I have run make check on Linux and MacOS, and make check-world on Linux (check-world fails on MacOS on all versions and all branches due to ecpg failures).  Automatic tests are difficult because it is to a rare race condition which is difficult (or possibly impossible) to automatically create.  Our current approach testing is to force the issue using a debugger.  Any ideas on how to reproduce automatically are appreciated but even on our heavily loaded systems this seems to be a very small portion of queries that hit this case (meaning the issue happens about once a week for us).

The main problem is that under certain rare circumstances we see recovery conflicts going into loops sending sigusr1 to the parallel query which retries a posix_falloc call.  The call gets interrupted by the signal perpetually and the replication cannot proceed.

The patch attempts to handle cases where we are trying to cancel the query or terminate the process in the same way we would handle an ENOSPC in the resize operation, namely to break off the loop, do relevant cleanup, and then throw relevant exceptions.

There is a very serious complication in this area, however, which is that dsm_impl_op takes an elevel parameter which determines whether or not it is safe to throw errors.  This elevel is then passed to ereport inside the function, and this function also calls the resize operation itself.  Since this is not safe with CHECK_FOR_INTERRUPTS(), I conditionally do this only if elevel is greater or equal to ERROR.  

Currently all codepaths here use elevel ERROR when reaching this path but given that the calling function supports semantics where this could be lower, and where a caller might have additional cleanup to do, I don't think one can just add CHECK_FOR_INTERRUPTS() there even though that would work for now since this could create all kinds of subtle bugs in the future.

So what the patch does is check for whether we are trying to end the query or the backend and does not retry the resize operation if either of those conditions are true.  In those cases it will set errno to EINTR and return the same.

The only caller then, if the resize operation failed, checks for an elevel greater or equal to ERROR, and whether the errno is set to EINTR.  If so it checks for signals.  If these do not abort the query, we then fall through and pass the ereport with the supplied elevel as we would have otherwise, and return false to the caller.

In current calls to this function, this means that interrupts are handled right after cleanup of the dynamic shared memory and these then abort the query or exit the backend.  Future calls could specify a WARNING elevel if they have extra cleanup to do, in which case signals would not be checked, and the same warning found today would found in the log.  At the next CHECK_FOR_INTERRUPTS() call, the appropriate errors would be raised etc.

In the past it had been suggested we do PG_TRY(); and PG_CATCH(), but given that it is not consistent whether we can raise an error or whether we MUST raise an error, I don't see how this approach can work.  As far as I can see, we MUST raise an error in the appropriate spot if and only if elevel is set to a sufficient level.

Backporting this is pretty trivial.  We expect to confirm this ourselves on both master and 10.  I can prepare back ports fairly quickly.

Is there any feedback on this approach before I add it to the next commitfest?

--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin


racecond.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Fix for infinite signal loop in parallel scan

Andres Freund
Hi,

On 2018-09-07 17:57:05 +0200, Chris Travers wrote:
> Attached is the patch we are fully testing at Adjust.

For the future, please don't start a separate threads from the bugreport
/ original discussion.  Makes it much harder to follow.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Fix for infinite signal loop in parallel scan

Thomas Munro-3
In reply to this post by Chris Travers-7
On Sat, Sep 8, 2018 at 3:57 AM Chris Travers <[hidden email]> wrote:
> Attached is the patch we are fully testing at Adjust.

Thanks!

> I have run make check on Linux and MacOS, and make check-world on Linux (check-world fails on MacOS on all versions and all branches due to ecpg failures).

FWIW it's entirely possible to get make check-world passing on a Mac.
Maybe post the problem you're seeing to a new thread?

> ...

> In the past it had been suggested we do PG_TRY(); and PG_CATCH(), but given that it is not consistent whether we can raise an error or whether we MUST raise an error, I don't see how this approach can work.  As far as I can see, we MUST raise an error in the appropriate spot if and only if elevel is set to a sufficient level.

Yeah, your way looks a bit nicer than something involving PG_TRY().

> Is there any feedback on this approach before I add it to the next commitfest?

Please go ahead and add it.  Being a bug fix, we'll commit it sooner
than the open commitfest anyway, but it's useful to have it in there.

+ if (errno == EINTR && elevel >= ERROR)
+ CHECK_FOR_INTERRUPTS();

I think we might as well just CHECK_FOR_INTERRUPTS() unconditionally.
In this branch elevel is always ERROR as you noted, and the code
around there is confusing enough already.

+ } while (rc == EINTR && !(ProcDiePending || QueryCancelPending));

There seems to be a precedent for checking QueryCancelPending directly
to break out of a loop in regcomp.c and syncrep.c.  So this seems OK.
Hmm, I wonder if there should be an INTERRUPT_PENDING() macro that
hides those details, but allows you to break out of a loop and then do
some cleanup before CHECK_FOR_INTERRUPT().

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Fix for infinite signal loop in parallel scan

Alex Kliukin
In reply to this post by Chris Travers-7


> On 7. Sep 2018, at 17:57, Chris Travers <[hidden email]> wrote:
>
> Hi;
>
> Attached is the patch we are fully testing at Adjust.  There are a few non-obvious aspects of the code around where the patch hits.    I have run make check on Linux and MacOS, and make check-world on Linux (check-world fails on MacOS on all versions and all branches due to ecpg failures).  Automatic tests are difficult because it is to a rare race condition which is difficult (or possibly impossible) to automatically create.  Our current approach testing is to force the issue using a debugger.  Any ideas on how to reproduce automatically are appreciated but even on our heavily loaded systems this seems to be a very small portion of queries that hit this case (meaning the issue happens about once a week for us).


I did some manual testing on it, putting breakpoints in the
ResolveRecoveryConflictWithLock in the startup process on a streaming replica
(configured with a very low max_standby_streaming_delay, i.e. 100ms) and to the
posix_fallocate call on the normal backend on the same replica. At this point I
also instructed gdb not to stop upon receiving SIGUSR1 (handle SIGUSR1 nonstop)
and resumed execution on both the backend and the startup process.

Then I simulated a conflict by creating a rather big (3GB) table on the master,
doing some updates on it and then running an aggregate on the replica backend
(i.e. 'select count(1) from test' with 'force_parallel_mode = true')  where I
set the breakpoint. The aggregate and force_parallel_mode ensured that
the query was executed as a parallel one, leading to allocation of the DSM

Once the backend reached the posix_fallocate call and was waiting on a
breakpoint, I called 'vacuum full test' on the master that lead to a conflict
on the replica running 'select from test' (in a vast majority of cases),
triggering the breakpoint in ResolveRecoveryConflictWithLock in the startup
process, since the startup process tried to cancel the conflicting backend. At
that point I continued execution of the startup process (which would loop in
CancelVirtualTransaction sending SIGUSR1 to the backend while the backend
waited to be resumed from the breakpoint). Right after that, I changed the size
parameter on the backend to something that would make posix_fallocate run for a
bit longer, typically ten to hundred MB. Once the backend process was resumed,
it started receiving SIGUSR1 from the startup process, resulting in
posix_fallocate existing with EINTR.

With the patch applied, the posix_fallocate loop terminated right away (because
of QueryCancelPending flag set to true) and the backend went through the
cleanup, showing an ERROR of cancelling due to the conflict with recovery.
Without the patch, it looped indefinitely in the dsm_impl_posix_resize, while
the startup process were looping forever, trying to send SIGUSR1.

One thing I’m wondering is whether we could do the same by just blocking SIGUSR1
for the duration of posix_fallocate?

Cheers,
Oleksii Kliukin
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Fix for infinite signal loop in parallel scan

Chris Travers-7
In reply to this post by Thomas Munro-3
First, I have attached a cleaned-up revision (pg_indent, removing a dangling comment etc)

On Fri, Sep 14, 2018 at 1:16 PM Thomas Munro <[hidden email]> wrote:
On Sat, Sep 8, 2018 at 3:57 AM Chris Travers <[hidden email]> wrote:
> Attached is the patch we are fully testing at Adjust.

Thanks!

> I have run make check on Linux and MacOS, and make check-world on Linux (check-world fails on MacOS on all versions and all branches due to ecpg failures).

FWIW it's entirely possible to get make check-world passing on a Mac.
Maybe post the problem you're seeing to a new thread?

Will do. 

> ...

> In the past it had been suggested we do PG_TRY(); and PG_CATCH(), but given that it is not consistent whether we can raise an error or whether we MUST raise an error, I don't see how this approach can work.  As far as I can see, we MUST raise an error in the appropriate spot if and only if elevel is set to a sufficient level.

Yeah, your way looks a bit nicer than something involving PG_TRY().

> Is there any feedback on this approach before I add it to the next commitfest?

Please go ahead and add it.  Being a bug fix, we'll commit it sooner
than the open commitfest anyway, but it's useful to have it in there.

+ if (errno == EINTR && elevel >= ERROR)
+ CHECK_FOR_INTERRUPTS();

I think we might as well just CHECK_FOR_INTERRUPTS() unconditionally.
In this branch elevel is always ERROR as you noted, and the code
around there is confusing enough already.

The reason I didn't do that is future safety and for the off chance that someone could do something strange with extensions today that might use this in an unsafe way.    While I cannot think of any use case for calling this in an unsafe way, I know for a fact that one might write extensions, background workers, etc. to do things with this API that are not in our current code right now.  For something that could be back ported I really want to work as much as possible in a way that does not possibly brake even exotic extensions.

Longer-run I would like to see if I can help refactor this code so that responsibility for error handling is clearer and such problems cannot exist.  But that's not something to back port.

+ } while (rc == EINTR && !(ProcDiePending || QueryCancelPending));

There seems to be a precedent for checking QueryCancelPending directly
to break out of a loop in regcomp.c and syncrep.c.  So this seems OK.

Yeah, I checked.
 
Hmm, I wonder if there should be an INTERRUPT_PENDING() macro that
hides those details, but allows you to break out of a loop and then do
some cleanup before CHECK_FOR_INTERRUPT(). 

That is a good idea. 

--
Thomas Munro
http://www.enterprisedb.com


--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin


racecond.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Fix for infinite signal loop in parallel scan

Chris Travers-7
In reply to this post by Alex Kliukin


On Mon, Sep 17, 2018 at 2:59 PM Oleksii Kliukin <[hidden email]> wrote:


> On 7. Sep 2018, at 17:57, Chris Travers <[hidden email]> wrote:
>
> Hi;
>
> Attached is the patch we are fully testing at Adjust.  There are a few non-obvious aspects of the code around where the patch hits.    I have run make check on Linux and MacOS, and make check-world on Linux (check-world fails on MacOS on all versions and all branches due to ecpg failures).  Automatic tests are difficult because it is to a rare race condition which is difficult (or possibly impossible) to automatically create.  Our current approach testing is to force the issue using a debugger.  Any ideas on how to reproduce automatically are appreciated but even on our heavily loaded systems this seems to be a very small portion of queries that hit this case (meaning the issue happens about once a week for us).


I did some manual testing on it, putting breakpoints in the
ResolveRecoveryConflictWithLock in the startup process on a streaming replica
(configured with a very low max_standby_streaming_delay, i.e. 100ms) and to the
posix_fallocate call on the normal backend on the same replica. At this point I
also instructed gdb not to stop upon receiving SIGUSR1 (handle SIGUSR1 nonstop)
and resumed execution on both the backend and the startup process.

Then I simulated a conflict by creating a rather big (3GB) table on the master,
doing some updates on it and then running an aggregate on the replica backend
(i.e. 'select count(1) from test' with 'force_parallel_mode = true')  where I
set the breakpoint. The aggregate and force_parallel_mode ensured that
the query was executed as a parallel one, leading to allocation of the DSM

Once the backend reached the posix_fallocate call and was waiting on a
breakpoint, I called 'vacuum full test' on the master that lead to a conflict
on the replica running 'select from test' (in a vast majority of cases),
triggering the breakpoint in ResolveRecoveryConflictWithLock in the startup
process, since the startup process tried to cancel the conflicting backend. At
that point I continued execution of the startup process (which would loop in
CancelVirtualTransaction sending SIGUSR1 to the backend while the backend
waited to be resumed from the breakpoint). Right after that, I changed the size
parameter on the backend to something that would make posix_fallocate run for a
bit longer, typically ten to hundred MB. Once the backend process was resumed,
it started receiving SIGUSR1 from the startup process, resulting in
posix_fallocate existing with EINTR.

With the patch applied, the posix_fallocate loop terminated right away (because
of QueryCancelPending flag set to true) and the backend went through the
cleanup, showing an ERROR of cancelling due to the conflict with recovery.
Without the patch, it looped indefinitely in the dsm_impl_posix_resize, while
the startup process were looping forever, trying to send SIGUSR1.

One thing I’m wondering is whether we could do the same by just blocking SIGUSR1
for the duration of posix_fallocate?

Many thanks!

If we were to do that, I would say we should mask all signals we can mask during the call.

I don't have a problem going down that road instead if people think it is better.

As a note, we have a patched version of PostgreSQL 10.5 ready to deploy to a system affected by this and expect that to be done this week.
 

Cheers,
Oleksii Kliukin


--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Fix for infinite signal loop in parallel scan

Thomas Munro-3
On Tue, Sep 18, 2018 at 1:15 AM Chris Travers <[hidden email]> wrote:
> On Mon, Sep 17, 2018 at 2:59 PM Oleksii Kliukin <[hidden email]> wrote:
>> With the patch applied, the posix_fallocate loop terminated right away (because
>> of QueryCancelPending flag set to true) and the backend went through the
>> cleanup, showing an ERROR of cancelling due to the conflict with recovery.
>> Without the patch, it looped indefinitely in the dsm_impl_posix_resize, while
>> the startup process were looping forever, trying to send SIGUSR1.

Thanks for testing!

>> One thing I’m wondering is whether we could do the same by just blocking SIGUSR1
>> for the duration of posix_fallocate?
>
> If we were to do that, I would say we should mask all signals we can mask during the call.
>
> I don't have a problem going down that road instead if people think it is better.

We discussed that when adding posix_fallocate() and decided that
retrying is better:

https://www.postgresql.org/message-id/20170628230458.n5ehizmvhoerr5yq%40alap3.anarazel.de

Here is a patch that I propose to commit and back-patch to 9.4.  I
just wrote a suitable commit message, edited the comments lightly and
fixed some whitespace.

--
Thomas Munro
http://www.enterprisedb.com

0001-Allow-DSM-allocation-to-be-interrupted.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Fix for infinite signal loop in parallel scan

Alex Kliukin


> On 18. Sep 2018, at 03:18, Thomas Munro <[hidden email]> wrote:
>
> On Tue, Sep 18, 2018 at 1:15 AM Chris Travers <[hidden email]> wrote:
>> On Mon, Sep 17, 2018 at 2:59 PM Oleksii Kliukin <[hidden email]> wrote:
>>> With the patch applied, the posix_fallocate loop terminated right away (because
>>> of QueryCancelPending flag set to true) and the backend went through the
>>> cleanup, showing an ERROR of cancelling due to the conflict with recovery.
>>> Without the patch, it looped indefinitely in the dsm_impl_posix_resize, while
>>> the startup process were looping forever, trying to send SIGUSR1.
>
> Thanks for testing!
>
>>> One thing I’m wondering is whether we could do the same by just blocking SIGUSR1
>>> for the duration of posix_fallocate?
>>
>> If we were to do that, I would say we should mask all signals we can mask during the call.
>>
>> I don't have a problem going down that road instead if people think it is better.
>
> We discussed that when adding posix_fallocate() and decided that
> retrying is better:
>
> https://www.postgresql.org/message-id/20170628230458.n5ehizmvhoerr5yq%40alap3.anarazel.de
>
> Here is a patch that I propose to commit and back-patch to 9.4.  I
> just wrote a suitable commit message, edited the comments lightly and
> fixed some whitespace.

Thanks!

Apart from the fact that the reviewer's name is “Murat Kabilov” and
not “Murak Kabilov” the back-patch looks good to me.

Cheers,
Oleksii Kliukin
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Fix for infinite signal loop in parallel scan

Thomas Munro-3
On Tue, Sep 18, 2018 at 9:25 PM Oleksii Kliukin <[hidden email]> wrote:
> > On 18. Sep 2018, at 03:18, Thomas Munro <[hidden email]> wrote:
> > Here is a patch that I propose to commit and back-patch to 9.4.  I
> > just wrote a suitable commit message, edited the comments lightly and
> > fixed some whitespace.
>
> Thanks!
>
> Apart from the fact that the reviewer's name is “Murat Kabilov” and
> not “Murak Kabilov” the back-patch looks good to me.

Oops, fixed.  Pushed.  Thanks all for the report, patch and reviews.

--
Thomas Munro
http://www.enterprisedb.com