potential stuck lock in SaveSlotToPath()

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

potential stuck lock in SaveSlotToPath()

Peter Eisentraut-6
When SaveSlotToPath() is called with elevel=LOG, the early exits don't
release the slot's io_in_progress_lock.  Fix attached.

This could result in a walsender being stuck on the lock forever.  A
possible way to get into this situation is if the offending code paths
are triggered in a low disk space situation.  (This is how it was found;
maybe there are other ways.)

Pavan Deolasee and Craig Ringer worked on this issue.  I'm forwarding it
on their behalf.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

0001-Drop-slot-s-LWLock-before-returning-from-SaveSlotToP.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: potential stuck lock in SaveSlotToPath()

Andres Freund
Hi,

On 2020-03-18 16:46:23 +0100, Peter Eisentraut wrote:
> When SaveSlotToPath() is called with elevel=LOG, the early exits don't
> release the slot's io_in_progress_lock.  Fix attached.

I'm a bit confused as to why we we ever call it with elevel = LOG
(i.e. why we have the elevel parameter at all). That seems to have been
there from the start, so it's either me or Robert that's to blame. But I
can't immediately see a reason for it?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: potential stuck lock in SaveSlotToPath()

Alvaro Herrera-9
On 2020-Mar-18, Andres Freund wrote:

> Hi,
>
> On 2020-03-18 16:46:23 +0100, Peter Eisentraut wrote:
> > When SaveSlotToPath() is called with elevel=LOG, the early exits don't
> > release the slot's io_in_progress_lock.  Fix attached.
>
> I'm a bit confused as to why we we ever call it with elevel = LOG
> (i.e. why we have the elevel parameter at all). That seems to have been
> there from the start, so it's either me or Robert that's to blame. But I
> can't immediately see a reason for it?

I guess you didn't want failure to save a slot be a reason to abort a
checkpoint.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: potential stuck lock in SaveSlotToPath()

Thomas Munro-5
In reply to this post by Peter Eisentraut-6
On Thu, Mar 19, 2020 at 4:46 AM Peter Eisentraut
<[hidden email]> wrote:
> [patch]

+         * releaseing even in that case.

Typo.


Reply | Threaded
Open this post in threaded view
|

Re: potential stuck lock in SaveSlotToPath()

Andres Freund
In reply to this post by Alvaro Herrera-9
Hi,

On 2020-03-18 16:54:19 -0300, Alvaro Herrera wrote:

> On 2020-Mar-18, Andres Freund wrote:
> > On 2020-03-18 16:46:23 +0100, Peter Eisentraut wrote:
> > > When SaveSlotToPath() is called with elevel=LOG, the early exits don't
> > > release the slot's io_in_progress_lock.  Fix attached.
> >
> > I'm a bit confused as to why we we ever call it with elevel = LOG
> > (i.e. why we have the elevel parameter at all). That seems to have been
> > there from the start, so it's either me or Robert that's to blame. But I
> > can't immediately see a reason for it?
>
> I guess you didn't want failure to save a slot be a reason to abort a
> checkpoint.

I don't see a valid reason for that though - if anything it's dangerous,
because we're not persistently saving the slot. It should fail the
checkpoint imo. Robert, do you have an idea?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: potential stuck lock in SaveSlotToPath()

Robert Haas
On Wed, Mar 18, 2020 at 4:25 PM Andres Freund <[hidden email]> wrote:
> I don't see a valid reason for that though - if anything it's dangerous,
> because we're not persistently saving the slot. It should fail the
> checkpoint imo. Robert, do you have an idea?

Well, the comment atop SaveSlotToPath says:

 * This needn't actually be part of a checkpoint, but it's a convenient
 * location.

And I agree with that.

Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
right for the early-exit cases either.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: potential stuck lock in SaveSlotToPath()

Peter Eisentraut-6
On 2020-03-19 16:38, Robert Haas wrote:
> Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
> right for the early-exit cases either.

There appear to be appropriate pgstat_report_wait_end() calls.  What are
you seeing?

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: potential stuck lock in SaveSlotToPath()

Robert Haas
On Fri, Mar 20, 2020 at 11:32 AM Peter Eisentraut
<[hidden email]> wrote:
> On 2020-03-19 16:38, Robert Haas wrote:
> > Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
> > right for the early-exit cases either.
>
> There appear to be appropriate pgstat_report_wait_end() calls.  What are
> you seeing?

Oh, you're right. I think I got confused because the rename() and
close() don't have that, but those don't have a wait event set either.
Sorry for the noise.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: potential stuck lock in SaveSlotToPath()

Peter Eisentraut-6
On 2020-03-20 16:38, Robert Haas wrote:

> On Fri, Mar 20, 2020 at 11:32 AM Peter Eisentraut
> <[hidden email]> wrote:
>> On 2020-03-19 16:38, Robert Haas wrote:
>>> Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
>>> right for the early-exit cases either.
>>
>> There appear to be appropriate pgstat_report_wait_end() calls.  What are
>> you seeing?
>
> Oh, you're right. I think I got confused because the rename() and
> close() don't have that, but those don't have a wait event set either.
> Sorry for the noise.

Any concerns about applying and backpatching the patch I posted?

The talk about reorganizing this code doesn't seem very concrete at the
moment and would probably not be backpatch material anyway.


--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: potential stuck lock in SaveSlotToPath()

Alvaro Herrera-9
On 2020-Mar-25, Peter Eisentraut wrote:

> On 2020-03-20 16:38, Robert Haas wrote:
> > On Fri, Mar 20, 2020 at 11:32 AM Peter Eisentraut
> > <[hidden email]> wrote:
> > > On 2020-03-19 16:38, Robert Haas wrote:
> > > > Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
> > > > right for the early-exit cases either.
> > >
> > > There appear to be appropriate pgstat_report_wait_end() calls.  What are
> > > you seeing?
> >
> > Oh, you're right. I think I got confused because the rename() and
> > close() don't have that, but those don't have a wait event set either.
> > Sorry for the noise.
>
> Any concerns about applying and backpatching the patch I posted?

It looks a straight bug fix to me, I agree it should be back-patched.

> The talk about reorganizing this code doesn't seem very concrete at the
> moment and would probably not be backpatch material anyway.

Agreed on both counts.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: potential stuck lock in SaveSlotToPath()

Robert Haas
In reply to this post by Peter Eisentraut-6
On Wed, Mar 25, 2020 at 6:13 AM Peter Eisentraut
<[hidden email]> wrote:
> Any concerns about applying and backpatching the patch I posted?

Not from me.

> The talk about reorganizing this code doesn't seem very concrete at the
> moment and would probably not be backpatch material anyway.

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: potential stuck lock in SaveSlotToPath()

Peter Eisentraut-6
On 2020-03-25 17:56, Robert Haas wrote:

> On Wed, Mar 25, 2020 at 6:13 AM Peter Eisentraut
> <[hidden email]> wrote:
>> Any concerns about applying and backpatching the patch I posted?
>
> Not from me.
>
>> The talk about reorganizing this code doesn't seem very concrete at the
>> moment and would probably not be backpatch material anyway.
>
> +1.

committed and backpatched

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: potential stuck lock in SaveSlotToPath()

Michael Paquier-2
On Thu, Mar 26, 2020 at 02:16:05PM +0100, Peter Eisentraut wrote:
> committed and backpatched

The patch committed does that in three places:
    /* rename to permanent file, fsync file and directory */
    if (rename(tmppath, path) != 0)
    {
+       LWLockRelease(&slot->io_in_progress_lock);
        ereport(elevel,
                (errcode_for_file_access(),
                 errmsg("could not rename file \"%s\" to \"%s\": %m",

But why do you assume that LWLockRelease() never changes errno?  It
seems to me that you should save errno before calling LWLockRelease(),
and then restore it back before using %m in the log message, no?  See
for example the case where trace_lwlocks is set.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: potential stuck lock in SaveSlotToPath()

Peter Eisentraut-6
On 2020-03-27 08:48, Michael Paquier wrote:

> On Thu, Mar 26, 2020 at 02:16:05PM +0100, Peter Eisentraut wrote:
>> committed and backpatched
>
> The patch committed does that in three places:
>      /* rename to permanent file, fsync file and directory */
>      if (rename(tmppath, path) != 0)
>      {
> +       LWLockRelease(&slot->io_in_progress_lock);
> ereport(elevel,
>                  (errcode_for_file_access(),
>                   errmsg("could not rename file \"%s\" to \"%s\": %m",
>
> But why do you assume that LWLockRelease() never changes errno?  It
> seems to me that you should save errno before calling LWLockRelease(),
> and then restore it back before using %m in the log message, no?  See
> for example the case where trace_lwlocks is set.
Good catch.  How about the attached patch?

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

0001-Save-errno-across-LWLockRelease-calls.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: potential stuck lock in SaveSlotToPath()

Michael Paquier-2
On Wed, Apr 01, 2020 at 04:26:25PM +0200, Peter Eisentraut wrote:
> Good catch.  How about the attached patch?

WFM.  Another trick would be to call LWLockRelease() after generating
the log, but I find your patch more consistent with the surroundings.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: potential stuck lock in SaveSlotToPath()

Peter Eisentraut-6
On 2020-04-02 08:21, Michael Paquier wrote:
> On Wed, Apr 01, 2020 at 04:26:25PM +0200, Peter Eisentraut wrote:
>> Good catch.  How about the attached patch?
>
> WFM.  Another trick would be to call LWLockRelease() after generating
> the log, but I find your patch more consistent with the surroundings.

done

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services