could not stat promote trigger file leads to shutdown

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

could not stat promote trigger file leads to shutdown

Peter Eisentraut-6
I have seen the error

     could not stat promote trigger file "...": Permission denied

because of a misconfiguration (for example, setting promote_trigger_file
to point into a directory to which you don't have appropriate read or
execute access).

The problem is that because this happens in the startup process, the
ERROR is turned into a FATAL and the whole instance shuts down.  That
seems like a harsh penalty.  Would it be better to turn this ERROR into
a WARNING?

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: could not stat promote trigger file leads to shutdown

Fujii Masao-2
On Thu, Nov 14, 2019 at 10:58 PM Peter Eisentraut
<[hidden email]> wrote:

>
> I have seen the error
>
>      could not stat promote trigger file "...": Permission denied
>
> because of a misconfiguration (for example, setting promote_trigger_file
> to point into a directory to which you don't have appropriate read or
> execute access).
>
> The problem is that because this happens in the startup process, the
> ERROR is turned into a FATAL and the whole instance shuts down.  That
> seems like a harsh penalty.  Would it be better to turn this ERROR into
> a WARNING?

+1

Regards,

--
Fujii Masao


Reply | Threaded
Open this post in threaded view
|

Re: could not stat promote trigger file leads to shutdown

Tom Lane-2
In reply to this post by Peter Eisentraut-6
Peter Eisentraut <[hidden email]> writes:
> I have seen the error
>      could not stat promote trigger file "...": Permission denied
> because of a misconfiguration (for example, setting promote_trigger_file
> to point into a directory to which you don't have appropriate read or
> execute access).

> The problem is that because this happens in the startup process, the
> ERROR is turned into a FATAL and the whole instance shuts down.  That
> seems like a harsh penalty.  Would it be better to turn this ERROR into
> a WARNING?

It is harsh, but I suspect if we just logged the complaint, we'd get
bug reports about "Postgres isn't reacting to my trigger file",
because people don't read the postmaster log unless forced to.
Is there some more-visible way to report the problem, short of
shutting down?

(BTW, from this perspective, WARNING is especially bad because it
might not get logged at all.  Better to use LOG.)

One thought is to try to detect the misconfiguration at postmaster
start --- better to fail at startup than sometime later.  But I'm
not sure how reliably we could do that.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: could not stat promote trigger file leads to shutdown

Michael Paquier-2
On Thu, Nov 14, 2019 at 10:38:30AM -0500, Tom Lane wrote:
> It is harsh, but I suspect if we just logged the complaint, we'd get
> bug reports about "Postgres isn't reacting to my trigger file",
> because people don't read the postmaster log unless forced to.
> Is there some more-visible way to report the problem, short of
> shutting down?
>
> (BTW, from this perspective, WARNING is especially bad because it
> might not get logged at all.  Better to use LOG.)

Neither am I comfortable with that.

> One thought is to try to detect the misconfiguration at postmaster
> start --- better to fail at startup than sometime later.  But I'm
> not sure how reliably we could do that.

I think that we could do something close to the area where
RemovePromoteSignalFiles() gets called.  Why not simply checking the
path defined by PromoteTriggerFile() at startup time then?  I take it
that the only thing we should not complain about is stat() returning
ENOENT when looking at the promote file defined.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: could not stat promote trigger file leads to shutdown

Fujii Masao-2
On Fri, Nov 15, 2019 at 10:49 AM Michael Paquier <[hidden email]> wrote:

>
> On Thu, Nov 14, 2019 at 10:38:30AM -0500, Tom Lane wrote:
> > It is harsh, but I suspect if we just logged the complaint, we'd get
> > bug reports about "Postgres isn't reacting to my trigger file",
> > because people don't read the postmaster log unless forced to.
> > Is there some more-visible way to report the problem, short of
> > shutting down?
> >
> > (BTW, from this perspective, WARNING is especially bad because it
> > might not get logged at all.  Better to use LOG.)
>
> Neither am I comfortable with that.

I always wonder why WARNING was defined that way.
I think that users usually pay attention to the word "WARNING"
rather than "LOG".

> > One thought is to try to detect the misconfiguration at postmaster
> > start --- better to fail at startup than sometime later.  But I'm
> > not sure how reliably we could do that.
>
> I think that we could do something close to the area where
> RemovePromoteSignalFiles() gets called.  Why not simply checking the
> path defined by PromoteTriggerFile() at startup time then?  I take it
> that the only thing we should not complain about is stat() returning
> ENOENT when looking at the promote file defined.

promote_trigger_file is declared as PGC_SIGHUP,
so such check would be necessary even while the standby is running.
Which can cause the server to fail after the startup.

Regards,

--
Fujii Masao


Reply | Threaded
Open this post in threaded view
|

Re: could not stat promote trigger file leads to shutdown

Peter Eisentraut-6
On 2019-11-15 03:14, Fujii Masao wrote:

>>> One thought is to try to detect the misconfiguration at postmaster
>>> start --- better to fail at startup than sometime later.  But I'm
>>> not sure how reliably we could do that.
>> I think that we could do something close to the area where
>> RemovePromoteSignalFiles() gets called.  Why not simply checking the
>> path defined by PromoteTriggerFile() at startup time then?  I take it
>> that the only thing we should not complain about is stat() returning
>> ENOENT when looking at the promote file defined.
> promote_trigger_file is declared as PGC_SIGHUP,
> so such check would be necessary even while the standby is running.
> Which can cause the server to fail after the startup.

Let me illustrate a scenario in a more lively way:

Say you want to set up promote_trigger_file to point to a file outside
of the data directory, maybe because you want to integrate it with some
external tooling.  So you go into your configuration and set

     promote_trigger_file = '/srv/foobar/trigger'

and reload the server.  Everything is happy.  The fact that the
directory /srv/foobar/ does not exist at this point is completely ignored.

Now you become root and run

     mkdir /srv/foobar

and, depending circumstances such as root's umask or the permissions of
/srv, your PostgreSQL server crashes immediately.  That can't be good.

Also imagine the above steps being run by a configuration management system.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: could not stat promote trigger file leads to shutdown

Tom Lane-2
In reply to this post by Fujii Masao-2
Fujii Masao <[hidden email]> writes:
> On Fri, Nov 15, 2019 at 10:49 AM Michael Paquier <[hidden email]> wrote:
>> On Thu, Nov 14, 2019 at 10:38:30AM -0500, Tom Lane wrote:
>>> (BTW, from this perspective, WARNING is especially bad because it
>>>> might not get logged at all.  Better to use LOG.)

>> Neither am I comfortable with that.

> I always wonder why WARNING was defined that way.
> I think that users usually pay attention to the word "WARNING"
> rather than "LOG".

The issue really is "what are we warning about".  The way things
are set up basically assumes that WARNING is for complaining about
user-issued commands that might not be doing what the user wants.
Which is a legitimate use-case, but it doesn't necessarily mean
something that's very important to put in the postmaster log.

What we're seeing, in these repeated proposals to use WARNING in
some background process that doesn't run user commands, is that
there is also a use-case for "more-significant-than-usual log
messages".  Maybe we need a new elevel category for that.
SYSTEM_WARNING or LOG_WARNING, perhaps?

>> I think that we could do something close to the area where
>> RemovePromoteSignalFiles() gets called.  Why not simply checking the
>> path defined by PromoteTriggerFile() at startup time then?  I take it
>> that the only thing we should not complain about is stat() returning
>> ENOENT when looking at the promote file defined.

> promote_trigger_file is declared as PGC_SIGHUP,
> so such check would be necessary even while the standby is running.
> Which can cause the server to fail after the startup.

No, it'd be just the same as any other GUC: if we make such a test
in the check hook, then we'd fail for a bad value at startup, but
at SIGHUP we'd just reject the new setting.  I think this might be
a workable answer to Peter's concern.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: could not stat promote trigger file leads to shutdown

Sergei Kornilov
Hello

> Maybe we need a new elevel category for that.
> SYSTEM_WARNING or LOG_WARNING, perhaps?

I think a separate levels for user warnings and system warnings (and errors) would be great for log analytics. Error due to user typo in query is not the same as cache lookup error (for example).

regards, Sergei


Reply | Threaded
Open this post in threaded view
|

Re: could not stat promote trigger file leads to shutdown

Tom Lane-2
In reply to this post by Peter Eisentraut-6
Peter Eisentraut <[hidden email]> writes:

> Say you want to set up promote_trigger_file to point to a file outside
> of the data directory, maybe because you want to integrate it with some
> external tooling.  So you go into your configuration and set
>      promote_trigger_file = '/srv/foobar/trigger'
> and reload the server.  Everything is happy.  The fact that the
> directory /srv/foobar/ does not exist at this point is completely ignored.
> Now you become root and run
>      mkdir /srv/foobar
> and, depending circumstances such as root's umask or the permissions of
> /srv, your PostgreSQL server crashes immediately.  That can't be good.

No, it's not good, but the proposed fix of s/ERROR/LOG/ simply delays
the problem till later, ie when you try to promote the server nothing
happens.  That's not good either.  (To be clear: I'm not necessarily
against that change, I just don't think it's a sufficient response.)

If we add a GUC-check-hook test, then the problem of misconfiguration
is reduced to the previously unsolved problem that we have crappy
feedback for erroneous on-the-fly configuration changes.  So it's
still unsolved, but at least we've got one unsolved problem not two.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: could not stat promote trigger file leads to shutdown

Alvaro Herrera-9
On 2019-Nov-15, Tom Lane wrote:

> If we add a GUC-check-hook test, then the problem of misconfiguration
> is reduced to the previously unsolved problem that we have crappy
> feedback for erroneous on-the-fly configuration changes.  So it's
> still unsolved, but at least we've got one unsolved problem not two.

I am now against this kind of behavior, because nowadays it is common
to have external orchestrating systems stopping and starting postmaster
on their own volition.

If this kind of misconfiguration causes postmaster refuse to start, it
can effectively become a service-shutdown scenario which requires the
DBA to go temporarily mad.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: could not stat promote trigger file leads to shutdown

Tom Lane-2
Alvaro Herrera <[hidden email]> writes:
> On 2019-Nov-15, Tom Lane wrote:
>> If we add a GUC-check-hook test, then the problem of misconfiguration
>> is reduced to the previously unsolved problem that we have crappy
>> feedback for erroneous on-the-fly configuration changes.  So it's
>> still unsolved, but at least we've got one unsolved problem not two.

> I am now against this kind of behavior, because nowadays it is common
> to have external orchestrating systems stopping and starting postmaster
> on their own volition.

> If this kind of misconfiguration causes postmaster refuse to start, it
> can effectively become a service-shutdown scenario which requires the
> DBA to go temporarily mad.

By that argument, postgresql.conf could contain complete garbage
and the postmaster should still start.  I'm not willing to say
that an "external orchestrating system" doesn't need to take
responsibility for putting valid info into the config file.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: could not stat promote trigger file leads to shutdown

Peter Eisentraut-6
In reply to this post by Tom Lane-2
On 2019-11-15 19:23, Tom Lane wrote:

> Peter Eisentraut <[hidden email]> writes:
>> Say you want to set up promote_trigger_file to point to a file outside
>> of the data directory, maybe because you want to integrate it with some
>> external tooling.  So you go into your configuration and set
>>       promote_trigger_file = '/srv/foobar/trigger'
>> and reload the server.  Everything is happy.  The fact that the
>> directory /srv/foobar/ does not exist at this point is completely ignored.
>> Now you become root and run
>>       mkdir /srv/foobar
>> and, depending circumstances such as root's umask or the permissions of
>> /srv, your PostgreSQL server crashes immediately.  That can't be good.
>
> No, it's not good, but the proposed fix of s/ERROR/LOG/ simply delays
> the problem till later, ie when you try to promote the server nothing
> happens.  That's not good either.  (To be clear: I'm not necessarily
> against that change, I just don't think it's a sufficient response.)
>
> If we add a GUC-check-hook test, then the problem of misconfiguration
> is reduced to the previously unsolved problem that we have crappy
> feedback for erroneous on-the-fly configuration changes.  So it's
> still unsolved, but at least we've got one unsolved problem not two.

AFAICT, a GUC check hook wouldn't actually be able to address the
specific scenario I described.  At the time the GUC is set, the
containing the directory of the trigger file does not exist yet.  This
is currently not an error.  The problem only happens if after the GUC is
set the containing directory appears and is not readable.

I notice that we use LOG level if an SSL key or certificate file is not
accessible on reload, so that seems somewhat similar.

We don't have any GUC check hooks on other file system location string
settings that ensure accessibility or presence of the file.  Although I
do notice that we use check_canonical_path() in some places and not
others for mysterious and undocumented reasons.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: could not stat promote trigger file leads to shutdown

Tom Lane-2
Peter Eisentraut <[hidden email]> writes:
> On 2019-11-15 19:23, Tom Lane wrote:
>> If we add a GUC-check-hook test, then the problem of misconfiguration
>> is reduced to the previously unsolved problem that we have crappy
>> feedback for erroneous on-the-fly configuration changes.  So it's
>> still unsolved, but at least we've got one unsolved problem not two.

> AFAICT, a GUC check hook wouldn't actually be able to address the
> specific scenario I described.  At the time the GUC is set, the
> containing the directory of the trigger file does not exist yet.  This
> is currently not an error.  The problem only happens if after the GUC is
> set the containing directory appears and is not readable.

True, if the hook just consists of trying fopen() and checking the
errno.  Would it be feasible to insist that the containing directory
exist and be readable?  We have enough infrastructure that that
should only take a few lines of code, so the question is whether
or not that's a nicer behavior than we have now.

If we had this to do over, I'd argue that we misdesigned trigger
files: they should be required to exist always, and triggering
depends on file contents (eg empty vs. not) not existence.  That
would make it far easier to check for configuration mistakes
at startup.  But I suppose it's too late now.

> We don't have any GUC check hooks on other file system location string
> settings that ensure accessibility or presence of the file.

Right, but I'm suggesting we should add that where appropriate.
Basically the complaint here is that the system lacks checks
that the given configuration settings are workable, and we ought
to add such.

> Although I
> do notice that we use check_canonical_path() in some places and not
> others for mysterious and undocumented reasons.

Probably only that some patch authors didn't know about it :-(

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: could not stat promote trigger file leads to shutdown

Peter Eisentraut-6
On 2019-11-20 16:21, Tom Lane wrote:

>> AFAICT, a GUC check hook wouldn't actually be able to address the
>> specific scenario I described.  At the time the GUC is set, the
>> containing the directory of the trigger file does not exist yet.  This
>> is currently not an error.  The problem only happens if after the GUC is
>> set the containing directory appears and is not readable.
> True, if the hook just consists of trying fopen() and checking the
> errno.  Would it be feasible to insist that the containing directory
> exist and be readable?  We have enough infrastructure that that
> should only take a few lines of code, so the question is whether
> or not that's a nicer behavior than we have now.

Is it possible to do this in a mostly bullet-proof way?  Just because
the directory exists and looks pretty good otherwise, doesn't mean we
can read a file created in it later in a way that doesn't fall afoul of
the existing error checks.  There could be something like SELinux
lurking, for example.

Maybe some initial checking would be useful, but I think we still need
to downgrade the error check at use time a bit to not crash in the cases
that we miss.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: could not stat promote trigger file leads to shutdown

Kyotaro Horiguchi-4
At Wed, 4 Dec 2019 11:52:33 +0100, Peter Eisentraut <[hidden email]> wrote in

> On 2019-11-20 16:21, Tom Lane wrote:
> >> AFAICT, a GUC check hook wouldn't actually be able to address the
> >> specific scenario I described.  At the time the GUC is set, the
> >> containing the directory of the trigger file does not exist yet.  This
> >> is currently not an error.  The problem only happens if after the GUC
> >> is
> >> set the containing directory appears and is not readable.
> > True, if the hook just consists of trying fopen() and checking the
> > errno.  Would it be feasible to insist that the containing directory
> > exist and be readable?  We have enough infrastructure that that
> > should only take a few lines of code, so the question is whether
> > or not that's a nicer behavior than we have now.
>
> Is it possible to do this in a mostly bullet-proof way?  Just because
> the directory exists and looks pretty good otherwise, doesn't mean we
> can read a file created in it later in a way that doesn't fall afoul
> of the existing error checks.  There could be something like SELinux
> lurking, for example.
>
> Maybe some initial checking would be useful, but I think we still need
> to downgrade the error check at use time a bit to not crash in the
> cases that we miss.

+1. Any GUC variables that points to outer, or externally-modifiable
resources, including directories, files, commands can face that kind
of problem. For example a bogus value for archive_command doesn't
preveint server from starting. I understand that the reason is that we
don't have a reliable means to check-up the command before we actually
execute it, but server can (or should) continue running even if it
fails. I think this issue falls into that category.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center