Nicer error when connecting to standby with hot_standby=off

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Nicer error when connecting to standby with hot_standby=off

James Coleman
I recently noticed while setting up a test environment that attempting to connect to a standby running without hot_standby=on results in a fairly generic error (I believe "the database system is starting up"). I don't have my test setup running right now, so can't confirm with a repro case at the moment, but with a little bit of spelunking I noticed that error text only shows up in src/backend/postmaster/postmaster.c when port->canAcceptConnections has the value CAC_STARTUP.

Ideally the error message would include something along the lines of "The server is running as a standby but cannot accept connections with hot_standby=off".

I wanted to get some initial feedback on the idea before writing a patch: does that seem like a reasonable change? Is it actually plausible to distinguish between this state and "still recovering" (i.e., when starting up a hot standby but initial recovery hasn't completed so it legitimately can't accept connections yet)? If so, should we include the possibility if hot_standby isn't on, just in case?

The enum value CAC_STARTUP is defined in src/include/libpq/libpq-be.h, which makes me wonder if changing this value would result in a wire protocol change/something the client wants to know about? If so, I assume it's not reasonable to change the value, but would it still be reasonable to change the error text?

Thanks,
James Coleman
Reply | Threaded
Open this post in threaded view
|

Re: Nicer error when connecting to standby with hot_standby=off

Andres Freund
Hi,

On 2020-03-08 20:12:21 -0400, James Coleman wrote:

> I recently noticed while setting up a test environment that attempting to
> connect to a standby running without hot_standby=on results in a fairly
> generic error (I believe "the database system is starting up"). I don't
> have my test setup running right now, so can't confirm with a repro case at
> the moment, but with a little bit of spelunking I noticed that error text
> only shows up in src/backend/postmaster/postmaster.c when
> port->canAcceptConnections has the value CAC_STARTUP.
>
> Ideally the error message would include something along the lines of "The
> server is running as a standby but cannot accept connections with
> hot_standby=off".

Yea, something roughly like that would be good.


> I wanted to get some initial feedback on the idea before writing a patch:
> does that seem like a reasonable change? Is it actually plausible to
> distinguish between this state and "still recovering" (i.e., when starting
> up a hot standby but initial recovery hasn't completed so it legitimately
> can't accept connections yet)? If so, should we include the possibility if
> hot_standby isn't on, just in case?

Yes, it is feasible to distinguish those cases. And we should, if we're
going to change things around.


> The enum value CAC_STARTUP is defined in src/include/libpq/libpq-be.h,
> which makes me wonder if changing this value would result in a wire
> protocol change/something the client wants to know about? If so, I assume
> it's not reasonable to change the value, but would it still be reasonable
> to change the error text?

The value shouldn't be visible to clients in any way. While not obvious
from the name, there's this comment at the top of the header:

 *  Note that this is backend-internal and is NOT exported to clients.
 *  Structs that need to be client-visible are in pqcomm.h.


Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Nicer error when connecting to standby with hot_standby=off

James Coleman
On Mon, Mar 9, 2020 at 6:28 PM Andres Freund <[hidden email]> wrote:

>
> Hi,
>
> On 2020-03-08 20:12:21 -0400, James Coleman wrote:
> > I recently noticed while setting up a test environment that attempting to
> > connect to a standby running without hot_standby=on results in a fairly
> > generic error (I believe "the database system is starting up"). I don't
> > have my test setup running right now, so can't confirm with a repro case at
> > the moment, but with a little bit of spelunking I noticed that error text
> > only shows up in src/backend/postmaster/postmaster.c when
> > port->canAcceptConnections has the value CAC_STARTUP.
> >
> > Ideally the error message would include something along the lines of "The
> > server is running as a standby but cannot accept connections with
> > hot_standby=off".
>
> Yea, something roughly like that would be good.

Awesome, thanks for the early feedback!

> > I wanted to get some initial feedback on the idea before writing a patch:
> > does that seem like a reasonable change? Is it actually plausible to
> > distinguish between this state and "still recovering" (i.e., when starting
> > up a hot standby but initial recovery hasn't completed so it legitimately
> > can't accept connections yet)? If so, should we include the possibility if
> > hot_standby isn't on, just in case?
>
> Yes, it is feasible to distinguish those cases. And we should, if we're
> going to change things around.

I'll look into this hopefully soon, but it's helpful to know that it's
possible. Is it basically along the lines of checking to see if the
LSN is past the minimum recovery point?

> > The enum value CAC_STARTUP is defined in src/include/libpq/libpq-be.h,
> > which makes me wonder if changing this value would result in a wire
> > protocol change/something the client wants to know about? If so, I assume
> > it's not reasonable to change the value, but would it still be reasonable
> > to change the error text?
>
> The value shouldn't be visible to clients in any way. While not obvious
> from the name, there's this comment at the top of the header:
>
>  *        Note that this is backend-internal and is NOT exported to clients.
>  *        Structs that need to be client-visible are in pqcomm.h.

This is also helpful.

Thanks,
James


Reply | Threaded
Open this post in threaded view
|

Re: Nicer error when connecting to standby with hot_standby=off

Andres Freund
Hi,

On 2020-03-09 18:40:32 -0400, James Coleman wrote:

> On Mon, Mar 9, 2020 at 6:28 PM Andres Freund <[hidden email]> wrote:
> > > I wanted to get some initial feedback on the idea before writing a patch:
> > > does that seem like a reasonable change? Is it actually plausible to
> > > distinguish between this state and "still recovering" (i.e., when starting
> > > up a hot standby but initial recovery hasn't completed so it legitimately
> > > can't accept connections yet)? If so, should we include the possibility if
> > > hot_standby isn't on, just in case?
> >
> > Yes, it is feasible to distinguish those cases. And we should, if we're
> > going to change things around.
>
> I'll look into this hopefully soon, but it's helpful to know that it's
> possible. Is it basically along the lines of checking to see if the
> LSN is past the minimum recovery point?

No, I don't think that's the right approach. IIRC the startup process
(i.e. the one doing the WAL replay) signals postmaster once consistency
has been achieved. So you can just use that state.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Nicer error when connecting to standby with hot_standby=off

James Coleman
On Mon, Mar 9, 2020 at 8:06 PM Andres Freund <[hidden email]> wrote:

>
> Hi,
>
> On 2020-03-09 18:40:32 -0400, James Coleman wrote:
> > On Mon, Mar 9, 2020 at 6:28 PM Andres Freund <[hidden email]> wrote:
> > > > I wanted to get some initial feedback on the idea before writing a patch:
> > > > does that seem like a reasonable change? Is it actually plausible to
> > > > distinguish between this state and "still recovering" (i.e., when starting
> > > > up a hot standby but initial recovery hasn't completed so it legitimately
> > > > can't accept connections yet)? If so, should we include the possibility if
> > > > hot_standby isn't on, just in case?
> > >
> > > Yes, it is feasible to distinguish those cases. And we should, if we're
> > > going to change things around.
> >
> > I'll look into this hopefully soon, but it's helpful to know that it's
> > possible. Is it basically along the lines of checking to see if the
> > LSN is past the minimum recovery point?
>
> No, I don't think that's the right approach. IIRC the startup process
> (i.e. the one doing the WAL replay) signals postmaster once consistency
> has been achieved. So you can just use that state.
I've taken that approach in the attached patch (I'd expected to wait
until later to work on this...but it seemed pretty small so I ended up
hacking on it this evening).

I don't have tests included: I tried intentionally breaking the
existing behavior (returning no error when hot_standby=off), but
running make check-world (including tap tests) didn't find any
breakages. I can look into that more deeply at some point, but if you
happen to know a place we test similar things, then I'd be happy to
hear it.

One other question: how is error message translation handled? I
haven't added entries to the relevant files, but also I'm obviously
not qualified to write them.

Thanks,
James

v1-0001-Improve-standby-connection-denied-error-message.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Nicer error when connecting to standby with hot_standby=off

David Zhang
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:       tested, passed
Spec compliant:           not tested
Documentation:            not tested

I applied the patch to the latest master branch and run a test below. The error messages have been separated. Below is the test steps.

### setup primary server
initdb -D /tmp/primary/data
mkdir /tmp/archive_dir
echo "archive_mode='on'" >> /tmp/primary/data/postgresql.conf
echo "archive_command='cp %p /tmp/archive_dir/%f'" >> /tmp/primary/data/postgresql.conf
pg_ctl -D /tmp/primary/data -l /tmp/primary-logs start

### setup host standby server
pg_basebackup -p 5432 -w -R -D /tmp/hotstandby
echo "primary_conninfo='host=127.0.0.1 port=5432 user=pgdev'" >> /tmp/hotstandby/postgresql.conf
echo "restore_command='cp /tmp/archive_dir/%f %p'" >> /tmp/hotstandby/postgresql.conf
echo "hot_standby = off" >> /tmp/hotstandby/postgresql.conf
pg_ctl -D /tmp/hotstandby -l /tmp/hotstandby-logs -o "-p 5433" start

### keep trying to connect to hot standby server in order to get the error messages in different stages.
while true; do echo "`date`"; psql postgres -p 5433 -c "SELECT txid_current_snapshot();" sleep 0.2; done

### before the patch
psql: error: could not connect to server: FATAL:  the database system is starting up
...

### after the patch, got different messages, one message indicates hot_standby is off
psql: error: could not connect to server: FATAL:  the database system is starting up
...
psql: error: could not connect to server: FATAL:  the database system is up, but hot_standby is off
...
Reply | Threaded
Open this post in threaded view
|

Re: Nicer error when connecting to standby with hot_standby=off

James Coleman
On Thu, Apr 2, 2020 at 5:53 PM David Zhang <[hidden email]> wrote:

>
> The following review has been posted through the commitfest application:
> make installcheck-world:  not tested
> Implements feature:       tested, passed
> Spec compliant:           not tested
> Documentation:            not tested
>
> I applied the patch to the latest master branch and run a test below. The error messages have been separated. Below is the test steps.
>
> ### setup primary server
> initdb -D /tmp/primary/data
> mkdir /tmp/archive_dir
> echo "archive_mode='on'" >> /tmp/primary/data/postgresql.conf
> echo "archive_command='cp %p /tmp/archive_dir/%f'" >> /tmp/primary/data/postgresql.conf
> pg_ctl -D /tmp/primary/data -l /tmp/primary-logs start
>
> ### setup host standby server
> pg_basebackup -p 5432 -w -R -D /tmp/hotstandby
> echo "primary_conninfo='host=127.0.0.1 port=5432 user=pgdev'" >> /tmp/hotstandby/postgresql.conf
> echo "restore_command='cp /tmp/archive_dir/%f %p'" >> /tmp/hotstandby/postgresql.conf
> echo "hot_standby = off" >> /tmp/hotstandby/postgresql.conf
> pg_ctl -D /tmp/hotstandby -l /tmp/hotstandby-logs -o "-p 5433" start
>
> ### keep trying to connect to hot standby server in order to get the error messages in different stages.
> while true; do echo "`date`"; psql postgres -p 5433 -c "SELECT txid_current_snapshot();" sleep 0.2; done
>
> ### before the patch
> psql: error: could not connect to server: FATAL:  the database system is starting up
> ...
>
> ### after the patch, got different messages, one message indicates hot_standby is off
> psql: error: could not connect to server: FATAL:  the database system is starting up
> ...
> psql: error: could not connect to server: FATAL:  the database system is up, but hot_standby is off
> ...

Thanks for the review and testing!

James


Reply | Threaded
Open this post in threaded view
|

Re: Nicer error when connecting to standby with hot_standby=off

Fujii Masao-4


On 2020/04/03 22:49, James Coleman wrote:

> On Thu, Apr 2, 2020 at 5:53 PM David Zhang <[hidden email]> wrote:
>>
>> The following review has been posted through the commitfest application:
>> make installcheck-world:  not tested
>> Implements feature:       tested, passed
>> Spec compliant:           not tested
>> Documentation:            not tested
>>
>> I applied the patch to the latest master branch and run a test below. The error messages have been separated. Below is the test steps.
>>
>> ### setup primary server
>> initdb -D /tmp/primary/data
>> mkdir /tmp/archive_dir
>> echo "archive_mode='on'" >> /tmp/primary/data/postgresql.conf
>> echo "archive_command='cp %p /tmp/archive_dir/%f'" >> /tmp/primary/data/postgresql.conf
>> pg_ctl -D /tmp/primary/data -l /tmp/primary-logs start
>>
>> ### setup host standby server
>> pg_basebackup -p 5432 -w -R -D /tmp/hotstandby
>> echo "primary_conninfo='host=127.0.0.1 port=5432 user=pgdev'" >> /tmp/hotstandby/postgresql.conf
>> echo "restore_command='cp /tmp/archive_dir/%f %p'" >> /tmp/hotstandby/postgresql.conf
>> echo "hot_standby = off" >> /tmp/hotstandby/postgresql.conf
>> pg_ctl -D /tmp/hotstandby -l /tmp/hotstandby-logs -o "-p 5433" start
>>
>> ### keep trying to connect to hot standby server in order to get the error messages in different stages.
>> while true; do echo "`date`"; psql postgres -p 5433 -c "SELECT txid_current_snapshot();" sleep 0.2; done
>>
>> ### before the patch
>> psql: error: could not connect to server: FATAL:  the database system is starting up
>> ...
>>
>> ### after the patch, got different messages, one message indicates hot_standby is off
>> psql: error: could not connect to server: FATAL:  the database system is starting up
>> ...
>> psql: error: could not connect to server: FATAL:  the database system is up, but hot_standby is off
>> ...
>
> Thanks for the review and testing!

Thanks for the patch! Here is the comment from me.

+ else if (!FatalError && pmState == PM_RECOVERY)
+ return CAC_STANDBY; /* connection disallowed on non-hot standby */

Even if hot_standby is enabled, pmState seems to indicate PM_RECOVERY
until recovery has reached a consistent state. No? So, if my understanding
is right, "FATAL:  the database system is up, but hot_standby is off" can
be logged even when hot_standby is on. Which sounds very confusing.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: Nicer error when connecting to standby with hot_standby=off

James Coleman
On Wed, Jul 29, 2020 at 11:24 AM Fujii Masao
<[hidden email]> wrote:

>
>
>
> On 2020/04/03 22:49, James Coleman wrote:
> > On Thu, Apr 2, 2020 at 5:53 PM David Zhang <[hidden email]> wrote:
> >>
> >> The following review has been posted through the commitfest application:
> >> make installcheck-world:  not tested
> >> Implements feature:       tested, passed
> >> Spec compliant:           not tested
> >> Documentation:            not tested
> >>
> >> I applied the patch to the latest master branch and run a test below. The error messages have been separated. Below is the test steps.
> >>
> >> ### setup primary server
> >> initdb -D /tmp/primary/data
> >> mkdir /tmp/archive_dir
> >> echo "archive_mode='on'" >> /tmp/primary/data/postgresql.conf
> >> echo "archive_command='cp %p /tmp/archive_dir/%f'" >> /tmp/primary/data/postgresql.conf
> >> pg_ctl -D /tmp/primary/data -l /tmp/primary-logs start
> >>
> >> ### setup host standby server
> >> pg_basebackup -p 5432 -w -R -D /tmp/hotstandby
> >> echo "primary_conninfo='host=127.0.0.1 port=5432 user=pgdev'" >> /tmp/hotstandby/postgresql.conf
> >> echo "restore_command='cp /tmp/archive_dir/%f %p'" >> /tmp/hotstandby/postgresql.conf
> >> echo "hot_standby = off" >> /tmp/hotstandby/postgresql.conf
> >> pg_ctl -D /tmp/hotstandby -l /tmp/hotstandby-logs -o "-p 5433" start
> >>
> >> ### keep trying to connect to hot standby server in order to get the error messages in different stages.
> >> while true; do echo "`date`"; psql postgres -p 5433 -c "SELECT txid_current_snapshot();" sleep 0.2; done
> >>
> >> ### before the patch
> >> psql: error: could not connect to server: FATAL:  the database system is starting up
> >> ...
> >>
> >> ### after the patch, got different messages, one message indicates hot_standby is off
> >> psql: error: could not connect to server: FATAL:  the database system is starting up
> >> ...
> >> psql: error: could not connect to server: FATAL:  the database system is up, but hot_standby is off
> >> ...
> >
> > Thanks for the review and testing!
>
> Thanks for the patch! Here is the comment from me.
>
> +               else if (!FatalError && pmState == PM_RECOVERY)
> +                       return CAC_STANDBY; /* connection disallowed on non-hot standby */
>
> Even if hot_standby is enabled, pmState seems to indicate PM_RECOVERY
> until recovery has reached a consistent state. No? So, if my understanding
> is right, "FATAL:  the database system is up, but hot_standby is off" can
> be logged even when hot_standby is on. Which sounds very confusing.
That's a good point. I've attached a corrected version.

I still don't have a good idea for how to add a test for this change.
If a test for this warranted, I'd be interested in any ideas.

James

v2-0001-Improve-standby-connection-denied-error-message.patch (3K) Download Attachment