allow online change primary_conninfo

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
77 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Andres Freund
Hi,

On 2019-02-03 15:33:38 +0300, Sergei Kornilov wrote:

> +/*
> + * Actual processing SIGHUP signal
> + */
> +static void
> +ProcessWalRcvSigHup(void)
> +{
> + ProcessConfigFile(PGC_SIGHUP);
> +
> + /*
> + * If primary_conninfo has been changed while walreceiver is running,
> + * shut down walreceiver so that a new walreceiver is started and
> + * initiates replication with the new connection information.
> + */
> + if (strcmp(current_conninfo, PrimaryConnInfo) != 0)
> + ereport(FATAL,
> + (errcode(ERRCODE_ADMIN_SHUTDOWN),
> + errmsg("terminating walreceiver process due to change of primary_conninfo"),
> + errdetail("In a moment starts streaming WAL with new configuration.")));
> +
> + /*
> + * And the same for primary_slot_name.
> + */
> + if (strcmp(current_slotname, PrimarySlotName) != 0)
> + ereport(FATAL,
> + (errcode(ERRCODE_ADMIN_SHUTDOWN),
> + errmsg("terminating walreceiver process due to change of primary_slot_name"),
> + errdetail("In a moment starts streaming WAL with new configuration.")));
> +
> + XLogWalRcvSendHSFeedback(true);
> +}

I don't quite think this is the right design. IMO the startup process
should signal the walreceiver to shut down, and the wal receiver should
never look at the config. Otherwise there's chances for knowledge of
pg.conf to differ between the processes.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Sergei Kornilov
Hi

> I don't quite think this is the right design. IMO the startup process
> should signal the walreceiver to shut down, and the wal receiver should
> never look at the config.

Ok, i can rewrite such way. Please check attached version.

> Otherwise there's chances for knowledge of
> pg.conf to differ between the processes.

I still not understood why this:
- is not issue for startup process with all primary_conninfo logic
- but issue for walreceiver with Michael Paquier patch; therefore without any primary_conninfo change logic in startup.
In both cases primary_conninfo change logic is only in one process.

regards, Sergei

reload_walreceiver_conninfo_v005.patch (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Michael Paquier-2
On Sat, Feb 16, 2019 at 02:50:35PM +0300, Sergei Kornilov wrote:

> + if ((strcmp(conninfo, PrimaryConnInfo) != 0 ||
> + strcmp(slotname, PrimarySlotName) != 0) &&
> + WalRcvRunning())
> + {
> + ereport(LOG,
> + (errmsg("terminating walreceiver process due to change of %s",
> + strcmp(conninfo, PrimaryConnInfo) != 0 ? "primary_conninfo" : "primary_slot_name"),
> + errdetail("In a moment starts streaming WAL with new configuration.")));
> +
> + ShutdownWalRcv();
> + lastSourceFailed = true;
> + }
If you do that, the startup process would try to jump to a different
source to fetch WAL if archiving is enabled.  Is that really what we
want?  It would be nice to have one message for primary_conninfo being
updated, and one for primary_slot_name so as if both are changed at
the same time the user gets the correct information.

"In a moment starts streaming WAL with new configuration." sounds
strange.  It would be more natural to have something like "The WAL
receiver is going to be restarted with the new configuration", just a
suggestion.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Sergei Kornilov
Hi

> If you do that, the startup process would try to jump to a different
> source to fetch WAL if archiving is enabled. Is that really what we
> want?

Yes, similar behavior with exit walreceiver by any reason. Same as in all previous patch versions.
(not sure this can be changed in some small and elegant way)

> It would be nice to have one message for primary_conninfo being
> updated, and one for primary_slot_name so as if both are changed at
> the same time the user gets the correct information.

Hm, maybe even better would be separate message if both settings are changed? Doing this in attached patch.

> "In a moment starts streaming WAL with new configuration." sounds
> strange. It would be more natural to have something like "The WAL
> receiver is going to be restarted with the new configuration", just a
> suggestion.

This phrase was proposed here: https://www.postgresql.org/message-id/20181213.184233.215171075.horiguchi.kyotaro%40lab.ntt.co.jp
My english is poor, so i added message just as proposed. I replaced to your variant.

regards, Sergei

reload_walreceiver_conninfo_v006.patch (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Andres Freund
In reply to this post by Sergei Kornilov
Hi,

On 2019-02-16 14:50:35 +0300, Sergei Kornilov wrote:

> > I don't quite think this is the right design. IMO the startup process
> > should signal the walreceiver to shut down, and the wal receiver should
> > never look at the config.
>
> Ok, i can rewrite such way. Please check attached version.
>
> > Otherwise there's chances for knowledge of
> > pg.conf to differ between the processes.
>
> I still not understood why this:
> - is not issue for startup process with all primary_conninfo logic
> - but issue for walreceiver with Michael Paquier patch; therefore without any primary_conninfo change logic in startup.
> In both cases primary_conninfo change logic is only in one process.
>
> regards, Sergei

For one, it's not ok to just let the startup process think that the
walreceiver failed - that'll make it change source of WAL to the next
available method. Something we definitely do not want, as
restore_command is very commonly slower.

But just as importantly, I think, is that we ought to automate
cluster-wide tasks like failing over more inside postgres. And that's
much harder if different parts of PG, say the startup process and
walreceiver, have different assumptions about the current configuration.



> +void
> +ProcessStartupSigHup(void)
> +{
> + char   *conninfo = pstrdup(PrimaryConnInfo);
> + char   *slotname = pstrdup(PrimarySlotName);
> +
> + ProcessConfigFile(PGC_SIGHUP);
> +
> + /*
> + * we need shutdown running walreceiver if replication settings was
> + * changed. walreceiver will be started with new settings in next
> + * WaitForWALToBecomeAvailable iteration
> + */
> + if ((strcmp(conninfo, PrimaryConnInfo) != 0 ||
> + strcmp(slotname, PrimarySlotName) != 0) &&
> + WalRcvRunning())
> + {
> + ereport(LOG,
> + (errmsg("terminating walreceiver process due to change of %s",
> + strcmp(conninfo, PrimaryConnInfo) != 0 ? "primary_conninfo" : "primary_slot_name"),
> + errdetail("In a moment starts streaming WAL with new configuration.")));
> +
> + ShutdownWalRcv();
> + lastSourceFailed = true;
> + }
> +
> + pfree(conninfo);
> + pfree(slotname);
> +}

That'd still switch to a different method, something we do not want...


Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Sergei Kornilov
Hi

> For one, it's not ok to just let the startup process think that the
> walreceiver failed - that'll make it change source of WAL to the next
> available method. Something we definitely do not want, as
> restore_command is very commonly slower.

Ok. This was not mentioned before Michael response yesterday. restore_command is much faster compared to database restart, also switch to a different method was proposed few years ago by Simon Riggs in original change recovery.conf proposal ( https://www.postgresql.org/message-id/flat/CANP8+jLO5fmfudbB1b1iw3pTdOK1HBM=xMTaRfOa5zpDVcqzew@.../ ). I assumed we can start with this. Sorry for your wasted time.

> That'd still switch to a different method, something we do not want...

Ok, do not want means we do not want. Will try change behavior.

regards, Sergei

Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Michael Paquier-2
On Mon, Feb 18, 2019 at 12:06:05AM +0300, Sergei Kornilov wrote:
> Ok. This was not mentioned before Michael response
> yesterday. restore_command is much faster compared to database
> restart [...]

The startup process would not switch back to streaming with archiving
enabled until it has consumed all the segments available.  I recall
David Steele mentioning me that a perl command invocation can take
easily 100ms.  On a system which generates more than 10 segments per
segment, you can guess what happens..  I think that this argument is a
non-starter if we want to make the WAL receiver a maximum available.

>> That'd still switch to a different method, something we do not want...
>
> Ok, do not want means we do not want. Will try change behavior.

Thanks Sergei for considering it.  The code paths touched are
sensitive, so we had better be careful here.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Sergei Kornilov
In reply to this post by Andres Freund
Hello

This might be not the right way, but I can't think of a better way to not switch to a different method than split of lastSourceFailed processing and starting new source. Something like refactoring in first attached patch. I move RequestXLogStreaming logic from lastSourceFailed processing and add new pendingRestartSource indicator.
Second patch is mostly the same as previous version but uses new pendingRestartSource mechanism instead of lastSourceFailed.

Thanks in advance!

regards, Sergei

0001_refactor_WaitForWALToBecomeAvailable_to_graceful_restart_source.patch (6K) Download Attachment
0002_restart_stream_source_on_conf_change.patch (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Michael Paquier-2
On Sat, Mar 02, 2019 at 01:49:51PM +0300, Sergei Kornilov wrote:
> This might be not the right way, but I can't think of a better way
> to not switch to a different method than split of lastSourceFailed
> processing and starting new source. Something like refactoring in
> first attached patch. I move RequestXLogStreaming logic from
> lastSourceFailed processing and add new pendingRestartSource
> indicator.

OK.  This needs a rather close lookup, and I am not actually sure that
you even need pendingRestartSource.  Please let me a couple of days
before coming back to you on 0001.

> Second patch is mostly the same as previous version but uses new
> pendingRestartSource mechanism instead of lastSourceFailed.

This, on the other hand, looks like the implementation we are looking
for based on the discussions which happened until now to have the
startup process handle the shutdown and restart of the WAL receiver.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Re: allow online change primary_conninfo

David Steele
Hi Michael,

On 3/4/19 7:19 AM, Michael Paquier wrote:

> On Sat, Mar 02, 2019 at 01:49:51PM +0300, Sergei Kornilov wrote:
>> This might be not the right way, but I can't think of a better way
>> to not switch to a different method than split of lastSourceFailed
>> processing and starting new source. Something like refactoring in
>> first attached patch. I move RequestXLogStreaming logic from
>> lastSourceFailed processing and add new pendingRestartSource
>> indicator.
>
> OK.  This needs a rather close lookup, and I am not actually sure that
> you even need pendingRestartSource.  Please let me a couple of days
> before coming back to you on 0001.
>
>> Second patch is mostly the same as previous version but uses new
>> pendingRestartSource mechanism instead of lastSourceFailed.
>
> This, on the other hand, looks like the implementation we are looking
> for based on the discussions which happened until now to have the
> startup process handle the shutdown and restart of the WAL receiver.

Do you know when you'll have a chance to revisit this?

Regards,
--
-David
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Andres Freund
Hi,

On 2019-03-25 12:32:21 +0400, David Steele wrote:

> On 3/4/19 7:19 AM, Michael Paquier wrote:
> > On Sat, Mar 02, 2019 at 01:49:51PM +0300, Sergei Kornilov wrote:
> > > This might be not the right way, but I can't think of a better way
> > > to not switch to a different method than split of lastSourceFailed
> > > processing and starting new source. Something like refactoring in
> > > first attached patch. I move RequestXLogStreaming logic from
> > > lastSourceFailed processing and add new pendingRestartSource
> > > indicator.
> >
> > OK.  This needs a rather close lookup, and I am not actually sure that
> > you even need pendingRestartSource.  Please let me a couple of days
> > before coming back to you on 0001.
> >
> > > Second patch is mostly the same as previous version but uses new
> > > pendingRestartSource mechanism instead of lastSourceFailed.
> >
> > This, on the other hand, looks like the implementation we are looking
> > for based on the discussions which happened until now to have the
> > startup process handle the shutdown and restart of the WAL receiver.
>
> Do you know when you'll have a chance to revisit this?

I think we unfortunately got to mark this as returned with
feedback. I've not done so, but just switched the entry to waiting on
author.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Sergei Kornilov
Hi

> I think we unfortunately got to mark this as returned with
> feedback. I've not done so, but just switched the entry to waiting on
> author.

Why returned with feedback? Why waiting on author? I didn't receive a feedback for latest published patch version. What can I do as author? Patch still applied (thanks cf bot)
Obviously too late for pg12, but why can not be target pg13 and therefore be moved to next CF?

regards, Sergei


Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Andres Freund
Hi,

On 2019-04-03 23:49:54 +0300, Sergei Kornilov wrote:
> > I think we unfortunately got to mark this as returned with
> > feedback. I've not done so, but just switched the entry to waiting on
> > author.
>
> Why returned with feedback? Why waiting on author? I didn't receive a
> feedback for latest published patch version. What can I do as author?
> Patch still applied (thanks cf bot) Obviously too late for pg12, but
> why can not be target pg13 and therefore be moved to next CF?

Well, my impression was that the patch didn't yet really address the
feedback. And thus should have been marked as waiting on author for a
while.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Sergei Kornilov
Hi

>>  > I think we unfortunately got to mark this as returned with
>>  > feedback. I've not done so, but just switched the entry to waiting on
>>  > author.
>>
>>  Why returned with feedback? Why waiting on author? I didn't receive a
>>  feedback for latest published patch version. What can I do as author?
>>  Patch still applied (thanks cf bot) Obviously too late for pg12, but
>>  why can not be target pg13 and therefore be moved to next CF?
>
> Well, my impression was that the patch didn't yet really address the
> feedback. And thus should have been marked as waiting on author for a
> while.

Not agree. Latest patch version perform walreceiver restart without switch to a different method as discussed. Here is no race condition between startup process and walreceiver because conninfo passed via WalRcvData struct as currently. I miss something important?
Michael Paquier had no possibility to review latest implementation, but did not say this is totally wrong, just asked wait a rather close lookup.

Of cource we can close this cf entry. I would be happy if someone else post proper implementation. And I can rework my implementation again, but I don’t know how the correct implementation should look or why latest implementation is wrong.

regards, Sergei


Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Michael Paquier-2
On Thu, Apr 04, 2019 at 12:27:55AM +0300, Sergei Kornilov wrote:
> Not agree. Latest patch version perform walreceiver restart without
> switch to a different method as discussed. Here is no race condition
> between startup process and walreceiver because conninfo passed via
> WalRcvData struct as currently. I miss something important?
> Michael Paquier had no possibility to review latest implementation,
> but did not say this is totally wrong, just asked wait a rather
> close lookup.

I agree with Sergei's statement here.  He has sent a patch for review,
which I have looked up a bit, but not enough to provide a full review
unfortunately, and this even if I was listed as a reviewer of it.  So
if somebody is to blame here, that's me.

> Of cource we can close this cf entry. I would be happy if someone
> else post proper implementation. And I can rework my implementation
> again, but I don’t know how the correct implementation should look
> or why latest implementation is wrong.

Moving this patch to next CF is fine.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Andres Freund
On 2019-04-04 11:06:05 +0900, Michael Paquier wrote:

> On Thu, Apr 04, 2019 at 12:27:55AM +0300, Sergei Kornilov wrote:
> > Not agree. Latest patch version perform walreceiver restart without
> > switch to a different method as discussed. Here is no race condition
> > between startup process and walreceiver because conninfo passed via
> > WalRcvData struct as currently. I miss something important?
> > Michael Paquier had no possibility to review latest implementation,
> > but did not say this is totally wrong, just asked wait a rather
> > close lookup.
>
> I agree with Sergei's statement here.  He has sent a patch for review,
> which I have looked up a bit, but not enough to provide a full review
> unfortunately, and this even if I was listed as a reviewer of it.  So
> if somebody is to blame here, that's me.

> > Of cource we can close this cf entry. I would be happy if someone
> > else post proper implementation. And I can rework my implementation
> > again, but I don’t know how the correct implementation should look
> > or why latest implementation is wrong.
>
> Moving this patch to next CF is fine.

Cool, sorry for the misunderstanding then.


Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Sergei Kornilov
In reply to this post by Michael Paquier-2
Hello

Updated version attached. Merge conflict was about tests count in 001_stream_rep.pl. Nothing else was changed. My approach can be still incorrect, any redesign ideas are welcome. Thanks in advance!

regards, Sergei

0002_v2_restart_stream_source_on_conf_change.patch (8K) Download Attachment
0001_v2_refactor_WaitForWALToBecomeAvailable_to_graceful_restart_source.patch (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Michael Paquier-2
On Mon, Jul 01, 2019 at 02:33:39PM +0300, Sergei Kornilov wrote:
> Updated version attached. Merge conflict was about tests count in
> 001_stream_rep.pl. Nothing else was changed. My approach can be
> still incorrect, any redesign ideas are welcome. Thanks in advance!

It has been some time, and I am finally catching up with this patch.

+         <para>
+          WAL receiver will be restarted after <varname>primary_slot_name</varname>
+          was changed.
          </para>
The sentence sounds strange.  Here is a suggestion:
The WAL receiver is restarted after an update of primary_slot_name (or
primary_conninfo).

The comment at the top of the call of ProcessStartupSigHup() in
HandleStartupProcInterrupts() needs to be updated as it mentions a
configuration file re-read, but that's not the case anymore.

pendingRestartSource's name does not represent what it does, as it is
only associated with the restart of a WAL receiver when in streaming
state, and that's a no-op for the archive mode and the local mode.

So, the patch splits the steps taken when checking for a WAL source by
adding an extra step after the failure handling that you are calling
the restart step.  When a failure happens for the stream mode
(shutdown of WAL receiver, promotion. etc), there is a switch to the
archive mode, and nothing is changed in this case in your patch.  So
when shutting down the WAL receiver after a parameter change, what
happens is that the startup process waits for retrieve_retry_interval
before moving back to the archive mode.  Only after scanning again the
archives do we restart a new WAL receiver.  However, if the restart of
the WAL receiver is planned because of an update of primary_conninfo
(or slot), shouldn't the follow-up mode be XLOG_FROM_STREAM without
waiting for wal_retrieve_retry_interval ms for extra WAL to become
available?
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Thomas Munro-5
On Tue, Jul 30, 2019 at 6:42 PM Michael Paquier <[hidden email]> wrote:
> On Mon, Jul 01, 2019 at 02:33:39PM +0300, Sergei Kornilov wrote:
> > Updated version attached. Merge conflict was about tests count in
> > 001_stream_rep.pl. Nothing else was changed. My approach can be
> > still incorrect, any redesign ideas are welcome. Thanks in advance!
>
> [review]

Unfortunately this comes right that the end of the CF, but the good
news is that there is another one just around the corner.  Moved to
September.

--
Thomas Munro
https://enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Michael Paquier-2
On Thu, Aug 01, 2019 at 09:06:59PM +1200, Thomas Munro wrote:
> Unfortunately this comes right that the end of the CF, but the good
> news is that there is another one just around the corner.  Moved to
> September.

Moving it to the next CF is fine, thanks.  The author had no time to
reply since my last lookup.
--
Michael

signature.asc (849 bytes) Download Attachment
1234