allow online change primary_conninfo

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
77 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Sergei Kornilov
Hello

> It has been some time, and I am finally catching up with this patch.

Thank you!

> + <para>
> + WAL receiver will be restarted after <varname>primary_slot_name</varname>
> + was changed.
>           </para>
> The sentence sounds strange. Here is a suggestion:
> The WAL receiver is restarted after an update of primary_slot_name (or
> primary_conninfo).
>
> The comment at the top of the call of ProcessStartupSigHup() in
> HandleStartupProcInterrupts() needs to be updated as it mentions a
> configuration file re-read, but that's not the case anymore.

I agree. But I want to know if this is an acceptable idea in general, before spending a few more months polishing the text on something we don’t want.

> pendingRestartSource's name does not represent what it does, as it is
> only associated with the restart of a WAL receiver when in streaming
> state, and that's a no-op for the archive mode and the local mode.

Currently we have only one source with such operation, so I can name variable pendingRestartWalreceiver and replace corresponding switch to compact condition. If someday we need additional actions for another wal source - we can replace this code. Looks like some form of premature optimization from me

> However, if the restart of
> the WAL receiver is planned because of an update of primary_conninfo
> (or slot), shouldn't the follow-up mode be XLOG_FROM_STREAM without
> waiting for wal_retrieve_retry_interval ms for extra WAL to become
> available?

Hmm. Standby with log_min_messages = debug2 and default wal_retrieve_retry_interval = 5s gives me:

2019-08-01 12:38:31.255 MSK 15707 @ from  [vxid: txid:0] [] DEBUG:  sendtime 2019-08-01 12:38:31.254905+03 receipttime 2019-08-01 12:38:31.254986+03 replication apply delay 0 ms transfer latency 0 ms
2019-08-01 12:38:31.255 MSK 15707 @ from  [vxid: txid:0] [] DEBUG:  sending write 0/3023FA8 flush 0/3023F38 apply 0/3023F38
2019-08-01 12:38:31.262 MSK 15707 @ from  [vxid: txid:0] [] DEBUG:  sending write 0/3023FA8 flush 0/3023FA8 apply 0/3023F38
2019-08-01 12:38:31.262 MSK 15707 @ from  [vxid: txid:0] [] DEBUG:  sending write 0/3023FA8 flush 0/3023FA8 apply 0/3023FA8
2019-08-01 12:38:31.457 MSK 15699 @ from  [vxid: txid:0] [] LOG:  received SIGHUP, reloading configuration files
2019-08-01 12:38:31.458 MSK 15699 @ from  [vxid: txid:0] [] LOG:  parameter "primary_conninfo" changed to " host='/tmp' port=5555"
2019-08-01 12:38:31.459 MSK 15700 @ from  [vxid:1/0 txid:0] [] LOG:  The WAL receiver is going to be restarted due to change of primary_conninfo
2019-08-01 12:38:31.459 MSK 15703 @ from  [vxid: txid:0] [] DEBUG:  checkpointer updated shared memory configuration values
2019-08-01 12:38:31.459 MSK 15707 @ from  [vxid: txid:0] [] FATAL:  terminating walreceiver process due to administrator command
2019-08-01 12:38:31.463 MSK 15724 @ from  [vxid: txid:0] [] LOG:  started streaming WAL from primary at 0/3000000 on timeline 1
2019-08-01 12:38:31.464 MSK 15724 @ from  [vxid: txid:0] [] DEBUG:  sendtime 2019-08-01 12:38:31.463954+03 receipttime 2019-08-01 12:38:31.464228+03 replication apply delay 0 ms transfer latency 0 ms
2019-08-01 12:38:31.464 MSK 15724 @ from  [vxid: txid:0] [] DEBUG:  sendtime 2019-08-01 12:38:31.464068+03 receipttime 2019-08-01 12:38:31.464375+03 replication apply delay 0 ms transfer latency 0 ms
2019-08-01 12:38:31.464 MSK 15724 @ from  [vxid: txid:0] [] DEBUG:  sending write 0/3023FA8 flush 0/3023FA8 apply 0/3023FA8

No source switch, no wal_retrieve_retry_interval waiting. (wal writes on primary by \watch 0.3 query)
RequestXLogStreaming set walRcvState to WALRCV_STARTING - correct state for WalRcvStreaming and therefore we have no lastSourceFailed.

regards, Sergei


Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Sergei Kornilov
In reply to this post by Michael Paquier-2
Hello

Updated patch attached. (also I merged into one file)

> + <para>
> + WAL receiver will be restarted after <varname>primary_slot_name</varname>
> + was changed.
>           </para>
> The sentence sounds strange. Here is a suggestion:
> The WAL receiver is restarted after an update of primary_slot_name (or
> primary_conninfo).

Changed.

> The comment at the top of the call of ProcessStartupSigHup() in
> HandleStartupProcInterrupts() needs to be updated as it mentions a
> configuration file re-read, but that's not the case anymore.

Changed to "Process any requests or signals received recently." like in other places, e.g. syslogger

> pendingRestartSource's name does not represent what it does, as it is
> only associated with the restart of a WAL receiver when in streaming
> state, and that's a no-op for the archive mode and the local mode.

I renamed to pendingWalRcvRestart and replaced switch with simple condition.

> So when shutting down the WAL receiver after a parameter change, what
> happens is that the startup process waits for retrieve_retry_interval
> before moving back to the archive mode. Only after scanning again the
> archives do we restart a new WAL receiver.

As I answered earlier, here is no switch to archive or wal_retrieve_retry_interval waiting in my patch. I recheck on current revision too:

2019-08-28 12:16:27.295 MSK 11180 @ from  [vxid: txid:0] [] DEBUG:  sending write 0/30346C8 flush 0/30346C8 apply 0/30346C8
2019-08-28 12:16:27.493 MSK 11172 @ from  [vxid: txid:0] [] LOG:  received SIGHUP, reloading configuration files
2019-08-28 12:16:27.494 MSK 11172 @ from  [vxid: txid:0] [] LOG:  parameter "primary_conninfo" changed to "host='/tmp' port=5555 sslmode=disable sslcompression=0 gssencmode=disable target_session_attrs=any"
2019-08-28 12:16:27.496 MSK 11173 @ from  [vxid:1/0 txid:0] [] LOG:  The WAL receiver is going to be restarted due to change of primary_conninfo
2019-08-28 12:16:27.496 MSK 11176 @ from  [vxid: txid:0] [] DEBUG:  checkpointer updated shared memory configuration values
2019-08-28 12:16:27.496 MSK 11180 @ from  [vxid: txid:0] [] FATAL:  terminating walreceiver process due to administrator command
2019-08-28 12:16:27.500 MSK 11335 @ from  [vxid: txid:0] [] LOG:  started streaming WAL from primary at 0/3000000 on timeline 1
2019-08-28 12:16:27.500 MSK 11335 @ from  [vxid: txid:0] [] DEBUG:  sendtime 2019-08-28 12:16:27.50037+03 receipttime 2019-08-28 12:16:27.500821+03 replication apply delay 0 ms transfer latency 0 ms

No "DEBUG:  switched WAL source from stream to archive after failure" messages, no time difference (wal_retrieve_retry_interval = 5s).

regards, Sergei

v3_allow_reload_walreceiver_conninfo.patch (14K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Fujii Masao-2
On Wed, Aug 28, 2019 at 6:50 PM Sergei Kornilov <[hidden email]> wrote:
>
> Hello
>
> Updated patch attached. (also I merged into one file)

Thanks for updating the patch! Here are some comments from me.

    #primary_conninfo = '' # connection string to sending server
    # (change requires restart)
    #primary_slot_name = '' # replication slot on sending server
    # (change requires restart)

ISTM that you need to update the above parts in postgresql.conf.sample.


    /* we don't expect primary_conninfo to change */

ISTM that you need to update the above comment in walreceiver.c.


+         <para>
+          The WAL receiver is restarted after an update of
<varname>primary_conninfo</varname>.
+         </para>

If primary_conninfo is set to an empty string, walreceiver just shuts down,
not restarts. The above description in the doc is not always true.
So I'm thinking something like the following description is better.
Thought?

    If primary_conninfo is changed while WAL receiver is running,
    the WAL receiver shuts down and then restarts with new setting,
    except when primary_conninfo is an empty string.


There is the case where walreceiver doesn't shut down immediately
after the change of primary_conninfo. If the change happens while
the startup process in paused state (e.g., by pg_wal_replay_pause(),
recovery_min_apply_delay, recovery conflict, etc), the startup
process tries to terminate walreceiver after it gets out of such state.
Shouldn't this fact be documented as a note?

Regards,

--
Fujii Masao


Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Sergei Kornilov
Hello

Thank you for review!

> ISTM that you need to update the above parts in postgresql.conf.sample.

Good catch, I forgot about conf sample.

> ISTM that you need to update the above comment in walreceiver.c.

Changed

> If primary_conninfo is set to an empty string, walreceiver just shuts down,
> not restarts. The above description in the doc is not always true.
> So I'm thinking something like the following description is better.
> Thought?
>
>     If primary_conninfo is changed while WAL receiver is running,
>     the WAL receiver shuts down and then restarts with new setting,
>     except when primary_conninfo is an empty string.

Ok, changed.
I leave primary_slot_name description as before.

> There is the case where walreceiver doesn't shut down immediately
> after the change of primary_conninfo. If the change happens while
> the startup process in paused state (e.g., by pg_wal_replay_pause(),
> recovery_min_apply_delay, recovery conflict, etc), the startup
> process tries to terminate walreceiver after it gets out of such state.
> Shouldn't this fact be documented as a note?

Hmm. Is somewhere documented that walreceiver will receive WAL while the startup process in paused state?
(didn't add such note in current version)

regards, Sergei

v4_allow_reload_walreceiver_conninfo.patch (16K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Kyotaro Horiguchi-4
In reply to this post by Sergei Kornilov
Hello. I looked this and have some comments.

At Wed, 28 Aug 2019 12:49:46 +0300, Sergei Kornilov <[hidden email]> wrote in <[hidden email]>
sk> Hello
sk>
sk> Updated patch attached. (also I merged into one file)
sk>
sk> > + <para>
sk> > + WAL receiver will be restarted after <varname>primary_slot_name</varname>
sk> > + was changed.
sk> >           </para>
sk> > The sentence sounds strange. Here is a suggestion:
sk> > The WAL receiver is restarted after an update of primary_slot_name (or
sk> > primary_conninfo).
sk>
sk> Changed.

-          This parameter can only be set at server start.
+          This parameter can only be set in the <filename>postgresql.conf</filename>
+          file or on the server command line.

I'm not sure it's good to change the context of this
description. This was mentioning that changing of this parameter
requires server (re)start. So if we want to be on the same
context after rewriting, it would be like "This parameter can be
set any time and causes WAL receiver restart with the new setting
if the server is in standby mode."

And If I'm not missing something, I don't find an (explict)
paramter of postmaster for setting primary_conninfo.



sk> > The comment at the top of the call of ProcessStartupSigHup() in
sk> > HandleStartupProcInterrupts() needs to be updated as it mentions a
sk> > configuration file re-read, but that's not the case anymore.
sk>
sk> Changed to "Process any requests or signals received recently." like in other places, e.g. syslogger
sk>
sk> > pendingRestartSource's name does not represent what it does, as it is
sk> > only associated with the restart of a WAL receiver when in streaming
sk> > state, and that's a no-op for the archive mode and the local mode.
sk>
sk> I renamed to pendingWalRcvRestart and replaced switch with simple condition.
sk>
sk> > So when shutting down the WAL receiver after a parameter change, what
sk> > happens is that the startup process waits for retrieve_retry_interval
sk> > before moving back to the archive mode. Only after scanning again the
sk> > archives do we restart a new WAL receiver.
sk>
sk> As I answered earlier, here is no switch to archive or wal_retrieve_retry_interval waiting in my patch. I recheck on current revision too:

@@ -11787,48 +11793,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
  if (!StandbyMode)
  return false;
 
-                    /*
-                     * If primary_conninfo is set, launch walreceiver to try
-                     * to stream the missing WAL.
-                     *
-                     * If fetching_ckpt is true, RecPtr points to the initial
-                     * checkpoint location. In that case, we use RedoStartLSN

Mentioning code, the movement of the code block makes the
surrounding swtich almost useless and the structure and the
result looks somewhat strange.

Couldn't we do the same thing by just skipping the wait and
setting lastSourceFailed to true in the case of intentional
walreceiver restart?

The attached is an incomplete (fraction) patch to sketch what is
in my mind. With something like this and making
ProcessStartupSigHup kill walreceiver would work.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6876537b62..61b235f8f7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11881,10 +11881,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
  * obtaining the requested WAL. We're going to loop back
  * and retry from the archive, but if it hasn't been long
  * since last attempt, sleep wal_retrieve_retry_interval
- * milliseconds to avoid busy-waiting.
+ * milliseconds to avoid busy-waiting. We don't wait if
+ * explicitly requested to restart.
  */
  now = GetCurrentTimestamp();
- if (!TimestampDifferenceExceeds(last_fail_time, now,
+ if (!pendingWalRcvRestart &&
+ !TimestampDifferenceExceeds(last_fail_time, now,
  wal_retrieve_retry_interval))
  {
  long secs,
@@ -11905,6 +11907,17 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
  }
  last_fail_time = now;
  currentSource = XLOG_FROM_ARCHIVE;
+
+ /*
+ * If wal receiver is requested to restart, we skip the
+ * next XLOG_FROM_ARCHIVE to immediately starting it.
+ */
+ if (pendingWalRcvRestart)
+ {
+ lastSourceFailed = true;
+ continue;
+ }
+
  break;
 
  default:
Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Sergei Kornilov
Hello

Thank you for review!

> - This parameter can only be set at server start.
> + This parameter can only be set in the <filename>postgresql.conf</filename>
> + file or on the server command line.
>
> I'm not sure it's good to change the context of this
> description. This was mentioning that changing of this parameter
> requires server (re)start. So if we want to be on the same
> context after rewriting, it would be like "This parameter can be
> set any time and causes WAL receiver restart with the new setting
> if the server is in standby mode."
Was written such way after this review: https://www.postgresql.org/message-id/20181125214313.lydvmrraqjfrb3s2%40alap3.anarazel.de

> And If I'm not missing something, I don't find an (explict)
> paramter of postmaster for setting primary_conninfo.

Well, we have common -c option: -c name=value

> Couldn't we do the same thing by just skipping the wait and
> setting lastSourceFailed to true in the case of intentional
> walreceiver restart?

Yes, it's possible. Let's see... Done in attached variant.
We need check pendingWalRcvRestart before rescanLatestTimeLine lines.

regards, Sergei

v5_allow_reload_walreceiver_conninfo.patch (15K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Andres Freund
Hi,

On 2019-09-19 17:46:06 +0300, Sergei Kornilov wrote:

>           <para>
> -          This parameter can only be set at server start.
> +          This parameter can only be set in the <filename>postgresql.conf</filename>
> +          file or on the server command line.
>            This setting has no effect if the server is not in standby mode.
>           </para>
> +         <para>
> +          If <varname>primary_conninfo</varname> is changed while WAL receiver is running,
> +          the WAL receiver shuts down and then restarts with new setting,
> +          except when primary_conninfo is an empty string.
> +         </para>

From the sentence structure it's not clear that "except when
primary_conninfo is an empty string" only applies to "and then restarts
with new setting".


> +/*
> + * Need for restart running WalReceiver due the configuration change.
> + * Suitable only for XLOG_FROM_STREAM source
> + */
> +static bool pendingWalRcvRestart = false;

s/due the/due to a/ (or even "due to the").


> @@ -11862,6 +11869,17 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
>   if (WalRcvStreaming())
>   ShutdownWalRcv();
>  
> + /*
> + * If wal receiver is requested to restart, we skip the
> + * next XLOG_FROM_ARCHIVE to immediately starting it.
> + */
> + if (pendingWalRcvRestart)
> + {
> + lastSourceFailed = true;
> + currentSource = XLOG_FROM_ARCHIVE;
> + continue;
> + }

I can't parse that comment. What does "skipping to starting" mean? I
assume it's just about avoiding wal_retrieve_retry_interval, but I think
the comment ought to be rephrased.

Also, should we really check this before scanning for new timelines?

Why is it the right thing to change to XLOG_FROM_ARCHIVE when we're just
restarting walreceiver? The server might unnecessarily get stuck in
archive based recovery for a long time this way? It seems to me that
we'd need to actually trigger RequestXLogStreaming() in this case.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Sergei Kornilov
Hello

Thank you for review! Can you please also check v4 version? v5 implements design suggested by Kyotaro Horiguchi-san, while v4 has another design. Which one do you prefer? Or are both wrong?

> I can't parse that comment. What does "skipping to starting" mean? I
> assume it's just about avoiding wal_retrieve_retry_interval, but I think
> the comment ought to be rephrased.

Design idea is to rewrite current state from working XLOG_FROM_STREAM to failed XLOG_FROM_ARCHIVE (without actually try this method on this iteration) and immediately go to next iteration to advance the state machine. Next iteration after failed archive recovery is walreceiver. So walreceiver will be stopped just before this lines and started on next iteration. Walreceiver will be restarted, we do not call restore_command

> Also, should we really check this before scanning for new timelines?

Hmm, on the next day... No, this is not really necessary.

> Why is it the right thing to change to XLOG_FROM_ARCHIVE when we're just
> restarting walreceiver? The server might unnecessarily get stuck in
> archive based recovery for a long time this way? It seems to me that
> we'd need to actually trigger RequestXLogStreaming() in this case.

I hope I clarified this in design idea description above.

Thank you!

regards, Sergei


Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Kyotaro Horiguchi-4
Hello.

At Sat, 21 Sep 2019 13:06:25 +0300, Sergei Kornilov <[hidden email]> wrote in <[hidden email]>
> Hello
>
> Thank you for review! Can you please also check v4 version? v5 implements design suggested by Kyotaro Horiguchi-san, while v4 has another design. Which one do you prefer? Or are both wrong?
>
> > I can't parse that comment. What does "skipping to starting" mean? I
> > assume it's just about avoiding wal_retrieve_retry_interval, but I think
> > the comment ought to be rephrased.
>
> Design idea is to rewrite current state from working XLOG_FROM_STREAM to failed XLOG_FROM_ARCHIVE (without actually try this method on this iteration) and immediately go to next iteration to advance the state machine. Next iteration after failed archive recovery is walreceiver. So walreceiver will be stopped just before this lines and started on next iteration. Walreceiver will be restarted, we do not call restore_command

Sorry, it's my bad. It meant "If wal receiver is requested to
restart, change state to XLOG_FROM_STREAM immediately skipping
the next XLOG_FROM_ARCHIVE state.".

> > Also, should we really check this before scanning for new timelines?
>
> Hmm, on the next day... No, this is not really necessary.

No problem when timeline is not changed when explicit restart of
wal receiver.  But if it happened just after the standby received
new hisotry file, we suffer an extra restart of wal receiver. It
seems better that we check that in the case.

> > Why is it the right thing to change to XLOG_FROM_ARCHIVE when we're just
> > restarting walreceiver? The server might unnecessarily get stuck in
> > archive based recovery for a long time this way? It seems to me that
> > we'd need to actually trigger RequestXLogStreaming() in this case.
>
> I hope I clarified this in design idea description above.

My suggestion was just to pretend that the next XLOG_FROM_STREAM
failed, but the outcome actually looks wierd as Andres commented.

I think that comes from the structure of the state machine. WAL
receiver is started not at the beginning of XLOG_FROM_STREAM
state but at the end of XLOG_FROM_ARCHIVE. So we need to switch
once to XLOG_FROM_ARCHIVE in order to start wal receiver.

So, I'd like to propose to move the stuff to the second switch().
(See the attached incomplete patch.) This is rather similar to
Sergei's previous proposal, but the structure of the state
machine is kept.

Some other comments are below.

In ProcessStartupSigHup, conninfo and slotname don't need to be
retained until the end of the function.

The log message in the function seems to be too detailed.  On the
other hand, if we changed primary_conninfo to '' (stop) or vise
versa (start), the message (restart) looks strange.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6c69eb6dd7..dba0042db6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11755,12 +11755,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
  for (;;)
  {
  int oldSource = currentSource;
+ bool start_wal_receiver = false;
 
  /*
- * First check if we failed to read from the current source, and
+ * First check if we failed to read from the current source or if
+ * we want to restart wal receiver, and
  * advance the state machine if so. The failure to read might've
  * happened outside this function, e.g when a CRC check fails on a
  * record, or within this loop.
+ * Restart wal receiver if explicitly requested, too.
  */
  if (lastSourceFailed)
  {
@@ -11788,53 +11791,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
  if (!StandbyMode)
  return false;
 
- /*
- * If primary_conninfo is set, launch walreceiver to try
- * to stream the missing WAL.
- *
- * If fetching_ckpt is true, RecPtr points to the initial
- * checkpoint location. In that case, we use RedoStartLSN
- * as the streaming start position instead of RecPtr, so
- * that when we later jump backwards to start redo at
- * RedoStartLSN, we will have the logs streamed already.
- */
- if (PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0)
- {
- XLogRecPtr ptr;
- TimeLineID tli;
-
- if (fetching_ckpt)
- {
- ptr = RedoStartLSN;
- tli = ControlFile->checkPointCopy.ThisTimeLineID;
- }
- else
- {
- ptr = RecPtr;
-
- /*
- * Use the record begin position to determine the
- * TLI, rather than the position we're reading.
- */
- tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);
-
- if (curFileTLI > 0 && tli < curFileTLI)
- elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but previous recovered WAL file came from timeline %u",
- (uint32) (tliRecPtr >> 32),
- (uint32) tliRecPtr,
- tli, curFileTLI);
- }
- curFileTLI = tli;
- RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
- PrimarySlotName);
- receivedUpto = 0;
- }
-
  /*
  * Move to XLOG_FROM_STREAM state in either case. We'll
  * get immediate failure if we didn't launch walreceiver,
  * and move on to the next state.
  */
+ start_wal_receiver = true;
  currentSource = XLOG_FROM_STREAM;
  break;
 
@@ -11864,8 +11826,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
  ShutdownWalRcv();
 
  /*
- * Before we sleep, re-scan for possible new timelines if
- * we were requested to recover to the latest timeline.
+ * Re-scan for possible new timelines if we were requested
+ * to recover to the latest timeline.
  */
  if (recoveryTargetTimeLineGoal == RECOVERY_TARGET_TIMELINE_LATEST)
  {
@@ -11929,8 +11891,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
  lastSourceFailed ? "failure" : "success");
 
  /*
- * We've now handled possible failure. Try to read from the chosen
- * source.
+ * We've now handled possible failure and configuration change. Try to
+ * read from the chosen source.
  */
  lastSourceFailed = false;
 
@@ -11969,9 +11931,71 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
  bool havedata;
 
  /*
- * Check if WAL receiver is still active.
+ * shutdown wal receiver if restart is requested.
  */
- if (!WalRcvStreaming())
+ if (!start_wal_receiver && pendingWalRcvRestart)
+ {
+ if (WalRcvRunning())
+ ShutdownWalRcv();
+
+ /*
+ * Re-scan for possible new timelines if we were
+ * requested to recover to the latest timeline.
+ */
+ if (recoveryTargetTimeLineGoal ==
+ RECOVERY_TARGET_TIMELINE_LATEST)
+ rescanLatestTimeLine();
+
+ start_wal_receiver = true;
+ }
+ pendingWalRcvRestart = false;
+
+ /*
+ * Launch walreceiver if needed.
+ *
+ * If fetching_ckpt is true, RecPtr points to the initial
+ * checkpoint location. In that case, we use RedoStartLSN
+ * as the streaming start position instead of RecPtr, so
+ * that when we later jump backwards to start redo at
+ * RedoStartLSN, we will have the logs streamed already.
+ */
+ if (start_wal_receiver &&
+ PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0)
+ {
+ XLogRecPtr ptr;
+ TimeLineID tli;
+
+ if (fetching_ckpt)
+ {
+ ptr = RedoStartLSN;
+ tli = ControlFile->checkPointCopy.ThisTimeLineID;
+ }
+ else
+ {
+ ptr = RecPtr;
+
+ /*
+ * Use the record begin position to determine the
+ * TLI, rather than the position we're reading.
+ */
+ tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);
+
+ if (curFileTLI > 0 && tli < curFileTLI)
+ elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but previous recovered WAL file came from timeline %u",
+ (uint32) (tliRecPtr >> 32),
+ (uint32) tliRecPtr,
+ tli, curFileTLI);
+ }
+ curFileTLI = tli;
+ RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
+ PrimarySlotName);
+ receivedUpto = 0;
+ }
+
+ /*
+ * Else, check if WAL receiver is still active.
+ */
+ else if (!WalRcvStreaming())
  {
  lastSourceFailed = true;
  break;
Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Sergei Kornilov
Hello

> So, I'd like to propose to move the stuff to the second switch().
> (See the attached incomplete patch.) This is rather similar to
> Sergei's previous proposal, but the structure of the state
> machine is kept.

Very similar to my v4 proposal (also move RequestXLogStreaming call), but closer to currentSource reading. No objections from me, attached patch is changed this way.
I renamed start_wal_receiver to startWalReceiver - this style looks more consistent to near code.

> + /*
> + * Else, check if WAL receiver is still active.
> + */
> + else if (!WalRcvStreaming())

I think we still need wait WalRcvStreaming after RequestXLogStreaming call. So I remove else branch and leave separate condition.

> In ProcessStartupSigHup, conninfo and slotname don't need to be
> retained until the end of the function.

Agreed, I move pfree

> The log message in the function seems to be too detailed. On the
> other hand, if we changed primary_conninfo to '' (stop) or vise
> versa (start), the message (restart) looks strange.

I have no strong opinion here. These messages was changed many times during this thread lifetime, can be changed anytime. I think this is not issue since we have no consensus about overall design.
Write detailed messages was proposed here: https://www.postgresql.org/message-id/20190216151025.GJ2240%40paquier.xyz

> or vise versa (start)

I explicitly check currentSource and WalRcvRunning, so we have no such messages if user had no walreceiver before.

regards, Sergei

v6_allow_reload_walreceiver_conninfo.patch (17K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Tomas Vondra-4
Hi,

I'm a bit confused by the status of this patch - it's marked as Waiting
on Author (since last week), but that status was set by Sergei Kornilov
who is also the patch author. And there have been no messages since the
beginning of 2019-11 CF.

So what's the current status of this patch?


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Sergei Kornilov
Hello

"Waiting on Author" is the right status. I found logical conflict with recently added wal_receiver_create_temp_slot GUC and my tests are currently broken. Will fix it and post new version.

PS: also, I surprised why it's ok for wal_receiver_create_temp_slot to be PGC_SIGHUP and ignore change of this setting until walreceiver will reconnect by unrelated reason. I means walreceiver does nothing special on SIGHUP. In common case change of wal_receiver_create_temp_slot setting will have effect only during restart of walreceiver process. And therefore we will switch to archive recovery. But such design was strongly rejected for my patch year ago.

regards, Sergei


Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Michael Paquier-2
On Tue, Jan 21, 2020 at 06:03:18PM +0300, Sergei Kornilov wrote:
> PS: also, I surprised why it's ok for wal_receiver_create_temp_slot
> to be PGC_SIGHUP and ignore change of this setting until walreceiver
> will reconnect by unrelated reason. I means walreceiver does nothing
> special on SIGHUP. In common case change of
> wal_receiver_create_temp_slot setting will have effect only during
> restart of walreceiver process. And therefore we will switch to
> archive recovery. But such design was strongly rejected for my patch
> year ago.

[ Looks at 3297308... ]
Yeah, you are right.  I was not paying much attention but something
does not stick here.  My understanding is that we should have the WAL
receiver receive the value it needs to use from the startup process
(aka via RequestXLogStreaming from xlog.c), and that we ought to make
this new parameter PGC_POSTMASTER instead of PGC_SIGHUP.  HEAD is
inconsistent here.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Sergei Kornilov
Hello

> Yeah, you are right. I was not paying much attention but something
> does not stick here. My understanding is that we should have the WAL
> receiver receive the value it needs to use from the startup process
> (aka via RequestXLogStreaming from xlog.c), and that we ought to make
> this new parameter PGC_POSTMASTER instead of PGC_SIGHUP. HEAD is
> inconsistent here.

Thank you!

I attached two patches:
- first changes wal_receiver_create_temp_slot to PGC_POSTMASTER and moved the logic to RequestXLogStreaming
- second is based on last published v6 version of main patch. It changes wal_receiver_create_temp_slot back to PGC_SIGHUP along with primary_conninfo and primary_slot_name and will restart walreceiver if need.

Regarding the main patch: we have several possibilities for moving RequestXLogStreaming call. We need to choose one.
And, of course, changes in the text...

regards, Sergei

0002_v7_allow_reload_walreceiver_conninfo.patch (20K) Download Attachment
0001_v7_move_temp_slot_logic_to_startup.patch (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Sergei Kornilov
Hello
Small rebase due to merge conflict of the tests. No functional changes since v7.

PS: also it is end of current CF, I will mark patch entry as moved to the next CF.

0002_v8_allow_reload_walreceiver_conninfo.patch (20K) Download Attachment
0001_v8_move_temp_slot_logic_to_startup.patch (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Alvaro Herrera-9
In reply to this post by Michael Paquier-2
On 2020-Jan-22, Michael Paquier wrote:

> On Tue, Jan 21, 2020 at 06:03:18PM +0300, Sergei Kornilov wrote:
> > PS: also, I surprised why it's ok for wal_receiver_create_temp_slot
> > to be PGC_SIGHUP and ignore change of this setting until walreceiver
> > will reconnect by unrelated reason. I means walreceiver does nothing
> > special on SIGHUP. In common case change of
> > wal_receiver_create_temp_slot setting will have effect only during
> > restart of walreceiver process. And therefore we will switch to
> > archive recovery. But such design was strongly rejected for my patch
> > year ago.
>
> [ Looks at 3297308... ]
> Yeah, you are right.  I was not paying much attention but something
> does not stick here.  My understanding is that we should have the WAL
> receiver receive the value it needs to use from the startup process
> (aka via RequestXLogStreaming from xlog.c), and that we ought to make
> this new parameter PGC_POSTMASTER instead of PGC_SIGHUP.  HEAD is
> inconsistent here.

I'm CCing Peter as committer of 329730827848.

What are the downsides of changing wal_receiver_create_temp_slot to
PGC_POSTMASTER?  It seems pretty nasty to requires a full server
restart.  Maybe we can signal all walreceivers at that point so that
they restart with the correct setting?  That's much less problematic, I
would think.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Alvaro Herrera-9
On 2020-Mar-13, Alvaro Herrera wrote:

> What are the downsides of changing wal_receiver_create_temp_slot to
> PGC_POSTMASTER?  It seems pretty nasty to requires a full server
> restart.  Maybe we can signal all walreceivers at that point so that
> they restart with the correct setting?  That's much less problematic, I
> would think.

... oh, that's exactly what 0002 does.  So patch 0001 is only about
making a temporary change to the create_temp_slot to be consistent with
existing policy, before changing the policy.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Michael Paquier-2
On Fri, Mar 13, 2020 at 11:17:38AM -0300, Alvaro Herrera wrote:
> ... oh, that's exactly what 0002 does.  So patch 0001 is only about
> making a temporary change to the create_temp_slot to be consistent with
> existing policy, before changing the policy.

Yes.  In my opinion, patch 0002 should not change the GUC mode of
wal_receiver_create_temp_slot as the discussion here is about
primary_conninfo, even if both may share some logic regarding WAL
receiver shutdown and its restart triggered by the startup process.

Patch 0001 has actually been presented on this thread first:
https://www.postgresql.org/message-id/753391579708726@...
And there is an independent patch registered in this CF:
https://commitfest.postgresql.org/27/2456/

Should we add patch 0001 as an open item for v13 as there is a risk of
forgetting this issue?  I have created a wiki blank page a couple of
weeks back:
https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Sergei Kornilov
Hello

Sorry for late replies.

> Yes. In my opinion, patch 0002 should not change the GUC mode of
> wal_receiver_create_temp_slot as the discussion here is about
> primary_conninfo, even if both may share some logic regarding WAL
> receiver shutdown and its restart triggered by the startup process.

Ok, I removed related changes from main patch. Along with minor merge conflict.

> Patch 0001 has actually been presented on this thread first:
> https://www.postgresql.org/message-id/753391579708726@...
> And there is an independent patch registered in this CF:
> https://commitfest.postgresql.org/27/2456/

Yep, 0001 is separate patch. I will post copy of this patch here to make cfbot works.  Main patch 0002 requires resetting of is_temp_slot in RequestXLogStreaming to works properly.

> Should we add patch 0001 as an open item for v13 as there is a risk of
> forgetting this issue?

I think yes.

Well, it seems better to move this patch to next commitfest?

regards, Sergei

0001-v3-change-wal_receiver_create_temp_slot.patch (12K) Download Attachment
0002_v9_allow_reload_walreceiver_conninfo.patch (18K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: allow online change primary_conninfo

Alvaro Herrera-9
On 2020-Mar-17, Sergei Kornilov wrote:

> Well, it seems better to move this patch to next commitfest?

What?  You want to make wal_receiver_create_temp_slot unchangeable and
default to off in pg13, and delay the patch that fixes those things to
pg14?  That makes no sense to me.  Please keep them both here so that we
can get things to a usable state.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


1234