Logical replication keepalive flood

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Logical replication keepalive flood

Abbas Butt
Hi,
I have observed the following behavior with PostgreSQL 13.3.

The WAL sender process sends approximately 500 keepalive messages per second to pg_recvlogical.
These keepalive messages are totally un-necessary.
Keepalives should be sent only if there is no network traffic and a certain time (half of wal_sender_timeout) passes.
These keepalive messages not only choke the network but also impact the performance of the receiver,
because the receiver has to process the received message and then decide whether to reply to it or not.
The receiver remains busy doing this activity 500 times a second.

On investigation it is revealed that the following code fragment in function WalSndWaitForWal in file walsender.c is responsible for sending these frequent keepalives:

if (MyWalSnd->flush < sentPtr &&
    MyWalSnd->write < sentPtr &&
    !waiting_for_ping_response)
        WalSndKeepalive(false);

waiting_for_ping_response is normally false, and flush and write will always be less than sentPtr (Receiver's LSNs cannot advance server's LSNs)

Here are the steps to reproduce:
1. Start the database server.
2. Setup pgbench tables.
  ./pgbench -i -s 50 -h 192.168.5.140 -p 7654 -U abbas postgres
3. Create a logical replication slot.
   SELECT * FROM pg_create_logical_replication_slot('my_slot', 'test_decoding');
4. Start pg_recvlogical.
  ./pg_recvlogical --slot=my_slot --verbose -d postgres -h 192.168.5.140 -p 7654 -U abbas --start -f -
5. Run pgbench
  ./pgbench -U abbas -h 192.168.5.140 -p 7654  -c 2 -j 2 -T 1200 -n postgres
6. Observer network traffic to find the keepalive flood.

Alternately modify the above code fragment to see approx 500 keepalive log messages a second

if (MyWalSnd->flush < sentPtr &&
    MyWalSnd->write < sentPtr &&
    !waiting_for_ping_response)
{
    elog(LOG, "[Keepalive]  wrt ptr %X/%X  snt ptr %X/%X ",
               (uint32) (MyWalSnd->write >> 32),
               (uint32) MyWalSnd->write,
               (uint32) (sentPtr >> 32),
               (uint32) sentPtr);
    WalSndKeepalive(false);
}

Opinions?

--
--
Abbas
Senior Architect


Skype ID: gabbasb

Follow us on Twitter
@EnterpriseDB
Reply | Threaded
Open this post in threaded view
|

Re: Logical replication keepalive flood

Kyotaro Horiguchi-4
At Sat, 5 Jun 2021 16:08:00 +0500, Abbas Butt <[hidden email]> wrote in

> Hi,
> I have observed the following behavior with PostgreSQL 13.3.
>
> The WAL sender process sends approximately 500 keepalive messages per
> second to pg_recvlogical.
> These keepalive messages are totally un-necessary.
> Keepalives should be sent only if there is no network traffic and a certain
> time (half of wal_sender_timeout) passes.
> These keepalive messages not only choke the network but also impact the
> performance of the receiver,
> because the receiver has to process the received message and then decide
> whether to reply to it or not.
> The receiver remains busy doing this activity 500 times a second.
I can reproduce the problem.

> On investigation it is revealed that the following code fragment in
> function WalSndWaitForWal in file walsender.c is responsible for sending
> these frequent keepalives:
>
> if (MyWalSnd->flush < sentPtr &&
>     MyWalSnd->write < sentPtr &&
>     !waiting_for_ping_response)
>         WalSndKeepalive(false);

The immediate cause is pg_recvlogical doesn't send a reply before
sleeping. Currently it sends replies every 10 seconds intervals.

So the attached first patch stops the flood.

That said, I don't think it is not intended that logical walsender
sends keep-alive packets with such a high frequency.  It happens
because walsender actually doesn't wait at all because it waits on
WL_SOCKET_WRITEABLE because the keep-alive packet inserted just before
is always pending.

So as the attached second, we should try to flush out the keep-alive
packets if possible before checking pg_is_send_pending().

Any one can "fix" the issue but I think each of them is reasonable by
itself.

Any thoughts, suggestions and/or opinions?

regareds.

--
Kyotaro Horiguchi
NTT Open Source Software Center

diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 5efec160e8..4497ff1071 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -362,6 +362,10 @@ StreamLogicalLog(void)
  goto error;
  }
 
+ /* sned reply for all writes so far */
+ if (!flushAndSendFeedback(conn, &now))
+ goto error;
+
  FD_ZERO(&input_mask);
  FD_SET(PQsocket(conn), &input_mask);
 

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 109c723f4e..fcea56d1c1 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1469,6 +1469,9 @@ WalSndWaitForWal(XLogRecPtr loc)
  /* Send keepalive if the time has come */
  WalSndKeepaliveIfNecessary();
 
+ /* We may have queued a keep alive packet. flush it before sleeping. */
+ pq_flush_if_writable();
+
  /*
  * Sleep until something happens or we time out.  Also wait for the
  * socket becoming writable, if there's still pending output.
Reply | Threaded
Open this post in threaded view
|

Re: Logical replication keepalive flood

akapila
On Mon, Jun 7, 2021 at 12:54 PM Kyotaro Horiguchi
<[hidden email]> wrote:

>
> At Sat, 5 Jun 2021 16:08:00 +0500, Abbas Butt <[hidden email]> wrote in
> > Hi,
> > I have observed the following behavior with PostgreSQL 13.3.
> >
> > The WAL sender process sends approximately 500 keepalive messages per
> > second to pg_recvlogical.
> > These keepalive messages are totally un-necessary.
> > Keepalives should be sent only if there is no network traffic and a certain
> > time (half of wal_sender_timeout) passes.
> > These keepalive messages not only choke the network but also impact the
> > performance of the receiver,
> > because the receiver has to process the received message and then decide
> > whether to reply to it or not.
> > The receiver remains busy doing this activity 500 times a second.
>
> I can reproduce the problem.
>
> > On investigation it is revealed that the following code fragment in
> > function WalSndWaitForWal in file walsender.c is responsible for sending
> > these frequent keepalives:
> >
> > if (MyWalSnd->flush < sentPtr &&
> >     MyWalSnd->write < sentPtr &&
> >     !waiting_for_ping_response)
> >         WalSndKeepalive(false);
>
> The immediate cause is pg_recvlogical doesn't send a reply before
> sleeping. Currently it sends replies every 10 seconds intervals.
>

Yeah, but one can use -s option to send it at lesser intervals.

> So the attached first patch stops the flood.
>

I am not sure sending feedback every time before sleep is a good idea,
this might lead to unnecessarily sending more messages. Can we try by
using one-second interval with -s option to see how it behaves? As a
matter of comparison the similar logic in workers.c uses
wal_receiver_timeout to send such an update message rather than
sending it every time before sleep.

> That said, I don't think it is not intended that logical walsender
> sends keep-alive packets with such a high frequency.  It happens
> because walsender actually doesn't wait at all because it waits on
> WL_SOCKET_WRITEABLE because the keep-alive packet inserted just before
> is always pending.
>
> So as the attached second, we should try to flush out the keep-alive
> packets if possible before checking pg_is_send_pending().
>

/* Send keepalive if the time has come */
  WalSndKeepaliveIfNecessary();

+ /* We may have queued a keep alive packet. flush it before sleeping. */
+ pq_flush_if_writable();

We already call pq_flush_if_writable() from WalSndKeepaliveIfNecessary
after sending the keep-alive message, so not sure how this helps?

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Logical replication keepalive flood

Abbas Butt


On Mon, Jun 7, 2021 at 3:13 PM Amit Kapila <[hidden email]> wrote:
On Mon, Jun 7, 2021 at 12:54 PM Kyotaro Horiguchi
<[hidden email]> wrote:
>
> At Sat, 5 Jun 2021 16:08:00 +0500, Abbas Butt <[hidden email]> wrote in
> > Hi,
> > I have observed the following behavior with PostgreSQL 13.3.
> >
> > The WAL sender process sends approximately 500 keepalive messages per
> > second to pg_recvlogical.
> > These keepalive messages are totally un-necessary.
> > Keepalives should be sent only if there is no network traffic and a certain
> > time (half of wal_sender_timeout) passes.
> > These keepalive messages not only choke the network but also impact the
> > performance of the receiver,
> > because the receiver has to process the received message and then decide
> > whether to reply to it or not.
> > The receiver remains busy doing this activity 500 times a second.
>
> I can reproduce the problem.
>
> > On investigation it is revealed that the following code fragment in
> > function WalSndWaitForWal in file walsender.c is responsible for sending
> > these frequent keepalives:
> >
> > if (MyWalSnd->flush < sentPtr &&
> >     MyWalSnd->write < sentPtr &&
> >     !waiting_for_ping_response)
> >         WalSndKeepalive(false);
>
> The immediate cause is pg_recvlogical doesn't send a reply before
> sleeping. Currently it sends replies every 10 seconds intervals.
>

Yeah, but one can use -s option to send it at lesser intervals.

That option can impact pg_recvlogical, it will not impact the server sending keepalives too frequently.
By default the status interval is 10 secs, still we are getting 500 keepalives a second from the server.
 

> So the attached first patch stops the flood.
>

I am not sure sending feedback every time before sleep is a good idea,
this might lead to unnecessarily sending more messages. Can we try by
using one-second interval with -s option to see how it behaves? As a
matter of comparison the similar logic in workers.c uses
wal_receiver_timeout to send such an update message rather than
sending it every time before sleep.

> That said, I don't think it is not intended that logical walsender
> sends keep-alive packets with such a high frequency.  It happens
> because walsender actually doesn't wait at all because it waits on
> WL_SOCKET_WRITEABLE because the keep-alive packet inserted just before
> is always pending.
>
> So as the attached second, we should try to flush out the keep-alive
> packets if possible before checking pg_is_send_pending().
>

/* Send keepalive if the time has come */
  WalSndKeepaliveIfNecessary();

+ /* We may have queued a keep alive packet. flush it before sleeping. */
+ pq_flush_if_writable();

We already call pq_flush_if_writable() from WalSndKeepaliveIfNecessary
after sending the keep-alive message, so not sure how this helps?

--
With Regards,
Amit Kapila.


--
--
Abbas
Senior Architect


Skype ID: gabbasb

Follow us on Twitter
@EnterpriseDB
Reply | Threaded
Open this post in threaded view
|

Re: Logical replication keepalive flood

Kyotaro Horiguchi-4
At Mon, 7 Jun 2021 15:26:05 +0500, Abbas Butt <[hidden email]> wrote in

> On Mon, Jun 7, 2021 at 3:13 PM Amit Kapila <[hidden email]> wrote:
> > > The immediate cause is pg_recvlogical doesn't send a reply before
> > > sleeping. Currently it sends replies every 10 seconds intervals.
> > >
> >
> > Yeah, but one can use -s option to send it at lesser intervals.
> >
>
> That option can impact pg_recvlogical, it will not impact the server
> sending keepalives too frequently.
> By default the status interval is 10 secs, still we are getting 500
> keepalives a second from the server.
>
> > > So the attached first patch stops the flood.
> > >
> >
> > I am not sure sending feedback every time before sleep is a good idea,
> > this might lead to unnecessarily sending more messages. Can we try by
> > using one-second interval with -s option to see how it behaves? As a
> > matter of comparison the similar logic in workers.c uses
> > wal_receiver_timeout to send such an update message rather than
> > sending it every time before sleep.

Logical walreceiver sends a feedback when walrcv_eceive() doesn't
receive a byte.  If its' not good that pg_recvlogical does the same
thing, do we need to improve logical walsender's behavior as well?

> > > That said, I don't think it is not intended that logical walsender
> > > sends keep-alive packets with such a high frequency.  It happens
> > > because walsender actually doesn't wait at all because it waits on
> > > WL_SOCKET_WRITEABLE because the keep-alive packet inserted just before
> > > is always pending.
> > >
> > > So as the attached second, we should try to flush out the keep-alive
> > > packets if possible before checking pg_is_send_pending().
> > >
> >
> > /* Send keepalive if the time has come */
> >   WalSndKeepaliveIfNecessary();
> >
> > + /* We may have queued a keep alive packet. flush it before sleeping. */
> > + pq_flush_if_writable();
> >
> > We already call pq_flush_if_writable() from WalSndKeepaliveIfNecessary
> > after sending the keep-alive message, so not sure how this helps?

No. WalSndKeepaliveIfNecessary calls it only when walreceiver does not
receive a reply message for a long time. So the keepalive sent by the
direct call to WalSndKeepalive() from WalSndWaitForWal is not flushed
out in most cases, which causes the flood.

I rechecked all callers of WalSndKeepalive().

WalSndKeepalive()
+- *WalSndWaltForWal
+- ProcessStandbyReplyMessage
|+- ProcessStandbyMessage
| +- ProcessRepliesIfAny
|  +- $WalSndWriteData
|  +- *WalSndWaitForWal
|  +- WalSndLoop
|    (calls pq_flush_if_writable() after sending the packet, but the
|     keepalive packet prevents following stream data from being sent
|     since the pending keepalive-packet causes pq_is_sned_pending()
|     return (falsely) true.)
+- WalSndDone
 +- *WalSndLoop
+- WalSndKeepaliveIfNecessary
   (calls pq_flush_if_writable always only after calling WalSndKeepalive())

The callers prefixed by '*' above misunderstand that some of the data
sent by them are still pending even when the only pending bytes is the
keepalive packet. Of course the keepalive pakcets should be sent
*before* sleep and the unsent keepalive packet prevents the callers
from sleeping then they immediately retry sending another keepalive
pakcet and repeat it until the condition changes. (The callers
prevised by "$" also enters a sleep before flushing but doesn't repeat
sending keepalives.)

The caller is forgetting that a keepalive pakcet may be queued but not
flushed after calling WalSndKeepalive.  So more sensible fix would be
calling pq_flush_if_writable in WalSndKeepalive?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: Logical replication keepalive flood

Kyotaro Horiguchi-4
At Tue, 08 Jun 2021 10:05:36 +0900 (JST), Kyotaro Horiguchi <[hidden email]> wrote in

> At Mon, 7 Jun 2021 15:26:05 +0500, Abbas Butt <[hidden email]> wrote in
> > On Mon, Jun 7, 2021 at 3:13 PM Amit Kapila <[hidden email]> wrote:
> > > I am not sure sending feedback every time before sleep is a good idea,
> > > this might lead to unnecessarily sending more messages. Can we try by
> > > using one-second interval with -s option to see how it behaves? As a
> > > matter of comparison the similar logic in workers.c uses
> > > wal_receiver_timeout to send such an update message rather than
> > > sending it every time before sleep.
>
> Logical walreceiver sends a feedback when walrcv_eceive() doesn't
> receive a byte.  If its' not good that pg_recvlogical does the same
> thing, do we need to improve logical walsender's behavior as well?

For the clarity, only the change in the walsender side can stop the
flood.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: Logical replication keepalive flood

Abbas Butt
Hi Kyotaro,
I have tried to test your patches. Unfortunately even after applying the patches 
the WAL Sender is still sending too frequent keepalive messages.
In my opinion the fix is to make sure that wal_sender_timeout/2 has passed before sending
the keepalive message in the code fragment I had shared earlier.
In  other words we should replace the call to
WalSndKeepalive(false);
with
WalSndKeepaliveIfNecessary(false);

Do you agree with the suggested fix?

On Tue, Jun 8, 2021 at 10:09 AM Kyotaro Horiguchi <[hidden email]> wrote:
At Tue, 08 Jun 2021 10:05:36 +0900 (JST), Kyotaro Horiguchi <[hidden email]> wrote in
> At Mon, 7 Jun 2021 15:26:05 +0500, Abbas Butt <[hidden email]> wrote in
> > On Mon, Jun 7, 2021 at 3:13 PM Amit Kapila <[hidden email]> wrote:
> > > I am not sure sending feedback every time before sleep is a good idea,
> > > this might lead to unnecessarily sending more messages. Can we try by
> > > using one-second interval with -s option to see how it behaves? As a
> > > matter of comparison the similar logic in workers.c uses
> > > wal_receiver_timeout to send such an update message rather than
> > > sending it every time before sleep.
>
> Logical walreceiver sends a feedback when walrcv_eceive() doesn't
> receive a byte.  If its' not good that pg_recvlogical does the same
> thing, do we need to improve logical walsender's behavior as well?

For the clarity, only the change in the walsender side can stop the
flood.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


--
--
Abbas
Senior Architect


Skype ID: gabbasb

Follow us on Twitter
@EnterpriseDB
Reply | Threaded
Open this post in threaded view
|

Re: Logical replication keepalive flood

Kyotaro Horiguchi-4
Hi.

On 2021/06/08 21:21, Abbas Butt wrote:
> Hi Kyotaro,
> I have tried to test your patches. Unfortunately even after applying the
> patches
> the WAL Sender is still sending too frequent keepalive messages.

Sorry for the bogus patch.  I must have seen something impossible.

The keep-alive packet is immediately flushed explicitly, so Amit is
right that no additional

pq_flush_if_writable() is not needed.

> In my opinion the fix is to make sure that wal_sender_timeout/2 has passed
> before sending
> the keepalive message in the code fragment I had shared earlier.
> In  other words we should replace the call to
> WalSndKeepalive(false);
> with
> WalSndKeepaliveIfNecessary(false);
>
> Do you agree with the suggested fix?

I'm afraid not. The same is done just after unconditionally.

The issue - if actually it is - we send a keep-alive packet before a
quite short sleep.

We really want to send it if the sleep gets long but we cannot predict
that before entering a sleep.

Let me think a little more on this..

regards.




Reply | Threaded
Open this post in threaded view
|

Re: Logical replication keepalive flood

Kyotaro Horiguchi-4
At Wed, 9 Jun 2021 11:21:55 +0900, Kyotaro Horiguchi <[hidden email]> wrote in
> The issue - if actually it is - we send a keep-alive packet before a
> quite short sleep.
>
> We really want to send it if the sleep gets long but we cannot predict
> that before entering a sleep.
>
> Let me think a little more on this..

After some investigation, I find out that the keepalives are sent
almost always after XLogSendLogical requests for the *next* record. In
most of the cases the record is not yet inserted at the request time
but insertd very soon (in 1-digit milliseconds). It doesn't seem to be
expected that that happens with such a high frequency when
XLogSendLogical is keeping up-to-date with the bleeding edge of WAL
records.

It is completely unpredictable when the next record comes, so we
cannot decide whether to send a keepalive or not at the current
timing.

Since we want to send a keepalive when we have nothing to send for a
while, it is a bit different to keep sending keepalives at some
intervals while the loop is busy.

As a possible solution, the attached patch splits the sleep into two
pieces. If the first sleep reaches the timeout then send a keepalive
then sleep for the remaining time. The first timeout is quite
arbitrary but keepalive of 4Hz at maximum doesn't look so bad to me.

Is it acceptable?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 109c723f4e..49b3c0d4e2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -105,6 +105,9 @@
  */
 #define MAX_SEND_SIZE (XLOG_BLCKSZ * 16)
 
+/* Minimum idle time for sending an idle-time keepalive in milliseconds */
+#define KEEPALIVE_TIMEOUT 250
+
 /* Array of WalSnds in shared memory */
 WalSndCtlData *WalSndCtl = NULL;
 
@@ -244,7 +247,7 @@ static void WalSndKeepalive(bool requestReply);
 static void WalSndKeepaliveIfNecessary(void);
 static void WalSndCheckTimeOut(void);
 static long WalSndComputeSleeptime(TimestampTz now);
-static void WalSndWait(uint32 socket_events, long timeout, uint32 wait_event);
+static int WalSndWait(uint32 socket_events, long timeout, uint32 wait_event);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid);
@@ -1428,19 +1431,6 @@ WalSndWaitForWal(XLogRecPtr loc)
  if (got_STOPPING)
  break;
 
- /*
- * We only send regular messages to the client for full decoded
- * transactions, but a synchronous replication and walsender shutdown
- * possibly are waiting for a later location. So, before sleeping, we
- * send a ping containing the flush location. If the receiver is
- * otherwise idle, this keepalive will trigger a reply. Processing the
- * reply will update these MyWalSnd locations.
- */
- if (MyWalSnd->flush < sentPtr &&
- MyWalSnd->write < sentPtr &&
- !waiting_for_ping_response)
- WalSndKeepalive(false);
-
  /* check whether we're done */
  if (loc <= RecentFlushPtr)
  break;
@@ -1483,6 +1473,39 @@ WalSndWaitForWal(XLogRecPtr loc)
  if (pq_is_send_pending())
  wakeEvents |= WL_SOCKET_WRITEABLE;
 
+ /*
+ * We only send regular messages to the client for full decoded
+ * transactions, but a synchronous replication and walsender shutdown
+ * possibly are waiting for a later location. So, before sleeping, we
+ * send a ping containing the flush location. If the receiver is
+ * otherwise idle, this keepalive will trigger a reply. Processing the
+ * reply will update these MyWalSnd locations. If the sleep is shorter
+ * than KEEPALIVE_TIMEOUT milliseconds, we skip sending a keepalive to
+ * prevent it from getting too-frequent.
+ */
+ if (MyWalSnd->flush < sentPtr &&
+ MyWalSnd->write < sentPtr &&
+ !waiting_for_ping_response)
+ {
+ if (sleeptime > KEEPALIVE_TIMEOUT)
+ {
+ int r;
+
+ r = WalSndWait(wakeEvents, KEEPALIVE_TIMEOUT,
+   WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+
+ if (r != 0)
+ continue;
+
+ sleeptime -= KEEPALIVE_TIMEOUT;
+ }
+
+ WalSndKeepalive(false);
+
+ if (pq_flush_if_writable() != 0)
+ WalSndShutdown();
+ }
+
  WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
  }
 
@@ -3136,15 +3159,18 @@ WalSndWakeup(void)
  * composed of optional WL_SOCKET_WRITEABLE and WL_SOCKET_READABLE flags.  Exit
  * on postmaster death.
  */
-static void
+static int
 WalSndWait(uint32 socket_events, long timeout, uint32 wait_event)
 {
  WaitEvent event;
+ int ret;
 
  ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, socket_events, NULL);
- if (WaitEventSetWait(FeBeWaitSet, timeout, &event, 1, wait_event) == 1 &&
- (event.events & WL_POSTMASTER_DEATH))
+ ret = WaitEventSetWait(FeBeWaitSet, timeout, &event, 1, wait_event);
+ if (ret == 1 && (event.events & WL_POSTMASTER_DEATH))
  proc_exit(1);
+
+ return ret;
 }
 
 /*
Reply | Threaded
Open this post in threaded view
|

Re: Logical replication keepalive flood

akapila
On Wed, Jun 9, 2021 at 1:47 PM Kyotaro Horiguchi
<[hidden email]> wrote:

>
> At Wed, 9 Jun 2021 11:21:55 +0900, Kyotaro Horiguchi <[hidden email]> wrote in
> > The issue - if actually it is - we send a keep-alive packet before a
> > quite short sleep.
> >
> > We really want to send it if the sleep gets long but we cannot predict
> > that before entering a sleep.
> >
> > Let me think a little more on this..
>
> After some investigation, I find out that the keepalives are sent
> almost always after XLogSendLogical requests for the *next* record.
>

Does these keepalive messages are sent at the same frequency even for
subscribers? Basically, I wanted to check if we have logical
replication set up between 2 nodes then do we send these keep-alive
messages flood? If not, then why is it different in the case of
pg_recvlogical? Is it possible that the write/flush location is not
updated at the pace at which we expect? Please see commit 41d5f8ad73
which seems to be talking about a similar problem.

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Logical replication keepalive flood

Abbas Butt
Hi,

On Wed, Jun 9, 2021 at 2:30 PM Amit Kapila <[hidden email]> wrote:
On Wed, Jun 9, 2021 at 1:47 PM Kyotaro Horiguchi
<[hidden email]> wrote:
>
> At Wed, 9 Jun 2021 11:21:55 +0900, Kyotaro Horiguchi <[hidden email]> wrote in
> > The issue - if actually it is - we send a keep-alive packet before a
> > quite short sleep.
> >
> > We really want to send it if the sleep gets long but we cannot predict
> > that before entering a sleep.
> >
> > Let me think a little more on this..
>
> After some investigation, I find out that the keepalives are sent
> almost always after XLogSendLogical requests for the *next* record.
>

Does these keepalive messages are sent at the same frequency even for
subscribers?

Yes, I have tested it with one publisher and one subscriber.
The moment I start pgbench session I can see keepalive messages sent and replied by the subscriber with same frequency.
 
Basically, I wanted to check if we have logical
replication set up between 2 nodes then do we send these keep-alive
messages flood?

Yes we do.
 
If not, then why is it different in the case of
pg_recvlogical?

Nothing, the WAL sender behaviour is same in both cases.
 
Is it possible that the write/flush location is not
updated at the pace at which we expect?

Well, it is async replication. The receiver can choose to update LSNs at its own will, say after 10 mins interval.
It should only impact the size of WAL retained by the server.

Please see commit 41d5f8ad73
which seems to be talking about a similar problem.

That commit does not address this problem.
 

--
With Regards,
Amit Kapila.


--
--
Abbas
Senior Architect


Skype ID: gabbasb

Follow us on Twitter
@EnterpriseDB
Reply | Threaded
Open this post in threaded view
|

Re: Logical replication keepalive flood

Kyotaro Horiguchi-4
At Wed, 9 Jun 2021 17:32:25 +0500, Abbas Butt <[hidden email]> wrote in

>
> On Wed, Jun 9, 2021 at 2:30 PM Amit Kapila <[hidden email]> wrote:
> > Does these keepalive messages are sent at the same frequency even for
> > subscribers?
>
> Yes, I have tested it with one publisher and one subscriber.
> The moment I start pgbench session I can see keepalive messages sent and
> replied by the subscriber with same frequency.
>
> > Basically, I wanted to check if we have logical
> > replication set up between 2 nodes then do we send these keep-alive
> > messages flood?
>
> Yes we do.
>
> > If not, then why is it different in the case of
> > pg_recvlogical?
>
> Nothing, the WAL sender behaviour is same in both cases.
>
>
> > Is it possible that the write/flush location is not
> > updated at the pace at which we expect?
Yes. MyWalSnd->flush/write are updated far frequently but still
MyWalSnd->write is behind sentPtr by from thousands of bytes up to
less than 1 block (1block = 8192 bytes). (Flush lags are larger than
write lags, of course.)

I counted how many times keepalives are sent for each request length
to logical_read_xlog_page() for 10 seconds pgbench run and replicating
pgbench_history, using the attached change.

size:  sent /notsent/ calls: write lag/ flush lag
   8:     3 /     6 /     3:    5960 /  348962
  16:     1 /     2 /     1:     520 /  201096
  24:  2425 /  4852 /  2461:    5259 /  293569
  98:     2 /     0 /    54:       5 /    1050
 187:     2 /     0 /    94:       0 /    1060
4432:     1 /     0 /     1: 410473592 / 410473592
7617:     2 /     0 /    27:     317 /   17133
8280:     1 /     2 /     4:     390 /     390

Where,

size    is requested data length to logical_read_xlog_page()

sent    is the number of keepalives sent in the loop in WalSndWaitForWal

notsent is the number of runs of the loop in WalSndWaitForWal without
                sending a keepalive

calls   is the number of calls to WalSndWaitForWal

write lag is the bytes MyWalSnd->write is behind from sentPtr at the
        first run of the loop per call to logical_read_xlog_page.

flush lag is the the same to the above for MyWalSnd->flush.

Maybe the line of size=4432 is the first time fetch of WAL.

So this numbers show that WalSndWaitForWal is called almost only at
starting to fetching a record, and in that case the function runs the
loop three times and sends one keepalive by average.

> Well, it is async replication. The receiver can choose to update LSNs at
> its own will, say after 10 mins interval.
> It should only impact the size of WAL retained by the server.
>
> Please see commit 41d5f8ad73
> > which seems to be talking about a similar problem.
> >
>
> That commit does not address this problem.

Yeah, at least for me, WalSndWaitForWal send a keepalive per one call.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 42738eb940..ee78116e79 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -571,6 +571,7 @@ err:
  * We fetch the page from a reader-local cache if we know we have the required
  * data and if there hasn't been any error since caching the data.
  */
+int hogestate = -1;
 static int
 ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 {
@@ -605,6 +606,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
  {
  XLogRecPtr targetSegmentPtr = pageptr - targetPageOff;
 
+ hogestate = pageptr + XLOG_BLCKSZ - state->currRecPtr;
  readLen = state->routine.page_read(state, targetSegmentPtr, XLOG_BLCKSZ,
    state->currRecPtr,
    state->readBuf);
@@ -623,6 +625,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
  * First, read the requested data length, but at least a short page header
  * so that we can validate it.
  */
+ hogestate = pageptr + Max(reqLen, SizeOfXLogShortPHD) - state->currRecPtr;
  readLen = state->routine.page_read(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
    state->currRecPtr,
    state->readBuf);
@@ -642,6 +645,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
  /* still not enough */
  if (readLen < XLogPageHeaderSize(hdr))
  {
+ hogestate = pageptr + XLogPageHeaderSize(hdr) - state->currRecPtr;
  readLen = state->routine.page_read(state, pageptr, XLogPageHeaderSize(hdr),
    state->currRecPtr,
    state->readBuf);
@@ -649,6 +653,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
  goto err;
  }
 
+ hogestate = -1;
  /*
  * Now that we know we have the full header, validate it.
  */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 109c723f4e..0de10c4a31 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1363,17 +1363,45 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
  * if we detect a shutdown request (either from postmaster or client)
  * we will return early, so caller must always check.
  */
+unsigned long counts[32768][3] = {0};
+unsigned long lagw[32768] = {0};
+unsigned long lagf[32768] = {0};
+
+void
+PrintCounts(void)
+{
+ int i = 0;
+ for (i = 0 ; i < 32768 ; i++)
+ {
+ if (counts[i][0] + counts[i][1] + counts[i][2] > 0)
+ {
+ unsigned long wl = 0, fl = 0;
+ if (counts[i][1] > 0)
+ {
+ wl = lagw[i] / counts[i][0];
+ fl = lagf[i] / counts[i][0];
+
+ ereport(LOG, (errmsg ("[%5d]: %5lu / %5lu / %5lu: %5lu  %5lu",
+  i, counts[i][1], counts[i][2], counts[i][0], wl, fl), errhidestmt(true)));
+ }
+ }
+ }
+}
+
 static XLogRecPtr
 WalSndWaitForWal(XLogRecPtr loc)
 {
  int wakeEvents;
  static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+ extern int hogestate;
+ bool lagtaken = false;
 
  /*
  * Fast path to avoid acquiring the spinlock in case we already know we
  * have enough WAL available. This is particularly interesting if we're
  * far behind.
  */
+ counts[hogestate][0]++;
  if (RecentFlushPtr != InvalidXLogRecPtr &&
  loc <= RecentFlushPtr)
  return RecentFlushPtr;
@@ -1439,7 +1467,39 @@ WalSndWaitForWal(XLogRecPtr loc)
  if (MyWalSnd->flush < sentPtr &&
  MyWalSnd->write < sentPtr &&
  !waiting_for_ping_response)
+ {
+ if (hogestate >= 0)
+ {
+ counts[hogestate][1]++;
+ if (!lagtaken)
+ {
+ lagf[hogestate] += sentPtr - MyWalSnd->flush;
+ lagw[hogestate] += sentPtr - MyWalSnd->write;
+ lagtaken = true;
+ }
+ }
+// ereport(LOG, (errmsg ("KA[%lu/%lu/%lu]: %X/%X %X/%X %X/%X %d: %ld",
+//  ka, na, ka + na,
+//  LSN_FORMAT_ARGS(MyWalSnd->flush),
+//  LSN_FORMAT_ARGS(MyWalSnd->write),
+//  LSN_FORMAT_ARGS(sentPtr),
+//  waiting_for_ping_response,
+//  sentPtr - MyWalSnd->write)));
  WalSndKeepalive(false);
+ }
+ else
+ {
+ if (hogestate >= 0)
+ counts[hogestate][2]++;
+
+// ereport(LOG, (errmsg ("kap[%lu/%lu/%lu]: %X/%X %X/%X %X/%X %d: %ld",
+//  ka, na, ka + na,
+//  LSN_FORMAT_ARGS(MyWalSnd->flush),
+//  LSN_FORMAT_ARGS(MyWalSnd->write),
+//  LSN_FORMAT_ARGS(sentPtr),
+//  waiting_for_ping_response,
+//  sentPtr - MyWalSnd->write)));
+ }
 
  /* check whether we're done */
  if (loc <= RecentFlushPtr)
Reply | Threaded
Open this post in threaded view
|

Re: Logical replication keepalive flood

Kyotaro Horiguchi-4
At Thu, 10 Jun 2021 15:00:16 +0900 (JST), Kyotaro Horiguchi <[hidden email]> wrote in

> At Wed, 9 Jun 2021 17:32:25 +0500, Abbas Butt <[hidden email]> wrote in
> >
> > On Wed, Jun 9, 2021 at 2:30 PM Amit Kapila <[hidden email]> wrote:
> > > Is it possible that the write/flush location is not
> > > updated at the pace at which we expect?
>
> Yes. MyWalSnd->flush/write are updated far frequently but still
> MyWalSnd->write is behind sentPtr by from thousands of bytes up to
> less than 1 block (1block = 8192 bytes). (Flush lags are larger than
> write lags, of course.)
For more clarity, I changed the previous patch a bit and retook numbers.

Total records: 19476
  8:     2 /     4 /     2:    4648 /  302472
 16:     5 /    10 /     5:    5427 /  139872
 24:  3006 /  6015 /  3028:    4739 /  267215
187:     2 /     0 /    50:       1 /     398

While a 10 seconds run of pgbench, it walsender reads 19476 records
and calls logical_read_xlog_page() 3028 times, and the mean of write
lag is 4739 bytes and flush lag is 267215 bytes (really?), as the
result most of the record fetch causes a keep alive. (The WAL contains
many FPIs).

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 42738eb940..ee78116e79 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -571,6 +571,7 @@ err:
  * We fetch the page from a reader-local cache if we know we have the required
  * data and if there hasn't been any error since caching the data.
  */
+int hogestate = -1;
 static int
 ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 {
@@ -605,6 +606,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
  {
  XLogRecPtr targetSegmentPtr = pageptr - targetPageOff;
 
+ hogestate = pageptr + XLOG_BLCKSZ - state->currRecPtr;
  readLen = state->routine.page_read(state, targetSegmentPtr, XLOG_BLCKSZ,
    state->currRecPtr,
    state->readBuf);
@@ -623,6 +625,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
  * First, read the requested data length, but at least a short page header
  * so that we can validate it.
  */
+ hogestate = pageptr + Max(reqLen, SizeOfXLogShortPHD) - state->currRecPtr;
  readLen = state->routine.page_read(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
    state->currRecPtr,
    state->readBuf);
@@ -642,6 +645,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
  /* still not enough */
  if (readLen < XLogPageHeaderSize(hdr))
  {
+ hogestate = pageptr + XLogPageHeaderSize(hdr) - state->currRecPtr;
  readLen = state->routine.page_read(state, pageptr, XLogPageHeaderSize(hdr),
    state->currRecPtr,
    state->readBuf);
@@ -649,6 +653,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
  goto err;
  }
 
+ hogestate = -1;
  /*
  * Now that we know we have the full header, validate it.
  */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 109c723f4e..62f5f09fee 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1363,17 +1363,49 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
  * if we detect a shutdown request (either from postmaster or client)
  * we will return early, so caller must always check.
  */
+unsigned long counts[32768][3] = {0};
+unsigned long lagw[32768] = {0};
+unsigned long lagf[32768] = {0};
+unsigned long nrec = 0;
+void
+PrintCounts(void)
+{
+ int i = 0;
+ ereport(LOG, (errmsg ("Total records: %lu", nrec), errhidestmt(true)));
+ nrec = 0;
+
+ for (i = 0 ; i < 32768 ; i++)
+ {
+ if (counts[i][0] + counts[i][1] + counts[i][2] > 0)
+ {
+ unsigned long wl = 0, fl = 0;
+ if (counts[i][1] > 0)
+ {
+ wl = lagw[i] / counts[i][0];
+ fl = lagf[i] / counts[i][0];
+
+ ereport(LOG, (errmsg ("%5d: %5lu / %5lu / %5lu: %7lu / %7lu",
+  i, counts[i][1], counts[i][2], counts[i][0], wl, fl), errhidestmt(true)));
+ }
+ counts[i][0] = counts[i][1] = counts[i][2] = lagw[i] = lagf[i] = 0;
+ }
+ }
+}
+
 static XLogRecPtr
 WalSndWaitForWal(XLogRecPtr loc)
 {
  int wakeEvents;
  static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+ extern int hogestate;
+ bool lagtaken = false;
 
  /*
  * Fast path to avoid acquiring the spinlock in case we already know we
  * have enough WAL available. This is particularly interesting if we're
  * far behind.
  */
+ counts[hogestate][0]++;
  if (RecentFlushPtr != InvalidXLogRecPtr &&
  loc <= RecentFlushPtr)
  return RecentFlushPtr;
@@ -1439,7 +1471,39 @@ WalSndWaitForWal(XLogRecPtr loc)
  if (MyWalSnd->flush < sentPtr &&
  MyWalSnd->write < sentPtr &&
  !waiting_for_ping_response)
+ {
+ if (hogestate >= 0)
+ {
+ counts[hogestate][1]++;
+ if (!lagtaken)
+ {
+ lagf[hogestate] += sentPtr - MyWalSnd->flush;
+ lagw[hogestate] += sentPtr - MyWalSnd->write;
+ lagtaken = true;
+ }
+ }
+// ereport(LOG, (errmsg ("KA[%lu/%lu/%lu]: %X/%X %X/%X %X/%X %d: %ld",
+//  ka, na, ka + na,
+//  LSN_FORMAT_ARGS(MyWalSnd->flush),
+//  LSN_FORMAT_ARGS(MyWalSnd->write),
+//  LSN_FORMAT_ARGS(sentPtr),
+//  waiting_for_ping_response,
+//  sentPtr - MyWalSnd->write)));
  WalSndKeepalive(false);
+ }
+ else
+ {
+ if (hogestate >= 0)
+ counts[hogestate][2]++;
+
+// ereport(LOG, (errmsg ("kap[%lu/%lu/%lu]: %X/%X %X/%X %X/%X %d: %ld",
+//  ka, na, ka + na,
+//  LSN_FORMAT_ARGS(MyWalSnd->flush),
+//  LSN_FORMAT_ARGS(MyWalSnd->write),
+//  LSN_FORMAT_ARGS(sentPtr),
+//  waiting_for_ping_response,
+//  sentPtr - MyWalSnd->write)));
+ }
 
  /* check whether we're done */
  if (loc <= RecentFlushPtr)
@@ -2843,6 +2907,7 @@ XLogSendLogical(void)
 {
  XLogRecord *record;
  char   *errm;
+ extern unsigned long nrec;
 
  /*
  * We'll use the current flush point to determine whether we've caught up.
@@ -2860,6 +2925,7 @@ XLogSendLogical(void)
  */
  WalSndCaughtUp = false;
 
+ nrec++;
  record = XLogReadRecord(logical_decoding_ctx->reader, &errm);
 
  /* xlog record was invalid */
Reply | Threaded
Open this post in threaded view
|

Re: Logical replication keepalive flood

akapila
On Thu, Jun 10, 2021 at 11:42 AM Kyotaro Horiguchi
<[hidden email]> wrote:

>
> At Thu, 10 Jun 2021 15:00:16 +0900 (JST), Kyotaro Horiguchi <[hidden email]> wrote in
> > At Wed, 9 Jun 2021 17:32:25 +0500, Abbas Butt <[hidden email]> wrote in
> > >
> > > On Wed, Jun 9, 2021 at 2:30 PM Amit Kapila <[hidden email]> wrote:
> > > > Is it possible that the write/flush location is not
> > > > updated at the pace at which we expect?
> >
> > Yes. MyWalSnd->flush/write are updated far frequently but still
> > MyWalSnd->write is behind sentPtr by from thousands of bytes up to
> > less than 1 block (1block = 8192 bytes). (Flush lags are larger than
> > write lags, of course.)
>
> For more clarity, I changed the previous patch a bit and retook numbers.
>
> Total records: 19476
>   8:     2 /     4 /     2:    4648 /  302472
>  16:     5 /    10 /     5:    5427 /  139872
>  24:  3006 /  6015 /  3028:    4739 /  267215
> 187:     2 /     0 /    50:       1 /     398
>
> While a 10 seconds run of pgbench, it walsender reads 19476 records
> and calls logical_read_xlog_page() 3028 times, and the mean of write
> lag is 4739 bytes and flush lag is 267215 bytes (really?), as the
> result most of the record fetch causes a keep alive. (The WAL contains
> many FPIs).
>
Good analysis. I think this analysis has shown that walsender is
sending messages at top speed as soon as they are generated. So, I am
wondering why there is any need to wait/sleep in such a workload. One
possibility that occurred to me RecentFlushPtr is not updated and or
we are not checking it aggressively. To investigate on that lines, can
you check the behavior with the attached patch? This is just a quick
hack patch to test whether we need to really wait for WAL a bit
aggressively.

--
With Regards,
Amit Kapila.

walsnd_check_wait_required_1.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Logical replication keepalive flood

Kyotaro Horiguchi-4
At Thu, 10 Jun 2021 12:18:00 +0530, Amit Kapila <[hidden email]> wrote in
> Good analysis. I think this analysis has shown that walsender is
> sending messages at top speed as soon as they are generated. So, I am
> wondering why there is any need to wait/sleep in such a workload. One
> possibility that occurred to me RecentFlushPtr is not updated and or
> we are not checking it aggressively. To investigate on that lines, can
> you check the behavior with the attached patch? This is just a quick
> hack patch to test whether we need to really wait for WAL a bit
> aggressively.

Yeah, anyway the comment for the caller site of WalSndKeepalive tells
that exiting out of the function *after* there is somewhat wrong.

> * possibly are waiting for a later location. So, before sleeping, we
> * send a ping containing the flush location. If the receiver is

But I nothing changed by moving the keepalive check to after the exit
check. (loc <= RecentFlushPtr).

And the patch also doesn't change the situation so much. The average
number of loops is reduced from 3 to 2 per call but the ratio between
total records and keepalives doesn't change.

previsous: A=#total-rec = 19476, B=#keepalive=3006, B/A = 0.154
this time: A=#total-rec = 13208, B=#keepalive=1988, B/A = 0.151

Total records: 13208
reqsz: #sent/ #!sent/ #call: wr lag  / fl lag
   8:     4 /     4 /     4:    6448 /  268148
  16:     1 /     1 /     1:    8688 /  387320
  24:  1988 /  1987 /  1999:    6357 /  226163
 195:     1 /     0 /    20:     408 /    1647
7477:     2 /     0 /   244:      68 /     847
8225:     1 /     1 /     1:    7208 /    7208

So I checked how many bytes RecentFlushPtr is behind requested loc if
it is not advanced enough.

Total records: 15128
reqsz:  #sent/ #!sent/ #call: wr lag  / fl lag  / RecentFlushPtr lag
    8:     2 /     2 /     2:     520 /   60640 /       8
   16:     1 /     1 /     1:    8664 /   89336 /      16
   24:  2290 /  2274 /  2302:    5677 /  230583 /      23
  187:     1 /     0 /    40:       1 /    6118 /       1
 7577:     1 /     0 /    69:     120 /    3733 /      65
 8177:     1 /     1 /     1:    8288 /    8288 /    2673

So it's not a matter of RecentFlushPtr check. (Almost) Always when
WalSndWakeupRequest feels to need to send a keepalive, the function is
called before the record begins to be written.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: Logical replication keepalive flood

akapila
On Fri, Jun 11, 2021 at 7:07 AM Kyotaro Horiguchi
<[hidden email]> wrote:

>
> At Thu, 10 Jun 2021 12:18:00 +0530, Amit Kapila <[hidden email]> wrote in
> > Good analysis. I think this analysis has shown that walsender is
> > sending messages at top speed as soon as they are generated. So, I am
> > wondering why there is any need to wait/sleep in such a workload. One
> > possibility that occurred to me RecentFlushPtr is not updated and or
> > we are not checking it aggressively. To investigate on that lines, can
> > you check the behavior with the attached patch? This is just a quick
> > hack patch to test whether we need to really wait for WAL a bit
> > aggressively.
>
> Yeah, anyway the comment for the caller site of WalSndKeepalive tells
> that exiting out of the function *after* there is somewhat wrong.
>
> > * possibly are waiting for a later location. So, before sleeping, we
> > * send a ping containing the flush location. If the receiver is
>
> But I nothing changed by moving the keepalive check to after the exit
> check. (loc <= RecentFlushPtr).
>
> And the patch also doesn't change the situation so much. The average
> number of loops is reduced from 3 to 2 per call but the ratio between
> total records and keepalives doesn't change.
>
> previsous: A=#total-rec = 19476, B=#keepalive=3006, B/A = 0.154
> this time: A=#total-rec = 13208, B=#keepalive=1988, B/A = 0.151
>
> Total records: 13208
> reqsz: #sent/ #!sent/ #call: wr lag  / fl lag
>    8:     4 /     4 /     4:    6448 /  268148
>   16:     1 /     1 /     1:    8688 /  387320
>   24:  1988 /  1987 /  1999:    6357 /  226163
>  195:     1 /     0 /    20:     408 /    1647
> 7477:     2 /     0 /   244:      68 /     847
> 8225:     1 /     1 /     1:    7208 /    7208
>
> So I checked how many bytes RecentFlushPtr is behind requested loc if
> it is not advanced enough.
>
> Total records: 15128
> reqsz:  #sent/ #!sent/ #call: wr lag  / fl lag  / RecentFlushPtr lag
>     8:     2 /     2 /     2:     520 /   60640 /       8
>    16:     1 /     1 /     1:    8664 /   89336 /      16
>    24:  2290 /  2274 /  2302:    5677 /  230583 /      23
>   187:     1 /     0 /    40:       1 /    6118 /       1
>  7577:     1 /     0 /    69:     120 /    3733 /      65
>  8177:     1 /     1 /     1:    8288 /    8288 /    2673
>

Does this data indicate that when the request_size is 187 or 7577,
even though we have called WalSndWaitForWal() 40 and 69 times
respectively but keepalive is sent just once? Why such a behavior
should depend upon request size?

> So it's not a matter of RecentFlushPtr check. (Almost) Always when
> WalSndWakeupRequest feels to need to send a keepalive, the function is
> called before the record begins to be written.
>

I think we always wake up walsender after we have flushed the WAL via
WalSndWakeupProcessRequests(). I think here the reason why we are
seeing keepalives is that we always send it before sleeping. So, it
seems each time we try to read a new page, we call WalSndWaitForWal
which sends at least one keepalive message. I am not sure what is an
appropriate way to reduce the frequency of these keepalive messages.
Andres might have some ideas?

--
With Regards,
Amit Kapila.