On Mon, Jun 7, 2021 at 12:54 PM Kyotaro Horiguchi
<[hidden email]> wrote:
> At Sat, 5 Jun 2021 16:08:00 +0500, Abbas Butt <[hidden email]> wrote in
> > Hi,
> > I have observed the following behavior with PostgreSQL 13.3.
> > The WAL sender process sends approximately 500 keepalive messages per
> > second to pg_recvlogical.
> > These keepalive messages are totally un-necessary.
> > Keepalives should be sent only if there is no network traffic and a certain
> > time (half of wal_sender_timeout) passes.
> > These keepalive messages not only choke the network but also impact the
> > performance of the receiver,
> > because the receiver has to process the received message and then decide
> > whether to reply to it or not.
> > The receiver remains busy doing this activity 500 times a second.
> I can reproduce the problem.
> > On investigation it is revealed that the following code fragment in
> > function WalSndWaitForWal in file walsender.c is responsible for sending
> > these frequent keepalives:
> > if (MyWalSnd->flush < sentPtr &&
> > MyWalSnd->write < sentPtr &&
> > !waiting_for_ping_response)
> > WalSndKeepalive(false);
> The immediate cause is pg_recvlogical doesn't send a reply before
> sleeping. Currently it sends replies every 10 seconds intervals.
Yeah, but one can use -s option to send it at lesser intervals.
> So the attached first patch stops the flood.
I am not sure sending feedback every time before sleep is a good idea,
this might lead to unnecessarily sending more messages. Can we try by
using one-second interval with -s option to see how it behaves? As a
matter of comparison the similar logic in workers.c uses
wal_receiver_timeout to send such an update message rather than
sending it every time before sleep.
> That said, I don't think it is not intended that logical walsender
> sends keep-alive packets with such a high frequency. It happens
> because walsender actually doesn't wait at all because it waits on
> WL_SOCKET_WRITEABLE because the keep-alive packet inserted just before
> is always pending.
> So as the attached second, we should try to flush out the keep-alive
> packets if possible before checking pg_is_send_pending().
/* Send keepalive if the time has come */
+ /* We may have queued a keep alive packet. flush it before sleeping. */
We already call pq_flush_if_writable() from WalSndKeepaliveIfNecessary
after sending the keep-alive message, so not sure how this helps?
On Wed, Jun 9, 2021 at 1:47 PM Kyotaro Horiguchi
<[hidden email]> wrote:
> At Wed, 9 Jun 2021 11:21:55 +0900, Kyotaro Horiguchi <[hidden email]> wrote in
> > The issue - if actually it is - we send a keep-alive packet before a
> > quite short sleep.
> > We really want to send it if the sleep gets long but we cannot predict
> > that before entering a sleep.
> > Let me think a little more on this..
> After some investigation, I find out that the keepalives are sent
> almost always after XLogSendLogical requests for the *next* record.
Does these keepalive messages are sent at the same frequency even for
subscribers? Basically, I wanted to check if we have logical
replication set up between 2 nodes then do we send these keep-alive
messages flood? If not, then why is it different in the case of
pg_recvlogical? Is it possible that the write/flush location is not
updated at the pace at which we expect? Please see commit 41d5f8ad73
which seems to be talking about a similar problem.
On Thu, Jun 10, 2021 at 11:42 AM Kyotaro Horiguchi
<[hidden email]> wrote:
> At Thu, 10 Jun 2021 15:00:16 +0900 (JST), Kyotaro Horiguchi <[hidden email]> wrote in
> > At Wed, 9 Jun 2021 17:32:25 +0500, Abbas Butt <[hidden email]> wrote in
> > >
> > > On Wed, Jun 9, 2021 at 2:30 PM Amit Kapila <[hidden email]> wrote:
> > > > Is it possible that the write/flush location is not
> > > > updated at the pace at which we expect?
> > Yes. MyWalSnd->flush/write are updated far frequently but still
> > MyWalSnd->write is behind sentPtr by from thousands of bytes up to
> > less than 1 block (1block = 8192 bytes). (Flush lags are larger than
> > write lags, of course.)
> For more clarity, I changed the previous patch a bit and retook numbers.
> Total records: 19476
> 8: 2 / 4 / 2: 4648 / 302472
> 16: 5 / 10 / 5: 5427 / 139872
> 24: 3006 / 6015 / 3028: 4739 / 267215
> 187: 2 / 0 / 50: 1 / 398
> While a 10 seconds run of pgbench, it walsender reads 19476 records
> and calls logical_read_xlog_page() 3028 times, and the mean of write
> lag is 4739 bytes and flush lag is 267215 bytes (really?), as the
> result most of the record fetch causes a keep alive. (The WAL contains
> many FPIs).
Good analysis. I think this analysis has shown that walsender is
sending messages at top speed as soon as they are generated. So, I am
wondering why there is any need to wait/sleep in such a workload. One
possibility that occurred to me RecentFlushPtr is not updated and or
we are not checking it aggressively. To investigate on that lines, can
you check the behavior with the attached patch? This is just a quick
hack patch to test whether we need to really wait for WAL a bit
On Fri, Jun 11, 2021 at 7:07 AM Kyotaro Horiguchi
<[hidden email]> wrote:
> At Thu, 10 Jun 2021 12:18:00 +0530, Amit Kapila <[hidden email]> wrote in
> > Good analysis. I think this analysis has shown that walsender is
> > sending messages at top speed as soon as they are generated. So, I am
> > wondering why there is any need to wait/sleep in such a workload. One
> > possibility that occurred to me RecentFlushPtr is not updated and or
> > we are not checking it aggressively. To investigate on that lines, can
> > you check the behavior with the attached patch? This is just a quick
> > hack patch to test whether we need to really wait for WAL a bit
> > aggressively.
> Yeah, anyway the comment for the caller site of WalSndKeepalive tells
> that exiting out of the function *after* there is somewhat wrong.
> > * possibly are waiting for a later location. So, before sleeping, we
> > * send a ping containing the flush location. If the receiver is
> But I nothing changed by moving the keepalive check to after the exit
> check. (loc <= RecentFlushPtr).
> And the patch also doesn't change the situation so much. The average
> number of loops is reduced from 3 to 2 per call but the ratio between
> total records and keepalives doesn't change.
> previsous: A=#total-rec = 19476, B=#keepalive=3006, B/A = 0.154
> this time: A=#total-rec = 13208, B=#keepalive=1988, B/A = 0.151
> Total records: 13208
> reqsz: #sent/ #!sent/ #call: wr lag / fl lag
> 8: 4 / 4 / 4: 6448 / 268148
> 16: 1 / 1 / 1: 8688 / 387320
> 24: 1988 / 1987 / 1999: 6357 / 226163
> 195: 1 / 0 / 20: 408 / 1647
> 7477: 2 / 0 / 244: 68 / 847
> 8225: 1 / 1 / 1: 7208 / 7208
> So I checked how many bytes RecentFlushPtr is behind requested loc if
> it is not advanced enough.
> Total records: 15128
> reqsz: #sent/ #!sent/ #call: wr lag / fl lag / RecentFlushPtr lag
> 8: 2 / 2 / 2: 520 / 60640 / 8
> 16: 1 / 1 / 1: 8664 / 89336 / 16
> 24: 2290 / 2274 / 2302: 5677 / 230583 / 23
> 187: 1 / 0 / 40: 1 / 6118 / 1
> 7577: 1 / 0 / 69: 120 / 3733 / 65
> 8177: 1 / 1 / 1: 8288 / 8288 / 2673
Does this data indicate that when the request_size is 187 or 7577,
even though we have called WalSndWaitForWal() 40 and 69 times
respectively but keepalive is sent just once? Why such a behavior
should depend upon request size?
> So it's not a matter of RecentFlushPtr check. (Almost) Always when
> WalSndWakeupRequest feels to need to send a keepalive, the function is
> called before the record begins to be written.
I think we always wake up walsender after we have flushed the WAL via
WalSndWakeupProcessRequests(). I think here the reason why we are
seeing keepalives is that we always send it before sleeping. So, it
seems each time we try to read a new page, we call WalSndWaitForWal
which sends at least one keepalive message. I am not sure what is an
appropriate way to reduce the frequency of these keepalive messages.
Andres might have some ideas?