finding out what's generating WALs

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

finding out what's generating WALs

Chris Withers-2
Hi All,

With a 9.4 cluster, what's the best way to find out what's generating
the most WAL?

I'm looking after a multi-tenant PG 9.4 cluster, and we've started
getting alerts for the number of WALs on the server.
It'd be great to understand what's generating all that WAL and what's
likely to be causing any problems.\

More generally, what's number of WALs is "too much"? check_postgres.pl
when used in nagios format only appears to be able to alert on absolute
thresholds, does this always make sense? What's a good threshold to
alert on?

cheers,

Chris

Reply | Threaded
Open this post in threaded view
|

Re: finding out what's generating WALs

Achilleas Mantzios
On 11/12/18 4:00 μ.μ., Chris Withers wrote:
> Hi All,
>
> With a 9.4 cluster, what's the best way to find out what's generating the most WAL?
>
> I'm looking after a multi-tenant PG 9.4 cluster, and we've started getting alerts for the number of WALs on the server.
> It'd be great to understand what's generating all that WAL and what's likely to be causing any problems.\
>

One way is to keep snapshots of pg_stat_user_tables and then try to identify spikes based on the various _tup fields.
Another way is to take a look in your archive (where you keep your archived wals), try to identify a period where excessive wals were generated and then use
https://www.postgresql.org/docs/11/pgwaldump.html to see what's in there.

> More generally, what's number of WALs is "too much"? check_postgres.pl when used in nagios format only appears to be able to alert on absolute thresholds, does this always make sense? What's a good
> threshold to alert on?
>

Regarding you wals in pg_wal,  a good threshold could be anything more than a e.g. 10% increase from wal_keep_segments with a trend to go up. If this number goes up chances are something bad is happening.

> cheers,
>
> Chris
>


--
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt


Reply | Threaded
Open this post in threaded view
|

why would postgres be throttling a streaming replication slot's sending?

Chris Withers-2
On 11/12/2018 14:48, Achilleas Mantzios wrote:

> On 11/12/18 4:00 μ.μ., Chris Withers wrote:
>>
>> I'm looking after a multi-tenant PG 9.4 cluster, and we've started
>> getting alerts for the number of WALs on the server.
>> It'd be great to understand what's generating all that WAL and what's
>> likely to be causing any problems.\
>
> Regarding you wals in pg_wal,  a good threshold could be anything more
> than a e.g. 10% increase from wal_keep_segments with a trend to go up.
> If this number goes up chances are something bad is happening.

wal_keep_segments is 0 here, so I went hunting in pg_replication_slots
and found that it's the barman slot that's behind on the WAL.

strace'ing on the barman receiving side shows it waiting on the socket,
so no problem on that side.

on the sending side, it's a little confusing, since the server is
basically idle, no cpu or disk activity, and yet strace'ing the sending
process attached to the other end of the socket shows time being spent
waiting on a poll which, while it includes the socket being sent to,
also includes a bunch of pipes. I've attached a chunk of the output
below in the hope that someone on this list could offer an explanation
as to what might cause the WAL to be trickling over this port rather
than catching up as fast as it can?

cheers,

Chris

strace output for the streaming_barman slot servicing the barman
replication slot:

--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=3577, si_uid=26} ---
write(4, "\0", 1)                       = 1
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
read(3, "\0", 16)                       = 1
read(6, 0x7fffffffdcd7, 1)              = -1 EAGAIN (Resource
temporarily unavailable)
recvfrom(10, 0x7fffffffdcaf, 1, 0, NULL, NULL) = -1 EAGAIN (Resource
temporarily unavailable)
read(5,
"\3\1\0\0\7\326\355\2\343\0\0\0@\n\0\0\0\17\1\230\313\4\0\0)\347\372l\0\0\0\0"...,
4272) = 4272
sendto(10, "d\0\0\20\315w\0\0\4\313\230\1\17P\0\0\4\313\230\1
\0\0\2\37\321\343\345\306\20\3\1"..., 4302, 0, NULL, 0) = 4302
read(3, 0x7fffffffdc90, 16)             = -1 EAGAIN (Resource
temporarily unavailable)
poll([{fd=10, events=POLLIN}, {fd=3, events=POLLIN}, {fd=6,
events=POLLIN}], 3, 26076) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=3577, si_uid=26} ---
write(4, "\0", 1)                       = 1
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
read(3, "\0", 16)                       = 1
read(6, 0x7fffffffdcd7, 1)              = -1 EAGAIN (Resource
temporarily unavailable)
recvfrom(10, 0x7fffffffdcaf, 1, 0, NULL, NULL) = -1 EAGAIN (Resource
temporarily unavailable)
read(5, "~\320\5\0\3\0\0\0\0
\1\230\313\4\0\0\3\0\0\0\0\0\0\0\323\316\314\0\0\0\0\0"..., 8192) = 8192
sendto(10, "d\0\0 \35w\0\0\4\313\230\1
\0\0\0\4\313\230\1@\0\0\2\37\321\343\361\376\226~\320"..., 8222, 0,
NULL, 0) = 8222
read(3, 0x7fffffffdc90, 16)             = -1 EAGAIN (Resource
temporarily unavailable)
poll([{fd=10, events=POLLIN}, {fd=3, events=POLLIN}, {fd=6,
events=POLLIN}], 3, 25275) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=3577, si_uid=26} ---

Interesting numbers:
- process 3577 is the wal writer process
- fd 10 is the socket, 3 and 6 are pipes
- time is visibly spent sitting on that poll
- the network is only seeing 2Mbit/sec, which is nothing

cheers,

Chris

Reply | Threaded
Open this post in threaded view
|

Re: why would postgres be throttling a streaming replication slot's sending?

Achilleas Mantzios
On 12/12/18 3:45 μ.μ., Chris Withers wrote:

> On 11/12/2018 14:48, Achilleas Mantzios wrote:
>> On 11/12/18 4:00 μ.μ., Chris Withers wrote:
>>>
>>> I'm looking after a multi-tenant PG 9.4 cluster, and we've started getting alerts for the number of WALs on the server.
>>> It'd be great to understand what's generating all that WAL and what's likely to be causing any problems.\
>>
>> Regarding you wals in pg_wal,  a good threshold could be anything more than a e.g. 10% increase from wal_keep_segments with a trend to go up. If this number goes up chances are something bad is
>> happening.
>
> wal_keep_segments is 0 here, so I went hunting in pg_replication_slots and found that it's the barman slot that's behind on the WAL.
>

Then your number WALs depends on the traffic generated after the previous checkpoint, the status of archiving and IMO the status of replication slots. The doc :
https://www.postgresql.org/docs/11/runtime-config-replication.html doesn't specifically imply this last one, but it is documented here : https://www.postgresql.org/docs/10/wal-configuration.html , and
it has happened to me. If a replication client with replication_slot dies, WALs accumulate till it wakes up again. (case in pgsql 10)

> strace'ing on the barman receiving side shows it waiting on the socket, so no problem on that side.
>
> on the sending side, it's a little confusing, since the server is basically idle, no cpu or disk activity, and yet strace'ing the sending process attached to the other end of the socket shows time
> being spent waiting on a poll which, while it includes the socket being sent to, also includes a bunch of pipes. I've attached a chunk of the output below in the hope that someone on this list could
> offer an explanation as to what might cause the WAL to be trickling over this port rather than catching up as fast as it can?

pg_stat_replication is what you should use for monitoring. IIRC you wont see some cols in the row of the barman.
Also you could query pg_replication_slots as :

select pg_wal_lsn_diff(pg_current_wal_lsn(),confirmed_flush_lsn),pg_wal_lsn_diff(pg_current_wal_lsn(),restart_lsn) from pg_replication_slots where active_pid=<the pid reported above by
pg_stat_replication>;
  pg_wal_lsn_diff | pg_wal_lsn_diff
-----------------+-----------------
              312 |         1917736
(1 row)

In the first column you should see diff values (in bytes) close to zero. In the second you should see the length (window) of your replication slot in bytes, which means that this converted to 16MB
should indicate an excess of the number of wals due to the replication slot in addition to the other factors (archiving, checkpoints). In my system it never reaches the size of a single wal.

>
> cheers,
>
> Chris
>
> strace output for the streaming_barman slot servicing the barman replication slot:

IMO no need to go deep to gdb and stack traces unless you are sure there is a problem further down.

>
> --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=3577, si_uid=26} ---
> write(4, "\0", 1)                       = 1
> rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
> read(3, "\0", 16)                       = 1
> read(6, 0x7fffffffdcd7, 1)              = -1 EAGAIN (Resource temporarily unavailable)
> recvfrom(10, 0x7fffffffdcaf, 1, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
> read(5, "\3\1\0\0\7\326\355\2\343\0\0\0@\n\0\0\0\17\1\230\313\4\0\0)\347\372l\0\0\0\0"..., 4272) = 4272
> sendto(10, "d\0\0\20\315w\0\0\4\313\230\1\17P\0\0\4\313\230\1 \0\0\2\37\321\343\345\306\20\3\1"..., 4302, 0, NULL, 0) = 4302
> read(3, 0x7fffffffdc90, 16)             = -1 EAGAIN (Resource temporarily unavailable)
> poll([{fd=10, events=POLLIN}, {fd=3, events=POLLIN}, {fd=6, events=POLLIN}], 3, 26076) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
> --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=3577, si_uid=26} ---
> write(4, "\0", 1)                       = 1
> rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
> read(3, "\0", 16)                       = 1
> read(6, 0x7fffffffdcd7, 1)              = -1 EAGAIN (Resource temporarily unavailable)
> recvfrom(10, 0x7fffffffdcaf, 1, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
> read(5, "~\320\5\0\3\0\0\0\0 \1\230\313\4\0\0\3\0\0\0\0\0\0\0\323\316\314\0\0\0\0\0"..., 8192) = 8192
> sendto(10, "d\0\0 \35w\0\0\4\313\230\1 \0\0\0\4\313\230\1@\0\0\2\37\321\343\361\376\226~\320"..., 8222, 0, NULL, 0) = 8222
> read(3, 0x7fffffffdc90, 16)             = -1 EAGAIN (Resource temporarily unavailable)
> poll([{fd=10, events=POLLIN}, {fd=3, events=POLLIN}, {fd=6, events=POLLIN}], 3, 25275) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
> --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=3577, si_uid=26} ---
>
> Interesting numbers:
> - process 3577 is the wal writer process
> - fd 10 is the socket, 3 and 6 are pipes
> - time is visibly spent sitting on that poll
> - the network is only seeing 2Mbit/sec, which is nothing
>
> cheers,
>
> Chris
>


--
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt