BUG? Slave don't reconnect to the master

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

BUG? Slave don't reconnect to the master

Олег Самойлов
Hi all.

I found some strange behaviour of postgres, which I recognise as a bug. First of all, let me explain situation.

I created a "test bed" (not sure how to call it right), to test high availability clusters based on Pacemaker and PostgreSQL. The test bed consist of 12 virtual machines (on VirtualBox) runing on a MacBook Pro and formed 4 HA clusters with different structure. And all 4 HA cluster constantly tested in loop: simulated failures with different nature, waited for rising fall-over, fixing, and so on. For simplicity I'll explain only one HA cluster. This is 3 virtual machines, with master on one, and sync and async slaves on other. The PostgreSQL service is provided by float IPs pointed to working master and slaves. Slaves are connected to the master float IP too. When the pacemaker detects a failure, for instance, on the master, it promote a master on other node with lowest latency WAL and switches float IPs, so the third node keeping be a sync slave. My company decided to open this project as an open source, now I am finishing formality.

Almost works fine, but sometimes, rather rare, I detected that a slave don't reconnect to the new master after a failure. First case is PostgreSQL-STOP, when I `kill` by STOP signal postgres on the master to simulate freeze. The slave don't reconnect to the new master with errors in log:

18:02:56.236 [3154] FATAL:  terminating walreceiver due to timeout
18:02:56.237 [1421] LOG:  record with incorrect prev-link 0/1600DDE8 at 0/1A00DE10

What is strange that error about incorrect WAL is risen  after the termination of connection. Well, this can be workarouned by turning off wal receiver timeout. Now PostgreSQL-STOP works fine, but the problem is still exists with other test. ForkBomb simulates an out of memory situation. In this case a slave sometimes don't reconnect to the new master too, with errors in log:

10:09:43.99 [1417] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
10:09:43.992 [1413] LOG:  invalid record length at 0/D8014278: wanted 24, got 0

The last error message (last row in log) was observed different, btw.

What I expect as right behaviour. The PostgreSQL slave must reconnect to the master IP (float IP) after the wal_retrieve_retry_interval.

Reply | Threaded
Open this post in threaded view
|

Re: BUG? Slave don't reconnect to the master

Jehan-Guillaume de Rorthais
On Tue, 18 Aug 2020 13:48:41 +0300
Олег Самойлов <[hidden email]> wrote:

> Hi all.
>
> I found some strange behaviour of postgres, which I recognise as a bug. First
> of all, let me explain situation.
>
> I created a "test bed" (not sure how to call it right), to test high
> availability clusters based on Pacemaker and PostgreSQL. The test bed consist
> of 12 virtual machines (on VirtualBox) runing on a MacBook Pro and formed 4
> HA clusters with different structure. And all 4 HA cluster constantly tested
> in loop: simulated failures with different nature, waited for rising
> fall-over, fixing, and so on. For simplicity I'll explain only one HA
> cluster.
> This is 3 virtual machines, with master on one, and sync and async
> slaves on other. The PostgreSQL service is provided by float IPs pointed to
> working master and slaves. Slaves are connected to the master float IP too.
> When the pacemaker detects a failure, for instance, on the master, it promote
> a master on other node with lowest latency WAL and switches float IPs, so the
> third node keeping be a sync slave. My company decided to open this project
> as an open source, now I am finishing formality.

As the maintainer of PAF[1], I'm looking forward to discover it :)
Do not hesitate to ping me offlist as well in regard with Pacemaker and
resource agents.

> Almost works fine, but sometimes, rather rare, I detected that a slave don't
> reconnect to the new master after a failure. First case is PostgreSQL-STOP,
> when I `kill` by STOP signal postgres on the master to simulate freeze. The
> slave don't reconnect to the new master with errors in log:
>
> 18:02:56.236 [3154] FATAL:  terminating walreceiver due to timeout
> 18:02:56.237 [1421] LOG:  record with incorrect prev-link 0/1600DDE8 at
> 0/1A00DE10

Do you have more logs from both side of the replication?
How do you build your standbys?

> What is strange that error about incorrect WAL is risen  after the
> termination of connection.

This is because the first message comes from the walreceiver itself (3154),
which receive and write WAL, and the other one comes from the startup process
(1421) which wait and replay WAL.

> Well, this can be workarouned by turning off wal
> receiver timeout. Now PostgreSQL-STOP works fine, but the problem is still
> exists with other test. ForkBomb simulates an out of memory situation. In
> this case a slave sometimes don't reconnect to the new master too, with
> errors in log:
>
> 10:09:43.99 [1417] FATAL:  could not receive data from WAL stream: server
> closed the connection unexpectedly This probably means the server terminated
> abnormally before or while processing the request.
> 10:09:43.992 [1413] LOG:  invalid record length at 0/D8014278: wanted 24, got
> 0

I suspect the problem is somewhere else. The first message here is probably
related to your primary being fenced, the second one is normal. After your
IP moved to the recently promoted primary, your standby are supposed to
reconnect with no problem.

> The last error message (last row in log) was observed different, btw.
>
> What I expect as right behaviour. The PostgreSQL slave must reconnect to the
> master IP (float IP) after the wal_retrieve_retry_interval.

In my own experience with PAF, it just works like what you describe.


Regards,

[1] https://clusterlabs.github.io/PAF/


Reply | Threaded
Open this post in threaded view
|

Re: BUG? Slave don't reconnect to the master

Олег Самойлов


> 19 авг. 2020 г., в 16:07, Jehan-Guillaume de Rorthais <[hidden email]> написал(а):
>
> On Tue, 18 Aug 2020 13:48:41 +0300
> Олег Самойлов <[hidden email]> wrote:
>
>> Hi all.
>>
>> I found some strange behaviour of postgres, which I recognise as a bug. First
>> of all, let me explain situation.
>>
>> I created a "test bed" (not sure how to call it right), to test high
>> availability clusters based on Pacemaker and PostgreSQL. The test bed consist
>> of 12 virtual machines (on VirtualBox) runing on a MacBook Pro and formed 4
>> HA clusters with different structure. And all 4 HA cluster constantly tested
>> in loop: simulated failures with different nature, waited for rising
>> fall-over, fixing, and so on. For simplicity I'll explain only one HA
>> cluster.
>> This is 3 virtual machines, with master on one, and sync and async
>> slaves on other. The PostgreSQL service is provided by float IPs pointed to
>> working master and slaves. Slaves are connected to the master float IP too.
>> When the pacemaker detects a failure, for instance, on the master, it promote
>> a master on other node with lowest latency WAL and switches float IPs, so the
>> third node keeping be a sync slave. My company decided to open this project
>> as an open source, now I am finishing formality.
>
> As the maintainer of PAF[1], I'm looking forward to discover it :)
> Do not hesitate to ping me offlist as well in regard with Pacemaker and
> resource agents.

I am glad to see you here. Great thanks for PAF module.

>
>> Almost works fine, but sometimes, rather rare, I detected that a slave don't
>> reconnect to the new master after a failure. First case is PostgreSQL-STOP,
>> when I `kill` by STOP signal postgres on the master to simulate freeze. The
>> slave don't reconnect to the new master with errors in log:
>>
>> 18:02:56.236 [3154] FATAL:  terminating walreceiver due to timeout
>> 18:02:56.237 [1421] LOG:  record with incorrect prev-link 0/1600DDE8 at
>> 0/1A00DE10
>
> Do you have more logs from both side of the replication?

There was nothing special. Just errors from previous tests.

> How do you build your standbys?
Okey, I'll show you all configs. Linux Centos 7, PostgreSQL 11 from the official postgres repository.
# rpm -q corosync
corosync-2.4.5-4.el7.x86_64
# rpm -q pacemaker
pacemaker-1.1.21-4.el7.x86_64
# rpm -q sbd
sbd-1.4.0-15.el7.x86_64

The cluster designed for three datacenter with good connection between in one city , so it must survive a crush of one datacenter. So stonith is not used, instead I use quorum and sbd based watchdog. Name of nodes: tuchanka3a, tuchanka3b, tuchanka3c. Name for float ip: krogan3 for the master; krogan3s1, krogan3s2 for slaves.

postgresql common conf (not default options):

ident_file = '/var/lib/pgsql/pg_ident.conf'
hba_file = '/var/lib/pgsql/pg_hba.conf'
listen_addresses = '*'
log_filename = 'postgresql.%F.log'
wal_keep_segments = 1
restart_after_crash = off
wal_receiver_timeout=0 # in case PostgreSQL-STOP wal_receiver_timeout is default 60s, of cause
shared_buffers = 32MB
max_wal_size=80MB

postgresql conf specific for nodes:
for tuchanka3a:

$ cat krogan3.conf
synchronous_commit = remote_write
synchronous_standby_names = 'ANY 1 (tuchanka3b,tuchanka3c)'

$ cat krogan3.paf
primary_conninfo = 'host=krogan3 user=replicant application_name=tuchanka3a sslmode=disable'
recovery_target_timeline = 'latest'
standby_mode = 'on'

for tucanka3b:
$ cat krogan3.conf
synchronous_commit = remote_write
synchronous_standby_names = 'ANY 1 (tuchanka3a,tuchanka3c)'

$ cat krogan3.paf
primary_conninfo = 'host=krogan3 user=replicant application_name=tuchanka3b sslmode=disable'
recovery_target_timeline = 'latest'
standby_mode = 'on'

for tuchanka3c:
$ cat krogan3.conf
synchronous_commit = remote_write
synchronous_standby_names = 'ANY 1 (tuchanka3a,tuchanka3b)'

$ cat krogan3.paf
primary_conninfo = 'host=krogan3 user=replicant application_name=tuchanka3c sslmode=disable'
recovery_target_timeline = 'latest'
standby_mode = 'on'

pacemaker config, specific for this cluster:

pcs cluster cib original_cib.xml
cp original_cib.xml configured_cib.xml
pcs -f configured_cib.xml resource create krogan3DB ocf:heartbeat:pgsqlms bindir=/usr/pgsql-11/bin pgdata=/var/lib/pgsql/krogan3 recovery_template=/var/lib/pgsql/krogan3.paf op monitor interval=15 timeout=10 monitor interval=16 role=Master timeout=20 monitor interval=17 role=Slave timeout=10 meta master notify=true resource-stickiness=10
pcs -f configured_cib.xml resource create krogan3IP ocf:heartbeat:IPaddr2 nic=eth0 cidr_netmask=24 ip=192.168.89.35
pcs -f configured_cib.xml resource create krogan3s1IP ocf:heartbeat:IPaddr2 nic=eth0 cidr_netmask=24 ip=192.168.89.36
pcs -f configured_cib.xml resource create krogan3s2IP ocf:heartbeat:IPaddr2 nic=eth0 cidr_netmask=24 ip=192.168.89.37
pcs -f configured_cib.xml constraint colocation add krogan3IP with master krogan3DB-master INFINITY
pcs -f configured_cib.xml constraint order promote krogan3DB-master then start krogan3IP symmetrical=false
pcs -f configured_cib.xml constraint order demote krogan3DB-master then stop krogan3IP symmetrical=false kind=Optional
pcs -f configured_cib.xml constraint location krogan3s1IP rule score=-INFINITY master-krogan3DB lt integer 0
pcs -f configured_cib.xml constraint location krogan3s2IP rule score=-INFINITY master-krogan3DB lt integer 0
pcs -f configured_cib.xml constraint colocation add krogan3s1IP with slave krogan3DB-master INFINITY
pcs -f configured_cib.xml constraint colocation add krogan3s2IP with slave krogan3DB-master INFINITY
pcs -f configured_cib.xml constraint colocation add krogan3s1IP with krogan3s2IP -1000
pcs -f configured_cib.xml constraint order start krogan3DB-master then start krogan3s1IP
pcs -f configured_cib.xml constraint order start krogan3DB-master then start krogan3s2IP
pcs cluster cib-push configured_cib.xml --wait diff-against=original_cib.xml

Test PostgreSQL-STOP (wal_receiver_timeout is default 60s):
killall -STOP postgres

PostgreSQL logs:
From old master, tuchanka 3a:
2020-08-20 10:03:35.633 MSK [1359] LOG:  received promote request
2020-08-20 10:03:35.633 MSK [1507] FATAL:  terminating walreceiver process due to administrator command
2020-08-20 10:03:35.634 MSK [1359] LOG:  redo is not required
2020-08-20 10:03:35.636 MSK [1359] LOG:  selected new timeline ID: 3
2020-08-20 10:03:35.703 MSK [1359] LOG:  archive recovery complete
2020-08-20 10:03:35.712 MSK [1357] LOG:  database system is ready to accept connections
2020-08-20 10:03:36.704 MSK [1640] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:03:36.818 MSK [1641] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:05:03.399 MSK [1641] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:05:31.829 MSK [2564] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:07:07.183 MSK [1640] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:08:01.798 MSK [3813] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:09:36.798 MSK [3813] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:10:29.449 MSK [4975] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:12:05.088 MSK [2564] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:12:28.589 MSK [5977] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:14:15.063 MSK [5977] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:14:37.883 MSK [7071] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:16:29.821 MSK [7071] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:16:59.105 MSK [8236] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:18:39.647 MSK [4975] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:20:27.047 MSK [9855] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:22:13.899 MSK [8236] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:22:37.731 MSK [10933] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:24:28.857 MSK [10933] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:24:57.420 MSK [12084] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby

All errors are from former test, which was applied to slaves. (The host for tests choosen randomly).

New master tuchanka3b, some logs are copied from the old master on database restoration after previous test by pg_basebackup. I separated them by blank line.

2020-08-20 10:03:35.633 MSK [1359] LOG:  received promote request
2020-08-20 10:03:35.633 MSK [1507] FATAL:  terminating walreceiver process due to administrator command
2020-08-20 10:03:35.634 MSK [1359] LOG:  redo is not required
2020-08-20 10:03:35.636 MSK [1359] LOG:  selected new timeline ID: 3
2020-08-20 10:03:35.703 MSK [1359] LOG:  archive recovery complete
2020-08-20 10:03:35.712 MSK [1357] LOG:  database system is ready to accept connections
2020-08-20 10:03:36.704 MSK [1640] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:03:36.818 MSK [1641] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:05:03.399 MSK [1641] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:05:31.829 MSK [2564] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:07:07.183 MSK [1640] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:08:01.798 MSK [3813] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:09:36.798 MSK [3813] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:10:29.449 MSK [4975] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:12:05.088 MSK [2564] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:12:28.589 MSK [5977] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:14:15.063 MSK [5977] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:14:37.883 MSK [7071] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:16:29.821 MSK [7071] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:16:59.105 MSK [8236] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:18:39.647 MSK [4975] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:20:27.047 MSK [9855] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:22:13.899 MSK [8236] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:22:37.731 MSK [10933] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:24:28.857 MSK [10933] LOG:  terminating walsender process due to replication timeout

2020-08-20 10:24:55.888 MSK [1407] LOG:  database system was interrupted; last known up at 2020-08-20 10:24:49 MSK
2020-08-20 10:24:55.906 MSK [1407] LOG:  entering standby mode
2020-08-20 10:24:55.908 MSK [1407] LOG:  redo starts at 0/15000028
2020-08-20 10:24:55.909 MSK [1407] LOG:  consistent recovery state reached at 0/15002300
2020-08-20 10:24:55.910 MSK [1405] LOG:  database system is ready to accept read only connections
2020-08-20 10:24:55.928 MSK [1411] LOG:  started streaming WAL from primary at 0/16000000 on timeline 3
2020-08-20 10:26:37.308 MSK [1411] FATAL:  terminating walreceiver due to timeout
2020-08-20 10:26:37.308 MSK [1407] LOG:  invalid record length at 0/1600C4D8: wanted 24, got 0
2020-08-20 10:30:55.965 MSK [1407] LOG:  received promote request
2020-08-20 10:30:55.965 MSK [2424] FATAL:  terminating walreceiver process due to administrator command
2020-08-20 10:30:55.966 MSK [1407] LOG:  redo done at 0/1600C4B0
2020-08-20 10:30:55.966 MSK [1407] LOG:  last completed transaction was at log time 2020-08-20 10:25:38.764299+03
2020-08-20 10:30:55.968 MSK [1407] LOG:  selected new timeline ID: 4
2020-08-20 10:30:56.001 MSK [1407] LOG:  archive recovery complete
2020-08-20 10:30:56.005 MSK [1405] LOG:  database system is ready to accept connections

The slave with didn't reconnected replication, tuchanka3c. Also I separated logs copied from the old master by a blank line:

2020-08-20 10:03:35.633 MSK [1359] LOG:  received promote request
2020-08-20 10:03:35.633 MSK [1507] FATAL:  terminating walreceiver process due to administrator command
2020-08-20 10:03:35.634 MSK [1359] LOG:  redo is not required
2020-08-20 10:03:35.636 MSK [1359] LOG:  selected new timeline ID: 3
2020-08-20 10:03:35.703 MSK [1359] LOG:  archive recovery complete
2020-08-20 10:03:35.712 MSK [1357] LOG:  database system is ready to accept connections
2020-08-20 10:03:36.704 MSK [1640] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:03:36.818 MSK [1641] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:05:03.399 MSK [1641] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:05:31.829 MSK [2564] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:07:07.183 MSK [1640] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:08:01.798 MSK [3813] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:09:36.798 MSK [3813] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:10:29.449 MSK [4975] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:12:05.088 MSK [2564] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:12:28.589 MSK [5977] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:14:15.063 MSK [5977] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:14:37.883 MSK [7071] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:16:29.821 MSK [7071] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:16:59.105 MSK [8236] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:18:39.647 MSK [4975] LOG:  terminating walsender process due to replication timeout

2020-08-20 10:20:25.168 MSK [1408] LOG:  database system was interrupted; last known up at 2020-08-20 10:20:19 MSK
2020-08-20 10:20:25.180 MSK [1408] LOG:  entering standby mode
2020-08-20 10:20:25.181 MSK [1408] LOG:  redo starts at 0/11000098
2020-08-20 10:20:25.183 MSK [1408] LOG:  consistent recovery state reached at 0/11000A68
2020-08-20 10:20:25.183 MSK [1406] LOG:  database system is ready to accept read only connections
2020-08-20 10:20:25.193 MSK [1412] LOG:  started streaming WAL from primary at 0/12000000 on timeline 3
2020-08-20 10:25:05.370 MSK [1955] LOG:  could not send data to client: Connection reset by peer
2020-08-20 10:25:05.370 MSK [1955] CONTEXT:  PL/pgSQL function beat4tmux(text) line 4 at RAISE
2020-08-20 10:25:05.370 MSK [1955] STATEMENT:  call beat4tmux('3')
2020-08-20 10:25:05.370 MSK [1955] FATAL:  connection to client lost
2020-08-20 10:25:05.370 MSK [1955] CONTEXT:  PL/pgSQL function beat4tmux(text) line 4 at RAISE
2020-08-20 10:25:05.370 MSK [1955] STATEMENT:  call beat4tmux('3')
2020-08-20 10:26:38.655 MSK [1412] FATAL:  terminating walreceiver due to timeout
2020-08-20 10:26:38.655 MSK [1408] LOG:  record with incorrect prev-link 0/1200C4B0 at 0/1600C4D8

The function beat4tmux used for health monitoring.

The output from mon:

Online: [ tuchanka3b tuchanka3c ]
OFFLINE: [ tuchanka3a ]

Full list of resources:

 Master/Slave Set: krogan3DB-master [krogan3DB]
     Masters: [ tuchanka3b ]
     Slaves: [ tuchanka3c ]
     Stopped: [ tuchanka3a ]
krogan3IP (ocf::heartbeat:IPaddr2): Started tuchanka3b
krogan3s1IP     (ocf::heartbeat:IPaddr2): Stopped
krogan3s2IP     (ocf::heartbeat:IPaddr2): Stopped

Node Attributes:
* Node tuchanka3b:
    + master-krogan3DB                  : 1001
* Node tuchanka3c:
    + master-krogan3DB                  : -1000

Migration Summary:
* Node tuchanka3b:
* Node tuchanka3c:

Pending Fencing Actions:
* reboot of tuchanka3a pending: client=crmd.1242, origin=tuchanka3a

>> Well, this can be workarouned by turning off wal_receiver_timeout.

>> Now PostgreSQL-STOP works fine, but the problem is still
>> exists with other test. ForkBomb simulates an out of memory situation. In
>> this case a slave sometimes don't reconnect to the new master too, with
>> errors in log:
>>
>> 10:09:43.99 [1417] FATAL:  could not receive data from WAL stream: server
>> closed the connection unexpectedly This probably means the server terminated
>> abnormally before or while processing the request.
>> 10:09:43.992 [1413] LOG:  invalid record length at 0/D8014278: wanted 24, got
>> 0
>
> I suspect the problem is somewhere else. The first message here is probably
> related to your primary being fenced, the second one is normal. After your
> IP moved to the recently promoted primary, your standby are supposed to
> reconnect with no problem.

I believe that problem with the ForkBomb has the same nature. The case with PostgreSQL-STOP and default wal_receiver_timeout can be easily repeated, it rises always by `killall -STOP postgres` on the master, as I can see. But the case with wal_receiver_timeout=0 and ForkBomb happens rather rare, not always.



Reply | Threaded
Open this post in threaded view
|

Re: BUG? Slave don't reconnect to the master

Jehan-Guillaume de Rorthais
On Thu, 20 Aug 2020 15:16:10 +0300
Олег Самойлов <[hidden email]> wrote:
[...]

> >> Almost works fine, but sometimes, rather rare, I detected that a slave
> >> don't reconnect to the new master after a failure. First case is
> >> PostgreSQL-STOP, when I `kill` by STOP signal postgres on the master to
> >> simulate freeze. The slave don't reconnect to the new master with errors
> >> in log:
> >>
> >> 18:02:56.236 [3154] FATAL:  terminating walreceiver due to timeout
> >> 18:02:56.237 [1421] LOG:  record with incorrect prev-link 0/1600DDE8 at
> >> 0/1A00DE10  
> >
> > Do you have more logs from both side of the replication?  
>
> There was nothing special. Just errors from previous tests.

OK

> > How do you build your standbys?  
> Okey, I'll show you all configs. Linux Centos 7, PostgreSQL 11 from the
> official postgres repository. # rpm -q corosync
> corosync-2.4.5-4.el7.x86_64
> # rpm -q pacemaker
> pacemaker-1.1.21-4.el7.x86_64
> # rpm -q sbd
> sbd-1.4.0-15.el7.x86_64
>
> The cluster designed for three datacenter with good connection between in one
> city , so it must survive a crush of one datacenter. So stonith is not used,
> instead I use quorum and sbd based watchdog. Name of nodes: tuchanka3a,
> tuchanka3b, tuchanka3c. Name for float ip: krogan3 for the master; krogan3s1,
> krogan3s2 for slaves.
>
> postgresql common conf (not default options):
>
> ident_file = '/var/lib/pgsql/pg_ident.conf'
> hba_file = '/var/lib/pgsql/pg_hba.conf'
> listen_addresses = '*'
> log_filename = 'postgresql.%F.log'
> wal_keep_segments = 1
> restart_after_crash = off
> wal_receiver_timeout=0 # in case PostgreSQL-STOP wal_receiver_timeout is
> default 60s, of cause shared_buffers = 32MB
> max_wal_size=80MB

Based on setup per node, you can probably add
'synchronous_commit=remote_write' in the common conf.

> [...]
> pacemaker config, specific for this cluster:
> [...]

why did you add "monitor interval=15"? No harm, but it is redundant with
"monitor interval=16 role=Master" and "monitor interval=17 role=Slave".

By the way, nice catch to exclude RO IP when master score is
negative!

> Test PostgreSQL-STOP (wal_receiver_timeout is default 60s):
> killall -STOP postgres
>
> PostgreSQL logs:
> From old master, tuchanka 3a:
> [...]

OK

> All errors are from former test, which was applied to slaves. (The host for
> tests choosen randomly).
>
> New master tuchanka3b, some logs are copied from the old master on database
> restoration after previous test by pg_basebackup. I separated them by blank
> line.

You should probably consider putting your logs outside of your PGDATA, or
even better: send your PostgreSQL/Pacemaker logs to a dedicated syslog node
outside. Adding the hostname in the prefix of each log line might probably help
a lot.

> [...]
> 10:24:55.906 LOG:  entering standby mode
> 10:24:55.908 LOG:  redo starts at 0/15000028
> 10:24:55.909 LOG:  consistent recovery state reached at 0/15002300
> 10:24:55.910 LOG:  database system is ready to accept read only connections
> 10:24:55.928 LOG:  started streaming WAL from primary at 0/16000000 on tl 3
> 10:26:37.308 FATAL:  terminating walreceiver due to timeout

Timeout because of SIGSTOP on primary here.

> 10:26:37.308 LOG:  invalid record length at 0/1600C4D8: wanted 24, got 0
> 10:30:55.965 LOG:  received promote request

Promotion from Pacemaker here.

What happened during more than 4 minutes between the timeout and the promotion?

> 10:30:55.965 FATAL:  terminating walreceiver process due to administrator cmd
> 10:30:55.966 LOG:  redo done at 0/1600C4B0
> 10:30:55.966 LOG:  last completed transaction was at log time 10:25:38.76429
> 10:30:55.968 LOG:  selected new timeline ID: 4
> 10:30:56.001 LOG:  archive recovery complete
> 10:30:56.005 LOG:  database system is ready to accept connections

> The slave with didn't reconnected replication, tuchanka3c. Also I separated
> logs copied from the old master by a blank line:
>
> [...]
>
> 10:20:25.168 LOG:  database system was interrupted; last known up at 10:20:19
> 10:20:25.180 LOG:  entering standby mode
> 10:20:25.181 LOG:  redo starts at 0/11000098
> 10:20:25.183 LOG:  consistent recovery state reached at 0/11000A68
> 10:20:25.183 LOG:  database system is ready to accept read only connections
> 10:20:25.193 LOG:  started streaming WAL from primary at 0/12000000 on tl 3
> 10:25:05.370 LOG:  could not send data to client: Connection reset by peer
> 10:26:38.655 FATAL:  terminating walreceiver due to timeout
> 10:26:38.655 LOG:  record with incorrect prev-link 0/1200C4B0 at 0/1600C4D8

This message appear before the effective promotion of tuchanka3b. Do you have
logs about what happen *after* the promotion?

Reading at this error, it seems like record at 0/1600C4D8 references the
previous one in WAL 0/12000000. So the file referenced as 0/16 have either
corrupted data or was 0/12 being recycled, but not zeroed correctly, as v11
always do no matter what (no wal_init_zero there).

That's why I'm wondering how you built your standbys, from scratch?

[...]
>
> >> Well, this can be workarouned by turning off wal_receiver_timeout.  

Just because the walreceiver keeps waiting on its socket and don't pay
attention to data on disk I suppose.

> >> Now PostgreSQL-STOP works fine, but the problem is still
> >> exists with other test. ForkBomb simulates an out of memory situation. In
> >> this case a slave sometimes don't reconnect to the new master too, with
> >> errors in log:
> >>
> >> 10:09:43.99 [1417] FATAL:  could not receive data from WAL stream: server
> >> closed the connection unexpectedly This probably means the server
> >> terminated abnormally before or while processing the request.
> >> 10:09:43.992 [1413] LOG:  invalid record length at 0/D8014278: wanted 24,
> >> got 0  
> >
> > I suspect the problem is somewhere else. The first message here is probably
> > related to your primary being fenced, the second one is normal. After your
> > IP moved to the recently promoted primary, your standby are supposed to
> > reconnect with no problem.  
>
> I believe that problem with the ForkBomb has the same nature. The case with
> PostgreSQL-STOP and default wal_receiver_timeout can be easily repeated, it
> rises always by `killall -STOP postgres` on the master, as I can see. But the
> case with wal_receiver_timeout=0 and ForkBomb happens rather rare, not always.

because of resources pressure, I suppose the OOM killer behavior is not
predictable. Sometime it will kill your walsender, sometime not. And sometime,
the watchdog itself is supposed to kill the node before anyone else.

Regards,


Reply | Threaded
Open this post in threaded view
|

Re: BUG? Slave don't reconnect to the master

Олег Самойлов


> 21 авг. 2020 г., в 17:26, Jehan-Guillaume de Rorthais <[hidden email]> написал(а):
>
> On Thu, 20 Aug 2020 15:16:10 +0300
> Based on setup per node, you can probably add
> 'synchronous_commit=remote_write' in the common conf.

Nope. I set 'synchronous_commit=remote_write' only for 3 and 4 node clusters. Two node clusters don't have this setting (use default), because they don't have the synchronous commit.

>
>> [...]
>> pacemaker config, specific for this cluster:
>> [...]
>
> why did you add "monitor interval=15"? No harm, but it is redundant with
> "monitor interval=16 role=Master" and "monitor interval=17 role=Slave".

I can't remember clearly. :) Look what happens without it.

+ pcs -f configured_cib.xml resource create krogan2DB ocf:heartbeat:pgsqlms bindir=/usr/pgsql-11/bin pgdata=/var/lib/pgsql/krogan2 recovery_template=/var/lib/pgsql/krogan2.paf meta master notify=true resource-stickiness=10
Warning: changing a monitor operation interval from 15 to 16 to make the operation unique
Warning: changing a monitor operation interval from 16 to 17 to make the operation unique

So trivial monitor always exists by default with interval 15.

My real command
pcs -f configured_cib.xml resource create krogan2DB ocf:heartbeat:pgsqlms bindir=/usr/pgsql-11/bin pgdata=/var/lib/pgsql/krogan2 recovery_template=/var/lib/pgsql/krogan2.paf op monitor interval=15 timeout=10 monitor interval=16 role=Master timeout=15 monitor interval=17 role=Slave timeout=10 meta master notify=true resource-stickiness=10

Looked like I needed to add all this to change "timeout" parameter for the monitor operations and I needed for interval parameter to point on the specific monitor operation. Looked like the default timeout 10 was not enough for the "master".

> By the way, nice catch to exclude RO IP when master score is
> negative.
>
> You should probably consider putting your logs outside of your PGDATA, or
> even better: send your PostgreSQL/Pacemaker logs to a dedicated syslog node
> outside. Adding the hostname in the prefix of each log line might probably help
> a lot.

I have put this into my ToDo notebook.

>> [...]
>> 10:24:55.906 LOG:  entering standby mode
>> 10:24:55.908 LOG:  redo starts at 0/15000028
>> 10:24:55.909 LOG:  consistent recovery state reached at 0/15002300
>> 10:24:55.910 LOG:  database system is ready to accept read only connections
>> 10:24:55.928 LOG:  started streaming WAL from primary at 0/16000000 on tl 3
>> 10:26:37.308 FATAL:  terminating walreceiver due to timeout
>
> Timeout because of SIGSTOP on primary here.

Sure

>> 10:26:37.308 LOG:  invalid record length at 0/1600C4D8: wanted 24, got 0
>> 10:30:55.965 LOG:  received promote request
>
> Promotion from Pacemaker here.

Yep

> What happened during more than 4 minutes between the timeout and the promotion?

It's one of the problem, which you may improve. :) The pacemaker reaction is the longest for STOP signal test, usually near 5 minutes. The pacemaker tried to make different things (for instance "demote") and wait for different timeouts.

>
>> 10:30:55.965 FATAL:  terminating walreceiver process due to administrator cmd
>> 10:30:55.966 LOG:  redo done at 0/1600C4B0
>> 10:30:55.966 LOG:  last completed transaction was at log time 10:25:38.76429
>> 10:30:55.968 LOG:  selected new timeline ID: 4
>> 10:30:56.001 LOG:  archive recovery complete
>> 10:30:56.005 LOG:  database system is ready to accept connections
>
>> The slave with didn't reconnected replication, tuchanka3c. Also I separated
>> logs copied from the old master by a blank line:
>>
>> [...]
>>
>> 10:20:25.168 LOG:  database system was interrupted; last known up at 10:20:19
>> 10:20:25.180 LOG:  entering standby mode
>> 10:20:25.181 LOG:  redo starts at 0/11000098
>> 10:20:25.183 LOG:  consistent recovery state reached at 0/11000A68
>> 10:20:25.183 LOG:  database system is ready to accept read only connections
>> 10:20:25.193 LOG:  started streaming WAL from primary at 0/12000000 on tl 3
>> 10:25:05.370 LOG:  could not send data to client: Connection reset by peer
>> 10:26:38.655 FATAL:  terminating walreceiver due to timeout
>> 10:26:38.655 LOG:  record with incorrect prev-link 0/1200C4B0 at 0/1600C4D8
>
> This message appear before the effective promotion of tuchanka3b. Do you have
> logs about what happen *after* the promotion?

This is end of the slave log. Nothing. Just absent replication.

> Reading at this error, it seems like record at 0/1600C4D8 references the
> previous one in WAL 0/12000000. So the file referenced as 0/16 have either
> corrupted data or was 0/12 being recycled, but not zeroed correctly, as v11
> always do no matter what (no wal_init_zero there).

Okey, may be in v12 it will be fixed.

> That's why I'm wondering how you built your standbys, from scratch?

By special scripts. :) This project already on GitHub and I am waiting for the final solution of my boss to open it. And it will take some time to translate README to English. After this I'll link the repository here.

>
> [...]
>>
>>>> Well, this can be workarouned by turning off wal_receiver_timeout.  
>
> Just because the walreceiver keeps waiting on its socket and don't pay
> attention to data on disk I suppose.

I suppose because it don't pay attention on the absents of the data, and don't finish the connection by timeout. But it may finish it in other way by other reason to reconnect to a new master later.

>
>>>> Now PostgreSQL-STOP works fine, but the problem is still
>>>> exists with other test. ForkBomb simulates an out of memory situation. In
>>>> this case a slave sometimes don't reconnect to the new master too, with
>>>> errors in log:
>>>>
>>>> 10:09:43.99 [1417] FATAL:  could not receive data from WAL stream: server
>>>> closed the connection unexpectedly This probably means the server
>>>> terminated abnormally before or while processing the request.
>>>> 10:09:43.992 [1413] LOG:  invalid record length at 0/D8014278: wanted 24,
>>>> got 0  
>>>
>>> I suspect the problem is somewhere else. The first message here is probably
>>> related to your primary being fenced, the second one is normal. After your
>>> IP moved to the recently promoted primary, your standby are supposed to
>>> reconnect with no problem.  
>>
>> I believe that problem with the ForkBomb has the same nature. The case with
>> PostgreSQL-STOP and default wal_receiver_timeout can be easily repeated, it
>> rises always by `killall -STOP postgres` on the master, as I can see. But the
>> case with wal_receiver_timeout=0 and ForkBomb happens rather rare, not always.
>
> because of resources pressure, I suppose the OOM killer behavior is not
> predictable. Sometime it will kill your walsender, sometime not. And sometime,
> the watchdog itself is supposed to kill the node before anyone else.

I think will be better first of all to fix PostgreSQL STOP test, because it is easily repeatable. And after this I'll test the case with ForkBomb, one day will be enough.

Reply | Threaded
Open this post in threaded view
|

Re: BUG? Slave don't reconnect to the master

Jehan-Guillaume de Rorthais
Hi,

Sorry for the late answer, I've been busy.

On Mon, 24 Aug 2020 18:45:42 +0300
Олег Самойлов <[hidden email]> wrote:

> > 21 авг. 2020 г., в 17:26, Jehan-Guillaume de Rorthais <[hidden email]>
> > написал(а):
> >
> > On Thu, 20 Aug 2020 15:16:10 +0300
> > Based on setup per node, you can probably add
> > 'synchronous_commit=remote_write' in the common conf.  
>
> Nope. I set 'synchronous_commit=remote_write' only for 3 and 4 node clusters.
> [...]

Then I suppose your previous message had an error as it shows three
nodes tuchanka3a, tuchanka3b and tuchanka3c (no 4th node), all with remote_write
in krogan3.conf. But anyway.

> >> [...]
> >> pacemaker config, specific for this cluster:
> >> [...]  
> >
> > why did you add "monitor interval=15"? No harm, but it is redundant with
> > "monitor interval=16 role=Master" and "monitor interval=17 role=Slave".  
>
> I can't remember clearly. :) Look what happens without it.
>
> + pcs -f configured_cib.xml resource create krogan2DB ocf:heartbeat:pgsqlms
> bindir=/usr/pgsql-11/bin pgdata=/var/lib/pgsql/krogan2
> recovery_template=/var/lib/pgsql/krogan2.paf meta master notify=true
> resource-stickiness=10
> Warning: changing a monitor operation interval from 15 to 16 to make the
> operation unique
> Warning: changing a monitor operation interval from 16 to 17 to make the
> operation unique

Something fishy here. This command lack op monitor settings. Pacemaker don't
add any default monitor operation with default interval if you don't give one
at resource creation.

If you create such a resource with no monitoring, the cluster will start/stop
it when needed, but will NOT check for its health. See:

https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Pacemaker_Explained/s-resource-monitoring.html

> So trivial monitor always exists by default with interval 15.

nope.

> My real command
> pcs -f configured_cib.xml resource create krogan2DB ocf:heartbeat:pgsqlms
> bindir=/usr/pgsql-11/bin pgdata=/var/lib/pgsql/krogan2
> recovery_template=/var/lib/pgsql/krogan2.paf op monitor interval=15
> timeout=10 monitor interval=16 role=Master timeout=15 monitor interval=17
> role=Slave timeout=10 meta master notify=true resource-stickiness=10
>
> Looked like I needed to add all this to change "timeout" parameter for the
> monitor operations and I needed for interval parameter to point on the
> specific monitor operation.

OK, I understand now. If you want to edit an existing resource, use "pcs
resource update". Make sure read the pcs manual about how to use it to
edit/remove/add operations on a resource.

> Looked like the default timeout 10 was not enough for the "master".

It's written in PAF doc. See:
https://clusterlabs.github.io/PAF/configuration.html#resource-agent-actions

Do not hesitate to report or submit some enhancements to the doc if needed.

> [...]
> >> [...]
> >> 10:24:55.906 LOG:  entering standby mode
> >> 10:24:55.908 LOG:  redo starts at 0/15000028
> >> 10:24:55.909 LOG:  consistent recovery state reached at 0/15002300
> >> 10:24:55.910 LOG:  database system is ready to accept read only connections
> >> 10:24:55.928 LOG:  started streaming WAL from primary at 0/16000000 on tl 3
> >> 10:26:37.308 FATAL:  terminating walreceiver due to timeout  
> >
> > Timeout because of SIGSTOP on primary here.  
>
> Sure
>
> >> 10:26:37.308 LOG:  invalid record length at 0/1600C4D8: wanted 24, got 0
> >> 10:30:55.965 LOG:  received promote request  
> >
> > Promotion from Pacemaker here.  
>
> Yep
>
> > What happened during more than 4 minutes between the timeout and the
> > promotion?  
>
> It's one of the problem, which you may improve. :) The pacemaker reaction is
> the longest for STOP signal test, usually near 5 minutes. The pacemaker tried
> to make different things (for instance "demote") and wait for different
> timeouts.

Oh, understood, obviously, I should have thought about that. Well, I will not
be able to improve anything here. You might want to adjust the various operation
timeouts and lower the migration-threshold.

> >> 10:30:55.965 FATAL:  terminating walreceiver process dpue to administrator
> >> cmd 10:30:55.966 LOG:  redo done at 0/1600C4B0
> >> 10:30:55.966 LOG:  last completed transaction was at log time
> >> 10:25:38.76429 10:30:55.968 LOG:  selected new timeline ID: 4
> >> 10:30:56.001 LOG:  archive recovery complete
> >> 10:30:56.005 LOG:  database system is ready to accept connections  
> >  
> >> The slave with didn't reconnected replication, tuchanka3c. Also I separated
> >> logs copied from the old master by a blank line:
> >>
> >> [...]
> >>
> >> 10:20:25.168 LOG:  database system was interrupted; last known up at
> >> 10:20:19 10:20:25.180 LOG:  entering standby mode
> >> 10:20:25.181 LOG:  redo starts at 0/11000098
> >> 10:20:25.183 LOG:  consistent recovery state reached at 0/11000A68
> >> 10:20:25.183 LOG:  database system is ready to accept read only connections
> >> 10:20:25.193 LOG:  started streaming WAL from primary at 0/12000000 on tl 3
> >> 10:25:05.370 LOG:  could not send data to client: Connection reset by peer
> >> 10:26:38.655 FATAL:  terminating walreceiver due to timeout
> >> 10:26:38.655 LOG:  record with incorrect prev-link 0/1200C4B0 at
> >> 0/1600C4D8  
> >
> > This message appear before the effective promotion of tuchanka3b. Do you
> > have logs about what happen *after* the promotion?  
>
> This is end of the slave log. Nothing. Just absent replication.

This is unusual. Could you log some more details about replication
tryouts to your PostgreSQL logs? Set log_replication_commands and lower
log_min_messages to debug ?

> > Reading at this error, it seems like record at 0/1600C4D8 references the
> > previous one in WAL 0/12000000. So the file referenced as 0/16 have either
> > corrupted data or was 0/12 being recycled, but not zeroed correctly, as v11
> > always do no matter what (no wal_init_zero there).  
>
> Okey, may be in v12 it will be fixed.

No, I don't think so. This is not related with a bug in v11 -vs- v12, but v12
might behave a bit differently depending on the value of wal_init_zero.

> > That's why I'm wondering how you built your standbys, from scratch?  
>
> By special scripts. :) This project already on GitHub and I am waiting for
> the final solution of my boss to open it. And it will take some time to
> translate README to English. After this I'll link the repository here.

I'll give it a look and try to reproduce if I find some time.


Regards,


Reply | Threaded
Open this post in threaded view
|

Re: BUG? Slave don't reconnect to the master

Олег Самойлов
Hi

> On 3 Sep 2020, at 14:39, Jehan-Guillaume de Rorthais <[hidden email]> wrote:
> On Mon, 24 Aug 2020 18:45:42 +0300
> Олег Самойлов <[hidden email]> wrote:
>
>>> 21 авг. 2020 г., в 17:26, Jehan-Guillaume de Rorthais <[hidden email]>
>>> написал(а):
>>>
>>> On Thu, 20 Aug 2020 15:16:10 +0300
>>> Based on setup per node, you can probably add
>>> 'synchronous_commit=remote_write' in the common conf.  
>>
>> Nope. I set 'synchronous_commit=remote_write' only for 3 and 4 node clusters.
>> [...]
>
> Then I suppose your previous message had an error as it shows three
> nodes tuchanka3a, tuchanka3b and tuchanka3c (no 4th node), all with remote_write
> in krogan3.conf. But anyway.
I tested 4 different types of clusters. The cluster 1 and 2 has two nodes and thus don't reveal this bug. The cluster 3 and 4 has 3 and 4 nodes and thus this bug is observed. I used the cluster 3 as example.

>
>>>> [...]
>>>> pacemaker config, specific for this cluster:
>>>> [...]  
>>>
>>> why did you add "monitor interval=15"? No harm, but it is redundant with
>>> "monitor interval=16 role=Master" and "monitor interval=17 role=Slave".  
>>
>> I can't remember clearly. :) Look what happens without it.
>>
>> + pcs -f configured_cib.xml resource create krogan2DB ocf:heartbeat:pgsqlms
>> bindir=/usr/pgsql-11/bin pgdata=/var/lib/pgsql/krogan2
>> recovery_template=/var/lib/pgsql/krogan2.paf meta master notify=true
>> resource-stickiness=10
>> Warning: changing a monitor operation interval from 15 to 16 to make the
>> operation unique
>> Warning: changing a monitor operation interval from 16 to 17 to make the
>> operation unique
>
> Something fishy here. This command lack op monitor settings. Pacemaker don't
> add any default monitor operation with default interval if you don't give one
> at resource creation.
>
> If you create such a resource with no monitoring, the cluster will start/stop
> it when needed, but will NOT check for its health. See:
>
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Pacemaker_Explained/s-resource-monitoring.html
May be. But keep in mind, that I uses `pcs`, I do not edit the xml file directly. And I use too old pacemaker, the default package of CentOS 7 is
pacemaker-1.1.21-4.el7.x86_64, while you link of documentation is for Pacemaker 2.0.

But never mind, this does not concern the discussed bug.

>
>> So trivial monitor always exists by default with interval 15.
>
> nope.

This is not true for CentOS 7. I removed my monitor options, for this example.

pcs cluster cib original_cib.xml
cp original_cib.xml configured_cib.xml
pcs -f configured_cib.xml resource create krogan3DB ocf:heartbeat:pgsqlms bindir=/usr/pgsql-11/bin pgdata=/var/lib/pgsql/krogan3 recovery_template=/var/lib/pgsql/krogan3.paf meta master notify=true resource-stickiness=10
pcs -f configured_cib.xml resource create krogan3IP ocf:heartbeat:IPaddr2 nic=eth0 cidr_netmask=24 ip=192.168.89.35
pcs -f configured_cib.xml resource create krogan3s1IP ocf:heartbeat:IPaddr2 nic=eth0 cidr_netmask=24 ip=192.168.89.36
pcs -f configured_cib.xml resource create krogan3s2IP ocf:heartbeat:IPaddr2 nic=eth0 cidr_netmask=24 ip=192.168.89.37
pcs -f configured_cib.xml constraint colocation add krogan3IP with master krogan3DB-master INFINITY
pcs -f configured_cib.xml constraint order promote krogan3DB-master then start krogan3IP symmetrical=false
pcs -f configured_cib.xml constraint order demote krogan3DB-master then stop krogan3IP symmetrical=false kind=Optional
pcs -f configured_cib.xml constraint location krogan3s1IP rule score=-INFINITY master-krogan3DB lt integer 0
pcs -f configured_cib.xml constraint location krogan3s2IP rule score=-INFINITY master-krogan3DB lt integer 0
pcs -f configured_cib.xml constraint colocation add krogan3s1IP with slave krogan3DB-master INFINITY
pcs -f configured_cib.xml constraint colocation add krogan3s2IP with slave krogan3DB-master INFINITY
pcs -f configured_cib.xml constraint colocation add krogan3s1IP with krogan3s2IP -1000
pcs -f configured_cib.xml constraint order start krogan3DB-master then start krogan3s1IP
pcs -f configured_cib.xml constraint order start krogan3DB-master then start krogan3s2IP
pcs cluster cib-push configured_cib.xml --wait diff-against=original_cib.xml

13:44:27 j0 root@tuchanka3a:~
# pcs resource show krogan3DB-master
 Master: krogan3DB-master
  Meta Attrs: notify=true resource-stickiness=10
  Resource: krogan3DB (class=ocf provider=heartbeat type=pgsqlms)
   Attributes: bindir=/usr/pgsql-11/bin pgdata=/var/lib/pgsql/krogan3 recovery_template=/var/lib/pgsql/krogan3.paf
   Operations: demote interval=0s timeout=120 (krogan3DB-demote-interval-0s)
               methods interval=0s timeout=5 (krogan3DB-methods-interval-0s)
               monitor interval=15 timeout=10 (krogan3DB-monitor-interval-15)
               monitor interval=16 role=Master timeout=10 (krogan3DB-monitor-interval-16)
               monitor interval=17 role=Slave timeout=10 (krogan3DB-monitor-interval-17)
               notify interval=0s timeout=60 (krogan3DB-notify-interval-0s)
               promote interval=0s timeout=30 (krogan3DB-promote-interval-0s)
               reload interval=0s timeout=20 (krogan3DB-reload-interval-0s)
               start interval=0s timeout=60 (krogan3DB-start-interval-0s)
               stop interval=0s timeout=60 (krogan3DB-stop-interval-0s)

>
>> My real command
>> pcs -f configured_cib.xml resource create krogan2DB ocf:heartbeat:pgsqlms
>> bindir=/usr/pgsql-11/bin pgdata=/var/lib/pgsql/krogan2
>> recovery_template=/var/lib/pgsql/krogan2.paf op monitor interval=15
>> timeout=10 monitor interval=16 role=Master timeout=15 monitor interval=17
>> role=Slave timeout=10 meta master notify=true resource-stickiness=10
>>
>> Looked like I needed to add all this to change "timeout" parameter for the
>> monitor operations and I needed for interval parameter to point on the
>> specific monitor operation.
>
> OK, I understand now. If you want to edit an existing resource, use "pcs
> resource update". Make sure read the pcs manual about how to use it to
> edit/remove/add operations on a resource.
This is not so easy. To edit existed resource I must to know the "interval" of this resource, but in this case I am not sure what the interval will be for the monitor operation of the master role. :) Because
>>
>> Warning: changing a monitor operation interval from 15 to 16 to make the
>> operation unique
>> Warning: changing a monitor operation interval from 16 to 17 to make the
>> operation unique

I am not sure in what order and what it will be. Thats why I configured as I configured. This just works.

>
>> Looked like the default timeout 10 was not enough for the "master".
>
> It's written in PAF doc. See:
> https://clusterlabs.github.io/PAF/configuration.html#resource-agent-actions
>
> Do not hesitate to report or submit some enhancements to the doc if needed.

May be the documentation was improved. Thanks that you have pointed me on that. After moving to CentOS 8 I will check with recommended parameters according to the documentation.

>> It's one of the problem, which you may improve. :) The pacemaker reaction is
>> the longest for STOP signal test, usually near 5 minutes. The pacemaker tried
>> to make different things (for instance "demote") and wait for different
>> timeouts.
>
> Oh, understood, obviously, I should have thought about that. Well, I will not
> be able to improve anything here. You might want to adjust the various operation
> timeouts and lower the migration-threshold.

Yep. My test bed is rather slow, so I work only with the default timeouts. And this is not a problem for my goal. The optimisation of the timeouts must be done on the production server later.

>
>>>> 10:30:55.965 FATAL:  terminating walreceiver process dpue to administrator
>>>> cmd 10:30:55.966 LOG:  redo done at 0/1600C4B0
>>>> 10:30:55.966 LOG:  last completed transaction was at log time
>>>> 10:25:38.76429 10:30:55.968 LOG:  selected new timeline ID: 4
>>>> 10:30:56.001 LOG:  archive recovery complete
>>>> 10:30:56.005 LOG:  database system is ready to accept connections  
>>>
>>>> The slave with didn't reconnected replication, tuchanka3c. Also I separated
>>>> logs copied from the old master by a blank line:
>>>>
>>>> [...]
>>>>
>>>> 10:20:25.168 LOG:  database system was interrupted; last known up at
>>>> 10:20:19 10:20:25.180 LOG:  entering standby mode
>>>> 10:20:25.181 LOG:  redo starts at 0/11000098
>>>> 10:20:25.183 LOG:  consistent recovery state reached at 0/11000A68
>>>> 10:20:25.183 LOG:  database system is ready to accept read only connections
>>>> 10:20:25.193 LOG:  started streaming WAL from primary at 0/12000000 on tl 3
>>>> 10:25:05.370 LOG:  could not send data to client: Connection reset by peer
>>>> 10:26:38.655 FATAL:  terminating walreceiver due to timeout
>>>> 10:26:38.655 LOG:  record with incorrect prev-link 0/1200C4B0 at
>>>> 0/1600C4D8  
>>>
>>> This message appear before the effective promotion of tuchanka3b. Do you
>>> have logs about what happen *after* the promotion?  
>>
>> This is end of the slave log. Nothing. Just absent replication.
>
> This is unusual. Could you log some more details about replication
> tryouts to your PostgreSQL logs? Set log_replication_commands and lower
> log_min_messages to debug ?
Sure, this is PostgreSQL logs for the cluster tuchanka3.
Tuchanka3a is an old (failed) master.


Tuchanka3b is a new (promoted) master.


Tuchanka3c is a slave that has lost replication.


>>> That's why I'm wondering how you built your standbys, from scratch?  
>>
>> By special scripts. :) This project already on GitHub and I am waiting for
>> the final solution of my boss to open it. And it will take some time to
>> translate README to English. After this I'll link the repository here.
>
> I'll give it a look and try to reproduce if I find some time.

Heh, may be you are the only man who may find this project useful. :) I hope you have got my last email with description how to reveal this bug in the test bed and the screenshot of the result. Here is a new updated patch for this:

diff --git a/test/failure1 b/test/failure1
index d81b9c8..9cc5dc6 100755
--- a/test/failure1
+++ b/test/failure1
@@ -105,15 +105,15 @@ function OutOfSpace {
 readonly -f OutOfSpace

 # Here can be commented out unnessesary tests
-break_node+=('Reset')
-break_node+=('PowerOff')
-break_node+=('ShutDown')
-break_node+=('UnLink')
-break_node+=('Postgres-KILL')
+#break_node+=('Reset')
+#break_node+=('PowerOff')
+#break_node+=('ShutDown')
+#break_node+=('UnLink')
+#break_node+=('Postgres-KILL')
 break_node+=('Postgres-STOP')
-break_node+=('SBD-STOP')
-break_node+=('ForkBomb')
-break_node+=('OutOfSpace')
+#break_node+=('SBD-STOP')
+#break_node+=('ForkBomb')
+#break_node+=('OutOfSpace')
 readonly -a break_node

 # Setup tmux panes
@@ -137,9 +137,9 @@ function test_node {
        local f=$1
        local h unbroken time
        # random node from the cluster
-       h=$(random_word ${cluster_vms[$c]})
+       # h=$(random_word ${cluster_vms[$c]})
        # Can be used to test only the first node in the cluster
-       # h=$(first_word ${cluster_vms[$c]})
+       h=$(first_word ${cluster_vms[$c]})
        for unbroken in ${cluster_vms[$c]}
        do
                if [ $unbroken -ne $h ]
diff --git a/upload/common/postgresql.conf b/upload/common/postgresql.conf
index 1079e4b..a31b23d 100644
--- a/upload/common/postgresql.conf
+++ b/upload/common/postgresql.conf
@@ -29,11 +29,11 @@ restart_after_crash = off           # пусть решает pacemaker pgsqlms
 # Disable wal_receiver_timeout, because with default wal_receiver_timeout there is a bug
 # in PostgreSQL, which breaks recconection of the replication on PostgreSQL-STOP test.
 # https://www.postgresql.org/message-id/60590EC6-4062-4F25-A49C-3948ED2A7D47%40ya.ru
-wal_receiver_timeout=0
+#wal_receiver_timeout=0
 # экономлю ОЗУ виртуалок
 shared_buffers = 32MB
 # экономлю на виртуальных винчестерах, поставил равным min_wal_size
 max_wal_size=80MB
 # This 2 options are to debug PostgreSQL replication
-#log_replication_commands = on
-#log_min_messages = debug
+log_replication_commands = on
+log_min_messages = debug

tuchanka3a.log.xz (13K) Download Attachment
tuchanka3b.log.xz (16K) Download Attachment
tuchanka3c.log.xz (26K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG? Slave don't reconnect to the master

Jehan-Guillaume de Rorthais
On Mon, 7 Sep 2020 23:46:17 +0300
Олег Самойлов <[hidden email]> wrote:

> [...]  
> >>> why did you add "monitor interval=15"? No harm, but it is redundant with
> >>> "monitor interval=16 role=Master" and "monitor interval=17
> >>> role=Slave".    
> >>
> >> I can't remember clearly. :) Look what happens without it.
> >>
> >> + pcs -f configured_cib.xml resource create krogan2DB ocf:heartbeat:pgsqlms
> >> bindir=/usr/pgsql-11/bin pgdata=/var/lib/pgsql/krogan2
> >> recovery_template=/var/lib/pgsql/krogan2.paf meta master notify=true
> >> resource-stickiness=10
> >> Warning: changing a monitor operation interval from 15 to 16 to make the
> >> operation unique
> >> Warning: changing a monitor operation interval from 16 to 17 to make the
> >> operation unique  
> >
> > Something fishy here. This command lack op monitor settings. Pacemaker don't
> > add any default monitor operation with default interval if you don't give
> > one at resource creation.
> >
> > If you create such a resource with no monitoring, the cluster will
> > start/stop it when needed, but will NOT check for its health. See:
> >
> > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Pacemaker_Explained/s-resource-monitoring.html 
>
> May be. But keep in mind, that I uses `pcs`, I do not edit the xml file
> directly. And I use too old pacemaker, the default package of CentOS 7 is
> pacemaker-1.1.21-4.el7.x86_64, while you link of documentation is for
> Pacemaker 2.0.

It's the same behavior between both 2.0 and 1.1, but ...(see bellow)

> >> So trivial monitor always exists by default with interval 15.  
> >
> > nope.  
>
> This is not true for CentOS 7. I removed my monitor options, for this example.
>
> pcs cluster cib original_cib.xml
> cp original_cib.xml configured_cib.xml
> pcs -f configured_cib.xml resource create krogan3DB ocf:heartbeat:pgsqlms
> bindir=/usr/pgsql-11/bin pgdata=/var/lib/pgsql/krogan3
> recovery_template=/var/lib/pgsql/krogan3.paf meta master notify=true
> resource-stickiness=10

I tried your command, and indeed, pcs creates the missing monitor operation
with a default interval of 15. This is surprising, it's the first time I cross
these warning messages. Thanks for this information, I wasn't aware of this pcs
behavior.

But anyway, it's not recommended to create your resources without specifying
interval and timeout for each operations. See PAF docs. Just create the two
monitor operations related to both roles and you'll not have these warnings.

> > [...]
> > OK, I understand now. If you want to edit an existing resource, use "pcs
> > resource update". Make sure read the pcs manual about how to use it to
> > edit/remove/add operations on a resource.  
>
> This is not so easy. To edit existed resource I must to know the "interval"
> of this resource, but in this case I am not sure what the interval will be
> for the monitor operation of the master role. :) Because
> >>
> >> Warning: changing a monitor operation interval from 15 to 16 to make the
> >> operation unique
> >> Warning: changing a monitor operation interval from 16 to 17 to make the
> >> operation unique  
>
> I am not sure in what order and what it will be. Thats why I configured as I
> configured. This just works.

Now we know where these warnings comes from, you have a solution (set both
of them explicitly)

> >> Looked like the default timeout 10 was not enough for the "master".  
> >
> > It's written in PAF doc. See:
> > https://clusterlabs.github.io/PAF/configuration.html#resource-agent-actions
> >
> > Do not hesitate to report or submit some enhancements to the doc if
> > needed.  
>
> May be the documentation was improved. Thanks that you have pointed me on
> that. After moving to CentOS 8 I will check with recommended parameters
> according to the documentation.

You can do it right now with CentOS 7. They are the same.

> > [...]
> >>>> 10:30:55.965 FATAL:  terminating walreceiver process dpue to
> >>>> administrator cmd 10:30:55.966 LOG:  redo done at 0/1600C4B0
> >>>> 10:30:55.966 LOG:  last completed transaction was at log time
> >>>> 10:25:38.76429 10:30:55.968 LOG:  selected new timeline ID: 4
> >>>> 10:30:56.001 LOG:  archive recovery complete
> >>>> 10:30:56.005 LOG:  database system is ready to accept connections    
> >>>  
> >>>> The slave with didn't reconnected replication, tuchanka3c. Also I
> >>>> separated logs copied from the old master by a blank line:
> >>>>
> >>>> [...]
> >>>>
> >>>> 10:20:25.168 LOG:  database system was interrupted; last known up at
> >>>> 10:20:19 10:20:25.180 LOG:  entering standby mode
> >>>> 10:20:25.181 LOG:  redo starts at 0/11000098
> >>>> 10:20:25.183 LOG:  consistent recovery state reached at 0/11000A68
> >>>> 10:20:25.183 LOG:  database system is ready to accept read only
> >>>> connections 10:20:25.193 LOG:  started streaming WAL from primary at
> >>>> 0/12000000 on tl 3 10:25:05.370 LOG:  could not send data to client:
> >>>> Connection reset by peer 10:26:38.655 FATAL:  terminating walreceiver
> >>>> due to timeout 10:26:38.655 LOG:  record with incorrect prev-link
> >>>> 0/1200C4B0 at 0/1600C4D8    
> >>>
> >>> This message appear before the effective promotion of tuchanka3b. Do you
> >>> have logs about what happen *after* the promotion?    
> >>
> >> This is end of the slave log. Nothing. Just absent replication.  
> >
> > This is unusual. Could you log some more details about replication
> > tryouts to your PostgreSQL logs? Set log_replication_commands and lower
> > log_min_messages to debug ?  
>
> Sure, this is PostgreSQL logs for the cluster tuchanka3.
> Tuchanka3a is an old (failed) master.

According to your logs:

20:29:41 tuchanka3a: freeze
20:30:39 tuchanka3c: wal receiver timeout (default 60s timeout)
20:30:39 tuchanka3c: switched to archives, and error'ed (expected)
20:30:39 tuchanka3c: switched to stream again (expected)
                     no more news from this new wal receiver
20:34:21 tuchanka3b: promoted

I'm not sure where your floating IP is located at 20:30:39, but I suppose it
is still on tuchanka3a as the wal receiver don't hit any connection error and
tuchanka3b is not promoted yet.

So at this point, I suppose the wal receiver is stuck in libpqrcv_connect
waiting for frozen tuchanka3a to answer, with no connection timeout. You might
track tcp sockets on tuchanka3a to confirm this.

To avoid such a wait, try to add eg. connect_timeout=2 to your primary_conninfo
parameter. See:
https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-PARAMKEYWORDS

Regards,