Fwd: Postgres HA - pacemaker RA do not support auto failback

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: Postgres HA - pacemaker RA do not support auto failback

Shital A


Hello,

Postgres version : 9.6
OS:Rhel 7.6

We are working on HA setup for postgres cluster of two nodes in active-passive mode. 

Installed: 
Pacemaker 1.1.19
Corosync 2.4.3

The pacemaker agent with this installation doesn't support automatic failback. What I mean by that is explained below:
1. Cluster is setup like A - B with A as master. 
2. Kill services on A, node B will come up as master.
3. node A is ready to join the cluster, we have to delete the lock file it creates on any one of the node and execute the cleanup command to get the node back as standby

Step 3 is manual so HA is not achieved in real sense. 

Please help to check:
1. Is there any version of the resouce agent which supports automatic failback? To avoid generation of lock file and deleting it.

2. If there is no such support, what checks should be added in pgsql RA to achieve 

Please suggest.
Thanks.

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Postgres HA - pacemaker RA do not support auto failback

Adrien Nayrat-2
On 8/12/19 2:57 PM, Shital A wrote:

>
>
> Hello,
>
> Postgres version : 9.6
> OS:Rhel 7.6
>
> We are working on HA setup for postgres cluster of two nodes in active-passive
> mode. 
>
> Installed: 
> Pacemaker 1.1.19
> Corosync 2.4.3
>
> The pacemaker agent with this installation doesn't support automatic failback.
> What I mean by that is explained below:
> 1. Cluster is setup like A - B with A as master. 
> 2. Kill services on A, node B will come up as master.
> 3. node A is ready to join the cluster, we have to delete the lock file it
> creates on any one of the node and execute the cleanup command to get the node
> back as standby
>
> Step 3 is manual so HA is not achieved in real sense. 
>
> Please help to check:
> 1. Is there any version of the resouce agent which supports automatic failback?
> To avoid generation of lock file and deleting it.
>
> 2. If there is no such support, what checks should be added in pgsql RA to achieve 
>
> Please suggest.
> Thanks.
>
Hello,

Which RA did you use? AFAIK there is two RA :
- pgsql : https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
- PAF : http://clusterlabs.github.io/PAF/ (Which is different from pgsql RA :
http://clusterlabs.github.io/PAF/FAQ.html#why-new-ra-for-postgresql)

If I am right, PAF should handle the case when there no need to perform pgrewind
(if the old master is not too advanced in the transaction log).

Regards,

--
Adrien NAYRAT
https://blog.anayrat.info


signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Postgres HA - pacemaker RA do not support auto failback

Shital A


On Mon, 12 Aug 2019, 18:54 Adrien Nayrat, <[hidden email]> wrote:
On 8/12/19 2:57 PM, Shital A wrote:
>
>
> Hello,
>
> Postgres version : 9.6
> OS:Rhel 7.6
>
> We are working on HA setup for postgres cluster of two nodes in active-passive
> mode. 
>
> Installed: 
> Pacemaker 1.1.19
> Corosync 2.4.3
>
> The pacemaker agent with this installation doesn't support automatic failback.
> What I mean by that is explained below:
> 1. Cluster is setup like A - B with A as master. 
> 2. Kill services on A, node B will come up as master.
> 3. node A is ready to join the cluster, we have to delete the lock file it
> creates on any one of the node and execute the cleanup command to get the node
> back as standby
>
> Step 3 is manual so HA is not achieved in real sense. 
>
> Please help to check:
> 1. Is there any version of the resouce agent which supports automatic failback?
> To avoid generation of lock file and deleting it.
>
> 2. If there is no such support, what checks should be added in pgsql RA to achieve 
>
> Please suggest.
> Thanks.
>

Hello,

Which RA did you use? AFAIK there is two RA :
- pgsql : https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
- PAF : http://clusterlabs.github.io/PAF/ (Which is different from pgsql RA :
http://clusterlabs.github.io/PAF/FAQ.html#why-new-ra-for-postgresql)

If I am right, PAF should handle the case when there no need to perform pgrewind
(if the old master is not too advanced in the transaction log).

Regards,

--
Adrien NAYRAT
https://blog.anayrat.info



Hello

We are using the pgsql RA installed with pacemaker. 

Is PAF recommended over pgsql?

Has anyone changed pgsql to handle the lock file and other cases that might not have been handled? 

Please advise.

Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Postgres HA - pacemaker RA do not support auto failback

Shital A


On Tue, 13 Aug 2019, 11:50 Shital A, <[hidden email]> wrote:


On Mon, 12 Aug 2019, 18:54 Adrien Nayrat, <[hidden email]> wrote:
On 8/12/19 2:57 PM, Shital A wrote:
>
>
> Hello,
>
> Postgres version : 9.6
> OS:Rhel 7.6
>
> We are working on HA setup for postgres cluster of two nodes in active-passive
> mode. 
>
> Installed: 
> Pacemaker 1.1.19
> Corosync 2.4.3
>
> The pacemaker agent with this installation doesn't support automatic failback.
> What I mean by that is explained below:
> 1. Cluster is setup like A - B with A as master. 
> 2. Kill services on A, node B will come up as master.
> 3. node A is ready to join the cluster, we have to delete the lock file it
> creates on any one of the node and execute the cleanup command to get the node
> back as standby
>
> Step 3 is manual so HA is not achieved in real sense. 
>
> Please help to check:
> 1. Is there any version of the resouce agent which supports automatic failback?
> To avoid generation of lock file and deleting it.
>
> 2. If there is no such support, what checks should be added in pgsql RA to achieve 
>
> Please suggest.
> Thanks.
>

Hello,

Which RA did you use? AFAIK there is two RA :
- pgsql : https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
- PAF : http://clusterlabs.github.io/PAF/ (Which is different from pgsql RA :
http://clusterlabs.github.io/PAF/FAQ.html#why-new-ra-for-postgresql)

If I am right, PAF should handle the case when there no need to perform pgrewind
(if the old master is not too advanced in the transaction log).

Regards,

--
Adrien NAYRAT
https://blog.anayrat.info



Hello

We are using the pgsql RA installed with pacemaker. 

Is PAF recommended over pgsql?

Has anyone changed pgsql to handle the lock file and other cases that might not have been handled? 

Please advise.

Thanks.


Looks like the previous email bounced, received undelivered message. Sending again. We are using the pgsql RA installed with pacemaker and seems like it prevents failback by creating a lock file. We want to implement auto failback and need pointers on how to implement. 

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Postgres HA - pacemaker RA do not support auto failback

Jehan-Guillaume (ioguix) de Rorthais
On Tue, 13 Aug 2019 16:38:44 +0530
Shital A <[hidden email]> wrote:

> On Tue, 13 Aug 2019, 11:50 Shital A, <[hidden email]> wrote:
>
> > On Mon, 12 Aug 2019, 18:54 Adrien Nayrat, <[hidden email]>
> > wrote:
> >  
> >> On 8/12/19 2:57 PM, Shital A wrote:  
> >> > Postgres version : 9.6
> >> > OS:Rhel 7.6
> >> >
> >> > We are working on HA setup for postgres cluster of two nodes in  
> >> > active-passive mode.
> >> >
> >> > Installed:
> >> > Pacemaker 1.1.19
> >> > Corosync 2.4.3
> >> >
> >> > The pacemaker agent with this installation doesn't support automatic  
> >> > failback.  
> >> > What I mean by that is explained below:
> >> > 1. Cluster is setup like A - B with A as master.
> >> > 2. Kill services on A, node B will come up as master.
> >> > 3. node A is ready to join the cluster, we have to delete the lock file  
> >> > it creates on any one of the node and execute the cleanup command to
> >> > get the node back as standby
> >> >
> >> > Step 3 is manual so HA is not achieved in real sense.
> >> >
> >> > Please help to check:
> >> > 1. Is there any version of the resouce agent which supports automatic  
> >> > failback?  
> >> > To avoid generation of lock file and deleting it.
> >> >
> >> > 2. If there is no such support, what checks should be added in pgsql RA  
> >> > to achieve
> >>
> >> Which RA did you use? AFAIK there is two RA :
> >> - pgsql :
> >> https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
> >> - PAF : http://clusterlabs.github.io/PAF/ (Which is different from pgsql
> >> RA :
> >> http://clusterlabs.github.io/PAF/FAQ.html#why-new-ra-for-postgresql)
> >>
> >> If I am right, PAF should handle the case when there no need to perform
> >> pgrewind
> >> (if the old master is not too advanced in the transaction log).

PAF only make sure controlled failover (aka switchover) performs safely so the
old master hook back as a standby correctly.

Should a failure occurs with an automatic failover, PAF has no dark magic to
automagicaly failoback the failing master as a standby.

> > We are using the pgsql RA installed with pacemaker.
> >
> > Is PAF recommended over pgsql?

Yes, as much as the historical RA.

> > Has anyone changed pgsql to handle the lock file and other cases that
> > might not have been handled?

I don't know. However, I do think it would be either presumptuous to climb this
road today.

> Looks like the previous email bounced, received undelivered message.
> Sending again. We are using the pgsql RA installed with pacemaker and seems
> like it prevents failback by creating a lock file. We want to implement
> auto failback and need pointers on how to implement.

The lock file is here because PgSQL has no safety belt against data corruption
when an old master hook on a new one while its LSN was further away from
the timeline fork. You must take care of this by yourself.

Moreover, should a failover occurs, you should probably check WHY it occurred
and fix things before performing a blindly failback of a potential wounded
instance. A wounded instance can hurt your cluster in many different and
inconfortable ways.

Last, as Adrien wrote, you can manually use pg_rewind to revert the old master
to the timeline fork, as far as you enabled checksums or 'wal_log_hints=on' and
kept enough WAL around.

Another option is just to make sure you are able to build a new standby as was
fast as possible.

But remember: if your data is important, use quorum, fencing and watchdog.

Good luck.