Before you start learning to troubleshoot cluster resource related topics in cluster, you must be familiar with Cluster Architecture. In my last article I had shared the steps to configure HA Cluster and create a cluster resource so here I will give some tips to troubleshoot Cluster resource. There are many times you may observe that your resource has ended up in stopped state. So here i will guide you the commands and logs to check to troubleshoot cluster resource.
Hints - Troubleshoot Cluster Resource
- First read logs to find out why the resource has failed
/var/log/messages
on Red Hatjournalctl
on recent distributions- If a resource fails, it will try to start the resource a couple of times. And if it reaches a certain threshold then that resource will set its fail counter to INFINITY and then it will stop trying. This is to prevent cluster from trying to start resource indefinitely
- Next even if you found the problem and fixed the problem, there are chances even then the resource won't start automatically. To fix this you will need to reset the resource fail counter
Some Examples
To illustrate this I will create a resource with some incorrect information
[root@node1 ~]# pcs resource create apache-ip ocf:heartbeat:IPaddr2 ip=10.0.0.50 cidr_netmask=24
Next let us check the status
[root@node1 ~]# pcs status Cluster name: mycluster Stack: corosync Current DC: node2.example.com (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum Last updated: Sat Oct 27 12:29:40 2018 Last change: Sat Oct 27 12:29:35 2018 by root via cibadmin on node1.example.com 3 nodes configured 3 resources configured Online: [ node1.example.com node2.example.com node3.example.com ] Full list of resources: Resource Group: ipgroup ip1 (ocf::heartbeat:IPaddr2): Started node2.example.com ip2 (ocf::heartbeat:IPaddr2): Started node2.example.com apache-ip (ocf::heartbeat:IPaddr2): Stopped Failed Actions: * apache-ip_start_0 on node3.example.com 'unknown error' (1): call=55, status=complete, exitreason='Unable to find nic or netmask.', last-rc-change='Sat Oct 27 12:29:35 2018', queued=0ms, exec=44ms * apache-ip_start_0 on node2.example.com 'unknown error' (1): call=67, status=complete, exitreason='Unable to find nic or netmask.', last-rc-change='Sat Oct 27 12:29:36 2018', queued=0ms, exec=35ms * apache-ip_start_0 on node1.example.com 'unknown error' (1): call=72, status=complete, exitreason='Unable to find nic or netmask.', last-rc-change='Sat Oct 27 12:29:35 2018', queued=0ms, exec=42ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
So as you see the resource apache-ip is in stopped state and the Failed Actions
gives us a hint "Unable to find nic or netmask
Inside /var/log/messages
we see more information about the error
Oct 27 12:29:35 node1 IPaddr2(apache-ip)[24881]: ERROR: Unable to find nic or netmask. Oct 27 12:29:35 node1 IPaddr2(apache-ip)[24881]: ERROR: [findif] failed Oct 27 12:29:35 node1 lrmd[3697]: notice: apache-ip_start_0:24881:stderr [ ocf-exit-reason:Unable to find nic or netmask. ] Oct 27 12:29:35 node1 crmd[3700]: notice: Result of start operation for apache-ip on node1.example.com: 1 (unknown error) Oct 27 12:29:35 node1 crmd[3700]: notice: node1.example.com-apache-ip_start_0:72 [ ocf-exit-reason:Unable to find nic or netmask.n ]
findif
i.e. find interface is unable to get the interface.Next
lrmd
is the local resource manager daemon indicates that it couldn't work locally and it doesn't knows why the failure has happened but gives a hint that it may have happened due to nic or netmaskNext it is talking to
crmd
i.e. Cluster Resource management Daemon is capable to synchronize this information to rest of the cluster
Try a debug-start
for the failed resource to get more information on the failure cause
[root@node1 ~]# pcs resource debug-start apache-ip
Operation start for apache-ip (ocf:heartbeat:IPaddr2) returned: 'unknown error' (1)
> stderr: ocf-exit-reason:Unable to find nic or netmask.
> stderr: ERROR: [findif] failed
You can get the failcount value using the below command
[root@node1 ~]# crm_failcount -G -r apache-ip
scope=status name=fail-count-apache-ip value=2
pcs resource failcount show
may or may not work. Because at the time of writing this article there was a bug with pcs
due to which it failed to show any failure count.Now here the problem is my interface subnet is 10.0.2.0/24
while I have assigned 10.0.0/24
subnet to the apache-ip
resource.
So we need to correct this, for which we will update our cluster xml file.
[root@node1 ~]# pcs cluster edit Error: $EDITOR environment variable is not set
Now since we do not have a EDITOR
variable, it prompts us with an error. Let us use vim
as our EDITOR
[root@node1 ~]# export EDITOR=vim
And re-attempt the edit
[root@node1 ~]# pcs cluster edit
Here as you see the ip
value as 10.0.0.50 which needs to be 10.0.2.50
<primitive class="ocf" id="apache-ip" provider="heartbeat" type="IPaddr2">
<instance_attributes id="apache-ip-instance_attributes">
<nvpair id="apache-ip-instance_attributes-cidr_netmask" name="cidr_netmask" value="24"/>
<nvpair id="apache-ip-instance_attributes-ip" name="ip" value="10.0.0.50"/>
</instance_attributes>
<operations>
<op id="apache-ip-monitor-interval-10s" interval="10s" name="monitor" timeout="20s"/>
<op id="apache-ip-start-interval-0s" interval="0s" name="start" timeout="20s"/>
<op id="apache-ip-stop-interval-0s" interval="0s" name="stop" timeout="20s"/>
</operations>
</primitive>
So let us fix the ip value for our apache-ip
resource
[root@node1 ~]# pcs resource update apache-ip ip=10.0.2.50
And re-validate the cluster xml using "pcs cluster edit"
<primitive class="ocf" id="ip1" provider="heartbeat" type="IPaddr2">
<instance_attributes id="ip1-instance_attributes">
<nvpair id="ip1-instance_attributes-cidr_netmask" name="cidr_netmask" value="24"/>
<nvpair id="ip1-instance_attributes-ip" name="ip" value="10.0.2.50"/>
</instance_attributes>
<operations>
<op id="ip1-monitor-interval-10s" interval="10s" name="monitor" timeout="20s"/>
<op id="ip1-start-interval-0s" interval="0s" name="start" timeout="20s"/>
<op id="ip1-stop-interval-0s" interval="0s" name="stop" timeout="20s"/>
</operations>
</primitive>
As you see now the value is correct for the resource ip.
Once you have done the changes, check the cluster status again
[root@node1 ~]# pcs status Cluster name: mycluster Stack: corosync Current DC: node2.example.com (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum Last updated: Sat Oct 27 12:55:14 2018 Last change: Sat Oct 27 12:54:27 2018 by root via cibadmin on node1.example.com 3 nodes configured 3 resources configured Online: [ node1.example.com node2.example.com node3.example.com ] Full list of resources: Resource Group: ipgroup ip1 (ocf::heartbeat:IPaddr2): Started node2.example.com ip2 (ocf::heartbeat:IPaddr2): Started node2.example.com apache-ip (ocf::heartbeat:IPaddr2): Started node1.example.com Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
For us the service has automatically started but if it does not happens for you then clear the failed count for the service
[root@node1 ~]# pcs resource cleanup apache-ip Cleaned up apache-ip on node3.example.com Cleaned up apache-ip on node2.example.com Cleaned up apache-ip on node1.example.com
and then the resource should come up automatically if all the configurations are correct.
Lastly I hope the hints from the article to troubleshoot cluster resource on Red Hat / CentOS 7 Cluster was helpful. So, let me know your suggestions and feedback using the comment section.