Guide to troubleshoot Cluster Resource (RHEL / CentOS 7)

Before you start learning to troubleshoot cluster resource related topics in cluster, you must be familiar with Cluster Architecture. In my last article I had shared the steps to configure HA Cluster and create a cluster resource so here I will give some tips to troubleshoot Cluster resource. There are many times you may observe that your resource has ended up in stopped state. So here i will guide you the commands and logs to check to troubleshoot cluster resource.

Hints - Troubleshoot Cluster Resource

First read logs to find out why the resource has failed
/var/log/messages on Red Hat
journalctl on recent distributions
If a resource fails, it will try to start the resource a couple of times. And if it reaches a certain threshold then that resource will set its fail counter to INFINITY and then it will stop trying. This is to prevent cluster from trying to start resource indefinitely
Next even if you found the problem and fixed the problem, there are chances even then the resource won't start automatically. To fix this you will need to reset the resource fail counter

Some Examples

To illustrate this I will create a resource with some incorrect information

[root@node1 ~]# pcs resource create apache-ip ocf:heartbeat:IPaddr2 ip=10.0.0.50 cidr_netmask=24

Next let us check the status

[root@node1 ~]# pcs status
Cluster name: mycluster
Stack: corosync
Current DC: node2.example.com (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Sat Oct 27 12:29:40 2018
Last change: Sat Oct 27 12:29:35 2018 by root via cibadmin on node1.example.com

3 nodes configured
3 resources configured

Online: [ node1.example.com node2.example.com node3.example.com ]

Full list of resources:

 Resource Group: ipgroup
     ip1        (ocf::heartbeat:IPaddr2):       Started node2.example.com
     ip2        (ocf::heartbeat:IPaddr2):       Started node2.example.com
 apache-ip      (ocf::heartbeat:IPaddr2):       Stopped

Failed Actions:
* apache-ip_start_0 on node3.example.com 'unknown error' (1): call=55, status=complete, exitreason='Unable to find nic or netmask.',
    last-rc-change='Sat Oct 27 12:29:35 2018', queued=0ms, exec=44ms
* apache-ip_start_0 on node2.example.com 'unknown error' (1): call=67, status=complete, exitreason='Unable to find nic or netmask.',
    last-rc-change='Sat Oct 27 12:29:36 2018', queued=0ms, exec=35ms
* apache-ip_start_0 on node1.example.com 'unknown error' (1): call=72, status=complete, exitreason='Unable to find nic or netmask.',
    last-rc-change='Sat Oct 27 12:29:35 2018', queued=0ms, exec=42ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

So as you see the resource apache-ip is in stopped state and the Failed Actions gives us a hint "Unable to find nic or netmask

Inside /var/log/messages we see more information about the error

Oct 27 12:29:35 node1 IPaddr2(apache-ip)[24881]: ERROR: Unable to find nic or netmask.
Oct 27 12:29:35 node1 IPaddr2(apache-ip)[24881]: ERROR: [findif] failed
Oct 27 12:29:35 node1 lrmd[3697]: notice: apache-ip_start_0:24881:stderr [ ocf-exit-reason:Unable to find nic or netmask. ]
Oct 27 12:29:35 node1 crmd[3700]: notice: Result of start operation for apache-ip on node1.example.com: 1 (unknown error)
Oct 27 12:29:35 node1 crmd[3700]: notice: node1.example.com-apache-ip_start_0:72 [ ocf-exit-reason:Unable to find nic or netmask.n ]

NOTE

The findif i.e. find interface is unable to get the interface.
Next lrmd is the local resource manager daemon indicates that it couldn't work locally and it doesn't knows why the failure has happened but gives a hint that it may have happened due to nic or netmask
Next it is talking to crmd i.e. Cluster Resource management Daemon is capable to synchronize this information to rest of the cluster

Try a debug-start for the failed resource to get more information on the failure cause

[root@node1 ~]# pcs resource debug-start apache-ip
Operation start for apache-ip (ocf:heartbeat:IPaddr2) returned: 'unknown error' (1)
 >  stderr: ocf-exit-reason:Unable to find nic or netmask.
 >  stderr: ERROR: [findif] failed

You can get the failcount value using the below command

[root@node1 ~]# crm_failcount -G -r apache-ip
scope=status name=fail-count-apache-ip value=2

NOTE

pcs resource failcount show may or may not work. Because at the time of writing this article there was a bug with pcs due to which it failed to show any failure count.

Now here the problem is my interface subnet is 10.0.2.0/24 while I have assigned 10.0.0/24 subnet to the apache-ip resource.

So we need to correct this, for which we will update our cluster xml file.

WARNING

Modifying this file can be dangerous, so you should know what you are going to change. For our example we are not willing to change anything but just have a look at the resource in detail.

[root@node1 ~]# pcs cluster edit
Error: $EDITOR environment variable is not set

Now since we do not have a EDITOR variable, it prompts us with an error. Let us use vim as our EDITOR

[root@node1 ~]# export EDITOR=vim

And re-attempt the edit

[root@node1 ~]# pcs cluster edit

Here as you see the ip value as 10.0.0.50 which needs to be 10.0.2.50

<primitive class="ocf" id="apache-ip" provider="heartbeat" type="IPaddr2">
  <instance_attributes id="apache-ip-instance_attributes">
    <nvpair id="apache-ip-instance_attributes-cidr_netmask" name="cidr_netmask" value="24"/>
    <nvpair id="apache-ip-instance_attributes-ip" name="ip" value="10.0.0.50"/>
  </instance_attributes>
  <operations>
    <op id="apache-ip-monitor-interval-10s" interval="10s" name="monitor" timeout="20s"/>
    <op id="apache-ip-start-interval-0s" interval="0s" name="start" timeout="20s"/>
    <op id="apache-ip-stop-interval-0s" interval="0s" name="stop" timeout="20s"/>
  </operations>
</primitive>

So let us fix the ip value for our apache-ip resource

[root@node1 ~]# pcs resource update apache-ip ip=10.0.2.50

And re-validate the cluster xml using "pcs cluster edit"

<primitive class="ocf" id="ip1" provider="heartbeat" type="IPaddr2">
  <instance_attributes id="ip1-instance_attributes">
    <nvpair id="ip1-instance_attributes-cidr_netmask" name="cidr_netmask" value="24"/>
    <nvpair id="ip1-instance_attributes-ip" name="ip" value="10.0.2.50"/>
  </instance_attributes>
  <operations>
    <op id="ip1-monitor-interval-10s" interval="10s" name="monitor" timeout="20s"/>
    <op id="ip1-start-interval-0s" interval="0s" name="start" timeout="20s"/>
    <op id="ip1-stop-interval-0s" interval="0s" name="stop" timeout="20s"/>
  </operations>
</primitive>

As you see now the value is correct for the resource ip.

Once you have done the changes, check the cluster status again

[root@node1 ~]# pcs status
Cluster name: mycluster
Stack: corosync
Current DC: node2.example.com (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Sat Oct 27 12:55:14 2018
Last change: Sat Oct 27 12:54:27 2018 by root via cibadmin on node1.example.com

3 nodes configured
3 resources configured

Online: [ node1.example.com node2.example.com node3.example.com ]

Full list of resources:

 Resource Group: ipgroup
     ip1        (ocf::heartbeat:IPaddr2):       Started node2.example.com
     ip2        (ocf::heartbeat:IPaddr2):       Started node2.example.com
 apache-ip      (ocf::heartbeat:IPaddr2):       Started node1.example.com

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

For us the service has automatically started but if it does not happens for you then clear the failed count for the service

[root@node1 ~]# pcs resource cleanup apache-ip
Cleaned up apache-ip on node3.example.com
Cleaned up apache-ip on node2.example.com
Cleaned up apache-ip on node1.example.com

and then the resource should come up automatically if all the configurations are correct.

Lastly I hope the hints from the article to troubleshoot cluster resource on Red Hat / CentOS 7 Cluster was helpful. So, let me know your suggestions and feedback using the comment section.