How to cleanup Failed Actions from pcs status of cluster

In this article I will share the commands to cleanup failed actions from pcs status output for High Availability Pcaemaker cluster.

It happens many times where there are some failed actions logged in the pcs status when a resource fails to start in the cluster. Even after the resource has successfully started, these failed actions continue to appear in the pcs status output.

So we can clean failed actions from pcs status in such case.

 

Issue: Cleanup failed action messages from pcs status

Below I have a sample output from pcs status on my KVM high Availability Cluster, here there are two types of "Failed Actions"

  • Failed Resource Actions
  • Failed Fencing Actions

To check the cluster status:

[root@centos8-2 ~]# pcs status
Cluster name: ha-cluster
Stack: corosync
Current DC: centos8-3 (version 2.0.2-3.el8_1.2-744a30d655) - partition with quorum
Last updated: Sat May  2 14:38:27 2020
Last change: Sat May  2 14:38:23 2020 by root via cibadmin on centos8-2

3 nodes configured
4 resources configured

Online: [ centos8-2 centos8-3 centos8-4 ]

Full list of resources:

 fence-centos8-3        (stonith:fence_xvm):    Started centos8-3
 fence-centos8-2        (stonith:fence_xvm):    Started centos8-2
 ClusterIP      (ocf::heartbeat:IPaddr2):       Started centos8-4
 fence-centos8-4        (stonith:fence_xvm):    Started centos8-3

Failed Resource Actions:
* fence-centos8-2_start_0 on centos8-4 'OCF_TIMEOUT' (198): call=122, status=Timed Out, exitreason='',
    last-rc-change='Sat May  2 14:36:16 2020', queued=1ms, exec=20012ms
* fence-centos8-4_start_0 on centos8-4 'OCF_TIMEOUT' (198): call=124, status=Timed Out, exitreason='',
    last-rc-change='Sat May  2 14:36:36 2020', queued=0ms, exec=20011ms

Failed Fencing Actions:
* reboot of centos8-2 failed: delegate=, client=pacemaker-controld.1548, origin=centos8-3,
    last-failed='Sat May  2 14:37:17 2020'
* reboot of centos8-4 failed: delegate=, client=pacemaker-controld.1548, origin=centos8-3,
    last-failed='Fri May  1 20:57:41 2020'

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Now my resource and fencing resource have started successfully, so I don't need to keep these failed action messages.

The commands to cleanup failed actions for Resource and Fencing are different.

 

Solution: Cleanup Failed Actions for Resource

To cleanup failed actions messages for resource under "Failed Resource Actions" use pcs resource cleanup <resource>. You can get the resource name from the Failed Resource Actions message output.

Below is the output from my pcs status

* fence-centos8-2_start_0 on centos8-4 'OCF_TIMEOUT' (198): call=122, status=Timed Out, exitreason='',
    last-rc-change='Sat May  2 14:36:16 2020', queued=1ms, exec=20012ms
* fence-centos8-4_start_0 on centos8-4 'OCF_TIMEOUT' (198): call=124, status=Timed Out, exitreason='',
    last-rc-change='Sat May  2 14:36:36 2020', queued=0ms, exec=20011ms

Here the resource name is fence-centos8-2 and fence-centos8-4 which you can also check using "pcs resource status"

So to cleanup failed action messages for fence-centos8-2 resource use:

[root@centos8-2 ~]# pcs resource cleanup fence-centos8-2
Cleaned up fence-centos8-2 on centos8-4
Cleaned up fence-centos8-2 on centos8-3
Cleaned up fence-centos8-2 on centos8-2
Waiting for 1 reply from the controller. OK

Similarly to cleanup failed action messages for fence-centos8-2 resource

[root@centos8-2 ~]# pcs resource cleanup fence-centos8-4
Cleaned up fence-centos8-4 on centos8-4
Cleaned up fence-centos8-4 on centos8-3
Cleaned up fence-centos8-4 on centos8-2
Waiting for 1 reply from the controller. OK

After performing cleanup, check the cluster status

[root@centos8-2 ~]# pcs status
Cluster name: ha-cluster
Stack: corosync
Current DC: centos8-3 (version 2.0.2-3.el8_1.2-744a30d655) - partition with quorum
Last updated: Sat May  2 14:39:19 2020
Last change: Sat May  2 14:39:17 2020 by hacluster via crmd on centos8-4

3 nodes configured
4 resources configured

Online: [ centos8-2 centos8-3 centos8-4 ]

Full list of resources:

 fence-centos8-3        (stonith:fence_xvm):    Started centos8-3
 fence-centos8-2        (stonith:fence_xvm):    Started centos8-2
 ClusterIP      (ocf::heartbeat:IPaddr2):       Started centos8-4
 fence-centos8-4        (stonith:fence_xvm):    Started centos8-3

Failed Fencing Actions:
* reboot of centos8-2 failed: delegate=, client=pacemaker-controld.1548, origin=centos8-3,
    last-failed='Sat May  2 14:37:17 2020'
* reboot of centos8-4 failed: delegate=, client=pacemaker-controld.1548, origin=centos8-3,
    last-failed='Fri May  1 20:57:41 2020'

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

So now we don't have any Failed Resource Actions, next we will cleanup failed action messages for Fencing

 

Solution: Cleanup Failed Actions for Fencing

Now the pcs status still shows failed action messages for Fencing, so to cleanup failed action messages for fencing we will use "pcs stonith history cleanup <resource>"

But before we perform cleanup, we can check the complete history of Failed Fencing Actions using "pcs stonith history show <resource>"

[root@centos8-2 ~]# pcs stonith history show centos8-2
We failed reboot node centos8-2 on behalf of pacemaker-controld.1548 from centos8-3 at Sat May  2 14:36:57 2020
We failed reboot node centos8-2 on behalf of pacemaker-controld.1548 from centos8-3 at Sat May  2 14:36:37 2020
We failed reboot node centos8-2 on behalf of pacemaker-controld.1548 from centos8-3 at Sat May  2 14:36:17 2020
We failed reboot node centos8-2 on behalf of pacemaker-controld.1548 from centos8-3 at Sat May  2 14:37:16 2020
We failed reboot node centos8-2 on behalf of pacemaker-controld.1548 from centos8-3 at Sat May  2 14:37:17 2020
0 events found

We can get the resource name from this message output of pcs status

* reboot of centos8-2 failed: delegate=, client=pacemaker-controld.1548, origin=centos8-3,
    last-failed='Sat May  2 14:37:17 2020'
* reboot of centos8-4 failed: delegate=, client=pacemaker-controld.1548, origin=centos8-3,
    last-failed='Fri May  1 20:57:41 2020'

To perform cleanup of failed action messages of fencing use

[root@centos8-2 ~]# pcs stonith history cleanup centos8-2
cleaning up fencing-history for node centos8-2
0 events found

[root@centos8-2 ~]# pcs stonith history cleanup centos8-4
cleaning up fencing-history for node centos8-4
0 events found

Now check the pcaemaker cluster status using pcs status

[root@centos8-2 ~]# pcs status
Cluster name: ha-cluster
Stack: corosync
Current DC: centos8-3 (version 2.0.2-3.el8_1.2-744a30d655) - partition with quorum
Last updated: Sat May  2 14:41:05 2020
Last change: Sat May  2 14:39:17 2020 by hacluster via crmd on centos8-4

3 nodes configured
4 resources configured

Online: [ centos8-2 centos8-3 centos8-4 ]

Full list of resources:

 fence-centos8-3        (stonith:fence_xvm):    Started centos8-3
 fence-centos8-2        (stonith:fence_xvm):    Started centos8-2
 ClusterIP      (ocf::heartbeat:IPaddr2):       Started centos8-4
 fence-centos8-4        (stonith:fence_xvm):    Started centos8-3

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

So we don't have any more failed action messages.

NOTE:

This only cleans up previously encountered errors. If pcs continues to show more, it means failures continue to occur, then you must first debug the actual root cause.

 

Lastly I hope the steps from the article to cleanup failed action messages in pcaemaker cluster on Linux was helpful. So, let me know your suggestions and feedback using the comment section.

 

References:
Red Hat: How to clean failed action messages for Fencing
Red Hat: How to clean failed action messages for Resource

Leave a Comment

Please use shortcodes <pre class=comments>your code</pre> for syntax highlighting when adding code.