Before starting with GFS2 file system setup on Red Hat or CentOS cluster, you must be familiar with
⇒ What is Cluster, it's architecture and types ?
⇒ What is Cluster resource and constraint ?
⇒ How to setup a Red Hat or CentOS 7 Cluster ?
⇒ If you only have two nodes in your cluster then you need to follow some additional steps to setup two node cluster.
⇒ If your requirement is to share ext4 or xfs based file system then you can also share LVM across clusters without GFS2 file system.
⇒ GFS2 file system requires shared storage so if not available you must manually create a shared storage using iscsi target (targetcli) on RHEL or CentOS Linux machine.
I had written a very old article to setup a cluster using GFS2 file system on RHEL 6 but those steps are not valid for RHEL / CentOS 7 so if you are using CentOS/RHEL 6 then you can refer that article..
I am using Oracle VirtualBox installed on my Linux Server for the demonstration of this article, running on a Windows 10 laptop. I had configured my shared storage using iscsi target (targetcli) in my previous article, so I will use the same storage target on my cluster setup. You can follow my old articles if you do not have a cluster setup ready with you.
In this article we will create multiple cluster resource and order the resource start up sequence using constraint. As it is very important that these resources start up in a pre-defined order or else they will fail to start up.
So let us start with the steps to configure GFS2 file system on Red Hat or CentOS 7 Cluster
Why do we need cluster filesystem?
- In some cases, it makes sense to use a cluster-aware file system.
- The purpose of a cluster-aware file system is to allow multiple nodes to write to the file system simultaneously.
- The default cluster-aware file system on the SUSE Linux Enterprise Server is OCFS2, and on Red Hat, it is Global File System (GFS) 2.
- The file system is doing this by synchronizing caches between the nodes that have the filesystem resource running immediately, which means that every node always has the actual state of exactly what is happening on the file system.
- Typically, you’ll need them in active/active scenarios, where multiple instances of the same resource are running on multiple nodes and are all active.
- You don’t have to create a cluster file system, if you only want to run one instance of a resource at the same time.
Any disadvantage of using cluster filesystem?
Apart from the benefits, there are also disadvantages to using cluster file system. The most important disadvantage is that the cache has to be synchronized between all nodes involved. This makes a cluster file system slower than a stand-alone file system, in many cases, especially those that involve a lot of metadata operations. Because they also provide much stronger coupling between the nodes, it becomes harder for the cluster to prevent faults from spreading.
It is often believed that a cluster file system provides an advantage over failover times, as compared to a local node file system, because it is already mounted. However, this is not true; the file system is still paused until fencing/STONITH and journal recovery for the failed node have completed. This will freeze the clustered file system on all nodes. It is actually a set of independent local file systems that provides higher availability! Clustered file systems should be used where they are required, but only after careful planning.
Pre-requisities to setup GFS2 file system
Below are the mandatory requirement on your cluster, before you start working on GFS2 file system
- CLVM (Clustered Logical Volume manager)
- DLM (Distributed Lock Manager)
It is important that your cluster setup is configured with fencing/STONITH.
We have enabled fencing here on our cluster. You can enable it using "pcs property set stonith-enabled=true
"
[root@node1 ~]# pcs property show Cluster Properties: cluster-infrastructure: corosync cluster-name: mycluster dc-version: 1.1.18-11.el7_5.3-2b07d5c5a9 have-watchdog: false last-lrm-refresh: 1546059766 no-quorum-policy: freeze stonith-enabled: true
Below you can see the cluster status, here I have three fencing devices configured
[root@node1 ~]# pcs status
Cluster name: mycluster
Stack: corosync
Current DC: node1.example.com (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Sat Dec 29 10:33:16 2018
Last change: Sat Dec 29 10:33:01 2018 by root via cibadmin on node1.example.com
3 nodes configured
3 resources configured
Online: [ node1.example.com node2.example.com node3.example.com ]
Full list of resources:
fence-vm1 (stonith:fence_xvm): Started node2.example.com
fence-vm2 (stonith:fence_xvm): Started node1.example.com
fence-vm3 (stonith:fence_xvm): Started node3.example.com
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
[root@node1 ~]# pcs stonith show fence-vm1 (stonith:fence_xvm): Started node2.example.com fence-vm2 (stonith:fence_xvm): Started node2.example.com fence-vm3 (stonith:fence_xvm): Started node2.example.com
Install gfs2-utils
, lvm2-cluster
, dlm
on all your cluster nodes if not already installed
# yum -y install gfs2-utils lvm2-cluster dlm
Change the pcs
property to no-quorum-policy
to freeze
. This property is necessary because it means that cluster nodes will do nothing after losing quorum, and this is required for GFS2
# pcs property set no-quorum-policy=freeze
If you would leave the default setting of stop, mounted GFS2 file system cannot use the cluster to properly stop, which will result in fencing of the entire cluster.
Configure DLM Resource
The Distribute Block Manager, also known as controld
is a mandatory part of the cluster. If, after starting, it fails a monitor test, then the nodes on which it fails need to be fenced, and that is to keep the cluster clean. And that is kind of necessary to make sure that no bad things will happen related to no-quorum
policy, which is set to freeze
.
As with the GFS2 file system itself, these resources have to be started on all nodes that require access to the file system. Pacemaker provides the clone resource for this purpose. Clone resorts can be applied for any resources that have to be activated on multiple nodes simultaneously.
[root@node1 ~]# pcs resource create dlm ocf:pacemaker:controld op monitor interval=30s on-fail=fence clone interleave=true ordered=true
Check the pcs cluster status
[root@node1 ~]# pcs status
Cluster name: mycluster
Stack: corosync
Current DC: node1.example.com (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Sat Dec 29 10:57:58 2018
Last change: Sat Dec 29 10:57:52 2018 by root via cibadmin on node1.example.com
3 nodes configured
6 resources configured
Online: [ node1.example.com node2.example.com node3.example.com ]
Full list of resources:
Clone Set: dlm-clone [dlm]
Started: [ node1.example.com node2.example.com node3.example.com ]
fence-vm1 (stonith:fence_xvm): Started node2.example.com
fence-vm2 (stonith:fence_xvm): Started node2.example.com
fence-vm3 (stonith:fence_xvm): Started node2.example.com
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
So our dlm
and dlm-clone
resource have started properly on all our cluster nodes.
Configure CLVMD resource
- If multiple nodes of the cluster require simultaneous read/write access to LVM volumes in an active/active system, then you must use CLVMD.
- CLVMD provides a system for coordinating activation of and changes to LVM volumes across nodes of a cluster concurrently.
- CLVMD's clustered-locking service provides protection to LVM metadata as various nodes of the cluster interact with volumes and make changes to their layout.
To enable clustered-locking set locking_type=3
in lvm.conf
[root@node1 ~]# grep locking_type /etc/lvm/lvm.conf | egrep -v '#'
locking_type = 3
This is the reason
halvm
and clvm
are not compatible for that reason, as HALVM requires locking_type
as 1
while CLVMD requireslocking_type
as 3
You can dynamically change this by using the below command
# lvmconf --enable-cluster
Disable and stop lvm2-lvmetad
service
# systemctl disable lvm2-lvmetad --now
Next create clvmd
resource
[root@node1 ~]# pcs resource create clvmd ocf:heartbeat:clvm op monitor interval=30s on-fail=fence clone interleave=true ordered=true
validate the resource status
[root@node1 ~]# pcs status
Cluster name: mycluster
Stack: corosync
Current DC: node1.example.com (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Sat Dec 29 10:57:58 2018
Last change: Sat Dec 29 10:57:52 2018 by root via cibadmin on node1.example.com
3 nodes configured
9 resources configured
Online: [ node1.example.com node2.example.com node3.example.com ]
Full list of resources:
Clone Set: dlm-clone [dlm]
Started: [ node1.example.com node2.example.com node3.example.com ]
Clone Set: clvmd-clone [clvmd]
Started: [ node1.example.com node2.example.com node3.example.com ]
fence-vm1 (stonith:fence_xvm): Started node2.example.com
fence-vm2 (stonith:fence_xvm): Started node2.example.com
fence-vm3 (stonith:fence_xvm): Started node2.example.com
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
Change resource start up order
Now we need colocation constraint as well. This colocation constraint, make sure that clvmd
clone is always kept together with dlm
clone.
[root@node1 ~]# pcs constraint order start dlm-clone then clvmd-clone
Adding dlm-clone clvmd-clone (kind: Mandatory) (Options: first-action=start then-action=start)
[root@node1 ~]# pcs constraint colocation add clvmd-clone with dlm-clone
From my previous article I am using iscsi target on all of my cluster nodes, which I will use to setup my cluster file system (GFS2).
So after connecting to my storage node, I have /dev/sdc
available on all my cluster nodes.
[root@node2 ~]# ls -l /dev/sd*
brw-rw---- 1 root disk 8, 0 Dec 29 09:47 /dev/sda
brw-rw---- 1 root disk 8, 1 Dec 29 09:47 /dev/sda1
brw-rw---- 1 root disk 8, 2 Dec 29 09:47 /dev/sda2
brw-rw---- 1 root disk 8, 16 Dec 29 09:47 /dev/sdb
brw-rw---- 1 root disk 8, 17 Dec 29 09:47 /dev/sdb1
brw-rw---- 1 root disk 8, 32 Dec 29 10:30 /dev/sdc
I will set up logical volume on /dev/sdc
on one of my cluster nodes. The same configuration will automatically get synced to all other cluster nodes
[root@node1 ~]# pvcreate /dev/sdc Physical volume "/dev/sdc" successfully created.
[root@node1 ~]# vgcreate -Ay -cy --shared vgclvm /dev/sdc Clustered volume group "vgclvm" successfully created
Here
- -A|--autobackup y|n : Specifies if metadata should be backed up automatically after a change.
- -c|--clustered y|n : Create a clustered VG using
clvmd
if LVM is compiled with cluster support. This allows multiple hosts to share a VG on shared devices.clvmd
and a lock manager must be configured and running.
Display the available volume groups
[root@node1 ~]# vgs
VG #PV #LV #SN Attr VSize VFree
centos 2 2 0 wz--n- <17.52g 1020.00m
vgclvm 1 0 0 wz--nc 992.00m 992.00m
Create new logical volume using our shared volume group
[root@node1 ~]# lvcreate -l 100%FREE -n lvcluster vgclvm Logical volume "lvcluster" created.
Create a GFS2 file system on our logical volume.
[root@node1 ~]# mkfs.gfs2 -j3 -p lock_dlm -t mycluster:gfs2fs /dev/vgclvm/lvcluster
/dev/vgclvm/lvcluster is a symbolic link to /dev/dm-2
This will destroy any data on /dev/dm-2
Are you sure you want to proceed? [y/n] y
Discarding device contents (may take a while on large devices): Done
Adding journals: Done
Building resource groups: Done
Creating quota file: Done
Writing superblock and syncing: Done
Device: /dev/vgclvm/lvcluster
Block size: 4096
Device size: 0.97 GB (253952 blocks)
Filesystem size: 0.97 GB (253951 blocks)
Journals: 3
Journal size: 8MB
Resource groups: 7
Locking protocol: "lock_dlm"
Lock table: "mycluster:gfs2fs"
UUID: da1e5aa6-51a3-4512-ba79-3e325455007e
Here
- -t clustername:fsname : is used to specify the name of the locking table
- -j nn : specifies how many journals(nodes) are used
- -J : allows specification of the journal size. if not specified, a journal has a default size of 128 MB. Minimal size is 8 MB (NOT recommended)
In the command,
clustername
must be the pacemaker cluster name as I have used mycluster
which is my cluster name.
Create mount point and validate
Now our logical volume is created successfully. next let us create mount point for our filesystem
Manually create this mount point on all the cluster nodes
# mkdir /clusterfs
Before we create a resource for GFS2, let us manually try to validate if our filesystem on lvcluster
is working properly.
[root@node1 ~]# mount /dev/vgclvm/lvcluster /clusterfs/
Validate the same
[root@node2 ~]# mount | grep clusterfs
/dev/mapper/vgclvm-lvcluster on /clusterfs type gfs2 (rw,noatime)
So looks like the lvm
got mounted successfully.
Create GFS2FS cluster resource
Now we can create a resource for gfs2fs
for our GFS2 file system.
[root@node1 ~]# pcs resource create gfs2fs Filesystem device="/dev/vgclvm/lvcluster" directory="/clusterfs" fstype=gfs2 options=noatime op monitor interval=10s on-fail=fence clone interleave=true
Assumed agent name 'ocf:heartbeat:Filesystem' (deduced from 'Filesystem')
Validate the cluster status
[root@node1 ~]# pcs status
Cluster name: mycluster
Stack: corosync
Current DC: node1.example.com (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Sat Dec 29 10:58:08 2018
Last change: Sat Dec 29 10:57:52 2018 by root via cibadmin on node1.example.com
3 nodes configured
12 resources configured
Online: [ node1.example.com node2.example.com node3.example.com ]
Full list of resources:
Clone Set: dlm-clone [dlm]
Started: [ node1.example.com node2.example.com node3.example.com ]
Clone Set: clvmd-clone [clvmd]
Started: [ node1.example.com node2.example.com node3.example.com ]
fence-vm1 (stonith:fence_xvm): Started node2.example.com
fence-vm2 (stonith:fence_xvm): Started node2.example.com
fence-vm3 (stonith:fence_xvm): Started node2.example.com
Clone Set: gfs2fs-clone [gfs2fs]
Started: [ node1.example.com node2.example.com node3.example.com ]
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
So our gfs2fs
service is started automatically on all our cluster nodes.
Now arrange the resource start-up order for GFS2
and CLVMD
so that after a node reboot the services are started in proper order or else they will fail to start
[root@node1 ~]# pcs constraint order start clvmd-clone then gfs2fs-clone Adding clvmd-clone gfs2fs-clone (kind: Mandatory) (Options: first-action=start then-action=start) [root@node1 ~]# pcs constraint colocation add gfs2fs-clone with clvmd-clone
Validate our Cluster with GFS2 file system
Now since our resource/service is running properly on our cluster nodes. Let us create a file on one of our cluster node.
[root@node1 ~]# cd /clusterfs/ [root@node1 clusterfs]# touch file
Now connect to any other cluster node, and this file should exist there as well
[root@node2 ~]# ls /clusterfs/ file
So our Cluster with GFS2 file system configuration is working as expected.
One of our environments has been established with RHEL 7.9 GFS2 cluster – 2node. While manual backups taking place on Shared Volume of 500 GB (by 2nodes), the Read/Write traffic increases to 45 MB/sec and 250 IOPS. This is almost impacting other servers in the network. Any suggestions please?
This is one of the common problems faced with shared storage. You may use cgroup to implement maxlimit on block IO based on how your backup process is executed so that all of available block IO is not used up.
Thank you for the guidance here. I forwarded to GPFS cluster team, hopefully they would be able to find a way to update the configuration using the URL given by you.
OK no worries, I’ll see what I can find. This post def helps too: https://www.golinuxcloud.com/create-cluster-resource-in-ha-cluster-examples/
Was thinking I could create a CRON job to run ever 10 seconds to confirm process health, then if it fails, put the cluster node in standby.
1 last thing on clusters – nearly every post I have found on clustering is for CentOS, my company initially tried pcs on ubuntu but found it unstable. Is there anything about CentOS that makes it better for clustering?
cron job would be a nice hack but I would also suggest asking in a larger forum in stackoverflow. May be there are more possible ways to achieve what you are trying to do.
I had used pacemaker cluster in past with Red Hat. You may already know that Red hat creates its own version of these open source software such as pacemaker, Kubernetes, Openstack with an added advantage of support. This is the reason why we see it being used mostly in CentOS as it was downstream project of Red Hat. So we can expect stable code in CentOS as they were already tested in Red Hat. Now since CentOS is going to be upstream project so we can’t say about that any more. End users may have to wait longer to get a fix with CentOS Stream.
Thanks so much for this post and this incredible site. I have found everything I needed! I was able to setup a proper 2 node cluster with fencing, stonith & a gfs2 shared disk from scratch on CentOS 7.9 by following these 4 posts:
1) https://www.golinuxcloud.com/ste-by-step-configure-high-availability-cluster-centos-7/
2) https://www.golinuxcloud.com/how-to-install-configure-two-node-cluster-linux-centos-7/
3) https://www.golinuxcloud.com/what-is-fencing-configure-kvm-cluster-fencing/#Setup_KVM_HA_Cluster
4) https://www.golinuxcloud.com/configure-gfs2-setup-cluster-linux-rhel-centos-7/
I am creating an HA implementation of Apache ActiveMQ, evaluating & testing both Classic & Artemis. So far shared storage and failover works great. Had a couple questions:
– Can you add a service/process as a Cluster Resource so if the process fails it will stonith the bad node and move all resources including IP to the healthy node?
– Do you have a donate section bc I am def using an ad blocker 😀
Thank you for your kind words and I am glad these artciles helped.
Which resource are you asking about? As that would depend.
I have created a buymeacoffee page, so you can make any donations here 🙂
I think I saw a post about adding apache as a cluster resource, was wondering if this could apply to other processes. If possible, I would like to have the AMQ process be a cluster resource. On the offchance that the process fails on node1 but everything else about node1 is healthy, the vIP would stay with node1 but node2 would be promoted from passive to active.
In Windows clustering, if the monitored process fails, the whole node is essentially stonith’d and the 2nd node becomes active with Cluster IP & with the process. Just wondering if the same is possible with pcs.
I am not sure if that is possible because if a cluster resource fails for some reason then that wouldn’t necessarily mean that the respective node is also faulty and would continue to be active
But I am afraid it is quite some time since I worked on cluster so I can’t recall if any such config is possible with pacemaker. You will have to test and try.
🙁
As per Indian regulations, only registered Indian businesses (i.e. sole proprietorships, limited liability partnerships and companies, but not individuals) can accept international payments. More info here: https://stripe.com/docs/india-exports
If GFS2 fail or corrupt, is there any way to failover another GFS2?
Do you mean the filesystem fails or the cluster node? As due to high availability, the cluster node failure will cause GFS2 to mount on next available cluster node
Hi,
After creating the GFS cluster with PCS, Do I need to write inside the Fstab for permanent mount?
That should not be required because here the filesystem is handled by the cluster resource instead by the system kernel.
I am Getting error – dlm: close connection to node 1 & 2 error than gfs2 shared partition getting hanged on both nodes.
Please help me.
error taks gfs2_quotad:3349 blocked for more than 120 second
Shared partition is getting hanged , please help me
You may have to check the syslog to understand the cause of this behaviour. It may be due to fencing but can’t be sure. /var/log/messages can give you more insight.
i am using HP ILO 4 for fencing device and its running on both nodes.
i am using HP ILO4 for fencing device
I am afraid, not very familiar with iLO4 fencing.
showing taks gfs2_quotad:3349 blocked for more than 120 second and shared partition is getting hanged on both nodes
thanks. Well explained. It helped me clarify some confusion.
Amazingly clear, educational and really helpful documents!!!
Great job!
Regards, John