Table of Contents
Before starting with GFS2 file system setup on Red Hat or CentOS cluster, you must be familiar with
⇒ What is Cluster, it's architecture and types ?
⇒ What is Cluster resource and constraint ?
⇒ How to setup a Red Hat or CentOS 7 Cluster ?
⇒ If you only have two nodes in your cluster then you need to follow some additional steps to setup two node cluster.
⇒ If your requirement is to share ext4 or xfs based file system then you can also share LVM across clusters without GFS2 file system.
⇒ GFS2 file system requires shared storage so if not available you must manually create a shared storage using iscsi target (targetcli) on RHEL or CentOS Linux machine.
I had written a very old article to setup a cluster using GFS2 file system on RHEL 6 but those steps are not valid for RHEL / CentOS 7 so if you are using CentOS/RHEL 6 then you can refer that article..
I am using Oracle VirtualBox installed on my Linux Server for the demonstration of this article, running on a Windows 10 laptop. I had configured my shared storage using iscsi target (targetcli) in my previous article, so I will use the same storage target on my cluster setup. You can follow my old articles if you do not have a cluster setup ready with you.
In this article we will create multiple cluster resource and order the resource start up sequence using constraint. As it is very important that these resources start up in a pre-defined order or else they will fail to start up.
So let us start with the steps to configure GFS2 file system on Red Hat or CentOS 7 Cluster
Why do we need cluster filesystem?
- In some cases, it makes sense to use a cluster-aware file system.
- The purpose of a cluster-aware file system is to allow multiple nodes to write to the file system simultaneously.
- The default cluster-aware file system on the SUSE Linux Enterprise Server is OCFS2, and on Red Hat, it is Global File System (GFS) 2.
- The file system is doing this by synchronizing caches between the nodes that have the filesystem resource running immediately, which means that every node always has the actual state of exactly what is happening on the file system.
- Typically, you’ll need them in active/active scenarios, where multiple instances of the same resource are running on multiple nodes and are all active.
- You don’t have to create a cluster file system, if you only want to run one instance of a resource at the same time.
Any disadvantage of using cluster filesystem?
Apart from the benefits, there are also disadvantages to using cluster file system. The most important disadvantage is that the cache has to be synchronized between all nodes involved. This makes a cluster file system slower than a stand-alone file system, in many cases, especially those that involve a lot of metadata operations. Because they also provide much stronger coupling between the nodes, it becomes harder for the cluster to prevent faults from spreading.
It is often believed that a cluster file system provides an advantage over failover times, as compared to a local node file system, because it is already mounted. However, this is not true; the file system is still paused until fencing/STONITH and journal recovery for the failed node have completed. This will freeze the clustered file system on all nodes. It is actually a set of independent local file systems that provides higher availability! Clustered file systems should be used where they are required, but only after careful planning.
Pre-requisities to setup GFS2 file system
Below are the mandatory requirement on your cluster, before you start working on GFS2 file system
- CLVM (Clustered Logical Volume manager)
- DLM (Distributed Lock Manager)
It is important that your cluster setup is configured with fencing/STONITH.
We have enabled fencing here on our cluster. You can enable it using "pcs property set stonith-enabled=true
"
[root@node1 ~]# pcs property show Cluster Properties: cluster-infrastructure: corosync cluster-name: mycluster dc-version: 1.1.18-11.el7_5.3-2b07d5c5a9 have-watchdog: false last-lrm-refresh: 1546059766 no-quorum-policy: freeze stonith-enabled: true
Below you can see the cluster status, here I have three fencing devices configured
[root@node1 ~]# pcs status
Cluster name: mycluster
Stack: corosync
Current DC: node1.example.com (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Sat Dec 29 10:33:16 2018
Last change: Sat Dec 29 10:33:01 2018 by root via cibadmin on node1.example.com
3 nodes configured
3 resources configured
Online: [ node1.example.com node2.example.com node3.example.com ]
Full list of resources:
fence-vm1 (stonith:fence_xvm): Started node2.example.com
fence-vm2 (stonith:fence_xvm): Started node1.example.com
fence-vm3 (stonith:fence_xvm): Started node3.example.com
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
[root@node1 ~]# pcs stonith show fence-vm1 (stonith:fence_xvm): Started node2.example.com fence-vm2 (stonith:fence_xvm): Started node2.example.com fence-vm3 (stonith:fence_xvm): Started node2.example.com
Install gfs2-utils
, lvm2-cluster
, dlm
on all your cluster nodes if not already installed
# yum -y install gfs2-utils lvm2-cluster dlm
Change the pcs
property to no-quorum-policy
to freeze
. This property is necessary because it means that cluster nodes will do nothing after losing quorum, and this is required for GFS2
# pcs property set no-quorum-policy=freeze
If you would leave the default setting of stop, mounted GFS2 file system cannot use the cluster to properly stop, which will result in fencing of the entire cluster.
Configure DLM Resource
The Distribute Block Manager, also known as controld
is a mandatory part of the cluster. If, after starting, it fails a monitor test, then the nodes on which it fails need to be fenced, and that is to keep the cluster clean. And that is kind of necessary to make sure that no bad things will happen related to no-quorum
policy, which is set to freeze
.
As with the GFS2 file system itself, these resources have to be started on all nodes that require access to the file system. Pacemaker provides the clone resource for this purpose. Clone resorts can be applied for any resources that have to be activated on multiple nodes simultaneously.
[root@node1 ~]# pcs resource create dlm ocf:pacemaker:controld op monitor interval=30s on-fail=fence clone interleave=true ordered=true
Check the pcs cluster status
[root@node1 ~]# pcs status
Cluster name: mycluster
Stack: corosync
Current DC: node1.example.com (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Sat Dec 29 10:57:58 2018
Last change: Sat Dec 29 10:57:52 2018 by root via cibadmin on node1.example.com
3 nodes configured
6 resources configured
Online: [ node1.example.com node2.example.com node3.example.com ]
Full list of resources:
Clone Set: dlm-clone [dlm]
Started: [ node1.example.com node2.example.com node3.example.com ]
fence-vm1 (stonith:fence_xvm): Started node2.example.com
fence-vm2 (stonith:fence_xvm): Started node2.example.com
fence-vm3 (stonith:fence_xvm): Started node2.example.com
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
So our dlm
and dlm-clone
resource have started properly on all our cluster nodes.
Configure CLVMD resource
- If multiple nodes of the cluster require simultaneous read/write access to LVM volumes in an active/active system, then you must use CLVMD.
- CLVMD provides a system for coordinating activation of and changes to LVM volumes across nodes of a cluster concurrently.
- CLVMD's clustered-locking service provides protection to LVM metadata as various nodes of the cluster interact with volumes and make changes to their layout.
To enable clustered-locking set locking_type=3
in lvm.conf
[root@node1 ~]# grep locking_type /etc/lvm/lvm.conf | egrep -v '#'
locking_type = 3
This is the reason
halvm
and clvm
are not compatible for that reason, as HALVM requires locking_type
as 1
while CLVMD requireslocking_type
as 3
You can dynamically change this by using the below command
# lvmconf --enable-cluster
Disable and stop lvm2-lvmetad
service
# systemctl disable lvm2-lvmetad --now
Next create clvmd
resource
[root@node1 ~]# pcs resource create clvmd ocf:heartbeat:clvm op monitor interval=30s on-fail=fence clone interleave=true ordered=true
validate the resource status
[root@node1 ~]# pcs status
Cluster name: mycluster
Stack: corosync
Current DC: node1.example.com (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Sat Dec 29 10:57:58 2018
Last change: Sat Dec 29 10:57:52 2018 by root via cibadmin on node1.example.com
3 nodes configured
9 resources configured
Online: [ node1.example.com node2.example.com node3.example.com ]
Full list of resources:
Clone Set: dlm-clone [dlm]
Started: [ node1.example.com node2.example.com node3.example.com ]
Clone Set: clvmd-clone [clvmd]
Started: [ node1.example.com node2.example.com node3.example.com ]
fence-vm1 (stonith:fence_xvm): Started node2.example.com
fence-vm2 (stonith:fence_xvm): Started node2.example.com
fence-vm3 (stonith:fence_xvm): Started node2.example.com
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
Change resource start up order
Now we need colocation constraint as well. This colocation constraint, make sure that clvmd
clone is always kept together with dlm
clone.
[root@node1 ~]# pcs constraint order start dlm-clone then clvmd-clone
Adding dlm-clone clvmd-clone (kind: Mandatory) (Options: first-action=start then-action=start)
[root@node1 ~]# pcs constraint colocation add clvmd-clone with dlm-clone
Set up shared storage on cluster nodes
From my previous article I am using iscsi target on all of my cluster nodes, which I will use to setup my cluster file system (GFS2).
So after connecting to my storage node, I have /dev/sdc
available on all my cluster nodes.
[root@node2 ~]# ls -l /dev/sd*
brw-rw---- 1 root disk 8, 0 Dec 29 09:47 /dev/sda
brw-rw---- 1 root disk 8, 1 Dec 29 09:47 /dev/sda1
brw-rw---- 1 root disk 8, 2 Dec 29 09:47 /dev/sda2
brw-rw---- 1 root disk 8, 16 Dec 29 09:47 /dev/sdb
brw-rw---- 1 root disk 8, 17 Dec 29 09:47 /dev/sdb1
brw-rw---- 1 root disk 8, 32 Dec 29 10:30 /dev/sdc
I will set up logical volume on /dev/sdc
on one of my cluster nodes. The same configuration will automatically get synced to all other cluster nodes
[root@node1 ~]# pvcreate /dev/sdc Physical volume "/dev/sdc" successfully created.
[root@node1 ~]# vgcreate -Ay -cy --shared vgclvm /dev/sdc Clustered volume group "vgclvm" successfully created
Here
- -A|--autobackup y|n : Specifies if metadata should be backed up automatically after a change.
- -c|--clustered y|n : Create a clustered VG using
clvmd
if LVM is compiled with cluster support. This allows multiple hosts to share a VG on shared devices.clvmd
and a lock manager must be configured and running.
Display the available volume groups
[root@node1 ~]# vgs
VG #PV #LV #SN Attr VSize VFree
centos 2 2 0 wz--n- <17.52g 1020.00m
vgclvm 1 0 0 wz--nc 992.00m 992.00m
Create new logical volume using our shared volume group
[root@node1 ~]# lvcreate -l 100%FREE -n lvcluster vgclvm Logical volume "lvcluster" created.
Create a GFS2 file system on our logical volume.
[root@node1 ~]# mkfs.gfs2 -j3 -p lock_dlm -t mycluster:gfs2fs /dev/vgclvm/lvcluster
/dev/vgclvm/lvcluster is a symbolic link to /dev/dm-2
This will destroy any data on /dev/dm-2
Are you sure you want to proceed? [y/n] y
Discarding device contents (may take a while on large devices): Done
Adding journals: Done
Building resource groups: Done
Creating quota file: Done
Writing superblock and syncing: Done
Device: /dev/vgclvm/lvcluster
Block size: 4096
Device size: 0.97 GB (253952 blocks)
Filesystem size: 0.97 GB (253951 blocks)
Journals: 3
Journal size: 8MB
Resource groups: 7
Locking protocol: "lock_dlm"
Lock table: "mycluster:gfs2fs"
UUID: da1e5aa6-51a3-4512-ba79-3e325455007e
Here
- -t clustername:fsname : is used to specify the name of the locking table
- -j nn : specifies how many journals(nodes) are used
- -J : allows specification of the journal size. if not specified, a journal has a default size of 128 MB. Minimal size is 8 MB (NOT recommended)
In the command,
clustername
must be the pacemaker cluster name as I have used mycluster
which is my cluster name.
Create mount point and validate
Now our logical volume is created successfully. next let us create mount point for our filesystem
Manually create this mount point on all the cluster nodes
# mkdir /clusterfs
Before we create a resource for GFS2, let us manually try to validate if our filesystem on lvcluster
is working properly.
[root@node1 ~]# mount /dev/vgclvm/lvcluster /clusterfs/
Validate the same
[root@node2 ~]# mount | grep clusterfs
/dev/mapper/vgclvm-lvcluster on /clusterfs type gfs2 (rw,noatime)
So looks like the lvm
got mounted successfully.
Create GFS2FS cluster resource
Now we can create a resource for gfs2fs
for our GFS2 file system.
[root@node1 ~]# pcs resource create gfs2fs Filesystem device="/dev/vgclvm/lvcluster" directory="/clusterfs" fstype=gfs2 options=noatime op monitor interval=10s on-fail=fence clone interleave=true
Assumed agent name 'ocf:heartbeat:Filesystem' (deduced from 'Filesystem')
Validate the cluster status
[root@node1 ~]# pcs status
Cluster name: mycluster
Stack: corosync
Current DC: node1.example.com (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Sat Dec 29 10:58:08 2018
Last change: Sat Dec 29 10:57:52 2018 by root via cibadmin on node1.example.com
3 nodes configured
12 resources configured
Online: [ node1.example.com node2.example.com node3.example.com ]
Full list of resources:
Clone Set: dlm-clone [dlm]
Started: [ node1.example.com node2.example.com node3.example.com ]
Clone Set: clvmd-clone [clvmd]
Started: [ node1.example.com node2.example.com node3.example.com ]
fence-vm1 (stonith:fence_xvm): Started node2.example.com
fence-vm2 (stonith:fence_xvm): Started node2.example.com
fence-vm3 (stonith:fence_xvm): Started node2.example.com
Clone Set: gfs2fs-clone [gfs2fs]
Started: [ node1.example.com node2.example.com node3.example.com ]
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
So our gfs2fs
service is started automatically on all our cluster nodes.
Now arrange the resource start-up order for GFS2
and CLVMD
so that after a node reboot the services are started in proper order or else they will fail to start
[root@node1 ~]# pcs constraint order start clvmd-clone then gfs2fs-clone Adding clvmd-clone gfs2fs-clone (kind: Mandatory) (Options: first-action=start then-action=start) [root@node1 ~]# pcs constraint colocation add gfs2fs-clone with clvmd-clone
Validate our Cluster with GFS2 file system
Now since our resource/service is running properly on our cluster nodes. Let us create a file on one of our cluster node.
[root@node1 ~]# cd /clusterfs/ [root@node1 clusterfs]# touch file
Now connect to any other cluster node, and this file should exist there as well
[root@node2 ~]# ls /clusterfs/ file
So our Cluster with GFS2 file system configuration is working as expected.
Amazingly clear, educational and really helpful documents!!!
Great job!
Regards, John
thanks. Well explained. It helped me clarify some confusion.
showing taks gfs2_quotad:3349 blocked for more than 120 second and shared partition is getting hanged on both nodes
error taks gfs2_quotad:3349 blocked for more than 120 second
Shared partition is getting hanged , please help me
You may have to check the syslog to understand the cause of this behaviour. It may be due to fencing but can't be sure. /var/log/messages can give you more insight.
i am using HP ILO 4 for fencing device and its running on both nodes.
i am using HP ILO4 for fencing device
I am afraid, not very familiar with iLO4 fencing.
I am Getting error - dlm: close connection to node 1 & 2 error than gfs2 shared partition getting hanged on both nodes.
Please help me.
Hi,
After creating the GFS cluster with PCS, Do I need to write inside the Fstab for permanent mount?
That should not be required because here the filesystem is handled by the cluster resource instead by the system kernel.
If GFS2 fail or corrupt, is there any way to failover another GFS2?
Do you mean the filesystem fails or the cluster node? As due to high availability, the cluster node failure will cause GFS2 to mount on next available cluster node