IBM Spectrum Scale
IBM Spectrum Scale
The General Parallel File System (GPFS) is a parallel file system developed by IBM and was renamed recently to IBM Spectrum Scale. It is a high performance parallel clustered file system. GPFS is parallel in the sense data is broken into blocks and striped across multiple disks so that it can be read and written in parallel. GPFS has many enterprise features such as mirroring, high availability, replication, and disaster recovery
GPFS can be bought as a Software or as an Appliance. As a software there are three editions (Express, Standard, and Advanced Edition) depending on the features needed. As an applicance, GSS from Lenovo, ESS from IBM, Seagate Clusterstor, DDN, and NEC.
Models of Deployment
There are basically three models for the deployment; Shared Storage Model (SAN), Client-Server Model (can be SAN or NAS), and the Shared-nothing Cluster model. The latter is more suitable for Big Data especially because IBM also provides a Hadoop Plugin to use it with GPFS instread of the HDFS.
GPFS Entities
There are three basic entities in the GPFS world, the first is the NSD Client or GPFS Client, the second is the NSD or GPFS Server, and the latter are the NSDs which stands for network shared disks. NSDs are basically the Disks where the Data and Metadata will be stored, they only have to be gived a clusted-wide unique name.
Install Notes
It should be noted that the same GPFS packages must be installed on all nodes of a GPFS Cluster. After the installation, the license has to be changed depending on the node whether being a GPFS server or client node. Please note that the storage side consists of Metadata Disks and Data Disks. Therefore NSD servers basically serve both Metadata and Data requests.
The NSD servers can replicate metadata and data (up to 3 copies) if configured. The replication is based on failure groups. Failure groups are required at least for this configuration. When configured, an Active-Active Failover mechanism is used between failuer groups.
The GPFS Daemon is a multi-threaded user mode daemon. However, a special Kernel extension is needed which makes GPFS appear to the application as just another file-system, using the so-called Virtual File system concept.
GPFS Node Architecture
A GPFS node has a Linux Kernel, the GPFS portability layer on top of it, the GPFS kernel extension on top of the latter, and the GPFS Daemon in the userland.
- GPFS portability layer: It is a layer (loadable kernel module) which enables communication between Linux kernel and GPFS daemon. This kernel module must be compiled after GPFS installation.
- GPFS kernel extension: It provides the interfaces to the kernel’s virtual file system (VFS) in order to add the GPFS file system. So the kernel thinks of GPFS as another local file-system like ext3 or xfs.
- GPFS daemon: The GPFS daemon performs all I/O and buffer management for GPFS.
GPFS Cluster Configuration
The GPFS Cluster Configuration File is stored in /var/mmfs/gen/mmsdrfs, it contains information like list of nodes, available disks, file system and other cluster configurations. There are two ways to store the configuration file; the first in on the server and the latter is on all Quorum nodes. To store it on the server, one has to specify the primary and secondary server to store a copy of the file on each. Any changes in the configuration would require the primary and secondary server to be available. To this end use the following command
mmchcluster {[--ccr-disable] [-p PrimaryServer] [-s SecondaryServer]}To store a copy of the configuration file on all Quorum nodes, aka configuration server repository (CCR), use the following command
mmchcluster --ccr-enableIn the CCR case, any changes in the configuration would require the majority of Quorum nodes to be available
Install dependencies
GPFS pre-requires the installation of the following packages 1. Development Tools 2. kernel-devel 3. kernel-headers
yum -y groupinstall "Development Tools"
yum -y install kernel-devel-$(uname -r) kernel-headers-$(uname -r)Installation Steps
- Install standard edition (or any edition you like, depending on the set of feature required)
- Export /usr/lpp/mmfs/bin/ or add to bashrc and source it
- Building the portability layer on all nodes.
Method1:
/usr/lpp/mmfs/bin/mmbuildgpl --build-packageMethod2:
cd /usr/lpp/mmfs/src
make Autoconfig (make LINUX_DISTRIBUTION=REDHAT_AS_LINUX Autoconfig)
make World
make InstallImages
make rpm (only redhat dist)- Make sure all nodes can resolve properly the name and IP add of all other NSD servers and clients Password-less authentication among all nodes including localhost based on ssh-keys Firewall, iptables, selinux all disabled
GPFS Cluster Creation and Configuration
- GPFS Cluster Creation
mmcrcluster -N node1:manager-quorum,node2:manager-quorum,..--ccr-enable -r /usr/bin/ssh -R /usr/bin/scp -C BostonGPFSClusterNotes:
- Node role: Manager, Quorum. Default Client and non-quorum
Manager: Indicates whether a node is part of the node pool from which file system managers and token managers can be selected. Manager and Quorum require GPFS Server Licenses
- -R: Remote Copy Program
- -r: Shell Command
- -C: Cluster Name
- --ccr-enable: Store config. On all quorum nodes
- -N: nodes or nodelist file
cat boston.nodelist
excelero-a:quorum
excelero-b:quorum
excelero-c:quorum
excelero-d:quorumQuorum operates on the principle of majority rule. A majority of the nodes in the cluster must be successfully communicating before any node can mount and access a file system. This keeps any nodes that are cut off from the cluster from writing data to the file system.
GPFS Cluster Configuration
After creating the GPFS cluster in the last step, set license mode for each node
mmchlicense server --accept -N node1,node2,node3,…
mmchlicense client --accept -N node1,node2,node3,…List the licenses assigned to each node
mmlslicenseStartup the Cluster
mmstartup –awhere -a indicated that we want to startup the GPFS cluster on all nodes. This step takes a while depending on the number of nodes, especially the Quorum nodes, since GPFS would check the Quorum at this stage before it starts up. You can check the status of the cluster if it is still arbitrating (still trying to form a quorum, up, or down) using
mmgetstate -aTo check cluster configuration use the following command
mmlsclusterTo check the interface used for cluster communication use
mmdiag --networkNext, create network shared disks (NSDs); a cluster-wide names for Disks used by GPFS. The input to this command consists of a file containing NSD stanzas describing the properties of the disks to be created.
mmcrnsd -F StanzaFile [-v {yes | no}]The NSD stanza file should have the following format:
%nsd: device=DiskName
nsd=NsdName
servers=ServerList
usage={dataOnly | metadataOnly | dataAndMetadata | descOnly | localCache}
failureGroup=FailureGroup
pool=StoragePool
Notes:
- server list can be ommited in the following cases:
- For IBM Spectrum Scale RAID, a server list is not allowed. The servers are determined from the underlying vdisk definition
- For SAN configurations where the disks are SAN-attached to all nodes in the cluster, a server list is optional.
- usage
- dataAndMetadata: Indicates that the disk contains both data and metadata. This is the default for disks in the system pool.
- dataOnly: Indicates that the disk contains data and does not contain metadata. This is the default for disks in storage pools other than the system pool.
- metadataOnly: Indicates that the disk contains metadata and does not contain data.
- LocalCache: Indicates that the disk is to be used as a local read-only cache device.
- Failure Groups: failureGroup=FailureGroup indicates a set of disks that share a common point of failure that could cause them all to become simultaneously unavailable
- A number identifying the failure group to which this disk belongs.
- All disks that are either attached to the same adapter or virtual shared disk server have a common point of failure and should therefore be placed in the same failure group.
Note that failure group is different from replication factor. GPFS maintains each instance of replicated data and metadata on disks in different failure groups. A replication factor of 2 in GPFS means that each block of a replicated file is in 2 failure groups. A failure group contains one or more NSDs. Each storage pool in a GPFS file system contains one or more failure groups. Failure groups are defined by the administrator and can be changed at any time for a fully replicated file system; i.e. any single failure group can fail and the data remains online.
- Pool: pool specifies the name of the pool that the NSD is assigned to. One of these storage pools is the required “system” storage pool. The other internal storage pools are optional user storage pools.
Next, Create the file system using the following command
mmcrfs gpfs-fsname -F stanza.nsd -j cluster -A no -B 1m -M 1 -m 1 -R 1 -r 1 -n 4 -T /scratch --metadata-block-size 128kwhere
- -j Specifies the default block allocation map type
- Cluster
- Scatter
- -A Indicates when the file system is to be mounted
- -B Block Size
- -M max. Metadata replica
- -m default Metadata replica
- -R max. Data replica
- -r default Data replica
- -n estimated number of nodes that will mount the file system
- -T mountpoint
Next, mount the file system using
mmmount gpfs-fsname
# to unmount the file system
mmumount gpfs-fsnameNext, create a policy file and change the policy so that the storage for data is allocated from a different pool, hence NSDs, than for the metadata
cat gpfs.policy
RULE ‘default’ SET POOL ‘dataPool’
mmchpolicy gpfs-fsname gpfs.policy -I yeswhere
- -I yes|no
- Yes policy is validated and immediately activated
- No policy is validated but not activated