IBM Spectrum Scale
IBM Spectrum Scale
The General Parallel File System (GPFS) is a parallel file system developed by IBM and was renamed recently to IBM Spectrum Scale. It is a high performance parallel clustered file system. GPFS is parallel in the sense data is broken into blocks and striped across multiple disks so that it can be read and written in parallel. GPFS has many enterprise features such as mirroring, high availability, replication, and disaster recovery
GPFS can be bought as a Software or as an Appliance. As a software there are three editions (Express, Standard, and Advanced Edition) depending on the features needed. As an applicance, GSS from Lenovo, ESS from IBM, Seagate Clusterstor, DDN, and NEC.
Models of Deployment
There are basically three models for the deployment; Shared Storage Model (SAN), Client-Server Model (can be SAN or NAS), and the Shared-nothing Cluster model. The latter is more suitable for Big Data especially because IBM also provides a Hadoop Plugin to use it with GPFS instread of the HDFS.
GPFS Entities
There are three basic entities in the GPFS world, the first is the NSD Client or GPFS Client, the second is the NSD or GPFS Server, and the latter are the NSDs which stands for network shared disks. NSDs are basically the Disks where the Data and Metadata will be stored, they only have to be gived a clusted-wide unique name.
Install Notes
It should be noted that the same GPFS packages must be installed on all nodes of a GPFS Cluster. After the installation, the license has to be changed depending on the node whether being a GPFS server or client node. Please note that the storage side consists of Metadata Disks and Data Disks. Therefore NSD servers basically serve both Metadata and Data requests.
The NSD servers can replicate metadata and data (up to 3 copies) if configured. The replication is based on failure groups. Failure groups are required at least for this configuration. When configured, an Active-Active Failover mechanism is used between failuer groups.
The GPFS Daemon is a multi-threaded user mode daemon. However, a special Kernel extension is needed which makes GPFS appear to the application as just another file-system, using the so-called Virtual File system concept.
GPFS Node Architecture
A GPFS node has a Linux Kernel, the GPFS portability layer on top of it, the GPFS kernel extension on top of the latter, and the GPFS Daemon in the userland.
- GPFS portability layer: It is a layer (loadable kernel module) which enables communication between Linux kernel and GPFS daemon. This kernel module must be compiled after GPFS installation.
- GPFS kernel extension: It provides the interfaces to the kernel’s virtual file system (VFS) in order to add the GPFS file system. So the kernel thinks of GPFS as another local file-system like ext3 or xfs.
- GPFS daemon: The GPFS daemon performs all I/O and buffer management for GPFS.
GPFS Cluster Configuration
The GPFS Cluster Configuration File is stored in /var/mmfs/gen/mmsdrfs, it contains information like list of nodes, available disks, file system and other cluster configurations. There are two ways to store the configuration file; the first in on the server and the latter is on all Quorum nodes. To store it on the server, one has to specify the primary and secondary server to store a copy of the file on each. Any changes in the configuration would require the primary and secondary server to be available. To this end use the following command
mmchcluster {[--ccr-disable] [-p PrimaryServer] [-s SecondaryServer]}To store a copy of the configuration file on all Quorum nodes, aka configuration server repository (CCR), use the following command
mmchcluster --ccr-enableIn the CCR case, any changes in the configuration would require the majority of Quorum nodes to be available
Install dependencies
GPFS pre-requires the installation of the following packages 1. Development Tools 2. kernel-devel 3. kernel-headers
yum -y groupinstall "Development Tools"
yum -y install kernel-devel-$(uname -r) kernel-headers-$(uname -r)Installation Steps
- Install standard edition (or any edition you like, depending on the set of feature required)
- Export /usr/lpp/mmfs/bin/ or add to bashrc and source it
- Building the portability layer on all nodes.
Method1:
/usr/lpp/mmfs/bin/mmbuildgpl --build-packageMethod2:
cd /usr/lpp/mmfs/src
make Autoconfig (make LINUX_DISTRIBUTION=REDHAT_AS_LINUX Autoconfig)
make World
make InstallImages
make rpm (only redhat dist)- Make sure all nodes can resolve properly the name and IP add of all other NSD servers and clients Password-less authentication among all nodes including localhost based on ssh-keys Firewall, iptables, selinux all disabled
GPFS Cluster Creation and Configuration
- GPFS Cluster Creation
mmcrcluster -N node1:manager-quorum,node2:manager-quorum,..--ccr-enable -r /usr/bin/ssh -R /usr/bin/scp -C BostonGPFSClusterNotes:
- Node role: Manager, Quorum. Default Client and non-quorum
Manager: Indicates whether a node is part of the node pool from which file system managers and token managers can be selected. Manager and Quorum require GPFS Server Licenses
- -R: Remote Copy Program
- -r: Shell Command
- -C: Cluster Name
- --ccr-enable: Store config. On all quorum nodes
- -N: nodes or nodelist file
cat boston.nodelist
excelero-a:quorum
excelero-b:quorum
excelero-c:quorum
excelero-d:quorumQuorum operates on the principle of majority rule. A majority of the nodes in the cluster must be successfully communicating before any node can mount and access a file system. This keeps any nodes that are cut off from the cluster from writing data to the file system.
GPFS Cluster Configuration
After creating the GPFS cluster in the last step, set license mode for each node
mmchlicense server --accept -N node1,node2,node3,…
mmchlicense client --accept -N node1,node2,node3,…List the licenses assigned to each node
mmlslicenseStartup the Cluster
mmstartup –awhere -a indicated that we want to startup the GPFS cluster on all nodes. This step takes a while depending on the number of nodes, especially the Quorum nodes, since GPFS would check the Quorum at this stage before it starts up. You can check the status of the cluster if it is still arbitrating (still trying to form a quorum, up, or down) using
mmgetstate -aTo check cluster configuration use the following command
mmlsclusterTo check the interface used for cluster communication use
mmdiag --networkNext, create network shared disks (NSDs); a cluster-wide names for Disks used by GPFS. The input to this command consists of a file containing NSD stanzas describing the properties of the disks to be created.
mmcrnsd -F StanzaFile [-v {yes | no}]The NSD stanza file should have the following format:
%nsd: device=DiskName
nsd=NsdName
servers=ServerList
usage={dataOnly | metadataOnly | dataAndMetadata | descOnly | localCache}
failureGroup=FailureGroup
pool=StoragePool
Notes:
- server list can be ommited in the following cases:
- For IBM Spectrum Scale RAID, a server list is not allowed. The servers are determined from the underlying vdisk definition
- For SAN configurations where the disks are SAN-attached to all nodes in the cluster, a server list is optional.
- usage
- dataAndMetadata: Indicates that the disk contains both data and metadata. This is the default for disks in the system pool.
- dataOnly: Indicates that the disk contains data and does not contain metadata. This is the default for disks in storage pools other than the system pool.
- metadataOnly: Indicates that the disk contains metadata and does not contain data.
- LocalCache: Indicates that the disk is to be used as a local read-only cache device.
Volume Management
By default Quobyte creates one volume configuration called BASE, which
Viewing Volume Configurations
Configurations can be viewed through the API or from the web console.
- API
[root@q01 ~]# qmgmt volume config export BASE
configuration_name: "BASE"
volume_metadata_configuration {
placement_settings {
required_device_tags {
}
forbidden_device_tags {
}
prefer_client_local_device: false
optimize_for_mapreduce: false
}
replication_factor: 1
}
default_config {
file_layout {
stripe_width: 1
replication_factor: 1
block_size_bytes: 524288
object_size_bytes: 8388608
segment_size_bytes: 10737418240
crc_method: CRC_32_ISCSI
}
placement {
required_device_tags {
}
forbidden_device_tags {
}
prefer_client_local_device: false
optimize_for_mapreduce: false
}
io_policy {
cache_size_in_objects: 10
enable_async_writebacks: true
enable_client_checksum_verification: true
enable_client_checksum_computation: true
sync_writes: AS_REQUESTED
direct_io: AS_REQUESTED
OBSOLETE_implicit_locking: false
lost_lock_behavior: IO_ERROR
OBSOLETE_keep_page_cache: false
implicit_locking_mode: NO_LOCKING
enable_direct_writebacks: false
notify_dataservice_on_close: false
keep_page_cache_mode: USE_HEURISTIC
rpc_retry_mode: RETRY_FOREVER
lock_scope: GLOBAL
}
}
snapshot_configuration {
snapshot_interval_s: 0
snapshot_lifetime_s: 0
}
metadata_cache_configuration {
cache_ttl_ms: 10000
negative_cache_ttl_ms: 10000
enable_write_back_cache: false
}
- Web console
Login to web console and navigate to "Volume Configuration". Select BASE to view the configuration.
Editing Volume Configuration
- API
qmgmt volume config edit BASEThis will open in your default editor (or use the value of the EDITOR environment variable if it is set).
- Web console
Navigate to 'Volume Configurations' and tick the box beside BASE. Then select 'edit' from the drop down menu.
Creating Volume Configurations
- API
To create a new config use the same command you would to edit a command, but use a configuration name that doesn't exist. For example to create a new configuration called 3x_replication run the following
qmgmt volume config edit 3x_replicationThis will open an empty file in a text editor.
To avoid specifying every setting manually, it is advisable to inherit a different configuration to use as a template. For example to use the BASE configuration as a template for the 3x_replication add the following
base_configuration: "BASE"Now individual parameters can be set, and any setting that isn't defined will inherit the value used in the BASE configuration. The below options were set in the 3x_replication configuration
[root@q01 ~]# qmgmt volume config export 3x_replication
configuration_name: "3x_replication"
base_configuration: "BASE"
volume_metadata_configuration {
replication_factor: 3
}
default_config {
placement {
required_device_tags {
tags: "hdd"
}
forbidden_device_tags {
}
prefer_client_local_device: false
optimize_for_mapreduce: false
}
}This will create 3 replications of data and metadata, as well as only place data on devices tagged with "hdd". The use of tags allows for finer control of what data is placed on what devices. In this example all data is placed on HDDs and not on SSD storage.
- Web console
The Web console only allows new sub-configurations to be created, i.e configurations that inherit from another. To create a new sub-configuration, navigate to 'Volume Configurations' and tick the box next to BASE. Then from the drop down menu select 'Add new sub-configuration'
Creating Volumes
Volumes are created either from the CLI or through the web console.
- CLI
The generic command used to create volumes is
qmgmt volume create <volume name> <user> <group> <volume configuration>In the test bed 3 volumes were created using different volume configurations. They were created by running the following commands
qmgmt volume create home_vol root root 3x_replication
qmgmt volume create scratch_vol root root ssd_performance
qmgmt volume create archive_vol root root 8+3_erasureMounting Volumes
Volumes can be mounted on any server that the quobyte-client package is installed. There is a CLI tool mount.quobyte used to mount the quobyte volumes. The command takes a list of registry servers and volume to mount as well as the directory to mount the volume to. So to mount the home_vol above to /home
mount.quobyte q01:7861,q02:7861,q03:7861,q04:7861/home_vol /homeThis can be repeated to mount any other volumes
mount.quobyte q01:7861,q02:7861,q03:7861,q04:7861/scratch_vol /scratch
mount.quobyte q01:7861,q02:7861,q03:7861,q04:7861/archive_vol /archive