Difference between revisions of "IBM Spectrum Scale"

From Define Wiki
Jump to navigation Jump to search
 
(25 intermediate revisions by 3 users not shown)
Line 1: Line 1:
[[File:ibm-gpfs.png|centre|250px]]
+
[[File:ibm-gpfs.jpg|centre|250px]]
  
=Server Installation=
+
=IBM Spectrum Scale=
  
Before installing Quobyte server software there are some tasks that need to be performed on each server to ensure correct functionality.
+
The General Parallel File System (GPFS) is a parallel file system developed by IBM and was renamed recently to IBM Spectrum Scale. It is a high performance parallel clustered file system. GPFS is parallel in the sense data is broken into blocks and striped across multiple disks so that it can be read and written in parallel. GPFS has many enterprise features such as mirroring, high availability, replication, and disaster recovery
  
===Configure NTP===
+
GPFS can be bought as a Software or as an Appliance. As a software there are three editions (Express, Standard, and Advanced Edition) depending on the features needed. As an applicance, GSS from Lenovo, ESS from IBM, Seagate Clusterstor, DDN, and NEC.  
Each server needs to have the same time, or some services will not start. Ensure that NTP is configured and running on all servers, and check all clocks are synced.
 
  
===Disable Swap===
+
===Models of Deployment===
Disable swap on each storage server. Running <code>swapoff -a</code> will disable all swap devices found in /proc/swaps and /etc/fstab. Also comment out/remove any swap lines in /etc/fstab to prevent swap being activated if a server is rebooted.
+
There are basically three models for the deployment; Shared Storage Model (SAN), Client-Server Model (can be SAN or NAS), and the Shared-nothing Cluster model. The latter is more suitable for Big Data especially because IBM also provides a Hadoop Plugin to use it with GPFS instread of the HDFS.  
  
===Install dependencies===
+
===GPFS Entities===
<syntaxhighlight>yum -y install java-1.8.0-openjdk-headless wget</syntaxhighlight>
+
There are three basic entities in the GPFS world, the first is the NSD Client or GPFS Client, the second is the NSD or GPFS Server, and the latter are the NSDs which stands for network shared disks. NSDs are basically the Disks where the Data and Metadata will be stored, they only have to be gived a clusted-wide unique name.
  
===Download the Quobyte yum repo file===
+
===Install Notes===
<syntaxhighlight>
+
It should be noted that the same GPFS packages must be installed on all nodes of a GPFS Cluster. After the installation, the license has to be changed depending on the node whether being a GPFS server or client node. Please note that the storage side consists of Metadata Disks and Data Disks. Therefore NSD servers basically serve both Metadata and Data requests.
cd /etc/yum.repos.d
 
wget https://packages.quobyte.com/repo/9/<YOUR_REPO_ID>/rpm/CentOS_7/quobyte.repo
 
</syntaxhighlight>
 
  
===Install Quobyte packages===
+
The NSD servers can replicate metadata and data (up to 3 copies) if configured. The replication is based on failure groups. Failure groups are required at least for this configuration. When configured, an Active-Active Failover mechanism is used between failuer groups.
<syntaxhighlight>yum -y install quobyte-server quobyte-client</syntaxhighlight>
 
  
=Server Configuration=
+
The GPFS Daemon is a multi-threaded user mode daemon. However, a special Kernel extension is needed which makes GPFS appear to the application as just another file-system, using the so-called Virtual File system concept.
  
===Prepare Drives===
+
===GPFS Node Architecture===
 +
A GPFS node has a Linux Kernel, the GPFS portability layer on top of it, the GPFS kernel extension on top of the latter, and the GPFS Daemon in the userland.
  
Any drives being used by Quobyte need to be formatted and mounted before Quobyte can use them. Currently only ext4 and XFS are supported.  
+
# GPFS portability layer: It is a layer (loadable kernel module) which enables communication between Linux kernel and GPFS daemon. This kernel module must be compiled after GPFS installation.
Each server in our testbed has 3 available drives. 2 SSDs (/dev/sdb and /dev/sdc) and 1 HHD (/dev/sdd). To prepare each drive do the following
+
# GPFS kernel extension: It provides the interfaces to the kernel’s virtual file system (VFS) in order to add the GPFS file system. So the kernel thinks of GPFS as another local file-system like ext3 or xfs.
<syntaxhighlight>
+
# GPFS daemon: The GPFS daemon performs all I/O and buffer management for GPFS.  
# Create a filesystem on each drive and mount them.
 
# Note it is recommended to use the full drive and not partitions.
 
  
mkfs.xfs /dev/sdX
+
===GPFS Cluster Configuration===
mount /dev/sdX /some/mount/point
+
The GPFS Cluster Configuration File is stored in /var/mmfs/gen/mmsdrfs, it contains information like list of nodes, available disks, file system and other cluster configurations. There are two ways to store the configuration file; the first in on the server and the latter is on all Quorum nodes. To store it on the server, one has to specify the primary and secondary server to store a copy of the file on each. Any changes in the configuration would require the primary and secondary server to be available. To this end use the following command
  
# The testbed was configured as per below, where /dev/sda was the OS drive
+
<syntaxhighlight> mmchcluster {[--ccr-disable] [-p PrimaryServer] [-s SecondaryServer]}</syntaxhighlight>
[root@q01 ~]# lsblk
 
NAME  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
 
sda      8:0    0 238.5G  0 disk
 
├─sda1  8:1    0    1M  0 part
 
├─sda2  8:2    0  512M  0 part /boot
 
├─sda3  8:3    0  15.6G  0 part
 
├─sda4  8:4    0    1K  0 part
 
└─sda5  8:5    0 222.4G  0 part /
 
sdb      8:16  0 372.6G  0 disk /mnt/quobyte/metadata0
 
sdc      8:32  0 894.3G  0 disk /mnt/quobyte/data0
 
sdd      8:48  0 931.5G  0 disk /mnt/quobyte/data1
 
  
# The same procedure was performed on each storage server
+
To store a copy of the configuration file on all Quorum nodes, aka configuration server repository (CCR), use the following command
</syntaxhighlight>
+
<syntaxhighlight>mmchcluster --ccr-enable</syntaxhighlight>
  
===Define Registry Servers===
+
In the CCR case, any changes in the configuration would require the majority of Quorum nodes to be available
Edit /etc/quobyte/host.cfg on each server to state which servers are running the registry servers. In our setup all 4 are so it was updated to read
 
  
<code>registry=q01:7861,q02:7861,q03:7861,q04:7861</code>
+
===Install dependencies===
 
+
GPFS pre-requires the installation of the following packages
If name resolution isn't configured on the servers the hostnames can be replaced with IP addresses.
+
1. Development Tools
 
+
2. kernel-devel
===Create registry devices===
+
3. kernel-headers
 
+
<syntaxhighlight>yum -y groupinstall "Development Tools"
To create the first registry device do the following on one server only.
+
yum -y install kernel-devel-$(uname -r) kernel-headers-$(uname -r)</syntaxhighlight>
 
 
<syntaxhighlight>qbootstrap /mnt/quobyte/metadata0</syntaxhighlight>
 
 
 
Then start services on this server.
 
 
 
<syntaxhighlight>
 
systemctl start quobyte-registry
 
systemctl start quobyte-webconsole
 
systemctl start quobyte-api
 
</syntaxhighlight>
 
 
 
To confirm that the registry services is running and the device is available run the following command.
 
 
 
<syntaxhighlight>
 
[root@q01 ~]# qmgmt device list
 
Id  Host                                      Mode                  Disk Used  Disk Avail  Services    LED Mode
 
1  q01                                      ONLINE                    34 MB      400 GB  REGISTRY    OFF
 
</syntaxhighlight>
 
 
 
Note it make take a minute for the device to initially register.
 
 
 
To create other registry services, do the following on each server
 
 
 
<syntaxhighlight>
 
qmkdev -t REGISTRY /mnt/quobyte/metadata0
 
systemctl start quobyte-registry
 
</syntaxhighlight>
 
 
 
Once this is completed on each server you can list and check availability of each registry from the first server.
 
 
 
<syntaxhighlight>
 
[root@q01 ~]# qmgmt device list
 
Id  Host                                      Mode                  Disk Used  Disk Avail  Services    LED Mode
 
  1  q01                                      ONLINE                    34 MB      400 GB  REGISTRY    OFF
 
  2  q02                                      ONLINE                    34 MB      400 GB  REGISTRY    OFF
 
  3  q03                                      ONLINE                    34 MB      400 GB  REGISTRY    OFF
 
  4  q04                                      ONLINE                    34 MB      400 GB  REGISTRY    OFF
 
</syntaxhighlight>
 
 
 
===Add Metadata Devices===
 
 
 
From the first server run the following command to add Metadata to the registry devices
 
 
 
<syntaxhighlight>
 
qmgmt device update add-type <id> METADATA
 
# id for each registry is listed in the output of 'qmgmt device list'
 
</syntaxhighlight>
 
 
 
SSH to each host with a metadata device and start the metadata service by running
 
 
 
<syntaxhighlight>systemctl start quobyte-metadata</syntaxhighlight>
 
 
 
Confirm that metadata devices are running
 
 
 
<syntaxhighlight>
 
[root@q01 ~]# qmgmt device list
 
Id  Host                                      Mode                  Disk Used  Disk Avail  Services    LED Mode
 
  1  q01                                      ONLINE                    34 MB      400 GB  METADATA REGISTRY  OFF
 
  2  q02                                      ONLINE                    34 MB      400 GB  METADATA REGISTRY  OFF
 
  3  q03                                      ONLINE                    34 MB      400 GB  METADATA REGISTRY  OFF
 
  4  q04                                      ONLINE                    34 MB      400 GB  METADATA REGISTRY  OFF
 
</syntaxhighlight>
 
 
 
===Add Data Devices===
 
To add data devices perform the following on each server.
 
 
 
<syntaxhighlight>
 
# Define data devices
 
qmkdev -t DATA /mnt/quobyte/data0
 
qmkdev -t DATA /mnt/quobyte/data1
 
 
 
# Start Quobyte Data service
 
systemctl start quobyte-data
 
</syntaxhighlight>
 
 
 
Once completed on each server check all devices are registered and available.
 
 
 
<syntaxhighlight>
 
[root@q01 ~]# qmgmt device list
 
Id  Host                                      Mode                  Disk Used  Disk Avail  Services    LED Mode
 
  1  q01                                      ONLINE                    34 MB      400 GB  METADATA REGISTRY  OFF
 
  5  q01                                      ONLINE                    22 GB      960 GB  DATA        OFF
 
  6  q01                                      ONLINE                    34 GB    1000 GB  DATA        OFF
 
  2  q02                                      ONLINE                    34 MB      400 GB  METADATA REGISTRY  OFF
 
  7  q02                                      ONLINE                    36 MB      960 GB  DATA        OFF
 
  8  q02                                      ONLINE                    46 GB    1000 GB  DATA        OFF
 
  3  q03                                      ONLINE                    34 MB      400 GB  METADATA REGISTRY  OFF
 
  9  q03                                      ONLINE                    36 MB      960 GB  DATA        OFF
 
10  q03                                      ONLINE                    46 GB    1000 GB  DATA        OFF
 
  4  q04                                      ONLINE                    34 MB      400 GB  METADATA REGISTRY  OFF
 
11  q04                                      ONLINE                    36 MB      400 GB  DATA        OFF
 
12  q04                                      ONLINE                    46 GB    1000 GB  DATA        OFF
 
</syntaxhighlight>
 
 
 
=Volume Management=
 
 
 
By default Quobyte creates one volume configuration called BASE, which 
 
 
 
===Viewing Volume Configurations===
 
 
 
Configurations can be viewed through the API or from the web console.
 
 
 
* API
 
 
 
<syntaxhighlight>
 
[root@q01 ~]# qmgmt volume config export BASE
 
configuration_name: "BASE"
 
volume_metadata_configuration {
 
  placement_settings {
 
    required_device_tags {
 
    }
 
    forbidden_device_tags {
 
    }
 
    prefer_client_local_device: false
 
    optimize_for_mapreduce: false
 
  }
 
  replication_factor: 1
 
}
 
 
 
default_config {
 
  file_layout {
 
    stripe_width: 1
 
    replication_factor: 1
 
    block_size_bytes: 524288
 
    object_size_bytes: 8388608
 
    segment_size_bytes: 10737418240
 
    crc_method: CRC_32_ISCSI
 
  }
 
  placement {
 
    required_device_tags {
 
    }
 
    forbidden_device_tags {
 
    }
 
    prefer_client_local_device: false
 
    optimize_for_mapreduce: false
 
  }
 
  io_policy {
 
    cache_size_in_objects: 10
 
    enable_async_writebacks: true
 
    enable_client_checksum_verification: true
 
    enable_client_checksum_computation: true
 
    sync_writes: AS_REQUESTED
 
    direct_io: AS_REQUESTED
 
    OBSOLETE_implicit_locking: false
 
    lost_lock_behavior: IO_ERROR
 
    OBSOLETE_keep_page_cache: false
 
    implicit_locking_mode: NO_LOCKING
 
    enable_direct_writebacks: false
 
    notify_dataservice_on_close: false
 
    keep_page_cache_mode: USE_HEURISTIC
 
    rpc_retry_mode: RETRY_FOREVER
 
    lock_scope: GLOBAL
 
  }
 
}
 
snapshot_configuration {
 
  snapshot_interval_s: 0
 
  snapshot_lifetime_s: 0
 
}
 
metadata_cache_configuration {
 
  cache_ttl_ms: 10000
 
  negative_cache_ttl_ms: 10000
 
  enable_write_back_cache: false
 
}
 
</syntaxhighlight>
 
 
 
 
 
* Web console
 
 
 
Login to web console and navigate to "Volume Configuration". Select BASE to view the configuration.
 
 
 
[[File:Quobyte base vol config webconsole.png|551px]]
 
 
 
===Editing Volume Configuration===
 
 
 
* API
 
 
 
<syntaxhighlight>
 
qmgmt volume config edit BASE
 
</syntaxhighlight>
 
 
 
This will open in your default editor (or use the value of the EDITOR environment variable if it is set).
 
 
 
* Web console
 
 
 
Navigate to 'Volume Configurations' and tick the box beside BASE. Then select 'edit' from the drop down menu.
 
 
 
[[file:quobyte_edit_base_config.png|551px]]
 
  
===Creating Volume Configurations===
+
===Installation Steps===
  
* API
+
* Install standard edition (or any edition you like, depending on the set of feature required)
 +
* Export /usr/lpp/mmfs/bin/ or add to bashrc and source it
 +
* Building the portability layer on all nodes.
 +
Method1:
 +
<syntaxhighlight> /usr/lpp/mmfs/bin/mmbuildgpl --build-package </syntaxhighlight>
 +
Method2:
 +
<syntaxhighlight>cd /usr/lpp/mmfs/src
 +
make Autoconfig (make LINUX_DISTRIBUTION=REDHAT_AS_LINUX Autoconfig)
 +
make World
 +
make InstallImages
 +
make rpm (only redhat dist) </syntaxhighlight>
 +
* Make sure all nodes can resolve properly the name and IP add of all other NSD servers and clients Password-less authentication among all nodes including localhost based on ssh-keys Firewall, iptables, selinux all disabled
  
To create a new config use the same command you would to edit a command, but use a configuration name that doesn't exist. For example to create a new configuration called 3x_replication run the following
+
===GPFS Cluster Creation and Configuration===
 +
* GPFS Cluster Creation
 +
<syntaxhighlight> mmcrcluster -N node1:manager-quorum,node2:manager-quorum,..--ccr-enable -r  /usr/bin/ssh -R  /usr/bin/scp  -C  BostonGPFSCluster </syntaxhighlight>
 +
Notes:
 +
* Node role: Manager, Quorum. Default Client and non-quorum
 +
Manager: Indicates whether a node is part of the node pool from which file system managers and token managers can be selected.
 +
Manager and Quorum require GPFS Server Licenses
 +
* -R: Remote Copy Program
 +
* -r: Shell Command
 +
* -C: Cluster Name
 +
* --ccr-enable: Store config. On all quorum nodes
 +
* -N: nodes or nodelist file
 +
<syntaxhighlight>cat boston.nodelist
 +
excelero-a:quorum
 +
excelero-b:quorum
 +
excelero-c:quorum
 +
excelero-d:quorum</syntaxhighlight>
  
<syntaxhighlight>
+
Quorum operates on the principle of majority rule. A majority of the nodes in the cluster must be successfully communicating before any node can mount and access a file system. This keeps any nodes that are cut off from the cluster from writing data to the file system.
qmgmt volume config edit 3x_replication
 
</syntaxhighlight>
 
  
This will open an empty file in a text editor.
+
=GPFS Cluster Configuration=
  
To avoid specifying every setting manually, it is advisable to inherit a different configuration to use as a template. For example to use the BASE configuration as a template for the 3x_replication add the following
+
After creating the GPFS cluster in the last step, set license mode for each node
 +
<syntaxhighlight>mmchlicense server --accept -N node1,node2,node3,…
 +
mmchlicense client --accept -N node1,node2,node3,…</syntaxhighlight>
  
<syntaxhighlight>
+
List the licenses assigned to each node
base_configuration: "BASE"
+
<syntaxhighlight>mmlslicense</syntaxhighlight>
</syntaxhighlight>
 
  
Now individual parameters can be set, and any setting that isn't defined will inherit the value used in the BASE configuration. The below options were set in the 3x_replication configuration
+
Startup the Cluster
 +
<syntaxhighlight>mmstartup –a</syntaxhighlight>
 +
where -a indicated that we want to startup the GPFS cluster on all nodes. This step takes a while depending on the number of nodes, especially the Quorum nodes, since GPFS would check the Quorum at this stage before it starts up. You can check the status of the cluster if it is still arbitrating (still trying to form a quorum, up, or down) using
 +
<syntaxhighlight>mmgetstate -a</syntaxhighlight>
 +
To check cluster configuration use the following command
 +
<syntaxhighlight>mmlscluster</syntaxhighlight>
 +
To check the interface used for cluster communication use
 +
<syntaxhighlight>mmdiag --network</syntaxhighlight>
  
<syntaxhighlight>
+
Next, create network shared disks (NSDs); a cluster-wide names for Disks used by GPFS. The input to this command consists of a file containing NSD stanzas describing the properties of the disks to be created.
[root@q01 ~]# qmgmt volume config export 3x_replication
+
<syntaxhighlight>mmcrnsd -F StanzaFile [-v {yes | no}]</syntaxhighlight>
configuration_name: "3x_replication"
+
The NSD stanza file should have the following format:
base_configuration: "BASE"
 
volume_metadata_configuration {
 
  replication_factor: 3
 
}
 
default_config {
 
  placement {
 
    required_device_tags {
 
      tags: "hdd"
 
    }
 
    forbidden_device_tags {
 
    }
 
    prefer_client_local_device: false
 
    optimize_for_mapreduce: false
 
  }
 
}
 
</syntaxhighlight>
 
  
This will create 3 replications of data and metadata, as well as only place data on devices tagged with "hdd". The use of tags allows for finer control of what data is placed on what devices. In this example all data is placed on HDDs and not on SSD storage.
+
  %nsd: device=DiskName
 +
  nsd=NsdName
 +
  servers=ServerList
 +
  usage={dataOnly | metadataOnly | dataAndMetadata | descOnly | localCache}
 +
  failureGroup=FailureGroup
 +
  pool=StoragePool
  
* Web console
+
Notes:
 +
* server list can be ommited in the following cases:
 +
# For IBM Spectrum Scale RAID, a server list is not allowed. The servers are determined from the underlying vdisk definition
 +
# For SAN configurations where the disks are SAN-attached to all nodes in the cluster, a server list is optional.
  
The Web console only allows new sub-configurations to be created, i.e configurations that inherit from another. To create a new sub-configuration, navigate to 'Volume Configurations' and tick the box next to BASE. Then from the drop down menu select 'Add new sub-configuration'
+
* usage
 +
#  dataAndMetadata: Indicates that the disk contains both data and metadata. This is the default for disks in the system pool.
 +
#  dataOnly: Indicates that the disk contains data and does not contain metadata. This is the default for disks in storage pools other than the system pool.
 +
#  metadataOnly: Indicates that the disk contains metadata and does not contain data.
 +
#  LocalCache: Indicates that the disk is to be used as a local read-only cache device.
  
[[File:quobyte_add_subconfig.png|551px]]
+
* Failure Groups: failureGroup=FailureGroup indicates a set of disks that share a common point of failure that could cause them all to become simultaneously unavailable
 +
# A number identifying the failure group to which this disk belongs.
 +
# All disks that are either attached to the same adapter or virtual shared disk server have a common point of failure and should therefore be placed in the same failure group.  
  
===Creating Volumes===
+
Note that failure group is different from replication factor. GPFS maintains each instance of replicated data and metadata on disks in different failure groups. A replication factor of 2 in GPFS means that each block of a replicated file is in 2 failure groups. A failure group contains one or more NSDs. Each storage pool in a GPFS file system contains one or more failure groups.
 +
Failure groups are defined by the administrator and can be changed at any time for a fully replicated file system; i.e. any single failure group can fail and the data remains online.
  
Volumes are created either from the CLI or through the web console.
+
* Pool: pool specifies the name of the pool that the NSD is assigned to. One of these storage pools is the required “system” storage pool. The other internal storage pools are optional user storage pools.  
  
* CLI
+
Next, Create the file system using the following command
The generic command used to create volumes is
+
<syntaxhighlight>mmcrfs gpfs-fsname -F stanza.nsd -j cluster -A no -B 1m -M 1 -m 1 -R 1 -r 1 -n 4 -T /scratch --metadata-block-size 128k</syntaxhighlight>
<syntaxhighlight>qmgmt volume create <volume name> <user> <group> <volume configuration></syntaxhighlight>
+
where
 +
* -j Specifies the default block allocation map type
 +
# Cluster
 +
# Scatter
 +
* -A Indicates when the file system is to be mounted
 +
* -B Block Size
 +
* -M max. Metadata replica
 +
* -m default Metadata replica
 +
* -R max. Data replica
 +
* -r default  Data replica
 +
* -n estimated number of nodes that will mount the file system
 +
* -T mountpoint
  
In the test bed 3 volumes were created using different volume configurations. They were created by running the following commands
+
Next, mount the file system using
 +
<syntaxhighlight>mmmount gpfs-fsname
 +
# to unmount the file system
 +
mmumount gpfs-fsname</syntaxhighlight>
  
<syntaxhighlight>
+
Next, create a policy file and change the policy so that the storage for data is allocated from a different pool, hence NSDs, than for the metadata
qmgmt volume create home_vol root root 3x_replication
+
<syntaxhighlight>cat gpfs.policy
qmgmt volume create scratch_vol root root ssd_performance
+
RULE ‘default’ SET POOL ‘dataPool’
qmgmt volume create archive_vol root root 8+3_erasure
+
mmchpolicy gpfs-fsname gpfs.policy  -I yes
 
</syntaxhighlight>
 
</syntaxhighlight>
 +
where
 +
* -I yes|no
 +
# Yes policy is validated and immediately activated
 +
# No policy is validated but not activated
  
===Mounting Volumes===
 
 
Volumes can be mounted on any server that the quobyte-client package is installed. There is a CLI tool mount.quobyte used to mount the quobyte volumes.
 
The command takes a list of registry servers and volume to mount as well as the directory to mount the volume to. So to mount the home_vol above to /home
 
 
<syntaxhighlight>
 
mount.quobyte q01:7861,q02:7861,q03:7861,q04:7861/home_vol /home
 
</syntaxhighlight>
 
  
This can be repeated to mount any other volumes
+
[[ Installation ]]
  
<syntaxhighlight>
+
[[ GPFS_NVMesh | NVMesh ]]
mount.quobyte q01:7861,q02:7861,q03:7861,q04:7861/scratch_vol /scratch
 
mount.quobyte q01:7861,q02:7861,q03:7861,q04:7861/archive_vol /archive
 
</syntaxhighlight>
 

Latest revision as of 08:39, 5 June 2018

Error creating thumbnail: File missing

IBM Spectrum Scale

The General Parallel File System (GPFS) is a parallel file system developed by IBM and was renamed recently to IBM Spectrum Scale. It is a high performance parallel clustered file system. GPFS is parallel in the sense data is broken into blocks and striped across multiple disks so that it can be read and written in parallel. GPFS has many enterprise features such as mirroring, high availability, replication, and disaster recovery

GPFS can be bought as a Software or as an Appliance. As a software there are three editions (Express, Standard, and Advanced Edition) depending on the features needed. As an applicance, GSS from Lenovo, ESS from IBM, Seagate Clusterstor, DDN, and NEC.

Models of Deployment

There are basically three models for the deployment; Shared Storage Model (SAN), Client-Server Model (can be SAN or NAS), and the Shared-nothing Cluster model. The latter is more suitable for Big Data especially because IBM also provides a Hadoop Plugin to use it with GPFS instread of the HDFS.

GPFS Entities

There are three basic entities in the GPFS world, the first is the NSD Client or GPFS Client, the second is the NSD or GPFS Server, and the latter are the NSDs which stands for network shared disks. NSDs are basically the Disks where the Data and Metadata will be stored, they only have to be gived a clusted-wide unique name.

Install Notes

It should be noted that the same GPFS packages must be installed on all nodes of a GPFS Cluster. After the installation, the license has to be changed depending on the node whether being a GPFS server or client node. Please note that the storage side consists of Metadata Disks and Data Disks. Therefore NSD servers basically serve both Metadata and Data requests.

The NSD servers can replicate metadata and data (up to 3 copies) if configured. The replication is based on failure groups. Failure groups are required at least for this configuration. When configured, an Active-Active Failover mechanism is used between failuer groups.

The GPFS Daemon is a multi-threaded user mode daemon. However, a special Kernel extension is needed which makes GPFS appear to the application as just another file-system, using the so-called Virtual File system concept.

GPFS Node Architecture

A GPFS node has a Linux Kernel, the GPFS portability layer on top of it, the GPFS kernel extension on top of the latter, and the GPFS Daemon in the userland.

  1. GPFS portability layer: It is a layer (loadable kernel module) which enables communication between Linux kernel and GPFS daemon. This kernel module must be compiled after GPFS installation.
  2. GPFS kernel extension: It provides the interfaces to the kernel’s virtual file system (VFS) in order to add the GPFS file system. So the kernel thinks of GPFS as another local file-system like ext3 or xfs.
  3. GPFS daemon: The GPFS daemon performs all I/O and buffer management for GPFS.

GPFS Cluster Configuration

The GPFS Cluster Configuration File is stored in /var/mmfs/gen/mmsdrfs, it contains information like list of nodes, available disks, file system and other cluster configurations. There are two ways to store the configuration file; the first in on the server and the latter is on all Quorum nodes. To store it on the server, one has to specify the primary and secondary server to store a copy of the file on each. Any changes in the configuration would require the primary and secondary server to be available. To this end use the following command

 mmchcluster {[--ccr-disable] [-p PrimaryServer] [-s SecondaryServer]}

To store a copy of the configuration file on all Quorum nodes, aka configuration server repository (CCR), use the following command

mmchcluster --ccr-enable

In the CCR case, any changes in the configuration would require the majority of Quorum nodes to be available

Install dependencies

GPFS pre-requires the installation of the following packages 1. Development Tools 2. kernel-devel 3. kernel-headers

yum -y groupinstall "Development Tools" 
yum -y install kernel-devel-$(uname -r) kernel-headers-$(uname -r)

Installation Steps

  • Install standard edition (or any edition you like, depending on the set of feature required)
  • Export /usr/lpp/mmfs/bin/ or add to bashrc and source it
  • Building the portability layer on all nodes.

Method1:

 /usr/lpp/mmfs/bin/mmbuildgpl --build-package

Method2:

cd /usr/lpp/mmfs/src
make Autoconfig (make LINUX_DISTRIBUTION=REDHAT_AS_LINUX Autoconfig)
make World
make InstallImages
make rpm (only redhat dist)
  • Make sure all nodes can resolve properly the name and IP add of all other NSD servers and clients Password-less authentication among all nodes including localhost based on ssh-keys Firewall, iptables, selinux all disabled

GPFS Cluster Creation and Configuration

  • GPFS Cluster Creation
 mmcrcluster -N node1:manager-quorum,node2:manager-quorum,..--ccr-enable -r  /usr/bin/ssh -R  /usr/bin/scp  -C  BostonGPFSCluster

Notes:

  • Node role: Manager, Quorum. Default Client and non-quorum

Manager: Indicates whether a node is part of the node pool from which file system managers and token managers can be selected. Manager and Quorum require GPFS Server Licenses

  • -R: Remote Copy Program
  • -r: Shell Command
  • -C: Cluster Name
  • --ccr-enable: Store config. On all quorum nodes
  • -N: nodes or nodelist file
cat boston.nodelist 
excelero-a:quorum
excelero-b:quorum
excelero-c:quorum
excelero-d:quorum

Quorum operates on the principle of majority rule. A majority of the nodes in the cluster must be successfully communicating before any node can mount and access a file system. This keeps any nodes that are cut off from the cluster from writing data to the file system.

GPFS Cluster Configuration

After creating the GPFS cluster in the last step, set license mode for each node

mmchlicense server --accept -N node1,node2,node3,…
mmchlicense client --accept -N node1,node2,node3,…

List the licenses assigned to each node

mmlslicense

Startup the Cluster

mmstartup –a

where -a indicated that we want to startup the GPFS cluster on all nodes. This step takes a while depending on the number of nodes, especially the Quorum nodes, since GPFS would check the Quorum at this stage before it starts up. You can check the status of the cluster if it is still arbitrating (still trying to form a quorum, up, or down) using

mmgetstate -a

To check cluster configuration use the following command

mmlscluster

To check the interface used for cluster communication use

mmdiag --network

Next, create network shared disks (NSDs); a cluster-wide names for Disks used by GPFS. The input to this command consists of a file containing NSD stanzas describing the properties of the disks to be created.

mmcrnsd -F StanzaFile [-v {yes | no}]

The NSD stanza file should have the following format:

 %nsd: device=DiskName
 nsd=NsdName
 servers=ServerList
 usage={dataOnly | metadataOnly | dataAndMetadata | descOnly | localCache}
 failureGroup=FailureGroup
 pool=StoragePool

Notes:

  • server list can be ommited in the following cases:
  1. For IBM Spectrum Scale RAID, a server list is not allowed. The servers are determined from the underlying vdisk definition
  2. For SAN configurations where the disks are SAN-attached to all nodes in the cluster, a server list is optional.
  • usage
  1. dataAndMetadata: Indicates that the disk contains both data and metadata. This is the default for disks in the system pool.
  2. dataOnly: Indicates that the disk contains data and does not contain metadata. This is the default for disks in storage pools other than the system pool.
  3. metadataOnly: Indicates that the disk contains metadata and does not contain data.
  4. LocalCache: Indicates that the disk is to be used as a local read-only cache device.
  • Failure Groups: failureGroup=FailureGroup indicates a set of disks that share a common point of failure that could cause them all to become simultaneously unavailable
  1. A number identifying the failure group to which this disk belongs.
  2. All disks that are either attached to the same adapter or virtual shared disk server have a common point of failure and should therefore be placed in the same failure group.

Note that failure group is different from replication factor. GPFS maintains each instance of replicated data and metadata on disks in different failure groups. A replication factor of 2 in GPFS means that each block of a replicated file is in 2 failure groups. A failure group contains one or more NSDs. Each storage pool in a GPFS file system contains one or more failure groups. Failure groups are defined by the administrator and can be changed at any time for a fully replicated file system; i.e. any single failure group can fail and the data remains online.

  • Pool: pool specifies the name of the pool that the NSD is assigned to. One of these storage pools is the required “system” storage pool. The other internal storage pools are optional user storage pools.

Next, Create the file system using the following command

mmcrfs gpfs-fsname -F stanza.nsd -j cluster -A no -B 1m -M 1 -m 1 -R 1 -r 1 -n 4 -T /scratch --metadata-block-size 128k

where

  • -j Specifies the default block allocation map type
  1. Cluster
  2. Scatter
  • -A Indicates when the file system is to be mounted
  • -B Block Size
  • -M max. Metadata replica
  • -m default Metadata replica
  • -R max. Data replica
  • -r default Data replica
  • -n estimated number of nodes that will mount the file system
  • -T mountpoint

Next, mount the file system using

mmmount gpfs-fsname
# to unmount the file system
mmumount gpfs-fsname

Next, create a policy file and change the policy so that the storage for data is allocated from a different pool, hence NSDs, than for the metadata

cat gpfs.policy
RULE ‘default’ SET POOL ‘dataPool’
mmchpolicy gpfs-fsname gpfs.policy  -I yes

where

  • -I yes|no
  1. Yes policy is validated and immediately activated
  2. No policy is validated but not activated


Installation

NVMesh