Latest revision as of 11:04, 5 June 2018

This page covers the use of GPFS with NVMesh.

Setup

Making NVMesh volumes visible to GPFS

By default GPFS can only see block devices of certain types. Known disk types currently are:

  powerdisk  - EMC power path disk
  vpath      - IBM virtual path disk
  dmm        - Device-Mapper Multipath (DMM)
  dlmfdrv    - Hitachi dlm
  hdisk      - AIX hard disk
  lv         - AIX logical volume.  Historical usage only.
               Not allowed as a new device to mmcrnsd.
  gpt        - GPFS partition on Windows disk
  generic    - Device having no unique failover or multipathing
               characteristic (predominantly Linux devices).
  dasd       - DASD device (for Linux on z Systems)

To list all the currently known devices run the following command.

[root@excelero-a ~]# mmdevdiscover
sdb generic
sdb1 generic
sdb2 generic
sda generic
sda1 generic
sda2 generic
dm-0 dmm
dm-1 dmm
dm-2 dmm
dm-3 dmm
dm-4 dmm
dm-5 dmm

To use NVMesh block devices with GPFS an additional known disk type needs to be added to GPFS. The mmdevdiscover script has a built in function to run an arbitrary user script during execution. We use this to add NVMesh devices to the list of known drives. Below is a simple bash script that will find all attached NVMesh volumes and label them as generic Linux devices.

[root@excelero-a ~]# cat /var/mmfs/etc/nsddevices
#!/bin/bash

if [[ -d /dev/nvmesh ]]; then
        cd /dev && for dev in $(ls nvmesh/); do
                echo nvmesh/$dev generic
        done
fi

exit 1

Note that the file /var/mmfs/etc/nsddevices needs to be created on all systems (both servers and clients).

After adding /var/mmfs/etc/nsddevices, confirm that the NVMesh volumes are visible in the output of mmdevdiscover.

[root@excelero-a ~]# mmdevdiscover
nvmesh/nv01 generic
nvmesh/nv01p1 generic
nvmesh/nv02 generic
nvmesh/nv02p1 generic
nvmesh/nv03 generic
nvmesh/nv03p1 generic
nvmesh/nv04 generic
nvmesh/nv04p1 generic
.
.

NSD Creation

As the NVMesh block devices are available on all servers we can setup GPFS using a direct attached (share all) setup. In this configuration, all block devices used as NSDs appear as local devices on each server. This is the optimal configuration for GPFS, as it means that all network traffic occurs at a block level and so removes the need for GPFS to share devices.

To create the NSDs, create a stanza file that has an entry for each NSD. This only needs to be done on one server, GPFS will sync the configuration across the cluster.

Since all block devices are attached to each server and client, we do not need to specify any servers in the stanza file.

 
[root@excelero-a ~]# cat nsd.stanza
%nsd:
        nsd=nsd01
        device=/dev/nvmesh/nv01
        usage=dataAndMetadata
%nsd:
        nsd=nsd02
        device=/dev/nvmesh/nv02
        usage=dataAndMetadata
%nsd:
        nsd=nsd03
        device=/dev/nvmesh/nv03
        usage=dataAndMetadata
%nsd:
        nsd=nsd04
        device=/dev/nvmesh/nv04
        usage=dataAndMetadata

This stanza file specifies 4 NSDs to be created.

Create the NSDs using the mmcrnsd command. Once created confirm that the NSDs are mapped to each server and client, and that GPFS sees them as directly attached storage.

[root@excelero-a ~]# mmlsnsd -M

 Disk name    NSD volume ID      Device         Node name                Remarks
---------------------------------------------------------------------------------------
 nsd01        C0A800015B156123   /dev/nvmesh/nv01 dgx-1.admin
 nsd01        C0A800015B156123   /dev/nvmesh/nv01 excelero-a.admin
 nsd01        C0A800015B156123   /dev/nvmesh/nv01 excelero-b.admin
 nsd01        C0A800015B156123   /dev/nvmesh/nv01 excelero-c.admin
 nsd01        C0A800015B156123   /dev/nvmesh/nv01 excelero-d.admin
 nsd02        C0A800015B156124   /dev/nvmesh/nv02 dgx-1.admin
 nsd02        C0A800015B156124   /dev/nvmesh/nv02 excelero-a.admin
 nsd02        C0A800015B156124   /dev/nvmesh/nv02 excelero-b.admin
 nsd02        C0A800015B156124   /dev/nvmesh/nv02 excelero-c.admin
 nsd02        C0A800015B156124   /dev/nvmesh/nv02 excelero-d.admin
 nsd03        C0A800015B156125   /dev/nvmesh/nv03 dgx-1.admin
 nsd03        C0A800015B156125   /dev/nvmesh/nv03 excelero-a.admin
 nsd03        C0A800015B156125   /dev/nvmesh/nv03 excelero-b.admin
 nsd03        C0A800015B156125   /dev/nvmesh/nv03 excelero-c.admin
 nsd03        C0A800015B156125   /dev/nvmesh/nv03 excelero-d.admin
 nsd04        C0A800015B156126   /dev/nvmesh/nv04 dgx-1.admin
 nsd04        C0A800015B156126   /dev/nvmesh/nv04 excelero-a.admin
 nsd04        C0A800015B156126   /dev/nvmesh/nv04 excelero-b.admin
 nsd04        C0A800015B156126   /dev/nvmesh/nv04 excelero-c.admin
 nsd04        C0A800015B156126   /dev/nvmesh/nv04 excelero-d.admin

[root@excelero-a ~]# mmlsnsd -L

 File system   Disk name    NSD volume ID      NSD servers
---------------------------------------------------------------------------------------------
 gpfs1         nsd01        C0A800015B156123   (directly attached)
 gpfs1         nsd02        C0A800015B156124   (directly attached)
 gpfs1         nsd03        C0A800015B156125   (directly attached)
 gpfs1         nsd04        C0A800015B156126   (directly attached)

mmlsnsd -L should be checked on every system as a sanity check of the configuration.

File system creation

Once all NSDs are created, use mmcrfs command to create a file system. The minimum invocation of this command is of the form

mmcrfs fs_name -F nsd_stanza_file

It is worth reading the man page of mmcrfs to get an idea of what options are available as some of the options can not be changed after the file system is created. Some common options include

 
-A Auto-mount the file system when GPFS daemon starts 
-B File system block size 
-j Block allocation map (scatter is recommended for flash storage)
-m Default metadata replication factor
-M Default metadata replication factor
-r Default metadata replication factor
-R Default metadata replication factor
-n Estimated number of nodes that will mount the file system

A good baseline that has been shown to work is

mmcrfs gpfs1 -F nsd.stanza -A no -B 4m -D posix -j scatter -m 1 -M 1 -r 1 -R 1 -n 1 -E no -k posix -S yes

After the file system is created mount it on all servers using the mmmount command.

# Replace gpfs1 with the name of the file system
mmmount gpfs1 -a

If at this point any client fails to mount the filesystem and reports a stale file handle, then it is most likely due to that client no recognising the NVMesh volume as a valid target. Recheck that the /var/mmfs/etc/nsddevices script was added to the failing client and that its contents are correct. Check the output of mmdevdiscover on the client to confirm that the block devices are visible, and try remounting the GPFS file system using

mmmount gpfs1

locally.

Optimisations and Performance Tuning

Multiple NSDs

During testing it was found that having one large NVMesh volume that striped across all servers limited throughput. So it is recommended that to get the most throughput possible create a separate NVMesh volume for each Excelero server and use GPFS to stripe across them. In our test configuration that meant that each Excelero server exported a volume that consisted of 4 NVMes in RAID0. Each of these volumes are presented to GPFS as separate NSDs.

GPFS Tuning

Since Excelero is resposible for sharing all of the block devices, the required GPFS tuning is minimised to a small set parameters.

GPFS is a smart file system, that tries to auto tune to get the most performance possible, but we can override some defaults to help the tuning algorithm find optimal parameters. It is important to note that any values set in GPFS are considered to be guidelines as opposed to hard set values, GPFS may change them based on the tuning of some other parameters.

The first thing to look at is caching and prefetching. With NVMesh we want to avoid any caching or prefetching of files. The parameter 'prefetchAggressiveness' determines what the prefetching behaviour of GPFS is. By default it has a value of 2, which indicates that GPFS should prefetch files if the first access occurs at a zero offset in the file, or if the second access is sequential. To tell GPFS not to prefetch any files we set 'prefetchAggressiveness=0'.

GPFS also tries to limit the amount of IO going to each server to avoid overloading them and causing IO requests to queue. This limit is controlled by 'maxMBpS'. The recommendation is to set this to twice the network rate, up to its maximum of 100000MB/s. This is not a hard limit, so even if this maximum value is lower than what the network or servers are capable of it won't effect the overall throughput.

Along with the maximum bandwidth GPFS tries to guess what the expected throughput should be based on the number of LUNs a server has attached. Obviously this won't work with NVMesh because each LUN that GPFS sees will in reality consist of multiple drives. It is recommended to set 'ignorePrefetchLUNCount=yes' which instructs GPFS to not rely on the LUN count to estimate throughput.

GPFS uses multiple threads to handle IO requests in parallel. It does this using IO queues. Each of these queues is dedicated to processing small IO or large IO requests. By default a large IO request is any IO that is larger than 64k. We can control the total number of queues, the ratio of small:large queues and the number of threads in each queue to fine tune the system to handle specific workloads. The following parameters are used to control this.

  nsdSmallThreadRatio: ratio of small to large threads
  nsdThreadsPerQueue: Number of threads in each IO queue
  nsdMaxWorkerThreads: Total number of threads

For reference IBM recommend the following as a guideline for a general use system.

  nsdSmallThreadRatio=1
  nsdThreadsPerQueue=12
  nsdMaxWorkerThreads=480

This configuration results in a total of 40 queues, 20 dedicated to handling small IO requests and 20 for large.

For an NVMesh system that is optimised for large IO and throughput, the following configuration was used.

  nsdSmallThreadRatio=0
  nsdThreadsPerQueue=24
  nsdMaxWorkerThreads=2040

This provides 85 queues, each with 24 threads, to handle large IO requests. This is good for a system that needs to handle high throughput but it does sacrifice IOPs and small IO performance.

Unfortunately there are no real guidelines on how these should be set, and it is not always deterministic how changing them will effect the file system, so it is necessary to test performance after altering them to ensure the desired performance levels are being met.

The final tunable parameter is workerThreads. This sets the total number of threads that the GPFS daemon should use. Changing this will change the value of several other parameters. The maximum value is 8096, but any value between 4096 and 8182 has proven to perform well during testing. If it is set too high, GPFS may auto-tune it to a lower value to better suit the value of other parameters. For reference, in testing

  workerThreads=6141

was used to achieve the best benchmark result.

NVmesh Client Tuning

NVMesh allows for some tuning through modifying kernel parameters. Despite configuring GPFS to not prefetch, it more than likely will still perform some form of read ahead. This causes a drop in performance, but we can use the max_ios_per_cpu kernel parameter to throttle IO requests, which in effectively makes it impossible for GPFS to prefetch.

By default NVMesh allows each CPU to queue 64 IO requests at any one time. At first glance this looks beneficial, but it ultimately ends up degrading performance as it is enabling GPFS to read ahead, which is causing caching. Ideally we want NVMesh processing blocks at the same rate the GPFS is, thus preventing GPFS from even attempting to prefetch data. NVMesh recommend setting max_ios_per_cpu to 8 as a base figure. In testing we have found that this is still too high, particularly on a client with a high core count. It is a good idea to start with a value of 1, and increment from there, until performance starts to drop off again. In a test configuration with a single DGX-1 as a client the optimal performance was achieved using a value of 3.

This only needs to be set on NVMesh clients. It can be set on the fly by doing

 echo 3 > /sys/module/nvmeibc/parameters/max_ios_per_cpu

which allows for easy retesting as it doesn't need the client to be restarted.

Once a good value is found, it can be set permanently by doing

 echo “options nvmeibc max_ios_per_cpu=8” >> /etc/modprobe.d/nvmesh.conf

on clients.

Benchmark

The following is an example fio script that was used to benchmark IO from a single DGX-1 client, along with the achieved throughput

[global]
ioengine=libaio
direct=1
iodepth=1
invalidate=1
time_based
runtime=300
norandommap
randrepeat=0
log_avg_msec=1000
group_reporting

[gpfs1-read]
rw=read
blocksize=4m
size=2T
filename=/gpfs/gpfs1/fio
numjobs=128
stonewall

gpfs1-read: (g=0): rw=read, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=1
...
fio-2.2.10
Starting 128 processes
Jobs: 128 (f=128): [R(128)] [100.0% done] [44100MB/0KB/0KB /s] [11.3K/0/0 iops] [eta 00m:00s]
gpfs1-read: (groupid=0, jobs=128): err= 0: pid=58735: Tue Jun  5 10:47:22 2018
  read : io=12869GB, bw=43923MB/s, iops=10980, runt=300014msec
    slat (usec): min=92, max=686135, avg=1035.58, stdev=3522.99
    clat (usec): min=2, max=80858, avg=10616.33, stdev=4816.44
     lat (msec): min=1, max=692, avg=11.65, stdev= 5.90
    clat percentiles (usec):
     |  1.00th=[ 3824],  5.00th=[ 5344], 10.00th=[ 6176], 20.00th=[ 7264],
     | 30.00th=[ 8096], 40.00th=[ 8768], 50.00th=[ 9536], 60.00th=[10432],
     | 70.00th=[11456], 80.00th=[13120], 90.00th=[16064], 95.00th=[19584],
     | 99.00th=[28800], 99.50th=[33024], 99.90th=[44288], 99.95th=[48896],
     | 99.99th=[58624]
    bw (KB  /s): min= 5818, max=498714, per=0.78%, avg=351823.75, stdev=28459.03
    lat (usec) : 4=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.06%, 4=1.16%, 10=53.67%, 20=40.53%, 50=4.54%
    lat (msec) : 100=0.04%
  cpu          : usr=0.07%, sys=2.75%, ctx=28334724, majf=0, minf=132935
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=3294347/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=12869GB, aggrb=43923MB/s, minb=43923MB/s, maxb=43923MB/s, mint=300014msec, maxt=300014msec

Difference between revisions of "GPFS NVMesh"