Setup HPC HA (Platform HPC 3.0)

From Define Wiki
Jump to navigation Jump to search
System Setup
  • Detailed instructions are on the platform_hpl_install.pdf file
  • Only tested by enabling HA after system installation

In short:

  1. Install a cluster OS
  2. Install Platform HPC on the OS (dont enable HA)
  3. Install a cluster NFS server
  4. Install the failover nodes
  5. Setup HA
Install OS
  • Note:
    • System must have FQDN
    • Configure both eth0 and eth1
    • Stick with installation defaults (didnt add server packages)
    • Stop IPTables (service iptables stop)
    • Disable SE Linux (on first boot)


Suggested Partition Table

Size  Mounted on
50G   /           # space for /opt
6G    /var
50G   /depot      # space for repos and updates
REST  /data       # local scratch on all nodes
Install Platform HPC
  • Copy over the hpc30-xxx.iso file, the OS-DVD.iso and license file
mount ᆳ-o loop hpc30-1234.rhel.iso /mnt
/mnt/pcm-installer
  • Questions are fairly straight forward, NOTE Say no to HA at this stage
  • reboot headnode after installation
Add NFS Server Group
  • Use combination of ngedit / netedit (if additional networking required)
  • CFM to create the exports file
# file: /etc/cfm/nfs-centos-5.6-x86_64/etc/export
/data/home 172.20.0.0/255.255.0.0(rw,async,no_root_squash)
/data/app 172.20.0.0/255.255.0.0(rw,async,no_root_squash)
  • Post script to enable nfs on boot
chkconfig --level 345 nfs on
Enable HA on headnode
  • Install hpc-ha-1.0.2.rpm on headnode
hpc-ha-tool setup 
# provide the necessary details: 
#   virtual external IP address
#   virtual internal IP address
#   NFS home directory location, e.g. nas000:/data/home
#   NFS application directory location, e.g. nas000:/data/app
#
# See example below:
[root@hpcha1 ~]# hpc-ha-tool setup
Do you wish to enable HPC HA (y/n) [n] y
Please input virtual IP address for network 172.28.0.0/255.255.0.0
    172.28.10.67
Please input virtual IP address for network 172.20.0.0/255.255.0.0
    172.20.0.5
Please input a NFS path for setting up /home directory:
    nas000:/data/home
Please input a NFS path for setting up APP directory:
    nas000:/data/app

Generating configuration files...
succeed!

Syncing up configurations to HPC nodes...
done!
  • Install failover headnode using addhost (Option should now be present for failover node)
  • Apply license to the failover headnode (installer001 default name)
  • Then turn on HA and test it
# turn auto HA on
kusu-failmode -m auto

# Also had to setup LSF for failover
/etc/rc.kusu.d/S11lsf-genconfig
Verify HA is working
[root@pcmha ~]# kusu-failinfo 
Installer node is currently set to: pcmha [Online]
Failover node is currently set to: installer001 [Online]
Failover mode is currently set to: Auto
KusuInstaller services currently running on: pcmha
[root@pcmha ~]# hpc-ha-tool status
Testing whether HPC HA enabled ... ok
Testing HPC HA configures ... ok
Testing failover backup node ... ok
Testing heartbeat status ... ok
Testing pacemaker status ... ok
Testing HPC database ... ok
Testing float IP addresses ... ok
Testing NFS mount points ... ok
Testing failover mode ... ok
Testing Kusu resource status ... ok
Testing isf-ac daemon status ... ok
Testing LSF daemon status ... ok

HPC HA is ready.
Test HA is working
kusu-failto
Problems!
  • IPs have to be above headnode IP
  • Disk partitions seems a bit funky for the failover node
  • kusu-failto reported failure:
[root@pcmha ~]# kusu-failto   
Are you sure you wish to failover from node 'pcmha' to node 'installer001'? [<y/N>]: y
Installer Services running on 'pcmha'
Syncing and configuring database...
Starting kusu. This may take a while...
   Starting initial network configuration                  [  OK  ] 
   Generating hosts, hosts.equiv, and resolv.conf          [  OK  ] 
   Config mail mechanism for kusu                          [  OK  ] 
   Setting up SSH host file                                [  OK  ] 
   Setting up user skel files                              [  OK  ] 
   Setting up network routes                               [  OK  ] 
   Setting up syslog on PCM installer                      [  OK  ] 
   Running S11lsf-genconfig                                [  OK  ] 
   Increasing ulimit memlock                               [  OK  ] 
   Setting npm service for HPC HA                          [  OK  ] 
   Running S70SetupPCMGUI.sh                               [  OK  ] 
   Post actions when failover                              [  OK  ] 
   Setting up fstab for home directories                   [  OK  ] 
   Synchronizing System configuration files                [FAILED] 
   Checking compatibility of OFED Kernel module            [  OK  ] 
   Starting initial configuration procedure                [  OK  ] 

Installer Services now running on 'installer001'

[root@installer001 ~]# bhosts 
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV 
compute000         closed          -      1      0      0      0      0      0
installer001       ok              -     12      0      0      0      0      0
pcmha              closed          -      1      0      0      0      0      0

[root@pcmha kusu]# bhosts 
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV 
compute000         ok              -     12      0      0      0      0      0
installer001       closed          -      1      0      0      0      0      0
pcmha              ok              -     12      0      0      0      0      0

==> kusu-nodeheartbeatd.log <==
2011-07-13 16:19:19 ERROR Failed to report 'run' operation state to installer.