Difference between revisions of "Genomics England"

From Define Wiki
Jump to navigation Jump to search
(→‎Add node to LSF: Add commands for setting up LSF on the node)
m (→‎Dragen migration to P2 (Helix): Use the -vv flag with ansible-playbook commands so they are more verbose)
 
(2 intermediate revisions by the same user not shown)
Line 70: Line 70:
  
 
  <nowiki>
 
  <nowiki>
$ ansible-playbook controller.yml -t bind
+
$ ansible-playbook -vv controller.yml -t bind
 
</nowiki>
 
</nowiki>
 
Ping the new node from the controller using its name to confirm the new DNS entry.
 
Ping the new node from the controller using its name to confirm the new DNS entry.
Line 76: Line 76:
 
With the node added to DNS, run the Dragen deployment with this command:
 
With the node added to DNS, run the Dragen deployment with this command:
 
  <nowiki>
 
  <nowiki>
$ ansible-playbook dragen.yml
+
$ ansible-playbook -vv dragen.yml
 
</nowiki>
 
</nowiki>
 
Then run the compute playbook to add mounts, slurm workers, etc on Dragen nodes:
 
Then run the compute playbook to add mounts, slurm workers, etc on Dragen nodes:
  
 
  <nowiki>
 
  <nowiki>
$ ansible-playbook static-compute.yml -l dragen,dragen_dev
+
$ ansible-playbook -vv static-compute.yml -l dragen,dragen_dev
 +
</nowiki>
 +
Also, remember to install Datadog on the new node by running this playbook:
 +
 
 +
<nowiki>
 +
$ ansible-playbook -vv datadog-agent-install.yaml
 +
</nowiki>
 +
Limit this command to the name of the new node to speed the execution of this playbook up.
 +
 
 +
=== Set up SSSD for AD ===
 +
 
 +
Finally, set up SSSD (also a manual step). Copy <code>/etc/krb5.keytab</code>, <code>/etc/sssd/sssd.conf</code> and <code>/etc/pki/ca-trust/source/anchors/cluster-ca.crt</code> from the HPC controller (or any compute node) and put them in the same location on the Dragen node. Then run (as root) the following commands:
 +
 
 +
<nowiki>
 +
# yum install sssd realmd -y
 +
# chmod 600 /etc/sssd/sssd.conf
 +
# update-ca-trust
 +
# chown root:root /etc/krb5.keytab
 +
# restorecon krb5.keytab
 +
# systemctl restart sssd
 +
# systemctl enable sssd
 
</nowiki>
 
</nowiki>
  
Line 126: Line 146:
 
# lsf_daemons status
 
# lsf_daemons status
 
# bhosts -w
 
# bhosts -w
</nowiki>
 
 
=== Set up SSSD for AD ===
 
 
Finally, set up SSSD (also a manual step). Copy <code>/etc/krb5.keytab</code>, <code>/etc/sssd/sssd.conf</code> and <code>/etc/pki/ca-trust/source/anchors/cluster-ca.crt</code> from the HPC controller (or any compute node) and put them in the same location on the Dragen node. Then run (as root) the following commands:
 
 
<nowiki>
 
# yum install sssd realmd -y
 
# chmod 600 /etc/sssd/sssd.conf
 
# update-ca-trust
 
# chown root:root /etc/krb5.keytab
 
# restorecon krb5.keytab
 
# systemctl restart sssd
 
# systemctl enable sssd
 
 
</nowiki>
 
</nowiki>

Latest revision as of 09:35, 2 July 2020

Dragen nodes

General info

  • also known as "Edico" nodes
  • no raid in Dragen nodes (only one disk)
  • a local NVMe drive in each (used as cache for data)
  • boot interface on enp134s0f0


Dragen migration to P2 (Helix)

First, wait for GEL to physically move Dragen boxes from P1 to P2.

To proceed you will need the following:

  • iDRAC IP addresses of nodes accessible from P2
  • provisioning network interfaces connected to the public304 provisioning network
  • a confirmation of whether the node has to go to the dev or prod cluster

When you get the iDRAC address and credentials, log in to https://<idrac-address> and change the boot order so that PXE/network boot from a 10G card is first on the list. Also, write down the MAC address of the first PCI network interface (the one this node will be booting from).

When the rest of requirements are fulfilled, you can add Dragen nodes to Ironic:

# openstack image list
...
| 36a1c5fc-ff1c-40dc-9b87-75197e73257a | ironic-deploy_kernel                                | active |
| e73198d9-605b-45a8-84cb-c828599e59ca | ironic-deploy_ramdisk                               | active |
...
# openstack network list
...
| ab24b469-e07d-44ca-8bee-5c24d6c455e4 | public304                                          | a00b07c1-0a1b-4a36-91cb-5e0d51ec9258 |
...
# openstack baremetal node create --name edico-dragen016 --driver idrac --driver-info drac_address=10.6.6.44 --driver-info drac_username=<idrac-user> --driver-info drac_password=<idrac-password> --driver-info cleaning_network=ab24b469-e07d-44ca-8bee-5c24d6c455e4 --driver-info provisioning_network=ab24b469-e07d-44ca-8bee-5c24d6c455e4 --driver-info deploy_kernel=36a1c5fc-ff1c-40dc-9b87-75197e73257a --driver-info deploy_ramdisk=e73198d9-605b-45a8-84cb-c828599e59ca --resource-class baremetal --network-interface flat
# openstack baremetal node list | grep edico
| ca6bd4f3-54f2-4877-8fba-db86f691a849 | edico-dragen016 | None                                 | None        | enroll             | False       |
# openstack baremetal port create <10g-interface-mac> --node ca6bd4f3-54f2-4877-8fba-db86f691a849
# openstack baremetal node manage edico-dragen016
# openstack baremetal node provide edico-dragen016

Then provision the newly added node with an operating system:

# openstack server create --image centos7-1907-dhcp-on-enp134s0f0-raid --flavor baremetal.small --security-group ping-and-ssh --key-name mykey --network public304 <t/p>hpgridzdragXXX

After provisioning, log into the instance as the centos user and add this public key to root's ~/.ssh/authorized_keys for passwordless SSH from the HPC controller:

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBDEKTyKSRBpHcjgG16LF5mav11lEwbot1lmTPjvZPr6 cluster key

The next step is to run Dragen deployment using the Dragen role (https://gitlab.vscaler.com/mkarpiarz/ansible-dragen-role) and trinityX playbooks residing on the HPC controller (vcontroller) in the main environment.

For this, you will have to add the new baremetal instance to DNS first. From GEL's controller001 get into the HPC controller and run the following commands:

# ssh vc
$ cd /opt/vScaler/site/
$ vim /etc/hosts

Here add the IP and the name of the new Dragen node in the "Dragen boxes" section.

$ vim hosts

Here add only the name of the node in the dragen_dev (dev cluster) or dragen (production cluster) group. And then run this playbook to update the DNS server:

$ ansible-playbook -vv controller.yml -t bind

Ping the new node from the controller using its name to confirm the new DNS entry.

With the node added to DNS, run the Dragen deployment with this command:

$ ansible-playbook -vv dragen.yml

Then run the compute playbook to add mounts, slurm workers, etc on Dragen nodes:

$ ansible-playbook -vv static-compute.yml -l dragen,dragen_dev

Also, remember to install Datadog on the new node by running this playbook:

$ ansible-playbook -vv datadog-agent-install.yaml

Limit this command to the name of the new node to speed the execution of this playbook up.

Set up SSSD for AD

Finally, set up SSSD (also a manual step). Copy /etc/krb5.keytab, /etc/sssd/sssd.conf and /etc/pki/ca-trust/source/anchors/cluster-ca.crt from the HPC controller (or any compute node) and put them in the same location on the Dragen node. Then run (as root) the following commands:

# yum install sssd realmd -y
# chmod 600 /etc/sssd/sssd.conf
# update-ca-trust
# chown root:root /etc/krb5.keytab
# restorecon krb5.keytab
# systemctl restart sssd
# systemctl enable sssd

Add node to LSF

Next up, set up LSF on the node -- this is currently a manual procedure and there is no playbook for it. First, set up and run the installer:

# mkdir -p /hpc/lsfadmin/lsf
# mount -t nfs corwekanfs.int.corp.gel.ac:/hpc/lsfadmin/lsf /hpc/lsfadmin/lsf
# ln -s /hpc/lsfadmin/lsf /usr/share/lsf
# ln -s /usr/share/lsf/conf/profile.lsf /etc/profile.d/lsf.sh
# /usr/share/lsf/10.1/install/hostsetup --top=/usr/share/lsf --boot=y

Log out and in again (or start a new bash session) so you can use run LSF commands without having to specify full paths and check the status of LSF daemons:

# lsf_daemons status

Then log into one of the LSF masters (dev or prod, depending to which cluster the node has to go) and add your new node to either /usr/share/lsf/conf/lsf.cluster.cluster (for prod) or /usr/share/lsf/conf/lsf.cluster.dev (for dev). Make sure that the name of the node in LSF matches the hostname of the node.

Next up reconfigure LSF:

# lsadmin reconfig
# badmin mbdrestart

Check the status of your node in LSF by running:

# bhosts -w

In case of problems, all LSF logs are stored in /usr/share/lsf/log/.

Finally, log into your node again and run the following to start LSF processes and confirm the node is in the cluster and is available:

# systemctl start lsfd
# systemctl enable lsfd
# lsf_daemons start
# lsf_daemons status
# bhosts -w