Difference between revisions of "Genomics England"
(→Dragen migration from P1 to P2: Add commands for setting up SSSD) |
m (→Dragen migration to P2 (Helix): Use the -vv flag with ansible-playbook commands so they are more verbose) |
||
| (7 intermediate revisions by the same user not shown) | |||
| Line 9: | Line 9: | ||
| − | == Dragen migration | + | == Dragen migration to P2 (Helix) == |
First, wait for GEL to physically move Dragen boxes from P1 to P2. | First, wait for GEL to physically move Dragen boxes from P1 to P2. | ||
| Line 19: | Line 19: | ||
* a confirmation of whether the node has to go to the dev or prod cluster | * a confirmation of whether the node has to go to the dev or prod cluster | ||
| − | When the | + | When you get the iDRAC address and credentials, log in to <nowiki>https://<idrac-address></nowiki> and change the boot order so that PXE/network boot from a 10G card is first on the list. Also, write down the MAC address of the first PCI network interface (the one this node will be booting from). |
| + | |||
| + | When the rest of requirements are fulfilled, you can add Dragen nodes to Ironic: | ||
<nowiki> | <nowiki> | ||
| Line 31: | Line 33: | ||
| ab24b469-e07d-44ca-8bee-5c24d6c455e4 | public304 | a00b07c1-0a1b-4a36-91cb-5e0d51ec9258 | | | ab24b469-e07d-44ca-8bee-5c24d6c455e4 | public304 | a00b07c1-0a1b-4a36-91cb-5e0d51ec9258 | | ||
... | ... | ||
| − | # openstack baremetal node create --name | + | # openstack baremetal node create --name edico-dragen016 --driver idrac --driver-info drac_address=10.6.6.44 --driver-info drac_username=<idrac-user> --driver-info drac_password=<idrac-password> --driver-info cleaning_network=ab24b469-e07d-44ca-8bee-5c24d6c455e4 --driver-info provisioning_network=ab24b469-e07d-44ca-8bee-5c24d6c455e4 --driver-info deploy_kernel=36a1c5fc-ff1c-40dc-9b87-75197e73257a --driver-info deploy_ramdisk=e73198d9-605b-45a8-84cb-c828599e59ca --resource-class baremetal --network-interface flat |
# openstack baremetal node list | grep edico | # openstack baremetal node list | grep edico | ||
| − | # openstack baremetal port create | + | | ca6bd4f3-54f2-4877-8fba-db86f691a849 | edico-dragen016 | None | None | enroll | False | |
| − | # openstack baremetal node manage | + | # openstack baremetal port create <10g-interface-mac> --node ca6bd4f3-54f2-4877-8fba-db86f691a849 |
| − | # openstack baremetal node provide | + | # openstack baremetal node manage edico-dragen016 |
| + | # openstack baremetal node provide edico-dragen016 | ||
</nowiki> | </nowiki> | ||
| Line 67: | Line 70: | ||
<nowiki> | <nowiki> | ||
| − | $ ansible-playbook controller.yml -t bind | + | $ ansible-playbook -vv controller.yml -t bind |
</nowiki> | </nowiki> | ||
Ping the new node from the controller using its name to confirm the new DNS entry. | Ping the new node from the controller using its name to confirm the new DNS entry. | ||
| Line 73: | Line 76: | ||
With the node added to DNS, run the Dragen deployment with this command: | With the node added to DNS, run the Dragen deployment with this command: | ||
<nowiki> | <nowiki> | ||
| − | $ ansible-playbook dragen.yml | + | $ ansible-playbook -vv dragen.yml |
</nowiki> | </nowiki> | ||
Then run the compute playbook to add mounts, slurm workers, etc on Dragen nodes: | Then run the compute playbook to add mounts, slurm workers, etc on Dragen nodes: | ||
<nowiki> | <nowiki> | ||
| − | $ ansible-playbook static-compute.yml -l dragen,dragen_dev | + | $ ansible-playbook -vv static-compute.yml -l dragen,dragen_dev |
</nowiki> | </nowiki> | ||
| + | Also, remember to install Datadog on the new node by running this playbook: | ||
| − | + | <nowiki> | |
| + | $ ansible-playbook -vv datadog-agent-install.yaml | ||
| + | </nowiki> | ||
| + | Limit this command to the name of the new node to speed the execution of this playbook up. | ||
| − | + | === Set up SSSD for AD === | |
Finally, set up SSSD (also a manual step). Copy <code>/etc/krb5.keytab</code>, <code>/etc/sssd/sssd.conf</code> and <code>/etc/pki/ca-trust/source/anchors/cluster-ca.crt</code> from the HPC controller (or any compute node) and put them in the same location on the Dragen node. Then run (as root) the following commands: | Finally, set up SSSD (also a manual step). Copy <code>/etc/krb5.keytab</code>, <code>/etc/sssd/sssd.conf</code> and <code>/etc/pki/ca-trust/source/anchors/cluster-ca.crt</code> from the HPC controller (or any compute node) and put them in the same location on the Dragen node. Then run (as root) the following commands: | ||
| Line 95: | Line 102: | ||
# systemctl restart sssd | # systemctl restart sssd | ||
# systemctl enable sssd | # systemctl enable sssd | ||
| + | </nowiki> | ||
| + | |||
| + | === Add node to LSF === | ||
| + | |||
| + | Next up, set up LSF on the node -- this is currently a manual procedure and there is no playbook for it. | ||
| + | First, set up and run the installer: | ||
| + | |||
| + | <nowiki> | ||
| + | # mkdir -p /hpc/lsfadmin/lsf | ||
| + | # mount -t nfs corwekanfs.int.corp.gel.ac:/hpc/lsfadmin/lsf /hpc/lsfadmin/lsf | ||
| + | # ln -s /hpc/lsfadmin/lsf /usr/share/lsf | ||
| + | # ln -s /usr/share/lsf/conf/profile.lsf /etc/profile.d/lsf.sh | ||
| + | # /usr/share/lsf/10.1/install/hostsetup --top=/usr/share/lsf --boot=y | ||
| + | </nowiki> | ||
| + | Log out and in again (or start a new bash session) so you can use run LSF commands without having to specify full paths and check the status of LSF daemons: | ||
| + | |||
| + | <nowiki> | ||
| + | # lsf_daemons status | ||
| + | </nowiki> | ||
| + | |||
| + | Then log into one of the LSF masters (dev or prod, depending to which cluster the node has to go) and add your new node to either <code>/usr/share/lsf/conf/lsf.cluster.cluster</code> (for prod) or <code>/usr/share/lsf/conf/lsf.cluster.dev</code> (for dev). Make sure that the name of the node in LSF matches the hostname of the node. | ||
| + | |||
| + | Next up reconfigure LSF: | ||
| + | |||
| + | <nowiki> | ||
| + | # lsadmin reconfig | ||
| + | # badmin mbdrestart | ||
| + | </nowiki> | ||
| + | Check the status of your node in LSF by running: | ||
| + | |||
| + | <nowiki> | ||
| + | # bhosts -w | ||
| + | </nowiki> | ||
| + | |||
| + | In case of problems, all LSF logs are stored in <code>/usr/share/lsf/log/</code>. | ||
| + | |||
| + | Finally, log into your node again and run the following to start LSF processes and confirm the node is in the cluster and is available: | ||
| + | |||
| + | <nowiki> | ||
| + | # systemctl start lsfd | ||
| + | # systemctl enable lsfd | ||
| + | # lsf_daemons start | ||
| + | # lsf_daemons status | ||
| + | # bhosts -w | ||
</nowiki> | </nowiki> | ||
Latest revision as of 09:35, 2 July 2020
Dragen nodes
General info
- also known as "Edico" nodes
- no raid in Dragen nodes (only one disk)
- a local NVMe drive in each (used as cache for data)
- boot interface on
enp134s0f0
Dragen migration to P2 (Helix)
First, wait for GEL to physically move Dragen boxes from P1 to P2.
To proceed you will need the following:
- iDRAC IP addresses of nodes accessible from P2
- provisioning network interfaces connected to the
public304provisioning network - a confirmation of whether the node has to go to the dev or prod cluster
When you get the iDRAC address and credentials, log in to https://<idrac-address> and change the boot order so that PXE/network boot from a 10G card is first on the list. Also, write down the MAC address of the first PCI network interface (the one this node will be booting from).
When the rest of requirements are fulfilled, you can add Dragen nodes to Ironic:
# openstack image list ... | 36a1c5fc-ff1c-40dc-9b87-75197e73257a | ironic-deploy_kernel | active | | e73198d9-605b-45a8-84cb-c828599e59ca | ironic-deploy_ramdisk | active | ... # openstack network list ... | ab24b469-e07d-44ca-8bee-5c24d6c455e4 | public304 | a00b07c1-0a1b-4a36-91cb-5e0d51ec9258 | ... # openstack baremetal node create --name edico-dragen016 --driver idrac --driver-info drac_address=10.6.6.44 --driver-info drac_username=<idrac-user> --driver-info drac_password=<idrac-password> --driver-info cleaning_network=ab24b469-e07d-44ca-8bee-5c24d6c455e4 --driver-info provisioning_network=ab24b469-e07d-44ca-8bee-5c24d6c455e4 --driver-info deploy_kernel=36a1c5fc-ff1c-40dc-9b87-75197e73257a --driver-info deploy_ramdisk=e73198d9-605b-45a8-84cb-c828599e59ca --resource-class baremetal --network-interface flat # openstack baremetal node list | grep edico | ca6bd4f3-54f2-4877-8fba-db86f691a849 | edico-dragen016 | None | None | enroll | False | # openstack baremetal port create <10g-interface-mac> --node ca6bd4f3-54f2-4877-8fba-db86f691a849 # openstack baremetal node manage edico-dragen016 # openstack baremetal node provide edico-dragen016
Then provision the newly added node with an operating system:
# openstack server create --image centos7-1907-dhcp-on-enp134s0f0-raid --flavor baremetal.small --security-group ping-and-ssh --key-name mykey --network public304 <t/p>hpgridzdragXXX
After provisioning, log into the instance as the centos user and add this public key to root's ~/.ssh/authorized_keys for passwordless SSH from the HPC controller:
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBDEKTyKSRBpHcjgG16LF5mav11lEwbot1lmTPjvZPr6 cluster key
The next step is to run Dragen deployment using the Dragen role (https://gitlab.vscaler.com/mkarpiarz/ansible-dragen-role) and trinityX playbooks residing on the HPC controller (vcontroller) in the main environment.
For this, you will have to add the new baremetal instance to DNS first. From GEL's controller001 get into the HPC controller and run the following commands:
# ssh vc $ cd /opt/vScaler/site/ $ vim /etc/hosts
Here add the IP and the name of the new Dragen node in the "Dragen boxes" section.
$ vim hosts
Here add only the name of the node in the dragen_dev (dev cluster) or dragen (production cluster) group.
And then run this playbook to update the DNS server:
$ ansible-playbook -vv controller.yml -t bind
Ping the new node from the controller using its name to confirm the new DNS entry.
With the node added to DNS, run the Dragen deployment with this command:
$ ansible-playbook -vv dragen.yml
Then run the compute playbook to add mounts, slurm workers, etc on Dragen nodes:
$ ansible-playbook -vv static-compute.yml -l dragen,dragen_dev
Also, remember to install Datadog on the new node by running this playbook:
$ ansible-playbook -vv datadog-agent-install.yaml
Limit this command to the name of the new node to speed the execution of this playbook up.
Set up SSSD for AD
Finally, set up SSSD (also a manual step). Copy /etc/krb5.keytab, /etc/sssd/sssd.conf and /etc/pki/ca-trust/source/anchors/cluster-ca.crt from the HPC controller (or any compute node) and put them in the same location on the Dragen node. Then run (as root) the following commands:
# yum install sssd realmd -y # chmod 600 /etc/sssd/sssd.conf # update-ca-trust # chown root:root /etc/krb5.keytab # restorecon krb5.keytab # systemctl restart sssd # systemctl enable sssd
Add node to LSF
Next up, set up LSF on the node -- this is currently a manual procedure and there is no playbook for it. First, set up and run the installer:
# mkdir -p /hpc/lsfadmin/lsf # mount -t nfs corwekanfs.int.corp.gel.ac:/hpc/lsfadmin/lsf /hpc/lsfadmin/lsf # ln -s /hpc/lsfadmin/lsf /usr/share/lsf # ln -s /usr/share/lsf/conf/profile.lsf /etc/profile.d/lsf.sh # /usr/share/lsf/10.1/install/hostsetup --top=/usr/share/lsf --boot=y
Log out and in again (or start a new bash session) so you can use run LSF commands without having to specify full paths and check the status of LSF daemons:
# lsf_daemons status
Then log into one of the LSF masters (dev or prod, depending to which cluster the node has to go) and add your new node to either /usr/share/lsf/conf/lsf.cluster.cluster (for prod) or /usr/share/lsf/conf/lsf.cluster.dev (for dev). Make sure that the name of the node in LSF matches the hostname of the node.
Next up reconfigure LSF:
# lsadmin reconfig # badmin mbdrestart
Check the status of your node in LSF by running:
# bhosts -w
In case of problems, all LSF logs are stored in /usr/share/lsf/log/.
Finally, log into your node again and run the following to start LSF processes and confirm the node is in the cluster and is available:
# systemctl start lsfd # systemctl enable lsfd # lsf_daemons start # lsf_daemons status # bhosts -w