OpenOnDemand
Setting up Open OnDemand
What is it?
Open OnDemand is a portal to run web applications to help add ease of use functionality to any Compute Cluster.
the deplyonment is now pretty much handled by the stackhpc/osc playbooks BUT I'm adding some extra notes do document how to do stuff by hand to tweak an existing setup.
Switch OOD to use FreeIPA and ondemand-dex
on the portal node (here login-1) first step install the ondemand-dex package
dnf install -y ondemand-dex
now we edit the /etc/ood/config/ood-portal.yml file
remove or commne all httpd-auth lines like these:
Stop httpd from listening on port 80 and use certbot to create a trusted SSL cert that won't ever expire
first up make sure you ARE allowing port 80 through the firewall and security groups
next comment out all lines in /etc/httpd/conf.d/welcome.conf i should look like this. Do not delete it or it will come back from the dead when you update apache
root@login-1 certs]# cat /etc/httpd/conf.d/welcome.conf # # This configuration file enables the default "Welcome" page if there # is no default index page present for the root URL. To disable the # Welcome page, comment out all the lines below. # # NOTE: if this file is removed, it will be restored on upgrades. # #<LocationMatch "^/+$"> # Options -Indexes # ErrorDocument 403 /.noindex.html #</LocationMatch> # #<Directory /usr/share/httpd/noindex> # AllowOverride None # Require all granted #</Directory> # #Alias /.noindex.html /usr/share/httpd/noindex/index.html #Alias /poweredby.png /usr/share/httpd/icons/apache_pb3.png
comment out Listen 80 in /etc/httpd/conf/httpd.conf on mine its around line 45.
restart httpd
check it's not listening with systemctl and ss -t | grep -e ":http" -e ":80"
[root@login-1 certs]# grep "Listen 80" /etc/httpd/conf/httpd.conf -n
45:#Listen 80
[root@login-1 tls]# systemctl restart httpd
[root@login-1 tls]# systemctl status httpd
● httpd.service - The Apache HTTP Server
Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/httpd.service.d
└─ood-portal.conf, ood.conf
Active: active (running) since Tue 2023-11-07 12:41:18 GMT; 8s ago
Docs: man:httpd.service(8)
Process: 98483 ExecStartPre=/opt/ood/ood-portal-generator/sbin/update_ood_portal --rpm (code=exited, status=0/SUCCESS)
Main PID: 98499 (httpd)
Status: "Started, listening on: port 443, port 81"
Tasks: 213 (limit: 100627)
Memory: 44.2M
CGroup: /system.slice/httpd.service
├─98499 /usr/sbin/httpd -DFOREGROUND
├─98500 /usr/sbin/httpd -DFOREGROUND
├─98501 /usr/sbin/httpd -DFOREGROUND
├─98502 /usr/sbin/httpd -DFOREGROUND
└─98503 /usr/sbin/httpd -DFOREGROUND
Nov 07 12:41:18 login-1.cluster.internal systemd[1]: Starting The Apache HTTP Server...
Nov 07 12:41:18 login-1.cluster.internal update_ood_portal[98483]: No change in Apache config.
Nov 07 12:41:18 login-1.cluster.internal update_ood_portal[98483]: No change in the Dex config.
Nov 07 12:41:18 login-1.cluster.internal httpd[98499]: [Tue Nov 07 12:41:18.946520 2023] [so:warn] [pid 98499:tid 140163328375104] AH01574: module status_module is already loaded, skipping
Nov 07 12:41:18 login-1.cluster.internal systemd[1]: Started The Apache HTTP Server.
Nov 07 12:41:18 login-1.cluster.internal httpd[98499]: Server configured, listening on: port 443, port 81
[root@login-1 certs]# ss -t | grep -e :http -e :80
LAST-ACK 0 1 10.0.3.240:59462 74.125.193.99:https
ESTAB 0 0 10.0.3.240:53178 92.122.160.95:https
ESTAB 0 0 10.0.3.240:43826 104.82.170.85:https
ESTAB 0 0 [::ffff:10.0.3.240]:https [::ffff:82.1.125.18]:56555
notice that it's only listenting on 443 (https) and 81 (not so sure it is but hey) this is good.
now we can install certbot dnf install -y certbot and run it against our domain which you have to point at the public ip certbot certonly -d <public FQDN> . In the following example I and doing a dry run as I have run this before but you should remove this.
[root@login-1 certs]# certbot certonly -d nuig-cluster.define-technology.com --dry-run Saving debug log to /var/log/letsencrypt/letsencrypt.log How would you like to authenticate with the ACME CA? - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 1: Spin up a temporary webserver (standalone) 2: Place files in webroot directory (webroot) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Select the appropriate number [1-2] then [enter] (press 'c' to cancel): 1 Simulating renewal of an existing certificate for nuig-cluster.define-technology.com The dry run was successful.
mine didn't add the renew job to cron or a timer so I did the following to auto renew the certs
cat >/etc/cron.daily/certbot-renew <<EOF #!/bin/bash /usr/bin/certbot renew &>/dev/null EOF chmod a+x /etc/cron.daily/certbot-renew
now you should have the certs in /etc/letsencrypt/<fqdn>/ directory I tried putting this path directly into the apache config files but if fails what does work is to link to these. I put these links into /etc/pki/tls/certs and /etc/pki/tls/private
ln -s /etc/letsencrypt/live/nuig-cluster.define-technology.com/fullchain.pem /etc/pki/tls/certs/nuig-cluster.define-technology.com.crt ln -s /etc/letsencrypt/live/nuig-cluster.define-technology.com/privkey.pem /etc/pki/tls/private/nuig-cluster.define-technology.com.key [root@login-1 certs]# ll /etc/pki/tls/private/nuig-cluster.define-technology.com.key /etc/pki/tls/certs/nuig-cluster.define-technology.com.crt lrwxrwxrwx. 1 root root 70 Nov 7 12:42 /etc/pki/tls/certs/nuig-cluster.define-technology.com.crt -> /etc/letsencrypt/live/nuig-cluster.define-technology.com/fullchain.pem lrwxrwxrwx. 1 root root 68 Nov 7 12:40 /etc/pki/tls/private/nuig-cluster.define-technology.com.key -> /etc/letsencrypt/live/nuig-cluster.define-technology.com/privkey.pem
finally modify the /etc/ood/config/ood-portal run /opt/ood/ood-portal-generator/sbin/update_ood_portal and restart httpd and opdemand-dex
we need to edit the ssl section I made mine:
cat /etc/ood/config/ood-portal <...snip... /> # List of SSL Apache directives # Example: # ssl: # - 'SSLCertificateFile "/etc/pki/tls/certs/www.example.com.crt"' # - 'SSLCertificateKeyFile "/etc/pki/tls/private/www.example.com.key"' # Default: null (no SSL support) ssl: #- 'SSLCertificateFile /etc/pki/tls/certs/localhost.crt' #- 'SSLCertificateKeyFile /etc/pki/tls/private/localhost.key' - 'SSLCertificateFile /etc/pki/tls/certs/nuig-cluster.define-technology.com.crt' - 'SSLCertificateKeyFile /etc/pki/tls/private/nuig-cluster.define-technology.com.key' - 'SSLProtocol all -TLSv1.1 -TLSv1 -SSLv2 -SSLv3' - 'SSLCipherSuite ALL:+HIGH:!ADH:!EXP:!SSLv2:!SSLv3:!MEDIUM:!LOW:!NULL:!aNULL' - 'SSLHonorCipherOrder On' - 'SSLCompression off' - 'SSLSessionTickets Off' <...snip... /> /opt/ood/ood-portal-generator/sbin/update_ood_portal systemctl restart httpd ondemand-dex systemctl status httpd ondemand-dex
i.e. I commented out the old SSL cert liines from the stackhpc appliance (we will need to override this in the playbooks later) and put in the new links
everything below this line is ancient but STILL useful
It provides the following apps out of the box which I shall go through the configuration of using my trusty virtual TrinityX cluster on my file server.
- Home directory browser with editor
- Active Job Viewer
- Job creator
- Cluster Shell
- Remote Visualisation
here are screenshots of each of the working modules:
Install RPMs on Portal node
yum install centos-release-scl yum install https://yum.osc.edu/ondemand/1.5/ondemand-release-web-1.5-1.el7.noarch.rpm yum install ondemand systemctl start httpd24-httpd
Configure The Portal Node
Configure Authentication to use LDAP and allow remote access to nodes using reverse proxy
Backup the original blank config and configure basic ldap authentication against the cluster LDAP for now. If you want something more advanced then read the ood docs. Note: fill in the LDAP URL according to the cluster config on your system. You can find this in the sssd.conf file normally along with the search dn which is appended before the ?uid below. In a default TrinityX cluster the url will be <controller hostname>.cluster:636/ou=People,dc=local?uid the .cluster is the internal domain name and cannot be removed as it will fails the ssl hostchecks
Note that the last three lines configure the reverse proxy settings. the most important is the host_regex line. If a host does not match this then no proxy will connect to the node this is to stop your portal server being used to redirect traffic to node you do not control. In TrinityX we use the internal TLD domain suffix .cluster '[\w.-]+\.cluster' limits access to these servers. If you wanted to be more specific then you could use something like node\d+\.cluster or (rvis|interactive)\d+\.cluster to limit access to more specific hosts in the cluster.
some LDAP servers require uncommenting the AuthLDAPGroupAttribute and AuthLDAPGroupAttributeIsDN lines
cp /etc/ood/config/ood_portal.yml /etc/ood/config/ood_portal.yml.orig cat >> /etc/ood/config/ood_portal.yml << EOF auth: - 'AuthType Basic' - 'AuthName "private"' - 'AuthBasicProvider ldap' - 'AuthLDAPURL "ldaps://trinityx.cluster:636/ou=People,dc=local?uid"' # - 'AuthLDAPGroupAttribute memberUid' # - 'AuthLDAPGroupAttributeIsDN off' - 'RequestHeader unset Authorization' - 'Require valid-user' host_regex: '[\w.-]+\.cluster' node_uri: '/node' rnode_uri: '/rnode' EOF
put a comment in /opt/rh/httpd24/root/etc/httpd/conf.modules.d/01-ldap.confand explicitly set LDAPLibraryDebug to 0 (off) as for some stupid reason failing LDAP authentications will NOT result in an error in /var/log/httpd24/error_log.
If you see an internal server error after login and nothing in the logs then set LDAPLibraryDebug to 1 and it may point you in the right direction (the misssing .cluster) for me
cat >> /opt/rh/httpd24/root/etc/httpd/conf.modules.d/01-ldap.conf << EOF # change the following LDAPLibraryDebug line to 1 if you get 500 (internal server) errors after login LDAPLibraryDebug 0 EOF
Now we need to run the portal config generator and restart the service
/opt/ood/ood-portal-generator/sbin/update_ood_portal systemctl try-restart httpd24-httpd.service httpd24-htcacheclean.service
If you access the website now it should you prompt you for a login and you should be able to login as a normal user with your cluster credentials and then display the most basic portal page, it will look bare compared to the above screenshots.
Configure Slurm and Remote Desktop sessions
Configure Slurm and remote access plugins
Lets fix that now and tell OOD how to use Slurm and create remote Visualisation settings I will cover VirtualGL for 3d accelerated using cluster GPU nodes another time. If you can install nvidia drivers that *REALLY WORK* then VGL is easy.
In the following the login host should be set to a LOGIN node. I am using the controller node here as I have no login node. I have added my user to the admins group in LDAP to allow me to login to the controller node. You could also disable the filter restricting non-root logins to the controller node if you wish. If you do not then you will get a rather unhelpful Authentication failed message when you use the cluster shell functions.
Note that the following uses shared apps and modules to make virtualgl and websockify work so that you do not need to add them to the nodes and re-provision them.
See the /etc/ood/config/clusters.d/my_cluster.yml file? you can call this anything you like like the cluster name and it should correspond to the title: "My Cluster" section. You can have multiple clusters. While I am mentioning this see the cluster: "cluster" in the job section? This is the slurm cluster name in case this have been customised from the default.
mkdir -p /etc/ood/config/clusters.d/
cat > /etc/ood/config/clusters.d/my_cluster.yml << EOF
---
v2:
metadata:
title: "My Cluster"
login:
host: "trinityx.cluster"
job:
adapter: "slurm"
cluster: "cluster"
bin: "/usr/sbin/"
conf: "/etc/slurm/slurm.conf"
bin_overrides:
sbatch: "/usr/bin/sbatch"
squeue: "/usr/bin/squeue"
scontrol: "/usr/bin/scontrol"
scancel: "/usr/bin/scancel"
batch_connect:
basic:
script_wrapper: |
module purge
%s
vnc:
script_wrapper: |
module purge
module add turbovnc websockify
export WEBSOCKIFY_CMD="websockify"
%s
EOF
Configure remote Desktop form
All the services are now configured ready for the bc_desktop module note we have NOT installed TurboVNC or websockify yet. We will get there later. Lets sort out the bc_desktop form first. There are lots of config options here to allow you to hide or set sensible defaults. Basically if you set a value in line in the yaml e.g. desktop: mate the option will be hardcoded and NOT show up to the user. If you set it using the value: option like this:
desktop:
value: "xfce"
then it will be set as a default and the user can modify it. You can set things to null to hide them if they are not needed. you can use the label override to change the label on the form and you may add markdown formatted help descriptions to fields with the help: override.
Do NOT bother configuring the bc_vnc_resolution. The default is to resize the remote desktop based on the size of the window at the client end dynamically.
if you must try the MATE desktop YMMV but when I used it I had problems with dconf being unable to write into /var/run/$UID/ as it did not exist for the user. I could have created a slurm prolog to do this as root but:
- I prefer XFCE which doesn't need this.
- XFCE takes up less disk space on my tiny VM config.
- I am lazy and once something works and know I can fix the the other way I will wait for someone to really need it before I do
mkdir /etc/ood/config/apps/bc_desktop -p
cat > /etc/ood/config/apps/bc_desktop/my_cluster.yml << EOF
---
title: "Remote Desktop"
cluster: my_cluster
attributes:
bc_account:
help: "this should be left blank most of the time"
desktop: "xfce"
bc_vnc_idle:
value: 180
label: "Idle timeout"
help: This is the time you have to connect to a session before it is automatically terminated
node_type: null
form:
- bc_vnc_idle
- desktop
- bc_account
- bc_num_hours
- bc_num_slots
- node_type
- bc_queue
- bc_vnc_resolution
- bc_email_on_started
EOF
Now we reconfigure the portal and restart the webserver to make our changes take effect
/opt/ood/ood-portal-generator/sbin/update_ood_portal systemctl try-restart httpd24-httpd.service
Configure the compute node image
luna chroot compute yum groupinstall xfce yum install numpy systemctl disable gdm exit luna osimage pack compute
Configure Modules for TurboVNC and websockify
TurboVNC
I cheat and install TurboVNC from the RPM downloaded from the TuboVNC website locally and then copy the /opt/TurboVNC directory into the shared apps folder and then use the following module for the computes
yum install turbovnc-2.2.2.x86_64.rpm
cp /opt/TurboVNC /trinity/shared/apps/ -r
cat /trinity/shared/modules/tr17.10/x86_64/compiler/turbovnc/2.2.2
#%Module
#
# @name: TurboVNC
# @version: 2.2.2
# @packaging: BIOS IT
#
# Customize the output of `module help` command
# ---------------------------------------------
proc ModulesHelp { } {
puts stderr "\tAdds $name to your environment variables"
puts stderr "\t\t\$PATH, \$MANPATH"
}
# Customize the output of `module whatis` command
# -----------------------------------------------
module-whatis "loads the [module-info name] environment"
# Define internal modulefile variables (Tcl script use only)
# ----------------------------------------------------------
set name TurboVNC
set version 2.2.2
set prefix /trinity/shared/apps/$name/$version
# Check if the path exists before modifying environment
# -----------------------------------------------------
if {![file exists $prefix]} {
puts stderr "\t[module-info name] Load Error: $prefix does not exist"
break
exit 1
}
# Update common variables in the environment
# ------------------------------------------
prepend-path PATH $prefix/bin
prepend-path MANPATH $prefix/man
setenv TURBONVNC_DIR $prefix
Websockify
I install this on the controller and set PYTHONPATH and use the --home= option so that it will end up in the shared apps folder The setup.py will try to install numpy on the node and in Centos 7 this breaks, pre-empt by installing from rpm FIRST
yum install numpy
git clone https://github.com/novnc/websockify
cd websockify
mkdir -p /trinity/shared/apps/websockify/0.8.0
PYTHONPATH=/trinity/shared/apps/websockify/0.8.0/lib/python python ./setup.py install --home=/trinity/shared/apps/websockify/0.8.0
cat /trinity/shared/modules/tr17.10/x86_64/libraries/websockify/0.8.0
#%Module
#
# @name: websockify
# @version: 0.8.0
# @packaging: BIOS-IT
#
# Customize the output of `module help` command
# ---------------------------------------------
proc ModulesHelp { } {
puts stderr "\tAdds websockify to your environment variables"
puts stderr "\t\t\$PATH, \$MANPATH"
}
# Customize the output of `module whatis` command
# -----------------------------------------------
module-whatis "loads the [module-info name] environment"
# Define internal modulefile variables (Tcl script use only)
# ----------------------------------------------------------
set name websockify
set version 0.8.0
set prefix /trinity/shared/apps/$name/$version/
# Check if the path exists before modifying environment
# -----------------------------------------------------
if {![file exists $prefix]} {
puts stderr "\t[module-info name] Load Error: $prefix does not exist"
break
exit 1
}
# Update common variables in the environment
# ------------------------------------------
prepend-path PATH $prefix/bin
prepend-path LD_LIBRARY_PATH $prefix/lib
prepend-path LIBRARY_PATH $prefix/lib
prepend-path PYTHONPATH $prefix/lib/python
prepend-path MANPATH $prefix/share/man
setenv WEBSOCKIFY_DIR $prefix





