OpenOnDemand

From Define Wiki
Revision as of 14:33, 7 November 2023 by Antony (talk | contribs) (editded the ssl cert part)
Jump to navigation Jump to search

Setting up Open OnDemand

What is it?

Open OnDemand is a portal to run web applications to help add ease of use functionality to any Compute Cluster.

the deplyonment is now pretty much handled by the stackhpc/osc playbooks BUT I'm adding some extra notes do document how to do stuff by hand to tweak an existing setup.

Switch OOD to use FreeIPA and ondemand-dex

on the portal node (here login-1) first step install the ondemand-dex package

dnf install -y ondemand-dex

now we edit the /etc/ood/config/ood-portal.yml file

remove or commne all httpd-auth lines like these:


Stop httpd from listening on port 80 and use certbot to create a trusted SSL cert that won't ever expire

first up make sure you ARE allowing port 80 through the firewall and security groups

next comment out all lines in /etc/httpd/conf.d/welcome.conf i should look like this. Do not delete it or it will come back from the dead when you update apache

root@login-1 certs]# cat /etc/httpd/conf.d/welcome.conf
#
# This configuration file enables the default "Welcome" page if there
# is no default index page present for the root URL.  To disable the
# Welcome page, comment out all the lines below.
#
# NOTE: if this file is removed, it will be restored on upgrades.
#
#<LocationMatch "^/+$">
#    Options -Indexes
#    ErrorDocument 403 /.noindex.html
#</LocationMatch>
#
#<Directory /usr/share/httpd/noindex>
#    AllowOverride None
#    Require all granted
#</Directory>
#
#Alias /.noindex.html /usr/share/httpd/noindex/index.html
#Alias /poweredby.png /usr/share/httpd/icons/apache_pb3.png

comment out Listen 80 in /etc/httpd/conf/httpd.conf on mine its around line 45.

restart httpd

check it's not listening with systemctl and ss -t | grep -e ":http" -e ":80"

[root@login-1 certs]# grep "Listen 80" /etc/httpd/conf/httpd.conf -n
45:#Listen 80
[root@login-1 tls]# systemctl restart httpd
[root@login-1 tls]# systemctl status httpd
● httpd.service - The Apache HTTP Server
   Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/httpd.service.d
           └─ood-portal.conf, ood.conf
   Active: active (running) since Tue 2023-11-07 12:41:18 GMT; 8s ago
     Docs: man:httpd.service(8)
  Process: 98483 ExecStartPre=/opt/ood/ood-portal-generator/sbin/update_ood_portal --rpm (code=exited, status=0/SUCCESS)
 Main PID: 98499 (httpd)
   Status: "Started, listening on: port 443, port 81"
    Tasks: 213 (limit: 100627)
   Memory: 44.2M
   CGroup: /system.slice/httpd.service
           ├─98499 /usr/sbin/httpd -DFOREGROUND
           ├─98500 /usr/sbin/httpd -DFOREGROUND
           ├─98501 /usr/sbin/httpd -DFOREGROUND
           ├─98502 /usr/sbin/httpd -DFOREGROUND
           └─98503 /usr/sbin/httpd -DFOREGROUND

Nov 07 12:41:18 login-1.cluster.internal systemd[1]: Starting The Apache HTTP Server...
Nov 07 12:41:18 login-1.cluster.internal update_ood_portal[98483]: No change in Apache config.
Nov 07 12:41:18 login-1.cluster.internal update_ood_portal[98483]: No change in the Dex config.
Nov 07 12:41:18 login-1.cluster.internal httpd[98499]: [Tue Nov 07 12:41:18.946520 2023] [so:warn] [pid 98499:tid 140163328375104] AH01574: module status_module is already loaded, skipping
Nov 07 12:41:18 login-1.cluster.internal systemd[1]: Started The Apache HTTP Server.
Nov 07 12:41:18 login-1.cluster.internal httpd[98499]: Server configured, listening on: port 443, port 81
[root@login-1 certs]# ss -t | grep -e :http -e :80
LAST-ACK 0      1               10.0.3.240:59462          74.125.193.99:https
ESTAB    0      0               10.0.3.240:53178          92.122.160.95:https
ESTAB    0      0               10.0.3.240:43826          104.82.170.85:https
ESTAB    0      0      [::ffff:10.0.3.240]:https   [::ffff:82.1.125.18]:56555

notice that it's only listenting on 443 (https) and 81 (not so sure it is but hey) this is good.

now we can install certbot dnf install -y certbot and run it against our domain which you have to point at the public ip certbot certonly -d <public FQDN> . In the following example I and doing a dry run as I have run this before but you should remove this.

[root@login-1 certs]# certbot certonly -d nuig-cluster.define-technology.com --dry-run
Saving debug log to /var/log/letsencrypt/letsencrypt.log

How would you like to authenticate with the ACME CA?
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1: Spin up a temporary webserver (standalone)
2: Place files in webroot directory (webroot)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Select the appropriate number [1-2] then [enter] (press 'c' to cancel): 1
Simulating renewal of an existing certificate for nuig-cluster.define-technology.com
The dry run was successful.

mine didn't add the renew job to cron or a timer so I did the following to auto renew the certs

cat >/etc/cron.daily/certbot-renew <<EOF
#!/bin/bash
/usr/bin/certbot renew &>/dev/null
EOF
chmod a+x /etc/cron.daily/certbot-renew

now you should have the certs in /etc/letsencrypt/<fqdn>/ directory I tried putting this path directly into the apache config files but if fails what does work is to link to these. I put these links into /etc/pki/tls/certs and /etc/pki/tls/private

ln -s /etc/letsencrypt/live/nuig-cluster.define-technology.com/fullchain.pem  /etc/pki/tls/certs/nuig-cluster.define-technology.com.crt
ln -s /etc/letsencrypt/live/nuig-cluster.define-technology.com/privkey.pem /etc/pki/tls/private/nuig-cluster.define-technology.com.key

[root@login-1 certs]# ll /etc/pki/tls/private/nuig-cluster.define-technology.com.key  /etc/pki/tls/certs/nuig-cluster.define-technology.com.crt
lrwxrwxrwx. 1 root root 70 Nov  7 12:42 /etc/pki/tls/certs/nuig-cluster.define-technology.com.crt -> /etc/letsencrypt/live/nuig-cluster.define-technology.com/fullchain.pem
lrwxrwxrwx. 1 root root 68 Nov  7 12:40 /etc/pki/tls/private/nuig-cluster.define-technology.com.key -> /etc/letsencrypt/live/nuig-cluster.define-technology.com/privkey.pem

finally modify the /etc/ood/config/ood-portal run /opt/ood/ood-portal-generator/sbin/update_ood_portal and restart httpd and opdemand-dex

we need to edit the ssl section I made mine:

cat /etc/ood/config/ood-portal

<...snip... />

# List of SSL Apache directives
# Example:
#     ssl:
#       - 'SSLCertificateFile "/etc/pki/tls/certs/www.example.com.crt"'
#       - 'SSLCertificateKeyFile "/etc/pki/tls/private/www.example.com.key"'
# Default: null (no SSL support)
ssl:
#- 'SSLCertificateFile /etc/pki/tls/certs/localhost.crt'
#- 'SSLCertificateKeyFile /etc/pki/tls/private/localhost.key'
- 'SSLCertificateFile /etc/pki/tls/certs/nuig-cluster.define-technology.com.crt'
- 'SSLCertificateKeyFile /etc/pki/tls/private/nuig-cluster.define-technology.com.key'
- 'SSLProtocol all -TLSv1.1 -TLSv1 -SSLv2 -SSLv3'
- 'SSLCipherSuite ALL:+HIGH:!ADH:!EXP:!SSLv2:!SSLv3:!MEDIUM:!LOW:!NULL:!aNULL'
- 'SSLHonorCipherOrder On'
- 'SSLCompression off'
- 'SSLSessionTickets Off'

<...snip... />

/opt/ood/ood-portal-generator/sbin/update_ood_portal

systemctl restart httpd ondemand-dex

systemctl status httpd ondemand-dex

i.e. I commented out the old SSL cert liines from the stackhpc appliance (we will need to override this in the playbooks later) and put in the new links


everything below this line is ancient but STILL useful Dashboard.png

It provides the following apps out of the box which I shall go through the configuration of using my trusty virtual TrinityX cluster on my file server.

  1. Home directory browser with editor
  2. Active Job Viewer
  3. Job creator
  4. Cluster Shell
  5. Remote Visualisation

here are screenshots of each of the working modules:

OOnDemandFilesBrowser.png OOnDemandActiveJobViewer.png OOnDemandJobCreator.png OOnDemandWebTerminal.png OOnDemandRemoteDesktop1.png OOnDemandRemoteDesktop2.png

Install RPMs on Portal node

yum install centos-release-scl
yum install https://yum.osc.edu/ondemand/1.5/ondemand-release-web-1.5-1.el7.noarch.rpm
yum install ondemand
systemctl start httpd24-httpd

Configure The Portal Node

Configure Authentication to use LDAP and allow remote access to nodes using reverse proxy

Backup the original blank config and configure basic ldap authentication against the cluster LDAP for now. If you want something more advanced then read the ood docs. Note: fill in the LDAP URL according to the cluster config on your system. You can find this in the sssd.conf file normally along with the search dn which is appended before the ?uid below. In a default TrinityX cluster the url will be <controller hostname>.cluster:636/ou=People,dc=local?uid the .cluster is the internal domain name and cannot be removed as it will fails the ssl hostchecks

Note that the last three lines configure the reverse proxy settings. the most important is the host_regex line. If a host does not match this then no proxy will connect to the node this is to stop your portal server being used to redirect traffic to node you do not control. In TrinityX we use the internal TLD domain suffix .cluster '[\w.-]+\.cluster' limits access to these servers. If you wanted to be more specific then you could use something like node\d+\.cluster or (rvis|interactive)\d+\.cluster to limit access to more specific hosts in the cluster.

some LDAP servers require uncommenting the AuthLDAPGroupAttribute and AuthLDAPGroupAttributeIsDN lines

cp /etc/ood/config/ood_portal.yml /etc/ood/config/ood_portal.yml.orig
cat >> /etc/ood/config/ood_portal.yml << EOF
auth:
  - 'AuthType Basic'
  - 'AuthName "private"'
  - 'AuthBasicProvider ldap'
  - 'AuthLDAPURL "ldaps://trinityx.cluster:636/ou=People,dc=local?uid"'
#  - 'AuthLDAPGroupAttribute memberUid'
#  - 'AuthLDAPGroupAttributeIsDN off'
  - 'RequestHeader unset Authorization'
  - 'Require valid-user'
host_regex: '[\w.-]+\.cluster'
node_uri: '/node'
rnode_uri: '/rnode'
EOF

put a comment in /opt/rh/httpd24/root/etc/httpd/conf.modules.d/01-ldap.confand explicitly set LDAPLibraryDebug to 0 (off) as for some stupid reason failing LDAP authentications will NOT result in an error in /var/log/httpd24/error_log. If you see an internal server error after login and nothing in the logs then set LDAPLibraryDebug to 1 and it may point you in the right direction (the misssing .cluster) for me

cat >> /opt/rh/httpd24/root/etc/httpd/conf.modules.d/01-ldap.conf << EOF
# change the following LDAPLibraryDebug line to 1 if you get 500 (internal server) errors after login
LDAPLibraryDebug 0
EOF

Now we need to run the portal config generator and restart the service

/opt/ood/ood-portal-generator/sbin/update_ood_portal
systemctl try-restart httpd24-httpd.service httpd24-htcacheclean.service

If you access the website now it should you prompt you for a login and you should be able to login as a normal user with your cluster credentials and then display the most basic portal page, it will look bare compared to the above screenshots.

Configure Slurm and Remote Desktop sessions

Configure Slurm and remote access plugins

Lets fix that now and tell OOD how to use Slurm and create remote Visualisation settings I will cover VirtualGL for 3d accelerated using cluster GPU nodes another time. If you can install nvidia drivers that *REALLY WORK* then VGL is easy.

In the following the login host should be set to a LOGIN node. I am using the controller node here as I have no login node. I have added my user to the admins group in LDAP to allow me to login to the controller node. You could also disable the filter restricting non-root logins to the controller node if you wish. If you do not then you will get a rather unhelpful Authentication failed message when you use the cluster shell functions.

Note that the following uses shared apps and modules to make virtualgl and websockify work so that you do not need to add them to the nodes and re-provision them.

See the /etc/ood/config/clusters.d/my_cluster.yml file? you can call this anything you like like the cluster name and it should correspond to the title: "My Cluster" section. You can have multiple clusters. While I am mentioning this see the cluster: "cluster" in the job section? This is the slurm cluster name in case this have been customised from the default.

mkdir -p /etc/ood/config/clusters.d/
cat > /etc/ood/config/clusters.d/my_cluster.yml << EOF
---
v2:
  metadata:
    title: "My Cluster"
  login:
    host: "trinityx.cluster"
  job:
    adapter: "slurm"
    cluster: "cluster"
    bin: "/usr/sbin/"
    conf: "/etc/slurm/slurm.conf"
    bin_overrides:
      sbatch: "/usr/bin/sbatch"
      squeue: "/usr/bin/squeue"
      scontrol: "/usr/bin/scontrol"
      scancel: "/usr/bin/scancel"
  batch_connect:
    basic:
      script_wrapper: |
        module purge
        %s
    vnc:
      script_wrapper: |
        module purge
        module add turbovnc websockify
        export WEBSOCKIFY_CMD="websockify"
        %s
EOF

Configure remote Desktop form

All the services are now configured ready for the bc_desktop module note we have NOT installed TurboVNC or websockify yet. We will get there later. Lets sort out the bc_desktop form first. There are lots of config options here to allow you to hide or set sensible defaults. Basically if you set a value in line in the yaml e.g. desktop: mate the option will be hardcoded and NOT show up to the user. If you set it using the value: option like this:

  desktop:
    value: "xfce"

then it will be set as a default and the user can modify it. You can set things to null to hide them if they are not needed. you can use the label override to change the label on the form and you may add markdown formatted help descriptions to fields with the help: override.

Do NOT bother configuring the bc_vnc_resolution. The default is to resize the remote desktop based on the size of the window at the client end dynamically.

if you must try the MATE desktop YMMV but when I used it I had problems with dconf being unable to write into /var/run/$UID/ as it did not exist for the user. I could have created a slurm prolog to do this as root but:

  1. I prefer XFCE which doesn't need this.
  2. XFCE takes up less disk space on my tiny VM config.
  3. I am lazy and once something works and know I can fix the the other way I will wait for someone to really need it before I do
mkdir /etc/ood/config/apps/bc_desktop -p
cat > /etc/ood/config/apps/bc_desktop/my_cluster.yml << EOF
---
title: "Remote Desktop"
cluster: my_cluster
attributes:
  bc_account:
    help: "this should be left blank most of the time"
  desktop: "xfce"
  bc_vnc_idle:
    value: 180
    label: "Idle timeout"
    help: This is the time you have to connect to a session before it is automatically terminated
  node_type: null
form:
  - bc_vnc_idle
  - desktop
  - bc_account
  - bc_num_hours
  - bc_num_slots
  - node_type
  - bc_queue
  - bc_vnc_resolution
  - bc_email_on_started
EOF

Now we reconfigure the portal and restart the webserver to make our changes take effect

/opt/ood/ood-portal-generator/sbin/update_ood_portal
systemctl try-restart httpd24-httpd.service

Configure the compute node image

luna chroot compute
yum groupinstall xfce
yum install numpy
systemctl disable gdm
exit
luna osimage pack compute

Configure Modules for TurboVNC and websockify

TurboVNC

I cheat and install TurboVNC from the RPM downloaded from the TuboVNC website locally and then copy the /opt/TurboVNC directory into the shared apps folder and then use the following module for the computes

yum install turbovnc-2.2.2.x86_64.rpm
cp /opt/TurboVNC /trinity/shared/apps/ -r


cat /trinity/shared/modules/tr17.10/x86_64/compiler/turbovnc/2.2.2
#%Module
#
# @name:    TurboVNC
# @version:  2.2.2
# @packaging: BIOS IT
#

# Customize the output of `module help` command
# ---------------------------------------------
proc ModulesHelp { } {
   puts stderr "\tAdds $name to your environment variables"
   puts stderr "\t\t\$PATH, \$MANPATH"
}

# Customize the output of `module whatis` command
# -----------------------------------------------
module-whatis   "loads the [module-info name] environment"

# Define internal modulefile variables (Tcl script use only)
# ----------------------------------------------------------
set   name      TurboVNC
set   version   2.2.2
set   prefix    /trinity/shared/apps/$name/$version

# Check if the path exists before modifying environment
# -----------------------------------------------------
if {![file exists $prefix]} {
   puts stderr "\t[module-info name] Load Error: $prefix does not exist"
   break
   exit 1
}

# Update common variables in the environment
# ------------------------------------------
prepend-path   PATH              $prefix/bin

prepend-path   MANPATH           $prefix/man

setenv         TURBONVNC_DIR     $prefix

Websockify

I install this on the controller and set PYTHONPATH and use the --home= option so that it will end up in the shared apps folder The setup.py will try to install numpy on the node and in Centos 7 this breaks, pre-empt by installing from rpm FIRST

yum install numpy
git clone https://github.com/novnc/websockify
cd websockify
mkdir -p /trinity/shared/apps/websockify/0.8.0
PYTHONPATH=/trinity/shared/apps/websockify/0.8.0/lib/python python ./setup.py install --home=/trinity/shared/apps/websockify/0.8.0

 cat /trinity/shared/modules/tr17.10/x86_64/libraries/websockify/0.8.0
#%Module
#
# @name:    websockify
# @version:  0.8.0
# @packaging: BIOS-IT
#

# Customize the output of `module help` command
# ---------------------------------------------
proc ModulesHelp { } {
   puts stderr "\tAdds websockify to your environment variables"
   puts stderr "\t\t\$PATH, \$MANPATH"
}

# Customize the output of `module whatis` command
# -----------------------------------------------
module-whatis   "loads the [module-info name] environment"

# Define internal modulefile variables (Tcl script use only)
# ----------------------------------------------------------
set   name      websockify
set   version   0.8.0
set   prefix    /trinity/shared/apps/$name/$version/

# Check if the path exists before modifying environment
# -----------------------------------------------------
if {![file exists $prefix]} {
   puts stderr "\t[module-info name] Load Error: $prefix does not exist"
   break
   exit 1
}

# Update common variables in the environment
# ------------------------------------------
prepend-path   PATH              $prefix/bin

prepend-path   LD_LIBRARY_PATH   $prefix/lib
prepend-path   LIBRARY_PATH      $prefix/lib
prepend-path   PYTHONPATH        $prefix/lib/python

prepend-path   MANPATH           $prefix/share/man

setenv         WEBSOCKIFY_DIR         $prefix