DeepOps on OpenStack POC
DeepOps on vScaler POC
The Ubuntu driver and all the components required for a simple Kubernetes with GPU deployment (1 master, 1 worker) are available in the ireland.south1 region. Below is a description how users can use it.
First, as an admin, share the DGX image with your project:
[root@lhc-headnode ~]# source /etc/kolla/admin-openrc.sh [root@lhc-headnode ~]# openstack image add project ubuntu-software-config-dgx <your-project-name> [root@lhc-headnode ~]# openstack image member list ubuntu-software-config-dgx
The last command will tell you the image in "pending" state. Log as your regular user and accept the image either through Horizon or with the command:
$ openstack image set --accept <uuid-of-the-image>
(openstack CLI version 3.16.1 doesn't allow to accept images using their names)
All of this is done this way for now because the vGPU driver and settings for connecting to our license server are baked into the image.
Next, head to the dashboard to Container Infra -> Cluster Templates, where you can see the "deepops-poc-template" template. This template was created using the following command:
[root@lhc-headnode ~]# openstack coe cluster template create --coe kubernetes --image ubuntu-software-config-dgx --external-network public1 --flavor g1.large.1xk80 --master-flavor m1.medium --docker-storage-driver overlay --public --floating-ip-disabled deepops-poc-template
Create a cluster off of the template by clicking on "Create Cluster" next to the template and specifying the name of the cluster and your keypair (leave all other parameters with their defaults). You can also create a cluster with these commands:
$ openstack coe cluster template list $ openstack coe cluster create --cluster-template deepops-poc-template --keypair <your-keypair> <cluster-name> $ openstack coe cluster list
Now, head to Orchestration -> Stacks and wait for the stack associated with the cluster to transition into the "Create Complete" state. When this is done, check the list of your instances in Compute -> Instances and find the master node of the cluster. Assign a floating IP to it and SSH into it.
Confirm Kubernetes works by launching a test pod:
kubectl get nodes git clone https://github.com/NVIDIA/deepops cd deepops/ git checkout a0479e14b5bd0a4595a98ed86fd129ba8ec9a75d kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.12/nvidia-device-plugin.yml kubectl apply -f tests/gpu-test-job.yml kubectl get pods -o wide
Wait for the pod to become ready and run these commands to confirm the pod can access the vGPU:
kubectl describe pod gpu-pod kubectl exec -ti gpu-pod -- nvidia-smi
Enable monitoring (Prometheus + Grafana)
Description based on the README of https://github.com/NVIDIA/deepops.
WARNING: Below commands will only work in a setup with a single master and one or more worker nodes.
On the master with the DeepOps repo cloned, first prepare the config:
cd deepops/ cp -r config.example/ config
Edit config/kube.yml and change kube_version to whatever kubectl version --short reports as the Server version.
Then continue:
kubectl taint nodes --all node-role.kubernetes.io/master-
MASTER=$(kubectl get nodes -l node-role.kubernetes.io/master= -o jsonpath='{.items[].metadata.name}')
kubectl label node $MASTER node-role.kubernetes.io/master=true --overwrite
./scripts/helm_install_linux.sh
kubectl create sa tiller --namespace kube-system
kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller
helm init --service-account tiller --node-selectors node-role.kubernetes.io/master=true
Wait for the tiller pod to become ready (check on it with kubectl get pods -n kube-system | grep tiller and then proceed:
helm repo add coreos https://s3-eu-west-1.amazonaws.com/coreos-charts/stable/ helm install coreos/prometheus-operator --name prometheus-operator --namespace monitoring --values config/prometheus-operator.yml kubectl create configmap kube-prometheus-grafana-gpu --from-file=config/gpu-dashboard.json -n monitoring helm install coreos/kube-prometheus --name kube-prometheus --namespace monitoring --values config/kube-prometheus.yml
Label all your GPU nodes with the command:
kubectl label nodes <gpu-node-name> hardware-type=NVIDIAGPU
and deploy the explorer:
kubectl create -f services/dcgm-exporter.yml
Finally, head to http://<your-floating-ip>:30200 and select the "GPU Nodes" dashboard to see graphs with metrics from your GPU nodes.