DeepOps on OpenStack POC

From Define Wiki
Revision as of 14:50, 10 January 2019 by Mariusz (talk | contribs) (→‎DeepOps on vScaler POC: Add commands for deploying monitoring on GPU nodes)
Jump to navigation Jump to search

DeepOps on vScaler POC

The Ubuntu driver and all the components required for a simple Kubernetes with GPU deployment (1 master, 1 worker) are available in the ireland.south1 region. Below is a description how users can use it.

First, as an admin, share the DGX image with your project:

[root@lhc-headnode ~]# source /etc/kolla/admin-openrc.sh
[root@lhc-headnode ~]# openstack image add project ubuntu-software-config-dgx <your-project-name>
[root@lhc-headnode ~]# openstack image member list ubuntu-software-config-dgx

The last command will tell you the image in "pending" state. Log as your regular user and accept the image either through Horizon or with the command:

$ openstack image set --accept ubuntu-software-config-dgx

All of this is done this way for now because the vGPU driver and settings for connecting to our license server are baked into the image.

Next, head to the dashboard to Container Infra -> Cluster Templates, where you can see the "deepops-poc-template" template. This template was created using the following command:

[root@lhc-headnode ~]# openstack coe cluster template create --coe kubernetes --image ubuntu-software-config-dgx --external-network public1 --flavor g1.large.1xk80 --master-flavor m1.medium --docker-storage-driver overlay --public --floating-ip-disabled deepops-poc-template

Create a cluster off of the template by clicking on "Create Cluster" next to the template and specifying the name of the cluster and your keypair (leave all other parameters with their defaults). You can also create a cluster with these commands:

$ openstack coe cluster template list
$ openstack coe cluster create --cluster-template deepops-poc-template --keypair <your-keypair> <cluster-name>
$ openstack coe cluster list

Now, head to Orchestration -> Stacks and wait for the stack associated with the cluster to transition into the "Create Complete" state. When this is done, check the list of your instances in Compute -> Instances and find the master node of the cluster. Assign a floating IP to it and SSH into it.

Confirm Kubernetes works by launching a test pod:

kubectl get nodes
git clone https://github.com/NVIDIA/deepops
cd deepops/
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.12/nvidia-device-plugin.yml
kubectl apply -f tests/gpu-test-job.yml
kubectl get pods -o wide

Wait for the pod to become ready and run these commands to confirm the pod can access the vGPU:

kubectl describe pod gpu-pod
kubectl exec -ti gpu-pod -- nvidia-smi

Enable monitoring (Prometheus + Grafana)

Description based on the README of https://github.com/NVIDIA/deepops.

WARNING: Below commands will only work in a setup with a single master and one or more worker nodes.

On the master with the DeepOps repo cloned, first prepare the config:

cd deepops/
cp -r config.example/ config

Edit config/kube.yml and change kube_version to whatever kubectl version --short reports as the Server version.

Then continue:

kubectl taint nodes --all node-role.kubernetes.io/master-
MASTER=$(kubectl get nodes -l node-role.kubernetes.io/master= -o jsonpath='{.items[].metadata.name}')
kubectl label node $MASTER node-role.kubernetes.io/master=true --overwrite
./scripts/helm_install_linux.sh
kubectl create sa tiller --namespace kube-system
kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller
helm init --service-account tiller --node-selectors node-role.kubernetes.io/master=true

Wait for the tiller pod to become ready (check on it with kubectl get pods -n kube-system | grep tiller and then proceed:

helm repo add coreos https://s3-eu-west-1.amazonaws.com/coreos-charts/stable/
helm install coreos/prometheus-operator --name prometheus-operator --namespace monitoring --values config/prometheus-operator.yml
kubectl create configmap kube-prometheus-grafana-gpu --from-file=config/gpu-dashboard.json -n monitoring
helm install coreos/kube-prometheus --name kube-prometheus --namespace monitoring --values config/kube-prometheus.yml

Label all your GPU nodes with the command:

kubectl label nodes <gpu-node-name> hardware-type=NVIDIAGPU

and deploy the explorer:

kubectl create -f services/dcgm-exporter.yml

Finally, head to http://<your-floating-ip>:30200 and select the "GPU Nodes" dashboard to see graphs with metrics from your GPU nodes.

Resources