Difference between revisions of "DeepOps on OpenStack POC"

From Define Wiki
Jump to navigation Jump to search
(→‎DeepOps on vScaler POC: Add commands for deploying monitoring on GPU nodes)
(→‎Enable monitoring (Prometheus + Grafana): Add more script-ready commands)
 
(2 intermediate revisions by the same user not shown)
Line 13: Line 13:
  
 
  <nowiki>
 
  <nowiki>
$ openstack image set --accept ubuntu-software-config-dgx
+
$ openstack image set --accept <uuid-of-the-image>
 
</nowiki>
 
</nowiki>
 +
(openstack CLI version 3.16.1 doesn't allow to accept images using their names)
 +
 
All of this is done this way for now because the vGPU driver and settings for connecting to our license server are baked into the image.
 
All of this is done this way for now because the vGPU driver and settings for connecting to our license server are baked into the image.
  
Line 38: Line 40:
 
git clone https://github.com/NVIDIA/deepops
 
git clone https://github.com/NVIDIA/deepops
 
cd deepops/
 
cd deepops/
 +
git checkout a0479e14b5bd0a4595a98ed86fd129ba8ec9a75d
 
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.12/nvidia-device-plugin.yml
 
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.12/nvidia-device-plugin.yml
 
kubectl apply -f tests/gpu-test-job.yml
 
kubectl apply -f tests/gpu-test-job.yml
Line 59: Line 62:
 
cd deepops/
 
cd deepops/
 
cp -r config.example/ config
 
cp -r config.example/ config
</nowiki>
+
KUBE_VERSION=$(kubectl version --short | grep -i server | cut -d ":" -f 2)
Edit <code>config/kube.yml</code> and change <code>kube_version</code> to whatever <code>kubectl version --short</code> reports as the Server version.
+
sed -i.bak "s/kube_version:.*$/kube_version:${KUBE_VERSION}/" config/kube.yml
 
+
kubectl taint nodes --all node-role.kubernetes.io/master- | true
Then continue:
+
kubectl taint nodes --all kubeadmNode- | true
<nowiki>
 
kubectl taint nodes --all node-role.kubernetes.io/master-
 
 
MASTER=$(kubectl get nodes -l node-role.kubernetes.io/master= -o jsonpath='{.items[].metadata.name}')
 
MASTER=$(kubectl get nodes -l node-role.kubernetes.io/master= -o jsonpath='{.items[].metadata.name}')
 
kubectl label node $MASTER node-role.kubernetes.io/master=true --overwrite
 
kubectl label node $MASTER node-role.kubernetes.io/master=true --overwrite
 
./scripts/helm_install_linux.sh
 
./scripts/helm_install_linux.sh
 +
export PATH=$PATH:~/.local/bin
 
kubectl create sa tiller --namespace kube-system
 
kubectl create sa tiller --namespace kube-system
 
kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller
 
kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller
 
helm init --service-account tiller --node-selectors node-role.kubernetes.io/master=true
 
helm init --service-account tiller --node-selectors node-role.kubernetes.io/master=true
 
</nowiki>
 
</nowiki>
Wait for the tiller pod to become ready (check on it with <code>kubectl get pods -n kube-system | grep tiller</code> and then proceed:
+
Wait for the tiller pod to become ready. Here is a quick script that waits for the tiller pod:
 +
 
 +
<nowiki>
 +
while [ -z "$( kubectl get pods -n kube-system | grep tiller | grep Running)" ]
 +
do
 +
  echo $(kubectl get pods -n kube-system | grep tiller)
 +
  sleep 2
 +
done
 +
</nowiki>
  
 +
Then proceed:
 
  <nowiki>
 
  <nowiki>
 
helm repo add coreos https://s3-eu-west-1.amazonaws.com/coreos-charts/stable/
 
helm repo add coreos https://s3-eu-west-1.amazonaws.com/coreos-charts/stable/
Line 83: Line 94:
  
 
  <nowiki>
 
  <nowiki>
kubectl label nodes <gpu-node-name> hardware-type=NVIDIAGPU
+
MINIONS=$(kubectl get nodes -l node-role.kubernetes.io/node= -o jsonpath='{.items[].metadata.name}')
 +
for minion in $MINIONS
 +
do
 +
  kubectl label nodes $minion hardware-type=NVIDIAGPU
 +
done
 
</nowiki>
 
</nowiki>
 
and deploy the explorer:
 
and deploy the explorer:

Latest revision as of 17:27, 7 March 2019

DeepOps on vScaler POC

The Ubuntu driver and all the components required for a simple Kubernetes with GPU deployment (1 master, 1 worker) are available in the ireland.south1 region. Below is a description how users can use it.

First, as an admin, share the DGX image with your project:

[root@lhc-headnode ~]# source /etc/kolla/admin-openrc.sh
[root@lhc-headnode ~]# openstack image add project ubuntu-software-config-dgx <your-project-name>
[root@lhc-headnode ~]# openstack image member list ubuntu-software-config-dgx

The last command will tell you the image in "pending" state. Log as your regular user and accept the image either through Horizon or with the command:

$ openstack image set --accept <uuid-of-the-image>

(openstack CLI version 3.16.1 doesn't allow to accept images using their names)

All of this is done this way for now because the vGPU driver and settings for connecting to our license server are baked into the image.

Next, head to the dashboard to Container Infra -> Cluster Templates, where you can see the "deepops-poc-template" template. This template was created using the following command:

[root@lhc-headnode ~]# openstack coe cluster template create --coe kubernetes --image ubuntu-software-config-dgx --external-network public1 --flavor g1.large.1xk80 --master-flavor m1.medium --docker-storage-driver overlay --public --floating-ip-disabled deepops-poc-template

Create a cluster off of the template by clicking on "Create Cluster" next to the template and specifying the name of the cluster and your keypair (leave all other parameters with their defaults). You can also create a cluster with these commands:

$ openstack coe cluster template list
$ openstack coe cluster create --cluster-template deepops-poc-template --keypair <your-keypair> <cluster-name>
$ openstack coe cluster list

Now, head to Orchestration -> Stacks and wait for the stack associated with the cluster to transition into the "Create Complete" state. When this is done, check the list of your instances in Compute -> Instances and find the master node of the cluster. Assign a floating IP to it and SSH into it.

Confirm Kubernetes works by launching a test pod:

kubectl get nodes
git clone https://github.com/NVIDIA/deepops
cd deepops/
git checkout a0479e14b5bd0a4595a98ed86fd129ba8ec9a75d
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.12/nvidia-device-plugin.yml
kubectl apply -f tests/gpu-test-job.yml
kubectl get pods -o wide

Wait for the pod to become ready and run these commands to confirm the pod can access the vGPU:

kubectl describe pod gpu-pod
kubectl exec -ti gpu-pod -- nvidia-smi

Enable monitoring (Prometheus + Grafana)

Description based on the README of https://github.com/NVIDIA/deepops.

WARNING: Below commands will only work in a setup with a single master and one or more worker nodes.

On the master with the DeepOps repo cloned, first prepare the config:

cd deepops/
cp -r config.example/ config
KUBE_VERSION=$(kubectl version --short | grep -i server | cut -d ":" -f 2)
sed -i.bak "s/kube_version:.*$/kube_version:${KUBE_VERSION}/" config/kube.yml
kubectl taint nodes --all node-role.kubernetes.io/master- | true
kubectl taint nodes --all kubeadmNode- | true
MASTER=$(kubectl get nodes -l node-role.kubernetes.io/master= -o jsonpath='{.items[].metadata.name}')
kubectl label node $MASTER node-role.kubernetes.io/master=true --overwrite
./scripts/helm_install_linux.sh
export PATH=$PATH:~/.local/bin
kubectl create sa tiller --namespace kube-system
kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller
helm init --service-account tiller --node-selectors node-role.kubernetes.io/master=true

Wait for the tiller pod to become ready. Here is a quick script that waits for the tiller pod:

while [ -z "$( kubectl get pods -n kube-system | grep tiller | grep Running)" ]
do
  echo $(kubectl get pods -n kube-system | grep tiller)
  sleep 2
done

Then proceed:

helm repo add coreos https://s3-eu-west-1.amazonaws.com/coreos-charts/stable/
helm install coreos/prometheus-operator --name prometheus-operator --namespace monitoring --values config/prometheus-operator.yml
kubectl create configmap kube-prometheus-grafana-gpu --from-file=config/gpu-dashboard.json -n monitoring
helm install coreos/kube-prometheus --name kube-prometheus --namespace monitoring --values config/kube-prometheus.yml

Label all your GPU nodes with the command:

MINIONS=$(kubectl get nodes -l node-role.kubernetes.io/node= -o jsonpath='{.items[].metadata.name}')
for minion in $MINIONS
do
  kubectl label nodes $minion hardware-type=NVIDIAGPU
done

and deploy the explorer:

kubectl create -f services/dcgm-exporter.yml

Finally, head to http://<your-floating-ip>:30200 and select the "GPU Nodes" dashboard to see graphs with metrics from your GPU nodes.

Resources