Add the GPU operator to a rancher deployed k8s environment

From Define Wiki
Jump to navigation Jump to search

Setup helm

# add the gpu-operator (Mac osx)
brew install helm

Add the Nvidia repo

helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update


Install the GPU operator

david@Davids-MacBook-Air-2 ~ % helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator
NAME: gpu-operator-1676571811
LAST DEPLOYED: Thu Feb 16 18:23:33 2023
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

Check the status after the install

david@Davids-MacBook-Air-2 ~ % kubectl get pods -n gpu-operator
NAME                                                              READY   STATUS      RESTARTS         AGE
gpu-feature-discovery-rxh9p                                       1/1     Running     0                13h
gpu-operator-1676571811-node-feature-discovery-master-5d45zf949   1/1     Running     0                13h
gpu-operator-1676571811-node-feature-discovery-worker-zkqhn       1/1     Running     0                13h
gpu-operator-6c4c6f484-k97n9                                      1/1     Running     0                13h
nvidia-container-toolkit-daemonset-snzzv                          1/1     Running     0                13h
nvidia-cuda-validator-vwldd                                       0/1     Completed   0                7h10m
nvidia-dcgm-exporter-bmbwn                                        1/1     Running     0                13h
nvidia-device-plugin-daemonset-9jvxm                              1/1     Running     0                13h
nvidia-device-plugin-validator-dlblr                              0/1     Completed   0                7h10m
nvidia-driver-daemonset-lppgj                                     1/1     Running     32 (7h19m ago)   13h
nvidia-mig-manager-lrx6m                                          1/1     Running     0                13h
nvidia-operator-validator-5dtlz                                   1/1     Running     0                13h

Run a gpu test job - Nvidia-smi

david@Davids-MacBook-Air-2 ~ % kubectl run gpu-test \
--rm -t -i \
--restart=Never \
--image=nvcr.io/nvidia/cuda:10.1-base-ubuntu18.04 nvidia-smi
Fri Feb 17 08:09:04 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:00:05.0 Off |                    0 |
| N/A   26C    P0    50W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
pod "gpu-test" deleted
david@Davids-MacBook-Air-2 ~ %