VScaler: vGPU Configuration
Prerequisites
This guide assumes you are using CentOS 7 and OpenStack Train or later.
First of all, download the latest NVIDIA GRID software for RHEL/Linux KVM from here https://nvidia.flexnetoperations.com/control/nvda/download?agree=Accept&element=10189877. This should give you an archive file named NVIDIA-GRID-Linux-KVM-<driver-version>.zip, for example NVIDIA-GRID-Linux-KVM-450.89-452.57.zip
Inside the archive you will find these 2 scripts:
NVIDIA-Linux-x86_64-<driver-version>-vgpu-kvm.run-- this is the driver installer for hypervisors;NVIDIA-Linux-x86_64-<driver-version>-grid.run-- this is the driver installer for guests (VMs);
Hypervisor config
Transfer the first .run file to your hypervisor.
Enable IOMMU in grub by adding "intel_iommu=on iommu=pt nouveau.blacklist=1" (for Intel-based CPUs) or "amd_iommu=on iommu=pt nouveau.blacklist=1" (for AMD-based CPUs) to GRUB_CMDLINE_LINUX in /etc/default/grub. Run grub2-mkconfig -o /boot/grub2/grub.cfg to update your current grub config.
Then install the GRID driver:
# yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) gcc -y ... # bash NVIDIA-Linux-x86_64-450.89-vgpu-kvm.run ...
Reboot to BIOS and make sure both Virtualisation (typically in the CPU section) and SR-IOV/IOMMU (typically in the PCI section) are enabled. Then boot up the operating system.
You should now see vfio kernel modules and mdev devices:
# lsmod | grep vfio nvidia_vgpu_vfio 54216 0 nvidia 19705995 10 nvidia_vgpu_vfio vfio_mdev 12841 0 mdev 20756 2 vfio_mdev,nvidia_vgpu_vfio vfio_iommu_type1 22440 0 vfio 32657 3 vfio_mdev,nvidia_vgpu_vfio,vfio_iommu_type1 # ls /sys/class/mdev_bus/*/mdev_supported_types /sys/class/mdev_bus/0000:01:00.0/mdev_supported_types: nvidia-256 nvidia-259 nvidia-262 nvidia-344 nvidia-347 nvidia-437 nvidia-440 nvidia-443 nvidia-257 nvidia-260 nvidia-263 nvidia-345 nvidia-435 nvidia-438 nvidia-441 nvidia-444 nvidia-258 nvidia-261 nvidia-343 nvidia-346 nvidia-436 nvidia-439 nvidia-442 ...
OpenStack config
The first order of business is finding out which mdev device type matches the vGPU profile you want to use. For this you can use this script:
#!/bin/bash
PCI_ADDR=0000:01:00.0
for devtype in $(ls -1 "/sys/class/mdev_bus/${PCI_ADDR}/mdev_supported_types")
do
profile=$(cat "/sys/class/mdev_bus/${PCI_ADDR}/mdev_supported_types/$devtype/name")
echo "$devtype -> $profile"
done
where PCI_ADDR is one of the PCI addresses in /sys/class/mdev_bus/ (you can cross-check these with bus addresses outputted by lspci | grep -i nvidia).
Knowing the mdev device type, you can then add the following kolla-ansible overrrides:
/etc/kolla/config/nova.conf
[DEFAULT] # A workaround for a race condition when launching multiple vGPU instances at once max_concurrent_builds = 1
/etc/kolla/config/nova/nova-compute.conf
[devices] enabled_vgpu_types = <selected-mdev-device-type> # For example: #enabled_vgpu_types = nvidia-261
Run kolla-ansible reconfigure -t nova to apply these changes.
After the reconfigure is done, you should be able to launch an instance with a vGPU by running the following OpenStack commands:
openstack flavor create --public m1.small.vgpu --vcpus 1 --ram 2048 --disk 10 --property "resources:VGPU=1" openstack server create --image <centos7-cloud-image> --flavor m1.small.vgpu --key-name <your-keypair> --security-group <extra-security-groups> --network <internal-network> demo_vgpu
Wait for the instance to become active and SSH into it.
Guest VM config
On your guest, you should be able to see the vGPU as a PCI device:
$ lspci | grep -i nvidia 00:05.0 VGA compatible controller: NVIDIA Corporation TU102GL (rev a1)
Transfer the NVIDIA-Linux-x86_64-<driver-version>-grid.run file to the instance, install the driver and set up the license config:
NOTE: You may need to run yum update and restart first if your image is old.
sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) gcc -y sudo bash ~/NVIDIA-Linux-x86_64-450.89-grid.run --silent --accept-license tail -1 /var/log/nvidia-installer.log sudo tee /etc/nvidia/gridd.conf << EOF ServerAddress=185.93.31.35 ServerPort=7070 FeatureType=1 EOF
Then reboot the instance and when it's back up, run tests to confirm the vGPU can now be used:
lsmod | grep nvidia nvidia-smi systemctl status nvidia-gridd sudo yum install wget -y wget https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-10.1.243-1.x86_64.rpm sudo rpm -i cuda-repo-rhel7-10.1.243-1.x86_64.rpm sudo yum install cuda-toolkit-10-1 cuda-samples-10-1 -y cd /usr/local/cuda/samples/1_Utilities/bandwidthTest/ sudo make clean && sudo make ./bandwidthTest
The last command should output Result = PASS. If it didn't, check system logs for problems with the GRID license daemon:
sudo grep gridd /var/log/messages | tail
Windows Server instances
For Windows-based instances, follow this instruction: https://docs.nvidia.com/grid/latest/grid-licensing-user-guide/index.html#licensing-grid-vgpu-windows
In short: install the Windows driver included in the archive with all the Linux vGPU/GRID drivers then open NVIDIA Control Panel and add the IP of your license server in the "Primary Licence Server" field. The text under "Licence Edition" should change to:
"Your system is licensed for Quatro Virtual Data Center Workstation."
Manual mdev device attachment
Instead of using OpenStack, you can manually attach a vGPU to a libvirt VM by first creating the mdev device:
# uuidgen > /sys/class/mdev_bus/<device>/mdev_supported_types/<type>/create
and then putting the location of the device in the device section of libvirt's config for your VM, like so:
<device>
<name>mdev_4b20d080_1b54_4048_85b3_a6a62d165c01</name>
<path>/sys/devices/pci0000:00/0000:00:02.0/4b20d080-1b54-4048-85b3-a6a62d165c01</path>
<parent>pci_0000_06_00_0</parent>
<driver>
<name>vfio_mdev</name>
</driver>
<capability type='mdev'>
<type id='nvidia-11'/>
<iommuGroup number='12'/>
</capability>
</device>
References
- https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#install-vgpu-package-generic-linux-kvm
- https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#virtual-gpu-types-grid-reference
- https://libvirt.org/drvnodedev.html#MDEV
Old Notes
1. Yum install kernel kernel-devel -y
2. reboot
3. Download latest NVIDIA GRID software/drivers for RHEL KVM from here https://nvidia.flexnetoperations.com/control/nvda/download?agree=Accept&element=10189877
4. Yum install gcc glibc -y
5. Rpm -iv NVIDIA-vGPU-rhel-7.5-390.72.x86_64.rpm
6. Reboot
7. Check with:
a. lsmod | grep vfio
b. Nvidia-smi
8. cp /usr/lib/nvidia/systemd/nvidia-vgpu* /usr/lib/systemd/system/
9. Systemctl start nvidia-vgpu-mgr.service
10. Systemctl enable nvidia-vgpu-mgr.service
11. systemctl start nvidia-vgpud.service
12. systemctl enable nvidia-vgpud.service
13. Check the /sys/class/mdev_bus/0000\:05\:00.0/mdev_supported_types/ directories and select one of the supported devices, eg nvidia-101
14. Create uuids and vgpu devices with them FOR EACH PHYSICAL GPU:
a. uuidgen
b. echo "af88fbf2-0110-4669-ab84-d747e9a9c19c" > /sys/class/mdev_bus/0000\:05\:00.0/mdev_supported_types/nvidia-101/create
15. Disable ECC on the GPUs on the host
16. Add the following to nova.conf of the gpu nodes:
[devices]
enabled_vgpu_types = nvidia-84
17. Add the following to nova.conf of the controller nodes:
[scheduler]
driver = filter_scheduler
[filter_scheduler]
available_filters = nova.scheduler.filters.all_filters
enabled_filters = AvailabilityZoneFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter
18. Do a kolla-ansible reconfigure to apply the above settings
19. Create a flavor with the following property:
a. --property "resources:VGPU=1