Difference between revisions of "VScaler: vGPU Configuration"

Latest revision as of 21:56, 2 February 2021

Prerequisites

This guide assumes you are using CentOS 7 and OpenStack Train or later.

First of all, download the latest NVIDIA GRID software for RHEL/Linux KVM from here https://nvidia.flexnetoperations.com/control/nvda/download?agree=Accept&element=10189877. This should give you an archive file named NVIDIA-GRID-Linux-KVM-<driver-version>.zip, for example NVIDIA-GRID-Linux-KVM-450.89-452.57.zip

Inside the archive you will find these 2 scripts:

NVIDIA-Linux-x86_64-<driver-version>-vgpu-kvm.run -- this is the driver installer for hypervisors;
NVIDIA-Linux-x86_64-<driver-version>-grid.run -- this is the driver installer for guests (VMs);

Hypervisor config

Transfer the first .run file to your hypervisor.

Enable IOMMU in grub by adding "intel_iommu=on iommu=pt nouveau.blacklist=1" (for Intel-based CPUs) or "amd_iommu=on iommu=pt nouveau.blacklist=1" (for AMD-based CPUs) to GRUB_CMDLINE_LINUX in /etc/default/grub. Run grub2-mkconfig -o /boot/grub2/grub.cfg to update your current grub config.

Then install the GRID driver:

# yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) gcc -y
...
# bash NVIDIA-Linux-x86_64-450.89-vgpu-kvm.run
...

Reboot to BIOS and make sure both Virtualisation (typically in the CPU section) and SR-IOV/IOMMU (typically in the PCI section) are enabled. Then boot up the operating system.

You should now see vfio kernel modules and mdev devices:

# lsmod | grep vfio
nvidia_vgpu_vfio       54216  0
nvidia              19705995  10 nvidia_vgpu_vfio
vfio_mdev              12841  0
mdev                   20756  2 vfio_mdev,nvidia_vgpu_vfio
vfio_iommu_type1       22440  0
vfio                   32657  3 vfio_mdev,nvidia_vgpu_vfio,vfio_iommu_type1
# ls /sys/class/mdev_bus/*/mdev_supported_types
/sys/class/mdev_bus/0000:01:00.0/mdev_supported_types:
nvidia-256  nvidia-259  nvidia-262  nvidia-344  nvidia-347  nvidia-437  nvidia-440  nvidia-443
nvidia-257  nvidia-260  nvidia-263  nvidia-345  nvidia-435  nvidia-438  nvidia-441  nvidia-444
nvidia-258  nvidia-261  nvidia-343  nvidia-346  nvidia-436  nvidia-439  nvidia-442
...

OpenStack config

The first order of business is finding out which mdev device type matches the vGPU profile you want to use. For this you can use this script:

#!/bin/bash

PCI_ADDR=0000:01:00.0

for devtype in $(ls -1 "/sys/class/mdev_bus/${PCI_ADDR}/mdev_supported_types")
do
  profile=$(cat "/sys/class/mdev_bus/${PCI_ADDR}/mdev_supported_types/$devtype/name")
  echo "$devtype -> $profile"
done

where PCI_ADDR is one of the PCI addresses in /sys/class/mdev_bus/ (you can cross-check these with bus addresses outputted by lspci | grep -i nvidia).

Knowing the mdev device type, you can then add the following kolla-ansible overrrides:

/etc/kolla/config/nova.conf

[DEFAULT]
# A workaround for a race condition when launching multiple vGPU instances at once
max_concurrent_builds = 1

/etc/kolla/config/nova/nova-compute.conf

[devices]
enabled_vgpu_types = <selected-mdev-device-type>
# For example: 
#enabled_vgpu_types = nvidia-261

Run kolla-ansible reconfigure -t nova to apply these changes.

After the reconfigure is done, you should be able to launch an instance with a vGPU by running the following OpenStack commands:

openstack flavor create --public m1.small.vgpu --vcpus 1 --ram 2048 --disk 10 --property "resources:VGPU=1"
openstack server create --image <centos7-cloud-image> --flavor m1.small.vgpu --key-name <your-keypair> --security-group <extra-security-groups> --network <internal-network> demo_vgpu

Wait for the instance to become active and SSH into it.

Guest VM config

On your guest, you should be able to see the vGPU as a PCI device:

$ lspci | grep -i nvidia
00:05.0 VGA compatible controller: NVIDIA Corporation TU102GL (rev a1)

Transfer the NVIDIA-Linux-x86_64-<driver-version>-grid.run file to the instance, install the driver and set up the license config:

NOTE: You may need to run yum update and restart first if your image is old.

sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) gcc -y
sudo bash ~/NVIDIA-Linux-x86_64-450.89-grid.run --silent --accept-license
tail -1 /var/log/nvidia-installer.log
sudo tee /etc/nvidia/gridd.conf << EOF
ServerAddress=185.93.31.35
ServerPort=7070
FeatureType=1
EOF

Then reboot the instance and when it's back up, run tests to confirm the vGPU can now be used:

lsmod | grep nvidia
nvidia-smi
systemctl status nvidia-gridd
sudo yum install wget -y
wget https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-10.1.243-1.x86_64.rpm
sudo rpm -i cuda-repo-rhel7-10.1.243-1.x86_64.rpm
sudo yum install cuda-toolkit-10-1 cuda-samples-10-1 -y
cd /usr/local/cuda/samples/1_Utilities/bandwidthTest/
sudo make clean && sudo make
./bandwidthTest

The last command should output Result = PASS. If it didn't, check system logs for problems with the GRID license daemon:

sudo grep gridd /var/log/messages | tail

Windows Server instances

For Windows-based instances, follow this instruction: https://docs.nvidia.com/grid/latest/grid-licensing-user-guide/index.html#licensing-grid-vgpu-windows

In short: install the Windows driver included in the archive with all the Linux vGPU/GRID drivers then open NVIDIA Control Panel and add the IP of your license server in the "Primary Licence Server" field. The text under "Licence Edition" should change to:

"Your system is licensed for Quatro Virtual Data Center Workstation."

Manual mdev device attachment

Instead of using OpenStack, you can manually attach a vGPU to a libvirt VM by first creating the mdev device:

# uuidgen > /sys/class/mdev_bus/<device>/mdev_supported_types/<type>/create

and then putting the location of the device in the device section of libvirt's config for your VM, like so:

<device>
  <name>mdev_4b20d080_1b54_4048_85b3_a6a62d165c01</name>
  <path>/sys/devices/pci0000:00/0000:00:02.0/4b20d080-1b54-4048-85b3-a6a62d165c01</path>
  <parent>pci_0000_06_00_0</parent>
  <driver>
    <name>vfio_mdev</name>
  </driver>
  <capability type='mdev'>
    <type id='nvidia-11'/>
    <iommuGroup number='12'/>
  </capability>
</device>

References

Old Notes

	1. Yum install kernel kernel-devel -y

	2. reboot

	3. Download latest NVIDIA GRID software/drivers for RHEL KVM from here https://nvidia.flexnetoperations.com/control/nvda/download?agree=Accept&element=10189877

	4. Yum install gcc glibc -y

	5. Rpm -iv NVIDIA-vGPU-rhel-7.5-390.72.x86_64.rpm

	6. Reboot

	7. Check with:
		a. lsmod | grep vfio
		b. Nvidia-smi

	8. cp /usr/lib/nvidia/systemd/nvidia-vgpu* /usr/lib/systemd/system/

	9. Systemctl start nvidia-vgpu-mgr.service 

	10. Systemctl enable nvidia-vgpu-mgr.service 

	11. systemctl start nvidia-vgpud.service

	12. systemctl enable nvidia-vgpud.service

	13. Check the /sys/class/mdev_bus/0000\:05\:00.0/mdev_supported_types/ directories and select one of the supported devices, eg nvidia-101

	14. Create uuids and vgpu devices with them FOR EACH PHYSICAL GPU:
		a. uuidgen
		b. echo "af88fbf2-0110-4669-ab84-d747e9a9c19c" > /sys/class/mdev_bus/0000\:05\:00.0/mdev_supported_types/nvidia-101/create 

	15. Disable ECC on the GPUs on the host

	16. Add the following to nova.conf of the gpu nodes:
		[devices]
		enabled_vgpu_types = nvidia-84

	17. Add the following to nova.conf of the controller nodes:
		[scheduler]
		driver = filter_scheduler
		
		[filter_scheduler]
		available_filters = nova.scheduler.filters.all_filters
		enabled_filters = AvailabilityZoneFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter

	18. Do a kolla-ansible reconfigure to apply the above settings

	19. Create a flavor with the following property:
		a. --property "resources:VGPU=1