Difference between revisions of "VScaler: vGPU Configuration"

From Define Wiki
Jump to navigation Jump to search
(Created page with " 1. Yum install kernel kernel-devel -y 2. reboot 3. Download latest NVIDIA GRID software/drivers for RHEL KVM from here https://nvidia.flexnetoperations.com/control/nvda/dow...")
 
 
(3 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 +
== Prerequisites ==
 +
 +
This guide assumes you are using CentOS 7 and OpenStack Train or later.
 +
 +
First of all, download the latest NVIDIA GRID software for RHEL/Linux KVM from here https://nvidia.flexnetoperations.com/control/nvda/download?agree=Accept&element=10189877. This should give you an archive file named <code>NVIDIA-GRID-Linux-KVM-<driver-version>.zip</code>, for example <code>NVIDIA-GRID-Linux-KVM-450.89-452.57.zip</code>
 +
 +
Inside the archive you will find these 2 scripts:
 +
 +
* <code>NVIDIA-Linux-x86_64-<driver-version>-vgpu-kvm.run</code> -- this is the driver installer for '''hypervisors''';
 +
* <code>NVIDIA-Linux-x86_64-<driver-version>-grid.run</code> -- this is the driver installer for '''guests (VMs)''';
 +
 +
== Hypervisor config ==
 +
 +
Transfer the first .run file to your hypervisor.
 +
 +
Enable IOMMU in grub by adding "intel_iommu=on iommu=pt nouveau.blacklist=1" (for Intel-based CPUs) or "amd_iommu=on iommu=pt nouveau.blacklist=1" (for AMD-based CPUs) to <code>GRUB_CMDLINE_LINUX</code> in <code>/etc/default/grub</code>. Run <code>grub2-mkconfig -o /boot/grub2/grub.cfg</code> to update your current grub config.
 +
 +
Then install the GRID driver:
 +
 +
<nowiki>
 +
# yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) gcc -y
 +
...
 +
# bash NVIDIA-Linux-x86_64-450.89-vgpu-kvm.run
 +
...
 +
</nowiki>
 +
 +
Reboot to BIOS and make sure both Virtualisation (typically in the CPU section) and SR-IOV/IOMMU (typically in the PCI section) are enabled. Then boot up the operating system.
 +
 +
You should now see vfio kernel modules and mdev devices:
 +
 +
<nowiki>
 +
# lsmod | grep vfio
 +
nvidia_vgpu_vfio      54216  0
 +
nvidia              19705995  10 nvidia_vgpu_vfio
 +
vfio_mdev              12841  0
 +
mdev                  20756  2 vfio_mdev,nvidia_vgpu_vfio
 +
vfio_iommu_type1      22440  0
 +
vfio                  32657  3 vfio_mdev,nvidia_vgpu_vfio,vfio_iommu_type1
 +
# ls /sys/class/mdev_bus/*/mdev_supported_types
 +
/sys/class/mdev_bus/0000:01:00.0/mdev_supported_types:
 +
nvidia-256  nvidia-259  nvidia-262  nvidia-344  nvidia-347  nvidia-437  nvidia-440  nvidia-443
 +
nvidia-257  nvidia-260  nvidia-263  nvidia-345  nvidia-435  nvidia-438  nvidia-441  nvidia-444
 +
nvidia-258  nvidia-261  nvidia-343  nvidia-346  nvidia-436  nvidia-439  nvidia-442
 +
...
 +
</nowiki>
 +
 +
== OpenStack config ==
 +
The first order of business is finding out which mdev device type matches the vGPU profile you want to use. For this you can use this script:
 +
 +
<nowiki>
 +
#!/bin/bash
 +
 +
PCI_ADDR=0000:01:00.0
 +
 +
for devtype in $(ls -1 "/sys/class/mdev_bus/${PCI_ADDR}/mdev_supported_types")
 +
do
 +
  profile=$(cat "/sys/class/mdev_bus/${PCI_ADDR}/mdev_supported_types/$devtype/name")
 +
  echo "$devtype -> $profile"
 +
done
 +
</nowiki>
 +
where <code>PCI_ADDR</code> is one of the PCI addresses in <code>/sys/class/mdev_bus/</code> (you can cross-check these with bus addresses outputted by <code>lspci | grep -i nvidia</code>).
 +
 +
Knowing the mdev device type, you can then add the following <code>kolla-ansible</code> overrrides:
 +
 +
<code>/etc/kolla/config/nova.conf</code>
 +
<nowiki>
 +
[DEFAULT]
 +
# A workaround for a race condition when launching multiple vGPU instances at once
 +
max_concurrent_builds = 1
 +
</nowiki>
 +
 +
<code>/etc/kolla/config/nova/nova-compute.conf</code>
 +
<nowiki>
 +
[devices]
 +
enabled_vgpu_types = <selected-mdev-device-type>
 +
# For example:
 +
#enabled_vgpu_types = nvidia-261
 +
</nowiki>
 +
 +
Run <code>kolla-ansible reconfigure -t nova</code> to apply these changes.
 +
 +
After the reconfigure is done, you should be able to launch an instance with a vGPU by running the following OpenStack commands:
 +
 +
<nowiki>
 +
openstack flavor create --public m1.small.vgpu --vcpus 1 --ram 2048 --disk 10 --property "resources:VGPU=1"
 +
openstack server create --image <centos7-cloud-image> --flavor m1.small.vgpu --key-name <your-keypair> --security-group <extra-security-groups> --network <internal-network> demo_vgpu
 +
</nowiki>
 +
 +
Wait for the instance to become active and SSH into it.
 +
 +
== Guest VM config ==
 +
 +
On your guest, you should be able to see the vGPU as a PCI device:
 +
 +
<nowiki>
 +
$ lspci | grep -i nvidia
 +
00:05.0 VGA compatible controller: NVIDIA Corporation TU102GL (rev a1)
 +
</nowiki>
 +
 +
Transfer the <code>NVIDIA-Linux-x86_64-<driver-version>-grid.run</code> file to the instance, install the driver and set up the license config:
 +
 +
NOTE: You may need to run <code>yum update</code> and restart first if your image is old.
 +
 +
<nowiki>
 +
sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) gcc -y
 +
sudo bash ~/NVIDIA-Linux-x86_64-450.89-grid.run --silent --accept-license
 +
tail -1 /var/log/nvidia-installer.log
 +
sudo tee /etc/nvidia/gridd.conf << EOF
 +
ServerAddress=185.93.31.35
 +
ServerPort=7070
 +
FeatureType=1
 +
EOF
 +
</nowiki>
 +
 +
Then reboot the instance and when it's back up, run tests to confirm the vGPU can now be used:
 +
 +
<nowiki>
 +
lsmod | grep nvidia
 +
nvidia-smi
 +
systemctl status nvidia-gridd
 +
sudo yum install wget -y
 +
wget https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-10.1.243-1.x86_64.rpm
 +
sudo rpm -i cuda-repo-rhel7-10.1.243-1.x86_64.rpm
 +
sudo yum install cuda-toolkit-10-1 cuda-samples-10-1 -y
 +
cd /usr/local/cuda/samples/1_Utilities/bandwidthTest/
 +
sudo make clean && sudo make
 +
./bandwidthTest
 +
</nowiki>
 +
The last command should output <code>Result = PASS</code>. If it didn't, check system logs for problems with the GRID license daemon:
 +
 +
<nowiki>
 +
sudo grep gridd /var/log/messages | tail
 +
</nowiki>
 +
 +
 +
=== Windows Server instances ===
 +
For Windows-based instances, follow this instruction:
 +
https://docs.nvidia.com/grid/latest/grid-licensing-user-guide/index.html#licensing-grid-vgpu-windows
 +
 +
In short: install the Windows driver included in the archive with all the Linux vGPU/GRID drivers then open <code>NVIDIA Control Panel</code> and add the IP of your license server in the "Primary Licence Server" field. The text under "Licence Edition" should change to:
 +
 +
"Your system is licensed for Quatro Virtual Data Center Workstation."
 +
 +
== Manual mdev device attachment ==
 +
Instead of using OpenStack, you can manually attach a vGPU to a libvirt VM by first creating the mdev device:
 +
 +
<nowiki>
 +
# uuidgen > /sys/class/mdev_bus/<device>/mdev_supported_types/<type>/create
 +
</nowiki>
 +
and then putting the location of the device in the <code>device</code> section of libvirt's config for your VM, like so:
 +
 +
<nowiki>
 +
<device>
 +
  <name>mdev_4b20d080_1b54_4048_85b3_a6a62d165c01</name>
 +
  <path>/sys/devices/pci0000:00/0000:00:02.0/4b20d080-1b54-4048-85b3-a6a62d165c01</path>
 +
  <parent>pci_0000_06_00_0</parent>
 +
  <driver>
 +
    <name>vfio_mdev</name>
 +
  </driver>
 +
  <capability type='mdev'>
 +
    <type id='nvidia-11'/>
 +
    <iommuGroup number='12'/>
 +
  </capability>
 +
</device>
 +
</nowiki>
 +
 +
== References ==
 +
 +
# https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#install-vgpu-package-generic-linux-kvm
 +
# https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#virtual-gpu-types-grid-reference
 +
# https://libvirt.org/drvnodedev.html#MDEV
 +
 +
== Old Notes ==
 +
 +
<pre>
 
1. Yum install kernel kernel-devel -y
 
1. Yum install kernel kernel-devel -y
 +
 
2. reboot
 
2. reboot
 +
 
3. Download latest NVIDIA GRID software/drivers for RHEL KVM from here https://nvidia.flexnetoperations.com/control/nvda/download?agree=Accept&element=10189877
 
3. Download latest NVIDIA GRID software/drivers for RHEL KVM from here https://nvidia.flexnetoperations.com/control/nvda/download?agree=Accept&element=10189877
 +
 
4. Yum install gcc glibc -y
 
4. Yum install gcc glibc -y
 +
 
5. Rpm -iv NVIDIA-vGPU-rhel-7.5-390.72.x86_64.rpm
 
5. Rpm -iv NVIDIA-vGPU-rhel-7.5-390.72.x86_64.rpm
 +
 
6. Reboot
 
6. Reboot
 +
 
7. Check with:
 
7. Check with:
 
a. lsmod | grep vfio
 
a. lsmod | grep vfio
 
b. Nvidia-smi
 
b. Nvidia-smi
 +
 
8. cp /usr/lib/nvidia/systemd/nvidia-vgpu* /usr/lib/systemd/system/
 
8. cp /usr/lib/nvidia/systemd/nvidia-vgpu* /usr/lib/systemd/system/
 +
 
9. Systemctl start nvidia-vgpu-mgr.service 
 
9. Systemctl start nvidia-vgpu-mgr.service 
 +
 
10. Systemctl enable nvidia-vgpu-mgr.service 
 
10. Systemctl enable nvidia-vgpu-mgr.service 
 +
 
11. systemctl start nvidia-vgpud.service
 
11. systemctl start nvidia-vgpud.service
 +
 
12. systemctl enable nvidia-vgpud.service
 
12. systemctl enable nvidia-vgpud.service
 +
 
13. Check the /sys/class/mdev_bus/0000\:05\:00.0/mdev_supported_types/ directories and select one of the supported devices, eg nvidia-101
 
13. Check the /sys/class/mdev_bus/0000\:05\:00.0/mdev_supported_types/ directories and select one of the supported devices, eg nvidia-101
 +
 
14. Create uuids and vgpu devices with them FOR EACH PHYSICAL GPU:
 
14. Create uuids and vgpu devices with them FOR EACH PHYSICAL GPU:
 
a. uuidgen
 
a. uuidgen
 
b. echo "af88fbf2-0110-4669-ab84-d747e9a9c19c" > /sys/class/mdev_bus/0000\:05\:00.0/mdev_supported_types/nvidia-101/create 
 
b. echo "af88fbf2-0110-4669-ab84-d747e9a9c19c" > /sys/class/mdev_bus/0000\:05\:00.0/mdev_supported_types/nvidia-101/create 
 +
 
15. Disable ECC on the GPUs on the host
 
15. Disable ECC on the GPUs on the host
 +
 
16. Add the following to nova.conf of the gpu nodes:
 
16. Add the following to nova.conf of the gpu nodes:
 
[devices]
 
[devices]
 
enabled_vgpu_types = nvidia-84
 
enabled_vgpu_types = nvidia-84
 +
 
17. Add the following to nova.conf of the controller nodes:
 
17. Add the following to nova.conf of the controller nodes:
 
[scheduler]
 
[scheduler]
Line 28: Line 219:
 
available_filters = nova.scheduler.filters.all_filters
 
available_filters = nova.scheduler.filters.all_filters
 
enabled_filters = AvailabilityZoneFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter
 
enabled_filters = AvailabilityZoneFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter
 +
 
18. Do a kolla-ansible reconfigure to apply the above settings
 
18. Do a kolla-ansible reconfigure to apply the above settings
 +
 
19. Create a flavor with the following property:
 
19. Create a flavor with the following property:
 
a. --property "resources:VGPU=1
 
a. --property "resources:VGPU=1
 +
</pre>

Latest revision as of 21:56, 2 February 2021

Prerequisites

This guide assumes you are using CentOS 7 and OpenStack Train or later.

First of all, download the latest NVIDIA GRID software for RHEL/Linux KVM from here https://nvidia.flexnetoperations.com/control/nvda/download?agree=Accept&element=10189877. This should give you an archive file named NVIDIA-GRID-Linux-KVM-<driver-version>.zip, for example NVIDIA-GRID-Linux-KVM-450.89-452.57.zip

Inside the archive you will find these 2 scripts:

  • NVIDIA-Linux-x86_64-<driver-version>-vgpu-kvm.run -- this is the driver installer for hypervisors;
  • NVIDIA-Linux-x86_64-<driver-version>-grid.run -- this is the driver installer for guests (VMs);

Hypervisor config

Transfer the first .run file to your hypervisor.

Enable IOMMU in grub by adding "intel_iommu=on iommu=pt nouveau.blacklist=1" (for Intel-based CPUs) or "amd_iommu=on iommu=pt nouveau.blacklist=1" (for AMD-based CPUs) to GRUB_CMDLINE_LINUX in /etc/default/grub. Run grub2-mkconfig -o /boot/grub2/grub.cfg to update your current grub config.

Then install the GRID driver:

# yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) gcc -y
...
# bash NVIDIA-Linux-x86_64-450.89-vgpu-kvm.run
...

Reboot to BIOS and make sure both Virtualisation (typically in the CPU section) and SR-IOV/IOMMU (typically in the PCI section) are enabled. Then boot up the operating system.

You should now see vfio kernel modules and mdev devices:

# lsmod | grep vfio
nvidia_vgpu_vfio       54216  0
nvidia              19705995  10 nvidia_vgpu_vfio
vfio_mdev              12841  0
mdev                   20756  2 vfio_mdev,nvidia_vgpu_vfio
vfio_iommu_type1       22440  0
vfio                   32657  3 vfio_mdev,nvidia_vgpu_vfio,vfio_iommu_type1
# ls /sys/class/mdev_bus/*/mdev_supported_types
/sys/class/mdev_bus/0000:01:00.0/mdev_supported_types:
nvidia-256  nvidia-259  nvidia-262  nvidia-344  nvidia-347  nvidia-437  nvidia-440  nvidia-443
nvidia-257  nvidia-260  nvidia-263  nvidia-345  nvidia-435  nvidia-438  nvidia-441  nvidia-444
nvidia-258  nvidia-261  nvidia-343  nvidia-346  nvidia-436  nvidia-439  nvidia-442
...

OpenStack config

The first order of business is finding out which mdev device type matches the vGPU profile you want to use. For this you can use this script:

#!/bin/bash

PCI_ADDR=0000:01:00.0

for devtype in $(ls -1 "/sys/class/mdev_bus/${PCI_ADDR}/mdev_supported_types")
do
  profile=$(cat "/sys/class/mdev_bus/${PCI_ADDR}/mdev_supported_types/$devtype/name")
  echo "$devtype -> $profile"
done

where PCI_ADDR is one of the PCI addresses in /sys/class/mdev_bus/ (you can cross-check these with bus addresses outputted by lspci | grep -i nvidia).

Knowing the mdev device type, you can then add the following kolla-ansible overrrides:

/etc/kolla/config/nova.conf

[DEFAULT]
# A workaround for a race condition when launching multiple vGPU instances at once
max_concurrent_builds = 1

/etc/kolla/config/nova/nova-compute.conf

[devices]
enabled_vgpu_types = <selected-mdev-device-type>
# For example: 
#enabled_vgpu_types = nvidia-261

Run kolla-ansible reconfigure -t nova to apply these changes.

After the reconfigure is done, you should be able to launch an instance with a vGPU by running the following OpenStack commands:

openstack flavor create --public m1.small.vgpu --vcpus 1 --ram 2048 --disk 10 --property "resources:VGPU=1"
openstack server create --image <centos7-cloud-image> --flavor m1.small.vgpu --key-name <your-keypair> --security-group <extra-security-groups> --network <internal-network> demo_vgpu

Wait for the instance to become active and SSH into it.

Guest VM config

On your guest, you should be able to see the vGPU as a PCI device:

$ lspci | grep -i nvidia
00:05.0 VGA compatible controller: NVIDIA Corporation TU102GL (rev a1)

Transfer the NVIDIA-Linux-x86_64-<driver-version>-grid.run file to the instance, install the driver and set up the license config:

NOTE: You may need to run yum update and restart first if your image is old.

sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) gcc -y
sudo bash ~/NVIDIA-Linux-x86_64-450.89-grid.run --silent --accept-license
tail -1 /var/log/nvidia-installer.log
sudo tee /etc/nvidia/gridd.conf << EOF
ServerAddress=185.93.31.35
ServerPort=7070
FeatureType=1
EOF

Then reboot the instance and when it's back up, run tests to confirm the vGPU can now be used:

lsmod | grep nvidia
nvidia-smi
systemctl status nvidia-gridd
sudo yum install wget -y
wget https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-10.1.243-1.x86_64.rpm
sudo rpm -i cuda-repo-rhel7-10.1.243-1.x86_64.rpm
sudo yum install cuda-toolkit-10-1 cuda-samples-10-1 -y
cd /usr/local/cuda/samples/1_Utilities/bandwidthTest/
sudo make clean && sudo make
./bandwidthTest

The last command should output Result = PASS. If it didn't, check system logs for problems with the GRID license daemon:

sudo grep gridd /var/log/messages | tail


Windows Server instances

For Windows-based instances, follow this instruction: https://docs.nvidia.com/grid/latest/grid-licensing-user-guide/index.html#licensing-grid-vgpu-windows

In short: install the Windows driver included in the archive with all the Linux vGPU/GRID drivers then open NVIDIA Control Panel and add the IP of your license server in the "Primary Licence Server" field. The text under "Licence Edition" should change to:

"Your system is licensed for Quatro Virtual Data Center Workstation."

Manual mdev device attachment

Instead of using OpenStack, you can manually attach a vGPU to a libvirt VM by first creating the mdev device:

# uuidgen > /sys/class/mdev_bus/<device>/mdev_supported_types/<type>/create

and then putting the location of the device in the device section of libvirt's config for your VM, like so:

<device>
  <name>mdev_4b20d080_1b54_4048_85b3_a6a62d165c01</name>
  <path>/sys/devices/pci0000:00/0000:00:02.0/4b20d080-1b54-4048-85b3-a6a62d165c01</path>
  <parent>pci_0000_06_00_0</parent>
  <driver>
    <name>vfio_mdev</name>
  </driver>
  <capability type='mdev'>
    <type id='nvidia-11'/>
    <iommuGroup number='12'/>
  </capability>
</device>

References

  1. https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#install-vgpu-package-generic-linux-kvm
  2. https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#virtual-gpu-types-grid-reference
  3. https://libvirt.org/drvnodedev.html#MDEV

Old Notes

	1. Yum install kernel kernel-devel -y

	2. reboot

	3. Download latest NVIDIA GRID software/drivers for RHEL KVM from here https://nvidia.flexnetoperations.com/control/nvda/download?agree=Accept&element=10189877

	4. Yum install gcc glibc -y

	5. Rpm -iv NVIDIA-vGPU-rhel-7.5-390.72.x86_64.rpm

	6. Reboot

	7. Check with:
		a. lsmod | grep vfio
		b. Nvidia-smi

	8. cp /usr/lib/nvidia/systemd/nvidia-vgpu* /usr/lib/systemd/system/

	9. Systemctl start nvidia-vgpu-mgr.service 

	10. Systemctl enable nvidia-vgpu-mgr.service 

	11. systemctl start nvidia-vgpud.service

	12. systemctl enable nvidia-vgpud.service

	13. Check the /sys/class/mdev_bus/0000\:05\:00.0/mdev_supported_types/ directories and select one of the supported devices, eg nvidia-101

	14. Create uuids and vgpu devices with them FOR EACH PHYSICAL GPU:
		a. uuidgen
		b. echo "af88fbf2-0110-4669-ab84-d747e9a9c19c" > /sys/class/mdev_bus/0000\:05\:00.0/mdev_supported_types/nvidia-101/create 

	15. Disable ECC on the GPUs on the host

	16. Add the following to nova.conf of the gpu nodes:
		[devices]
		enabled_vgpu_types = nvidia-84

	17. Add the following to nova.conf of the controller nodes:
		[scheduler]
		driver = filter_scheduler
		
		[filter_scheduler]
		available_filters = nova.scheduler.filters.all_filters
		enabled_filters = AvailabilityZoneFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter

	18. Do a kolla-ansible reconfigure to apply the above settings

	19. Create a flavor with the following property:
		a. --property "resources:VGPU=1