GPU: Nvidia-healthmon
Installation
https://developer.nvidia.com/tesla-deployment-kit
- Download and untar.
- move the the nvidia-healthmon folder and run.
Update teh config.ini file to match your system
Example file:
[global]
devices.tesla.count = 3
drivers.blacklist = nouveau
[Tesla K20m]
pci.gen = 2
pci.width = 16
temperature.warn = 9Basic Usage
./nvidia-healthmon
./nvidia-healthmon -c config.file
./nvidia-healthmon --extended [-h | --help]: Print usage
[-H | --verbose-help]: Print detailed usage
[-v | --verbose]: Enable verbose output
[-V | --version]: Prints the version number
[-q | --quick]: Execute a subset of tests
[-e | --extended]: Execute the complete test suite
[-i | --id]: Target a specific GPU
[-L | --list-devices]: List all the GPUs attached
[-c | --config]: Path to the configuration file
[-l | --log-file]: Path to the output log fileexample extended verbose output
Loading Config: SUCCESS
Global Tests
Black-Listed Drivers: SUCCESS
Load NVML: SUCCESS
Load CUDA: SUCCESS
NVML Sanity: SUCCESS
Tesla Devices Count: SUCCESS
Global Test Results: 5 success, 0 errors, 0 warnings, 0 did not run
-----------------------------------------------------------
GPU 0000:02:00.0 #0 : Tesla K20m (Serial: 0325212005895)
NVML Sanity: SUCCESS
InfoROM: SUCCESS
GEMINI InfoROM
This GPU does not share a board with another GPU chip.
Result: SKIPPED
ECC: SUCCESS
CUDA Sanity
GPU: Tesla K20m
Compute Capability: 3.5
Amount of Memory: 5032706048 bytes
ECC: Enabled
Number of SMs: 13
Core Clock: 705 MHz
Watchdog Timeout: Disabled
Compute Mode: Default
Result: SUCCESS
PCIe Maximum Link Generation: SUCCESS
PCIe Maximum Link Width: SUCCESS
PCI Bandwidth: SKIPPED
Memory
Allocated 4901464656 bytes (97.3%)
Result: SUCCESS
Device Results: 7 success, 0 errors, 0 warnings, 2 did not run
System Results: 12 success, 0 errors, 0 warnings, 2 did not run
One or more tests didn't run.