Difference between revisions of "GPU: Nvidia-healthmon"

From Define Wiki
Jump to navigation Jump to search
Line 5: Line 5:
 
# Download and untar.
 
# Download and untar.
 
# move the the nvidia-healthmon folder and run.
 
# move the the nvidia-healthmon folder and run.
 +
 +
 +
=== Update teh config.ini file to match your system ===
 +
 +
Example file:
 +
 +
<syntaxhighlight>
 +
[global]
 +
devices.tesla.count = 3
 +
drivers.blacklist = nouveau
 +
[Tesla K20m]
 +
pci.gen = 2
 +
pci.width = 16
 +
temperature.warn = 9
 +
</syntaxhighlight>
  
 
==Basic Usage ==
 
==Basic Usage ==

Revision as of 13:50, 10 May 2013

Installation

https://developer.nvidia.com/tesla-deployment-kit

  1. Download and untar.
  2. move the the nvidia-healthmon folder and run.


Update teh config.ini file to match your system

Example file:

[global]
devices.tesla.count = 3
drivers.blacklist = nouveau
[Tesla K20m]
pci.gen = 2
pci.width = 16
temperature.warn = 9

Basic Usage

./nvidia-healthmon
    [-h | --help]: Print usage
    [-H | --verbose-help]: Print detailed usage
    [-v | --verbose]: Enable verbose output
    [-V | --version]: Prints the version number
    [-q | --quick]: Execute a subset of tests
    [-e | --extended]: Execute the complete test suite
    [-i | --id]: Target a specific GPU
    [-L | --list-devices]: List all the GPUs attached
    [-c | --config]: Path to the configuration file
    [-l | --log-file]: Path to the output log file

example extended verbose output

[root@compute022 nvidia-healthmon]# ./nvidia-healthmon -e -v -i 0

Loading Config: SUCCESS
Global Tests
   Black-Listed Drivers: SUCCESS
   Load NVML: SUCCESS
   Load CUDA: SUCCESS
   NVML Sanity: SUCCESS
   Tesla Devices Count: SKIPPED
   Global Test Results: 4 success, 0 errors, 0 warnings, 1 did not run

-----------------------------------------------------------

GPU 0000:02:00.0 #0 : Tesla K20m (Serial: 0325212005895)
   NVML Sanity: SUCCESS
   InfoROM: SUCCESS
   GEMINI InfoROM
      This GPU does not share a board with another GPU chip.
      Result: SKIPPED
   ECC: SUCCESS
   CUDA Sanity
      GPU: Tesla K20m
      Compute Capability: 3.5
      Amount of Memory: 5032706048 bytes
      ECC: Enabled
      Number of SMs: 13
      Core Clock: 705 MHz
      Watchdog Timeout: Disabled
      Compute Mode: Default
      Result: SUCCESS
   PCIe Maximum Link Generation: SKIPPED
   PCIe Maximum Link Width: SKIPPED
   PCI Bandwidth: SKIPPED
   Memory
      Allocated 4901464656 bytes (97.3%)
      Result: SUCCESS
   Device Results: 5 success, 0 errors, 0 warnings, 4 did not run

System Results: 9 success, 0 errors, 0 warnings, 5 did not run
One or more tests didn't run.