Difference between revisions of "GPU: Nvidia-healthmon"

From Define Wiki
Jump to navigation Jump to search
 
(3 intermediate revisions by the same user not shown)
Line 4: Line 4:
  
 
# Download and untar.
 
# Download and untar.
# move the the nvidia-healthmon folder and run.
+
# move tothe nvidia-healthmon folder and run.
 +
 
 +
* To install in another directory
 +
#Copy both the binary and config file to the same location, or you must use the -c flag
  
  
Line 46: Line 49:
  
 
<syntaxhighlight>
 
<syntaxhighlight>
[root@compute022 nvidia-healthmon]# ./nvidia-healthmon -e -v -i 0
+
 
 +
[root@compute022 nvidia-healthmon]# ./nvidia-healthmon --extended -c K20.conf -v -i 0
 +
 
  
 
Loading Config: SUCCESS
 
Loading Config: SUCCESS
Line 54: Line 59:
 
   Load CUDA: SUCCESS
 
   Load CUDA: SUCCESS
 
   NVML Sanity: SUCCESS
 
   NVML Sanity: SUCCESS
   Tesla Devices Count: SKIPPED
+
   Tesla Devices Count: SUCCESS
   Global Test Results: 4 success, 0 errors, 0 warnings, 1 did not run
+
   Global Test Results: 5 success, 0 errors, 0 warnings, 0 did not run
  
 
-----------------------------------------------------------
 
-----------------------------------------------------------
Line 76: Line 81:
 
       Compute Mode: Default
 
       Compute Mode: Default
 
       Result: SUCCESS
 
       Result: SUCCESS
   PCIe Maximum Link Generation: SKIPPED
+
   PCIe Maximum Link Generation: SUCCESS
   PCIe Maximum Link Width: SKIPPED
+
   PCIe Maximum Link Width: SUCCESS
 
   PCI Bandwidth: SKIPPED
 
   PCI Bandwidth: SKIPPED
 
   Memory
 
   Memory
 
       Allocated 4901464656 bytes (97.3%)
 
       Allocated 4901464656 bytes (97.3%)
 
       Result: SUCCESS
 
       Result: SUCCESS
   Device Results: 5 success, 0 errors, 0 warnings, 4 did not run
+
   Device Results: 7 success, 0 errors, 0 warnings, 2 did not run
  
System Results: 9 success, 0 errors, 0 warnings, 5 did not run
+
System Results: 12 success, 0 errors, 0 warnings, 2 did not run
 
One or more tests didn't run.
 
One or more tests didn't run.
 
</syntaxhighlight>
 
</syntaxhighlight>

Latest revision as of 13:57, 10 May 2013

Installation

https://developer.nvidia.com/tesla-deployment-kit

  1. Download and untar.
  2. move tothe nvidia-healthmon folder and run.
  • To install in another directory
  1. Copy both the binary and config file to the same location, or you must use the -c flag


Update teh config.ini file to match your system

Example file:

[global]
devices.tesla.count = 3
drivers.blacklist = nouveau
[Tesla K20m]
pci.gen = 2
pci.width = 16
temperature.warn = 9

Basic Usage

./nvidia-healthmon
./nvidia-healthmon -c config.file
./nvidia-healthmon --extended
    [-h | --help]: Print usage
    [-H | --verbose-help]: Print detailed usage
    [-v | --verbose]: Enable verbose output
    [-V | --version]: Prints the version number
    [-q | --quick]: Execute a subset of tests
    [-e | --extended]: Execute the complete test suite
    [-i | --id]: Target a specific GPU
    [-L | --list-devices]: List all the GPUs attached
    [-c | --config]: Path to the configuration file
    [-l | --log-file]: Path to the output log file

example extended verbose output

[root@compute022 nvidia-healthmon]# ./nvidia-healthmon --extended -c K20.conf -v -i 0


Loading Config: SUCCESS
Global Tests
   Black-Listed Drivers: SUCCESS
   Load NVML: SUCCESS
   Load CUDA: SUCCESS
   NVML Sanity: SUCCESS
   Tesla Devices Count: SUCCESS
   Global Test Results: 5 success, 0 errors, 0 warnings, 0 did not run

-----------------------------------------------------------

GPU 0000:02:00.0 #0 : Tesla K20m (Serial: 0325212005895)
   NVML Sanity: SUCCESS
   InfoROM: SUCCESS
   GEMINI InfoROM
      This GPU does not share a board with another GPU chip.
      Result: SKIPPED
   ECC: SUCCESS
   CUDA Sanity
      GPU: Tesla K20m
      Compute Capability: 3.5
      Amount of Memory: 5032706048 bytes
      ECC: Enabled
      Number of SMs: 13
      Core Clock: 705 MHz
      Watchdog Timeout: Disabled
      Compute Mode: Default
      Result: SUCCESS
   PCIe Maximum Link Generation: SUCCESS
   PCIe Maximum Link Width: SUCCESS
   PCI Bandwidth: SKIPPED
   Memory
      Allocated 4901464656 bytes (97.3%)
      Result: SUCCESS
   Device Results: 7 success, 0 errors, 0 warnings, 2 did not run

System Results: 12 success, 0 errors, 0 warnings, 2 did not run
One or more tests didn't run.