Difference between revisions of "Intel: ICR Cluster Checker and Certification"

From Define Wiki
Jump to navigation Jump to search
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
 +
= THIS IS FOR AN OLDER VERSION OF CLUSTER CHECKER =
 
= Certification =
 
= Certification =
  
Line 39: Line 40:
 
The following is a sample XML file used for cluster checker
 
The following is a sample XML file used for cluster checker
  
 +
<syntaxhighlight>
 
<source lang=xml>
 
<source lang=xml>
 
<cluster>
 
<cluster>
Line 227: Line 229:
  
 
</source>
 
</source>
 +
</syntaxhighlight>
  
 
= Running Performance Tests =
 
= Running Performance Tests =

Latest revision as of 17:32, 4 October 2016

THIS IS FOR AN OLDER VERSION OF CLUSTER CHECKER

Certification

There are 3 runs of the cluster checker required to complete.

Run Intel Cluster Checker in Compliance Mode (as regular user)

cluster-check <xmlfile> --verbose 5 --compliance=1.2

Please note, the file_tree tests are allowed to fail (and usually does with /usr/sbin/iconvconfig.i686)

Run Intel Cluster Checker in Wellness Mode (as regular user)

cluster-check <xmlfile> --verbose 5 --level N --exclude copy_exactly

Where N is 4 if the cluster has TCP/IP over Ethernet only, otherwise, N=5 (Infiniband).

Generate Filesystem Checksums

You need to generate checksums of the file system for both the headnode and a compute node (any one will do). On each of the required nodes, run:

node_checksum

Which will generate a file list with md5sums in /tmp. Copy these file over to the directory specified in the <xmlfile> and your ready for step 3.

Run Intel Cluster Checker in Wellness Mode (as root)

cluster-check <xmlfile> --verbose 5 --level 4 --include_only copy_exactly --include_only dmidecode --include_only hdparm

The copy_exactly test may fail, if it does comment out the files it failed on for the COMPUTE_NODE checksum.

Once all the above have been complete, submit all the output file using the Seven Steps PDF submission form (on PDD).


Sample XML File

The following is a sample XML file used for cluster checker

<source lang=xml>
<cluster>
  <nodefile>./nodes</nodefile>
  <version_id>fb22930909234f188515f6002a21c5bf</version_id>
  <test>
    <clock_granularity>
      <granularity>2</granularity>
    </clock_granularity>
    <clock_sync>
      <deviation>300</deviation>
    </clock_sync>
    <core_frequency>
      <threshold>5</threshold>
    </core_frequency>
    <e1000>
      <options>options e1000 InterruptThrottleRate=0,0 TxIntDelay=0,64 RxAbsIntDelay=0,128 TxAbsIntDelay=0,64</options>
    </e1000>
    <gcc>
      <path>/usr/bin</path>
    </gcc>
    <hpcc>
        <thread-number>4</thread-number>
        <process-number>2</process-number>
        <fabric>
          <bandwidth>0.01</bandwidth>
          <device>sock</device>
          <dgemm>30</dgemm>
          <fft>1.4</fft>
          <hpl>0.06</hpl>
          <latency>75</latency>
          <ptrans>0.20</ptrans>
          <randomaccess>0.005</randomaccess>
          <stream>2.75</stream>
        </fabric>
        <fabric>
          <bandwidth>0.20</bandwidth>
          <device>rdssm:OpenIB-cma</device>
          <dgemm>30</dgemm>
          <fft>1.4</fft>
          <hpl>0.20</hpl>
          <latency>20</latency>
          <ptrans>0.20</ptrans>
          <randomaccess>0.025</randomaccess>
          <stream>2.75</stream>
        </fabric>
    </hpcc>
    <imb_collective_intel_mpi>
      <benchmark>barrier</benchmark>
      <fabric>
        <device>sock</device>
      </fabric>
      <fabric>
        <device>rdssm:OpenIB-cma</device>
      </fabric>
    </imb_collective_intel_mpi>
    <imb_pingpong_intel_mpi>
      <fabric>
        <device>sock</device>
        <bandwidth>100</bandwidth>
        <latency>45</latency>
      </fabric>
      <fabric>
        <device>rdssm:OpenIB-cma</device>
        <bandwidth>900</bandwidth>
        <latency>5.0</latency>
      </fabric>
    </imb_pingpong_intel_mpi>
    <intel_mpi>
      <device>shm</device>
      <process-number>4</process-number>
    </intel_mpi>
    <intel_mpi_internode>
      <fabric>
        <device>sock</device>
      </fabric>
      <fabric>
        <device>rdssm:OpenIB-cma</device>
      </fabric>
      <process-number>4</process-number>
    </intel_mpi_internode>
    <intel_mpi_rt>
      <device>shm</device>
      <process-number>4</process-number>
    </intel_mpi_rt>
    <intel_mpi_rt_internode>
      <fabric>
        <device>sock</device>
      </fabric>
      <fabric>
        <device>rdma:OpenIB-cma</device>
      </fabric>
      <process-number>4</process-number>
    </intel_mpi_rt_internode>
    <intel_mpi_testsuite>
      <fabric>
        <device>sock</device>
      </fabric>
      <fabric>
        <device>rdssm:OpenIB-cma</device>
      </fabric>
    </intel_mpi_testsuite>
    <memory_bandwidth_stream>
      <threads>ALL</threads>
      <bandwidth>9500</bandwidth>
    </memory_bandwidth_stream>
    <mflops_intel_mkl>
      <k>112</k>
      <m>5000</m>
      <n>5000</n>
      <mflops>75000</mflops>
    </mflops_intel_mkl>
    <nfs_mounts>
      <filesystem>autofs</filesystem>
      <filesystem>nfs</filesystem>
    </nfs_mounts>
    <openib>
      <memlock>2000000</memlock>
    </openib>
    <perl>
      <path>/usr/bin</path>
    </perl>
    <portal>
      <portal-name>portal</portal-name>
    </portal>    <process_check>
      <elapsed_time>3600</elapsed_time>
      <exempt_uids>400</exempt_uids>
      <percent_cpu>5</percent_cpu>
      <percent_memory>1</percent_memory>
      <zombie_allowed_elapsed_time>1</zombie_allowed_elapsed_time>
    </process_check>
    <python>
      <path>/usr/bin</path>
    </python>
    <stray_uids>
      <dir>/tmp</dir>
    </stray_uids>
    <subnet_manager>
      <command>opensm</command>
    </subnet_manager>
    <system_memory>
      <physical_threshold>100</physical_threshold>
      <swap_threshold>100</swap_threshold>
    </system_memory>
    <uid_sync>
      <mingid>500</mingid>
      <minuid>500</minuid>
    </uid_sync>
      <kernel_parameters>
        <exclude>kernel.domainname</exclude>
        <exclude>net.ipv4.conf.</exclude>
        <exclude>net.ipv4.neigh.</exclude>
        <exclude>net.ipv4.netfilter.</exclude>
        <exclude>net.ipv6.conf.</exclude>
        <exclude>net.ipv6.neigh.</exclude>
        <exclude>fs.nfs.</exclude>
        <exclude>dev.cdrom.</exclude>
      </kernel_parameters>
      <dmidecode>
        <!-- <exclude>Memory Device (0x1100): Asset Tag</exclude> -->
        <!-- <exclude>System Event Log (0x0011): Change Token</exclude> -->
      </dmidecode>
      <kernel_modules>
        <exclude>joydev</exclude>
        <exclude>ehci_hcd</exclude>
        <exclude>ohci_hcd</exclude>
        <exclude>scsi_mod</exclude>
        <exclude>sr_mod</exclude>
        <exclude>i2c_core</exclude>
        <exclude>i2c_dev</exclude>
        <exclude>serio_raw</exclude>
        <exclude>ipv6</exclude>
        <exclude>uhci_hcd</exclude>
        <exclude>usb_storage</exclude>
        <exclude>cdrom</exclude>
      </kernel_modules>
      <hdparm>
        <cache-read>10000</cache-read>
        <device>/dev/sda1</device>
        <device-read>1500</device-read>
     </hdparm>
     <copy_exactly>
      <compute_node>COMPUTE_NODE_COPY_EXACTLY</compute_node>
      <head_node>HEAD_NODE_COPY_EXACTLY</head_node>
     </copy_exactly>
  </test>
</cluster>

</source>

Running Performance Tests

  • All these tests are useful as production validating tools

Check the total system memory is consistent

cluster-check <xmlfile> --include_only system_memory

Check the CPU model, frequency and stepping are consistent

cluster-check <xmlfile> --include_only cpuinfo

Run Component Performance Checks

cluster-check <xmlfile> --include_only hdparm \
                        --include_only memory_bandwidth_stream \
                        --include_only mflops_intel_mkl \ 
                        --include_only imb_pingpong_intel_mpi

CPU Performance Check

cluster-check <xmlfile> --include_only mflops_intel_mkl

Memory Bandwidth Check

cluster-check <xmlfile> --include_only memory_bandwidth_stream