Intel: ICR Cluster Checker and Certification
Certification
There are 3 runs of the cluster checker required to complete.
Run Intel Cluster Checker in Compliance Mode (as regular user)
cluster-check <xmlfile> --verbose 5 --compliance=1.1
Please note, the file_tree tests are allowed to fail (and usually does with /usr/sbin/iconvconfig.i686)
Run Intel Cluster Checker in Wellness Mode (as regular user)
cluster-check <xmlfile> --verbose 5 --level N --exclude copy_exactly
Where N is 4 if the cluster has TCP/IP over Ethernet only, otherwise, N=5 (Infiniband).
Generate Filesystem Checksums
You need to generate checksums of the file system for both the headnode and a compute node (any one will do). On each of the required nodes, run:
node_checksum
Which will generate a file list with md5sums in /tmp. Copy these file over to the directory specified in the <xmlfile> and your ready for step 3.
Run Intel Cluster Checker in Wellness Mode (as root)
cluster-check <xmlfile> --verbose 5 --level 4 --include_only copy_exactly --include_only dmidecode --include_only hdparm
The copy_exactly test may fail, if it does comment out the files it failed on for the COMPUTE_NODE checksum.
Once all the above have been complete, submit all the output file using the Seven Steps PDF submission form (on PDD).
Sample XML File
The following is a sample XML file used for cluster checker
<cluster>
<nodefile>./nodes</nodefile>
<version_id>fb22930909234f188515f6002a21c5bf</version_id>
<test>
<clock_granularity>
<granularity>2</granularity>
</clock_granularity>
<clock_sync>
<deviation>300</deviation>
</clock_sync>
<core_frequency>
<threshold>5</threshold>
</core_frequency>
<e1000>
<options>options e1000 InterruptThrottleRate=0,0 TxIntDelay=0,64 RxAbsIntDelay=0,128 TxAbsIntDelay=0,64</options>
</e1000>
<gcc>
<path>/usr/bin</path>
</gcc>
<hpcc>
<thread-number>4</thread-number>
<process-number>2</process-number>
<fabric>
<bandwidth>0.01</bandwidth>
<device>sock</device>
<dgemm>30</dgemm>
<fft>1.4</fft>
<hpl>0.06</hpl>
<latency>75</latency>
<ptrans>0.20</ptrans>
<randomaccess>0.005</randomaccess>
<stream>2.75</stream>
</fabric>
<fabric>
<bandwidth>0.20</bandwidth>
<device>rdssm:OpenIB-cma</device>
<dgemm>30</dgemm>
<fft>1.4</fft>
<hpl>0.20</hpl>
<latency>20</latency>
<ptrans>0.20</ptrans>
<randomaccess>0.025</randomaccess>
<stream>2.75</stream>
</fabric>
</hpcc>
<imb_collective_intel_mpi>
<benchmark>barrier</benchmark>
<fabric>
<device>sock</device>
</fabric>
<fabric>
<device>rdssm:OpenIB-cma</device>
</fabric>
</imb_collective_intel_mpi>
<imb_pingpong_intel_mpi>
<fabric>
<device>sock</device>
<bandwidth>100</bandwidth>
<latency>45</latency>
</fabric>
<fabric>
<device>rdssm:OpenIB-cma</device>
<bandwidth>900</bandwidth>
<latency>5.0</latency>
</fabric>
</imb_pingpong_intel_mpi>
<intel_mpi>
<device>shm</device>
<process-number>4</process-number>
</intel_mpi>
<intel_mpi_internode>
<fabric>
<device>sock</device>
</fabric>
<fabric>
<device>rdssm:OpenIB-cma</device>
</fabric>
<process-number>4</process-number>
</intel_mpi_internode>
<intel_mpi_rt>
<device>shm</device>
<process-number>4</process-number>
</intel_mpi_rt>
<intel_mpi_rt_internode>
<fabric>
<device>sock</device>
</fabric>
<fabric>
<device>rdma:OpenIB-cma</device>
</fabric>
<process-number>4</process-number>
</intel_mpi_rt_internode>
<intel_mpi_testsuite>
<fabric>
<device>sock</device>
</fabric>
<fabric>
<device>rdssm:OpenIB-cma</device>
</fabric>
</intel_mpi_testsuite>
<memory_bandwidth_stream>
<threads>ALL</threads>
<bandwidth>9500</bandwidth>
</memory_bandwidth_stream>
<mflops_intel_mkl>
<k>112</k>
<m>5000</m>
<n>5000</n>
<mflops>75000</mflops>
</mflops_intel_mkl>
<nfs_mounts>
<filesystem>autofs</filesystem>
<filesystem>nfs</filesystem>
</nfs_mounts>
<openib>
<memlock>2000000</memlock>
</openib>
<perl>
<path>/usr/bin</path>
</perl>
<portal>
<portal-name>portal</portal-name>
</portal> <process_check>
<elapsed_time>3600</elapsed_time>
<exempt_uids>400</exempt_uids>
<percent_cpu>5</percent_cpu>
<percent_memory>1</percent_memory>
<zombie_allowed_elapsed_time>1</zombie_allowed_elapsed_time>
</process_check>
<python>
<path>/usr/bin</path>
</python>
<stray_uids>
<dir>/tmp</dir>
</stray_uids>
<subnet_manager>
<command>opensm</command>
</subnet_manager>
<system_memory>
<physical_threshold>100</physical_threshold>
<swap_threshold>100</swap_threshold>
</system_memory>
<uid_sync>
<mingid>500</mingid>
<minuid>500</minuid>
</uid_sync>
<kernel_parameters>
<exclude>kernel.domainname</exclude>
<exclude>net.ipv4.conf.</exclude>
<exclude>net.ipv4.neigh.</exclude>
<exclude>net.ipv4.netfilter.</exclude>
<exclude>net.ipv6.conf.</exclude>
<exclude>net.ipv6.neigh.</exclude>
<exclude>fs.nfs.</exclude>
<exclude>dev.cdrom.</exclude>
</kernel_parameters>
<dmidecode>
<!-- <exclude>Memory Device (0x1100): Asset Tag</exclude> -->
<!-- <exclude>System Event Log (0x0011): Change Token</exclude> -->
</dmidecode>
<kernel_modules>
<exclude>joydev</exclude>
<exclude>ehci_hcd</exclude>
<exclude>ohci_hcd</exclude>
<exclude>scsi_mod</exclude>
<exclude>sr_mod</exclude>
<exclude>i2c_core</exclude>
<exclude>i2c_dev</exclude>
<exclude>serio_raw</exclude>
<exclude>ipv6</exclude>
<exclude>uhci_hcd</exclude>
<exclude>usb_storage</exclude>
<exclude>cdrom</exclude>
</kernel_modules>
<hdparm>
<cache-read>10000</cache-read>
<device>/dev/sda1</device>
<device-read>1500</device-read>
</hdparm>
<copy_exactly>
<compute_node>COMPUTE_NODE_COPY_EXACTLY</compute_node>
<head_node>HEAD_NODE_COPY_EXACTLY</head_node>
</copy_exactly>
</test>
</cluster>
Running Performance Tests
- All these tests are useful as production validating tools
Check the total system memory is consistent
cluster-check <xmlfile> --include_only system_memory
Check the CPU model, frequency and stepping are consistent
cluster-check <xmlfile> --include_only cpuinfo
Run Component Performance Checks
cluster-check <xmlfile> --include_only hdparm \
--include_only memory_bandwidth_stream \
--include_only mflops_intel_mkl \
--include_only imb_pingpong_intel_mpi
CPU Performance Check
cluster-check <xmlfile> --include_only mflops_intel_mkl
Memory Bandwidth Check
cluster-check <xmlfile> --include_only memory_bandwidth_stream