Difference between revisions of "Benchmarking: Stream (Memory Bandwidth)"
Jump to navigation
Jump to search
| Line 16: | Line 16: | ||
* Can use either Intel or GCC to build/compile | * Can use either Intel or GCC to build/compile | ||
* Ensure you build for multi-threaded runs (<tt>-fopenmp (gcc) -openmp (icc)</tt> | * Ensure you build for multi-threaded runs (<tt>-fopenmp (gcc) -openmp (icc)</tt> | ||
| − | * For large array sizes, include <tt>-mcmodel=medium<tt> | + | * For large array sizes, include <tt>-mcmodel=medium</tt> |
* Noticed best performance using Intel ICC | * Noticed best performance using Intel ICC | ||
=== Intel === | === Intel === | ||
<syntaxhighlight> | <syntaxhighlight> | ||
| − | + | icc -O3 -static -openmp stream.c -o stream_icc | |
</syntaxhighlight> | </syntaxhighlight> | ||
=== GCC === | === GCC === | ||
| + | * GCC typically gave the worst performance in the limited tests we performed (probably better optimisation flags required, but not well documented for STREAM) | ||
<syntaxhighlight> | <syntaxhighlight> | ||
| − | + | gcc -O3 -fopenmp stream.c -o stream_gcc | |
</syntaxhighlight> | </syntaxhighlight> | ||
== Open 64 == | == Open 64 == | ||
| − | * | + | * Below are optimisations for the AMD 6300 Arch |
<syntaxhighlight> | <syntaxhighlight> | ||
| − | + | opencc -march=bdver1 -mp -Ofast -LNO:simd=2 -WOPT:sib=on \ | |
-LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4 -DSTREAM_ARRAY_SIZE=30000000 \ | -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4 -DSTREAM_ARRAY_SIZE=30000000 \ | ||
-DNTIMES=30 -DOFFSET=1840 stream.c -o stream_occ | -DNTIMES=30 -DOFFSET=1840 stream.c -o stream_occ | ||
| Line 39: | Line 40: | ||
* Vary the number of threads used by using: <tt>export OMP_NUM_THREADS=32</tt> | * Vary the number of threads used by using: <tt>export OMP_NUM_THREADS=32</tt> | ||
| + | === Intel === | ||
| + | <syntaxhighlight> | ||
| + | export OMP_NUM_THREADS=16 | ||
| + | export KMP_AFFINITY=compact | ||
| + | ./stream_icc | ||
| + | </syntaxhighlight> | ||
| + | |||
| + | === GCC === | ||
| + | <syntaxhighlight> | ||
| + | export GOMP_CPU_AFFINITY="0 1 2 ..." | ||
| + | ./stream_gcc | ||
| + | </syntaxhighlight> | ||
| + | |||
| + | === Open64 === | ||
| + | * Below is the recommend best for AMD 6300 arch | ||
| + | * Peak memory bandwidth is achieved when STREAM is run on three cores of each NUMA node. For example, the following run shows that the same system is capable of achieving STREAM 5% better than when using all cores. | ||
| + | <syntaxhighlight> | ||
| + | # assuming 32 core system | ||
| + | export O64_OMP_AFFINITY=”TRUE” | ||
| + | export O64_OMP_AFFINITY_MAP=”2,4,6,10,12,14,18,20,22,26,28,30” | ||
| + | export OMP_NUM_THREADS=12 | ||
| + | ./stream | ||
| + | </syntaxhighlight> | ||
== Results == | == Results == | ||
Revision as of 21:55, 21 July 2013
STREAM: The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.
- Note: Ensure power saving features are disabled, we need max clock speed to prevent fluctuations in performance:
- /etc/init.d/cpuspeed stop
Get the source
- Main STREAM website: http://www.cs.virginia.edu/stream/
- Pull the latest copy of STREAM from:
# (v 5.10 at the time of edit)
wget http://www.cs.virginia.edu/stream/FTP/Code/stream.cCompile
- Can use either Intel or GCC to build/compile
- Ensure you build for multi-threaded runs (-fopenmp (gcc) -openmp (icc)
- For large array sizes, include -mcmodel=medium
- Noticed best performance using Intel ICC
Intel
icc -O3 -static -openmp stream.c -o stream_iccGCC
- GCC typically gave the worst performance in the limited tests we performed (probably better optimisation flags required, but not well documented for STREAM)
gcc -O3 -fopenmp stream.c -o stream_gccOpen 64
- Below are optimisations for the AMD 6300 Arch
opencc -march=bdver1 -mp -Ofast -LNO:simd=2 -WOPT:sib=on \
-LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4 -DSTREAM_ARRAY_SIZE=30000000 \
-DNTIMES=30 -DOFFSET=1840 stream.c -o stream_occRun
- Vary the number of threads used by using: export OMP_NUM_THREADS=32
Intel
export OMP_NUM_THREADS=16
export KMP_AFFINITY=compact
./stream_iccGCC
export GOMP_CPU_AFFINITY="0 1 2 ..."
./stream_gccOpen64
- Below is the recommend best for AMD 6300 arch
- Peak memory bandwidth is achieved when STREAM is run on three cores of each NUMA node. For example, the following run shows that the same system is capable of achieving STREAM 5% better than when using all cores.
# assuming 32 core system
export O64_OMP_AFFINITY=”TRUE”
export O64_OMP_AFFINITY_MAP=”2,4,6,10,12,14,18,20,22,26,28,30”
export OMP_NUM_THREADS=12
./stream