Difference between revisions of "Benchmarking: Stream (Memory Bandwidth)"
Jump to navigation
Jump to search
| Line 65: | Line 65: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
== Results == | == Results == | ||
| + | === Dual Socket E5-2600 Server === | ||
| + | <syntaxhighlight> | ||
| + | # HT Disabled | ||
| + | Function Best Rate MB/s Avg time Min time Max time | ||
| + | Copy: 63992.2 0.079985 0.075009 0.083602 | ||
| + | Scale: 67067.1 0.072370 0.071570 0.073986 | ||
| + | Add: 65718.0 0.110574 0.109559 0.111694 | ||
| + | Triad: 66606.8 0.108982 0.108097 0.111754 | ||
| + | </syntaxhighlight> | ||
| + | |||
=== Dual Socket AMD 6380 Server === | === Dual Socket AMD 6380 Server === | ||
<syntaxhighlight> | <syntaxhighlight> | ||
| Line 98: | Line 108: | ||
Add: 213480.2 0.171661 0.168634 0.180814 | Add: 213480.2 0.171661 0.168634 0.180814 | ||
Triad: 215921.8 0.170627 0.166727 0.177991 | Triad: 215921.8 0.170627 0.166727 0.177991 | ||
| + | </syntaxhighlight> | ||
| + | |||
| + | == Observations == | ||
| + | Placement/Distribution of jobs not using all cores is very important for bandwidth performance | ||
| + | <syntaxhighlight> | ||
| + | # Use KMP_AFFINITY (Intel icc only) to 'compact' jobs on to adjacent cores or 'scatter' to spread them across the system | ||
| + | # System below is a Dual E5-2670 box with 64GB RAM, HT off, 16 cores | ||
| + | |||
| + | # Using all 16 cores: | ||
| + | Function Best Rate MB/s Avg time Min time Max time | ||
| + | Copy: 63992.2 0.079985 0.075009 0.083602 | ||
| + | Scale: 67067.1 0.072370 0.071570 0.073986 | ||
| + | Add: 65718.0 0.110574 0.109559 0.111694 | ||
| + | Triad: 66606.8 0.108982 0.108097 0.111754 | ||
| + | |||
| + | # Using 8 cores (on one socket) - limited to the 50% BW | ||
| + | OMP_NUM_THREADS=8 KMP_AFFINITY=compact ./stream_icc | ||
| + | Function Best Rate MB/s Avg time Min time Max time | ||
| + | Copy: 31929.5 0.154730 0.150331 0.158771 | ||
| + | Scale: 32842.5 0.148586 0.146152 0.150800 | ||
| + | Add: 32240.0 0.224286 0.223325 0.225280 | ||
| + | Triad: 32340.8 0.223632 0.222629 0.228462 | ||
| + | |||
| + | # Using 8 cores (spread across two sockets) - You get the max BW available | ||
| + | OMP_NUM_THREADS=8 KMP_AFFINITY=compact ./stream_icc | ||
| + | Function Best Rate MB/s Avg time Min time Max time | ||
| + | Copy: 58487.3 0.082912 0.082069 0.084016 | ||
| + | Scale: 56235.1 0.085526 0.085356 0.085717 | ||
| + | Add: 63344.1 0.115197 0.113665 0.116755 | ||
| + | Triad: 64233.5 0.112643 0.112091 0.114209 | ||
</syntaxhighlight> | </syntaxhighlight> | ||
Revision as of 17:00, 23 July 2013
STREAM: The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.
- Note: Ensure power saving features are disabled, we need max clock speed to prevent fluctuations in performance:
- /etc/init.d/cpuspeed stop
Get the source
- Main STREAM website: http://www.cs.virginia.edu/stream/
- Pull the latest copy of STREAM from:
# (v 5.10 at the time of edit)
wget http://www.cs.virginia.edu/stream/FTP/Code/stream.cCompile
- Can use either Intel or GCC to build/compile
- Ensure you build for multi-threaded runs (-fopenmp (gcc) -openmp (icc)
- For large array sizes, include -mcmodel=medium
- Noticed best performance using Intel ICC
Intel
icc -O3 -static -openmp stream.c -o stream_iccGCC
- GCC typically gave the worst performance in the limited tests we performed (probably better optimisation flags required, but not well documented for STREAM)
gcc -O3 -fopenmp stream.c -o stream_gccOpen 64
- Below are optimisations for the AMD 6300 Arch
opencc -march=bdver1 -mp -Ofast -LNO:simd=2 -WOPT:sib=on \
-LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4 -DSTREAM_ARRAY_SIZE=30000000 \
-DNTIMES=30 -DOFFSET=1840 stream.c -o stream_occRun
- Vary the number of threads used by using: export OMP_NUM_THREADS=32
Intel
export OMP_NUM_THREADS=16
export KMP_AFFINITY=compact
./stream_iccGCC
export GOMP_CPU_AFFINITY="0 1 2 ..."
./stream_gccOpen64
- Below is the recommend best for AMD 6300 arch
- Peak memory bandwidth is achieved when STREAM is run on three cores of each NUMA node. For example, the following run shows that the same system is capable of achieving STREAM 5% better than when using all cores.
# assuming 32 core system
export O64_OMP_AFFINITY=”TRUE”
export O64_OMP_AFFINITY_MAP=”2,4,6,10,12,14,18,20,22,26,28,30”
export OMP_NUM_THREADS=12
./streamResults
Dual Socket E5-2600 Server
# HT Disabled
Function Best Rate MB/s Avg time Min time Max time
Copy: 63992.2 0.079985 0.075009 0.083602
Scale: 67067.1 0.072370 0.071570 0.073986
Add: 65718.0 0.110574 0.109559 0.111694
Triad: 66606.8 0.108982 0.108097 0.111754Dual Socket AMD 6380 Server
# built using open64 as above
OMP_NUM_THREADS=16 O64_OMP_AFFINITY=true O64_OMP_AFFINITY_MAP=$(seq -s, 0 2 31) ./stream_occ
Function Best Rate MB/s Avg time Min time Max time
Copy: 63467.9 0.007673 0.007563 0.007783
Scale: 66527.9 0.007532 0.007215 0.007711
Add: 62947.3 0.011611 0.011438 0.011769
Triad: 62544.5 0.011718 0.011512 0.011887Numascale 4x AMD 6380 Dual-Socket Servers
# build
opencc -Ofast -march=bdver2 -mp -mcmodel=medium -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 \
-CG:use_prefetchnta=on -LNO:prefetch_ahead=4 -DSTREAM_ARRAY_SIZE=1500000000ULL stream.c -o stream_c.exe.open64.33g
# run (on all 128 cores)
OMP_NUM_THREADS=128 O64_OMP_AFFINITY=true O64_OMP_AFFINITY_MAP=$(seq -s, 0 1 127) ./stream_c.exe.open64.33g
Function Best Rate MB/s Avg time Min time Max time
Copy: 213159.1 0.118355 0.112592 0.143757
Scale: 211838.0 0.126541 0.113294 0.153623
Add: 199420.7 0.206908 0.180523 0.310094
Triad: 200751.6 0.189908 0.179326 0.209648
# run on every second core (one per memory controller)
OMP_NUM_THREADS=64 O64_OMP_AFFINITY=true O64_OMP_AFFINITY_MAP=$(seq -s, 0 2 127) ./stream_c.exe.open64.33g
Function Best Rate MB/s Avg time Min time Max time
Copy: 221879.3 0.110662 0.108167 0.113518
Scale: 234276.7 0.108583 0.102443 0.121964
Add: 213480.2 0.171661 0.168634 0.180814
Triad: 215921.8 0.170627 0.166727 0.177991Observations
Placement/Distribution of jobs not using all cores is very important for bandwidth performance
# Use KMP_AFFINITY (Intel icc only) to 'compact' jobs on to adjacent cores or 'scatter' to spread them across the system
# System below is a Dual E5-2670 box with 64GB RAM, HT off, 16 cores
# Using all 16 cores:
Function Best Rate MB/s Avg time Min time Max time
Copy: 63992.2 0.079985 0.075009 0.083602
Scale: 67067.1 0.072370 0.071570 0.073986
Add: 65718.0 0.110574 0.109559 0.111694
Triad: 66606.8 0.108982 0.108097 0.111754
# Using 8 cores (on one socket) - limited to the 50% BW
OMP_NUM_THREADS=8 KMP_AFFINITY=compact ./stream_icc
Function Best Rate MB/s Avg time Min time Max time
Copy: 31929.5 0.154730 0.150331 0.158771
Scale: 32842.5 0.148586 0.146152 0.150800
Add: 32240.0 0.224286 0.223325 0.225280
Triad: 32340.8 0.223632 0.222629 0.228462
# Using 8 cores (spread across two sockets) - You get the max BW available
OMP_NUM_THREADS=8 KMP_AFFINITY=compact ./stream_icc
Function Best Rate MB/s Avg time Min time Max time
Copy: 58487.3 0.082912 0.082069 0.084016
Scale: 56235.1 0.085526 0.085356 0.085717
Add: 63344.1 0.115197 0.113665 0.116755
Triad: 64233.5 0.112643 0.112091 0.114209