Benchmarking: Stream (Memory Bandwidth)

From Define Wiki
Jump to navigation Jump to search

STREAM: The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.

  • Note: Ensure power saving features are disabled, we need max clock speed to prevent fluctuations in performance:
    • /etc/init.d/cpuspeed stop

Get the source

  # (v 5.10 at the time of edit)
  wget http://www.cs.virginia.edu/stream/FTP/Code/stream.c

Compile

  • Can use either Intel or GCC to build/compile
  • Ensure you build for multi-threaded runs (-fopenmp (gcc) -openmp (icc)
  • For large array sizes, include -mcmodel=medium
  • Noticed best performance using Intel ICC

Intel

  icc -O3 -static -openmp stream.c -o stream_icc

GCC

  • GCC typically gave the worst performance in the limited tests we performed (probably better optimisation flags required, but not well documented for STREAM)
  gcc -O3 -fopenmp stream.c -o stream_gcc

Open 64

  • Below are optimisations for the AMD 6300 Arch
  opencc -march=bdver1 -mp -Ofast -LNO:simd=2 -WOPT:sib=on  \
     -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4 -DSTREAM_ARRAY_SIZE=30000000 \
     -DNTIMES=30 -DOFFSET=1840 stream.c -o stream_occ

Run

  • Vary the number of threads used by using: export OMP_NUM_THREADS=32

Intel

  export OMP_NUM_THREADS=16
  export KMP_AFFINITY=compact
  ./stream_icc

GCC

  export GOMP_CPU_AFFINITY="0 1 2 ..."
  ./stream_gcc

Open64

  • Below is the recommend best for AMD 6300 arch
  • Peak memory bandwidth is achieved when STREAM is run on three cores of each NUMA node. For example, the following run shows that the same system is capable of achieving STREAM 5% better than when using all cores.
  # assuming 32 core system
  export O64_OMP_AFFINITY=”TRUE”
  export O64_OMP_AFFINITY_MAP=”2,4,6,10,12,14,18,20,22,26,28,30”
  export OMP_NUM_THREADS=12
  ./stream

Results

Dual Socket E5-2600 Server

  # HT Disabled
  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:           63992.2     0.079985     0.075009     0.083602
  Scale:          67067.1     0.072370     0.071570     0.073986
  Add:            65718.0     0.110574     0.109559     0.111694
  Triad:          66606.8     0.108982     0.108097     0.111754

Dual Socket AMD 6380 Server

  # built using open64 as above 
  OMP_NUM_THREADS=16 O64_OMP_AFFINITY=true O64_OMP_AFFINITY_MAP=$(seq -s, 0 2 31) ./stream_occ 
  
  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:           63467.9     0.007673     0.007563     0.007783
  Scale:          66527.9     0.007532     0.007215     0.007711
  Add:            62947.3     0.011611     0.011438     0.011769
  Triad:          62544.5     0.011718     0.011512     0.011887

Numascale 4x AMD 6380 Dual-Socket Servers

  # build
  opencc -Ofast -march=bdver2 -mp -mcmodel=medium -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 \
              -CG:use_prefetchnta=on -LNO:prefetch_ahead=4 -DSTREAM_ARRAY_SIZE=1500000000ULL stream.c -o stream_c.exe.open64.33g
  # run (on all 128 cores)
  OMP_NUM_THREADS=128 O64_OMP_AFFINITY=true O64_OMP_AFFINITY_MAP=$(seq -s, 0 1 127) ./stream_c.exe.open64.33g

  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:          213159.1     0.118355     0.112592     0.143757
  Scale:         211838.0     0.126541     0.113294     0.153623
  Add:           199420.7     0.206908     0.180523     0.310094
  Triad:         200751.6     0.189908     0.179326     0.209648

  # run on every second core (one per memory controller)
  OMP_NUM_THREADS=64 O64_OMP_AFFINITY=true O64_OMP_AFFINITY_MAP=$(seq -s, 0 2 127) ./stream_c.exe.open64.33g 
  
  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:          221879.3     0.110662     0.108167     0.113518
  Scale:         234276.7     0.108583     0.102443     0.121964
  Add:           213480.2     0.171661     0.168634     0.180814
  Triad:         215921.8     0.170627     0.166727     0.177991

Observations

Placement/Distribution of jobs when not using all cores is very important for bandwidth performance

  # Use KMP_AFFINITY (Intel icc only) to 'compact' jobs on to adjacent cores or 'scatter' to spread them across the system
  # System below is a Dual E5-2670 box with 64GB RAM, HT off, 16 cores

  # Using all 16 cores:
  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:           63992.2     0.079985     0.075009     0.083602
  Scale:          67067.1     0.072370     0.071570     0.073986
  Add:            65718.0     0.110574     0.109559     0.111694
  Triad:          66606.8     0.108982     0.108097     0.111754
  
  # Using 8 cores (on one socket) - limited to the 50% BW
  OMP_NUM_THREADS=8 KMP_AFFINITY=compact ./stream_icc
  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:           31929.5     0.154730     0.150331     0.158771
  Scale:          32842.5     0.148586     0.146152     0.150800
  Add:            32240.0     0.224286     0.223325     0.225280
  Triad:          32340.8     0.223632     0.222629     0.228462

  # Using 8 cores (spread across two sockets) - You get the max BW available
  OMP_NUM_THREADS=8 KMP_AFFINITY=scatter ./stream_icc
  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:           58487.3     0.082912     0.082069     0.084016
  Scale:          56235.1     0.085526     0.085356     0.085717
  Add:            63344.1     0.115197     0.113665     0.116755
  Triad:          64233.5     0.112643     0.112091     0.114209