Benchmarking: Stream (Memory Bandwidth)

STREAM: The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.

Table 1: Stream Benchmarks

Name	 Kernel	              Bytes/Iteration	 FLOPS/Iteration
COPY	a(i) = b(i)	           16	 0
SCALE	a(i) = q*b(i)	           16	 1
SUM	a(i) = b(i) + c(i)	   24	 1
TRIAD	a(i) = b(i) + q*c(i)	   24	 2

The Copy benchmark measures the transfer rate in the absence of arithmetic. This should be one of the fastest memory operations, but it also represents a common one – fetching two values from memory, a(i) and b(i), and update one operation.

The Scale benchmark adds a simple arithmetic operation to the Copy benchmark. This starts to simulate real application operations. The operation fetches two values from memory, a(i) and b(i), but operates on b(i) before writing it to a(i). It's a simple scalar operation, but more complex operations are built from it, so the performance of this simple test can be used as an indicator of the performance of more complex operations.

The third benchmark, the Sum benchmark, adds a third operand and was originally written to allow multiple load/store ports on vector machines to be tested when vector machines were in vogue. However, this benchmark is very useful today because of the large pipelines that some processors possess. Rather than just fetch two values from memory, this micro-benchmark fetches three. For larger arrays, this will quickly fill a processor pipeline, so you can test the memory bandwidth filling the processor pipeline or the performance when the pipeline is full. Moreover, this benchmark is starting to approximate what some applications will perform in real computations.

The fourth benchmark in Stream, the Triad benchmark, allows chained or overlapped or fused, multiple-add operations. It builds on the Sum benchmark by adding an arithmetic operation to one of the fetched array values. Given that fused multiple-add operations (FMA) are an important operation in many basic computations, such as dot products, matrix multiplication, polynomial evaluations, Newton’s method for evaluation functions, and many DSP operations, this benchmark can be directly associated with application performance. The FMA operation has its own instruction set now and is usually done in hardware. Consequently, feeding such hardware operations with data can be extremely important – hence, the usefulness of the Triad memory bandwidth benchmark.

There are two variables or definitions in the code that you should pay attention to. The first is STREAM_ARRAY_SIZE. This is the number of array elements used to run the benchmarks. In the current version, it is set to 10,000,000, which the code states should be good enough for caches up to 20MB. The Stream FAQ recommends you use a problem size such that each array is four times the sum of the caches (L1, L2, and L3). You can either change the code to reflect the array sizes you want, or you can set the variable when compiling the code.

The second variable you might want to change is NTIMES, the number of times each benchmark is run. By default, Stream reports the "best" result for any iteration after the first; therefore, be sure always to set NTIMES at least to 2 (10 is the default). This variable can also be set during compilation without changing the code.

Note: Ensure power saving features are disabled, we need max clock speed to prevent fluctuations in performance:
- /etc/init.d/cpuspeed stop

Get the source

Main STREAM website: http://www.cs.virginia.edu/stream/
Pull the latest copy of STREAM from:

  # (v 5.10 at the time of edit)
  wget http://www.cs.virginia.edu/stream/FTP/Code/stream.c

Compile

Can use either Intel or GCC to build/compile
Ensure you build for multi-threaded runs (-fopenmp (gcc) -openmp (icc)
For large array sizes, include -mcmodel=medium
Noticed best performance using Intel ICC

Intel

  icc -O3 -static -openmp stream.c -o stream_icc

GCC

GCC typically gave the worst performance in the limited tests we performed (probably better optimisation flags required, but not well documented for STREAM)

  gcc -O3 -fopenmp stream.c -o stream_gcc

Open 64

Below are optimisations for the AMD 6300 Arch

  opencc -march=bdver1 -mp -Ofast -LNO:simd=2 -WOPT:sib=on  \
     -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4 -DSTREAM_ARRAY_SIZE=30000000 \
     -DNTIMES=30 -DOFFSET=1840 stream.c -o stream_occ

Run

Vary the number of threads used by using: export OMP_NUM_THREADS=32

Intel

  export OMP_NUM_THREADS=16
  export KMP_AFFINITY=compact
  ./stream_icc

GCC

  export GOMP_CPU_AFFINITY="0 1 2 ..."
  ./stream_gcc

Open64

Below is the recommend best for AMD 6300 arch
Peak memory bandwidth is achieved when STREAM is run on three cores of each NUMA node. For example, the following run shows that the same system is capable of achieving STREAM 5% better than when using all cores.

  # assuming 32 core system
  export O64_OMP_AFFINITY=”TRUE”
  export O64_OMP_AFFINITY_MAP=”2,4,6,10,12,14,18,20,22,26,28,30”
  export OMP_NUM_THREADS=12
  ./stream

Results

Dual Socket E5-2660 v2 @ 2.20GHz

# HT Disabled
export OMP_NUM_THREADS=20
icc -O3 -openmp -DSTREAM_ARRAY_SIZE=95000000 -mcmodel=medium stream.c -o stream_icc
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           79878.5     0.019151     0.019029     0.019254
Scale:          87765.1     0.018336     0.017319     0.025528
Add:            92615.5     0.024757     0.024618     0.025459
Triad:          92773.6     0.024712     0.024576     0.024968
-------------------------------------------------------------

Dual Socket E5-2600 Server

  # HT Disabled
  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:           63992.2     0.079985     0.075009     0.083602
  Scale:          67067.1     0.072370     0.071570     0.073986
  Add:            65718.0     0.110574     0.109559     0.111694
  Triad:          66606.8     0.108982     0.108097     0.111754

Dual Socket AMD 6380 Server

  # built using open64 as above 
  OMP_NUM_THREADS=16 O64_OMP_AFFINITY=true O64_OMP_AFFINITY_MAP=$(seq -s, 0 2 31) ./stream_occ 
  
  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:           63467.9     0.007673     0.007563     0.007783
  Scale:          66527.9     0.007532     0.007215     0.007711
  Add:            62947.3     0.011611     0.011438     0.011769
  Triad:          62544.5     0.011718     0.011512     0.011887

Numascale 4x AMD 6380 Dual-Socket Servers

  # build
  opencc -Ofast -march=bdver2 -mp -mcmodel=medium -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 \
              -CG:use_prefetchnta=on -LNO:prefetch_ahead=4 -DSTREAM_ARRAY_SIZE=1500000000ULL stream.c -o stream_c.exe.open64.33g
  # run (on all 128 cores)
  OMP_NUM_THREADS=128 O64_OMP_AFFINITY=true O64_OMP_AFFINITY_MAP=$(seq -s, 0 1 127) ./stream_c.exe.open64.33g

  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:          213159.1     0.118355     0.112592     0.143757
  Scale:         211838.0     0.126541     0.113294     0.153623
  Add:           199420.7     0.206908     0.180523     0.310094
  Triad:         200751.6     0.189908     0.179326     0.209648

  # run on every second core (one per memory controller)
  OMP_NUM_THREADS=64 O64_OMP_AFFINITY=true O64_OMP_AFFINITY_MAP=$(seq -s, 0 2 127) ./stream_c.exe.open64.33g 
  
  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:          221879.3     0.110662     0.108167     0.113518
  Scale:         234276.7     0.108583     0.102443     0.121964
  Add:           213480.2     0.171661     0.168634     0.180814
  Triad:         215921.8     0.170627     0.166727     0.177991

Single Calxeda ECX-1000 @ 1.4GHz

# build
gcc -O3 -fopenmp /root/stream.c -o /root/stream_gcc

# run (on all 4 cores)
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            1696.5     0.094708     0.094311     0.094899
Scale:           1696.8     0.095611     0.094293     0.096759
Add:             1993.1     0.120893     0.120413     0.121506
Triad:           1865.7     0.130471     0.128638     0.132437

Single Calxeda ECX-2000

# build
gcc -O3 -fopenmp /root/stream.c -o /root/stream_gcc

# run (on all 4 cores)
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            3247.7     0.049727     0.049266     0.050480
Scale:           4490.5     0.035814     0.035631     0.036157
Add:             4051.3     0.059573     0.059240     0.059767
Triad:           3847.4     0.062560     0.062380     0.062768

Observations

Placement/Distribution of jobs when not using all cores is very important for bandwidth performance

  # Use KMP_AFFINITY (Intel icc only) to 'compact' jobs on to adjacent cores or 'scatter' to spread them across the system
  # System below is a Dual E5-2670 box with 64GB RAM, HT off, 16 cores

  # Using all 16 cores:
  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:           63992.2     0.079985     0.075009     0.083602
  Scale:          67067.1     0.072370     0.071570     0.073986
  Add:            65718.0     0.110574     0.109559     0.111694
  Triad:          66606.8     0.108982     0.108097     0.111754
  
  # Using 8 cores (on one socket) - limited to the 50% BW
  OMP_NUM_THREADS=8 KMP_AFFINITY=compact ./stream_icc
  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:           31929.5     0.154730     0.150331     0.158771
  Scale:          32842.5     0.148586     0.146152     0.150800
  Add:            32240.0     0.224286     0.223325     0.225280
  Triad:          32340.8     0.223632     0.222629     0.228462

  # Using 8 cores (spread across two sockets) - You get the max BW available
  OMP_NUM_THREADS=8 KMP_AFFINITY=scatter ./stream_icc
  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:           58487.3     0.082912     0.082069     0.084016
  Scale:          56235.1     0.085526     0.085356     0.085717
  Add:            63344.1     0.115197     0.113665     0.116755
  Triad:          64233.5     0.112643     0.112091     0.114209

Benchmarking: Stream (Memory Bandwidth)

Contents

Get the source

Compile

Intel

GCC

Open 64

Run

Intel

GCC

Open64

Results

Dual Socket E5-2660 v2 @ 2.20GHz

Dual Socket E5-2600 Server

Dual Socket AMD 6380 Server

Numascale 4x AMD 6380 Dual-Socket Servers

Single Calxeda ECX-1000 @ 1.4GHz

Single Calxeda ECX-2000

Observations

Navigation menu

Search