Difference between revisions of "Benchmarking: Stream (Memory Bandwidth)"

Latest revision as of 10:17, 14 March 2018

STREAM: The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.

Table 1: Stream Benchmarks

Name	 Kernel	              Bytes/Iteration	 FLOPS/Iteration
COPY	a(i) = b(i)	           16	 0
SCALE	a(i) = q*b(i)	           16	 1
SUM	a(i) = b(i) + c(i)	   24	 1
TRIAD	a(i) = b(i) + q*c(i)	   24	 2

The Copy benchmark measures the transfer rate in the absence of arithmetic. This should be one of the fastest memory operations, but it also represents a common one – fetching two values from memory, a(i) and b(i), and update one operation.

The Scale benchmark adds a simple arithmetic operation to the Copy benchmark. This starts to simulate real application operations. The operation fetches two values from memory, a(i) and b(i), but operates on b(i) before writing it to a(i). It's a simple scalar operation, but more complex operations are built from it, so the performance of this simple test can be used as an indicator of the performance of more complex operations.

The third benchmark, the Sum benchmark, adds a third operand and was originally written to allow multiple load/store ports on vector machines to be tested when vector machines were in vogue. However, this benchmark is very useful today because of the large pipelines that some processors possess. Rather than just fetch two values from memory, this micro-benchmark fetches three. For larger arrays, this will quickly fill a processor pipeline, so you can test the memory bandwidth filling the processor pipeline or the performance when the pipeline is full. Moreover, this benchmark is starting to approximate what some applications will perform in real computations.

The fourth benchmark in Stream, the Triad benchmark, allows chained or overlapped or fused, multiple-add operations. It builds on the Sum benchmark by adding an arithmetic operation to one of the fetched array values. Given that fused multiple-add operations (FMA) are an important operation in many basic computations, such as dot products, matrix multiplication, polynomial evaluations, Newton’s method for evaluation functions, and many DSP operations, this benchmark can be directly associated with application performance. The FMA operation has its own instruction set now and is usually done in hardware. Consequently, feeding such hardware operations with data can be extremely important – hence, the usefulness of the Triad memory bandwidth benchmark.

There are two variables or definitions in the code that you should pay attention to. The first is STREAM_ARRAY_SIZE. This is the number of array elements used to run the benchmarks. In the current version, it is set to 10,000,000, which the code states should be good enough for caches up to 20MB. The Stream FAQ recommends you use a problem size such that each array is four times the sum of the caches (L1, L2, and L3). You can either change the code to reflect the array sizes you want, or you can set the variable when compiling the code.

The second variable you might want to change is NTIMES, the number of times each benchmark is run. By default, Stream reports the "best" result for any iteration after the first; therefore, be sure always to set NTIMES at least to 2 (10 is the default). This variable can also be set during compilation without changing the code.

Note: Ensure power saving features are disabled, we need max clock speed to prevent fluctuations in performance:
- /etc/init.d/cpuspeed stop

Get the source

Main STREAM website: http://www.cs.virginia.edu/stream/
Pull the latest copy of STREAM from:

  # (v 5.10 at the time of edit)
  wget http://www.cs.virginia.edu/stream/FTP/Code/stream.c

Compile

Can use either Intel or GCC to build/compile
Ensure you build for multi-threaded runs (-fopenmp (gcc) -openmp (icc)
For large array sizes, include -mcmodel=medium
Noticed best performance using Intel ICC

Intel

  icc -O3 -static -openmp stream.c -o stream_icc

GCC

GCC typically gave the worst performance in the limited tests we performed (probably better optimisation flags required, but not well documented for STREAM)

  gcc -O3 -fopenmp stream.c -o stream_gcc

Open 64

Below are optimisations for the AMD 6300 Arch

  opencc -march=bdver1 -mp -Ofast -LNO:simd=2 -WOPT:sib=on  \
     -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4 -DSTREAM_ARRAY_SIZE=30000000 \
     -DNTIMES=30 -DOFFSET=1840 stream.c -o stream_occ

Run

Vary the number of threads used by using: export OMP_NUM_THREADS=32

Intel

  export OMP_NUM_THREADS=16
  export KMP_AFFINITY=compact
  ./stream_icc

GCC

  export GOMP_CPU_AFFINITY="0 1 2 ..."
  ./stream_gcc

Open64

Below is the recommend best for AMD 6300 arch
Peak memory bandwidth is achieved when STREAM is run on three cores of each NUMA node. For example, the following run shows that the same system is capable of achieving STREAM 5% better than when using all cores.

  # assuming 32 core system
  export O64_OMP_AFFINITY=”TRUE”
  export O64_OMP_AFFINITY_MAP=”2,4,6,10,12,14,18,20,22,26,28,30”
  export OMP_NUM_THREADS=12
  ./stream

Results

Dual Socket E5-2660 v2 @ 2.20GHz

# HT Disabled
export OMP_NUM_THREADS=20
icc -O3 -openmp -DSTREAM_ARRAY_SIZE=95000000 -mcmodel=medium stream.c -o stream_icc
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           79878.5     0.019151     0.019029     0.019254
Scale:          87765.1     0.018336     0.017319     0.025528
Add:            92615.5     0.024757     0.024618     0.025459
Triad:          92773.6     0.024712     0.024576     0.024968
-------------------------------------------------------------

Dual Socket E5-2600 Server

  # HT Disabled
  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:           63992.2     0.079985     0.075009     0.083602
  Scale:          67067.1     0.072370     0.071570     0.073986
  Add:            65718.0     0.110574     0.109559     0.111694
  Triad:          66606.8     0.108982     0.108097     0.111754

Dual Socket AMD 6380 Server

  # built using open64 as above 
  OMP_NUM_THREADS=16 O64_OMP_AFFINITY=true O64_OMP_AFFINITY_MAP=$(seq -s, 0 2 31) ./stream_occ 
  
  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:           63467.9     0.007673     0.007563     0.007783
  Scale:          66527.9     0.007532     0.007215     0.007711
  Add:            62947.3     0.011611     0.011438     0.011769
  Triad:          62544.5     0.011718     0.011512     0.011887

Numascale 4x AMD 6380 Dual-Socket Servers

  # build
  opencc -Ofast -march=bdver2 -mp -mcmodel=medium -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 \
              -CG:use_prefetchnta=on -LNO:prefetch_ahead=4 -DSTREAM_ARRAY_SIZE=1500000000ULL stream.c -o stream_c.exe.open64.33g
  # run (on all 128 cores)
  OMP_NUM_THREADS=128 O64_OMP_AFFINITY=true O64_OMP_AFFINITY_MAP=$(seq -s, 0 1 127) ./stream_c.exe.open64.33g

  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:          213159.1     0.118355     0.112592     0.143757
  Scale:         211838.0     0.126541     0.113294     0.153623
  Add:           199420.7     0.206908     0.180523     0.310094
  Triad:         200751.6     0.189908     0.179326     0.209648

  # run on every second core (one per memory controller)
  OMP_NUM_THREADS=64 O64_OMP_AFFINITY=true O64_OMP_AFFINITY_MAP=$(seq -s, 0 2 127) ./stream_c.exe.open64.33g 
  
  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:          221879.3     0.110662     0.108167     0.113518
  Scale:         234276.7     0.108583     0.102443     0.121964
  Add:           213480.2     0.171661     0.168634     0.180814
  Triad:         215921.8     0.170627     0.166727     0.177991

Single Calxeda ECX-1000 @ 1.4GHz

# build
gcc -O3 -fopenmp /root/stream.c -o /root/stream_gcc

# run (on all 4 cores)
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            1696.5     0.094708     0.094311     0.094899
Scale:           1696.8     0.095611     0.094293     0.096759
Add:             1993.1     0.120893     0.120413     0.121506
Triad:           1865.7     0.130471     0.128638     0.132437

Single Calxeda ECX-2000

# build
gcc -O3 -fopenmp /root/stream.c -o /root/stream_gcc

# run (on all 4 cores)
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            3247.7     0.049727     0.049266     0.050480
Scale:           4490.5     0.035814     0.035631     0.036157
Add:             4051.3     0.059573     0.059240     0.059767
Triad:           3847.4     0.062560     0.062380     0.062768

Haswell E5-2640v3 DDR4-1866Mhz

# HT Disabled / service cpuspeed stop
[root@localhost stream]# icc -O3 -openmp -DSTREAM_ARRAY_SIZE=95000000 -mcmodel=medium stream.c -o stream_icc_dp
[root@localhost stream]# OMP_NUM_THREADS=8 KMP_AFFINITY=scatter ./stream_icc_dp 
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 95000000 (elements), Offset = 0 (elements)
Memory per array = 724.8 MiB (= 0.7 GiB).
Total memory required = 2174.4 MiB (= 2.1 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 8
Number of Threads counted = 8
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 16039 microseconds.
   (= 16039 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           81209.4     0.018779     0.018717     0.018873
Scale:          79819.5     0.019124     0.019043     0.019198
Add:            83415.6     0.027439     0.027333     0.027526
Triad:          83294.3     0.027480     0.027373     0.027533
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Haswell E5-2650v3 DDR4-2133Mhz

# HT on
[root@e5-2650v3-nodee stream]# icc -O3 -openmp -DSTREAM_ARRAY_SIZE=4500000000 -mcmodel=medium stream.c -o stream_icc_dp
[root@e5-2650v3-nodee stream]# OMP_NUM_THREADS=20 KMP_AFFINITY=scatter ./stream_icc_dp
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 4500000000 (elements), Offset = 0 (elements)
Memory per array = 34332.3 MiB (= 33.5 GiB).
Total memory required = 102996.8 MiB (= 100.6 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 20
Number of Threads counted = 20
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 714150 microseconds.
   (= 714150 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:          107295.5     0.671414     0.671044     0.672320
Scale:         107487.7     0.670575     0.669844     0.671884
Add:           110500.1     0.978171     0.977375     0.981087
Triad:         111482.3     0.969603     0.968764     0.972643
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Quad Socket E5-4640v2

With 128GB RAM, HT enabled but running half the cores (80 cores total)
Memory speed at 1866Mhz

[user@lz4 stream]$ icc -O3 -openmp -DSTREAM_ARRAY_SIZE=4500000000 -mcmodel=medium stream.c -o stream_icc
[user@lz4 stream]$ OMP_NUM_THREADS=40 KMP_AFFINITY=scatter ./stream_icc 
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 4500000000 (elements), Offset = 0 (elements)
Memory per array = 34332.3 MiB (= 33.5 GiB).
Total memory required = 102996.8 MiB (= 100.6 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 40
Number of Threads counted = 40
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 460634 microseconds.
   (= 460634 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:          145430.8     0.495683     0.495081     0.498299
Scale:         141597.3     0.509096     0.508484     0.510163
Add:           161922.9     0.668773     0.666984     0.672804
Triad:         162028.6     0.668336     0.666549     0.672009
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Observations

Placement/Distribution of jobs when not using all cores is very important for bandwidth performance

  # Use KMP_AFFINITY (Intel icc only) to 'compact' jobs on to adjacent cores or 'scatter' to spread them across the system
  # System below is a Dual E5-2670 box with 64GB RAM, HT off, 16 cores

  # Using all 16 cores:
  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:           63992.2     0.079985     0.075009     0.083602
  Scale:          67067.1     0.072370     0.071570     0.073986
  Add:            65718.0     0.110574     0.109559     0.111694
  Triad:          66606.8     0.108982     0.108097     0.111754
  
  # Using 8 cores (on one socket) - limited to the 50% BW
  OMP_NUM_THREADS=8 KMP_AFFINITY=compact ./stream_icc
  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:           31929.5     0.154730     0.150331     0.158771
  Scale:          32842.5     0.148586     0.146152     0.150800
  Add:            32240.0     0.224286     0.223325     0.225280
  Triad:          32340.8     0.223632     0.222629     0.228462

  # Using 8 cores (spread across two sockets) - You get the max BW available
  OMP_NUM_THREADS=8 KMP_AFFINITY=scatter ./stream_icc
  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:           58487.3     0.082912     0.082069     0.084016
  Scale:          56235.1     0.085526     0.085356     0.085717
  Add:            63344.1     0.115197     0.113665     0.116755
  Triad:          64233.5     0.112643     0.112091     0.114209

Dual Skylake 6132

only logical cores used for test (2x14=28cores)

# 12 sticks of 16gb 2666Mhz
# /opt/intel/bin/icc -O3 -qopenmp -parallel -AVX -DSTREAM_ARRAY_SIZE=800000000 -mcmodel=medium stream.c -o stream_icc_AVX
# for j in 28; do for i in {1..100}; do OMP_NUM_THREADS=$j ./stream_icc_AVX | grep "Copy\|Scale\|Add\|Triad\|counted" | tee -a skylake_6132_12x16gb_stream_icc_AVX_28core.txt ;done ;done
# grep the best results: cat skylake_6132_12x16gb_stream_icc_AVX_28core.txt | sort -k 4 | grep -i copy 
Copy:          140025.6     0.092183     0.091412     0.092748
Scale:         141014.3     0.091679     0.090771     0.092523
Add:           137117.5     0.141334     0.140026     0.142912
Triad:         141462.5     0.136485     0.135725     0.137369

# 6 sticks of 32gb 2666Mhz
# /opt/intel/bin/icc -O3 -qopenmp -parallel -AVX -DSTREAM_ARRAY_SIZE=800000000 -mcmodel=medium stream.c -o stream_icc_AVX
# for j in 28; do for i in {1..100}; do OMP_NUM_THREADS=$j ./stream_icc_AVX | grep "Copy\|Scale\|Add\|Triad\|counted" | tee -a skylake_6132_6x32gb_stream_icc_AVX_28core.txt ;done ;done
#grep the best results: cat skylake_6132_6x32gb_stream_icc_AVX_28core.txt | sort -k 4 | grep -i copy 
Copy:           94531.2     0.136154     0.135405     0.137249
Scale:          94785.4     0.135705     0.135042     0.136687
Add:            97745.7     0.198002     0.196428     0.199466
Triad:         100247.4     0.192429     0.191526     0.192869

Dual Skylake 5120

only logical cores used for test (2x14=28cores)

# 12 sticks of 16gb 2666Mhz(2400)
# /opt/intel/bin/icc -O3 -qopenmp -parallel -AVX -DSTREAM_ARRAY_SIZE=800000000 -mcmodel=medium stream.c -o stream_icc_AVX
# for j in 28; do for i in {1..100}; do OMP_NUM_THREADS=$j ./stream_icc_AVX | grep "Copy\|Scale\|Add\|Triad\|counted" | tee -a skylake_5120_12x16gb_stream_icc_AVX_28core.txt ;done ;done
# grep the best results: cat skylake_5120_12x16gb_stream_icc_AVX_28core.txt | sort -k 4 | grep -i copy 
Copy:          140025.6     0.092183     0.091412     0.092748
Scale:         141014.3     0.091679     0.090771     0.092523
Add:           137117.5     0.141334     0.140026     0.142912
Triad:         141462.5     0.136485     0.135725     0.137369

# 6 sticks of 32gb 2666Mhz(2400)
# /opt/intel/bin/icc -O3 -qopenmp -parallel -AVX -DSTREAM_ARRAY_SIZE=800000000 -mcmodel=medium stream.c -o stream_icc_AVX
# for j in 28; do for i in {1..100}; do OMP_NUM_THREADS=$j ./stream_icc_AVX | grep "Copy\|Scale\|Add\|Triad\|counted" | tee -a skylake_5120_6x32gb_stream_icc_AVX_28core.txt ;done ;done
# grep the best results: cat skylake_5120_6x32gb_stream_icc_AVX_28core.txt | sort -k 4 | grep -i copy 
Copy:           94531.2     0.136154     0.135405     0.137249
Scale:          94785.4     0.135705     0.135042     0.136687
Add:            97745.7     0.198002     0.196428     0.199466
Triad:         100247.4     0.192429     0.191526     0.192869

Difference between revisions of "Benchmarking: Stream (Memory Bandwidth)"

Latest revision as of 10:17, 14 March 2018

Contents

Get the source

Compile

Intel

GCC

Open 64

Run

Intel

GCC

Open64

Results

Dual Socket E5-2660 v2 @ 2.20GHz

Dual Socket E5-2600 Server

Dual Socket AMD 6380 Server

Numascale 4x AMD 6380 Dual-Socket Servers

Single Calxeda ECX-1000 @ 1.4GHz

Single Calxeda ECX-2000

Haswell E5-2640v3 DDR4-1866Mhz

Haswell E5-2650v3 DDR4-2133Mhz

Quad Socket E5-4640v2

Observations

Dual Skylake 6132

Dual Skylake 5120

Navigation menu

Search

@@ Line 1: / Line 1: @@
 STREAM: The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.
+<syntaxhighlight>
+Table 1: Stream Benchmarks
+Name	 Kernel	              Bytes/Iteration	 FLOPS/Iteration
+COPY	a(i) = b(i)	           16	 0
+SCALE	a(i) = q*b(i)	           16	 1
+SUM	a(i) = b(i) + c(i)	   24	 1
+TRIAD	a(i) = b(i) + q*c(i)	   24	 2
+ </syntaxhighlight>
+The Copy benchmark measures the transfer rate in the absence of arithmetic. This should be one of the fastest memory operations, but it also represents a common one – fetching two values from memory, a(i) and b(i), and update one operation.
+The Scale benchmark adds a simple arithmetic operation to the Copy benchmark. This starts to simulate real application operations. The operation fetches two values from memory, a(i) and b(i), but operates on b(i) before writing it to a(i). It's a simple scalar operation, but more complex operations are built from it, so the performance of this simple test can be used as an indicator of the performance of more complex operations.
+The third benchmark, the Sum benchmark, adds a third operand and was originally written to allow multiple load/store ports on vector machines to be tested when vector machines were in vogue. However, this benchmark is very useful today because of the large pipelines that some processors possess. Rather than just fetch two values from memory, this micro-benchmark fetches three. For larger arrays, this will quickly fill a processor pipeline, so you can test the memory bandwidth filling the processor pipeline or the performance when the pipeline is full. Moreover, this benchmark is starting to approximate what some applications will perform in real computations.
+The fourth benchmark in Stream, the Triad benchmark, allows chained or overlapped or fused, multiple-add operations. It builds on the Sum benchmark by adding an arithmetic operation to one of the fetched array values. Given that fused multiple-add operations (FMA) are an important operation in many basic computations, such as dot products, matrix multiplication, polynomial evaluations, Newton’s method for evaluation functions, and many DSP operations, this benchmark can be directly associated with application performance. The FMA operation has its own instruction set now and is usually done in hardware. Consequently, feeding such hardware operations with data can be extremely important – hence, the usefulness of the Triad memory bandwidth benchmark.
+There are two variables or definitions in the code that you should pay attention to. The first is STREAM_ARRAY_SIZE. This is the number of array elements used to run the benchmarks. In the current version, it is set to 10,000,000, which the code states should be good enough for caches up to 20MB. The Stream FAQ recommends you use a problem size such that each array is four times the sum of the caches (L1, L2, and L3). You can either change the code to reflect the array sizes you want, or you can set the variable when compiling the code.
+The second variable you might want to change is NTIMES, the number of times each benchmark is run. By default, Stream reports the "best" result for any iteration after the first; therefore, be sure always to set NTIMES at least to 2 (10 is the default). This variable can also be set during compilation without changing the code.
 * '''Note''': Ensure power saving features are disabled, we need max clock speed to prevent fluctuations in performance:
@@ Line 30: / Line 63: @@
 </syntaxhighlight>
-== Open 64 ==
+=== Open 64 ===
 * Below are optimisations for the AMD 6300 Arch
 <syntaxhighlight>
@@ Line 65: / Line 98: @@
 </syntaxhighlight>
 == Results ==
+=== Dual Socket E5-2660 v2 @ 2.20GHz ===
+<syntaxhighlight>
+# HT Disabled
+export OMP_NUM_THREADS=20
+icc -O3 -openmp -DSTREAM_ARRAY_SIZE=95000000 -mcmodel=medium stream.c -o stream_icc
+-------------------------------------------------------------
+Function    Best Rate MB/s  Avg time     Min time     Max time
+Copy:           79878.5     0.019151     0.019029     0.019254
+Scale:          87765.1     0.018336     0.017319     0.025528
+Add:            92615.5     0.024757     0.024618     0.025459
+Triad:          92773.6     0.024712     0.024576     0.024968
+-------------------------------------------------------------
+</syntaxhighlight>
+=== Dual Socket E5-2600 Server ===
+<syntaxhighlight>
+  # HT Disabled
+  Function    Best Rate MB/s  Avg time     Min time     Max time
+  Copy:           63992.2     0.079985     0.075009     0.083602
+  Scale:          67067.1     0.072370     0.071570     0.073986
+  Add:            65718.0     0.110574     0.109559     0.111694
+  Triad:          66606.8     0.108982     0.108097     0.111754
+</syntaxhighlight>
+=== Dual Socket AMD 6380 Server ===
+<syntaxhighlight>
+  # built using open64 as above
+  OMP_NUM_THREADS=16 O64_OMP_AFFINITY=true O64_OMP_AFFINITY_MAP=$(seq -s, 0 2 31) ./stream_occ
+  Function    Best Rate MB/s  Avg time     Min time     Max time
+  Copy:           63467.9     0.007673     0.007563     0.007783
+  Scale:          66527.9     0.007532     0.007215     0.007711
+  Add:            62947.3     0.011611     0.011438     0.011769
+  Triad:          62544.5     0.011718     0.011512     0.011887
+</syntaxhighlight>
+=== Numascale 4x AMD 6380 Dual-Socket Servers ===
+<syntaxhighlight>
+  # build
+  opencc -Ofast -march=bdver2 -mp -mcmodel=medium -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 \
+              -CG:use_prefetchnta=on -LNO:prefetch_ahead=4 -DSTREAM_ARRAY_SIZE=1500000000ULL stream.c -o stream_c.exe.open64.33g
+  # run (on all 128 cores)
+  OMP_NUM_THREADS=128 O64_OMP_AFFINITY=true O64_OMP_AFFINITY_MAP=$(seq -s, 0 1 127) ./stream_c.exe.open64.33g
+  Function    Best Rate MB/s  Avg time     Min time     Max time
+  Copy:          213159.1     0.118355     0.112592     0.143757
+  Scale:         211838.0     0.126541     0.113294     0.153623
+  Add:           199420.7     0.206908     0.180523     0.310094
+  Triad:         200751.6     0.189908     0.179326     0.209648
+  # run on every second core (one per memory controller)
+  OMP_NUM_THREADS=64 O64_OMP_AFFINITY=true O64_OMP_AFFINITY_MAP=$(seq -s, 0 2 127) ./stream_c.exe.open64.33g
+  Function    Best Rate MB/s  Avg time     Min time     Max time
+  Copy:          221879.3     0.110662     0.108167     0.113518
+  Scale:         234276.7     0.108583     0.102443     0.121964
+  Add:           213480.2     0.171661     0.168634     0.180814
+  Triad:         215921.8     0.170627     0.166727     0.177991
+</syntaxhighlight>
+=== Single Calxeda ECX-1000 @ 1.4GHz ===
+<syntaxhighlight>
+# build
+gcc -O3 -fopenmp /root/stream.c -o /root/stream_gcc
+# run (on all 4 cores)
+Function    Best Rate MB/s  Avg time     Min time     Max time
+Copy:            1696.5     0.094708     0.094311     0.094899
+Scale:           1696.8     0.095611     0.094293     0.096759
+Add:             1993.1     0.120893     0.120413     0.121506
+Triad:           1865.7     0.130471     0.128638     0.132437
+</syntaxhighlight>
+=== Single Calxeda ECX-2000 ===
+<syntaxhighlight>
+# build
+gcc -O3 -fopenmp /root/stream.c -o /root/stream_gcc
+# run (on all 4 cores)
+Function    Best Rate MB/s  Avg time     Min time     Max time
+Copy:            3247.7     0.049727     0.049266     0.050480
+Scale:           4490.5     0.035814     0.035631     0.036157
+Add:             4051.3     0.059573     0.059240     0.059767
+Triad:           3847.4     0.062560     0.062380     0.062768
+</syntaxhighlight>
+=== Haswell E5-2640v3 DDR4-1866Mhz  ===
+<syntaxhighlight>
+# HT Disabled / service cpuspeed stop
+[root@localhost stream]# icc -O3 -openmp -DSTREAM_ARRAY_SIZE=95000000 -mcmodel=medium stream.c -o stream_icc_dp
+[root@localhost stream]# OMP_NUM_THREADS=8 KMP_AFFINITY=scatter ./stream_icc_dp
+-------------------------------------------------------------
+STREAM version $Revision: 5.10 $
+-------------------------------------------------------------
+This system uses 8 bytes per array element.
+-------------------------------------------------------------
+Array size = 95000000 (elements), Offset = 0 (elements)
+Memory per array = 724.8 MiB (= 0.7 GiB).
+Total memory required = 2174.4 MiB (= 2.1 GiB).
+Each kernel will be executed 10 times.
+ The *best* time for each kernel (excluding the first iteration)
+ will be used to compute the reported bandwidth.
+-------------------------------------------------------------
+Number of Threads requested = 8
+Number of Threads counted = 8
+-------------------------------------------------------------
+Your clock granularity/precision appears to be 1 microseconds.
+Each test below will take on the order of 16039 microseconds.
+   (= 16039 clock ticks)
+Increase the size of the arrays if this shows that
+you are not getting at least 20 clock ticks per test.
+-------------------------------------------------------------
+WARNING -- The above is only a rough guideline.
+For best results, please be sure you know the
+precision of your system timer.
+-------------------------------------------------------------
+Function    Best Rate MB/s  Avg time     Min time     Max time
+Copy:           81209.4     0.018779     0.018717     0.018873
+Scale:          79819.5     0.019124     0.019043     0.019198
+Add:            83415.6     0.027439     0.027333     0.027526
+Triad:          83294.3     0.027480     0.027373     0.027533
+-------------------------------------------------------------
+Solution Validates: avg error less than 1.000000e-13 on all three arrays
+-------------------------------------------------------------
+</syntaxhighlight>
+=== Haswell E5-2650v3 DDR4-2133Mhz ===
+<syntaxhighlight>
+# HT on
+[root@e5-2650v3-nodee stream]# icc -O3 -openmp -DSTREAM_ARRAY_SIZE=4500000000 -mcmodel=medium stream.c -o stream_icc_dp
+[root@e5-2650v3-nodee stream]# OMP_NUM_THREADS=20 KMP_AFFINITY=scatter ./stream_icc_dp
+-------------------------------------------------------------
+STREAM version $Revision: 5.10 $
+-------------------------------------------------------------
+This system uses 8 bytes per array element.
+-------------------------------------------------------------
+Array size = 4500000000 (elements), Offset = 0 (elements)
+Memory per array = 34332.3 MiB (= 33.5 GiB).
+Total memory required = 102996.8 MiB (= 100.6 GiB).
+Each kernel will be executed 10 times.
+ The *best* time for each kernel (excluding the first iteration)
+ will be used to compute the reported bandwidth.
+-------------------------------------------------------------
+Number of Threads requested = 20
+Number of Threads counted = 20
+-------------------------------------------------------------
+Your clock granularity/precision appears to be 1 microseconds.
+Each test below will take on the order of 714150 microseconds.
+   (= 714150 clock ticks)
+Increase the size of the arrays if this shows that
+you are not getting at least 20 clock ticks per test.
+-------------------------------------------------------------
+WARNING -- The above is only a rough guideline.
+For best results, please be sure you know the
+precision of your system timer.
+-------------------------------------------------------------
+Function    Best Rate MB/s  Avg time     Min time     Max time
+Copy:          107295.5     0.671414     0.671044     0.672320
+Scale:         107487.7     0.670575     0.669844     0.671884
+Add:           110500.1     0.978171     0.977375     0.981087
+Triad:         111482.3     0.969603     0.968764     0.972643
+-------------------------------------------------------------
+Solution Validates: avg error less than 1.000000e-13 on all three arrays
+-------------------------------------------------------------
+</syntaxhighlight>
+=== Quad Socket E5-4640v2 ===
+* With 128GB RAM, HT enabled but running half the cores (80 cores total)
+* Memory speed at 1866Mhz
+<syntaxhighlight>
+[user@lz4 stream]$ icc -O3 -openmp -DSTREAM_ARRAY_SIZE=4500000000 -mcmodel=medium stream.c -o stream_icc
+[user@lz4 stream]$ OMP_NUM_THREADS=40 KMP_AFFINITY=scatter ./stream_icc
+-------------------------------------------------------------
+STREAM version $Revision: 5.10 $
+-------------------------------------------------------------
+This system uses 8 bytes per array element.
+-------------------------------------------------------------
+Array size = 4500000000 (elements), Offset = 0 (elements)
+Memory per array = 34332.3 MiB (= 33.5 GiB).
+Total memory required = 102996.8 MiB (= 100.6 GiB).
+Each kernel will be executed 10 times.
+ The *best* time for each kernel (excluding the first iteration)
+ will be used to compute the reported bandwidth.
+-------------------------------------------------------------
+Number of Threads requested = 40
+Number of Threads counted = 40
+-------------------------------------------------------------
+Your clock granularity/precision appears to be 1 microseconds.
+Each test below will take on the order of 460634 microseconds.
+   (= 460634 clock ticks)
+Increase the size of the arrays if this shows that
+you are not getting at least 20 clock ticks per test.
+-------------------------------------------------------------
+WARNING -- The above is only a rough guideline.
+For best results, please be sure you know the
+precision of your system timer.
+-------------------------------------------------------------
+Function    Best Rate MB/s  Avg time     Min time     Max time
+Copy:          145430.8     0.495683     0.495081     0.498299
+Scale:         141597.3     0.509096     0.508484     0.510163
+Add:           161922.9     0.668773     0.666984     0.672804
+Triad:         162028.6     0.668336     0.666549     0.672009
+-------------------------------------------------------------
+Solution Validates: avg error less than 1.000000e-13 on all three arrays
+-------------------------------------------------------------
+</syntaxhighlight>
+== Observations ==
+Placement/Distribution of jobs when not using all cores is very important for bandwidth performance
+<syntaxhighlight>
+  # Use KMP_AFFINITY (Intel icc only) to 'compact' jobs on to adjacent cores or 'scatter' to spread them across the system
+  # System below is a Dual E5-2670 box with 64GB RAM, HT off, 16 cores
+  # Using all 16 cores:
+  Function    Best Rate MB/s  Avg time     Min time     Max time
+  Copy:           63992.2     0.079985     0.075009     0.083602
+  Scale:          67067.1     0.072370     0.071570     0.073986
+  Add:            65718.0     0.110574     0.109559     0.111694
+  Triad:          66606.8     0.108982     0.108097     0.111754
+  # Using 8 cores (on one socket) - limited to the 50% BW
+  OMP_NUM_THREADS=8 KMP_AFFINITY=compact ./stream_icc
+  Function    Best Rate MB/s  Avg time     Min time     Max time
+  Copy:           31929.5     0.154730     0.150331     0.158771
+  Scale:          32842.5     0.148586     0.146152     0.150800
+  Add:            32240.0     0.224286     0.223325     0.225280
+  Triad:          32340.8     0.223632     0.222629     0.228462
+  # Using 8 cores (spread across two sockets) - You get the max BW available
+  OMP_NUM_THREADS=8 KMP_AFFINITY=scatter ./stream_icc
+  Function    Best Rate MB/s  Avg time     Min time     Max time
+  Copy:           58487.3     0.082912     0.082069     0.084016
+  Scale:          56235.1     0.085526     0.085356     0.085717
+  Add:            63344.1     0.115197     0.113665     0.116755
+  Triad:          64233.5     0.112643     0.112091     0.114209
+</syntaxhighlight>
+== Dual Skylake 6132 ==
+# only logical cores used for test (2x14=28cores)
+<syntaxhighlight>
+# 12 sticks of 16gb 2666Mhz
+# /opt/intel/bin/icc -O3 -qopenmp -parallel -AVX -DSTREAM_ARRAY_SIZE=800000000 -mcmodel=medium stream.c -o stream_icc_AVX
+# for j in 28; do for i in {1..100}; do OMP_NUM_THREADS=$j ./stream_icc_AVX | grep "Copy\|Scale\|Add\|Triad\|counted" | tee -a skylake_6132_12x16gb_stream_icc_AVX_28core.txt ;done ;done
+# grep the best results: cat skylake_6132_12x16gb_stream_icc_AVX_28core.txt | sort -k 4 | grep -i copy
+Copy:          140025.6     0.092183     0.091412     0.092748
+Scale:         141014.3     0.091679     0.090771     0.092523
+Add:           137117.5     0.141334     0.140026     0.142912
+Triad:         141462.5     0.136485     0.135725     0.137369
+# 6 sticks of 32gb 2666Mhz
+# /opt/intel/bin/icc -O3 -qopenmp -parallel -AVX -DSTREAM_ARRAY_SIZE=800000000 -mcmodel=medium stream.c -o stream_icc_AVX
+# for j in 28; do for i in {1..100}; do OMP_NUM_THREADS=$j ./stream_icc_AVX | grep "Copy\|Scale\|Add\|Triad\|counted" | tee -a skylake_6132_6x32gb_stream_icc_AVX_28core.txt ;done ;done
+#grep the best results: cat skylake_6132_6x32gb_stream_icc_AVX_28core.txt | sort -k 4 | grep -i copy
+Copy:           94531.2     0.136154     0.135405     0.137249
+Scale:          94785.4     0.135705     0.135042     0.136687
+Add:            97745.7     0.198002     0.196428     0.199466
+Triad:         100247.4     0.192429     0.191526     0.192869
+</syntaxhighlight>
+== Dual Skylake 5120 ==
+#only logical cores used for test (2x14=28cores)
+<syntaxhighlight>
+# 12 sticks of 16gb 2666Mhz(2400)
+# /opt/intel/bin/icc -O3 -qopenmp -parallel -AVX -DSTREAM_ARRAY_SIZE=800000000 -mcmodel=medium stream.c -o stream_icc_AVX
+# for j in 28; do for i in {1..100}; do OMP_NUM_THREADS=$j ./stream_icc_AVX | grep "Copy\|Scale\|Add\|Triad\|counted" | tee -a skylake_5120_12x16gb_stream_icc_AVX_28core.txt ;done ;done
+# grep the best results: cat skylake_5120_12x16gb_stream_icc_AVX_28core.txt | sort -k 4 | grep -i copy
+Copy:          140025.6     0.092183     0.091412     0.092748
+Scale:         141014.3     0.091679     0.090771     0.092523
+Add:           137117.5     0.141334     0.140026     0.142912
+Triad:         141462.5     0.136485     0.135725     0.137369
+# 6 sticks of 32gb 2666Mhz(2400)
+# /opt/intel/bin/icc -O3 -qopenmp -parallel -AVX -DSTREAM_ARRAY_SIZE=800000000 -mcmodel=medium stream.c -o stream_icc_AVX
+# for j in 28; do for i in {1..100}; do OMP_NUM_THREADS=$j ./stream_icc_AVX | grep "Copy\|Scale\|Add\|Triad\|counted" | tee -a skylake_5120_6x32gb_stream_icc_AVX_28core.txt ;done ;done
+# grep the best results: cat skylake_5120_6x32gb_stream_icc_AVX_28core.txt | sort -k 4 | grep -i copy
+Copy:           94531.2     0.136154     0.135405     0.137249
+Scale:          94785.4     0.135705     0.135042     0.136687
+Add:            97745.7     0.198002     0.196428     0.199466
+Triad:         100247.4     0.192429     0.191526     0.192869
+</syntaxhighlight>