Difference between revisions of "Benchmarking: Stream (Memory Bandwidth)"
| (21 intermediate revisions by 4 users not shown) | |||
| Line 1: | Line 1: | ||
STREAM: The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels. | STREAM: The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels. | ||
| + | |||
| + | <syntaxhighlight> | ||
| + | Table 1: Stream Benchmarks | ||
| + | |||
| + | Name Kernel Bytes/Iteration FLOPS/Iteration | ||
| + | COPY a(i) = b(i) 16 0 | ||
| + | SCALE a(i) = q*b(i) 16 1 | ||
| + | SUM a(i) = b(i) + c(i) 24 1 | ||
| + | TRIAD a(i) = b(i) + q*c(i) 24 2 | ||
| + | </syntaxhighlight> | ||
| + | |||
| + | The Copy benchmark measures the transfer rate in the absence of arithmetic. This should be one of the fastest memory operations, but it also represents a common one – fetching two values from memory, a(i) and b(i), and update one operation. | ||
| + | |||
| + | |||
| + | |||
| + | The Scale benchmark adds a simple arithmetic operation to the Copy benchmark. This starts to simulate real application operations. The operation fetches two values from memory, a(i) and b(i), but operates on b(i) before writing it to a(i). It's a simple scalar operation, but more complex operations are built from it, so the performance of this simple test can be used as an indicator of the performance of more complex operations. | ||
| + | |||
| + | |||
| + | |||
| + | The third benchmark, the Sum benchmark, adds a third operand and was originally written to allow multiple load/store ports on vector machines to be tested when vector machines were in vogue. However, this benchmark is very useful today because of the large pipelines that some processors possess. Rather than just fetch two values from memory, this micro-benchmark fetches three. For larger arrays, this will quickly fill a processor pipeline, so you can test the memory bandwidth filling the processor pipeline or the performance when the pipeline is full. Moreover, this benchmark is starting to approximate what some applications will perform in real computations. | ||
| + | |||
| + | |||
| + | |||
| + | The fourth benchmark in Stream, the Triad benchmark, allows chained or overlapped or fused, multiple-add operations. It builds on the Sum benchmark by adding an arithmetic operation to one of the fetched array values. Given that fused multiple-add operations (FMA) are an important operation in many basic computations, such as dot products, matrix multiplication, polynomial evaluations, Newton’s method for evaluation functions, and many DSP operations, this benchmark can be directly associated with application performance. The FMA operation has its own instruction set now and is usually done in hardware. Consequently, feeding such hardware operations with data can be extremely important – hence, the usefulness of the Triad memory bandwidth benchmark. | ||
| + | |||
| + | |||
| + | There are two variables or definitions in the code that you should pay attention to. The first is STREAM_ARRAY_SIZE. This is the number of array elements used to run the benchmarks. In the current version, it is set to 10,000,000, which the code states should be good enough for caches up to 20MB. The Stream FAQ recommends you use a problem size such that each array is four times the sum of the caches (L1, L2, and L3). You can either change the code to reflect the array sizes you want, or you can set the variable when compiling the code. | ||
| + | |||
| + | |||
| + | |||
| + | The second variable you might want to change is NTIMES, the number of times each benchmark is run. By default, Stream reports the "best" result for any iteration after the first; therefore, be sure always to set NTIMES at least to 2 (10 is the default). This variable can also be set during compilation without changing the code. | ||
| + | |||
| + | |||
* '''Note''': Ensure power saving features are disabled, we need max clock speed to prevent fluctuations in performance: | * '''Note''': Ensure power saving features are disabled, we need max clock speed to prevent fluctuations in performance: | ||
| Line 30: | Line 63: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
| − | == Open 64 == | + | === Open 64 === |
* Below are optimisations for the AMD 6300 Arch | * Below are optimisations for the AMD 6300 Arch | ||
<syntaxhighlight> | <syntaxhighlight> | ||
| Line 65: | Line 98: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
== Results == | == Results == | ||
| + | === Dual Socket E5-2660 v2 @ 2.20GHz === | ||
| + | <syntaxhighlight> | ||
| + | # HT Disabled | ||
| + | export OMP_NUM_THREADS=20 | ||
| + | icc -O3 -openmp -DSTREAM_ARRAY_SIZE=95000000 -mcmodel=medium stream.c -o stream_icc | ||
| + | ------------------------------------------------------------- | ||
| + | Function Best Rate MB/s Avg time Min time Max time | ||
| + | Copy: 79878.5 0.019151 0.019029 0.019254 | ||
| + | Scale: 87765.1 0.018336 0.017319 0.025528 | ||
| + | Add: 92615.5 0.024757 0.024618 0.025459 | ||
| + | Triad: 92773.6 0.024712 0.024576 0.024968 | ||
| + | ------------------------------------------------------------- | ||
| + | </syntaxhighlight> | ||
| + | |||
| + | === Dual Socket E5-2600 Server === | ||
| + | <syntaxhighlight> | ||
| + | # HT Disabled | ||
| + | Function Best Rate MB/s Avg time Min time Max time | ||
| + | Copy: 63992.2 0.079985 0.075009 0.083602 | ||
| + | Scale: 67067.1 0.072370 0.071570 0.073986 | ||
| + | Add: 65718.0 0.110574 0.109559 0.111694 | ||
| + | Triad: 66606.8 0.108982 0.108097 0.111754 | ||
| + | </syntaxhighlight> | ||
| + | |||
| + | === Dual Socket AMD 6380 Server === | ||
| + | <syntaxhighlight> | ||
| + | # built using open64 as above | ||
| + | OMP_NUM_THREADS=16 O64_OMP_AFFINITY=true O64_OMP_AFFINITY_MAP=$(seq -s, 0 2 31) ./stream_occ | ||
| + | |||
| + | Function Best Rate MB/s Avg time Min time Max time | ||
| + | Copy: 63467.9 0.007673 0.007563 0.007783 | ||
| + | Scale: 66527.9 0.007532 0.007215 0.007711 | ||
| + | Add: 62947.3 0.011611 0.011438 0.011769 | ||
| + | Triad: 62544.5 0.011718 0.011512 0.011887 | ||
| + | </syntaxhighlight> | ||
| + | === Numascale 4x AMD 6380 Dual-Socket Servers === | ||
| + | <syntaxhighlight> | ||
| + | # build | ||
| + | opencc -Ofast -march=bdver2 -mp -mcmodel=medium -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 \ | ||
| + | -CG:use_prefetchnta=on -LNO:prefetch_ahead=4 -DSTREAM_ARRAY_SIZE=1500000000ULL stream.c -o stream_c.exe.open64.33g | ||
| + | # run (on all 128 cores) | ||
| + | OMP_NUM_THREADS=128 O64_OMP_AFFINITY=true O64_OMP_AFFINITY_MAP=$(seq -s, 0 1 127) ./stream_c.exe.open64.33g | ||
| + | |||
| + | Function Best Rate MB/s Avg time Min time Max time | ||
| + | Copy: 213159.1 0.118355 0.112592 0.143757 | ||
| + | Scale: 211838.0 0.126541 0.113294 0.153623 | ||
| + | Add: 199420.7 0.206908 0.180523 0.310094 | ||
| + | Triad: 200751.6 0.189908 0.179326 0.209648 | ||
| + | |||
| + | # run on every second core (one per memory controller) | ||
| + | OMP_NUM_THREADS=64 O64_OMP_AFFINITY=true O64_OMP_AFFINITY_MAP=$(seq -s, 0 2 127) ./stream_c.exe.open64.33g | ||
| + | |||
| + | Function Best Rate MB/s Avg time Min time Max time | ||
| + | Copy: 221879.3 0.110662 0.108167 0.113518 | ||
| + | Scale: 234276.7 0.108583 0.102443 0.121964 | ||
| + | Add: 213480.2 0.171661 0.168634 0.180814 | ||
| + | Triad: 215921.8 0.170627 0.166727 0.177991 | ||
| + | </syntaxhighlight> | ||
| + | |||
| + | === Single Calxeda ECX-1000 @ 1.4GHz === | ||
| + | <syntaxhighlight> | ||
| + | # build | ||
| + | gcc -O3 -fopenmp /root/stream.c -o /root/stream_gcc | ||
| + | |||
| + | # run (on all 4 cores) | ||
| + | Function Best Rate MB/s Avg time Min time Max time | ||
| + | Copy: 1696.5 0.094708 0.094311 0.094899 | ||
| + | Scale: 1696.8 0.095611 0.094293 0.096759 | ||
| + | Add: 1993.1 0.120893 0.120413 0.121506 | ||
| + | Triad: 1865.7 0.130471 0.128638 0.132437 | ||
| + | </syntaxhighlight> | ||
| + | |||
| + | === Single Calxeda ECX-2000 === | ||
| + | <syntaxhighlight> | ||
| + | # build | ||
| + | gcc -O3 -fopenmp /root/stream.c -o /root/stream_gcc | ||
| + | |||
| + | # run (on all 4 cores) | ||
| + | Function Best Rate MB/s Avg time Min time Max time | ||
| + | Copy: 3247.7 0.049727 0.049266 0.050480 | ||
| + | Scale: 4490.5 0.035814 0.035631 0.036157 | ||
| + | Add: 4051.3 0.059573 0.059240 0.059767 | ||
| + | Triad: 3847.4 0.062560 0.062380 0.062768 | ||
| + | </syntaxhighlight> | ||
| + | |||
| + | === Haswell E5-2640v3 DDR4-1866Mhz === | ||
| + | <syntaxhighlight> | ||
| + | # HT Disabled / service cpuspeed stop | ||
| + | [root@localhost stream]# icc -O3 -openmp -DSTREAM_ARRAY_SIZE=95000000 -mcmodel=medium stream.c -o stream_icc_dp | ||
| + | [root@localhost stream]# OMP_NUM_THREADS=8 KMP_AFFINITY=scatter ./stream_icc_dp | ||
| + | ------------------------------------------------------------- | ||
| + | STREAM version $Revision: 5.10 $ | ||
| + | ------------------------------------------------------------- | ||
| + | This system uses 8 bytes per array element. | ||
| + | ------------------------------------------------------------- | ||
| + | Array size = 95000000 (elements), Offset = 0 (elements) | ||
| + | Memory per array = 724.8 MiB (= 0.7 GiB). | ||
| + | Total memory required = 2174.4 MiB (= 2.1 GiB). | ||
| + | Each kernel will be executed 10 times. | ||
| + | The *best* time for each kernel (excluding the first iteration) | ||
| + | will be used to compute the reported bandwidth. | ||
| + | ------------------------------------------------------------- | ||
| + | Number of Threads requested = 8 | ||
| + | Number of Threads counted = 8 | ||
| + | ------------------------------------------------------------- | ||
| + | Your clock granularity/precision appears to be 1 microseconds. | ||
| + | Each test below will take on the order of 16039 microseconds. | ||
| + | (= 16039 clock ticks) | ||
| + | Increase the size of the arrays if this shows that | ||
| + | you are not getting at least 20 clock ticks per test. | ||
| + | ------------------------------------------------------------- | ||
| + | WARNING -- The above is only a rough guideline. | ||
| + | For best results, please be sure you know the | ||
| + | precision of your system timer. | ||
| + | ------------------------------------------------------------- | ||
| + | Function Best Rate MB/s Avg time Min time Max time | ||
| + | Copy: 81209.4 0.018779 0.018717 0.018873 | ||
| + | Scale: 79819.5 0.019124 0.019043 0.019198 | ||
| + | Add: 83415.6 0.027439 0.027333 0.027526 | ||
| + | Triad: 83294.3 0.027480 0.027373 0.027533 | ||
| + | ------------------------------------------------------------- | ||
| + | Solution Validates: avg error less than 1.000000e-13 on all three arrays | ||
| + | ------------------------------------------------------------- | ||
| + | </syntaxhighlight> | ||
| + | |||
| + | === Haswell E5-2650v3 DDR4-2133Mhz === | ||
| + | <syntaxhighlight> | ||
| + | # HT on | ||
| + | [root@e5-2650v3-nodee stream]# icc -O3 -openmp -DSTREAM_ARRAY_SIZE=4500000000 -mcmodel=medium stream.c -o stream_icc_dp | ||
| + | [root@e5-2650v3-nodee stream]# OMP_NUM_THREADS=20 KMP_AFFINITY=scatter ./stream_icc_dp | ||
| + | ------------------------------------------------------------- | ||
| + | STREAM version $Revision: 5.10 $ | ||
| + | ------------------------------------------------------------- | ||
| + | This system uses 8 bytes per array element. | ||
| + | ------------------------------------------------------------- | ||
| + | Array size = 4500000000 (elements), Offset = 0 (elements) | ||
| + | Memory per array = 34332.3 MiB (= 33.5 GiB). | ||
| + | Total memory required = 102996.8 MiB (= 100.6 GiB). | ||
| + | Each kernel will be executed 10 times. | ||
| + | The *best* time for each kernel (excluding the first iteration) | ||
| + | will be used to compute the reported bandwidth. | ||
| + | ------------------------------------------------------------- | ||
| + | Number of Threads requested = 20 | ||
| + | Number of Threads counted = 20 | ||
| + | ------------------------------------------------------------- | ||
| + | Your clock granularity/precision appears to be 1 microseconds. | ||
| + | Each test below will take on the order of 714150 microseconds. | ||
| + | (= 714150 clock ticks) | ||
| + | Increase the size of the arrays if this shows that | ||
| + | you are not getting at least 20 clock ticks per test. | ||
| + | ------------------------------------------------------------- | ||
| + | WARNING -- The above is only a rough guideline. | ||
| + | For best results, please be sure you know the | ||
| + | precision of your system timer. | ||
| + | ------------------------------------------------------------- | ||
| + | Function Best Rate MB/s Avg time Min time Max time | ||
| + | Copy: 107295.5 0.671414 0.671044 0.672320 | ||
| + | Scale: 107487.7 0.670575 0.669844 0.671884 | ||
| + | Add: 110500.1 0.978171 0.977375 0.981087 | ||
| + | Triad: 111482.3 0.969603 0.968764 0.972643 | ||
| + | ------------------------------------------------------------- | ||
| + | Solution Validates: avg error less than 1.000000e-13 on all three arrays | ||
| + | ------------------------------------------------------------- | ||
| + | |||
| + | </syntaxhighlight> | ||
| + | |||
| + | === Quad Socket E5-4640v2 === | ||
| + | * With 128GB RAM, HT enabled but running half the cores (80 cores total) | ||
| + | * Memory speed at 1866Mhz | ||
| + | <syntaxhighlight> | ||
| + | [user@lz4 stream]$ icc -O3 -openmp -DSTREAM_ARRAY_SIZE=4500000000 -mcmodel=medium stream.c -o stream_icc | ||
| + | [user@lz4 stream]$ OMP_NUM_THREADS=40 KMP_AFFINITY=scatter ./stream_icc | ||
| + | ------------------------------------------------------------- | ||
| + | STREAM version $Revision: 5.10 $ | ||
| + | ------------------------------------------------------------- | ||
| + | This system uses 8 bytes per array element. | ||
| + | ------------------------------------------------------------- | ||
| + | Array size = 4500000000 (elements), Offset = 0 (elements) | ||
| + | Memory per array = 34332.3 MiB (= 33.5 GiB). | ||
| + | Total memory required = 102996.8 MiB (= 100.6 GiB). | ||
| + | Each kernel will be executed 10 times. | ||
| + | The *best* time for each kernel (excluding the first iteration) | ||
| + | will be used to compute the reported bandwidth. | ||
| + | ------------------------------------------------------------- | ||
| + | Number of Threads requested = 40 | ||
| + | Number of Threads counted = 40 | ||
| + | ------------------------------------------------------------- | ||
| + | Your clock granularity/precision appears to be 1 microseconds. | ||
| + | Each test below will take on the order of 460634 microseconds. | ||
| + | (= 460634 clock ticks) | ||
| + | Increase the size of the arrays if this shows that | ||
| + | you are not getting at least 20 clock ticks per test. | ||
| + | ------------------------------------------------------------- | ||
| + | WARNING -- The above is only a rough guideline. | ||
| + | For best results, please be sure you know the | ||
| + | precision of your system timer. | ||
| + | ------------------------------------------------------------- | ||
| + | Function Best Rate MB/s Avg time Min time Max time | ||
| + | Copy: 145430.8 0.495683 0.495081 0.498299 | ||
| + | Scale: 141597.3 0.509096 0.508484 0.510163 | ||
| + | Add: 161922.9 0.668773 0.666984 0.672804 | ||
| + | Triad: 162028.6 0.668336 0.666549 0.672009 | ||
| + | ------------------------------------------------------------- | ||
| + | Solution Validates: avg error less than 1.000000e-13 on all three arrays | ||
| + | ------------------------------------------------------------- | ||
| + | </syntaxhighlight> | ||
| + | |||
| + | == Observations == | ||
| + | Placement/Distribution of jobs when not using all cores is very important for bandwidth performance | ||
| + | <syntaxhighlight> | ||
| + | # Use KMP_AFFINITY (Intel icc only) to 'compact' jobs on to adjacent cores or 'scatter' to spread them across the system | ||
| + | # System below is a Dual E5-2670 box with 64GB RAM, HT off, 16 cores | ||
| + | |||
| + | # Using all 16 cores: | ||
| + | Function Best Rate MB/s Avg time Min time Max time | ||
| + | Copy: 63992.2 0.079985 0.075009 0.083602 | ||
| + | Scale: 67067.1 0.072370 0.071570 0.073986 | ||
| + | Add: 65718.0 0.110574 0.109559 0.111694 | ||
| + | Triad: 66606.8 0.108982 0.108097 0.111754 | ||
| + | |||
| + | # Using 8 cores (on one socket) - limited to the 50% BW | ||
| + | OMP_NUM_THREADS=8 KMP_AFFINITY=compact ./stream_icc | ||
| + | Function Best Rate MB/s Avg time Min time Max time | ||
| + | Copy: 31929.5 0.154730 0.150331 0.158771 | ||
| + | Scale: 32842.5 0.148586 0.146152 0.150800 | ||
| + | Add: 32240.0 0.224286 0.223325 0.225280 | ||
| + | Triad: 32340.8 0.223632 0.222629 0.228462 | ||
| + | |||
| + | # Using 8 cores (spread across two sockets) - You get the max BW available | ||
| + | OMP_NUM_THREADS=8 KMP_AFFINITY=scatter ./stream_icc | ||
| + | Function Best Rate MB/s Avg time Min time Max time | ||
| + | Copy: 58487.3 0.082912 0.082069 0.084016 | ||
| + | Scale: 56235.1 0.085526 0.085356 0.085717 | ||
| + | Add: 63344.1 0.115197 0.113665 0.116755 | ||
| + | Triad: 64233.5 0.112643 0.112091 0.114209 | ||
| + | </syntaxhighlight> | ||
| + | |||
| + | |||
| + | == Dual Skylake 6132 == | ||
| + | # only logical cores used for test (2x14=28cores) | ||
| + | <syntaxhighlight> | ||
| + | # 12 sticks of 16gb 2666Mhz | ||
| + | # /opt/intel/bin/icc -O3 -qopenmp -parallel -AVX -DSTREAM_ARRAY_SIZE=800000000 -mcmodel=medium stream.c -o stream_icc_AVX | ||
| + | # for j in 28; do for i in {1..100}; do OMP_NUM_THREADS=$j ./stream_icc_AVX | grep "Copy\|Scale\|Add\|Triad\|counted" | tee -a skylake_6132_12x16gb_stream_icc_AVX_28core.txt ;done ;done | ||
| + | # grep the best results: cat skylake_6132_12x16gb_stream_icc_AVX_28core.txt | sort -k 4 | grep -i copy | ||
| + | Copy: 140025.6 0.092183 0.091412 0.092748 | ||
| + | Scale: 141014.3 0.091679 0.090771 0.092523 | ||
| + | Add: 137117.5 0.141334 0.140026 0.142912 | ||
| + | Triad: 141462.5 0.136485 0.135725 0.137369 | ||
| + | |||
| + | # 6 sticks of 32gb 2666Mhz | ||
| + | # /opt/intel/bin/icc -O3 -qopenmp -parallel -AVX -DSTREAM_ARRAY_SIZE=800000000 -mcmodel=medium stream.c -o stream_icc_AVX | ||
| + | # for j in 28; do for i in {1..100}; do OMP_NUM_THREADS=$j ./stream_icc_AVX | grep "Copy\|Scale\|Add\|Triad\|counted" | tee -a skylake_6132_6x32gb_stream_icc_AVX_28core.txt ;done ;done | ||
| + | #grep the best results: cat skylake_6132_6x32gb_stream_icc_AVX_28core.txt | sort -k 4 | grep -i copy | ||
| + | Copy: 94531.2 0.136154 0.135405 0.137249 | ||
| + | Scale: 94785.4 0.135705 0.135042 0.136687 | ||
| + | Add: 97745.7 0.198002 0.196428 0.199466 | ||
| + | Triad: 100247.4 0.192429 0.191526 0.192869 | ||
| + | </syntaxhighlight> | ||
| + | |||
| + | == Dual Skylake 5120 == | ||
| + | #only logical cores used for test (2x14=28cores) | ||
| + | <syntaxhighlight> | ||
| + | # 12 sticks of 16gb 2666Mhz(2400) | ||
| + | # /opt/intel/bin/icc -O3 -qopenmp -parallel -AVX -DSTREAM_ARRAY_SIZE=800000000 -mcmodel=medium stream.c -o stream_icc_AVX | ||
| + | # for j in 28; do for i in {1..100}; do OMP_NUM_THREADS=$j ./stream_icc_AVX | grep "Copy\|Scale\|Add\|Triad\|counted" | tee -a skylake_5120_12x16gb_stream_icc_AVX_28core.txt ;done ;done | ||
| + | # grep the best results: cat skylake_5120_12x16gb_stream_icc_AVX_28core.txt | sort -k 4 | grep -i copy | ||
| + | Copy: 140025.6 0.092183 0.091412 0.092748 | ||
| + | Scale: 141014.3 0.091679 0.090771 0.092523 | ||
| + | Add: 137117.5 0.141334 0.140026 0.142912 | ||
| + | Triad: 141462.5 0.136485 0.135725 0.137369 | ||
| + | |||
| + | # 6 sticks of 32gb 2666Mhz(2400) | ||
| + | # /opt/intel/bin/icc -O3 -qopenmp -parallel -AVX -DSTREAM_ARRAY_SIZE=800000000 -mcmodel=medium stream.c -o stream_icc_AVX | ||
| + | # for j in 28; do for i in {1..100}; do OMP_NUM_THREADS=$j ./stream_icc_AVX | grep "Copy\|Scale\|Add\|Triad\|counted" | tee -a skylake_5120_6x32gb_stream_icc_AVX_28core.txt ;done ;done | ||
| + | # grep the best results: cat skylake_5120_6x32gb_stream_icc_AVX_28core.txt | sort -k 4 | grep -i copy | ||
| + | Copy: 94531.2 0.136154 0.135405 0.137249 | ||
| + | Scale: 94785.4 0.135705 0.135042 0.136687 | ||
| + | Add: 97745.7 0.198002 0.196428 0.199466 | ||
| + | Triad: 100247.4 0.192429 0.191526 0.192869 | ||
| + | </syntaxhighlight> | ||
Latest revision as of 10:17, 14 March 2018
STREAM: The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.
Table 1: Stream Benchmarks
Name Kernel Bytes/Iteration FLOPS/Iteration
COPY a(i) = b(i) 16 0
SCALE a(i) = q*b(i) 16 1
SUM a(i) = b(i) + c(i) 24 1
TRIAD a(i) = b(i) + q*c(i) 24 2The Copy benchmark measures the transfer rate in the absence of arithmetic. This should be one of the fastest memory operations, but it also represents a common one – fetching two values from memory, a(i) and b(i), and update one operation.
The Scale benchmark adds a simple arithmetic operation to the Copy benchmark. This starts to simulate real application operations. The operation fetches two values from memory, a(i) and b(i), but operates on b(i) before writing it to a(i). It's a simple scalar operation, but more complex operations are built from it, so the performance of this simple test can be used as an indicator of the performance of more complex operations.
The third benchmark, the Sum benchmark, adds a third operand and was originally written to allow multiple load/store ports on vector machines to be tested when vector machines were in vogue. However, this benchmark is very useful today because of the large pipelines that some processors possess. Rather than just fetch two values from memory, this micro-benchmark fetches three. For larger arrays, this will quickly fill a processor pipeline, so you can test the memory bandwidth filling the processor pipeline or the performance when the pipeline is full. Moreover, this benchmark is starting to approximate what some applications will perform in real computations.
The fourth benchmark in Stream, the Triad benchmark, allows chained or overlapped or fused, multiple-add operations. It builds on the Sum benchmark by adding an arithmetic operation to one of the fetched array values. Given that fused multiple-add operations (FMA) are an important operation in many basic computations, such as dot products, matrix multiplication, polynomial evaluations, Newton’s method for evaluation functions, and many DSP operations, this benchmark can be directly associated with application performance. The FMA operation has its own instruction set now and is usually done in hardware. Consequently, feeding such hardware operations with data can be extremely important – hence, the usefulness of the Triad memory bandwidth benchmark.
There are two variables or definitions in the code that you should pay attention to. The first is STREAM_ARRAY_SIZE. This is the number of array elements used to run the benchmarks. In the current version, it is set to 10,000,000, which the code states should be good enough for caches up to 20MB. The Stream FAQ recommends you use a problem size such that each array is four times the sum of the caches (L1, L2, and L3). You can either change the code to reflect the array sizes you want, or you can set the variable when compiling the code.
The second variable you might want to change is NTIMES, the number of times each benchmark is run. By default, Stream reports the "best" result for any iteration after the first; therefore, be sure always to set NTIMES at least to 2 (10 is the default). This variable can also be set during compilation without changing the code.
- Note: Ensure power saving features are disabled, we need max clock speed to prevent fluctuations in performance:
- /etc/init.d/cpuspeed stop
Get the source
- Main STREAM website: http://www.cs.virginia.edu/stream/
- Pull the latest copy of STREAM from:
# (v 5.10 at the time of edit)
wget http://www.cs.virginia.edu/stream/FTP/Code/stream.cCompile
- Can use either Intel or GCC to build/compile
- Ensure you build for multi-threaded runs (-fopenmp (gcc) -openmp (icc)
- For large array sizes, include -mcmodel=medium
- Noticed best performance using Intel ICC
Intel
icc -O3 -static -openmp stream.c -o stream_iccGCC
- GCC typically gave the worst performance in the limited tests we performed (probably better optimisation flags required, but not well documented for STREAM)
gcc -O3 -fopenmp stream.c -o stream_gccOpen 64
- Below are optimisations for the AMD 6300 Arch
opencc -march=bdver1 -mp -Ofast -LNO:simd=2 -WOPT:sib=on \
-LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4 -DSTREAM_ARRAY_SIZE=30000000 \
-DNTIMES=30 -DOFFSET=1840 stream.c -o stream_occRun
- Vary the number of threads used by using: export OMP_NUM_THREADS=32
Intel
export OMP_NUM_THREADS=16
export KMP_AFFINITY=compact
./stream_iccGCC
export GOMP_CPU_AFFINITY="0 1 2 ..."
./stream_gccOpen64
- Below is the recommend best for AMD 6300 arch
- Peak memory bandwidth is achieved when STREAM is run on three cores of each NUMA node. For example, the following run shows that the same system is capable of achieving STREAM 5% better than when using all cores.
# assuming 32 core system
export O64_OMP_AFFINITY=”TRUE”
export O64_OMP_AFFINITY_MAP=”2,4,6,10,12,14,18,20,22,26,28,30”
export OMP_NUM_THREADS=12
./streamResults
Dual Socket E5-2660 v2 @ 2.20GHz
# HT Disabled
export OMP_NUM_THREADS=20
icc -O3 -openmp -DSTREAM_ARRAY_SIZE=95000000 -mcmodel=medium stream.c -o stream_icc
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 79878.5 0.019151 0.019029 0.019254
Scale: 87765.1 0.018336 0.017319 0.025528
Add: 92615.5 0.024757 0.024618 0.025459
Triad: 92773.6 0.024712 0.024576 0.024968
-------------------------------------------------------------Dual Socket E5-2600 Server
# HT Disabled
Function Best Rate MB/s Avg time Min time Max time
Copy: 63992.2 0.079985 0.075009 0.083602
Scale: 67067.1 0.072370 0.071570 0.073986
Add: 65718.0 0.110574 0.109559 0.111694
Triad: 66606.8 0.108982 0.108097 0.111754Dual Socket AMD 6380 Server
# built using open64 as above
OMP_NUM_THREADS=16 O64_OMP_AFFINITY=true O64_OMP_AFFINITY_MAP=$(seq -s, 0 2 31) ./stream_occ
Function Best Rate MB/s Avg time Min time Max time
Copy: 63467.9 0.007673 0.007563 0.007783
Scale: 66527.9 0.007532 0.007215 0.007711
Add: 62947.3 0.011611 0.011438 0.011769
Triad: 62544.5 0.011718 0.011512 0.011887Numascale 4x AMD 6380 Dual-Socket Servers
# build
opencc -Ofast -march=bdver2 -mp -mcmodel=medium -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 \
-CG:use_prefetchnta=on -LNO:prefetch_ahead=4 -DSTREAM_ARRAY_SIZE=1500000000ULL stream.c -o stream_c.exe.open64.33g
# run (on all 128 cores)
OMP_NUM_THREADS=128 O64_OMP_AFFINITY=true O64_OMP_AFFINITY_MAP=$(seq -s, 0 1 127) ./stream_c.exe.open64.33g
Function Best Rate MB/s Avg time Min time Max time
Copy: 213159.1 0.118355 0.112592 0.143757
Scale: 211838.0 0.126541 0.113294 0.153623
Add: 199420.7 0.206908 0.180523 0.310094
Triad: 200751.6 0.189908 0.179326 0.209648
# run on every second core (one per memory controller)
OMP_NUM_THREADS=64 O64_OMP_AFFINITY=true O64_OMP_AFFINITY_MAP=$(seq -s, 0 2 127) ./stream_c.exe.open64.33g
Function Best Rate MB/s Avg time Min time Max time
Copy: 221879.3 0.110662 0.108167 0.113518
Scale: 234276.7 0.108583 0.102443 0.121964
Add: 213480.2 0.171661 0.168634 0.180814
Triad: 215921.8 0.170627 0.166727 0.177991Single Calxeda ECX-1000 @ 1.4GHz
# build
gcc -O3 -fopenmp /root/stream.c -o /root/stream_gcc
# run (on all 4 cores)
Function Best Rate MB/s Avg time Min time Max time
Copy: 1696.5 0.094708 0.094311 0.094899
Scale: 1696.8 0.095611 0.094293 0.096759
Add: 1993.1 0.120893 0.120413 0.121506
Triad: 1865.7 0.130471 0.128638 0.132437Single Calxeda ECX-2000
# build
gcc -O3 -fopenmp /root/stream.c -o /root/stream_gcc
# run (on all 4 cores)
Function Best Rate MB/s Avg time Min time Max time
Copy: 3247.7 0.049727 0.049266 0.050480
Scale: 4490.5 0.035814 0.035631 0.036157
Add: 4051.3 0.059573 0.059240 0.059767
Triad: 3847.4 0.062560 0.062380 0.062768Haswell E5-2640v3 DDR4-1866Mhz
# HT Disabled / service cpuspeed stop
[root@localhost stream]# icc -O3 -openmp -DSTREAM_ARRAY_SIZE=95000000 -mcmodel=medium stream.c -o stream_icc_dp
[root@localhost stream]# OMP_NUM_THREADS=8 KMP_AFFINITY=scatter ./stream_icc_dp
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 95000000 (elements), Offset = 0 (elements)
Memory per array = 724.8 MiB (= 0.7 GiB).
Total memory required = 2174.4 MiB (= 2.1 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 8
Number of Threads counted = 8
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 16039 microseconds.
(= 16039 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 81209.4 0.018779 0.018717 0.018873
Scale: 79819.5 0.019124 0.019043 0.019198
Add: 83415.6 0.027439 0.027333 0.027526
Triad: 83294.3 0.027480 0.027373 0.027533
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------Haswell E5-2650v3 DDR4-2133Mhz
# HT on
[root@e5-2650v3-nodee stream]# icc -O3 -openmp -DSTREAM_ARRAY_SIZE=4500000000 -mcmodel=medium stream.c -o stream_icc_dp
[root@e5-2650v3-nodee stream]# OMP_NUM_THREADS=20 KMP_AFFINITY=scatter ./stream_icc_dp
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 4500000000 (elements), Offset = 0 (elements)
Memory per array = 34332.3 MiB (= 33.5 GiB).
Total memory required = 102996.8 MiB (= 100.6 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 20
Number of Threads counted = 20
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 714150 microseconds.
(= 714150 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 107295.5 0.671414 0.671044 0.672320
Scale: 107487.7 0.670575 0.669844 0.671884
Add: 110500.1 0.978171 0.977375 0.981087
Triad: 111482.3 0.969603 0.968764 0.972643
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------Quad Socket E5-4640v2
- With 128GB RAM, HT enabled but running half the cores (80 cores total)
- Memory speed at 1866Mhz
[user@lz4 stream]$ icc -O3 -openmp -DSTREAM_ARRAY_SIZE=4500000000 -mcmodel=medium stream.c -o stream_icc
[user@lz4 stream]$ OMP_NUM_THREADS=40 KMP_AFFINITY=scatter ./stream_icc
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 4500000000 (elements), Offset = 0 (elements)
Memory per array = 34332.3 MiB (= 33.5 GiB).
Total memory required = 102996.8 MiB (= 100.6 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 40
Number of Threads counted = 40
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 460634 microseconds.
(= 460634 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 145430.8 0.495683 0.495081 0.498299
Scale: 141597.3 0.509096 0.508484 0.510163
Add: 161922.9 0.668773 0.666984 0.672804
Triad: 162028.6 0.668336 0.666549 0.672009
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------Observations
Placement/Distribution of jobs when not using all cores is very important for bandwidth performance
# Use KMP_AFFINITY (Intel icc only) to 'compact' jobs on to adjacent cores or 'scatter' to spread them across the system
# System below is a Dual E5-2670 box with 64GB RAM, HT off, 16 cores
# Using all 16 cores:
Function Best Rate MB/s Avg time Min time Max time
Copy: 63992.2 0.079985 0.075009 0.083602
Scale: 67067.1 0.072370 0.071570 0.073986
Add: 65718.0 0.110574 0.109559 0.111694
Triad: 66606.8 0.108982 0.108097 0.111754
# Using 8 cores (on one socket) - limited to the 50% BW
OMP_NUM_THREADS=8 KMP_AFFINITY=compact ./stream_icc
Function Best Rate MB/s Avg time Min time Max time
Copy: 31929.5 0.154730 0.150331 0.158771
Scale: 32842.5 0.148586 0.146152 0.150800
Add: 32240.0 0.224286 0.223325 0.225280
Triad: 32340.8 0.223632 0.222629 0.228462
# Using 8 cores (spread across two sockets) - You get the max BW available
OMP_NUM_THREADS=8 KMP_AFFINITY=scatter ./stream_icc
Function Best Rate MB/s Avg time Min time Max time
Copy: 58487.3 0.082912 0.082069 0.084016
Scale: 56235.1 0.085526 0.085356 0.085717
Add: 63344.1 0.115197 0.113665 0.116755
Triad: 64233.5 0.112643 0.112091 0.114209
Dual Skylake 6132
- only logical cores used for test (2x14=28cores)
# 12 sticks of 16gb 2666Mhz
# /opt/intel/bin/icc -O3 -qopenmp -parallel -AVX -DSTREAM_ARRAY_SIZE=800000000 -mcmodel=medium stream.c -o stream_icc_AVX
# for j in 28; do for i in {1..100}; do OMP_NUM_THREADS=$j ./stream_icc_AVX | grep "Copy\|Scale\|Add\|Triad\|counted" | tee -a skylake_6132_12x16gb_stream_icc_AVX_28core.txt ;done ;done
# grep the best results: cat skylake_6132_12x16gb_stream_icc_AVX_28core.txt | sort -k 4 | grep -i copy
Copy: 140025.6 0.092183 0.091412 0.092748
Scale: 141014.3 0.091679 0.090771 0.092523
Add: 137117.5 0.141334 0.140026 0.142912
Triad: 141462.5 0.136485 0.135725 0.137369
# 6 sticks of 32gb 2666Mhz
# /opt/intel/bin/icc -O3 -qopenmp -parallel -AVX -DSTREAM_ARRAY_SIZE=800000000 -mcmodel=medium stream.c -o stream_icc_AVX
# for j in 28; do for i in {1..100}; do OMP_NUM_THREADS=$j ./stream_icc_AVX | grep "Copy\|Scale\|Add\|Triad\|counted" | tee -a skylake_6132_6x32gb_stream_icc_AVX_28core.txt ;done ;done
#grep the best results: cat skylake_6132_6x32gb_stream_icc_AVX_28core.txt | sort -k 4 | grep -i copy
Copy: 94531.2 0.136154 0.135405 0.137249
Scale: 94785.4 0.135705 0.135042 0.136687
Add: 97745.7 0.198002 0.196428 0.199466
Triad: 100247.4 0.192429 0.191526 0.192869Dual Skylake 5120
- only logical cores used for test (2x14=28cores)
# 12 sticks of 16gb 2666Mhz(2400)
# /opt/intel/bin/icc -O3 -qopenmp -parallel -AVX -DSTREAM_ARRAY_SIZE=800000000 -mcmodel=medium stream.c -o stream_icc_AVX
# for j in 28; do for i in {1..100}; do OMP_NUM_THREADS=$j ./stream_icc_AVX | grep "Copy\|Scale\|Add\|Triad\|counted" | tee -a skylake_5120_12x16gb_stream_icc_AVX_28core.txt ;done ;done
# grep the best results: cat skylake_5120_12x16gb_stream_icc_AVX_28core.txt | sort -k 4 | grep -i copy
Copy: 140025.6 0.092183 0.091412 0.092748
Scale: 141014.3 0.091679 0.090771 0.092523
Add: 137117.5 0.141334 0.140026 0.142912
Triad: 141462.5 0.136485 0.135725 0.137369
# 6 sticks of 32gb 2666Mhz(2400)
# /opt/intel/bin/icc -O3 -qopenmp -parallel -AVX -DSTREAM_ARRAY_SIZE=800000000 -mcmodel=medium stream.c -o stream_icc_AVX
# for j in 28; do for i in {1..100}; do OMP_NUM_THREADS=$j ./stream_icc_AVX | grep "Copy\|Scale\|Add\|Triad\|counted" | tee -a skylake_5120_6x32gb_stream_icc_AVX_28core.txt ;done ;done
# grep the best results: cat skylake_5120_6x32gb_stream_icc_AVX_28core.txt | sort -k 4 | grep -i copy
Copy: 94531.2 0.136154 0.135405 0.137249
Scale: 94785.4 0.135705 0.135042 0.136687
Add: 97745.7 0.198002 0.196428 0.199466
Triad: 100247.4 0.192429 0.191526 0.192869