Difference between revisions of "Benchmarks:AI"

From Define Wiki
Jump to navigation Jump to search
(Created page with "== Training == ==== TensorFlow ==== TensorFlow provides scripts to run training benchmarks with different models. The scripts are hosted on GitHub [https://github.com/tensorfl...")
 
 
(11 intermediate revisions by one other user not shown)
Line 1: Line 1:
== Training ==
+
= Glossary =
==== TensorFlow ====
+
=== Common Terms ===
TensorFlow provides scripts to run training benchmarks with different models. The scripts are hosted on GitHub [https://github.com/tensorflow/benchmarks here].
+
* '''Epochs'''
 +
The number times that the learning algorithm will work through the entire training dataset. Increasing this has a linear effect on runtime, so doubling the number of epochs will double training time. For benchmark purposes the final accuracy of the trained model isn't important so only a small number of epochs is needed. For 16 V100 GPUs 10-20 epochs provides sufficient work to determine realistic performance expectations.
 +
* '''Batch Size'''
 +
The number of samples used per iteration of the algorithm. This value can be increased in accordance with the amount of available GPU memory and also depends on the model being used. For example ResNet50 is a small network and on a 32GB V100 the batch size can increase to 512, where as the same GPU could only use a batch size of 256 when running ResNet 152 because the network is 3 times larger.
  
It is recommended to run the scripts using nvidia-docker2 and the TensorFlow docker image obtained from [https://ngc.nvidia.com NGC].
+
= Benchmarks =
 
+
*[[TensorFlow Benchmarking | TensorFlow Benchmarking using ResNet (on GPU/v100)]]
To simplify the setup I have created a Dockerfile to pull the image and download the scripts. To use this first create a directory to hold your Dockerfiles.
 
 
 
mkdir ~/Dockerfiles
 
 
 
Then create a file in this directory and add the following
 
 
 
FROM nvcr.io/nvidia/tensorflow:18.10-py3
 
RUN apt-get update && apt-get install git && git clone -b cnn_tf_v1.10_compatible https://github.com/tensorflow/benchmarks
 
ENTRYPOINT bash
 
 
 
To build the image run
 
 
 
docker build -f dockerfiles/tf_bench .
 

Latest revision as of 14:41, 31 January 2019

Glossary

Common Terms

  • Epochs

The number times that the learning algorithm will work through the entire training dataset. Increasing this has a linear effect on runtime, so doubling the number of epochs will double training time. For benchmark purposes the final accuracy of the trained model isn't important so only a small number of epochs is needed. For 16 V100 GPUs 10-20 epochs provides sufficient work to determine realistic performance expectations.

  • Batch Size

The number of samples used per iteration of the algorithm. This value can be increased in accordance with the amount of available GPU memory and also depends on the model being used. For example ResNet50 is a small network and on a 32GB V100 the batch size can increase to 512, where as the same GPU could only use a batch size of 256 when running ResNet 152 because the network is 3 times larger.

Benchmarks