Difference between revisions of "Benchmarks:AI"

Revision as of 12:38, 31 January 2019

Training

TensorFlow

TensorFlow provides scripts to run training benchmarks with different models. The scripts are hosted on GitHub here.

It is recommended to run the scripts using nvidia-docker2 and the TensorFlow docker image obtained from NGC.

To simplify the setup I have created a Dockerfile to pull the image and download the scripts. To use this first create a directory to hold your Dockerfiles.

mkdir ~/Dockerfiles

Then create a file in this directory and add the following

FROM nvcr.io/nvidia/tensorflow:18.10-py3
RUN apt-get update && apt-get install git && git clone -b cnn_tf_v1.10_compatible https://github.com/tensorflow/benchmarks
ENTRYPOINT bash

To build the image run

docker build -f ~/Dockerfiles/tf_bench -t tf_bench .

The best way to run the container is in interactive mode as this allows multiple runs to be performed in quick succession. To start the container run

docker run --runtime=nvidia -it tf_bench

The benchmark scripts are located in /workspace/scripts/tf_cnn_benchmarks.

To run the benchmark using synthetic data execute

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server

The benchmark can be run with different models. Current supported models are resnet50, resnet152, inception3 and vgg16.

The trained model can be saved by providing a checkpoint directory using the --train_dir flag. For example to train a model using ResNet-152 and 10 epochs with 8 GPUs, and save the trained model use

python tf_cnn_benchmarks.py --num_gpus=8 --batch_size=256 --model=resnet152 --variable_update=parameter_server --train_dir=/workspace/ckpt_dir --num_epochs=10

The saved model can then be used to perform other benchmarks for inferencing.

Difference between revisions of "Benchmarks:AI"

Revision as of 12:38, 31 January 2019

Training

TensorFlow

Navigation menu

Search

@@ Line 40: / Line 40: @@
 The benchmark can be run with different models. Current supported models are resnet50, resnet152, inception3 and vgg16.
+The trained model can be saved by providing a checkpoint directory using the --train_dir flag. For example to train a model using
+ResNet-152 and 10 epochs with 8 GPUs, and save the trained model use
+<syntaxhighlight>
+python tf_cnn_benchmarks.py --num_gpus=8 --batch_size=256 --model=resnet152 --variable_update=parameter_server --train_dir=/workspace/ckpt_dir --num_epochs=10
+</syntaxhighlight>
+The saved model can then be used to perform other benchmarks for inferencing.