Difference between revisions of "Benchmarks:AI"

From Define Wiki
Jump to navigation Jump to search
Line 40: Line 40:
  
 
The benchmark can be run with different models. Current supported models are resnet50, resnet152, inception3 and vgg16.
 
The benchmark can be run with different models. Current supported models are resnet50, resnet152, inception3 and vgg16.
 +
 +
The trained model can be saved by providing a checkpoint directory using the --train_dir flag. For example to train a model using
 +
ResNet-152 and 10 epochs with 8 GPUs, and save the trained model use
 +
 +
<syntaxhighlight>
 +
python tf_cnn_benchmarks.py --num_gpus=8 --batch_size=256 --model=resnet152 --variable_update=parameter_server --train_dir=/workspace/ckpt_dir --num_epochs=10
 +
</syntaxhighlight>
 +
 +
The saved model can then be used to perform other benchmarks for inferencing.

Revision as of 12:38, 31 January 2019

Training

TensorFlow

TensorFlow provides scripts to run training benchmarks with different models. The scripts are hosted on GitHub here.

It is recommended to run the scripts using nvidia-docker2 and the TensorFlow docker image obtained from NGC.

To simplify the setup I have created a Dockerfile to pull the image and download the scripts. To use this first create a directory to hold your Dockerfiles.

mkdir ~/Dockerfiles

Then create a file in this directory and add the following

FROM nvcr.io/nvidia/tensorflow:18.10-py3
RUN apt-get update && apt-get install git && git clone -b cnn_tf_v1.10_compatible https://github.com/tensorflow/benchmarks
ENTRYPOINT bash

To build the image run

docker build -f ~/Dockerfiles/tf_bench -t tf_bench .

The best way to run the container is in interactive mode as this allows multiple runs to be performed in quick succession. To start the container run

docker run --runtime=nvidia -it tf_bench

The benchmark scripts are located in /workspace/scripts/tf_cnn_benchmarks.

To run the benchmark using synthetic data execute

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server

The benchmark can be run with different models. Current supported models are resnet50, resnet152, inception3 and vgg16.

The trained model can be saved by providing a checkpoint directory using the --train_dir flag. For example to train a model using ResNet-152 and 10 epochs with 8 GPUs, and save the trained model use

python tf_cnn_benchmarks.py --num_gpus=8 --batch_size=256 --model=resnet152 --variable_update=parameter_server --train_dir=/workspace/ckpt_dir --num_epochs=10

The saved model can then be used to perform other benchmarks for inferencing.