Difference between revisions of "TensorFlow Benchmarking"

From Define Wiki
Jump to navigation Jump to search
 
(2 intermediate revisions by one other user not shown)
Line 1: Line 1:
=Training=
+
=Benchmark Training Performance=
 
== Building the docker image ==
 
== Building the docker image ==
 
TensorFlow provides scripts to run training benchmarks with different models. The scripts are hosted on GitHub [https://github.com/tensorflow/benchmarks here].
 
TensorFlow provides scripts to run training benchmarks with different models. The scripts are hosted on GitHub [https://github.com/tensorflow/benchmarks here].
Line 84: Line 84:
  
 
where CKPT_DIR is the directory to save the model in.
 
where CKPT_DIR is the directory to save the model in.
 +
 +
= Inference/Classification Benchmarking =
 +
 +
== Run the Inference Benchmark ==
 +
Using a trained model, how quickly can the network classify images. Parameters all seems similar - use the `--forward_only=True` for inference mode
 +
 +
<syntaxhighlight>
 +
python tf_cnn_benchmarks.py --forward_only=True --batch_size=256 \
 +
            --model=resnet50 --num_epochs=10 --distortions=True --display_every 10 \
 +
            --num_gpus=16 --data_dir=./test_data/fake_tf_record_data/ --data_name=imagenet
 +
</syntaxhighlight>
 +
 +
 +
== Script to iterate through batch sizes and numbers of GPUs ==
 +
<syntaxhighlight>
 +
#!/bin/bash
 +
 +
 +
for MODEL in resnet50 resnet152
 +
do
 +
        for BATCH_SIZE in 32 64 128 256
 +
        do
 +
                for NUM_GPUS in 1 2 4 8 16
 +
                do
 +
 +
                        echo "Now running $MODEL with Batch Size: $BATCH_SIZE on: $NUM_GPUS GPUs"
 +
                        echo "python tf_cnn_benchmarks.py --forward_only=True --batch_size=${BATCH_SIZE} --model=${MODEL} --num_epochs=5 --distortions=True --display_every 10 --num_gpus=${NUM_GPUS} --data_dir=./test_data/fake_tf_record_data/ --data_name=imagenet | tee logs/tf_cnn_res-batch${BATCH_SIZE}_gpu${NUM_GPUS}.log"
 +
                        python tf_cnn_benchmarks.py --forward_only=True --batch_size=${BATCH_SIZE} --model=${MODEL} --num_epochs=5 --distortions=True --display_every 10 --num_gpus=${NUM_GPUS} --data_dir=./test_data/fake_tf_record_data/ --data_name=imagenet | tee logs/tf_cnn_res-batch${BATCH_SIZE}_gpu${NUM_GPUS}.log
 +
 +
                done
 +
        done
 +
done
 +
</syntaxhighlight>

Latest revision as of 15:01, 31 January 2019

Benchmark Training Performance

Building the docker image

TensorFlow provides scripts to run training benchmarks with different models. The scripts are hosted on GitHub here.

It is recommended to run the scripts using nvidia-docker2 and the TensorFlow docker image obtained from NGC.

To simplify the setup I have created a Dockerfile to pull the image and download the scripts. To use this first create a directory to hold your Dockerfiles.

mkdir ~/Dockerfiles

Then create a file in this directory and add the following

FROM nvcr.io/nvidia/tensorflow:18.10-py3
RUN apt-get update && apt-get install git && git clone -b cnn_tf_v1.10_compatible https://github.com/tensorflow/benchmarks
ENTRYPOINT bash

To build the image run

docker build -f ~/Dockerfiles/tf_bench -t tf_bench .

The best way to run the container is in interactive mode as this allows multiple runs to be performed in quick succession. To start the container run

docker run --runtime=nvidia -it tf_bench

The benchmark scripts are located in /workspace/scripts/tf_cnn_benchmarks.

Benchmarking using synthetic data

To run the benchmark using synthetic data execute

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server

The benchmark can be run with different models. Current supported models are resnet50, resnet152, inception3 and vgg16.

The trained model can be saved by providing a checkpoint directory using the --train_dir flag. For example to train a model using ResNet-152 and 10 epochs with 8 GPUs, and save the trained model use

python tf_cnn_benchmarks.py --num_gpus=8 --batch_size=256 --model=resnet152 --variable_update=parameter_server --train_dir=/workspace/ckpt_dir --num_epochs=10

The saved model can then be used to perform other benchmarks for inferencing.

Benchmarking using real data

If you have a sample dataset that you want to perform the benchmarks with the data will first need to be made available to the TensorFlow docker container. The best way to do this is to use a bind mount to mount the directory containing the data to some directory inside the docker container. This mount point in the container needs to exist, so the best way to create one is to modify the RUN line in the dockerfile to add a mkdir command.

FROM nvcr.io/nvidia/tensorflow:18.10-py3
RUN apt-get update && apt-get install git && git clone -b cnn_tf_v1.10_compatible https://github.com/tensorflow/benchmarks && mkdir /workspace/data 
ENTRYPOINT bash

Then rebuild the image using the docker build command and start a new container adding the flag to bind mount your data directory to the data directory inside the container.

docker run --runtime=nvidia -v <path_to_my_data>:/workspace/data:shared  -it tf_bench

Then to run the benchmarks use the same commands as for synthetic data but provide two dditional flags to indicate where the data is and what format it is in. The data format can vary between datasets, but for example the commonly used IMAGENET2012 the data format would be NCHW.

A sample command to run the benchmark using real data and saving the trained model could be

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 \
--model=resnet50 --optimizer=momentum --variable_update=replicated \
--nodistortions --gradient_repacking=8 --num_gpus=8 \
--num_epochs=10 --weight_decay=1e-4 --data_dir=/workspace/data --use_fp16 \
--train_dir=${CKPT_DIR}

where CKPT_DIR is the directory to save the model in.

Inference/Classification Benchmarking

Run the Inference Benchmark

Using a trained model, how quickly can the network classify images. Parameters all seems similar - use the `--forward_only=True` for inference mode

python tf_cnn_benchmarks.py --forward_only=True --batch_size=256 \
            --model=resnet50 --num_epochs=10 --distortions=True --display_every 10 \
            --num_gpus=16 --data_dir=./test_data/fake_tf_record_data/ --data_name=imagenet


Script to iterate through batch sizes and numbers of GPUs

#!/bin/bash


for MODEL in resnet50 resnet152
do
        for BATCH_SIZE in 32 64 128 256
        do
                for NUM_GPUS in 1 2 4 8 16
                do

                        echo "Now running $MODEL with Batch Size: $BATCH_SIZE on: $NUM_GPUS GPUs"
                        echo "python tf_cnn_benchmarks.py --forward_only=True --batch_size=${BATCH_SIZE} --model=${MODEL} --num_epochs=5 --distortions=True --display_every 10 --num_gpus=${NUM_GPUS} --data_dir=./test_data/fake_tf_record_data/ --data_name=imagenet | tee logs/tf_cnn_res-batch${BATCH_SIZE}_gpu${NUM_GPUS}.log"
                        python tf_cnn_benchmarks.py --forward_only=True --batch_size=${BATCH_SIZE} --model=${MODEL} --num_epochs=5 --distortions=True --display_every 10 --num_gpus=${NUM_GPUS} --data_dir=./test_data/fake_tf_record_data/ --data_name=imagenet | tee logs/tf_cnn_res-batch${BATCH_SIZE}_gpu${NUM_GPUS}.log

                done
        done
done