|
|
| (7 intermediate revisions by one other user not shown) |
| Line 1: |
Line 1: |
| − | == Training == | + | = Glossary = |
| − | === TensorFlow === | + | === Common Terms === |
| − | ==== Building the docker image ====
| + | * '''Epochs''' |
| − | TensorFlow provides scripts to run training benchmarks with different models. The scripts are hosted on GitHub [https://github.com/tensorflow/benchmarks here].
| + | The number times that the learning algorithm will work through the entire training dataset. Increasing this has a linear effect on runtime, so doubling the number of epochs will double training time. For benchmark purposes the final accuracy of the trained model isn't important so only a small number of epochs is needed. For 16 V100 GPUs 10-20 epochs provides sufficient work to determine realistic performance expectations. |
| | + | * '''Batch Size''' |
| | + | The number of samples used per iteration of the algorithm. This value can be increased in accordance with the amount of available GPU memory and also depends on the model being used. For example ResNet50 is a small network and on a 32GB V100 the batch size can increase to 512, where as the same GPU could only use a batch size of 256 when running ResNet 152 because the network is 3 times larger. |
| | | | |
| − | It is recommended to run the scripts using nvidia-docker2 and the TensorFlow docker image obtained from [https://ngc.nvidia.com NGC].
| + | = Benchmarks = |
| − | | + | *[[TensorFlow Benchmarking | TensorFlow Benchmarking using ResNet (on GPU/v100)]] |
| − | To simplify the setup I have created a Dockerfile to pull the image and download the scripts. To use this first create a directory to hold your Dockerfiles.
| |
| − | | |
| − | <syntaxhighlight>
| |
| − | mkdir ~/Dockerfiles
| |
| − | </syntaxhighlight>
| |
| − | | |
| − | Then create a file in this directory and add the following
| |
| − | | |
| − | <syntaxhighlight>
| |
| − | FROM nvcr.io/nvidia/tensorflow:18.10-py3
| |
| − | RUN apt-get update && apt-get install git && git clone -b cnn_tf_v1.10_compatible https://github.com/tensorflow/benchmarks
| |
| − | ENTRYPOINT bash
| |
| − | </syntaxhighlight>
| |
| − | | |
| − | To build the image run
| |
| − | | |
| − | <syntaxhighlight>
| |
| − | docker build -f ~/Dockerfiles/tf_bench -t tf_bench .
| |
| − | </syntaxhighlight>
| |
| − | | |
| − | The best way to run the container is in interactive mode as this allows multiple runs to be performed in quick succession. To start the container run
| |
| − | | |
| − | <syntaxhighlight>
| |
| − | docker run --runtime=nvidia -it tf_bench
| |
| − | </syntaxhighlight>
| |
| − | | |
| − | The benchmark scripts are located in /workspace/scripts/tf_cnn_benchmarks.
| |
| − | | |
| − | ==== Benchmarking using synthetic data ====
| |
| − | | |
| − | To run the benchmark using synthetic data execute
| |
| − | | |
| − | <syntaxhighlight>
| |
| − | python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server
| |
| − | </syntaxhighlight>
| |
| − | | |
| − | The benchmark can be run with different models. Current supported models are resnet50, resnet152, inception3 and vgg16.
| |
| − | | |
| − | The trained model can be saved by providing a checkpoint directory using the --train_dir flag. For example to train a model using
| |
| − | ResNet-152 and 10 epochs with 8 GPUs, and save the trained model use
| |
| − | | |
| − | <syntaxhighlight>
| |
| − | python tf_cnn_benchmarks.py --num_gpus=8 --batch_size=256 --model=resnet152 --variable_update=parameter_server --train_dir=/workspace/ckpt_dir --num_epochs=10
| |
| − | </syntaxhighlight>
| |
| − | | |
| − | The saved model can then be used to perform other benchmarks for inferencing.
| |
| − | | |
| − | ==== Benchmarking using real data ====
| |
| − | | |
| − | If you have a sample dataset that you want to perform the benchmarks with the data will first need to be made available to the TensorFlow docker container. The best way to do
| |
| − | this is to use a bind mount to mount the directory containing the data to some directory inside the docker container. This mount point in the container needs to exist, so the best
| |
| − | way to create one is to modify the RUN line in the dockerfile to add a mkdir command.
| |
| − | | |
| − | <syntaxhighlight>
| |
| − | FROM nvcr.io/nvidia/tensorflow:18.10-py3
| |
| − | RUN apt-get update && apt-get install git && git clone -b cnn_tf_v1.10_compatible https://github.com/tensorflow/benchmarks && mkdir /workspace/data
| |
| − | ENTRYPOINT bash
| |
| − | </syntaxhighlight>
| |
| − | | |
| − | Then rebuild the image using the docker build command and start a new container adding the flag to bind mount your data directory to the data directory inside the container.
| |
| − | | |
| − | <syntaxhighlight>
| |
| − | docker run --runtime=nvidia -v <path_to_my_data>:/workspace/data:shared -it tf_bench
| |
| − | </syntaxhighlight>
| |
| − | | |
| − | Then to run the benchmarks use the same commands as for synthetic data but provide two dditional flags to indicate where the data is and what format it is in. The data format can
| |
| − | vary between datasets, but for example the commonly used IMAGENET2012 the data format would be NCHW.
| |
| − | | |
| − | A sample command to run the benchmark using real data and saving the trained model could be
| |
| − | | |
| − | <syntaxhighlight>
| |
| − | python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 \
| |
| − | --model=resnet50 --optimizer=momentum --variable_update=replicated \
| |
| − | --nodistortions --gradient_repacking=8 --num_gpus=8 \
| |
| − | --num_epochs=10 --weight_decay=1e-4 --data_dir=/workspace/data --use_fp16 \
| |
| − | --train_dir=${CKPT_DIR}
| |
| − | </syntaxhighlight>
| |
| − | | |
| − | where CKPT_DIR is the directory to save the model in.
| |