-
Notifications
You must be signed in to change notification settings - Fork 28
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
move DeepSpeech example to a separate repo
- Loading branch information
Showing
11 changed files
with
6 additions
and
540 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,204 +1,8 @@ | ||
# DeepSpeech benchmarks | ||
# DeepSpeech test application | ||
|
||
In this example we provide experimental results and steps to reproduce for benchmarking performance of training the | ||
[Baidu's Deep Speech](https://arxiv.org/abs/1412.5567) Recurrent Neural Network for automatic speech recognition. | ||
The distributed DeepSpeech training application that had been used as a test | ||
application and requirement provider towards TensorHive is now in a | ||
[separate repository](https://github.com/roscisz/dnn_training_benchmarks/tree/master/TensorFlowV1_DeepSpeech_ldc93s1). | ||
|
||
The example is based on [Project DeepSpeech by Mozilla](https://github.com/mozilla/DeepSpeech) which is an open source | ||
implementation in TensorFlow that supports distributed training using Distributed TensorFlow. | ||
|
||
## Table of contents | ||
- [x] [Installation instructions](#installation) | ||
- [ ] [Instructions for running the benchmarks](#running-the-benchmarks) | ||
- [x] [Manually](#manually) | ||
- [x] [Using run_cluster.sh](#run-clustersh) | ||
- [x] [Using Docker](#docker) | ||
- [x] [Using Kubernetes](#kubernetes) | ||
- [ ] [Using TensorHive](#tensorhive) | ||
- [ ] [Experimental results](#experimental-results): | ||
- [x] [Batch size influence on training performance on various GPUs](#batch-size) | ||
- [x] [Scalability on multiple GPUs](#multigpu-scalability) | ||
- [ ] Scalability in a distributed setting | ||
|
||
## Installation | ||
|
||
In this section we describe installation steps for DeepSpeech that we used in our setup. | ||
For detailed instructions for running the DeepSpeech training go to the | ||
[DeepSpeech project site](https://github.com/mozilla/DeepSpeech). | ||
|
||
### Prerequisites | ||
|
||
* GNU/Linux | ||
* Python 3, Pip3, git, wget | ||
* CUDA 9.0 with CuDNN 7 | ||
* TensorFlow 1.6.0 | ||
* Python packages: pandas, python_speech_features, pyxdg, progressbar2 | ||
|
||
The environment can be for example set up using nvidia-docker as follows: | ||
|
||
```bash | ||
nvidia-docker pull nvidia/cuda:9.0-cudnn7-devel | ||
nvidia-docker run -it nvidia/cuda:9.0-cudnn7-devel | ||
apt-get update | ||
apt-get install -y python3 python3-pip git wget | ||
pip3 install 'tensorflow-gpu==1.6.0' pandas python_speech_features pyxdg progressbar2 scipy | ||
``` | ||
|
||
### Installing DeepSpeech | ||
|
||
**Clone the proper version of DeepSpeech** | ||
```bash | ||
git clone https://github.com/mozilla/DeepSpeech.git | ||
cd DeepSpeech/ | ||
git reset --hard e00bfd0f413912855eb2312bc1efe3bd2b023b25 | ||
``` | ||
Note: if you have git-lfs installed, you can disable it for the benchmarks using environment variable GIT_LFS_SKIP_SMUDGE=1. | ||
|
||
**Download native libraries** | ||
```bash | ||
python3 util/taskcluster.py --arch gpu --target native_client/ --branch=v0.2.0 | ||
``` | ||
|
||
**Download small dataset** | ||
```bash | ||
python3 bin/import_ldc93s1.py ldc93s1 | ||
``` | ||
|
||
### Applying the benchmarking patch | ||
|
||
```bash | ||
wget https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/deepspeech/deepspeech_benchmarking.patch | ||
git apply deepspeech_benchmarking.patch | ||
``` | ||
|
||
## Running the benchmarks | ||
|
||
In this section we describe the steps to reproduce the [experimental results](#experimental-results), | ||
assuming that the DeepSpeech training program is installed and the benchmarking patch is applied. | ||
|
||
### Manually | ||
|
||
To run the benchmark, specify the number of "global steps" to be benchmarked using the `benchmark_steps` parameter: | ||
|
||
```bash | ||
LD_LIBRARY_PATH=native_client/ CUDA_VISIBLE_DEVICES=0 python3 ./DeepSpeech.py --train_files=ldc93s1/ldc93s1.csv --dev_files=ldc93s1/ldc93s1.csv --test_files=ldc93s1/ldc93s1.csv --log_level=3 --benchmark_steps=10 | ||
``` | ||
|
||
**Testing batch size on one GPU** | ||
|
||
To check the performance for various batch sizes, modify the `train_batch_size` parameter. For example, to use batch size of 128 run: | ||
|
||
```bash | ||
LD_LIBRARY_PATH=native_client/ CUDA_VISIBLE_DEVICES=0 python3 ./DeepSpeech.py --train_files=ldc93s1/ldc93s1.csv --dev_files=ldc93s1/ldc93s1.csv --test_files=ldc93s1/ldc93s1.csv --log_level=3 --benchmark_steps=10 --train_batch_size=128 | ||
``` | ||
**Testing scalability on many GPUs** | ||
|
||
***In-graph replication*** | ||
|
||
To check the performance of parallel training on multiple GPUs, modify the CUDA_VISIBLE_DEVICES environment variable. | ||
For example, to use GPUs 1 and 2, set CUDA_VISIBLE_DEVICES=1,2 and to use all GPUs in a 4-GPU system, set | ||
CUDA_VISIBLE_DEVICES=0,1,2,3. The in-graph replication method for data-parallel, synchronized training implemented in | ||
Mozilla DeepSpeech will be used. | ||
|
||
### run-cluster.sh | ||
|
||
The Mozilla DeepSpeech implementation supports distributed training using Distributed TensorFlow with a | ||
parameter server and worker processes. The 'bin/run-cluster.sh' script is helpful for configuring and running | ||
these processes on a single machine. The script accepts an argument in a p:w:g format, where p denotes the | ||
number of used parameter servers, w denotes the number of worker processes and g denotes the number of | ||
GPUs used by individual worker processes. | ||
|
||
For example, running our benchmark on the distributed training | ||
application using 1 parameter server and 4 workers using 1 GPU each would require executing the following command: | ||
|
||
```bash | ||
LD_LIBRARY_PATH=native_client/ bin/run-cluster.sh 1:4:1 --script="python3 DeepSpeech.py" --train_files=ldc93s1/ldc93s1.csv --dev_files=ldc93s1/ldc93s1.csv --test_files=ldc93s1/ldc93s1.csv --train_batch_size=64 --epoch=1000 --benchmark_warmup_steps=10 --benchmark_steps=10 --log_level=3 --noshow_progressbar | ||
``` | ||
|
||
It should be noted that the distributed training introduces a startup overhead, so increasing the number of | ||
warmup steps can be necessary to collect reliable results. | ||
|
||
### Docker | ||
|
||
We provide a Dockerfile that allows to build and run the benchmark as a Docker image: | ||
|
||
|
||
```bash | ||
docker build -t deepspeech . | ||
``` | ||
|
||
If there is a need to share the image between distributed machines, the repository has to be given | ||
in the image tag, and the image has to be pushed in to a Docker repository: | ||
|
||
```bash | ||
docker build -t <repositoryname>/deepspeech . | ||
docker push <repositoryname>/deepspeech | ||
``` | ||
|
||
Now, the benchmark can be executed in a Docker container, so that no dependencies need to be installed | ||
on the host machine: | ||
|
||
```bash | ||
docker run deepspeech python3 ./DeepSpeech.py --train_files=ldc93s1/ldc93s1.csv --dev_files=ldc93s1/ldc93s1.csv --test_files=ldc93s1/ldc93s1.csv --log_level=3 --benchmark_steps=10 --train_batch_size=128 | ||
``` | ||
|
||
|
||
### Kubernetes | ||
|
||
In order to enqueue the benchmark in a Kubernetes installation | ||
[(for example microk8s)](https://gist.github.com/PiotrowskiD/07a57ad0f21e2b4de78454d02b34865c), | ||
create an adequate deployment file (example provided in ds.yaml) and create the resource: | ||
|
||
```bash | ||
kubectl create -f ds.yaml | ||
``` | ||
|
||
The status of the resulting Pod, its detailed description and logs can be fetch as follows: | ||
|
||
```bash | ||
kubectl get pod | ||
kubectl describe pod ds | ||
kubectl logs ds | ||
``` | ||
|
||
Unfortunately Kubernetes doesn't take into account other process using GPUs which leads to conflicts if | ||
somebody else runs their jobs manually. Because CUDA_VISIBLE_DEVICES env variable is used inside the | ||
container, it can only chose from GPUs kubernetes assigns to the container. So if we would like to deploy | ||
our training to GPU number 3 we would have to set GPUs limit to 4 and set CVD to "3". That is if the GPUs | ||
are on a single node. If they would be on different ones we could use | ||
[node labels and selectors](https://kubernetes.io/docs/tasks/configure-pod-container/assign-pods-nodes/). | ||
|
||
|
||
### TensorHive | ||
|
||
## Experimental results | ||
|
||
### Batch size | ||
|
||
As expected, performance increases proportionally to batch size, up to a limit depending on GPU memory capacity: | ||
|
||
![batch_size_v100](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/deepspeech/img/batch_size_v100.png) | ||
![batch_size_gtx1060](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/deepspeech/img/batch_size_gtx1060.png) | ||
|
||
Although using Distributed TensorFlow with Parameter Servers allows distributed training on multiple nodes, it | ||
should be noted that communication with the Parameter Server introduces significant overhead comparing to the | ||
in-graph replication method: | ||
|
||
![batch_size_v100_distributed](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/deepspeech/img/batch_size_v100_distributed.png) | ||
|
||
### MultiGPU scalability | ||
|
||
The following results show performance results on NVIDIA® DGX Station™, depending on the choice utilized GPUs. The | ||
results are marked with ID's of the used GPUs, for example '013' means that CUDA_VISIBLE_DEVICES was set to 0,1,3. | ||
|
||
![multigpu_128](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/deepspeech/img/multigpu_128.png) | ||
|
||
Interestingly, in the cases of utilizing two GPUs, it is significant which GPUs are used exactly. For example, | ||
combining GPUs 0 and 1 or 2 and 3 results in worse performance. This is probably connected with interconnects between | ||
the GPUs. | ||
|
||
Overhead of the Distributed TensorFlow implementation is also visible in the multi-GPU setup: | ||
|
||
![multigpu_128_distributed](https://raw.githubusercontent.com/roscisz/TensorHive/master/examples/deepspeech/img/multigpu_128_distributed.png) | ||
|
||
The results show that in the investigated setup it is better to run many processes utilizing single GPUs than | ||
to run one process utilizing multiple GPUs. | ||
See the [TensorHive examples directory](https://github.com/roscisz/TensorHive/tree/master/examples) for | ||
examples how TensorHive task execution module can be used for various training applications. |
Oops, something went wrong.