Skip to content

Commit

Permalink
Merge pull request #240 from roscisz/develop
Browse files Browse the repository at this point in the history
r0.3.2
  • Loading branch information
roscisz authored Mar 5, 2020
2 parents c614970 + 9368edd commit fbd0fdf
Show file tree
Hide file tree
Showing 44 changed files with 2,218 additions and 887 deletions.
21 changes: 12 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
TensorHive
===
![](https://img.shields.io/badge/release-v0.3.1-brightgreen.svg?style=popout-square)
![](https://img.shields.io/badge/pypi-v0.3.1-brightgreen.svg?style=popout-square)
![](https://img.shields.io/badge/release-v0.3.2-brightgreen.svg?style=popout-square)
![](https://img.shields.io/badge/pypi-v0.3.2-brightgreen.svg?style=popout-square)
![](https://img.shields.io/badge/Issues%20and%20PRs-welcome-yellow.svg?style=popout-square)
![](https://img.shields.io/badge/platform-Linux-blue.svg?style=popout-square)
![](https://img.shields.io/badge/hardware-Nvidia-green.svg?style=popout-square)
Expand Down Expand Up @@ -150,7 +150,8 @@ Features
#### Core
- [x] :mag_right: Monitor metrics on each host
- [x] :tm: Nvidia GPUs
- [ ] :pager: CPU, RAM, HDD
- [x] :pager: CPU, RAM
- [ ] :open_file_folder: HDD
- [x] :customs: Protection of reserved resources
- [x] :warning: Send warning messages to terminal of users who violate the rules
- [x] :mailbox_with_no_mail: Send e-mail warnings
Expand Down Expand Up @@ -224,19 +225,21 @@ This diagram will help you to grasp the rough concept of the system.

Contibution and feedback
------------------------
**Project is still in early beta version**, so there will be some inconveniences, just be patient and keep an eye on upcoming updates.

We'd :heart: to collect your observations, issues and pull requests!

Feel free to **report any configuration problems, we will help you**.

We plan to develop examples of running distributed DNN training applications
in `Task nursery` along with templates for TF_CONFIG and PyTorch, deadline - March 2020 :shipit:, so stay tuned!
We are working on user groups for differentiated GPU access control,
grouping tasks into jobs and process-killing reservation violation handler,
deadline - July 2020 :shipit:, so stay tuned!

If you consider becoming a contributor, please look at issues labeled as
[**good-first-issue**](https://github.com/roscisz/TensorHive/issues?q=is%3Aissue+is%3Aopen+label%3Agood-first-issue)
and
[**help wanted**](https://github.com/roscisz/TensorHive/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22).

Credits
-------


TensorHive has been greatly supported within a joint project between [**VoiceLab.ai**](https://voicelab.ai) and
[**Gdańsk University of Technology**](https://pg.edu.pl/) titled: "Exploration and selection of methods
for parallelization of neural network training using multiple GPUs".
Expand Down
112 changes: 112 additions & 0 deletions examples/TF_CONFIG/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Using TensorHive for running distributed trainings using TF_CONFIG

This example shows how to use the TensorHive `task nursery` module to
conveniently orchestrate distributed trainings configured using
the TF_CONFIG environment variable. This
[MSG-GAN training application](https://github.com/roscisz/dnn_training_benchmarks/tree/master/TensorFlowV2_MSG-GAN_Fashion-MNIST)
was used for the example.

## Running the training without TensorHive

In order to run the training manually, a separate process `python train.py`
has to be run on each node with the appropriate values of parameters set as follows.

**TF_CONFIG**

The TF_CONFIG environment variable has to be appropriately configured depending
on the set of nodes taking part in the computations.
For example, a training on two nodes gl01 and gl02 would require the following
settings of TF_CONFIG:

gl01:
```bash
TF_CONFIG='{"cluster":{"worker":["gl01:2222", "gl02:2222"]}, "task":{"type": "worker", "index": 0}}'
```

gl02:
```bash
TF_CONFIG='{"cluster":{"worker":["gl01:2222", "gl02:2222"]}, "task":{"type": "worker", "index": 1}}'
```

**Other environment variables**

Depending on the environment, some other environment variables may have to be configured.
For example, because our TensorFlow compilation uses a custom MPI library, the LD_LIBRARY_PATH environment
variable has to be set for each process to /usr/mpi/gcc/openmpi-4.0.0rc5/lib/.

**Choosing the appropriate Python version**

In some cases, a specific Python binary has to be used for the training.
For example, in our environment, a python binary from a virtual environment
is used, so the python binary has to be defined as follows:

```
/home/roy/venv/p37avxmpitf2/bin/python
```

**Summary**

Finally, full commands required to start in the exemplary setup our environment, will be as follows:

gl01:

```bash
export TF_CONFIG='{"cluster":{"worker":["gl01:2222", "gl02:2222"]}, "task":{"type": "worker", "index": 0}}'
export LD_LIBRARY_PATH='/usr/mpi/gcc/openmpi-4.0.0rc5/lib/'
/home/roy/venv/p37avxmpitf2/bin/python /home/roy/dnn_training_benchmarks/TensorFlowV2_MSG-GAN_Fashion-MNIST/train.py
```

gl02:

```
export TF_CONFIG='{"cluster":{"worker":["gl01:2222", "gl02:2222"]}, "task":{"type": "worker", "index": 1}}'
export LD_LIBRARY_PATH='/usr/mpi/gcc/openmpi-4.0.0rc5/lib/'
/home/roy/venv/p37avxmpitf2/bin/python /home/roy/dnn_training_benchmarks/TensorFlowV2_MSG-GAN_Fashion-MNIST/train.py
```


## Running the training with TensorHive

The TensorHive `task nursery` module allows convenient orchestration of distributed trainings.
It is available in the Tasks Overview view. The `CREATE TASKS FROM TEMPLATE` button allows to
conveniently configure tasks supporting a specific framework or distribution method. In this
example we choose the Tensorflow - TF_CONFIG template, and click `GO TO TASK CREATOR`:

![choose_template](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/choose_template.png)

In the task creator, we set the Command to
```
/home/roy/venv/p37avxmpitf2/bin/python /home/roy/dnn_training_benchmarks/TensorFlowV2_MSG-GAN_Fashion-MNIST/train.py
```

In order to add the LD_LIBRARY_PATH environment variable, we enter the parameter name,
select Static (the same value for all processes) and click `ADD AS ENV VARIABLE TO ALL TASKS`:

![env_var](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/env_var.png)

Then, set the appropriate value of the environment variable (/usr/mpi/gcc/openmpi-4.0.0rc5/lib/).

The task creator allows also to conveniently specify other command-line arguments. For example,
to specify batch size, we enter parameter name --batch_size, again select Static and click
`ADD AS PARAMETER TO ALL TASKS` and set its value (in our case 32).

Select the required hostname and resource (CPU/GPU_N) for the specified training process. The resultant
command that will be executed by TensorHive on the selected node will be displayed above the process specification:

![single_process](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/single_process.png)

Note that the TF_CONFIG and CUDA_VISIBLE_DEVICES variables are configured automatically. Now, use
the `ADD TASK` button to duplicate the processes and modify the required target hosts to create
your training processes. For example, this screenshot shows the configuration for training on 4
hosts: gl01, gl02, gl03, gl04:

![multi_process](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/multi_process.png)

After clicking the `CREATE ALL TASKS` button, the processes will be available in the process list for future actions.
To run the processes, select them and use the `Spawn selected tasks` button. If TensorHive is configured properly,
the task status should change to `running`:

![running](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/multi_process.png)

Note, that the appropriate process PID will be displayed in the `pid` column. Task overview can
be used to schedule, spawn, stop, kill, edit the tasks, and see logs from their execution.
Binary file added examples/TF_CONFIG/img/choose_template.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/TF_CONFIG/img/env_var.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/TF_CONFIG/img/multi_process.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/TF_CONFIG/img/running.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/TF_CONFIG/img/single_process.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,12 @@
'tensorhive = tensorhive.__main__:main'
],
},
description='Lightweight computing resource management tool for executing distributed TensorFlow programs',
author='Pawel Rosciszewski, Michal Martyniak, Filip Schodowski, Tomasz Menet',
description='A user-friendly GPU management tool for distributed machine learning workloads',
author='Pawel Rosciszewski, Michal Martyniak, Filip Schodowski',
author_email='[email protected]',
url='https://github.com/roscisz/TensorHive',
download_url='https://github.com/roscisz/TensorHive/archive/{}.tar.gz'.format(tensorhive.__version__),
keywords='distributed machine learning tensorflow resource management',
keywords='reservation monitoring machine learning distributed tensorflow pytorch',
install_requires=[
'parallel-ssh==1.9.1',
'passlib==1.7.1',
Expand Down
2 changes: 1 addition & 1 deletion tensorhive/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = '0.3.1'
__version__ = '0.3.2'
40 changes: 38 additions & 2 deletions tensorhive/api/api_specification.yml
Original file line number Diff line number Diff line change
Expand Up @@ -571,6 +571,29 @@ paths:
description: {{RESPONSES['general']['auth_error']}}
security:
- Bearer: []
/nodes/{hostname}/cpu/metrics:
get:
tags:
- nodes
summary: Get node's CPU metric data
description: Puts null if some data is unavailable
operationId: tensorhive.controllers.nodes.cpu_controller.get_metrics
parameters:
- $ref: '#/parameters/hostnameParam'
- $ref: '#/parameters/cpuMetricTypeQuery'
responses:
200:
description: {{RESPONSES['general']['ok']}}
schema:
$ref: '#/definitions/CPUMetrics'
401:
description: {{RESPONSES['general']['unauthorized']}}
404:
description: {{RESPONSES['nodes']['hostname']['not_found']}}
422:
description: {{RESPONSES['general']['auth_error']}}
security:
- Bearer: []
/nodes/{hostname}/gpu/processes:
get:
tags:
Expand Down Expand Up @@ -1400,7 +1423,7 @@ definitions:
type: object
example:
<GPU_UUID (All metrics case)>:
gpu_util:
utilization:
unit: '%'
value: 95
power:
Expand All @@ -1409,6 +1432,8 @@ definitions:
<GPU_UUID (Specific metric case)>:
unit: '%'
value: 95
CPUMetrics:
type: object
GPUProcesses:
type: object
example:
Expand Down Expand Up @@ -1437,10 +1462,21 @@ parameters:
- mem_free
- mem_used
- mem_total
- gpu_util
- utilization
- mem_util
- temp
- power
cpuMetricTypeQuery:
description: Metric type. If not present, queries for all metrics
in: query
name: metric_type
required: false
type: string
enum:
- mem_free
- mem_used
- mem_total
- utilization
securityDefinitions:
Bearer:
type: apiKey
Expand Down
Loading

0 comments on commit fbd0fdf

Please sign in to comment.