Merge pull request #240 from roscisz/develop

r0.3.2
roscisz · Mar 5, 2020 · fbd0fdf · fbd0fdf
2 parents c614970 + 9368edd
commit fbd0fdf
Show file tree

Hide file tree

Showing 44 changed files with 2,218 additions and 887 deletions.
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 TensorHive
 ===
-![](https://img.shields.io/badge/release-v0.3.1-brightgreen.svg?style=popout-square)
-![](https://img.shields.io/badge/pypi-v0.3.1-brightgreen.svg?style=popout-square)
+![](https://img.shields.io/badge/release-v0.3.2-brightgreen.svg?style=popout-square)
+![](https://img.shields.io/badge/pypi-v0.3.2-brightgreen.svg?style=popout-square)
 ![](https://img.shields.io/badge/Issues%20and%20PRs-welcome-yellow.svg?style=popout-square)
 ![](https://img.shields.io/badge/platform-Linux-blue.svg?style=popout-square)
 ![](https://img.shields.io/badge/hardware-Nvidia-green.svg?style=popout-square)
@@ -150,7 +150,8 @@ Features
 #### Core
 - [x] :mag_right: Monitor metrics on each host
     - [x] :tm: Nvidia GPUs
-    - [ ] :pager: CPU, RAM, HDD
+    - [x] :pager: CPU, RAM
+    - [ ] :open_file_folder: HDD
 - [x] :customs: Protection of reserved resources
     - [x] :warning:	Send warning messages to terminal of users who violate the rules
     - [x] :mailbox_with_no_mail: Send e-mail warnings
@@ -224,19 +225,21 @@ This diagram will help you to grasp the rough concept of the system.
 
 Contibution and feedback
 ------------------------
-**Project is still in early beta version**, so there will be some inconveniences, just be patient and keep an eye on upcoming updates.
-
 We'd :heart: to collect your observations, issues and pull requests!
 
 Feel free to **report any configuration problems, we will help you**.
 
-We plan to develop examples of running distributed DNN training applications
-in `Task nursery` along with templates for TF_CONFIG and PyTorch, deadline - March 2020 :shipit:, so stay tuned!
+We are working on user groups for differentiated GPU access control,
+grouping tasks into jobs and process-killing reservation violation handler,
+deadline - July 2020 :shipit:, so stay tuned!
+
+If you consider becoming a contributor, please look at issues labeled as 
+[**good-first-issue**](https://github.com/roscisz/TensorHive/issues?q=is%3Aissue+is%3Aopen+label%3Agood-first-issue)
+and
+[**help wanted**](https://github.com/roscisz/TensorHive/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22).
 
 Credits
 -------
-
-
 TensorHive has been greatly supported within a joint project between [**VoiceLab.ai**](https://voicelab.ai) and
 [**Gdańsk University of Technology**](https://pg.edu.pl/) titled: "Exploration and selection of methods
 for parallelization of neural network training using multiple GPUs".

diff --git a/examples/TF_CONFIG/README.md b/examples/TF_CONFIG/README.md
@@ -0,0 +1,112 @@
+# Using TensorHive for running distributed trainings using TF_CONFIG
+
+This example shows how to use the TensorHive `task nursery` module to
+conveniently orchestrate distributed trainings configured using
+the TF_CONFIG environment variable. This
+[MSG-GAN training application](https://github.com/roscisz/dnn_training_benchmarks/tree/master/TensorFlowV2_MSG-GAN_Fashion-MNIST)
+was used for the example.
+
+## Running the training without TensorHive
+
+In order to run the training manually, a separate process `python train.py`
+has to be run on each node with the appropriate values of parameters set as follows.
+
+**TF_CONFIG**
+
+The TF_CONFIG environment variable has to be appropriately configured depending
+on the set of nodes taking part in the computations.
+For example, a training on two nodes gl01 and gl02 would require the following
+settings of TF_CONFIG:
+
+gl01:
+```bash
+TF_CONFIG='{"cluster":{"worker":["gl01:2222", "gl02:2222"]}, "task":{"type": "worker", "index": 0}}'
+```
+
+gl02:
+```bash
+TF_CONFIG='{"cluster":{"worker":["gl01:2222", "gl02:2222"]}, "task":{"type": "worker", "index": 1}}'
+```
+
+**Other environment variables**
+
+Depending on the environment, some other environment variables may have to be configured.
+For example, because our TensorFlow compilation uses a custom MPI library, the LD_LIBRARY_PATH environment
+variable has to be set for each process to /usr/mpi/gcc/openmpi-4.0.0rc5/lib/.
+
+**Choosing the appropriate Python version**
+
+In some cases, a specific Python binary has to be used for the training.
+For example, in our environment, a python binary from a virtual environment
+is used, so the python binary has to be defined as follows:
+
+```
+/home/roy/venv/p37avxmpitf2/bin/python
+```
+
+**Summary**
+
+Finally, full commands required to start in the exemplary setup our environment, will be as follows:
+
+gl01:
+
+```bash
+export TF_CONFIG='{"cluster":{"worker":["gl01:2222", "gl02:2222"]}, "task":{"type": "worker", "index": 0}}'
+export LD_LIBRARY_PATH='/usr/mpi/gcc/openmpi-4.0.0rc5/lib/'
+/home/roy/venv/p37avxmpitf2/bin/python /home/roy/dnn_training_benchmarks/TensorFlowV2_MSG-GAN_Fashion-MNIST/train.py
+```
+
+gl02:
+
+```
+export TF_CONFIG='{"cluster":{"worker":["gl01:2222", "gl02:2222"]}, "task":{"type": "worker", "index": 1}}'
+export LD_LIBRARY_PATH='/usr/mpi/gcc/openmpi-4.0.0rc5/lib/'
+/home/roy/venv/p37avxmpitf2/bin/python /home/roy/dnn_training_benchmarks/TensorFlowV2_MSG-GAN_Fashion-MNIST/train.py
+```
+
+
+## Running the training with TensorHive
+
+The TensorHive `task nursery` module allows convenient orchestration of distributed trainings.
+It is available in the Tasks Overview view. The `CREATE TASKS FROM TEMPLATE` button allows to
+conveniently configure tasks supporting a specific framework or distribution method. In this
+example we choose the Tensorflow - TF_CONFIG template, and click `GO TO TASK CREATOR`:
+
+![choose_template](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/choose_template.png)
+
+In the task creator, we set the Command to
+```
+/home/roy/venv/p37avxmpitf2/bin/python /home/roy/dnn_training_benchmarks/TensorFlowV2_MSG-GAN_Fashion-MNIST/train.py
+```
+
+In order to add the LD_LIBRARY_PATH environment variable, we enter the parameter name,
+select Static (the same value for all processes) and click `ADD AS ENV VARIABLE TO ALL TASKS`:
+
+![env_var](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/env_var.png)
+
+Then, set the appropriate value of the environment variable (/usr/mpi/gcc/openmpi-4.0.0rc5/lib/).
+
+The task creator allows also to conveniently specify other command-line arguments. For example,
+to specify batch size, we enter parameter name --batch_size, again select Static and click
+`ADD AS PARAMETER TO ALL TASKS` and set its value (in our case 32).
+
+Select the required hostname and resource (CPU/GPU_N) for the specified training process. The resultant
+command that will be executed by TensorHive on the selected node will be displayed above the process specification:
+
+![single_process](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/single_process.png)
+
+Note that the TF_CONFIG and CUDA_VISIBLE_DEVICES variables are configured automatically. Now, use
+the `ADD TASK` button to duplicate the processes and modify the required target hosts to create
+your training processes. For example, this screenshot shows the configuration for training on 4
+hosts: gl01, gl02, gl03, gl04:
+
+![multi_process](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/multi_process.png)
+
+After clicking the `CREATE ALL TASKS` button, the processes will be available in the process list for future actions.
+To run the processes, select them and use the `Spawn selected tasks` button. If TensorHive is configured properly,
+the task status should change to `running`:
+
+![running](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/multi_process.png)
+
+Note, that the appropriate process PID will be displayed in the `pid` column. Task overview can
+be used to schedule, spawn, stop, kill, edit the tasks, and see logs from their execution.
diff --git a/examples/TF_CONFIG/img/choose_template.png b/examples/TF_CONFIG/img/choose_template.png
diff --git a/examples/TF_CONFIG/img/env_var.png b/examples/TF_CONFIG/img/env_var.png
diff --git a/examples/TF_CONFIG/img/multi_process.png b/examples/TF_CONFIG/img/multi_process.png
diff --git a/examples/TF_CONFIG/img/running.png b/examples/TF_CONFIG/img/running.png
diff --git a/examples/TF_CONFIG/img/single_process.png b/examples/TF_CONFIG/img/single_process.png
diff --git a/setup.py b/setup.py
@@ -14,12 +14,12 @@
             'tensorhive = tensorhive.__main__:main'
         ],
     },
-    description='Lightweight computing resource management tool for executing distributed TensorFlow programs',
-    author='Pawel Rosciszewski, Michal Martyniak, Filip Schodowski, Tomasz Menet',
+    description='A user-friendly GPU management tool for distributed machine learning workloads',
+    author='Pawel Rosciszewski, Michal Martyniak, Filip Schodowski',
     author_email='[email protected]',
     url='https://github.com/roscisz/TensorHive',
     download_url='https://github.com/roscisz/TensorHive/archive/{}.tar.gz'.format(tensorhive.__version__),
-    keywords='distributed machine learning tensorflow resource management',
+    keywords='reservation monitoring machine learning distributed tensorflow pytorch',
     install_requires=[
         'parallel-ssh==1.9.1',
         'passlib==1.7.1',

diff --git a/tensorhive/__init__.py b/tensorhive/__init__.py
@@ -1 +1 @@
-__version__ = '0.3.1'
+__version__ = '0.3.2'
diff --git a/tensorhive/api/api_specification.yml b/tensorhive/api/api_specification.yml
@@ -571,6 +571,29 @@ paths:
           description: {{RESPONSES['general']['auth_error']}}
       security:
         - Bearer: []
+  /nodes/{hostname}/cpu/metrics:
+    get:
+      tags:
+        - nodes
+      summary: Get node's CPU metric data
+      description: Puts null if some data is unavailable
+      operationId: tensorhive.controllers.nodes.cpu_controller.get_metrics
+      parameters:
+        - $ref: '#/parameters/hostnameParam'
+        - $ref: '#/parameters/cpuMetricTypeQuery'
+      responses:
+        200:
+          description: {{RESPONSES['general']['ok']}}
+          schema:
+            $ref: '#/definitions/CPUMetrics'
+        401:
+          description: {{RESPONSES['general']['unauthorized']}}
+        404:
+          description: {{RESPONSES['nodes']['hostname']['not_found']}}
+        422:
+          description: {{RESPONSES['general']['auth_error']}}
+      security:
+        - Bearer: []
   /nodes/{hostname}/gpu/processes:
     get:
       tags: 
@@ -1400,7 +1423,7 @@ definitions:
     type: object
     example:
       <GPU_UUID (All metrics case)>:
-        gpu_util:
+        utilization:
           unit: '%'
           value: 95
         power:
@@ -1409,6 +1432,8 @@ definitions:
       <GPU_UUID (Specific metric case)>:
         unit: '%'
         value: 95
+  CPUMetrics:
+    type: object
   GPUProcesses:
     type: object
     example:
@@ -1437,10 +1462,21 @@ parameters:
       - mem_free
       - mem_used
       - mem_total
-      - gpu_util
+      - utilization
       - mem_util
       - temp
       - power
+  cpuMetricTypeQuery:
+    description: Metric type. If not present, queries for all metrics
+    in: query
+    name: metric_type
+    required: false
+    type: string
+    enum:
+      - mem_free
+      - mem_used
+      - mem_total
+      - utilization
 securityDefinitions:
   Bearer:
     type: apiKey