Gisburg has a the following GPU resources:
- 18 GPU Nodes with two Nvidia RTX 8000 GPU modules
- 4 GPU Nodes with two Nvidia V100S GPU modules for being able to run computationally expensive tasks, like training ML models.
However, being able to use these resources with the latest ML libraries (Tensorflow, Pytorch, etc) and datascience tools (like jupyter notebooks) requires us to setup the proper computational environment.
Here we list the installation instructions for getting started with doing GPU enabled ML on ginsburg:
We need to install miniconda.
Note 1: There is already a conda installed on ginsburg, but as it is controlled by root we don't have all the freedom that is required to be able to customize. So we install our own.
Note 2: We install miniconda on the scratch directory (e.g.
/burg/abernathey/users/db3157/
) because ginsburg has a limit on the files in the home directoy, and the large number of files generated by conda environments fills it up quickly.
cd <personal scratch dir>
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ./miniconda3/miniconda.sh
bash ./miniconda3/miniconda.sh -b -u -p ./miniconda3
rm -rf ./miniconda3/miniconda.sh
./miniconda3/bin/conda init bash
./miniconda3/bin/conda init zsh
For some reason conda has gotten extremely slow in the last few years. mamba solves this. Install mamba by using:
conda install mamba -n base -c conda-forge
Mamba works exactly like conda in practice, and we just need to replace the word conda
with mamba
when we want to use it.
We will install jupyter and jupyter lab in the base environment, which will be used to run notebooks.
mamba install jupyter jupyterlab -n base
Install a new environment with the dependencies that are needed. Here we do this using an environment.yml
file, with the following contents:
name: tf_gpu_ml
channels:
- conda-forge
dependencies:
- python
- numpy
- scipy
- jupyter
- jupyterlab
- xarray
- tensorflow-gpu
This environment can be created as
mamba env create --file environment.yml
Note 1: More details on many steps can be found in https://github.com/ocean-transport/guides/blob/master/Setting_up_conda_on_clusters.md
Note 2: You might need to use pip if you want the absolute newest version of tensorflow, but similar procedure would apply.
Often on HPC systems, we need to explicitly point the jupyter kernel to a new environment (like the one created above). We can do this by
First activate the new environment:
conda activate tf_gpu_ml
Then create the kernelspec
jupyter kernelspec install-self --user
Rename the python kernel you created. This way it does not override the default kernel.
$ cd ~/.local/share/jupyter/kernels
$ vim /kernel.json
Change the display_name, for example to:
"display_name": "tf_gpu_ml",
Then rename the directory, as well...
mv python3 tf_gpu_ml
Now the new kernel should be ready to use on jupyter, so next we will learn how to use jupyter notebooks.
Note: More details on this can be found in https://github.com/ocean-transport/guides/blob/master/adding_jupyter_kernels.md
This can be done by first requesting an interactive node with gpu support.
srun --pty -t 60:00 -A abernathey --gres=gpu:1 /bin/bash
Then we mostly follow the step from ginsburg doc to start and access jupyter notebooks.
unset XDG_RUNTIME_DIR
hostname -i
outputs something like 10.43.4.206
jupyter notebook --no-browser --ip=10.43.4.206
This command will start jupyter and tell you the location it is running, e.g http://10.43.4.206:8888/
.
Then you can access this notebook server from your local machine by
ssh -L 8080:10.43.4.206:8888 [email protected]
Followed by opening localhost:8080
on a browser.
Note: that by default this will open a jupyter notebook, but can be switched over to lab by changing the word tree to lab in the url.
You can check if gpu support was properly activated by using following commands in Python or jupyter notebook.
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
This should output something that says that 1 gpu is available, since we had requested 1 in step 6.