-
Notifications
You must be signed in to change notification settings - Fork 45
Enable NVIDIA CUDA Support in Scale
Scale now utilizes Nvidia-docker2 for all GPU dependent jobs. In Scale 7.0.0+ parallel execution of GPU jobs on multi-GPU cluster nodes is fully supported.
- NVIDIA Kernel Driver for your GPU.
- docker-ce 18.09.2 or greater
The following steps assume a RHEL/CentOS environment, but further instructions can be found at NVIDIA CUDA repo for other distros. Please note that these steps must be performed on ALL Mesos Agents with GPUS within the cluster to ensure uniform support for GPU acceleration. All steps are run as root.
First step is to install nvidia-docker
# If you have nvidia-docker 1.0 installed: we need to remove it and all existing GPU containers
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo yum remove nvidia-docker
# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
sudo tee /etc/yum.repos.d/nvidia-docker.repo
# Install nvidia-docker2 and reload the Docker daemon configuration
sudo yum install -y nvidia-docker2
sudo pkill -SIGHUP dockerd
Then we test nvidia-docker locally with the following command:
# Test nvidia-smi with the latest official CUDA image
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
Now we can verify functionality in Scale by adding a GPU enabled job into scale and running it.
Note: The response should list the job-type id assigned to this job.
POST https:your-scale-url/api/v6/job-types/
{
"docker_image": "nvidia/cuda:9.0-base",
"manifest":
{
"seedVersion": "1.0.0",
"job": {
"name": "gpu-test",
"jobVersion": "1.0.0",
"packageVersion": "1.0.0",
"title": "GPU Test 2",
"description": "stresses GPU",
"maintainer": {
"name": "AIS",
"organization": "Applied Information Sciences",
"email": ""
},
"timeout": 3600,
"interface": {
"command": "nvidia-smi"
},
"resources":
{
"scalar": [
{ "name": "gpus", "value": 1.0 }
]
}
}
}
}
Finally we can run the job in scale.
POST https:your-scale-url/api/v6/jobs
{
"job_type_id": 15
}
If you are developing a docker container for use in scale but wish to test it out before deploying, you can mimic the way scale calls docker for GPU jobs simply by adding --runtime=nvidia to your docker command.
docker run --runtime=nvidia someorg/mygpudockercontainer:latest
NVIDIA provides a shim to enable GPU accelerated CUDA applications to run within Docker. This consists of a NVIDIA volume plugin and a CLI wrapper around the standard Docker CLI. With this CLI and use of the Docker base images provided by NVIDIA, it is possible to run GPU accelerated algorithms within Scale.
The Scale team is working towards fully supporting parallel execution of GPU jobs on multi-GPU cluster nodes. The scheduler currently will make greedy allocation of all offered node GPU resources to the first matched job in the queue. This ensures no GPU contention as our current GPU support does not isolate GPUs in the running containers.
- NVIDIA Kernel Driver for your GPU. This is GRID K520 for g2.2xlarge AWS instance.
The following steps assume a RHEL/CentOS environment, but further instructions can be found at NVIDIA CUDA repo for other distros. Please note that these steps must be performed on ALL Mesos Agents with GPUS within the cluster to ensure uniform support for GPU acceleration. All steps are run as root.
yum install -y https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker-1.0.1-1.x86_64.rpm
systemctl enable nvidia-docker
systemctl start nvidia-docker
echo "MESOS_DOCKER=nvidia-docker" >> /opt/mesosphere/etc/mesos-slave-common
systemctl restart dcos-mesos-slave
You can verify that the nvidia-docker
will be used with the following command (nothing will be shown if nvidia-docker
isn't active):
ps -ef | grep mesos-docker-executor | grep nvidia-docker
You can verify the support is actually working within Docker using the following command:
nvidia-docker run -it nvidia/cuda:7.0-cudnn4-runtime nvidia-smi
You can verify that the support is being recognized by Marathon by launching a service with the following configuration (set instances to count of nodes with GPUs):
{
"id": "/gpu-test",
"cmd": "nvidia-smi && sleep 1500",
"container": {
"type": "DOCKER",
"docker": {
"image": "nvidia/cuda:7.0-cudnn4-runtime"
}
},
"cpus": 0.1,
"disk": 0,
"instances": 1,
"mem": 128,
"gpus": 0,
"requirePorts": false,
"portDefinitions": [],
"networks": [],
"healthChecks": [],
"fetch": [],
"constraints": [
[
"hostname",
"UNIQUE"
]
]
}
Finally, we can create a GPU enabled job type and test it in Scale with the following two service calls:
POST https://dcos/service/scale/api/v6/job-types/
{
"docker_image": "nvidia/cuda:7.0-cudnn4-runtime",
"manifest":
{
"seedVersion": "1.0.0",
"job": {
"name": "gpu-test",
"jobVersion": "1.0.0",
"packageVersion": "1.0.0",
"title": "GPU Test",
"description": "Validates GPU accessibility by executing nvidia-smi",
"maintainer": {
"name": "Jonathan Meyer",
"organization": "Applied Information Sciences",
"email": "[email protected]"
},
"timeout": 3600,
"interface": {
"command": "sh nvidia-smi"
},
"resources":
{
"scalar": [
{ "name": "gpus", "value": 1.0 }
]
}
}
}
}
POST https://dcos/service/scale/api/v6/queue/new-job/
{
"job_type_id": 11
}
- Home
- What's New
-
In-depth Topics
- Enable Scale to run CUDA GPU optimized algorithms
- Enable Scale to store secrets securely
- Test Scale's scan capability on the fly
- Test Scale's workspace broker capability on the fly
- Scale Performance Metrics
- Private docker repository configuration
- Setting up Automated Snapshots for Elasticsearch
- Setting up Cluster Monitoring
- Developer Notes