Skip to content

Enable NVIDIA CUDA Support in Scale

Raul.A edited this page Aug 1, 2019 · 6 revisions

NVIDIA CUDA support with Mesos Docker Containerizer

NVIDIA Support (Scale 7.0.0+)

Scale now utilizes Nvidia-docker2 for all GPU dependent jobs. In Scale 7.0.0+ parallel execution of GPU jobs on multi-GPU cluster nodes is fully supported.

Prerequisites

  • NVIDIA Kernel Driver for your GPU.
  • docker-ce 18.09.2 or greater

Installation

The following steps assume a RHEL/CentOS environment, but further instructions can be found at NVIDIA CUDA repo for other distros. Please note that these steps must be performed on ALL Mesos Agents with GPUS within the cluster to ensure uniform support for GPU acceleration. All steps are run as root.

First step is to install nvidia-docker

# If you have nvidia-docker 1.0 installed: we need to remove it and all existing GPU containers
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo yum remove nvidia-docker

# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
  sudo tee /etc/yum.repos.d/nvidia-docker.repo

# Install nvidia-docker2 and reload the Docker daemon configuration
sudo yum install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

Test

Then we test nvidia-docker locally with the following command:

# Test nvidia-smi with the latest official CUDA image
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

Now we can verify functionality in Scale by adding a GPU enabled job into scale and running it.

Note: The response should list the job-type id assigned to this job.

POST https:your-scale-url/api/v6/job-types/

{
    "docker_image": "nvidia/cuda:9.0-base",
    "manifest":
    {
        "seedVersion": "1.0.0",
        "job": {
            "name": "gpu-test",
            "jobVersion": "1.0.0",
            "packageVersion": "1.0.0",
            "title": "GPU Test 2",
            "description": "stresses GPU",
            "maintainer": {
                "name": "AIS",
                "organization": "Applied Information Sciences",
                "email": ""
            },
            "timeout": 3600,
                "interface": {
                "command": "nvidia-smi"
            },
            "resources":
            {
                "scalar": [
                    { "name": "gpus", "value": 1.0 }
                ]
            }
        }
    }
}

Finally we can run the job in scale.

POST https:your-scale-url/api/v6/jobs

{
    "job_type_id": 15
}

Development

If you are developing a docker container for use in scale but wish to test it out before deploying, you can mimic the way scale calls docker for GPU jobs simply by adding --runtime=nvidia to your docker command.

docker run --runtime=nvidia someorg/mygpudockercontainer:latest

Legacy NVIDIA support (<Scale 7.0.0)

NVIDIA provides a shim to enable GPU accelerated CUDA applications to run within Docker. This consists of a NVIDIA volume plugin and a CLI wrapper around the standard Docker CLI. With this CLI and use of the Docker base images provided by NVIDIA, it is possible to run GPU accelerated algorithms within Scale.

Note

The Scale team is working towards fully supporting parallel execution of GPU jobs on multi-GPU cluster nodes. The scheduler currently will make greedy allocation of all offered node GPU resources to the first matched job in the queue. This ensures no GPU contention as our current GPU support does not isolate GPUs in the running containers.

Prerequisites

  • NVIDIA Kernel Driver for your GPU. This is GRID K520 for g2.2xlarge AWS instance.

Installation

The following steps assume a RHEL/CentOS environment, but further instructions can be found at NVIDIA CUDA repo for other distros. Please note that these steps must be performed on ALL Mesos Agents with GPUS within the cluster to ensure uniform support for GPU acceleration. All steps are run as root.

yum install -y https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker-1.0.1-1.x86_64.rpm
systemctl enable nvidia-docker
systemctl start nvidia-docker
echo "MESOS_DOCKER=nvidia-docker" >> /opt/mesosphere/etc/mesos-slave-common
systemctl restart dcos-mesos-slave

Test

You can verify that the nvidia-docker will be used with the following command (nothing will be shown if nvidia-docker isn't active):

ps -ef | grep mesos-docker-executor | grep nvidia-docker

You can verify the support is actually working within Docker using the following command:

nvidia-docker run -it nvidia/cuda:7.0-cudnn4-runtime nvidia-smi

You can verify that the support is being recognized by Marathon by launching a service with the following configuration (set instances to count of nodes with GPUs):

{
  "id": "/gpu-test",
  "cmd": "nvidia-smi && sleep 1500",
  "container": {
    "type": "DOCKER",
    "docker": {
      "image": "nvidia/cuda:7.0-cudnn4-runtime"
    }
  },
  "cpus": 0.1,
  "disk": 0,
  "instances": 1,
  "mem": 128,
  "gpus": 0,
  "requirePorts": false,
  "portDefinitions": [],
  "networks": [],
  "healthChecks": [],
  "fetch": [],
  "constraints": [
    [
      "hostname",
      "UNIQUE"
    ]
  ]
}

Finally, we can create a GPU enabled job type and test it in Scale with the following two service calls:

POST https://dcos/service/scale/api/v6/job-types/

{
    "docker_image": "nvidia/cuda:7.0-cudnn4-runtime",
    "manifest":
    {
        "seedVersion": "1.0.0",
        "job": {
            "name": "gpu-test",
                "jobVersion": "1.0.0",
                "packageVersion": "1.0.0",
                "title": "GPU Test",
                "description": "Validates GPU accessibility by executing nvidia-smi",
                "maintainer": {
                "name": "Jonathan Meyer",
                    "organization": "Applied Information Sciences",
                    "email": "[email protected]"
            },
            "timeout": 3600,
                "interface": {
                "command": "sh nvidia-smi"
            },
            "resources":
            {
                "scalar": [
                    { "name": "gpus", "value": 1.0 }
                ]
            }
        }
    }
}

POST https://dcos/service/scale/api/v6/queue/new-job/

{
    "job_type_id": 11
}
Clone this wiki locally