Skip to content

Commit

Permalink
Document multi-node with Dask on Databricks (#297)
Browse files Browse the repository at this point in the history
Co-authored-by: Jacob Tomlinson <[email protected]>
Co-authored-by: Jacob Tomlinson <[email protected]>
  • Loading branch information
3 people authored Nov 21, 2023
1 parent 8f9d5e2 commit 596580e
Show file tree
Hide file tree
Showing 7 changed files with 103 additions and 18 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@
"\n",
"This notebook is designed to run on a single node with multiple GPUs, you can get multi-GPU VMs from [AWS](https://docs.rapids.ai/deployment/stable/cloud/aws/ec2-multi.html), [GCP](https://docs.rapids.ai/deployment/stable/cloud/gcp/dataproc.html), [Azure](https://docs.rapids.ai/deployment/stable/cloud/azure/azure-vm-multi.html), [IBM](https://docs.rapids.ai/deployment/stable/cloud/ibm/virtual-server.html) and more.\n",
"\n",
"We start a [local cluster](../../../tools/dask-cuda.md) and keep it ready for running distributed tasks with dask.\n",
"We start a [local cluster](../../../source/tools/dask-cuda.md) and keep it ready for running distributed tasks with dask.\n",
"\n",
"Below, [LocalCUDACluster](https://github.com/rapidsai/dask-cuda) launches one Dask worker for each GPU in the current systems. It's developed as a part of the RAPIDS project.\n",
"Learn More:\n",
Expand Down
2 changes: 1 addition & 1 deletion source/guides/scheduler-gpu-requirements.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ If your workload doesn't trigger any edge-cases and you're not using the high-le

### Known edge cases

When calling [`client.submit`](dask:distributed.Client.submit) and passing data directly to a function the whole graph is serialized and sent to the scheduler. In order for the scheduler to figure out what to do with it the graph is deserialized. If the data uses GPUs this can cause the scheduler to import RAPIDS libraries, attempt to instantiate a CUDA context and populate the data into GPU memory. If those libraries are missing and/or there are no GPUs this will cause the scheduler to fail.
When calling [`client.submit`](https://docs.dask.org/en/latest/futures.html#distributed.Client.submit) and passing data directly to a function the whole graph is serialized and sent to the scheduler. In order for the scheduler to figure out what to do with it the graph is deserialized. If the data uses GPUs this can cause the scheduler to import RAPIDS libraries, attempt to instantiate a CUDA context and populate the data into GPU memory. If those libraries are missing and/or there are no GPUs this will cause the scheduler to fail.

Many Dask collections also have a meta object which represents the overall collection but without any data. For example a Dask Dataframe has a meta Pandas Dataframe which has the same meta properties and is used during scheduling. If the underlying data is instead a cuDF Dataframe then the meta object will be too, which is deserialized on the scheduler.

Expand Down
Binary file added source/images/databricks-dask-cudf-example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added source/images/databricks-dask-init-script.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added source/images/databricks-mnmg-dask-client.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added source/images/databricks-worker-driver-node.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
117 changes: 101 additions & 16 deletions source/platforms/databricks.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,18 @@
# Databricks

## Databricks Notebooks
You can install RAPIDS on Databricks in a few different ways:

You can install RAPIDS libraries into a Databricks GPU Notebook environment.
1. Accelerate machine learning workflows in a single-node GPU notebook environment
2. Spark users can install [RAPIDS Accelerator for Apache Spark 3.x on Databricks](https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/databricks.html)
3. Install Dask alongside Spark and then use libraries like `dask-cudf` for multi-node workloads

### Launch a single-node Databricks cluster
## Single-node GPU Notebook environment

Navigate to the **All Purpose Compute** tab of the **Compute** section in Databricks and select **Create Compute**.
(launch-databricks-cluster)=

### Launch cluster

To get started with a single-node Databricks cluster, navigate to the **All Purpose Compute** tab of the **Compute** section in Databricks and select **Create Compute**. Name your cluster and choose "Single node".

![Screenshot of the Databricks compute page](../images/databricks-create-compute.png)

Expand All @@ -18,19 +24,15 @@ Then expand the **Advanced Options** section and open the **Docker** tab. Select

![Screenshot of setting the custom container](../images/databricks-custom-container.png)

Once you have done this the GPU nodes should be available in the **Node type** dropdown.
Once you have completed, the "GPU accelerated" nodes should be available in the **Node type** dropdown.

![Screenshot of selecting a g4dn.xlarge node type](../images/databricks-choose-gpu-node.png)

```{warning}
It is also possible to use the Databricks ML GPU Runtime to enable GPU nodes, however at the time of writing the newest version (13.3 LTS ML Beta) contains an older version of `tensorflow` and `protobuf` which is not compatible with RAPIDS. So using a custom container with the latest Databricks GPU container images is recommended.
```

Select **Create Compute**.
Select **Create Compute**

### Install RAPIDS in your notebook
### Install RAPIDS

Once your cluster has started create a new notebook or open an existing one.
Once your cluster has started, you can create a new notebook or open an existing one from the `/Workspace` directory then attach it to your running cluster.

````{warning}
At the time of writing the `databricksruntime/gpu-pytorch:cuda11.8` image does not contain the full `cuda-toolkit` so if you selected that one you will need to install that before installing RAPIDS.
Expand All @@ -44,15 +46,15 @@ At the time of writing the `databricksruntime/gpu-pytorch:cuda11.8` image does n
````

At the top of your notebook run any of the following `pip` install commands to install your preferred RAPIDS libraries.
At the top of your notebook run any of the following pip install commands to install your preferred RAPIDS libraries.

```text
!pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com
!pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com
!pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com
```

### Test Rapids
### Test RAPIDS

```python
import cudf
Expand All @@ -63,9 +65,92 @@ gdf
0 1 4
1 2 5
2 3 6
```

## Multi-node Dask cluster

Dask now has a [dask-databricks](https://github.com/jacobtomlinson/dask-databricks) CLI tool (via [`conda`](https://github.com/conda-forge/dask-databricks-feedstock) and [`pip`](https://pypi.org/project/dask-databricks/)) to simplify the Dask cluster startup process within Databricks.

### Create init-script

To get started, you must first configure an [initialization script](https://docs.databricks.com/en/init-scripts/index.html) to install `dask`, `dask-databricks` RAPIDS libraries and all other dependencies for your project.

Databricks recommends using [cluster-scoped](https://docs.databricks.com/en/init-scripts/cluster-scoped.html) init scripts stored in the workspace files.

Navigate to the top-left **Workspace** tab and click on your **Home** directory then select **Add** > **File** from the menu. Create an `init.sh` script with contents:

```bash
#!/bin/bash
set -e

# The Databricks Python directory isn't on the path in
# databricksruntime/gpu-tensorflow:cuda11.8 for some reason
export PATH="/databricks/python/bin:$PATH"

# Install RAPIDS (cudf & dask-cudf) and dask-databricks
/databricks/python/bin/pip install --extra-index-url=https://pypi.nvidia.com \
cudf-cu11 \
dask[complete] \
dask-cudf-cu11 \
dask-cuda=={rapids_version} \
dask-databricks

# Start the Dask cluster with CUDA workers
dask databricks run --cuda

```

**Note**: By default, the `dask databricks run` command will launch a dask scheduler in the driver node and standard workers on remaining nodes.

To launch a dask cluster with GPU workers, you must parse in `--cuda` flag option.

### Launch Dask cluster

Once your script is ready, follow the same [instructions](launch-databricks-cluster) to launch a **Multi-node** Databricks cluster.

After docker setup in **Advanced Options**, switch to the **Init Scripts** tab and add the file path to the init-script in your Workspace directory starting with `/Users/<user-name>/<script-name>.sh`.

You can also configure cluster log delivery in the **Logging** tab, which will write the init script logs to DBFS in a subdirectory called `dbfs:/cluster-logs/<cluster-id>/init_scripts/`. Refer to [docs](https://docs.databricks.com/en/init-scripts/logs.html) for more information.

![Screenshot of init script](../images/databricks-dask-init-script.png)

Now you should be able to select a "GPU-Accelerated" instance for both **Worker** and **Driver** nodes.

![Screenshot of driver worker node](../images/databricks-worker-driver-node.png)

### Connect to Client

To test RAPIDS, Connect to the dask client and submit tasks.

```python
import dask_databricks


client = dask_databricks.get_client()
client
```

The **[Dask dashboard](https://docs.dask.org/en/latest/dashboard.html)** provides a web-based UI with visualizations and real-time information about the Dask cluster's status i.e task progress, resource utilization, etc.

The Dask dashboard server will start up automatically when the scheduler is created, and is hosted on ports `8087` by default.

To access, follow the provided URL link to the dashboard status endpoint from within Databricks.

![Screenshot of dask-client.png](../images/databricks-mnmg-dask-client.png)

```python
import cudf
import dask


df = dask.datasets.timeseries().map_partitions(cudf.from_pandas)
df.x.mean().compute()
```

## Databricks Spark
![Screenshot of dask-cudf-example.png](../images/databricks-dask-cudf-example.png)

### Clean up

You can also use the RAPIDS Accelerator for Apache Spark 3.x on Databricks. See the [Spark RAPIDS documentation](https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-databricks.html) for more information.
```python
client.close()
```

0 comments on commit 596580e

Please sign in to comment.