Document multi-node with Dask on Databricks (#297)

Co-authored-by: Jacob Tomlinson <[email protected]> Co-authored-by: Jacob Tomlinson <[email protected]>
rapidsai · Nov 21, 2023 · 596580e · 596580e
1 parent 8f9d5e2
commit 596580e
Show file tree

Hide file tree

Showing 7 changed files with 103 additions and 18 deletions.
diff --git a/source/examples/xgboost-randomforest-gpu-hpo-dask/notebook.ipynb b/source/examples/xgboost-randomforest-gpu-hpo-dask/notebook.ipynb
@@ -130,7 +130,7 @@
     "\n",
     "This notebook is designed to run on a single node with multiple GPUs, you can get multi-GPU VMs from [AWS](https://docs.rapids.ai/deployment/stable/cloud/aws/ec2-multi.html), [GCP](https://docs.rapids.ai/deployment/stable/cloud/gcp/dataproc.html), [Azure](https://docs.rapids.ai/deployment/stable/cloud/azure/azure-vm-multi.html), [IBM](https://docs.rapids.ai/deployment/stable/cloud/ibm/virtual-server.html) and more.\n",
     "\n",
-    "We start a [local cluster](../../../tools/dask-cuda.md) and keep it ready for running distributed tasks with dask.\n",
+    "We start a [local cluster](../../../source/tools/dask-cuda.md) and keep it ready for running distributed tasks with dask.\n",
     "\n",
     "Below, [LocalCUDACluster](https://github.com/rapidsai/dask-cuda) launches one Dask worker for each GPU in the current systems. It's developed as a part of the RAPIDS project.\n",
     "Learn More:\n",

diff --git a/source/guides/scheduler-gpu-requirements.md b/source/guides/scheduler-gpu-requirements.md
@@ -27,7 +27,7 @@ If your workload doesn't trigger any edge-cases and you're not using the high-le
 
 ### Known edge cases
 
-When calling [`client.submit`](dask:distributed.Client.submit) and passing data directly to a function the whole graph is serialized and sent to the scheduler. In order for the scheduler to figure out what to do with it the graph is deserialized. If the data uses GPUs this can cause the scheduler to import RAPIDS libraries, attempt to instantiate a CUDA context and populate the data into GPU memory. If those libraries are missing and/or there are no GPUs this will cause the scheduler to fail.
+When calling [`client.submit`](https://docs.dask.org/en/latest/futures.html#distributed.Client.submit) and passing data directly to a function the whole graph is serialized and sent to the scheduler. In order for the scheduler to figure out what to do with it the graph is deserialized. If the data uses GPUs this can cause the scheduler to import RAPIDS libraries, attempt to instantiate a CUDA context and populate the data into GPU memory. If those libraries are missing and/or there are no GPUs this will cause the scheduler to fail.
 
 Many Dask collections also have a meta object which represents the overall collection but without any data. For example a Dask Dataframe has a meta Pandas Dataframe which has the same meta properties and is used during scheduling. If the underlying data is instead a cuDF Dataframe then the meta object will be too, which is deserialized on the scheduler.
 

diff --git a/source/images/databricks-dask-cudf-example.png b/source/images/databricks-dask-cudf-example.png
diff --git a/source/images/databricks-dask-init-script.png b/source/images/databricks-dask-init-script.png
diff --git a/source/images/databricks-mnmg-dask-client.png b/source/images/databricks-mnmg-dask-client.png
diff --git a/source/images/databricks-worker-driver-node.png b/source/images/databricks-worker-driver-node.png
diff --git a/source/platforms/databricks.md b/source/platforms/databricks.md
@@ -1,12 +1,18 @@
 # Databricks
 
-## Databricks Notebooks
+You can install RAPIDS on Databricks in a few different ways:
 
-You can install RAPIDS libraries into a Databricks GPU Notebook environment.
+1. Accelerate machine learning workflows in a single-node GPU notebook environment
+2. Spark users can install [RAPIDS Accelerator for Apache Spark 3.x on Databricks](https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/databricks.html)
+3. Install Dask alongside Spark and then use libraries like `dask-cudf` for multi-node workloads
 
-### Launch a single-node Databricks cluster
+## Single-node GPU Notebook environment
 
-Navigate to the **All Purpose Compute** tab of the **Compute** section in Databricks and select **Create Compute**.
+(launch-databricks-cluster)=
+
+### Launch cluster
+
+To get started with a single-node Databricks cluster, navigate to the **All Purpose Compute** tab of the **Compute** section in Databricks and select **Create Compute**. Name your cluster and choose "Single node".
 
 ![Screenshot of the Databricks compute page](../images/databricks-create-compute.png)
 
@@ -18,19 +24,15 @@ Then expand the **Advanced Options** section and open the **Docker** tab. Select
 
 ![Screenshot of setting the custom container](../images/databricks-custom-container.png)
 
-Once you have done this the GPU nodes should be available in the **Node type** dropdown.
+Once you have completed, the "GPU accelerated" nodes should be available in the **Node type** dropdown.
 
 ![Screenshot of selecting a g4dn.xlarge node type](../images/databricks-choose-gpu-node.png)
 
-```{warning}
-It is also possible to use the Databricks ML GPU Runtime to enable GPU nodes, however at the time of writing the newest version (13.3 LTS ML Beta) contains an older version of `tensorflow` and `protobuf` which is not compatible with RAPIDS. So using a custom container with the latest Databricks GPU container images is recommended.
-```
-
-Select **Create Compute**.
+Select **Create Compute**
 
-### Install RAPIDS in your notebook
+### Install RAPIDS
 
-Once your cluster has started create a new notebook or open an existing one.
+Once your cluster has started, you can create a new notebook or open an existing one from the `/Workspace` directory then attach it to your running cluster.
 
 ````{warning}
 At the time of writing the `databricksruntime/gpu-pytorch:cuda11.8` image does not contain the full `cuda-toolkit` so if you selected that one you will need to install that before installing RAPIDS.
@@ -44,15 +46,15 @@ At the time of writing the `databricksruntime/gpu-pytorch:cuda11.8` image does n
 
 ````
 
-At the top of your notebook run any of the following `pip` install commands to install your preferred RAPIDS libraries.
+At the top of your notebook run any of the following pip install commands to install your preferred RAPIDS libraries.
 
 ```text
 !pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com
 !pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com
 !pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com
 ```
 
-### Test Rapids
+### Test RAPIDS
 
 ```python
 import cudf
@@ -63,9 +65,92 @@ gdf
 0   1   4
 1   2   5
 2   3   6
+```
+
+## Multi-node Dask cluster
+
+Dask now has a [dask-databricks](https://github.com/jacobtomlinson/dask-databricks) CLI tool (via [`conda`](https://github.com/conda-forge/dask-databricks-feedstock) and [`pip`](https://pypi.org/project/dask-databricks/)) to simplify the Dask cluster startup process within Databricks.
+
+### Create init-script
+
+To get started, you must first configure an [initialization script](https://docs.databricks.com/en/init-scripts/index.html) to install `dask`, `dask-databricks` RAPIDS libraries and all other dependencies for your project.
+
+Databricks recommends using [cluster-scoped](https://docs.databricks.com/en/init-scripts/cluster-scoped.html) init scripts stored in the workspace files.
+
+Navigate to the top-left **Workspace** tab and click on your **Home** directory then select **Add** > **File** from the menu. Create an `init.sh` script with contents:
+
+```bash
+#!/bin/bash
+set -e
+
+# The Databricks Python directory isn't on the path in
+# databricksruntime/gpu-tensorflow:cuda11.8 for some reason
+export PATH="/databricks/python/bin:$PATH"
+
+# Install RAPIDS (cudf & dask-cudf) and dask-databricks
+/databricks/python/bin/pip install --extra-index-url=https://pypi.nvidia.com \
+      cudf-cu11 \
+      dask[complete] \
+      dask-cudf-cu11  \
+      dask-cuda=={rapids_version} \
+      dask-databricks
+
+# Start the Dask cluster with CUDA workers
+dask databricks run --cuda
+
+```
+
+**Note**: By default, the `dask databricks run` command will launch a dask scheduler in the driver node and standard workers on remaining nodes.
+
+To launch a dask cluster with GPU workers, you must parse in `--cuda` flag option.
+
+### Launch Dask cluster
+
+Once your script is ready, follow the same [instructions](launch-databricks-cluster) to launch a **Multi-node** Databricks cluster.
+
+After docker setup in **Advanced Options**, switch to the **Init Scripts** tab and add the file path to the init-script in your Workspace directory starting with `/Users/<user-name>/<script-name>.sh`.
 
+You can also configure cluster log delivery in the **Logging** tab, which will write the init script logs to DBFS in a subdirectory called `dbfs:/cluster-logs/<cluster-id>/init_scripts/`. Refer to [docs](https://docs.databricks.com/en/init-scripts/logs.html) for more information.
+
+![Screenshot of init script](../images/databricks-dask-init-script.png)
+
+Now you should be able to select a "GPU-Accelerated" instance for both **Worker** and **Driver** nodes.
+
+![Screenshot of driver worker node](../images/databricks-worker-driver-node.png)
+
+### Connect to Client
+
+To test RAPIDS, Connect to the dask client and submit tasks.
+
+```python
+import dask_databricks
+
+
+client = dask_databricks.get_client()
+client
+```
+
+The **[Dask dashboard](https://docs.dask.org/en/latest/dashboard.html)** provides a web-based UI with visualizations and real-time information about the Dask cluster's status i.e task progress, resource utilization, etc.
+
+The Dask dashboard server will start up automatically when the scheduler is created, and is hosted on ports `8087` by default.
+
+To access, follow the provided URL link to the dashboard status endpoint from within Databricks.
+
+![Screenshot of dask-client.png](../images/databricks-mnmg-dask-client.png)
+
+```python
+import cudf
+import dask
+
+
+df = dask.datasets.timeseries().map_partitions(cudf.from_pandas)
+df.x.mean().compute()
 ```
 
-## Databricks Spark
+![Screenshot of dask-cudf-example.png](../images/databricks-dask-cudf-example.png)
+
+### Clean up
 
-You can also use the RAPIDS Accelerator for Apache Spark 3.x on Databricks. See the [Spark RAPIDS documentation](https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-databricks.html) for more information.
+```python
+client.close()
+```