clean up instructions for multi-node dask cluster

rapidsai · Nov 15, 2023 · f851e9f · f851e9f
1 parent 33c9dd8
commit f851e9f
Showing 1 changed file with 40 additions and 23 deletions.
diff --git a/source/platforms/databricks.md b/source/platforms/databricks.md
@@ -1,12 +1,16 @@
 # Databricks
 
-You can install RAPIDS on Databricks in a few different ways. You can accelerate your machine learning workflows in a Databricks GPU notebook environment using a single-node or multi-node cluster. Spark users can install Spark-RAPIDS or alternatively install Dask alongside Spark and then use libraries like `dask-cudf` for multi-node workloads.
+You can install RAPIDS on Databricks in a few different ways:
 
-## Launch Databricks cluster
+1. Accelerate machine learning workflows in a single-node GPU notebook environment
+2. Spark users can install RAPIDS Accelerator for Apache Spark 3.x on Databricks
+3. Install Dask alongside Spark and then use libraries like `dask-cudf` for multi-node workloads
 
-### Single-node compute node
+## Single-node GPU Notebook environment
 
-To get started with a single-node Databricks cluster, navigate to the **All Purpose Compute** tab of the **Compute** section in Databricks and select **Create Compute**. Name your cluster and choose "Single node" or "Multi-node" (if desired).
+### Launch cluster
+
+To get started with a single-node Databricks cluster, navigate to the **All Purpose Compute** tab of the **Compute** section in Databricks and select **Create Compute**. Name your cluster and choose "Single node".
 
 ![Screenshot of the Databricks compute page](../images/databricks-create-compute.png)
 
@@ -22,9 +26,9 @@ Once you have completed, the "GPU accelerated" nodes should be available in the
 
 Select **Create Compute**
 
-#### Install RAPIDS in your notebook
+### Install RAPIDS
 
-Once your cluster has started, you can create a new notebook or open an existing one from the `/Workspace` directory and attach it to your running cluster.
+Once your cluster has started, you can create a new notebook or open an existing one from the `/Workspace` directory then attach it to your running cluster.
 
 ````{warning}
 At the time of writing the `databricksruntime/gpu-pytorch:cuda11.8` image does not contain the full `cuda-toolkit` so if you selected that one you will need to install that before installing RAPIDS.
@@ -46,7 +50,7 @@ At the top of your notebook run any of the following pip install commands to ins
 !pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com
 ```
 
-#### Test RAPIDS
+### Test RAPIDS
 
 ```python
 import cudf
@@ -59,23 +63,17 @@ gdf
 2   3   6
 ```
 
-### Multi-node compute node with Dask RAPIDS
-
-Switch to the **Init Scripts** tab and add the file path to the init script in your Workspace directory starting with `/Users`.
+## Multi-node Dask cluster
 
-You can also configure cluster log delivery in the **Logging** tab, which will write the init script logs to DBFS in a subdirectory called `dbfs:/cluster-logs/<cluster-id>/init_scripts/`. Refer to [docs](https://docs.databricks.com/en/init-scripts/logs.html) for more information.
-
-Once you have completed, the "GPU accelerated" nodes should be available in the **Worker type** and **Driver type** dropdown.
+### Create init-script
 
-![Screenshot of selecting a g4dn.xlarge node type](../images/databricks-ML-runtime.png)
+We now provide a [dask-databricks](https://pypi.org/project/dask-databricks/) CLI tool that simplifies the Dask cluster startup process in Databricks.`pip install dask-databricks` should launch a dask scheduler in the driver node and workers on remaining nodes within a few minutes.
 
-Select **Create Compute**.
+To get started, you must first create an [initialization script](https://docs.databricks.com/en/init-scripts/index.html) to install dask, Rapids and other dependencies.
 
-## DASK Rapids in Databricks MNMG Cluster
+Databricks recommends storing all cluster-scoped init scripts using workspace files. Each user has a Home directory configured under the `/Users` directory in the workspace. \
 
-You can launch Dask RAPIDS cluster on a multi-node GPU Databricks cluster. To do this, you must first create an [initialization script](https://docs.databricks.com/en/init-scripts/index.html) to install Dask before launching the Databricks cluster.
-
-Databricks recommends storing all cluster-scoped init scripts using workspace files. Each user has a Home directory configured under the `/Users` directory in the workspace. Navigate to your home directory in the UI and select **Create** > **File** from the menu, create an `init.sh` script with contents:
+Navigate to your home directory in the UI and select **Create** > **File** from the menu to create an `init.sh` script with contents:
 
 ```bash
 #!/bin/bash
@@ -87,9 +85,9 @@ export PATH="/databricks/python/bin:$PATH"
 
 # Install RAPIDS (cudf & dask-cudf) and dask-databricks
 /databricks/python/bin/pip install --extra-index-url=https://pypi.nvidia.com \
-      cudf-cu11 \
+      cudf-cu11 \  # installs cudf
       dask[complete] \
-      dask-cudf-cu11 \
+      dask-cudf-cu11  \ #installs dask-cudf
       dask-cuda=={rapids_version} \
       dask-databricks
 
@@ -98,7 +96,19 @@ dask databricks run --cuda
 
 ```
 
-Connect to the dask client and submit tasks.
+**Note**: To launch dask cuda workers, you must parse in `--cuda` flag option when running the command, otherwise the script will launch standard dask workers by default.
+
+### Launch Dask cluster
+
+Once your script is ready, follow the instructions in the **"Launch a Single-node cluster"** section, making sure to select **Multi node** instead.
+
+Under **Advanced Option**, switch to the **Init Scripts** tab and add the file path to the init script you created in your Workspace directory starting with `/Users`.
+
+You can also configure cluster log delivery in the **Logging** tab, which will write the init script logs to DBFS in a subdirectory called `dbfs:/cluster-logs/<cluster-id>/init_scripts/`. Refer to [docs](https://docs.databricks.com/en/init-scripts/logs.html) for more information.
+
+### Connect to Client
+
+To test RAPIDS, Connect to the dask client and submit tasks.
 
 ```python
 import dask_databricks
@@ -113,6 +123,13 @@ df = dask.datasets.timeseries().map_partitions(cudf.from_pandas)
 print(df.x.mean().compute())
 ```
 
-## Databricks Spark
+### Clean up
+
+```python
+client.close()
+cluster.close()
+```
+
+## Spark-RAPIDS Cluster
 
 You can also use the RAPIDS Accelerator for Apache Spark 3.x on Databricks. See the [Spark RAPIDS documentation](https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-databricks.html) for more information.