Skip to content

Commit

Permalink
clean up instructions for multi-node dask cluster
Browse files Browse the repository at this point in the history
  • Loading branch information
skirui-source committed Nov 15, 2023
1 parent 33c9dd8 commit f851e9f
Showing 1 changed file with 40 additions and 23 deletions.
63 changes: 40 additions & 23 deletions source/platforms/databricks.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,16 @@
# Databricks

You can install RAPIDS on Databricks in a few different ways. You can accelerate your machine learning workflows in a Databricks GPU notebook environment using a single-node or multi-node cluster. Spark users can install Spark-RAPIDS or alternatively install Dask alongside Spark and then use libraries like `dask-cudf` for multi-node workloads.
You can install RAPIDS on Databricks in a few different ways:

## Launch Databricks cluster
1. Accelerate machine learning workflows in a single-node GPU notebook environment
2. Spark users can install RAPIDS Accelerator for Apache Spark 3.x on Databricks
3. Install Dask alongside Spark and then use libraries like `dask-cudf` for multi-node workloads

### Single-node compute node
## Single-node GPU Notebook environment

To get started with a single-node Databricks cluster, navigate to the **All Purpose Compute** tab of the **Compute** section in Databricks and select **Create Compute**. Name your cluster and choose "Single node" or "Multi-node" (if desired).
### Launch cluster

To get started with a single-node Databricks cluster, navigate to the **All Purpose Compute** tab of the **Compute** section in Databricks and select **Create Compute**. Name your cluster and choose "Single node".

![Screenshot of the Databricks compute page](../images/databricks-create-compute.png)

Expand All @@ -22,9 +26,9 @@ Once you have completed, the "GPU accelerated" nodes should be available in the

Select **Create Compute**

#### Install RAPIDS in your notebook
### Install RAPIDS

Once your cluster has started, you can create a new notebook or open an existing one from the `/Workspace` directory and attach it to your running cluster.
Once your cluster has started, you can create a new notebook or open an existing one from the `/Workspace` directory then attach it to your running cluster.

````{warning}
At the time of writing the `databricksruntime/gpu-pytorch:cuda11.8` image does not contain the full `cuda-toolkit` so if you selected that one you will need to install that before installing RAPIDS.
Expand All @@ -46,7 +50,7 @@ At the top of your notebook run any of the following pip install commands to ins
!pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com
```

#### Test RAPIDS
### Test RAPIDS

```python
import cudf
Expand All @@ -59,23 +63,17 @@ gdf
2 3 6
```

### Multi-node compute node with Dask RAPIDS

Switch to the **Init Scripts** tab and add the file path to the init script in your Workspace directory starting with `/Users`.
## Multi-node Dask cluster

You can also configure cluster log delivery in the **Logging** tab, which will write the init script logs to DBFS in a subdirectory called `dbfs:/cluster-logs/<cluster-id>/init_scripts/`. Refer to [docs](https://docs.databricks.com/en/init-scripts/logs.html) for more information.

Once you have completed, the "GPU accelerated" nodes should be available in the **Worker type** and **Driver type** dropdown.
### Create init-script

![Screenshot of selecting a g4dn.xlarge node type](../images/databricks-ML-runtime.png)
We now provide a [dask-databricks](https://pypi.org/project/dask-databricks/) CLI tool that simplifies the Dask cluster startup process in Databricks.`pip install dask-databricks` should launch a dask scheduler in the driver node and workers on remaining nodes within a few minutes.

Select **Create Compute**.
To get started, you must first create an [initialization script](https://docs.databricks.com/en/init-scripts/index.html) to install dask, Rapids and other dependencies.

## DASK Rapids in Databricks MNMG Cluster
Databricks recommends storing all cluster-scoped init scripts using workspace files. Each user has a Home directory configured under the `/Users` directory in the workspace. \

You can launch Dask RAPIDS cluster on a multi-node GPU Databricks cluster. To do this, you must first create an [initialization script](https://docs.databricks.com/en/init-scripts/index.html) to install Dask before launching the Databricks cluster.

Databricks recommends storing all cluster-scoped init scripts using workspace files. Each user has a Home directory configured under the `/Users` directory in the workspace. Navigate to your home directory in the UI and select **Create** > **File** from the menu, create an `init.sh` script with contents:
Navigate to your home directory in the UI and select **Create** > **File** from the menu to create an `init.sh` script with contents:

```bash
#!/bin/bash
Expand All @@ -87,9 +85,9 @@ export PATH="/databricks/python/bin:$PATH"

# Install RAPIDS (cudf & dask-cudf) and dask-databricks
/databricks/python/bin/pip install --extra-index-url=https://pypi.nvidia.com \
cudf-cu11 \
cudf-cu11 \ # installs cudf
dask[complete] \
dask-cudf-cu11 \
dask-cudf-cu11 \ #installs dask-cudf
dask-cuda=={rapids_version} \
dask-databricks

Expand All @@ -98,7 +96,19 @@ dask databricks run --cuda

```

Connect to the dask client and submit tasks.
**Note**: To launch dask cuda workers, you must parse in `--cuda` flag option when running the command, otherwise the script will launch standard dask workers by default.

### Launch Dask cluster

Once your script is ready, follow the instructions in the **"Launch a Single-node cluster"** section, making sure to select **Multi node** instead.

Under **Advanced Option**, switch to the **Init Scripts** tab and add the file path to the init script you created in your Workspace directory starting with `/Users`.

You can also configure cluster log delivery in the **Logging** tab, which will write the init script logs to DBFS in a subdirectory called `dbfs:/cluster-logs/<cluster-id>/init_scripts/`. Refer to [docs](https://docs.databricks.com/en/init-scripts/logs.html) for more information.

### Connect to Client

To test RAPIDS, Connect to the dask client and submit tasks.

```python
import dask_databricks
Expand All @@ -113,6 +123,13 @@ df = dask.datasets.timeseries().map_partitions(cudf.from_pandas)
print(df.x.mean().compute())
```

## Databricks Spark
### Clean up

```python
client.close()
cluster.close()
```

## Spark-RAPIDS Cluster

You can also use the RAPIDS Accelerator for Apache Spark 3.x on Databricks. See the [Spark RAPIDS documentation](https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-databricks.html) for more information.

0 comments on commit f851e9f

Please sign in to comment.