diff --git a/source/platforms/databricks.md b/source/platforms/databricks.md index 5538ed7e..f97af3b2 100644 --- a/source/platforms/databricks.md +++ b/source/platforms/databricks.md @@ -1,12 +1,16 @@ # Databricks -You can install RAPIDS on Databricks in a few different ways. You can accelerate your machine learning workflows in a Databricks GPU notebook environment using a single-node or multi-node cluster. Spark users can install Spark-RAPIDS or alternatively install Dask alongside Spark and then use libraries like `dask-cudf` for multi-node workloads. +You can install RAPIDS on Databricks in a few different ways: -## Launch Databricks cluster +1. Accelerate machine learning workflows in a single-node GPU notebook environment +2. Spark users can install RAPIDS Accelerator for Apache Spark 3.x on Databricks +3. Install Dask alongside Spark and then use libraries like `dask-cudf` for multi-node workloads -### Single-node compute node +## Single-node GPU Notebook environment -To get started with a single-node Databricks cluster, navigate to the **All Purpose Compute** tab of the **Compute** section in Databricks and select **Create Compute**. Name your cluster and choose "Single node" or "Multi-node" (if desired). +### Launch cluster + +To get started with a single-node Databricks cluster, navigate to the **All Purpose Compute** tab of the **Compute** section in Databricks and select **Create Compute**. Name your cluster and choose "Single node". ![Screenshot of the Databricks compute page](../images/databricks-create-compute.png) @@ -22,9 +26,9 @@ Once you have completed, the "GPU accelerated" nodes should be available in the Select **Create Compute** -#### Install RAPIDS in your notebook +### Install RAPIDS -Once your cluster has started, you can create a new notebook or open an existing one from the `/Workspace` directory and attach it to your running cluster. +Once your cluster has started, you can create a new notebook or open an existing one from the `/Workspace` directory then attach it to your running cluster. ````{warning} At the time of writing the `databricksruntime/gpu-pytorch:cuda11.8` image does not contain the full `cuda-toolkit` so if you selected that one you will need to install that before installing RAPIDS. @@ -46,7 +50,7 @@ At the top of your notebook run any of the following pip install commands to ins !pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com ``` -#### Test RAPIDS +### Test RAPIDS ```python import cudf @@ -59,23 +63,17 @@ gdf 2 3 6 ``` -### Multi-node compute node with Dask RAPIDS - -Switch to the **Init Scripts** tab and add the file path to the init script in your Workspace directory starting with `/Users`. +## Multi-node Dask cluster -You can also configure cluster log delivery in the **Logging** tab, which will write the init script logs to DBFS in a subdirectory called `dbfs:/cluster-logs//init_scripts/`. Refer to [docs](https://docs.databricks.com/en/init-scripts/logs.html) for more information. - -Once you have completed, the "GPU accelerated" nodes should be available in the **Worker type** and **Driver type** dropdown. +### Create init-script -![Screenshot of selecting a g4dn.xlarge node type](../images/databricks-ML-runtime.png) +We now provide a [dask-databricks](https://pypi.org/project/dask-databricks/) CLI tool that simplifies the Dask cluster startup process in Databricks.`pip install dask-databricks` should launch a dask scheduler in the driver node and workers on remaining nodes within a few minutes. -Select **Create Compute**. +To get started, you must first create an [initialization script](https://docs.databricks.com/en/init-scripts/index.html) to install dask, Rapids and other dependencies. -## DASK Rapids in Databricks MNMG Cluster +Databricks recommends storing all cluster-scoped init scripts using workspace files. Each user has a Home directory configured under the `/Users` directory in the workspace. \ -You can launch Dask RAPIDS cluster on a multi-node GPU Databricks cluster. To do this, you must first create an [initialization script](https://docs.databricks.com/en/init-scripts/index.html) to install Dask before launching the Databricks cluster. - -Databricks recommends storing all cluster-scoped init scripts using workspace files. Each user has a Home directory configured under the `/Users` directory in the workspace. Navigate to your home directory in the UI and select **Create** > **File** from the menu, create an `init.sh` script with contents: +Navigate to your home directory in the UI and select **Create** > **File** from the menu to create an `init.sh` script with contents: ```bash #!/bin/bash @@ -87,9 +85,9 @@ export PATH="/databricks/python/bin:$PATH" # Install RAPIDS (cudf & dask-cudf) and dask-databricks /databricks/python/bin/pip install --extra-index-url=https://pypi.nvidia.com \ - cudf-cu11 \ + cudf-cu11 \ # installs cudf dask[complete] \ - dask-cudf-cu11 \ + dask-cudf-cu11 \ #installs dask-cudf dask-cuda=={rapids_version} \ dask-databricks @@ -98,7 +96,19 @@ dask databricks run --cuda ``` -Connect to the dask client and submit tasks. +**Note**: To launch dask cuda workers, you must parse in `--cuda` flag option when running the command, otherwise the script will launch standard dask workers by default. + +### Launch Dask cluster + +Once your script is ready, follow the instructions in the **"Launch a Single-node cluster"** section, making sure to select **Multi node** instead. + +Under **Advanced Option**, switch to the **Init Scripts** tab and add the file path to the init script you created in your Workspace directory starting with `/Users`. + +You can also configure cluster log delivery in the **Logging** tab, which will write the init script logs to DBFS in a subdirectory called `dbfs:/cluster-logs//init_scripts/`. Refer to [docs](https://docs.databricks.com/en/init-scripts/logs.html) for more information. + +### Connect to Client + +To test RAPIDS, Connect to the dask client and submit tasks. ```python import dask_databricks @@ -113,6 +123,13 @@ df = dask.datasets.timeseries().map_partitions(cudf.from_pandas) print(df.x.mean().compute()) ``` -## Databricks Spark +### Clean up + +```python +client.close() +cluster.close() +``` + +## Spark-RAPIDS Cluster You can also use the RAPIDS Accelerator for Apache Spark 3.x on Databricks. See the [Spark RAPIDS documentation](https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-databricks.html) for more information.