GitHub - sllynn/unittest-example: Example of using scalatest and deequ from AWS Labs to unit test a Spark pipeline

Databricks test skeleton

Example unit and integration tests for Databricks Spark Scala jobs

This repository holds a sample scala project demonstrating:

unit testing of individual methods containing Spark DataFrame transformations (in the com.stuartdb.unittestexample.aggregationFuncsTest class); and
integration testing of a basic pipeline which might consist of calls to many such methods (see example class com.stuartdb.unittestexample.pipelineTest).

The objective is to demonstrate how tests and code can be written in a developer's local environment and executed remotely.

Support for remote Continuous Integration pipelines

The project also contains a Dockerfile and buildspec.yml file that would be required to create an AWS CodeBuild CI project.

This could be set to trigger build and test stages every time code is checked in to a given branch of your pipeline's repository or, perhaps more likely, a pull request is raised.

Creating and configuring a pipeline such as this is left as an exercise for the curious reader.

Pre-requisites

A compatible IDE (IntelliJ, VSCode, Eclipse)
Java JDK 8 , Scala 2.11.12, SBT 1.3.2
Conda or Miniconda (the latter is recommended for simplicity)
Databricks Connect
The input dataset used for the example classes (stored in /data in this repository)
Docker, if you want to experiment with performing build and test stages in a remote CI service such as AWS CodeBuild or CircleCI.
Access to AWS CodeBuild if you want to import the repository as a new project.

Preparing your environment

(assumes a compatible IDE with Java 8 and SBT 1.3.2 is already available on your local machine)

Stand up a new cluster running Databricks Runtime 6.0 Conda Beta in your Databricks workspace. You do not need to install any additional libraries or perform any advanced configuration.
If you haven't already, mount an S3 bucket to the Databricks Filesystem using the instructions here.
Clone this repository to your local machine and copy the test datasets in /datasets to your S3 bucket. Check you can see these files from DBFS.
Create a new Conda environment according to the instructions in the Databricks Connect documentation.
Activate the dbconnnect conda environment by running conda activate dbconnect.
Install Databricks Connect and run databricks-connect configure to enable access to a cluster running in your workspace.
Import the sbt project into your IDE. This should automatically download the correct version of Scala and the relevant dependencies (scalatest and deequ).
Determine the path to the Databricks Connect Spark JARs by running databricks-connect get-jar-dir. Update the value of spark_libs in the build.sbt file to reflect.

Running the tests

If your IDE has integration support for scalatest, you may be able to run the tests directly from the editor (no special configuration options are required).
Otherwise, create a run task / configuration that calls sbt test (to run all tests) or sbt "testOnly com.stuartdb.unittestexample.aggregationFuncsTest" to just run a single test.

Running the full pipeline

Again, if your IDE has integration support for scala classes then you can run the pipeline class directly, providing DBFS input and output paths as arguments to the main() method.
Otherwise, create a run task / configuration that calls sbt "run /dbfs/path/to/input/data /dbfs/path/for/output" .

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
dataset		dataset
project		project
src		src
Dockerfile		Dockerfile
README.md		README.md
build.sbt		build.sbt
buildspec.yml		buildspec.yml
conda.yaml		conda.yaml
dbconnect-creds.sh		dbconnect-creds.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Databricks test skeleton

Example unit and integration tests for Databricks Spark Scala jobs

Support for remote Continuous Integration pipelines

Pre-requisites

Preparing your environment

Running the tests

Running the full pipeline

About

Releases

Packages

Languages

sllynn/unittest-example

Folders and files

Latest commit

History

Repository files navigation

Databricks test skeleton

Example unit and integration tests for Databricks Spark Scala jobs

Support for remote Continuous Integration pipelines

Pre-requisites

Preparing your environment

Running the tests

Running the full pipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages