This repository holds a sample scala project demonstrating:
- unit testing of individual methods containing Spark DataFrame transformations (in the
com.stuartdb.unittestexample.aggregationFuncsTest
class); and - integration testing of a basic pipeline which might consist of calls to many such methods (see example class
com.stuartdb.unittestexample.pipelineTest
).
The objective is to demonstrate how tests and code can be written in a developer's local environment and executed remotely.
The project also contains a Dockerfile
and buildspec.yml
file that would be required to create an AWS CodeBuild CI project.
This could be set to trigger build and test stages every time code is checked in to a given branch of your pipeline's repository or, perhaps more likely, a pull request is raised.
Creating and configuring a pipeline such as this is left as an exercise for the curious reader.
- A compatible IDE (IntelliJ, VSCode, Eclipse)
- Java JDK 8 , Scala 2.11.12, SBT 1.3.2
- Conda or Miniconda (the latter is recommended for simplicity)
- Databricks Connect
- The input dataset used for the example classes (stored in
/data
in this repository) - Docker, if you want to experiment with performing build and test stages in a remote CI service such as AWS CodeBuild or CircleCI.
- Access to AWS CodeBuild if you want to import the repository as a new project.
(assumes a compatible IDE with Java 8 and SBT 1.3.2 is already available on your local machine)
- Stand up a new cluster running Databricks Runtime 6.0 Conda Beta in your Databricks workspace. You do not need to install any additional libraries or perform any advanced configuration.
- If you haven't already, mount an S3 bucket to the Databricks Filesystem using the instructions here.
- Clone this repository to your local machine and copy the test datasets in
/datasets
to your S3 bucket. Check you can see these files from DBFS. - Create a new Conda environment according to the instructions in the Databricks Connect documentation.
- Activate the dbconnnect conda environment by running
conda activate dbconnect
. - Install Databricks Connect and run
databricks-connect configure
to enable access to a cluster running in your workspace. - Import the sbt project into your IDE. This should automatically download the correct version of Scala and the relevant dependencies (scalatest and deequ).
- Determine the path to the Databricks Connect Spark JARs by running
databricks-connect get-jar-dir
. Update the value ofspark_libs
in thebuild.sbt
file to reflect.
- If your IDE has integration support for scalatest, you may be able to run the tests directly from the editor (no special configuration options are required).
- Otherwise, create a run task / configuration that calls
sbt test
(to run all tests) orsbt "testOnly com.stuartdb.unittestexample.aggregationFuncsTest"
to just run a single test.
- Again, if your IDE has integration support for scala classes then you can run the pipeline class directly, providing DBFS input and output paths as arguments to the
main()
method. - Otherwise, create a run task / configuration that calls
sbt "run /dbfs/path/to/input/data /dbfs/path/for/output"
.