RTDIP Pipelines Documentation (#132)

* Setup Pipelines Documentation Signed-off-by: GBBBAS <[email protected]> * Documentation Updates Signed-off-by: GBBBAS <[email protected]> --------- Signed-off-by: GBBBAS <[email protected]>
rtdip · Apr 3, 2023 · 8997747 · 8997747
1 parent b45447c
commit 8997747
Show file tree

Hide file tree

Showing 19 changed files with 566 additions and 31 deletions.
diff --git a/docs/blog/rtdip_ingestion_pipelines.md b/docs/blog/rtdip_ingestion_pipelines.md
@@ -188,7 +188,7 @@ Edge components are designed to provide a lightweight, low latency, low resource
 
 |Edge Type|Azure IoT Edge|AWS Greengrass|Target|
 |---------|--------------|--------------|------|
-| OPC Publisher|:heavy_check_mark:||Q3-Q4 2023|
+| OPC CloudPublisher|:heavy_check_mark:||Q3-Q4 2023|
 | Greengrass OPC UA||:heavy_check_mark:|Q4 2023|
 
 ## Conclusion

diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md
@@ -38,7 +38,7 @@ Installing the RTDIP can be done using a package installer, such as [Pip](https:
 === "Conda"
 
     Check which version of Conda is installed with the following command:
-        
+
         conda --version
 
     If necessary, upgrade Conda as follows:
@@ -64,27 +64,61 @@ If you plan to use pyodbc, Microsoft Visual C++ 14.0 or greater is required. Get
 #### Turbodbc
 To use turbodbc python library, ensure to follow the [Turbodbc Getting Started](https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html) section and ensure that [Boost](https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html) is installed correctly.
 
+### Java
+If you are planning to use the RTDIP Pipelines in your own environment that leverages [pyspark](https://spark.apache.org/docs/latest/api/python/getting_started/install.html) for a component, Java 8 or later is a [prerequisite](https://spark.apache.org/docs/latest/api/python/getting_started/install.html#dependencies). See below for suggestions to install Java in your development environment.
+
+=== "Conda"
+    A fairly simple option is to use the conda **openjdk** package to install Java into your python virtual environment. An example of a conda **environment.yml** file to achieve this is below.
+
+    ```yaml
+    name: rtdip-sdk
+    channels:
+        - conda-forge
+        - defaults
+    dependencies:
+        - python==3.10
+        - pip==23.0.1
+        - openjdk==11.0.15
+        - pip:
+            - rtdip-sdk
+    ```
+
+    !!! note "Pypi"
+        This package is not available from Pypi.
+
+=== "Java"
+    Follow the official Java JDK installation documentation [here.](https://docs.oracle.com/en/java/javase/11/install/overview-jdk-installation.html)
+
+    - [Windows](https://docs.oracle.com/en/java/javase/11/install/installation-jdk-microsoft-windows-platforms.html)
+    - [Mac OS](https://docs.oracle.com/en/java/javase/11/install/installation-jdk-macos.html)
+    - [Linux](https://docs.oracle.com/en/java/javase/11/install/installation-jdk-linux-platforms.html)
+
+    !!! note "Windows"
+        Windows requires an additional installation of a file called **winutils.exe**. Please see this [repo](https://github.com/steveloughran/winutils) for more information.
+
 ## Installing the RTDIP SDK
 
 RTDIP SDK is a PyPi package that can be found [here](https://pypi.org/project/rtdip-sdk/). On this page you can find the **project description**,  **release history**, **statistics**, **project links** and **maintainers**.
 
-Features of the SDK can be installed using different extras statements when installing the `rtdip-sdk` package:
+Features of the SDK can be installed using different extras statements when installing the **rtdip-sdk** package:
 
 === "Queries"
     When installing the package for only quering data, simply specify  in your preferred python package installer:
 
         rtdip-sdk
 
 === "Pipelines"
-    When installing the package for only quering data, simply specify  in your preferred python package installer:
+    RTDIP SDK can be installed to include the packages required to build, execute and deploy pipelines. Specify the following extra **[pipelines]** when installing RTDIP SDK so that the required python packages are included during installation.
 
         rtdip-sdk[pipelines]
 
 === "Pipelines + Pyspark"
-    When installing the package for only quering data, simply specify in your preferred python package installer:
+    RTDIP SDK can also execute pyspark functions as a part of the pipelines functionality. Specify the following extra **[pipelines,pyspark]** when installing RTDIP SDK so that the required pyspark python packages are included during installation.
 
         rtdip-sdk[pipelines,pyspark]
 
+    !!! note "Java"
+        Ensure that Java is installed prior to installing the rtdip-sdk with the **[pipelines,pyspark]**. See [here](#java) for more information.
 
 The following provides examples of how to install the RTDIP SDK package with Pip, Conda or Micromamba. Please note the section above to update any extra packages to be installed as part of the RTDIP SDK.
 
@@ -101,16 +135,18 @@ The following provides examples of how to install the RTDIP SDK package with Pip
 === "Conda"
 
     To create an environment, you will need to create a **environment.yml** file with the following:
-
-        name: rtdip-sdk
-        channels:
+
+    ```yaml
+    name: rtdip-sdk
+    channels:
         - conda-forge
         - defaults
-        dependencies:
+    dependencies:
         - python==3.10
         - pip==23.0.1
         - pip:
             - rtdip-sdk
+    ```
 
     Run the following command:
 
@@ -124,15 +160,17 @@ The following provides examples of how to install the RTDIP SDK package with Pip
 
     To create an environment, you will need to create a **environment.yml** file with the following:
 
-        name: rtdip-sdk
-        channels:
+    ```yaml
+    name: rtdip-sdk
+    channels:
         - conda-forge
         - defaults
-        dependencies:
+    dependencies:
         - python==3.10
         - pip==23.0.1
         - pip:
             - rtdip-sdk
+    ```
 
     Run the following command:
 

diff --git a/docs/integration/power-bi.md b/docs/integration/power-bi.md
@@ -5,7 +5,7 @@
 Microsoft Power BI is a business analytics service that provides interactive visualizations with self-service business intelligence capabilities
 that enable end users to create reports and dashboards by themselves without having to depend on information technology staff or database administrators.
 
-<center>![Power BI Databricks](images/databricks_powerbi.png){width=100%}</center>
+<center>![Power BI Databricks](images/databricks_powerbi.png){width=50%}</center>
 
 When you use Azure Databricks as a data source with Power BI, you can bring the advantages of Azure Databricks performance and technology beyond data scientists and data engineers to all business users.
 

diff --git a/docs/sdk/authentication/databricks.md b/docs/sdk/authentication/databricks.md
@@ -0,0 +1,25 @@
+# Databricks 
+
+Databricks supports authentication using Personal Access Tokens (PAT) and information about this authentication method is available [here.](https://docs.databricks.com/dev-tools/api/latest/authentication.html)
+
+## Authentication
+
+To generate a Databricks PAT Token, follow this [guide](https://docs.databricks.com/dev-tools/api/latest/authentication.html#generate-a-personal-access-token) and ensure that the token is stored securely and is never used directly in code.
+
+Your Databricks PAT Token can be used in the RTDIP SDK to authenticate with any Databricks Workspace or Databricks SQL Warehouse and simply provided in the `access_token` fields where tokens are required in the RTDIP SDK.
+
+## Example
+
+Below is an example of using a Databricks PAT Token for authenticating with a Databricks SQL Warehouse.
+
+```python
+from rtdip_sdk.odbc import db_sql_connector
+
+server_hostname = "server_hostname"
+http_path = "http_path"
+access_token = "dbapi......."
+
+connection = db_sql_connector.DatabricksSQLConnection(server_hostname, http_path, access_token)
+```
+
+Replace **server_hostname**, **http_path** with your own information and specify your Databricks PAT token for the **access_token**. 
diff --git a/docs/sdk/overview.md b/docs/sdk/overview.md
@@ -11,8 +11,3 @@
 ## Installation
 
 To get started with the RTDIP SDK, follow these [installation instructions.](../getting-started/installation.md)
-
-
-
-
-
diff --git a/docs/sdk/pipelines/components.md b/docs/sdk/pipelines/components.md
@@ -0,0 +1,89 @@
+# Pipeline Components
+
+## Overview
+
+The Real Time Data Ingestion Pipeline Framework supports the following component types:
+
+- Sources - connectors to source systems
+- Transformers - perform transformations on data, including data cleansing, data enrichment, data aggregation, data masking, data encryption, data decryption, data validation, data conversion, data normalization, data de-normalization, data partitioning etc
+- Destinations - connectors to sink/destination systems 
+- Utilities - components that perform utility functions such as logging, error handling, data object creation, authentication, maintenance etc
+- Secrets - components that facilitate accessing secret stores where sensitive information is stored such as passwords, connectiong strings, keys etc
+
+## Component Types
+
+|Python|Apache Spark|Databricks|
+|---------------------------|----------------------|--------------------------------------------------|
+|![python](images/python.png)|![pyspark](images/apachespark.png)|![databricks](images/databricks_horizontal.png)|
+
+Component Types determine system requirements to execute the component:
+
+- Python - components that are written in python and can be executed on a python runtime
+- Pyspark - components that are written in pyspark can be executed on an open source Apache Spark runtime
+- Databricks - components that require a Databricks runtime
+
+!!! note "Note"
+    </b>RTDIP are continuously adding more to this list. For detailed information on timelines, read this [blog post](../../blog/rtdip_ingestion_pipelines.md) and check back on this page regularly<br />
+
+### Sources
+
+Sources are components that connect to source systems and extract data from them. These will typically be real time data sources, but also support batch components as these are still important and necessary data souces of time series data in a number of circumstances in the real world.
+
+|Source Type|Python|Apache Spark|Databricks|Azure|AWS|
+|---------------------------|----------------------|--------------------|----------------------|----------------------|---------|
+|[Delta](../code-reference/pipelines/sources/spark/delta.md)||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|
+|[Delta Sharing](../code-reference/pipelines/sources/spark/delta_sharing.md)||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|
+|[Autoloader](../code-reference/pipelines/sources/spark/autoloader.md)|||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|
+|[Eventhub](../code-reference/pipelines/sources/spark/eventhub.md)||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:||
+
+!!! note "Note"
+    This list will dynamically change as the framework is further developed and new components are added.
+
+### Transformers
+
+Transformers are components that perform transformations on data. These will target certain data models and common transformations that sources or destination components require to be performed on data before it can be ingested or consumed.
+
+|Transformer Type|Python|Apache Spark|Databricks|Azure|AWS|
+|---------------------------|----------------------|--------------------|----------------------|----------------------|---------|
+|[Eventhub Body](../code-reference/pipelines/transformers/spark/eventhub.md)||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:||
+
+!!! note "Note"
+    This list will dynamically change as the framework is further developed and new components are added.
+
+### Destinations
+
+Destinations are components that connect to sink/destination systems and write data to them. 
+
+|Destination Type|Python|Apache Spark|Databricks|Azure|AWS|
+|---------------------------|----------------------|--------------------|----------------------|----------------------|---------|
+|[Delta Append](../code-reference/pipelines/destinations/spark/delta.md)||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|
+|[Eventhub](../code-reference/pipelines/destinations/spark/eventhub.md)||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:||
+
+!!! note "Note"
+    This list will dynamically change as the framework is further developed and new components are added.
+
+### Utilities
+
+Utilities are components that perform utility functions such as logging, error handling, data object creation, authentication, maintenance and are normally components that can be executed as part of a pipeline or standalone.
+
+|Utility Type|Python|Apache Spark|Databricks|Azure|AWS|
+|---------------------------|----------------------|--------------------|----------------------|----------------------|---------|
+|[Delta Table Create](../code-reference/pipelines/utilities/spark/delta_table_create.md)||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|
+
+!!! note "Note"
+    This list will dynamically change as the framework is further developed and new components are added.
+
+### Secrets
+
+Secrets are components that perform functions to interact with secret stores to manage sensitive information such as passwords, keys and certificates.
+
+|Secret Type|Python|Apache Spark|Databricks|Azure|AWS|
+|---------------------------|----------------------|--------------------|----------------------|----------------------|---------|
+|[Databricks Secret Scopes](../code-reference/pipelines/secrets/databricks.md)|||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|
+
+!!! note "Note"
+    This list will dynamically change as the framework is further developed and new components are added.
+
+## Conclusion
+
+Components can be used to build RTDIP Pipelines which is described in more detail [here.](jobs.md)
diff --git a/docs/sdk/pipelines/deploy/apache-airflow.md b/docs/sdk/pipelines/deploy/apache-airflow.md
@@ -0,0 +1,42 @@
+# Apache Airflow
+
+## Databricks Provider
+
+Apache Airflow can orchestrate an RTDIP Pipeline that has been deployed as a Databricks Job. For further information on how to deploy an RTDIP Pipeline as a Databricks Job, please see [here.](databricks.md) 
+
+Databricks has also provided more information about running Databricks jobs from Apache Airflow [here.](https://docs.databricks.com/workflows/jobs/how-to/use-airflow-with-jobs.html)
+
+### Prerequisites
+
+1. An Apache Airflow instance must be running.
+1. Authentication between Apache Airflow and Databricks must be [configured.](https://docs.databricks.com/workflows/jobs/how-to/use-airflow-with-jobs.html#create-a-databricks-personal-access-token-for-airflow)
+1. The python packages `apache-airflow` and `apache-airflow-providers-databricks` must be installed.
+1. You have created an [RTDIP Pipeline and deployed it to Databricks.](databricks.md)
+
+
+### Example
+
+The `JOB ID` in the example below can be obtained from the Databricks Job.
+
+```python
+from airflow import DAG
+from airflow.providers.databricks.operators.databricks import DatabricksRunNowOperator
+from airflow.utils.dates import days_ago
+
+default_args = {
+  'owner': 'airflow'
+}
+
+with DAG('databricks_dag',
+  start_date = days_ago(2),
+  schedule_interval = None,
+  default_args = default_args
+  ) as dag:
+
+  opr_run_now = DatabricksRunNowOperator(
+    task_id = 'run_now',
+    databricks_conn_id = 'databricks_default',
+    job_id = JOB_ID
+  )
+```
+
Original file line number	Diff line number	Diff line change
Expand Up		@@ -11,8 +11,3 @@
		## Installation

		To get started with the RTDIP SDK, follow these [installation instructions.](../getting-started/installation.md)