Skip to content

Commit

Permalink
RTDIP Pipelines Documentation (#132)
Browse files Browse the repository at this point in the history
* Setup Pipelines Documentation

Signed-off-by: GBBBAS <[email protected]>

* Documentation Updates

Signed-off-by: GBBBAS <[email protected]>

---------

Signed-off-by: GBBBAS <[email protected]>
  • Loading branch information
GBBBAS authored Apr 3, 2023
1 parent b45447c commit 8997747
Show file tree
Hide file tree
Showing 19 changed files with 566 additions and 31 deletions.
2 changes: 1 addition & 1 deletion docs/blog/rtdip_ingestion_pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ Edge components are designed to provide a lightweight, low latency, low resource

|Edge Type|Azure IoT Edge|AWS Greengrass|Target|
|---------|--------------|--------------|------|
| OPC Publisher|:heavy_check_mark:||Q3-Q4 2023|
| OPC CloudPublisher|:heavy_check_mark:||Q3-Q4 2023|
| Greengrass OPC UA||:heavy_check_mark:|Q4 2023|

## Conclusion
Expand Down
60 changes: 49 additions & 11 deletions docs/getting-started/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ Installing the RTDIP can be done using a package installer, such as [Pip](https:
=== "Conda"

Check which version of Conda is installed with the following command:

conda --version

If necessary, upgrade Conda as follows:
Expand All @@ -64,27 +64,61 @@ If you plan to use pyodbc, Microsoft Visual C++ 14.0 or greater is required. Get
#### Turbodbc
To use turbodbc python library, ensure to follow the [Turbodbc Getting Started](https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html) section and ensure that [Boost](https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html) is installed correctly.

### Java
If you are planning to use the RTDIP Pipelines in your own environment that leverages [pyspark](https://spark.apache.org/docs/latest/api/python/getting_started/install.html) for a component, Java 8 or later is a [prerequisite](https://spark.apache.org/docs/latest/api/python/getting_started/install.html#dependencies). See below for suggestions to install Java in your development environment.

=== "Conda"
A fairly simple option is to use the conda **openjdk** package to install Java into your python virtual environment. An example of a conda **environment.yml** file to achieve this is below.

```yaml
name: rtdip-sdk
channels:
- conda-forge
- defaults
dependencies:
- python==3.10
- pip==23.0.1
- openjdk==11.0.15
- pip:
- rtdip-sdk
```

!!! note "Pypi"
This package is not available from Pypi.

=== "Java"
Follow the official Java JDK installation documentation [here.](https://docs.oracle.com/en/java/javase/11/install/overview-jdk-installation.html)

- [Windows](https://docs.oracle.com/en/java/javase/11/install/installation-jdk-microsoft-windows-platforms.html)
- [Mac OS](https://docs.oracle.com/en/java/javase/11/install/installation-jdk-macos.html)
- [Linux](https://docs.oracle.com/en/java/javase/11/install/installation-jdk-linux-platforms.html)

!!! note "Windows"
Windows requires an additional installation of a file called **winutils.exe**. Please see this [repo](https://github.com/steveloughran/winutils) for more information.

## Installing the RTDIP SDK

RTDIP SDK is a PyPi package that can be found [here](https://pypi.org/project/rtdip-sdk/). On this page you can find the **project description**, **release history**, **statistics**, **project links** and **maintainers**.

Features of the SDK can be installed using different extras statements when installing the `rtdip-sdk` package:
Features of the SDK can be installed using different extras statements when installing the **rtdip-sdk** package:

=== "Queries"
When installing the package for only quering data, simply specify in your preferred python package installer:

rtdip-sdk

=== "Pipelines"
When installing the package for only quering data, simply specify in your preferred python package installer:
RTDIP SDK can be installed to include the packages required to build, execute and deploy pipelines. Specify the following extra **[pipelines]** when installing RTDIP SDK so that the required python packages are included during installation.

rtdip-sdk[pipelines]

=== "Pipelines + Pyspark"
When installing the package for only quering data, simply specify in your preferred python package installer:
RTDIP SDK can also execute pyspark functions as a part of the pipelines functionality. Specify the following extra **[pipelines,pyspark]** when installing RTDIP SDK so that the required pyspark python packages are included during installation.

rtdip-sdk[pipelines,pyspark]

!!! note "Java"
Ensure that Java is installed prior to installing the rtdip-sdk with the **[pipelines,pyspark]**. See [here](#java) for more information.

The following provides examples of how to install the RTDIP SDK package with Pip, Conda or Micromamba. Please note the section above to update any extra packages to be installed as part of the RTDIP SDK.

Expand All @@ -101,16 +135,18 @@ The following provides examples of how to install the RTDIP SDK package with Pip
=== "Conda"

To create an environment, you will need to create a **environment.yml** file with the following:

name: rtdip-sdk
channels:

```yaml
name: rtdip-sdk
channels:
- conda-forge
- defaults
dependencies:
dependencies:
- python==3.10
- pip==23.0.1
- pip:
- rtdip-sdk
```

Run the following command:

Expand All @@ -124,15 +160,17 @@ The following provides examples of how to install the RTDIP SDK package with Pip

To create an environment, you will need to create a **environment.yml** file with the following:

name: rtdip-sdk
channels:
```yaml
name: rtdip-sdk
channels:
- conda-forge
- defaults
dependencies:
dependencies:
- python==3.10
- pip==23.0.1
- pip:
- rtdip-sdk
```

Run the following command:

Expand Down
2 changes: 1 addition & 1 deletion docs/integration/power-bi.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
Microsoft Power BI is a business analytics service that provides interactive visualizations with self-service business intelligence capabilities
that enable end users to create reports and dashboards by themselves without having to depend on information technology staff or database administrators.

<center>![Power BI Databricks](images/databricks_powerbi.png){width=100%}</center>
<center>![Power BI Databricks](images/databricks_powerbi.png){width=50%}</center>

When you use Azure Databricks as a data source with Power BI, you can bring the advantages of Azure Databricks performance and technology beyond data scientists and data engineers to all business users.

Expand Down
25 changes: 25 additions & 0 deletions docs/sdk/authentication/databricks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Databricks

Databricks supports authentication using Personal Access Tokens (PAT) and information about this authentication method is available [here.](https://docs.databricks.com/dev-tools/api/latest/authentication.html)

## Authentication

To generate a Databricks PAT Token, follow this [guide](https://docs.databricks.com/dev-tools/api/latest/authentication.html#generate-a-personal-access-token) and ensure that the token is stored securely and is never used directly in code.

Your Databricks PAT Token can be used in the RTDIP SDK to authenticate with any Databricks Workspace or Databricks SQL Warehouse and simply provided in the `access_token` fields where tokens are required in the RTDIP SDK.

## Example

Below is an example of using a Databricks PAT Token for authenticating with a Databricks SQL Warehouse.

```python
from rtdip_sdk.odbc import db_sql_connector

server_hostname = "server_hostname"
http_path = "http_path"
access_token = "dbapi......."

connection = db_sql_connector.DatabricksSQLConnection(server_hostname, http_path, access_token)
```

Replace **server_hostname**, **http_path** with your own information and specify your Databricks PAT token for the **access_token**.
5 changes: 0 additions & 5 deletions docs/sdk/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,3 @@
## Installation

To get started with the RTDIP SDK, follow these [installation instructions.](../getting-started/installation.md)





89 changes: 89 additions & 0 deletions docs/sdk/pipelines/components.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Pipeline Components

## Overview

The Real Time Data Ingestion Pipeline Framework supports the following component types:

- Sources - connectors to source systems
- Transformers - perform transformations on data, including data cleansing, data enrichment, data aggregation, data masking, data encryption, data decryption, data validation, data conversion, data normalization, data de-normalization, data partitioning etc
- Destinations - connectors to sink/destination systems
- Utilities - components that perform utility functions such as logging, error handling, data object creation, authentication, maintenance etc
- Secrets - components that facilitate accessing secret stores where sensitive information is stored such as passwords, connectiong strings, keys etc

## Component Types

|Python|Apache Spark|Databricks|
|---------------------------|----------------------|--------------------------------------------------|
|![python](images/python.png)|![pyspark](images/apachespark.png)|![databricks](images/databricks_horizontal.png)|

Component Types determine system requirements to execute the component:

- Python - components that are written in python and can be executed on a python runtime
- Pyspark - components that are written in pyspark can be executed on an open source Apache Spark runtime
- Databricks - components that require a Databricks runtime

!!! note "Note"
</b>RTDIP are continuously adding more to this list. For detailed information on timelines, read this [blog post](../../blog/rtdip_ingestion_pipelines.md) and check back on this page regularly<br />

### Sources

Sources are components that connect to source systems and extract data from them. These will typically be real time data sources, but also support batch components as these are still important and necessary data souces of time series data in a number of circumstances in the real world.

|Source Type|Python|Apache Spark|Databricks|Azure|AWS|
|---------------------------|----------------------|--------------------|----------------------|----------------------|---------|
|[Delta](../code-reference/pipelines/sources/spark/delta.md)||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|
|[Delta Sharing](../code-reference/pipelines/sources/spark/delta_sharing.md)||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|
|[Autoloader](../code-reference/pipelines/sources/spark/autoloader.md)|||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|
|[Eventhub](../code-reference/pipelines/sources/spark/eventhub.md)||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:||

!!! note "Note"
This list will dynamically change as the framework is further developed and new components are added.

### Transformers

Transformers are components that perform transformations on data. These will target certain data models and common transformations that sources or destination components require to be performed on data before it can be ingested or consumed.

|Transformer Type|Python|Apache Spark|Databricks|Azure|AWS|
|---------------------------|----------------------|--------------------|----------------------|----------------------|---------|
|[Eventhub Body](../code-reference/pipelines/transformers/spark/eventhub.md)||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:||

!!! note "Note"
This list will dynamically change as the framework is further developed and new components are added.

### Destinations

Destinations are components that connect to sink/destination systems and write data to them.

|Destination Type|Python|Apache Spark|Databricks|Azure|AWS|
|---------------------------|----------------------|--------------------|----------------------|----------------------|---------|
|[Delta Append](../code-reference/pipelines/destinations/spark/delta.md)||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|
|[Eventhub](../code-reference/pipelines/destinations/spark/eventhub.md)||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:||

!!! note "Note"
This list will dynamically change as the framework is further developed and new components are added.

### Utilities

Utilities are components that perform utility functions such as logging, error handling, data object creation, authentication, maintenance and are normally components that can be executed as part of a pipeline or standalone.

|Utility Type|Python|Apache Spark|Databricks|Azure|AWS|
|---------------------------|----------------------|--------------------|----------------------|----------------------|---------|
|[Delta Table Create](../code-reference/pipelines/utilities/spark/delta_table_create.md)||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|

!!! note "Note"
This list will dynamically change as the framework is further developed and new components are added.

### Secrets

Secrets are components that perform functions to interact with secret stores to manage sensitive information such as passwords, keys and certificates.

|Secret Type|Python|Apache Spark|Databricks|Azure|AWS|
|---------------------------|----------------------|--------------------|----------------------|----------------------|---------|
|[Databricks Secret Scopes](../code-reference/pipelines/secrets/databricks.md)|||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|

!!! note "Note"
This list will dynamically change as the framework is further developed and new components are added.

## Conclusion

Components can be used to build RTDIP Pipelines which is described in more detail [here.](jobs.md)
42 changes: 42 additions & 0 deletions docs/sdk/pipelines/deploy/apache-airflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Apache Airflow

## Databricks Provider

Apache Airflow can orchestrate an RTDIP Pipeline that has been deployed as a Databricks Job. For further information on how to deploy an RTDIP Pipeline as a Databricks Job, please see [here.](databricks.md)

Databricks has also provided more information about running Databricks jobs from Apache Airflow [here.](https://docs.databricks.com/workflows/jobs/how-to/use-airflow-with-jobs.html)

### Prerequisites

1. An Apache Airflow instance must be running.
1. Authentication between Apache Airflow and Databricks must be [configured.](https://docs.databricks.com/workflows/jobs/how-to/use-airflow-with-jobs.html#create-a-databricks-personal-access-token-for-airflow)
1. The python packages `apache-airflow` and `apache-airflow-providers-databricks` must be installed.
1. You have created an [RTDIP Pipeline and deployed it to Databricks.](databricks.md)


### Example

The `JOB ID` in the example below can be obtained from the Databricks Job.

```python
from airflow import DAG
from airflow.providers.databricks.operators.databricks import DatabricksRunNowOperator
from airflow.utils.dates import days_ago

default_args = {
'owner': 'airflow'
}

with DAG('databricks_dag',
start_date = days_ago(2),
schedule_interval = None,
default_args = default_args
) as dag:

opr_run_now = DatabricksRunNowOperator(
task_id = 'run_now',
databricks_conn_id = 'databricks_default',
job_id = JOB_ID
)
```

Loading

0 comments on commit 8997747

Please sign in to comment.