-
Notifications
You must be signed in to change notification settings - Fork 25
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
RTDIP Pipelines Documentation (#132)
* Setup Pipelines Documentation Signed-off-by: GBBBAS <[email protected]> * Documentation Updates Signed-off-by: GBBBAS <[email protected]> --------- Signed-off-by: GBBBAS <[email protected]>
- Loading branch information
Showing
19 changed files
with
566 additions
and
31 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# Databricks | ||
|
||
Databricks supports authentication using Personal Access Tokens (PAT) and information about this authentication method is available [here.](https://docs.databricks.com/dev-tools/api/latest/authentication.html) | ||
|
||
## Authentication | ||
|
||
To generate a Databricks PAT Token, follow this [guide](https://docs.databricks.com/dev-tools/api/latest/authentication.html#generate-a-personal-access-token) and ensure that the token is stored securely and is never used directly in code. | ||
|
||
Your Databricks PAT Token can be used in the RTDIP SDK to authenticate with any Databricks Workspace or Databricks SQL Warehouse and simply provided in the `access_token` fields where tokens are required in the RTDIP SDK. | ||
|
||
## Example | ||
|
||
Below is an example of using a Databricks PAT Token for authenticating with a Databricks SQL Warehouse. | ||
|
||
```python | ||
from rtdip_sdk.odbc import db_sql_connector | ||
|
||
server_hostname = "server_hostname" | ||
http_path = "http_path" | ||
access_token = "dbapi......." | ||
|
||
connection = db_sql_connector.DatabricksSQLConnection(server_hostname, http_path, access_token) | ||
``` | ||
|
||
Replace **server_hostname**, **http_path** with your own information and specify your Databricks PAT token for the **access_token**. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
# Pipeline Components | ||
|
||
## Overview | ||
|
||
The Real Time Data Ingestion Pipeline Framework supports the following component types: | ||
|
||
- Sources - connectors to source systems | ||
- Transformers - perform transformations on data, including data cleansing, data enrichment, data aggregation, data masking, data encryption, data decryption, data validation, data conversion, data normalization, data de-normalization, data partitioning etc | ||
- Destinations - connectors to sink/destination systems | ||
- Utilities - components that perform utility functions such as logging, error handling, data object creation, authentication, maintenance etc | ||
- Secrets - components that facilitate accessing secret stores where sensitive information is stored such as passwords, connectiong strings, keys etc | ||
|
||
## Component Types | ||
|
||
|Python|Apache Spark|Databricks| | ||
|---------------------------|----------------------|--------------------------------------------------| | ||
|![python](images/python.png)|![pyspark](images/apachespark.png)|![databricks](images/databricks_horizontal.png)| | ||
|
||
Component Types determine system requirements to execute the component: | ||
|
||
- Python - components that are written in python and can be executed on a python runtime | ||
- Pyspark - components that are written in pyspark can be executed on an open source Apache Spark runtime | ||
- Databricks - components that require a Databricks runtime | ||
|
||
!!! note "Note" | ||
</b>RTDIP are continuously adding more to this list. For detailed information on timelines, read this [blog post](../../blog/rtdip_ingestion_pipelines.md) and check back on this page regularly<br /> | ||
|
||
### Sources | ||
|
||
Sources are components that connect to source systems and extract data from them. These will typically be real time data sources, but also support batch components as these are still important and necessary data souces of time series data in a number of circumstances in the real world. | ||
|
||
|Source Type|Python|Apache Spark|Databricks|Azure|AWS| | ||
|---------------------------|----------------------|--------------------|----------------------|----------------------|---------| | ||
|[Delta](../code-reference/pipelines/sources/spark/delta.md)||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:| | ||
|[Delta Sharing](../code-reference/pipelines/sources/spark/delta_sharing.md)||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:| | ||
|[Autoloader](../code-reference/pipelines/sources/spark/autoloader.md)|||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:| | ||
|[Eventhub](../code-reference/pipelines/sources/spark/eventhub.md)||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|| | ||
|
||
!!! note "Note" | ||
This list will dynamically change as the framework is further developed and new components are added. | ||
|
||
### Transformers | ||
|
||
Transformers are components that perform transformations on data. These will target certain data models and common transformations that sources or destination components require to be performed on data before it can be ingested or consumed. | ||
|
||
|Transformer Type|Python|Apache Spark|Databricks|Azure|AWS| | ||
|---------------------------|----------------------|--------------------|----------------------|----------------------|---------| | ||
|[Eventhub Body](../code-reference/pipelines/transformers/spark/eventhub.md)||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|| | ||
|
||
!!! note "Note" | ||
This list will dynamically change as the framework is further developed and new components are added. | ||
|
||
### Destinations | ||
|
||
Destinations are components that connect to sink/destination systems and write data to them. | ||
|
||
|Destination Type|Python|Apache Spark|Databricks|Azure|AWS| | ||
|---------------------------|----------------------|--------------------|----------------------|----------------------|---------| | ||
|[Delta Append](../code-reference/pipelines/destinations/spark/delta.md)||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:| | ||
|[Eventhub](../code-reference/pipelines/destinations/spark/eventhub.md)||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|| | ||
|
||
!!! note "Note" | ||
This list will dynamically change as the framework is further developed and new components are added. | ||
|
||
### Utilities | ||
|
||
Utilities are components that perform utility functions such as logging, error handling, data object creation, authentication, maintenance and are normally components that can be executed as part of a pipeline or standalone. | ||
|
||
|Utility Type|Python|Apache Spark|Databricks|Azure|AWS| | ||
|---------------------------|----------------------|--------------------|----------------------|----------------------|---------| | ||
|[Delta Table Create](../code-reference/pipelines/utilities/spark/delta_table_create.md)||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:| | ||
|
||
!!! note "Note" | ||
This list will dynamically change as the framework is further developed and new components are added. | ||
|
||
### Secrets | ||
|
||
Secrets are components that perform functions to interact with secret stores to manage sensitive information such as passwords, keys and certificates. | ||
|
||
|Secret Type|Python|Apache Spark|Databricks|Azure|AWS| | ||
|---------------------------|----------------------|--------------------|----------------------|----------------------|---------| | ||
|[Databricks Secret Scopes](../code-reference/pipelines/secrets/databricks.md)|||:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:| | ||
|
||
!!! note "Note" | ||
This list will dynamically change as the framework is further developed and new components are added. | ||
|
||
## Conclusion | ||
|
||
Components can be used to build RTDIP Pipelines which is described in more detail [here.](jobs.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
# Apache Airflow | ||
|
||
## Databricks Provider | ||
|
||
Apache Airflow can orchestrate an RTDIP Pipeline that has been deployed as a Databricks Job. For further information on how to deploy an RTDIP Pipeline as a Databricks Job, please see [here.](databricks.md) | ||
|
||
Databricks has also provided more information about running Databricks jobs from Apache Airflow [here.](https://docs.databricks.com/workflows/jobs/how-to/use-airflow-with-jobs.html) | ||
|
||
### Prerequisites | ||
|
||
1. An Apache Airflow instance must be running. | ||
1. Authentication between Apache Airflow and Databricks must be [configured.](https://docs.databricks.com/workflows/jobs/how-to/use-airflow-with-jobs.html#create-a-databricks-personal-access-token-for-airflow) | ||
1. The python packages `apache-airflow` and `apache-airflow-providers-databricks` must be installed. | ||
1. You have created an [RTDIP Pipeline and deployed it to Databricks.](databricks.md) | ||
|
||
|
||
### Example | ||
|
||
The `JOB ID` in the example below can be obtained from the Databricks Job. | ||
|
||
```python | ||
from airflow import DAG | ||
from airflow.providers.databricks.operators.databricks import DatabricksRunNowOperator | ||
from airflow.utils.dates import days_ago | ||
|
||
default_args = { | ||
'owner': 'airflow' | ||
} | ||
|
||
with DAG('databricks_dag', | ||
start_date = days_ago(2), | ||
schedule_interval = None, | ||
default_args = default_args | ||
) as dag: | ||
|
||
opr_run_now = DatabricksRunNowOperator( | ||
task_id = 'run_now', | ||
databricks_conn_id = 'databricks_default', | ||
job_id = JOB_ID | ||
) | ||
``` | ||
|
Oops, something went wrong.