This repository is a demo to use dbtvault levaraging dbt tutorial by using BigQuery.
The official document of dbtvault has own tutorials, however the execution requires Snowflake usage.
On the other hand, dbt has the getting started page based on BigQuery that uses dbt's public dataset in dbt-tutorial
project.
This demo is for people who more familiar with BigQuery, and try dbtvault as the same dataset as dbt tutorial.
To try dbtvault, it is good idea to go through dbt tutorial fist because BigQuery connectivity and data will be all set after completing the tutorial.
If you are new to dbt itself, good to go through official docs as well.
As written in the dbt tutorial, installation can be done by following instruction.
In this repository, Python pip based installation is available by using included resources. Please use Python 3.8 or above.
- Setup venv
python -m venv venv
- Activate
source ./venv/bin/activate
- Install dependencies
pip install -r requirements.txt
BigQuery accessibility is requied before running this demo.
- Create a Dataset in BigQuery
- Create a Service Account which has BigQuery write access
- If which permission is not clear, give BigQuery Admin in your test project.
- Download JSON key of Service Account
A dbt profile should be set up to connect BigQuery from dbt.
The profile will be created when a new project is created via dbt init
command. If you would like try dbtvault with just cloning this repository, add following config to your profile in ~/.dbt/profiles.yml
dbtvault_bigquery_demo:
outputs:
dev:
dataset: YOUR_DATASET
job_execution_timeout_seconds: 300
job_retries: 1
keyfile: PATH_TO_YOUR_SA_KEY
location: US
method: service-account
priority: interactive
project: GCP_PROJECT_ID
threads: 1
type: bigquery
target: dev
You should replace following.
- YOUR_DATASET
- Your dataset name
- PATH_TO_YOUR_SA_KEY
- Path to your SA's JSON key
- GCP_PROJECT_ID
- Your GCP project id
Once profile is set up, test the connectivity with BigQuery by following command.
dbt debug
Now we are ready to play with dbtvault in BigQuery.
You can follow docs to get familiar with dbtvault. Although the tutorial is only executable in Snowflake as of May 2022, the document is good to use to learn basic syntax and usage of dbtvault.
To use dbtvault in dbt project, add packages.yml
with follwoing contents and run dbt deps
. Check the latest version in here
packages:
- package: Datavault-UK/dbtvault
version: 0.8.3
After completing dbt tutorial, models
dir should have stg_.sql files.
We use the public data of dbt-tutorial project as datasource in this demo, and make raw staging layer based on the data. Rename stg_.sql as raw_stg_*.sql and move to models/raw_stage
dir.
Files modified:
- models/raw_stage/raw_stg_customers.sql
- models/raw_stage/raw_stg_orders.sql
Then, add following setting in the model section of dbt_project.yml
models:
dbtvault-bigquery-demo:
raw_stage:
tags:
- 'raw'
+materialized: view
dbtvault uses hash staging layer as intermediate layer to prepare constructing vault. The hash staging layer has original data + drived data such as hash key and Data Vault specific fields like DataSource, LoadTimeDate, and e.t.c..
Create stage dir under models and create stg_*.sql.
Exmaple: stg_customer.sql
{%- set yaml_metadata -%}
source_model: 'raw_stg_customers'
derived_columns:
RECORD_SOURCE: '!customers'
EFFECTIVE_FROM: CURRENT_DATE()
LOAD_DATE: CURRENT_DATE()
hashed_columns:
CUSTOMER_PK:
- 'customer_id'
CUSTOMER_HASHDIFF:
is_hashdiff: true
columns:
- 'customer_id'
- 'first_name'
- 'last_name'
{%- endset -%}
{% set metadata_dict = fromyaml(yaml_metadata) %}
{% set source_model = metadata_dict['source_model'] %}
{% set derived_columns = metadata_dict['derived_columns'] %}
{% set hashed_columns = metadata_dict['hashed_columns'] %}
{{ dbtvault.stage(include_source_columns=true,
source_model=source_model,
derived_columns=derived_columns,
hashed_columns=hashed_columns,
ranked_columns=none) }}
In this demo, we create hubs, links, and satellites as vault.
Create vault
dir under models
and put related contents.
Hubs are one of the core building blocks in Data Vault. Create hubs
dir under models/vault
. Put SQLs for hubs.
Example: h_customer.sql
{%- set source_model = ["stg_customer"] -%}
{%- set src_pk = "CUSTOMER_PK" -%}
{%- set src_nk = "CUSTOMER_ID" -%}
{%- set src_ldts = "LOAD_DATE" -%}
{%- set src_source = "RECORD_SOURCE" -%}
{{ dbtvault.hub(src_pk=src_pk, src_nk=src_nk, src_ldts=src_ldts,
src_source=src_source, source_model=source_model) }}
Links are association of business objects like Hubs or Satellites. Create link
dir under models/vault
. Put SQLs for links.
Example: l_customer_order.sql
{%- set source_model = "stg_order" -%}
{%- set src_pk = "LINK_ORDER_CUSTOMER_PK" -%}
{%- set src_fk = ["ORDER_PK", "CUSTOMER_PK"] -%}
{%- set src_ldts = "LOAD_DATE" -%}
{%- set src_source = "RECORD_SOURCE" -%}
{{ dbtvault.link(src_pk=src_pk, src_fk=src_fk, src_ldts=src_ldts,
src_source=src_source, source_model=source_model) }}
In this example, data of the link is extracted from stg_order.sql.
Satellites are detailed payload data of parent hub or link. Create satellites
dir under models/vault
. Put SQLs for satellites.
Example: s_customer.sql
{%- set source_model = "stg_customer" -%}
{%- set src_pk = "CUSTOMER_PK" -%}
{%- set src_hashdiff = "CUSTOMER_HASHDIFF" -%}
{%- set src_payload = ["CUSTOMER_ID","FIRST_NAME", "LAST_NAME"] -%}
{%- set src_eff = "EFFECTIVE_FROM" -%}
{%- set src_ldts = "LOAD_DATE" -%}
{%- set src_source = "RECORD_SOURCE" -%}
{{ dbtvault.sat(src_pk=src_pk, src_hashdiff=src_hashdiff,
src_payload=src_payload, src_eff=src_eff,
src_ldts=src_ldts, src_source=src_source,
source_model=source_model) }}
After adding vault related setting, update models section of project yaml like below. In this demo, vault layer is persisted as table so that easier to check the contents in preview in the BigQuery. It can be view based on preference.
models:
dbtvault-bigquery-demo:
# Config indicated by + and applies to all files under models/example/
raw_stage:
+materialized: view
stage:
+materialized: view
vault:
+materialized: table
Finally, run dbt run
and check BigQuery after completion!
- Learn more about dbt in the docs
- Learn more about dbtvault in the docs
- Check out the blog for the latest news on dbt's development and best practices