- General guidelines
- Project structure
- Profiles
- Projects
- Packages
- dbt Sources
- Models
- Seeds
- Macros
- YAML
- Credits
Follow our SQL style guide.
Learn it, know it, live it. 🏄
Consider the official dbt best practices.
One of which is to use coding conventions and a style guide. 👍
Optimize primarily for readability, maintainability, and robustness rather than for fewer lines of code.
Newlines are cheap; people's time is expensive.
Very long lines are harder to read, especially in situations where space may be limited like on smaller screens or in side-by-side version control diffs.
It's more readable, and easier to keep consistent.
It contains database credentials, so file permissions should be set to restrict access to it.
To avoid naming conflicts when we have multiple profiles.
dbt's built-in generate_schema_name_for_env
macro assumes production targets are named prod
.
Developers should target a dev environment by default.
Projects should have a directory structure per the following example.
.
├── README.md
├── analysis
├── dbt_project.yml
├── macros
├── models
│ ├── marts
│ │ ├── core
│ │ │ ├── monthly_product_revenue.sql
│ │ │ ├── monthly_product_revenue.yml
│ │ │ ├── payments.sql
│ │ │ ├── payments.yml
│ │ │ ├── customers.sql
│ │ │ ├── customers.yml
│ │ │ └── intermediate
│ │ │ ├── int_customers.sql
│ │ │ ├── int_customers.yml
│ │ │ ├── int_payments.sql
│ │ │ └── int_payments.yml
│ │ ├── marketing
│ │ │ ├── page_hits.sql
│ │ │ ├── page_hits.yml
│ │ │ ├── sessions.sql
│ │ │ ├── sessions.yml
│ │ │ ├── users.sql
│ │ │ ├── users.yml
│ │ │ └── intermediate
│ │ │ ├── int_sessions.sql
│ │ │ ├── int_sessions.yml
│ │ │ ├── int_users.sql
│ │ │ └── int_users.yml
│ │ └── etc.
│ ├── sources
│ │ ├── google_analytics
│ │ │ ├── source_google_analytics.yml
│ │ │ ├── source_google_analytics__ga_campaign.sql
│ │ │ └── source_google_analytics__ga_orders.sql
│ │ ├── stripe
│ │ │ ├── source_stripe.yml
│ │ │ ├── source_stripe__users.sql
│ │ │ ├── source_stripe__payments.sql
│ │ │ └── source_stripe__refunds.sql
│ │ └── etc.
│ └── utils
├── packages.yml
├── seeds
└── snapshots
The marts directory should contain a directory for each business-centric mart required. Marts model business processes. For some projects, there may be only a core mart. Mart directories contain schema definitions and tests in correspondingly named .yml
files for each model.
Intermediate (int_
) models are used to provide an additional layer of business modelling before final mart models. They are useful when mart model is becoming unwieldy in its complexity, or to create an abstraction for use by several mart models. Intermediate models are not to be dependend on outside of the dbt project.
The sources directory should contain a directory for each set of source-centric models. These models create an abstraction on top of source data for easier use in marts. Source models are not to be confused with dbt sources - source models are a layer above dbt sources. A <source_name>.yml
file defining the dbt sources should exist within each source-specific directory.
Source models should be named source_<source name>__<table name>
(two underscores between the source name and table name because those names may contain underscores themselves).
- Source models are the first layer of modelling on top of source data.
- They are normally materialized as views.
- All dbt sources should have a corresponding source model, though not necessarily the other way around:
- Source models can build on top of other source models, e.g. in Shopify to create an order_items source table by unnesting the order table's order_items field.
- Source models should select from dbt sources, not directly from the source tables themselves.
- Source models should alias source columns as necessary to conform to our naming conventions.
- Source models should perform data type corrections and simple extractions of nested scalar data.
- In source models, timezones should be cast to the timezone of the client. Where there are multiple timezones for the client, select the timezone in which most of the reporting stakeholders are based.
For consistency (especially with version control):
- Files should be encoded as UTF-8.
- Indentation should always be done with space characters, never with tab characters.
- Files should have Unix-style endlines (i.e. LF, not CRLF).
- All trailing whitespace should be stripped from files.
- Files should always end with a newline.
If you use Atom install the editorconfig
package (support is also available for other editors).
root = true
[*]
indent_style = space
indent_size = 4
end_of_line = lf
charset = utf-8
trim_trailing_whitespace = true
insert_final_newline = true
[*.yml]
indent_size = 2
For example:
- dbt's compilation artifacts, downloaded package dependencies, and logs.
.DS_Store
files on Macs.desktop.ini
files on Windows.
target/
dbt_modules/
logs/
.*
!.editorconfig
!.git?*
desktop.ini
Concrete materializations should be opt-in.
models:
materialized: ephemeral
When targeting Redshift, enable late binding views globally.
This prevents views from getting dropped prematurely when their source models get rebuilt.
models:
bind: false # Materialize all views as late-binding.
It shouldn't be necessary (models are enabled by default), and it will override any enabled
config settings within packages which can cause problems.
models:
enabled: true # Bad
Lists such as accepted values and source table names should be in alphabetical order, except where an order is implied (e.g. the order of columns in a model).
sources:
- name: shopify
database: stitch
tables:
- addresses
- customers
- orders
- products
- shipments
Use packages from dbt Hub whenever possible.
This allows dbt to handle duplicate dependencies.
# Good
packages:
- package: fishtown-analytics/dbt_utils
# Bad
packages:
- git: https://github.com/fishtown-analytics/dbt-utils.git
This allows bug fixes to be picked up automatically while avoiding any unexpected functional changes.
# Good
packages:
- package: fishtown-analytics/dbt_utils
version: [">=0.6.0", "<0.7.0"]
# Bad
packages:
- package: fishtown-analytics/dbt_utils
version: 0.6.5
# Bad
packages:
- git: https://github.com/fishtown-analytics/dbt-utils.git
revision: master
Each dbt source should be defined in its own schema.yml
file named <source name>.yml
.
Sources have some nice features:
- Showing the source table schemas in the docs.
- Being able to easily run all models that select from a particular source.
- Testing source columns.
- Calculating the freshness of source data.
Sources should have the description
and loader
properties completed to give context on the source to future developers.
- Model names should be plural (tables are collections of multiple things).
- Source models should be named
source_<source name>__<table name>
(two underscores between the source name and table name because those names may contain underscores themselves). - Intermediate model names should be prefixed with
int_
. - Bridge model names should be prefixed with
bridge_
. - Aggregate model names should be prefixed with the grain of the aggregation (e.g.
daily_
,monthly_
). - Don't use fact/dimension prefixes like
fact_
anddim_
.- Models are easier to find when they are sorted alphabetically by their modelled event/entity without a
fact_
ordim_
prefix. - Facts and dimensions are not intrinsic properties of models, but are instead determined by the way the model and its columns are used.
- We create models optimized for modern columnar databases, which combine denormalized factual and dimensional data to reduce the need for joins.
- Models are easier to find when they are sorted alphabetically by their modelled event/entity without a
This makes joins easier and more performant.
- Instead of having a composite (multi-column) primary key, consider creating a surrogate primary key.
- Aggregate and bridge models are exceptions, as they often don't need normal primary keys.
Use comments to label column groups with a blank line between each group. This provides a quickly interpretable structure for columns.
Our standard output column groups (where applicable, always in this order):
Primary key
Foreign keys and IDs
Timestamps
- Any other groups as makes sense. A common catch-all is
Status and properties
. Other good examples areMetrics
andCosts and quantities
.
/* Good */
select
/* Primary key */
id
/* Foreign keys and IDs */
, customer_id
, region_id
/* Timestamps */
, created_at
/* Status and properties */
, status
, total_amount
from orders
/* Bad */
select
id
, created_at
, customer_id
, region_id
, status
, total_amount
from orders
For example, sort and dist keys for Redshift:
{{
config(
materialized='table'
, sort='created_at'
, dist='order_id'
)
}}
/* Good */
{{ ref('customers') }}
/* Bad */
{{ref('customers')}}
/* Good */
{% if where_clause is not none %}
/* Bad */
{%if where_clause is not none%}
Like in Python.
/* Good */
{{ foo('bar', baz=123) }}
/* Bad */
{{ foo('bar', baz = 123) }}
- In production, custom schemas are used verbatim (not concatenated with the target schema name).
- In all other environments, custom schemas are ignored (all models go in the single target schema).
/* Put this in `macros/generate_schema_name.sql`. */
{% macro generate_schema_name(custom_schema_name, node) -%}
{{ generate_schema_name_for_env(custom_schema_name, node) }}
{%- endmacro %}
Materialized source models and staging models should go in a separate schema named <production target schema>_staging
.
Such models should not be used directly in normal analysis/reporting, and putting them in a separate "staging" schema helps enforce that.
Models containing especially sensitive data should go in a separate schema named <production target schema>_sensitive
.
Permissions should be set for the "sensitive" schema to restrict access as appropriate.
Documentation and tests for each model should be placed in a schema .yml
file alongside the model file named <model name>.yml
.
Having a separate schema .yml
file for each model has several benefits:
- Makes it easier to find the documentation and tests for a model.
- Clearly shows which models have documentation/tests and which don't.
- Helps avoid version control merge conflicts.
Final models and source models should be documented. Intermediate models do not require documentation.
Testing and documenting models should be an integral part of their development.
This makes it easier to correlate between the model and the schema .yml
file.
Also, if comments are used in the model to label groupings of columns, then put matching comments in the schema .yml
file.
Markdown is easier to read and edit.
One exception is line breaks, where using <br>
is preferable to hassling with Markdown's line break syntax of ending a line with two or more spaces (which is almost impossible to notice and extremely easy to accidentally break).
# Good
models:
- name: orders
description: |
**Overview:** Summary data for orders.
<br>**Data sources:** `orders`, `order_line_items`
# Bad
models:
- name: orders
description: |
<b>Overview:</b> Summary data for orders.
<br><b>Data sources:</b> <code>orders</code>, <code>order_line_items</code>
unique
andnot_null
tests should be applied to the primary key.relationships
tests should be applied to the foreign keys.
The file name is used as the table name.
If seed data needs to be used directly in analysis/reporting then a model should be created to select from the seed data and apply any formatting or other safeguards, and then that model can be materialized in the default target schema.
It appears dbt always does Excel-like data type detection and auto-formatting, and even overriding the column types doesn't prevent that. For example, when overriding a seed column for 5-digit zip codes to be varchar(5)
a value in the file like 00123
will still end up being 123
in the resulting seed table.
So until dbt fixes the issue with auto-formatting seed data you may need to add a dummy row with non-numeric values to trick dbt into not auto-formatting such data.
This makes it easier to locate macros in the project.
One exception is adapter-specific macros, which should go in the same file as their associated multiple-dispatch macro.
- Indents should be 2 spaces.
- List items should be indented.
- Use a new line to separate list items that are dictionaries where appropriate.
models:
- name: events
columns:
- name: event_id
description: "This is a unique identifier for the event."
tests:
- unique
- not_null
- name: event_time
description: "When the event occurred in UTC (eg. 2018-01-01 12:00:00)."
tests:
- not_null
- name: user_id
description: "The ID of the user who recorded the event."
tests:
- not_null
- relationships:
to: ref('users')
field: id
Otherwise it's too easy to accidentally break the YAML syntax by unwittingly using a character sequence YAML considers special.
# Good
models:
- name: foo
columns:
- name: bar
description: "A colon: followed by a space and a #hashmark following a space are allowed in double-quoted flow scalars."
- name: baz
description: |
A colon: followed by a space
and a #hashmark following a space
are allowed in block scalars.
# Bad
models:
- name: foo
columns:
- name: bar
description: A colon: followed by a space and a #hashmark following a space are invalid in plain flow scalars.
- name: baz
description:
A colon: followed by a space
and a #hashmark following a space
are invalid in plain flow scalars.
These coding conventions were inspired in part by: