Skip to content

Commit

Permalink
Created new section called Parallel batch execution (#6589)
Browse files Browse the repository at this point in the history
## What are you changing in this pull request and why?

I have created this PR following this Git issue:
#6550 for Parallel
batch execution

## Checklist
- [ ] I have reviewed the [Content style
guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md)
so my content adheres to these guidelines.
- [ ] The topic I'm writing about is for specific dbt version(s) and I
have versioned it according to the [version a whole
page](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#adding-a-new-version)
and/or [version a block of
content](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#versioning-blocks-of-content)
guidelines.
- [ ] I have added checklist item(s) to this list for anything anything
that needs to happen before this PR is merged, such as "needs technical
review" or "change base branch."
- [ ] The content in this PR requires a dbt release note, so I added one
to the [release notes
page](https://docs.getdbt.com/docs/dbt-versions/dbt-cloud-release-notes).
<!--
PRE-RELEASE VERSION OF dbt (if so, uncomment):
- [ ] Add a note to the prerelease version [Migration
Guide](https://github.com/dbt-labs/docs.getdbt.com/tree/current/website/docs/docs/dbt-versions/core-upgrade)
-->
<!-- 
ADDING OR REMOVING PAGES (if so, uncomment):
- [ ] Add/remove page in `website/sidebars.js`
- [ ] Provide a unique filename for new pages
- [ ] Add an entry for deleted pages in `website/vercel.json`
- [ ] Run link testing locally with `npm run build` to update the links
that point to deleted pages
-->

<!-- vercel-deployment-preview -->
---
🚀 Deployment available! Here are the direct links to the updated files:


-
https://docs-getdbt-com-git-nfiann-rbip-dbt-labs.vercel.app/docs/build/incremental-microbatch
-
https://docs-getdbt-com-git-nfiann-rbip-dbt-labs.vercel.app/docs/dbt-versions/core-upgrade/06-upgrading-to-v1.9
-
https://docs-getdbt-com-git-nfiann-rbip-dbt-labs.vercel.app/reference/resource-properties/concurrent_batches

<!-- end-vercel-deployment-preview -->

---------

Co-authored-by: Mirna Wong <[email protected]>
Co-authored-by: Quigley Malcolm <[email protected]>
Co-authored-by: Leona B. Campbell <[email protected]>
  • Loading branch information
4 people authored Dec 7, 2024
1 parent 2c538a6 commit 0cb8e68
Show file tree
Hide file tree
Showing 4 changed files with 225 additions and 8 deletions.
140 changes: 132 additions & 8 deletions website/docs/docs/build/incremental-microbatch.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ Incremental models in dbt are a [materialization](/docs/build/materializations)
Microbatch is an incremental strategy designed for large time-series datasets:
- It relies solely on a time column ([`event_time`](/reference/resource-configs/event-time)) to define time-based ranges for filtering. Set the `event_time` column for your microbatch model and its direct parents (upstream models). Note, this is different to `partition_by`, which groups rows into partitions.
- It complements, rather than replaces, existing incremental strategies by focusing on efficiency and simplicity in batch processing.
- Unlike traditional incremental strategies, microbatch doesn't require implementing complex conditional logic for [backfilling](#backfills).
- Unlike traditional incremental strategies, microbatch enables you to [reprocess failed batches](/docs/build/incremental-microbatch#retry), auto-detect [parallel batch execution](#parallel-batch-execution), and eliminate the need to implement complex conditional logic for [backfilling](#backfills).

- Note, microbatch might not be the best strategy for all use cases. Consider other strategies for use cases such as not having a reliable `event_time` column or if you want more control over the incremental logic. Read more in [How `microbatch` compares to other incremental strategies](#how-microbatch-compares-to-other-incremental-strategies).

### How microbatch works
Expand Down Expand Up @@ -179,12 +180,15 @@ It does not matter whether the table already contains data for that day. Given t

Several configurations are relevant to microbatch models, and some are required:

| Config | Type | Description | Default |
|----------|------|---------------|---------|
| [`event_time`](/reference/resource-configs/event-time) | Column (required) | The column indicating "at what time did the row occur." Required for your microbatch model and any direct parents that should be filtered. | N/A |
| [`begin`](/reference/resource-configs/begin) | Date (required) | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A |
| [`batch_size`](/reference/resource-configs/batch-size) | String (required) | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A |
| [`lookback`](/reference/resource-configs/lookback) | Integer (optional) | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` |

| Config | Description | Default | Type | Required |
|----------|---------------|---------|------|---------|
| [`event_time`](/reference/resource-configs/event-time) | The column indicating "at what time did the row occur." Required for your microbatch model and any direct parents that should be filtered. | N/A | Column | Required |
| `begin` | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A | Date | Required |
| `batch_size` | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A | String | Required |
| `lookback` | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` | Integer | Optional |
| `concurrent_batches` | An override for whether batches run concurrently (at the same time) or sequentially (one after the other). | `None` | Boolean | Optional |


<Lightbox src="/img/docs/building-a-dbt-project/microbatch/event_time.png" title="The event_time column configures the real-world time of this record"/>

Expand Down Expand Up @@ -280,7 +284,127 @@ For now, dbt assumes that all values supplied are in UTC:

While we may consider adding support for custom time zones in the future, we also believe that defining these values in UTC makes everyone's lives easier.

## How `microbatch` compares to other incremental strategies?
## Parallel batch execution

The microbatch strategy offers the benefit of updating a model in smaller, more manageable batches. Depending on your use case, configuring your microbatch models to run in parallel offers faster processing, in comparison to running batches sequentially.

Parallel batch execution means that multiple batches are processed at the same time, instead of one after the other (sequentially) for faster processing of your microbatch models.

dbt automatically detects whether a batch can be run in parallel in most cases, which means you don’t need to configure this setting. However, the `concurrent_batches` config is available as an override (not a gate), allowing you to specify whether batches should or shouldn’t be run in parallel in specific cases.

For example, if you have a microbatch model with 12 batches, you can execute those batches to run in parallel. Specifically they'll run in parallel limited by the number of [available threads](/docs/running-a-dbt-project/using-threads).

### Prerequisites

To enable parallel execution, you must:

- Use a supported adapter:
- Snowflake
- Databricks
- More adapters coming soon!
- We'll be continuing to test and add concurrency support for adapters. This means that some adapters might get concurrency support _after_ the 1.9 initial release.

- Meet [additional conditions](#how-parallel-batch-execution-works) described in the following section.

### How parallel batch execution works

A batch can only run in parallel if all of these conditions are met:

| Condition | Parallel execution | Sequential execution|
| ---------------| :------------------: | :----------: |
| **Not** the first batch | ✅ | - |
| **Not** the last batch | ✅ | - |
| [Adapter supports](#prerequisites) parallel batches | ✅ | - |


After checking for the conditions in the previous table &mdash; and if `concurrent_batches` value isn't set, dbt will intelligently auto-detect if the model invokes the [`{{ this }}`](/reference/dbt-jinja-functions/this) Jinja function. If it references `{{ this }}`, the batches will run sequentially since `{{ this }}` represents the database of the current model and referencing the same relation causes conflict.

Otherwise, if `{{ this }}` isn't detected (and other conditions are met), the batches will run in parallel, which can be overriden when you [set a value for `concurrent_batches`](/reference/resource-properties/concurrent_batches).

### Parallel or sequential execution

Choosing between parallel batch execution and sequential processing depends on the specific requirements of your use case.

- Parallel batch execution is faster but requires logic independent of batch execution order. For example, if you're developing a data pipeline for a system that processes user transactions in batches, each batch is executed in parallel for better performance. However, the logic used to process each transaction shouldn't depend on the order of how batches are executed or completed.
- Sequential processing is slower but essential for calculations like [cumulative metrics](/docs/build/cumulative) in microbatch models. It processes data in the correct order, allowing each step to build on the previous one.

<!-- You can override the check for `this` by setting `concurrent_batches` to either `True` or `False`. If set to `False`, the batch will be run sequentially. If set to `True` the batch will be run in parallel (assuming [1], [2], and [3])
To override the `this` check, use the `concurrent_batches` configuration:


<File name='dbt_project.yml'>

```yaml
models:
+concurrent_batches: True
```

</File>

or:

<File name='models/my_model.sql'>

```sql
{{
config(
materialized='incremental',
concurrent_batches=True,
incremental_strategy='microbatch'
...
)
}}
select ...
```

</File>
-->

### Configure `concurrent_batches`

By default, dbt auto-detects whether batches can run in parallel for microbatch models, and this works correctly in most cases. However, you can override dbt's detection by setting the [`concurrent_batches` config](/reference/resource-properties/concurrent_batches) in your `dbt_project.yml` or model `.sql` file to specify parallel or sequential execution, given you meet all the [conditions](#prerequisites):

<Tabs>
<TabItem value="yaml" label="dbt_project.yml">

<File name='dbt_project.yml'>

```yaml
models:
+concurrent_batches: true # value set to true to run batches in parallel
```

</File>
</TabItem>

<TabItem value="sql" label="my_model.sql">

<File name='models/my_model.sql'>

```sql
{{
config(
materialized='incremental',
incremental_strategy='microbatch',
event_time='session_start',
begin='2020-01-01',
batch_size='day
concurrent_batches=true, # value set to true to run batches in parallel
...
)
}}
select ...
```
</File>
</TabItem>
</Tabs>

## How microbatch compares to other incremental strategies

As data warehouses roll out new operations for concurrently replacing/upserting data partitions, we may find that the new operation for the data warehouse is more efficient than what the adapter uses for microbatch. In such instances, we reserve the right the update the default operation for microbatch, so long as it works as intended/documented for models that fit the microbatch paradigm.

Most incremental models rely on the end user (you) to explicitly tell dbt what "new" means, in the context of each model, by writing a filter in an `{% if is_incremental() %}` conditional block. You are responsible for crafting this SQL in a way that queries [`{{ this }}`](/reference/dbt-jinja-functions/this) to check when the most recent record was last loaded, with an optional look-back window for late-arriving records.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ Starting in Core 1.9, you can use the new [microbatch strategy](/docs/build/incr
- Simplified query design: Write your model query for a single batch of data. dbt will use your `event_time``lookback`, and `batch_size` configurations to automatically generate the necessary filters for you, making the process more streamlined and reducing the need for you to manage these details.
- Independent batch processing: dbt automatically breaks down the data to load into smaller batches based on the specified `batch_size` and processes each batch independently, improving efficiency and reducing the risk of query timeouts. If some of your batches fail, you can use `dbt retry` to load only the failed batches.
- Targeted reprocessing: To load a *specific* batch or batches, you can use the CLI arguments `--event-time-start` and `--event-time-end`.
- [Automatic parallel batch execution](/docs/build/incremental-microbatch#parallel-batch-execution): Process multiple batches at the same time, instead of one after the other (sequentially) for faster processing of your microbatch models. dbt intelligently auto-detects if your batches can run in parallel, while also allowing you to manually override parallel execution with the `concurrent_batches` config.


Currently microbatch is supported on these adapters with more to come:
* postgres
Expand Down
90 changes: 90 additions & 0 deletions website/docs/reference/resource-properties/concurrent_batches.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
---
title: "concurrent_batches"
resource_types: [models]
datatype: model_name
description: "Learn about concurrent_batches in dbt."
---

:::note

Available in dbt Core v1.9+ or the [dbt Cloud "Latest" release tracks](/docs/dbt-versions/cloud-release-tracks).

:::

<Tabs>
<TabItem value="Project file">


<File name='dbt_project.yml'>

```yaml
models:
+concurrent_batches: true
```
</File>
</TabItem>
<TabItem value="sql file">
<File name='models/my_model.sql'>
```sql
{{
config(
materialized='incremental',
concurrent_batches=true,
incremental_strategy='microbatch'
...
)
}}
select ...
```

</File>

</TabItem>
</Tabs>

## Definition

`concurrent_batches` is an override which allows you to decide whether or not you want to run batches in parallel or sequentially (one at a time).

For more information, refer to [how batch execution works](/docs/build/incremental-microbatch#how-parallel-batch-execution-works).
## Example

By default, dbt auto-detects whether batches can run in parallel for microbatch models. However, you can override dbt's detection by setting the `concurrent_batches` config to `false` in your `dbt_project.yml` or model `.sql` file to specify parallel or sequential execution, given you meet these conditions:
* You've configured a microbatch incremental strategy.
* You're working with cumulative metrics or any logic that depends on batch order.

Set `concurrent_batches` config to `false` to ensure batches are processed sequentially. For example:

<File name='dbt_project.yml'>

```yaml
models:
my_project:
cumulative_metrics_model:
+concurrent_batches: false
```
</File>
<File name='models/my_model.sql'>
```sql
{{
config(
materialized='incremental',
incremental_strategy='microbatch'
concurrent_batches=false
)
}}
select ...

```
</File>


1 change: 1 addition & 0 deletions website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -956,6 +956,7 @@ const sidebarSettings = {
"reference/resource-configs/materialized",
"reference/resource-configs/on_configuration_change",
"reference/resource-configs/sql_header",
"reference/resource-properties/concurrent_batches",
],
},
{
Expand Down

0 comments on commit 0cb8e68

Please sign in to comment.