You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I did setup a LOG_BASED replication method from Postgres to BigQuery. The pipeline does an initial full snapshot when the destination database is empty, but even after that in subsequent runs it perform a full snapshot. I can tell by checking the value of _sdc_extracted_at and _sdc_batched_at get updated on all rows every time the pipeline run. Also the pipeline execution time is the same as the initial run.
Postgres Tap
# ------------------------------------------------------------------------------
# General Properties
# ------------------------------------------------------------------------------
id: "postgres_staging" # Unique identifier of the tap
name: "Staging DB" # Name of the tap
type: "tap-postgres" # !! THIS SHOULD NOT CHANGE !!
owner: "[email protected]" # Data owner to contact
#send_alert: False # Optional: Disable all configured alerts on this tap
# ------------------------------------------------------------------------------
# Source (Tap) - PostgreSQL connection details
# ------------------------------------------------------------------------------
db_conn:
host: "xxx.xxxxxxxxx.us-east-2.rds.amazonaws.com" # PostgreSQL host
port: 5432 # PostgreSQL port
user: "datastream" # PostfreSQL user
password: "xxxxxxxxxxxxx" # Plain string or vault encrypted
dbname: "staging" # PostgreSQL database name
filter_schemas: "public" # Optional: Scan only the required schemas
# to improve the performance of
# data extraction
#max_run_seconds # Optional: Stop running the tap after certain
# number of seconds
# Default: 43200
logical_poll_total_seconds: 180 # Optional: Stop running the tap when no data
# received from wal after certain number of seconds
# Default: 10800
#break_at_end_lsn: # Optional: Stop running the tap if the newly received lsn
# is after the max lsn that was detected when the tap started
# Default: true
#ssl: "true" # Optional: Using SSL via postgres sslmode 'require' option.
# If the server does not accept SSL connections or the client
# certificate is not recognized the connection will fail
# ------------------------------------------------------------------------------
# Destination (Target) - Target properties
# Connection details should be in the relevant target YAML file
# ------------------------------------------------------------------------------
target: "staging_raw_bq" # ID of the target connector where the data will be loaded
batch_size_rows: 20000 # Batch size for the stream to optimise load performance
stream_buffer_size: 0 # In-memory buffer size (MB) between taps and targets for asynchronous data pipes
# ------------------------------------------------------------------------------
# Source to target Schema mapping
# ------------------------------------------------------------------------------
schemas:
- source_schema: "public" # Source schema in postgres with tables
target_schema: "staging_raw" # Target schema in the destination Data Warehouse
tables:
- table_name: "answers"
replication_method: "LOG_BASED"
- table_name: "people"
replication_method: "LOG_BASED"
- table_name: "organizations"
replication_method: "LOG_BASED"
- table_name: "questions"
replication_method: "LOG_BASED"
- table_name: "settings"
replication_method: "LOG_BASED"
- table_name: "tags"
replication_method: "LOG_BASED"
Bigquery Target
# ------------------------------------------------------------------------------
# General Properties
# ------------------------------------------------------------------------------
id: "staging_raw_bq" # Unique identifier of the target
name: "Raw Staging Bigquery" # Name of the target
type: "target-bigquery" # !! THIS SHOULD NOT CHANGE !!
# ------------------------------------------------------------------------------
# Target - Data Warehouse connection details
# ------------------------------------------------------------------------------
db_conn:
project_id: "xxxxxxxxxx" # Bigquery project name
dataset_id: "staging_raw" # Bigquery dataset name
location: "us-east4" # Bigquery location of the dataset
Expected behavior
I expect after initial full snapshot of every table, only changed data (inserted, updated, deleted) get ingested and treated by the pipeline.
Screenshots Postgres Replication slot information
Bigquery data excerpt 1
Bigquery data excerpt 2
Your environment
Version: 0.52.0
Postgres(LOG_BASED)
BigQuery
I'm having a cronjob that run every 3 minutes.
Additional context
I'm not sure if it's a bug or I mis-configured the postgres tap.
The text was updated successfully, but these errors were encountered:
hi @mnifakram I can give you two places to check for BQ. I changed these sections for Redshift target, as this repo seems to maintain snowflake target mostly and other targets are lacking behind.
In general I would suggest to check if your full load is running by fastsync or tap-postgres each time, you can know which files are not using existing state and why based on that information.
Describe the bug
I did setup a
LOG_BASED
replication method from Postgres to BigQuery. The pipeline does an initial full snapshot when the destination database is empty, but even after that in subsequent runs it perform a full snapshot. I can tell by checking the value of_sdc_extracted_at
and_sdc_batched_at
get updated on all rows every time the pipeline run. Also the pipeline execution time is the same as the initial run.Postgres Tap
Bigquery Target
Expected behavior
I expect after initial full snapshot of every table, only changed data (inserted, updated, deleted) get ingested and treated by the pipeline.
Screenshots
Postgres Replication slot information
Bigquery data excerpt 1
Bigquery data excerpt 2
Your environment
Additional context
I'm not sure if it's a bug or I mis-configured the postgres tap.
The text was updated successfully, but these errors were encountered: