Log based mode performance #972

podviaznikov · 2022-06-07T10:43:19Z

This one is a follow up on #971 but a different case.

I was also tasting LOG_BASED replication and it works but it feels that it can be faster.

Eg I've set batch_size_rows to 100000 and I see in logs that to sync each batch of 100K take 1.5 minutes. Which looks a bit slow.
I there any way to speed up this? (I did try fastsync_parallelism)

The text was updated successfully, but these errors were encountered:

Samira-El · 2022-06-10T09:45:07Z

What do you wanna speed up exactly? There are many moving parts here.

podviaznikov · 2022-06-10T12:11:19Z

I think general speed. If I have more than 100K inserts/updates/deletes per 1.5 minute sync job will start to back up.

Samira-El · 2022-06-10T13:31:36Z

Just a disclaimer that there will always be a lag, this will not do real-time native replication unless you're willing to provision powerful machines or snowflake warehouses.

You need to investigate where you're bottleneck is, the replication is both CPU and IO-bound:

Is the tap processing inserts/updates/deletes events fast enough for your needs? if not then there is no config here to change, you have to bump the CPU of the machines on which Pipelinewise runs.

But if it is fast but the Snowflake warehouse you're using is small or has other workloads running there then there might be queues in the warehouse and the batches are gonna take longer to flush and the pipeline would not be doing anything in the meantime, regardless of how fast the tap is at consuming change logs. If it's well provisioned, you could try smaller batch size or time-based batch flushing (check out the pipelinewise-target-snowflake Readme).

Tolsto · 2022-06-13T13:15:05Z

The main issue is that most of the CPU intensive operations for the log-based replication in both the tap and target components are single-threaded and cannot be easily scaled. Your only option there is to either use multiple replication slots with multiple instances of Pipelinewise each syncing different tables or use CPU cores with a higher clock speed or IPC.
For better scalability we'd need process-based parallelism when processing the log in the tap and probably also for building the batches in the target.
If the Snowflake operations slow you down then I'd suggest to use larger batch sizes as this will decrease your Snowflake overhead substantially. If memory management becomes a problem with larger batches have a look at this PR.

If you have hstore or array type columns in your tables then this problem here will also seriously slow you down.

podviaznikov added the help wanted Extra attention is needed label Jun 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log based mode performance #972

Log based mode performance #972

podviaznikov commented Jun 7, 2022 •

edited

Loading

Samira-El commented Jun 10, 2022

podviaznikov commented Jun 10, 2022

Samira-El commented Jun 10, 2022

Tolsto commented Jun 13, 2022

Log based mode performance #972

Log based mode performance #972

Comments

podviaznikov commented Jun 7, 2022 • edited Loading

Samira-El commented Jun 10, 2022

podviaznikov commented Jun 10, 2022

Samira-El commented Jun 10, 2022

Tolsto commented Jun 13, 2022

podviaznikov commented Jun 7, 2022 •

edited

Loading