Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log based mode performance #972

Open
podviaznikov opened this issue Jun 7, 2022 · 4 comments
Open

Log based mode performance #972

podviaznikov opened this issue Jun 7, 2022 · 4 comments
Labels
help wanted Extra attention is needed

Comments

@podviaznikov
Copy link

podviaznikov commented Jun 7, 2022

This one is a follow up on #971 but a different case.

I was also tasting LOG_BASED replication and it works but it feels that it can be faster.

Eg I've set batch_size_rows to 100000 and I see in logs that to sync each batch of 100K take 1.5 minutes. Which looks a bit slow.
I there any way to speed up this? (I did try fastsync_parallelism)

@podviaznikov podviaznikov added the help wanted Extra attention is needed label Jun 7, 2022
@Samira-El
Copy link
Contributor

What do you wanna speed up exactly? There are many moving parts here.

@podviaznikov
Copy link
Author

I think general speed. If I have more than 100K inserts/updates/deletes per 1.5 minute sync job will start to back up.

@Samira-El
Copy link
Contributor

Just a disclaimer that there will always be a lag, this will not do real-time native replication unless you're willing to provision powerful machines or snowflake warehouses.

You need to investigate where you're bottleneck is, the replication is both CPU and IO-bound:

Is the tap processing inserts/updates/deletes events fast enough for your needs? if not then there is no config here to change, you have to bump the CPU of the machines on which Pipelinewise runs.

But if it is fast but the Snowflake warehouse you're using is small or has other workloads running there then there might be queues in the warehouse and the batches are gonna take longer to flush and the pipeline would not be doing anything in the meantime, regardless of how fast the tap is at consuming change logs. If it's well provisioned, you could try smaller batch size or time-based batch flushing (check out the pipelinewise-target-snowflake Readme).

@Tolsto
Copy link
Contributor

Tolsto commented Jun 13, 2022

The main issue is that most of the CPU intensive operations for the log-based replication in both the tap and target components are single-threaded and cannot be easily scaled. Your only option there is to either use multiple replication slots with multiple instances of Pipelinewise each syncing different tables or use CPU cores with a higher clock speed or IPC.
For better scalability we'd need process-based parallelism when processing the log in the tap and probably also for building the batches in the target.
If the Snowflake operations slow you down then I'd suggest to use larger batch sizes as this will decrease your Snowflake overhead substantially. If memory management becomes a problem with larger batches have a look at this PR.

If you have hstore or array type columns in your tables then this problem here will also seriously slow you down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants