Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore improvement: transfers #4054

Merged
merged 10 commits into from
Oct 15, 2024
Merged

Conversation

Michal-Leszczynski
Copy link
Collaborator

@Michal-Leszczynski Michal-Leszczynski commented Oct 2, 2024

This PR is using #4068 for extended rclone API.

This PR allows for setting rate limit/transfers on rclone API calls used for uploading/downloading during backup/restore. This is preferred to setting transfers once on the start of backup/restore because it is more resilient to agent restarts. Moreover, it is now possible to check current transfers set

This PR allows for setting transfers in a similar fashion as setting rate limit - meaning that they are global property (in the node), as they need to be changed in global rclone config and in initialized fs.Fs objects. We can't set them just in the context of a single rclone API call.

Fixed issue describes the need to allow for setting transfers only during restore, but because of the above, transfers also need be re-set during backup (so that backup does not use the transfers set by restore). Because of that, I decided to make transfers configurable not only in restore, but also in backup.

Transfers can be set to a special value -1 - meaning that the transfers specified in rclone config file should be used.
For restore, there is another special value 0 - mean that transfers should be set so that we get the fastest download resulting in transfers=2*node_shard_cnt.

The default value for --transfers for backup is -1, and 0 for restore

Fixes #3948

@Michal-Leszczynski Michal-Leszczynski changed the title Restore: allow for setting transfers Restore improvement: allow for setting transfers Oct 2, 2024
@Michal-Leszczynski Michal-Leszczynski changed the title Restore improvement: allow for setting transfers Restore improvement: transfers Oct 2, 2024
@Michal-Leszczynski Michal-Leszczynski force-pushed the ml/restore-transfers branch 6 times, most recently from 0351623 to 7d8da34 Compare October 8, 2024 13:58
@Michal-Leszczynski Michal-Leszczynski marked this pull request as ready for review October 8, 2024 14:44
@Michal-Leszczynski
Copy link
Collaborator Author

Michal-Leszczynski commented Oct 8, 2024

@karol-kokoszka This PR is ready for review!

I have one question - I added transfers as a singular int to both backup and restore tasks.
The current version of this PR always sets transfers before backup/restore - meaning that the default value set in rclone yaml config is overwritten. This creates a regression, as now user can't specify different transfer values for different nodes (except for the special --transfers=0 available only for restore).

It might be more idiomatic for it to be []string defining transfer setting per dc - so that it has the same syntax like backup's --rate-limit, --snapshot-parallel, --upload-parallel.
On the other hand, I'm not sure how useful would be the per dc (not per node) configuration.

I could extend --transfers=X semantic to:

  • if X>0 - set transfers to X on all nodes
  • if X=0 - set transfers to 2*shard_cnt (default, fastest setting like with --parallel flag)
  • if X=-1 - set transfers to the value from rclone config file (allows for more detailed manual configuration)

What are your thoughts on this?

@karol-kokoszka
Copy link
Collaborator

This creates a regression, as now user can't specify different transfer values for different nodes (except for the special --transfers=0 available only for restore)

I'm not sure if I understand. So now, the transfers from scylla-manager-agent.yaml are not respected ?

The goal is to change the number of transfer only in context of the given task (restore).

@Michal-Leszczynski
Copy link
Collaborator Author

I'm not sure if I understand. So now, the transfers from scylla-manager-agent.yaml are not respected ?
The goal is to change the number of transfer only in context of the given task (restore).

Setting transfers works like setting rate limit. We can't just simply set transfers on the level of a single RcloneMoveDir call, as changing their amount requires re-initializing some bigger, global structures, e.g. fs.Fs.

So if restore and backup are supposed to work on a different amount of transfers, each of them should set transfers on their own.
We could make it so restore sets transfers at the beginning of the task execution and then resets them to the default value at the end, but this could create problems when the restore gets interrupted, e.g. SM crasches, and it never gets to reset the transfers amount.

So my solution was to set transfers before each backup and restore task, so I also added --transfers to backup flags, as it might be useful in terms of experimenting with backup speed.

@Michal-Leszczynski
Copy link
Collaborator Author

So the question asked here originates from the fact that changing transfers amount is a global change (in the context of a single node), and also because we wanted to make --transfers=0 the default value - meaning that not setting transfers in restore task flags would still cause SM to change the default amount of transfers taken from rclone yaml config.

So if we start changing the transfers amount, they won't re-set themselves to the default values on their own, and perhaps user would like to have such granular control in some cases, e.g. mixed shard clusters. So that's why I proposed the additional special --transfers=-1 value describing that transfers should be set to the amount taken from the rclone yaml config (note that they would still need to be set by SM, as they might have been changed by SM in the context of previous task execution).

@Michal-Leszczynski
Copy link
Collaborator Author

Of course, we could try to extend --transfers with syntax allowing to specify per node transfers (e.g. --transfers=10,host_id_1:1,host_id_2:2, where 10 without following host_id would be the default value used for unspecified hosts). Or we could replace it with something like --per-shard-transfers, that would take care of the mixed shard cluster case, but would decrease flag's granularity.

@karol-kokoszka
Copy link
Collaborator

As per sync.

  1. Let's make the number of transfers consistent through all the restore task, by making sure that the agent keep the expected value for transfers. It will help to avoid a situation where the agent crashed during the restore and reinitialized the number of transfers to the default value taken from scylla-manager-agent.yaml.
  2. Let's allow to use special value -1 for number of transfers which means that number of transfers is taken from the scylla-manager-agent.yaml.
  3. Let's introduce this new --transfers flag to the backup task as well, but the default should be -1 to not break the existing behavior. Besides, let's do not allow to use special value 0 for backup. Meaning of the 0 is (as per comments above) transfers * number_of_node_shards.

We must block the situation where the restore is running at the same time as backup -> a.k.a. #4045 . By default, a backup job should be mutually exclusive to the restore job.

@Michal-Leszczynski
Copy link
Collaborator Author

@karol-kokoszka I addressed changes from this comment. Could you again take a look?

Also, here is the ground work PR.

This commit allows for greater control over transfers
and rate limit as a part of the upload/download
rclone API used by backup/restore tasks.
It also makes the changes more persistent in case
of agent restarts.

It also adds the core/transfers endpoint that
can be used outside the upload/download rclone API
in order to set the amount of transfers for the following rclone API calls.
This commit exposes the /rclone/core/transfers agent endpoint via agent client.
This commit requires changes in the usage of RcloneCopyPaths in restore pkg.
It also adds Transfers field to restore Target with the default value -1
meaning that transfers will be set to the amount specified in agent
rclone server config.
Our experiments showed that the fastest file download
was achieved for transfers=2*2*node_shard_cnt. In order to make
it easier for the user to use, a new, special value
of transfers=0 will take care of that.
It is set as the default, because we aim to make
not configured restore as fast as possible.
This commit allows user to control the amount of rclone transfers used during restore.
Moreover, during development days of restore task, the default value of --parallel flag
was changed from 1 to 0, but we forgot to change it in the flag documentation.
This is the first step to control transfers in the context of backup.
This commit requires changes in the usage of RcloneMoveDir in backup pkg.
This commit allows user to control the amount of rclone transfers used during backup.
…th transfers

This way this test also checks transfers before and after backup.
It also checks transfers before, in the middle, when paused,
when resumed, and after restore.
This required some changes to the test like swapping
src and dst cluster (increase batch count) and hanging on LAS instead of
copy paths (transfer change is applied as a part of copy paths).
@Michal-Leszczynski Michal-Leszczynski merged commit 2099e2e into master Oct 15, 2024
52 checks passed
@Michal-Leszczynski Michal-Leszczynski deleted the ml/restore-transfers branch October 15, 2024 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Control Transfers parameter from Manager Server
2 participants