Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rclone: stream files as they are being listed #4146

Closed
wants to merge 4 commits into from

Conversation

Michal-Leszczynski
Copy link
Collaborator

@Michal-Leszczynski Michal-Leszczynski commented Dec 2, 2024

This PR changes the rclone version used by SM-agent server.
The goal is improve huge dir listing by streaming the files to the SM as the dir is being listed. Rclone lists files in chunks (1000 is the default value for s3), so we can start streaming those files after each chunk has been listed, not only after the whole dir has been listed.
This allows for more reliable timeout handling on SM side, and also decreases memory pressure on agent.

Since this is timeout/performance change, I didn't prepare a dedicated test for it, but I created a benchmark for that (see last commit for details). It measures some interesting averages when listing a flat dir with 5555 files 1000 times. Note that this has been tested on my local docker setup without any rate limiting, but I still think that the results show what needs to be seen.

The results without the fix:

Avg total time: 157.916848ms           (average time of listing the whole dir)
Avg first latency: 146.564896ms        (average time after the first item has been listed)
Avg max cross chunk latency: 19.196µs  (average time between listing the last item from given chunk and the first from the next chunk)
Avg max within chunk latency: 112.92µs (average time between listing files from the same chunk)

The results after the fix:

Avg total time: 143.211603ms
Avg first latency: 26.344503ms
Avg max cross chunk latency: 26.144269ms
Avg max within chunk latency: 320.72µs

What is important, is that this fix reduces the time needed for listing the first item from ~146ms to ~26ms. Let's have in mind that 5555 files is not a huge amount for backed up sstables, and that the local setup has faster connections than the real one. It also slightly reduces the total time needed for listing, but the values are too similar to tell it for sure.

Fixes #4132

It does not make sense, as sstable dirs are flat.
Moreover, using recursion makes it impossible to use
internal rclone ListCB which improves performance and
memory allocation.
The main focus of this benchmark is to test the impact of the issues:
- #4134
- #4133
- #4132
@Michal-Leszczynski Michal-Leszczynski changed the title WIP: fix #4132 Rclone: stream files as they are being listed Dec 4, 2024
@Michal-Leszczynski Michal-Leszczynski marked this pull request as ready for review December 4, 2024 11:23
@VAveryanov8
Copy link
Collaborator

Nice benchmark! LGTM!

As I understand all the heavy lifting was done in scylladb/rclone repository?

Copy link
Collaborator

@karol-kokoszka karol-kokoszka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Please monitor https://jenkins.scylladb.com/view/scylla-manager/job/manager-master/ SCT tests after merging.
It may be even worth to start the tests just on this branch.

@Michal-Leszczynski
Copy link
Collaborator Author

As I understand all the heavy lifting was done in scylladb/rclone repository?

Correct, but we treat it more as a playground, so it would be good to review vendored changes as a part of this PR.

@mykaul
Copy link
Contributor

mykaul commented Dec 5, 2024

Is this another private patch on top of our forked rclone?

@karol-kokoszka
Copy link
Collaborator

Is this another private patch on top of our forked rclone?

yes

@Michal-Leszczynski
Copy link
Collaborator Author

After the discussion with @VAveryanov8 it turns out that this issue/PR does not really make sense (thanks for the effort, you saved us from unnecessary rclone patches that could result in more problems in the future).

When developing the patch, I was under the impression that we use non-recursive RcloneListDirIter when we list node''s sstables in backup location during backup's dedup stage. This came from the fact that this dir should be flat, so using flat listing should be better or not worse than recursive listing.

It turned out that we are using recursive RcloneListDirIter for that (@karol-kokoszka I think that you were implementing it - any idea why the recursive version was chosen)?

Even though this PR "improves" flat dir listing, it's not connected to https://github.com/scylladb/scylla-enterprise/issues/4861, which was a driving force behind that, and we shouldn't risk questionable improvements connected to rclone modifications. The only way in which this patch PR could help with mentioned issue, is that perhaps flat dir listing has indeed noticeably better performance than the recursive version, but I doubt that's the case.

Anyway, I'm closing this PR, perhaps it will be revisited in the future.

@karol-kokoszka
Copy link
Collaborator

It turned out that we are using recursive RcloneListDirIter for that (@karol-kokoszka I think that you were implementing it - any idea why the recursive version was chosen)?

I don't think there is any special cause that lead to choose recursive over the flat approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RcloneListDirIter: call callback before listing the whole dir
4 participants