Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore: add bandwidth metrics #4081

Merged
merged 5 commits into from
Oct 30, 2024
Merged

Conversation

Michal-Leszczynski
Copy link
Collaborator

@Michal-Leszczynski Michal-Leszczynski commented Oct 26, 2024

This PR add new restore metrics:

  • downloaded_bytes (downloaded bytes) labeled by cluster, location, host
  • download_duration (download duration in ms) labeled by cluster, location, host
  • streamed_bytes (load&streamed bytes) labeled by cluster, host
  • stream_duration (load&stream duration in ms) labeled by cluster, host

They can be easily used to calculate download/stream bandwidth.

Ref #4042

This should make it easier to see what is updated where and when.
@Michal-Leszczynski Michal-Leszczynski marked this pull request as ready for review October 28, 2024 07:15
They are really useful for evaluating restore performance.
It's useful for checking/tracking restore performance.

Ref #4042
This was a left-over from the PR introducing
indexing (14aef7b). It also initialized metrics
as a part of the indexing procedure, but it
forgot to remove the previous metrics initialization
from the code.
There was a confusion about which cluster ID should
be used for labeling remaining_bytes metric.
When setting remaining_bytes, we used backup cluster ID,
but when decreasing, we used restore cluster ID.
Backup cluster ID should be used in both places
as this metrics describes how many bytes from
which place are yet to be restored. Since we use backup
cluster DC, node ID, etc., we should also use backup
cluster ID.
@Michal-Leszczynski
Copy link
Collaborator Author

Michal-Leszczynski commented Oct 29, 2024

Examples for a single restore worker:

Streamed bytes

image

Downloaded bytes (per location)

image

Stream bandwidth in MB/s

image

Download bandwidth in MB/s

image

@Michal-Leszczynski
Copy link
Collaborator Author

@karol-kokoszka This PR is ready for review!

@Michal-Leszczynski Michal-Leszczynski merged commit 07ff683 into master Oct 30, 2024
51 checks passed
@Michal-Leszczynski Michal-Leszczynski deleted the ml/restore-bw-metrics branch October 30, 2024 07:35
karol-kokoszka pushed a commit that referenced this pull request Nov 4, 2024
* refactor(restore): separate methods for updating metrics/progress

This should make it easier to see what is updated where and when.

* feat(metrics): restore, add bandwidth metrics

They are really useful for evaluating restore performance.

* feat(restore): set download/stream bytes/duration metrics

It's useful for checking/tracking restore performance.

Ref #4042

* fix(restore): don't initialize metrics twice

This was a left-over from the PR introducing
indexing (14aef7b). It also initialized metrics
as a part of the indexing procedure, but it
forgot to remove the previous metrics initialization
from the code.

* fix(restore): use backup bluster ID in remaining_bytes metric

There was a confusion about which cluster ID should
be used for labeling remaining_bytes metric.
When setting remaining_bytes, we used backup cluster ID,
but when decreasing, we used restore cluster ID.
Backup cluster ID should be used in both places
as this metrics describes how many bytes from
which place are yet to be restored. Since we use backup
cluster DC, node ID, etc., we should also use backup
cluster ID.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants