Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] Explore FAST Dataset #43

Open
chauhankaranraj opened this issue Jun 29, 2021 · 0 comments
Open

[Epic] Explore FAST Dataset #43

chauhankaranraj opened this issue Jun 29, 2021 · 0 comments

Comments

@chauhankaranraj
Copy link
Member

In this project we have primarily been working with the backblaze dataset, and more recently the ceph-telemetry dataset. These datasets mainly consist of SMART metrics collected from hard disks via the smartctl tool (although ceph-telemetry also contains quite a lot of metadata, in addition to SMART metrics).

However, some recent research suggests that incorporating disk performance and disk location data on top of SMART data can be valuable in analyzing disk health. Specifically, this paper claims to achieve improvements in disk failure prediction models, when using these additional metrics. If this is indeed true for our use cases as well, then ceph should also collect these metrics from their users as a part of ceph-telemetry, so that we can build better models.

In this epic, we will explore this FAST dataset and evaluate the tradeoffs between performance gain and overhead of collecting these metrics from users. This would help us determine the optimal set of additional features that ceph should collect from users, to get the maximum benefit (in terms of better disk health prediction models).

This was referenced Jun 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant