[Epic] Explore FAST Dataset #43

chauhankaranraj · 2021-06-29T21:41:04Z

In this project we have primarily been working with the backblaze dataset, and more recently the ceph-telemetry dataset. These datasets mainly consist of SMART metrics collected from hard disks via the smartctl tool (although ceph-telemetry also contains quite a lot of metadata, in addition to SMART metrics).

However, some recent research suggests that incorporating disk performance and disk location data on top of SMART data can be valuable in analyzing disk health. Specifically, this paper claims to achieve improvements in disk failure prediction models, when using these additional metrics. If this is indeed true for our use cases as well, then ceph should also collect these metrics from their users as a part of ceph-telemetry, so that we can build better models.

In this epic, we will explore this FAST dataset and evaluate the tradeoffs between performance gain and overhead of collecting these metrics from users. This would help us determine the optimal set of additional features that ceph should collect from users, to get the maximum benefit (in terms of better disk health prediction models).

The text was updated successfully, but these errors were encountered:

This was referenced Jun 29, 2021

FAST dataset EDA #44

Open

Forecasting using FAST dataset #45

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] Explore FAST Dataset #43

[Epic] Explore FAST Dataset #43

chauhankaranraj commented Jun 29, 2021

[Epic] Explore FAST Dataset #43

[Epic] Explore FAST Dataset #43

Comments

chauhankaranraj commented Jun 29, 2021