Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor disk status #224

Open
rgaudin opened this issue Jul 30, 2024 · 4 comments
Open

Monitor disk status #224

rgaudin opened this issue Jul 30, 2024 · 4 comments
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@rgaudin
Copy link
Member

rgaudin commented Jul 30, 2024

At the moment, we don't monitor machine disk status: RAID arrays status, SMART status, etc.
We don't want/need to integrate it to grafana or even automatically upload information but we can at least add some checks to the routines so we're not completely blind on such problems should they occur.

@rgaudin rgaudin added documentation Improvements or additions to documentation enhancement New feature or request labels Jul 30, 2024
@benoit74 benoit74 self-assigned this Aug 8, 2024
@benoit74
Copy link
Collaborator

benoit74 commented Aug 8, 2024

I've reconfigured grafana helm chart to export mdadm metrics and push them to prometheus.

These metrics are now available in our Grafana Cloud instance.

I've configured a minimal dashboard (based on an existing one tbh). Not sure it has everything / it is the most convenient one, but at least it has the most important information. https://kiwixorg.grafana.net/d/edu6v6ekri77kd/mdadm

I've updated the weekly routine to check this dashboard.

@benoit74 benoit74 closed this as completed Aug 8, 2024
@benoit74 benoit74 reopened this Aug 8, 2024
@benoit74
Copy link
Collaborator

benoit74 commented Aug 8, 2024

I reopen because I've done only the RAID part, we still need to check smart status (at least)

@rgaudin
Copy link
Member Author

rgaudin commented Aug 8, 2024

And we want to monitor/check for non-nodes as well

@benoit74
Copy link
Collaborator

benoit74 commented Aug 8, 2024

My proposition follows.

Every week:

  • run cat /proc/mdstat and check that all arrays are "active", no resync is ongoing (this might be normal but should be notified at least), no other problems are displayed
  • run smartctl -H /dev/sdxxx on all disks to check basic health status

Every month:

  • run smartctl -t short /dev/sdxxx on all disks to start a short self-test ; come back few minutes later (usually 2 mins) to check result with smartctl -l selftest /dev/sdxxx (one new line is appended for every self test)

Every year (we can probably include it in monthly routine and say "run it only once a year in January"):

  • run smartctl -t long /dev/sdxxx on all disks to start a long self-test ; come back next day to check result with smartctl -l selftest /dev/sdxxx (one new line is appended for every self test)

Nota: pretty easy to automate looping over disks with something like:

for disk in $(smartctl --scan | awk '{print $1}'); do
    echo $disk
    smartctl -H $disk
done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants