Monitor disk status #224

rgaudin · 2024-07-30T11:46:14Z

At the moment, we don't monitor machine disk status: RAID arrays status, SMART status, etc.
We don't want/need to integrate it to grafana or even automatically upload information but we can at least add some checks to the routines so we're not completely blind on such problems should they occur.

benoit74 · 2024-08-08T09:45:27Z

I've reconfigured grafana helm chart to export mdadm metrics and push them to prometheus.

These metrics are now available in our Grafana Cloud instance.

I've configured a minimal dashboard (based on an existing one tbh). Not sure it has everything / it is the most convenient one, but at least it has the most important information. https://kiwixorg.grafana.net/d/edu6v6ekri77kd/mdadm

I've updated the weekly routine to check this dashboard.

benoit74 · 2024-08-08T09:46:07Z

I reopen because I've done only the RAID part, we still need to check smart status (at least)

rgaudin · 2024-08-08T09:52:34Z

And we want to monitor/check for non-nodes as well

benoit74 · 2024-08-08T10:11:31Z

My proposition follows.

Every week:

run cat /proc/mdstat and check that all arrays are "active", no resync is ongoing (this might be normal but should be notified at least), no other problems are displayed
run smartctl -H /dev/sdxxx on all disks to check basic health status

Every month:

run smartctl -t short /dev/sdxxx on all disks to start a short self-test ; come back few minutes later (usually 2 mins) to check result with smartctl -l selftest /dev/sdxxx (one new line is appended for every self test)

Every year (we can probably include it in monthly routine and say "run it only once a year in January"):

run smartctl -t long /dev/sdxxx on all disks to start a long self-test ; come back next day to check result with smartctl -l selftest /dev/sdxxx (one new line is appended for every self test)

Nota: pretty easy to automate looping over disks with something like:

for disk in $(smartctl --scan | awk '{print $1}'); do
    echo $disk
    smartctl -H $disk
done

rgaudin added documentation Improvements or additions to documentation enhancement New feature or request labels Jul 30, 2024

benoit74 self-assigned this Aug 8, 2024

benoit74 closed this as completed Aug 8, 2024

benoit74 reopened this Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitor disk status #224

Monitor disk status #224

rgaudin commented Jul 30, 2024

benoit74 commented Aug 8, 2024

benoit74 commented Aug 8, 2024

rgaudin commented Aug 8, 2024

benoit74 commented Aug 8, 2024

Monitor disk status #224

Monitor disk status #224

Comments

rgaudin commented Jul 30, 2024

benoit74 commented Aug 8, 2024

benoit74 commented Aug 8, 2024

rgaudin commented Aug 8, 2024

benoit74 commented Aug 8, 2024