You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, our monitoring only watches the status of a ceph cluster and alerts if it is in a HEALTH_WARN or HEALTH_ERR state. An engineer has no more information than the cluster health state and and cannot determine the severity of an alert based on these statuses alone.
We should have more specific alerting around ceph and state specifically why a cluster is in the state that it is in. Ceph can report health with json
ceph health -f json
{"checks":{"OSD_BACKFILLFULL":{"severity":"HEALTH_WARN","summary":{"message":"6 backfillfull osd(s)"}},"POOL_BACKFILLFULL":{"severity":"HEALTH_WARN","summary":{"message":"14 pool(s) backfillfull"}}},"status":"HEALTH_WARN","summary":[{"severity":"HEALTH_WARN","summary":"'ceph health' JSON format has changed in luminous. If you see this your monitoring system is scraping the wrong fields. Disable this with 'mon health preluminous compat warning = false'"}],"overall_status":"HEALTH_WARN"}
This gives the reasons why a cluster is in the state that it is in. With better descriptions, ceph alerts would look less scary and could cut down in the time it takes to do them, and if we get more specific with what we alert on, cut down the number of alerts ceph creates.
The text was updated successfully, but these errors were encountered:
Currently, our monitoring only watches the status of a ceph cluster and alerts if it is in a HEALTH_WARN or HEALTH_ERR state. An engineer has no more information than the cluster health state and and cannot determine the severity of an alert based on these statuses alone.
We should have more specific alerting around ceph and state specifically why a cluster is in the state that it is in. Ceph can report health with json
This gives the reasons why a cluster is in the state that it is in. With better descriptions, ceph alerts would look less scary and could cut down in the time it takes to do them, and if we get more specific with what we alert on, cut down the number of alerts ceph creates.
The text was updated successfully, but these errors were encountered: