filesystem error count metric #3113

anarcat · 2024-09-06T04:01:19Z

We are porting various alerts from Nagios to the prometheus ecosystem and we've found one check that is kind of useful in Nagios that seems to be missing from the node exporter. It's a check that looks at EXT filesystems with the tune2fs -l command and (basically) greps for the FS Error count field.

This should normally be zero but under certain circumstances (failing disk, filesystem bug, power outage), it will rise. running fsck on the filesystem will fix this (and, normally, after a power outage, a reboot will run fsck, but under certain circumstances, it might not fully do it).

So I think the node exporter should do this. I've tried to find metrics about this in our node exporters and couldn't find anything under the node_filesystem_* namespace. There is node_filesystem_readonly and, according to this post node_filesystem_device_error (but I can't see that metric here), but neither of those are the same as the error count.

Am I missing something or this is missing from the node exporter?

Here's a copy of the check, called dsa-check-filesystems here:

#!/usr/bin/ruby

require 'filesystem'

ignorefs = ["NFS", "nfs", "nfs4", "nfsd", "afs", "binfmt_misc", "proc", "smbfs",
	   "autofs", "iso9660", "ncpfs", "coda", "devpts", "ftpfs", "devfs",
	   "mfs", "shfs", "sysfs", "cifs", "lustre_lite", "tmpfs", "usbfs",
	   "udf", "fusectl", "fuse.snapshotfs", "rpc_pipefs"]
mountpoints = {}

FileSystem.mounts.each do |m|
	if ((not ignorefs.include?(m.fstype)) && (m.options !~ /bind/))
		mountpoints[m.device] = { 'type' => m.fstype, 'mount' => m.mount }
	end
end

def check_ext3(dev, mnt)
	output=%x{tune2fs -l #{dev}}
	if output =~ /FS Error count:\s*(\d+)/ and $1.to_i > 0
		return "#{dev} (#{mnt}) has #{$1} errors"
	end
end

output = []
mountpoints.keys.each do |m|
	temp = ''
	begin
		if mountpoints[m]['type'] =~ /ext/
			temp = check_ext3(m, mountpoints[m]['mount'])
		end
	rescue Exception => e
	end
	if temp && (temp.length > 0)
		output << temp
	end
end

if output.length > 0
	puts output.join("\n")
	exit 1
end
puts "OK: All filesystems ok."
exit 0

The text was updated successfully, but these errors were encountered:

SuperQ · 2024-09-06T08:13:57Z

The node_exporter collector policy does not allow subprocess execution. It also does not allow for functions that require root privileges.

This can probably be solved by reading from /sys/fs/ext4/. There is a work in progress to implement this in prometheus/procfs.

anarcat · 2024-09-07T01:36:16Z

right, running tune2fs seemed like an odd idea in the first place, i was hoping for something exactly like that.

the PR you linked to has been merged, so we're getting close? :)

i don't quite understand what it takes to percolate stuff from procfs into the node exporter itself, now we'd need a stub to call that ext4.fs.ProcStat() thing next? or does procfs need to make a release first?

discordianfish · 2024-09-22T12:37:08Z

Yes, lets track this in #3005 and close this here

discordianfish closed this as completed Sep 22, 2024

anarcat mentioned this issue Sep 23, 2024

Disk and filesystem error metrics #3005

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filesystem error count metric #3113

filesystem error count metric #3113

anarcat commented Sep 6, 2024

SuperQ commented Sep 6, 2024

anarcat commented Sep 7, 2024

discordianfish commented Sep 22, 2024

filesystem error count metric #3113

filesystem error count metric #3113

Comments

anarcat commented Sep 6, 2024

SuperQ commented Sep 6, 2024

anarcat commented Sep 7, 2024

discordianfish commented Sep 22, 2024