Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start process to send email notifications upon btrfs problems? #88

Open
testbird opened this issue Oct 25, 2020 · 6 comments
Open

start process to send email notifications upon btrfs problems? #88

testbird opened this issue Oct 25, 2020 · 6 comments

Comments

@testbird
Copy link

testbird commented Oct 25, 2020

I am wondering how btrfs users can ensure to automatically get notified (only) upon errors and warnings (that usually happen in the background and are very likely to remain unnoticed).

Btrfsmaintenance seemed like good find for this, it covers the necessary background tasks.

But couldn't it also run something like a watchdog task (i.e. bash pipe)?
It could filter the log for any btrfs warnings and errors that may occur during the daily usage, and send out emails as soon as they occur.

ximion added a commit to ximion/btrfsmaintenance that referenced this issue Mar 19, 2022
This can be very useful for smaller setups where the admin still would
like to receive an email in case a disk in a btrfs RAID array fails.

Partially resolves kdave#88
ximion added a commit to ximion/btrfsmaintenance that referenced this issue Mar 19, 2022
This can be very useful for smaller setups where the admin still would
like to receive an email in case a disk in a btrfs RAID array fails.

Partially resolves kdave#88
ximion added a commit to ximion/btrfsmaintenance that referenced this issue Mar 21, 2022
This can be very useful for smaller setups where the admin still would
like to receive an email in case a disk in a btrfs RAID array fails.

Partially resolves kdave#88
@Ultranium
Copy link

Ultranium commented Oct 16, 2024

Adding some sort of notifications (for example, via email or running a custom script) is a must.
Basically, scrubbing is useless if users don't find out in time that there is a data corruption happened, while they most likely still have fresh backups to avoid permanent data loss.

If there are no notifications and users don't monitor logs on a regular basis, they will find about filesystem failure way later, maybe in a few years, when it could already be too late to restore a backup.

@eku
Copy link

eku commented Oct 16, 2024

Are the messages not in the journal. Why should btrfsmaintenance take over the sending of mail itself? Simply monitor the journal.

@Ultranium
Copy link

Ultranium commented Oct 16, 2024

I doubt a lot of average users monitor logs regularly, or monitor at all.
That's why ZFS has ZED, which can send email notifications if something isn't right (or if everything is alright and you just want to be reminded when scrub or a pool resilver has finished). Having something similar for BTRFS would be great.

@Zygo
Copy link

Zygo commented Oct 16, 2024

mdadm has a similar capability too.

Not everyone has a journal.

Reading the journal (or dmesg directly) is complicated by various issues:

  • how to identify the filesystem from the kernel messages? It's not impossible, but it has a lot of corner cases, particularly with device-mapper aliases, replaced devices, and dropped ratelimited messages. With btrfs dev stats you already know how to identify a filesystem mount point because you had to pass it to btrfs dev stats.
  • kernel messages are ratelimited, so some errors might not be reported, but they are counted in btrfs dev stats.
  • if the system goes down because of the failure, it might come back up without the messages being accessible. Journal files are data blocks, while dev stats are metadata items, and btrfs with the default noflushoncommit mount option will write the metadata blocks before the data blocks. A temporary drive issue that causes a crash might cause the journal update to be lost while the dev stats are retained.
  • it only works if there's one journal. If an error is detected while booting from a live USB stick, it is written to a journal, but not the journal of the machine when it's booted normally. A similar issue arises with removable media that moves from one host to another. btrfs dev stats stores the error counts inside the filesystem, so the counts are always available from the filesystem itself.
  • failures during readahead have possible causes that are unrelated to the device. Readahead operations are generally considered expendable. Some block layers (particularly LVM) will simply fail readahead reads if there's not enough memory available, or the reads are inconvenient to handle correctly. If any process actually requires the reads to be completed then they will later issue non-readahead read requests that won't be dropped.

The last one is a bit complicated.

btrfs dev stats does not count errors that occur during readahead operations. These errors are silently corrected but not recorded in btrfs dev stats (see discussion of that change. Note that the specific bug discussed in the thread was fixed, but the general non-counting of readahead errors remains). For readahead reads that fail because of non-device reasons, this is harmless behavior.

If there's a real device failure that happens during a readahead operation (e.g. a corrupted block is detected on a failing cheap SSD), then dev stats will report no trace of the problem because readahead errors are not counted. With raid1 or dup profiles, btrfs might still self-repair the corrupted block, so the corruption cannot be detected by future reads or scrubs. Data corruption is an important early indicator of SSD failure, so losing some of the detected csum events is a significant problem for a monitoring system. Journal/dmesg monitoring can catch that kind of problem, at the expense of some false positives.

@eku
Copy link

eku commented Oct 17, 2024

If there are no notification

AFAIK the btrfsmaintenance scripts run either via cron or systremd.timer. Both send the output via mail to the system administrator. Don't you use these?

@Ultranium
Copy link

Ultranium commented Oct 18, 2024

If there are no notification

AFAIK the btrfsmaintenance scripts run either via cron or systremd.timer. Both send the output via mail to the system administrator. Don't you use these?

If you are talking about the MAILTO= cron directive, it will send a command output regardless of its exit code, thus cluttering admin's email. In ZFS ZED it's possible to choose sending notifications only if pool is degraded, which is convenient.

BTW, I'm no saying that btrfsmaintenance must copy ZFS, I'm just pointing out how it could be improved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants