-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
6 changed files
with
234 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
This runbook is about handling sudden mail on a server. | ||
|
||
## Display mail | ||
|
||
View mails by running `mail` interactively (cannot be done via this runbook). | ||
At the `mail` prompt you can view mails by entering their number. To list the | ||
remaining mails again type `h` (for headers). If you want to bulk delete type | ||
something like `d1-1000` (will delete mail number 1 to 1000). | ||
|
||
## Silence mail producers | ||
|
||
Nowadays the only system mail producer is cron. So we need to prevent cron | ||
from delivering mails locally. | ||
|
||
### Variant 1: mail relay | ||
|
||
If you have a mail relay configured you should make cron use it to avoid mails | ||
staying on the server. To do so insert a `MAILTO="<email address>"` line | ||
into the relevant crontab. | ||
|
||
### Variant 2: just silence mails | ||
|
||
Disable cron mails by inserting `MAILTO=""` in the crontab. | ||
|
||
### Variant 3: write to logs | ||
|
||
Append a log redirection like `>/var/log/somelog.log 2>&1` to all your crons. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
This runbook gives hint on how to interpret system log errors. As the kernel log receives unstructured logs of different verbosities and formatting it is very hard to give a definitive guide on how to interpret those. So the following is only a small collection of errors one can easily ignore or one should not ignore. | ||
|
||
## ACPI Errors | ||
|
||
ACPI errors be them on laptops or servers are usually irrelevant and just indicate the bad ACPI standard compliance and support from vendors. Ignore them. | ||
|
||
## blk_update_request: I/O error, dev sda, sector xxxx | ||
|
||
Your disk is failing and needs replacement. | ||
|
||
## \[Hardware Error]: error_type: 2, single-bit ECC | ||
|
||
Your RAM modules might be faulty. | ||
|
||
## Hardware event. This is not a software error. | ||
|
||
Indicates a hardware problem. You might want to check HW status with `ipmitool`, `mcelog` or `rasdaemon` on the machine or in your ILO / ILOM / IDRAC GUI. | ||
|
||
## Buffer I/O error on device | ||
|
||
Probably a RAID problem. Check physical devices with `smartctl` and your RAID management tool. | ||
|
||
## ADDRCONF(NETDEV_UP): bond0: link is not ready | ||
|
||
Check bond with `nmcli` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
## ARP Debugging | ||
|
||
### Review Kernel Settings | ||
|
||
Check whether ARP resolving is enabled: the `arp_ignore` setting should be 0 | ||
|
||
sysctl net.ipv4.conf.all.arp_ignore | ||
|
||
Check ARP garbage collection settings: | ||
|
||
sysctl -a 2>/dev/null | grep net.ipv4.neigh.default.gc | ||
|
||
A typical problem on routers or large k8s cluster node is that the nodes ARP cache runs full | ||
because `gc_thres3` is to small. Note also that when `gc_thres1` is larger than the set of | ||
entries you usually cache no eviction will ever happen. | ||
|
||
Check ARP cache timeout settings: | ||
|
||
sysctl -a 2>/dev/null | grep net.ipv4.neigh.default.base_reachable | ||
|
||
While you can query settings per network interface via `sysctl` too it’s easier to use | ||
`ip ntable show` to get an overview of effective settings per network interface: | ||
|
||
ip ntable show | ||
|
||
### Check ARP Cache | ||
|
||
Print with | ||
|
||
ip neigh | ||
|
||
Note how valid entries are marked as `REACHABLE` outdated entries are marked as `STALE` | ||
|
||
Or if ARP tools are installed print ARP cache with `arp -a` or `arp -n` for table format. | ||
|
||
arp -a | ||
|
||
### Clear ARP Cache | ||
|
||
To clear the complete cache run | ||
|
||
ip -s -s neigh flush all | ||
|
||
You can also run `arp -d `. To delete individual items run: | ||
|
||
arp -d <ip> | ||
|
||
### Further Reading | ||
|
||
* [https://www.baeldung.com/linux/arp-settings](https://www.baeldung.com/linux/arp-settings) | ||
* [https://manpages.debian.org/bullseye/manpages/arp.7.en.html](https://manpages.debian.org/bullseye/manpages/arp.7.en.html) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
This runbook is about analyzing disk usage issues. | ||
|
||
## Preparation | ||
|
||
Determine mount point that is full from | ||
|
||
df -h | ||
|
||
Change directory to the mount point | ||
|
||
cd /somepath ; pwd | ||
|
||
## Find largest directories | ||
|
||
Increase `-maxdepth` if granularity is insuffcient, but keep it small to get a faster result. | ||
|
||
find . -maxdepth 3 -type d -print0 | xargs -0 -n1 du -sh | sort -hr | head -15 | ||
|
||
## Find largest files | ||
|
||
This can be long running. Try limiting it to the largest directories found | ||
in the step above. | ||
|
||
find . -type f -printf '%k\t%p\n' | sort -nr | head -15 | ||
|
||
## Check for deleted files using disk space | ||
|
||
In case you do not find a cause of the disk being full there might be large files | ||
that were deleted but are still opened by some process. Until the process does not | ||
close their file handles the disk space will not be free. | ||
|
||
To find such files run | ||
|
||
lsof -nP +L1 | sort -nr -k 9 | head -15 | ||
|
||
Example output with the file size in the 9th column: | ||
|
||
gnome-ter 23634 lars 27u REG 259,7 131072 0 1581153 /tmp/#1581153 (deleted) | ||
gnome-ter 23634 lars 26u REG 259,7 1376256 0 1581144 /tmp/#1581144 (deleted) | ||
gnome-ter 23634 lars 24u REG 259,7 458752 0 1580991 /tmp/#1580991 (deleted) | ||
code 8961 lars 28u REG 259,7 3252484 0 1577156 /tmp/.org.chromium.Chromium.zZ11HC (deleted) | ||
gjs 7684 lars 17r REG 259,7 32768 0 973350 /home/lars/.local/share/gvfs-metadata/home-a1b15d43.log (deleted) | ||
evolution 7010 lars 13r REG 259,7 32768 0 973350 /home/lars/.local/share/gvfs-metadata/home-a1b15d43.log (deleted) | ||
|
||
Next restart those processes to free the disk space. | ||
|
||
## Specific Cleanups | ||
|
||
### Systemd Logs | ||
|
||
Check systemd log usage with | ||
|
||
journalctl --disk-usage | ||
|
||
Drop logs with vacuum to a certain size | ||
|
||
journalctl --vacuum-size=100M | ||
|
||
### Package Manager | ||
|
||
If the root partition ran full it might temporary help cleaning the package | ||
manager cache. For Debian-based systems: | ||
|
||
sudo apt clean | ||
|
||
## Visualize Disk Usage | ||
|
||
To get an overview on disk usage per directory use these visualisations: | ||
|
||
- [du Radial Map](https://lzone.de/visual-ops/du+Radial+Map) | ||
- [du Tree Map](https://lzone.de/visual-ops/du+Tree+Map) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
This runbook describes how to handle dpkg problems with half-configure | ||
or broken package installations. | ||
|
||
## Broken Packages | ||
|
||
### 1. Perform dpkg Audit | ||
|
||
sudo dpkg -C | ||
|
||
If you find a problematic package try fixing it with APT | ||
|
||
sudo apt-get clean && sudo apt-get autoremove | ||
sudo apt-get -f install | ||
sudo dpkg --configure -a | ||
|
||
If this fails due to the broken package state force remove it with dpkg | ||
|
||
dpkg -r <package> | ||
|
||
If dpkg removing fails this is usually due to a failing pre-remove script. | ||
In this case remove the failing script and retry `dpkg -r` | ||
|
||
### 2. Check for half-configured packages | ||
|
||
dpkg -l | awk '/^iF/ {print $2}' | ||
|
||
If you find half-configured packages run | ||
|
||
sudo dpkg-reconfigure <package> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
This runbook is about fixing unexpected systemd unit states. | ||
|
||
## Failed units | ||
|
||
Find out why the unit failed with | ||
|
||
journalctl -u <unit name> | ||
|
||
and restart with | ||
|
||
systemctl restart <unit name> | ||
|
||
In case the unit has failed to often you might need to reset the failed state with | ||
|
||
systemctl reset-failed <unit name> | ||
|
||
## Masked units | ||
|
||
If a service is 'masked' it means someone manually disabled them. You can undo this with | ||
|
||
systemctl unmask <unit name> | ||
|
||
## Advanced Debugging | ||
|
||
Analyze unit definition with | ||
|
||
systemctl edit <unit> | ||
If you change things in the unit definition ensure to reload systemd: | ||
|
||
systemctl daemon-reload |