Adding runbooks

lwindolf · Jan 24, 2024 · 6e34e12 · 6e34e12
1 parent e04201e
commit 6e34e12
Show file tree

Hide file tree

Showing 6 changed files with 234 additions and 0 deletions.
diff --git a/runbooks/Mail.md b/runbooks/Mail.md
@@ -0,0 +1,27 @@
+This runbook is about handling sudden mail on a server.
+
+## Display mail
+
+View mails by running `mail` interactively (cannot be done via this runbook).
+At the `mail` prompt you can view mails by entering their number. To list the
+remaining mails again type `h` (for headers). If you want to bulk delete type 
+something like `d1-1000` (will delete mail number 1 to 1000).
+
+## Silence mail producers
+
+Nowadays the only system mail producer is cron. So we need to prevent cron
+from delivering mails locally.
+
+### Variant 1: mail relay
+
+If you have a mail relay configured you should make cron use it to avoid mails 
+staying on the server. To do so insert a `MAILTO="<email address>"` line
+into the relevant crontab.
+
+### Variant 2: just silence mails
+
+Disable cron mails by inserting `MAILTO=""` in the crontab.
+
+### Variant 3: write to logs
+
+Append a log redirection like `>/var/log/somelog.log 2>&1` to all your crons.
diff --git a/runbooks/System Log.md b/runbooks/System Log.md
@@ -0,0 +1,25 @@
+This runbook gives hint on how to interpret system log errors. As the kernel log receives unstructured logs of different verbosities and formatting it is very hard to give a definitive guide on how to interpret those. So the following is only a small collection of errors one can easily ignore or one should not ignore.
+
+## ACPI Errors
+
+ACPI errors be them on laptops or servers are usually irrelevant and just indicate the bad ACPI standard compliance and support from vendors. Ignore them.
+
+## blk_update_request: I/O error, dev sda, sector xxxx
+
+Your disk is failing and needs replacement.
+
+## \[Hardware Error]: error_type: 2, single-bit ECC
+
+Your RAM modules might be faulty.
+
+## Hardware event. This is not a software error.
+
+Indicates a hardware problem. You might want to check HW status with `ipmitool`, `mcelog` or `rasdaemon` on the machine or in your ILO / ILOM / IDRAC GUI.
+
+## Buffer I/O error on device
+
+Probably a RAID problem. Check physical devices with `smartctl` and your RAID management tool.
+
+## ADDRCONF(NETDEV_UP): bond0: link is not ready
+
+Check bond with `nmcli`
diff --git a/runbooks/arp.md b/runbooks/arp.md
@@ -0,0 +1,51 @@
+## ARP Debugging
+
+### Review Kernel Settings
+
+Check whether ARP resolving is enabled: the `arp_ignore` setting should be 0
+
+    sysctl net.ipv4.conf.all.arp_ignore
+
+Check ARP garbage collection settings:
+
+    sysctl -a 2>/dev/null | grep net.ipv4.neigh.default.gc
+
+A typical problem on routers or large k8s cluster node is that the nodes ARP cache runs full 
+because `gc_thres3` is to small. Note also that when `gc_thres1` is larger than the set of 
+entries you usually cache no eviction will ever happen. 
+
+Check ARP cache timeout settings:
+
+    sysctl -a 2>/dev/null | grep net.ipv4.neigh.default.base_reachable
+
+While you can query settings per network interface via `sysctl` too it’s easier to use 
+`ip ntable show` to get an overview of effective settings per network interface:
+
+    ip ntable show 
+
+### Check ARP Cache
+
+Print with
+
+    ip neigh
+
+Note how valid entries are marked as `REACHABLE` outdated entries are marked as `STALE`
+
+Or if ARP tools are installed print ARP cache with `arp -a` or `arp -n` for table format.
+
+    arp -a
+
+### Clear ARP Cache
+
+To clear the complete cache run
+
+    ip -s -s neigh flush all
+
+You can also run `arp -d `. To delete individual items run:
+
+    arp -d <ip>
+
+### Further Reading
+
+* [https://www.baeldung.com/linux/arp-settings](https://www.baeldung.com/linux/arp-settings)
+* [https://manpages.debian.org/bullseye/manpages/arp.7.en.html](https://manpages.debian.org/bullseye/manpages/arp.7.en.html)
diff --git a/runbooks/df.md b/runbooks/df.md
@@ -0,0 +1,71 @@
+This runbook is about analyzing disk usage issues.
+
+## Preparation
+
+Determine mount point that is full from
+
+    df -h
+
+Change directory to the mount point
+
+    cd /somepath ; pwd
+
+## Find largest directories
+
+Increase `-maxdepth` if granularity is insuffcient, but keep it small to get a faster result.
+
+    find . -maxdepth 3 -type d -print0 | xargs -0 -n1 du -sh | sort -hr | head -15
+
+## Find largest files
+
+This can be long running. Try limiting it to the largest directories found
+in the step above.
+
+    find . -type f -printf '%k\t%p\n' | sort -nr | head -15
+
+## Check for deleted files using disk space
+
+In case you do not find a cause of the disk being full there might be large files
+that were deleted but are still opened by some process. Until the process does not
+close their file handles the disk space will not be free. 
+
+To find such files run
+
+    lsof -nP +L1 | sort -nr -k 9 | head -15
+
+Example output with the file size in the 9th column:
+
+    gnome-ter 23634 lars   27u   REG  259,7   131072     0 1581153 /tmp/#1581153 (deleted)
+    gnome-ter 23634 lars   26u   REG  259,7  1376256     0 1581144 /tmp/#1581144 (deleted)
+    gnome-ter 23634 lars   24u   REG  259,7   458752     0 1580991 /tmp/#1580991 (deleted)
+    code       8961 lars   28u   REG  259,7  3252484     0 1577156 /tmp/.org.chromium.Chromium.zZ11HC (deleted)
+    gjs        7684 lars   17r   REG  259,7    32768     0  973350 /home/lars/.local/share/gvfs-metadata/home-a1b15d43.log (deleted)
+    evolution  7010 lars   13r   REG  259,7    32768     0  973350 /home/lars/.local/share/gvfs-metadata/home-a1b15d43.log (deleted)
+
+Next restart those processes to free the disk space.
+
+## Specific Cleanups
+
+### Systemd Logs
+
+Check systemd log usage with
+
+    journalctl --disk-usage
+
+Drop logs with vacuum to a certain size
+
+    journalctl --vacuum-size=100M
+
+### Package Manager
+
+If the root partition ran full it might temporary help cleaning the package 
+manager cache. For Debian-based systems:
+
+     sudo apt clean
+
+## Visualize Disk Usage
+
+To get an overview on disk usage per directory use these visualisations:
+
+- [du Radial Map](https://lzone.de/visual-ops/du+Radial+Map)
+- [du Tree Map](https://lzone.de/visual-ops/du+Tree+Map)
diff --git a/runbooks/dpkg.md b/runbooks/dpkg.md
@@ -0,0 +1,29 @@
+This runbook describes how to handle dpkg problems with half-configure
+or broken package installations.
+
+## Broken Packages
+
+### 1. Perform dpkg Audit 
+
+    sudo dpkg -C
+
+If you find a problematic package try fixing it with APT
+
+    sudo apt-get clean && sudo apt-get autoremove
+    sudo apt-get -f install
+    sudo dpkg --configure -a
+
+If this fails due to the broken package state force remove it with dpkg
+
+    dpkg -r <package>
+
+If dpkg removing fails this is usually due to a failing pre-remove script.
+In this case remove the failing script and retry `dpkg -r`
+
+### 2. Check for half-configured packages
+
+    dpkg -l | awk '/^iF/ {print $2}'
+
+If you find half-configured packages run
+
+    sudo dpkg-reconfigure <package>
diff --git a/runbooks/systemd.md b/runbooks/systemd.md
@@ -0,0 +1,31 @@
+This runbook is about fixing unexpected systemd unit states.
+
+## Failed units
+
+Find out why the unit failed with
+
+    journalctl -u <unit name>
+
+and restart with
+
+    systemctl restart <unit name>
+
+In case the unit has failed to often you might need to reset the failed state with
+
+    systemctl reset-failed <unit name>
+
+## Masked units
+
+If a service is 'masked' it means someone manually disabled them. You can undo this with
+
+    systemctl unmask <unit name>
+
+## Advanced Debugging
+
+Analyze unit definition with
+
+    systemctl edit <unit>
+     
+If you change things in the unit definition ensure to reload systemd:
+
+    systemctl daemon-reload