Skip to content

Commit

Permalink
Adding runbooks
Browse files Browse the repository at this point in the history
  • Loading branch information
lwindolf committed Jan 24, 2024
1 parent e04201e commit 6e34e12
Show file tree
Hide file tree
Showing 6 changed files with 234 additions and 0 deletions.
27 changes: 27 additions & 0 deletions runbooks/Mail.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
This runbook is about handling sudden mail on a server.

## Display mail

View mails by running `mail` interactively (cannot be done via this runbook).
At the `mail` prompt you can view mails by entering their number. To list the
remaining mails again type `h` (for headers). If you want to bulk delete type
something like `d1-1000` (will delete mail number 1 to 1000).

## Silence mail producers

Nowadays the only system mail producer is cron. So we need to prevent cron
from delivering mails locally.

### Variant 1: mail relay

If you have a mail relay configured you should make cron use it to avoid mails
staying on the server. To do so insert a `MAILTO="<email address>"` line
into the relevant crontab.

### Variant 2: just silence mails

Disable cron mails by inserting `MAILTO=""` in the crontab.

### Variant 3: write to logs

Append a log redirection like `>/var/log/somelog.log 2>&1` to all your crons.
25 changes: 25 additions & 0 deletions runbooks/System Log.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
This runbook gives hint on how to interpret system log errors. As the kernel log receives unstructured logs of different verbosities and formatting it is very hard to give a definitive guide on how to interpret those. So the following is only a small collection of errors one can easily ignore or one should not ignore.

## ACPI Errors

ACPI errors be them on laptops or servers are usually irrelevant and just indicate the bad ACPI standard compliance and support from vendors. Ignore them.

## blk_update_request: I/O error, dev sda, sector xxxx

Your disk is failing and needs replacement.

## \[Hardware Error]: error_type: 2, single-bit ECC

Your RAM modules might be faulty.

## Hardware event. This is not a software error.

Indicates a hardware problem. You might want to check HW status with `ipmitool`, `mcelog` or `rasdaemon` on the machine or in your ILO / ILOM / IDRAC GUI.

## Buffer I/O error on device

Probably a RAID problem. Check physical devices with `smartctl` and your RAID management tool.

## ADDRCONF(NETDEV_UP): bond0: link is not ready

Check bond with `nmcli`
51 changes: 51 additions & 0 deletions runbooks/arp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
## ARP Debugging

### Review Kernel Settings

Check whether ARP resolving is enabled: the `arp_ignore` setting should be 0

sysctl net.ipv4.conf.all.arp_ignore

Check ARP garbage collection settings:

sysctl -a 2>/dev/null | grep net.ipv4.neigh.default.gc

A typical problem on routers or large k8s cluster node is that the nodes ARP cache runs full
because `gc_thres3` is to small. Note also that when `gc_thres1` is larger than the set of
entries you usually cache no eviction will ever happen.

Check ARP cache timeout settings:

sysctl -a 2>/dev/null | grep net.ipv4.neigh.default.base_reachable

While you can query settings per network interface via `sysctl` too it’s easier to use
`ip ntable show` to get an overview of effective settings per network interface:

ip ntable show

### Check ARP Cache

Print with

ip neigh

Note how valid entries are marked as `REACHABLE` outdated entries are marked as `STALE`

Or if ARP tools are installed print ARP cache with `arp -a` or `arp -n` for table format.

arp -a

### Clear ARP Cache

To clear the complete cache run

ip -s -s neigh flush all

You can also run `arp -d `. To delete individual items run:

arp -d <ip>

### Further Reading

* [https://www.baeldung.com/linux/arp-settings](https://www.baeldung.com/linux/arp-settings)
* [https://manpages.debian.org/bullseye/manpages/arp.7.en.html](https://manpages.debian.org/bullseye/manpages/arp.7.en.html)
71 changes: 71 additions & 0 deletions runbooks/df.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
This runbook is about analyzing disk usage issues.

## Preparation

Determine mount point that is full from

df -h

Change directory to the mount point

cd /somepath ; pwd

## Find largest directories

Increase `-maxdepth` if granularity is insuffcient, but keep it small to get a faster result.

find . -maxdepth 3 -type d -print0 | xargs -0 -n1 du -sh | sort -hr | head -15

## Find largest files

This can be long running. Try limiting it to the largest directories found
in the step above.

find . -type f -printf '%k\t%p\n' | sort -nr | head -15

## Check for deleted files using disk space

In case you do not find a cause of the disk being full there might be large files
that were deleted but are still opened by some process. Until the process does not
close their file handles the disk space will not be free.

To find such files run

lsof -nP +L1 | sort -nr -k 9 | head -15

Example output with the file size in the 9th column:

gnome-ter 23634 lars 27u REG 259,7 131072 0 1581153 /tmp/#1581153 (deleted)
gnome-ter 23634 lars 26u REG 259,7 1376256 0 1581144 /tmp/#1581144 (deleted)
gnome-ter 23634 lars 24u REG 259,7 458752 0 1580991 /tmp/#1580991 (deleted)
code 8961 lars 28u REG 259,7 3252484 0 1577156 /tmp/.org.chromium.Chromium.zZ11HC (deleted)
gjs 7684 lars 17r REG 259,7 32768 0 973350 /home/lars/.local/share/gvfs-metadata/home-a1b15d43.log (deleted)
evolution 7010 lars 13r REG 259,7 32768 0 973350 /home/lars/.local/share/gvfs-metadata/home-a1b15d43.log (deleted)

Next restart those processes to free the disk space.

## Specific Cleanups

### Systemd Logs

Check systemd log usage with

journalctl --disk-usage

Drop logs with vacuum to a certain size

journalctl --vacuum-size=100M

### Package Manager

If the root partition ran full it might temporary help cleaning the package
manager cache. For Debian-based systems:

sudo apt clean

## Visualize Disk Usage

To get an overview on disk usage per directory use these visualisations:

- [du Radial Map](https://lzone.de/visual-ops/du+Radial+Map)
- [du Tree Map](https://lzone.de/visual-ops/du+Tree+Map)
29 changes: 29 additions & 0 deletions runbooks/dpkg.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
This runbook describes how to handle dpkg problems with half-configure
or broken package installations.

## Broken Packages

### 1. Perform dpkg Audit

sudo dpkg -C

If you find a problematic package try fixing it with APT

sudo apt-get clean && sudo apt-get autoremove
sudo apt-get -f install
sudo dpkg --configure -a

If this fails due to the broken package state force remove it with dpkg

dpkg -r <package>

If dpkg removing fails this is usually due to a failing pre-remove script.
In this case remove the failing script and retry `dpkg -r`

### 2. Check for half-configured packages

dpkg -l | awk '/^iF/ {print $2}'

If you find half-configured packages run

sudo dpkg-reconfigure <package>
31 changes: 31 additions & 0 deletions runbooks/systemd.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
This runbook is about fixing unexpected systemd unit states.

## Failed units

Find out why the unit failed with

journalctl -u <unit name>

and restart with

systemctl restart <unit name>

In case the unit has failed to often you might need to reset the failed state with

systemctl reset-failed <unit name>

## Masked units

If a service is 'masked' it means someone manually disabled them. You can undo this with

systemctl unmask <unit name>

## Advanced Debugging

Analyze unit definition with

systemctl edit <unit>
If you change things in the unit definition ensure to reload systemd:

systemctl daemon-reload

0 comments on commit 6e34e12

Please sign in to comment.