Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snapshot_create / snapshot_revert default boot failure if newest installed kernel not in use #84

Open
rjo-uk opened this issue Dec 12, 2024 · 2 comments · May be fixed by #86
Open

snapshot_create / snapshot_revert default boot failure if newest installed kernel not in use #84

rjo-uk opened this issue Dec 12, 2024 · 2 comments · May be fixed by #86

Comments

@rjo-uk
Copy link
Contributor

rjo-uk commented Dec 12, 2024

Summary

Although not best practice, it's possible to install a new kernel package but not reboot into it. For example, if a maintenance window does not allow downtime.

When a new kernel is installed but not booted into, the snapshot_create role backs up the currently running version of kernel files from /boot. It also backs up grubenv which typically references the newer kernel. This mismatch causes the default entry when running snapshot_revert to be inconsistent and results in a default grub entry which fails with error file /vmlinuz-<old kernel> not found.

Additional Details / Steps to Reproduce

A RHEL 7.9 install with the default kernel:

[root@rhel7 ~]# rpm -q kernel
kernel-3.10.0-1160.el7.x86_64
[root@rhel7 ~]# grubby --default-kernel
/boot/vmlinuz-3.10.0-1160.el7.x86_64

Update the kernel but do not reboot the server:

[root@rhel7 ~]# yum update -y kernel
[root@rhel7 ~]# rpm -q kernel
kernel-3.10.0-1160.el7.x86_64
kernel-3.10.0-1160.119.1.el7.x86_64
[root@rhel7 ~]# grubby --default-kernel
/boot/vmlinuz-3.10.0-1160.119.1.el7.x86_64

Prior to performing an IPU or any change to the system, we run the snapshot_create role as follows:

- hosts: all
  roles:
    - name: snapshot_create
      snapshot_create_set_name: ripu
      snapshot_create_snapshot_autoextend_threshold: 70
      snapshot_create_snapshot_autoextend_percent: 20
      snapshot_create_boot_backup: true
      snapshot_create_volumes:
        - vg: rootvg
          lv: root
          size: 2G
        - vg: rootvg
          lv: var
          size: 2G

The following backup files are created:

[root@rhel7 ~]# tar tvfz boot-backup-ripu.tgz
-rw-r--r-- root/root       167 2020-08-18 19:59 .vmlinuz-3.10.0-1160.el7.x86_64.hmac
-rw------- root/root   3616707 2020-08-18 19:59 System.map-3.10.0-1160.el7.x86_64
-rw-r--r-- root/root    153591 2020-08-18 19:59 config-3.10.0-1160.el7.x86_64
-rw-r--r-- root/root      5002 2024-12-12 10:35 grub2/grub.cfg
-rw-r--r-- root/root      1024 2024-12-12 10:35 grub2/grubenv
-rw------- root/root  21328833 2024-12-11 17:23 initramfs-3.10.0-1160.el7.x86_64.img
-rw-r--r-- root/root    320648 2020-08-18 19:59 symvers-3.10.0-1160.el7.x86_64.gz
-rwxr-xr-x root/root   6769496 2020-08-18 19:59 vmlinuz-3.10.0-1160.el7.x86_64

Note the following about the backup contents:

  • It contains the /boot files for the current running kernel (derived from ansible_kernel)
  • It does not contain the /boot files for the newer kernel
  • It contains a grubenv file which has the newer kernel
head -2 /boot/grub2/grubenv
# GRUB Environment Block
saved_entry=Red Hat Enterprise Linux Server (3.10.0-1160.119.1.el7.x86_64) 7.9 (Maipo)

We reboot the server to bring in the new kernel and upgrade it to RHEL 8 via leapp using the infra.leapp collection. (OR, to reproduce this issue you could likely remove the kernel files for RHEL 7 in /boot)

Post-upgrade, /boot no longer contains any RHEL 7 kernels.

[root@rhel7 boot]# ls -alrt
total 109708
drwx------.  3 root root     4096 Nov  5  2020 efi
-rw-r--r--.  1 root root      174 Dec  2 11:35 .vmlinuz-4.18.0-553.32.1.el8_10.x86_64.hmac
-rw-------.  1 root root  4509962 Dec  2 11:36 System.map-4.18.0-553.32.1.el8_10.x86_64
-rw-r--r--.  1 root root   202351 Dec  2 11:36 config-4.18.0-553.32.1.el8_10.x86_64
-rwxr-xr-x.  1 root root 10876536 Dec  2 11:36 vmlinuz-4.18.0-553.32.1.el8_10.x86_64
drwx------.  2 root root    16384 Dec 12 14:33 lost+found
-rw-------.  1 root root 62110325 Dec 12 14:34 initramfs-0-rescue-baab5782d740445d806e2ba8322746a8.img
-rwxr-xr-x.  1 root root  6769496 Dec 12 14:34 vmlinuz-0-rescue-baab5782d740445d806e2ba8322746a8
drwxr-xr-x.  3 root root     4096 Dec 12 15:04 loader
drwx------.  5 root root     4096 Dec 12 15:04 grub2
lrwxrwxrwx.  1 root root       53 Dec 12 15:06 symvers-4.18.0-553.32.1.el8_10.x86_64.gz -> /lib/modules/4.18.0-553.32.1.el8_10.x86_64/symvers.gz
-rw-------.  1 root root 27818128 Dec 12 15:07 initramfs-4.18.0-553.32.1.el8_10.x86_64.img
dr-xr-xr-x. 18 root root     4096 Dec 12 15:11 ..
dr-xr-xr-x.  6 root root     4096 Dec 12 15:13 .

We roll-back the server using the snapshot_revert role.

This uses the files from boot-backup-ripu.tgz to populate /boot. As the grubenv file contains the newer kernel entry, the server tries to boot from it. However, the newer kernel files were not backed up and the boot fails with a message such as:

file /vmlinuz-3.10.0-1160.119.1el7.x86_64 not found

Possible enhancements to improve role behaviour

Not fully sure what the expected behaviour should be for this use case. Should the create role stop when this scenario is detected, or if it was allowed what would we do on revert - default back to the older kernel or the newer one?

Perhaps one of these:

When appropriate, warn the user that "Newest installed kernel not in use" and they should reboot the server before running the snapshot_create role.

When snapshot_create is run, ensure that kernel described in grubenv is also included in the backup.

When snapshot_revert is run, modify the grubenv file so that the older kernel is selected as a default.

@swapdisk
Copy link
Member

Not best practice, indeed! I'm leaning to just fail if not booted from the default kernel.

We already have a check like this done for bigboot and shrink_lv roles. Maybe do the same way at the top of roles/snapshot_create/tasks/check.yml like this?...

- name: Validate default kernel is booted
  ansible.builtin.include_role:
    name: initramfs
    tasks_from: preflight
  when: snapshot_create_boot_backup

This will fail the play with message warning "Current kernel version ... is not the default version ..."

Thoughts?

@rjo-uk
Copy link
Contributor Author

rjo-uk commented Dec 16, 2024

Thanks for the help @swapdisk - that's a great suggestion. I've made the change in the attached pull request. Sample invocation:

PLAY [all] ***********************************************************************************************

TASK [Gathering Facts] ***********************************************************************************
ok: [rhel7.london.example.com]

TASK [snapshot_create : Check available disk space] ******************************************************
included: /home/richard/ansible/infra.lvm_snapshots/roles/snapshot_create/tasks/check.yml for rhel7.london.example.com

TASK [Validate default kernel is booted] *****************************************************************

TASK [initramfs : Make sure the required related facts are available] ************************************
ok: [rhel7.london.example.com]

TASK [initramfs : Get kernel version] ********************************************************************
ok: [rhel7.london.example.com]

TASK [initramfs : Get default kernel] ********************************************************************
ok: [rhel7.london.example.com]

TASK [initramfs : Parse default kernel version] **********************************************************
ok: [rhel7.london.example.com]

TASK [initramfs : Check the values] **********************************************************************
fatal: [rhel7.london.example.com]: FAILED! => {
    "assertion": "initramfs_default_kernel == initramfs_kernel_version",
    "changed": false,
    "evaluated_to": false,
    "msg": "Current kernel version '3.10.0-1160.el7.x86_64' is not the default version '3.10.0-1160.119.1.el7.x86_64'"
}

PLAY RECAP ***********************************************************************************************
rhel7.london.example.com   : ok=6    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants